Bots are causing high memory usage and my host wants me to fix

im getting warnings from my shared host (lunarpages). saying my cpu & memory usage is high. im guessing they want me off shared. but looks like a lot of the traffic is not even from humans… how do you prevent the bots from accessing 53,000 hits? lol. i know I can ip deny individual bots, but I keep doing that, and then lunarpages sends me another list of the new top ten ips… is this just something everyone deals with?

CPU Usage - %7.53
MEM Usage - %1.23
Number of MySQL procs (average) - 0.14
Top Process %CPU 66.00


Top Process %CPU 65.00 
```php

Top Process %CPU 31.50 
```php



Top 10 of 88026 Total Sites By KBytes
# 	Hits 	Files 	KBytes 	Visits 	Hostname
1 	53683 	3.61% 	0 	0.00% 	703394 	2.79% 	0 	0.00% 	113.128.7.109
2 	24799 	1.67% 	0 	0.00% 	421146 	1.67% 	0 	0.00% 	219.139.116.166
3 	30716 	2.07% 	0 	0.00% 	414266 	1.64% 	0 	0.00% 	113.128.31.117
4 	15159 	1.02% 	0 	0.00% 	270894 	1.07% 	0 	0.00% 	119.130.163.150
5 	15309 	1.03% 	1 	0.00% 	259043 	1.03% 	1 	0.00% 	hn.kd.ny.adsl
6 	11284 	0.76% 	0 	0.00% 	204741 	0.81% 	0 	0.00% 	113.128.9.138
7 	11526 	0.78% 	0 	0.00% 	149724 	0.59% 	0 	0.00% 	121.29.126.70
8 	3686 	0.25% 	0 	0.00% 	139829 	0.55% 	0 	0.00% 	115.187.229.179
9 	7412 	0.50% 	0 	0.00% 	139187 	0.55% 	0 	0.00% 	115.218.107.247
10 	4103 	0.28% 	683 	0.11% 	134515 	0.53% 	107 	0.16% 	spider-199-21-99-112.yandex.com

Are these legitimate bots, like search engines, etc? If so, then the obvious solution is to use robots.txt to block them.

But that won’t help if they are some sort of malware. In that case, are they coming from any particular country? If so, you could consider blocking the entire range of IP addresses, but then you’d also be blocking legitimate visitors from that country.

Maybe somebody else will have a better suggestion.

Mike

You could try something like Crawl Protect.

Try this in your robots.txt file: - it worked for me when I had Gigabytes of Russian bots

Crawl-delay: 10

Rule ignored by Googlebot"

Google accepts the rule but ignores it

The problem with robots.txt (as I mentioned earlier) is that, if the bots are malignant, they won’t take any notice of it.

Mike

I forgot to mention that the “Gigabytes of robots” accessing my site was on a daily basis.

Does this come in the malignant category?

I don’t know what you mean by “Gigabytes of robots”. In general, iIf the bot comes from a reputable company, like Google or Alexa, then it will respect robots.txt. These bots are generally well-behaved, and won’t cause any problems with your hosting.

But if the bot has some nefarious purpose, like harvesting email addresses, then it won’t take any notice of robots.txt, and you’ll have to find some other way of blocking it.

Mike

More and more bots such as brandwatch.net are crawling sites looking for things said about clients. They can suck down a lot of data and use a lot of resources. robots.txt isn’t going to block them. Your best bet is to use htaccess.

There are a couple of ways you can do it. You can block the bots using their user agent string. I have had success with this but have not been able to block Baiduspider no matter how many different permutations I have tried in htaccess. I have blocked the rest of the bots, though. When they visit they get a 403 Forbidden error page. I put this in htaccess:

#Block bots.
SetEnvIfNoCase User-Agent “^baiduspider” bad_bot
SetEnvIfNoCase User-Agent “^baidu” bad_bot
SetEnvIfNoCase User-Agent “^baidu*” bad_bot
SetEnvIfNoCase User-Agent “^Baiduspider/2.0” bad_bot
SetEnvIfNoCase User-Agent “^Yandex*” bad_bot
SetEnvIfNoCase User-Agent “^YandexBot” bad_bot
SetEnvIfNoCase User-Agent “^magpie-crawler” bad_bot

Order Allow,Deny
Allow from all
Deny from env=bad_bot

magpie-crawler is brandwatch.net.

As I said, I have not been successful at blocking Baidu using this method but have blocked everything else. I’m going to have to resort to using an IP address range to block Baidu because the user agent string isn’t working.

Another method you can use is to rewrite based on the user agent string as described here:

I don’t know how efficient that is.

Also see this for more ideas:

If your web server is Apache:
Quote from http://www.uk-cheapest.co.uk/blog/2010/11/how-do-i-block-the-baiduspider-from-crawling-my-site/ :
"You can easily disable the Baidu spider by placing the following in your .htaccess file:

BrowserMatchNoCase Baiduspider bad_bot
Deny from env=bad_bot

Using this method saves you the trouble of having to find blocks of Baidu IP addresses and block them individually. "

However,
since it seems you have PHP on your server and
since Baidu ignores your robots.txt,
how about a reverse DOS?

I block Baidu differently on my web server. When Baidu requests
my default web page, some PHP code delays this page request for
999 seconds. This keeps one IP socket busy on both my server and
the Baidu server until the default IP ‘timeout error’ occurs. This keeps
Baidu from bothering other web servers for a minute or so. This is,
kind of, a reverse DOS attack on Baidu. Zero (0) bytes are transferred.
http://gelm.net/How-to-block-Baidu-with-PHP.htm

Be VERY careful with this. I would never recommend this as an option, instead you should block it on the firewall level. Though this is of course not that easy when your on a shared server.

The reason I do not recommend using sleep() with PHP (or any other language for that matter) in this case, is due to if someone finds out that you did this it makes it very easy to take your server “down”, i.e. use up all of the available web server threads and by that denying access to the server for real customers.

If you ask me…Lunarpages’ network is managed by idiots. If they had any idea of what there were doing they would block traffic that is causing high loads on their servers. They can easily do with with any good firewall. Example a real human would not make a dozen connections in the span of a few seconds, thus you can drop connections that are making too many connections within a certain threshold.