Blocking bots with .htaccess

nayen · June 10, 2015, 7:22am

Hi,

I noticed two unknown bots in my stats file which seem to be consuming bandwidth and I want to block them. Here is the entries in my stats file:

Unknown robot (identified by 'spider')
Unknown robot (identified by 'bot*')

I searched the web and came up with the following code:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^bot*
RewriteRule ^.* - [F,L]

I wanted to ask if the above code is correct before implementing it. Especially the 4th line, I saw four variations of it, so I am not sure if the one I picked will work or not.

Thanks.

TechnoBear · June 10, 2015, 9:03am

Not so much an answer to your actual question, as a suggested solution to your problem.

I love that script - does a great job of eliminating bots which don’t play by the rules. I’ve used it on three sites for years.

nayen · June 10, 2015, 9:22am

Thanks for the link, I will inspect it.

One thing, does it not have a negative effect on site’s performance (loading speed) etc.?

Algorithmically that solution seems more efficient, but performance-wise, how does it perform compared to .htaccess way?

TechnoBear · June 10, 2015, 10:12am

No. It adds virtually nothing to the page weight, and only bots ignoring the “nofollow” and visiting the “forbidden” link actually reach the black hole.

[quote=“nayen, post:3, topic:192467”]
Algorithmically that solution seems more efficient, but performance-wise, how does it perform compared to .htaccess way?
[/quote]I’m not quite sure I understand what you’re asking, and I haven’t used the .htaccess method to compare. But as I understand it, bots can “spoof” their identity and thus get round blocks - and there may be the risk of accidentally banning other bots which are not causing problems. The advantage of this method (IMHO) is that it blocks bots based on their behaviour. Bots which respect the robots.txt file and nofollow directives will not fall foul of this; only those which ignore your site settings will be affected - and they’re the ones you definitely don’t want.

nayen · June 10, 2015, 10:58am

Thanks for the clarification.

jeffreylees · June 10, 2015, 11:58am

I love that script - does a great job of eliminating bots which don’t play by the rules.

That’s important too - remember that you don’t want to block all bots/spiders/what-have-you - just the ones that won’t “play by the rules”.

That script looks pretty interesting, I’m gonna bookmark it and take another look next time I need something like that, thanks @technobear

dklynn · June 11, 2015, 4:50am

nayen,

Blackhole is a great way to deal with bad bots … but back to your question.

Your code is fine EXCEPT that:

I’d not use either start or end anchors in order to catch anything with ‘spider’ or ‘bot’ in its user name (NOT reliable - the user name can be changed).
I’d not make Apache look at all the {REQUEST_URI} (using the ^ and *) so I’d just use .? for the regex which will match anything (even a blank) and redirect to the FAIL condition specified.

Dealing specifically with bad bots is the preferred way to go albeit it does take a little more effort on your part.

Regards,

DK

nayen · June 11, 2015, 7:01am

Thanks for the reply. With your last sentence, do you mean I should go with something like Blackhole instead of .htaccess blocking? And if I go with .htaccess, I should have the last line like:

RewriteRule .? - [F,L]

Is that right?

I first want to start with the easier and less complicated way, which seems to be .htaccess in this case and then based on how things evolve, I will surely consider Blackhole or a similar script if needed.

Thanks everyone again for all the input.

dklynn · June 11, 2015, 8:42am

nayen,

Correct.

If you go with your .htaccess code, remember to EXCLUDE googlebot with another RewriteCond as TechnoBear’s comment about blocking good bots is spot on! I believe that your code should be:

RewriteEngine on RewriteCond %{HTTP_USER_AGENT} spider [NC,OR] RewriteCond %{HTTP_USER_AGENT} bot\* [NC] RewriteRule .? - [F,L]

to eliminate the start and end anchors and match ONLY the * character after bot rather than bot, bott, bottt, … - remember, * is a regex metacharacter!

If you wanted to match bot, {whatever}bot but not googlebot, you’d have to use an if … then … else structure using mod_rewrite’s SKIP flag. That’s relatively advanced and you don’t need it (based on your original question) but I’ve got an example in my mod_rewrite tutorial of the SKIP flag’s use at http://dk.co.nz/seo.

Regards,

DK

nayen · June 11, 2015, 8:57am

Thanks a lot DK. I will also check the link you provided. For the moment, I will start with .htaccess and see how things go.

dklynn · June 12, 2015, 2:02am

nayen,

If bot* doesn’t work, use bot[*] which should do the same thing, i.e., require the * immediately following bot.

Regards,

DK

nayen · June 14, 2015, 12:40pm

Hi DK, I put the .htaccess code and when I check my stats file it seems the two bots are successfully blocked. I will continue to monitor. Thanks again.

nayen · June 16, 2015, 1:32pm

Hi DK,

Since I updated my .htaccess file with the above rules, I am noticing a slow but gradual decrease in the number of pages indexed on Google even though the number of pages are increasing in my website. Is it possible the code I have also blocks googlebot?

Here is the code I have:

RewriteCond %{HTTP_USER_AGENT} spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} bot[*] [NC]
RewriteRule .? - [F,L]

I am not experienced with regular expressions but I am suspecting the second line might be blocking googlebot too. I’ll be glad if you could take a look and tell if the above code has issues.

dklynn · June 16, 2015, 9:20pm

Good morning nayen,

I believe that googlebot identifies itself as “googlebot” and that will neither match “spider” not “bot*” so this code will not tell googlebot to “go away” (Fail) as it will other "spider"s or "bot*"s.

Had the code above NOT required the * immediately following bot, it would have blocked googlebot, too, but that’s just not the case here.

Regards,

DK

nayen · June 17, 2015, 5:48am

Thanks a lot DK. After posting here, I signed up to Google Search Console and noticed that the googlebot can crawl my site fine. Also, I read that there may be fluctuations in the number of indexed pages for new sites so it is all cool now.

dklynn · June 17, 2015, 6:20am

nayen,

Thank you for the confirmation that all’s right with the world … at least your corner of it! 'Glad it worked out for you.

Regards,

DK

nayen · June 18, 2015, 6:27am

Sorry for keeping you busy but it seems bot* is not being blocked (spider bot is blocked for good). I have tried the following two lines so far (separately):

RewriteCond %{HTTP_USER_AGENT} bot\* [NC]

RewriteCond %{HTTP_USER_AGENT} bot[*] [NC]

I kept observing the stats file after using both lines and the bot still visits and creates hits.

To clarify once more, I see the following in Robots/Spiders visitors section of Awstats page:

Unknown robot (identified by 'bot*')

Any further ideas what I can use to block that one?

dklynn · June 18, 2015, 8:52am

nayen,

Yes … but of the two ways, I’m only sure of one.

Group the bot* with NOT googlebot (with { } brackets???)

OR

Split your bot blocker into two halves:

[code]RewriteEngine on

RewriteCond %{HTTP_USER_AGENT} spider [NC]
RewriteRule .? - [F,L]

RewriteCond %{HTTP_USER_AGENT} !googlebot [NC]
RewriteCond %{HTTP_USER_AGENT} bot [NC]
RewriteRule .? - [F,L][/code]

2a. Split your bot blocker into two halves but use a Skip flag.:

[code]RewriteEngine on

RewriteCond %{HTTP_USER_AGENT} googlebot [NC]
RewriteRule .? - [S=1]

RewriteCond %{HTTP_USER_AGENT} spider [NC]
RewriteCond %{HTTP_USER_AGENT} bot [NC]
RewriteRule .? - [F,L][/code]

The Skip flag tells mod_rewrite to skip the next S RewriteRules so, if the bot was not googlebot, you’d be back to the original code EXCEPT that you’d be killing all bots (except googlebot) and not just bot*.

If anyone here is an expert on AWSTATS, perhaps they can shed some light on bot* (it may be bot with zero or more characters following bot).

Either of the two #2 options will allow googlebot but not any other bot identified user agent.

Regards,

DK

nayen · June 22, 2015, 11:36am

DK,

Honestly, I didn’t use the last code samples you provided but I continued my research on the topic and finally noticed that my Awstats build is from 2010, a very old one and hence it does not properly list some of the recent legit bots/spiders such as YandexBot and Baiduspider. I also checked my raw access logs file, which I should have done before and from the look of it, the bots “spider” and “bot*” I have been trying too hard to block are most probably bingbot or YandexBot for “bot*” and Baiduspider for “spider”. Because there aren’t any other entries in the raw access logs file containing spider or bot (except googlebot).

After this revelation, I now removed my bot blocking code from .htaccess and I will monitor my raw access file for a couple of days. I may also move my site to a more up to date host.

Thank you once more for all your help.

dklynn · June 22, 2015, 11:53am

nayen,

Thanks for your message; all resolved at this point but, if you have problems in the future, don’t hesitate to come back with questions.

Regards,

DK