Blocking bots with .htaccess

Hi,

I noticed two unknown bots in my stats file which seem to be consuming bandwidth and I want to block them. Here is the entries in my stats file:

Unknown robot (identified by 'spider')
Unknown robot (identified by 'bot*')

I searched the web and came up with the following code:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^bot*
RewriteRule ^.* - [F,L]

I wanted to ask if the above code is correct before implementing it. Especially the 4th line, I saw four variations of it, so I am not sure if the one I picked will work or not.

Thanks.

1 Like

Not so much an answer to your actual question, as a suggested solution to your problem.

I love that script - does a great job of eliminating bots which donā€™t play by the rules. Iā€™ve used it on three sites for years.

2 Likes

Thanks for the link, I will inspect it.

One thing, does it not have a negative effect on siteā€™s performance (loading speed) etc.?

Algorithmically that solution seems more efficient, but performance-wise, how does it perform compared to .htaccess way?

No. It adds virtually nothing to the page weight, and only bots ignoring the ā€œnofollowā€ and visiting the ā€œforbiddenā€ link actually reach the black hole.

[quote=ā€œnayen, post:3, topic:192467ā€]
Algorithmically that solution seems more efficient, but performance-wise, how does it perform compared to .htaccess way?
[/quote]Iā€™m not quite sure I understand what youā€™re asking, and I havenā€™t used the .htaccess method to compare. But as I understand it, bots can ā€œspoofā€ their identity and thus get round blocks - and there may be the risk of accidentally banning other bots which are not causing problems. The advantage of this method (IMHO) is that it blocks bots based on their behaviour. Bots which respect the robots.txt file and nofollow directives will not fall foul of this; only those which ignore your site settings will be affected - and theyā€™re the ones you definitely donā€™t want.

2 Likes

Thanks for the clarification.

I love that script - does a great job of eliminating bots which donā€™t play by the rules.

Thatā€™s important too - remember that you donā€™t want to block all bots/spiders/what-have-you - just the ones that wonā€™t ā€œplay by the rulesā€.

That script looks pretty interesting, Iā€™m gonna bookmark it and take another look next time I need something like that, thanks @technobear :smiley:

2 Likes

nayen,

Blackhole is a great way to deal with bad bots ā€¦ but back to your question.

Your code is fine EXCEPT that:

  1. Iā€™d not use either start or end anchors in order to catch anything with ā€˜spiderā€™ or ā€˜botā€™ in its user name (NOT reliable - the user name can be changed).

  2. Iā€™d not make Apache look at all the {REQUEST_URI} (using the ^ and *) so Iā€™d just use .? for the regex which will match anything (even a blank) and redirect to the FAIL condition specified.

Dealing specifically with bad bots is the preferred way to go albeit it does take a little more effort on your part.

Regards,

DK

1 Like

Thanks for the reply. With your last sentence, do you mean I should go with something like Blackhole instead of .htaccess blocking? And if I go with .htaccess, I should have the last line like:

RewriteRule .? - [F,L]

Is that right?

I first want to start with the easier and less complicated way, which seems to be .htaccess in this case and then based on how things evolve, I will surely consider Blackhole or a similar script if needed.

Thanks everyone again for all the input.

nayen,

Correct.

If you go with your .htaccess code, remember to EXCLUDE googlebot with another RewriteCond as TechnoBearā€™s comment about blocking good bots is spot on! I believe that your code should be:

RewriteEngine on RewriteCond %{HTTP_USER_AGENT} spider [NC,OR] RewriteCond %{HTTP_USER_AGENT} bot\* [NC] RewriteRule .? - [F,L]

to eliminate the start and end anchors and match ONLY the * character after bot rather than bot, bott, bottt, ā€¦ - remember, * is a regex metacharacter!

If you wanted to match bot, {whatever}bot but not googlebot, youā€™d have to use an if ā€¦ then ā€¦ else structure using mod_rewriteā€™s SKIP flag. Thatā€™s relatively advanced and you donā€™t need it (based on your original question) but Iā€™ve got an example in my mod_rewrite tutorial of the SKIP flagā€™s use at http://dk.co.nz/seo.

Regards,

DK

1 Like

Thanks a lot DK. I will also check the link you provided. For the moment, I will start with .htaccess and see how things go.

nayen,

If bot* doesnā€™t work, use bot[*] which should do the same thing, i.e., require the * immediately following bot.

Regards,

DK

Hi DK, I put the .htaccess code and when I check my stats file it seems the two bots are successfully blocked. I will continue to monitor. Thanks again.

Hi DK,

Since I updated my .htaccess file with the above rules, I am noticing a slow but gradual decrease in the number of pages indexed on Google even though the number of pages are increasing in my website. Is it possible the code I have also blocks googlebot?

Here is the code I have:

RewriteCond %{HTTP_USER_AGENT} spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} bot[*] [NC]
RewriteRule .? - [F,L]

I am not experienced with regular expressions but I am suspecting the second line might be blocking googlebot too. Iā€™ll be glad if you could take a look and tell if the above code has issues.

Good morning nayen,

I believe that googlebot identifies itself as ā€œgooglebotā€ and that will neither match ā€œspiderā€ not ā€œbot*ā€ so this code will not tell googlebot to ā€œgo awayā€ (Fail) as it will other "spider"s or "bot*"s.

Had the code above NOT required the * immediately following bot, it would have blocked googlebot, too, but thatā€™s just not the case here.

Regards,

DK

Thanks a lot DK. After posting here, I signed up to Google Search Console and noticed that the googlebot can crawl my site fine. Also, I read that there may be fluctuations in the number of indexed pages for new sites so it is all cool now.

1 Like

nayen,

Thank you for the confirmation that allā€™s right with the world ā€¦ at least your corner of it! 'Glad it worked out for you.

Regards,

DK

Sorry for keeping you busy but it seems bot* is not being blocked (spider bot is blocked for good). I have tried the following two lines so far (separately):

RewriteCond %{HTTP_USER_AGENT} bot\* [NC]

RewriteCond %{HTTP_USER_AGENT} bot[*] [NC]

I kept observing the stats file after using both lines and the bot still visits and creates hits.

To clarify once more, I see the following in Robots/Spiders visitors section of Awstats page:

Unknown robot (identified by 'bot*') 

Any further ideas what I can use to block that one?

nayen,

Yes ā€¦ but of the two ways, Iā€™m only sure of one.

  1. Group the bot* with NOT googlebot (with { } brackets???)

OR

  1. Split your bot blocker into two halves:

[code]RewriteEngine on

RewriteCond %{HTTP_USER_AGENT} spider [NC]
RewriteRule .? - [F,L]

RewriteCond %{HTTP_USER_AGENT} !googlebot [NC]
RewriteCond %{HTTP_USER_AGENT} bot [NC]
RewriteRule .? - [F,L][/code]

2a. Split your bot blocker into two halves but use a Skip flag.:

[code]RewriteEngine on

RewriteCond %{HTTP_USER_AGENT} googlebot [NC]
RewriteRule .? - [S=1]

RewriteCond %{HTTP_USER_AGENT} spider [NC]
RewriteCond %{HTTP_USER_AGENT} bot [NC]
RewriteRule .? - [F,L][/code]

The Skip flag tells mod_rewrite to skip the next S RewriteRules so, if the bot was not googlebot, youā€™d be back to the original code EXCEPT that youā€™d be killing all bots (except googlebot) and not just bot*.

If anyone here is an expert on AWSTATS, perhaps they can shed some light on bot* (it may be bot with zero or more characters following bot).

Either of the two #2 options will allow googlebot but not any other bot identified user agent.

Regards,

DK

1 Like

DK,

Honestly, I didnā€™t use the last code samples you provided but I continued my research on the topic and finally noticed that my Awstats build is from 2010, a very old one and hence it does not properly list some of the recent legit bots/spiders such as YandexBot and Baiduspider. I also checked my raw access logs file, which I should have done before and from the look of it, the bots ā€œspiderā€ and ā€œbot*ā€ I have been trying too hard to block are most probably bingbot or YandexBot for ā€œbot*ā€ and Baiduspider for ā€œspiderā€. Because there arenā€™t any other entries in the raw access logs file containing spider or bot (except googlebot).

After this revelation, I now removed my bot blocking code from .htaccess and I will monitor my raw access file for a couple of days. I may also move my site to a more up to date host.

Thank you once more for all your help.

nayen,

Thanks for your message; all resolved at this point but, if you have problems in the future, donā€™t hesitate to come back with questions.

Regards,

DK