I wanted to ask if the above code is correct before implementing it. Especially the 4th line, I saw four variations of it, so I am not sure if the one I picked will work or not.
No. It adds virtually nothing to the page weight, and only bots ignoring the ānofollowā and visiting the āforbiddenā link actually reach the black hole.
[quote=ānayen, post:3, topic:192467ā]
Algorithmically that solution seems more efficient, but performance-wise, how does it perform compared to .htaccess way?
[/quote]Iām not quite sure I understand what youāre asking, and I havenāt used the .htaccess method to compare. But as I understand it, bots can āspoofā their identity and thus get round blocks - and there may be the risk of accidentally banning other bots which are not causing problems. The advantage of this method (IMHO) is that it blocks bots based on their behaviour. Bots which respect the robots.txt file and nofollow directives will not fall foul of this; only those which ignore your site settings will be affected - and theyāre the ones you definitely donāt want.
Blackhole is a great way to deal with bad bots ā¦ but back to your question.
Your code is fine EXCEPT that:
Iād not use either start or end anchors in order to catch anything with āspiderā or ābotā in its user name (NOT reliable - the user name can be changed).
Iād not make Apache look at all the {REQUEST_URI} (using the ^ and *) so Iād just use .? for the regex which will match anything (even a blank) and redirect to the FAIL condition specified.
Dealing specifically with bad bots is the preferred way to go albeit it does take a little more effort on your part.
Thanks for the reply. With your last sentence, do you mean I should go with something like Blackhole instead of .htaccess blocking? And if I go with .htaccess, I should have the last line like:
RewriteRule .? - [F,L]
Is that right?
I first want to start with the easier and less complicated way, which seems to be .htaccess in this case and then based on how things evolve, I will surely consider Blackhole or a similar script if needed.
If you go with your .htaccess code, remember to EXCLUDE googlebot with another RewriteCond as TechnoBearās comment about blocking good bots is spot on! I believe that your code should be:
to eliminate the start and end anchors and match ONLY the * character after bot rather than bot, bott, bottt, ā¦ - remember, * is a regex metacharacter!
If you wanted to match bot, {whatever}bot but not googlebot, youād have to use an if ā¦ then ā¦ else structure using mod_rewriteās SKIP flag. Thatās relatively advanced and you donāt need it (based on your original question) but Iāve got an example in my mod_rewrite tutorial of the SKIP flagās use at http://dk.co.nz/seo.
Hi DK, I put the .htaccess code and when I check my stats file it seems the two bots are successfully blocked. I will continue to monitor. Thanks again.
Since I updated my .htaccess file with the above rules, I am noticing a slow but gradual decrease in the number of pages indexed on Google even though the number of pages are increasing in my website. Is it possible the code I have also blocks googlebot?
I am not experienced with regular expressions but I am suspecting the second line might be blocking googlebot too. Iāll be glad if you could take a look and tell if the above code has issues.
I believe that googlebot identifies itself as āgooglebotā and that will neither match āspiderā not ābot*ā so this code will not tell googlebot to āgo awayā (Fail) as it will other "spider"s or "bot*"s.
Had the code above NOT required the * immediately following bot, it would have blocked googlebot, too, but thatās just not the case here.
Thanks a lot DK. After posting here, I signed up to Google Search Console and noticed that the googlebot can crawl my site fine. Also, I read that there may be fluctuations in the number of indexed pages for new sites so it is all cool now.
Sorry for keeping you busy but it seems bot* is not being blocked (spider bot is blocked for good). I have tried the following two lines so far (separately):
The Skip flag tells mod_rewrite to skip the next S RewriteRules so, if the bot was not googlebot, youād be back to the original code EXCEPT that youād be killing all bots (except googlebot) and not just bot*.
If anyone here is an expert on AWSTATS, perhaps they can shed some light on bot* (it may be bot with zero or more characters following bot).
Either of the two #2 options will allow googlebot but not any other bot identified user agent.
Honestly, I didnāt use the last code samples you provided but I continued my research on the topic and finally noticed that my Awstats build is from 2010, a very old one and hence it does not properly list some of the recent legit bots/spiders such as YandexBot and Baiduspider. I also checked my raw access logs file, which I should have done before and from the look of it, the bots āspiderā and ābot*ā I have been trying too hard to block are most probably bingbot or YandexBot for ābot*ā and Baiduspider for āspiderā. Because there arenāt any other entries in the raw access logs file containing spider or bot (except googlebot).
After this revelation, I now removed my bot blocking code from .htaccess and I will monitor my raw access file for a couple of days. I may also move my site to a more up to date host.