Rogue bots , indentifying

zapitnu · September 23, 2010, 8:42am

I have heard that it is common that malicious bots can be identified by the fact they have usually will not contain an HTTP_ACCEPT_LANGUAGE header.

Anyone now how true this is?

And if true, how reliable is it? I want to die() when there is no HTTP_ACCEPT_LANGUAGE header in a effort to kill malicious bots.

Will I lose something if I do, good bots like google, yahoo? Will I lose real people?

Aleksejs · September 23, 2010, 10:12am

Do you have any kind of CSRF protection in place?
Also, some bots are smarter some are dumber. If you use JavaScript on your page anyway, you can also make sure that some hidden field gets value depending on digest+entered data - something like:
md5(CSRFdigest+formfield1+formffield2);
And check on server side if correct values are present both in CSRFdigest field and in Javascript computed field.

Aleksejs · October 23, 2010, 8:10pm

Just a followup resource:
How to keep bad robots, spiders and web crawlers away

zapitnu · September 23, 2010, 10:01am

Actually, I am dealing with form submission software, where it scans the internet for forms, stores the form vars, and then submits crap. In my case, every 3 minutes. So since this is form submission software I don’t think robots.txt can do anything for me. The bot has already come by.

It uses random proxies so I can’t block by IP. It has a user agent, but likely faked and made to appear common.

So that is why I am investigating the accept language thing I have heard about.

elgumbo · September 23, 2010, 9:09am

I don’t know about the HTTP_ACCEPT_LANGUAGE theory but I used to use the robots.txt method to find them.

Add a disallow line in the robots.txt to a file you do not use on your site. Grab the details of any bot that access that page.

IProx · October 30, 2010, 2:42am

It is surprising how may bots are out there. Some bots pretend to be a valid search engine like Google and sneak into your website without you realizing it. Not all bots obey robots.txt either.

Mittineague · October 30, 2010, 5:25am

Not only that, but as elgumbo mentioned some use it to find out where you don’t want them to go and then go there. Set up a “honey pot” and you’ll catch some.

Don’t think of the robots.txt file as a security measure by any means.

EastCoast · October 30, 2010, 5:54pm

If you’re worried about bots it’s worth looking at the code used by the well known wordpress plugin ‘bad behaviour’ (which can also be used out with the plugin)
http://www.bad-behavior.ioerror.us/