I have heard that it is common that malicious bots can be identified by the fact they have usually will not contain an HTTP_ACCEPT_LANGUAGE header.
Anyone now how true this is?
And if true, how reliable is it? I want to die() when there is no HTTP_ACCEPT_LANGUAGE header in a effort to kill malicious bots.
Will I lose something if I do, good bots like google, yahoo? Will I lose real people?
Do you have any kind of CSRF protection in place?
Actually, I am dealing with form submission software, where it scans the internet for forms, stores the form vars, and then submits crap. In my case, every 3 minutes. So since this is form submission software I don't think robots.txt can do anything for me. The bot has already come by.
It uses random proxies so I can't block by IP. It has a user agent, but likely faked and made to appear common.
So that is why I am investigating the accept language thing I have heard about.
I don't know about the HTTP_ACCEPT_LANGUAGE theory but I used to use the robots.txt method to find them.
Add a disallow line in the robots.txt to a file you do not use on your site. Grab the details of any bot that access that page.
It is surprising how may bots are out there. Some bots pretend to be a valid search engine like Google and sneak into your website without you realizing it. Not all bots obey robots.txt either.
Not only that, but as elgumbo mentioned some use it to find out where you don't want them to go and then go there. Set up a "honey pot" and you'll catch some.
Don't think of the robots.txt file as a security measure by any means.
If you're worried about bots it's worth looking at the code used by the well known wordpress plugin 'bad behaviour' (which can also be used out with the plugin)