A few days ago, I installed the WordPress Global Translator plugin. Since then I’ve noticed a lot of new spiders/bots coming in, which I suppose is normal.
However, one bot in particular, the Baiduspider bot, is disregarding the robots.txt instructions by going where it should not go. The disallow instructions I placed in the robots.txt file for the bot don’t work, either. The bot is also using a number of different IP addresses, so I’m not sure the IP address could be used to deny it access.
I’ve read that Baiduspider is a search engine from China. One of the translations I set up with the plugin was the Simplified Chinese translation.
I hope Baiduspider is not scraping my site, but it seems to be visiting every nook and cranny, including image folders. My site is a photo blog, so I’m concerned about my blog photos getting swiped on a large scale by this bot.
Every time I check the “Latest Visitors” section of my CPanel, Baiduspider is either currently on my site or has recently been back. It’s been coming and going a lot within the last few days.
Can anyone offer some insight on the Baiduspider bot and what, if anything, can or should be done to deny it access to my site? I’d like to think it’s a harmless bot. Even is it is harmless, I still don’t like the fact that it’s disregarding the robots.txt file. Should I be concerned about this bot?
You can ban IP addresses on your server/domain to prevent Baidu from indexing your web site. However, if you have no problem with Google indexing your picture I can hardly understand why would you have a problem with Baidu.
Thanks for all the specific info, zealus, especially for all the IP addresses.
I actually don’t allow Google to index my images. I don’t mind being indexed by Baiduspider, but it’s set to do whatever it wishes, with no regard to the robots.txt file. Google and a few other bots, on the other hand, at least abide by the robots.txt file.
I guess it’s just the typical battle-of-the-bots world — a love/hate relationship!
Well, let me back up a bit. After reviewing my robots.txt file, the Baiduspider doesn’t appear to have disregarded the specific disallows I had in place when the bot first started coming by. However, as of a few days ago, I inserted the following into the robots.txt file:
I just checked the Last Visitors panel of my CPanel, though, and there were several instances of: “Agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)” being on the site within the last few hours. So unless I’ve incorrectly entered the above Baiduspider disallow in my robots.txt file, it looks to me that at least the disallow statement completely banning the bot from the site is being disregarded by the bot — because it’s clearly still coming by.
I don’t really think it’s worth my time to attempt to block Baiduspider via my .htaccess file by trying to account for all the scores of IP addresses that are listed at http://www.useragentstring.com/Baiduspider_id_248.php for this specific version of the Baiduspider.
But perhaps, out of a degree of ignorance, I’m making more of this than I should — I’m not sure. If I were to succeed at disallowing the bot, would my site not be indexed at all for China?
However, I don’t allow the Gooblebot-Image bot to index the individual photos on my site. And my fear is that Baiduspider might be doing that.
I admit, this area is somewhat new territory for me, so any corrective thoughts are appreciated. The only thing I do know is that there is an incredible amount of activity going on by the Baiduspider bot, and it has raised some question marks for me.
Put a blank index.html file in the images folder. This will block the public viewing of the folder. In the header of the blank index.html file, put something like robots noindex - or what ever the command is.
Hey I know what you mean. Just this morning 72 freaking baidu bots attacked my site. I don’t mind the attention ,but I dont think its needed because they all look trough the same pages. im thinking about doing a '’ as a wild card to ban those rascal’s I.P. e.g. 220.181.7. I need to save bandwidth for actual people
I too have received many visits from baidu. Every day there are between twelve and fifteen hits; always in pairs, sometimes three at a time. Most IP addresses start with either 123 or 220; a few start with 119. One of the two main IPs always gets a 404 code while the other(s) get a 200 code.
Some time back baidu seemed to be taking my photos (my site has hundreds of photographs). I turned on hotlink protection in CPanel. Since then, baidu only checks the existence of my site but does not crawl pages at all.
I am currently compiling my own list of IP addresses that baidu uses. I plan to block them all once there are no new numbers on the list.
I appreciate the advice. However, baidu is no longer scanning any files in my site – ever since I enabled Hotlink Protection. What it is doing is filling up my stats with visits to " / ". Plus every second or third visit gets the 404 file which skews the information I am working with for monitoring my website. As much trouble as it will be to block all of the various IP addresses, I think that is what I will do. Unless you can suggest some other alternative.
“Can anyone offer some insight on the Baiduspider bot and what, if anything, can or should be done to deny it access to my site?”
Baidu should be denied access to your server and below is a suggestion of
what you should do.
If your web server is Apache, you can return a ‘403 Forbidden’ error message by
editing your .htaccess file in the root of your server path. e.g.
Order allow,deny
Deny from 119.63.192.0/21
Deny from 123.122.0.0/20
Deny from 220.181.0.0/16
Allow from all
I like the 999 second thing.
I’ve been thinking about this A LOT lately, IMO searching is content theft technically. A bot’s adherence to your robots file is the only thing that would make it even remotely acceptable in my eyes. So yeah, I think tie the little blighter up.
Those little bot buggers eat up bandwidth, especially on sites run on out of box CMS where like 90% of the files are useless and never seen by the real users (but I’m way too lazy to sift through, or put in a hideously long robots file because there is no “Allow” functionality)
Okay, the thing that must be done is to verify anything before downloading it or installing it. inform yourself first about the “secundary effects” of the programs, so you won’t have problems.
Baidu will gather information about your site then it will figure out if it’s a site they will allow or disallow themselves. Baidu is a search engine. It was the competitive one with Google when they were in town still, but now Baidu has taken over completely again. It still was #1 when Google was there anyways.
Anyone know what baidu means in Chinese? Search or another type of random name like Google?