Anyone Dealt with the Baiduspider Bot?

DebNCgal · February 19, 2009, 12:24am

A few days ago, I installed the WordPress Global Translator plugin. Since then I’ve noticed a lot of new spiders/bots coming in, which I suppose is normal.

However, one bot in particular, the Baiduspider bot, is disregarding the robots.txt instructions by going where it should not go. The disallow instructions I placed in the robots.txt file for the bot don’t work, either. The bot is also using a number of different IP addresses, so I’m not sure the IP address could be used to deny it access.

I’ve read that Baiduspider is a search engine from China. One of the translations I set up with the plugin was the Simplified Chinese translation.

I hope Baiduspider is not scraping my site, but it seems to be visiting every nook and cranny, including image folders. My site is a photo blog, so I’m concerned about my blog photos getting swiped on a large scale by this bot.

Every time I check the “Latest Visitors” section of my CPanel, Baiduspider is either currently on my site or has recently been back. It’s been coming and going a lot within the last few days.

Can anyone offer some insight on the Baiduspider bot and what, if anything, can or should be done to deny it access to my site? I’d like to think it’s a harmless bot. Even is it is harmless, I still don’t like the fact that it’s disregarding the robots.txt file. Should I be concerned about this bot?

Thanks for any assistance.

Deb

zealus · February 19, 2009, 3:54am

It’s a search bot originated by Chinese search engine Baidu.

Robot Name: BaiDuSpider
Agent_String: Baiduspider+(+http://www.baidu.com/search/spider.htm)
URL: http://www.baidu.com/search/spider.htm
IP Addr: 220.181.32.11 220.181.32.16 220.181.32.22 220.181.32.49 220.181.32.51 220.181.32.64 220.181.32.68 220.181.32.98 220.181.50.207 220.181.50.220 61.135.168.131 61.135.168.14 61.135.168.173 61.135.168.39

More information can be found here: http://www.useragentstring.com/pages/Baiduspider/

You can ban IP addresses on your server/domain to prevent Baidu from indexing your web site. However, if you have no problem with Google indexing your picture I can hardly understand why would you have a problem with Baidu.

DebNCgal · February 19, 2009, 12:15pm

Thanks for all the specific info, zealus, especially for all the IP addresses.

I actually don’t allow Google to index my images. I don’t mind being indexed by Baiduspider, but it’s set to do whatever it wishes, with no regard to the robots.txt file. Google and a few other bots, on the other hand, at least abide by the robots.txt file.

I guess it’s just the typical battle-of-the-bots world — a love/hate relationship!

Thank you!

ameRie · February 19, 2009, 12:35pm

I haven’t heard this yet. It’s funny this baidubots disregard robots.txt file? am I right?

DebNCgal · February 19, 2009, 1:50pm

Well, let me back up a bit. After reviewing my robots.txt file, the Baiduspider doesn’t appear to have disregarded the specific disallows I had in place when the bot first started coming by. However, as of a few days ago, I inserted the following into the robots.txt file:

User-agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)
Disallow: /

I just checked the Last Visitors panel of my CPanel, though, and there were several instances of: “Agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)” being on the site within the last few hours. So unless I’ve incorrectly entered the above Baiduspider disallow in my robots.txt file, it looks to me that at least the disallow statement completely banning the bot from the site is being disregarded by the bot — because it’s clearly still coming by.

I don’t really think it’s worth my time to attempt to block Baiduspider via my .htaccess file by trying to account for all the scores of IP addresses that are listed at http://www.useragentstring.com/Baiduspider_id_248.php for this specific version of the Baiduspider.

But perhaps, out of a degree of ignorance, I’m making more of this than I should — I’m not sure. If I were to succeed at disallowing the bot, would my site not be indexed at all for China?

However, I don’t allow the Gooblebot-Image bot to index the individual photos on my site. And my fear is that Baiduspider might be doing that.

I admit, this area is somewhat new territory for me, so any corrective thoughts are appreciated. The only thing I do know is that there is an incredible amount of activity going on by the Baiduspider bot, and it has raised some question marks for me.

Thanks.

kev · February 23, 2009, 8:54pm

Put a blank index.html file in the images folder. This will block the public viewing of the folder. In the header of the blank index.html file, put something like robots noindex - or what ever the command is.

spellDwhy · April 24, 2009, 6:39am

It sounds like very well . But I want to get a chinese translationsoftware. Can you help me? Thanks million.

BzKid · December 8, 2009, 3:37pm

Hey I know what you mean. Just this morning 72 freaking baidu bots attacked my site. I don’t mind the attention ,but I dont think its needed because they all look trough the same pages. im thinking about doing a '’ as a wild card to ban those rascal’s I.P. e.g. 220.181.7. I need to save bandwidth for actual people

scheng1 · December 10, 2009, 4:43am

You will be facing plagiarizm problems soon. The Chinese in mainland China does not think that plagiarizm is a big deal.

plainsman · January 3, 2010, 7:22pm

DebNCgal

I too have received many visits from baidu. Every day there are between twelve and fifteen hits; always in pairs, sometimes three at a time. Most IP addresses start with either 123 or 220; a few start with 119. One of the two main IPs always gets a 404 code while the other(s) get a 200 code.

Some time back baidu seemed to be taking my photos (my site has hundreds of photographs). I turned on hotlink protection in CPanel. Since then, baidu only checks the existence of my site but does not crawl pages at all.

I am currently compiling my own list of IP addresses that baidu uses. I plan to block them all once there are no new numbers on the list.

Dan_Grossman · January 3, 2010, 9:44pm

I believe the user agent you should be using for this spider in robots.txt is simply “Baiduspider”. Not the full user agent string. Give it a try.

http://www.baidu.com/robots.txt

No different than Google, which asks you use “Googlebot”, not “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

plainsman · January 3, 2010, 11:23pm

Dan,

I appreciate the advice. However, baidu is no longer scanning any files in my site – ever since I enabled Hotlink Protection. What it is doing is filling up my stats with visits to " / ". Plus every second or third visit gets the 404 file which skews the information I am working with for monitoring my website. As much trouble as it will be to block all of the various IP addresses, I think that is what I will do. Unless you can suggest some other alternative.

gelmce · August 15, 2010, 11:59pm

“Can anyone offer some insight on the Baiduspider bot and what, if anything, can or should be done to deny it access to my site?”

Baidu should be denied access to your server and below is a suggestion of
what you should do.

If your web server is Apache, you can return a ‘403 Forbidden’ error message by
editing your .htaccess file in the root of your server path. e.g.
Order allow,deny
Deny from 119.63.192.0/21
Deny from 123.122.0.0/20
Deny from 220.181.0.0/16
Allow from all

Even better, if you have PHP on your web server, is to make Baidu wait
up to 999 seconds for a page request.
See: http://gelm.net/How-to-block-Baidu-with-PHP.htm

BLZ · August 16, 2010, 5:34am

I like the 999 second thing.
I’ve been thinking about this A LOT lately, IMO searching is content theft technically. A bot’s adherence to your robots file is the only thing that would make it even remotely acceptable in my eyes. So yeah, I think tie the little blighter up.

Those little bot buggers eat up bandwidth, especially on sites run on out of box CMS where like 90% of the files are useless and never seen by the real users (but I’m way too lazy to sift through, or put in a hideously long robots file because there is no “Allow” functionality)

Alexa12345 · September 3, 2010, 12:19am

Okay, the thing that must be done is to verify anything before downloading it or installing it. inform yourself first about the “secundary effects” of the programs, so you won’t have problems.

PJdreams · September 8, 2010, 12:10am

China does things differently on the internet.

Baidu will gather information about your site then it will figure out if it’s a site they will allow or disallow themselves. Baidu is a search engine. It was the competitive one with Google when they were in town still, but now Baidu has taken over completely again. It still was #1 when Google was there anyways.

Anyone know what baidu means in Chinese? Search or another type of random name like Google?

susanqy · September 9, 2010, 9:47am

jvfconsulting · September 20, 2010, 6:31pm

Can you show us what your robots.txt is reading? Maybe something was typed incorrectly?