PHP Web Crawler

Hi SP,

I’m creating a small web spider in PHP that will read some RSS feeds for a client. It got me wondering a lot of things about what a spider does and how it reveals itself to a webserver. For instance when I view my connection logs on my server I sometimes see things like “bing-bot”, “google-bot” etc and I was curious, how do they reveal their crawling sessions like this? Would it be possible for me to set up something like that in PHP?

That values come from User-Agent header sent by crawler.

You can send your own header:

header("User-Agent: My Crawler Name");

or, if you use cURL:

curl_setopt($c, CURLOPT_USERAGENT, 'My Crawler Name');
1 Like

If you want to get into the nitty-gritty check out section 14.43
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

User-Agent = “User-Agent” “:” 1*( product | comment )

1 Like

Thanks guys. Appreciate that.

I would recommend checking out this project.

http://querypath.org/

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.