PHP Web Crawler

wh33t · April 18, 2015, 2:03am

Hi SP,

I’m creating a small web spider in PHP that will read some RSS feeds for a client. It got me wondering a lot of things about what a spider does and how it reveals itself to a webserver. For instance when I view my connection logs on my server I sometimes see things like “bing-bot”, “google-bot” etc and I was curious, how do they reveal their crawling sessions like this? Would it be possible for me to set up something like that in PHP?

megazoid · April 18, 2015, 8:54am

That values come from User-Agent header sent by crawler.

You can send your own header:

header("User-Agent: My Crawler Name");

or, if you use cURL:

curl_setopt($c, CURLOPT_USERAGENT, 'My Crawler Name');

Mittineague · April 18, 2015, 8:59am

If you want to get into the nitty-gritty check out section 14.43
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

User-Agent = “User-Agent” “:” 1*( product | comment )

wh33t · April 19, 2015, 3:47am

Thanks guys. Appreciate that.

oddz · April 19, 2015, 5:25am

I would recommend checking out this project.

http://querypath.org/

system · July 19, 2015, 12:39pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.