First time poster, I could really do with some help as I can't find a solution on the internet.
I've made a script using cURL to take information from Wikipedia and put it in a MYSQL DB so far I have over 3000 records but need to take more, I'm now getting the error:
"Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice."
When I run the script.
Now I know what I need to do, I found instructions on wiki (User-Agent policy - Meta) apparently I need a user agent but I just don't know how to do this.
I've tried adding this code:
ini_set( "user_agent", "MediaArchiver (+http://www.mywebsite.com/)");
but that doesn't work.
Any idea how I'm going to go about this? Thanks to anyone who will help because I can't find anything else on the internet to guide me
I'm still learning curl myself, but I think your going to need to use curl_setopt.
Yep, curl_setopt is the way to go:
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, "MediaArchiver (+http://www.mywebsite.com/)");
Thanks for the quick replies and I really appreciate the help, the above code didn't stop the error message though.
Has anyone had an experience in using cURL on wikipedia and know how to get round this?
You have to slow down. Pull too much from Wikipedia in too short of a time they will consider it a bad bot.
Consider hitting dbpedia instead.
It'll mean learning some SPARQL, but its not far removed from sql as long as you grok what a namespace is.
Looking at Cups reply, perhaps I'm not doing this the best way...
What I'm trying to do is compile a database of English Music Artists, Movies, TV Shows, Video Games and Books.
The reason for this is because I want users on my site to be able to add these items to their "profiles" I don't want them to be able to add false or made up ones which is why I'm going for the DB. I did consider writing a curl script which went and got all the information about the form of media when the user adds a new one to their profile but that would require too much processing time.
Thus I set about creating a cURL script which goes to wikipedia, takes the Name, Image and Description of the media item and sticks it in my DB (along with the original page so I can link to wikipedia as per their terms).
DBPedia looks like a better way at a quick glance, or perhaps there's some other suggestion? I did try exporting the MusicBrainz db but that only covers Music.
Once again I want to thank everyone for their replies, I'm new to PHP but have been coding in Java for a while
Yeah, you can hit dbpedia and cache locally what you want, it sounds like you just want the contents of the infobox.
You just then need to routinely update your cached data.