How to code a crawl script that copies the first paragraph from Wikipedia?

Sharkyx · April 6, 2011, 3:30pm

Hey guys,

Long time no see; I’m trying to find a way I can implement a PHP script for my Wordpress blog in which whenever I access my website.com/tag/tag-name section I get a description worth 250 words or something from Wikipedia relating to the term.

For example if my tag is websitename.com/tag/theory-of-relativity, then the script should crawl Wikipedia after “Theory of Relativity” and past the first 250 words from Theory of relativity - Wikipedia, the free encyclopedia, so I can get a description for the term before I list the wordpress post tagged with the term.

This may sound like black-hat SEO spam to some of you, but I actually believe this is a good practice since it provides relevancy, and is by no means non-ethical or something. I want my readers to have an idea what a term is all about (i put a lot of tags on my blog that are very encyclopedic in nature) and this might help.

I have minimal programming experience and I would really appreciate it if someone could help me out with some insights on how I can write a piece of code like this. maybe someone finds the idea of such a script being implemented really useful, why not write it and share it altogether? Thank in advance for your efforts.

Mittineague · April 6, 2011, 4:54pm

IMHO no code needed. When you write a post with a “tag” in it, include a link to the wikipedia page. It will take a little time for you to find the link, but only minimal compared to writing the post.

frank1 · April 7, 2011, 3:38am

Well in some stage i think you need to do some scraping
so reading this book will help
Amazon.com: Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL (9781593271206): Michael Schrenk: Books

Daniel15 · April 7, 2011, 4:09am

Another option are the Wikipedia database dumps - The entire Wikipedia database is available for download. Then you wouldn’t need to scrape it, just look it up in a local copy of the database.

Sharkyx · April 10, 2011, 5:25pm

Thank you for your replies.

@Mittineague: how can you customize your tag pages so you can add individual content for every tag page ? From my experience, you can’t…

@frank1: scrapping is exactly what this is about I believe. thanks for the suggestion
@daniel15: yes, but that would mean i need to download entire gigabites and then interpret their database structure. Plus, I need it to be up date, and the best way to achieve this is by scraping off wikipedia directly.

I tried googling for something similar, but … I couldn’t find something really relevant.

Daniel15 · April 11, 2011, 12:39am

In that case I’d check out their API. The “parse” action will parse a page and return the HTML for that page. Example URLs:

http://en.wikipedia.org/w/api.php?action=parse&page=SitePoint&format=php - Data for “SitePoint” page in PHP serialised format - to use with unserialize() in PHP.
http://en.wikipedia.org/w/api.php?action=parse&page=SitePoint&format=xml - Data for “SitePoint” page in XML format

No scraping needed, the data is in an easy-to-use format for you.

I’d strongly recommend donating to Wikipedia if you use its data extensively. High usage of their servers means that your scraping costs them quite a bit of money (bandwidth, server processing time, etc.)

Here’s an example for you (PHP):


<?php
$page = 'SitePoint';
$api_url = 'http://en.wikipedia.org/w/api.php?action=parse&page=%s&format=php';

// MediaWiki API needs a user-agent to be specified
$context = stream_context_create(array('http' => array(
	'user_agent' => 'SitePoint example for topic 748667',
)));

$data = unserialize(file_get_contents(sprintf($api_url, $page), null, $context));

echo $data['parse']['text']['*'];
?>

Sharkyx · April 12, 2011, 5:35pm

Daniel,

Your code is wonderful and hits the spot, but it’s not there yet.

If you type in a more than one word parsing phrase, it won’t return anything.

example “new york” won’t show anything.

then “new_york” will show you that you need to redirect and use proper capitalization.

ultimately, “New_York” will show you the proper display.

I’d only like to show the first <p> from the a wikipedia listing. this will put less strain on my server as well as wikipedia’s. and, yes, i’ve downloaded to wikipedia multiple times now, whenever a call out was made.

A perfect example of what I’m trying to replicate can be seen at PhysOrg.com - Science News, Technology, Physics, Nanotechnology, Space Science, Earth Science, Medicine.

Just click on any of their post, check out the right sidebar and click on tag or more.

CLEAR EXAMPLE: PhysOrg.com - magnetic field

thank you everyone for your help.

Daniel15:

In that case I’d check out their API. The “parse” action will parse a page and return the HTML for that page. Example URLs:

http://en.wikipedia.org/w/api.php?action=parse&page=SitePoint&format=php - Data for “SitePoint” page in PHP serialised format - to use with unserialize() in PHP.
http://en.wikipedia.org/w/api.php?action=parse&page=SitePoint&format=xml - Data for “SitePoint” page in XML format

No scraping needed, the data is in an easy-to-use format for you.

I’d strongly recommend donating to Wikipedia if you use its data extensively. High usage of their servers means that your scraping costs them quite a bit of money (bandwidth, server processing time, etc.)

Here’s an example for you (PHP):
<?php
$page = 'SitePoint';
$api_url = 'http://en.wikipedia.org/w/api.php?action=parse&page=%s&format=php';

// MediaWiki API needs a user-agent to be specified
$context = stream_context_create(array('http' => array(
	'user_agent' => 'SitePoint example for topic 748667',
)));

$data = unserialize(file_get_contents(sprintf($api_url, $page), null, $context));

echo $data['parse']['text']['*'];
?>

Daniel15 · April 12, 2011, 11:51pm

Yes, my code was just an example, not a production-ready script. You will definitely have to modify it.

If you type in a more than one word parsing phrase, it won’t return anything.

example “new york” won’t show anything.

then “new_york” will show you that you need to redirect and use proper capitalization.

ultimately, “New_York” will show you the proper display.

Modify the script to handle that
Read Mediawiki’s API documentation and see how to handle redirects.

I’d only like to show the first <p> from the a wikipedia listing. this will put less strain on my server as well as wikipedia’s. and, yes, i’ve downloaded to wikipedia multiple times now, whenever a call out was made.

I think the API only returns the whole page. From that, you’d grab the first paragraph. I think returning the whole page would be less stressful on their servers as they’d cache the whole page (they wouldn’t just cache the first paragraph) so it should be relatively quick to retrieve anyways.

A perfect example of what I’m trying to replicate can be seen at

They might be using the API, or might just use the Wikipedia dumps I talked about earlier. Perhaps ask them what approach they used?

system · September 22, 2011, 3:13pm

to: Daniel15 thanks for sharing the idea about wikipedia api - strange, but i never heard about it before and just scraped it
to: Sharkyx
i understand that you’re not talking about black-hat seo,but there are a few services that can show you all keywords that wikipedia are in top of google serp, so you can at least exclude these keywords from your “posts”, because its hard to outperform wikipedia.