Article: Nokogiri Fundamentals: Extract HTML from the Web

An excerpt from http://www.sitepoint.com/nokogiri-fundamentals-extract-html-web/, by Darko Gjorgjievski

Most people get very confused when they try to learn Nokogiri without mastering some fundamentals first. There is a reason for this. Trying to learn Nokogiri without learning the things that make it work is like trying to learn the features of Word without knowing how to type. By the
end of this article, you’ll be comfortable with taking a web site and extracting any piece of data from it.

Nokogiri is one of the most popular Ruby gems, with over 37 million downloads at publication time. Just as a comparison, Rails has 47 million downloads currently. My first thought after seeing this was: “Wow, if this gem is so popular, there must be some pretty comprehensive tutorials for it, right?” If you type “rails tutorial” into Google, you’ll get over 280,000 results. A search for “nokogiri tutorial” gives you…less than 3000 results (on the bright side, this umber should get bigger after this article is released!).

Nokogiri is a Parser…What?

To make things worse, most tutorials confuse rather than clarify things. When describing Nokogiri, for example, most articles describe it as a “parser”. Most people have no clear definition on what a parser is. To answer this question, take a look at this StackOverflow answer where it is described as “something that turns some kind of data (usually a string) into another kind of data (usually a data structure)”. This makes things a bit clearer now.

[Continue reading this article on SitePoint!]

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.