lindenwalsh — 2011-03-21T12:36:17-04:00 — #1
I'm experimenting with data scraping, and I'm wondering if it's possible to use DOMDocument to load plain text instead of HTML or is CURL the best way to do this? DOMDocument loads images etc when using loadHTMLFile(); which is very slow when you're processing a few pages at the same time. Is there a way to ignore images so they won't slow down the process? I know I can strip tags afterwards but that's after that fact that they've been loaded and have really slowed down the processing.
aamonkey — 2011-03-21T13:56:01-04:00 — #2
DomDocument does not "load images", it simply pulls all the html/xml from a page and puts it into a document tree.
lindenwalsh — 2011-03-21T13:59:15-04:00 — #3
Ok thanks for that. Why would it be slow to load several pages at once? Is it when I'm parsing the HTML out that the images are loaded?
aamonkey — 2011-03-21T14:17:08-04:00 — #4
Probably because the servers you are hitting are slow. The images are never "loaded" - if you are taking the urls of the image links and downloading the files to your server that might take some time, or if you are outputting the images to the browser of course then the browser will need to request each of the images from the server.
hurrakan — 2011-03-22T10:45:35-04:00 — #5
I think it would be better to use loadHTML() instead of loadHTMLFile().
HTML pages often contain MANY errors so you should suppress them.
Make sure to set appropriate cURL settings to make sure it gets the pages as fast as possible.
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
if ( ! $html = curl_exec($ch))
echo curl_error($ch).'<pre>'.print_r(curl_getinfo($ch), true).'</pre>';
$dom = new DOMDocument;
if ( @ $dom->loadHTML($html)) // suppress warning errors about invalid HTML