DOMDocument vs Curl plain text

LindenWalsh · March 21, 2011, 4:36pm

Hi,

I’m experimenting with data scraping, and I’m wondering if it’s possible to use DOMDocument to load plain text instead of HTML or is CURL the best way to do this? DOMDocument loads images etc when using loadHTMLFile(); which is very slow when you’re processing a few pages at the same time. Is there a way to ignore images so they won’t slow down the process? I know I can strip tags afterwards but that’s after that fact that they’ve been loaded and have really slowed down the processing.

Thank you!

aamonkey · March 21, 2011, 5:56pm

DomDocument does not “load images”, it simply pulls all the html/xml from a page and puts it into a document tree.

LindenWalsh · March 21, 2011, 5:59pm

Ok thanks for that. Why would it be slow to load several pages at once? Is it when I’m parsing the HTML out that the images are loaded?

aamonkey · March 21, 2011, 6:17pm

Probably because the servers you are hitting are slow. The images are never “loaded” - if you are taking the urls of the image links and downloading the files to your server that might take some time, or if you are outputting the images to the browser of course then the browser will need to request each of the images from the server.

hurrakan · March 22, 2011, 2:45pm

I think it would be better to use loadHTML() instead of loadHTMLFile().

HTML pages often contain MANY errors so you should suppress them.

Make sure to set appropriate cURL settings to make sure it gets the pages as fast as possible.


$ch = curl_init();

curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');

if ( ! $html = curl_exec($ch))  
{                
    echo curl_error($ch).'<pre>'.print_r(curl_getinfo($ch), true).'</pre>';
}
else
{             
    curl_close($ch);
    
    $dom = new DOMDocument;
    
    if ( @ $dom->loadHTML($html)) // suppress warning errors about invalid HTML
    {