Scraping images from a website

Stuart_Swan · September 9, 2007, 6:14pm

I am trying to scrape product images from a website to save myself a large and manual task.

Each product has its own page: http://www.example.com/product.php?p=1234

Within this page there is one product image with an ambiguous name such as random_product.png the only thing distinguising it from the other images on the page is that it has a location of catalog/random_product.png.

What I would like to do is have script scan all the product pages 1 - 6000 and save the image as the ID e.g., if random_product.png had an id of 1234 the script would save the file as 1234.png

Are there any scripts available that would handle this?

Many thanks in advance.

asprookie · September 9, 2007, 6:43pm

A site sucker or a bot.

Stuart_Swan · September 9, 2007, 7:08pm

I’ve looked into website grabbers but they do not save the image with the id in the name, are there any you can recommend that would handle this.

Hammer65 · September 9, 2007, 10:25pm

Who owns these images? Do they know you are using them? Have they given you permission? Images and video are protected works, you can’t just grab them and use them for your own convenience.

cranial_bore · September 9, 2007, 11:25pm

This should give you the gist of it:


<?php
function save_image($pageID) {
    
    $base = 'http://example.com/';
    
    //use cURL functions to "open" page
    //load $page as source code for target page
    
    //Find catalog/ images on this page
    preg_match_all('~catalog/([a-z0-9\\.\\_\\-]+(\\.gif|\\.png|\\.jpe?g))~i', $page, $matches);
    
    /*
    $matches[0] => array of image paths (as in source code)
    $matches[1] => array of file names
    $matches[2] => array of extensions
    */
    
    for($i=0; $i < count($matches[0]); $i++) {
        $source = $base . $matches[0][$i];
        $tgt = $pageID . $matches[2][$i];    //NEW file name. ID + extension
        
        if(copy($source, $tgt)) $success = true;
        else $success = false;
    }
    
    return $success; //Rough validation. Only reports last image from source
}


//Download image from each page
for($i=1; $i<=6000; $i++) {
    if(!save_image($i)) echo "Error with page $i<br>";
}
?>

You’ll have to add your own cURL code to load the HTML source of each page into the $page variable.

It’d probably be nice on the hosting web server not to do all 6000 pages in one go, and even for smaller runs you may need to increase your max execution time.

And remember that any copyright restrictions will still apply regardless of how you get the images.

Stuart_Swan · September 10, 2007, 7:59am

Thanks for that lowdown on the law, but you can sleep easy, the images are my client’s for his new website which I need to scrape from his old one.

Stuart_Swan · September 10, 2007, 7:59am

Thanks for this, i’ll give it a shot and let you know how I get on, much appreciated. The web server is our own physical dedicated so no problems with using resources, only a few websites on it at present.

cranial_bore:

This should give you the gist of it:


<?php
function save_image($pageID) {
    
    $base = 'http://example.com/';
    
    //use cURL functions to "open" page
    //load $page as source code for target page
    
    //Find catalog/ images on this page
    preg_match_all('~catalog/([a-z0-9\\.\\_\\-]+(\\.gif|\\.png|\\.jpe?g))~i', $page, $matches);
    
    /*
    $matches[0] => array of image paths (as in source code)
    $matches[1] => array of file names
    $matches[2] => array of extensions
    */
    
    for($i=0; $i < count($matches[0]); $i++) {
        $source = $base . $matches[0][$i];
        $tgt = $pageID . $matches[2][$i];    //NEW file name. ID + extension
        
        if(copy($source, $tgt)) $success = true;
        else $success = false;
    }
    
    return $success; //Rough validation. Only reports last image from source
}


//Download image from each page
for($i=1; $i<=6000; $i++) {
    if(!save_image($i)) echo "Error with page $i<br>";
}
?>

You’ll have to add your own cURL code to load the HTML source of each page into the $page variable.

It’d probably be nice on the hosting web server not to do all 6000 pages in one go, and even for smaller runs you may need to increase your max execution time.

And remember that any copyright restrictions will still apply regardless of how you get the images.