kjm7267 — 2013-01-08T16:14:32-05:00 — #1
Good afternoon from unseasonably sunny Palo Alto, California; where I am stuck on a website PHP programming problem. My site shows, in a table, a series of records from a MySQL database table that I also own. Each record in the database table has two fields besides the requisite ID. So the website shows two fields per row, that is, two fields per record.
My database gets its data from about ten different public domain websites, each of which displays information in a manner similar to mine, but each a bit uniquely. My site effectively aggregates the records of these ten other sites. Each of the ten sites adds a few records each day. Once a record is added to one of ten, it's valid in perpetuity; that is, the data never gets stale. None of the ten websites seems to have an RSS feed or API that would enable me to update my database automatedly. So I've had to do it manually , cutting and pasting information from the ten other websites into my database. Not fun.
Please, how should I think about using PHP to streamline this process? I have done a lot or research on CURL, screen scraping, and other methods, but none seem to ring true. A basic cure would be to find a way to automatedly dump all the records from each of the ten into my database. An advanced cure would be to do that, plus to cause some sort of continuous updating. If anyone cold point me in the right direction, I would most certainly appreciate it. Thank you!
ronalds — 2013-01-08T21:34:34-05:00 — #2
You can setup cron job to run your PHP script and take care of automation. In the PHP script page content can be downloaded with cURL and then required info can be extracted with [DOM functions, [URL="http://dk1.php.net/manual/en/book.simplexml.php"]SimpleXML](http://dk1.php.net/manual/en/book.dom.php) or in worst case regular expressions.
starlion — 2013-01-10T08:33:46-05:00 — #3
cURL is the way to go, but as has been mentioned in several threads recently, you need to make sure those sites allow you to do this. Most sites prohibit 'scraping' or pulling their data without written permission.
sogo7 — 2013-01-10T22:30:17-05:00 — #4
There is also a seldom used option with cURL to make Parallel requests for pages from remote servers. A ready made PHP class can be found at -> https://github.com/petewarden/ParallelCurl
So instead of having to wait for each of the ten (or more) servers to answer in turn before the next page of data can be asked for using a loop, all the requests get sent at once. This can often drastically reduce the amount of time the scraping script needs to complete its run and this is beneficial if the sites being scraped are slow to respond or your regex/data processing is particularly complex. Remember most shared hosting servers will limit script execution time and you have just 30 seconds to fetch, filter & store everything. Alternatively just break the process into smaller chunks and use multiple cron jobs. It all really depends how much & how often you want to scrape updates.
When you have all your new information ready in arrays try to use a Bulk Insert SQL statement to shove it into the database in one go, again this is generally quicker than inserting one row of data at time using a for-each loop. On a shared webserver you have a limited number of database connections available for the entire site so they need to be open then closed as fast as possible, exceed this limit and the host will often lock the entire database off from the rest of the site. (Some do it for just an hour, some you have to ask admin to re-connect and if you're on a free web host they'll probably just terminate your entire account without warning).
If the top story of an RSS feed has not changed since the last time your script looked at it then there is no point in scraping the destination pages of the links again.
A badly configured scraper script hitting a site to often or grabbing to many pages at once can cause some web sites to slow down or even crash completely. If the webmaster becomes aware of your web server IP across their access logs like a rash then they could block you or worse complain to your web hosting provider. So be gentle.