Increase speed?

I have a script that builds an array of about 150 IDs, which then uses file_get_contents() to gather data associated to that ID. I have used CURL, and they are the same speed.

I am trying to conjure up a way to make the following code faster. I’m considering switching the DB to InnoDB so I can run multiple instances at once, but it seems like there’s probably a better way. Ideas?


$Array = array(1, 2, 3, 4, 5);
foreach($Array as $ID) {
	$file = file_get_contents('http://www.example.com/index.php?id='.$ID);	
	// INSERT data into database
}

Standard Disclaimer Question: Do you have permission to be screen-scraping this data and storing it?

After contemplating the answer to StarLion’s question, feel free to give this thread a read.

It’s public, industrial data that requires no login. No images or wording, only numbers and dates. I asked one department for a database that doesn’t change instead of sending millions of hits to their server, and they just say “Just send the requests to the server individually.” Ok, fine by me lol.

Honestly, for my purposes, that thread didn’t really have any ideas. I did retest CURL though, and it is a sliver faster than file_get_contents(). The majority of the time is all spent sending and waiting for the file contents, so I am thinking multiple instances could speed it up big time. I could have a CRON script run 10 different instances for different sets of updates.

Your script will be limited by the two factors neither of which you have control over (that I know of): the source server’s latency and your connection’s bandwidth.

The problem looks to be that you’re downloading the remote content serially, i.e. one at a time. You have to wait for the first to have finished before starting on the second, and so on. The key here is to make the requests in parallel: all of them (or in chunks) at the same time. That way the total time taken is (optimally) only the time of the single slowest request.

This can be done in various ways. You could spawn many instances of your script at the same time, each fetching from one URL only. There is also cURL’s “multi” interface, which allows sending off and receiving many cURL requests simultaneously. It’s a bit of a faff, but a good starting point is the curl_multi_exec() PHP manual page.

If you’re going to be making ~150 requests pretty much simultaneously, you had better make sure that the content provider is really OK with it. That said, there’s no reason why you can’t artificially “slow” the requests (only send N requests at once, for example) and still be much faster than getting the URLs one at a time.

Parallel vs. serial is a good point but php wasn’t designed to work in parallel (that I’m aware of). Another solution is to get the data asynchronously, say the night before with a cron-jobs and then run the main script with “local” data. One advantage of having local data is that until the source file is updated here is no need to download the data again.

You can run multiple “processes” in php, but not exactly threading. I’ve written a class to handle “multi-processing” that extends a shared memory segment, but it’s not completely done yet (the shared memory portion). Down side is that this will be for CLI work only, apache doesn’t really support it.

Well both you guys obviously know more than me. The script time is 90% waiting for the data from the remote site. My server resources aren’t being exhausted because it can’t speed up the connection time. I figured I could run the multiple instances and exhaust more resources, but if you say it’s not designed to do that then I just waste all my time building it lol.

My server has 16GB memory and quad 3.4GHz CPUs, and the script is using like 3% of that.

What is the frequency of this data fetch?

Daily, hourly, on page load?

So much to learn…

Command Line Interface and PHP5 Multithreading:
If you have a bottle neck in database or network connection then you can speed up your script up to 1000% just by implementing PHP5 multithreading. For example, you may spend 10 seconds just to establish http connection when you fopening remote http page and just 1 second to retrieve the content. If you need to fopen 1000 pages one by one then you will spend 101000+11000 = 11000 seconds (3 hours and 3 minutes)! If you run 100 threads then you will spend (101000+11000)/100 = 110 seconds (less then 2 minutes!). Obviously, you will need powerful enough CPU, enough memory and network bandwidth.

PHP CLI

Hah well I’m not a professional programmer. I’m only building tools I need. CLI is definitely going to be some learning for me, but I’m sure I’ll get it done. Thanks Denny!

Cups, it’s fetching state industrial information. It fetches information on about 20 thousand wells. I’d like to do it weekly, since I have other websites I’m gathering data from that I want to spread out through the week and run at night.

Ouch, I just read “Unlike the CGI SAPI, CLI writes no headers to the output by default”. If this is the case, then it wouldn’t work since headers are required in some cases.

This is incorrect. There is no “threading” in PHP. You can however, fork a new “process”, which may mimic threading to those who do not know the difference. Have a look here: http://stackoverflow.com/questions/1762418/process-vs-thread. But with that being said, forking a new process is almost as beneficial as threading, the loss is in the lack of communication between your threads (hence my extension of a shared memory segment).

Now, apache also has things to say about pcntl_fork(); It can have unwanted results. What those results are I’m unsure as I have not had a reason to fork anything through an apache request, it has all be CLI work for me thus far.
We need to know a little bit more about these requests. Cups is on the right track with frequency, and I’m also curious about payload size.

Payload size is about 400KB.

Seems like you should start with a simple test.

Run the script from the first post from the command line. Don’t do anything with the data, just bring it down. Now run several copies of the script from the command line at the same time. Probably want each script to bring down a different set of random id’s.

That will tell you right away if running something in parallel will help. It could be that the example.com server itself is the bottleneck.

PHP is designed to allow what I suggested: it is not misusing PHP at all. If you follow one, or both, of the suggestions that I made, let us know how you get on.

Salathe, I’m going to look into what you suggested. I’ve been too busy to write up a test for this though. If it works (which it looks like it should) it would be the easiest way to implement by far.

I’m definitely not going to bombard the remote server with requests, since I don’t want them to implement a way to stop me. The page is typically 300KB-1MB (all files), so I don’t think it would be a big issue to send 25 requests at once. I have about 8k requests total on each site, and they’ve never had a problem with me sending a request about every second. I run it at night too.

Thanks for the idea Salathe.