Downloading Feed for Processing is a Bottleneck

cpradio · February 2, 2013, 12:43pm

Okay, so I’ve been profiling my recent project a bit, as the Tests I wrote take a while to run (54 tests in 15-20 seconds). One of the tests takes 13-18 of those seconds.

I personally do not like tests taking longer than a second to run (most cases that is a do able), but this one I just can’t seem to get around it, and the biggest problem, is it will only get worse.

The test in question is given a URL, it downloads the contents of that URL and stores it locally (right now it uses [fphp]file_get_contents[/fphp]). At this moment, I’m not certain if the bottleneck is the network request or the actual processing (but I plan to run a few tests later to help narrow that down).

Here are the requirements:

Must be run through a cronjob
Feed is external to the website, so it isn’t on the same network (must be this way as it is a third party system providing the data)
Must download it and store it locally (takes 13+ seconds)
Must process the feed into a set of MySQL tables and file system caches for use by the website (this literally takes 67 ms to process)

Here is what I’m using for the required steps

Feed is downloaded using file_get_contents
[list][*]I want to experiment using wget and curl too. Placing it in wget removes PHP from the equation for downloading[/list]
Feed is being processed using simplexml, seems to be fine

Concerns

Download could cause a max execution time to be reached, or exceed memory (if the file is large enough)
Currently I’m only receiving 5 records in the feed, and I expect once fully live it will be 50+ records.

Other notes of Interest

The whole process is anonymous, it can take any feed and relay it into MySQL/file cache without any coding changes.
[list][*]Though I don’t recommend using it this way, as you lose key benefits, so there are ways to define a feed and process it with minimal coding[/list]
All other functions of the system are under 200 ms, so this download is the odd man out.

So, I’d love to hear your feedback and experience with curl in a process similar to this, or if you think ditching curl and running wget prior to the cronjob run is the best candidate (that is my feeling too). Or if there is another approach I should be looking at, I’m open to that too.

Cups · February 2, 2013, 6:22pm

I’ll take a stab and lets see if I have completely grasped your problem.

I got round this in the past by completely separating the download and cache of the data from the processing of that same data.

I don’t know if you are able to predict when the processing needs to be done or if the fetched data must be accurate to the last second, last minute or if it could be 5 minutes old.

As an example:

a) Cron grabs n rss feeds ready for processing and cache the files locally, hourly IIRC.

b) Cron then tells the processor to grab the files and do the post processing, hourly + 3 minutes.

If something failed for any of the last rss feed fetches, it processes the previously cached version (or it may ignore it because it is > x minutes old)

This almost completely removes the line latency and availability problems that the network can throw up from the processing time.

But as I say, YMMV if you are say, processing exchange rates or stock prices where you want up to the second accuracy and these requests must be made on demand, and not every x minutes - then this approach not be any help to you at all.

I almost always use cURL for this fetching now, though I have used wget in the past.

KyleWolfe · February 2, 2013, 6:43pm

I’ve moved from sockets and curl to ftp whenever possible. Much fast and MUCH more reliable I had an issue I posted about that never got resolved. A curl call would “succeed” with no errors, but the remote and local file sizes did not match.

ftp_get() has given me no grief thus far. Can you possibly grab the xml feed through a text based curl request, rather than a “download”?

cpradio · February 2, 2013, 7:23pm

Thanks for both responses.

The feed will likely only change daily, so the “must be up to date” isn’t really a factor. In short, the client I am writing this for uses a third party service to keep his inventory. They are giving us a feed with the inventory data in it, that I’m going to use to run their website operations. If they update the inventory mid-day they could always kick off the processing cronjob manually.

@Cups ; yes, that is exactly what I was considering when I mentioned using wget. I’d actually just create a shell script that would execute wget, place the feed locally, then call the php script to process it (so it would do it synchronously automatically). I think taking from your experience, that separating the download of the feed from the actual processing, is indeed the best option. The only downside, is there isn’t a real sure-fire way to test that the feed was successfully downloaded in my tests (integration/unit test wise) until its time for processing it (check for file existence, etc).

@K_Wolfe ; not sure ftp_get would work for me, but that’s a good suggestion. I don’t think the third party is willing to give an ftp option. I already asked them to give me an XML feed, as originally they provided a CSV output, but due to some of the data, it made it hard to process consistently.

cpradio · February 3, 2013, 1:34am

So interesting result: wget downloads the entire feed in .4 seconds (less than a second!). So that tells me file_get_contents to an external site must be doing a LOT under the hood.
I plan to test curl tomorrow (or tonight if I can’t sleep).

Cups · February 3, 2013, 11:37am

The only downside, is there isn’t a real sure-fire way to test that the feed was successfully downloaded in my tests (integration/unit test wise) until its time for processing it (check for file existence, etc).

But in your tests you could spoof the fetching of the file by just creating a new one.

Like I said, it’d also be the place where you test the absence of a file, and also the existence of a file which is > x minutes old, which therefore may be wasteful to process - depending on your needs.

Presuming you are talking about unit testing …


testSourceExists(){
// create a file locally
// test it is read and then analysed
}

testMissingSource(){
// try and test and analyse a file which does not exist
}


testSourceIsOld(){
// create a file
// touch it and reset the time to x minutes ago
// try and test and analyse it
}

Testing the fetching of the source files from 3rd parties would be pretty pointless and only provide a snapshot of the state of networks at that particular time, though sending off a daily timed fetch using cURL and only analysing the return code 200 etc, would be much quicker and could provide early head-ups on url changes for your target fetches and so on - this kind of failure might be worth logging, auto notifications to admins and so on.

cpradio · February 3, 2013, 1:32pm

Yeah, that works and I have tests that do that, but I typically like to see an external request follow through too. curl is just as slow as file_get_contents so I think I’ll just have to live with being unable to do that.

rpkamp · February 3, 2013, 2:41pm

We’re also doing quite a lot of fetches of data to be processed at work, and we’ve found that it helps if you put the hostname and IP of the target server in your hosts file because then 1) your script can resolve the hostname faster (no DNS round trip) and 2) you never get the problem that you can’t get the feed because your DNS has problems.

Of course you need to keep in mind you did this and that you need to update the hosts file when the host IP changes.

YMMV, but for us this works really well. Just something to think about.

cpradio · February 3, 2013, 2:50pm

That is good to know, unfortunately, I think the host this will eventually be running on may be a shared server so access to the hosts file is unlikely. I’ll inquire against that though.

Ah, I just figured out why wget was running so much faster. It wasn’t using all of the parameters of the URL (once I put the URL in quotes, it runs much slower). Which ultimately points to downloading the feed outside of the processing is looking like a better choice more and more.

Cups · February 3, 2013, 6:01pm

Good thread, useful tips here.

Stomme_poes · February 4, 2013, 11:56am

I just came back from FOSDEM where in the Perl devroom Liz Cholet did a talk on “Automating Firefox with MozREPL, AnyEvent and Coro” (talk abstract is not very useful but I dunno if FOSDEM will later have videos or what). I don’t think it matters she happened to do this with Perl, except I dunno what PHP’s version of AnyEvent or Python’s Twisted would be.

What she had to do at her job was have Firefox directed by some script go to a URL, grab the page and do analysis of the objects found on that page. Also Firefox itself needed to be modified/queried (you could get data from the plugins in Firefox itself!) sometimes and using MozREPL you get access to Firefox’ own objects too. Firefox was being run by WWW::Mechanize::Firefox, which again I know there’s also a Python version of Mechanize but in this case you want PHP to not actually call a browser but just someone like wget… But anyway,

like Cups suggested, her analysis parts were done separately from the page-loading parts. She used Perl’s Any::Event for asynchronously sending the page requests, and Coro is for threading so basically the parts of her code that did the analysis would just be listening for the response so they could then start the analysis (then Firefox would get directed to call the next URL… I assume she had an array of URLs). K Wolfe mentioned ftp; Liz was using telnet port 4242 (for the demo she was using localhost, but at her work she also still used telnet).

I wonder if you could have an asynchronous calling of page URLs with your data-processing parts just listening for success and neither of these processes getting in the way of the rest of your stuff. I’m not sure if it would speed up your total but if currently one process is blocking the next then that might be something to look at.

TomB · February 4, 2013, 12:23pm

The other option is, if possible, don’t download all the data at once and avoid batch processing all together. You mentioned query strings with parameters on the data source, so you may be better to do some JIT processing when the data is actually used. Cache this response so it’s not fetched every time someone views the page. You will have an overhead of extra requests but you will also download less data (If certain data is accessed very infrequently you’ll only update it as it’s needed!)

cpradio · February 4, 2013, 12:45pm

Stomme_poes:

I just came back from FOSDEM where in the Perl devroom Liz Cholet did a talk on “Automating Firefox with MozREPL, AnyEvent and Coro” (talk abstract is not very useful but I dunno if FOSDEM will later have videos or what). I don’t think it matters she happened to do this with Perl, except I dunno what PHP’s version of AnyEvent or Python’s Twisted would be.

What she had to do at her job was have Firefox directed by some script go to a URL, grab the page and do analysis of the objects found on that page. Also Firefox itself needed to be modified/queried (you could get data from the plugins in Firefox itself!) sometimes and using MozREPL you get access to Firefox’ own objects too. Firefox was being run by WWW::Mechanize::Firefox, which again I know there’s also a Python version of Mechanize but in this case you want PHP to not actually call a browser but just someone like wget… But anyway,

like Cups suggested, her analysis parts were done separately from the page-loading parts. She used Perl’s Any::Event for asynchronously sending the page requests, and Coro is for threading so basically the parts of her code that did the analysis would just be listening for the response so they could then start the analysis (then Firefox would get directed to call the next URL… I assume she had an array of URLs). K Wolfe mentioned ftp; Liz was using telnet port 4242 (for the demo she was using localhost, but at her work she also still used telnet).

I wonder if you could have an asynchronous calling of page URLs with your data-processing parts just listening for success and neither of these processes getting in the way of the rest of your stuff. I’m not sure if it would speed up your total but if currently one process is blocking the next then that might be something to look at.

Interesting, but not quite what I’m attempting (I think), in short, this is purely 1 feed, but written in a way to allow multiple feeds if the business ever needed it (this is the third time they switched over to a new system in 2 years – makes good money for me, as I get to rewrite their system, but I’m tired of developing it from scratch each time, so I wrote a generalized system this time). Anyway, more to the point, their inventory is stored on this third party system (closed system), they don’t have a public API, the only thing they will provide is a feed of your inventory.

The client wants me to grab the feed, process it into a local database so they can use that data on their website without having to maintain two systems. So at whatever interval they decide, they will run this process to get the feed, and update their website with its data, therefore they only have to update the data on the third party site.

Kind of stated above in response to Stomme poes, but the retrieval of the feed is the biggest hit and it is an XML document, so having the full feed is nice (although probably not 100% necessary), however, I only need the full feed when they want to update their local inventory used to run their website. The feed is downloaded once per run (this is planned to be used in a cronjob or from the command-line), its broken apart into pieces and then processed. The processing literally takes less than 2 seconds. The downloading of the feed on the other hand takes 13-15 seconds. (just wanted to point out that the feed isn’t downloaded on every request to the data source, it is purely a once when needed to update the local inventory setup)

The remark about the parameters was more so a gotcha with wget. I forgot that & unescaped causes wget to behave differently – actually causes the command to be executed differently – oops. Once I place it in quotes the correct feed was downloaded (wasn’t caught before because it was finding a cache on the external site).

Once the data is processed and stored within MySQL, an API is used to retrieve the details needed on each page, the API handles all caching on its own. So the retrieval of data is super fast, allowing each page load to have very good load times.

Cups · February 4, 2013, 2:10pm

Hmmm … an XML file into a db eh?

I might be veering OT here, but this sounds exactly like a challenge I faced a year or two ago.

This might be an aside, but in my case I found very wasteful the fact that they processed the entire XML file when in fact all they wanted was the diffs.

I did not get chance to work out how best to achieve a comparison of 2 XML files before events took over and I left the project.

This has little bearing on your immediate problem as you say you have the processing time down, just quite fascinating you face the same challenge and it makes me wonder even more how best to tackle this particular problem.

cpradio · February 4, 2013, 2:54pm

Just to be clear, I am reading the XML file, each node has 105 sub-nodes, each of those nodes correlate to a column in the MySQL table. I developed a “mapper” class that 1) figures out what columns are needed for the table, and 2) its data types

That class gets passed to the data provider which is in charge of building its data source (be it a database or flat file system or whatever), processing the feed items into the data source, and then doing any cleanup.

In the end, a table of 105 columns gets generated and the data appropriately inserted.

Just to give you an idea, I actually have two mappers for this project, one for 99% of the nodes in the XML file, the other for a single node (images) that is repeatable per record.

Since storing images doesn’t make sense, I also have a second data provider whose sole job is building a flat file system and keeping it updated when the process is executed. So the two mappers and the two data providers handle updating the system as a whole. The data providers are based on an interface so they can be called without knowing what type it is and what it is going to do. Same with the mappers, they are able to figure out the process without really being told much other than the schema of the XML file and a definition file that dictates what to look for.

Overall, I’m extremely happy with the end result as what could look like a spaghetti code mess is 20 lines of setup the mappers, load them, load the feed, process it, send mappers and feed to each data provider, process them, do cleanup and exit. I could make a few pieces smarter, but in the end decided, that may be overreaching as it is best to specifically tell the system what to do in the places I wanted to “make it smarter” (was approaching over thinking it with no real measurable benefit).

Actually, since the number of records I’m dealing with isn’t large, I’ve at this point just gone with INSERT … ON DUPLICATE KEY UPDATE. I may regret that in the future, but for something that runs once a day and currently takes less than a minute to run, I’m not too worried.

There are a variety of ways I could find differences fairly quickly with my setup, just loading two feed sources into the system and likely performing array_diff would get me there. I could likely also run diff (from the command line) to get an indication of how much has changed too to determine if it makes sense to compare the two files or just flat out replace the whole table.

Cups · February 4, 2013, 4:13pm

I spent some time looking for PHP XML comparison projects I could either use or learn from, but there was not much IIRC and mostly involved potentially use of large amounts of memory. There are Java projects dealing with extracting the diffs between XML files, but I ended up just doing what I was told and re-inserting the entire data into the db nightly.

I did voice my concerns and it probably included the word “brittle”, but then just moved on - not very far actually, when I found out their policy was to pay contractors only after they had been paid and that “…this could be up to 90 days”.

$LearningCurve++ ;

cpradio · February 4, 2013, 4:19pm

Ouch! That definitely is a ++ to LearningCruve, no doubt. Anymore I follow the policy half up front and the rest when I show a demo of the completion. Once payment is received, I’ll install it or provide the project with installation documentation to the client. My theory is, if I’m going to commit to do my best for you, you need to show me your committed to my presence on this project (then again, I can make outlandish demands purely from this isn’t my primary source of income; so I do and that weeds out all of the ones I likely would have regretted working for ;)).

I do find it interesting that diff between XML files might not have a lot of PHP involvement, I may research that a bit and see if a framework couldn’t be developed (and add it to my ever growing to do list – that could be a fun project with a lot of interesting results/paths).

Cups · February 4, 2013, 9:15pm

that could be a fun project with a lot of interesting results/paths

Yes, that was my lasting memory too. It was something I would have done on my own time actually.

I found the whole idea very intellectually challenging and stimulating, which is why I like this job.

I think.