Managing Large Data Volumes with PHP & mySQL

Ted_S · September 22, 2011, 4:26am

Hello developers –

Coming down to the final tweaks on an application and we’ve started to import data… lots of data… and the process, while fast, could be better [4000-5000 records / minute with over 10,000,000 to process]. Hoping that you all can offer up some suggestions on what to optimize.

The import consists of around 1500 XML files ranging from 10mb to 1gb each with a series of products & corresponding stores selling said products.

Gather an associative array with the key of [known] products - $bigproducts
Loop through each category, download, extract (gunzip) the product level file
Loop through the products
[LIST=1]
Check if the product exists by identifier key in $bigproducts
If the identifer is not found, a query is run on 5 indexed fields looking for the item.
No matches: insert a db record within the loop; add the key to $bigproducts
Matches: skip and move on
[/LIST]
Access the store-product that matches each product
[LIST=1]
Take the product id and run an INSERT with DUPLICATE KEY UPDATE
[/LIST]

Of course there’s a ton of secondary work in data cleanup, managing categories and such but that’s the high-level logic.

All of the php has been tightly managed to prevent memory leaks and we see less than 100 bytes used per record, all released back through the cycle and just a small build up over the course of 100,000 product imports.

As I see it the major bottlenecks are in the queries and the inserts.

Each product not in the array [which is everything on a new category] gets a query. Products that are truly new product get an immediate insert. Individual store product records don’t query data but do a forced INSERT / UPDATE every time.

We’ve considered batching the inserts but we’re talking about thousands of inserts per category and also an impact to data quality in not finding duplicates.

We’ve also considered killing the indexes on the table during the process but as it’s actively adding records to search that also seems bad.

Any thoughts on how to kick things up to a faster state? Or what to do to measure where it may be slow?

system · September 22, 2011, 6:43am

First question I’d be asking is can you get an EFFICIENT data format instead of XML? Like say something actually machine readable? Don’t let the XML zealots tell you otherwise, XML is no more machine readable than C source code or JSON. The only reason it exists is to be HUMAN readable – which is why in terms of processing it’s one of the most grossly inefficient means of storing data out there.

I’ll take a deeper look at your explanation tomorrow – have you done any code profiling to see which sections are actually taking the longest to execute?

Ted_S · September 22, 2011, 7:13am

Nope. It’s XML or bust… and even worse, individual records aren’t timestamped with updates so I have to assume everything changes until I compare it manually.

We’ve tweaked a lot of the script around memory leak and delays so the obvious items are falling off… Just rebuilt a query to use 3 indexes rather than 5 which sped things up to around 10,000 records / minute. Not sure if there’s something more we can profile to understand it?

Ted_S · September 22, 2011, 8:37am

In typical fashion as soon as we increased speed we realized the sacrifice was too big to quality and have stepped back down.

Adding ON DUPLICATE KEY UPDATE to the store product results seems to have sped them up by removing a query on each one. However adding back in a query for each product that is not found by ID offsets.

Running at around 5,000 / minute.

Ren · September 22, 2011, 1:14pm

You seem to have some unique key in the products XML.

You say the major time bottleneck is the queries and inserts.

Could add more MySQL server hardware, if you are indeed write limited.

$n = count($connections);
$cnn = $connections[crc32($productKey) % $n];
… do insert or update …

Effectively reducing the queries by $n per server.

Then perform some merge of some kind afterwards.

system · September 22, 2011, 7:20pm

This might sound odd, but that massive associative array may be your biggest bottleneck – in terms not just of RAM usage, but also in terms of lookups! PHP associative arrays expand into the namespace since they are not actually indexed arrays in the conventional programming sense (under the hood they don’t work like real arrays at all!). It may in fact be faster to run a query against the database than to pre-load all those keys… I know that sounds weird, but it really hinges on how big that array ends up and how PHP handles associative arrays.

Not sure it would make a difference, but it might be something to try. At least on a key field in a database you have optimizations like binary trees on the indexes. I’m not sure what or even if PHP has that sort of thing attached to associative arrays.

ahundiak · September 23, 2011, 12:35am

What? Are you actually saying that you use associative arrays without knowing all the details of the implementation. I’m very disappointed in you. I thought you knew everything about everything and that only stupid lazy programmers like myself used features without knowing right down to the bit level how they worked. My bubble has been burst.

Ted S,
Is this pretty much a onetime import? Maybe you could rent a cloud instance (Amazon EC2) and basically throw virtual hardware at it. I would also be tempted to just write a C program and interface directly with the database.

Ted_S · September 23, 2011, 2:37am

One time insert… daily updates [which is a problem when it takes over a day to write].

The box I have this on is fairly nice and not showing any stress but I’ll try another option and see if that speeds things up.

Thanks

Ted_S · September 23, 2011, 2:39am

The assoc array starts at 0 but grows into millions of items and there’s a couple of them [smaller but stil thousands of records] for other searches too.

The query with indexes is running insanely fast despite going through a million+ searches. I’ll see if killing the array does anything.

Thanks

salathe · September 23, 2011, 8:15am

I would advise actually taking a look at what consumes the most time, cpu, memory, etc. by using some profiling tools (assuming you aren’t already). There’s no point taking random shots in the dark at what might be the weaker parts of the system: find them and work on those that you can get the biggest gains out of.

Ted_S · September 25, 2011, 12:05am

FYI, we were playing around with this and you were absolutely correct. mySQL kicking butt while the bigger the array was, the slower things got. Great tip.

Ted_S · September 25, 2011, 12:05am

You’re out of my league… Any suggestions on which tools would help with this?

system · September 25, 2011, 10:49am

A lot of the notions – like system arrays being faster and handling cache yourself – just don’t work right in interpreted languages like PHP. Arrays in particular because they aren’t really arrays under the hood… at least not in the sense compiled languages like C or Pascal mean… even PHP’s numeric indexes are NOT actually memory offsets like they are in ‘real’ languages… and the lack of strict typecasting can make things many times slower.

I wasn’t sure if multiple SQL queries was going to be faster, but sounds like it was. That’s something I’ve hit up against with many SQL “experts” is they often get so obsessed with using less queries, they forget that larger result sets can be painfully slow and chew up your free RAM. This is even more true when you have high network latency to your SQL engine because it’s hosted on a different server.

It flies in the face of SQL Conventional Wisdom – but often multiple smaller queries are FASTER than doing one big one, just because you’re shoving data around in smaller sets. This is particularly the case in interpreted languages or so called ‘VM’/JIT compiled languages – and even more so when there’s no strict typecasting. The overhead of working with the data outweighs the penalty of multiple SQL connects.

More than once I’ve gone into people’s servers where they were choking out a dual Xeon setup – and by breaking up their massive queries and getting memory use under control they go from 90% IOWAIT and 4.0+ load to 15% IOWAIT and barely tickling userland.

Glad to hear that change worked out for you.

system · September 25, 2011, 10:52am

Funny guy…

It was indeed a guess, but it was an educated guess given how non-typecast languages have to work and php associative arrays have to work. I figured they wouldn’t waste the overhead of a btree or other indexing optimizations on the namespace.

… and it looks like my guess was right so nana-nana foo-foo.

Such an optimization to the language would probably double the execution speed or more – particularly of variables that come from outside userland code-space, the problem is it would effectively add around 32 bytes to each and every variable and slow down variable creation. Depending on how many variables (or array indexes/object properties) are created, it could be painfully slow when you create a new variable or entry, offsetting the speed gains when you turn around and access them.

Though that’s why pascal/modula back when they compiled to P-Code had forward declaration in the first place.

rpkamp · September 25, 2011, 11:12am

You could make a hybrid with batch inserts if you like. In pseudo code:


$newkeys=array();
$i=0;
foreach($newproducts as $newproduct) {
  if (in_array($newproduct['id'], $newkeys) || key_exists_in_database($newproduct['id'])) {
    continue; // or do some update if appropriate
  }
  $inserts[] = $newproduct;
  $newkeys[] = $product['id'];
  $i++;
  if ($i==100) { // 100 is an example, tweak as you like
     insert_into_database($inserts);
     $newkeys=array();
     $inserts=array();
     $i=0;
  }
}
if (count($newproducts) > 0) {
  // there are still some products in the buffer that haven't been INSERTed yet -- do that now
  insert_into_database($inserts);
}

obviously key_exists_in_database and insert_into_database are mock functions

Basically you keep a buffer of products (in the example a max of 100), and write them to the database in batches. Indeed as you say you can’t find the keys in the database while the items aren’t inserted, but that’s what the $newkeys array is for – it stores product keys of products that have to be INSERTed in the database but are not yet inserted. Following this, you can increase the number 100, but don’t make it too big because you’ll loose the advantage (and also you may hit MySQL query length limit).

I’ve used this technique in the past and works quite well!