Can anyone explain to me what has happened here please?
We have a very large customers table (32.75m rows, 12GB in size) which has a whole load of invalid data in it and I was tasked to reduce its size. We need to remove customers who have no email address recorded and are over 30 days old, but the query to get that data on the server will grind it to a halt.
So, I selected all of the rows, just the ID, email
and dateadded
, 1m at a time, into a CSV outfile. Using the CSV engine I ran an “INSERT SELECT FROM” and imported the records which met the criteria above into a MyISAM table. When this was done (took about 6 seconds per 1m row import file) I exported the IDs into another file.
This identified 28.3 million records which needed to be removed, but I needed to do an extra level of checking before deleting: Since I grabbed those IDs, have these customer records become valid? So, the script selected all rows that now had a valid email address and the ones that do were removed from the list to process. Then, any that remained were stripped from the basket
and the customer
tables and at the end a log was appended to tell me how many rows were deleted, how many per second, and how many valid customers were overlooked.
The process took a few days to complete, on and off, but last night we ran an optimize on the customer table, which took 2h 20min.
Now, my log file suggests that we’ve removed 28,261,695 in 24h 47min and 99.42% of the IDs I gave it were removed. So, we expected the count on the table to be something like 4.5m, however, it’s actually reporting 28,091,795 and 8.3GB still.
I’m going to get a million rows out again and run it through the same process that I did previously to compare the results, but if anyone could shed any light on what may have happened I would be most grateful.
Cheers