Delete Duplicates From All Tables?

AdWarm · September 18, 2013, 9:49am

Hi Guys,

I have 4 tables within the same database. All of them have a field called ‘email’ in them. How do i go about doing a Query which will Delete the row of data if the same ‘email’ is found in another Table leaving just the 1 record? I dont know if this makes it any easier for you but each row of data has an ID number (note it wont be the same ID number in the other tables for the same ‘email’ row, its just an Auto Increment i added before importing the data).

Another thing i need to make you aware of is that theres nearly 10 million records between the 4 tables. Is there anything i should do to speed the process up?

If anyone could help me that would be great.

Thanks,
Mike

cpradio · September 18, 2013, 12:25pm

Can you provide more background? What are these 4 tables and what is their relationship to each other? Do you want to maintain the email address in one of the 4 tables (so really we’d just be removing it from the other 3) and if so, which one should maintain the unique email address?

Can you provide a generic example, such as:

table_name
1 myemail@mydomain.com KEEP
2 myotheremail@mydomain.com

table_name2
4 myemail@mydomain.com

table_name3
16 myemail@mydomain.com

etc.

AdWarm · September 18, 2013, 1:22pm

Below are the table structures:


Table: BB11UKCoreg

1 email text
2 firstname text
3 lastname text
4 ip text 
5 optindate text
6 source text 
7 ID int(11) AUTO_INCREMENT

Table: Cinamuse

1 email text
2 ip text
3 url text 
4 timestamp text
5 ID int(11) AUTO_INCREMENT

Table: ElitemateJS

1 firstname text 
2 lastname text 
3 address1 text
4 town text 
5 email text
6 source text 
7 ip text
8 ID int(11) AUTO_INCREMENT

Table: JSRevG

1 firstname text 
2 lastname text 
3 address1 text
4 address2 text 
5 address3 text
6 address4 text 
7 address5 text 
8 postcode text
9 email text
10 ip text 
11 appdate text 
12 createdate text
13 sourceurl text 
14 ID int(11) AUTO_INCREMENT

What i need to do is this:

Mysql reads the data in DB1
Mysql then reads DB2 to see if any ‘email’ is the same as whats in DB1
If mysql see a match the row is deleted from DB2
Mysql then reads DB3 to see if any ‘email’ is the same as whats in DB1
If mysql see a match the row is deleted from DB3
Mysql then reads DB4 to see if any ‘email’ is the same as whats in DB1
If mysql see a match the row is deleted from DB4

Mysql reads the data in DB2
Mysql then reads DB1 to see if any ‘email’ is the same as whats in DB2
If mysql see a match the row is deleted from DB1
Mysql then reads DB3 to see if any ‘email’ is the same as whats in DB2
If mysql see a match the row is deleted from DB3
Mysql then reads DB4 to see if any ‘email’ is the same as whats in DB2
If mysql see a match the row is deleted from DB4

and so on… does that make sence?

EDIT: there is no relation between the tables.

cpradio · September 18, 2013, 1:35pm

AdWarm:

What i need to do is this:

Mysql reads the data in DB1
Mysql then reads DB2 to see if any ‘email’ is the same as whats in DB1
If mysql see a match the row is deleted from DB2
Mysql then reads DB3 to see if any ‘email’ is the same as whats in DB1
If mysql see a match the row is deleted from DB3
Mysql then reads DB4 to see if any ‘email’ is the same as whats in DB1
If mysql see a match the row is deleted from DB4

Mysql reads the data in DB2
Mysql then reads DB1 to see if any ‘email’ is the same as whats in DB2
If mysql see a match the row is deleted from DB1
Mysql then reads DB3 to see if any ‘email’ is the same as whats in DB2
If mysql see a match the row is deleted from DB3
Mysql then reads DB4 to see if any ‘email’ is the same as whats in DB2
If mysql see a match the row is deleted from DB4

and so on… does that make sence?

EDIT: there is no relation between the tables.

Wouldn’t DB2 only need to compare itself to DB3 and DB4, as its’ comparison to DB1 is already done? Then DB3 would only need to compare itself to DB4. So you’d have:

DB1 compared to DB2
DB1 compared to DB3
DB1 compared to DB4
DB2 compared to DB3
DB2 compared to DB4
DB3 compared to DB4

AdWarm · September 18, 2013, 3:42pm

Hi,

Ah right, yeah i guess so, sorry my brain isnt 100% active today… Would you be able to help me out please. Take a look below, would this work?


DELETE FROM DB2 WHERE email IN (SELECT email FROM DB1);
DELETE FROM DB3 WHERE email IN (SELECT email FROM DB1);
DELETE FROM DB4 WHERE email IN (SELECT email FROM DB1);

DELETE FROM DB1 WHERE email IN (SELECT email FROM DB2);
DELETE FROM DB3 WHERE email IN (SELECT email FROM DB2);
DELETE FROM DB4 WHERE email IN (SELECT email FROM DB2);

DELETE FROM DB1 WHERE email IN (SELECT email FROM DB3);
DELETE FROM DB2 WHERE email IN (SELECT email FROM DB3);
DELETE FROM DB4 WHERE email IN (SELECT email FROM DB3);

DELETE FROM DB1 WHERE email IN (SELECT email FROM DB4);
DELETE FROM DB2 WHERE email IN (SELECT email FROM DB4);
DELETE FROM DB3 WHERE email IN (SELECT email FROM DB4);

Also should i do anything to each table since theres millions of records to make it process the query quicker?

Thanks

cpradio · September 18, 2013, 3:46pm

AdWarm:

Hi,

Ah right, yeah i guess so, sorry my brain isnt 100% active today… Would you be able to help me out please. Take a look below, would this work?


DELETE FROM DB2 WHERE email IN (SELECT email FROM DB1);
DELETE FROM DB3 WHERE email IN (SELECT email FROM DB1);
DELETE FROM DB4 WHERE email IN (SELECT email FROM DB1);

[s]DELETE FROM DB1 WHERE email IN (SELECT email FROM DB2);[/s]
DELETE FROM DB3 WHERE email IN (SELECT email FROM DB2);
DELETE FROM DB4 WHERE email IN (SELECT email FROM DB2);

[s]DELETE FROM DB1 WHERE email IN (SELECT email FROM DB3);[/s]
[s]DELETE FROM DB2 WHERE email IN (SELECT email FROM DB3);[/s]
DELETE FROM DB4 WHERE email IN (SELECT email FROM DB3);

[s]DELETE FROM DB1 WHERE email IN (SELECT email FROM DB4);[/s]
[s]DELETE FROM DB2 WHERE email IN (SELECT email FROM DB4);[/s]
[s]DELETE FROM DB3 WHERE email IN (SELECT email FROM DB4);[/s]

Also should i do anything to each table since theres millions of records to make it process the query quicker?

Thanks

Yes, that looks right to me although I’ve striked the redundant checks (check the quoted code).

As for making it run quicker, index the email column across all 4 tables (let the indexes get built, then run it). It will definitely take a while, but there isn’t much you can do about that.

AdWarm · September 22, 2013, 1:05pm

I need some help, iv indexed all the email fields in all 4 of the tables and im running one of the queries to delete any dups that are found in the other table and so far its ran for over 24hrs. Is there any other way of making it faster?

SpacePhoenix · September 22, 2013, 11:06pm

How many records are in each table?

Are you running the queries via something like phpMyAdmin?

Have you thought about having PHP interact with the database, remove all duplicates from table 1 then from table one get all the email address and for each one run separate DELETE queries against each table in turn to remove the duplicates.

AdWarm · September 23, 2013, 9:49am

Hi,

Yea im running the query via PHPMYADMIN. I’ve just had to restart MySQL as it was taking too long.

DB1 = 5,929,150 records
DB2 = 786,755 records
DB3 = 387,964 records
DB4 = 2,721,587 records

If i did it via PHP wouldnt that taking a long time to do and then timeout?

Whats the best way of me doing this, i really need some help

Thank you…

SpacePhoenix · September 24, 2013, 6:18am

It’d probably just a similar time done via php but might have a lot less of a chance of timing out. Looking at the tables, what’s the relationship between them?

AdWarm · September 24, 2013, 9:55am

Hi,

Theres no relation to them all.

Thanks