Concatenate Columns & Delete Duplicates

transfield · February 13, 2015, 12:00am

Hello,
This is my table structure:

CREATE TABLE IF NOT EXISTS `the_table_name` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(100) NOT NULL,
  `phone` varchar(25) NOT NULL,
  `nric` varchar(25) NOT NULL,
  `address1` varchar(255) NOT NULL,
  `gender` varchar(10) NOT NULL,
  `race` varchar(10) NOT NULL,
  `date` int(4) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=0 ;

I want to concatenate name, phone, nric, address1 into a temporary column, identify the duplicate rows in that temporary column & delete those duplicate rows leaving only unique rows left.

For example, let’s say that after concatenation, the temporary column has got the following records:
michael, 12345678, 91011231314, baker street
james, 12345678, 91011231314, baker street
michael, 12345678, 91011231314, baker street
michael, 12345678, 91011231314, downing road

This record should be deleted:
michael, 12345678, 91011231314, baker street

These records should remain:
michael, 12345678, 91011231314, baker street
james, 12345678, 91011231314, baker street
michael, 12345678, 91011231314, downing road

How do I achieve this via a query?

Thanks for your help.

transfield · February 13, 2015, 2:43am

BTW, I’m looking for a Mysql solution. Sorry I didn’t mention that earlier.

megazoid · February 13, 2015, 5:09am

Add hash column to table:

ALTER TABLE the_table_name ADD hash VARCHAR( 32 ) NULL DEFAULT NULL;
Calculate hash for each row and store in that column:

UPDATE the_table_name SET hash = MD5(CONCAT(name, ‘:’, phone, ‘:’, nric, ‘:’, address1));
Delete rows with duplicated hash (but leave one with lowest id):

DELETE t1 FROM the_table_name t1, the_table_name t2 WHERE t1.id > t2.id AND t1.hash = t2.hash;
Drop hash column:

ALTER TABLE the_table_name DROP hash;
Done.

molona · February 13, 2015, 7:18am

@megazoid has provided a possible solution… which really addresses the question that I was going to ask

Why a creating a temp column, if you want to delete those rows?

r937 · February 13, 2015, 2:44pm

who suggested a temp table?

the temp column idea is okay, but not entirely necessary

molona · February 13, 2015, 3:01pm

My typo. I don’t use temp columns to find duplicates

transfield · February 28, 2015, 3:47pm

Thanks for your reply megazoid. I’ve successfully followed your instructions up to step 2.

However, I don’t understand how to write the query for step 3. Should it be like this?
DELETE t1 FROM the_table_name t1, the_table_name t2 WHERE t1.id > t2.id AND t1.hash = t2.hash;

When I run this query my server is hanging / taking a very long time. At the moment, I’m running this experiment on 80,000 records.

Eventually, I need to run this query on 30 million records…

Alternatively, is there a way to throw away the concatenated duplicates when I import the records as a tab delimited text file into the table? At the moment, I’m using the following query:
LOAD DATA LOCAL INFILE ‘/path/to/my/server.com/public_html/tab_delimited_records.txt’ INTO TABLE the_table_name(name, phone, nric, address1, gender, race, date)";


Thanks for your help.

megazoid · March 1, 2015, 10:07am

Try to add index on hash column

transfield · March 1, 2015, 2:37pm

Wow, this solved the problem. It took 3.38 seconds to process 80,000 records. Thank you megazoid for your guidance. God bless you

system · May 31, 2015, 9:48pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.