Concatenate Columns & Delete Duplicates

Hello,
This is my table structure:

CREATE TABLE IF NOT EXISTS `the_table_name` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(100) NOT NULL,
  `phone` varchar(25) NOT NULL,
  `nric` varchar(25) NOT NULL,
  `address1` varchar(255) NOT NULL,
  `gender` varchar(10) NOT NULL,
  `race` varchar(10) NOT NULL,
  `date` int(4) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=0 ;

I want to concatenate name, phone, nric, address1 into a temporary column, identify the duplicate rows in that temporary column & delete those duplicate rows leaving only unique rows left.

For example, let’s say that after concatenation, the temporary column has got the following records:
michael, 12345678, 91011231314, baker street
james, 12345678, 91011231314, baker street
michael, 12345678, 91011231314, baker street
michael, 12345678, 91011231314, downing road

This record should be deleted:
michael, 12345678, 91011231314, baker street

These records should remain:
michael, 12345678, 91011231314, baker street
james, 12345678, 91011231314, baker street
michael, 12345678, 91011231314, downing road

How do I achieve this via a query?

Thanks for your help.

BTW, I’m looking for a Mysql solution. Sorry I didn’t mention that earlier.

  1. Add hash column to table:

    ALTER TABLE the_table_name ADD hash VARCHAR( 32 ) NULL DEFAULT NULL;

  2. Calculate hash for each row and store in that column:

    UPDATE the_table_name SET hash = MD5(CONCAT(name, ‘:’, phone, ‘:’, nric, ‘:’, address1));

  3. Delete rows with duplicated hash (but leave one with lowest id):

    DELETE t1 FROM the_table_name t1, the_table_name t2 WHERE t1.id > t2.id AND t1.hash = t2.hash;

  4. Drop hash column:

    ALTER TABLE the_table_name DROP hash;

  5. Done.

1 Like

@megazoid has provided a possible solution… which really addresses the question that I was going to ask

  • Why a creating a temp column, if you want to delete those rows?

who suggested a temp table?

the temp column idea is okay, but not entirely necessary

My typo. I don’t use temp columns to find duplicates

Thanks for your reply megazoid. I’ve successfully followed your instructions up to step 2.

However, I don’t understand how to write the query for step 3. Should it be like this?
DELETE t1 FROM the_table_name t1, the_table_name t2 WHERE t1.id > t2.id AND t1.hash = t2.hash;

When I run this query my server is hanging / taking a very long time. At the moment, I’m running this experiment on 80,000 records.

Eventually, I need to run this query on 30 million records…

Alternatively, is there a way to throw away the concatenated duplicates when I import the records as a tab delimited text file into the table? At the moment, I’m using the following query:
LOAD DATA LOCAL INFILE ‘/path/to/my/server.com/public_html/tab_delimited_records.txt’ INTO TABLE the_table_name(name, phone, nric, address1, gender, race, date)";

Thanks for your help.

Try to add index on hash column

1 Like

Wow, this solved the problem. It took 3.38 seconds to process 80,000 records. Thank you megazoid for your guidance. God bless you :slight_smile:

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.