[MySQL 5.0.45-community-nt-log] Remove Duplicate Data or Rows

cms9651 · November 22, 2012, 12:54pm

Hello guys!

How do I remove duplicate data or rows from a MySQL query in the table with more than 10.000.000 rows?

Could you please help?
Thanks you very much for your help.

r937 · November 22, 2012, 1:05pm

how? with a healthy dose of patience

what part are you having problems with? identifying the duplicates, or removing them?

cms9651 · November 22, 2012, 1:12pm

You have right my teacher…

I have first problem with identifying the duplicates because I use DISTINCT in my Select query but for the number of rows (more than 10.000.000 rows) I have this error:

[Err] 2013 - Lost connection to MySQL server during query

Any suggestions?
Thank you

r937 · November 22, 2012, 2:28pm

how do you know there are any duplicates at all?

cms9651 · November 22, 2012, 2:42pm

thank you for reply.

I try this query:

SELECT
	Field1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
GROUP BY
	Field1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

And in the output I have duplicates rows…

r937 · November 22, 2012, 3:39pm

so you don’t actually have a problem identifying the duplicates?

how many collumns in this table?

cms9651 · November 22, 2012, 4:56pm

thank you.

My problems are:

the number of rows in tbl_t over 10.000.000;
delete the duplicate rows in tbl_t;
the number of column in tbl_t are 16.

If try this example query I don’t have problem:

SELECT
	Field1
FROM
	tbl_t
GROUP BY
	Field1
HAVING
	COUNT(*) > 1;

The problem is when add all fields (number 16) of tbl_t in my query for find duplicates:

SELECT
	Field1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
GROUP BY
	Field1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

MySQL is crashing, what do I do?

r937 · November 23, 2012, 10:52am

what is the primary key of your table?

cms9651 · November 23, 2012, 2:47pm

Field 1_ID int(10) NOT NULL AUTO_INCREMENT, PRIMARY KEY (1_ID)

r937 · November 23, 2012, 3:55pm

i am sorry, something still is not right

you cannot possibly have duplicate rows in the table, not with a PRIMARY KEY

this is, i am sure, because you have used fake column names

something doesn’t make sense here

cms9651 · November 23, 2012, 7:10pm

It’s true… I have used fake column names in my example to simplify:

SELECT
	Field1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
GROUP BY
	Field1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

For me these are duplicate rows even if 1_id is different, all other data are the same:

+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
| 1_id    |name 	| asciiname   | alternatenames | latitude | longitude | feature class | feature code | country code | cc2 | admin1 |
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
| 3038838 | Costa Verda | Costa Verda | cv             | 42.48333 | 1.66667   | T             | SLP          | AD           |   1 | 00     |
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
| 3038839 | Costa Verda | Costa Verda | cv             | 42.48333 | 1.66667   | T             | SLP          | AD           |   1 | 00     |
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+

continue...
+---------+------------+-------------+----------------+-------------------+
| admin2  | population | elevation   | timezone       | modification date |
+---------+------------+-------------+----------------+-------------------+
| 02      | 8839       | 369         | Costa Verda    | 2010-12-22        |
+---------+------------+-------------+----------------+-------------------+
| 02      | 8839       | 369         | Costa Verda    | 2010-12-22        |
+---------+------------+-------------+----------------+-------------------+

Table structure:


DROP TABLE IF EXISTS `tbl_t`;
CREATE TABLE `tbl_t` (
  `1_id` int(10) NOT NULL AUTO_INCREMENT,
  `name` varchar(200) DEFAULT NULL,
  `asciiname` varchar(200) DEFAULT NULL,
  `alternatenames` varchar(5000) DEFAULT NULL,
  `latitude` char(200) DEFAULT NULL,
  `longitude` char(200) DEFAULT NULL,
  `feature class` char(100) DEFAULT NULL,
  `feature code` char(100) DEFAULT NULL,
  `country code` char(200) DEFAULT NULL,
  `cc2` char(60) DEFAULT NULL,
  `admin1` char(20) DEFAULT NULL,
  `admin2` char(80) DEFAULT NULL,
  `population` varchar(200) DEFAULT NULL,
  `elevation` varchar(200) DEFAULT NULL,
  `timezone` varchar(200) DEFAULT NULL,
  `modification date` date DEFAULT NULL,
  PRIMARY KEY (`1_id`),
  KEY `CODE` (`1_id`),
  KEY `COUNTRYCODE` (`country code`),
  KEY `asciiname` (`asciiname`)
) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1;

KyleWolfe · November 23, 2012, 9:44pm

Simply select all columns and group by all columns, and place a min() on the primary key ID, and remove that ID from the group clause. This will give you a unique list of data and the lowest ID.

r937 · November 23, 2012, 11:07pm

kyle, he’s got unique primary keys, putting a MIN() on the primary key won’t help

the duplicates now are given by “even if 1_id is different, all other data are the same”

add a WHERE clause to restrict the duplicates to a range –

SELECT
	id_1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
 [COLOR="#0000FF"]WHERE id_1 BETWEEN 13000 AND 17000[/COLOR]
GROUP BY
	id_1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

this should stop the query from “crashing” but you’ll have to do it multiple times

cms9651 · November 26, 2012, 2:05pm

thank you.

I try this query:

SELECT
	id_1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
 [COLOR="#0000FF"]WHERE id_1 BETWEEN 1 AND 10000[/COLOR]
GROUP BY
	id_1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

The output is:

Affected rows: 0
Time: 1.176ms

this means that between the rows 1 and 10000 there are not duplicates ?

KyleWolfe · November 26, 2012, 2:47pm

CMS. Do you have other tables that use id_1 as a foreign key?

cms9651 · November 26, 2012, 2:52pm

Not, I don’t have.

KyleWolfe · November 26, 2012, 3:31pm

My earlier post will suffice then, at least as a starting point. You don’t even need to worry about bringing your id_1 over and just create a new one.

--Make temp table
CREATE TABLE `tempTable` (
  `1_id` int(10) NOT NULL AUTO_INCREMENT,
  `name` varchar(200) DEFAULT NULL,
  `asciiname` varchar(200) DEFAULT NULL,
  `alternatenames` varchar(5000) DEFAULT NULL,
  `latitude` char(200) DEFAULT NULL,
  `longitude` char(200) DEFAULT NULL,
  `feature class` char(100) DEFAULT NULL,
  `feature code` char(100) DEFAULT NULL,
  `country code` char(200) DEFAULT NULL,
  `cc2` char(60) DEFAULT NULL,
  `admin1` char(20) DEFAULT NULL,
  `admin2` char(80) DEFAULT NULL,
  `population` varchar(200) DEFAULT NULL,
  `elevation` varchar(200) DEFAULT NULL,
  `timezone` varchar(200) DEFAULT NULL,
  `modification date` date DEFAULT NULL,
  PRIMARY KEY (`1_id`),
  KEY `CODE` (`1_id`),
  KEY `COUNTRYCODE` (`country code`),
  KEY `asciiname` (`asciiname`)
) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1;
--Append unique data to new table, excluding 1_id from select as it is causing your 'dupes'
INSERT INTO tempTable (name, asciiname, alternatenames, latitude, longitude, feature class, feature code, country code, cc2, admin1, admin2, population, elevation, timezone, modification date)
SELECT
  name,
  asciiname,
  alternatenames,
  latitude,
  longitude,
  feature class,
  feature code,
  country code,
  cc2,
  admin1,
  admin2,
  population,
  elevation,
  timezone,
  modification date
FROM
  tbl_t
GROUP BY  
  name,
  asciiname,
  alternatenames,
  latitude,
  longitude,
  feature class,
  feature code,
  country code,
  cc2,
  admin1,
  admin2,
  population,
  elevation,
  timezone,
  modification date

This is just a start… You obviously have an insert query somewhere that was causing duplicates to be created, you’ll have to find that and weed it out. Once you’ve created this temp table you may want to check for duplicate ‘name’ entries.

cms9651 · November 27, 2012, 11:02am

thank you.
I try your suggestion but I have the crash of mysql:
[Err] 126 - Incorrect key file for table ‘C:\WINDOWS\TEMP\#sql240_1e_9.MYI’; try to repair it