[MySQL 5.0.45-community-nt-log] Remove Duplicate Data or Rows

Hello guys!

How do I remove duplicate data or rows from a MySQL query in the table with more than 10.000.000 rows?

Could you please help?
Thanks you very much for your help.

how? with a healthy dose of patience

what part are you having problems with? identifying the duplicates, or removing them?

You have right my teacher… :slight_smile:

I have first problem with identifying the duplicates because I use DISTINCT in my Select query but for the number of rows (more than 10.000.000 rows) I have this error:

[Err] 2013 - Lost connection to MySQL server during query

Any suggestions?
Thank you

how do you know there are any duplicates at all?

thank you for reply.

I try this query:

SELECT
	Field1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
GROUP BY
	Field1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

And in the output I have duplicates rows…

so you don’t actually have a problem identifying the duplicates?

how many collumns in this table?

thank you.

My problems are:

  • the number of rows in tbl_t over 10.000.000;
  • delete the duplicate rows in tbl_t;
  • the number of column in tbl_t are 16.

If try this example query I don’t have problem:

SELECT
	Field1
FROM
	tbl_t
GROUP BY
	Field1
HAVING
	COUNT(*) > 1;

The problem is when add all fields (number 16) of tbl_t in my query for find duplicates:

SELECT
	Field1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
GROUP BY
	Field1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

MySQL is crashing, what do I do?

what is the primary key of your table?

Field 1_ID int(10) NOT NULL AUTO_INCREMENT, PRIMARY KEY (1_ID)

i am sorry, something still is not right

you cannot possibly have duplicate rows in the table, not with a PRIMARY KEY

this is, i am sure, because you have used fake column names

something doesn’t make sense here

It’s true… I have used fake column names in my example to simplify:

SELECT
	Field1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
GROUP BY
	Field1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

For me these are duplicate rows even if 1_id is different, all other data are the same:

+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
| 1_id    |name 	| asciiname   | alternatenames | latitude | longitude | feature class | feature code | country code | cc2 | admin1 |
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
| 3038838 | Costa Verda | Costa Verda | cv             | 42.48333 | 1.66667   | T             | SLP          | AD           |   1 | 00     |
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
| 3038839 | Costa Verda | Costa Verda | cv             | 42.48333 | 1.66667   | T             | SLP          | AD           |   1 | 00     |
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+

continue...
+---------+------------+-------------+----------------+-------------------+
| admin2  | population | elevation   | timezone       | modification date |
+---------+------------+-------------+----------------+-------------------+
| 02      | 8839       | 369         | Costa Verda    | 2010-12-22        |
+---------+------------+-------------+----------------+-------------------+
| 02      | 8839       | 369         | Costa Verda    | 2010-12-22        |
+---------+------------+-------------+----------------+-------------------+

Table structure:


DROP TABLE IF EXISTS `tbl_t`;
CREATE TABLE `tbl_t` (
  `1_id` int(10) NOT NULL AUTO_INCREMENT,
  `name` varchar(200) DEFAULT NULL,
  `asciiname` varchar(200) DEFAULT NULL,
  `alternatenames` varchar(5000) DEFAULT NULL,
  `latitude` char(200) DEFAULT NULL,
  `longitude` char(200) DEFAULT NULL,
  `feature class` char(100) DEFAULT NULL,
  `feature code` char(100) DEFAULT NULL,
  `country code` char(200) DEFAULT NULL,
  `cc2` char(60) DEFAULT NULL,
  `admin1` char(20) DEFAULT NULL,
  `admin2` char(80) DEFAULT NULL,
  `population` varchar(200) DEFAULT NULL,
  `elevation` varchar(200) DEFAULT NULL,
  `timezone` varchar(200) DEFAULT NULL,
  `modification date` date DEFAULT NULL,
  PRIMARY KEY (`1_id`),
  KEY `CODE` (`1_id`),
  KEY `COUNTRYCODE` (`country code`),
  KEY `asciiname` (`asciiname`)
) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1;

Simply select all columns and group by all columns, and place a min() on the primary key ID, and remove that ID from the group clause. This will give you a unique list of data and the lowest ID.

kyle, he’s got unique primary keys, putting a MIN() on the primary key won’t help :slight_smile:

the duplicates now are given by “even if 1_id is different, all other data are the same”

add a WHERE clause to restrict the duplicates to a range –

SELECT
	id_1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
 [COLOR="#0000FF"]WHERE id_1 BETWEEN 13000 AND 17000[/COLOR]
GROUP BY
	id_1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

this should stop the query from “crashing” but you’ll have to do it multiple times

thank you.

I try this query:

SELECT
	id_1,
	Field2,
	Field3 ,......Field16
FROM
	tbl_t
 [COLOR="#0000FF"]WHERE id_1 BETWEEN 1 AND 10000[/COLOR]
GROUP BY
	id_1,
	Field2,
	Field3 ,......Field16
HAVING
	COUNT(*) > 1;

The output is:

Affected rows: 0
Time: 1.176ms

this means that between the rows 1 and 10000 there are not duplicates ? :confused:

CMS. Do you have other tables that use id_1 as a foreign key?

Not, I don’t have.

My earlier post will suffice then, at least as a starting point. You don’t even need to worry about bringing your id_1 over and just create a new one.

--Make temp table
CREATE TABLE `tempTable` (
  `1_id` int(10) NOT NULL AUTO_INCREMENT,
  `name` varchar(200) DEFAULT NULL,
  `asciiname` varchar(200) DEFAULT NULL,
  `alternatenames` varchar(5000) DEFAULT NULL,
  `latitude` char(200) DEFAULT NULL,
  `longitude` char(200) DEFAULT NULL,
  `feature class` char(100) DEFAULT NULL,
  `feature code` char(100) DEFAULT NULL,
  `country code` char(200) DEFAULT NULL,
  `cc2` char(60) DEFAULT NULL,
  `admin1` char(20) DEFAULT NULL,
  `admin2` char(80) DEFAULT NULL,
  `population` varchar(200) DEFAULT NULL,
  `elevation` varchar(200) DEFAULT NULL,
  `timezone` varchar(200) DEFAULT NULL,
  `modification date` date DEFAULT NULL,
  PRIMARY KEY (`1_id`),
  KEY `CODE` (`1_id`),
  KEY `COUNTRYCODE` (`country code`),
  KEY `asciiname` (`asciiname`)
) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1;
--Append unique data to new table, excluding 1_id from select as it is causing your 'dupes'
INSERT INTO tempTable (name, asciiname, alternatenames, latitude, longitude, feature class, feature code, country code, cc2, admin1, admin2, population, elevation, timezone, modification date)
SELECT
  name,
  asciiname,
  alternatenames,
  latitude,
  longitude,
  feature class,
  feature code,
  country code,
  cc2,
  admin1,
  admin2,
  population,
  elevation,
  timezone,
  modification date
FROM
  tbl_t
GROUP BY  
  name,
  asciiname,
  alternatenames,
  latitude,
  longitude,
  feature class,
  feature code,
  country code,
  cc2,
  admin1,
  admin2,
  population,
  elevation,
  timezone,
  modification date

This is just a start… You obviously have an insert query somewhere that was causing duplicates to be created, you’ll have to find that and weed it out. Once you’ve created this temp table you may want to check for duplicate ‘name’ entries.

thank you.
I try your suggestion but I have the crash of mysql:
[Err] 126 - Incorrect key file for table ‘C:\WINDOWS\TEMP\#sql240_1e_9.MYI’; try to repair it