Hello guys!
How do I remove duplicate data or rows from a MySQL query in the table with more than 10.000.000 rows?
Could you please help?
Thanks you very much for your help.
Hello guys!
How do I remove duplicate data or rows from a MySQL query in the table with more than 10.000.000 rows?
Could you please help?
Thanks you very much for your help.
how? with a healthy dose of patience
what part are you having problems with? identifying the duplicates, or removing them?
You have right my teacher…
I have first problem with identifying the duplicates because I use DISTINCT in my Select query but for the number of rows (more than 10.000.000 rows) I have this error:
[Err] 2013 - Lost connection to MySQL server during query
Any suggestions?
Thank you
how do you know there are any duplicates at all?
thank you for reply.
I try this query:
SELECT
Field1,
Field2,
Field3 ,......Field16
FROM
tbl_t
GROUP BY
Field1,
Field2,
Field3 ,......Field16
HAVING
COUNT(*) > 1;
And in the output I have duplicates rows…
so you don’t actually have a problem identifying the duplicates?
how many collumns in this table?
thank you.
My problems are:
If try this example query I don’t have problem:
SELECT
Field1
FROM
tbl_t
GROUP BY
Field1
HAVING
COUNT(*) > 1;
The problem is when add all fields (number 16) of tbl_t in my query for find duplicates:
SELECT
Field1,
Field2,
Field3 ,......Field16
FROM
tbl_t
GROUP BY
Field1,
Field2,
Field3 ,......Field16
HAVING
COUNT(*) > 1;
MySQL is crashing, what do I do?
what is the primary key of your table?
Field 1_ID
int(10) NOT NULL AUTO_INCREMENT, PRIMARY KEY (1_ID
)
i am sorry, something still is not right
you cannot possibly have duplicate rows in the table, not with a PRIMARY KEY
this is, i am sure, because you have used fake column names
something doesn’t make sense here
It’s true… I have used fake column names in my example to simplify:
SELECT
Field1,
Field2,
Field3 ,......Field16
FROM
tbl_t
GROUP BY
Field1,
Field2,
Field3 ,......Field16
HAVING
COUNT(*) > 1;
For me these are duplicate rows even if 1_id
is different, all other data are the same:
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
| 1_id |name | asciiname | alternatenames | latitude | longitude | feature class | feature code | country code | cc2 | admin1 |
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
| 3038838 | Costa Verda | Costa Verda | cv | 42.48333 | 1.66667 | T | SLP | AD | 1 | 00 |
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
| 3038839 | Costa Verda | Costa Verda | cv | 42.48333 | 1.66667 | T | SLP | AD | 1 | 00 |
+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
continue...
+---------+------------+-------------+----------------+-------------------+
| admin2 | population | elevation | timezone | modification date |
+---------+------------+-------------+----------------+-------------------+
| 02 | 8839 | 369 | Costa Verda | 2010-12-22 |
+---------+------------+-------------+----------------+-------------------+
| 02 | 8839 | 369 | Costa Verda | 2010-12-22 |
+---------+------------+-------------+----------------+-------------------+
Table structure:
DROP TABLE IF EXISTS `tbl_t`;
CREATE TABLE `tbl_t` (
`1_id` int(10) NOT NULL AUTO_INCREMENT,
`name` varchar(200) DEFAULT NULL,
`asciiname` varchar(200) DEFAULT NULL,
`alternatenames` varchar(5000) DEFAULT NULL,
`latitude` char(200) DEFAULT NULL,
`longitude` char(200) DEFAULT NULL,
`feature class` char(100) DEFAULT NULL,
`feature code` char(100) DEFAULT NULL,
`country code` char(200) DEFAULT NULL,
`cc2` char(60) DEFAULT NULL,
`admin1` char(20) DEFAULT NULL,
`admin2` char(80) DEFAULT NULL,
`population` varchar(200) DEFAULT NULL,
`elevation` varchar(200) DEFAULT NULL,
`timezone` varchar(200) DEFAULT NULL,
`modification date` date DEFAULT NULL,
PRIMARY KEY (`1_id`),
KEY `CODE` (`1_id`),
KEY `COUNTRYCODE` (`country code`),
KEY `asciiname` (`asciiname`)
) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1;
Simply select all columns and group by all columns, and place a min() on the primary key ID, and remove that ID from the group clause. This will give you a unique list of data and the lowest ID.
kyle, he’s got unique primary keys, putting a MIN() on the primary key won’t help
the duplicates now are given by “even if 1_id
is different, all other data are the same”
add a WHERE clause to restrict the duplicates to a range –
SELECT
id_1,
Field2,
Field3 ,......Field16
FROM
tbl_t
[COLOR="#0000FF"]WHERE id_1 BETWEEN 13000 AND 17000[/COLOR]
GROUP BY
id_1,
Field2,
Field3 ,......Field16
HAVING
COUNT(*) > 1;
this should stop the query from “crashing” but you’ll have to do it multiple times
thank you.
I try this query:
SELECT
id_1,
Field2,
Field3 ,......Field16
FROM
tbl_t
[COLOR="#0000FF"]WHERE id_1 BETWEEN 1 AND 10000[/COLOR]
GROUP BY
id_1,
Field2,
Field3 ,......Field16
HAVING
COUNT(*) > 1;
The output is:
Affected rows: 0
Time: 1.176ms
this means that between the rows 1 and 10000 there are not duplicates ?
CMS. Do you have other tables that use id_1 as a foreign key?
Not, I don’t have.
My earlier post will suffice then, at least as a starting point. You don’t even need to worry about bringing your id_1 over and just create a new one.
--Make temp table
CREATE TABLE `tempTable` (
`1_id` int(10) NOT NULL AUTO_INCREMENT,
`name` varchar(200) DEFAULT NULL,
`asciiname` varchar(200) DEFAULT NULL,
`alternatenames` varchar(5000) DEFAULT NULL,
`latitude` char(200) DEFAULT NULL,
`longitude` char(200) DEFAULT NULL,
`feature class` char(100) DEFAULT NULL,
`feature code` char(100) DEFAULT NULL,
`country code` char(200) DEFAULT NULL,
`cc2` char(60) DEFAULT NULL,
`admin1` char(20) DEFAULT NULL,
`admin2` char(80) DEFAULT NULL,
`population` varchar(200) DEFAULT NULL,
`elevation` varchar(200) DEFAULT NULL,
`timezone` varchar(200) DEFAULT NULL,
`modification date` date DEFAULT NULL,
PRIMARY KEY (`1_id`),
KEY `CODE` (`1_id`),
KEY `COUNTRYCODE` (`country code`),
KEY `asciiname` (`asciiname`)
) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1;
--Append unique data to new table, excluding 1_id from select as it is causing your 'dupes'
INSERT INTO tempTable (name, asciiname, alternatenames, latitude, longitude, feature class, feature code, country code, cc2, admin1, admin2, population, elevation, timezone, modification date)
SELECT
name,
asciiname,
alternatenames,
latitude,
longitude,
feature class,
feature code,
country code,
cc2,
admin1,
admin2,
population,
elevation,
timezone,
modification date
FROM
tbl_t
GROUP BY
name,
asciiname,
alternatenames,
latitude,
longitude,
feature class,
feature code,
country code,
cc2,
admin1,
admin2,
population,
elevation,
timezone,
modification date
This is just a start… You obviously have an insert query somewhere that was causing duplicates to be created, you’ll have to find that and weed it out. Once you’ve created this temp table you may want to check for duplicate ‘name’ entries.
thank you.
I try your suggestion but I have the crash of mysql:
[Err] 126 - Incorrect key file for table ‘C:\WINDOWS\TEMP\#sql240_1e_9.MYI’; try to repair it