Is this code to delete duplicates OK?

I am using a table I was given of all the towns in the UK and I have realised it contains duplicates

e.g



971 	Birmingham 	West Midlands
978 	Birmingham 	West Midlands
993 	Birmingham 	West Midlands
1006 	Birmingham 	West Midlands
1047 	Birmingham 	West Midlands


My table structure is as follows:



CREATE TABLE `towns` (
  `town_id` int(11) NOT NULL,
  `town` varchar(150) collate utf8_unicode_ci NOT NULL,
  `county` varchar(150) collate utf8_unicode_ci NOT NULL,
  PRIMARY KEY  (`town_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci COMMENT='Towns';


I have done a search for some code to remove duplicates and found the following code but wanted to get a second opinion before using it as the site is live and I don’t want to break it:)



CREATE TABLE new_table as
SELECT * FROM old_table WHERE 1 GROUP BY [column to remove duplicates by];

DROP TABLE old_table;

RENAME TABLE new_table TO old_table;


If this is correct, then can you advise me of what to put in the square bracket?
If there is a better solution please let me know.

Many thanks

Paul

If the table includes post codes it might have the town repeated for each post code area which is the same town

Thanks SpacePhonenix - the table doesn 't contain postcodes - just town and county

please run this –

SELECT town
  FROM towns
GROUP
    BY town
HAVING COUNT(*) <> COUNT(DISTINCT county)

any results?

Thanks r937

Here is the result I got



Showing rows 0 - 29 (410 total, Query took 0.1852 sec)


so what that means is that there are some town names which exist in more than one county

which makes sense, right? two different counties could have two different towns with the same name, right?

the “remove duplicates” code which you posted would obliterate this information

Eep I am glad I asked.

I don’t suppose you would be so kind as to suggest some code that would do what I need to do? :slight_smile:

sure :slight_smile:

first, which database system is this? it looks like MySQL but i’m just double-checking because we have a MySQL forum but you chose to post in the Databases forum instead

second, does this query return anything –

SELECT MIN(town_id)
     , town
     , county
  FROM towns
GROUP
    BY town
     , county
HAVING COUNT(*) > 1

Thanks Rudy - I appreciate it.

Yes it is MYSQL - sorry, in my haste to post I didn’t notice there were separate forums but hopefully you won’t abandon me for my mistake.

Your query returns



Showing rows 0 - 29 (458 total, Query took 1.0632 sec)


Many thanks

Paul

just to be safe, back up your table data first, then try this –

DELETE towns
  FROM ( SELECT MIN(town_id) AS min_id
              , town        
              , county      
           FROM towns       
         GROUP              
             BY town        
              , county      
         HAVING COUNT(*) > 1 ) AS dupes
INNER
  JOIN towns
    ON towns.town   = dupes.town
   AND towns.county = dupes.county
 WHERE towns.town_id > dupes.min_id

Thanks Rudy, back up made but get the following error:



#1104 - The SELECT would examine more than MAX_JOIN_SIZE rows; check your WHERE and use SET SQL_BIG_SELECTS=1 or SET SQL_MAX_JOIN_SIZE=# if the SELECT is okay

wha???

i don’t know why you would get that, nor how to fix it, but i can think of a couple of ways to get around it…

CREATE TABLE towns_to_delete
SELECT towns.town_id
  FROM ( SELECT MIN(town_id) AS min_id
              , town        
              , county      
           FROM towns       
         GROUP              
             BY town        
              , county      
         HAVING COUNT(*) > 1 ) AS dupes
INNER
  JOIN towns
    ON towns.town   = dupes.town
   AND towns.county = dupes.county
 WHERE towns.town_id > dupes.min_id

if this runs, you should inspect a few ids from the towns_to_delete table, and convince yourself that those ids do need to be deleted, then run this –

DELETE FROM towns WHERE town_id IN
( SELECT town_id FROM towns_to_delete )

Thanks Rudy

Unfortunately I get the following similar message



#1104 - The SELECT would examine more than MAX_JOIN_SIZE rows; check your WHERE and use SET SQL_BIG_SELECTS=1 or SET SQL_MAX_JOIN_SIZE=# if the SELECT is okay


you must be doing something else wrong, because that subquery ran fine back in post #8

can i see the exact last query you tried?

Yeah sure - here is the exact query



CREATE TABLE towns_to_delete SELECT towns.town_id
FROM (

SELECT MIN( town_id ) AS min_id, town, county
FROM towns
GROUP
BY town, county
HAVING COUNT( * ) >1
) AS dupes
INNER
JOIN towns ON towns.town = dupes.town
AND towns.county = dupes.county
WHERE towns.town_id > dupes.min_id


I tried the following from post 8 again and it still works. I guess you know it is not exactly the same query


SELECT MIN( town_id ) , town, county
FROM towns
GROUP
BY town, county
HAVING COUNT( * ) >1
LIMIT 0 , 30


ALTER TABLE towns ADD INDEX ( town,county )

this should speed things up, although i don’t think it really addresses your MAX_JOIN_SIZE problem…

Hey Rudy

After running the ALTER TABLE query, I then tried your previous CREATE TABLE towns_to_delete and it now worked.

The ids in the towns_to_delete where indeed the culprits so I ran the other query to delete them from the towns table and it is all fine now.

So I would just like to say thank you so much for all your time on this. It is really appreciated as I am trying to run before I can walk with this site.