Deleting duplicates with irregular patterns

bostboy · November 28, 2010, 7:08pm

I have a company database with name, address and city. I have been able to get rid of duplicates with matches at the beginning of the name with the same address, the same name and address, and the same name with the same beginning of the address.

But I now have some situations like

DiMio and Di Mio
P S Gourmet Coffee and Ps Gourmet Coffee

Both have the same addresses and city but there are too many cases of names being different with the same address that I can’t just use an address match. Some part of the name needs to match also.

I have tried several things, including

SELECT id, name, address, city, count( * ) AS n
FROM t1
GROUP BY name like SUBSTRING(name FROM 5 FOR 4) , address, left( city, 4 )
HAVING n >1

But that didn’t really work well at all.

Is there a way to do this in MySQL or do I need to do it with a php routine of some sort?

r937 · November 28, 2010, 8:02pm

try this –

SELECT id
     , name
     , address
     , city
     , COUNT(*) AS n
  FROM t1
[COLOR="Blue"]GROUP 
    BY REPLACE(name,' ','')[/COLOR]
     , address
     , left(city,4)
HAVING n > 1

bostboy · November 28, 2010, 10:08pm

Thanks, that works for those names pretty well. And also gives me one more tool in the toolbox to go after the others.

Like if one starts with The and the second one doesn’t. And I realize at some point it becomes an individual case and decision. So I’ll head down that road. But the replace really helps. Thanks for the input.