Return duplicates

ServerStorm · January 17, 2012, 8:27pm

Hi,

I have searched the form for duplicates and while there were many questions asked I am not sure what is the best way to do this.

If I am searching users by a registrant_id, first_name, last_name, prov_state, and postal_code where any three of these fields that match in multiple records are considered duplicates but the registrant_id is unique for each member then do you know the best way to return all of these duplicates?

Many thanks,
Steve

r937 · January 17, 2012, 8:33pm

there are four ways of choosing three of four

so you will need four different queries, which, luckily, you can UNION

here’s one of them –

SELECT t.*
  FROM ( SELECT last_name
              , prov_state
              , postal_code 
           FROM daTable
         GROUP
             BY last_name
              , prov_state
              , postal_code 
         HAVING COUNT(*) > 1 ) AS dupe1
INNER
  JOIN daTable AS t
    ON t.last_name   = dupe1.last_name                  
       t.prov_state  = dupe1.prov_state       
       t.postal_code = dupe1.postal_code

having said that, i wonder if you truly are interested in the results of this query – it will return all people named Smith in Oakville

ServerStorm · January 17, 2012, 8:44pm

Thanks Rudy,

As I am not deleting these, I guess I could also just do a normal SELECT too ?


SELECT
	l.lead_id as 'Registrant Number'
	, l.first_name as 'First Name'
	, l.last_name as 'Last Name'
    	, pst.name as 'State/Province'
    	, p.postal_code as 'Zip/Postal Code'
    FROM
    	leads as l
    LEFT OUTER 
    	JOIN leads2addresses as l2a
    		ON l2a.lead_id = l.lead_id
    LEFT OUTER 
    	JOIN addresses as a
    		ON a.address_id = l2a.address_id
    LEFT OUTER 
    	JOIN postal_codes as p
    		ON p.postal_code = a.post_code 
    LEFT OUTER 
    	JOIN provs_states as pst
    		ON pst.abbreviation = p.state_prov_abbr
    		AND p.postal_code = a.post_code
   WHERE
	l.first_name Like '%Bill%'
   AND
       l.last_name Like '%Smi%'
   AND
      p.postal_code Like '%L7L%'

Returns:
[TABLE=“class: grid”]
[TR]
[TD]Registrant Number[/TD]
[TD]First Name[/TD]
[TD]Last Name[/TD]
[TD]State/Province[/TD]
[TD]Zip/Postal Code[/TD]
[/TR]
[TR]
[TD]2[/TD]
[TD]Bill[/TD]
[TD]Smith[/TD]
[TD]Ontario[/TD]
[TD]L7L6M5[/TD]
[/TR]
[TR]
[TD]4[/TD]
[TD]Bill[/TD]
[TD]Smith[/TD]
[TD]Ontario[/TD]
[TD]L7L6M5[/TD]
[/TR]
[/TABLE]

Steve

ServerStorm · January 17, 2012, 8:46pm

r937:

there are four ways of choosing three of four

so you will need four different queries, which, luckily, you can UNION

here’s one of them –
SELECT t.*
  FROM ( SELECT last_name
              , prov_state
              , postal_code 
           FROM daTable
         GROUP
             BY last_name
              , prov_state
              , postal_code 
         HAVING COUNT(*) > 1 ) AS dupe1
INNER
  JOIN daTable AS t
    ON t.last_name   = dupe1.last_name                  
       t.prov_state  = dupe1.prov_state       
       t.postal_code = dupe1.postal_code
having said that, i wonder if you truly are interested in the results of this query – it will return all people named Smith in Oakville

The results of this query will allow the user to click to view the full profile of each lead which has much more info, so your query would be fine.

Regards,
Steve

r937 · January 17, 2012, 9:04pm

okay, where did the idea of deleting come from?

and what does this query have to do with finding duplicates based on three of four columns?

ServerStorm · January 17, 2012, 9:25pm

Just was saying that your supplied method will work as it does the correct job of selecting (and not deleting) the duplicates… sorry for the lack of clarity.

and what does this query have to do with finding duplicates based on three of four columns?

My client wants to view all records where any three columns that match out of Registrant Number, First Name, Last Name, State/Province, or Zip/Postal Code will be shown as a possible duplicate.

So if in a HTML form they fill in three of the possible fields for matching and duplicates are returned then they can manually choose to delete them.

r937 · January 17, 2012, 9:41pm

may i ask why you have to join the address table to the postal code table? what columns do you have in the postal codes table?

also, same question for the states/provinces table – why the join here?

ServerStorm · January 17, 2012, 9:53pm

Addresses:


Table addresses
===============
address_id, intended_use, street, apt_unit_number, street_direction, post_code, region
---------------
address_id       int(11) PK
intended_use     varchar(150)
street           varchar(150)
apt_unit_number  varchar(10)
street_direction varchar(2)
post_code        varchar(20)
region           varchar(150)

Postal Codes:


Table postal_codes==================
postal_code, city_name, state_prov_abbr, country_iso_code
------------------
postal_code      varchar(20) PK
city_name        varchar(64)
state_prov_abbr  char(2)
country_iso_code char(2)

Province and States:


Table provs_states
==================
id, country_id, name, abbreviation
------------------
id               smallint(5) unsigned PK
country_id       char(2)
name             varchar(100)
abbreviation     varchar(3)

Countries:


Table countries===============
country_iso_code, country_name
---------------
country_iso_code char(2) PK
country_name     varchar(150)

This is setup this way as a convenience for the client, to find a persons full address they only need to supply the person’s data and their postal code; the joins provide all the other address type data. Yes it is not as efficient as using it in one table, but their application is ‘heavy’ on the requirement to search for people’s addresses so this extra complication (in code) and slower efficiency is warranted.