Duplicate Issue

tgavin · February 14, 2010, 9:44pm

Im not exactly sure how, but I’ve ended up with duplicates.

In the current table, you’ll see that I used chapt_id as the PK. This allowed for duplicate chapters in some movies.

CREATE TABLE IF NOT EXISTS `chapters` (
  `chapt_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `mov_id` bigint(20) unsigned NOT NULL DEFAULT '0',
  `chapter_num` varchar(3) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  `chapter_title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`chapt_id`),
  KEY `mov_id` (`mov_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ;

You can see here, if I export from the chapters table

INSERT INTO `chapters` VALUES
(1996, 65, '1', 'Main Titles'),
(1997, 65, '2', 'The Prisoners Arrive'),
(1998, 65, '3', 'Escape Attempt 1'),
(1999, 65, '4', 'The Blind Spot'),
(2000, 65, '5', 'To the Cooler'),
(2001, 65, '6', 'Bartlett''s Arrival'),
(2002, 65, '7', 'The Objective'),
(2003, 65, '8', 'The Escape Committee'),
(2004, 65, '9', 'Forger & Scrounger'),
(2005, 65, '10', 'Stealing Steel'),
(322864, 65, '1', 'Main Titles'),
(322865, 65, '2', 'The Prisoners Arrive'),
(322866, 65, '3', 'Escape Attempt 1'),
(322867, 65, '4', 'The Blind Spot'),
(322868, 65, '5', 'To the Cooler'),
(322869, 65, '6', 'Bartlett''s Arrival'),
(322870, 65, '7', 'The Objective'),
(322871, 65, '8', 'The Escape Committee'),
(322872, 65, '9', 'Forger & Scrounger'),
(322873, 65, '10', 'Stealing Steel');

I’m trying to take the data from the current “chapters” table and INSERT it into the new “movies_chapters” table. Obviously, looking at the PKs below you can see that I’m running into duplicate errors.

CREATE TABLE IF NOT EXISTS `movies_chapters` (
  `mc_mov_id` int(10) unsigned NOT NULL DEFAULT '0',
  `mc_lang` char(2) CHARACTER SET latin1 NOT NULL,
  `mc_chapter_num` varchar(3) CHARACTER SET latin1 NOT NULL,
  `mc_chapter_title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`mc_mov_id`,`mc_chapter_num`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

I’ll be using the mov_id and the chapter_num as the new PK, so I’m not worried about keeping the chapt_id anymore. What’s the best way to handle this? Delete the duplicates from the chapters table? If so, how?

Thanks!

SpacePhoenix · February 14, 2010, 10:20pm

Do any other tables in the database use the PK from the chapters table is their FK?

tgavin · February 14, 2010, 10:25pm

Nope.

r937 · February 14, 2010, 11:00pm

see if you can spot the difference in these two solutions –

INSERT 
  INTO movies_chapters
     ( mc_mov_id
     , mc_lang
     , mc_chapter_num
     , mc_chapter_title )
SELECT DISTINCT
       mov_id
     , 'en'
     , chapter_num
     , chapter_title  
  FROM chapters

INSERT IGNORE
  INTO movies_chapters
     ( mc_mov_id
     , mc_lang
     , mc_chapter_num
     , mc_chapter_title )
SELECT mov_id
     , 'en'
     , chapter_num
     , chapter_title  
  FROM chapters

tgavin · February 14, 2010, 11:18pm

The syntax is different, but I’m going to say those do the same thing. Does INSERT INGORE use less system resources?

r937 · February 14, 2010, 11:47pm

no, INSERT IGNORE attempts to insert all rows, ignoring dupes, whereas the INSERT which has SELECT DISTINCT inserts only the uniques

system resources is of minor concern here, since you are presumably doing this only once

SpacePhoenix · February 14, 2010, 11:48pm

@Rudy, for the benefit of anyone who comes across this thread what performance difference, if any is there between the two solutions?

r937 · February 14, 2010, 11:55pm

i would guess that the one with DISTINCT might be slower, because it has to sort the entire table on all columns first, whereas the IGNORE only has to do an index lookup on each row

who do you suppose has some sample data in the requisite two table layouts and might be available to test it for us?

tgavin · February 15, 2010, 12:41am

I tried doing both and received this error performing the DUPLICATE query: #2006 - MySQL server has gone away. The chapters table has over 600,000 rows.

Rudy, I understand that the actions are different, but isn’t the outcome still the same? Which method do you recommend?

r937 · February 15, 2010, 12:47am

i would not agonize over which method to use

if one of them failed on me, i’d use the other one in a flash

tgavin · February 15, 2010, 12:51am

Excellent, so they DO produce the same result.

I’m working on this on my local Mac, not a web server. It’s a Mac Pro 8 Core with 6 GB RAM. My query results are

Using INSERT IGNORE
Inserted rows: 579653 (Query took 47.4497 sec)

Using DUPLICATE
#2006 - MySQL server has gone away.

I tried this three times and got the same result. Use INSERT IGNORE.

Thanks Rudy!

crmalibu · February 16, 2010, 9:16pm

Not necessarily. You might have multiple records with the same mov_id and chapter_num, but have different titles. Distinct sees those as distinct rows. An error would be thrown when you try to insert them because your new table only has a unique constraint on mov_id and chapter_num.

This probably isn’t the case in your dataset though.