Optimize a MySQL Query

force · January 24, 2014, 6:24pm

I was working on MySQL query that I finally got working, but although it returns the correct results, it seems fairly inefficient. I was wondering if there might be a better way to do this.

Here is the sample query:

SELECT DISTINCT `contact`.*
FROM `contact`
JOIN `contactSessions` ON `contactSessions`.`CUID`=`contact`.`CUID`
WHERE `contactSessions`.`SID`=51

UNION

SELECT `contact`.*
FROM `contact`
WHERE
`contact`.`CUID` IN
(
    SELECT DISTINCT `contactRelationship`.`superiorUID`
    FROM `contactRelationship`
    LEFT JOIN `contact` ON `contactRelationship`.`subordinateUID`=`contact`.`CUID`
    WHERE `contactRelationship`.`subordinateUID` IN 
        (
            SELECT DISTINCT `contact`.`CUID`
            FROM `contact`
            JOIN `contactSessions` ON `contactSessions`.`CUID`=`contact`.`CUID`
            WHERE `contactSessions`.`SID`=51
        )
)

The two contactSessions.SID WHERE clauses will always be identical. The SID number is the variable value of the whole query.

Basically, what I’m trying to do is select any contacts with a specific SID. Then, also select any parents/superiors when children/subordinates are found with a specific SID. There wouldn’t necessarily be a record for the specific SID and the CUID in the contactSessions table for the parent/superior.

So, here’s a longer explanation of the query:

What the first part of the query does is select all the contact records for any contacts that appear in the contactSessions table with a specific SID.

The second part (after the UNION) selects all the contact CUIDs for any CUIDs that appear in the contactSessions table with a specific SID. Then, when the CUID appears as a child (subordinateUID) in contactRelationship table, the parent’s CUIDs (superiorUIDs) are returned. Then, the results from the contact table are returned when the superiorUID matches the CUID.

Below are the CREATE statements for the tables and sample data.



-- -----------------------------------------------------
-- Table `Contact`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `Contact` (
  `CUID` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  `firstName` VARCHAR(255) NULL,
  `lastName` VARCHAR(255) NULL,
  PRIMARY KEY (`CUID`),
  UNIQUE INDEX `UID_UNIQUE` (`CUID` ASC))
ENGINE = InnoDB;


-- -----------------------------------------------------
-- Table `ContactRelationship`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `ContactRelationship` (
  `superiorUID` BIGINT UNSIGNED NOT NULL,
  `subordinateUID` BIGINT UNSIGNED NOT NULL,
  PRIMARY KEY (`superiorUID`, `subordinateUID`),
  INDEX `fk_ContactRelationship_Contact_idx` (`superiorUID` ASC),
  INDEX `fk_ContactRelationship_Contact1_idx` (`subordinateUID` ASC),
  CONSTRAINT `fk_ContactRelationship_Contact`
    FOREIGN KEY (`superiorUID`)
    REFERENCES `Contact` (`CUID`)
    ON DELETE CASCADE
    ON UPDATE NO ACTION,
  CONSTRAINT `fk_ContactRelationship_Contact1`
    FOREIGN KEY (`subordinateUID`)
    REFERENCES `Contact` (`CUID`)
    ON DELETE CASCADE
    ON UPDATE NO ACTION)
ENGINE = InnoDB;




-- -----------------------------------------------------
-- Table `ContactSessions`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `ContactSessions` (
  `CUID` BIGINT UNSIGNED NOT NULL,
  `SID` INT UNSIGNED NOT NULL,
  INDEX `fk_ContactSubscriptions_Contact1_idx` (`CUID` ASC),
  PRIMARY KEY (`CUID`, `SID`),
  INDEX `fk_ContactSessions_Sessions1_idx` (`SID` ASC),
  CONSTRAINT `fk_ContactSubscriptions_Contact1`
    FOREIGN KEY (`CUID`)
    REFERENCES `Contact` (`CUID`)
    ON DELETE CASCADE
    ON UPDATE NO ACTION)
ENGINE = InnoDB;


INSERT INTO `contact` (`CUID`, `firstName`, `lastName`) VALUES (1, 'John', 'Doe');
INSERT INTO `contact` (`CUID`, `firstName`, `lastName`) VALUES (2, 'Jane', 'Doe');
INSERT INTO `contact` (`CUID`, `firstName`, `lastName`) VALUES (3, 'Bob', 'Smith');
INSERT INTO `contact` (`CUID`, `firstName`, `lastName`) VALUES (4, 'Bill', 'Johnson');
INSERT INTO `contact` (`CUID`, `firstName`, `lastName`) VALUES (5, 'Little Kelly', 'Johnson');
INSERT INTO `contact` (`CUID`, `firstName`, `lastName`) VALUES (6, 'Little Jimmy', 'Doe');
INSERT INTO `contact` (`CUID`, `firstName`, `lastName`) VALUES (7, 'Little Jenny', 'Doe');
INSERT INTO `contact` (`CUID`, `firstName`, `lastName`) VALUES (8, 'Little Lizzy', 'Doe');


INSERT INTO `ContactRelationship` (`superiorUID`, `subordinateUID`) VALUES (1,6);
INSERT INTO `ContactRelationship` (`superiorUID`, `subordinateUID`) VALUES (1,7);
INSERT INTO `ContactRelationship` (`superiorUID`, `subordinateUID`) VALUES (1,8);
INSERT INTO `ContactRelationship` (`superiorUID`, `subordinateUID`) VALUES (2,6);
INSERT INTO `ContactRelationship` (`superiorUID`, `subordinateUID`) VALUES (2,7);
INSERT INTO `ContactRelationship` (`superiorUID`, `subordinateUID`) VALUES (2,8);
INSERT INTO `ContactRelationship` (`superiorUID`, `subordinateUID`) VALUES (4,5);


INSERT INTO `ContactSessions` (`CUID`, `SID`) VALUES (8,50);
INSERT INTO `ContactSessions` (`CUID`, `SID`) VALUES (8,51);
INSERT INTO `ContactSessions` (`CUID`, `SID`) VALUES (8,52);
INSERT INTO `ContactSessions` (`CUID`, `SID`) VALUES (8,53);
INSERT INTO `ContactSessions` (`CUID`, `SID`) VALUES (6,51);
INSERT INTO `ContactSessions` (`CUID`, `SID`) VALUES (6,52);
INSERT INTO `ContactSessions` (`CUID`, `SID`) VALUES (5,51);

r937 · January 24, 2014, 8:17pm

efficiency and complexity are only vaguely connected

a complex query such as yours can still be quite efficient

have you run and analyzed the EXPLAIN?

p.s. for the love of all that’s holy, don’t use those horrid backticks in your query sql – bad enough that people generate their dump scripts with phpmyadmin’s “show those horrid backticks” option turned on

force · January 24, 2014, 9:03pm

Well, I’m not sure that the EXPLAIN actually tells me all that much.

The thing is that I use an identical query twice. Once before the union and again in a subquery. I thought there might be a way to store and re-use the results without having to query the same thing twice.

r937 · January 24, 2014, 9:30pm

well, you have three instances of “using temporary” which doesn’t look good

if i had a few hours, i’d get to know your data a little better and see if i could find another way

one thing that i’d try right away is to mremove the DISTINCT from your subqueries and see if that helps

the reason is, the following two code snippets are logically equivalent –

WHERE foo IN ( 9, 3, 7 )

WHERE foo IN ( 9, 3, 3, 9, 7, 7, 3 )

force · January 24, 2014, 9:44pm

It looks like my sample data set is too small to determine if there is a significant performance difference. There seems to be only a .0001 sec difference between leaving out DISTINCT and leaving it in.

Lemon_Juice · January 25, 2014, 10:32am

WHERE col IN(SELECT …) subqueries are generally poorly optimized by MySQL, maybe rewriting them as joins would help?

What version of MySQL are you using? In 5.6 there were some optimizations done to these types of subqueries so you might try upgrading if possible.

force · January 26, 2014, 1:44am

I’m all for any suggestions if you see a way where that could be accomplished. With the JOINs I tried, there would always be results missing.

What version of MySQL are you using? In 5.6 there were some optimizations done to these types of subqueries so you might try upgrading if possible.

The live server is running 5.5

Lemon_Juice · January 26, 2014, 6:18pm

It’s difficult for me to rewrite your query without possibly spending a lot of time on it since I’m not that proficient in it…

But there’s another idea worth considering - how many rows are usually returned by your subqueries? If these are not large numbers of IDs (I mean not more than 1000 or so) then I’m sure you can gain a lot in terms of performance if you split this query into 3 - first execute the innermost (last) SELECT, grab the data and execute the next SELECT higher up injecting the data from the first query into the IN() part as literal values. Then do the same with the next subquery higher up. This is how I optimized this type of subqueries in the past using PHP and it worked pretty well. If you want to do it in SQL alone then you could select your IDs into a temporary table and then execute the parent SELECT with a simple join with the temp table - this should also work fast and possibly could be wrapped up in a stored procedure.

Edit: I can see that the last idea with the temp table gets quite close to rewriting this query using joins so perhaps this could be worth experimenting, here is just an example of rewriting the last subquery:


SELECT DISTINCT `contact`.*
FROM `contact`
JOIN `contactSessions` ON `contactSessions`.`CUID`=`contact`.`CUID`
WHERE `contactSessions`.`SID`=51

UNION

SELECT `contact`.*
FROM `contact`
WHERE
`contact`.`CUID` IN
(
    SELECT DISTINCT `contactRelationship`.`superiorUID`
    FROM `contactRelationship`
    LEFT JOIN `contact` ON `contactRelationship`.`subordinateUID`=`contact`.`CUID`

    INNER JOIN
        (
            SELECT DISTINCT `contact`.`CUID`
            FROM `contact`
            JOIN `contactSessions` ON `contactSessions`.`CUID`=`contact`.`CUID`
            WHERE `contactSessions`.`SID`=51
        ) AS dt
    ON dt.CUID=contactRelationship.subordinateUID
)

If this works then it’s just a matter of applying the same idea to the other subquery. It’s untested but I’d give it a try.

force · January 26, 2014, 6:42pm

There would probably be anywhere between 500 and 4,000 rows in the contact table. I would guess that the other two tables would have slightly fewer rows than the contact table.

I’ll give your query a try and see what I get. Thanks

r937 · January 26, 2014, 9:04pm

okay, i took a closer look at simplifying the subqueries

let’s start at the innermost subquery

first of all, you want to return contact.CUID… but this is going to equal contactSessions.CUID, which you get by searching for contactSessions.SID… so you do not need the join to the contact table!

now, moving to the next outer subquery, you want contactSessions.CUID to be equal to contactRelationship.subordinateUID… and once again, you do not need the join to the contact table, because once you’ve retrieved the contactRelationship row, you have contactRelationship.superiorUID

so let’s rewrite the second of your two UNION queries like this –

SELECT contact.*
  FROM contactSessions
INNER
  JOIN contactRelationship
    ON contactRelationship.subordinateUID = contactSessions.CUID
INNER
  JOIN contact
    ON contact.CUID = contactRelationship.superiorUID    
 WHERE contactSessions.SID = 51

notice that the FROM clause starts with the most restricted table, the one for which you supply a WHERE condition value

mysql is smart enough to re-arrange tables in a multi-table join, but i am a firm believer in writing a query in the manner in which i expect it to be executed

now we combine with the first of your two UNION queries, eliminating the DISTINCT because that’s what UNION (as opposed to UNION ALL) will do for you –

SELECT contact.*
  FROM contactSessions
INNER
  JOIN contact
    ON contact.CUID = contactSessions.CUID  
 WHERE contactSessions.SID = 51

UNION

SELECT contact.*
  FROM contactSessions
INNER
  JOIN contactRelationship
    ON contactRelationship.subordinateUID = contactSessions.CUID
INNER
  JOIN contact
    ON contact.CUID = contactRelationship.superiorUID    
 WHERE contactSessions.SID = 51

run that, and let’s have a look at the EXPLAIN

force · January 27, 2014, 3:12pm

That didn’t even occur to me to use a different table in the FROM clause and join the contact table at the end instead. It looks like that was pretty much the key to the whole thing.

run that, and let’s have a look at the EXPLAIN

Here it is:

Thanks Rudy Looks like the processing time for the query was cut in half.

r937 · January 27, 2014, 7:24pm

well, no

like i tried to explain, you can use inner joins on multiple tables in pretty much any sequence, and mysql will figure out the most efficient access path, invariably starting with the table that has the highest restriction, i.e. will return the fewest rows, thus ensuring that only the minimum number of rows have to be joined (rather than joining tables completely and only then applying the WHERE condition, in effect retrieving a lot and then throwing a whole bunch of joined rows away)

the improvement came about from two things – converting subqueries to joins (which is actually how mysql executes subqueries, but, for some reason, it doesn’t always do it efficiently), and, more importantly, not joining to additional tables if you don’t have to

Lemon_Juice · January 27, 2014, 9:10pm

I am impressed with the solution Rudy has come to, I knew something like that could be done, I just couldn’t wrap my head around it to be able to transform the query to such an optimized state

Yes, the sequence of inner joins doesn’t matter but I’ve experienced on a few occasions a situation where a left join performed faster than an inner join - I don’t know why, especially that I would have expected inner joins to be faster. Therefore, in cases where the inner join were equal to the left join (that is no NULLs in the result) simply switching to the left join improved performance. But I suspect this can depend on many variables so I’m not sure if left joins are always faster…

force · January 28, 2014, 11:00pm

Right, the logic of the query basically got flipped around. Sorry, that’s what I was trying to say earlier.

Indeed, same here.

Thanks again Rudy