IN() too slow, need help rewriting

aamonkey · February 22, 2011, 8:03pm

I have the following query:

SELECT COUNT(VI.id)
FROM Client_Voucher_Items VI
LEFT JOIN Client_Vouchers V
    ON VI.voucherId = V.id
LEFT JOIN Clients C
    ON C.id = V.clientId
LEFT JOIN Items I
    ON I.id = VI.itemId
WHERE I.description = 'Glasses'
AND V.dateVoided IS NULL
AND V.dateCreated >= '2010-01-01'
AND V.dateCreated <=  '2010-01-31'
AND C.id IN (
    SELECT DISTINCT(V.clientId)
    FROM Client_Vouchers V, Clients C
    WHERE C.id = V.clientId
    AND DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >=  60
    AND V.dateCreated >= '2010-01-01'
    AND V.dateCreated <=  '2010-01-31'
    AND V.dateVoided IS NULL
)

Because of the IN() clause (and mysql’s horrible optimization of it) this query is taking 10+ seconds to run. I am not proficient enough with sql to rewrite this without the IN() clause. I can post the table structure if it’s not clear.

thank you for any help

r937 · February 22, 2011, 9:08pm

is it possible to have a client voucher item for a client voucher that doesn’t exist?

is it possible to have a client voucher for a client that doesn’t exist?

is it possible to have a client voucher item for an item that doesn’t exist?

i’m going to guess that the answer to all of these questions is no, so your LEFT OUTER JOINs should be rewritten as INNER JOINs

does that speed things up a little bit?

we can tackle the IN problem next…

aamonkey · February 22, 2011, 9:41pm

Thanks r937 - that did shave a second or two off the query - so now I have this:

SELECT COUNT(VI.id)
FROM Client_Voucher_Items VI
INNER JOIN Client_Vouchers V
    ON VI.voucherId = V.id
INNER JOIN Clients C
    ON C.id = V.clientId
INNER JOIN Items I
    ON I.id = VI.itemId
WHERE I.description = 'Glasses'
AND V.dateVoided IS NULL
AND V.dateCreated >= '2010-01-01'
AND V.dateCreated <=  '2010-01-31'
AND C.id IN (
    SELECT DISTINCT(V.clientId)
    FROM Client_Vouchers V, Clients C
    WHERE C.id = V.clientId
    AND DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >=  60
    AND V.dateCreated >= '2010-01-01'
    AND V.dateCreated <=  '2010-01-31'
    AND V.dateVoided IS NULL
)

r937 · February 23, 2011, 3:21am

okay, we’re in great shape

before rewriting the subquery as a join, i just want to take a moment and confirm that the subquery is returning something different from the main query

in the main query, you’re looking specifically for glasses

but in the subquery, you seem to be looking for any client of age 60+ who has ~any~ voucher, regardless of whether that voucher contains glasses or not

is this correct?

because if it isn’t, it sure looks like those WHERE conditions in the subquery could simply be added to the WHERE clause of the outer query

aamonkey · February 23, 2011, 2:40pm

I get what you’re saying Rudy. The query is supposed to pull the unduplicated number of people over 60 that had vouchers for glasses in a date range, and I see now that I had it written incorrectly.

I rewrote this as:

SELECT COUNT(DISTINCT(V.clientId))
FROM ClientManagement_Client_Voucher_Items VI
INNER JOIN ClientManagement_Client_Vouchers V
    ON VI.voucherId = V.id
INNER JOIN ClientManagement_Clients C
    ON C.id = V.clientId
INNER JOIN SystemSetup_Items I
    ON I.id = VI.itemId
WHERE I.description = 'Glasses'
AND V.dateVoided IS NULL
AND V.dateCreated >= '2010-01-01'
AND V.dateCreated <=  '2010-12-31'
AND DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >=  60

Does that look right to you? It’s yielding slightly different numbers, but I think it’s right and what I had previously was wrong (because I was counting voucher items instead of distinct clients).

Side note - this is one of about 50 subqueries in a massive annual report I’m working on…I’m trying to find the bottlenecks in it and I believe I’ve been staring at it wayyyyy too long.

r937 · February 23, 2011, 2:56pm

certainly looks a lot simpler than what you had before

on a minor note, i try to write my FROM clause in a logical sequence, one that mirrors how i hope/assume the optimizer will approach it – starting with the essential or most restrictive table, and joining from there (so as to minimize the number of rows that it retrieves only to throw away)

since your whole query is about glasses, this is how i would write it…

SELECT COUNT( DISTINCT V.clientId )
  FROM SystemSetup_Items I
INNER
  JOIN ClientManagement_Client_Voucher_Items VI
    ON VI.itemId = I.id 
INNER 
  JOIN ClientManagement_Client_Vouchers V    
    ON V.id = VI.voucherId
   AND V.dateVoided IS NULL
   AND V.dateCreated >= '2010-01-01'
   AND V.dateCreated <=  '2010-12-31'
INNER 
  JOIN ClientManagement_Clients C    
    ON C.id = V.clientId
   AND DATE_FORMAT(V.dateCreated, '%Y') - 
       DATE_FORMAT(C.dob, '%Y') - 
         (DATE_FORMAT(V.dateCreated, '00-%m-%d') < 
          DATE_FORMAT(C.dob, '00-%m-%d')) >=  60
 WHERE I.description = 'Glasses'

note also that DISTINCT is ~not~ a function, and whatever comes after it does not need to be placed in parentheses

aamonkey · February 23, 2011, 3:10pm

I just learned a bunch of stuff

Thanks for the help and the 2nd pair of eyes Rudy - I appreciate it