Solved this problem yesterday after about 45 minutes of experimentation.
Consider the problem of a query against table A that we are pulling records from. Table B is going to be joined and it has related data, but we want the most recent row of B for each A.
(Real life example for those that parse them better: Consider a table of bills, and a table of appraisals that affect those bills - you need to pull the most recent appraisal).
To complicate matters Table C is also going to be joined to table A. We need an aggregate pulled against one of C’s columns. the group by clause will be on A’s primary key.
(Real life continued - here our third table was ‘transactions’ and I needed to sum the amount column).
The problem is this - when you do a group by on A you’ll only get 1 of B’s records. Summing B’s information is non-sensical. But you have to have the group by to get the sum of C’s column correctly.
My solution - an alias derived table.
SELECT
A.id AS id,
A.title AS title,
B.appraised AS appraised
SUM(C.amount) AS paid
FROM A
INNER JOIN C on C.aid = A.id
LEFT JOIN (
SELECT appraised FROM B ORDER BY appraisalDate
) AS B ON B.aid = A.id
GROUP BY A.id
Now the most recent appraisal on table B will be included in the report.
Posted in the hopes someone finds it useful, comments welcome.
the correct way to select the latest B row would be as follows…
SELECT A.id
, A.title
, B.appraised
, SUM(C.amount) AS paid
FROM A
INNER
JOIN C
ON C.aid = A.id
LEFT OUTER
JOIN ( SELECT aid
, MAX(appraisalDate) AS latest
FROM B
GROUP
BY aid ) AS Bm
ON Bm.aid = A.id
LEFT OUTER
JOIN B
ON B.aid = Bm.aid
AND B.appraisalDate = Bm.latest
GROUP
BY A.id
joining to a MAX subquery as a derived table ensures that only one row from B is joined to each A, and thus the GROUP BY on A.id is sufficient
Thanks! From the sounds of it if that ever came up it would have been a PITA to track down.
BTW, the actual query restricts both the outer and aliased query to a range of bills - possibly only on. Here’s the relevant snippet (the whole query is nearly 200 lines).
LEFT JOIN (
SELECT *
FROM ppt_land_appraisals ai
WHERE ai.pptbill_id IN ({$bills})
ORDER BY ai.created_on
) AS a ON b.id = a.pptbill_id
WHERE b.id IN ({$bills})
So with what you’re suggesting that should be…
LEFT JOIN (
SELECT *, MAX(created_on) AS latest
FROM ppt_land_appraisals ai
WHERE ai.pptbill_id IN ({$bills})
GROUP BY ai.pptbill_id
) AS a ON b.id = a.pptbill_id
WHERE b.id IN ({$bills})
Before the join though wouldn’t it still consider all 170,000 rows on that table? That’s what I’m trying to avoid.
And yeah, I need to get rid of the select star before I push to testing. I put it in there for now cause I’m still working out which fields from that particular table I need.
The appraisal table against which the alias table query runs has in excess of 170,000 rows or more. Does MySQL have to look at all of that before starting work on the outer query, or does the limiting where statement of the outer query get migrated down?
I know the end result will be the same, but my hunch is that having the where clause in the subquery will speed the parsing of the statement by limiting how many rows are actually considered for joining.
I could be wrong. I guess when I have time I can time test it.