How to Join most recent record

Solved this problem yesterday after about 45 minutes of experimentation.

Consider the problem of a query against table A that we are pulling records from. Table B is going to be joined and it has related data, but we want the most recent row of B for each A.

(Real life example for those that parse them better: Consider a table of bills, and a table of appraisals that affect those bills - you need to pull the most recent appraisal).

To complicate matters Table C is also going to be joined to table A. We need an aggregate pulled against one of C’s columns. the group by clause will be on A’s primary key.

(Real life continued - here our third table was ‘transactions’ and I needed to sum the amount column).

The problem is this - when you do a group by on A you’ll only get 1 of B’s records. Summing B’s information is non-sensical. But you have to have the group by to get the sum of C’s column correctly.

My solution - an alias derived table.


SELECT 
  A.id AS id, 
  A.title AS title,
  B.appraised AS appraised
  SUM(C.amount) AS paid
FROM A
  INNER JOIN C on C.aid = A.id
  LEFT JOIN (
    SELECT appraised FROM B ORDER BY appraisalDate
) AS B ON B.aid = A.id
GROUP BY A.id

Now the most recent appraisal on table B will be included in the report.

Posted in the hopes someone finds it useful, comments welcome.

there’s a problem with your solution – the value of B.appraised that is returned is ~not~ necessarily the most recent appraisal

this is the “hidden column” problem outlined here: http://dev.mysql.com/doc/refman/5.0/en/group-by-hidden-columns.html

the correct way to select the latest B row would be as follows…

SELECT A.id
     , A.title
     , B.appraised
     , SUM(C.amount) AS paid
  FROM A  
INNER 
  JOIN C 
    ON C.aid = A.id  
LEFT OUTER
  JOIN ( SELECT aid
              , MAX(appraisalDate) AS latest
           FROM B
         GROUP
             BY aid ) AS Bm
    ON Bm.aid = A.id
LEFT OUTER
  JOIN B
    ON B.aid = Bm.aid
   AND B.appraisalDate = Bm.latest
GROUP 
    BY A.id

joining to a MAX subquery as a derived table ensures that only one row from B is joined to each A, and thus the GROUP BY on A.id is sufficient

Thanks! From the sounds of it if that ever came up it would have been a PITA to track down.

BTW, the actual query restricts both the outer and aliased query to a range of bills - possibly only on. Here’s the relevant snippet (the whole query is nearly 200 lines).


LEFT JOIN (
  SELECT * 
  FROM ppt_land_appraisals ai 
  WHERE ai.pptbill_id IN ({$bills}) 
  ORDER BY ai.created_on
) AS a ON b.id = a.pptbill_id
WHERE b.id IN ({$bills})

So with what you’re suggesting that should be…


LEFT JOIN (
  SELECT *, MAX(created_on) AS latest 
  FROM ppt_land_appraisals ai 
  WHERE ai.pptbill_id IN ({$bills}) 
  GROUP BY ai.pptbill_id
) AS a ON b.id = a.pptbill_id
WHERE b.id IN ({$bills})

Right?

not quite right, as the use of the dreaded, evil “select star” in incompatible with GROUP BY

notice how i did it – the derived table returns only the join column and the MAX, and this is then used to join to the B table

also, you wouldn’t need the WHERE clause for the bills list in the subquery, as this is taken care of when the derived table is joined

Before the join though wouldn’t it still consider all 170,000 rows on that table? That’s what I’m trying to avoid.

And yeah, I need to get rid of the select star before I push to testing. I put it in there for now cause I’m still working out which fields from that particular table I need.

not sure i understand this question…

The appraisal table against which the alias table query runs has in excess of 170,000 rows or more. Does MySQL have to look at all of that before starting work on the outer query, or does the limiting where statement of the outer query get migrated down?

I know the end result will be the same, but my hunch is that having the where clause in the subquery will speed the parsing of the statement by limiting how many rows are actually considered for joining.

I could be wrong. I guess when I have time I can time test it.

or you could stick EXPLAIN in front of the two different query versions, and compare the execution plans

:slight_smile:

:smiley: How else do you get an accurate time test?