Using COUNT and GROUP BY with multiple tables, including derived tables?

Here’s the table structure:

CREATE TABLE songs
(
    song_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    name    VARCHAR(50)  NOT NULL,
    PRIMARY KEY (id)
);

CREATE TABLE ratings
(
    song_id INT     UNSIGNED NOT NULL DEFAULT '0',
    user_id INT     UNSIGNED NOT NULL DEFAULT '0',
    rating  TINYINT UNSIGNED NOT NULL DEFAULT '1',
    PRIMARY KEY (song_id, user_id),
    INDEX       (user_id)
);

CREATE TABLE bookmarks
(
    song_id INT UNSIGNED NOT NULL DEFAULT '0',
    user_id INT UNSIGNED NOT NULL DEFAULT '0',
    PRIMARY KEY (song_id, user_id),
    INDEX       (user_id)
);

Here’s the sql code:

SELECT
    songs.name,
    average_rating,
    total_ratings,
    total_bookmarks
FROM
    songs
    INNER
      JOIN
      (
        SELECT
            song_id,
            AVG(rating) AS average_rating,
            COUNT(*) AS total_ratings
        FROM
            ratings
        GROUP
            song_id
      ) AS ratings_summary
        ON ratings_summary.song_id = songs.song_id
    LEFT
      JOIN
      (
        SELECT
            song_id,
            COUNT(*) AS total_bookmarks
        FROM
            bookmarks
        GROUP BY
            song_id
      ) AS bookmarks_summary
        ON bookmarks_summary.song_id = songs.id
ORDER BY
    total_ratings DESC;

The above query works, but it takes 5 seconds. If I remove the ORDER BY it takes one-tenth of a second. If it remove either one of the sub-queries (derived tables) it also takes one-tenth of a second.

This is really killing me. I’ve done searches about COUNT, GROUP BY, but there aren’t many examples of getting proper counts when joining multiple one-to-many tables.

Using COUNT with GROUP BY is fine if you’re only joining two tables, but once you add a second one-to-many table to the mix, it throws all the aggregate functions out of whack because it causes the extras rows to multiple against each other.

Instead of using derived tables I could use traditional joins directly against the tables, but then I’d have to resort to using sloppy COUNT(DISTINCT song_id) which works around it, but it’s just as slow as the 5 second monster above. I’m really at my wits end here.

The situation gets even worse because I’ll actually need to add more one-to-many joins which’ll make it even slower.

I’m probably not going to be able to help you, but I have a feeling someone else is going to ask you for the output of EXPLAIN.

that’s weird

how long does this take –

SELECT songs.song_id
     , songs.name
     , average_rating
     , total_ratings
  FROM songs
INNER
  JOIN ( SELECT song_id
              , AVG(rating) AS average_rating
              , COUNT(*) AS total_ratings
           FROM ratings
         GROUP
             BY song_id ) AS ratings_summary
    ON ratings_summary.song_id = songs.song_id
UNION ALL
SELECT song_id
     , NULL
     , NULL
     , COUNT(*) AS total_bookmarks
  FROM bookmarks
GROUP
    BY song_id

note there is no ORDER BY yet, and the rows still need to be condensed

presumably the two SELECTs are quick, and i’m wondering if the UNION adds any significant overhead…

Very fast. One-tenth of a second.

rudy, I’m not sure what you’re getting at here. How does this incorporate the “total_bookmarks” field that collects a different set of stats than “total_ratings”?

please try this both with and without ORDER BY

SELECT song_id
     , MAX(name) AS name
     , MAX(average_rating) AS average_rating
     , MAX(total_ratings) AS total_ratings
     , MAX(total_bookmarks) AS total_bookmarks
  FROM (
       SELECT songs.song_id
            , songs.name
            , average_rating
            , total_ratings
            , NULL AS total_bookmarks
         FROM songs
       INNER
         JOIN ( SELECT song_id
                     , AVG(rating) AS average_rating
                     , COUNT(*) AS total_ratings
                  FROM ratings
                GROUP
                    BY song_id ) AS ratings_summary
           ON ratings_summary.song_id = songs.song_id
       UNION ALL
       SELECT song_id
            , NULL
            , NULL
            , NULL
            , COUNT(*) AS total_bookmarks
         FROM bookmarks
       GROUP
           BY song_id
       ) AS d
GROUP
    BY song_id
ORDER
    BY total_ratings DESC

First look, I don’t see why there is a need for 2 sub queries.


SELECT s.name, AVG(r.rating), COUNT(r.user_id), COALESCE(b.popularity, 0) AS popularity 	
	FROM songs AS s
		INNER JOIN ratings AS r ON s.song_id = r.song_id
		LEFT JOIN (SELECT song_id, COUNT(user_id) AS popularity FROM bookmarks GROUP BY 1) AS b ON b.song_id = s.song_id
	GROUP BY 4, 1, s.song_id
	ORDER BY 4 DESC, 1

Maybe?

Thanks rudy. They both take a little over one-tenth of a second to execute. Now I just need to study it to understand how I can join other one-to-many and many-to-many tables to it.

Ren, unfortunately that query takes 15 seconds to execute.

Hmm, not good.

Just out of curiosity how many rows has the songs table?

I’m imaging it to be fairly sizeable, or something else is drastically wrong.

After playing around with this query there seems to be some problems. All the data in the second union is always included in the result set. I don’t see how to properly place conditions on the query. For instance, let’s say I wanted to only include songs that have more than 3 total ratings. Every logical place I try to place the condition results in several columns containing NULL values including the total_bookmarks column which definitely shouldn’t be NULL. I’d like to get this to work as it’s fast.

About 10,000.

well, yeah, that’s how UNION queries work

but the outer query should be collapsing the result set of the UNION

the “more than 3 ratings” would be a HAVING condition inside the ratings_summary subquery

Thanks rudy. You’re right. I was placing the condition in the wrong area. I was able to eliminate the NULL rows by adding the following to the outer query:

...
GROUP
    BY song_id
[b][COLOR="Red"]HAVING
    total_ratings IS NOT NULL[/COLOR][/b]
ORDER
    BY total_ratings DESC

I don’t know if the HAVING clause is the best or most appropriate way to deal with this, but it works.