Adding a related table count to results

I was going to create this post to get some help, but I might not need to.

I want to get some people from my database, and to include with their information a separate field that sums the number of publications each person appears in.

For example:


ID First Last PubCount
 1 Tom   Jones       6
 2 Joe   Bloggs      8

So I think that it might be as easy as creating the two tables separately, then combining the two together. Thoughts of sugar plums, and inner joins with sub queries go running through my head.

So the basic query is going to be:


SELECT ID, FirstNames, Surname, PubCount
FROM Person
  INNER JOIN (
    ...
  ) as Publications
    ON Person.ID = Publications.PersonID

with some magic happening in the subquery part.

I know that I want to select the count of publications for each person, so I am reminded of the GROUP command. Could that work?


SELECT PersonID, count(PersonID) as PubCount
FROM PublicationPerson
GROUP BY PersonID

There are going to be duplicates within that, but they should be weeded out by the inner join.


SELECT ID, FirstNames, Surname, PubCount
FROM Person
  INNER JOIN (
    SELECT PersonID, count(PersonID) as PubCount
    FROM PublicationPerson
    GROUP BY PersonID
  ) as Publications
    ON Person.ID = Publications.PersonID

It seems to run, and not too badly too. But I’m not happy about the inner join of the subquery - it might be an expensive operation.

With nearly 100,000 people and 250,000 publications, it takes 0.13 seconds to get 10,000 people.

What if the the count was pulled up to the main select, so that the subquery isn’t needed?


SELECT Person.ID, FirstNames, Surname, count(PersonID) as PubCount
FROM Person
  INNER JOIN PublicationPerson
    ON Person.ID = PublicationPerson.PersonID
GROUP BY PersonID

For 10,000 results it’s 5 times faster, taking only 0.028 seconds.

So ultimately what I’m asking is: whether the second technique is the better one. Is plain speed a better determining factor than other factors?

i actually like the first one, where you do the count in a subquery

it’s more flexible, as it allows future joins to other related tables without compromising the count – for example, suppose you also wanted to count the number of subscriptions people have along with their publications

in the second method, the simple join with the GROUP BY in the outer query, you’d get cross join effects

mind you, if there is a difference in execution speed (assuming, of course, that there are the appropriate indexes on all tables), it’s always practical to go with the faster one until such time as further enhancement to the query is actually required

Thanks for that advice, for I am (engage shudders of horror) taking over from a departed developer.

This is some of the code that I’ve been left with:


(
  SELECT Person.Id, FirstNames, Surname, NeeName, AlsoKnownAs, NeeName, FormerName, DeathDate, BirthDate, Age, AgeMeasure, Country.Name as Country, CityTown.Name as CityTown, c2.Name as FormerCityTown, FuneralHome.Name AS FuneralHome, Charity.Name AS Charity
  FROM Person
    LEFT JOIN Country ON Country.Id = Person.CountryId
    LEFT JOIN CityTown ON CityTown.Id = Person.CityTownId
    LEFT JOIN CityTown AS c2 ON Person.FormerCityTownId = c2.Id
    LEFT JOIN Charity ON Charity.Id = Person.CharityId
    LEFT JOIN FuneralHome ON FuneralHome.Id = Person.FuneralHomeId
  WHERE Person.AddedOn > DATE_SUB( NOW( ) , INTERVAL ' . $period . ' )
)
UNION DISTINCT
(
  SELECT Person.Id, FirstNames, Surname, NeeName, AlsoKnownAs, NeeName, FormerName, DeathDate, BirthDate, Age, AgeMeasure, Country.Name as Country, CityTown.Name as CityTown, c2.Name as FormerCityTown, FuneralHome.Name AS FuneralHome, Charity.Name AS Charity
  FROM (
    SELECT PersonId
    FROM PublicationPerson
    WHERE AddedOn > DATE_SUB( NOW( ) , INTERVAL ' . $period . ' )
  ) AS RecentPublications
  INNER JOIN Person ON Person.Id = RecentPublications.PersonId
  LEFT JOIN Country ON Country.Id = Person.CountryId
  LEFT JOIN CityTown ON CityTown.Id = Person.CityTownId
  LEFT JOIN CityTown AS c2 ON Person.FormerCityTownId = c2.Id
  LEFT JOIN Charity ON Charity.Id = Person.CharityId
  LEFT JOIN FuneralHome ON FuneralHome.Id = Person.FuneralHomeId
)
ORDER BY Surname ASC , FirstNames ASC

I’m hopeful that the following is a better rendition of the above, where the publication people are instead joined on, and included by the use of NOT NULL


SELECT Person.Id, FirstNames, Surname, NeeName, AlsoKnownAs, NeeName, FormerName, DeathDate, BirthDate, Age, AgeMeasure, Country.Name as Country, CityTown.Name as CityTown, FormerCityTown.Name as FormerCityTown, FuneralHome.Name AS FuneralHome, Charity.Name AS Charity
FROM Person
  LEFT OUTER JOIN Country ON Country.Id = Person.CountryId
  LEFT OUTER JOIN CityTown ON CityTown.Id = Person.CityTownId
  LEFT OUTER JOIN CityTown AS FormerCityTown ON Person.FormerCityTownId = FormerCityTown.Id
  LEFT OUTER JOIN Charity ON Charity.Id = Person.CharityId
  LEFT OUTER JOIN FuneralHome ON FuneralHome.Id = Person.FuneralHomeId
  LEFT OUTER JOIN (
    SELECT DISTINCT PersonId
    FROM PublicationPerson
    WHERE AddedOn > DATE_SUB( NOW( ) , INTERVAL 7 DAY )
  ) as RecentPublications on RecentPublications.PersonId = Person.Id
WHERE
  Person.AddedOn > DATE_SUB( NOW( ) , INTERVAL 7 DAY ) OR
  RecentPublications.PersonId IS NOT NULL
ORDER BY Surname ASC , FirstNames ASC

To which, the inclusion of publication counts should be more usefully achievable.

When it comes to more expansive queries such as this, what sort of performance information would it be useful for me to learn about?

regarding performance, you should learn how to read and interpret the results of EXPLAINs

by the way, those LEFT OUTER JOINs should all be INNER JOINs (except for the one to publications, as a person might not have any)

LEFT OUTER JOIN allows, of course, for the possibility that the row from the left table has no matching row in the right table, and you want the left table row returned anyway

consider this –

Person LEFT OUTER JOIN Country 
ON Country.Id = Person.CountryId

the LEFT OUTER JOIN here says that you have a person with a countryid, and you want that person returned even if that person’s countryid doesn’t exist in the country table

that’s a sure sign of a problem, wouldn’t you say?

:slight_smile:

Off Topic:

i just noticed quirksmode in your sig… do you know ppk? if so, say hello for me :slight_smile:

How much of a problem is it where the country is not known?

Which leads me to another question. Is it considered better to have null values for unknown or unavailable data, or to have a placeholder value (0), in the table being joined to the right?

i guess that would depend on the business rules for the application, whether it’s okay to accept a person into the database without a country

and it is by far better to use NULL instead of a dummy placeholder

for one thing, you will never be able to actually declare a true foreign key with a dummy value