I was going to create this post to get some help, but I might not need to.
I want to get some people from my database, and to include with their information a separate field that sums the number of publications each person appears in.
For example:
ID First Last PubCount
1 Tom Jones 6
2 Joe Bloggs 8
So I think that it might be as easy as creating the two tables separately, then combining the two together. Thoughts of sugar plums, and inner joins with sub queries go running through my head.
So the basic query is going to be:
SELECT ID, FirstNames, Surname, PubCount
FROM Person
INNER JOIN (
...
) as Publications
ON Person.ID = Publications.PersonID
with some magic happening in the subquery part.
I know that I want to select the count of publications for each person, so I am reminded of the GROUP command. Could that work?
SELECT PersonID, count(PersonID) as PubCount
FROM PublicationPerson
GROUP BY PersonID
There are going to be duplicates within that, but they should be weeded out by the inner join.
SELECT ID, FirstNames, Surname, PubCount
FROM Person
INNER JOIN (
SELECT PersonID, count(PersonID) as PubCount
FROM PublicationPerson
GROUP BY PersonID
) as Publications
ON Person.ID = Publications.PersonID
It seems to run, and not too badly too. But I’m not happy about the inner join of the subquery - it might be an expensive operation.
With nearly 100,000 people and 250,000 publications, it takes 0.13 seconds to get 10,000 people.
What if the the count was pulled up to the main select, so that the subquery isn’t needed?
SELECT Person.ID, FirstNames, Surname, count(PersonID) as PubCount
FROM Person
INNER JOIN PublicationPerson
ON Person.ID = PublicationPerson.PersonID
GROUP BY PersonID
For 10,000 results it’s 5 times faster, taking only 0.028 seconds.
So ultimately what I’m asking is: whether the second technique is the better one. Is plain speed a better determining factor than other factors?