Efficient table design and query for very particular case

frodo · June 2, 2010, 8:43pm

Hi,
I have long been a fan of SPFs and have posted numerous questions all receiving very helpful answers. So I will try something more challenging this time. Now, before anybody thinks I am throwing a theoretical problem, this is not the case.

Let’s imagine we have a very large population (approx 1,000,000) of which we know the first name and the age. All individual population units can be part of no group or up to 10 different groups each of which can be between 2 and 20 members large. We know to which groups each unit of the population belongs to, if any. To further complicate the situation the population is growing daily.

I need to query this dataset to find groups in which at least three memebrs with a specific name have an age of x, y and z years plus or minus w. So for example I need to find groups in which there is a John who is 20 (+/- 5 years), an Andrew who is 36 (+/- 5 years) and a Lisa who is 7 (+/- 5 years). The group itself can be larger, but this is irrelevant.

The question is: what is the most efficient way of organising this data and what is the most efficient way of querying it?

I really appreciate any help you can give me.

Thanks,
Adrien

oddz · June 3, 2010, 12:02pm

I don’t think that query does what you want it to do. That is considering that the most inner query results in selecting only row with name equal to Lisa between age 2 and 12. You than attempt to filter on name Andrew when all there will be are Lisa’s – make sense?

In regards to optimization the query I showed you should be fine. If it really becomes an issue you can do some caching server side or something to store the flat result set eliminating the database trip altogether once the data has been resolved once.

The whole philosophy with the query I showed you is to remove all the people that you don’t want through the where clause. Once that filter has been applied you will be left with all the people but won’t have any “groups”. Creating groups is than done by associating each row with a imaginary group. That is done so that the aggregate filter can than be applied through use of group by on the calculated column used to identify a group of people.

oddz · June 3, 2010, 12:29pm

Yeah, I mean its a pretty simple query. You shouldn’t have much trouble with it.

Here is another rearrangement that does the same thing:


SELECT
      CASE
      
         WHEN p.name = 'John' AND p.age BETWEEN 15 AND 25
         THEN 1
     
         WHEN p.name = 'Andrew' AND p.age BETWEEN 31 AND 41
         THEN 2
         
         WHEN p.name = 'Lisa' AND p.age BETWEEN 2 AND 12
         THEN 3
         
       ELSE 0 END grp
  FROM
     population
 WHERE
     p.name IN ('John','Andrew','Lisa')
   AND
     p.age BETWEEN 2 AND 41
 GROUP
    BY
     grp
HAVING
     grp <> 0
   AND
     COUNT(*) > 2

That one is probably more scalable in the end.

The largest difference in this case that the where clause doesn’t eliminate all possible non matches. Instead all rows not matching any criteria are placed into a group that is later removed via the having filter. The obvious advantage of this is that you will only every need to have two conditions in the where clause. The disadvantage is that the case statement is going to cycling through useless data. That shouldn’t be much of an issue though.

frodo · June 3, 2010, 12:19pm

Hi oddz,
Does your query take into account that there are thousands of names in the “population” table and that the trhee I mentioned are just a fraction of them?

Adrien

oddz · June 3, 2010, 12:14pm

Case is just a simple switch statement.

Broken up into simple language the case statement provided in the query means:

When name equals John and Johns age is 15 – 25 set this columns value equal to ‘a’
Otherwise if name equals Andrew and Andrews age is 31 – 41 set this columns value equal to ‘b’
If none of those conditions are matched assume the person is Lisa with age of 2–12 and set the columns value equal to ‘c’.

The case statement is going to applied after the where clause. So the where the clause is going to eliminate all the people who don’t match any of the criteria. The case statement is than used to place each person into the appropriate group. That eliminates the need for another table and makes it possible to filter on the aggregate calculation using a having clause.

frodo · June 3, 2010, 12:08pm

Hi oddz,
Once more thank you for your help. As i mentioned I am not familiar with CASE and WHEN. Would you be able to run me through your query or point me to a resource that explains how these instructions work please?

Thanks,
Adrien

frodo · June 3, 2010, 7:04am

Hi oddz,
Thank you very much for your quick reply.
I have to admit that your SQL is a little too advanced for me so I’m not sure I follow it. However, following your line of thought I’ve packaged a query to act straight on the first table. I however unsure if my or your query would be more efficient (ie use less resources):

SELECT d.group_id FROM (
    SELECT b.group_id FROM (
        SELECT group_id FROM population 
        WHERE name = 'Lisa' 
        AND age BETWEEN 2 AND 12
    ) a, population b 
    WHERE  name = 'Andrew' 
    AND age BETWEEN 31 AND 41 
    AND a.group_id = b.group_id
) c, population d 
WHERE name = 'John' 
AND age BETWEEN 15 
AND 25 AND c.group_id = d.group_id

Thanks,
Adrien

frodo · June 2, 2010, 9:06pm

The best I have managed to come up with is to have two tables.
One which contains all the units in the population, their age and the group they belong to.
The second table contains all possible 3 people permutations of the members forming that specific group.

CREATE TABLE population (
name varchar(30) NOT NULL,
age smallint(6) NOT NULL,
group_id int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

CREATE TABLE groups (
name_1 varchar(25) NOT NULL,
name_2 varchar(25) NOT NULL,
name_3 varchar(25) NOT NULL,
age_1 smallint(5) unsigned NOT NULL,
age_2 smallint(5) unsigned NOT NULL,
age_3 smallint(5) unsigned NOT NULL,
group_id int(10) unsigned NOT NULL,
KEY name_1 (name_1),
KEY name_2 (name_2),
KEY name_3 (name_3),
KEY age_1 (age_1),
KEY age_2 (age_2),
KEY age_3 (age_3)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

I would then query the second table with a query like:
SELECT group_id FROM groups WHERE name_1 = ‘John’ AND age_1 BETWEEN 15 AND 25 AND name_2 = ‘Andrew’ AND age_2 BETWEEN 31 AND 41 AND name_3 = ‘Lisa’ AND age_3 BETWEEN 2 AND 12

I do however feel that is is greatly inefficient: any ideas?

oddz · June 2, 2010, 9:22pm

Couldn’t you just use your first table and use a single query to dynamically calculate the group?


SELECT
      CASE 
         WHEN p.name = 'John' AND p.age BETWEEN 15 AND 25
         THEN 'a'
      
         WHEN p.name = 'Andrew' AND p.age BETWEEN 31 AND 41
         THEN 'b'
      
         ELSE 'c' END grp
  FROM
      population p
 WHERE
      (p.name = 'John' AND p.age BETWEEN 15 AND 25)
    OR
      (p.name = 'Andrew' AND p.age BETWEEN 31 AND 41)
    OR
      (p.name = 'Lisa' AND p.age BETWEEN 2 AND 12)
 GROUP
    BY
     grp
HAVING
     COUNT(*) > 2

To me it doesn’t make sense to use an actual table for groups since age is a dynamic property of a person. That seems better left to be resolved in the query based on what age range and person name you would like to filter.