Looking to refine/optimize a complex multijoin query

Hello all,

This is a query I’ve been thinking about for quite some time that I would like to optimize to run efficiently in MySQL. It began life in an ASP/Access app about 12 years ago that ran a straight (no joins) query against an Access database of contacts returning about 600 rows. That was saved in an array which was then looped and each row ran another query against a couple of tables that pulled out yearly grade information. It’s not the most efficient way of doing it but it worked.

Now we’re using a more sophisticated CRM on PHP/MySQL called CiviCRM and I’m rebuilding this part of the system as a custom plugin.

In order to achieve the same results as the old system, I’ve duplicated what we had originally done to get a baseline to start with. It’s dead slow compared to the old ASP/Access version so I have some work to do.

Here’s how I’m approaching it:

I have a main query to pull in the contact information as follows:



$sql = "SELECT contact.id
        , contact.last_name
        , contact.first_name
        , contact.external_identifier
        , contact.user_unique_id
        , membership.membership_type_id
        , membership.id AS membership_id
        , membership.join_date AS member_since
        , membership_type.name AS member_type

        FROM contact 
        
        INNER JOIN membership
                ON contact.id = membership.contact_id
        
        INNER JOIN membership_type
                ON membership.membership_type_id = membership_type.id
        
        WHERE contact.first_name IS NOT NULL 
        
        AND contact.last_name IS NOT NULL
        
        ORDER BY contact.last_name";


The CRM has a database abstraction layer that allows me to get a data access object and I can either loop in into an array or loop and execute a query for each row in my main query. I’ve found it is slightly more efficient looping directly from the dataset so here’s how that part works (I’ve skipped a bunch of the html cruft involved to keep the concept from getting muddled up):

Anything in the code below that is prefixed with $dao-> is a value from the results of the query above and specific to the row that is in focus during that iteration of the While loop.



// Execute the query above and return a dataset object 
$dao = CRM_Core_DAO::executeQuery($sql);

// START HTML string variable to hold a table of results $report_table = "<table>, etc..";


while( $dao->fetch( ) ) {   

        $sql = "SELECT professional_development_categories.id AS id
                 , professional_development_categories.category AS category
                 , SUM(professional_development_activities.credits) AS credits
                 , professional_development_categories.minimum
                 , professional_development_categories.maximum
                 , professional_development_categories.description
                 FROM professional_development_categories
                 LEFT OUTER JOIN professional_development_activities
                 ON professional_development_activities.category_id = professional_development_categories.id
                 AND professional_development_activities.contact_id = " . $dao->id . "
                 AND EXTRACT(YEAR FROM professional_development_activities.credit_date) = " . $_SESSION["report_year"] . "
                 GROUP BY professional_development_categories.id";

        // Populate the variable that holds the HTML table of results
        $report_table .= '<tr>';
        $report_table .= '<td>' . $dao->last_name . '</td>';
        $report_table .= '<td>' . $dao->first_name . '</td>';
        $report_table .= '<td>' . $dao->member_type . '</td>';
        $report_table .= '<td>' . $dao->member_since . '</td>';

        // Execute the sub query for each row in the main query to get this years scores for that individual
        $subdao = CRM_Core_DAO::executeQuery($sql);
            
        $total_credits = 0;
        $sub_cells = "";
                
        while( $subdao->fetch( ) ) { 
                 $total_credits += abs($subdao->credits);             
                 $sub_cells .= '<td>' . abs($subdao->credits) . '</td>';
        }
            
        $report_table .= '<td>' . $total_credits . '</td>';
        $report_table .= $sub_cells;

        $report_table .= '</tr>';

}


// END HTML to create a table of results $report_table .= "</table>, etc..";


So hopefully that isn’t too convoluted and makes some sense. I’m looking to make some improvements to it and speed it up. I’d like to combine the queries but I’m not sure how I can do that.

Thanks,
Andrew

Ok, so to try and improve this query, I’ve been playing in PHPMyAdmin running some queries. I figured the first thing I could try is to add the sum of all of the credits from the activities table. So I added that field and LEFT JOIN-ed the table as follows:



SELECT civicrm_contact.id
        , civicrm_contact.last_name
        , civicrm_contact.first_name
        , civicrm_contact.external_identifier
        , civicrm_contact.user_unique_id
        , civicrm_membership.membership_type_id
        , civicrm_membership.id AS membership_id
        , civicrm_membership.join_date AS member_since
        , civicrm_membership_type.name AS member_type

        # ADD Credits
        , civi_cpd_activities.credits AS total_credits

        FROM civicrm_contact 

        # Left Join the table with some criteria
        LEFT JOIN civi_cpd_activities
        ON  civi_cpd_activities.contact_id = civicrm_contact.id AND EXTRACT(YEAR FROM civi_cpd_activities.credit_date) = " . $_SESSION["report_year"] . "

        INNER JOIN civicrm_membership
        ON civicrm_contact.id = civicrm_membership.contact_id
        INNER JOIN civicrm_membership_type
        ON civicrm_membership.membership_type_id = civicrm_membership_type.id
        WHERE civicrm_contact.first_name IS NOT NULL 
        AND civicrm_contact.last_name IS NOT NULL
        ORDER BY civicrm_contact.last_name



Now the problem with this is that it gives me a new row for each civi_cpd_activities.credits and duplicates all the other info about the individual. I would like to SUM(civi_cpd_activities.credits) but for some reason that just gives me a single result. I don’t quite understand why it’s doing that so I need to figure that out.

It’s still a very slow query to run. PHPMyAdmin responded with the following results: Showing rows 0 - 29 ( 684 total, Query took 8.0574 sec). I’ve written some pretty ugly queries but this one is really slow and I haven’t even gotten to the part where I have to pull out the sum of each score category and list that out.

Just for fun, I thought I’d try a sub-query for that summed field and it seems to be 1 second faster than the field with the left join. Still pretty slow for such a seemingly simple query:



SELECT civicrm_contact.id
        , civicrm_contact.last_name
        , civicrm_contact.first_name
        , civicrm_contact.external_identifier
        , civicrm_contact.user_unique_id
        , civicrm_membership.membership_type_id
        , civicrm_membership.id AS membership_id
        , civicrm_membership.join_date AS member_since
        , civicrm_membership_type.name AS member_type
        , (SELECT SUM(civi_cpd_activities.credits) FROM civi_cpd_activities WHERE civi_cpd_activities.contact_id = civicrm_contact.id AND EXTRACT(YEAR FROM civi_cpd_activities.credit_date) = [COLOR=#000000][COLOR=#DD0000]" . [/COLOR][COLOR=#0000BB]$_SESSION[/COLOR][COLOR=#007700][[/COLOR][COLOR=#DD0000]'[/COLOR][COLOR=#0000BB]report_year[/COLOR][COLOR=#DD0000]'[/COLOR][COLOR=#007700]][/COLOR][COLOR=#DD0000] . "[/COLOR][/COLOR]) As total_credits    
        FROM civicrm_contact 

        INNER JOIN civicrm_membership
        ON civicrm_contact.id = civicrm_membership.contact_id

        INNER JOIN civicrm_membership_type
        ON civicrm_membership.membership_type_id = civicrm_membership_type.id

        WHERE civicrm_contact.first_name IS NOT NULL 
        AND civicrm_contact.last_name IS NOT NULL
        ORDER BY civicrm_contact.last_name


By the way, the activities table conatians 17,000 rows of data from the last 12 years so there is some work for the database. The query when run without looking at the activities table takes well under a second to run:

Showing rows 0 - 29 ( 634 total, Query took 0.0075 sec)

With the sub-query:

Showing rows 0 - 29 ( 634 total, Query took 7.3675 sec)

I just had an idea… What about creating a temp table of only the activities results from the year in question and then running the query against that table.

SELECT c.id
     , c.last_name
     , c.first_name
     , c.external_identifier
     , c.user_unique_id
     , m.membership_type_id
     , m.id AS membership_id
     , m.join_date AS member_since
     , mt.name AS member_type
     , a.sum_credits 
  FROM civicrm_contact AS c
INNER 
  JOIN civicrm_membership AS m
    ON m.contact_id = c.id
INNER 
  JOIN civicrm_membership_type AS mt
    ON mt.id = m.membership_type_id
 WHERE c.first_name IS NOT NULL 
   AND c.last_name IS NOT NULL
LEFT OUTER
  JOIN ( SELECT pda.contact_id
              , SUM(pda.credits) AS sum_credits
           FROM professional_development_activities AS pda
          WHERE pda.credit_date >= '2014-01-01'
            AND pda.credit_date  < '2015-01-01'
         GROUP 
             BY pda.contact_id ) AS a
    ON a.contact_id = c.id
ORDER 
    BY c.last_name

notice the two dates – you would replace these with run-time parameters just like you used to have a parameter for “report_year”

the idea is that pda.credit_date would have an index, and this range test (starting with first day of year, up to but not including first day of next year) will allow the index to be utilized, whereas your EXTRACT function would not

Fantastic! Thanks Rudy. I’ll run this and see how it improves the speed of things.

Cheers,
Andrew

Holy smokes, what a difference! I had to move the WHERE clause to the end of the query just before the ORDER BY but the query runs lightning fast:

Showing rows 570 - 599 ( 634 total, Query took 0.0282 sec)

So if I understand correctly the EXTRACT function prevented my query from being able to create or use an index. Is that right? If so, that is definitely a function to be weary of.

My next trick will be to loop out the total credits for each category in the activities table for this time period. I’ll give that a go on my own. Hopefully I won’t have to come running for help and I’ll post my results when I have something. This is amazingly fast now!

doh!!

queries do not create indexes, you have to do that yourself, using ALTER TABLE ADD INDEX or CREATE INDEX

using ~any~ function on a column will negate the use of any index on that column

did your credit_date column already have an index, or did you just create one now?