To COUNT() or to Store value?

Hello all!

I’m going through some of my queries and trying to tweak/optimize some of them. Mainly looking for speed gains and to reduce the server load.

In an article I read yesterday they mentioned not using COUNT() where possible. And I have a situation where I think I can apply this.

I have a customer database that is accessed many times throughout the day by users. It is paginated in normal fashion (viewing 1-10 of 21,234 customers). I use COUNT() on each query to determine the total number of records.

I could run COUNT() once a night in my cron job and update the number to another table (just for holding counts?). Then rather than run a COUNT() on a very large table (thousands of rows) I would need to run a SELECT WHERE on a very small table (5 rows max).

In reality the total number of records is for reference only. It will not matter if this number is not exact. I could even qualify it as (viewing 1-10 of approx 21,234 customers)

Would I receive any gains in speed and reduction on server resources by doing this? Any ideas as to what gains I could possibly see? Is the juice worth the squeeze?

Thanks in advance for any guidance or suggestions.

If your using MYISAM tables, and your fields are NOT NULL, the COUNT will be instant (MySQL caches the number of rows for MYISAM tables).

Just do a DESCRIBE on your query and see what you should be changing.
Thousands of rows is a minuscule number to mysql, so chances are your missing some indexes there, that’s why your query is slow.

On the other hand, when you show viewing 1-10 of 21,234 results, it’s because you already did the search, and when you go to viewing 11-20 of 21,234 results, you already have the data for the users to display, so you should not have a need for another query.

In other words, do the select once, cache the result for that search, and when you change pages, show the cached result. You can cut your SQL queries down by ~90% like this.

could you give a brief description of how the caching is accomplisehd?

The caching will have to be done at the script level.

For the searches / search results, the best way I found was something like this:

search.php


# Search the database
$results = DB::selectValues("SELECT id FROM users WHERE $special_search_stuff");

# Cache the results 
$cacheKey = md5($special_search_stuff);
$session['searches'][$cacheKey] = $results;

# Redirect to search page
header("Location: /results.php?search=$cacheKey&page=1");
exit;

results.php


# Get the cached data
$users = $session['searches'][$__GET['search']];

# get the page your on, and make it base 0
$page = $__GET['search'] - 1;

# Get the ids to show on this page
$ids = array_slice($users, $page*10,10);

# get the user data from the db
$data = DB::select("SELECT * FROM users WHERE id IN (" . implode(", ",$ids) . ");");

# Show the users
foreach ($data as $user) {
  echo $user['name'] . "<br />";
}
echo "Showing: " . ($page*10) . " - " . ($page*10+10) . " out of " . count($users);

The idea being:

  • Your slow search gets done once, and you cache the IDs returned by that search.
  • The results page, will do one select, on the IDs (primary key).
  • You can add some smart caching (like cache the user results from the results page, and when the user looks at the same page twice, they will not hit the db)

That way:

  • You only have 1 fast select per “next/prev” page.
  • If your searching records that change allot (ex: newest users), when you click “next page”, you don’t get the same users you were looking at (new users signed up to the site while you were looking at page 1, and by the time you click page 2, you get half the users that were on page 1 on page 2)

Notes:

  • I used the SESSION in my example, to explain the idea clearer, but in reality you should use something like MemcacheD / APC (depending on the size of your system). That way, if another user does the same search, you don’t have to hit the DB any more, since you have the results already.

  • Also, I cached all the IDs returned by your search (that’s stupid in real life, but makes the example simpler), instead what you must do is see how many pages users usually navigate (lets say 5 pages), and only preload that many records (50 in our example), when the user gets at the end of the list, you preload the next section (50 to 100). That way, you do the search every 5 pages, and 90% of the users will only do one search since they will get bored and never go past page 5.

  • The reason why you want to load records by ID for the results pages is also because it can simplify your caching code. If you cache the objects by id, using MemcacheD or APC, the same user that gets shown on your results page doesn’t have to be loaded again when you view it’s profile (gets loaded from the cache).
    Also, if you need to load users 1,2,3,4,5 and users 3,4,5 are already in your cache, your select will be SELECT * FROM users WHERE id IN (1,2);. And if all 5 users are in the cache (happens on popular searches), well then you get no select at all.

I used this method on a very active website (30mill+ profiles).

Overkill?

If all this is a bit overkill for you, and you prefer doing the searches on every results page, just use SQL_CALC_FOUND_ROWS and FOUND_ROWS.

Ex:


# Get the data
SELECT SQL_CALC_FOUND_ROWS * FROM table WHERE ... LIMIT 10,10;
# Get the total number of records that matched your search
SELECT FOUND_ROWS();

Your slow query will only run once, and the second one is instant.