IN -or- OR (that is the question)

rustybuddy · February 25, 2010, 2:38pm

WOW. Some pretty impressive gains. Re-looking at how my query was designed was definitely the proper action. A little background, this query is actually just 1/3 of a larger more complex query. I modified this to utilize a JOIN rather than a SELECT (to get the list of id’s) and an OR to find the id’s in another table. Here are the numbers (WOW!).

*note: I used a ‘worst case’ scenario for testing so the performance gains will be less in most of my production environments. My tests utilized a set of approx 800 OR’s which would be a dealer with ALL products listed on their site (which would be a very rare occurrence). Also these queries were run with no cache.

Query with multiple OR’s: 0.3606sec
Same Query with JOIN: 0.0232sec

That’s a savings of .3374 secs per query!! Not to mention server resource savings.

Thanks to everyone (especially Dr John) for helping me work through this and optimize this query!!

Vali · February 25, 2010, 4:38pm

If you use IN, and the field for the IN is your primary key, your select will take ~0.0003 sec.
After your joins, if done on keys, you should get your result in ~0.003 sec

Do a DESCRIBE on your query, and change it to get less joins in there, and I’m sure you can make it at least 10 times faster.

Example:


# table has indexes on type1 and type2

# Slower
SELECT id FROM table WHERE type1 = 1 OR type2 IN (1,2,3,4);

# Faster
SELECT id FROM table WHERE type1 = 1
UNION
SELECT id FROM table WHERE type2 IN (1,2,3,4);

Reason:
When you use the OR, MySQL will only use one of the two indexes. So when your using UNION, it does the SELECT twice, both times on indexes.

Play around with DESCRIBE to optimize queries.

Ps:

Sometimes, the queries that slow down your server are not the slow ones, but the “fast enough” ones that get run 100 times per page… I would look for some of those to optimize.

r937 · February 25, 2010, 4:45pm

Vali:

Example:
# table has indexes on type1 and type2

# Slower
SELECT id FROM table WHERE type1 = 1 OR type2 IN (1,2,3,4);

# Faster
SELECT id FROM table WHERE type1 = 1
UNION
SELECT id FROM table WHERE type2 IN (1,2,3,4);
Reason:
When you use the OR, MySQL will only use one of the two indexes. So when your using UNION, it does the SELECT twice, both times on indexes.

this is a fabulous example, may i steal it?

by the way, not all databases work this way, some can do index merges because they can actually utilize more than one index (shocking, eh)

sadly, i’m finding it harder and harder to remember all these little nuances for multiple database systems…

so my strategy is to write SQL that is easy to write and more importantly easy to understand

thus, i would write the OR syntax rather than the UNION syntax, if i even remembered about the UNION syntax

(and like SQL, i don’t use hacks in CSS, either)

but you are right, little nuances like this are extremely important for “hardening” an application that will see heavy use

rpkamp · February 25, 2010, 7:00pm

To continue the discussion from a few posts back, I’ve been thinking about it quite a lot the last few days, and I think that “under the hood” DBMSs rewrite IN to a series of OR’s.

At a low level, the DBMS has to check whether the specified field has some value, and since it impossible to check if the value of a field is in a given collection of values at once, the DBMS would need to check it for each of the possibilities separately.

For those of you who might argue the DBMS could use something like PHP’s in_array() to check if the specified value has one of the values in a given collection I have to disappoint you, “under the hood” PHP is also checking the values one by one.

Since DBMSs rewrite the query once and then execute it, it would be weird if they left the IN in the query, since it will have to be executed as a series of OR’s.
So rewriting IN to series of OR’s makes sense. Well, at least to me it does

Dan_Grossman · February 25, 2010, 7:33pm

MySQL doesn’t actually execute any query in any manner.

The MySQL server has a query optimizer and planner which rewrites the query to a simplified query language, and that query is passed on to the storage engine used by the table.

The storage engine is what actually implements all the information retrieval and modification operations. To talk about the retrieval algorithm at such a low level you have to dig into the storage engine. MyISAM does it totally different than InnoDB, and there are at least 20 engines you can choose from. Each can implement the lower level query operations differently.

rustybuddy · February 25, 2010, 7:52pm

This is very interesting to me… I wasn’t aware that MySQL could only utilize 1 key per select… hmmm…

It’s going to take a bit to wrap my head around integrating this into my query as my remaining OR’s are created dynamically so rather than just tacking some OR’s on to the query I would need it to generate another SELECT statement and a UNION.

For now, I am very satisfied with my performance gains just from switching from many OR’s to one additional JOIN.

felgall · February 25, 2010, 8:35pm

A series of OR conditions can be completely independent with different fields being tested for each.

An IN guarantees that it is one field being tested for all the values and therefore there is at least the possibility that the actual code being run could be more efficient due to one of the values to be compared not needing to be reloaded for each.

Whether a database actually is able to or even tries to make use of the fact that an IN could be executed more efficiently than a series of ORs by not constantly reloading the field being tested depends on the database and various other factors.

Generically a series of OR conditions is not equivalent to an IN since a=b OR c=d does not have an equivalent IN condition. The IN clearly specifies information that requires closer examination to extract from a series of ORs.

So the two can either be implemented the same way by databases (which it appears is usually the case) or the database could take advantage of the higher specificity of the IN statement to implement that variant in a more efficient manner. Of course a series of OR condifitions testing the same field should be implemented the same way as an IN but the potimiser has to actually look more closely at all the code to determine whether the OR conditions are or are not the equivalent of an IN before being able to implement it that way.

rpkamp · February 25, 2010, 11:44pm

Interesting, I hadn’t thought of that. That invalidates my previous argument that DBMSs are most likely to “rewrite IN to a series of OR’s” (Dan, you’re right, this is not technalically the correct way to put it, but you understand my reasoning, right?).

They are not equivalent, but I think we can all argee that every IN can be written as a series of OR’s, but not every series of OR’s can be written as an IN.
So it’s a “one-way relation”, or injection if you will.

In the next couple of days I will dust off the book from a Database course I once took and see if that has anything to tell us about optimizing IN and/or OR. I’ll keep you posted

felgall · February 26, 2010, 1:50am

Which means that IF the optimiser can generate more efficient code for a comparison using IN where it doesn’t need to swap out the field value being compared then there are two possibilities of what it could do with the OR.

It could not check if it is the equivalent of an IN in which case the code can’t retain any field values between comparisons - which would take longer
It could spend extra time up front to see if the OR is an equivalent of IN.

Either way the OR would then take longer than the IN.

The IN and OR can be processed the same way and therefore take the same time only if the IN is considered to be the equivalent of OR and the fact that one side of all the comparisons is the same is ignored in terms of how the processing is done.

That’s why I said right back in my first reply that IN might be faster than OR or they might take exactly the same time depending on just how the code is handled by the optimiser.

In every case where you have a one way relationship like this between two versions of a database query there is at least a theoretical possibility of the more specific one being faster than the other.

Dan_Grossman · February 26, 2010, 1:55am

Since we’re all just pulling things out of thin air now, is anyone interested in digging into the code behind one of the storage engines and finding out what’s going on?

When’s the last time ya did an amortized analysis of the costs of operations on a b-tree?

r937 · February 26, 2010, 4:35am

excellent comment, dan old buddy

i hate it when threads go there…

rpkamp · February 27, 2010, 3:09pm

Okay, so I’ve been reading several chapters in the book I got for a databases course at the university, but I couln’t find any reference to IN in that book whatsoever.

So I went to the MySQL documentation and found the following:

expr IN (value, …)
Returns 1 if expr is equal to any of the values in the IN list, else returns 0. If all values are constants, they are evaluated according to the type of expr and sorted. The search for the item then is done using a binary search. This means IN is very quick if the IN value list consists entirely of constants.

You can find this here.

I did some experimental testing on a toy table, created by the SQL below, to investigate this.


CREATE TABLE `test` (
 `id` int(10) DEFAULT NULL,
 `some_char` char(1) DEFAULT NULL,
 KEY `id_index` (`id`),
 KEY `some_char_index` (`some_char`)
) ENGINE=MyISAM

INSERT INTO test(id,some_char) VALUES
(1, 'j'),
(2, 'i'),
(3, 'h'),
(4, 'g'),
(5, 'f'),
(6, 'e'),
(7, 'd'),
(8, 'c'),
(9, 'b'),
(10, 'a')

(I first tried this table with a primary key on id, but then MySQL seems to sort results by id, which defies the purpose of what I’m about to show).

Now consider the following query


SELECT id,some_char FROM test WHERE some_char IN ('b','a');

The result is


    id  some_char
------  ---------
    10  a
     9  b

And the EXPLAIN for this query is:


    id  select_type  table   type    possible_keys    key              key_len  ref       rows  Extra      
------  -----------  ------  ------  ---------------  ---------------  -------  ------  ------  -----------
     1  SIMPLE       test    range   some_char_index  some_char_index  2        (NULL)       2  Using where

So, MySQL indeed seems to have sorted the values of the IN values list (‘a’ is first in the result, ‘b’ second, in my IN values list I asked them other way around).

However, in two seperate cases MySQL was not sorting the values.

First case
The query


SELECT id,some_char FROM test WHERE some_char IN ('b','a','c');

Returns the result


    id  some_char
------  ---------
     8  c        
     9  b        
    10  a

And the EXPLAIN of this query:


    id  select_type  table   type    possible_keys    key     key_len  ref       rows  Extra      
------  -----------  ------  ------  ---------------  ------  -------  ------  ------  -----------
     1  SIMPLE       test    ALL     some_char_index  (NULL)  (NULL)   (NULL)      10  Using where

So instead of using the index, MySQL seems have to fallen back to a full table scan.
Interestingly, when we perform the following query on the table


INSERT INTO test(id,some_char) VALUES (11,'z'), (12,'z'), (13,'z')

To give the table a 13 tuples, and perform the same query, we get the result


    id  some_char
------  ---------
    10  a        
     9  b        
     8  c

and the EXPLAIN of this query now is:


    id  select_type  table   type    possible_keys    key              key_len  ref       rows  Extra      
------  -----------  ------  ------  ---------------  ---------------  -------  ------  ------  -----------
     1  SIMPLE       test    range   some_char_index  some_char_index  2        (NULL)       3  Using where

So now MySQL is using the index again.
(For 11 and 12 tuples MySQL also uses a tablescan, for >= 13 tuples it using the index again.)

So MySQL appears to be considering whether sorting the values of the IN values list would be faster than just doing a tablescan.

Second case
When we drop the index some_char_index and perform the same queries (still with 13 tuples) we get

Query:


SELECT id,some_char FROM test WHERE some_char IN ('b','a');

Result:


    id  some_char
------  ---------
     9  b        
    10  a

Explain:


    id  select_type  table   type    possible_keys  key     key_len  ref       rows  Extra      
------  -----------  ------  ------  -------------  ------  -------  ------  ------  -----------
     1  SIMPLE       test    ALL     (NULL)         (NULL)  (NULL)   (NULL)      13  Using where

And for the other query.

Query:


SELECT id,some_char FROM test WHERE some_char IN ('b','a','c');

Result:


    id  some_char
------  ---------
     8  c        
     9  b        
    10  a

Explain:


    id  select_type  table   type    possible_keys  key     key_len  ref       rows  Extra      
------  -----------  ------  ------  -------------  ------  -------  ------  ------  -----------
     1  SIMPLE       test    ALL     (NULL)         (NULL)  (NULL)   (NULL)      13  Using where

So, when there is no index on the some_char column, MySQL does not sort the values of the IN values list and performs a tablescan.

Now to get back to the original problem, when I rewite all IN’s to OR’s (so some_char IN (‘a’,‘b’) becomes some_field=‘a’ OR some_field=‘b’) the results are EXACTLY the same.
This would suggest (but doesn’t prove) that MySQL is rewriting the queries using IN and the queries using OR’s to the same plan.

For InnoDB the results are basically the same, but instead of 13 tuples you need 17 tuples to make MySQL use the index when looking for 3 different values.

DISCLAIMER: All these results are experimental and should be viewed as such. I’m not proving anything, just showing the results of some experiments. Everyone can draw their own conclusions.

r937 · February 27, 2010, 5:40pm

regarding the use of indexes…

mysql is smart enough to figure out that it’s probably faster to do a table scan for a table with only a dozen rows because it can grab them all with only one physical read off the disk instead of two physical reads (index plus data)

so if you’re going to do any testing on whether indexes are used or not, make sure your table has at least a few thousand rows

rpkamp · February 27, 2010, 6:44pm

While I totally agree with this principle, my “results” show something else.
Namely requesting IN(‘b’,‘a’) for the table with 10 rows does use the index.