Subquery vs Join Question

I’m under the assumption it’s usually better tho run a join query than a subquery. Though I’ve read today that it’s preferential to use a subquery over join queries that will create a temporay table.

So is it a good rule of thumb to always write subqueries in place of joins that require the DISTINCT or ORDER BY commands?

wha???

no it isn’t

:slight_smile:

Thanks for weighing in. I read this in a book called High Performance MySQL.

The example they gave references two tables film which has 1,000 rows and film_actor which has 5,642 total rows.

Join Query:
SELECT DISTINCT film.film_if FROM film
INNER JOIN film_actor USING(film_id)

Subquery:
SELECT film_id FROM film
WHERE EXISTS(SELECT * FROM film_actor
WHERE film.film_id=film_actor.film_id)

The queries return 997 results. On my localhost they process at:

Join: .008 - .0015 secs
Subquery: .007 - .0011 secs

The book’s benchmarks show results in Queries per Sec

Join: 185 qps
Subquery: 325 qps

They say when you want to find rows in table1 that match rows in table2 and eliminate any duplicate info a subquery can sometimes perform better

that’s a highly contrived example, and i am at a loss to understand why they would use that example to illustrate a benchmark

ask yourself what the query is supposed to return, whether the subquery version or the join version (hint: they produce the same result)

i can easily explain why the subquery version is faster – it’s because an EXISTS subquery can be optimized, whereas the join, coupled with the use of DISTINCT, retrieves tons of extraneous rows, only to throw away most of them

still wondering what the purpose of the query is?

another type of subquery is called a correlated subquery, and it is notoriously inefficient as compared to a join, but this example with the films and the actors doesn’t use a correlated subquery

you should not draw any general conclusions about subqueries versus joins, as you did in your first post, such as “preferential to use a subquery over join queries that will create a temporay table” (aside: there is no temporary table involved here)

okay, let’s get back to the purpose of the query

there are 1000 films in the films table, and 5642 actors in the film_actors table, right?

have you figured out what the purpose of the query is yet?

find all films that have at least one actor

how useless is that?? we are told that the query returns 997 rows, which isn’t at all surprising (it means there are only 3 films that have no actor at all)

however, i’m sure you will agree that this is quite the contrived example – in real life, it would be much more likely that you would want to find the films that don’t have an actor, rather than to list all the ones that do

please, don’t draw any conclusions from it – other than, yes, an EXISTS subquery is quite efficient

bear in mind, though, that EXISTS subqueries are rare, and not always this contrived

:slight_smile:

Also to consider, how old is the book? what version of mysql are they using in the book? Have advances been made since that time in how mysql performs?

Also to consider, how old is the book? what version of mysql are they using in the book? Have advances been made since that time in how mysql performs?

This could be the issue. The book was written in 2008, using MySQL 5.1. Though I thought at least my local host was using the latest version of MySQL as I just downloaded the Wamp server pack I was testing with.

Though I’ve just found Wamp’s download for 64 bit operating systems, which is the one I’m using, only includes MySQL 5.1.53.

Have there been any upgrades to MySQL’s query optimizer since these versions?

mysql 5.1 is a very mature database engine

any comments on the post i made explaining how contrived that example is?

Sorry, didn’t see your post before my last reply.

Sure, I can see how this can be a manipulated example. There is no need for a join or a subquery. The example could’ve been just as easily written:

SELECT DISTINCT film_id FROM film_actor

Though performance even between this and the original subquery is neck and neck.

But from your explanation, I understand the only reason the subquery is efficient is because of the EXIST clause.

So I guess I should revise my earlier statement:

So is it a good rule of thumb to always write subqueries in place of joins that require the DISTINCT or ORDER BY commands?

to:

So is it a good rule of thumb to always write EXIST() subqueries in place of joins that require the DISTINCT command?

Also you said the DISTINCT join query in this example didn’t create a temporary table.

Was I wrong also in my assumption that ORDER BY and DISTINCT clauses usually create temp tables?

no, it depends on the purpose of the query

usually is such a loaded word :wink:

again, it depends on the query, and on the indexes that are available

my advice: don’t keep looking for generalities :slight_smile: