How to optimize this query

Hey gents,

I have a problem making a query fast enough. So if you guys could take some time to read through this and think of ways to make it quicker, that would be much appreciated.

The setup

The user can specify several topics as their interests.
A question is tagged with many topics.

The goal is to display the questions to the user that are tagged with topics that are within his interests.

To make it a little more difficult, topics are laid out as a polyhierarchic tree structure. That means one topic can have several parent topics. We need to show the user all questions tagged with the topics in his interests and all of their children.

An example

Topic Tree structure:

Computers -> Programming -> Web -> PHP -> CakePHP
Computers -> Programming -> Web -> Frameworks -> CakePHP

I have added “Web” to my interests. Now I need to display all questions that are tagged with “Web”, “PHP”, “Frameworks” or “CakePHP”.

Tables

topics -> id, title
parent_topics -> id, parent_id, topic_id
interests -> id, user_id, topic_id // not parent_topic_id, as we are interested in a specific topic and do not care where it is in the tree

questions -> id, title
questions_topics -> id, question_id, topic_id

My solution so far

I have pre-computed and cached all topic children relationships in memory, so that is not a problem.
Given a user I know all of his interests and all of their children in a matter of a couple milliseconds.

Now having this (huge) list of topic_id’s I need to find the questions for them.

So what I do now is:

select question_id from questions_topics where topic_id IN ([huge_list_of_ids]);

This query is very slow, because the questions_topics table has 24k rows and the list of topic_ids that I feed to it can exceed several hundreds. If I reduce the number of topic_ids in the IN clause, the query runs in ~20ms as opposed to 800ms as it is now.

The next query would be:

select [fields] from questions where id IN ([found_question_ids]) AND deleted = 0 and published = 1 order by last_updated desc limit 0, 20;

That is also pretty slow as the list of question_ids is big. So if you guys have a suggestion for that as well, that would be great.

Ways to improve

  1. I tried indeces on questions_topics.topic_id and (questions_topics.question_id, questions_topics.topic_id). Both make the query even slower (significantly even).

  2. I could save all relationships of a question to its topics AND all of their children in the questions_topics table.

So when a question is tagged with “CakePHP”, I would have relations for it for “CakePHP”, “PHP”, “Frameworks”, “Web”, “Programming” and “Computers”. I could then make the query much faster by only doing:

select question_id from questions_topics where topic_id IN ([just_the_topics_ids_in_my_interests_without_children]);

Since the tree is quite big, this would mean several hundred relationships saved in the DB for each question, making the table quite big and stuffed with redundant data. Plus I’d have to recalculate all of these relationships when someone changes the tree structure. I don’t want to go there if not necessary …

  1. Use something else altogether, like Redis?

  2. Your idea here

Thank you for your time.

huh? that’s counter-intuitive for sure

please do an EXPLAIN on the query, and a SHOW CREATE TABLE on the table

i’m assuming that despite the fact that you chose to post in the Databases forum rather than the MySQL forum, this is still a mysql problem, as i see you are using LIMIT

Here you go:


EXPLAIN SELECT `QuestionTopic`.`question_id`
FROM `questions_topics` AS `QuestionTopic`
WHERE `QuestionTopic`.`topic_id`
IN (
'4ea7b132-6948-4fc7-a7ac-c5634352af75', '4ea7b135-dc44-40df-bbeb-c5634352af75', '4ea7bd3d-b534-4f57-9b34-c5634352af75', '4ea7b3dc-6e8c-4c28-b05c-c5634352af75', '4ea7b3dc-263c-429e-97c0-c5634352af75', '4ea7bd3d-4d90-4c93-abe1-c5634352af75', '4ea7bd4a-639c-4780-80b6-c5634352af75'
)
GROUP BY question_id


id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	SIMPLE 	QuestionTopic 	range 	topic_id 	topic_id 	108 	NULL 	70 	Using where; Using temporary; Using filesort


CREATE TABLE `questions_topics` (
 `id` char(36) NOT NULL,
 `question_id` char(36) NOT NULL,
 `topic_id` char(36) NOT NULL,
 `latest_update` datetime NOT NULL,
 `created` datetime NOT NULL,
 PRIMARY KEY (`id`),
 KEY `question_id` (`question_id`,`topic_id`),
 KEY `topic_id` (`topic_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Sorry for not posting this in the mysql forum. : /

okay, the EXPLAIN looks logical

now please run this and try your EXPLAIN again –


ALTER TABLE questions_topics
ADD INDEX ( topic_id , question_id )

a many-to-many relationship table like this should always be declared ~without~ its own auto_increment or other surrogate key, but with one of the two two-column indexes declared as the PK, and the other one, with the two columns reversed, as an index

example…

CREATE TABLE questions_topics 
( question_id CHAR(36) NOT NULL
, topic_id CHAR(36) NOT NULL
, latest_update DATETIME NOT NULL
, created DATETIME NOT NULL
, PRIMARY KEY ( question_id , topic_id )
, INDEX ( topic_id , question_id )
) ENGINE=InnoDB DEFAULT CHARSET=utf8

both the PK and the second index are a covering index and they will allow lookups in either direction to be resolved without accessing the table at all!


id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	SIMPLE 	QuestionTopic 	range 	topic_id,topic_id_2 	topic_id_2 	108 	NULL 	70 	Using where; Using index; Using temporary; Using f...

topic_id_2 is that compound index.

Query went up from 250ms to 480ms, as I have tried before.

Regarding changed CREATE TABLE … will try that. Although the framework I am using does not support compound fields as primary keys for the Active Record magic. Will try though.

Thanks for your time.

weird

it’s the GROUP BY that’s causing this: “Using temporary; Using filesort”

you aren’t returning a bazillion rows here, so you could remove the dupes in php, right?

still, less than a second is pretty good response…

Changing the table layout brings the query down to 380ms, as opposed to my initial 250ms. Any more ideas?

Removing the group by brings the query down from 250ms to 170ms, so that def. helps. Good suggestion.

Do you suggest I should just live with it for now?

How about that second query that fetches the actual questions with their id. That is at 180ms. Do you think this can further optimized?

The query is


SELECT `Question`.`title`, `Question`.`slug`, `Question`.`body`, `Question`.`question_follow_count`, `Question`.`urgent`, `Question`.`answer_count`, `User`.`name`, `User`.`title`, `User`.`id`, `Question`.`id` FROM `questions` AS `Question` LEFT JOIN `users` AS `User` ON (`Question`.`user_id` = `User`.`id`) LEFT JOIN `polls` AS `Poll` ON (`Poll`.`foreign_id` = `Question`.`id`) WHERE `Question`.`id` IN ('4eb1076b-5790-4716-8298-dd4d4352af75', '4eb1076c-d500-4f9e-8be4-dd4d4352af75') AND `Question`.`deleted` <> 1 AND `Question`.`published` = 1 ORDER BY `Question`.`latest_update` desc LIMIT 10;

Adding an index over id, deleted, published, latest_update brings the response time from 180ms to 430ms.

ach du lieber, that is some ugly sql you have there

SELECT Question.title
     , Question.slug
     , Question.body
     , Question.question_follow_count
     , Question.urgent
     , Question.answer_count
     , User.name
     , User.title
     , User.id
     , Question.id 
  FROM questions AS Question 
LEFT 
  JOIN users AS User 
    ON (Question.user_id = User.id) 
LEFT 
  JOIN polls AS Poll 
    ON (Poll.foreign_id = Question.id) 
 WHERE Question.id IN 
       ( '4eb1076b-5790-4716-8298-dd4d4352af75'
       , '4eb1076c-d500-4f9e-8be4-dd4d4352af75' ) 
   AND Question.deleted <> 1 
   AND Question.published = 1 
ORDER 
    BY Question.latest_update desc LIMIT 10;

you said you added an index on ( id, deleted, published, latest_update ) and that showed that you are definitely on top of indexing strategies, because those are all the columns referenced in the WHERE clause

i really cannot explain why your response time is increasing

as for the query itself, why are you joining to the polls table? you aren’t using any of its columns for anything…

This is how CakePHP maps a hasOne relation. It does the join and then has a second query that fetches the poll data


SELECT `Poll`.`id`, `Poll`.`foreign_id`, `Poll`.`question`, `Poll`.`created`, `Poll`.`updated` FROM `polls` AS `Poll` WHERE `Poll`.`foreign_id` = '4eb113d1-6a68-447f-98f5-dd4d4352af75' 

Not elegant I know, but oh well.

Anyway, thanks for your help. I think this is sufficient for now. I can always add some caching strategies to cache the questions for specific parent topic combinations.

As for the uglyness of the sql, that is CakePHP’s output. I didn’t bother to format it beautifully. Sorry about that.

but the join is totally useless !!

sigh

oh those silly framework engineers

:smiley:

:stuck_out_tongue:

I am having trouble finding a proper index for the following query:


EXPLAIN SELECT 
`Feed`.`id`, `Feed`.`type`, `Feed`.`answer_id`,
`Question`.`title`, `Question`.`slug` 
FROM `feeds` AS `Feed` 
LEFT JOIN `questions` AS `Question` ON (`Feed`.`question_id` = `Question`.`id`) 
WHERE `Feed`.`type` = 'new_answer' AND 
`Feed`.`topic_id` IN ('4ea7b2c0-e840-4aec-8a2d-c5634352af75', '4ea7b2c8-9d84-4771-9400-c5634352af75') AND 
`Question`.`answer_count` > 0 
GROUP BY `Feed`.`question_id` 
ORDER BY `Feed`.`created` desc LIMIT 10;

I tried an index on type,topic_id,question_id,created and one on type,topic_id,created.
But EXPLAIN won’t tell me it’s using the index. What am I missing? : /

Thank you for your time.

that query is seriously b0rked

start by removing the GROUP BY clause

Tried that. Also does not use an INDEX.

can’t help you, man, sorry

can’t figure your query out, it is still logically b0rked

Hehe well okay, I have rewritten the logic and have a completely different query now. So all good.

Thx anyway.