Hey gents,
I have a problem making a query fast enough. So if you guys could take some time to read through this and think of ways to make it quicker, that would be much appreciated.
The setup
The user can specify several topics as their interests.
A question is tagged with many topics.
The goal is to display the questions to the user that are tagged with topics that are within his interests.
To make it a little more difficult, topics are laid out as a polyhierarchic tree structure. That means one topic can have several parent topics. We need to show the user all questions tagged with the topics in his interests and all of their children.
An example
Topic Tree structure:
Computers -> Programming -> Web -> PHP -> CakePHP
Computers -> Programming -> Web -> Frameworks -> CakePHP
I have added “Web” to my interests. Now I need to display all questions that are tagged with “Web”, “PHP”, “Frameworks” or “CakePHP”.
Tables
topics -> id, title
parent_topics -> id, parent_id, topic_id
interests -> id, user_id, topic_id // not parent_topic_id, as we are interested in a specific topic and do not care where it is in the tree
questions -> id, title
questions_topics -> id, question_id, topic_id
My solution so far
I have pre-computed and cached all topic children relationships in memory, so that is not a problem.
Given a user I know all of his interests and all of their children in a matter of a couple milliseconds.
Now having this (huge) list of topic_id’s I need to find the questions for them.
So what I do now is:
select question_id from questions_topics where topic_id IN ([huge_list_of_ids]);
This query is very slow, because the questions_topics table has 24k rows and the list of topic_ids that I feed to it can exceed several hundreds. If I reduce the number of topic_ids in the IN clause, the query runs in ~20ms as opposed to 800ms as it is now.
The next query would be:
select [fields] from questions where id IN ([found_question_ids]) AND deleted = 0 and published = 1 order by last_updated desc limit 0, 20;
That is also pretty slow as the list of question_ids is big. So if you guys have a suggestion for that as well, that would be great.
Ways to improve
-
I tried indeces on questions_topics.topic_id and (questions_topics.question_id, questions_topics.topic_id). Both make the query even slower (significantly even).
-
I could save all relationships of a question to its topics AND all of their children in the questions_topics table.
So when a question is tagged with “CakePHP”, I would have relations for it for “CakePHP”, “PHP”, “Frameworks”, “Web”, “Programming” and “Computers”. I could then make the query much faster by only doing:
select question_id from questions_topics where topic_id IN ([just_the_topics_ids_in_my_interests_without_children]);
Since the tree is quite big, this would mean several hundred relationships saved in the DB for each question, making the table quite big and stuffed with redundant data. Plus I’d have to recalculate all of these relationships when someone changes the tree structure. I don’t want to go there if not necessary …
-
Use something else altogether, like Redis?
-
Your idea here
Thank you for your time.