Query for searching in multiple tables – UNION or not?

atoroqo · April 29, 2013, 11:41am

I’m trying to create a query in order to search for a specific value in several different tables. I have a table of photos, another one with keywords and a third with people (both photographers and people in the pictures), and finally one where I keep the image captions with a full text index. The fastest solution I’ve come up with so far is this one:

SELECT various_columns_from_table_"photos"_and_some_columns_from_table_"people"
FROM captions
INNER JOIN photos
ON captions.id = photos.captionid
LEFT JOIN people ON photos.photografph_id = people.id
WHERE MATCH(caption) AGAINST(+'hometown')

UNION

SELECT various_columns_from_table_"photos"_and_some_columns_from_table_"people"
FROM photos
LEFT JOIN people AS photographers ON photos.photografph_id = photographers.id
LEFT JOIN keywords_photos as kp
ON kp.photoid = photos.id
LEFT JOIN keywords AS k
ON k.id = kp.wordid
LEFT JOIN people_photos AS pp
ON photos.id = pp.photoid
LEFT JOIN people as people
ON pp.personid = people.id
WHERE 'hometown'
IN(various_columns_in_the_tables_"photos"_"people"_and_"keywords")
GROUP BY photos.id

Does anyone see any obvious room for improvement or any errors I should correct? I’m not sure about all the LEFT JOINS, but it seems to be working as intended.
I’ve tried to join all the tables together in a single SELECT statement, but all my attempts so far has resulted in searches that take several seconds.

atoroqo · May 9, 2013, 5:52pm

No comments on this, which may be good or not.

I just got a new idea myself. What if I copy the data from all the columns in the photos, people and keywords tables into a new column in the captions table (excluding words/phrases already present in the caption column) and use both columns for this kind of general search? That way the query could be heavily simplified, much more flexible and hopefully faster:

SELECT various_columns_from_table_"photos"_and_some_columns_from_table_"people"
FROM captions
LEFT JOIN photos
ON captions.id = photos.captionid
LEFT JOIN people
ON photos.photografph_id = people.id
WHERE MATCH(caption,newColumn) AGAINST('blah blah blah' IN BOOLEAN MODE)

Well, I guess there’s only one way to find out … But it’s also nice to be supervised by the more experienced coders who are able to spot obvious mistakes.

rpkamp · May 9, 2013, 6:22pm

Copying data across several tables (known as denormalization) is generally not recommended unless the alternative is too slow. Mostly because it’s easy to lose sight over which data is the “original” data. I.e., if the values in the original table and the copy are different, which one is the “real” value?

Anyway, I suppose in this case it depends on how many entries you have in the tables. If it’s not that many I’d go with option 1. If option 1 is too slow, go with option 2, or get a dedicated search service like sphinx search – see http://www.ibm.com/developerworks/library/os-php-sphinxsearch/

atoroqo · May 9, 2013, 6:48pm

Thanks. Yes, I’ve had Sphinx recommended to me a couple of times now, and I guess I’ll have to check it out eventually. But using Sphinx would require switching web hotels, and that’s not up to me to decide. So for now I’m stuck with the indexing and search options provided by MySQL itself.

Option 1 takes about 4-5 seconds with a fairly simple search, so yes, I think some other solution is called for. And as long as I make sure that the extra column is updated automatically every time any of its source columns is changed, there shouldn’t be a problem? Or should there?

rpkamp · May 10, 2013, 3:39pm

No it shouldn’t be a problem. As long you remember that that data is a copy and should never be used as canonical data

atoroqo · May 11, 2013, 4:11pm

OK. I’ll make sure to comment the columns and document everything in case I get hit by a train and someone else takes over the project.

wwb_99 · May 16, 2013, 6:52pm

I wouldn’t be so down on denormalization – it really is how one makes modern apps go in lots of ways and is really powerful when you get your head around it. We are starting to make things that are fundamentally denormalized and then normalize via code when necessary.