Handling plural keywords using FULLTEXT

Hi,

A client has complained to me that her search doesn’t handle plural and singular keywords interchangeably. I have a single MySQL FULLTEXT column on her products table that contains all searchable content.

There is a chair product on there and if you search chairs, nothing comes up, you have to search just chair. (Note the database is small with only about 300 products). What I have done, which seems to work, is when a search term is entered I create a plural and singular of it and search for them both.

Examples:

‘chair’ -> ‘chair chairs’
‘chairs’ -> ‘chairs chair’
‘sony playstation’ -> ‘sony playstation sony playstations’

The bits in single quotes on the right would appear in the MATCH part of the query.

It seems to work although I accept that it isn’t always grammatically correct. E.g.

‘coat for children’ -> ‘coat for childrens’

Are there likely to be any issues in doing this? I looked into using the BOOLEAN wildcard on the right ‘chair*’ but that’s too inaccurate.

As an aside, on a slightly larger database I have a similar column with FULLTEXT. Not all the records are searchable though. The ones that aren’t have a bool column issearchable and the FULLTEXT column is NULL. What do you think is more scalable? Using WHERE issearchable = ‘1’ or just deleting the issearchable column and letting MySQL skip over the null FULLTEXT coulmn. Right now, the table isn’t big enough to make a noticeable difference when benchmarking. I’m thinking seeing as it’s indexed it will be faster for MySQL to skip over null columns rather than first work out which ones are issearchable. Since it’s only got a cardinality of 2 I can’t index it.

Interesting problem.

I think I would try using queries with wildcard even though it would likely return some false positives

A problem would be the different ways plurality is expressed. eg;
chair - chairs
bench - benches
child - children
etc.

And even if you did something like word% or word___
you could still miss words that instead of having suffixes have different spelling eg.
parenthesis - parentheses
man - men

You could actually split the query string into separate words and do some more intelligent expansion of words - in your scripting language (e.g. php) you could detect whether each word is plural or singular and add the proper new word to the query. You could even build a list of irregular nouns so that they are handled like they should.

Personally, I’ve never had much luck with mysql FULLTEXT search - very little customization and weird searching mechanisms, I find it never behaves like I want it to. I usually do searches by using a combination of LIKE and REGEXP clauses that are prepared by my php script - much better results and I have more control of how the search is actually performed. Sure, it’s slower than FULLTEXT but the speed is pretty good for any table with 10000 records or less (and most often online shops don’t exceed this number).

Sorry for the late reply.

Language is a tricky thing…I have just done a bespoke multi-lingual CMS/site and while it wasn’t too complicated it really makes the development time longer as you have to separate every bit of text from the HTML templates.

@Mittineague I know what you mean. Having said that, my idea above probably works about 85% of the time so maybe that’s what I should use for now.

@lemonjuice I agree. FULLTEXT is great for a really quick solution, it’s fast and does work quite well generally and is better than LIKE on its own. However, it’s there’s not much you can do to tweak it. I’m intrigued as to your solution but I’m hopeless with REGEX.

What do you think is better (performance-wise) out of these—or doesn’t it really matter? If you run EXPLAIN it just says it’s using the FULLTEXT index.

SELECT * FROM pages WHERE MATCH(searchtext) AGAINST (‘$keywords’) AND searchtext != ‘’;
SELECT * FROM pages WHERE MATCH(searchtext) AGAINST (‘$keywords’) AND searchtext != NULL;
SELECT * FROM pages WHERE MATCH(searchtext) AGAINST (‘$keywords’) AND issearchable = 1;

Finally, I have just done a site that has a single table that stores the FULLTEXT index for the entire site (pages, blogs, faqs, etc) so I can use InnoDB foreign keys elsewhere. If all my tables were MyISAM would it better practice to let pages, blogs, faqs, etc have their own FULLTEXT index and do a UNION query?

Thanks.

Generally, the regex you would most often need in this case is this:

SELECT * FROM table WHERE text REGEXP '[[:<:]]word[[:>:]]';

This will find rows with word - but not rows with words or sword. This is very basic search and moreover there’s a problem with any non-English accented characters because REGEXP works properly only in the basic latin character set. So what I do is I set up another table that hold data specifically for search - a sort of search index that is updated whenever necessary, for example (searching for products):


CREATE TABLE `search_index` (
	`prod_id` MEDIUMINT(8) UNSIGNED NOT NULL,
	`name` VARCHAR(1000) NOT NULL,
	`text_data` MEDIUMTEXT NOT NULL,
	PRIMARY KEY (`prod_id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;

As you can see I have two text fields here - the reason is that I want words found in the product name to be more important than those in the product description and I can use this distinction later. text_data will contain product descriptions, specifications, and all other data that I want to be searchable - simply concatenated into one long string. All this data is inserted with any accented characters converted to their Latin equivalents and any unimportant special characters (like punctuation) are omitted.

When performing the actual search, first I put all my results into a temporary table. I do it because this way I can easily influence the importance (relevance) of different types of matches - in this case name or text_data:


CREATE TEMPORARY TABLE `found_pages_tmp` (
	`prod_id` MEDIUMINT(8) UNSIGNED NOT NULL,
	`relevance` TINYINT(3) NOT NULL,
	PRIMARY KEY (`prod_id`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM

And then I search my product table for sony playstation. First, I’m searching product names and I give these searches highest relevance of 8:


INSERT INTO found_pages_tmp (prod_id, relevance)
	SELECT prod_id, 8 FROM search_index
	WHERE
	 name REGEXP '[[:&lt;:]]sony[[:&gt;:]]' AND (name REGEXP '[[:&lt;:]]playstation[[:&gt;:]]' OR name REGEXP '[[:&lt;:]]playstations[[:&gt;:]]');

The REGEXP part has been created automatically by my PHP script after splitting the query into words, stripping accents and punctuation and taking into account plural expansions, etc.

Then I add to this search results within other text data, they will get the relevance of 1 and will add up to the relevance of any previously found results:


INSERT INTO found_pages_tmp (prod_id, relevance)
	SELECT prod_id, 1 FROM search_index
	WHERE text_data REGEXP '[[:&lt;:]]sony[[:&gt;:]]' AND (name REGEXP '[[:&lt;:]]playstation[[:&gt;:]]' OR name REGEXP '[[:&lt;:]]playstations[[:&gt;:]]')

        ON DUPLICATE KEY UPDATE
	found_pages_tmp.relevance=found_pages_tmp.relevance+1

As you can see a match in the product name is 8 times more important than a match in the product description.

And finally I read the search results for display:


SELECT * FROM found_pages_tmp
    ORDER BY relevance DESC;

This is of course just a simplified version because normally I join the results with the product table and add product name or price as the secondary ORDER BY column. And often I first insert my results into a search cache table so that the search doesn’t have to be repeated if the user navigates across different page numbers of a large result set or repeats the same search within a short period of time. But these details are not the main subject here.

I have used this technique on a few sites and the results are quite good. I can tweak this idea in different ways to get closer to how I want the search to be performed - this is not perfect and quite far from advanced algorithms used by google search but still much better than plain LIKE or FULLTEXT.

(FULLTEXT will not find words that are shorter than 4 or 5 characters - depending on how the server is set up - and if more than 50% of rows are found then you get an empty result - quite unbelievable to me how someone thought it to be a reasonable idea! BTW, I suspect mysql.com site uses FULLTEXT search for their site and I always find it hard to use their search engine - the results are weird and often far from what I intended)

It looks like MySQL 5.6 supports FULLTEXT on Innodb but it behaves somewhat differently than in MyIsam - to me this looks messy…

Wow, thanks Lemonjuice, I’ve bookmarked your post.

By the way, do you know if it’s possible to combine a FULLTEXT query with a LIKE ‘% %’ query and still order by relevancy?

From the little I have read on the subject, I was led to believe the two are mutually exclusive.

That wouldn’t surprise me as I don’t think LIKE can generate a relevancy score. However, if anyone knows otherwise that would to good to know.

LIKE has no concept of relevency – you either match or you don’t.

For fulltext many engines will support handling of plurals pretty transparently. Really depends on what you are using.

I don’t have any seat time with MySql fulltext but in general you’d be fine just leaving null values in the searchable fields and not filtering – you can’t find relevent results in NULL can you?

Off Topic:

So, I guess, regarding this entire thread; if I LIKE it, then it is irrelevant?

Yeah, so I guess you can’t really mix them.

I don’t think FULLTEXT does handle plurals though…

Like I said, I don’t have much seat time with MySql’s implementation but many others handle plurals pretty well. One easy thing to do would be externalize the search to Apache Solr – a HTTP wrapper for lucene – which certainly can handle plurals and all other manner of stemming depending on what stemmer you choose to configure.