Originally published at: http://www.sitepoint.com/efficient-chinese-search-elasticsearch/
If you have played with Elasticsearch, you already know that analyzing and tokenization are the most important steps while indexing content, and without them your pertinency is going to be bad, your users unhappy and your results poorly sorted.
Even with English content you can lose pertinence with a bad stemming, miss some documents when not performing proper elision and so on. And that’s worse if you are indexing another language; the default analyzers are not all-purpose.
When dealing with Chinese documents, everything is even more complex, even by considering only Mandarin which is the official language in China and the most spoken worldwide. Let’s dig into Chinese content tokenization and expose the best ways of doing it with Elasticsearch.
What is so hard about Chinese search?
Chinese characters are logograms, they represents a word or a morpheme (the smallest meaningful unit of language). Put together, their meaning can change and represent a whole new word. Another difficulty is that there is no space between words or sentences, making it very hard for a computer to know where a word starts or ends.
There are tens of thousands of Chinese characters, even if in practice, written Chinese requires a knowledge of between three and four thousand. Let’s see an example: the word “volcano” (火山) is in fact the combination of:
- 火: fire
- 山: mountainsky
Our tokenizer must be clever enough to avoid separating those two logograms, because the meaning is changed when they are not together.