Efficient Chinese Search with Elasticsearch

Originally published at: http://www.sitepoint.com/efficient-chinese-search-elasticsearch/

If you have played with Elasticsearch, you already know that analyzing and tokenization are the most important steps while indexing content, and without them your pertinency is going to be bad, your users unhappy and your results poorly sorted.

Even with English content you can lose pertinence with a bad stemming, miss some documents when not performing proper elision and so on. And that’s worse if you are indexing another language; the default analyzers are not all-purpose.

When dealing with Chinese documents, everything is even more complex, even by considering only Mandarin which is the official language in China and the most spoken worldwide. Let’s dig into Chinese content tokenization and expose the best ways of doing it with Elasticsearch.

What is so hard about Chinese search?

Chinese characters are logograms, they represents a word or a morpheme (the smallest meaningful unit of language). Put together, their meaning can change and represent a whole new word. Another difficulty is that there is no space between words or sentences, making it very hard for a computer to know where a word starts or ends.

There are tens of thousands of Chinese characters, even if in practice, written Chinese requires a knowledge of between three and four thousand. Let’s see an example: the word “volcano” (火山) is in fact the combination of:

  • 火: fire
  • 山: mountainsky

Our tokenizer must be clever enough to avoid separating those two logograms, because the meaning is changed when they are not together.

Continue reading this article on SitePoint

Wow, you must have been programming for Chinese search very often!

Hey cool, I remember this example from BasisTech http://www.basistech.com/text-analytics/rosette/base-linguistics/
(under tokenisation, where the word “student” erroneously appears in the text “Beijing University Biology Department”, I thought through improper bi-gramming).

When learning about tokenisation in Sphinx, I ran across BasisTech. Their plugin for CJK languages is the only option so far that I’ve heard of for Sphinx, and apparently they also work with ES. However, unlike most ES or Sphinx plugins, BasisTech Rosetta is proprietary.

1 Like

Sorry, but I don’t even recognize that as being a word. Seems more like a “making text search find glyphs representing concepts” thing to me.

flyGOmachines!! (airplanes)

Mountainsky is that stuff they mine in places like Colorado and sell on nature calendars. It’s real pretty. Compare with plainssky, which they don’t mine in Kansas because nobody wants it.

The case “student” appears in text “Beijing University Biology Department” is totally probable. But normally, such mistakes should be avoided.

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.