Efficient Chinese Search with Elasticsearch

Stomme_poes · December 23, 2014, 6:56pm

Hey cool, I remember this example from BasisTech http://www.basistech.com/text-analytics/rosette/base-linguistics/
(under tokenisation, where the word “student” erroneously appears in the text “Beijing University Biology Department”, I thought through improper bi-gramming).

When learning about tokenisation in Sphinx, I ran across BasisTech. Their plugin for CJK languages is the only option so far that I’ve heard of for Sphinx, and apparently they also work with ES. However, unlike most ES or Sphinx plugins, BasisTech Rosetta is proprietary.