lunr-languages (non-English search support)

lunr-languages provide plugins to extend lunr for non-English languages.

The lunr-languages project provides plugins for lunr to provide specialized stemming and tokenizing for non-English languages. These plugins are used in the offline packager to index foreign language documents in the package. Some usage information is available at the lunr project home page. The lunr-languages github project has been forked into the TitaniaSoftware github area. The Titania fork has a plugin for Simplified Chinese.

Using a lunr-languages plugin can be as simple as calling builder.use(plugin) on a lunr.Builder instance. That will configure the builder to use the stemmer, trimmer, and stopwords provided by the plugin. Alternatively, the builder can be manually configured to use the components provided by the plugin.

It is possible to index several different languages in one index. However, experience with indexing multiple languages for the offline packager shows that separate indexes for each language provide a better search experience. All the indexes can be written out in one JSON structure. The browser search function must search in each index and merge the results.

The lunr-languages project includes tools and procedures for writing plugins for additional languages. Indexing CJK (Chinese-Japanese-Korean) documents, or other languages that do not typically include clear word separators (such as Thai) is particularly challenging. One approach to this problem is to provide a morphological segmenter in place of an ordinary tokenizer. The Japanese and Thai lunr-languages plugins use this approach. TitaniaSoftware added a Simplified Chinese plugin based on another open-source segmenter, Rakuten MA. These tools can slow down indexing, and usually are not completely accurate because they are based on machine learning.