Table of Contents
Table of Contents
mnoGoSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, as well as utf8.
indexer recodes all documents to the character set specified in the "LocalCharset" indexer.conf command. Internally recoding is implemented using unicode. Please note that some recoding may loose some data. For example, recoding between any Greek and Russian charsets looses all national characters. This does not matter for a single language sites. If you want to build multi-lingual search engine use UTF8 character set as a LocalCharset.
You may use BrowserCharset command to choose a charset which will be used to display search results. BrowserCharset may differ from LocalCharset.
Each charset is reconized by a number of it's aliases. Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand:
Since 3.2.0 mnoGoSearch has an automatic charset and language guesser. It currently recognizes more than 70 various charsets and languages. Charset and language detection is implemented using "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/mnogosearch/etc/lm/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 chacters. Shorter texts may not be guessed well.