Table of Contents

. Character sets
. Supported character sets
. Recoding
. Recoding at search time
. Character sets aliases
. Document charset detection
. Automatic charset guesser
. Default charset
. Default Language

Chapter . Character sets

. Supported character sets

mnoGoSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, as well as utf8.

. Recoding

indexer recodes all documents to the character set specified in the "LocalCharset" indexer.conf command. Internally recoding is implemented using unicode. Please note that some recoding may loose some data. For example, recoding between any Greek and Russian charsets looses all national characters. This does not matter for a single language sites. If you want to build multi-lingual search engine use UTF8 character set as a LocalCharset.

. Recoding at search time

You may use BrowserCharset command to choose a charset which will be used to display search results. BrowserCharset may differ from LocalCharset.

. Character sets aliases

Each charset is reconized by a number of it's aliases. Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand:

Table . Charsets aliases

iso-8859-1:

CP819
CSISOLATIN1
IBM819
ISO-8859-1
ISO-IR-100
ISO_8859-1
ISO_8859-1:1987
L1
LATIN1
iso-8859-10:

CSISOLATIN6
ISO-8859-10
ISO-IR-157
ISO_8859-10
ISO_8859-10:1992
L6
LATIN6
iso-8859-11:

ISO-8859-11
TIS-620
TIS620
TACTIS
iso-8869-13:

ISO-8859-13
ISO-IR-179
ISO_8859-13
L7
LATIN7
iso-8859-14:

ISO-8859-14
ISO-IR-199
ISO_8859-14
ISO_8859-14:1998
L8
LATIN8
iso-8859-15:

ISO-8859-15
ISO-IR-203
ISO_8859-15
ISO_8859-15:1998
iso-8859-16:

ISO-8859-16
ISO-IR-226
ISO_8859-16
ISO_8859-16:2000
iso-8859-2:

CSISOLATIN2
ISO-8859-2
ISO-IR-101
ISO_8859-2
ISO_8859-2:1987
L2
LATIN2
iso-8859-3:

CSISOLATIN3
ISO-8859-3
ISO-IR-109
ISO_8859-3
ISO_8859-3:1988
L3
LATIN3
iso-8859-4:

CSISOLATIN4
ISO-8859-4
ISO-IR-110
ISO_8859-4
ISO_8859-4:1988
L4
LATIN4
iso-8859-5:

CSISOLATINCYRILLIC
CYRILLIC
ISO-8859-5
ISO-IR-144
ISO_8859-5
ISO_8859-5:1988
iso-8859-6:

ARABIC
ASMO-708
CSISOLATINARABIC
ECMA-114
ISO-8859-6
ISO-IR-127
ISO_8859-6
ISO_8859-6:1987
iso-8859-7:

CSISOLATINGREEK
ECMA-118
ELOT_928
GREEK
GREEK8
ISO-8859-7
ISO-IR-126
ISO_8859-7
ISO_8859-7:1987
iso-8859-8:

CSISOLATINHEBREW
HEBREW
ISO-8859-8
ISO-IR-138
ISO_8859-8
ISO_8859-8:1988
iso-8859-9:

CSISOLATIN5
ISO-8859-9
ISO-IR-148
ISO_8859-9
ISO_8859-9:1989
L5
LATIN5
armscii-8:

ARMSCII-8
big5:

BIG-5
BIG-FIVE
BIG5
BIGFIVE
CN-BIG5
CSBIG5
cp1250:

CP1250
MS-EE
WINDOWS-1250
cp1251:

CP1251
MS-CYRL
WINDOWS-1251
cp1252:

CP1252
MS-ANSI
WINDOWS-1252
cp1253:

CP1253
MS-GREEK
WINDOWS-1253
cp1254:

CP1254
MS-TURK
WINDOWS-1254
cp1255:

CP1255
MS-HEBR
WINDOWS-1255
cp1256:

CP1256
MS-ARAB
WINDOWS-1256
cp1257:

CP1257
WINBALTRIM
WINDOWS-1257
cp1258:

CP1258
WINDOWS-1258
cp437:

437
CP437
IBM437
cp850:

850
CP850
CSPC850MULTILINGUAL
IBM850
cp852:

852
CP852
IBM852
cp855:

855
CP855
IBM855
cp857:

857
CP857
IBM857
cp860:

860
CP860
IBM860
cp861:

861
CP861
IBM861
cp862:

862
CP862
IBM862
cp863:

863
CP863
IBM863
cp864:

864
CP864
IBM864
cp865:

865
CP865
IBM865
cp866:

866
CP866
CSIBM866
IBM866
cp869:

869
CP869
IBM869
CP874
WINDOWS-874
euc-kr:

CSEUCKR
EUC-KR
EUCKR
gb2312:

CHINESE
CSGB2312
CSISO58GB231280
GB2312
GB_2312-80
ISO-IR-58
koi8-r:

CSKOI8R
KOI8-R
koi8-u

KOI8-U
shift-jis:

CSSHIFTJIS
MS_KANJI
S-JIS
SHIFT-JIS
SHIFT_JIS
SJIS
cp367:

ANSI_X3.4-1968
ASCII
CP367
CSASCII
IBM367
ISO-IR-6
ISO646-US
ISO_646.IRV:1991
US
US-ASCII
utf8:

UTF-8
UTF8
viscii:

CSVISCII
VISCII
VISCII1.1-1
maccyrillic:

MACCYRILLIC
X-MAC-CYRILLIC

. Document charset detection

indexer detects document character set in this order:

  1. "Content-type: text/html; charset=xxx"

  2. <META NAME="Content" CONTENT="text/html; charset=xxx">

  3. Defaults from "Local Charset" field in Common Parameters

. Automatic charset guesser

Since 3.2.0 mnoGoSearch has an automatic charset and language guesser. It currently recognizes more than 70 various charsets and languages. Charset and language detection is implemented using "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/mnogosearch/etc/lm/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 chacters. Shorter texts may not be guessed well.

. Default charset

Use Charset indexer.conf command to choose the default charset of indexed servers.

. Default Language

You can set default language for Servers by using DefaultLang indexer.conf variable. This is useful while restricting search by URL language.