Chapter 4. Storing mnoGoSearch data

Table of Contents

SQL storage types
General storage information
Various modes of words storage
Storage mode - single
Storage mode - multi
Storage mode - crc
Storage mode - crc-multi
Storage mode - cache
SQL structure notes
Additional features of non-CRC storage modes
Cache mode storage
Introduction
Cache mode word indexes structure
Cache mode tools
Starting cache mode
Optional usage of several splitters
Using run-splitter script
Doing search
mnoGoSearch performance issues
MySQL performance
Post-indexing optimization
SearchD support
Why using searchd
Starting searchd
Merging several databases
Distributed indexing
Oracle notes
Introduction
Compilation, Installation and Configuration
IBM DB2 notes

SQL storage types

General storage information

mnoGoSearch stores only unique words found in document. If the word appeares several times in the same document all it's weights in different parts of the document are binary ORed. It means that count of word appearence in the document does not affect it's weight. But the fact whether the word appeares in more important parts of the document (title,description etc) is taken in account however.

Various modes of words storage

There are different modes of word storage which are currently supported by mnoGoSearch: "single","multi","crc","crc-multi". Default mode is "single". Mode is to be selected by DBMode command in both indexer.conf and search.htm files.


Examples:
DBMode single
DBMode multi
DBMode crc
DBMode crc-multi

mnoGoSearch compiled with built-in database supports only "single","crc" and "crc-multi" modes. "multi" mode is not implemented in built-in database.

Storage mode - single

When "single" is specified, all words are stored in one table (or in text file in built-in database) with structure (url_id,word,weight), where url_id is the ID of the document which is refferenced by rec_id field in "url" table. Word has variable char(32) SQL type.

Storage mode - multi

If "multi" is selected, words will be located in different 13 tables depending of their lengths. Structures of these tables are the same with "single" mode, but fixed length char type is used, which is usually faster in most databases. This fact makes "multi" mode usually faster comparing with "single" mode. This mode is not implemented for built-in database.

Storage mode - crc

If "crc" mode is selected, mnoGoSearch will store 32 bit integer word IDs calculated by CRC32 algorythm instead of words. This mode requres less disc space and is faster than "single" and "multi" modes. mnoGoSearch uses the fact that CRC32 calculates quite unique check sums for different words. According to our tests there are only 250 pairs of words have the same CRC in the list of about 1.600.000 unique words. Most of these pairs (>90%) have at least one misspelled word. Words information is stored in the structure (url_id,word_id,weight), where word_id is 32 bit integer ID calculated by CRC32 algorythm. This mode is recommended for big search engines.

Storage mode - crc-multi

When "crc-multi" mode is selected, mnoGoSearch stores CRC32 word IDs in several tables (or binary files in built-in database) with the same to "crc" structures depending on word lengths like in "multi" mode. This mode usually is the most fast and recommended for big search engines.

Storage mode - cache

There is a new "cache" storage mode. It is the fastest one and it allows to index and quickly search through several millions documents. Take a look into Cache mode section for explanation.

SQL structure notes

Please note that we develop mnoGoSearch with MySQL as backend and often have no possibility to test each version with all of other supported databases. So, if there is no table definition in create/you_database directory, you may found MySQL definition for the same table and just adopt it for your backend. MySQL table definitions are always up-to-date.

Additional features of non-CRC storage modes

"single" mode in both SQL and build-in database as well as "multi" mode with SQL database support substring search. As far as "crc" and "crc-multi" do not store words themselves and use integer values generated by CRC32 algorythm instead, there is no possibility of substring search in these modes.