Next: , Previous: Top, Up: Top


1 Introduction

First of all, mifluz is at beta stage.

This program is part of the GNU project, released under the aegis of GNU.

The purpose of mifluz is to provide a C++ library to store a full text inverted index. To put it briefly, it allows storage of occurrences of words in such a way that they can later be searched. The basic idea of an inverted index is to associate each unique word with a list of documents in which they appear. This list can then be searched to locate the documents containing a specific word.

Implementing a library that manages an inverted index is a very easy task when there is a small number of words and documents. It becomes a lot harder when dealing with a large number of words and documents. mifluz has been designed with the further upper limits in mind : 500 million documents, 100 giga words, 18 million document updates per day. In the present state of mifluz, it is possible to store 100 giga words using 600 giga bytes. The best average insertion rate observed as of today 4000 key/sec on a 1 giga byte index.

mifluz has two main characteristics : it is very simple (one might say stupidly simple :-) and uses 100% of the size of the indexed text for the index. It is simple because it provides only a few basic functions. It does not contain document parsers (HTML, PDF etc...). It does not contain a full text query parser. It does not provide result display functions or other user friendly stuff. It only provides functions to store word occurrences and retrieve them. The fact that it uses 100% of the size of the indexed text is rather atypical. Most well known full text indexing systems only use 30%. The advantage mifluz has over most full text indexing systems is that it is fully dynamic (update, delete, insert), uses only a controlled amount of memory while resolving a query, has higher upper limits and has a simple storage scheme. This is achieved by consuming more disk space.