Friday, May 30, 2014

Text Processing in Information Retreival



Search Engines process texts before applying any ranking model (Vector Space Model, Probabilistic Models or Inference Network Model ) to find the information about documents and terms contained in the document. Some text preprocessing techniques that are commonly used in information retrieval systems are:

Tokenization

Tokenization is the technique of converting a character stream into tokens by detecting word boundaries. The most common word boundary is one or white space character (which can be new line, tab, carriage return and form feed) in continuation.

Normalization

Normalization is a technique which is applied to terms to minimize the number of variations of a word.The syntactical variations may result in drop of recall value if an exact match against the query keywords is required. The two most common methods used for normalization are:

Stemming

The technique is used to convert a word into root word or stem based on some rules. The stem is what is left after any prefixes or suffixes have been removed. Foe eg. connect which can be the stem for connected, connecting, connection, and connections. The four widely used stemming algorithms are affix removal, table lookup, successor variety, n-grams and Porter Stemmer. The main disadvantage of stemming is that the stem may not be a real word and therefore it should be used with caution. Light Stemming is another variation of stemming where only plural forms are stemmed. For example cars -> car. 

Lemmatization

Lemmatization is another normalization technique that is based around dictionary lookups. An advantage of lemmatization is that it will give valid words, but that in this case the context is required to be known is advance.

 Stop Words Removal

Stop words are frequently used words (for example the, is, or, and) that provides very little information about the content. The removal of stop words will decrease index size drastically but will also decrease recall in case of exact matches. Another  problem is that common stop words like may, can, and will are homonyms for rarer nouns. For example may is a stop words, but it is also the name of a month, May.


After the above processes are executed in the order the information regarding documents verses terms are maintained which is used by various mathematical models to select and rank the documents based on the information needs of the users.