Search Engines process texts before applying any ranking model (Vector Space Model, Probabilistic Models or Inference Network Model ) to find the information about documents and terms contained in the document. Some text preprocessing techniques that are commonly used in information retrieval systems are:
Tokenization
Tokenization is the technique of converting a character stream into tokens by detecting word boundaries. The most common word boundary is one or white space character (which can be new line, tab, carriage return and form feed) in continuation.
Normalization
Normalization is a technique which is applied to terms to minimize the number of variations of a word.The syntactical variations may result in drop of recall value if an exact match against the query keywords is required. The two most common methods used for normalization are:
Stemming
The technique is used to convert a word into root word or stem based on some rules. The stem is what is left after any prefixes or suffixes have been removed. Foe eg. connect which can be the stem for connected, connecting, connection, and connections. The four widely used stemming algorithms are affix removal, table lookup, successor variety, n-grams and Porter Stemmer. The main disadvantage of stemming is that the stem may not be a real word and therefore it should be used with caution. Light Stemming is another variation of stemming where only plural forms are stemmed. For example cars -> car.
Lemmatization
Lemmatization is another normalization technique that is based around dictionary lookups. An advantage of lemmatization is that it will give valid words, but that in this case the context is required to be known is advance.
Stop Words Removal
Stop words are frequently used words (for example the, is, or, and) that provides very little information about the content. The removal of stop words will decrease index size drastically but will also decrease recall in case of exact matches. Another problem is that common stop words like may, can, and will are homonyms for rarer nouns. For example may is a stop words, but it is also the name of a month, May.
After the above processes are executed in the order the information regarding documents verses terms are maintained which is used by various mathematical models to select and rank the documents based on the information needs of the users.