Web Technologies Blog: How Apache Solr Search Works?

Apache Solr is serverization of Lucene Search engine, which is a information retrieval system and works on the principles of Vector Space Model. In this model each document is represented as a vector with each dimension representing a separate term. Set of such documents is called a corpus. A query string is also considered a vector and the similarity between the query document and a particular document in the corpora is calculated as cosine between two vector which gives the score for that document against the given query string.

Consider a document in a corpora and a query string that will be used to search over documents. These can be represented as following vectors

d1=(w1,w2,..,wn)
q=(z1,z2,...zt)

w1,w2..wn and z1,z2...zn represents as weights of various terms in the above documents. These weights are calculated as tf*idf (term frequency and inverse document frequency) weights. The tf*idf weight (term frequency–inverse document frequency) is a numerical statistic which reflects how important a word is to a document in a collection or corpus. Term frequency is number of times a term occurs in a document whereas inverse document frequency finds out the more relevant documents over a given query. These can be calculated by the following formulas:

tf = (number of times a term occurs in a document)/ (total number of terms in that document)
idf = log { Total documents/ (Total documents containing those terms+1) }

The score of a particular document against a given query can be calculated as:

Score of document=cosine=d1.q/ |d1||q| where |d1| is vector norm and can be calculated as : square root of (w1^2+w2^2+....+wn^2)

This can be inferred that if a term occurs in most of the documents then the idf for that term will be less and it will contribute less in the score.

Web Technologies Blog

Sunday, May 20, 2012

How Apache Solr Search Works?

1 comment: