Apache Solr is serverization of Lucene Search engine, which is a information retrieval system and works on the principles of Vector Space Model. In this model each document is represented as a vector with each dimension representing a separate term. Set of such documents is called a corpus. A query string is also considered a vector and the similarity between the query document and a particular document in the corpora is calculated as cosine between two vector which gives the score for that document against the given query string.
Consider a document in a corpora and a query string that will be used to search over documents. These can be represented as following vectors
d1=(w1,w2,..,wn)
q=(z1,z2,...zt)
w1,w2..wn and z1,z2...zn represents as weights of various terms in the above documents. These weights are calculated as tf*idf (term frequency and inverse document frequency) weights. The tf*idf weight (term frequency–inverse document frequency) is a numerical statistic which reflects how important a word is to a document in a collection or corpus. Term frequency is number of times a term occurs in a document whereas inverse document frequency finds out the more relevant documents over a given query. These can be calculated by the following formulas:
tf = (number of times a term occurs in a document)/ (total number of terms in that document)
idf = log { Total documents/ (Total documents containing those terms+1) }
The score of a particular document against a given query can be calculated as:
Score of document=cosine=d1.q/ |d1||q| where |d1| is vector norm and can be calculated as : square root of (w1^2+w2^2+....+wn^2)
This can be inferred that if a term occurs in most of the documents then the idf for that term will be less and it will contribute less in the score.
1 comment:
A good one ! congos...
Post a Comment