Information Retrieval systems are based on various models like Boolean, Vector Space Model, Probabilistic and Graphical Model and evaluation is required in-order to test and improve the search engines for 'relevance' of content of records shown in the result set.
Relevance: It is a measure of how effectively was the user’s information need met? Relevancy requires human judgement, and it answers the following questions:
– How useful were the results?
– How many of the retrieved results were useful?
– Were there any useful results that are missed out against a particular query?
– Did the order of the results make the user’s search easier or harder?
There are several evaluation strategies that are being used for testing relevance of a search engine. However, here we will discuss about Set based evaluation, it's problems and Mean Average Presion that is widely used by TREC.
Set based evaluation
This is most fundamental and most basic evaluation strategy. The metric used for evaluation are:
Precision: This tells us how many of the retrieved results were useful?
Precision = # (relevant items retrieved)/ # (retrieved items)
= TP / (TP+ FP)
= Prob(relevant/ retrieved)
where,
True Positive (TP): A retrieved document is relevant
False Positive (FP): A retrieved document is not relevant
Recall: This tells us were there any useful pages left not retrieved?
Recall = #(relevant items retrieved)/ # (relevant items)
= TP/ (TP+ FN)
= Prob(retrieved/ relevant)
where,
– True Positive (TP): A retrieved document is relevant
– False Negatives (FN): A relevant document is not retrieved
Precision decreases when false positives increase (these are called Type I error in statistical hypothesis testing). Recall decreases when false negatives increase (there are called Type II error in statistical hypothesis testing).
(Judgement)
Relevant Non Relevant
Retreived Docs TP FP (Type I)
(Search Results) Non Retrevied Docs FN (Type II) TN
Since it is inconvenient to evaluate based on two numbers Harmonic mean of precision and recall is taken called F-measure. F-measure summarizes both the metrics (precision and recall) in a single statistics. We have taken HM here because it is best to negate the effect of outliers. (Arithmetic mean is worst in this case).
e.g.: P = 0.2, R = 0.9 AM = 5.5, HM = 0.32. Precision is very low in this example, however AM is quite high since R is outlier in this case and making this value high.
F = 2*P*R / (P + R)
Problem for Set Based Evaluation : Who will measure the recall? Whole data set needs given a query to be reviewed and therefore will work best for small corpora. Also order of resulted documents not considered in evaluation.
Mean Average Precision
This is a measure of quality at all recall levels. Here we take average precision over all queries and then take out a mean of all the average precision.
Here we take average precision for a single query.
AP = 1/(# relevant)*Summation(k=1 to # relevant documents)(Precision at rank of kth relevant document)
– Most frequently, arithmetic mean is used over the query sample.
Precision @ k
Since MAP evaluates precision at all levels, and only top portion of the resultset is most important we can pick some number say 10 as number of documents to be picked.
Problem for Precision @ k: Not all queries will have more than k relevant results. So, even a perfect system may score less than 1.0 for some queries.
eg: for number of relevant documents fixed to 4, if query q is giving 3 results, then avg. precision will be 1+1+1/ 4 = 0.75 < 1.
R - Precision
This is one of the evaluation metric of TREC.
It uses a variable result set cut-off for each query based on number of its relevant results. In this case, a perfect system can score 1.0 over all queries. (Now number of relevant documents have not been fixed and n = documents retrieved.)
Eg: Lets q1 gives documents d1, d2, d3, q2 gives document d1,d2 and q3 gives documents d2,d4,d5,d6. Suppose it is perfect system and all documents are relevant.
MAP (R-Precision) = 1/3{1/4(1+1+1+1) + 1/2(1+1) + 1/3(1+1+1)} = 1
No comments:
Post a Comment