Web Technologies Blog: 2014

Friday, May 30, 2014

Text Processing in Information Retreival

Search Engines process texts before applying any ranking model (Vector Space Model, Probabilistic Models or Inference Network Model ) to find the information about documents and terms contained in the document. Some text preprocessing techniques that are commonly used in information retrieval systems are:

Tokenization

Tokenization is the technique of converting a character stream into tokens by detecting word boundaries. The most common word boundary is one or white space character (which can be new line, tab, carriage return and form feed) in continuation.

Normalization

Normalization is a technique which is applied to terms to minimize the number of variations of a word.The syntactical variations may result in drop of recall value if an exact match against the query keywords is required. The two most common methods used for normalization are:

Stemming

The technique is used to convert a word into root word or stem based on some rules. The stem is what is left after any prefixes or suffixes have been removed. Foe eg. connect which can be the stem for connected, connecting, connection, and connections. The four widely used stemming algorithms are affix removal, table lookup, successor variety, n-grams and Porter Stemmer. The main disadvantage of stemming is that the stem may not be a real word and therefore it should be used with caution. Light Stemming is another variation of stemming where only plural forms are stemmed. For example cars -> car.

Lemmatization

Lemmatization is another normalization technique that is based around dictionary lookups. An advantage of lemmatization is that it will give valid words, but that in this case the context is required to be known is advance.

Stop Words Removal

Stop words are frequently used words (for example the, is, or, and) that provides very little information about the content. The removal of stop words will decrease index size drastically but will also decrease recall in case of exact matches. Another problem is that common stop words like may, can, and will are homonyms for rarer nouns. For example may is a stop words, but it is also the name of a month, May.

After the above processes are executed in the order the information regarding documents verses terms are maintained which is used by various mathematical models to select and rank the documents based on the information needs of the users.

Wednesday, March 26, 2014

Search Evaluation

Information Retrieval systems are based on various models like Boolean, Vector Space Model, Probabilistic and Graphical Model and evaluation is required in-order to test and improve the search engines for 'relevance' of content of records shown in the result set.

Relevance: It is a measure of how effectively was the user’s information need met? Relevancy requires human judgement, and it answers the following questions:

– How useful were the results?
– How many of the retrieved results were useful?
– Were there any useful results that are missed out against a particular query?
– Did the order of the results make the user’s search easier or harder?

There are several evaluation strategies that are being used for testing relevance of a search engine. However, here we will discuss about Set based evaluation, it's problems and Mean Average Presion that is widely used by TREC.

Set based evaluation

This is most fundamental and most basic evaluation strategy. The metric used for evaluation are:

Precision: This tells us how many of the retrieved results were useful?

Precision = # (relevant items retrieved)/ # (retrieved items)
= TP / (TP+ FP)
= Prob(relevant/ retrieved)
where,

True Positive (TP): A retrieved document is relevant
False Positive (FP): A retrieved document is not relevant

Recall: This tells us were there any useful pages left not retrieved?

Recall = #(relevant items retrieved)/ # (relevant items)
= TP/ (TP+ FN)
= Prob(retrieved/ relevant)

where,

– True Positive (TP): A retrieved document is relevant
– False Negatives (FN): A relevant document is not retrieved

Precision decreases when false positives increase (these are called Type I error in statistical hypothesis testing). Recall decreases when false negatives increase (there are called Type II error in statistical hypothesis testing).

(Judgement)
Relevant Non Relevant
Retreived Docs TP FP (Type I)
(Search Results) Non Retrevied Docs FN (Type II) TN

Since it is inconvenient to evaluate based on two numbers Harmonic mean of precision and recall is taken called F-measure. F-measure summarizes both the metrics (precision and recall) in a single statistics. We have taken HM here because it is best to negate the effect of outliers. (Arithmetic mean is worst in this case).
e.g.: P = 0.2, R = 0.9 AM = 5.5, HM = 0.32. Precision is very low in this example, however AM is quite high since R is outlier in this case and making this value high.

F = 2*P*R / (P + R)

Problem for Set Based Evaluation : Who will measure the recall? Whole data set needs given a query to be reviewed and therefore will work best for small corpora. Also order of resulted documents not considered in evaluation.

Mean Average Precision

This is a measure of quality at all recall levels. Here we take average precision over all queries and then take out a mean of all the average precision.
Here we take average precision for a single query.

AP = 1/(# relevant)*Summation(k=1 to # relevant documents)(Precision at rank of kth relevant document)

– Most frequently, arithmetic mean is used over the query sample.

Precision @ k

Since MAP evaluates precision at all levels, and only top portion of the resultset is most important we can pick some number say 10 as number of documents to be picked.

Problem for Precision @ k: Not all queries will have more than k relevant results. So, even a perfect system may score less than 1.0 for some queries.
eg: for number of relevant documents fixed to 4, if query q is giving 3 results, then avg. precision will be 1+1+1/ 4 = 0.75 < 1.

R - Precision

This is one of the evaluation metric of TREC.
It uses a variable result set cut-off for each query based on number of its relevant results. In this case, a perfect system can score 1.0 over all queries. (Now number of relevant documents have not been fixed and n = documents retrieved.)

Eg: Lets q1 gives documents d1, d2, d3, q2 gives document d1,d2 and q3 gives documents d2,d4,d5,d6. Suppose it is perfect system and all documents are relevant.

MAP (R-Precision) = 1/3{1/4(1+1+1+1) + 1/2(1+1) + 1/3(1+1+1)} = 1

Wednesday, January 8, 2014

Candidate Selection Using Solr’s Phonetic Match Algorithms

Information Retrieval is a one of the top notch problems faced today in IT Industry. Nevertheless, there are a number of search engines (many of them are Open Source) which are employed for this purpose. These search engines are developed for the sole purpose of Information Retrieval and they provide a number of options through which we can structure the index and the query to meet our requirements. Even after all these advancements in this field, the process of configuring a search engine to meet our requirements is a challenging task.

Candidate Selection:

Candidate Selection refers to the retrieval of documents from the index depending on the query. There are various ways to analyze different fields of documents present in index to meet our requirements.

For example:

Consider a document which contains a field title and we analyze the field title using the following analyzers in sequence:

1.) solr.LowerCaseTokenizerFactory

2.) solr.wordDelimiterFilterFactory

3.) solr.EdgeNGramFilterFactory

(Figure 1)

The sequence of field analysis shown above would be same for both “query” and “index” time in the analyzer defined for a particular field.

If the field “title” contains a word “Search-Engine” then the field in index would be present as:

a.) First, LowerCaseTokenizer filter would be applied and it would produce the following token :

“search-engine”

b.) Second,wordDelimiterFilterFactory would be applied on the same field to produce tokens as:

“search”, ”engine”, ”searchengine”

c.) Third, EdgeNGramFilterFactory would produce the following tokens:

i.) For the token “search” following tokens would be produced;

‘s’, ‘se’, ‘sea’, ‘sear’, ‘searc’ , ‘search’ .

ii.) For the token “engine” following tokens would be produced;

‘e’, ’en’, ’eng’, ’engi’, ’engi’, ’engin’, ’engine’.

iii.) Finally for the word “searchengine” following tokens would be produced;

‘s’, ‘se’, ‘sea’, ‘sear’, ’searc’ and so on.

Then, if a particular query comes for “sear” it would match one of the following tokens produced above in the final filter applied (which is EdgeNGramFilterFactory) and the corresponding document would be retrieved.

Phonetic Matching

One more type of analysis which could be done is by using phonetic matching i.e. "for a particular query retrieve documents where fields have similar sound as compared to the query".

This type of analysis is particularly useful in cases where for a particular term one might expect different forms of queries.

For example:

Consider a case where a field (say ‘title’) of a document contains a word “Stephen”. Then a user might send a query like “Stefan” or “Steven” and expect us to deliver the desired result. In such cases the usage of phonetics becomes indispensable.

Over the years, there has been a number of algorithms developed serving the same purpose. One of the prime disadvantages of these algorithms is that they also return a large number of false negatives (irrelevant results).

For example:

For the query “Steven”, the results returned might also contain documents where a field has value “Stuffin” or even “Steph”.

As such lots of irrelevant results are returned and hence the choice of phonetic algorithm depends largely on the ratio of relevant and irrelevant results returned.

One of the most popular phonetic matching algorithms is Beider Morse Phonetic Matching algorithm which has a relatively high ratio of relevant to irrelevant results as compared to other algorithms. Furthermore there’s also support of this algorithm in Solr called ‘solr.BeiderMorseFilterFactory’. (There’s also support of other phonetic matching algorithms in Solr, but here we will be discussing how to use Beider Morse Phonetic Matching (BMPM) to increase the relevancy of search results and at the same time reduce false positives or irrelevant results).

Using Beider Morse Phonetic Matching in Solr:

Consider the following sequence of analysis:

1.) solr.WhitespaceTokenizerFactory

2.) solr.LowerCaseFilterFactory

3.) solr.BeiderMorseFilterFactory

(Figure 2)

Now for the word “Stephen Clark” following tokens would be produced:

1.) First in solr.WhitespaceTokenizerFactory the word will be separated according to whitespace. So the tokens produced would be:

“Stephen”, “Clark”

2.) Second in solr.LowerCaseFilterFactory tokens would be converted into small case:

“stephen”, “clark”

3.) Finally, after third analysis in solr.BeiderMorseFilterFactory a number of phonetic hashes (tokens) would be produced. Some of the tokens produced would be:

“steven”, “steven” and so on(including false positives like “steph” ,“stuffin”,...)

Now the query sent by user also undergoes the same analysis and if the final tokens produced by the user query matches any of that in the index then that document is retrieved.

The major problem here is again of false positives as described in the last section. Since BMPM also produces irrelevant results ranking becomes important so that the irrelevant results stay at bottom of result set.

For example:

If a user query returns 50 documents (out of which 30 of them are irrelevant) then by using appropriate ranking methods the irrelevant results would occur at bottom. As such, if we have to use such type of analysis for “auto-suggest “feature (found in most web-sites) where a maximum of only 20 results have to be displayed, then this ranking method would display most relevant results to the users effectively.

Furthermore this mixture of concept of Candidate Selection and Ranking could be controlled by using different type of analysis to retrieve the document.

For example:

Suppose there’s a field, say “field1” which is analyzed using the sequence shown in Figure1. Likewise consider another field, say “field2” where analysis is done using the sequence shown in Figure2

Now consider the equation:

Score = boost1*field1 + boost2*field2

where boost1 >> boost2

The term “Score” is a standard term used in search engines. It provides a measure of relevancy of a document against a particular query. The more score a document has in the result set of documents, the higher it will occur in the result set as compared to other documents.

“boost1” and “boost2” are numerical values multiplied to the fields and can be thought of as weight.

The above equation states that; while retrieving documents for a particular query use both type of analysis defined in Figure1 and Figure2. And since field2 (which uses BMPM) will also generate false positives, the documents retrieved using field2 will have a much lower score (as boost2 is much less than boost1) and will occur lower in the result set.

For example:

Consider the term “Washington” present in some documents. Also, the term “Vasington”, “Washincton” and “Bassington” are also present in other documents. If we send a query like “Washington” all the results will be retrieved (since every document will match either using analysis in Figure1 or in Figure2 or both as explained in both the Figures). Using the equation defined above, the result set will contain documents in the following order:

1.) Washington (match found in both the analysis chain defined)

2.) Washincton (match found in both the analysis chain defined)

3.) Vasington (does not match analysis of Figure1 but matches analysis defined in Figure2)

4.) Bassington (does not match analysis of Figure1 but matches analysis defined in Figure2)

Note that the results in 3^rd and 4^th position will have much lower score than the document present at 1^st position. However, the difference in scores of results present at 2^nd, 3^rd position will be less. Since result 2 is still a better match for the query it occurs higher up in the results as compared to result 3 and result 4. Also, since result4 is quite off the league against the query, its score will be much less than that of result2 and result4.