Candidate Selection Using Solr’s Phonetic Match Algorithms
Information Retrieval is a one of the top notch problems faced today in IT Industry. Nevertheless, there are a number of search engines (many of them are Open Source) which are employed for this purpose. These search engines are developed for the sole purpose of Information Retrieval and they provide a number of options through which we can structure the index and the query to meet our requirements. Even after all these advancements in this field, the process of configuring a search engine to meet our requirements is a challenging task.
Candidate Selection:
Candidate Selection refers to the retrieval of documents from the index depending on the query. There are various ways to analyze different fields of documents present in index to meet our requirements.
For example:
Consider a document which contains a field title and we analyze the field title using the following analyzers in sequence:
1.) solr.LowerCaseTokenizerFactory
2.) solr.wordDelimiterFilterFactory
3.) solr.EdgeNGramFilterFactory
(Figure 1)
The sequence of field analysis shown above would be same for both “query” and “index” time in the analyzer defined for a particular field.
If the field “title” contains a word “Search-Engine” then the field in index would be present as:
a.) First, LowerCaseTokenizer filter would be applied and it would produce the following token :
“search-engine”
b.) Second,wordDelimiterFilterFactory would be applied on the same field to produce tokens as:
“search”, ”engine”, ”searchengine”
c.) Third, EdgeNGramFilterFactory would produce the following tokens:
i.) For the token “search” following tokens would be produced;
‘s’, ‘se’, ‘sea’, ‘sear’, ‘searc’ , ‘search’ .
ii.) For the token “engine” following tokens would be produced;
‘e’, ’en’, ’eng’, ’engi’, ’engi’, ’engin’, ’engine’.
iii.) Finally for the word “searchengine” following tokens would be produced;
‘s’, ‘se’, ‘sea’, ‘sear’, ’searc’ and so on.
Then, if a particular query comes for “sear” it would match one of the following tokens produced above in the final filter applied (which is EdgeNGramFilterFactory) and the corresponding document would be retrieved.
Phonetic Matching
One more type of analysis which could be done is by using phonetic matching i.e. "for a particular query retrieve documents where fields have similar sound as compared to the query".
This type of analysis is particularly useful in cases where for a particular term one might expect different forms of queries.
For example:
Consider a case where a field (say ‘title’) of a document contains a word “Stephen”. Then a user might send a query like “Stefan” or “Steven” and expect us to deliver the desired result. In such cases the usage of phonetics becomes indispensable.
Over the years, there has been a number of algorithms developed serving the same purpose. One of the prime disadvantages of these algorithms is that they also return a large number of false negatives (irrelevant results).
For example:
For the query “Steven”, the results returned might also contain documents where a field has value “Stuffin” or even “Steph”.
As such lots of irrelevant results are returned and hence the choice of phonetic algorithm depends largely on the ratio of relevant and irrelevant results returned.
One of the most popular phonetic matching algorithms is Beider Morse Phonetic Matching algorithm which has a relatively high ratio of relevant to irrelevant results as compared to other algorithms. Furthermore there’s also support of this algorithm in Solr called ‘solr.BeiderMorseFilterFactory’. (There’s also support of other phonetic matching algorithms in Solr, but here we will be discussing how to use Beider Morse Phonetic Matching (BMPM) to increase the relevancy of search results and at the same time reduce false positives or irrelevant results).
Using Beider Morse Phonetic Matching in Solr:
Consider the following sequence of analysis:
1.) solr.WhitespaceTokenizerFactory
2.) solr.LowerCaseFilterFactory
3.) solr.BeiderMorseFilterFactory
(Figure 2)
Now for the word “Stephen Clark” following tokens would be produced:
1.) First in solr.WhitespaceTokenizerFactory the word will be separated according to whitespace. So the tokens produced would be:
“Stephen”, “Clark”
2.) Second in solr.LowerCaseFilterFactory tokens would be converted into small case:
“stephen”, “clark”
3.) Finally, after third analysis in solr.BeiderMorseFilterFactory a number of phonetic hashes (tokens) would be produced. Some of the tokens produced would be:
“steven”, “steven” and so on(including false positives like “steph” ,“stuffin”,...)
Now the query sent by user also undergoes the same analysis and if the final tokens produced by the user query matches any of that in the index then that document is retrieved.
The major problem here is again of false positives as described in the last section. Since BMPM also produces irrelevant results ranking becomes important so that the irrelevant results stay at bottom of result set.
For example:
If a user query returns 50 documents (out of which 30 of them are irrelevant) then by using appropriate ranking methods the irrelevant results would occur at bottom. As such, if we have to use such type of analysis for “auto-suggest “feature (found in most web-sites) where a maximum of only 20 results have to be displayed, then this ranking method would display most relevant results to the users effectively.
Furthermore this mixture of concept of Candidate Selection and Ranking could be controlled by using different type of analysis to retrieve the document.
For example:
Suppose there’s a field, say “field1” which is analyzed using the sequence shown in Figure1. Likewise consider another field, say “field2” where analysis is done using the sequence shown in Figure2
Now consider the equation:
Score = boost1*field1 + boost2*field2
where boost1 >> boost2
The term “Score” is a standard term used in search engines. It provides a measure of relevancy of a document against a particular query. The more score a document has in the result set of documents, the higher it will occur in the result set as compared to other documents.
“boost1” and “boost2” are numerical values multiplied to the fields and can be thought of as weight.
The above equation states that; while retrieving documents for a particular query use both type of analysis defined in Figure1 and Figure2. And since field2 (which uses BMPM) will also generate false positives, the documents retrieved using field2 will have a much lower score (as boost2 is much less than boost1) and will occur lower in the result set.
For example:
Consider the term “Washington ” present in some documents. Also, the term “Vasington”, “Washincton” and “Bassington” are also present in other documents. If we send a query like “Washington ” all the results will be retrieved (since every document will match either using analysis in Figure1 or in Figure2 or both as explained in both the Figures). Using the equation defined above, the result set will contain documents in the following order:
1.) Washington (match found in both the analysis chain defined)
2.) Washincton (match found in both the analysis chain defined)
3.) Vasington (does not match analysis of Figure1 but matches analysis defined in Figure2)
4.) Bassington (does not match analysis of Figure1 but matches analysis defined in Figure2)
Note that the results in 3rd and 4th position will have much lower score than the document present at 1st position. However, the difference in scores of results present at 2nd, 3rd position will be less. Since result 2 is still a better match for the query it occurs higher up in the results as compared to result 3 and result 4. Also, since result4 is quite off the league against the query, its score will be much less than that of result2 and result4.