Saturday, June 23, 2012


Solr Improvement for Faster Indexing 

Solr is build on top of Lucene which is the engine of  Solr's indexing process. The following factors effects the Solr's indexing process:

Schema Configurations

Index only fields that are required for searching. Also store only those fields that are required for display purpose. More fields indexed or stored will increase the index time, optimization time, index size and memory. Avoid indexing/storing fields with very long text unless these are really used in search.Solr is a information retreival solution and should not be used for data storage.

Disable norms that are used to boosts and field length normalization during indexing time (that are used for boosting short field values). This can be done using omitNorms=true for an indexed field. Length normalization use RAM and increase index size on disk as well. Only fields that are considered for search should have omitNorms=false.

Segment Merge Factor Value

Configure high merge factor value (around 10-15). There is trade off between index time and search time. If merge factor is set to high then there will be segments with more number of documents and the merging of segments would be less frequent which will increase the indexing time but since now for a query the a big set of documents will be considered for searching the search would become slow.

Optimization

When indexes are build fresh build with optimizations enabled and when these are updated disable the optimization to increase the indexing performance . Optimization is a very expensive operation and it non-optimized indexed will not affect a lot in query performance.

Increasing Virtual Memory

Increasing the size of heap memory would result in better performance. This can be done by adding the following option in the tomcat configuration file (catalina.sh)

  JAVA_OPTS="$JAVA_OPTS "-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=1536m"

Better Hardware

Index creation is a heavy computation process and requires high performance CPUs and enough memory. For indexing over 10 million documents we used dedicated VM with quad core Intel CPU (2.8 GHz) with 12GB RAM.

Avoid Network transfers 

Transfer of data over the network should be avoided during indexing. It is better that the data to be indexed is available on the local disk  and index should be created on the local disk and not on some remote filesystem.

Analyzers

Use only the analyzers that are required and prefer the one's that have best performance.

Parallel Processing

Use more than once node for parallel indexing and then merge the indexes using Lucene IndexMerge tool.

Configure AutoCommit Option

Commit is an expensive operation so it's best to make many changes to an index in a batch and then commit command at the end. The autocommit configuration in the solrconfig file has the option over numer of documents and time which can be used to have commit of documents in bulk.

Thursday, May 31, 2012

Solr Query Time Performance Tips

Solr has been build on top of Lucene information retrieval system and implements Vector Space Model for ranking document with respect to a particular set of terms called query string. Solr is written in JAVA and can be installed as web application in any of Servlet Containers like Tomcat. Though Solr is highly scalable and configurable when it comes large data sets (billions of documents) the following steps can be considered to improve the query time.

Up-gradation of Hardware

Hardware Requirements : A lot of computation takes place when indexes are created and the indexed documents are queried. It would be better to have dedicated physical machines as Solr instances. The hardware requirements of a machine depends on the numbers of requests the server is expected to handle. For querying over 10 million documents and satisfying 1.2 million hits per days we used dedicated VM with quad core Intel CPU (2.8 GHz) with 12GB RAM.  

Changing the Software Configurations

Increasing Java Virtual Machine Heap Size: The JAVA virtual machine handles object life cycle in the heap area and increasing the size of heap area would result in better performance. This can be done by adding the following option in the tomcat configuration file (catalina.sh)
 
JAVA_OPTS="$JAVA_OPTS "-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=1536m" 

Use Solr Caching Feature : Solr caches the query result set based on the least recently used policy. So if the result-set that has not been served for long and the caching size requires removal of some result-set then the mentioned result-set will be removed from the cached data. More can be read from http://wiki.apache.org/solr/SolrCaching#

Using Distributed Search Feature 

For auto suggestions use different Solr Instance : Auto Suggest fires a huge amout of request for a given query string and it is an add-on functionality. Moreover auto-suggest fields generate many options for a input document field value and this increase a large set of values unser consideration over the time of generation of suggestions for a given query string. The field used for auto-suggestion for the same reason should not be used for general query string that is used for normal search over the indexed documents.  It would be nice to have different Solr instance/ stack for the auto-suggest functionality.

Use replications or shards: When there a huge of request it is a better idea to distrbute the request over set of Solr instances  that are replica of each other. The master-slave replication model consists of one master that is dedicately used for index creation/ updation and the same indexes are pulled by the slaves that are dedicated used for serving the requests and are generally put behind a load balancer. This model solves the read-write problem as well as the indexes are created by master only and read requests are served by slaves which do not perform the write operations. This also avoid the indexes from being corrupted. 

If there are huge number of similar documents (say information about millions of books), then instead of created a huge set of indexed on one machine the indexed can be distrbuted over set of Solr instance that can be collectively called for a query. These set up is called Sharding. Each shard will now have small set of values that will be used for query. (More can read regarding distributed search on  http://wiki.apache.org/solr/DistributedSearch.)

Tuning Search Queries

Query over minimum fields: The number of fields to used for query should be minimum to enable smaller set of sets of field values under consideration for computation on runtime. Moreover the queries should be written following short circuit or minimum evaluation strategy of Boolean Algebra.

User filter queries: Filter queries (fq) are used to specify a specific set of documents under consideration for query. It is used to reduce the set of documents to be considered for score calculation and works much like "WHERE clause of MySQL".

Sunday, May 20, 2012

How Apache Solr Search Works?

Apache Solr is serverization  of Lucene Search engine, which is a information retrieval system and works on the principles of Vector Space Model. In this model each document is represented as a vector with each dimension representing a separate term. Set of such documents is called a corpus. A query string is also considered a vector and the similarity between the query document and a particular document in the corpora is calculated as cosine between two vector which gives the score for that document against the given query string.

Consider a  document in a corpora and a query string that will be used to search over documents. These can be represented as following vectors

d1=(w1,w2,..,wn)
q=(z1,z2,...zt)

w1,w2..wn and z1,z2...zn represents as weights of various terms in the above documents. These weights are calculated as tf*idf (term frequency and inverse document frequency) weights. The tf*idf weight (term frequency–inverse document frequency) is a numerical statistic which reflects how important a word is to a document in a collection or corpus. Term frequency is number of times a term occurs in a document whereas inverse document frequency finds out the more relevant documents over a given query. These can be calculated by the following formulas:

tf  =  (number of times a term occurs in a document)/ (total number of terms in that document)
idf =  log { Total documents/ (Total documents containing those terms+1) }

The score of a particular document against a given query can be calculated as:

Score of document=cosine=d1.q/ |d1||q| where |d1| is vector norm and can be calculated as : square root of (w1^2+w2^2+....+wn^2)

This can be inferred that if a term occurs in most of the documents then the idf for that term will be less and it will contribute less in the score. 


Saturday, May 12, 2012

Solr Core Vs Solr Shard

Consider set of two documents (d1,d2,....dm) and (t1,t2,....tn) with the constraint that some (or all) of the properties of two documents each from different set are different. For instance if we consider a set of cameras and a set of books, we cannot compare these two based on their properties. The best way to index and query over these documents is to create different sets of indexes and for each set, query them separately. Each such index configuration is called a Solr Core. A Solr Core represents just a set of indexed records. A single instance of Solr can have multiple cores.

Now consider that number of documents in any of the above set is large enough (m or n in millions). This would lead to large index size on disk which leads to increase in time of index inserts/ updates and query time, due  to increase in disk seek timings. A better solution to handle huge number of requests over millions of indexed records is to split the indexes over multiple nodes (separate machines) in multiple Solr instances. For example if we consider first set then (d1,d2,...di) in one node, (d(i+1), ....dk) in other, and so on, where sum of such sets will be equal to m. Each such set up is called a Solr Shard.

The following points can be inferred from the above:
  • Solr Cores may have different schema's but Solr Shards are replication of each other (in terms of schema) with different set of documents.
  • A Solr Core represents just a set of indexed records. A Solr Shard is configured as a set of Solr Cores.
  • A Solr Shard can be a Solr instance with many Solr cores configured.
  • A Solr Shard set up is to achieve better performace in query and indexing timings when there are huge number of documents and querying them as single set taking huge time.
  • A Solr Shard will have at least one unique field that will be the unique identifier over the whole set of documents indexed in different Solr Shards.
  • A Solr Core will represent a unique set of documents. Setting different cores for dissimilar records will make the schema much simpler.

 If there are very less number of intrinsic properties for a set of two documents and large number of extrinsic properties we should try to keep them in one core only. But if the number of intrinsic properties are large and these are heavily used in querying the documents then it is better to go with different cores. Moreover if the number of documents are large enough we should also consider the concept of Solr Shards. It is worth to mention here that administration of Solr Cores and creating new indexing during run-time is easy to handle without restarting the instance. Lastly, where configuration and administration of Solr Cores in one node may be difficult, configuration of shards as one master and many slaves has challenges. On the whole distribution of indexes over multiple nodes and choosing multiple cores and shards models should be chosen wisely  as per the problem requirements.