Saturday, June 23, 2012


Solr Improvement for Faster Indexing 

Solr is build on top of Lucene which is the engine of  Solr's indexing process. The following factors effects the Solr's indexing process:

Schema Configurations

Index only fields that are required for searching. Also store only those fields that are required for display purpose. More fields indexed or stored will increase the index time, optimization time, index size and memory. Avoid indexing/storing fields with very long text unless these are really used in search.Solr is a information retreival solution and should not be used for data storage.

Disable norms that are used to boosts and field length normalization during indexing time (that are used for boosting short field values). This can be done using omitNorms=true for an indexed field. Length normalization use RAM and increase index size on disk as well. Only fields that are considered for search should have omitNorms=false.

Segment Merge Factor Value

Configure high merge factor value (around 10-15). There is trade off between index time and search time. If merge factor is set to high then there will be segments with more number of documents and the merging of segments would be less frequent which will increase the indexing time but since now for a query the a big set of documents will be considered for searching the search would become slow.

Optimization

When indexes are build fresh build with optimizations enabled and when these are updated disable the optimization to increase the indexing performance . Optimization is a very expensive operation and it non-optimized indexed will not affect a lot in query performance.

Increasing Virtual Memory

Increasing the size of heap memory would result in better performance. This can be done by adding the following option in the tomcat configuration file (catalina.sh)

  JAVA_OPTS="$JAVA_OPTS "-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=1536m"

Better Hardware

Index creation is a heavy computation process and requires high performance CPUs and enough memory. For indexing over 10 million documents we used dedicated VM with quad core Intel CPU (2.8 GHz) with 12GB RAM.

Avoid Network transfers 

Transfer of data over the network should be avoided during indexing. It is better that the data to be indexed is available on the local disk  and index should be created on the local disk and not on some remote filesystem.

Analyzers

Use only the analyzers that are required and prefer the one's that have best performance.

Parallel Processing

Use more than once node for parallel indexing and then merge the indexes using Lucene IndexMerge tool.

Configure AutoCommit Option

Commit is an expensive operation so it's best to make many changes to an index in a batch and then commit command at the end. The autocommit configuration in the solrconfig file has the option over numer of documents and time which can be used to have commit of documents in bulk.