Saturday, June 23, 2012


Solr Improvement for Faster Indexing 

Solr is build on top of Lucene which is the engine of  Solr's indexing process. The following factors effects the Solr's indexing process:

Schema Configurations

Index only fields that are required for searching. Also store only those fields that are required for display purpose. More fields indexed or stored will increase the index time, optimization time, index size and memory. Avoid indexing/storing fields with very long text unless these are really used in search.Solr is a information retreival solution and should not be used for data storage.

Disable norms that are used to boosts and field length normalization during indexing time (that are used for boosting short field values). This can be done using omitNorms=true for an indexed field. Length normalization use RAM and increase index size on disk as well. Only fields that are considered for search should have omitNorms=false.

Segment Merge Factor Value

Configure high merge factor value (around 10-15). There is trade off between index time and search time. If merge factor is set to high then there will be segments with more number of documents and the merging of segments would be less frequent which will increase the indexing time but since now for a query the a big set of documents will be considered for searching the search would become slow.

Optimization

When indexes are build fresh build with optimizations enabled and when these are updated disable the optimization to increase the indexing performance . Optimization is a very expensive operation and it non-optimized indexed will not affect a lot in query performance.

Increasing Virtual Memory

Increasing the size of heap memory would result in better performance. This can be done by adding the following option in the tomcat configuration file (catalina.sh)

  JAVA_OPTS="$JAVA_OPTS "-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=1536m"

Better Hardware

Index creation is a heavy computation process and requires high performance CPUs and enough memory. For indexing over 10 million documents we used dedicated VM with quad core Intel CPU (2.8 GHz) with 12GB RAM.

Avoid Network transfers 

Transfer of data over the network should be avoided during indexing. It is better that the data to be indexed is available on the local disk  and index should be created on the local disk and not on some remote filesystem.

Analyzers

Use only the analyzers that are required and prefer the one's that have best performance.

Parallel Processing

Use more than once node for parallel indexing and then merge the indexes using Lucene IndexMerge tool.

Configure AutoCommit Option

Commit is an expensive operation so it's best to make many changes to an index in a batch and then commit command at the end. The autocommit configuration in the solrconfig file has the option over numer of documents and time which can be used to have commit of documents in bulk.

4 comments:

Unknown said...

Hi, nice description about Solr Improvement for Faster Indexing in web technology.Thanks...


-Aparna
Theosoft

INDIASPIRITUALITY said...

Dear Sir,

Gita Supersite is not working. It is offline. Kindly have a look. Emailed you at your IIT mail address.

I use it regularly for reference purpose

Regards

Indiaspirituality Blog

INDIASPIRITUALITY said...

Gita supersite is shows following error message

I would be grateful if you could do the needful. Thank you for working on this project.

---------

Site off-line
The site is currently not available due to technical problems. Please try again later. Thank you for your understanding.

If you are the maintainer of this site, please check your database settings in thesettings.php file and ensure that your hosting provider's database server is running. For more help, see the handbook, or contact your hosting provider.

------------

Anonymous said...

You made some good points there. I checked on the internet for more info about the issue and found most individuals will go along with your views on this site.
website design