Saturday, May 12, 2012

Solr Core Vs Solr Shard

Consider set of two documents (d1,d2,....dm) and (t1,t2,....tn) with the constraint that some (or all) of the properties of two documents each from different set are different. For instance if we consider a set of cameras and a set of books, we cannot compare these two based on their properties. The best way to index and query over these documents is to create different sets of indexes and for each set, query them separately. Each such index configuration is called a Solr Core. A Solr Core represents just a set of indexed records. A single instance of Solr can have multiple cores.

Now consider that number of documents in any of the above set is large enough (m or n in millions). This would lead to large index size on disk which leads to increase in time of index inserts/ updates and query time, due  to increase in disk seek timings. A better solution to handle huge number of requests over millions of indexed records is to split the indexes over multiple nodes (separate machines) in multiple Solr instances. For example if we consider first set then (d1,d2,...di) in one node, (d(i+1), ....dk) in other, and so on, where sum of such sets will be equal to m. Each such set up is called a Solr Shard.

The following points can be inferred from the above:
  • Solr Cores may have different schema's but Solr Shards are replication of each other (in terms of schema) with different set of documents.
  • A Solr Core represents just a set of indexed records. A Solr Shard is configured as a set of Solr Cores.
  • A Solr Shard can be a Solr instance with many Solr cores configured.
  • A Solr Shard set up is to achieve better performace in query and indexing timings when there are huge number of documents and querying them as single set taking huge time.
  • A Solr Shard will have at least one unique field that will be the unique identifier over the whole set of documents indexed in different Solr Shards.
  • A Solr Core will represent a unique set of documents. Setting different cores for dissimilar records will make the schema much simpler.

 If there are very less number of intrinsic properties for a set of two documents and large number of extrinsic properties we should try to keep them in one core only. But if the number of intrinsic properties are large and these are heavily used in querying the documents then it is better to go with different cores. Moreover if the number of documents are large enough we should also consider the concept of Solr Shards. It is worth to mention here that administration of Solr Cores and creating new indexing during run-time is easy to handle without restarting the instance. Lastly, where configuration and administration of Solr Cores in one node may be difficult, configuration of shards as one master and many slaves has challenges. On the whole distribution of indexes over multiple nodes and choosing multiple cores and shards models should be chosen wisely  as per the problem requirements.


2 comments:

Unknown said...

Hello, this is a very nice article. I just have one question; as you said "choosing multiple cores and shards models should be chosen wisely", I want to know how can I choose the criteria to split the indexes in a shards model? What kind of indexes will be in the master shard?

Thank you.

Unknown said...

Can you help me understand what is document in solr terminology.