Web Technologies Blog: Scalability

Consider set of two documents (d1,d2,....dm) and (t1,t2,....tn) with the constraint that some (or all) of the properties of two documents each from different set are different. For instance if we consider a set of cameras and a set of books, we cannot compare these two based on their properties. The best way to index and query over these documents is to create different sets of indexes and for each set, query them separately. Each such index configuration is called a Solr Core. A Solr Core represents just a set of indexed records. A single instance of Solr can have multiple cores.

Now consider that number of documents in any of the above set is large enough (m or n in millions). This would lead to large index size on disk which leads to increase in time of index inserts/ updates and query time, due to increase in disk seek timings. A better solution to handle huge number of requests over millions of indexed records is to split the indexes over multiple nodes (separate machines) in multiple Solr instances. For example if we consider first set then (d1,d2,...di) in one node, (d(i+1), ....dk) in other, and so on, where sum of such sets will be equal to m. Each such set up is called a Solr Shard.

The following points can be inferred from the above:

Solr Cores may have different schema's but Solr Shards are replication of each other (in terms of schema) with different set of documents.
A Solr Core represents just a set of indexed records. A Solr Shard is configured as a set of Solr Cores.
A Solr Shard can be a Solr instance with many Solr cores configured.
A Solr Shard set up is to achieve better performace in query and indexing timings when there are huge number of documents and querying them as single set taking huge time.
A Solr Shard will have at least one unique field that will be the unique identifier over the whole set of documents indexed in different Solr Shards.
A Solr Core will represent a unique set of documents. Setting different cores for dissimilar records will make the schema much simpler.

If there are very less number of intrinsic properties for a set of two documents and large number of extrinsic properties we should try to keep them in one core only. But if the number of intrinsic properties are large and these are heavily used in querying the documents then it is better to go with different cores. Moreover if the number of documents are large enough we should also consider the concept of Solr Shards. It is worth to mention here that administration of Solr Cores and creating new indexing during run-time is easy to handle without restarting the instance. Lastly, where configuration and administration of Solr Cores in one node may be difficult, configuration of shards as one master and many slaves has challenges. On the whole distribution of indexes over multiple nodes and choosing multiple cores and shards models should be chosen wisely as per the problem requirements.

MongoDB is a non-relational (schemaless) database that contains records in BSON format (binary representation of JSON). The best part of the MongoDB is its scalability, easy to integrate APIs (available in various web application development scripting languages) and easy usage via commands on various operating systems. The persistence of records in various collections (similar concept as of tables in relational databases) is non-transactional which makes the database operations quite fast as compared to relational databases. The point to be considered here that MongoDB should not be used incase a record miss cannot be accepted as in case of Banking systems and E Commerce Systems.

Some Use Cases of MongoDB

The schema-less and non-transaction property of MongoDB has made it to be usable in the following scenarios:

Archiving: the change in schema in relational databases over a decent amount of time makes it difficult to archive the data in non relational databases.
Logging: The insertion of records is fast in MongoDB because of the non-transaction property. However the same property is responsible for some insertion misses over a large number of insertions.
Real Time Analytics: it can be used to track the real-time performance metrics (page views, unique visits, etc.) of a given website.
User Information Persistence for Identity Provider Systems: The user information such as registration, ratings, session data and profile can be saved in MongoDB in case of Identity Provider Systems or SSO systems.

Using MongoDB for File Storage

Grid File Store is a MongoDB specification based on which a large file is saved by splitting it into smaller chunks of data (256 K is size as default). The file is saved using two collections:

files: the meta-information like object id, size, insertion date and chunk id goes here.

chunk: the file data is saved in this collection.

Insertion API using GridFS

The insertion of a record required necessary meta-info that can be passed as key value pair in the form of Map. The other required parameter will be the file data in the form of bytes and the collection name which will be used to establish connection with the server.

public String insertContent(byte[] fileData,String myCollection, Map metainfo)
           throws MongoInsertionException, RepositoryConnectionException{

       mylogger.debug("Inserting record in Mongo Database.");
       DBCollection dbCollection = getMongoDBConnection(myCollection);
       DB db = dbCollection.getDB();
       GridFS myFS = new GridFS(db,myCollection);
       String mongoid=null;

       GridFSInputFile gridFileInput = myFS.createFile(fileData);

       for (Iterator iterator = metainfo.keySet().iterator(); iterator.hasNext();) {
           String key = (String) iterator.next();
           String value=metainfo.get(key);
           gridFileInput.put(key, value);
       }

       gridFileInput.save();
       mylogger.debug("RECORD ADDED SUCCESSFULLY IN MONGO!!!");
       mongoid=gridFileInput.getId().toString();

       if(mongoid==null) throw new MongoInsertionException();

       return mongoid;
}

Deletion of Record in GridFS

The deletion of record can be done by passing the object id to the GridFS instance.

private void deleteRecord(String id, String collectionName)

                                             throws RepositoryConnectionException{

    DBCollection dbCollection=getMongoDBConnection(collectionName);
    DB db = dbCollection.getDB();
    GridFS myFS = new GridFS(db,collectionName);
    myFS.remove(new ObjectId(id));
}

Updation of Record in MongoDB using GridFS

Direct file data updation is not possible in MongoDB. For updation of a file data the logic can be:

Insert a record with the new file data.
In the meta-information of the newly added record (child) add information of the old data.
In the meta-information of the old data (parent) add this information of the newly added record.

public boolean updateRecord(String id, String collectionName, byte[] updatedValue)
    throws MongoInsertionException, RepositoryConnectionException{


        Map metainfo=new HashMap();
        metainfo.put("parentid", id);
        String newid=insertContent(updatedValue, collectionName, metainfo);

        /*update meta information of old id*/
        DBCollection dbCollection=getMongoDBConnection(collectionName);
        DB db = dbCollection.getDB();
        GridFS myFS = new GridFS(db,collectionName);

        GridFSDBFile gridFSDBFile = myFS.find(new ObjectId(id));

        /*if some id exists previously, then delete it*/
        String currentUpdatedRecord=(String)gridFSDBFile.get("updatedversion");

        if(currentUpdatedRecord!=null && !"".equals(currentUpdatedRecord)){
            deleteRecord(currentUpdatedRecord, collectionName);
        }

        gridFSDBFile.put("updatedversion", newid);
        mylogger.debug("The meta information updatedversion updated to value:"+newid);
        gridFSDBFile.save();
        return true;
    }

Searching of Record in GridFS

With the above logic of updation the record can be searched using the child information saved in the parent record.

public static byte[] searchRecord(String id, String collectionName)

                                          throws RepositoryConnectionException{

   try{
       mylogger.debug("id="+id+" collectionName="+collectionName);
       DBCollection dbCollection=getMongoDBConnection(collectionName);
       DB db = dbCollection.getDB();
       GridFS myFS = new GridFS(db,collectionName);
       GridFSDBFile gridFSDBFile = myFS.find(new ObjectId(id));
       String newid=(String)gridFSDBFile.get("updatedversion");
       if(newid!=null && !"".equals(newid))
           gridFSDBFile = myFS.find(new ObjectId(newid));
       InputStream in = gridFSDBFile.getInputStream();
       byte[] bytes = IOUtils.toByteArray(in);
       return bytes;

   }catch(IllegalArgumentException e){
       e.printStackTrace();
       mylogger.error("UNABLE TO SEARCH RECORD!!!",e);
       return null;
   } catch (Exception e) {
       e.printStackTrace();
       mylogger.error("UNABLE TO SEARCH RECORD!!!",e);
       return null;
   }
}

Web Technologies Blog

Saturday, May 12, 2012

Solr Core Vs Solr Shard

Monday, August 29, 2011

Mongo DB - Scalable Solution for Persistence of Media Files