No-SQL databases became a buzzword nowadays and myself being somehow involved in working with them, naturally I was asked by someone a couple of days ago what are they good for and to explain what’s the deal with them. Well, there is much to say about this subject, but below I will try to explain the essential things in few words.
The need for this kind of storage was initiated by the web applications that more and more dealt with bigger quantities of data which were sometimes distributed over more than one computer or were installed on servers which couldn't manage them at some point because of their physical limitations. In other words they couldn’t vertically scale, so they had to do it horizontally . For those who are not familiar with these terms, horizontally scaling it is the opposite for the vertical scaling in which we could improve the performance of a database by buying additional memory or increase the number of processors for the server which hosts the database. The horizontal scaling allows the increasing of the performance by adding supplementary nodes in a network, nodes that will host instances of the database.
Since 2009, a lot of different concepts of No-SQL databases appeared, all of them coming with advantages and disadvantages, but every one of them covering specific needs. Some of them are columnar databases (they store data in so called “columns”), other are key-value based and other store entire documents, they are called document store databases ,etc.
Document Store Databases
Other distinct characteristics of No-SQL databases are : they are schemaless, most of them are open-source with convenient licensing, are distributed and accessible through simple APIs which make them attractive for applications running in the Cloud, where they could spread over many network nodes. On the other hand, the No-SQL databases being usually distributed, they obey the Eric Brewer’s CAP theorem which says that for any distributed system, the following properties of the system cannot be provided in the same time:
- Consistency (all nodes see the same data at the same time)
- Availability (the system should have a response for every request in a specified amount of time )
- Partition Tolerance (the system should work despite the failures of some of its nodes)
So, being distributed automatically means that partitions are already created. From Brewer’s Theorem we know that we must trade between the other two remained properties. That’s the reason for we have databases that are more available than consistent or more consistent than available. For example Cassandra is highly available database, but eventually consistent, which means that in absence of any update operation during a limited period of time, all the nodes will hold eventually the same state.
In a next post I will talk a little about the data model in the NoSQL databases.