Wednesday, December 18, 2013

CAP theorem and NoSQL databases

I was talking to a friend yesterday who said "RDBMS is going to go away, everyone uses NoSQL these days". This served as the motivation behind writing this post. No, I dont think that is the case by any stretch of imagination. This got me into reading more about NoSQL databases. Lets travel down this path to understand why the NoSQL databases are so popular today and how they started.

To get started on this, lets first try to understand the CAP theorem. There are three ingredients in the CAP theorem namely:

  1. Consistency- Having the same data across all the nodes in the cluster at any given instant of time.

  2. Availability- Being able to serve always. No downtime and least possible response time.

  3. Partition Tolerance- The system continues to serve even if some link is  broken and your cluster is broken into two or more parts. There could be a loss of a message, some node may crash, but you still want to be able to serve. 


Now the CAP theorem states that you can carry home only two out of these three. This is where the difference in RDBMS and NoSQL lies! Lets look at the three combinations we can form here[2]:

  1. CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the sam.

  2. CP - data is consistent between all nodes, and maintains partition tolerance by becoming unavailable when a node goes down.

  3. AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)


Now look at the case of some popular NoSQL customers and then return back to see why NoSQL is good and applicable to them but RDBMS in my opinion will co-exist.
Lets talk of amazon.com first. Their business model is such that they want to be available all the time. They wouldn't want their site to be down or have a higher response time at any moment. So it is very essential for them to have the 'A' and 'P' attributes of the CAP theorem. They would rather give away the 'C' for it to an extent. Getting a regret from amazon.com saying we don't have this item although we showed you it was available earlier is not as bad as the site itself going down. So if there was one item and two people simultaneously put it into their carts, that could happen but given their business model they can have alternatives to save their customers of this situation. For instance they could have some extra items in the stock always.

Similarly when you think of facebook.com, suppose you post a picture on your wall. Its not a great deal if one of your friends can see that picture and the other will be able to see the picture a few moments later. Again, it doesn't care as much about consistency as it does to the availability.

Lets now think why was the cluster or a farm of servers needed after all. Its because everything you do on internet is being stored in a database. Google, facebook, amazon etc are examples who keep all this data for providing personalized search or recommendations etc. This huge amount of data in the order of petabytes or zetabytes can not be stored on one disk. To try to store all of them on one disk and replicate it to more such disks is a pain and that is why google chose to use a farm of of several servers with smaller disks. Traditional RDBMS was built to best serve on a single disk and that is why people with this huge data came up with BigTable, DynamoDB etc.

And as we near the end of this article, its importnat to have a look at some NoSQL databases. There are many out there which can be broadly divided into 4 categories:

  1. ColumnHBaseAccumulo

  2. DocumentMarkLogicMongoDBCouchbase

  3. Key-value : DynamoRiakRedisCacheProject Voldemort

  4. GraphNeo4JAllegroVirtuoso


Note that there isnt a concrete line between the 4 types. As an example, the document oriented databases and the key-value databases could resemble the other type to seom extent at times. So the boundaries are a little fuzzy. To conclude with, I would say NoSQL databases are popular and are good in certain circumstances, but when you come to something like say banking you really need ACID compliance and therefore the RDBMS. So in my opinion they will co-exist as they today.

References:

  1. http://en.wikipedia.org/wiki/CAP_theorem

  2. http://stackoverflow.com/questions/12346326/nosql-cap-theorem-availability-and-partition-tolerance

  3. http://stackoverflow.com/questions/16779348/does-the-cap-theorem-imply-that-acid-is-not-possible-for-distributed-databases

  4. http://en.wikipedia.org/wiki/NoSQL

  5. https://www.youtube.com/watch?v=qI_g07C_Q5I

Tuesday, December 10, 2013

A sneak peek into the Hadoop ecosystem

I jumped into the IT field for the love I had for it and the biggest distraction for me has always been trying to know something I know nothing about. Ever since some of my colleagues started working with Hadoop, I have been wanting to read about it. To follow with the same, the continous drive has been a feeing that there has to be something really nice about big data for everyone talking about it find themselves so cool. And finally the latest post on facebook by a professor saying:-
Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...

Well this quote is pretty famous by now and I must acknowledge this was something which pushed me into studying more about what this actually is, why is it so cool! Hopefully the next time I bump into some cool people I have something to talk about as well :D . I finally found some time and energy to study some of it this weekend. Here is a high level overview of the picture[1] I have in my mind right now:

hadoop_ecosystem_

HDFS

The base of this ecosystem is the Hadoop Distributed File system derived from the Google's whitepaper for Google File System(GFS). Lets take a simple example[2] to understand HDFS. Lets say we have a record containing the phone numbers of all the people in a city. You use say a 5 machines Hadoop Cluster to keep this data. Lets say we want to have a replication factor of 3 which means every chunk of data you have will have 2 extra backup copies stored on different machines. Further lets assume that you divide the hard disks on your 5 machines into 78 pieces. Lets say you store phone numbers of all the people whose names start with 'A' on one piece of a disk and keep its back up on the other two machines. Similary do that to all people's names starting with alphabets 'B'-'Z' In this way you organize your data on the 3 machines.

MapReduce: To generate a report from all the data, you would now need MapReduce codes. The MapReduce API is available opensource for use. But you will have to write some good Java codes to run the map jobes parallely on all those machines and get the results back (Reduce) to generate the final report.

Hive & Pig Frameworks: Writing MapReduce jobs isnt a piece of cake. So, Facebook made the Hive framework allow an easy way to do the same. Hive uses SQL-ish syntax to do things on the data lying on the Hadoop layer below. Pig is a similar framework built by Yahoo but it is more of a data flow language. In Hive a user can specify that data from two tables must be joined with an easy syntax like SQL, but not what join implementation to use. Pig is procedural and though you will have a little more to write there it does allows you the flexibility to specify an implementation of how you would like to join different chunks of data.

HBase is a NoSQL database that allows you to work the enormous amount of data stored on the Hadoop system. It is a column-oriented database management system. It is well suited for sparse data sets, which are common in many big data use cases. HBase does not support a structured query language like SQL. They are written in Java much like a typical MapReduce application.

ZooKeeper is a centralized infrastructure that helps synchronize across clusters. It maintains common objects needed in large cluster environments. Examples of these objects include configuration information hierarchical naming space, and so on. Applications can leverage these services to coordinate distributed processing across large clusters.

Sqoop can be used to import data from a RDBMS system (say MySQL/Oracle/MsSQL server) into HDFS or vice-versa.

Flume is the tool to gather data from various sources into your Hadoop system. Its mostly used for log data.

Closing Remarks:
To end with I would like to state that I am by no means an expert on big data. In fact I am a beginner just interested in knowing this uber-cool technology. And with this post all I aim is to start a discussion so that we can together start learning it bit by bit! So, if you find a mistake therein, please do help me learn it better :)

References:
1) http://www.youtube.com/watch?v=R-qjyEn3bjs
2) http://www-01.ibm.com/software/data/infosphere/hadoop/