In the last post I described what I understand by BigData. Now the next thing is to get an understanding of BigData ecosystem. There are lots of tools available that are used in BigData world. Some of them are paid and many of them are free. In this post I will go through some of the basic terms & tools that you have to know to claim your arrival to the BigData world. Yes, you can also use this information to create an impression on your fellow colleagues 😉 . So, let’s get started
Hadoop: This is by far the most popular term that is sometimes used as Synonym to BigData. So, it is about time you understand what exactly is Hadoop.
Hadoop is java based tool that give two important capabilities necessary to survive and conquer BigData wave
- Distributed data storage: Through HDFS (Hadoop Distributed File System) based on Google GFS concept it allows you store huge amount of data in distributed fashion.
- Distributed data processing: Through MapReduce (based on Google MapReduce whitepaper) it allows you processes that huge amount of data.
It was also one of the first BigData tools that got out but not anymore there are tons of distributions available that are endorsed and managed by different companies.
Now, you might ask a question what is a distribution?
This is very similar to Linux distributions that you might have heard about. For example Cloudera took core of Hadoop, invested and created proprietary features around it and called it CDH and similar Microsoft partnered with Hortonworks and created first windows based distribution of Hadoop. So, that means each distribution gives you the basic set of features expected as part of Hadoop and in addition some features that are specific to their distribution.
Major Hadoop Distributions:
Below are some of the major distributions that are being used in production system these days.
- Apache Hadoop
Can you explain the basic working of Hadoop?
Hadoop uses distributed computing principles to store data huge amounts of data in its distributed file system known as HDFS (Hadoop Distributed File System). When you store data in Hadoop it is divided into small blocks (64MB by default) and stored into different computers that are part of Hadoop Cluster. It uses master-slave architecture where one of the computers in cluster is designated as Master (NameNode) and others are designated as Slave (DataNode).
Master maintains the metadata about slaves and data stored in hadoop cluster. Master also controls and distributes the tasks to slave nodes for execution. Master then collates the results received from various nodes and publishes them to the person who submitted a job. It also takes care of faults and resubmits the job in case any slave node is unable to complete the task.
*This is very basic explanation of Hadoop in layman language. There are several moving parts and concepts that make Hadoop work which I will try to cover later.
Why is there so much written about Hadoop 1.X VS Hadoop 2.X?
As we now know that Hadoop was started as hobby or side project and the kind of love and attention it got was never expected. So, when so many people started using it then they came across major limitation that didn’t make it good fit for production systems. So the community started building a new better, robust Hadoop to overcome the limitation of original Hadoop 1.X.
Hadoop 2.X is essentially a new product with major re-write and completely changes the Hadoop architecture for good. It also implements the concept of YARN (Yet Another Resource Manager) which is big step toward making Hadoop versatile and allow seamless integration with execution frameworks other than MapReduce.