About Hadoop

About Hadoop

The ApacheTM Hadoop project develops open-source software for reliable, scalable,distributed computing.
Process Big Data on clusters of commodity hardware
– Vibrant open-source community
– Many products and tools reside on top of Hadoop

Hadoop – 100 drives working at the same time can read 1TB of data in 2 minutes
Hadoop principles:

Scale-Out rather than Scale-Up
Bring code to data rather than data to code
Deal with failures – they are common
Abstract complexity of distributed and
concurrent applications

It is harder and more expensive to scale-up
– Add additional resources to an existing node (CPU, RAM)
– Moore’s Law can’t keep up with data growth
– New units must be purchased if required resources can not be
– Also known as scale vertically
• Scale-Out
– Add more nodes/machines to an existing distributed
– Software Layer is designed for node additions or removal
– Hadoop takes this approach – A set of nodes are bonded
together as a single distributed system
– Very easy to scale down as well

Given a large number machines, failures are
– Large warehouses may see machine failures weekly or
even daily
• Hadoop is designed to cope with node
– Data is replicated
– Tasks are retried
Hadoop ECO system:

At first Hadoop was mainly known for two core products:
– HDFS: Hadoop Distributed FileSystem
– MapReduce: Distributed data processing framework
Today, in addition to HDFS and MapReduce, the term also represents a multitude of products:
– HBase: Hadoop column database; supports batch and random
reads and limited queries
– Zookeeper: Highly-Available Coordination Service
– Oozie: Hadoop workflow scheduler and manager
– Pig: Data processing language and execution environment
– Hive: Data warehouse with SQL interface

Hadoop Distributions aim to resolve version incompatibilities
• Distribution Vendor will
– Integration Test a set of Hadoop products
– Package Hadoop products in various installation formats
• Linux Packages, tarballs, etc.
– Distributions may provide additional scripts to execute
– Some vendors may choose to backport features and bug
fixes made by Apache
– Typically vendors will employ Hadoop committers so the
bugs they find will make it into Apache’s repository

Distribution Vendors:

Cloudera Distribution for Hadoop (CDH)
• MapR Distribution
• Hortonworks Data Platform
• Apache BigTop Distribution
• Greenplum HD Data Computing Appliance

Cloudera has taken the lead on providing
Hadoop Distribution
– Cloudera is affecting the Hadoop eco-system in the same
way RedHat popularized Linux in the enterprise circles
• Most popular distribution
– http://www.cloudera.com/hadoop
– 100% open-source
• Cloudera employs a large percentage of
core Hadoop committers
• CDH is provided in various formats
– Linux Packages, Virtual Machine Images, and Tarballs

Integrates majority of popular Hadoop
– HDFS, MapReduce, HBase, Hive, Mahout, Oozie, Pig,
Sqoop, Whirr, Zookeeper, Flume
• CDH4 is used in this class

Common OS supported:

Red Hat Enterprise
Oracle Linux
SUSE Linux Enterprise Server

Work Progress

PHD - 24

M.TECH - 125

B.TECH -95

BIG DATA -110.


ON-GOING Hadoop Projects





Achievements – Hadoop Solutions


Twitter Feed

Customer Review

Hadoop Solutions 5 Star Rating: Recommended 4.9 - 5 based on 1000+ ratings. 1000+ user reviews.