Hadoop

Kayshav.com

Apache Hadoop framework is a open source software library for distributed processing of large data sets across clusters of computers using simple programming models. It offers scale-up features from a single server to thousands of servers with usage of computing power and storage in each server. The software library has features that detect failures at application layer in any server (or cluster) to achieve high availability (HA) service. Redundancy is also built-in to achieve high availability.

The Modules of Hadoop

Hadoop Common - consists of common utilities to support various hadoop modules.

Hadoop Distributed File System ( HDFS) - provides high through-put for application data management that can be operated on commodity hardware. HDFS is used to scale single Apache Hadoop cluster to a vary large number of nodes (from hundreds to thousands as required).

Hadoop YARN - provides job/tasks scheduling features, data processing, cluster resource management - to keep the system up (HA)

Hadoop Ozone - provides object data store features

MapReduce - a programming paradigm where Map feature transforms data by breaking into key value pairs (tuple) and Reduce uses output data of Map operation and aggregates (or combine) the tuples into smaller datasets. The mapping is performed first and its output is use by the reduce feature. The programming languages used in MapReduce are C++, Java, and Python. Use of parallel processing algorithms and with reduced movement of data massive (in range of peta bytes) data can processed

Detail Reference Apache Hadoop