MapReduce And Yarn

Education is not limited to just classrooms. It can be gained anytime, anywhere... - Ravi Ranjan (M.Tech-NIT)

MapReduce and Yarn

Mapreduce is mainly a data processing component of Hadoop. It is a programming model for processing large number of data sets. It contains the task of data processing and distributes the particular tasks across the nodes. It consists of two phases –

  • Map
  • Reduce

Map converts a typical dataset into another set of data where individual elements are divided into key/value pairs.

Reduce task takes the output files from a map considering as an input and then integrate the data tuples into a smaller set of tuples. Always it is been executed after the map job is done.

Features of Mapreduce system

Features of Mapreduce are as follows:

  • Framework is provided for Mapreduce execution
  • Abstracts developer from the complexity of distributed programming languages.
  • Partial failure of the processing cluster is expected and tolerable to fulfill the requirements.
  • In-built Redundancy and fault tolerance is available.
  • Mapreduce programming model system is language independent.
  • Automatic parallelization and distribution are in charge.
  • Fault tolerance
  • Enable data local processing
  • Shared nothing than architectural model
  • Manages all the inter process communication
  • Parallelly managing the distributed servers which are running across the various tasks.
  • Managing all communications and data transfers between the various part of system module.
  • Redundancy and failures are provided for overall management of the whole process.

Mapreduce simple steps follow:

  1. Executes map function on each input is received
  2. Map function emits key, value pair
  3. Shuffle, Sort and Group the outputs
  4. Executes the reduce function on the group
  5. Emits the output results is given per group basis.

Map Function

Mainly operates on each key/value pair of data and then transforms the data based on the transformation logic provided in the map function. Map function always produces a key/value pair as output result.

Map (key1, value1) ->List (key2, value2)

Reduce Function

It takes list of value for each and every key transforms the data based on the (aggregation) logic provided in the reduce function.

Reduce (key2, List (value2)) ->List (key3, value3)
Map Function for Word Count
private final staic IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, Context context)


throws IOException, InterruptedException{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while(tokenizer.hasMoreTokens()){ word.set(tokenizer.nextToken()); context.write(word, one); } } Reduce Function for Word Count public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException{ int sum = 0; for(IntWritable val: values){ sum+=val.get(); } Context.write(key, new IntWriatble(sum)); }


MapReduce is the framework that is used for processing large amounts of data on commodity hardware on a huge dataset of cluster ecosystem. The MapReduce is a powerful method of processing data when there are large amounts of node connected to the cluster. The two important tasks of the MapReduce algorithm are: Map and Reduce.

The main motto of the Map task is to take a large set of data and convert it into another set of data which is broken down into tuples(rows) or Key/Value pairs. Later the Reduce task takes the tuple which is the form of an output of the Map task and makes the input for a reduction task. Here the data tuples are converted into a very smaller set of tuples. The Reduce task always follows as per the Map task.

The biggest strength of the MapReduce framework is its scalability. Once a MapReduce program is written then it can be easily extrapolated to work over a cluster which has hundreds or even thousands of nodes within it. In this framework, actually computation is sent to where the data resides.

Hadoop Map Reduce – Key Features & Highlights

Hadoop Map Reduce – Key Features & Highlights



PayLoad– These are the applications that are implemented for the Map and Reduce functions.

Mapper– This application helps to maps the input key/value pairs to a set of intermediate key/value pair.

NamedNode– This node manages the HDFS.

DataNode– DataNode is used where data is presented in a before any processing takes place.

MasterNode– MasterNode is used where JobTracker runs and receives job requests from clients.

SlaveNode– Map and Reduce program run particularly in this node.

JobTracker– This schedules the jobs and tracks the assigns the jobs to Task tracker.

Task Tracker– the Task Tracker status is reported to JobTracker after the task is being tracked.

Job– It is an execution process of a Mapper and Reducer.

Task– Task of an execution of a Mapper or a called as Reducer on a slice of data.

Task Attempt– This is an attempt to execute a task on a SlaveNode.

Hadoop YARN Technology

Yarn full form stands for yet another resource negotiator. It is a cluster management technology which is an open source platform distributed for processing framework. The main objective of YARN is to construct a framework on Hadoop that allows the cluster resources to be allocated to the specified applications and consider MapReduce has one of these applications.

It separates each tasks of the job tracker into separate entities. The job tracker maintains track of both job scheduling which matches the tasks with task tracker and another one is task progress monitoring that take care of tasks and starts again the failed or slower tasks and doing the task bookkeeping like as maintaining counter totals.

It divides these two roles into two independent daemons that are a mainly the resource manager which manages the usage of resources across the cluster and an application master which manage the lifecycle of applications running on the cluster.

Application master agrees with the resource manager for the sake of cluster resources which is expressed in terms of a number of containers each with a certain memory limit then runs application specific processes in these containers.

The containers are handled by node managers which are running on cluster nodes which ensure the application does not use more resources which are allocated to it.

It is a very efficient technology which manages the Hadoop cluster. YARN is one of the parts of Hadoop 2 version under the aegis of the Apache Software Foundation.

YARN is has developed a completely new and innovative way of processing data and is now rightly at the center of The Hadoop architecture. Using this technology now it is possible to stream real-time, uses new interactive SQL, process data using multiple engines, manages the data using batch processing on a single platform and so on.

Map Reduce on YARN

MapReduce on YARN includes more entities compared to the classic MapReduce. They are:

  • Client –Client submits the MapReduce job.

Hadoop Yarn Technology

  • YARN resource manager – This manages the allocation of compute resources based on the cluster.
  • YARN node managers – It launches and monitors the compute containers on machines based on the cluster.
  • Map Reduce application master – It manages and arranges the tasks running the MapReduce job. The application master and the MapReduce application tasks run correspondingly in the containers which are scheduled by the resource manager and managed by the node managers.
  • Distributed file system (Normally HDFS) – It shares the job files created between the other entities.

How the YARN technology works?

  • YARN technology lets Hadoop provides the enterprise level solutions, helping organizations achieve better resource management. It is the main platform for getting consistent solutions, high level of security and governing of data over the complete spectrum of the Hadoop cluster.
  • There are various technologies that resides within the data center can also benefit from YARN. This procedure is possible to process and have linear-scale storage in a very cost effective way. Using YARN helps to come with applications that can access data and run in a Hadoop ecosystem on a consistent framework.

Some of the features of YARN

  • High degree of compatibility: The applications created are using the Map Reduce framework which can easily run on YARN.
  • Better cluster utilization: YARN allocates all the cluster resources in an efficient and dynamic manner and which leads to utilizes it in much better way compared to previous version of Hadoop.
  • Utmost scalability: As and when the required number of nodes in the Hadoop cluster expands, the YARN Resource Manager ensures that it meets the user requirements and processing power of the data center does not face any problems in solving.
  • Multi-tenancy: Various engines that access data on the Hadoop cluster can efficiently works all thanks goes to YARN being a highly versatile technology.

Key components of YARN

YARN came into existence because there was an urgent need to separate the two distinct tasks that go on in a Hadoop ecosystem and which are known as TaskTracker and the JobTracker entities. So consider the below mentioned key components of the YARN technology.

  • Global Resource Manager
  • Application Master per application
  • Node Manager per node slave
  • Container per application that runs on a Node Manager

Thus the Node Manager and the Resource Manager became the main reason on which the new distributed application works. The various resources manager are allocated to the system applications using the power of the Resource Manager. Application Master works along with the Node Manager and also works on specific framework to get resources from the Resource Manager to manage the various task components.

A scheduler works with the RM(Resource Manager) framework for the right allocation of resources and ensuring all the constraints of the user limit and queue capacities are adhered are provided at all times. As per the requirements of each application the scheduler will provide the right resource.

The Application Master works on basis of coordination with the scheduler in order to get the right resource containers keep an eye on the status and also keep tracking the progress of the process.

The Node Manager manages the application containers and launches it when it is required, tracks down the uses of the resources like the memory, processor, network and the disk utilization and gives the entire detailed report to the Resource Manager.