Hadoop Apache Pig

Education is not limited to just classrooms. It can be gained anytime, anywhere... - Ravi Ranjan (M.Tech-NIT)

Apache Pig

Pig raises the level of abstraction for processing large amount of datasets. It is a fundamental platform for analyzing large amount of data sets which consists of a high level language for expressing data analysis programs. It is an open source platform developed by yahoo.

Advantages of Pig

  • Reusing the code
  • Faster development
  • Less number of lines of code
  • Schema and type checking etc

Pig is made up of two pieces:

  • First is the language which allows to express data flows known as Pig Latin.
  • Second one is execution environment created to run Pig Latin programs. There are now presently two environments that are local execution in a single JVM and distributed execution on the basis of Hadoop cluster.

A Pig Latin program is huge collection of series of operations or transformations which are implemented to the input data files to generate output. These operations express a data flow that the pig execution environment transforms into an executable representation and then runs it accurately.

What makes Pig Hadoop popular?

  • Easy to learn read and write and implement if you know SQL.
  • It implements a new approach of multi query.
  • Provides a large number of nested data types such as Maps, Tuples and Bags which are not easily available in MapReduce along with some other data operations like Filters, Ordering and Joins.
  • It consist of different user groups for instance up to 90% of Yahoo’s MapReduce is done by Pig and up to 80% of Twitter’s MapReduce is also done by Pig and various other companies like Sales force, LinkedIn and Nokia etc are majoritively using the Pig.

The Apache Pig is a platform for managing large sets of data which consists of high-level programming to analyze the data as per the requirements assigned. Pig mainly consists of the infrastructure to evaluate the complexity of the program. The advantages of Pig programming is that it can easily handle parallel processes correspondingly managing a very large number of data. The programming on this platform is done by using the textual language Pig Latin.

Pig Latin comes with the following features:

  • Simple programming: it is easy to code, execute and manage the program.
  • Better optimization: system can automatically optimize the execution as per the requirement raised.
  • Extensive nature: Used to achieve highly specific processing tasks.

Pig can be used for following purposes:

  • ETL data pipeline
  • Research on raw data
  • Iterative processing.

The scalar data types in pig are in the form of int, float, double, long, chararray, and byte array. The complex data types in Pig are namely the map, tuple, and bag.

Map: The data element consisting the data type chararray where element has pig data type include complex data type

Example- [city’#’bang’,’pin’#560001]

In this city and pin are data element mapping the values here.

Tuple: Collection of data types and it has defined fixed length. It consists of multiple fields and those are ordered in sequence.

Bag: It is a huge collection of tuples ,unordered sequence , tuples arranged in the bag are separated by comma.

Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)


LOAD function: Load function helps to load the data from the file system. It is a known as a relational operator. In the first step in data-flow language it is required to mention the input, which is completed by using the keyword named as ‘load’.
The LOAD syntax is

LOAD ‘mydata’ [USING function] [AS schema];
Example- A = LOAD ‘intellipaat.txt’;
A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘	’);
The relational operations in Pig segmentation is as follows:
foreach, order by, filters, group, distinct, join, limit.


foreach: Takes a set of expressions and applies them to almost all the records in the data pipeline to next operator.

A =LOAD ‘input’ as (emp_name: charrarray, emp_id: long, emp_add : chararray, phone : chararray, preferences : map []);
B = foreach A generate emp_name, emp_id;


Filters: It contains a predicate and it provides us to select which records will be retained in our data pipeline permanently.

Syntax: alias = FILTER alias BY expression;
Otherwise it indicates the name of the relation, By indicates required keyword and the expression containing Boolean.
Example: M = FILTER N BY F5 == 4;

Running Pig Programs

There are namely 3 ways of executing Pig programs which works on both local and MapReduce mode:


Pig can run a script file that contains Pig commands. For example, pig script.pig runs the commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line.


Grunt is an interactive shell programming for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option apparently not used. It is also possible to run Pig scripts from within Grunt using run and exec.


You can execute all the Pig programs from Java and can use JDBC to run SQL programs from Java.

Example: Word count in Pig Lines=LOAD ‘input/hadoop.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN (TOKENIZE (line)) AS word; Groups = GROUP Words BY word; Counts = FOREACH Groups GENERATE group, COUNT (Words); Results = ORDER Words BY Counts DESC; Top5 = LIMIT Results 5; STORE Top5 INTO /output/top5words;