Wednesday, February 10, 2016

Apache Pig - Maximum Average Salary Analytics

Apache Pig - Maximum Average Salary Analytics

Here I am having large dummy data-set of employee. I will explain you how anybody can do the maximum average salary analysis of this data-set by using Apache Pig.

Salary Data Analytics

Please download the dummy data-set from Employee Dummy Data

Here I am assuming Apache Hadoop and Pig is up running onto your machine.

Create Directory and Load employee_dataset Data into the HDFS

Navigate to dummy data file directory

$ hadoop fs -mkdir /user/training/PIGDATA

$ hadoop fs -mkdir /user/training/PIGDATA/PIG_UDF_DATA

$ hadoop fs -copyFromLocal employee_dataset /user/training/PIGDATA/PIG_UDF_DATA

$ pig -x mapreduce

Processing Data using Apache Pig

Step: 1 - Load dataset with column names and datatypes

grunt> employeeData = LOAD '/user/training/PIGDATA/PIG_UDF_DATA/employee_dataset'  using  PigStorage(',') AS (emp_id:int, emp_name:chararray, job_title:chararray, dept_id:int, salary:float);

Step: 2 - Group records by department

grunt> group_by_dept = GROUP employeeData BY dept_id;

Step: 3 - Calculate average salary by department

grunt> average_sal = FOREACH group_by_dept GENERATE group, AVG(employeeData.salary) AS avgsalary;

Step: 4 - Sort salary in decreasing order and select the top 1 to get the max salary

grunt> sorted_avg_sal = ORDER average_sal BY avgsalary desc;

grunt> avg_max_sal_analytic = LIMIT sorted_avg_sal 1;

Step: 5 - Store results to HDFS

grunt> STORE avg_max_sal_analytic INTO '/home/training/Desktop/output/pig/topdepartment';

Output :

Successfully read 10293 records (535750 bytes) from: "/user/training/PIGDATA/PIG_UDF_DATA/employee_dataset"
Successfully stored 1 records (23 bytes) in: "/home/training/Desktop/output/pig/topdepartment"
Total records written : 1
Total bytes written : 23
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201601290600_0017   ->      job_201601290600_0018,
job_201601290600_0018   ->      job_201601290600_0019,
2016-02-04 05:09:41,199 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2 time(s).
2016-02-04 05:09:41,199 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

Step: 6 - Check the output

grunt> cat /home/training/Desktop/output/pig/topdepartment


77      9.99999986991104E14

Hope you have enjoyed the article.

Author: Iqubal Mustafa Kaki, Technical Specialist

Want to connect with me
If you want to connect with me, please connect through my email -


  1. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blogHadoop Admin Online Course
