Apache Pig - Maximum Average Salary Analytics

Here I am having large dummy data-set of employee. I will explain you how anybody can do the maximum average salary analysis of this data-set by using Apache Pig.

Salary Data Analytics

Please download the dummy data-set from Employee Dummy Data

Here I am assuming Apache Hadoop and Pig is up running onto your machine.

Create Directory and Load employee_dataset Data into the HDFS

Navigate to dummy data file directory

$ hadoop fs -mkdir /user/training/PIGDATA

$ hadoop fs -mkdir /user/training/PIGDATA/PIG_UDF_DATA

$ hadoop fs -copyFromLocal employee_dataset /user/training/PIGDATA/PIG_UDF_DATA

$ pig -x mapreduce

Processing Data using Apache Pig

Step: 1 - Load dataset with column names and datatypes

grunt> employeeData = LOAD '/user/training/PIGDATA/PIG_UDF_DATA/employee_dataset' using PigStorage(',') AS (emp_id:int, emp_name:chararray, job_title:chararray, dept_id:int, salary:float);

Step: 2 - Group records by department

grunt> group_by_dept = GROUP employeeData BY dept_id;

Step: 3 - Calculate average salary by department

grunt> average_sal = FOREACH group_by_dept GENERATE group, AVG(employeeData.salary) AS avgsalary;

Step: 4 - Sort salary in decreasing order and select the top 1 to get the max salary

grunt> sorted_avg_sal = ORDER average_sal BY avgsalary desc;

grunt> avg_max_sal_analytic = LIMIT sorted_avg_sal 1;

Step: 5 - Store results to HDFS

grunt> STORE avg_max_sal_analytic INTO '/home/training/Desktop/output/pig/topdepartment';

Output :

Input(s):

Successfully read 10293 records (535750 bytes) from: "/user/training/PIGDATA/PIG_UDF_DATA/employee_dataset"

Output(s):

Successfully stored 1 records (23 bytes) in: "/home/training/Desktop/output/pig/topdepartment"

Counters:

Total records written : 1

Total bytes written : 23

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:

job_201601290600_0017 -> job_201601290600_0018,

job_201601290600_0018 -> job_201601290600_0019,

job_201601290600_0019

2016-02-04 05:09:41,199 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2 time(s).

2016-02-04 05:09:41,199 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

Step: 6 - Check the output

grunt> cat /home/training/Desktop/output/pig/topdepartment

Output:

77 9.99999986991104E14

Hope you have enjoyed the article.

Author: Iqubal Mustafa Kaki, Technical Specialist

Want to connect with me
If you want to connect with me, please connect through my email - iqubal.kaki@gmail.com

2 comments:

UnknownJanuary 19, 2018 at 5:16 AM
After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blogHadoop Admin Online Course
raveenaMarch 11, 2020 at 1:08 AM
Best blog.
Big Data and Hadoop Online Training

TECHNICAL BLOGS - BIG DATA, HADOOP, JAVA

Wednesday, February 10, 2016