Apache Pig - Maximum
Average Salary Analytics
Here I am having large dummy data-set of
employee. I will explain you how anybody can do the maximum average salary analysis of this
data-set by using Apache Pig.
Salary
Data Analytics
Please download the dummy data-set from Employee
Dummy Data
Here I am assuming Apache Hadoop and Pig is up running
onto your machine.
Create
Directory and Load employee_dataset Data into the HDFS
Navigate to dummy data file directory
$ hadoop fs -mkdir
/user/training/PIGDATA
$ hadoop fs -mkdir
/user/training/PIGDATA/PIG_UDF_DATA
$ hadoop fs -copyFromLocal employee_dataset /user/training/PIGDATA/PIG_UDF_DATA
$ pig -x mapreduce
Processing Data using Apache Pig
Step: 1 - Load
dataset with column names and datatypes
grunt> employeeData = LOAD '/user/training/PIGDATA/PIG_UDF_DATA/employee_dataset' using
PigStorage(',') AS (emp_id:int, emp_name:chararray, job_title:chararray,
dept_id:int, salary:float);
Step: 2 - Group
records by department
grunt> group_by_dept = GROUP employeeData
BY dept_id;
Step: 3 - Calculate
average salary by department
grunt> average_sal
= FOREACH group_by_dept GENERATE group, AVG(employeeData.salary) AS avgsalary;
Step: 4 - Sort salary in decreasing
order and select the top 1 to get the max salary
grunt> sorted_avg_sal = ORDER
average_sal BY avgsalary desc;
grunt> avg_max_sal_analytic =
LIMIT sorted_avg_sal 1;
Step: 5 - Store
results to HDFS
grunt> STORE
avg_max_sal_analytic INTO '/home/training/Desktop/output/pig/topdepartment';
Output :
Input(s):
Successfully read 10293 records (535750 bytes)
from: "/user/training/PIGDATA/PIG_UDF_DATA/employee_dataset"
Output(s):
Successfully stored 1 records (23 bytes) in:
"/home/training/Desktop/output/pig/topdepartment"
Counters:
Total records written : 1
Total bytes written : 23
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201601290600_0017 ->
job_201601290600_0018,
job_201601290600_0018 ->
job_201601290600_0019,
job_201601290600_0019
2016-02-04 05:09:41,199 [main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2 time(s).
2016-02-04 05:09:41,199 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!
Step: 6 - Check the
output
grunt> cat /home/training/Desktop/output/pig/topdepartment
Output:
77
9.99999986991104E14
Hope you have enjoyed the article.
Author: Iqubal Mustafa
Kaki, Technical Specialist
Want to connect with me
If you want to connect with me, please connect through my email - iqubal.kaki@gmail.com
Want to connect with me
If you want to connect with me, please connect through my email - iqubal.kaki@gmail.com
After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blogHadoop Admin Online Course
ReplyDeleteBest blog.
ReplyDeleteBig Data and Hadoop Online Training