Friday, February 12, 2016

Embedding Pig Script In Java Application


Initialize Pig Server & Communicate Using Java

Just like SQL we can be embed Pig Script inside Java Application. Thus this is the way I initialize my Pig Server and can communicate the same by using Java code. Here I will explain you the steps, how anybody can perform the action item.














Here I am assuming Hadoop & Pig are up and running onto the machine. 

You can download below explained project and data-sets from the link EmbededPigInsideJavaPOC or from GitHub download link

Create a new Java Project and make sure below libraries should be added in the class path of the project.

i) $HADOOP_HOME/*.jar
ii) $HADOOP_HOME/lib/*.jar
iii) $PIG_HOME/*.jar or pig.jar
iv) $ZOOKEEPER_HOME/*.jar

The following action items need to be performed.

Ø Load input data into the HDFS.
Ø Create an instance of Pig Server class in Java
Ø Create an instance of Pig Server class in Java
Ø Execute Pig Script command by using PigServer class by using PigServer.registerQuery().
Ø To retrieve the result and store use PigServer.openIterator() or PigServer.store().
Ø Use PigServer.registerJar() for registering UDF.

Load input data into the HDFS


Here I have the data-sets of Bollywood Super hit Movies. Place the  bollywood_hit_movies.csv inside your project directory. You can get the data-sets which is in the project folder from the link  EmbededPigInsideJavaPOC or from GitHub download link.


$ hadoop fs -copyFromLocal /home/training/.eclipse_workspace/PigEmbedded/bollywood_hit_movies.csv /home/training/.eclipse_workspace/PigEmbedded/bollywood_hit_movies.csv













EmbededPigScriptInsideJava Class

project In this class, doing the analysis of Bollywood Super Hit Movies by Leading Actor Using Apache Pig Script embedded inside the Java Application.

package com.pig.embeded;

import java.io.IOException;
import java.util.Properties;
import org.apache.pig.ExecType;
import org.apache.pig.PigServer;
import org.apache.pig.backend.executionengine.ExecException;

/**
 * 
 * @author Iqubal Mustafa Kaki
 * 
 */
public class EmbededPigScriptInsideJava {

public static void main(String args[]) throws ExecException {
Properties p = new Properties();
p.setProperty("fs.default.name", "hdfs://localhost:8020/");
p.setProperty("mapred.job.tracker", "localhost:8021");
PigServer pigServer = new PigServer(ExecType.LOCAL, p);

try {
runMyQuery(pigServer, "bollywood_hit_movies.csv");
} catch (IOException e) {
e.printStackTrace();
}
}

private static void runMyQuery(PigServer pigServer, String inputfile)
throws IOException {

pigServer.setJobName("Embedded PIG Statements in JAVA");
pigServer.registerQuery("bollywood_hit = LOAD '"+ inputfile
+ "' Using PigStorage(',') AS (year: int, movie: chararray,lead_actor:chararray);");
pigServer.registerQuery("actor_superhits = GROUP bollywood_hit BY lead_actor;");
pigServer.registerQuery("number_of_hits = FOREACH actor_superhits GENERATE group, COUNT(bollywood_hit);");
pigServer.store("number_of_hits", "actor_superhit_dir");
}
}



Run the above class, it will create the output actor_superhit_dir inside current project directory.

Input(s):
Successfully read records from: "file:///home/training/.eclipse_workspace/EmbededPigInsideJavaPOC/bollywood_hit_movies.csv"

Output(s):
Successfully stored records in: "actor_superhit_dir"

Job DAG:
job_local_0001

16/02/08 04:59:18 INFO mapReduceLayer.MapReduceLauncher: Success!
















Hope you have enjoyed the article.
Author : Iqubal Mustafa Kaki, Technical Specialist

Want to connect with me
If you want to connect with me, please connect through my email - iqubal.kaki@gmail.com

2 comments:

  1. This is really an awesome article. Thank you for sharing this.It is worth reading for everyone.


    Bigdata Hadoop Training

    ReplyDelete
  2. Very nice Blog,keep sharing more information with us.
    thank you.....

    big data hadoop course

    hadoop admin course

    ReplyDelete