Saturday, February 6, 2016

Writing Apache Pig UDF’s using Java

Playing with Snake – Writing Apache Pig UDF’s using Java
          
Apache Pig is having capability to execute Java, Python, or Ruby code inside Pig Script as UDF - thus you can use them to load, aggregate, or do sophisticated data analysis. Here I will explain you how to write Apache Pig UDF’s (User Defined Functions) using Java.  Be ensuring you have installed Eclipse and Apache Maven onto your machine.



























You can download below explained project from the link IMUApachePigUDF_POC or from GitHub download link

Create a Maven Project for writing Apache Pig UDF by following below steps

File > New > Maven Project > Check Create a simple project

Here you need to add pig-0.15.0 dependency into the POM.XML 




<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
</build>

<dependencies>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.15.0</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
</dependency>
</dependencies>


Write Java UDF Class 

For writing Apache Pig UDF using Java. We need to implement EvalFunc interface to the class and should override exec method. Here, we are returning the uppercase of the given column in the below explained UDF example.


UpperCaseAttribute Java Class

import java.io.IOException; 
import org.apache.pig.EvalFunc; 
import org.apache.pig.data.Tuple; 

public class UpperCaseAttribute extends EvalFunc<String>{ 

   public String exec(Tuple input) throws IOException {   
      if (input == null || input.size() == 0)      
      return null;      
      String str = (String)input.get(0);      
      return str.toUpperCase();  

   } 
}



Export class file as JAR



Right click > Export > JAR file > Next






Step 1: Registering the Jar file

Apache Pig works on top of Hadoop. Be ensure that Hadoop is up and running. Then start Pig in MapReduce mode.

pig -x mapreduce

First step would be to register the exported JAR by executing below commands. Here I am assuming JAR is at the /home/training/Desktop location

grunt> REGISTER '/home/training/Desktop/UpperCaseAttribute_UDF.jar'

Step 2: Defining Alias

After registering the UDF we need to define an alias by using Define operator.

grunt> DEFINE UpperCaseAttribute UpperCaseAttribute();  






Step 3: Load Data into Apache Pig For Using UDF

Suppose we are having Person data as following

PersonDetails

Bill Gates,CEO,Microsoft
Iqubal,CEO,SZIAS
Steve Jobs,CEO,Apple

Create Directory and Load PersonDetails Data into the HDFS

Navigate to PersonDetails file directory

$ hadoop fs -mkdir /user/training/PIGDATA

$ hadoop fs -mkdir /user/training/PIGDATA/PIG_UDF_DATA

$ hadoop fs -copyFromLocal PersonDetails /user/training/PIGDATA/PIG_UDF_DATA

pig -x mapreduce

Processing Data using Apache Pig

grunt> PERSONDETAILS= LOAD '/user/training/PIGDATA/PIG_UDF_DATA/PersonDetails' Using PigStorage(',') AS (name:chararray, designation:chararray, company:chararray);

















Let us use our created UDF to convert the names of the Person in to upper case.

grunt> PersonNameUpperCase = FOREACH PERSONDETAILS GENERATE UpperCaseAttribute(name);

grunt> dump PersonNameUpperCase;


2016-02-03 07:07:25,512 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2016-02-03 07:07:25,516 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2016-02-03 07:07:25,516 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(BILL GATES)
(IQUBAL)
(STEVE JOBS)




























Hope you have enjoyed the article.
Author : Iqubal Mustafa Kaki, Technical Specialist


Want to connect with me
If you want to connect with me, please connect through my email - 
iqubal.kaki@gmail.com

4 comments:

  1. Hello, i'm trying to develope a UDF in java for Pig but I've got some issues.
    My principal goal is to merge different values using some policies. These policies are defined in an external text file that I want to read when I execute the script, but it seems that the script will not read it. To check this I tried to save in another file some results, obtained by the execution of the script, but also for this one the udf do nothing.
    My question is, udf does not support the operations of reading to and writing from an external file?
    Thanks for your attention

    ReplyDelete
  2. I just want to know about Pig UDF's and found these post is perfect one ,Thanks for sharing the informative post of Pig and able to understand the concepts easily,Thoroughly enjoyed reading
    Also Check out the : https://www.credosystemz.com/training-in-chennai/best-hadoop-training-in-chennai/

    ReplyDelete