TECHNICAL BLOGS - BIG DATA, HADOOP, JAVA: 2016

Tuesday, March 22, 2016

Elephant Bird API - Loading JSON data Using JsonLoader into Apache Pig

Elephant Bird API - Loading JSON data Using JsonLoader into Apache Pig

Here, I will explain how to use the Elephant Bird API for loading JSON data into Apache Pig.

Suppose we are having following Blog Data into JSON format.

Blog JSON:

{

"Blog" : "IMU008390900000",

"BlogType" : "Big Data Blog",

"AuthorDisplayName" : "Iqubal Mustafa Kaki",

"Title" : "Elephant Bird API - Loading JSON data into Apache Pig",

"AuthorEmailId" : " iqubal.kaki@gmail.com",

}

Steps for loading JSON Data into Apache Pig

Here we need to register below jars

grunt> REGISTER '/home/training/Desktop/ElephantBird/elephant-bird-core-4.13.jar'

grunt> REGISTER '/home/training/Desktop/ElephantBird/elephant-bird-pig-4.13.jar'

grunt> REGISTER '/home/training/Desktop/ElephantBird/elephant-bird-hadoop-compat-4.5.jar'

grunt> REGISTER '/home/training/Desktop/ElephantBird/json-simple-1.1.jar'

Load JSON into Pig Using JsonLoader

grunt> records = load '/user/training/Blog.json' using com.twitter.elephantbird.pig.load.JsonLoader('Blog:chararray,BlogType:chararray,AuthorDisplayName:chararray,Title:chararray,AuthorEmailId:chararray');

Verify the same

grunt> dump records;

Output will be

Successfully stored 1 records (203 bytes) in: "hdfs://localhost/tmp/temp-726203488/tmp-918944215"

Counters:

Total records written : 1

Total bytes written : 203

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:

job_201603232110_0003

2016-03-23 21:48:54,132 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

2016-03-23 21:48:54,192 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2016-03-23 21:48:54,192 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

([AuthorEmailId#iqubal.kaki@gmail.com,AuthorDisplayName#Iqubal Mustafa Kaki,BlogType#Big Data Blog,Title#Elephant Bird API Loading JSON data into Apache Pig,Blog#IMU008390900000])

Reference : https://pig.apache.org/docs/r0.11.1/func.html#jsonloadstore

Hope you have enjoyed the article.
Author : Iqubal Mustafa Kaki, Technical Specialist

Want to connect with me
If you want to connect with me, please connect through my email iqubal.kaki@gmail.com

Friday, February 19, 2016

Java API for Invoking Oozie Workflows

Java API for Invoking Oozie Workflows

Java Client API for the integration of Apache Oozie and Java applications. Here I will explain about Java API for interacting/invoking Apache Oozie Jobs.

There are two approaches for writing Java code for invoking Oozie jobs.

- Oozie Client Java API

- Local Oozie Client Java API

You can download below explained project and related configuration files from the link ExecuteOozieJobUsingJavaAPI or from GitHub download link

Oozie Client Java API

This approach will explain how to submit an Oozie job using the Java Client API.

Below code will elaborate the step for writing Apache Oozie Client.

import org.apache.oozie.client.OozieClient;

import org.apache.oozie.client.WorkflowJob;

import java.util.Properties;

...

// get a OozieClient for the Oozie

OozieClient wc = new OozieClient("http://kingsolomon:8080/oozie");

// create a workflow job configuration and set the workflow application path

Properties conf = wc.createConfiguration();

conf.setProperty(OozieClient.APP_PATH, "hdfs://kingsolomon:9000/myapp");

// setting workflow parameters

conf.setProperty("jobTracker", "kingsolomon:9001");

conf.setProperty("inputDir", "/myapp/inputdir");

conf.setProperty("outputDir", "/myapp/outputdir");

...

// submit and start the workflow job

String jobId = wc.run(conf);

System.out.println("Workflow job submitted");

// wait until the workflow job finishes printing the status every 10 seconds

while (wc.getJobInfo(jobId).getStatus() == Workflow.Status.RUNNING) {

System.out.println("Workflow job running ...");

Thread.sleep(10 * 1000);

}

// print the final status of the workflow job

System.out.println("Workflow job completed ...");

System.out.println(wf.getJobInfo(jobId));

...

Sample Java program to call workflow

Suppose your project ExecuteOozieJobUsingJavaAPI path is /user/hadoopadmin/Desktop/ExecuteOozieJobUsingJavaAPI. Also workflow.xml and other files are in the project folder. You can get the these files and Java code from above source code project folder link. Put all these file & folder inside HDFS, so these are available for Map Reduce jobs internally using by Apache Hive.

For reference please go through the workflow.xml file which is in the project folder.

Here we need to set all properties which we have used in the workflow.xml file

Note : Ensure to replace configuration highlighted properties with your environment specific configuration.

import java.util.Properties;

import org.apache.oozie.client.OozieClient;

import org.apache.oozie.client.WorkflowJob;

public class ExecuteOozieScheduler {

public static void main(String[] args) {

OozieClient wc = new OozieClient("http://kingsolomon:11000/oozie");

Properties conf = wc.createConfiguration();

conf.setProperty(OozieClient.APP_PATH,

"hdfs://kingsolomon:9000/user/hadoopadmin/Desktop/ExecuteOozieJobUsingJavaAPI/workflow.xml");

conf.setProperty("jobTracker", "kingsolomon:9001");

conf.setProperty("nameNode", "hdfs://kingsolomon:9000");

conf.setProperty("queueName", "default");

conf.setProperty("dbScripts",

"hdfs://kingsolomon:9000/user/hadoopadmin/Desktop/ExecuteOozieJobUsingJavaAPI");

conf.setProperty("rootFolder",

"hdfs://kingsolomon:9000/user/hadoopadmin/Desktop/ExecuteOozieJobUsingJavaAPI");

// HDFS directories (multiple directories can be separated by a comma) that contain JARs

conf.setProperty("oozie.libpath", "hdfs://kingsolomon:9000/user/oozie/share/lib");

conf.setProperty("oozie.use.system.libpath", "true");

conf.setProperty("oozie.wf.rerun.failnodes", "true");

try {

String jobId = wc.run(conf);

System.out.println("Workflow job, " + jobId + " submitted");

while (wc.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {

System.out.println("Workflow job running ...");

Thread.sleep(10 * 1000);

}

System.out.println("Workflow job completed ...");

System.out.println(wc.getJobInfo(jobId));

} catch (Exception r) {

System.out.println("Errors " + r.getLocalizedMessage());

}

Local Oozie Client Java API

Oozie provides an embedded Oozie implementation, LocalOozie , which is useful for development, debugging and testing of workflow applications within the convenience of an IDE.

The code snipped below shows the usage of the LocalOozie class. All the interaction with Oozie is done using Oozie OozieClient Java API, as shown in the previous section.

The examples bundled with Oozie include the complete and running class, LocalOozie Example from where this snipped was taken.

Below code will elaborate the step for writing Local Oozie Client.

import org.apache.oozie.local.LocalOozie;

import org.apache.oozie.client.OozieClient;

import org.apache.oozie.client.WorkflowJob;

import java.util.Properties;

...

// start local Oozie

LocalOozie.start();

// get a OozieClient for local Oozie

OozieClient wc = LocalOozie.getClient();

// create a workflow job configuration and set the workflow application path

Properties conf = wc.createConfiguration();

conf.setProperty(OozieClient.APP_PATH,

"kingsolomon:9000/myapp");

// setting workflow parameters

conf.setProperty("jobTracker", "kingsolomon:9001");

conf.setProperty("inputDir", "/myapp/inputdir");

conf.setProperty("outputDir", "/myapp/outputdir");

...

// submit and start the workflow job

String jobId = wc.run(conf);

System.out.println("Workflow job submitted");

// wait until the workflow job finishes printing the status every 10 seconds

while (wc.getJobInfo(jobId).getStatus() == Workflow.Status.RUNNING) {

System.out.println("Workflow job running ...");

Thread.sleep(10 * 1000);

}

// print the final status o the workflow job

System.out.println("Workflow job completed ...");

System.out.println(wf.getJobInfo(jobId));

// stop local Oozie

LocalOozie.stop();

...

Sample Java program to call LocalOozie workflow

Here we need to set all properties which we have used in the workflow.xml file

Note : Ensure to replace configuration highlighted properties with your environment specific configuration.

import java.util.Properties;

import org.apache.oozie.local.LocalOozie;

import org.apache.oozie.client.OozieClient;

import org.apache.oozie.client.WorkflowJob;

public class ExecuteLocalOozieScheduler {

public static void main(String[] args) throws Exception {

LocalOozie.start();

// get a OozieClient for local Oozie

OozieClient wc = LocalOozie.getClient();

Properties conf = wc.createConfiguration();

conf.setProperty(OozieClient.APP_PATH,

"hdfs://kingsolomon:9000/user/hadoopadmin/Desktop/ExecuteOozieJobUsingJavaAPI/workflow.xml");

conf.setProperty("jobTracker", "kingsolomon:9001");

conf.setProperty("nameNode", "hdfs://kingsolomon:9000");

conf.setProperty("queueName", "default");

conf.setProperty("dbScripts", "/user/hadoopadmin/Desktop/ExecuteOozieJobUsingJavaAPI");

conf.setProperty("rootFolder", "/user/hadoopadmin/Desktop/ExecuteOozieJobUsingJavaAPI");

// HDFS directories (multiple directories can be separated by a comma) that contain JARs

conf.setProperty("oozie.libpath", "hdfs://kingsolomon:9000/user/oozie/share/lib");

conf.setProperty("oozie.use.system.libpath", "true");

conf.setProperty("oozie.wf.rerun.failnodes", "true");

try {

String jobId = wc.run(conf);

System.out.println("Workflow job, " + jobId + " submitted");

while (wc.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {

System.out.println("Workflow job running ...");

Thread.sleep(10 * 1000);

}

System.out.println("Workflow job completed ...");

System.out.println(wc.getJobInfo(jobId));

} catch (Exception r) {

System.out.println("Errors " + r.getLocalizedMessage());

}

Sample Output
.....
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job completed ...

Workflow id[......] status[SUCCEEDED]

Tuesday, February 16, 2016

Apache Oozie - Big Data Workflow Engine

Apache Oozie - Big Data Workflow Engine

As you know Hadoop is used for handling huge data set. Sometime one job may not be sufficient for doing all required task. Better you go more number of jobs and the get output of one job as input for other job. Finally when N number of job completed you will get the appropriate solution. Also there should be a co-ordination between these N numbers of jobs. Thus Oozie takes care the co-ordination between these job. Its co-ordinate these job and take output from job and passing this as input to another job. Thus you can say Oozie is a workflow scheduler engine which co-ordinate between N number of Job as defined in its configuration XML file. For running Oozie the BOOTSTRAP service should be available.

Here in the above diagram, there are three jobs A, B & C. The output of Job A is the input of Job B. Similarly the output of Job B is the input of Job C.

Thus, Anyone can define Apache Oozie as following.

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAG) of actions.

Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

Oozie is a scalable, reliable and extensible system.

Oozie in the Hadoop ecosystem

The backbone of Oozie configuration is workflow XML file. The workflow.xml file will have all configuration of N number of the jobs. Basically it is having two node. First one is control node and second one is action node. Control node has the job list and the sequence of the execution. Action nodes will have Job's action details. Thus one job will have one action tag for the same. There will be one control tag but N number of action tags that depends on the number of jobs you are performing.

Installation and Configuration

Oozie Installation on CentOS 6

We use official CDH repository from cloudera’s site to install CDH4. Go to official CDH download section and download CDH4 (i.e. 4.6) version or you can also use following wget command to download the repository and install it.

For OS 32 Bit

$ wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/i386/cloudera-cdh-4-0.i386.rpm

$ yum --nogpgcheck localinstall cloudera-cdh-4-0.i386.rpm

For OS 64 Bit

$ wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm

$ yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm

After adding CDH repository, execute below commands to install Oozie onto the machine

[root@kingsolomon ~]# yum install oozie

However above command should cover Oozie client installation. If not the execute below command to install Oozie client.

[root@kingsolomon ~]# yum install oozie-client

Oozie has been installed onto the machine. Now we will configure the Oozie onto the machine.

Oozie Configuration on CentOS 6

As Oozie does not directly interact with Hadoop, we need to do the configuration accordingly.

Note : Please configure all the settings when Oozie is not up and running.

Oozie has ‘Derby‘ as default built in DB however, I would recommend to use MySQL.

[root@kingsolomon ~]$ mysql -u root -p

Enter password:

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 3

Server version: 5.5.38 MySQL Community Server (GPL)

Oracle is a registered trademark of Oracle Corporation and/or its

affiliates. Other names may be trademarks of their respective

owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create database oozie;

Query OK, 1 row affected (0.00 sec)

mysql> grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'oozie';

Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie';

Query OK, 0 rows affected (0.00 sec)

mysql> exit

Bye

Now Configure MySQL related configuration into oozie-site.xml file

[root@kingsolomon ~]# cd /etc/oozie/conf

[root@kingsolomon ~]# vi oozie-site.xml

Add the following properties.

<name>oozie.service.JPAService.jdbc.driver</name>

<value>com.mysql.jdbc.Driver</value>

</property>

<name>oozie.service.JPAService.jdbc.url</name>

<value>jdbc:mysql://master:3306/oozie</value>

</property>

<name>oozie.service.JPAService.jdbc.username</name>

</property>

<name>oozie.service.JPAService.jdbc.password</name>

</property>

Download and add the MySQL JDBC connectivity driver JAR to Oozie lib directory.

[root@kingsolomon Downloads]# cp mysql-connector-java-5.1.31-bin.jar /var/lib/oozie/

Create Oozie database schema by executing below commands

[root@kingsolomon ~]# sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -run

Sample Output will be

setting OOZIE_CONFIG=/etc/oozie/conf

setting OOZIE_DATA=/var/lib/oozie

setting OOZIE_LOG=/var/log/oozie

setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat

setting CATALINA_TMPDIR=/var/lib/oozie

setting CATALINA_PID=/var/run/oozie/oozie.pid

setting CATALINA_BASE=/usr/lib/oozie/oozie-server-0.20

setting CATALINA_OPTS=-Xmx1024m

setting OOZIE_HTTPS_PORT=11443

setting OOZIE_HTTPS_KEYSTORE_PASS=password

......

Validate DB Connection

DONE

Check DB schema does not exist

DONE

Check OOZIE_SYS table does not exist

DONE

Create SQL schema

DONE

Create OOZIE_SYS table

DONE

Set MySQL MEDIUMTEXT flag

DONE

Oozie DB has been created for Oozie version '3.3.2-cdh4.7.1'

The SQL commands have been written to: /tmp/ooziedb-3060871175627729254.sql

Download ExtJS lib then extract the contents of the file to /var/lib/oozie/ on the same host as the Oozie Server.

[root@kingsolomon Downloads]# unzip ext-2.2.zip

[root@kingsolomon Downloads]# mv ext-2.2 /var/lib/oozie/

Make sure MySQL service is up and running before executing Oozie start command.

[root@kingsolomon Desktop]# service mysqld start

Starting mysqld: [ OK ]

Start Oozie server by executing below commands.

[root@kingsolomon Desktop]# service oozie start

Sample Output will be

Setting OOZIE_HOME: /usr/lib/oozie

Sourcing: /usr/lib/oozie/bin/oozie-env.sh

setting OOZIE_CONFIG=/etc/oozie/conf

setting OOZIE_DATA=/var/lib/oozie

setting OOZIE_LOG=/var/log/oozie

setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat

setting CATALINA_TMPDIR=/var/lib/oozie

setting CATALINA_PID=/var/run/oozie/oozie.pid

setting CATALINA_BASE=/usr/lib/oozie/oozie-server-0.20

setting CATALINA_OPTS=-Xmx1024m

setting OOZIE_HTTPS_PORT=11443

setting OOZIE_HTTPS_KEYSTORE_PASS=password

....

Using CATALINA_BASE: /usr/lib/oozie/oozie-server-0.20

Using CATALINA_HOME: /usr/lib/bigtop-tomcat

Using CATALINA_TMPDIR: /var/lib/oozie

Using JRE_HOME: /usr/lib/jvm/java-openjdk

Using CLASSPATH: /usr/lib/bigtop-tomcat/bin/bootstrap.jar

Using CATALINA_PID: /var/run/oozie/oozie.pid

Verify the Oozie Server Status

[root@kingsolomon Desktop]# service oozie status

running

[root@kingsolomon Desktop]# oozie admin -oozie http://kingsolomon:11000/oozie -status

System mode: NORMAL

Open Oozie Web Console from the browser using http://kingsolomin:11000/oozie link

Hope you have enjoyed the article.

Author : Iqubal Mustafa Kaki, Technical Specialist

Want to connect with me
If you want to connect with me, please connect through my email - iqubal.kaki@gmail.com

TECHNICAL BLOGS - BIG DATA, HADOOP, JAVA

Tuesday, March 22, 2016

Elephant Bird API - Loading JSON data Using JsonLoader into Apache Pig

Friday, February 19, 2016

Java API for Invoking Oozie Workflows

Sample Java program to call workflow

Sample Java program to call LocalOozie workflow

Tuesday, February 16, 2016

Apache Oozie - Big Data Workflow Engine

Thus, Anyone can define Apache Oozie as following.

About Me

Blog Archive