Thursday, February 4, 2016

Apache Pig Big Data Analysis

Big Data Analysis Using Apache Pig

Pig was developed by Yahoo. It is an engine built on the top of MapReduce which will convert PigLatin Script into MapReduce code. Thus Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs. It is using PigLatin scripting language for the operations like ETL (Extract, Transform and Load). Thus the definition of Apache Pig would be Pig is a high level scripting language that is used with Apache Hadoop Ecosystem. Also anyone can say it is a tool/platform which is used to analyze larger sets of data representing them as data flows.

























The main motive behind developing Pig was to reduce the time required for development through its multi query data flows language.

The Key features of Apache Pig are as following

- It has procedural data flow language i.e. PigLatin
- It is mainly used for programming
- It can handles all kinds of data e.g. Structured as well as Unstructured
- By using Pig’s multi-query approach anyone can operate many operation together in a       single flow, reducing the time of multiple times data scanned thus by using this we need to write 1/20th of code and required 1/16th time of development.
- It’s providing Rich Set of operators for filter, join, sort etc.
- It’s providing complex data types e.g. tuples, bags, and maps
- It is generally used by the researcher and programmer
- It operates on the client side of any cluster
- It does not have a dedicated metadata database and schema or data types will be defined in the script itself.
- Through User Defined Functions (UDF) facility in Pig, anyone can execute many languages code like Ruby, Python and Java. In other words in UDF we can use Java, Python and other language code and can execute them by using Pig Script.

Installing Apache Pig

Here I am assuming Hadoop is up and running onto the machine.

Download Apache Pig-0.15.0 from Apache Pig Download Link

Go to the specified location where you want to have Pig installable, then unzip the Apache Pig zipped folder.

$ tar -xzf pig-0.15.0.tar.gz
$ mv pig-0.15.0 pig
Apache Pig Configuration
Setup Environment Variable
Set Environment Variables by editing bashrc file using Edit ~/.bashrc file and append following lines into that and save.

export PIG_HOME = /home/training/Pig
export PATH  = PATH:/home/training/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf

Then execute the below command to source ~/.bashrc

$ source ~/.bashrc

Verifying the Installation

Verify the installation of Apache Pig by typing the pig command. If the installation is successful, you will get the grunt shell of Apache Pig as shown below.















Apache Pig Execution Modes

Local Mode

In local mode, there is no need of Hadoop or HDFS. It will require all file and execution from local system. This mode is generally used for testing purpose.

$ pig -x local

MapReduce Mode

MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.

$ pig -x mapreduce

Load/Read and Store Data Using Apache Pig 

Apache Pig works on top of Hadoop. Be ensure that Hadoop is up and running. Then start Pig in MapReduce mode.

$ pig -x mapreduce

Suppose we are having Person data as following

PersonDetails
Bill Gates,CEO,Microsoft
Iqubal,CEO,SZIAS
Steve Jobs,CEO,Apple

Create Directory and Load PersonDetails Data into the HDFS

Navigate to PersonDetails file directory

$ hadoop fs -mkdir /user/training/PIGDATA

$ hadoop fs -mkdir /user/training/PIGDATA/PIG_UDF_DATA

$ hadoop fs -copyFromLocal PersonDetails /user/training/PIGDATA/PIG_UDF_DATA

$ pig -x mapreduce

Processing Data using Apache Pig

grunt> PERSONDETAILS= LOAD '/user/training/PIGDATA/PIG_UDF_DATA/PersonDetails' Using PigStorage(',') AS (name:chararray, designation:chararray, company:chararray);

















Reading Data using DUMP command

grunt> dump PERSONDETAILS;

(Bill Gates,CEO,Microsoft)
(Iqubal,CEO,SZIAS)
(Steve Jobs,CEO,Apple)



























Storing Data using STORE command

grunt> STORE PERSONDETAILS INTO 'store_pig_latin';

Counters:
Total records written : 3
Total bytes written : 63
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201601290600_0012
2016-02-02 07:51:23,723 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

Verify by checking the file

grunt> cat store_pig_latin;


























Hope you have enjoyed the article.
Author : Iqubal Mustafa Kaki, Technical Specialist

Want to connect with me
If you want to connect with me, please connect through my email - 
iqubal.kaki@gmail.com

32 comments:

  1. Analogica Data is one of the Top Big Data Analysis Company in India.provides services like Dashboarding and Visualisation,Big Data Analysis,Internet Of Things,Data Warehousing,Data Mining and Machine Learning.





    ReplyDelete
  2. Hadoop framework is an open-source software framework used to distribute data.
    Hadoop uses different programming languages for storing huge data.
    Good Knowledge sharing about Big Data Hadoop.
    Big Data Hadoop has a huge demand in IT Industry.
    http://eonlinetraining.co/course/big-data-hadoop-online-training/

    ReplyDelete
  3. I get a lot of great information from this blog. Thanks for sharing this valuable information on Big Data Analysis. Big Data Hadoop Online Training Bangalore

    ReplyDelete
  4. It has been simply incredibly generous with you to provide openly what exactly many individuals would’ve marketed for an eBook to end up making some cash for their end, primarily given that you could have tried it in the event you wanted. digital marketing jobs career opportunities in abroad
    Advance Digital Marketing Training in chennai– 100% Job Guarantee

    ReplyDelete
  5. Nice post. By reading your blog, i get inspired and this provides some useful information. Thank you for posting this exclusive post for our vision. 
    Devops training in Chennai
    Devops training in Bangalore
    Devops training in Pune
    Devops Online training
    Devops training in Pune
    Devops training in Bangalore
    Devops training in tambaram

    ReplyDelete
  6. After reading your post I understood that last week was with full of surprises and happiness for you. Congratz! Even though the website is work related, you can update small events in your life and share your happiness with us too.
    java training in chennai | java training in bangalore

    java online training | java training in pune

    java training in chennai | java training in bangalore

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Excellent blog, I wish to share your post with my folks circle. It’s really helped me a lot, so keep sharing post like this

    angularjs Training in chennai
    angularjs Training in chennai

    angularjs-Training in tambaram

    angularjs-Training in sholinganallur

    ReplyDelete
  9. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    python training in pune
    python training institute in chennai
    python training in Bangalore

    ReplyDelete
  10. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    python training in tambaram
    python training in annanagar
    python training in jayanagar


    ReplyDelete
  11. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging.
    devops online training

    aws online training

    data science with python online training

    data science online training

    rpa online training

    ReplyDelete
  12. It seems you are so busy in last month. The detail you shared about your work and it is really impressive that's why i am waiting for your post because i get the new ideas over here and you really write so well.
    Microsoft Azure online training
    Selenium online training
    Java online training
    uipath online training
    Python online training

    ReplyDelete
  13. This information you provided in the blog that is really unique I love it!! Thanks for sharing such a great blog Keep posting.
    Big Data Training In Delhi
    Big Data Course In Delhi

    ReplyDelete
  14. It's a great post! Thank you for sharing your knowledge to others, it was very informative and in depth one.
    Apache Pig Training in Electronic City

    ReplyDelete
  15. This is most informative and also this post most user friendly and super navigation to all posts. Thank you so much for giving this information to me.datascience with python training in bangalore




    ReplyDelete

  16. Hi, you know this article is helping for me and everyone and thanks for sharing information Big Data Training in Delhi

    ReplyDelete
  17. Very interesting blog. Many blogs I see these days do not really provide anything that attracts others, but believe me the way you interact is literally awesome. I will instantly grab your rss feed to stay informed of any updates you make and as well take the advantage to share some latest information about

    CREDIT CARD HACK SOFTWARE which many are not yet informed of, the recent technology.

    Thank so much for the great job.

    ReplyDelete
  18. Did you want to set your career towards Amazon Web Services? Then Infycle is with you to make this into your life. Infycle Technologies gives the combined and best Big AWS Training in Chennai, along with the 100% hands-on training guided by professional teachers in the field. In addition to this, the interviews for the placement will be guided to the candidates, so that, they can face the interviews without struggles. Apart from all, the candidates will be placed in the top MNC's with a great salary package. To get it all, call 7502633633 and make this happen for your happy life.Best AWS Training in Chennai

    ReplyDelete
  19. Infycle Technologies, the = No.1 software training institute in Chennai offers the No.1 Data Science course in Chennai for tech professionals and students at the best offers. In addition to the Data Science course, other in-demand courses such as Python, Selenium, Oracle, Java, Python, Power BI, Digital Marketing also will be trained with 100% practical classes. After the completion of training, the trainees will be sent for placement interviews in the top MNC's. Call 7504633633 to get more info and a free demo.

    ReplyDelete
  20. Offering the most effective collection of knowledge in real-time to students through experts and creating them into industry Python Training in Hyderabad with the help of experts from AI Patasala.
    Python Institute in Hyderabad

    ReplyDelete
  21. Learn many things from your blog, great work, keep shining and if you are intresting in data engineering then checkout my blog data science course in satara

    ReplyDelete