Friday, January 29, 2016

Apache Hive XML Data Analysis

Loading XML Data into Apache Hive

It is very interesting to do analysis of XML data using Apache Hive. Suppose we have large XML file and we need to process these data using Hadoop Ecosystem. Here I will explain about Loading of XML data into Apache Hive table.
















Here a sample XML is as following

Company.xml
<Employee>
<Name>Bill Gates</Name>
<Email>BILLGATES@Microsoft.Com</Email>
<Contact>XXXXXXXXXX</Contact>
</Employee>
<Employee>
<Name>Steve Jobs</Name>
<Email>SteveJobs@Apple.Com</Email>
<Contact> XXXXXXXXXX</Contact>
</Employee>
<Employee>
<Name>Iqubal Mustafa Kaki</Name>
<Email>IqubalMustafaKaki@SZIAS.Com</Email>
<Contact>8390900000</Contact>
</Employee>


Here we will load XML data into hive table by using XPATH() .

Thus we will explore XPATH() which will used for converting XML data to String Array.

Steps for loading XML Data into Hive Table

Step:1 Structured the XML, by executing below command

cat Company.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed 's|</Employee>|</Employee>\n|g'| grep -v '^\s*$' > company_structured_records.xml


Now we have company_structured_records.xml file for the further operation.

Step:2 Create a Hive table and load the XML file into the table

hive> create table CompanyXMLTablePOC(XMLDATA String);

hive> load data local inpath '/home/hadoopadmin/Desktop/company_structured_records.xml' into table CompanyXmlTablePOC;


Step:3 Convert the XML Data into Array Format Using XPATH()

hive> select xpath(xmldata, 'Employee/*/text()') from CompanyXmlTablePOC;

you will get the result result as below.

OK
["Bill Gates","BILLGATES@Microsoft.Com","XXXXXXXXXX"]
["Steve Jobs","SteveJobs@Apple.Com"," XXXXXXXXXX"]
["Iqubal Mustafa Kaki","IqubalMustafaKaki@SZIAS.Com","8390900000"]
Time taken: 1.503 seconds, Fetched: 3 row(s)




























Step:4  Create the HIVE Table and Insert Data from Above Created Table

hive> create table company AS select xpath_string(xmldata,'Employee/Name'),xpath_string(xmldata,'Employee/Email'),xpath_string(xmldata,'Employee/Contact')from companyxmltablepoc;







hive> select * from company;
OK
Bill Gates BILLGATES@Microsoft.Com XXXXXXXXXX
Steve Jobs SteveJobs@Apple.Com XXXXXXXXXX
Iqubal Mustafa Kaki IqubalMustafaKaki@SZIAS.Com 8390900000

Time taken: 5.616 seconds, Fetched: 3 row(s)





Hope you have enjoyed the article.
Author : Iqubal Mustafa Kaki, Technical Specialist

Want to connect with me
If you want to connect with me, please connect through my email - 
iqubal.kaki@gmail.com

5 comments:

  1. Nice post ! Thanks for sharing valuable information with us. Keep sharing..Hadoop Admin Online Training

    ReplyDelete
  2. really Good blog post. provided a helpful information. I hope that you will post more updates like this Big data hadoop online training Hyderabad

    ReplyDelete
  3. It's a great post! Thank you for sharing your knowledge to others, it was very informative and in depth one.
    Apache Pig Training in Electronic City

    ReplyDelete