Sei sulla pagina 1di 9

Unstructured data load into Hive database

Through Pyspark

Created by: Sudeshna Bhattacharya


Associate Id: 560878
Created on: April 24, 2019
CDB-IMS-AIA-CDS
Contents
Unstructured data load into Hive database .............................................................................................. 1
Through Pyspark ....................................................................................................................................... 1
1. Introduction .......................................................................................................3
2. Objective and Scope ...........................................................................................3
3. Import Text Files into HIVE Using Spark .............................................................4
4. Code Implementation ........................................................................................6
Read data from Hive Table:...................................................................................................................... 7
5. Auditing ..............................................................................................................8
6. SPARK-SUBMIT Command .................................................................................9
7. Pyspark Advantage.............................................................................................9
1. Introduction

Apache Spark is a popular open source framework that ensures data processing with lightning speed and
supports various languages like Scala, Python, Java, and R. Here, we would be talking about Spark with
Python to demonstrate how Python leverages the functionalities of Apache Spark in loading
unstructured/semi-structured data into Hive database.

Hive as an ETL and data warehousing tool provide supports on loading of structured (from other
RDBMS), semi-structured (from XML file) and un-structured data (.txt, .csv type) from a defined source
and querying those data based upon certain requirement.

2. Objective and Scope

In this tip, we will explain how to load unstructured data in text file format into Hive table through
Python script and invoke the script through Apache Spark tool, Pyspark.

When it comes to file data source, we may have ‘n’ number of files in different format, a separate
automated unix script needs to be prepared to invoke the python script and pass individual file name
along with file path as argument to extract, transform as per business standard and finally, load them
into Hive database.
2.1 Scope of the code

 The code is working fine when we execute from spark version 2.2.0.cloudera2 using Scala
version 2.11.8
 Hive version used Hive 1.1.0-cdh5.12.2

3. Import Text Files into HIVE Using Spark

With Spark, we can read data from a CSV file, text file or external SQL data store, or another data source,
apply certain transformations to the data, and store it onto Hadoop in a Hadoop Distributed File System
(HDFS) or Hive.

For other RDBMS, we can read data by scoop command and load them into Hive database.
Text files with separators can be imported into a Spark DataFrame and then stored as a HIVE table using
the steps described here.

In this example, we will explain how to use spark’s primary data abstraction a resilient distributed dataset
(RDD), translate it into a DataFrame, and store it in HIVE. It is also possible to load CSV files directly into
DataFrames using the spark-csv package.

Let’s assume we have demo source file (.txt) present in as shown below.

TEST-CELL XR 2.00

ID : TEST001_14SEP17_AAAA
File name : "TEST001_14SEP17_AAAA.txt"
Time :
LogDate : 12 Sep 2017 1:34:26 PM
RunDate : 12 Sep 2017 1:51:42 PM
FileDate : 12 Sep 2017 1:53:22 PM
User : TESTUSR001
Analysis version : 2.88

Results:
Demo Images : 5
Demo Total cells : 1111
Demo Viable cells : 2323
Demo Nonviable cells : 320
Demo Viability (%) : 9.10
Demo Total cells / ml (x 10^6) : 11.66
Demo Total viable cells / ml (x 10^6) : 9.423

DemoSizeData :
3,7,6,15,5,3,1,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

DemoViableSizeData :
2,3,1,12,3,6,4,5,6,9,16,11,12,14,24,28,33,39,58,79,78,69,63,74,84,88,83,103,103,88,84,64,85,56,48,44,5
1,36,24,27,26,25,16,17,12,9,9,8,7,3,4,7,1,4,2,2,1,0,0,3,1,0,1,2,1,0,0,0,0,0,

DemoViabilityData :
92.8571,96.0000,97.6744,93.9394,96.8750,92.5000,97.2973,93.3333,88.8889,91.1111,85.2941,93.7500,
95.2381,83.3333,94.4444,92.1053,91.8919,92.8571,90.6977,97.5000,89.7436,83.7209,87.7551,90.6977,
94.8718,86.0465,89.7436,91.3043,91.4286,96.6667

DemoCountData :
28,50,43,33,32,40,37,45,36,45,34,48,42,36,54,38,37,42,43,40,39,43,49,43,39,43,39,46,35,30,35,48,32,5
0,40,49,48,40,44,56,38,44,34,40,46,37,31,39,45,44
4. Code Implementation
Suppose the source data is in a file, which is in a text format. The requirement is to load the text file into
hive table using pySpark.

Create python script in VI editor and import some open source python packages, which will be helpful
for reading text file and helps in several data manipulation task. Also imported some classes of Spark
SQL.

To achieve the requirement, two components are involved here:

I. Spark: Used to parse raw file, transform as per business standard and apply customization.
II. Hive: Used to store into database.

Initialize Spark & Hive Context:


The first step is to initialize Spark Context and Hive Context. Spark Context will be used to work with
spark core like RDD, whereas Hive Context will get used to working with Data frame.

from pyspark.sql import HiveContext


from pyspark import SparkContext, SparkConf

In pyspark API, both the context will get initialize automatically.

Create Spark Session:


Next step is to create spark session, which is entry point to programming Spark with the Dataset and
DataFrame API. A SparkSession can be used create DataFrame, register DataFrame as tables, execute
SQL over tables, cache tables, and read parquet files.

sparkSession = SparkSession.builder.appName("<app name>").getOrCreate()

Set HDFS Path:


Once we have sample text file, next step is to move and keep the sample file into HDFS path. This path
will be used while loading data.

Load Data into RDD:


Next, the raw data needs to be imported into a Spark RDD. In demo code, we are using a function called
textFile and passing an argument (path of file location) to load data into RDD.

Convert RDD into Dataframe:


The unstructured data are then split using Spark’s map( ) function that creates a new RDD where we can
apply our customized logic. Here data frame comes into the picture. This provides facility to interact
with hive through spark. We can use toDF function to convert RDD into panda’s dataframe.
Rename Dataframe Column:
For ease of readability from dataframe, we can define column name using add_prefix() function

This python pseudocode reads flat file from HDFS and after transforming, this will load file contents into
HIVE tables.

Load Data into HIVE Table:

This is the step where loading data frame into hive table. In demo code, we have used write.insertInto
to append transformed data into Hive table.

Read data from Hive Table:

We can validate the data that loaded through pyspark using sqlContext.sql command.

Please find the demo code attached here for reference.

DemoCode.docx
Method Invocation:
We can define a try block, where we can wrap up entire demo code and we may include code to write
message into log file in case of successful execution.

try:
//pass HDFS file location in first argument
v_filePath=sys.argv[1]
// pass source file name here
v_fileName=sys.argv[2]
//another variable for log file location
v_stgfilePath=sys.argv[3]
//define file handler to open log file in append or write mode. Current date appended for log purpose
only.
f=open(v_stgfilePath +"/StagingDataLoad.csv",'a')
v_currdt = datetime.datetime.now()
filestr= v_filePath + v_fileName
<demo_function>(filestr)
f.write('STAGING,<demo_function>,<demo_script_name.py>,'+v_currdt.strftime("%d-%m-%Y %H:%M
%p")+',SUCCEDED'+'\n')

5. Auditing
A separate log file can be placed in a specified directory that is used to capture success or error message
in case any exception encountered.

In example script shown above, for demo purpose, we have defined two variables for log file path and
file name. One file handler has defined to open log file in append mode and writes success message into
file if succeeded.

//Log entry in success:

STAGING, demo_function, demo_script_name.py,23-04-2019 08:36 AM,SUCCEDED

In demo code, one except block has defined to catch run time error and print stack traces into log file.
Exception message size has truncated to first line only.

except:
exceptiondata=traceback.format_exc().splitlines()
exceptionarray = [exceptiondata[-1]]
f.write('STAGING,<demo_function>,<demo_script_name.py>,'+v_currdt.strftime("%d-%m-%Y %H:%M
%p")+',FAILED,'+str(exceptionarray)+'\n')
Finally:
f.close()
//Log entry in exception:

STAGING, demo_function, demo_script_name.py,22-04-2019 08:19 AM,FAILED,["AttributeError: 'RDD'


object has no attribute 'toPandas'"]

6. SPARK-SUBMIT Command
$ spark2-submit --master yarn demo_script_name.py <file_location> FlatFileSource.txt
<log_file_location>

7. Pyspark Advantage

 Easy Integration with other languages: PySpark framework supports other languages like Scala,
Java, R.
 RDD: PySpark basically helps data scientists to easily work with Resilient Distributed Datasets.
 Speed: This framework is known for its greater speed compared with the other traditional data
processing frameworks.
 Caching and Disk persistence: This has a powerful caching and disk persistence mechanism for
datasets that make it incredibly faster and better than others.

Potrebbero piacerti anche