Sei sulla pagina 1di 51

CHAPTER 1

INTRODUCTION

1.1 NEED FOR THE PROJECT

Online marketing space is in constant shift as new technologies, services, and


marketing tactics gain popularity and become the new standard. Online store owners are
one of the many different segments affected by these constant evolutions. In order for
these business owners to survive and thrive, they need to be able to make better
decisions faster. This is where data analysis comes into play.
Having access to statistical information from all areas of your online marketing
and sales activities gives you an advantage over competitors that do not have this
information. Understanding trends and which marketing channels are no longer
profitable allows you to maneuver as a business before damage is done to your bottom
line. And, understanding shifts in consumer behavior gives you insights into the
demands of your market. Knowing these things enables you to drop certain products or
make strategic changes in your pricing that will result in big gains or, at the very least,
limit damage to your profits.
There has been an increasing emphasis on big data analytics (BDA) in e-
commerce in recent years. These is because of the fact that e-commerce firms that inject
big data analytics (BDA) into their value chain experience 56 % higher productivity
than their competitors. Specifically, in the e-commerce context, big data enables
merchants to track each users behavior and connect the dots to determine the most
effective ways to convert one-time customers into repeat buyers
Big data analytics (BDA) enables e-commerce firms to use data more
efficiently, drive a higher conversion rate, improve decision making and empower
customers. BDA is a distinctive competence of the high-performance business process

1
to support business needs, such as identifying loyal and profitable customers,
determining the optimal price, detecting quality problems, or deciding the lowest
possible level of inventory.
It is not possible to analyze such a huge amount of data without any tools. So
here we are using Hadoop as tool to analyze such a huge amount of data. Hadoop is
nothing but the framework which supports data storage and Analysis. Data storage is
undertaken by hadoop distributed file system and analysis is done by Hive.

1.2 OBJECTIVE OF THE PROJECT

A leading e-commerce MyCart company is planning to investigate and analyze


products and customer behaviour. It is receiving lots of data about different products,
registered users and behaviour of users in terms of placing orders and subsequent actions
made on the orders. Different products belong to different categories and have different
amount of discount and profit percentage associated with them. Users are spread across
different locations and depending on their behaviour, MyCart wants to capture purchase
pattern of users and detect for possible fraud activities. It is receiving files on a daily
basis and is trying to process them using Big Data tools in order to:
1. Gain competitive benefit
2. Campaign design
3. Possible fraud detection

2
1.3 SCOPE OF THE PROJECT
The scope of the application is to analyze the data generated to increase
customer experience. It helps in building personalized connections between users and
the developers. Big data analytics gives Ecommerce companies a granular digital view
of their customers. Knowing what a customer likes, dislikes, what they are interested in.
We also find the purchasing pattern of the customers and all kind of fraud activities done
in the application. Perform automation to run the jobs daily using Oozie jobs. Since we
are getting files daily by 00:30, trigger the Oozie coordinator job at 01:00 daily (i.e. 1
AM). Automation will help in reducing human interventions in generating results on a
daily basis. Perform Oozie actions like shell scripting, hive, java, Hbase and email. In
case of failure of an action, email has to be triggered. Store the recipients address in the
configuration file. Perform the data analysis using Hive. Try to make joins less
expensive and use better storage mechanism. Take advantage of partitioning, bucketing,
ORC file formats, vectorization, CBO and tez.

3
CHAPTER 2
LITERATURE SURVEY

Big data analysis in e-commerce system using Hadoop MapReduce

Every organization generated vast amount of data from various source. Web mining is
the process of extracting useful knowledge from web resources. Log files are maintained
by the web server. The challenging task for E-commerce companies is to know their
customer behavior to improve the business by analyzing web log files. E-commerce
website can generate tens of peta bytes of data in their web log files.

S. Sugune et al. [2], they discuss about the importance of log files in E-commerce
world. The analysis of log files is used for learning the user behavior in E-commerce
system. The analysis of such large web log files need parallel processing and reliable
data storage system. The Hadoop framework provides reliable storage by Hadoop
Distributed File System and parallel processing system for large database using
MapReduce programming model. If we are coming to the map reduce programming it is
difficult to understand and the number of line is high when compared to the Hive
coding. Applying map reduce in the real time is more complex.

4
The Study of Big Data Analytics in E-Commerce

Big data is a compilation of huge data-sets that cannot be processed using conventional
computing techniques. Big-data is not just only a data; slightly it has become a whole
theme, which involves a variety of tools, techniques and frameworks. It refers to using
complex datasets to drive focus, direction, and decision making within a company or
organization. This is achieved by implementing applicable systems for gaining an
accurate and deep understanding of the knowledge obtain by analyzing the
organizations data.

In this survey paper, Niranjanmoorthy et al. [3], have discussed the different types of
data held and its inverse usage for e-commerce and also different ways of providing
security and safety for the data when it is used in bulky services. By using this survey
we can provide the safety mechanisms to the large services. Since the capturing of
related informations, sharing, analyzing, transfer and presentations are not discussed in
this paper.

5
A survey of clustering techniques for big data analysis

Data mining technique that is used to extract useful information and hidden relationship
among data is extracted. The traditional data mining approaches could not be directly
implanted on big data as it faces difficulties to analyze big data. To avoid such
difficulties Sourab Arora et al .[1], they proposed the Clustering which is one of the
major techniques used for data mining in which mining is performed by finding out
clusters having similar group of data. In this paper they have discussed some of the
current big data mining clustering techniques. Comprehensive analysis of these
techniques is carried out and appropriate clustering algorithm is provided. This
clustering is a efficient technique but the clustering is a complex technique and it has
inability to recover from the database correction. The IP address of the cluster is
transparent to the client application. So that we use the merging techniques to prepare
the data.

6
Integration of Massive Streams and Batches via JSON-based Dataflow Algebra

Hirotoshi cho et al.[1] discussed the data in the real world are more complex and
unstructured data. These data should be analyzed using the bigdata technique. For easy
transformation between the servers the JSON data will be more efficient. So we can get
the data in the form of JSON format, in this paper they discuss about the two major
problems. (1) we need to choose an appropriate combination of frameworks for the
performance requirements, and (2) we have to build and maintain codes written in
different programming models that bind to the frameworks. In this paper, they address
the problems of combining multiple data processing frameworks. They present a novel
framework JsFlow that integrates stream and batch processings in a single system.
JsFlow provides a uniformed programming model for stream and batch processings by
using JSON-based dataflow algebra. JsFlow accepts processing dataflows and deploys
them over appropriate processing frameworks (e.g. Spark and Spark Streaming)
according to the performance requirements. They experimentally confirmed the
effectiveness of JsFlow compared with the state-of-the-art frameworks for processing
real-world Big Data. This approach is used in the project, so that the Json data has taken
as input to the data analysis.

7
Hive, Pig & Hbase Performance Evaluation for Data Processing Applications

Information extraction has received significant attention due to the rapid growth of
unstructured data. Researcher needs a low-cost, scalable, easy-to-use and fault tolerance
platform for large volume data processing eagerly. It is very important to evaluate the
MapReduce based frameworks for data processing applications. In this paper Vaishalini
et al.[1] leverages the comparative study of HBase, Hive and Pig. Hive is a good
programming language to write the code in terms of query. It reduces the complexities
of the map reduce programming. The Hbase is a NoSQL database to store and retrieve
the data. This Hbase table has the look up table which is used to enrich the data in the
original table. The processing time of HBase, Hive and Pig is implemented on a data set
with simple queries and we will observed the performance of the HBase, Hive, Pig and
evaluate the result according to it.

8
CHAPTER 3
SYSTEM ANALYSIS

3.1 PRESENT SYSTEM

In Present System, Data ingestion and analysis is done with MapReduce.


MapReduce is a processing technique and a program model for distributed computing
based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Secondly, reduce
task, which takes the output from a map as an input and combines those data tuples into
a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task
is always performed after the map job

Fig 3.1.1 MapReduce Workflow

9
3.2 PROPOSED SYSTEM

In the real world, for data analysis map reduce programming is used popularly.
But instead of writing 100 to 200 lines of code in java or who are all not comfortable
with java they can use the simple SQL like query called Hive, Which will take the 10 to
15 lines of query
In the proposed system, Initial stage Data Ingestion and Validation is done in
hive and HBase. Hive is a data warehouse infrastructure tool to process structured data
in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy. HBase is a distributed column-oriented database built on top of the
Hadoop file system. It is an open-source project and is horizontally scalable. HBase is a
data model that is similar to Googles big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS). It is a part of the Hadoop ecosystem that provides random
real-time read/write access to data in the Hadoop File System.

Advantages
HBase is used whenever there is a need to write heavy
applications.
HBase is used whenever we need to provide fast random access
to available data.
The advantage behind using hive is that data to be analyzed is
stored in HDFS which provides all features like scalability,
redundancy etc. and SQL like query over data in Hadoop.

10
CHAPTER 4
TECHNOLOGY STACK
4.1 HIVE

Apache Hive is a data warehouse infrastructure built on top of Hadoop for


providing data summarization, query, and analysis. Hive gives an SQL-like interface to
query data stored in various databases and file systems that integrate with Hadoop.
Traditional SQL queries must be implemented in the MapReduce Java API to execute
SQL applications and queries over distributed data. Hive provides the necessary SQL
abstraction to integrate SQL-like Queries (HiveQL) into the underlying Java API
without the need to implement queries in the low-level Java API. Since most data
warehousing applications work with SQL-based querying languages, Hive supports easy
portability of SQL-based application to Hadoop. While initially developed by Facebook,
Apache Hive is now used and developed by other companies such as Netflix and the
Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of
Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.

4.1.1 FEATURES OF HIVE

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and
compatible file systems such as Amazon s3 filesystem. It provides an SQL-like language
called HiveQL with schema on read and transparently converts queries to mapReduce,
Apache Tez and spark jobs. All three execution engines can run in Hadoop YARN. To
accelerate queries, it provides indexes, including bitmap indexes Other features of Hive
include:

11
Indexing to provide acceleration, index type including compaction and bitmap index
as of 0.10, more index types are planned.
Different storage types such as plain text, RCFile, HBase, ORC, and others.
Metadata storage in an RDBMS, significantly reducing the time to perform semantic
checks during query execution.
Operating on compressed data stored into the Hadoop ecosystem using algorithms
including DEFLATE, BWT, snappy, etc.
Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-
mining tools. Hive supports extending the UDF set to handle use-cases not supported
by built-in functions.
SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez,
or Spark jobs.

4.1.2 HIVE ARCHITECTURE

Major components of the Hive architecture are:

Fig 4.1.2.1 Hive Architecture

12
Metastore: Stores metadata for each of the tables such as their schema and location. It
also includes the partition metadata which helps the driver to track the progress of
various data sets distributed over the cluster. The data is stored in a traditional
RDBMS format. The metadata helps the driver to keep a track of the data and it is
highly crucial. Hence, a backup server regularly replicates the data which can be
retrieved in case of data loss.
Driver: Acts like a controller which receives the HiveQL statements. It starts the
execution of statement by creating sessions and monitors the life cycle and progress
of the execution. It stores the necessary metadata generated during the execution of
an HiveQL statement. The driver also acts as a collection point of data or query result
obtained after the Reduce operation.
Compiler: Performs compilation of the HiveQL query, which converts the query to
an execution plan. This plan contains the tasks and steps needed to be performed by
the Hadoop MapReduce to get the output as translated by the query. The compiler
converts the query to an Abstract syntax tree (AST). After checking for compatibility
and compile time errors, it converts the AST to a directed acyclic graph (DAG).
DAG divides operators to MapReduce stages and tasks based on the input query and
data.
Optimizer: Performs various transformations on the execution plan to get an
optimized DAG. Various transformations can be aggregated together, such as
converting a pipeline of joins by a single join, for better performance. It can also split
the tasks, such as applying a transformation on data before a reduce operation, to
provide better performance and scalability. However, the logic of transformation
used for optimization used can be modified or pipelined using another optimizer.
Executor: After compilation and Optimization, the Executor executes the tasks
according to the DAG. It interacts with the job tracker of Hadoop to schedule tasks to

13
be run. It takes care of pipelining the tasks by making sure that a task with
dependency gets executed only if all other prerequisites are run.
CLI, UI, and Thrift Server: Command Line Interface and UI (User Interface) allow
an external user to interact with Hive by submitting queries, instructions and
monitoring the process status. Thrift server allows external clients to interact with
Hive just like how JDBC/ODBC servers do.

4.2 HBASE

HBase is a data model that is similar to Googles big table designed to provide
quick random access to huge amounts of structured data. HBase is a distributed column-
oriented database built on top of the Hadoop file system. It is an open-source project and
is horizontally scalable. HBase is a data model that is similar to Googles big table
designed to provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System (HDFS).It is a part of
the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System. One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on
top of the Hadoop File System and provides read and write access.

Fig 4.2.1 Hbase Read and Write


14
HBase is a column-oriented database and the tables in it are sorted by row.
The table schema defines only column families, which are the key value pairs. A table
have multiple column families and each column family can have any number of
columns. Subsequent column values are stored contiguously on the disk. Each cell value
of the table has a timestamp. In short, in an HBase:

Table is a collection of rows.


Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.

4.2.1 FEATURES OF HBASE

HBase is linearly scalable.


It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.

4.2.2 ARCHITECTURE OF HBASE

In HBase, tables are split into regions and are served by the region servers. Regions are
vertically divided by column families into Stores. Stores are saved as files in HDFS.
Shown below is the architecture of HBase. HBase has three major components: the
client library, a master server, and region servers. Region servers can be added or
removed as per requirement.

15
Fig 4.2.2.1 Hbase Architecture

Master Server

The master server

Assigns regions to the region servers and takes the help of Apache Zookeeper for this
task.
Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as creation of
tables and column families.

16
Regions

Regions are nothing but tables that are split up and spread across the region servers.

Region server

The region servers have regions that

Communicate with the client and handle data-related operations.


Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.

When we take a deeper look into the region server, it contains regions and stores as
shown below:

Fig 4.2.2.2 Region Server

The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is
transferred and saved in HFiles as blocks and the memstore is flushed.

17
CHAPTER 5
SYSTEM DESIGN

5.1 DATA PREPARATION DIAGRAM

Data coming from web applications and from mobile applications are stored in
Fileserver. The received data is in the form of JSON file format. We receive the three
types of data called products_information, user_information, and user_activity. These
data contains many information and the amount of data is depends on the purchase. We
need the data in specified size, so we need to filter the data and finally we have to merge
those data into large for processing. Combining the different data into single specified
file.

Fig 5.1 Data Preparation

18
5.2 OVERALL ARCHITECTURAL DIAGRAM

Here is the overall architectural diagram for ecommerce data analysis, first the
data are extracted from various resources on a daily basis followed by data enrichment
and validation process. Enrichment process is nothing but adding value to the data
which is done by the use of HBase table(i.e., missing data in the hive table can be
obtained by referring the HBase table).Data validation is validating the data based on
some rules and eliminating the rows which is not valid. In data analysis pattern
detection, fraud detection are undertaken. Finally the analyzed data is exported to
number of databases.

Fig 5.2 System Architecture


19
CHAPTER 6
SYSTEM IMPLEMENTATION
6.1 MODULES
1 Data Preparation
2 Data Enrichment
3 Rules checking and validation
4 Data analysis
5 Data Export

6.1.1 DATA PREPARATION

Data preparation is the process of gathering, combining, structuring and


organizing data so it can be analyzed as part of business intelligence (BI) and business
analytics (BA) programs. The components of data preparation include data discovery,
profiling, cleansing, validation and transformation; it often also involves pulling
together data from different internal systems and external sources.

Data preparation work is done by information technology (IT) and BI teams as


they integrate data sets for loading into a data warehouse, NoSQL database or Hadoop
data lake repository. In addition, data analysts can use self-service data preparation tools
to collect and prepare data for analysis themselves.

One of the primary purposes of data preparation is ensuring that information


being readied for analysis is accurate and consistent, so the results of BI and analytics
applications will be valid. Data is often created with missing values, inaccuracies or
other errors. Additionally, data sets stored in separate files or databases often have
different formats that need to be reconciled. The process of correcting inaccuracies and

20
joining data sets constitutes a big part of the data preparation process. In our data
preparation the following steps are followed.

A single directory of HDFS contains all three different types of files. The first
challenge is to filter out these records corresponding to the different prefix they have.
Because to do the smaller records there is no need to use a HDFS. So we are merging
all the records to make a larger record.
Since many files may be small, it is also required to merge all the small files in small
number of large files, so that subsequent processing speeds up.
The threshold for the small files is 100 MB.
Make sure no file is less than 100 MB, unless it is the single file of its category (by
category we mean once of the products_info_*, users_info_* or users_activity_*
files).
For example if we receive three files of users_info_* category each of size 50 MB,
merge them together to make single file.
If we receive a single file of products_info_* category with size 81 MB, then it can
not be merged and let it be with the same size.

6.1.2 DATA ENRICHMENT

Data enrichment is a general term that refers to processes used to enhance,


refine or otherwise improve raw data. This idea and other similar concepts contribute to
making data a valuable asset for almost any modern business or enterprise. It also shows
the common imperative of proactively using this data in various ways. Every
organization has its own unique goals for adding value to its data, but many of the tools
for enriching data are universal in their refinement of content and documents to weed
out errors and inconsistencies. This is something any enterprise can appreciate. From
21
ensuring the accuracy of algorithms to adding new data to tables to correcting
typographic or spelling errors, these tools are designed to improve quality across all data
fronts.

In case, we get NULL for product category or user location in the


users_activity_* file, refer to the lookup table and enrich the fields.
Let us assume, we have value of C:V as Men Clothes for row key P110 in
our Hbase lookup table prod_category.
In the users_activity_* file we receive product id as P110 but category as
NULL, then we can perform a lookup on the basis of product id and retrieve
category as Men Clothes.
Similarly, if we have user id present in users_activity_* file but location is
missing or is NULL, we can retrieve location by performing a lookup over
user_location table present in HBase.

6.1.3 RULES CHECKING AND VALIDATION

Validation rules verify that the data a user enters in a record meets the standards
you specify before the user can save the record. A validation rule can contain a formula
or expression that evaluates the data in one or more fields and returns a value of True
or False. Validation rules also include an error message to display to the user when the
rule returns a value of True due to an invalid value.

The rule checking is to check the data entry by the customer is valid or not. If
not valid we have to give the message to the customer that the data entry is invalid or
not satisfying the specified constraints.

22
We know that we have the data given by the customer we are using some
conditions to check the integrity of the data. There are three rules are specified to
increase the trustworthiness of the data.

In order to increase the trustworthiness of data, there are some rules which
every record must adhere to. Below are the rules which every record must follow:

Rule Id Description
R1 user_id and product_id should not be
NULL
R2 order_date should be less than or equal
to the shipment_date
R3 age of the user should be a positive
number

We have three different tables in the analysis (product information, user


information, user activity). In these three tables we check the constraints mentioned
above.

In the RULE1 we are checking the primary key constraint like user_id and
product_id. Each table must have the primary key to identify each rows separately.
Usually the primary key defined as A primary key is a special relational database table
column (or combination of columns) designated to uniquely identify all table records. A
primary key's main features are: It must contain a unique value for each row of data. It
cannot contain null values. The null value checking is performed on the first rule.

23
In the RULE2 the user_activity table is checked according to the date. When
the wanted to purchase some product they need to order the product and pay for that
product. After the ordering the delivery will be takes place. So, order date is less than or
equal to the delivery date and the order date should not be greater than the delivery date,
in the same way delivery date should not be less than the order date. So this will be
checked on the rule2.

In the RULE3 the age of the user has to be verified. The age of a person should
be positive. Sometimes the data entered by the user may be wrong in order to indicate
the user about their data entry we need to validate their data by using this rule.

In case any of the above rules is violated, drop that record from further
processing pipeline. After validation is complete, identify the number of records failed
for each rule. If the number of invalid records for any of the failure is more than its
tolerable threshold then stop the processing and terminate the process by notifying the
same.

Rule Failure threshold (in percentage)


R1 2
R2 1
R3 1

For each rule we are using the different threshold value to indicate the failure to the user.

24
6.1.4 DATA ANALYSIS

Data Analysis is a process of analyze the data given by the user to make a
decision making and finding the fraud activities. In our System the following analysis
are taken place.

Purchase pattern Detection


Fraud detection
Shortcoming Identification.

Purchase Pattern Detection

What is the most purchased category for every user? Identify the users with
maximum amount of valid purchase.

Which products are generating the maximum profit? (Profit = (price - discount)
* profit_precentage)

Which resellers are generating the maximum profit?

Which is most sought after category corresponding to very occupation?

25
Fraud detection

Which user has performed most returns? What is the valid purchase made by
those users?
Which location is getting most cancellation?
Which location is getting most returns?
Which users have witnessed more than 50% of change in the purchase pattern
corresponding to their top 3 of purchase as compared to the last month?

Shortcomings Identification:

What is the net worth of cancelled products across every city for every different
reason types
sorted in descending order of net worth?
What is the net worth of returned products across every city for every different
reason types
sorted in descending order of net worth?

Perform the data analysis using Hive. We are using joins to make less
expensive and better storage mechanism. Take advantage of partitioning, bucketing,
ORC file formats, vectorization, CBO and tez.

26
6.1.5 DATA EXPORT:

The analyzed informations are exported into a MySQL table. Load final
analysis of the result corresponding to the most purchased category in MySQL
table.
MySQL table must be incrementally loaded after every month corresponding
to the analysis performed on data of previous month. Usually we are uploading the
data for analysis is monthly basis. So that, the result produced by monthly once,
but the data receiving is daily basis.
The incremental export will take place on the 1 st of every month. For
example, On Aug 1, 2016, it will be loaded with the details from July 1, 2016 to
31, 2016. By using the partitioned hive table or using MapReduce with
DBOutputformat and store the last value of exported month in RDBMS.

27
CHAPTER 7
APPENDICES

7.1 SCREEN SHOTS

In my example analysis we have taken purchase details for data analysis. For that we
created three tables.
1 products_info
2 users_info
3 user_activity

Each table contains the 4 separate tables called raw table, staging table, exception
table and core table.
1. List of raw tables:
products_info_raw, users_info_raw, user_activity_raw
2. List of staging table:
products_info_stg, users_info_stg, user_activity_stg
3. List of Exception table:
products_info_excp, users_info_excp, user_activity_excp
4. List of Core table:
products_info_core, users_info_core, user_activity_core

28
Raw table description
Initially the raw table has been created based on the attributes of the data. The generated
data is loaded into the raw table. This raw table contains the users data.

Fig 8.1.1 products_info_raw table description

Start the hive in the command prompt. It will load all the files and add the
resource jar files as .jar format. Now the terminal is ready for executing the hive
command. Already we created the tables which is specified above. The above screenshot
describes the products_info_raw table with there columns and their datatypes.

29
In the below screenshot select command has been used to display the details in
the table. The products_info_raw table contains the data in the order of product id,
name, reseller, category, price, discount, profit percentage. And the description of the
users_info_raw table has been displayed with its data.

Fig 8.1.2 Data of products_info_raw table & data and description of


users_info_raw

30
The user_activity_raw table description with the data in the order of product id,
user id, cancellation, return, cancellation reason, return reason, order date, shipment
date, delivery date, cancellation date, return date. The reasons are basically defines the
advantages and disadvantages of the product.

Fig 8.1.3 Data and description of user_activity_raw

31
Staging table description

We create the staging table and overwrite the data from raw table to the staging
table. This is the stage where data is available before loading into to the database. The
staging table has been created with the partitioning and bucketing concept. These are
used to speed up the processing and searching during the analysis.

Fig 8.1.4 Data and description of products_info_stg

32
The staging table has created for every table which are included in the analysis.
The below screen shots describes the users_info_stg and user_activity_stg. These tables
uses the partition column as date, because we are receiving data on a daily basis. So
based on date we partition the table.

Fig 8.1.5 Data and description of users_info_stg

33
Exception table description

By validating the users data in the staging table using some rules we discussed
in the rules and validation module. The data which are not satisfying the specified rule
or creating exceptions are treated as error and this will be handled by the exception
table. The table products_info_excp has been created with the partition column as date.

Fig 8.1.6 Description of products_info_stg

34
The Exception table has been created for every table. In our example we have
three tables, these all tables have their separate exception table. Below screenshot
describes the users_info_excp and the user_activity_excp.

Fig 8.1.7 Description of users_info_excp, user_activity_excp

35
Hbase look up table

We are having two Hbase look up tables for data enrichment. This Hbase tables
are referred by external tables created in the hive.

8.1.8 External table - prod_details, user_location

36
Core table description

Core table is a place where our analysis has been done. This table contains the
valid data, i.e. the data which satisfies all the rules in the validation process. The data in
the staging table are validated using some rules, during the validation the invalid results
are ignored. The valid data are inserted into the core table for data analysis. The below
screen shot describes the products_info_core and the data present in the table.

Fig 8.1.9 Data and description of product_info_core

37
The core tables are created for every table. These tables also contain the
partitioning and bucketing for fast analysis. The users_info_core table description are
shown and the data displayed is the valid data which is inserted after the validation has
completed.

Fig 8.1.10 Data and description of users_info_core

38
The below screen shot displays the user_activity_core table description and the
data available in the table which are overwrited after the rules checking and validation.

Fig 8.1.11 Data and description of user_activity_core

39
Data analysis
Purchase pattern detection
We create a new table called user_category_agr which gets the input from the
core table with specified conditions to answer the below analysis question.
What is the most purchased category for every user? Identify the users with
maximum amount of valid purchase.

Fig 8.1.12 purchase pattern detection 1

40
In the same way we create a table called max_profit and res_max_profit. These are
overwrite the input from the core table with the specified conditions to answer the
following analysis question.
Which products are generating the maximum profit?
Which resellers are generating the maximum profit?

Fig 8.1.13 Purchase pattern detection 2 & 3

41
The table occupation_category_aggr table has been created to answer the following
analysis question.
Which is most sought after category corresponding to very occupation?

Fig 8.1.14 Purchase pattern detection 4

42
Fraud detection
In the analysis we need find out which products are returned mostly and which are
cancelled on the same location. These are called as fraud detection. We create a table
called fraud_detection_work1 to find the first fraud activity.
Which user has performed most returns? What is the valid purchase made
by those users?

Fig 8.1.15 Fraud detection 1

43
In the same way we create a table fraud_detection_work2 and fraud_detection_work3
for analysis. We created the table using partitioning and bucketing for speed analysis.
These two table are created to find the answer for below questions.
Which location is getting most cancellation?
Which location is getting most returns?

Fig 8.1.16 Fraud detection 2 & 3

44
Shortcoming identification
In the analysis we need to find the total investment and the profit gained. These are
under the category of shortcomings.
What is the net worth of cancelled products across every city for every
different reason types sorted in descending order of net worth?
The table shortcoming_identification_1 give the output of the above question.

Fig 8.1.17 Shortcoming identification 1

45
In the below screen shot the table shortcoming_identification_2 is used to find the
output of the following question.
What is the net worth of returned products across every city for every
different reason types sorted in descending order of net worth?

Fig 8.1.18 Shortcoming identification 2

46
7.2 SOURCE CODE
Products_info_raw table creation

create table products_info_raw


(id STRING, name STRING, reseller STRING, category STRING, price int, discount
int, profit_percent int )
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Products_info_stg table creation

CREATE TABLE products_info_stg


( product_id STRING, product_name STRING, reseller STRING, category STRING,
price BIGINT, discount FLOAT, profit_percent FLOAT )
PARTITIONED BY ( rptg_dt STRING )
CLUSTERED BY ( product_id) INTO 8 BUCKETS STORED AS ORC;

Hbase table creation

production_category hbase table creation


create 'production_category', 'prod_details'
user_location hbase table creation
create 'user_location', 'user_details'

DATA INSERTION

Loading data to products_info_raw table

LOAD DATA LOCAL INPATH

47
'/home/priya/Documents/CTS/projects/myCart/data/product_info_merge.json'
INTO TABLE products_info_raw;

Loading data into products_info_stg table

INSERT OVERWRITE TABLE products_info_stg PARTITION (rptg_dt)


SELECT id, name, reseller, category, price, discount, profit_percent,
from_unixtime(cast(unix_timestamp() as bigint),'yyyy-MM-dd') as rptg_dt
FROM products_info_raw;

Creating products_info_core table

CREATE TABLE products_info_core


(product_id STRING, product_name STRING, reseller STRING, category STRING,
price BIGINT, discount FLOAT, profit_percent FLOAT)
PARTITIONED BY (rptg_dt STRING)
CLUSTERED BY (product_id) INTO 8 BUCKETS STORED AS ORC;

Creating table products_info_excp

CREATE TABLE products_info_excp


(product_id STRING, product_name STRING, reseller STRING, category STRING,
price BIGINT, discount FLOAT, profit_percent FLOAT, rule_failed STRING)
PARTITIONED BY (rptg_dt STRING)
CLUSTERED BY (product_id) INTO 8 BUCKETS STORED AS ORC;
Inserting data into products_info_excp and products_info_core tables

FROM products_info_stg p
LEFT OUTER JOIN prod_details l ON
p.product_id=l.prod_id AND p.rptg_dt=from_unixtime(cast(unix_timestamp() as

48
bigint),'yyyy-MM-dd')
INSERT OVERWRITE TABLE products_info_excp PARTITION (rptg_dt) SELECT
p.product_id, p.product_name, p.reseller, p.category, p.price, p.discount,
p.profit_percent, CASE WHEN p.product_id IS NULL THEN 'R1'
WHEN p.discount >= p.price THEN 'R2'
END AS rule_failed, p.rptg_dt
WHERE (p.product_id IS NULL) OR (p.discount >= p.price)
INSERT OVERWRITE TABLE products_info_core PARTITION (rptg_dt)
SELECT p.product_id, p.product_name, p.reseller,
CASE WHEN p.category IS NULL THEN l.category
ELSE p.category END AS category, p.price, p.discount, p.profit_percent, p.rptg_dt
WHERE (p.product_id IS NOT NULL) AND (p.discount <= p.price);

Data validation & Rules checking

Rules checking on products_info table

1.hive -e "SELECT COUNT(*) FROM mycart.products_info_excp WHERE rule_failed


= 'R1'" > products_info_excp_r1.txt
2.hive -e "SELECT COUNT(*) FROM mycart.products_info_excp WHERE rule_failed
= 'R2'" > products_info_excp_r2.txt
3.hive -e "SELECT COUNT(*) FROM mycart.products_info_excp WHERE rule_failed
= 'R3'" > products_info_excp_r3.txt

Checking the contents of products_info_core table

hive -e "SELECT COUNT(*) FROM mycart.products_info_core" >


products_info_core.txt

49
REFERENCES
[1] S.Sugune, M.Vithya, J.I.Christy Eunaicy, Big data analysis in e-commerce system
using HadoopMapReduce, Inventive Computation Technolodies (ICICT),International
Conference on 26-27 Aug. 2016
[2] Pavithra B 1 , Dr. Niranjanmurthy M2 , Kamal Shaker J3 , Martien Sylvester Mani
F3- The Study of Big Data Analytics in E-Commerce, International Journal of
Advanced Research in Computer and Communication Engineering, ICRITCSA, Vol. 5,
Special Issue 2, October 2016
[3] Saurabh Arora, Inderveer Chana, A survey of clustering techniques for big data
analysis, Confluence The Next Generation Information Technology Summit
(Confluence), 2014 5th International Conference -, 25-26 Sept. 2014
[4] Hirotoshi Cho,Hiroaki Shiokawa,Hiroyuki ,JsFlow: Integration of Massive Streams
and Batches via JSON-based Dataflow Algebra, Network-Based Information Systems
(NBiS),2016 19th International Conference on, 7-9 Sept.
[5] Vaishali Chauhan , Meenakshi Sharma, Hive, Pig & Hbase Performance Evaluation
For Data Processing Applications , International journal of advance research in science
and engineering, vol.no 5, issue.no 6, June 2016.
[6] HsinchunChen,Roger H. L. Chiang and Veda C. Storey-"Business Intelligence And
Analytics:From Big Data To Big Impact",MIS Quarterly Vol. 36 No. 4, pp. 1165-
1188,December 2012.
[7] Arti, SunitaChoudhary and G.N Purohit-"Role of Web Mining in E-
Commerce",International Journal of Advanced Research in Computer and
Communication Engineering,ISSN (Online) : 2278-1021:ISSN (Print) : 2319-5940 Vol.
4, Issue 1, pp:-251-253,January 2015
[8] Mustapha Ismail, Mohammed Mansur Ibrahim, Zayyan Mahmoud Sanusiand
MuesserNat,"Data Mining in Electronic Commerce: Benefits and

50
Challenges, International .Communications, Network and System Scien-ces,pp:501-
509,December 2015.
[9] ShahriarAkter and Samuel FossoWamba,"Big data analytics in Ecommerce: a
systematic review and agenda for future research",pp:173194,March 2016
[10] Ahmad Ghandour-"Big Data Driven E-Commerce Architecture",International
Journal of Economics, Commerce and Management,ISSN 2348 0386 Vol. III, Issue 5,
pp:940-947,May 2015.
[11] O. Liu, W.K. Chong, K.L. Man and C.O. Chan, "The Application of Big Data
Analytics in Business World" Proceedings of the International Multi Conference of
Engineers and Computer Scientists, Vol II, March 16 - 18, 2016
[12] ArnoldinaPabedinskait-, Vida Davidavicien- and PauliusMiliauskas,"Big Data
Driven E-Commerce Marketing"8th International Scientific Conference "Business and
Management 2014",ISBN online 978-609-457-651-5,pp:645-654,May 1516, 2014.
[13] UyoyoZinoEdosio,"Big Data Analytics and its Application in Ecommerce
"Research Gate, Conference Paper ,April 2014.
[14] G. Ilieva, T. Yankova and S. Klisarova,"Big Data Based System Model Of
Electronic Commerce" Trakia Journal of Sciences, Vol. 13, Suppl. 1, pp 407-413, 2015
[15] Niranjanamurthy M, DR. Dharmendra Chahar-"The study of ECommerce Security
Issues and Solutions" ISSN (Online) : 2278- 1021 International Journal of Advanced
Research in Computer and Communication Engineering Vol. 2, Issue 7, July 2013
[16] Constantine J. Aivalis, Kleanthis Gatziolis, Anthony C. Boucouvalas-"Evolving
analytics for e-commerce applications: Utilizing big data and social media extensions"
2016 International Conference on Telecommunications and Multimedia (TEMU), DOI:
10.1109/TEMU.2016.7551938 IEEE 25-27 July 2016
[17] Radoslav Fasuga, Ji Rdel, Eduard Kubanda -"Gloffer Framework for the
implementation of portal solutions for Ecommerce" ICETA-2015 INSPEC Accession
Number: 16284512, IEEE 26-27 Nov. 2015.

51

Potrebbero piacerti anche