Sei sulla pagina 1di 15

Hands On Hadoop Tools

A Project Work
Submitted in the partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING
IN

Computer Science with IBM specialization in Big data and analytics

Submitted by:
GAGANDEEP SINGH
(17BCS3758)

Under the Supervision of:

Mrs. Shikha Sharma Sarkar

APEX INSTITUTE OF TECHNOLOGY


CHANDIGARH UNIVERSITY, GHARUAN, MOHALI - 140413,
PUNJAB
MARCH 2019
DECLARATION

I, student of ‘Bachelor of Engineering in Computer Science with IBM specialization


in Big data and analytics’,session: 2018 - 2019, Apex Institute of Technology,
Chandigarh University, Punjab, hereby declare that the work presented in this Project
Work entitled ‘Hands on Hadoop tools’ is the outcome of our own bona fide work and is
correct to the best of our knowledge and this work has been undertaken taking care of
Engineering Ethics. It contains no material previously published or written by another
person nor material which has been accepted for the award of any other degree or diploma
of the university or other institute of higher learning, except where due acknowledgment
has been made in thetext.

Date: 27 March 2019

ii
ii
ACKNOWLEDGEMENT

A major project is a golden opportunity for learning and self-development. I Consider


myself very lucky and honored to have so many wonderful people lead me through in
completion of this project.

First and foremost would like to thank Dr. Bhupinder Singh HOD CSE who gave us the
opportunity to undertake this project.

My grateful thanks to Prof. Shikha Sharma Sarkar for her guidance in project work.

I would also like to thanks my seniors.

CSE department monitored our progress and arranged all facilities to make life easier. We
choose this moment to acknowledge their contribution gratefully.

4
Abstract
Big data is a new driver of the world economic and societal changes. The
world’s data collection is reaching a tipping point for major technological
changes that can bring new ways in decision making, managing our
health, cities, finance and education. While the data complexities are
increasing including data’s volume, variety, velocity and veracity, the real
impact hinges on our ability to uncover the `value’ in the data through
Big Data Analytics technologies. Big Data Analytics poses a grand
challenge on the design of highly scalable algorithms and systems to
integrate the data and uncover large hidden values from datasets that are
diverse, complex, and of a massive scale. Potential breakthroughs include
new algorithms, methodologies, systems and applications in Big Data
Analytics that discover useful and hidden knowledge from the Big Data
efficiently and effectively. Big Data Analytics is relevant to Hong Kong as
it moves towards a digital economy and society. Hong Kong is already
among the best in the world in Big Data Analytics, taking up such
leadership positions as chairs and editor in chiefs of important
conferences and journals in Big Data related areas. But to maintain such
leadership positions, Hong Kong universities, government and industry
must act quickly in addressing a number of major challenges. These
challenges includes “foundations,” which concerns new algorithms,
theory and methodologies in knowledge discovery from large amounts of
data and “systems and applications,” which concerns innovative
applications and systems useful for supporting Big Data practices. Big
data analytics must also be team effort cutting across academic
institutions, government and society and industry, and by researchers
from multiple disciplines including computer science and engineering,
health, data science and social and policy areas.

v
Table Of Contents

TitlePage i

Declaration ofthe student ii

Certificate ofthe Guide iii

Acknowledgement iv

Abstract v

1) INTRODUCTION 1

2) INSTALLATION 2

3) WORKING OF HIVE 7

4) RESULT 8

5)PROBLEM FACED 8

6) FUTURESCOPE 9

7) REFERENCES 10
1) INTRODUCTION

This massive amount of the data is known as Big Data. Big data is a axiom, or catch-phrase, operates to
designate a huge volume of both structured and unstructured data that is so huge that it's complex to practice
using customary database and software techniques. Operations and make quicker, more intelligent
decisions .Big Data, now a days this term becomes common in IT industries. As there is a huge amount of
data lies in the industry but there is nothing before big data comes into picture [3]. An example of big data
might be petabytes (1,024terabytes) or Exabyte’s (1,024 petabytes) of data comprising of billions to trillions
of records of millions of people—all from changed causes (e.g. Web, sales, customer contact center, social
media, mobile data and so on). The data is naturally lightly structured data that is often inadequate and
inaccessible.
2.Four V’s of Big data
Big Data Analytics that is the handling of the difficult and enormous datasets This data is different from
structured data in terms of five parameters –variety, volume, value, veracity and velocity (4V’s). The five V’s
(volume, variety, velocity, veracity) are the challenges of big data management are:

2.1 Volume
Data is ever-increasing day by day of all types ever MB, PB, YB, ZB, KB, TB of
information. The data outcomes into large files. Extreme capacity of data is highest
dispute of storage. This main issue is determined by decreasing storage cost. Data volumes
are predictable to grow 50 times by 2020.

2.2 Velocity
The data comes at high speed. Sometimes one minute is
too late so big data is time sensitive. Some administrations data velocity is central task. The social media
posts and credit card trades done in millisecond and files created by this putting in to databases

2.3 Variety
Data sources are tremendously heterogeneous. The records comes in many layouts and of any type, it may be
structured or unstructured such as text, audio, videos, log files and more. The variations are boundless, and
the data enters the network without having been quantified or qualified in any way.

1
2.4Veracity
The growth in the choice of standards typical of a large data set. When we dealing with high volume, velocity
and variety of data, the all of data are not going 100% correct, there will be dull data. Big data and analytics
technologies work with these types of data.

3.Hadoop
Hadoop is a framework that can run applications on systems with thousands of nodes and
terabytes [1]. It distributes the file among the nodes and allows to system continue work in
case of a node failure. This approach reduces the risk of catastrophic system failure. Hadoop
consists of the Hadoop kernel [10], Hadoop distributed file system (HDFS), maps reduce and
related projects are zookeeper, Hbase, Apache Hive. Hadoop Distributed File System consists
of three Components: the Name Node, Secondary Name Node and Data Node.

3.1 Hadoop And It’s Component

HBase: It is open source, distributed and Nonrelational database system implemented in Java. It runs above
the layer of HDFS. It can serve the input and output for the Map Reduce in well-mannered structure.

Oozie: Oozie is a web-application that runs in a java servlet. Oozie use the database to gather the
information of Workflow which is a collection of actions. It manages the Hadoop jobs in a mannered way.
Sqoop: Sqoop is a command-line interface application that provides platform which is used for converting
data from relational databases and Hadoop or vice versa

Pig: Pig is high-level platform where the MapReduce framework is created which is used with Hadoop
platform. It is a high level data processing system where the data records are analyzed that occurs in high
level language. Zookeeper: It is a centralization based service that provides distributed synchronization and
provides group services along with maintenance of the configuration information and records.

Hive: It is application developed for data warehouse that provides the SQL interface as well as relational
model. Hive infrastructure is built on the top layer of Hadoop that help in providing conclusion, and analysis
for respective queries. Hadoop was created by Doug Cutting and Mike

Figures and Tables


Positioning Figures and Tables: Place figures and tables at the top and bottom of columns. Avoid placing them in
the middle of columns. Large figures and tables may span

2) INSTALLATION

2.1) Downloading

Cloudera
1) Open the cloudera website. Go to https://www.cloudera.com/downloads/quickstart_vms/5-
13.html your computer's web browser. You can download the cloudera quickstart VM (also
known as an .vmdk file) here.
2
2) Scroll down to the latest version of cloudera vm. You'll find it near the bottom of the
page.
3)Download it.

2.2)Installing and Setting up Virtual Box


1). Open VMware. Double-click (or click once on a Mac) the VMware app icon.
2). Click New. It's a blue badge in the upper-left corner of the VirtualBox window.
3). Click Cloudera and set the ram.
4). Create your virtual machine's virtual hard drive.
5). Run it.

WORKING OF HIVE

Hive is developed on top of Hadoop. It is a data warehouse framework for querying and
analysis of data that is stored in HDFS. Hive is an open source-software that lets
programmers analyze large data sets on Hadoop.

The size of data sets being collected and analyzed in the industry for business
intelligence is growing and in a way, it is making traditional data warehousing solutions
more expensive. HADOOP with MapReduce framework, is being used as an alternative
solution for analyzing data sets with huge size. Though, Hadoop has proved useful for
working on huge data sets, its MapReduce framework is very low level and it requires
programmers to write custom programs which are hard to maintain and reuse. Hive
comes here for rescue of programmers.
Hive evolved as a data warehousing solution built on top of Hadoop Map-Reduce
framework.

Hive provides SQL-like declarative language, called HiveQL, which is used for
expressing queries. Using Hive-QL users associated with SQL are able to perform data
analysis very easily.

Hive engine compiles these queries into Map-Reduce jobs to be executed on Hadoop.
In addition, custom Map-Reduce scripts can also be plugged into queries. Hive
operates on data stored in tables which consists of primitive data types and collection
data types like arrays and maps.
Hive comes with a command-line shell interface which can be used to create tables
and execute queries.

Hive query language is similar to SQL wherein it supports subqueries. With Hive query
language, it is possible to take a MapReduce joins across Hive tables. It has a support
for simple SQL like functions- CONCAT, SUBSTR, ROUND etc., and aggregation
functions- SUM, COUNT, MAX etc. It also supports GROUP BY and SORT BY clauses. It
is also possible to write user defined functions in Hive query language.

3
Important characteristics of Hive
In Hive, tables and databases are created first and then data is loaded into
these tables.

Hive as data warehouse designed for managing and querying only structured
data that is stored in tables.

While dealing with structured data, Map Reduce doesn't have optimization and
usability features like UDFs but Hive framework does. Query optimization refers
to an effective way of query execution in terms of performance.

Hive's SQL-inspired language separates the user from the complexity of Map
Reduce programming. It reuses familiar concepts from the relational database
world, such as tables, rows, columns and schema, etc. for ease of learning.

Hadoop's programming works on flat files. So, Hive can use directory structures
to "partition" data to improve performance on certain queries.

A new and important component of Hive i.e. Metastore used for storing schema
information. This Metastore typically resides in a relational database. We can
interact with Hive using methods like
Web GUI
Java Database Connectivity (JDBC) interface

Most interactions tend to take place over a command line interface (CLI). Hive
provides a CLI to write Hive queries using Hive Query Language(HQL)

Generally, HQL syntax is similar to the SQL syntax that most data analysts are
familiar with. The Sample query below display all the records present in
mentioned table name.
Sample query : Select * from <TableName>

Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE(Record Columnar File).

For single user metadata storage, Hive uses derby database and for multiple
user Metadata or shared Metadata case Hive uses MYSQL.

For setting up MySQL as database and to store Meta-data information check


Tutorial "Installation and Configuration of HIVE and MYSQL"
Some of the key points about Hive:

The major difference between HQL and SQL is that Hive query executes on
Hadoop's infrastructure rather than the traditional database.

The Hive query execution is going to be like series of automatically generated


map reduce Jobs.

Hive supports partition and buckets concepts for easy retrieval of data when the
client executes the query.
Hive supports custom specific UDF (User Defined Functions) for data cleansing,
filtering, etc. According to the requirements of the programmers one can define
Hive UDFs.

Hive Architecture

Starting with HIVE

 open cloudera and start the VM


 open terminal in CentOS
 type hive in the terminal
 hive will be started
Create an internal table in hive

 hive>create table table_name(column1 datatype, cloumn2 datatype….)


>row format delimited
>fields terminated by ‘\n’;

Loading data from text file to the table we have created

 hive>LOAD DATA INPATH '/user/desktop/test.txt' INTO table guruhive_internaltable;

 Hive> select * from ss;

 Viewing the table in hadoop(hive data warehouse)


1. Go to browser and open hdfs at local host by typing link :
http://localhost:50070/dfshealth.jsp
2. Now click on “Browse the file system”

3. Now go to user directory and then in warehouse directory you will find database
‘san’ there and inside that you will find your table ‘ss’
3) Result

We can observe from the above image that we have successfully fetched our data from our .txt file into
the table created on hive and in HDFS directory. Once the data have been successfully stored in our
database, we can manipulate the stored data to fit the needs of our future projects.

4) Problems Faced

 Tried on sandbox but did’nt worked (system not compatible)


 Tried on IBM cloud but work only one time on one account
 Tried microsoft asure but only for 10 day.

5) FutureScope

With the combination of hive with other BIG DATA tools we can manipulate our data more precisely
References

 https://www.cloudera.com/documentation/enterprise/5-7-x/topics/cm_qs_quick_start.html

 https://doctuts.readthedocs.io/en/latest/hive.html

 https://www.cloudera.com/developers/get-started-with-hadoop-tutorial/setup.html

 https://www.guru99.com/hive-create-alter-drop-table.html

Potrebbero piacerti anche