Sei sulla pagina 1di 33

Big Data & Hadoop

Seminar Report

Submitted

In Department of Computer Science & Engineering

Bachelor of Technology

Submitted To: Submitted by:

MRS.ANUJA SHARMA

ASSISTANT PROFESSOR 16BCON556

Department of Computer Science & Engineering

JECRC UNIVERSITY, JAIPUR

2019
ACKNOWLEDGEMENT

First of all, I would like to express my gratitude towards JECRC UNIVERSITY, jaipur, for
providing me a platform to present my seminar at such an esteemed institute.

I would like to thank (Prof.) Dr. Naveen Hemrajani, HOD, Computer Science & Engineering
Department , JECRC University, jaipur, for their constant support.

I am also thankful to all the staff members of the department for their full cooperation and help.

Ritvija Tiwari
16BCON556
INDEX

Topic Page No.

Acknowledgment 2

List of Figures 4

Abstract 5

1. Introduction 6

1.1 What comes under Big Data? 7

1.2 The Structure of Big Data 7-8

2. Benefits of Big data 9

3. Architecture 10

4. Application 12

5. Hadoop 13

5.1 History of Hadoop 13-14

6. Hadoop Architecture 14-15

7. HDFS 16

8. HDFS Architecture 17

9. Advantages of Hadoop 18

10. MapReduce 20
10.1 The Algorithm 20-21

11. How does Hadoop work? 22

12. Practical Usage 23-24

13, Future of Big Data 25-28

14. Conclusion 29

5. References 30

Figure No. Title of Figure Page No

1. Introduction 2

2. What comes under Hadoop? 3

3. Hadoop 5

4. Hadoop Architecture 8

5. HDFS Architecture 12

6. Practical uses of Big Data 23

7. Future of Big Data 24

8. Salary Structure 25

9. Hadoop Skills 27
List of Figures

ABSTRACT

Big data is a new driver of the world economic and societal changes. The world’s data collection
is reaching a tipping point for major technological changes that can bring new ways in decision
making, managing our health, cities, finance and education. While the data complexities are
increasing including data’s volume, variety, velocity and veracity, the real impact hinges on our
ability to uncover the `value’ in the data through Big Data Analytics technologies. Big Data
Analytics poses a grand challenge on the design of highly scalable algorithms and systems to
integrate the data and uncover large hidden values from datasets that are diverse, complex, and
of a massive scale. Potential breakthroughs include new algorithms, methodologies, systems and
applications in Big Data Analytics that discover useful and hidden knowledge from the Big Data
efficiently and effectively

Big data, which refers to the data sets that are too big to be handled using the existing database
management tools, are emerging in many important applications, such as Internet search,
business informatics, social networks, social media, genomics, and meteorology. Big data
presents a grand challenge for database and data analytics research. In this talk, I review the
exciting activities in my group addressing the big data challenge. The central theme is to connect
big data with people in various ways. Particularly, I will showcase our recent progress in user
preference understanding, context-aware, on-demand data mining using crowd intelligence,
summarization and explorative analysis of large data sets, and privacy preserving data sharing
and analysis.
1.INTRODUCTION
The big data is a term used for the complex data sets as the traditional data processing mechanisms are
inadequate. The challenges of the big data include:Analysis, Capture, Data curation, Search, Sharing,
Storage, Storage, Transfer, Visualization and The privacy of information.

The term often refers simply to the use of predictive analytics or other certain advanced methods to extract
value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more
confident decision making. And better decisions can mean greater operational efficiency, cost reductions
and reduced risk. Analysis of data sets can find new correlations, to "spot business trends, prevent diseases,
combat crime and so on." Scientists, practitioners of media and advertising and governments alike regularly
meet difficulties with large data sets in areas including Internet search, finance and business informatics.
Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics,
complex physics simulations, and biological and environmental research.

The term big data most often refers to the avail of predictive analytics or other developed
methods to extract value from information and rarely refers to a particular size of the information
set. The feature of precision in big data may lead to the confident and better decision making and
this aid in developing greater operational efficiency, reduction in the cost and reduction in the
risk. The big data includes the sets which are larger than the commonly used software tools to
capture, curate and manage and process information within a defined time. The size of big data is
continuously increasing as it has increased from few terabytes to petabytes.
Fig-1

1.1 What Comes Under Big Data?


Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.

• Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures voices
of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.

• Social Media Data − Social media such as Facebook and Twitter hold information and the
views posted by millions of people across the globe.

• Stock Exchange Data − The stock exchange data holds information about the‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.

• Power Grid Data − The power grid data holds information consumed by a particular node
with respect to a base station.

• Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
• Search Engine Data − Search engines retrieve lots of data from different databases.

Fig-2

The big data represents the data assets characterized high volume, velocity, and variety to need
specifically applied science and analytical ways for its change into value. The big data has many
characteristic and they are as follows:

• Volume
• Variety
• Velocity
• Veracity
• Variability
• Complexity

• Volume: The volume is a characteristic that explains about the quantity of the
information that is produced is very vital in the present context, it is the size of the data
that explains the value and potential of the information and whether it can be viewed as
a big data or not.

• Variety: The next characteristic of big data is the variety and it determines the class to
which the big data belongs to and it is also an essential factor that the data analyst
should know.

• Velocity: In the present context the term velocity refers to the speed of production of
the information.

• Variability: The variability is characteristic and it causes an issue to the people who
analyze the information.

• Veracity: The quality of the information being captured may vary a lot and the precision
of analysis depends on the veracity factor of the source information.

• Complexity: The management of the data can become a difficult process when huge
information comes from different sources. All these information needs to linked,
correlate and connects with each other to acquire the data that is supposed to be
transferred by this information. Such a condition of big data is called as the complexity

1.2. The structure of Big Data:

The structure of the big data can be explained by the following:

• Structured
• Semi-structured
• Unstructured

• Structured: The structured mostly includes the traditional sources of information.

• Semi-structured: The semi-structured includes many sources of the big data.

• Unstructured: The unstructured includes the information like video data and audio data.

2.BENEFITS OF BIG DATA


• Using the information kept in the social network like Facebook, the marketing agencies are
learning about the response for their campaigns, promotions, and other advertising mediums.

• Using the information in the social media like preferences and product perception of their
consumers, product companies and retail organizations are planning their production.

• Using the data regarding the previous medical history of patients, hospitals are providing
better and quick service

3.ARCHITECTURE

In 2000, Seisint Inc. developed C++ based distributed file sharing framework for data storage
and querying. Structured, semi-structured and/or unstructured data is stored and distributed
across multiple servers. Querying of data is done by modified C++ called ECL which uses apply
scheme on read method to create structure of stored data during time of query. In 2004
LexisNexis acquired Seisint Inc. and 2008 acquired ChoicePoint, Inc. and their high speed
parallel processing platform. The two platforms were merged into HPCC Systems and in 2011
was open sourced under Apache v2.0 License. Currently HPCC and Quantcast File System are
the only publicly available platforms capable of analyzing multiple exabytes of data. In 2004,
Google published a paper on a process called MapReduce that used such an architecture. The
MapReduce framework provides a parallel processing model and associated implementation to
process huge amounts of data. With MapReduce, queries are split and distributed across parallel
nodes and processed in parallel (the Map step). The results are then gathered and delivered (the
Reduce step). The framework was very successful, so others wanted to replicate the algorithm.
Therefore, an implementation of the MapReduce framework was adopted by an Apache open
source project named Hadoop. MIKE2.0 is an open approach to information management that
acknowledges the need for revisions due to big data implications in an article titled "Big Data
Solution Offering". The methodology addresses handling big data in terms of useful
permutations of data sources, complexity in interrelationships, and difficulty in deleting (or
modifying) individual records. Recent studies show that the use of a multiple layer architecture is
an option for dealing with big data. The Distributed Parallel architecture distributes data across
multiple processing units and parallel processing units provide data much faster, by improving
processing speeds. This type of architecture inserts data into a parallel DBMS, which implements
the use of MapReduce and Hadoop frameworks. This type of framework looks to make the
processing power transparent to the end user by using a front end application server. Big Data
Analytics for Manufacturing Applications can be based on a 5C architecture (connection,
conversion, cyber, cognition, and configuration). Big Data Lake - With the changing face of
business and IT sector, capturing and storage of data has emerged into a sophisticated system.
The big data lake allows an organization to shift its focus from centralized control to a shared
model to respond to the changing dynamics of information management.

4.APPLICATIONS

Applications of Big Data : The applications of big data are in the following fields :
• Government
• International development
• Manufacturing
• Cyber-physical models
• Media
• Technology
• Private sector
• Science
• Science and research

• Government: For example in the United States of America, in the year of 2012, the
administration of Obama declared the big data research and development initiative,
because it is used to address many issues faced by the government. The big data is
also utilized by the Indian government.

• International development: The development in the big data analysis furnishes cost-
effective opportunities to enhance the decision in critical advancement areas like health
care, employment opportunities and crime, security and natural disaster. Hence, in this
way, the big data is helpful for the international development.

• Manufacturing: In manufacturing, the big data furnishes an infrastructure for


transparency in manufacturing or producing industry.

• Cyber-physical models: The present PHM implementations make avail of data during
the actual usage while the analytical step by step procedures can do more precisely
when more data is included. This is the role of big data in the cyber-physical models.

• Media: In the media, it is used in the internet of things which do the activities like
targeting of computers and data capturing.

• Technology: In the technology, it is used in the websites like eBay, Amazon and
Facebook and Google utilize it.

• Private sector: The application of big data in the private sector includes the retail, retail
banking, and real estate.

• Science: The best example for its application in science is about the Large Hardom
collider that represented 150 million sensors transmitting information 40 million times
per second.

1. HADOOP

Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.

5.1.Problem with traditional Approach

In traditional approach, the main issue was handling the heterogeneity of data i.e. structured,
semi-structured and unstructured. The RDBMS focuses mostly on structured data like banking
transaction, operational data etc. and Hadoop specializes in semi-structured, unstructured data
like text, videos, audios, Facebook posts, logs, etc. RDBMS technology is a proven, highly
consistent, matured systems supported by many companies. While on the other hand, Hadoop is
in demand due to Big Data, which mostly consists of unstructured data in different formats.
Fig-3

5.1 History of hadoop

History of Hadoop had started in the year 2002 with the project Apache Nutch. Hadoop was
created by Doug Cutting, the creator of Apache Lucene, the widely used text search library.
Hadoop has its origins in Apache Nutch, an open source web search engine which itself is a
part of Lucene Project.

2002 – 2004

Apache Nutch was started in the year 2002 by Doug Cutting which is an effort to build an open
source web search engine based on Lucene and Java for the search and index component.
Nutch was based on sort/merge processing. In June 2003, it was successfully demonstrated on
4 nodes by crawling 100 million pages. However they realised that their architecture wouldn’t
scale to billions of pages on web. There comes the help with the publication of a paper in 2003
that described the architecture of the Google’s Distributed Filesystem, called GFS which has
been used in production at Google which would solve their storage needs for the very large
files
2004 – 2006

In the year 2004, they started writing the open source implementation called Nutch Distributed
Filesystem (NDFS). In the same year Google published a paper that introduces MapReduce to
the world. Early in the year 2005, the Nutch developers had a working MapReduce
Implementation in Nutch and by the middle of that year all the major Nutch algorithms had
been ported using the MapReduce and NDFS (Nutch Distributed FileSystem). In Febraury,
2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop.

• 2004 : Initial versions of what is now Hadoop Distributed FileSystem and MapReduce
implemented by Doug Cutting and Mike Cafarella.
• December 2005 : Nutch ported to a new framework. Hadoop runs reliably on 20 nodes.

2006 – 2008
Doug Cutting joined Yahoo! in the year 2006, which provided him the dedicated team and
resources to turn Hadoop in to a system that ran at web scale. Hadoop was made Apache’s top
level project in the year 2008.

• February 2006 : Apache Hadoop project officially started to support the standalone
development of MapReduce and HDFS.
• February 2006 : Adoption of Hadoop by Yahoo! Grid Team.
• April 2006 : Sort benchmark ( 10 GB/node ) run on 188 nodes in 47.9.
• May 2006 : Yahoo! set up a Hadoop 300 nodes research cluster.
• May 2006 : Sort benchmark run on 500 nodes in 42 hours ( better hardware than April
benchmark )
• October 2006 : Research cluster reaches 600 nodes.
• December 2006 : Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours,
500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
• January 2007 : Research cluster reaches 900 nodes.
• April 2007 : Research clusters – two cluster of 1000 nodes.
• April 2008 : Won 1 Terabyte sort benchmark in 208 seconds on 990 nodes.
• October 2008 : Loading 10 Terabytes of data per day into research clusters.

2008 – now
After 2008 there is a full time development that is going on. There are many releases of
Hadoop.

• March 2009 : 17 clusters with a total of 24,000 nodes.


• April 2009 : Won the minute sort by sorting 500 GB in 59 seconds on 1,400 nodes and 100
TB sort in 173 minutes on 3,400 nodes.
• 2011 : Yahoo was running its search engine across 42,000 nodes.
• July 2013 : Gray sort by sorting at a rate of 1.42 Terabytes per minute.

6.HADOOP ARCHITECTURE

At its core, Hadoop has two major layers namely −

• Processing/Computation layer (MapReduce), and


• Storage layer (Hadoop Distributed File System).
Fig-4

7.HADOOP DISTRIBUTED FILE SYSTEM


The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed
on low-cost hardware. It provides high throughput access to application data and is suitable for
applications having large datasets.

Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −

• Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.

• Hadoop YARN − This is a framework for job scheduling and cluster resource
management
8.HDFS OVERVIEW

Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using
low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.

It is suitable for the distributed storage and processing.


Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

8.1 Features Of 'Hadoop'


• Suitable for Big Data Analysis
As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited
for analysis of Big Data. Since it is processing logic (not the actual data) that flows to the
computing nodes, less network bandwidth is consumed. This concept is called as data locality
concept which helps increase the efficiency of Hadoop based applications.
• Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus
allows for the growth of Big Data. Also, scaling does not require modifications to application
logic.
• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That
way, in the event of a cluster node failure, data processing can still proceed by using data stored
on another cluster node.

8.2. HDFS Architecture


Given below is the architecture of a Hadoop File System.
Fig-5

HDFS follows the master-slave architecture and it has the following elements.
8.2.1. Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks −

• Manages the file system namespace.


• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening files and
directories.

8.2.2. Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.

• Datanodes perform read-write operations on the file systems, as per client request.

• They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.

8.2.3. Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are called
as blocks. In other words, the minimum amount of data that HDFS can read or write is called a
Block. The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.

8.3. Goals of HDFS

Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and
automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the network traffic and
increases the throughput.

9.ADVANTAGES OF HADOOP

• Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.

• Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the
application layer.

• Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.

• Another big advantage of Hadoop is that apart from being open source, it is compatible on
all the platforms since it is Java based.
10.MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.

10.1. The Algorithm


Generally MapReduce paradigm is based on sending the computer to where the data resides!

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.

Map stage − The map or mapper’s job is to process the input data. Generally the input data is in
the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes the data and creates several
small chunks of data.

Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it produces a
new set of output, which will be stored in the HDFS.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in
the cluster.

The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.

Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.

After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
• How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs −

Data is initially divided into directories and files. Files are divided into uniform sized blocks of
128M and 64M (preferably 128M).

• These files are then distributed across various cluster nodes for further processing.

• HDFS, being on top of the local file system, supervises the processing.

• Blocks are replicated for handling hardware failure.

• Checking that the code was executed successfully.

• Performing the sort that takes place between the map and reduce stages.

• Sending the sorted data to a certain computer.

• Writing the debugging logs for each job.


• Practical Uses of Big Data:

• Location Tracking:

Logistic companies have been using location analytics to track and report orders for quite some
time. With Big Data in the picture, it is now possible to track the condition of the good in transit
and estimate the losses. It is now possible to gather real-time data about traffic and weather
conditions and define routes for transportation. This will help logistic companies to mitigate risks
in transport, improve speed and reliability in delivery.

• Precision Medicine:

With big data, hospitals can improve the level of patient care they provide. 24x7 monitoring can
be provided to intensive care patients without the need of direct supervision. On top of that, the
efficiency of medication can be improved by analyzing the past records of the patients and the
medicines provided to them. The need for guesswork can be significantly reduced.
In the case of certain bio pharmaceuticals, there are many variables that impact the final product.
For example, while manufacturing insulin intense care needs to be taken to ensure the product of
desired quality. By analyzing all the factors impacting the final drug big data analysis can point
out key factors that might result in incompetence in production.

• Fraud Detection & Handling:

Banking and finance sector is using big data to predict and prevent cyber crimes, card fraud
detection, archival of audit trails, etc. By analyzing the past data of their customers and the data
on previous brute force attacks banks can predict future attempts. Not just big data helps in
predicting cyber crimes but it also helps in handling issues related to miss transactions and
failures in net banking. It can even predict possible spikes on servers so that banks can manage
transactions accordingly.
The Securities Exchange Commission (SEC) is using big data to monitor financial markets for
possible illegal trades and suspicious activities. The SEC is using network analytics and natural
language processors to identify possible frauds in the financial markets.

• Advertising:

Advertisers are one of the biggest players in Big Data. Be it Facebook, Google, Twitter or any
other online giant, all keep a track of the user behavior and transactions. These internet giants
provide a great deal of data about people to the advertisers so that they can run targeted
campaigns. Take Facebook, for example, here you can target people based on buying intent,
website visits, interests, job role, demographics and what not. All this data is collected by
Facebook algorithms using big data analysis techniques. The same goes for Google, when you
target people based on clicks you will get different results and when you create a campaign for
leads that you will get different results. All this is made possible using big data.

12.5.Entertainment & Media:

In the field of entertainment and media, big data focuses on targeting people with the right
content at the right time. Based on your past views and your behavior online you will be shown
different recommendations. This technique is popularly used by Netflix and Youtube to increase
engagement and drive more revenues.
Now, even television broadcasters are looking to segment their viewer's database and show
different advertisements and shows accordingly. This will allow in better revenue from ads and
will provide a more engaging user experience.

Fig-6
13.Future of Big Data Hadoop Developer

13.1. Hadoop and Big Data


Hadoop is the supermodel of Big Data. To be skilled in Hadoop is a deciding factor in getting a
springboard to your career or getting left behind. If you are a fresher there is a huge scope if you
are skilled in Hadoop. Amongst the open source framework, there is almost no other alternative
which can deal with petabytes of data as Hadoop can. In 2015 was it was predicted that
Indian Big Data Hadoop industry will grow five folds in the analytics centre.

13.2. Job Market in Analytics on the Rise in India


Research suggests that by the end of 2018 India alone will face a shortage of about two lac data
scientist. The probable growth of Big Data in India is because of the awareness of the benefits
that insights from unstructured data can impact businesses and increase its ROI. Another fact is
that India is considered a hub for outsourcing such operational efficiencies at low cost. One can
see Bangalore emerging as a hub for such outsourcing capabilities.
Fig-7

Jobs for Hadoop developers in on the rise as organisations from different verticals such as e-
commerce, retail, automobile, telecom are adopting analytics to gain an advantage over their
competitors. Increasing demand and cost-effectiveness is also making many international
companies focus on India with plans for expansion. If new is right Twitter is also in the process
of setting their R&D centre in India.

The pool of trained professionals in data analytics with Hadoop expertise is low as compared to
the current and expected demand. Hadoop market in India is not a frizz which will dilute with
time, on the contrary, it is phenomenal in demand, learning the skill guarantees higher salary and
better job prospects for both experienced and fresher’s alike. Currently every major IT company
like, Facebook, Jabong, Snapdeal, Amazon etc.., are using Hadoop to convert zettabytes of data
created through these portals hence if you are trained in Hadoop you will be the apple of any
developer in India.

13.3. Salary Structure on Big Data and Hadoop professionals in India


The salary structure for a trained professional in Big Data Hadoop is quite lucrative with an
average start-up at 6 – 9 lac and a manager with 7-10 years getting anywhere close to 15-20 lac
and in some cases above 15 years of experience drawing almost or more than a 1 crore.
Fig-8

13.4. Big Data and Hadoop Skills will evolve and increase with time in India
At a high-level Hadoop, a developer is a person who should enjoy programming. Also, have
some prior knowledge of SQL or JAVA or any other programming or scripting language as it
will increase your efficiency as a developer.
Fig-9

14.CONCLUSION
The availability of Big Data, low-cost commodity hardware, and new information management
and analytic software have produced a unique moment in the history of data analysis. The
convergence of these trends means that we have the capabilities required to analyze astonishing
data sets quickly and cost-effectively for the first time in history. These capabilities are neither
theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize
enormous gains in terms of efficiency, productivity, revenue, and profitability.The Age of Big
Data is here, and these are truly revolutionary times if both business and technology
professionals continue to work together and deliver on the promise.

As the career paths available in big data continue to grow so does the shortage of big data
professionals needed to fill those positions. In the previous sections of this chapter the
characteristics needed to be successful in the field of big data have been introduced and
explained. The characteristics such as communication, knowledge of big data concepts, and
agility are equally as important as the technical skill aspects of big data.

Big data professionals are the bridge between raw data and useable information. They should
have the skills to manipulate data on the lowest levels, and they must know how to interpret its
trends, patterns, and outliers in many different forms. The languages and methods used to
achieve these goals are growing in strength and numbers, a pattern unlikely to change in the near
future, especially as more languages and tools enter and gain popularity in the big data fray.

Regardless of language, method, or specialization, big data scientists face a unique technical
challenge: working in a field where their exact role lacks a clear definition. Within an
organization, they help to solve problems, but even these problems may be undefined. To further
complicate matters, some data scientists work outside any specific organization and its direction,
like in academic research. Future chapters will explore concrete applications of big data across
multiple disciplines to demonstrate how diversely big data scientists can work.

MUTIPLE CHOICE QUESTIONS (MCQ’s)


1. Point out the correct statement

a. Hadoop do need specialized hardware to process the data


b) Hadoop 2.0 allows live stream processing of real time data
c) In Hadoop programming framework output files are divided in to lines or records
d) None of the mentioned

1. Hadoop is a framework that works with a variety of related tools. Common cohorts
include:

a) MapReduce, Hive and HBase


b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet

1. What was Hadoop named after?

a) Creator Doug Cutting’s favorite circus act


b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop development

1. All of the following accurately describe Hadoop, EXCEPT:

a) Open source
b) Real-time
c) Java-based
d) Distributed computing approach

1. __________ has the world’s largest Hadoop cluster.

a) Apple
b) Datamatics
c) Facebook
d) None of the mentioned

1. Facebook Tackles Big Data With _______ based on Hadoop.

a) ‘Project Prism’
b) ‘Prism’
c) ‘Project Big’
d) ‘Project Data’

1. Hive also support custom extensions written in :

a) C#
b) Java
c) C
d) C++

1. ___________ is general-purpose computing model and runtime system for distributed


data analytics.

a) Mapreduce
b) Drill
c) Oozie
d) None of the mentioned

1. _______ jobs are optimized for scalability but not latency.


a) Mapreduce
b) Drill
c) Oozie
d) Hive

1. _________ function is responsible for consolidating the results produced by each of the
Map() functions/tasks.
a) Reduce
b) Map
c) Reducer
d) All of the mentioned
1. References

• http://www.balasubramanyamlanka.com/history-of-hadoop/
• https://www.guru99.com/learn-hadoop-in-10-minutes.html
• https://www.newgenapps.com/blog/5-practical-uses-of-big-data
• https://www.tutorialspoint.com/hadoop/hadoop_streaming.html

Potrebbero piacerti anche