Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Executive Summary
On the list of technology industry buzzwords,
big data is among the most intriguing ones.
As data volume, velocity and variety proliferate,
and the search for veracity escalates, organizations across industries are placing new bets on
various new data sources such as machine sensor
data, medical images, financial information, retail
sales, radio frequency identification and Web
tracking data. This is creating huge challenges for
decision-makers to make meaning and untangle
trends from input larger than ever before.
From a technological perspective, the so-called
four Vs of big data (volume, velocity, variety and
veracity) make it ever more difficult to process big
data on a single system. Even if one disregarded
the storage constraints of a single system, and
utilized a storage area network (SAN) to store
the petabytes of incoming data, processing speed
remains a huge bottleneck. Whether a single-core
or multi-core processor is used, a single system
would take substantially more time to process
data than if the data was partitioned across an
HDFS Architecture
Client
Metadata Ops
Block Ops
DataNodes
Write
Read
NameNode
Replication
Rack 1
Rack 2
Figure 1
Hadoop: A Primer
Hadoops Role
HDFS shares many attributes with other distributed file systems. However, Hadoop has implemented numerous features that allow the file system
to be significantly more fault-tolerant than typical
hardware solutions such as redundant arrays
of inexpensive disks (RAIDs) or data replication alone. What follows is a deep dive into the
reasons Hadoop is considered a viable solution
for the challenges created by big data. The HDFS
components explored are the NameNode and
DataNodes (see Figure 1).
The MapReduce framework processes large data
sets across numerous computer nodes (known as
data nodes) where all nodes are on the same local
network and use similar hardware. Computational
processing can occur on data stored either in a file
system (either semi-structured or unstructured)
or in a database (structured). MapReduce can take
advantage of data locality. In MapReduce version
1, the components are JobTracker and TaskTrackers, whereas in MapReduce version 2 (YARN), the
components are the ResourceManager and NodeManagers (see next page, Figure 2).
How
MapReduce v1
Client
TaskTracker
Client
JobTracker
Resource Manager
NameNode
NameNode
TaskTracker
r
DataNode
DataNode
Client
Client
TaskTracker
r
Node Manager
Node Manager
DataNode
Application Master
Application Master
DataNode
DataNode
Node Manager
Node Manager
Node Manager
Node Manager
Container
Container
Container
Container
DataNode
DataNode
DataNode
DataNode
Figure 2
such large data volumes. By providing substantial aggregated data bandwidth, HDFS should
scale to thousands of nodes per cluster. Hadoop
is a highly scalable storage platform because it
can store and distribute very large data sets
across hundreds of commodity servers operating
in parallel. This enables businesses to run their
applications on thousands of nodes involving
thousands of terabytes of data.
In legacy environments, traditional ETL and
batch processes can take hours, days or even
weeks in a world where businesses require
access to data in minutes or even seconds.
Hadoop excels at high-volume batch processing.
Because the processing is in parallel, Hadoop is
said to perform batch processing multiple times
faster than on a single server.
Likewise, when Hadoop is used as an enterprise
data hub (EDH), it can ease the ETL bottleneck
by establishing a single version of truth that can
be accessed and transformed by business users
without the need for a dedicated infrastructure
setup. This makes Hadoop one place to store all
data, for as long as desired or required and in its
original fidelity that is integrated with existing
Mining
social media
customer sentiments.
data
to
determine
Evaluating
Proactively
Split
Map
[Combine]
Figure 3
Shuffle &
Sort
Reduce
Output
using
server/machine logs and analyzing various
data sets across multiple data sources.
Physical-infrastructure-based.
Virtual-infrastructure-based.
Physical Infrastructure for Hadoop Cluster
Deployment
Hadoop and its associated ecosystem components
are deployed on physical machines with large
amounts of local storage and memory. Machines
are racked and stacked with high-speed network
switches.
The merits:
Delivers the full benefits of Hadoops performance, especially with locality-aware computation. In the case where a node is too busy to
accept additional work, the JobTracker can
still schedule work near the node and take
advantage of the switchs bandwidth.
The demerits:
Prepackaged Hadoop implementations may
be older versions or private branches without
the code being public. This makes it harder to
handle failure.
The demerits:
Unless there is enough work to keep the
CPUs busy, hardware becomes a depreciating
investment, particularly if servers arent being
used to their full potential thereby increasing
Hard Factors
External factors
Number of maps
Environment
Number of reducers
Number of cores
Combiner
Memory size
Custom serialization
The Network
Shuffle tweaks
Intermediate compression
Figure 4
vCPU x Memory
No. of Nodes
m1.medium
1 X 2 GB
m1.large
1 X 4 GB
m1.xlarge
4 X 16 GB
Machine Sizes
CPU x Memory
NameNode
4 x 4 GB
DataNode
4 x 4 GB
Client
4 x 8 GB
Processor:
4 cores
Figure 5
Physical machine
Distribution
Apache Hadoop
Hadoop Version
1.0.3
2.0.0+1518
Pig
0.11.1.1-amzn (rexported)
0.11.0+36
Hive
0.11.0.1
0.10.0+214
Mahout
0.9
0.7+22
Data Details
Requirement
No. of Columns
37
No. of Files
50 nos.
20 Million
2.7 GB
135 GB
Figure 7
DataNodes
Extra
80M
160M
320M
640M
1B
No. of Records
AWS EMR (m1.large)
Physical Machines
Figure 8
6000
5000
4000
3000
2000
1000
0
40M
80M
160M
320M
No. of records
Physical Machines-PIG
Figure 9
PM
Hive (Query-3)
Hive (Query-2)
VM
Hive (Transformation)
Pig (Transformation)
1000
2000
3000
4000
5000
6000
Consequential Graphs
Figure 8 shows how the cluster performed for the
Hive transformation on both physical and virtual
environments.
Figure 8 reveals that both workloads took almost
the same time for smaller datasets (~40 to ~80
million records). Gradually with increasing data
sizes, the physical machines performed better
than EMRs m1.large cluster.
Figure 9, which compares PM versus VM using
Pig transformation, shows that the EMR cluster
80M
160M
320M
40M
80M
m1.large
160M
320M
Physical Machines
No. of Records
Pig (Transformation)
Hive (Transformation)
Figure 11
160M
320M 640M
1B
40M
m1.large
80M
160M
320M 640M
1B
Physical Machines
No. of Records
Hive (Query-2)
Hive (Query-3)
Figure 12
180.86
No. of Records
6M
10.21
139.37
4M
2M
1M
8.62
8.1
0.00
750.82
736.65
91.06
532.37
73.64
100.00
PM
VM(1x4)
455.27
200.00
300.00
400.00
500.00
VM(4x15)
600.00
700.00
800.00
PERFORMANCE
SCALABILITY
COST
Comparing the
performance of
physical and virtual
machines with the same
configuration, the
physical machines have
higher performance;
with increased memory,
however, a VM can
perform better.
Commissioning and
decommissioning of
physical machines
cluster nodes can
prove to be an
expensive affair
compared to provisioning
VMs as per need. Thereby
scalability can be highly
limited with physical
machines.
Provisioning physical
machines incurs higher
cost than virtual
machines, where
creation of a VM can be
as simple as cloning an
instance of a VM and its
unique identity.
RESOURCE
UTILIZATION
Figure 14
Moving Forward
If
In
Footnotes
1
HDFS - http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
MapReduce - http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.
About Cognizant
Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business
process outsourcing services, dedicated to helping the worlds leading companies build stronger businesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction,
technology innovation, deep industry and business process expertise, and a global, collaborative workforce
that embodies the future of work. With over 100 development and delivery centers worldwide and approximately 221,700 employees as of December 31, 2015, Cognizant is a member of the NASDAQ-100, the S&P
500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest
growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.
World Headquarters
European Headquarters
1 Kingdom Street
Paddington Central
London W2 6BD
Phone: +44 (0) 20 7297 7600
Fax: +44 (0) 20 7121 0102
Email: infouk@cognizant.com
Copyright 2016, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is
subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
TL Codex 1732