Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Large
Complex
Unstructured
Scale Up!
With the power of the Hubble telescope, we can take amazing pictures 45M light years away
Amazing image of the Antennae Galaxies (NGC 4038-4039)
Analogous with scale up:
non-commodity
specialized equipment
single point of failure*
Task
Task
tracker
tracker
Map Reduce
Layer
Job
Job
tracker
tracker
HDFS
Layer
Name
Name
node
node
Data
Data
node
node
Task
Task
tracker
tracker
Data
Data
node
node
Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png
What is Hadoop?
Traditional RDBMS
MapReduce
Data Size
Gigabytes (Terabytes)
Petabytes (Hexabytes)
Access
Batch
Updates
Structure
Static Schema
Dynamic Schema
Integrity
High (ACID)
Low
Scaling
Nonlinear
Linear
DBA Ratio
1:40
1:3000
Hadoop in action
WordCount demo
Map/Reduce
HiveQL on a weblog sample
demo
Code to execute:
hadoop
jar AcmeWordCount.jar
Vanilla Electronic Texts**
AcmeWordCount /test/davinci.txt
By Computers, Since 1971**
/test/davinci_wordcount
laws
2
Project
GUID
Parameters
[GUID]
&PageID=1234&Region=89191&IsoCy=BR&Lang=104
6&Referrer=hotmail.com&ag=2385105&Campaign=
&Event=12034
select
GUID,
HiveQL: SQL-like language
str_to_map("param", "&", "=")["IsoCy"],
Write SQL-like query which becomes
str_to_map("param", "&", "=")["Lang"]
MapReduce functions
from
Includes functions like str_to_map so
weblog;
one can perform parsing functions in
HiveQL
demo
Distributed Queries
Hadoop Jobs and Task Tracker (50030/50070)
HiveQL Query
TeraSort
14000
12577.5
300
270
12000
250
10000
200
7650.02
8000
5771.23
6000
150
123
100
4000
53
50
2000
0
1
0
8.0
16.0
32.0
Key facet is that as more nodes are added, the more can be calculated, faster
(minus overhead)
Designed for Batch systems (though this will change in the near future)
Roughly linear scalability for HiveQL query as well as TeraSort
Cassandr
a
Hive
Scribe
Hadoop
Hadoop
Oozie
Pig (-latin)
BackType
Hadoop
Pig /
Hbase
Cassandr
a
MR/GFS
Bigtable
Dremel
SimpleDB
Dynamo
EC2 / S3
MongoDB
Couchbase
Hbase
Pegasus
15 out of 17
140,000-190,000
more deep analytical talent positions
1.5 million
50-60%
more data saavy managers
increase in the number of Hadoop developers
in the US alone
within organizations already using Hadoop
within a year
250 billion
Potential annual value to
Europes public sector
$300 billion
Potential annual value to US healthcare
demo
Operation
s
Informatio
n
Worker
Excel
Integration
HiveODBC:
PowerPivot,
Crescent
AD integration
Kerberos
Cluster
Management
and Monitoring
Visual Studio
Tools
Integration
Hadoop JS
HiveQL
Data
Scientist
Developer
Java MR
StreamInsig
ht API
Hadoop on
Azure and Windows
EIS /
Databa
Pig Latin
CQL
OData /
Azure
Storage
HiveQL
Ocean of Data
File
APPENDIX
ata: techniques and technologies that make handling data at extreme scale econ
Hadoop: Auto-replication
Hadoop processes data in 64MB chunks and then replicates to different servers
Replication value set at the hdfs-site.xml, dfs.replication node