Hadoop 3.0

Apache Hadoop 3.
0 in
Nutshell
Munich, Apr. 2017
Sanjay Radia, Junping Du
1 Hortonworks Inc. 2011 2016. All Rights Reserved

About Speakers
Sanjay Radia
Chief Architect, Founder, Hortonworks
Part of the original Hadoop team at Yahoo! since 2007
Chief Architect of Hadoop Core at Yahoo!
Apache Hadoop PMC and Committer
Prior
Data center automation, virtualization, Java, HA, OSs, File Systems
Startup, Sun Microsystems, Inria
Ph.D., University of Waterloo
Junping Du
Apache Hadoop Committer & PMC member
Lead Software Engineer @ Hortonworks YARN Core Team
10+ years for developing enterprise software (5+ years for being Hadooper)

Page 2
Why Hadoop 3.0
The Driving Reasons Some features taking advantage of 3.0

YARN: long running services
Lot of content in Trunk that did not
make it to 2.x branch Ephemeral Ports (incompatible)
JDK Upgrade does not truly require
bumping major number
Hadoop command scripts rewrite
(incompatible)
Big features that need stabilizing major
release Erasure codes

Apache Hadoop 3.0
Key Takeaways Release Timeline
HDFS: Erasure codes 3.0.0-alpha1 - Sep/3/2016
YARN: Alpha2 - Jan/25/2017

Long running services, Alpha3 - Q2 2017 (Estimated)
scheduler enhancements, Beta/GA - Q3/Q4 2017 (Estimated)
Isolation & Docker
UI
Lots of Trunk content
JDK8 and newer dependent
libraries
Agenda
Hadoop 3.0 Basis - Major changes you should know before upgrade
JDK upgrade
Dependency upgrade
Change on default port for daemon/services
Shell script rewrite
Features
Hadoop Common
Client-Side Classpath Isolation
HDFS
Erasure Coding
Support for more than 2 NameNodes
YARN
Support for long running services
Scheduling enhancements: : App / Queue Priorities, global scheduling, placement strategies
New UI
ATS v2
MAPREDUCE
Task-level native optimizationHADOOP-11264
Hadoop Operation - JDK Upgrade
Minimum JDK for Hadoop 3.0.x is JDK8OOP-11858
Oracle JDK 7 is EoL at April 2015!!
Moving forward to use new features of JDK8

Lambda Expressions starting to use this
Stream API
security enhancements
performance enhancement for HashMaps, IO/NIO, etc.
Hadoops evolution with JDK upgrades

Hadoop 2.6.x - JDK 6, 7, 8 or later
Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later
Hadoop 3.0.x - JDK 8 or later

Dependency Upgrade
Jersey: 1.9 to 1.19
the root element whose content is empty collection is changed from null to
empty object({}).
Grizzly-http-servlet: 2.1.2 to 2.2.21
Guice: 3.0 to 4.0
cglib: 2.2 to 3.2.0
asm: 3.2 to 5.0.4
netty-all: 4.0.23 to 4.1x (in discussion)
Protocol Buffer: 2.5 to 3.x (in discussion)

Change of Default Ports for Hadoop Services
Previously, the default ports of multiple Hadoop services were in the Linux
ephemeral port range (32768-61000)
Can conflict with other apps running on the same node
New ports:
Namenode ports: 50470 9871, 50070 9870, 8020 9820
Secondary NN ports: 50091 9869, 50090 9868
Datanode ports: 50020 9867, 50010 9866, 50475 9865, 50075 9864
KMS service port 16000 9600

Hadoop Common
Client-Side Classpath Isolation

Client-side classpath isolation
HADOOP-11656/HADOOP-13070
Problem
Application codes dependency (including Apache Hive or dependency projects) can conflict with
Hadoops dependencies
Single Jar File
User code Server

Hadoop
newer Older
Client
commons commons
Conflicts!!!
Solution
Separating Server-side jar and Client-side jar
Like hbase-client, dependencies are shaded
Co-existable!
User code Hadoop Server
newer -client Older
commons shaded commons
0
HDFS
Support for Three NameNodes for HA
Erasure coding

1
Current (2.x) HDFS Replication Strategy
Three replicas by default
1st replica on local node, local rack or random node
2nd and 3rd replicas on the same remote rack
Reliability: tolerate 2 failures
Rack I Rack II
Good data locality, local shortcut
DataNode DataNode
Multiple copies => Parallel IO for parallel compute
Very Fast block recovery and node recovery
Parallel recover - the bigger the cluster the faster r1 r2
10TB Node recovery 30sec to a few hours
r3
3/x storage overhead vs 1.4-1.6 of Erasure Code
Remember that Hadoops JBod is much much cheaper
1/10 - 1/20 of SANs
1/10 1/5 of NFS

2
Erasure Coding
k data blocks + m parity blocks (k + m)
Example: Reed-Solomon 6+3
Reliability: tolerate m failures
b1 b2 b3 b4 b5 b6 P1 P2 P3
Save disk space
Save I/O bandwidth on the write path
6 data blocks 3 parity blocks
1.5x storage overhead

Tolerate any 3 failures
3-replication (6, 3) Reed-Solomon
Maximum fault Tolerance 2 3
Disk usage 3N 1.5N

(N byte of data)

3
Block Reconstruction
Block reconstruction overhead
Higher network bandwidth cost
Extra CPU overhead
Local Reconstruction Codes (LRC), Hitchhiker
Rack Rack Rack Rack Rack Rack Rack Rack Rack
b1 b2 b3 b4 b5 b6 P1 P2 P3
Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12.

Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013.
Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.
4
Erasure Coding on Contiguous/Striped Blocks
File f1 File f2 f3
Two Approaches
b1 b2 b3 b4 b5 b6 P1 P2 P3
EC on contiguous blocks
Pros: Better for locality
Cons: small files cannot be handled parity blocks
data blocks
b1 b2 b3 b4 b5 b6 P1 P2 P3
EC on striped blocks C1 C2 C3 C4 C5 C6 PC1 PC2 PC3
stripe 1
Pros: Leverage multiple disks in parallel C7 C8 C9 C10 C11 C12 PC4 PC5 PC6
Pros: Works for small small files stripe 2
Cons: No data locality for readers
stripe n
6 Data Blocks 3 Parity Blocks

5
Apache Hadoops decision
Starting from Striping to deal with smaller files

Hadoop 3.0.0 implementes Phase 1.1 and Phase 1.2

6
Erasure Coding Zone
Create a zone on an empty directory

Shell command: hdfs erasurecode createZone [-s <schemaName>] <path>
All the files under a zone directory are automatically erasure

coded
Rename across zones with different EC schemas are disallowed

7
Write Pipeline for Replicated Files
Write pipeline to datanodes
Durability
Use 3 replicas to tolerate maximum 2 failures
Visibility
Read is supported for being written files
Data can be made visible by hflush/hsync
Consistency
Client can start reading from any replica and failover to any other replica to read the same data
Appendable
Files can be reopened for append
data data data
Writer DN1 DN2 DN3

ack ack ack
* DN = DataNode
8
Parallel Write for EC Files
Parallel write Stipe size 1MB
Client writes to a group of 9 datanodes at the same time DN1
Calculate Parity bits at client side, at Write Time data

Durability ack
(6, 3)-Reed-Solomon can tolerate maximum 3 failures
data
Visibility (Same as replicated files) DN6
Read is supported for being written files
Writer ack
Data can be made visible by hflush/hsync parity
Consistency ack
Client can start reading from any 6 of the 9 replicas DN7
parity
When reading from a datanode fails, client can failover to
any other remaining replica to read the same data. ack

Appendable (Same as replicated files)
Files can be reopened for append
DN9

9
EC: Write Failure Handling
Datanode failure
Client ignores the failed datanode and continue writing. DN1
Able to tolerate 3 failures. data
Require at least 6 datanodes.

ack
Missing blocks will be reconstructed later.
data
Writer DN6
ack
parity
ack
DN7
parity
ack

DN9

0
Replication:
Slow Writers & Replace Datanode on Failure
Write pipeline for replicated files
Datanode can be replaced in case of failure.
Slow writers
A write pipeline may last for a long time
The probability of datanode failures increases over time.
Need to replace datanode on failure.
EC files
Do not support replace-datanode-on-failure. data
Slow writer improved
ack
data data
Writer DN1 DN2 DN3 DN4

ack ack

1
Reading with Parity Blocks
Parallel read
Read from 6 Datanodes with data blocks DN1
Support both stateful read and pread
Block1
Block2
DN2
Block reconstruction
Read parity blocks to reconstruct missing blocks
Reader DN3
Block4
reconstruct
DN4
Block5
Block3 DN5
Parity1 Block6
DN6
DN7
2
Network traffic Need good network bandwidth
Pros
Low latency because of parallel write/read
Good for small-size files
Cons
Require high network bandwidth between client-server
Higher reconstruction cost
Dead DataNode implies high network traffic and reconstruction time
Workload 3-replication (6, 3) Reed-Solomon
Read 1 block 1 LN 1/6 LN + 5/6 RR
Write 1LN + 1LR + 1RR 1/6 LN + 1/6 LR +

7/6 RR
LN: Local Node

LR: Local Rack
RR: Remote Rack
3
YARN
YARN Scheduling Enhancements
Support for Long Running Services
Re-architecture for YARN Timeline Service - ATS v2
Better elasticity and resource utilization
Better resource isolation and Docker!!
Better User Experiences
Other Enhancements

4
Scheduling Enhancements
Application priorities within a queue: YARN-1963
In Queue A, App1 > App 2
Inter-Queue priorities
Q1 > Q2 irrespective of demand / capacity
Previously based on unconsumed capacity
Affinity / anti-affinity: YARN-1042

More restraints on locations
Global Scheduling: YARN-5139
Get rid of scheduling triggered on node heartbeat
Replaced with global scheduler that has parallel threads
Globally optimal placement
Critical for long running services they stick to the allocation better be a good one
Enhanced container scheduling throughput (6x)

5
Key Drivers for Long Running Services
Consolidation of Infrastructure
Hadoop clusters have a lot of compute and storage resources (some unused)
Cant I use Hadoops resources for non-Hadoop load?
Openstack is hard to run, can I use YARN?
But does it support Docker? yes, we heard you
Hadoop related Data Services that run outside a Hadoop cluster
Why cant I run them in the Hadoop cluster
Run Hadoop services (Hive, HBase) on YARN
Run Multiple instances
Benefit from YARNs Elasticity and resource management

6
Built-in support for long running Service in YARN
A native YARN framework. YARN-4692
Abstract common Framework (Similar to Slider) to support long running service
More simplified API (to manage service lifecycle)
Better support for long running service
Recognition of long running service

Affect the policy of preemption, container reservation, etc.
Auto-restart of containers
Containers for long running service are retried to same node in case of local state
Service/application upgrade support YARN-4726

In general, services are expected to run long enough to cross versions
Dynamic container configuration

Only ask for resources just enough, but adjust them at runtime (memory harder)
7
Discovery services in YARN
Services can run on any YARN node; how do get its IP?
It can also move due to node failure
YARN Service Discovery via DNS: YARN-4757

Expose existing service information in YARN registry via DNS
Current YARN service registrys records will be converted into DNS entries
Discovery of container IP and service port via standard DNS lookups.

Application
zkapp1.user1.yarncluster.com -> 192.168.10.11:8080
Container
Container 1454001598828-0001-01-00004.yarncluster.com -> 192.168.10.18

8
A More Powerful YARN
Elastic Resource Model
Dynamic Resource Configuration
YARN-291
Allow tune down/up on NMs resource in runtime
Graceful decommissioning of NodeManagers
YARN-914
Drains a node thats being decommissioned to allow running containers to
finish
Efficient Resource Utilization
Support for container resizing
YARN-1197
Allows applications to change the size of an existing container

9
More Powerful YARN (Contd.)
Resource Isolation
Resource isolation support for disk and network
YARN-2619 (disk), YARN-2140 (network)
Containers get a fair share of disk and network resources using Cgroups
Docker support in LinuxContainerExecutor

YARN-3611
Support to launch Docker containers alongside process
Packaging and resource isolation
Complements YARNs support for long running services

0
Docker on Yarn & YARN on YARN - YCloud
Can use Yarn to test Hadoop!! MR Tez Spar

k
Hadoop Apps TensorFlow YARN

MR Tez Spark
YARN
1
YARN New UI (YARN-3368)

2
Timeline Service Revolution Why ATS v2
Scalability & Performance
v1 limitation: Reliability
Single global instance of writer/reader v1 limitation:
Local disk based LevelDB storage Data is stored in a local disk
Single point of failure (SPOF) for timeline
server
Usability
Handle flows as first-class concepts and
model aggregation Flexibility
Add configuration and metrics as first-class Data model is more describable
members Extended to more specific info to app
Better support for queries

3
Core Design for ATS v2
Distributed write path Separate reader instances
Logical per app collector + physical per
node writer
Collector/Writer launched as an auxiliary Aggregation & Accumulation
service in NM.
Aggregation: rolling up the metric values to the
Standalone writers will be added later. parent
Online aggregation for apps and flow runs
Pluggable backend storage Offline aggregation for users, flows and
Built in with a scalable and reliable queues
implementation (HBase) Accumulation: rolling up the metric values
across time interval
Accumulated resource consumption for app,
Enhanced data model
flow, etc.
Entity (bi-directional relation) with flow,
queue, etc.
Configuration, Metric, Event, etc.
4
Other YARN work planned in Hadoop 3.X
Resource profiles
YARN-3926
Users can specify resource profile name instead of individual resources
Resource types read via a config file
YARN federation
YARN-2915
Allows YARN to scale out to tens of thousands of nodes
Cluster of clusters which appear as a single cluster to an end user
Gang Scheduling
YARN-624

5
Thank you!
Reminder: BoFs on Thursday at 5:50pm

6

Hadoop 3.0

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hadoop 3.0

Caricato da

Copyright:

Formati disponibili

Apache Hadoop 3.

1 Hortonworks Inc. 2011 2016. All Rights Reserved

2 Hortonworks Inc. 2011 2016. All Rights Reserved

The Driving Reasons Some features taking advantage of 3.0

3 Hortonworks Inc. 2011 2016. All Rights Reserved

YARN: Alpha2 - Jan/25/2017

Moving forward to use new features of JDK8

Hadoops evolution with JDK upgrades

6 Hortonworks Inc. 2011 2016. All Rights Reserved

7 Hortonworks Inc. 2011 2016. All Rights Reserved

KMS service port 16000 9600

8 Hortonworks Inc. 2011 2016. All Rights Reserved

9 Hortonworks Inc. 2011 2016. All Rights Reserved

Single Jar File

User code Server

1 Hortonworks Inc. 2011 2016. All Rights Reserved

1/10 1/5 of NFS

1 Hortonworks Inc. 2011 2016. All Rights Reserved

1.5x storage overhead

3-replication (6, 3) Reed-Solomon

Maximum fault Tolerance 2 3

Disk usage 3N 1.5N

1 Hortonworks Inc. 2011 2016. All Rights Reserved

Rack Rack Rack Rack Rack Rack Rack Rack Rack

Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12.

6 Data Blocks 3 Parity Blocks

1 Hortonworks Inc. 2011 2016. All Rights Reserved

Starting from Striping to deal with smaller files

1 Hortonworks Inc. 2011 2016. All Rights Reserved

Create a zone on an empty directory

All the files under a zone directory are automatically erasure

1 Hortonworks Inc. 2011 2016. All Rights Reserved

Writer DN1 DN2 DN3

1 Hortonworks Inc. 2011 2016. All Rights Reserved

2 Hortonworks Inc. 2011 2016. All Rights Reserved

Writer DN1 DN2 DN3 DN4

2 Hortonworks Inc. 2011 2016. All Rights Reserved

Read 1 block 1 LN 1/6 LN + 5/6 RR

Write 1LN + 1LR + 1RR 1/6 LN + 1/6 LR +

LN: Local Node

2 Hortonworks Inc. 2011 2016. All Rights Reserved

Affinity / anti-affinity: YARN-1042

2 Hortonworks Inc. 2011 2016. All Rights Reserved

2 Hortonworks Inc. 2011 2016. All Rights Reserved

Recognition of long running service

Service/application upgrade support YARN-4726

Dynamic container configuration

YARN Service Discovery via DNS: YARN-4757

Discovery of container IP and service port via standard DNS lookups.

2 Hortonworks Inc. 2011 2016. All Rights Reserved

2 Hortonworks Inc. 2011 2016. All Rights Reserved

Docker support in LinuxContainerExecutor

3 Hortonworks Inc. 2011 2016. All Rights Reserved

Can use Yarn to test Hadoop!! MR Tez Spar

Hadoop Apps TensorFlow YARN

3 Hortonworks Inc. 2011 2016. All Rights Reserved

3 Hortonworks Inc. 2011 2016. All Rights Reserved

3 Hortonworks Inc. 2011 2016. All Rights Reserved

3 Hortonworks Inc. 2011 2016. All Rights Reserved

Potrebbero piacerti anche