Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
0 in
Nutshell
Munich, Apr. 2017
Sanjay Radia, Junping Du
Sanjay Radia
Chief Architect, Founder, Hortonworks
Part of the original Hadoop team at Yahoo! since 2007
Chief Architect of Hadoop Core at Yahoo!
Apache Hadoop PMC and Committer
Prior
Data center automation, virtualization, Java, HA, OSs, File Systems
Startup, Sun Microsystems, Inria
Ph.D., University of Waterloo
Junping Du
Apache Hadoop Committer & PMC member
Lead Software Engineer @ Hortonworks YARN Core Team
10+ years for developing enterprise software (5+ years for being Hadooper)
New ports:
Namenode ports: 50470 9871, 50070 9870, 8020 9820
Secondary NN ports: 50091 9869, 50090 9868
Datanode ports: 50020 9867, 50010 9866, 50475 9865, 50075 9864
Co-existable!
User code Hadoop Server
newer -client Older
commons shaded commons
1 Hortonworks Inc. 2011 2016. All Rights Reserved
0
HDFS
Support for Three NameNodes for HA
Erasure coding
r3
3/x storage overhead vs 1.4-1.6 of Erasure Code
Remember that Hadoops JBod is much much cheaper
1/10 - 1/20 of SANs
b1 b2 b3 b4 b5 b6 P1 P2 P3
b1 b2 b3 b4 b5 b6 P1 P2 P3
EC on contiguous blocks
Pros: Better for locality
Cons: small files cannot be handled parity blocks
data blocks
b1 b2 b3 b4 b5 b6 P1 P2 P3
EC on striped blocks C1 C2 C3 C4 C5 C6 PC1 PC2 PC3
stripe 1
Pros: Leverage multiple disks in parallel C7 C8 C9 C10 C11 C12 PC4 PC5 PC6
Pros: Works for small small files stripe 2
Cons: No data locality for readers
stripe n
Durability ack
(6, 3)-Reed-Solomon can tolerate maximum 3 failures
data
Visibility (Same as replicated files) DN6
Read is supported for being written files
Writer ack
Data can be made visible by hflush/hsync parity
Consistency ack
Client can start reading from any 6 of the 9 replicas DN7
parity
When reading from a datanode fails, client can failover to
any other remaining replica to read the same data. ack
Appendable (Same as replicated files)
Files can be reopened for append
DN9
ack
Missing blocks will be reconstructed later.
data
Writer DN6
ack
parity
ack
DN7
parity
ack
DN9
ack
data data
DN6
DN7
2 Hortonworks Inc. 2011 2016. All Rights Reserved
2
Network traffic Need good network bandwidth
Pros
Low latency because of parallel write/read
Good for small-size files
Cons
Require high network bandwidth between client-server
Higher reconstruction cost
Dead DataNode implies high network traffic and reconstruction time
Workload 3-replication (6, 3) Reed-Solomon
Consolidation of Infrastructure
Hadoop clusters have a lot of compute and storage resources (some unused)
Cant I use Hadoops resources for non-Hadoop load?
Openstack is hard to run, can I use YARN?
But does it support Docker? yes, we heard you
Hadoop related Data Services that run outside a Hadoop cluster
Why cant I run them in the Hadoop cluster
Run Hadoop services (Hive, HBase) on YARN
Run Multiple instances
Benefit from YARNs Elasticity and resource management
YARN
3 Hortonworks Inc. 2011 2016. All Rights Reserved
1
YARN New UI (YARN-3368)