Hadoop Course Content

Introduction to Hadoop/Bigdata:
Hadoop is an open-source software framework used for distributed storage and processing of dataset of
big data using the MapReduce programming model. It consists of computer clusters built from commodity
hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures
are common occurrences and should be automatically handled by the framework.
 Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
 Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the cluster;
 Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for
scheduling users' applications; and
 Hadoop MapReduce – an implementation of the MapReduce programming model for largescale data
processing.
Hadoop & HDFS

What is Big Data….??
Uses of Big Data
What is Hadoop…??
Relation Between Hadoop and Big Data
Features of Hadoop:
(i) Flexible
(ii) Scalable
(iii) Building efficient Data economy
(iv) Robust system
(v) Cost Effective
Challenges of Big Data:

The Four -V’s of Big Data
(i) Volume
(ii) Variety
(iii) Velocity
(iv) Veracity
How does Hadoop address the Big Data changes?

(i) Hadoop is built to run on a cluster of machines
(ii) Hadoop clusters scale Horizontally
(iii) Hadoop can handle Unstructured/Semi-Structured
(iv) Hadoop clusters provides storage and computing
(v) Hadoop provides storage for bigdata at reasonable cost
(vi) Hadoop allows capture of new or more data
(vii) Hadoop provides scalable analytics
Hadoop vs RDBMS
Hadoop vs Data Warehouse
Core Hadoop Components

(i) Hadoop Common
(ii) Hadoop Distributed File Systems (HDFS)
(iii) Map Reduce – Distributed Data Processing Framework of Apache Hadoop
(iv) YARN ( Yet Another Resource Negotiator )
Some of the basic terminologies:

1) What is cluster environment
2) What is Hadoop Cluster Node
HDFS ( Hadoop Distributed File Systems )

Features of HDFS:
1) Fault Tolerance
2) High Availability
3) Reliability
4) Replication
5) Scalability
Storage Aspects of HDFS

i) HDFS Block
ii) How to configure the block size
HDFS Architecture……….
1) Name Node
2) Data Node
3) Secondary Name Node
4) Job Tracker
5) Task Tracker
Replication in Hadoop
Data Storage in Data Node
Replication Configuration
Commands Guide in HDFS … ( Practical )

MAPREDUCE
What is MapReduce
Daemons of Hadoop
Hadoop 1.x Architecture
Limitations of Hadoop 1.x Architecture
Hadoop 2.x Architecture

MapReduce Phases
Map Reduce Life Cycle
What is combiner..?
What is Partitioner..?
Apache PIG and Pig Latin

Introduction to Apache PIG
SQL Vs. Apache PIG
Physical & Logical Layer
Different Data types in Apache PIG
Modes of Execution in Apache PIG ::: Local Mode, Map Reduce or Distributed Mode
Execution Mechanism  Grunt shell, Script, Embedded
Transformations in PIG
How to write a simple PIG Script
UDFs in PIG
Hands on with PIG Latin script
Hive and HiveQL

HIVE Introduction
Hive Architecture and Installation
Comparison with Traditional Database
Operators and Functions
Hive Meta Store and Integration with MySql
Hive integration with Hadoop
SQL vs. HIVE QL
Hive UDF's : Partitioning, Dynamic Partitioning and Bucketing
RegexSerDe (Regular Expressions)

Hive Tables (Managed Tables and External Tables, Storage Formats, Importing Data, Altering Tables,
Dropping Tables)
Hive data format – Text, ORC, Avro, parquet
SQOOP
Introduction to SQOOP
How to connect relational database using SQOOP
Different Sqoop Commands  Different flavors of imports, Export, HIVE imports
HBase
Hands on with Examples HBase and ZooKeeper
HBase introduction
HBase use cases
HBase basics -- Column families, Scans
HBase architecture and ZooKeeper Service: -- Data Model, Operations, Implementation, Consistency,
Sessions
HBase Admin -- Schema definition, Basic CRUD Operations
Spark with Scala Introduction to Scala

Why Scala
Scala Vs Java
Scala Basics
Scala Data types
Scala Packages
Introduction to Spark
Motivation for Spark
Spark Vs Map Reduce Processing
Architecture of Spark
Spark Shell Introduction
Caching in Spark
Real time Examples of Spark
Spark Components: Spark Core & Spark SQL
Spark Streaming
Features of RDD
Lazily Evaluated
Immutable
Partitioned
RDD operations
Actions
Transformation in RDD
Note:
Topic wise material will be provided with scenarios.
Assignments and tasks will be given to make you hands-on.
Technical assistance will be given even after the course.

Hadoop Course Content

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hadoop Course Content

Caricato da

Copyright:

Formati disponibili

Introduction to Hadoop/Bigdata:

Hadoop & HDFS

Relation Between Hadoop and Big Data

Challenges of Big Data:

How does Hadoop address the Big Data changes?

Hadoop vs Data Warehouse

Core Hadoop Components

Some of the basic terminologies:

HDFS ( Hadoop Distributed File Systems )

Storage Aspects of HDFS

Commands Guide in HDFS … ( Practical )

Hadoop 2.x Architecture

Apache PIG and Pig Latin

Hive and HiveQL

Hive Architecture and Installation

Comparison with Traditional Database

Operators and Functions

Hive Meta Store and Integration with MySql

Hive integration with Hadoop

SQL vs. HIVE QL

Hive UDF's : Partitioning, Dynamic Partitioning and Bucketing

RegexSerDe (Regular Expressions)

Hive data format – Text, ORC, Avro, parquet

How to connect relational database using SQOOP

Different Sqoop Commands  Different flavors of imports, Export, HIVE imports

HBase use cases

HBase basics -- Column families, Scans

HBase Admin -- Schema definition, Basic CRUD Operations

Spark with Scala Introduction to Scala

Potrebbero piacerti anche