Sei sulla pagina 1di 6

Introduction to Hadoop/Bigdata:

Hadoop is an open-source software framework used for distributed storage and processing of dataset of
big data using the MapReduce programming model. It consists of computer clusters built from commodity
hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures
are common occurrences and should be automatically handled by the framework.

 Hadoop Common – contains libraries and utilities needed by other Hadoop modules;

 Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the cluster;

 Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for
scheduling users' applications; and

 Hadoop MapReduce – an implementation of the MapReduce programming model for largescale data
processing.

Hadoop & HDFS


What is Big Data….??
Uses of Big Data

What is Hadoop…??

Relation Between Hadoop and Big Data

Features of Hadoop:
(i) Flexible
(ii) Scalable
(iii) Building efficient Data economy
(iv) Robust system
(v) Cost Effective

Challenges of Big Data:


The Four -V’s of Big Data
(i) Volume
(ii) Variety
(iii) Velocity
(iv) Veracity

How does Hadoop address the Big Data changes?


(i) Hadoop is built to run on a cluster of machines
(ii) Hadoop clusters scale Horizontally
(iii) Hadoop can handle Unstructured/Semi-Structured
(iv) Hadoop clusters provides storage and computing
(v) Hadoop provides storage for bigdata at reasonable cost
(vi) Hadoop allows capture of new or more data
(vii) Hadoop provides scalable analytics

Hadoop vs RDBMS

Hadoop vs Data Warehouse

Core Hadoop Components


(i) Hadoop Common
(ii) Hadoop Distributed File Systems (HDFS)
(iii) Map Reduce – Distributed Data Processing Framework of Apache Hadoop
(iv) YARN ( Yet Another Resource Negotiator )

Some of the basic terminologies:


1) What is cluster environment
2) What is Hadoop Cluster Node

HDFS ( Hadoop Distributed File Systems )


Features of HDFS:
1) Fault Tolerance
2) High Availability
3) Reliability
4) Replication
5) Scalability

Storage Aspects of HDFS


i) HDFS Block
ii) How to configure the block size

HDFS Architecture……….
1) Name Node
2) Data Node
3) Secondary Name Node
4) Job Tracker
5) Task Tracker

Replication in Hadoop
Data Storage in Data Node
Replication Configuration

Commands Guide in HDFS … ( Practical )


MAPREDUCE
What is MapReduce
Daemons of Hadoop
Hadoop 1.x Architecture
Limitations of Hadoop 1.x Architecture

Hadoop 2.x Architecture


MapReduce Phases
Map Reduce Life Cycle

What is combiner..?
What is Partitioner..?

Apache PIG and Pig Latin


Introduction to Apache PIG
SQL Vs. Apache PIG
Physical & Logical Layer
Different Data types in Apache PIG
Modes of Execution in Apache PIG ::: Local Mode, Map Reduce or Distributed Mode
Execution Mechanism  Grunt shell, Script, Embedded
Transformations in PIG
How to write a simple PIG Script
UDFs in PIG
Hands on with PIG Latin script

Hive and HiveQL


HIVE Introduction

Hive Architecture and Installation

Comparison with Traditional Database

Operators and Functions

Hive Meta Store and Integration with MySql

Hive integration with Hadoop

SQL vs. HIVE QL

Hive UDF's : Partitioning, Dynamic Partitioning and Bucketing

RegexSerDe (Regular Expressions)


Hive Tables (Managed Tables and External Tables, Storage Formats, Importing Data, Altering Tables,
Dropping Tables)

Hive data format – Text, ORC, Avro, parquet

SQOOP
Introduction to SQOOP

How to connect relational database using SQOOP

Different Sqoop Commands  Different flavors of imports, Export, HIVE imports

HBase
Hands on with Examples HBase and ZooKeeper

HBase introduction

HBase use cases

HBase basics -- Column families, Scans

HBase architecture and ZooKeeper Service: -- Data Model, Operations, Implementation, Consistency,
Sessions

HBase Admin -- Schema definition, Basic CRUD Operations

Spark with Scala Introduction to Scala


Why Scala
Scala Vs Java
Scala Basics
Scala Data types
Scala Packages
Introduction to Spark
Motivation for Spark
Spark Vs Map Reduce Processing
Architecture of Spark
Spark Shell Introduction
Caching in Spark
Real time Examples of Spark
Spark Components: Spark Core & Spark SQL
Spark Streaming
Features of RDD
Lazily Evaluated
Immutable
Partitioned
RDD operations
Actions
Transformation in RDD

Note:
Topic wise material will be provided with scenarios.
Assignments and tasks will be given to make you hands-on.
Technical assistance will be given even after the course.

Potrebbero piacerti anche