Course Outline Hadoop and Spark For Big Data and Data Science PDF

Course Outline:
Hadoop and Spark for Big Data and Data Science

By Data Science Studio
Week-1: Introduction to Big Data and Data Science

 What is Big Data
 What is Data Science
Week-2: Introduction to Hadoop and the Hadoop Ecosystem

 Problems with Traditional Large Scale Systems
 Hadoop
 Data Storage and Ingest
 Data Processing
 Data Analysis and Exploration
 Other Ecosystem Tools
 Homework Lab: Setup Hadoop
Week-3: Hadoop Architecture and Hadoop Distributed File System

(HDFS)
 Distributed Processing on a Cluster
 Storage: HDFS architecture
 Storage: Using HDFS
 Homework Lab: Access HDFS with Command Line and Hue
 Resource Management: YARN Architecture
 Resource Management: Working with YARN
 Homework Lab: Run a YARN Job
Week-4: Importing Relational Data with Apache Sqoop

 Sqoop Overview
 Basic Imports and Exports
 Limiting Results
 Improving Sqoop’s performance
 Sqoop 2
 Homework Lab: Import Data from MySQL Using Sqoop
Week-5: Introduction to Impala and Hive
 Introduction to Impala and Hive
 Why use Impala and Hive
 Querying Data with Impala and Hive
 Comparing Impala and Hive to Traditional Databases
Week-6: Modeling and Managing Data with Impala and Hive

 Data Storage Overview
 Creating Databases and Tables
 Loading Data into Tables
 HCatalog
 Impala Metadata Caching
 Homework Lab: Create and Populate Tables in Impala or Hive
Week-7: Data Formats

 File Formats
 Avro Schemas
 Avro Schema Evaluation
 Using Avro with Impala, Hive and Sqoop
 Using Parquet with Impala, Hive and Sqoop
 Compression
 Homework Lab: Select a Format for Data File
Week-8: Data File Partitioning

 Partitioning Overview
 Partitioning in Impala and Hive
 Conclusion
 Homework Lab: Partition Data in Impala or Hive
Week-9: Capturing Data with Apache Flume

 What is Apache Flume?
 Basic Flume Architecture
 Flume Sources
 Flume Sinks
 Flume Channels
 Flume Configurations
 Homework Lab: Collect Web Server Logs with Flume
Week-10: Spark Basics

 What is Apache Spark?
 Using the Spark Shell
 RDDs (Resilient Distributed Dataset)
 Functional Programming in Spark
 Homwork Lab:
o View the Spark Documentation
o Explore RDDs Using Spark Shell
o Use RDDs to Transform a Dataset
Week-11: Working with RDDs in Spark

 Creating RDDs
 Other General RDD Operations
 Homework Lab: Process Data Files with Spark
Week-12: Aggregating Data with Pair RDDs

 Key-value Pair RDDs
 MapReduce
 Other Pair RDD Operations
 Homework Lab: Use Pair RDDs to Join Two Datasets
Week-13: Writing and Deploying Spark Applications

 Spark Application vs. Spark Shell
 Creating the SparkContext
 Building a Spark Application (Scala and Java)
 Running a Spark Application
 The Spark Application Web UI
 Homework Lab: Write and Run a Spark Application
 Configure Spark Properties
 Logging
 Homework Lab: Configure a Spark Application
Week-14: Parallel Processing in Spark

 Review: Spark on a Cluster
 RDD Partitions
 Partitioning of File Based RDDs
 HDFS and Data Locality
 Executing Parallel Operations
 Stages and Tasks
 Homework Lab: View Jobs and Stages in the Spark Application UI
Week-15: Spark RDD Persistence

 RDD Lineage
 RDD Persistence Overview
 Distributed Persistence
 Homework Lab: Persist an RDD
Week-16: Common Patterns in Spark Data Processing

 Common Spark Use Cases
 Iterative Algorithms in Spark
 Graph Processing and Analysis
 Machine Learning
 Example: k-means
 Homework Lab: Iterative Processing in Spark
Week-17: Spark SQL and DataFrames

 Spark SQL and the SQL Context
 Creating DataFrames
 Transforming and Querying DataFrames
 Saving DataFrames
 DataFrames and RDDs
 Comparing Spark SQL, Impala and Hive-on-Spark
 Homework Lab: Use Spark SQL for ETL
Weel-18: Running Machine Learning Algorithms Using Spark MLlib

 Machine Learning with Spark
 Preparing Data for Machine Learning
 Building a Linear Regression Model
 Evaluating a Linear Regression Model
 Visualizing a Linear Regression Model
Week-19: BigDL Distributed Deep Learning on Apache Spark

 What is Deep Learning
 What is BigDL
 Why use BigDL
 Installing and Building BigDL
 BigDL examples
Week-20: Working on Spark in the Cloud

 Spark implementation in Databricks
 Spark implementation in Cloudera
 Spark implementation in Amazon Web Service

Course Outline Hadoop and Spark For Big Data and Data Science PDF

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Course Outline Hadoop and Spark For Big Data and Data Science PDF

Caricato da

Copyright:

Formati disponibili

Course Outline:

Hadoop and Spark for Big Data and Data Science

Week-1: Introduction to Big Data and Data Science

Week-2: Introduction to Hadoop and the Hadoop Ecosystem

Week-3: Hadoop Architecture and Hadoop Distributed File System

Week-4: Importing Relational Data with Apache Sqoop

Week-6: Modeling and Managing Data with Impala and Hive

Week-7: Data Formats

Week-8: Data File Partitioning

Week-9: Capturing Data with Apache Flume

Week-10: Spark Basics

Week-11: Working with RDDs in Spark

Week-12: Aggregating Data with Pair RDDs

Week-13: Writing and Deploying Spark Applications

Week-14: Parallel Processing in Spark

Week-15: Spark RDD Persistence

Week-16: Common Patterns in Spark Data Processing

Week-17: Spark SQL and DataFrames

Weel-18: Running Machine Learning Algorithms Using Spark MLlib

Week-19: BigDL Distributed Deep Learning on Apache Spark

Week-20: Working on Spark in the Cloud

Potrebbero piacerti anche