Sei sulla pagina 1di 4

Course Outline:

Hadoop and Spark for Big Data and Data Science


By Data Science Studio

Week-1: Introduction to Big Data and Data Science


 What is Big Data
 What is Data Science

Week-2: Introduction to Hadoop and the Hadoop Ecosystem


 Problems with Traditional Large Scale Systems
 Hadoop
 Data Storage and Ingest
 Data Processing
 Data Analysis and Exploration
 Other Ecosystem Tools
 Homework Lab: Setup Hadoop

Week-3: Hadoop Architecture and Hadoop Distributed File System


(HDFS)
 Distributed Processing on a Cluster
 Storage: HDFS architecture
 Storage: Using HDFS
 Homework Lab: Access HDFS with Command Line and Hue
 Resource Management: YARN Architecture
 Resource Management: Working with YARN
 Homework Lab: Run a YARN Job

Week-4: Importing Relational Data with Apache Sqoop


 Sqoop Overview
 Basic Imports and Exports
 Limiting Results
 Improving Sqoop’s performance
 Sqoop 2
 Homework Lab: Import Data from MySQL Using Sqoop
Week-5: Introduction to Impala and Hive
 Introduction to Impala and Hive
 Why use Impala and Hive
 Querying Data with Impala and Hive
 Comparing Impala and Hive to Traditional Databases

Week-6: Modeling and Managing Data with Impala and Hive


 Data Storage Overview
 Creating Databases and Tables
 Loading Data into Tables
 HCatalog
 Impala Metadata Caching
 Homework Lab: Create and Populate Tables in Impala or Hive

Week-7: Data Formats


 File Formats
 Avro Schemas
 Avro Schema Evaluation
 Using Avro with Impala, Hive and Sqoop
 Using Parquet with Impala, Hive and Sqoop
 Compression
 Homework Lab: Select a Format for Data File

Week-8: Data File Partitioning


 Partitioning Overview
 Partitioning in Impala and Hive
 Conclusion
 Homework Lab: Partition Data in Impala or Hive

Week-9: Capturing Data with Apache Flume


 What is Apache Flume?
 Basic Flume Architecture
 Flume Sources
 Flume Sinks
 Flume Channels
 Flume Configurations
 Homework Lab: Collect Web Server Logs with Flume

Week-10: Spark Basics


 What is Apache Spark?
 Using the Spark Shell
 RDDs (Resilient Distributed Dataset)
 Functional Programming in Spark
 Homwork Lab:
o View the Spark Documentation
o Explore RDDs Using Spark Shell
o Use RDDs to Transform a Dataset

Week-11: Working with RDDs in Spark


 Creating RDDs
 Other General RDD Operations
 Homework Lab: Process Data Files with Spark

Week-12: Aggregating Data with Pair RDDs


 Key-value Pair RDDs
 MapReduce
 Other Pair RDD Operations
 Homework Lab: Use Pair RDDs to Join Two Datasets

Week-13: Writing and Deploying Spark Applications


 Spark Application vs. Spark Shell
 Creating the SparkContext
 Building a Spark Application (Scala and Java)
 Running a Spark Application
 The Spark Application Web UI
 Homework Lab: Write and Run a Spark Application
 Configure Spark Properties
 Logging
 Homework Lab: Configure a Spark Application

Week-14: Parallel Processing in Spark


 Review: Spark on a Cluster
 RDD Partitions
 Partitioning of File Based RDDs
 HDFS and Data Locality
 Executing Parallel Operations
 Stages and Tasks
 Homework Lab: View Jobs and Stages in the Spark Application UI

Week-15: Spark RDD Persistence


 RDD Lineage
 RDD Persistence Overview
 Distributed Persistence
 Homework Lab: Persist an RDD

Week-16: Common Patterns in Spark Data Processing


 Common Spark Use Cases
 Iterative Algorithms in Spark
 Graph Processing and Analysis
 Machine Learning
 Example: k-means
 Homework Lab: Iterative Processing in Spark

Week-17: Spark SQL and DataFrames


 Spark SQL and the SQL Context
 Creating DataFrames
 Transforming and Querying DataFrames
 Saving DataFrames
 DataFrames and RDDs
 Comparing Spark SQL, Impala and Hive-on-Spark
 Homework Lab: Use Spark SQL for ETL

Weel-18: Running Machine Learning Algorithms Using Spark MLlib


 Machine Learning with Spark
 Preparing Data for Machine Learning
 Building a Linear Regression Model
 Evaluating a Linear Regression Model
 Visualizing a Linear Regression Model

Week-19: BigDL Distributed Deep Learning on Apache Spark


 What is Deep Learning
 What is BigDL
 Why use BigDL
 Installing and Building BigDL
 BigDL examples

Week-20: Working on Spark in the Cloud


 Spark implementation in Databricks
 Spark implementation in Cloudera
 Spark implementation in Amazon Web Service

Potrebbero piacerti anche