Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Semester: II
Year: 2014-2015
Credits(L:T:P):3:0:1
Core/Elective: Core
Course Objectives:
- To Understand big data for business intelligence
To Learn business case studies for big data analytics
- To Understand No sql big data management
- To manage Big data without SQL
- To understanding map-reduce analytics using Hadoop and related tools
TOPICS
MODULE I : UNDERSTANDING BIG DATA
10 Hours
What is big data why big data convergence of key trends unstructured data industry examples of
big data web analytics big data and marketing fraud and big data risk and big data credit risk
management big data and algorithmic trading big data and healthcare big data in medicine
advertising and big data big data technologies introduction to Hadoop open source technologies
cloud and big data mobile business intelligence Crowd sourcing analytics inter and trans firewall
analytics
10 Hours
Introduction to NoSQL aggregate data models aggregates key-value and document data models
relationships graph databases schemaless databases materialized views distribution models
sharding master-slave replication peer-peer replication sharding and replication. Consistancyrelaxing consistency --version stampsMapReduce partitioning and combining -- Composing MapReduce Calculations.
10 Hours
Data format analyzing data with Hadoop scaling out Hadoop streaming Hadoop pipes. Design of
Hadoop distributed file system (HDFS) HDFS concepts Java interface data flow, Data Ingest with
Flume and Sqoop. Hadoop I/O data integrity compression serialization Avro file-based data
structures
10 Hours
MapReduce workflows unit tests with MRUnit test data and local tests anatomy of MapReduce job
run classic Map-reduce YARN failures in classic Map-reduce and YARN job scheduling shuffle
and sort task execution MapReduce types input formats output formats
10 Hours
Introduction to Hbase: The Dawn of Big Data, the Problem with Relational Database Systems.
Introduction to Cassandra: The Cassandra Elevator Pitch. Introduction to Pig, Hive data types and file
formats HiveQL data definition HiveQL data manipulation HiveQL queries.
LAB Experiments
Review the commands available for the Hadoop Distributed File System:
Copy file foo.txt from local disk to the users directory in HDFS
Get a directory listing of the users home directory in HDFS
Get a directory listing of the HDFS root directory
Display the contents of the HDFS file user/fred/bar.txt
Move that file to the local disk, named as baz.txt
Create a directory called input under the users home directory
Delete the directory input old and all its contents
Verify the copy by listing the directory contents in HDFS:
following figure, selection and transformation operations occur in map tasks, while aggregation is
handled by reducers. Join operations are flexible: they can be performed in the reducer or mappers
depending on the size of the leftmost table.
1. Write a query to select only those clicks which correspond to starting, browsing, completing, or
purchasing movies. Use a CASE statement to transform the RECOMMENDED column into integers
where Y is 1 and N is 0. Also, ensure GENREID is not null. Only include the first 25 rows.
2. Write a query to select the customer ID, movie ID, recommended state and most recent rating for each
movie.
3. Load the results of the previous two queries into a staging table. First, create the staging table:
4. Next, load the results of the queries into the staging table.
Exercise 5 Extract sessions using Pig
While the SQL semantics of HiveQL are useful for aggregation and projection, some analysis is better
described as the flow of data through a series of sequential operations. For these situations, Pig Latin
provides a convenient way of implementing data flows over data stored in HDFS. Pig Latin statements
are translated into a sequence of Map Reduce jobs on the execution of any STORE or DUMP command.
Job construction is optimized to exploit as much parallelism as possible, and much like Hive, temporary
storage is used to hold intermediate results. As with Hive, aggregation occurs largely in the reduce
tasks. Map tasks handle Pigs FOREACH and LOAD, and GENERATE statements. The EXPLAIN
command will show the execution plan for any Pig Latin script. As of Pig 0.10, the ILLUSTRATE
command will provide sample results for each stage of the execution plan.
In this exercise you will learn basic Pig Latin semantics and about the fundamental types in Pig Latin,
Data Bags and Tuples.
Start the Grunt shell and execute the following statements to set up a dataflow with the click stream data.
Note: Pig Latin statements are assembled into Map Reduce jobs which are launched at execution of a
DUMP or STORE statement.
1. Group the log sample by movie and dump the resulting bag.
2. Add a GROUP BY statement to the sessionize.pig script to process the click stream data into
user sessions.
Course Outcomes: The students should be able to:
- Describe big data and use cases from selected business domains
- Explain NoSQL big data management
- Install, configure, and run Hadoop and HDFS
- Perform map-reduce analytics using Hadoop
- Use Hadoop related tools such as HBase, Cassandra, Pig, and Hive for big data Analytics
TEXT BOOKS:
1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses", Wiley, 2013.
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence",
Copyright 2013 Pearson Education, Inc. 2012.
3. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012.
4. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley, 2012.
REFERENCES:
5. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011.
6. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010.
7. Alan Gates, "Programming Pig", O'Reilley, 2011.