Sei sulla pagina 1di 3

CSE3019 Data Mining L T P J C

2 0 2 4 4
Version : 1.00
Pre-requisite: None

Course Objectives:
 To introduce the concept of Data Mining and Data Preprocessing
 To provide the skills required to handle large data sets
 To develop the knowledge for application of the mining algorithms for association, clustering.
 To introduce the algorithms for mining data streams
 To explain the features of recommendation engine

Expected Outcomes:
The student will be able to
 To design Data mining algorithms for real world applications
 To evaluate the performance of the various Data Mining algorithms
 Analyze and leverage data for real-time decision making

Student Learning Outcomes (SLO): 2, 7, 14, 17

Module:1 INTRODUCTION 3 Hours SLO:2, 7


Data Mining – Data ware housing-OLAP-Data Preprocessing

Module:2 CLASSIFICATION TECHNIQUES AND 5 Hours SLO:7, 17


FINDING SIMILAR ITEMS
Classification Techniques: Decision Tree,ID3,K-Nearest Neighbour Classifier, Naive Bayes- Near
Neighbour Search – Shingling of Documents - Similarity Preserving – Locality Sensitive Hashing (LSH)
–Application and Variance of LSH – Distance Measures – High degrees of similarity

Module:3 MINING DATA STREAMS 4 Hours SLO:7, 17


Stream Data model - Sampling Data in a Stream – Filtering Streams – Counting distinct elements in a
stream – Estimating Moments – Counting Ones in a window – Decaying windows

Module:4 LINK ANALYSIS 4 Hours SLO: 7, 17


Page Rank – Link Spam – Hubs and Authorities

Module:5 FREQUENT ITEM SETS 4 Hours SLO: 7, 17


Market-Basket Model – A-priori Algorithm – Handling larger datasets – Counting Frequent items in a
stream – Limited Pass Algorithms

Module:6 CLUSTERING 4 Hours SLO: 7, 17


Hierarchical Clustering – K-means Algorithm – Clustering in Non-Euclidean spaces, Clustering for
Streams and Parallelism -

Module:7 RECOMMENDATION SYSTEMS 4 Hours SLO: 7, 17


Content based – Collaborative Filtering – Dimensionality reduction-Case study

Module:8 CONTEMPORARY ISSUES (To be handled 2 Hours SLO: 2


by experts from industry)

Total Lecture: 30 Hours


Text Book:
1. Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and
Techniques, Morgan Kaufmann , 2011
Reference Books:
1. Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, Morgan
Kaufmann 2011
2. J. Leskovec, A. Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets. Cambridge
University Press, 2014.

SLO: 17
Project
# Generally a team project [3 to 4 members]
# Concepts studied in XXXX should have been used
# Down to earth application and innovative idea should have been attempted
# Report in Digital format with all drawings using software package to be submitted. [Ex. 1. Design of a
traffic light system using sequential circuits OR 2. Design of digital clock]
# Assessment on a continuous basis with a min of 3 reviews.

//Available online data sources may be used for exploring the following projects:
For example: Kaggle, UCI repository, kdnuggets, UCR Time Series Archive etc.
Projects may be given as group projects
Sample Projects:
1. Using a programming language that you are familiar with, such as C++ or Java, implement recent
frequent/closed/maximal itemset mining algorithms: Compare the performance of each
algorithm with various kinds of large data sets. Write a report to analyze the situations (e.g., data
size, data distribution, minimal support threshold setting, and pattern density) where one
algorithm may perform better than the others, and state why.
2. The DBLP data set (www.informatik.uni-trier.de/_ley/db/) consists of over one million entries
of research papers published in computer science conferences and journals. Among these entries,
there are a good number of authors that have coauthor relationships.
(a) Propose a method to efficiently mine a set of coauthor relationships that are closely
correlated (e.g., often coauthoring papers together).
(b) Based on the mining results and the pattern evaluation measures, discuss which measure may
convincingly uncover close collaboration patterns better than others.
(c) Based on the study in (a), develop a method that can roughly predict advisor and advisee
relationships and the approximate period for such advisory supervision.
3. Implement the associative classification algorithms and compare the performance of each
algorithm with various kinds of large data sets. Write a report to analyze the situations (e.g., data
size, data distribution, minimal support threshold setting, and pattern density) where one
algorithm may perform better than the others, and state why.
4. Implement fuzzy clustering and probabilistic clustering methods and compare the performance
of each algorithm with various kinds of large data sets. Write a report to analyze the situations
(e.g., data size, data distribution, pattern density and cluster validity) where one algorithm may
perform better than the others, and state why.
5. Implement and compare different outlier detection methods/outlier factors on various kinds of
large data sets. Write a report to analyze the situations (e.g., data size, data distribution, pattern
density) where one algorithm may perform better than the others, and state why.
6. Using a programming language that you are familiar with, such as C++ or Java, implement recent
algorithms for intent mining: Compare the performance of each algorithm with various kinds of
large data sets. Write a report to analyze the results where one algorithm may perform better than
the others, and state why.
7. Design and implement sentiment analysis algorithm for twitter dataset. Experiment the proposed
idea using different classifiers and identify the best classifier for the chosen data set based on
different performance measures.
Design and implement content based, user based and collaborative filtering technique on any
benchmark dataset to build a recommender system. Prepare a report based on the performance
of different methods to justify the choice of the best recommender system.

Lab SLO: 14
Indicative List of Experiments:

1. Implementing the classification techniques for real world data sets


2. Implement algorithms for similarity matching
3. Implement algorithms for mining data streams
4. Simulate Page ranking algorithm
5. Design and implement link spam detection
6. Implement A-priori algorithm using MapReduce
7. Clustering in Non-Euclidean spaces
8. Clustering for Streams and Parallelism
9. Design and develop a recommendation engine for the given application

Date of Approval by the Academic Council 16.03.17

Potrebbero piacerti anche