Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Md Tabrez Nafis
Department of Computer Science & Engineering
JAMIA HAMDARD, New Delhi
1
Why Data Mining?
2
Evolution of Sciences
Before 1600, empirical science
1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
1950s-1990s, computational science
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
1990-now, data science
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible
Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
3
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
4
What Is Data Mining?
5
Knowledge Discovery (KDD) Process
Task-relevant Data
Data Cleaning
Data Integration
Databases
6
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
8
Why Not Traditional Data Analysis?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes of
data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, social networks
Heterogeneous databases
Spatial, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
9
Database Processing vs. Data Mining
Processing
Query Query
Well defined Poorly defined
SQL No precise query
language
Data Data
– Operational data – Not operational data
Output Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
10
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than Rs. 10,000 in the last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
11
Architecture of Data Mining System
13
Basic Data Mining Tasks (cont’d)
Link Analysis uncovers relationships among data.
Affinity Analysis
Association Rules
Sequential Analysis determines sequential patterns.
14
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
15
Data Mining: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
16
Data Mining Functionalities
Multidimensional concept description: Characterization and
discrimination
Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
Frequent patterns, association, correlation vs. causality
Bread Butter [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown or missing numerical values
17
Data Mining Functionalities (2)
Cluster analysis
Class label is unknown: Group data to form new classes, e.g.,
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Periodicity analysis
Similarity-based analysis
18
Why Data Mining?—Potential Applications
19
Ex. 1: Market Analysis and Management
Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.
Determine customer purchasing patterns over time
Cross-market analysis—Find associations/co-relations between product sales,
& predict based on such association
Customer profiling—What types of customers buy what products (clustering
or classification)
Customer requirement analysis
Identify the best products for different groups of customers
Predict what factors will attract new customers
Provision of summary information
Multidimensional summary reports
Statistical summary information (data central tendency and variation)
20
Ex. 2: Corporate Analysis & Risk Management
21
Ex. 3: Fraud Detection & Mining Unusual Patterns
22
Mining for Knowledge
Knowledge in the form of rules
If <condition_1>&<condition_2>& …&<condition_n> Then
<conclusion>
Types of knowledge
Association
Presence of one set of items/attributes implies presence of
another set.
Classification
Given examples of objects belonging to different groups,
develop profile of each group in terms of attributes of the
objects.
Clustering.
Unsupervised grouping of similar records based on attributes.
Prediction (temporal and spatial).
Historical records collected at fixed period of time.
23
Mining Association Rules
24
Data Mining Functionalities:
Mining Frequent Patterns
Frequent patterns are the patterns that occur 8
frequently in the data. Patterns can include
itemsets, sequences and subsequences.
A frequent itemset refers to a set of items that
often appear together in a transactional data set.
ex: bread and milk
Data Mining Functionalities:
Mining Frequent Patterns
Association Rules 9
buys(X, “computer”)=>buys(X, “software”) [support =1%, confidence = 50%]
Single Dimension Association Rule 1% of all the transactions under analysis show
that computer and software are purchased together
Association rules are discarded as uninteresting if they do not satisfy minimum support threshold and minimum confidence threshold
An Example Association Rule
27
Data Mining Functionalities:
Classification and Prediction
10
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts. The model is derived based on the
analysis of a set of training data and is used to predict the class label of objects.
Decision Tree
Neural Network
Data Mining Functionalities:
Classification and Prediction
11 or
Prediction values continuous valued functions, i.e. it is used to predict missing
unavailable numeric data values rather than class labels.
Prediction can be used for both numeric prediction and class label prediction.
Regression analysis is a statistical method used numeric prediction.
Classification and regression may need to be preceded by relevance analysis,
which attempts to identify attributes that are significantly relevant to the
classification and regression process. Such attributes will be selected for the
classification and regression process. Other attributes, which are irrelevant, can
then be excluded from consideration
Mining Classification Rules
Patient Records
Symptoms, Diseases
Recovered
Never
Recover Recover
? ed
Not
recover?
30
An Example of Classification
Credit card data
Each transaction contains transaction date, amount, and a set of
items purchased, etc.
Each customer record contains gender, age, education
background, etc.
Example of rules discovered:
IF use of card >= 9 months continuously & no. of transaction <= 2
THEN Cash Advance = Yes.
Actionable item:
Promote credit services to potential customers who requires cash
advance.
31
Data Mining Functionalities:
Cluster Analysis
Clustering analyzes data objects without consulting
12
class labels.
Clustering can be used to generate class labels for
a group of data which did not exist at the
beginning.
The objects are clustered or grouped based on the
principle of maximizing the intra-class similarity and
minimizing the inter-class similarity.
Discovering Clusters
Dividing them up into groups according to similarity
33
34
Classification ≠Clustering
Classification
What is the difference
between Good & Bad
Good Customers Bad Customers
Clustering
How can I group the
customers
35
Discovering Sequential Patterns
People who have purchased a VCR are three
times more likely to purchase a camcorder
two to four months after the purchase.
36
An Example of Sequential Pattern
Mining
Electricity consumption data:
A set of time series each associated with an
industrial user.
Each time series represents an electricity load
profile of a user at a certain premise.
Reading of electricity load taken every 30 min.
The Goal
Identify companies with similar electricity load
profiles using data mining.
37
Web Log Mining
38
An Example of Web Log Mining
39
KDD Process: Several Key Steps
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
40
Are All the “Discovered” Patterns Interesting?
41
Requirements and Challenges
Variety of data types.
Noisy and incomplete data
The interestingness problem.
Different kinds of knowledge.
Different levels of abstraction.
Expression and visualization of data mining
results.
Efficiency and scalability of data mining
algorithms.
42
Thank You
43