Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Mining
1
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
2
Introduction to Data Mining
Why Data Mining?
Data vs. Information:
Data: recorded facts
Information: patterns underlying the data
The Explosive Growth of Data: from terabytes to
petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
3
Introduction to Data Mining
Why Data Mining?
4
Introduction to Data Mining
What is Data Mining?
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets.
Data mining—searching for knowledge (interesting
patterns) in your data.
5
Introduction to Data Mining
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO,
deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
systems
6
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
7
Introduction to Data Mining
What is Data Mining?
Data Mining(knowledge discovery from data)
Refers to extracting or “mining” knowledge from large amounts of
data.
Extraction of interesting (implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of
data.
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
8
Introduction to Data Mining
Knowledge Discovery Process
Data mining can be viewed as simply an essential step
in the process of knowledge discovery.
9
Introduction to Data Mining
Knowledge Discovery Process
Knowledge Discovery Process
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be combined
Data selection (where data relevant to the analysis task are retrieved
from the database)
Data transformation (where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations)
Data mining (an essential process where intelligent methods are
applied to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on interestingness measures
Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to
users)
Steps 1 through 4 are different forms of data preprocessing
10
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
11
Introduction to Data Mining
What Kind of Data Can be Mined?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
12
Introduction to Data Mining
Database Data
Database Data: Database management system
(DBMS), consists of a collection of interrelated data,
known as a database, and a set of software programs to
manage and access the data.
The software programs provide mechanisms
for defining database structures and data storage;
for specifying and managing concurrent, shared, or distributed
data access;
for ensuring consistency and security of the information stored
despite system crashes or attempts at unauthorized access.
A relational database is a collection of tables, each of
which is assigned a unique name.
Each table consists of a set of attributes (columns or fields) and
usually stores a large set of tuples (records or rows).
13
Introduction to Data Mining
Database Data
14
Introduction to Data Mining
Data Warehouse
Data warehouse: A data warehouse is a repository of
information collected from multiple sources, stored under
a unified schema, and usually residing at a single site.
Data in a data warehouse are organized around major
subjects (e.g., customer, item, supplier, and activity).
The data are stored to provide information from a
historical perspective, such as in the past 6 to 12 months,
and
A data warehouse is usually modeled by a
multidimensional data structure, called a data cube, are
typically summarized.
Each dimension corresponds to an attribute or a set of attributes
in the schema, and each cell stores the value of some aggregate
measure such as count
15
Introduction to Data Mining
Data Warehouse
A multidimensional data cube, commonly used for data warehousing, (a) showing summarized data for
AllElectronics and (b) showing summarized data resulting from drill-down and roll-up operations on the cube in (a).
16
Introduction to Data Mining
Data Warehouse
Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data
loading, and periodic data refreshing.
17
Introduction to Data Mining
Transactional Data
Transactional Data: Each record in a transactional
database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web
page.
A transaction typically includes a unique transaction identity
number (trans ID) and a list of the items making up the
transaction, such as the items purchased in the transaction.
A transactional database may have additional tables, which
contain other information related to the transactions.
Transactions can be stored in a table, with one record
per transaction.
Because most relational database systems do not support
nested relational structures, the transactional database is usually
either stored in a flat file
18
Introduction to Data Mining
Transactional Data
19
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
20
Introduction to Data Mining
What Kind of Patterns Can Be Mined?
Data mining functionalities.
Characterization and discrimination
Mining of frequent patterns, associations, and correlations
Classification and regression
Clustering analysis
Outlier analysis
Data mining functionalities are used to specify the kinds of
patterns to be found in data mining tasks.
In general, data mining tasks can be classified into two
categories: descriptive and predictive.
Descriptive mining tasks characterize properties of the data in a
target data set.
Predictive mining tasks perform induction on the current data in
order to make predictions.
21
Introduction to Data Mining
Concept/Class Description
Characterization: summarization of the general
characteristics or features of a target class of data.
The output of data characterization can be presented in
various forms.
E.g., pie charts, bar charts, curves, multidimensional data
cubes, and multidimensional tables, including crosstabs.
Example:
A customer relationship manager at AllElectronics may order the
following data mining task: Summarize the characteristics of
customers who spend more than $5000 a year at AllElectronics.
The result is a general profile of these customers, such as that
they are 40 to 50 years old, employed, and have excellent credit
ratings.
22
Introduction to Data Mining
Concept/Class Description
Discrimination: Comparison of the general features of
the target class data objects against the general features
of objects from one or multiple contrasting classes.
The forms of output presentation are similar to those for
characteristic descriptions.
Example:
A customer relationship manager at AllElectronics may want to compare
two groups of customers—those who shop for computer products
regularly (e.g., more than twice a month) and those who rarely shop for
such products (e.g., less than three times a year). The resulting
description provides a general comparative profile of these customers,
such as that 80% of the customers who frequently purchase computer
products are between 20 and 40 years old and have a university
education, whereas 60% of the customers who infrequently buy such
products are either seniors or youths, and have no university degree.
23
Introduction to Data Mining
Frequent Patterns, Association and Correlation Analysis
24
Introduction to Data Mining
Frequent Patterns, Association and Correlation Analysis
25
Introduction to Data Mining
Frequent Patterns, Association and Correlation Analysis
26
Introduction to Data Mining
Classification
Classification and label prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
27
Introduction to Data Mining
Classification
28
Introduction to Data Mining
Clustering
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing
interclass similarity
Many methods and applications
29
Introduction to Data Mining
Clustering
30
Introduction to Data Mining
Clustering
The output takes the form of a diagram that shows how
the instances fall into clusters.
Different cases:
Simple 2D representation: involves associating a cluster
number with each instance
Venn diagram: allow one instance to belong to more than one
cluster
Probabilistic assignment: associate instances with clusters
probabilistically
Dendrogram: produces a hierarchical structure of clusters
(dendron is the Greek word for tree)
31
Introduction to Data Mining
32
Clustering
Introduction to Data Mining
33
Clustering
Introduction to Data Mining
Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
34
Introduction to Data Mining
Are All Patterns are Interesting?
Data mining may generate thousands of patterns: Not all
of them are interesting
What makes a pattern interesting?
Easily understood by humans,
Valid on new or test data with some degree of certainty
Potentially useful
Novel
Validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
35
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
36
Introduction to Data Mining
What Technology Are Used?
37
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
38
Introduction to Data Mining
What Kind of Applications Are Targeted?
Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
From major dedicated data mining systems/tools (e.g., SAS, MS
SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible
data mining
39
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
40
Introduction to Data Mining
Major Issues In Data Mining
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
41
Introduction to Data Mining
Major Issues In Data Mining
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
42
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
43
Introduction to Data Mining
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
ACM Transactions on KDD starting in 2007
44
Introduction to Data Mining
Conferences and Journals on Data Mining
KDD Conferences Other related conferences
ACM SIGKDD Int. Conf. on
DB conferences: ACM SIGMOD,
Knowledge Discovery in
Databases and Data Mining VLDB, ICDE, EDBT, ICDT, …
(KDD) Web and IR conferences: WWW,
45
Introduction to Data Mining
Where to Find References? DBLP, CiteSeer, Google
Data mining and KDD (SIGKDD: CDROM)
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-
PAMI, etc.
Web and IR
Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems,
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
46
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
47
Introduction to Data Mining
Summary
Data mining: Discovering interesting patterns and knowledge from
massive amount of data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Data mining technologies and applications
Major issues in data mining
48