Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
DW Lecturer: DM Lecturer:
http://www.inf.unibz.it/dis/teaching/DWDM/index.html
Organization
` LLectures
t
Tuesday & Thursday From 10:30 To 12:30. Rooms are dynamic
Office hours Dr. Kacimi: Thursday From 14:00 to 17:00
(appointment by email)
` Projects
Lab hours Tuesday From 14:00 to 16:00. Room is dynamic
` Requirements: obtain at least 18 points in each of the following
Project: - use & implement algorithms
- write a project report
- present the project
Exam: - have knowledge about the course
- be able to present it
` Textbooks
Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques, Second Edition, 2006
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, "Introduction to
Data Mining
Mining", Pearson Addison Wesley
Wesley, 2008
2008, ISBN: 0-32-134136-7
0 32 134136 7
Margaret H. Dunham, Data Mining: Introductory and Advanced
Topics, Prentice Hall, 2003
D t Mi
Data Mining
i
Outline
Sources
Business: Web, e-commerce, transactions, etc.
Science: Remote sensing, bioinformatics, etc.
S i t and
Society d everyone: news, YouTube,
Y T b etc.
t
Gold Mining
Stone
Not Stone Mining
` Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging information harvesting,
dredging, harvesting business intelligence,
intelligence etc.
etc
Knowledge Discovery (KDD) Process
` View from typical database K l d
Knowledge
Evaluation
systems and data warehousing & Presentation
communities
Patterns
Data Mining
Task-relevant
Data
Selection
& transformation
Data Warehouse
Data
Cleaning ` Data mining plays an essential
& Integration role in the knowledge
di
discovery process
Databases
Knowledge Discovery (KDD) Process
` Data Cleaning
Remove noise and inconsistent data
` Data Integration
Combine multiple data sources
` Data Selection
Data relevant to analysis tasks are
retrieved form the data
` Data transformation
Transform data into appropriate form
for mining (summary, aggregation, etc.)
` Data mining
Extract data patterns
` Pattern Evaluation
Identify truly interesting patterns
` Knowledge representation
Use visualization
sua a o a andd knowledge
o edge
representation tools to present the
mined data to the user
Typical Architecture of a Data Mining System
` Knowledge Base
Guide the search
Evaluate
interestingness of User Interface
the results
` Include Pattern Evaluation Knowledge
g
Concept Base
hierarchies Data Mining Engine
User believes
Constraints, Database or Data
thresholds, Warehouse Server
metadata, etc.
Data cleaning, Integration and selection
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition
ii Other
Algorithm Disciplines
Why Confluence of Multiple Disciplines?
` Tremendous amount of data
Scalable algorithms to handle terabytes of data (e.g., Flickr had
3.6 billion images in June, 2009)
` High dimensionality of data
Data can have tens of thousands of features (e,g., DNA
microarray)
` High complexity of data
Data can be highly complex, can be of different types, and can
include different descriptors
p
Images can be described using text and visual features such as
color, texture, contours, etc.
Videos can be described using g text, images
g and their descriptors,
p
audio phonemes, etc.
Social networks can have a complex structure
...
` New and sophisticated applications
Applications can be difficult (e.g., medical applications).
Different Views of Data Mining
` Data View
` Knowledge view
` Method view
Kinds of techniques
q utilized
` Application view
Kinds of applications
1.1.2 Data to be Mined
` In principle, data mining should be applicable to any data
repository
Relational databases
Data warehouses
Transactional databases
Ad
Advanced d database
d t b systems
t
Relational Databases
` Database System
Collection of interrelated data, known as database
A set of software p
programs
g that manageg and access the data
` Relational Databases (RD)
A collection of tables. Each one has a unique name
A table contains a set of attributes (columns) & tuples (rows).
(rows)
Each object in a relational table has a unique key and is
described by a set of attribute values. Costumers
Data are accessed using database cust Id
cust_Id Name age income
152 Anna 27 24000
queries (SQL): projection, join, etc. ... ... ... ...
predict the credit risk of costumers based on their income, age and
expenses.
Data Warehouses
` A data warehouse (DW) is a repository of information collected
from multiple sources, stored under a unified schema.
Data source
in Bolzano
Clean Client
Data source Integrate Data Query and
in Paris Transform Warehouse Analysis Tools
Load
refresh Client
Data source
in Madrid
` Data Mining on TD
Perform a deeper analysis
Example: Which items sold well together?
Basically, data mining systems can identify frequent sets in
transactional databases and perform market basket data analysis.
Advanced Database Systems(1)
` Advanced database systems provide tools for handling complex
data
Spatial data (e.g., maps)
Engineering design data (e.g., buildings, system components)
Hypertext and multimedia data (text, image, audio, and video)
Time-related
e e a ed data
da a (e.g., historical
s o ca records)
eco ds)
Stream data (e.g., video surveillance and sensor data)
World Wide Web, a huge, widely distributed information repository
made available by Internet
` Require efficient data structures and scalable methods to handle
Complex object structures and variable length records
Semi structured or unstructured data
Multimedia and spatiotemporal data
Database schema with complex and dynamic structures
Advanced Database Systems(2)
` Example: World Wide Web
Provide rich, worldwide, online and distributed information services.
Data objects are linked together
U
Users traverse
t ffrom one object
bj t via
i lilinks
k to
t another
th
Problems
Data can be highly unstructured
Difficult to understand the semantic of web
pages and their context.
` Data Mining on WWW
Web usage Mining (user access pattern)
Improve system design (efficiency)
Better marketing decisions (adverts, user profile)
Authoritative Web p
page
g Analysis
y
Ranking web pages based on their importance
Automated Web page clustering and classification
Group and arrange web pages based on their content
Web community analysis
Identify hidden web social networks and observe their evolution
1.1.3 Knowledge to be Discovered
` Data mining functionalities are used to specify the kind of
patterns to be found in data mining tasks
` Data mining tasks can be classified into two categories
Descriptive : Characterize the general properties of the data
Predictive : Perform inference on the current data to make
p
predictions
` What to extract?
Users may not have an idea about what kinds of patterns in their
data can be interesting
` What to do?
Have a data mining system that can mine multiple types of
patterns to handle different user and applications needs
needs.
Discover patterns
Example of different
at various granularities Street City Country
granularities
(levels of abstraction)
Allow users to guide the search for interesting patterns
Characterization and Discrimination (1)
` Data can be associated with classes or concepts
Example of data from a store
Classes Concepts
` Typical methods
Hierarchical methods
methods, density
density-based
based methods
methods, Grid
Grid-based
based
methods, Model-Based methods, constraint-based methods, etc.
` Typical Applications
WWW,
WWW sociali l networks,
t k Marketing,
M k ti Biology,
Bi l Lib
Library, etc.
t
Outlier Analysis
` Outlier: A data object that does not comply with the general
behavior of the data learning (i.e., Class label is unknown)
` Noise or exception? One persons
person s garbage could be another
persons treasure
Or ?
` Typical methods
Product of clustering or regression analysis, etc.
` Typical Applications
Useful in fraud detection
Example
How to Uncover fraudulent usage of credit card?
Detect purchases of extremely large amounts for a given account
number in comparison to regular charges incurred by the same
account
Outliers may also be detected with respect to the location and
type of purchase, or the frequency.
Evolution Analysis
` Evolution Analysis describes trends of data objects whose
behavior changes over time
` It includes
Characterization and discrimination analysis
Association and correlation analysis
Cl ifi ti
Classification and
d prediction
di ti
Clustering of timerelated data
` Distinct features for such analysis
Time-series data analysis
Sequence or periodicity pattern matching
e.g., first
s buy digital
d g a camera,
ca e a, then
e buy large
a ge SSD memory
e o y cards
ca ds
Similarity-based analysis
1.1.4 Techniques Utilized
` Data-intensive
` Data warehouse (OLAP)
` Machine learning
` Statistics
` Pattern recognition
` Visualization
` High-performance
` ...
1.1.5 Applications Adapted
` Knowledge to be discovered
Frequent patterns, correlations, associations, classification,
prediction cluster
prediction,
` Techniques to be used
Statistics, machine learning, visualization, etc.
` D t Mining
Data Mi i is
i interdisciplinary
i t di i li
Large amount of complex data and sophisticated applications