Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Miquel Sànchez-Marrè
http://www.lsi.upc.edu/~miquel
19 de Julio 2013
https://kemlg.upc.edu
Content
Introduction
Knowledge Discovery in Databases
From DM to BD & DS: a Computational Perspective
Data Pre-processing
Data Mining Techniques: Statistical & ML methods
Data Post-Processing
Big Data / Data Science
Social Mining
Scaling from Data Mining to Big Mining
Computational Techniques
Examples of Extrapolation of classical DM Techniques
A look to Mahout
Big Data Tools & Resources
Big Data Trends
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
2
From DM to BD & DS: a Computational Perspective
https://kemlg.upc.edu
INTRODUCTION
Complex Real-world Systems/Domains
Exist in the daily real life of human beings, and normally show a
strong complexity for their understanding, analysis, management or
From DM to BD & DS: a Computational Perspective
solving.
They imply several decision making tasks very complex and
difficult, which usually are faced up by human experts
Some of them, in addition, could have catastrophic consequences
either for human beings or for the environment or for the economy of
one organization
Examples
Environmental System/Domains
Medical System/Domains
Industrial Process Management Systems/Domains
Business Administration & Management Systems/Domains
Marketing
Decisions on products and prices
Decisions on human resources
Decisions on strategies and company policy
Internet
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
4
Need for Decision Making Support Tools
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
5
Management Information Systems
MANAGEMENT INFORMATION SYSTEMS
From DM to BD & DS: a Computational Perspective
DECISION
TRANSACTION
SUPPORT
PROCESSING
SYSTEMS
SYSTEMS
(DSS)
INTELLIGENT DECISION SUPPORT SYSTEMS (IDSS)
AUTOMATIC
MODEL-DRIVEN DATA-DRIVEN
MODEL-DRIVEN
DSS DSS
DSS
TOP-DOWN BOTTOM-UP
APPROACH / APPROACH /
CONFIRMATIVE EXPLORATIVE
MODELS MODELS
DECISION SUPPORT
COSTS REGULATIONS
USER INTERFACE
DIAGNOSIS
Decision AI STATISTICAL NUMERICAL
DATA MI NING
Systems Background/
Subjective
KNOWLEDGE ACQUISITION / LEARNING
(IDSS) Knowledge
ON-LINE /
Off-line
OFF-LINE
data
On-line ACTUATORS
data
Spatial /
Geographical SUBJECTIVE
data INFORMATION / Actuation
ANALYSES
SENSORS
SYSTEM / PROCESS
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
7
Why Data Mining is important ?
From observations to decisions
[Adapted from A.D. Witakker, 1993]
From DM to BD & DS: a Computational Perspective
DECISION
High
Recomendations
INTERPRETATION Consequences
Predictions
Value and
Relevance KNOWLEDGE
to
Decisions Understanding
Data
Observations
Low
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
8
KNOWLEDGE DISCOVERY / DATA MINING
From DM to BD & DS: a Computational Perspective
https://kemlg.upc.edu
KDD Concepts
KDD
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
10
The KDD process
[Fayyad, Piatetsky-Shapiro & Smyth, 1996]
Interpretation /
Evaluation
From DM to BD & DS: a Computational Perspective
Data Mining
Knowledge
Transformation
Preprocessing
Knowledge
Patterns / Models
Selection
Transformed
Data
Preprocessed
Data
Target Data
Data
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
11
From DM to BD & DS: a Computational Perspective Terminology
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
12
From DM to BD & DS: a Computational Perspective
DATA PRE-PROCESSING
https://kemlg.upc.edu
Problems with data
Corrupted Data
Noisy Data
Irrelevant Data / Attribute Relevance
Attribute Extraction
Numeric and Symbolic Data
Scarce Data
Missing Attributes
Missing Values
Fractioned Data
Incompatible Data
Different Source Data
Different Granularity Data
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
14
Data Preparing Data
Data Tranformation
Data Filtering
From DM to BD & DS: a Computational Perspective
Data Sorting
Data Edition
Noise Modelling
New Information Obtention
Visualization
Removing
Data Selection
Sampling
New Information Generation
Data Engineering
Data Fusion
Time Series Analysis
Constructive Induction
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
15
Data Cleaning
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
16
Coding
Address to region
Phone prefix to province code
Birth data to age, and age to decade
To normalize huge magnitudes: millions to thousands (account
balance, credits, ...)
Sex, married/single, ... transformed to binary
Weekly, monthly informations, ... to number of week, number of
month, ...
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
17
Flattering
From DM to BD & DS: a Computational Perspective
DNI hobby
DNI Spo Mus Read
45464748 sport
45464748 music 45464748 1 1 1
45464748 reading 50000001 0 1 0
50000001 music
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
18
Data Selection
From DM to BD & DS: a Computational Perspective
Instance Selection
Feature Selection
Feature Weighting (Feature Relevance Determination)
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
19
Instance Selection
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
20
Feature Selection
Goal: to reduce the number of features (dimensionality)
From DM to BD & DS: a Computational Perspective
Categories of methods:
Feature ranking
Feature ranking methods ranks the features by a metric and
eliminates all features that do not achieve an adequate score
Subset selection
Subset selection methods search the set of possible features
for the optimal subset.
Methods
Wrapper methods (Wrappers): utilize the ML model of interest as a
black box to score subsets of feature according to their predictive
power.
Filter methods (Filters): select subsets of features as a preprocessing
step, independently of the chosen predictor.
Embedded methods: perform feature selection in the process of
training and are usually specific to given ML techniques
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
21
Feature Weighting
Goal: to determine the weight or relevance of all features in
an automatic way
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
22
Feature Weighting / Attribute Relevance (1)
Supervised methods
From DM to BD & DS: a Computational Perspective
Filter methods
Global methods
Mutual Information algorithm
Cross-Category Feature Importance
Projection of Attributes
Information Gain measure
Class Value method
Local methods
Value-Difference Metric
Per Category Feature Importance
Class Distribution Weighting algorithm
Flexible Weighting
Entropy-Based Local Weighting
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
23
Feature Weighting / Attribute Relevance (2)
Wrapper methods
From DM to BD & DS: a Computational Perspective
RELIEF
Contextual Information
Introspective Learning
Diet Algorithm
Genetic algorithms
Unsupervised methods
Gradient Descent
Entropy-based Feature Ranking
UEB-1
UEB-2
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
24
From DM to BD & DS: a Computational Perspective
https://kemlg.upc.edu
Data Mining Techniques
Statistical Techniques
Linear Models: simple regression, multiple regression
Time Series Models (AR, MA, ARIMA)
From DM to BD & DS: a Computational Perspective
UNCERTAINTY
MODELS
From DM to BD & DS: a Computational Perspective
(Stats) Pure
(AI) Certainty Factor (AI) Evidence Theory (AI) Possibility Theory
Probabilistic
Method[MYCIN] [Dempster-Shafer] Fuzzy Logic [Zadeh]
Model
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
28
Descriptive Models
From DM to BD & DS: a Computational Perspective
Clustering Techniques
https://kemlg.upc.edu
Clustering / Conceptual Clustering
From DM to BD & DS: a Computational Perspective
A B C
1 0.1 0.8 0.3
2 0.1 0.3 100
described by A=0.1
3 0.7 0.3 0.45
4 0.3 0.38 0.42 described by B[0.3,0.38] & C [0.42,0.45]
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
30
Clustering Techniques
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
31
K-means Method
Input: X = {x1, ..., xn} // Data to be clustered
From DM to BD & DS: a Computational Perspective
k // Number of clusters
Function K-means
End function
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
32
K-means method
From DM to BD & DS: a Computational Perspective [Charts extracted from “Mahout in action”, Owen et al., 2011]
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
33
Ascendant hierarchical clustering
Dendogram
From DM to BD & DS: a Computational Perspective
a
c h
b g
Data
set
d
f
e
h
c
f
d e a b g
34
Discriminant Models
From DM to BD & DS: a Computational Perspective
https://kemlg.upc.edu
K-NN algorithm
Input: T = {t1, ..., tn} // Training Data points available
D = {d1, ..., dm} // Data points to be classified
k // Number of neighbours
Output: neighbours // the k nearest neigbours
From DM to BD & DS: a Computational Perspective
Function K-NN
End function 36
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
Decision trees
From DM to BD & DS: a Computational Perspective
37
Decision Tree: example
From DM to BD & DS: a Computational Perspective
Criteria
39
ID3: basic idea
From DM to BD & DS: a Computational Perspective
40
© Miquel Sànchez i Marrè & Karina Gibert, KEMLG, 2009
ID3: selection criteria
5. 2 2 5 - B=2 0 3 0,0000
C - +
6. 2 2 7 - B=3 1 0 0,0000
7. 1 2 7 - =5 =7
8. 2 3 7 + C=5 0 3 0,0000 0,4206
- + (5)
C=7 3 2 0,6730
(3)
B
=1 =3
=2
(4)
A C D 5. 2 5 - 8. 2 7 +
1. 1 5 - 6. 2 7 - p n h(p,n) e
2. 2 5 - 7. 1 7 - A=1 1 1 0,6931 0,6931
3. 2 7 + A=2 1 1 0,6931
4. 1 7 +
C=5 0 2 0,0000 0,0000
C=7 2 0 0,0000 43
From DM to BD & DS: a Computational Perspective
Data Post-Processing
https://kemlg.upc.edu
Post-processing Techniques (1)
Post-processing techniques are devoted to transform the
direct results of the data mining step into directly
understandable, useful knowledge for later decision-making
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
46
Interpretation / Result Evaluation
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
47
Result Representation (1)
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
48
Result Representation (2)
90
80
70
60
50
40 Serie1
30
20
10 5
0
1 2 3 4 5
4
3 Serie1
0 20 40 60 80 100
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
49
Result Representation (3)
1
2
3
4
5
1
2
3
4
5
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
50
Two-dimensional Representation
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
51
Three-dimensional Representation
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
52
Validation Methods for Discriminant and Predictive
Models
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
53
Precision/Accuracy Types
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
54
Simple Validation [Stratified] (1)
From DM to BD & DS: a Computational Perspective
Induced
Model
Test
Stratified: The class distribution C1, …,CM in the training set and in the
test set follows the same distribution than the original whole data set
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
55
From DM to BD & DS: a Computational Perspective Simple Validation [Stratified] (2)
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
56
Cross Validation [Stratified]
The initial data set is split into N subsets
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
57
Global Precision of classification
Global Error or Misclassification Rate
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
58
Precision by modalities (1)
Confusion Matrix
From DM to BD & DS: a Computational Perspective
LOGISTIC
Predicted Error Type II /
REGRESSION False Positives
0 1
0 48,02 10,91
Observed
1 22,92 18,14
RBF neural
Predicted
network
0 1
0 47,34 11,60
Observed
1 20,87 20,19
0 1
0 41,34 17,60
Observed
1 12,14 28,92
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
60
From DM to BD & DS: a Computational Perspective ROC Curves
Sensitivity
True Positive rate
(1 – prob(error type I))
versus
1- especificity
False Positive rate
(prob(error type II)):
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
61
Performance Gini Index
Surface between ROC curve and the 45º bisector
From DM to BD & DS: a Computational Perspective
Gini
0,4375 0,4230 0,4445 0,5673
index
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
62
BIG DATA
From DM to BD & DS: a Computational Perspective
https://kemlg.upc.edu
Big Data
Big Data is a collection of data sets so large and
complex that it becomes difficult to process using on-hand
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
65
Big Data
Problems
Big data always give an answer, but it could not make sense
From DM to BD & DS: a Computational Perspective
Techniques
Tools
Architectures
The challenges include (new and old problems):
Capture
Curation
Storage
Search
Sharing
Transfer
Analysis
Visualization
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
67
Data Science
Data science incorporates
varying elements and
builds on techniques and
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
68
Data Science
[Extracted from “Big Data Course” D. Kossmann & N. Tatbul, 2012]
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
69
Big Data Jungle
TECHNICAL / SCIENTIFIC SOCIAL / BUSINESS / COMPANIES
From DM to BD & DS: a Computational Perspective
BIG DATA
“SMALL” DATA
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
70
Scientific Data Mining vs Social Mining
SCIENTIFIC DATA MINING SOCIAL MINING
From DM to BD & DS: a Computational Perspective
AVERAGES
REASONING
ABOUT
INDIVIDUALS
REASONING
ABOUT
PERSONAL
PROTOTYPES
RECOMMENDATIONS
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
71
Social Mining
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
72
Social Netwoks Analysis and Big Data
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
73
Influence
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
74
Discovering influence based on social relationships
[http://www.tweetStimuli.com, A.Tejeda 2012]
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
75
Discovering influence based on social relationships
[http://www.tweetStimuli.com, A.Tejeda 2012]
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
76
From DM to BD & DS: a Computational Perspective
Social Network Analysis and Big Data
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
77
FROM DATA MINING TO MASSIVE DATASET
MINING
From DM to BD & DS: a Computational Perspective
https://kemlg.upc.edu
How to scale from “Small” to Big Data?
MINING
DATA ALGORITHMS
From DM to BD & DS: a Computational Perspective
Slicing Data
Data is splitted in chunks of data
Each chunk is processed by a thread or node of computation
Slicing individuals
Slicing attributes
Recombining the results
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
80
Parallel K-means Method
Input: X = {x1, ..., xn} // Data to be clustered
From DM to BD & DS: a Computational Perspective
k // Number of clusters
Function K-means
End function
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
81
Parallel K-NN algorithm
Input: T = {t1, ..., tn} // Training Data points available
D = {d1, ..., dm} // Data points to be classified
k // Number of neighbours
Output: neighbours // the k nearest neigbours
From DM to BD & DS: a Computational Perspective
End function
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
82
Parallel Decision Tree
From DM to BD & DS: a Computational Perspective
Max = - infinite
Foreach candidate attribute Parallelize the loop !
Relevance = quality of split(attribute)
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
83
MapReduce
[Adapted from “Big Data Course” D. Kossmann & N. Tatbul, 2012]
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
85
MapReduce Basic Programming Model
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
86
MapReduce Parallelization
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
87
MapReduce Parallel Processing Model
[Chart from “Big Data Course” D. Kossmann & N. Tatbul, 2012]
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
88
MapReduce Execution Overview
[Chart from “Big Data Course” D. Kossmann & N. Tatbul, 2012]
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
89
K-means with MapReduce
From DM to BD & DS: a Computational Perspective
(xn, ?) (xn, cl(xn)) (All cl(xi)=k, xj) (cl(xi)=k, c1= average (xi | cl(xi) = k))
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
90
From DM to BD & DS: a Computational Perspective
A look to MAHOUT
https://kemlg.upc.edu
Mahout
Mahout is an open source machine learning library from
Apache
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
92
BIG DATA TOOLS & RESOURCES
From DM to BD & DS: a Computational Perspective
https://kemlg.upc.edu
Big Data Tools
NoSQL
From DM to BD & DS: a Computational Perspective
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
94
From DM to BD & DS: a Computational Perspective
Big Data Literature (1)
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
95
From DM to BD & DS: a Computational Perspective
Big Data Literature (2)
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
96
Big Data websites
From DM to BD & DS: a Computational Perspective
http://www.DataScienceCentral.com
http://www.apache.org
http://hadoop.apache.org
http://mahout.apache.org
http://bigml.com
Big Data & Data Science, Curso de Verano, 16-19 de julio de 2013, Universidade da Santiago de Compostela
97
From DM to BD & DS: a Computational Perspective
http://kemlg.upc.edu/