Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
function d :S S
d ( x, y) 0
d ( x, x) 0 distance semi-metric
d ( x, y) d ( y, x), x, y S metric
d ( x, z) d ( y, z) d ( x, y), x, y S
d ( x, y) 0 x y
Metrics
Minkowski n 1/ k
d k ( x, y) | xi yi |k
i 1
k=1 Manhattan:
n
d1 ( x, y ) | xi yi |
i 1
k=2 Euclidean:
n
d 2 ( x, y) ( xi yi ) 2
i 1
K= Chebyshev:
d ( x, y) max | xi yi |
Metrics
Minkowski
the feature numeric differences are equally important in
taking decisions
the higher-order norms emphasize the larger attribute
dissimilarities
due to different scales, one feature may dominate the others
in distance calculations
Normalization within the same interval (scaling)
x a
x ( a, b) scaling
y ( A, B) y (B A) A
b a
Normalization to zero mean and unit variance (standardization)
x D( , ) st
y D(0,1) x
y
Normalization
scaling
d M ( x, y) ( x y)T 1
( x y)
ii
the variance of attribute i
ij the covariance of attributes i and j
Metrics
Mahalanobis
to linearly transform the input data and take the Euclidean
metric in the transformed space
performs feature scaling
corrects the correlation between features
clusters with elongated form are easily detected
d CS ( x, y ) arccos(sCS ( x, y ))
Jaccard similarity
| X Y |
sJ ( X , Y )
| X Y |
| X Y | | X Y |
dJ ( X ,Y ) 1 sJ ( X , Y )
| X Y |
Metrics
Which one?
Fig. 1: k-Means clusters using Euclidean distance Fig. 2: k-Means clusters using Manhattan and Mahalanobis
distances
Metric learning
Why?
Improve the performance in clustering, classification and retrieval
tasks
Dimensionality reduction
How much prior knowledge on the data?
Unsupervised
Semi-supervised
Supervised (classification)
When?
Pre-processing filter methods
Wrapper methods
How much generalization?
Global whole data set
Local one metric/class
Supervised metric learning
Parameterized metrics
xT Ay
d A ( x, y) ( x y)T A( x y) d CS A ( x, y )
xT Ax y T Ay
Global:
Find A 0 such that similar points are close to each other
Convex optimization problem
Local:
Find Ai 0 which minimizes the intra-inertia for each cluster/class i
Unsupervised Metric Learning
Unsupervised Metric Learning
Linear Non-linear
Principal
Component Kernel MDS ISOMAP ANN Feature
Analisys PCA (SOM) clustering
Principal Component Analysis (PCA)
(Karl Pearson - 1901)
The most used form of factor analysis
Performs dimensionality reduction
Eliminates redundancy
Identifies the factors that best preserve the variance in data
(factor = new independent, unobserved variables supposed to influence the observed
variables)
HOW?
rotates the original feature space
projects the feature vectors onto a limited amount of axes
axes = the eigenvectors of the covariance matrix having the largest
eigenvalues
minimize the sum-squared error
are uncorrelated
maximize the variance retained from the original feature set
Principal Component Analysis (PCA)
Steps
Compute the variance-covariance matrix (mxm matrix)
Compute the eigenvectors
0.61 0.61
0.61 0.71
normalized to 0 mean
Principal Component Analysis (PCA)
Example
Eigenvalue equation:
The direction of x is not changed
A x x by the transformation A,
it is only scaled by a factor of .
eig eig
0.67
1 1.27, eig1
0.73
0.73
2 0.05, eig 2
0.67
1 2 var( x) var( y)
Principal Component Analysis (PCA)
Example
Project the data
T T 0.67 0.73
Df P D P eig1 eig 2
0.73 0.67
1.27 0
f
0 0.05
diagonalized covariance matrix
Principal components
1. The first component
accounts for a maximal amount of total variance in the observed
variables
is correlated with at least some of the observed variable
n.
Principal components
How many?
The elbow method applied to eigenvalues
Multidimensional Scaling
(Shepard 1962, Kruskal 1964)
Performs a non-linear mapping of the data onto a lower-
dimensional space preserving the dissimilarities among data
items
Is not an exact procedure but a way to "rearrange" objects in
an efficient manner, so as to arrive at a configuration that best
approximates the observed distances
Metric MDS
the dissimilarities are proportional to distances
minimize (d ij ij )2
i j
Non-metric MDS
dissimilarities are assumed to be merely ordinal and the rank
order has to be preserved
minimize (d ij f ( ij ))2
i j
Multidimensional Scaling
General scheme
1. Assign points to arbitrary coordinates in p-dimensional
space.
2. Compute Euclidean distances among all pairs of points, to
form the D matrix.
3. Compare the D matrix with the input D matrix by
evaluating the stress function. The smaller the value, the
greater the correspondence between the two.
4. Adjust coordinates of each point in the direction that best
maximally stress.
5. Repeat steps 2 through 4 until stress won't get any lower.
Multidimensional Scaling
Example
Self-Organizing Maps (SOM)
Motivation
competition
cooperation
SOM
Neighborhood
SOM
Batch training
The whole dataset is presented to the map and then the
weight adaptation is performed
A weighted average over all data items is computed
SOM
Example
Feature weighting
aims at numerically quantifying the contribution of each feature
towards the best possible clustering result.
Feature ranking
aims at establishing a hierarchy of features which can serve
further for feature selection.
34
UFS- General Schemes
Unsupervised Feature Selection
Global Local
Filters
Independent of clustering algorithm
Evaluate subsets using intrinsic property of data
Wrappers
Evaluate subsets based on clustering result
Global methods - Select features from entire space
Local methods - Select features from data in each cluster
35
FS Filter Models
Redundancy based
mutually-dependent features should be discarded
Clustering on features (Duda)
GP for detecting the redundancy of a feature with respect to a subset of
features (Neshatian 09)
Compute the merit of each feature
pairwise dependence scores are computed:
mutual information (Pena 03)
Entropy (Dash)
36
Unsupervised wrapper FS
Scenario
Search for:
Optimal feature subsets
A method to search efficiently in the space of all possible
subsets of the feature space
A criterion to evaluate/compare various feature subsets
Optimal partitions
A complete (unsupervised) clustering algorithm
A method to search efficiently in the space of all possible
partitions over a given feature subset
A criterion to evaluate/compare various partitions
Main drawback: high computational cost
37
Feature subset evaluation
Based on the quality of the optimal partition generated with
k-Means using the given feature subset
In a completely unsupervised scenario:
Bias with regard to the number of clusters
Bias with regard to the number of features
38
The bias w.r.t. the number of clusters
Solutions
Multi-objective optimization: the bias introduced in the
primary objective function is counterbalanced by a second
objective function
Several clustering criteria largely unbiased with regard to the
number of clusters already exist:
Davis-Bouldin Index
Silhouette Width
39
The bias w.r.t. the number of features
Solutions
Multi-objective optimization
Drawback: difficult to extract the best solution from the Pareto
front
Cross-projection normalization used for pairwise
comparisons between feature sets within a greedy algorithm
Drawbacks:
It is not transitive, which makes its use in global optimization techniques
problematic
40
Searching for the optimal feature
subset
Existing approaches:
greedy optimizers
multi-objective GAs
ensemble methods
multi-modal search
return the best among all local optima detected;
good partitions obtained in different feature subspaces may serve as items
in ensemble clustering;
all feature subsets can be used to construct one single feature subspace;
41