Metric Learning

Data Mining:
Unsupervised Metric Learning

MOC
FII 2010
Cluster Analysis
Dissimilarities and metrics in Data
Mining tasks
Clustering definition: group objects based on
similarity/dissimilarity
Classification: algorithms like kNN
Given a set of objects S , define dissimilarity as a

n
function d :S S
d ( x, y) 0
d ( x, x) 0 distance semi-metric
d ( x, y) d ( y, x), x, y S metric
d ( x, z) d ( y, z) d ( x, y), x, y S
d ( x, y) 0 x y
Metrics
Minkowski n 1/ k
d k ( x, y) | xi yi |k
i 1
k=1 Manhattan:
n
d1 ( x, y ) | xi yi |
i 1
k=2 Euclidean:
n
d 2 ( x, y) ( xi yi ) 2
i 1
K= Chebyshev:
d ( x, y) max | xi yi |
Metrics
Minkowski
the feature numeric differences are equally important in
taking decisions
the higher-order norms emphasize the larger attribute
dissimilarities
due to different scales, one feature may dominate the others
in distance calculations
Normalization within the same interval (scaling)
x a
x ( a, b) scaling
y ( A, B) y (B A) A
b a
Normalization to zero mean and unit variance (standardization)
x D( , ) st
y D(0,1) x
y
Normalization
scaling
Negative effect of scaling on two well-separated clusters

Metrics
Mahalanobis
d M ( x, y) ( x y)T 1
( x y)
ii
the variance of attribute i
ij the covariance of attributes i and j
Metrics
Mahalanobis
to linearly transform the input data and take the Euclidean
metric in the transformed space
performs feature scaling
corrects the correlation between features
clusters with elongated form are easily detected
the computation time grows quadratically with the number

of features
Metrics
Others
Cosine similarity: n x
xi yi y
i 1
sCS ( x, y )
n n
2
x i yi2
i 1 i 1
d CS ( x, y ) arccos(sCS ( x, y ))
Jaccard similarity
| X Y |
sJ ( X , Y )
| X Y |
| X Y | | X Y |
dJ ( X ,Y ) 1 sJ ( X , Y )
| X Y |
Metrics
Which one?
Fig. 1: k-Means clusters using Euclidean distance Fig. 2: k-Means clusters using Manhattan and Mahalanobis
distances
Metric learning
Why?
Improve the performance in clustering, classification and retrieval
tasks
Dimensionality reduction
How much prior knowledge on the data?
Unsupervised
Semi-supervised
Supervised (classification)
When?
Pre-processing filter methods
Wrapper methods
How much generalization?
Global whole data set
Local one metric/class
Supervised metric learning
Parameterized metrics
xT Ay
d A ( x, y) ( x y)T A( x y) d CS A ( x, y )
xT Ax y T Ay
Global:
Find A 0 such that similar points are close to each other
Convex optimization problem
Local:
Find Ai 0 which minimizes the intra-inertia for each cluster/class i
Dimensionality reduction for Clustering

(filter methods) (filter + wrapper)
Feature extraction Feature selection
Linear Non-linear
Principal
Component Kernel MDS ISOMAP ANN Feature
Analisys PCA (SOM) clustering
Principal Component Analysis (PCA)
(Karl Pearson - 1901)
The most used form of factor analysis
Performs dimensionality reduction
Eliminates redundancy
Identifies the factors that best preserve the variance in data
(factor = new independent, unobserved variables supposed to influence the observed
variables)
HOW?
rotates the original feature space
projects the feature vectors onto a limited amount of axes
axes = the eigenvectors of the covariance matrix having the largest
eigenvalues
minimize the sum-squared error
are uncorrelated
maximize the variance retained from the original feature set
Steps
Compute the variance-covariance matrix (mxm matrix)
Compute the eigenvectors
Project the data

Example
x 1.81
y 1.91
n
( xi x)( yi y)
i 1
cov(x, y ) cov( y, x) 0.61
n 1
n
( xi x) 2
i 1
var( x) 0.61
n 1
n
( yi y)2
i 1
var( y ) 0.71
n 1
0.61 0.61
0.61 0.71
normalized to 0 mean
Example
Eigenvalue equation:
The direction of x is not changed
A x x by the transformation A,
it is only scaled by a factor of .
eig eig
0.67
1 1.27, eig1
0.73
0.73
2 0.05, eig 2
0.67
1 2 var( x) var( y)
Example
Project the data
T T 0.67 0.73
Df P D P eig1 eig 2
0.73 0.67
1.27 0
f
0 0.05
diagonalized covariance matrix
Principal components
1. The first component
accounts for a maximal amount of total variance in the observed
variables
is correlated with at least some of the observed variable
i. The i-th component

accounts for a maximal amount of variance in the data set that was
not accounted for by the previous components
is correlated with some of the observed variables that did not display
strong correlations with previous components
is uncorrelated with the previous components
n.
Principal components
How many?
The elbow method applied to eigenvalues
Multidimensional Scaling
(Shepard 1962, Kruskal 1964)
Performs a non-linear mapping of the data onto a lower-
dimensional space preserving the dissimilarities among data
items
Is not an exact procedure but a way to "rearrange" objects in
an efficient manner, so as to arrive at a configuration that best
approximates the observed distances
Metric MDS
the dissimilarities are proportional to distances
minimize (d ij ij )2
i j
Non-metric MDS
dissimilarities are assumed to be merely ordinal and the rank
order has to be preserved
minimize (d ij f ( ij ))2
i j
General scheme
1. Assign points to arbitrary coordinates in p-dimensional
space.
2. Compute Euclidean distances among all pairs of points, to
form the D matrix.
3. Compare the D matrix with the input D matrix by
evaluating the stress function. The smaller the value, the
greater the correspondence between the two.
4. Adjust coordinates of each point in the direction that best
maximally stress.
5. Repeat steps 2 through 4 until stress won't get any lower.
Example
Self-Organizing Maps (SOM)
Motivation
How to find out semantics relationship among lots of

information without manual labor
How do I know, where to put my new data in, if I know nothing
about informations topology?
When I have a topic, how can I get all the information about it,
if I dont know the place to search them?
SOM
(Kohonen 1995)
Usage:
Clustering
Feature extraction
Data visualization
Learns a topological map:
a mapping from the high dimensional space of the data points onto
the points of a 2D grid
neighboring areas in these maps represent neighboring areas in the
input space
Inspiration - biological basis: brain maps
neighboring areas in the sensory cortex are responsible for the arm
and hand regions
Often realized as an ANN
SOM
ANN
Two layers: input layer and output (map) layer
Input and output layers are completely connected
Output neurons are locally interconnected
A topology (neighborhood relation) is defined on
the output layer
SOM
The output layer:
Consists of neurons organized on a regular grid
Each neuron is represented by a d-dimensional weight vector
The neurons are connected to adjacent neurons by a
neighborhood relation
Principles for learning the map:
1. Competition
2. Cooperation
3. Synaptic Adaptation
SOM
Map topologies
SOM
Sequential training
1. Randomly initialise all weights
2. Select input vector x = [x1, x2, x3, , xd]
3. Compare x with weights mj for each neuron j to determine
winner
4. Update winner so that it becomes more like x, together
with the winners neighbours
5. Adjust parameters: learning rate & neighbourhood function
6. Repeat from (2) until the map has converged (i.e. no
noticeable changes in the weights) or pre-defined no. of
training cycles have passed
SOM
Adaptation
competition
cooperation
SOM
Neighborhood
SOM
Batch training
The whole dataset is presented to the map and then the
weight adaptation is performed
A weighted average over all data items is computed
SOM
Example
Poverty map based on 39 indicators from World Bank statistics (1992)

Feature selection/weighting/ranking
Feature selection
aims at removing redundant and noisy features
Feature weighting
aims at numerically quantifying the contribution of each feature
towards the best possible clustering result.
Feature ranking
aims at establishing a hierarchy of features which can serve
further for feature selection.
34
UFS- General Schemes
Unsupervised Feature Selection
Filter Models Wrapper Models Hybrid Schemes
Global Local
Filters
Independent of clustering algorithm
Evaluate subsets using intrinsic property of data
Wrappers
Evaluate subsets based on clustering result
Global methods - Select features from entire space
Local methods - Select features from data in each cluster
35
FS Filter Models
Redundancy based
mutually-dependent features should be discarded
Clustering on features (Duda)
GP for detecting the redundancy of a feature with respect to a subset of
features (Neshatian 09)
Compute the merit of each feature
pairwise dependence scores are computed:
mutual information (Pena 03)
Entropy (Dash)
36
Unsupervised wrapper FS
Scenario
Search for:
Optimal feature subsets
A method to search efficiently in the space of all possible
subsets of the feature space
A criterion to evaluate/compare various feature subsets
Optimal partitions
A complete (unsupervised) clustering algorithm
A method to search efficiently in the space of all possible
partitions over a given feature subset
A criterion to evaluate/compare various partitions
Main drawback: high computational cost
37
Feature subset evaluation
Based on the quality of the optimal partition generated with
k-Means using the given feature subset
In a completely unsupervised scenario:
Bias with regard to the number of clusters
Bias with regard to the number of features
38
The bias w.r.t. the number of clusters
Solutions
Multi-objective optimization: the bias introduced in the
primary objective function is counterbalanced by a second
objective function
Several clustering criteria largely unbiased with regard to the
number of clusters already exist:
Davis-Bouldin Index
Silhouette Width

39
The bias w.r.t. the number of features
Solutions
Multi-objective optimization
Drawback: difficult to extract the best solution from the Pareto
front
Cross-projection normalization used for pairwise
comparisons between feature sets within a greedy algorithm
Drawbacks:
It is not transitive, which makes its use in global optimization techniques
problematic
40
Searching for the optimal feature
subset
Existing approaches:
greedy optimizers
multi-objective GAs
ensemble methods
multi-modal search
return the best among all local optima detected;
good partitions obtained in different feature subspaces may serve as items
in ensemble clustering;
all feature subsets can be used to construct one single feature subspace;
41

Metric Learning

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Metric Learning

Caricato da

Copyright:

Formati disponibili

Data Mining:

Unsupervised Metric Learning

Given a set of objects S , define dissimilarity as a

Negative effect of scaling on two well-separated clusters

the computation time grows quadratically with the number

Dimensionality reduction for Clustering

Feature extraction Feature selection

Project the data

i. The i-th component

How to find out semantics relationship among lots of

Poverty map based on 39 indicators from World Bank statistics (1992)

Filter Models Wrapper Models Hybrid Schemes

Potrebbero piacerti anche