Information Bottleneck (Slides) - Boris Epshtein Lena Gorelick PDF

Information Bottleneck
presented by
Boris Epshtein & Lena Gorelick
Advanced Topics in Computer and Human Vision
Spring 2004
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Motivation
Clustering Problem
Motivation
• “Hard” Clustering – partitioning of the input

data into several exhaustive and mutually
exclusive clusters
• Each cluster is represented by a centroid

Motivation
• “Good” clustering – should group similar

data points together and dissimilar points
apart
• Quality of partition – average distortion

between the data points and corresponding
representatives (cluster centroids)
Motivation
• “Soft” Clustering – each data point is
assigned to all clusters with some
normalized probability
• Goal – minimize expected distortion between

the data points and cluster centroids
Motivation…
Complexity-Precision Trade-off
• Too simple model Poor precision

• Higher precision requires more complex model
Motivation…
• Too simple model Poor precision

• Higher precision requires more complex model
• Too complex model Overfitting
Motivation…
• Too Complex Model

– can lead to overfitting Poor
– is hard to learn generalization
• Too Simple Model
– can not capture the real structure of the data
• Examples of approaches:
– SRM Structural Risk Minimization
– MDL Minimum Description Length
– Rate Distortion Theory
Agenda
• Motivation
• IB algorithms
– iIB
– dIB
– aIB
• Application
Definitions…
Entropy
• The measure of uncertainty about the

random variable
Definitions…
Entropy - Example
– Fair Coin:
– Unfair Coin:
Definitions…
Entropy - Illustration
Highest Lowest
Definitions…
Conditional Entropy
• The measure of uncertainty about the

random variable given the value of
the variable
Definitions…
Conditional Entropy
Example
Definitions…
Mutual Information
• The reduction in uncertainty of due

to the knowledge of
– Nonnegative
– Symmetric
– Convex w.r.t. for a fixed
Definitions…
Mutual Information - Example

Definitions…
Kullback Leibler Distance

Over the same
alphabet
• A distance between distributions

– Nonnegative
– Asymmetric
Agenda
• Motivation
• IB algorithms
– iIB
– dIB
– aIB
• Application
Rate Distortion Theory
Introduction
• Goal: obtain compact clustering of the
data with minimal expected distortion
• Distortion measure is a part of the

problem setup
• The clustering and its quality depend

on the choice of the distortion measure
Rate Distortion Theory
Data
• Obtain compact clustering of the data

with minimal expected distortion given
fixed set of representatives
Cover & Thomas
Rate Distortion Theory - Intuition
•
– zero distortion
– not compact
•
– high distortion
– very compact
Rate Distortion Theory – Cont.
• The quality of clustering is determined by
– Complexity is measured by
(a.k.a. Rate)
– Distortion is measured by
Rate Distortion Plane
D - distortion constraint
Minimal
Distortion
Maximal
Compression
Ed(X,T)
Rate Distortion Function
• Let be an upper bound constraint on the
expected distortion
Higher values of mean more relaxed

distortion constraint
Stronger compression levels are attainable
• Given the distortion constraint find the most

compact model (with smallest complexity )
• Given
– Set of points with prior
– Set of representatives
– Distortion measure
• Find
– The most compact soft clustering of
points of that satisfies the distortion
constraint
• Rate Distortion Function
Complexity Distortion
Term Term
Lagrange
Multiplier
Minimize !
Rate Distortion Curve
Minimal
Distortion
Maximal
Compression
Ed(X,T)
Minimize
Subject to
The minimum is attained when
Normalization
Solution - Analysis
Solution:
Known
The solution is implicit

Solution - Analysis
Solution:
For a fixed
When is similar to is small
closer points are attached to with higher
probability
Solution - Analysis
Solution:
Fix t
reduces the influence of distortion
does not depend on
this + maximal compression single cluster
Fix x
most of cond. prob. goes to some
with smallest distortion
hard clustering
Solution - Analysis
Solution:
Intermediate soft clustering,

intermediate complexity
Varying
Agenda
• Motivation
• IB algorithms
– iIB
– dIB
– aIB
• Application
Blahut – Arimoto Algorithm
Input:
Randomly init
Optimize convex function over convex set

the minimum is global
Blahut-Arimoto Algorithm
Advantages:
• Obtains compact clustering of the data with
minimal expected distortion
• Optimal clustering given fixed set of

representatives
Blahut-Arimoto Algorithm
Drawbacks:
• Distortion measure is a part of the problem
setup
– Hard to obtain for some problems
– Equivalent to determining relevant features
• Fixed set of representatives
• Slow convergence
Rate Distortion Theory –
Additional Insights
– Another problem would be to find optimal
representatives given the clustering.
– Joint optimization of clustering and

representatives doesn’t have a unique solution.
(like EM or K-means)
Agenda
• Motivation
• IB algorithms
– iIB
– dIB
– aIB
• Application
• Copes with the drawbacks of Rate Distortion
approach
• Compress the data while preserving “important”

(relevant) information
• It is often easier to define what information is

important than to define a distortion measure.
• Replace the distortion upper bound constraint by a

lower bound constraint over the relevant information
Tishby, Pereira & Bialek, 1999

Information Bottleneck-Example
Given:
Documents Joint prior Topics

Obtain:
I(Word;Topic)
I(Cluster;Topic)
I(Word;Cluster)
Words Partitioning Topics

Extreme case 1:
I(Cluster;Topic)=0
Not
Informative
I(Word;Cluster)=0
Very Compact
Extreme case 2:
I(Cluster;Topic)=max
Very
Informative
I(Word;Cluster)=max
Not Compact
Minimize I(Word; Cluster) & maximize I(Cluster; Topic)
topics
words
Compactness Relevant
Information
Relevance Compression Curve
Maximal
D – relevance Relevant
Information
constraint
Maximal
Compression
Relevance Compression Function
• Let be minimal allowed value of
Smaller more relaxed relevant
information constraint
Stronger compression levels are attainable
• Given relevant information constraint

Find the most compact model
(with smallest )
Compression Relevance
Term Term
Lagrange
Multiplier
Minimize !
Maximal
Relevant
Information
Maximal
Compression
Minimize
Subject to
The minimum is attained when
Normalization
Solution - Analysis
Solution:
Known
The solution is
implicit
Solution - Analysis
Solution:
• KL distance emerges as effective distortion measure

from IB principle
For a fixed
When is similar to KL is small
attach such points to with higher probability
The optimization is also over cluster representatives

Solution - Analysis
Solution:
Fix t
reduces the influence of KL
does not depend on
this + maximal compression single cluster
Fix x
most of cond. prob. goes to some
with smallest KL (hard mapping)
Maximal
Relevant
Information
Hard Mapping
Maximal
Compression
Agenda
• Motivation
• IB algorithms
– iIB
– dIB
– aIB
• Application
Iterative Optimization Algorithm (iIB)
• Input:
• Randomly init
Pereira, Tishby, Lee , 1993; Tishby, Pereira, Bialek, 2001

p(cluster | word) p(topic | cluster)
p(cluster)
Pereira, Tishby, Lee , 1993;

iIB simulation
• Given:
– 300 instances of with prior
– Binary relevant variable
– Joint prior
–
• Obtain:
– Optimal clustering (with minimal )
iIB simulation…
X points and their priors

iIB simulation…
Given the is given by the color of the

point on the map
iIB simulation…
Single Cluster – Maximal Compression

iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
Hard Clustering – Maximal Relevant Information

Optimize non-convex functional over 3
convex sets the minimum is local
• Analogy to K-means or EM
“Semantic change” in the clustering solution
Advantages:
• Defining relevant variable is often easier and more
intuitive than defining distortion measure
• Finds local minimum
Drawbacks:
• Finds local minimum (suboptimal solutions)
• Need to specify the parameters
• Slow convergence
• Large data sample is required
Agenda
• Motivation
• IB algorithms
– iIB
– dIB
– aIB
• Application
Deterministic Annealing-like algorithm (dIB)
• Iteratively increase the parameter and then adapt

the solution from the previous value of to the new
one.
• Track the changes in the solution as the system
shifts its preference from compression to relevance
• Tries to reconstruct the relevance-compression
curve
Slonim, Friedman, Tishby, 2002

Solution from previous step:

Small
Perturbation
Apply iIB using the duplicated cluster set as initialization

if are different
leave the split
else
use the old
Illustration
What clusters split at which values of

Deterministic
Annealing-like algorithm (dIB)
Advantages:
• Speed-up convergence by adapting previous soultion
Deterministic
Annealing-like algorithm (dIB)
Drawbacks:
• Need to specify and tune several parameters:
- perturbation size
- step for (splits might be “skipped”)
- similarity threshold for splitting
- may need to vary parameters during the process
Agenda
• Motivation
• IB algorithms
– iIB
– dIB
– aIB
• Application
Agglomerative Algorithm (aIB)
• Find hierarchical clustering tree in
a greedy bottom-up fashion
• Results in different trees for each
• Each tree is a range of clustering solutions at
different resolutions
Same
Different
Resolutions
Slonim & Tishby 1999
Fix
Start with
For each pair
Compute new
Merge and that produce the smallest
For each pair
Compute new
Merge and that produce the smallest
For each pair
Continue merging until single cluster is left

Advantages:
• Non-parametric
• Full Hierarchy of clusters for each
• Simple
Drawbacks:
• Greedy – is not guaranteed to extract even locally
minimal solutions along the tree
Agenda
• Motivation
• IB algorithms
– iIB
– dIB
– aIB
• Application
Applications…
Unsupervised Clustering of Images
Modeling assumption:
For a fixed colors and
their spatial distribution
are generated by a
mixture of Gaussians in
Shiri Gordon et. al., 2003 5-dim
Applications…

Apply EM procedure to estimate the mixture
parameters
Mixture of Gaussians model:
Shiri Gordon et. al., 2003

Applications…
Unsupervised Clustering of
Images
• Assume uniform prior
• Calculate conditional
• Apply aIB algorithm

Applications…

Applications…

Summary

• IB algorithms
– iIB
– dIB
– aIB
• Application
Thank you
Blahut-Arimoto algorithm
A B
Minimum Distance
?
Convex set of Convex set of

distributions distributions
When does it converge to the global minimum?

- A and B are convex + some requirements on
distance measure
Csiszar & Tusnady, 1984

A B
Reformulate using distance

A B
Rate Distortion Theory - Intuition
•
– zero distortion
– not compact
–
•
– high distortion
– very compact
–
Information Bottleneck - cont’d
• Assume Markov relations:
– T is a compressed representation of X, thus

independent of Y if X is given
– Information processing inequality:

Information Bottleneck (Slides) - Boris Epshtein Lena Gorelick PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Information Bottleneck (Slides) - Boris Epshtein Lena Gorelick PDF

Caricato da

Copyright:

Formati disponibili

Information Bottleneck

• “Hard” Clustering – partitioning of the input

• Each cluster is represented by a centroid

• “Good” clustering – should group similar

• Quality of partition – average distortion

• Goal – minimize expected distortion between

• Too simple model Poor precision

• Too simple model Poor precision

• Too Complex Model

• The measure of uncertainty about the

• The measure of uncertainty about the

• The reduction in uncertainty of due

Mutual Information - Example

Kullback Leibler Distance

• A distance between distributions

• Distortion measure is a part of the

• The clustering and its quality depend

• Obtain compact clustering of the data

Higher values of mean more relaxed

Stronger compression levels are attainable

• Given the distortion constraint find the most

The minimum is attained when

The solution is implicit

Intermediate soft clustering,

Optimize convex function over convex set

• Optimal clustering given fixed set of

• Fixed set of representatives

– Joint optimization of clustering and

• Compress the data while preserving “important”

• It is often easier to define what information is

• Replace the distortion upper bound constraint by a

Tishby, Pereira & Bialek, 1999

Documents Joint prior Topics

Words Partitioning Topics

Stronger compression levels are attainable

• Given relevant information constraint

The minimum is attained when

• KL distance emerges as effective distortion measure

The optimization is also over cluster representatives

Pereira, Tishby, Lee , 1993; Tishby, Pereira, Bialek, 2001

p(cluster | word) p(topic | cluster)

Pereira, Tishby, Lee , 1993;

X points and their priors

Given the is given by the color of the

Single Cluster – Maximal Compression

Hard Clustering – Maximal Relevant Information

• Iteratively increase the parameter and then adapt

Slonim, Friedman, Tishby, 2002

Solution from previous step:

Apply iIB using the duplicated cluster set as initialization

What clusters split at which values of

Continue merging until single cluster is left

• Full Hierarchy of clusters for each

Unsupervised Clustering of Images

Unsupervised Clustering of Images

Mixture of Gaussians model:

Shiri Gordon et. al., 2003

Shiri Gordon et. al., 2003

Unsupervised Clustering of Images

Shiri Gordon et. al., 2003

Unsupervised Clustering of Images

Shiri Gordon et. al., 2003

• Rate Distortion Theory