Sei sulla pagina 1di 114

Information Bottleneck

presented by
Boris Epshtein & Lena Gorelick
Advanced Topics in Computer and Human Vision
Spring 2004
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Motivation

Clustering Problem
Motivation

• “Hard” Clustering – partitioning of the input


data into several exhaustive and mutually
exclusive clusters

• Each cluster is represented by a centroid


Motivation

• “Good” clustering – should group similar


data points together and dissimilar points
apart

• Quality of partition – average distortion


between the data points and corresponding
representatives (cluster centroids)
Motivation
• “Soft” Clustering – each data point is
assigned to all clusters with some
normalized probability

• Goal – minimize expected distortion between


the data points and cluster centroids
Motivation…

Complexity-Precision Trade-off

• Too simple model Poor precision


• Higher precision requires more complex model
Motivation…

Complexity-Precision Trade-off

• Too simple model Poor precision


• Higher precision requires more complex model
• Too complex model Overfitting
Motivation…

Complexity-Precision Trade-off

• Too Complex Model


– can lead to overfitting Poor
– is hard to learn generalization
• Too Simple Model
– can not capture the real structure of the data

• Examples of approaches:
– SRM Structural Risk Minimization
– MDL Minimum Description Length
– Rate Distortion Theory
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Definitions…

Entropy

• The measure of uncertainty about the


random variable
Definitions…

Entropy - Example

– Fair Coin:

– Unfair Coin:
Definitions…

Entropy - Illustration

Highest Lowest
Definitions…

Conditional Entropy

• The measure of uncertainty about the


random variable given the value of
the variable
Definitions…
Conditional Entropy
Example
Definitions…

Mutual Information

• The reduction in uncertainty of due


to the knowledge of
– Nonnegative
– Symmetric
– Convex w.r.t. for a fixed
Definitions…

Mutual Information - Example


Definitions…

Kullback Leibler Distance


Over the same
alphabet

• A distance between distributions


– Nonnegative
– Asymmetric
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Rate Distortion Theory
Introduction
• Goal: obtain compact clustering of the
data with minimal expected distortion

• Distortion measure is a part of the


problem setup

• The clustering and its quality depend


on the choice of the distortion measure
Rate Distortion Theory
Data

• Obtain compact clustering of the data


with minimal expected distortion given
fixed set of representatives
Cover & Thomas
Rate Distortion Theory - Intuition

– zero distortion
– not compact


– high distortion
– very compact
Rate Distortion Theory – Cont.
• The quality of clustering is determined by

– Complexity is measured by
(a.k.a. Rate)

– Distortion is measured by
Rate Distortion Plane
D - distortion constraint
Minimal
Distortion

Maximal
Compression
Ed(X,T)
Rate Distortion Function
• Let be an upper bound constraint on the
expected distortion

Higher values of mean more relaxed


distortion constraint

Stronger compression levels are attainable

• Given the distortion constraint find the most


compact model (with smallest complexity )
Rate Distortion Function
• Given
– Set of points with prior
– Set of representatives
– Distortion measure
• Find
– The most compact soft clustering of
points of that satisfies the distortion
constraint
• Rate Distortion Function
Rate Distortion Function

Complexity Distortion
Term Term

Lagrange
Multiplier

Minimize !
Rate Distortion Curve

Minimal
Distortion

Maximal
Compression
Ed(X,T)
Rate Distortion Function
Minimize

Subject to

The minimum is attained when

Normalization
Solution - Analysis

Solution:

Known

The solution is implicit


Solution - Analysis

Solution:

For a fixed
When is similar to is small
closer points are attached to with higher
probability
Solution - Analysis
Solution:

Fix t
reduces the influence of distortion
does not depend on
this + maximal compression single cluster

Fix x
most of cond. prob. goes to some
with smallest distortion
hard clustering
Solution - Analysis
Solution:

Intermediate soft clustering,


intermediate complexity

Varying
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Blahut – Arimoto Algorithm
Input:
Randomly init

Optimize convex function over convex set


the minimum is global
Blahut-Arimoto Algorithm
Advantages:
• Obtains compact clustering of the data with
minimal expected distortion

• Optimal clustering given fixed set of


representatives
Blahut-Arimoto Algorithm
Drawbacks:
• Distortion measure is a part of the problem
setup
– Hard to obtain for some problems
– Equivalent to determining relevant features

• Fixed set of representatives

• Slow convergence
Rate Distortion Theory –
Additional Insights
– Another problem would be to find optimal
representatives given the clustering.

– Joint optimization of clustering and


representatives doesn’t have a unique solution.
(like EM or K-means)
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Information Bottleneck
• Copes with the drawbacks of Rate Distortion
approach

• Compress the data while preserving “important”


(relevant) information

• It is often easier to define what information is


important than to define a distortion measure.

• Replace the distortion upper bound constraint by a


lower bound constraint over the relevant information

Tishby, Pereira & Bialek, 1999


Information Bottleneck-Example
Given:

Documents Joint prior Topics


Information Bottleneck-Example
Obtain:
I(Word;Topic)

I(Cluster;Topic)

I(Word;Cluster)

Words Partitioning Topics


Information Bottleneck-Example
Extreme case 1:

I(Cluster;Topic)=0

Not
Informative

I(Word;Cluster)=0

Very Compact
Information Bottleneck-Example
Extreme case 2:

I(Cluster;Topic)=max

Very
Informative

I(Word;Cluster)=max

Not Compact
Minimize I(Word; Cluster) & maximize I(Cluster; Topic)
Information Bottleneck

topics
words

Compactness Relevant
Information
Relevance Compression Curve
Maximal
D – relevance Relevant
Information
constraint

Maximal
Compression
Relevance Compression Function
• Let be minimal allowed value of
Smaller more relaxed relevant
information constraint

Stronger compression levels are attainable

• Given relevant information constraint


Find the most compact model
(with smallest )
Relevance Compression Function

Compression Relevance
Term Term

Lagrange
Multiplier

Minimize !
Relevance Compression Curve
Maximal
Relevant
Information

Maximal
Compression
Relevance Compression Function
Minimize

Subject to

The minimum is attained when

Normalization
Solution - Analysis

Solution:

Known

The solution is
implicit
Solution - Analysis

Solution:

• KL distance emerges as effective distortion measure


from IB principle

For a fixed
When is similar to KL is small
attach such points to with higher probability

The optimization is also over cluster representatives


Solution - Analysis

Solution:

Fix t
reduces the influence of KL
does not depend on
this + maximal compression single cluster

Fix x
most of cond. prob. goes to some
with smallest KL (hard mapping)
Relevance Compression Curve
Maximal
Relevant
Information

Hard Mapping

Maximal
Compression
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Iterative Optimization Algorithm (iIB)

• Input:

• Randomly init

Pereira, Tishby, Lee , 1993; Tishby, Pereira, Bialek, 2001


Iterative Optimization Algorithm (iIB)

p(cluster | word) p(topic | cluster)

p(cluster)

Pereira, Tishby, Lee , 1993;


iIB simulation

• Given:
– 300 instances of with prior
– Binary relevant variable
– Joint prior

• Obtain:
– Optimal clustering (with minimal )
iIB simulation…

X points and their priors


iIB simulation…

Given the is given by the color of the


point on the map
iIB simulation…

Single Cluster – Maximal Compression


iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…

Hard Clustering – Maximal Relevant Information


Iterative Optimization Algorithm (iIB)
Optimize non-convex functional over 3
convex sets the minimum is local

• Analogy to K-means or EM
“Semantic change” in the clustering solution
Iterative Optimization Algorithm (iIB)
Advantages:
• Defining relevant variable is often easier and more
intuitive than defining distortion measure
• Finds local minimum
Iterative Optimization Algorithm (iIB)
Drawbacks:
• Finds local minimum (suboptimal solutions)
• Need to specify the parameters
• Slow convergence
• Large data sample is required
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Deterministic Annealing-like algorithm (dIB)

• Iteratively increase the parameter and then adapt


the solution from the previous value of to the new
one.
• Track the changes in the solution as the system
shifts its preference from compression to relevance
• Tries to reconstruct the relevance-compression
curve

Slonim, Friedman, Tishby, 2002


Deterministic Annealing-like algorithm (dIB)

Solution from previous step:


Deterministic Annealing-like algorithm (dIB)
Deterministic Annealing-like algorithm (dIB)

Small
Perturbation
Deterministic Annealing-like algorithm (dIB)

Apply iIB using the duplicated cluster set as initialization


Deterministic Annealing-like algorithm (dIB)

if are different
leave the split
else
use the old
Deterministic Annealing-like algorithm (dIB)
Illustration

What clusters split at which values of


Deterministic
Annealing-like algorithm (dIB)
Advantages:
• Finds local minimum (suboptimal solutions)
• Speed-up convergence by adapting previous soultion
Deterministic
Annealing-like algorithm (dIB)
Drawbacks:
• Need to specify and tune several parameters:
- perturbation size
- step for (splits might be “skipped”)
- similarity threshold for splitting
- may need to vary parameters during the process
• Finds local minimum (suboptimal solutions)
• Large data sample is required
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Agglomerative Algorithm (aIB)
• Find hierarchical clustering tree in
a greedy bottom-up fashion
• Results in different trees for each
• Each tree is a range of clustering solutions at
different resolutions

Same

Different
Resolutions
Slonim & Tishby 1999
Agglomerative Algorithm (aIB)
Fix
Start with
Agglomerative Algorithm (aIB)
For each pair

Compute new
Merge and that produce the smallest
Agglomerative Algorithm (aIB)
For each pair

Compute new
Merge and that produce the smallest
Agglomerative Algorithm (aIB)
For each pair

Continue merging until single cluster is left


Agglomerative Algorithm (aIB)
Agglomerative Algorithm (aIB)
Advantages:

• Non-parametric

• Full Hierarchy of clusters for each

• Simple
Agglomerative Algorithm (aIB)
Drawbacks:
• Greedy – is not guaranteed to extract even locally
minimal solutions along the tree
• Large data sample is required
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Applications…

Unsupervised Clustering of Images

Modeling assumption:
For a fixed colors and
their spatial distribution
are generated by a
mixture of Gaussians in
Shiri Gordon et. al., 2003 5-dim
Applications…

Unsupervised Clustering of Images


Apply EM procedure to estimate the mixture
parameters

Mixture of Gaussians model:

Shiri Gordon et. al., 2003


Applications…

Unsupervised Clustering of
Images
• Assume uniform prior
• Calculate conditional
• Apply aIB algorithm

Shiri Gordon et. al., 2003


Applications…

Unsupervised Clustering of Images

Shiri Gordon et. al., 2003


Applications…

Unsupervised Clustering of Images

Shiri Gordon et. al., 2003


Summary

• Rate Distortion Theory


– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Thank you
Blahut-Arimoto algorithm
A B
Minimum Distance
?

Convex set of Convex set of


distributions distributions

When does it converge to the global minimum?


- A and B are convex + some requirements on
distance measure

Csiszar & Tusnady, 1984


Blahut-Arimoto algorithm
A B

Reformulate using distance


Blahut-Arimoto algorithm
A B
Rate Distortion Theory - Intuition

– zero distortion
– not compact


– high distortion
– very compact

Information Bottleneck - cont’d
• Assume Markov relations:

– T is a compressed representation of X, thus


independent of Y if X is given

– Information processing inequality:

Potrebbero piacerti anche