Computer Vision Chapter4 - Features

Interest Points
How can we find corresponding points?

Image matching
by Diva Sian
by swashford
Harder case
by Diva Sian by scgbt

Not always easy
NASA Mars Rover images

Answer below (look for tiny colored
squares)

with SIFT feature matches
Figure by Noah Snavely
Image Matching
Image Matching
Invariant local features
Find features that are invariant to transformations
geometric invariance: translation, rotation, scale
photometric invariance: brightness, exposure,
Feature Descriptors
Advantages of local features
Locality
features are local, so robust to occlusion and clutter
Distinctiveness:
can differentiate a large database of objects
Quantity
hundreds or thousands in a single image
Efficiency
real-time performance achievable
Generality
exploit different types of features in different situations
More motivation
Feature points are used for:
Image alignment (e.g., mosaics)
3D reconstruction
Motion tracking
Object recognition
Indexing and database retrieval
Robot navigation
other
Want uniqueness
Look for image regions that are unusual
Lead to unambiguous matches in other images
How to define unusual?

Human eye movements
What catches your

interest?
Yarbus eye tracking

Interest points
original
Suppose you have to
click on some point, go
away and come back after
I deform the image, and
click on the same points
again.
Which points would you
choose?
deformed
Intuition
Corners
We should easily recognize the point by
looking through a small window
Shifting a window in any direction should give
a large change in intensity
flat region: edge: corner:

no change in no change along significant
all directions the edge change in all
Source: A. Efros
direction directions
Lets look at the gradient distributions
Principal Component Analysis
Principal component is the direction of How to compute PCA components:
highest variance.
1.Subtract off the mean for each data point.

Next, highest component is the direction
2.Compute the covariance matrix.
with highest variance orthogonal to the
3.Compute eigenvectors and eigenvalues.
previous components.
4.The components are the eigenvectors
ranked by the eigenvalues.
Corners have
Both eigenvalues are

large!
Second Moment Matrix or Harris Matrix
2 x 2 matrix of image derivatives smoothed

by Gaussian weights.
I I I I
Notation Ix Iy IxI y
:
x y x y
First compute Ix, Iy, and IxIy as 3 images; then apply Gaussian to each.
OR, first apply the Gaussian and the compute the derivatives.
The math
To compute the eigenvalues:
1. Compute the Harris matrix over a window.
Typically Gaussian weights
What does this equation mean in practice?
smoothed Ix2 smoothed IxIy

smoothed IxIy smoothed Iy2
2. Compute eigenvalues from that.

Corner Response Function
Computing eigenvalues are expensive
Harris corner detector used the following
alternative
Reminder:
Harris detector: Steps
1. Compute derivatives Ix, Iy and IxIy at each pixel and
smooth them with a Gaussian. (Or smooth first and then
derivatives.)
2. Compute the Harris matrix H in a window around each
pixel
3. Compute corner response function R
4.Threshold R
5.Find local maxima of response function (nonmaximum
suppression)
C.Harris and M.Stephens. Proceedings of the 4th Alvey Vision

Conference: pages 147151, 1988.
Harris Detector: Steps
Compute corner response R
Find points with large corner response: R > threshold
Take only the points of local maxima of R
Harris Detector: Results
Simpler Response Function
Instead of
We can use
Invariance
Suppose you rotate the image by some angle
Will you still pick up the same features?
What if you change the brightness?
Scale?
Properties of the Harris corner detector
Translation invariant? Yes

Rotation invariant? Yes Whats the
Scale invariant? No problem?
Corner !
All points will be
classified as edges
Scale invariant detection
Suppose youre looking for corners
Key idea: find scale that gives local maximum of f

f is a local maximum in both position and scale
Common definition of f : Laplacian
(or difference between two Gaussian filtered images with different sigmas)
Lindebergetetal,
Lindeberg al.,1996
1996
Slide
Slidefrom
fromTinne
TinneTuytelaars
Tuytelaars
Feature descriptors
We know how to detect good points
Next question: How to match them?
?
Feature descriptors
We know how to detect good points
Next question: How to match them?
?
Lots of possibilities (this is a popular research area)
Simple option: match square windows around the point
State of the art approach: SIFT
David Lowe, UBC http://www.cs.ubc.ca/~lowe/keypoints/
Invariance
Suppose we are comparing two images I1 and I2
I2 may be a transformed version of I1
What kinds of transformations are we likely to encounter in
practice?
Invariance
Suppose we are comparing two images I1 and I2
I2 may be a transformed version of I1
What kinds of transformations are we likely to encounter in
practice?
Wed like to find the same features regardless of the

transformation
This is called transformational invariance
Most feature methods are designed to be invariant to
translation, 2D rotation, scale
They can usually also handle
Limited 3D rotations (SIFT works up to about 60 degrees)
Limited affine transformations (some are fully affine invariant)
Limited illumination/contrast changes
How to achieve invariance
Need both of the following:
1. Make sure your detector is invariant
Harris is invariant to translation and rotation
Scale is trickier
common approach is to detect features at many scales using a
Gaussian pyramid (e.g., MOPS)
More sophisticated methods find the best scale to represent
each feature (e.g., SIFT)
2. Design an invariant feature descriptor
A descriptor captures the information in a region around the
detected feature point
The simplest descriptor: a square window of pixels
Whats this invariant to?
Lets look at some better approaches
Rotation invariance for feature descriptors
Find dominant orientation (blurred gradient ) of the image patch
This is given by x+, the eigenvector of H corresponding to +
+ is the larger eigenvalue
Rotate the patch according to this angle
Rotation Invariant Frame
Scale-space position (x, y, s) + orientation ()
Multiscale Oriented PatcheS descriptor
Take 40x40 square window around detected feature
Scale to 1/5 size (using prefiltering)
Rotate to horizontal
Sample 8x8 square window centered at feature
Intensity normalize the window by subtracting the mean, dividing by
the standard deviation in the window: I = (I )/
8 pixels
CSE 576: Computer Vision
Adapted from slide by Matthew Brown

Detections at multiple scales
Scale
Lets look at scale first:
What is the best scale?

Scale Invariance
f ( I i1im ( x, )) = f ( I i1im ( x, ))
How can we independently select interest points

in each image, such that the detections are
repeatable across different scales?
K. Grauman, B. Leibe
Differences between Inside and Outside
1. We can use a Laplacian function

Scale
But we use a Gaussian.
Why Gaussian?
It is invariant to
scale change, i.e.,
and has several
other nice
properties. Lindeberg, 1994
In practice, the Laplacian is approximated

using a Difference of Gaussian (DoG).
Difference-of-Gaussian (DoG)
G1 - G2 = DoG
- =
DoG example
Take Gaussians at
multiple spreads
and uses DoGs.
=1
= 66
Scale invariant interest points
Interest points are local maxima in both
position and scale. Look for extrema
5 in difference of
Gaussians.
4 scale
2
Apply Gaussians List of
with different s. (x, y, )
1
Scale
In practice the image is downsampled for

larger sigmas.
Lowe, 2004.
Lowes Pyramid Scheme
s+2 filters
s+1=2(s+1)/s0
.
.
i=2i/s0
.
. s+3 s+2
2=22/s0 images differ-
1=21/s0 including ence
0 original images
The parameter s determines the number of images per octave. 57
Key point localization
s+2 difference images.
top and bottom ignored.
s planes searched.
Detect maxima and

minima of difference-of-
Gaussian in scale space
Resample
Blur
Subtract
Each point is compared

to its 8 neighbors in the For each max or min found,
current image and 9 output is the location and
neighbors each in the the scale.
scales above and below
58
Scale-space extrema detection: experimental results over 32 images
that were synthetically transformed and noise added.
% detected average no. detected
% correctly matched
average no. matched
Stability Expense
Sampling in scale for efficiency
How many scales should be used per octave? S=?
More scales evaluated, more keypoints found
S < 3, stable keypoints increased too
S > 3, stable keypoints decreased
S = 3, maximum stable keypoints found
59
Results: Difference-of-Gaussian
How can we find correspondences?
Similarity
Orientation Normalization
Compute orientation histogram
Select dominant orientation [Lowe, SIFT, 1999]
Normalize: rotate to fixed orientation
0 2
T. Tuytelaars, B. Leibe
Whats next?
Once we have found the keypoints

and a dominant orientation for each,
we need to describe the (rotated and

scaled) neighborhood about each.
128-dimensional vector
Important Point
People just say SIFT.
But there are TWO parts to SIFT.
1. an interest point detector
2. a region descriptor
They are independent. Many people use the

region descriptor without looking for the
points.
Patch Descriptors
65
How can we find corresponding points?
How can we find correspondences?
How do we describe an image patch?
How do we describe an image patch?
Patches with similar content should have similar descriptors.

Raw patches as local descriptors
The simplest way to describe the

neighborhood around an interest
point is to write down the list of
intensities to form a feature vector
But this is very sensitive to even

small shifts, rotations.
70
SIFT descriptor
Full version
Divide the 16x16 window (8x8 case shown below) into a 4x4 grid of cells (2x2
case shown below)
Compute an orientation histogram for each cell
16 cells * 8 orientations = 128 dimensional descriptor
71
Adapted from slide by David Lowe
SIFT descriptor
Full version
Divide the 16x16 window into a 4x4 grid of cells
8 8 ... 8
72
Numeric Example
0.37 0.79 0.97 0.98
0.08 0.45 0.79 0.97
0.04 0.31 0.73 0.91
0.45 0.75 0.90 0.98
73
by Yao Lu
L(x-1,y-1) L(x,y-1) L(x+1,y-1) 0.98
L(x-1,y) L(x,y) L(x+1,y) 0.97

(x,y)
L(x-1,y+1) L(x,y+1) L(x+1,y+1) 0.91
0.45 0.75 0.90 0.98
2 2
magnitude(x,y)= + 1, 1, + , + 1 , 1
L x,y+1 L x,y1
(x,y)=a(
L(x+1,y)L(x1,y)
74
by Yao Lu
The orientations all
ended up in two bins:
Orientations in each of 11 in one bin, 5 in the
the 16 pixels of the cell other. (rough count)
5 11 0 0 0 0 0 0
75
SIFT descriptor
Full version
Start with a 16x16 window (256 pixels)
Divide the 16x16 window into a 4x4 grid of cells (16 cells)
Threshold normalize the descriptor:
such that:
0.2
76
Adapted from slide by David Lowe
Properties of SIFT
Extraordinarily robust matching technique
Can handle changes in viewpoint
Up to about 30 degree out of plane rotation
Can handle significant changes in illumination
Sometimes even day vs. night (below)
Fast and efficientcan run in real time
Various code available
http://www.cs.ubc.ca/~lowe/keypoints/
77
Example

with SIFT feature matches 78
Figure by Noah Snavely
Example: Object Recognition
SIFT is extremely powerful for object instance

recognition, especially for well-textured objects
79
Lowe, IJCV04
Example: Google Goggle
80
panorama?
We need to match (align) images
81
Matching with Features
Detect feature points in both images
82
Find corresponding pairs
83
Find corresponding pairs
Use these matching pairs to align images - the
required mapping is called a homography.
84
Automatic mosaicing
85
Recognition of specific objects, scenes
Schmid and Mohr Sivic and Zisserman,

1997 2003
Rothganger et al. Lowe 86

Kristen Grauman 2003 2002
Example: 3D Reconstructions
Photosynth (also called Photo Tourism) developed at UW
by Noah Snavely, Steve Seitz, Rick Szeliski and others
http://www.youtube.com/watch?v=p16frKJLVi0
Building Rome in a day, developed at UW by Sameer

Agarwal, Noah Snavely, Steve Seitz and others
http://www.youtube.com/watch?v=kxtQqYLRaSQ&feature=pl
ayer_embedded
87
When does the SIFT descriptor fail?
Patches SIFT thought were the same but arent:
88
Other methods: Daisy
Circular gradient binning
SIFT
Daisy
89 09
Picking the best DAISY, S. Winder, G. Hua, M. Brown, CVPR
Other methods: SURF
For computational efficiency only compute
gradient histogram with 4 bins:
SURF: Speeded Up Robust Features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, ECCV 2006
90
Other methods: BRIEF
Randomly sample pair of pixels a and b.
1 if a > b, else 0. Store binary vector. 011000111000...
a
b
Daisy
BRIEF: binary robust independent elementary features, Calonder,

V Lepetit, C Strecha, ECCV 2010
Descriptors and Matching
The SIFT descriptor and the various variants are
used to describe an image patch, so that we can
match two image patches.
In addition to the descriptors, we need a distance

measure to calculate how different the two patches
are?
?
Feature distance
How to define the difference between two features f1, f2?
Simple approach is SSD(f1, f2)
sum of square differences between entries of the two descriptors
(f1i f2i)2
i
But it can give good scores to very ambiguous (bad) matches
f1 f2
93
I1 I2
Feature distance in practice
How to define the difference between two features f1, f2?
Better approach: ratio distance = SSD(f1, f2) / SSD(f1, f2)
f2 is best SSD match to f1 in I2
f2 is 2nd best SSD match to f1 in I2
gives large values (~1) for ambiguous matches WHY?
f1 f2' f2
I1 I2 94
Eliminating more bad matches
50
true match
75
200
false match
feature distance
Throw out features with distance > threshold

How to choose the threshold?
95
True/false positives
50
true match
75
200
false match
feature distance
The distance threshold affects performance

True positives = # of detected matches that are correct
Suppose we want to maximize thesehow to choose threshold?
False positives = # of detected matches that are incorrect
Suppose we want to minimize thesehow to choose threshold?
96
actual class
(expectation)
TP FP
(true positive) (false positive)
Correct result Unexpected result
predicted class
(observation) TN
FN
(true negative)
(false negative)
Correct absence of
Missing result
result
TN ( True Negative): case was negative and predicted negative
TP ( True Positive): case was positive and predicted positive
FN ( False Negative): case was positive but predicted negative
FP ( False Positive): case was negative but predicted positive

Recall(True positive rate)= True negative rate=
+ +

Precision= +
+ Accuracy=
+++
Evaluating the results
How can we measure the performance of a feature matcher?
0.7
# true positives true

# matching features (positives) positive
rate

True positive rate=
+
(Recall) 0 0.1 false positive rate 1
# false positives
# unmatched features (negatives)

False positive rate=
+
Evaluating the results
How can we measure the performance of a feature matcher?
ROC curve (Receiver Operator Characteristic)
1
0.7
# true positives true

# matching features (positives) positive
rate
0 0.1 false positive rate 1

# false positives
# unmatched features (negatives)
ROC Curves
Generated by counting # current/incorrect matches, for different threholds
Want to maximize area under the curve (AUC)
Useful for comparing different feature matching methods
For more info: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
More on feature detection/description
Lots of applications
Features are used for:
Image alignment (e.g., mosaics)
3D reconstruction
Motion tracking
Object recognition
Indexing and database retrieval
Robot navigation
other
Object recognition (David Lowe)
Sony Aibo
SIFT usage:
Recognize
charging
station
Communicate
with visual
cards
Teach object
recognition
Other kinds of descriptors
There are descriptors for other purposes

Describing shapes
Describing textures
Describing features for image classification
Describing features for a code book
104
Local Descriptors: Shape Context
Count the number of points inside

each bin, e.g.:
Count = ?
...
Count = ?
Log-polar binning: more

precision for nearby points,
more flexibility for farther
points.
105
Belongie & Malik, ICCV 2001
Texture
The texture features of a patch can be considered a
descriptor.
E.g. the LBP histogram is a texture descriptor for a
patch.
; Varma & Zisserman, 2002, 2003;, 2003 106

Bag-of-words models
Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
107
Bag-of-words models
US Presidential Speeches Tag Cloud 108

http://chir.ag/phernalia/preztags/
Bag-of-words models

Bag-of-words models

What is a bag-of-words representation?
For a text document
Have a dictionary of non-common words
Count the occurrence of each word in that document
Make a histogram of the counts
Normalize the histogram by dividing each count by
the sum of all the counts
The histogram is the representation.
apple worm tree dog joint leaf grass bush fence

111
Bags of features for image classification
1. Extract features
112
1. Extract features
2. Learn visual vocabulary
113
1. Extract features
3. Quantize features using visual vocabulary
114
1. Extract features
3. Quantize features using visual vocabulary
4. Represent images by frequencies of
visual words
115
A possible texture representation
histogram
Universal texton dictionary
Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori,
116
Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002,
1. Feature extraction
Regular grid: every grid square is a feature

Vogel & Schiele, 2003
Fei-Fei & Perona, 2005
Interest point detector: the
region around each point
Csurka et al. 2004
Fei-Fei & Perona, 2005
Sivic et al. 2005
117
3 2
Compute
SIFT Normalize
descriptor patch
[Lowe99]
1 Detect patches
[Mikojaczyk and Schmid 02]
[Mata, Chum, Urban & Pajdla, 02]
[Sivic & Zisserman, 03]
118
Slide credit: Josef Sivic
Lots of feature descriptors

for the whole image or set
of images.
119
2. Discovering the visual vocabulary
feature vector space
What is the dimensionality?

128D for SIFT
120
Clustering
121
Visual vocabulary

Clustering
122
Viewpoint invariant description (Sivic)
Two types of viewpoint covariant regions computed
for each frame
Shape Adapted (SA) Mikolajczyk & Schmid
Maximally Stable (MSER) Matas et al.
Detect different kinds of image areas
Provide complimentary representations of frame
Computed at twice originally detected region size to
be more discriminating
123
Examples of Harris-Affine Operator
(Shape Adapted Regions)
124
Examples of Maximally Stable Regions
125
Maximally Stable Extremal Regions
J.Matas et.al. Distinguished Regions for Wide-baseline Stereo. BMVC 2002.
Maximally Stable Extremal Regions

Threshold image intensities: I > thresh
for several increasing values of thresh
Extract connected components
(Extremal Regions)
Find a threshold when region is Maximally
Stable, i.e. local minimum
of the relative growth
Approximate each region with
an ellipse
If we are shown a sequence of thresholded images It with frame
t corresponding to threshold t, we would see first a white image,
then 'black' spots corresponding to local intensity minima will
appear then grow larger.
These 'black' spots will eventually merge, until the whole image
is black.
The set of all connected components in the sequence is the set
of all extremal regions. In that sense, the concept of MSER is
linked to the one of component tree of the image.
Feature Descriptor
Each region represented by 128 dimensional vector

using SIFT descriptor
128
Noise Removal
Tracking region over 70 frames (must track
over at least 3)
129
Visual Vocabulary for Sivics Work
Implementation: K-Means clustering
Regions tracked through contiguous frames and average description

computed
10% of tracks with highest variance eliminated, leaving about 1000

regions per frame
Subset of 48 shots (~10%) selected for clustering
Distance function: Mahalanobis
6000 SA clusters and 10000 MS clusters
130
Visual Vocabulary
Shape-Adapted
Maximally Stable
131
Sivics Experiments on Video Shot Retrieval
Goal: match scene
locations within closed
world of shots
Data:164 frames from
48 shots taken at 19
different 3D locations;
4-9 frames from each
location
132
Experiments - Results
Precision = # relevant images/total # of frames retrieved

Recall = # correctly retrieved frames/ # relevant frames 133
More Pictorial Results
134
Clustering and vector quantization
Clustering is a common method for learning a visual
vocabulary or codebook
Each cluster center produced by k-means becomes a
codevector
Codebook can be learned on separate training set
The codebook is used for quantizing features
A vector quantizer takes a feature vector and maps it to the
index of the nearest code vector in a codebook
Codebook = visual vocabulary
Code vector = visual word
1 code vector 1
feature 2 code vector 2
vector 3 code vector 3
135
Another example visual vocabulary
136
Fei-Fei et al. 2005
Example codebook
Appearance codebook
137
Source: B. Leibe
Another codebook
Appearance codebook
138
Source: B. Leibe
Visual vocabularies: Issues
How to choose vocabulary size?

Too small: visual words not representative of all patches
Too large: quantization artifacts,
overfitting
Computational efficiency
Vocabulary trees
(Nister & Stewenius, 2006)
139
3. Image representation: histogram of codewords
frequency
..
codewords
140
Image classification
Given the bag-of-features representations of images from
different classes, learn a classifier using machine learning
141
But what about layout?
All of these images have the same color histogram

142
Spatial pyramid representation
Extension of a bag of features
Locally orderless representation at several levels of resolution
level 0
143
Lazebnik, Schmid & Ponce (CVPR 2006)
Spatial pyramid representation
level 0 level 1
144
level 0 level 1 level 2

145
Finale
Describing images or image patches is very
important for matching and recognition
The SIFT descriptor was invented in 1999 and is still
very heavily used.
Other descriptors are also available, some much
simpler, but less powerful.
Texture and shape descriptors are also useful.
Bag-of-words is a handy technique borrowed from
text retrieval. Lots of people use it to compare images
or regions.
Sivic developed a video frame retrieval system using
this method, called it Video Google.
The spatial pyramid allows us to describe an image
as a whole and over its parts at multiple levels. 146
Acknowledgement
Most Lecture Slides adapted from
Rick Szeliski, Microsoft

Steve Seitz, U. of Washington
Linda Shapiro, U. of Washington

Computer Vision Chapter4 - Features

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Computer Vision Chapter4 - Features

Caricato da

Copyright:

Formati disponibili

Interest Points

How can we find corresponding points?

by Diva Sian by scgbt

NASA Mars Rover images

NASA Mars Rover images

How to define unusual?

What catches your

Yarbus eye tracking

flat region: edge: corner:

1.Subtract off the mean for each data point.

Both eigenvalues are

2 x 2 matrix of image derivatives smoothed

1. Compute the Harris matrix over a window.

Typically Gaussian weights

What does this equation mean in practice?

smoothed Ix2 smoothed IxIy

2. Compute eigenvalues from that.

C.Harris and M.Stephens. Proceedings of the 4th Alvey Vision

What if you change the brightness?

Translation invariant? Yes

Key idea: find scale that gives local maximum of f

Wed like to find the same features regardless of the

CSE 576: Computer Vision

Adapted from slide by Matthew Brown

Lets look at scale first:

What is the best scale?

How can we independently select interest points

1. We can use a Laplacian function

In practice, the Laplacian is approximated

In practice the image is downsampled for

Detect maxima and

Each point is compared

% detected average no. detected

average no. matched

Once we have found the keypoints

we need to describe the (rotated and

People just say SIFT.

But there are TWO parts to SIFT.

1. an interest point detector

They are independent. Many people use the

Patches with similar content should have similar descriptors.

The simplest way to describe the

But this is very sensitive to even

0.37 0.79 0.97 0.98

0.08 0.45 0.79 0.97

0.04 0.31 0.73 0.91

0.45 0.75 0.90 0.98

L(x-1,y) L(x,y) L(x+1,y) 0.97

0.45 0.75 0.90 0.98

NASA Mars Rover images

SIFT is extremely powerful for object instance

Schmid and Mohr Sivic and Zisserman,

Rothganger et al. Lowe 86

Building Rome in a day, developed at UW by Sameer

Patches SIFT thought were the same but arent:

SURF: Speeded Up Robust Features

BRIEF: binary robust independent elementary features, Calonder,

In addition to the descriptors, we need a distance

Throw out features with distance > threshold

The distance threshold affects performance

# true positives true

# true positives true

0 0.1 false positive rate 1

There are descriptors for other purposes