Sei sulla pagina 1di 147

Interest Points

How can we find corresponding points?


Image matching

by Diva Sian

by swashford
Harder case

by Diva Sian by scgbt


Not always easy

NASA Mars Rover images


Answer below (look for tiny colored
squares)

NASA Mars Rover images


with SIFT feature matches
Figure by Noah Snavely
Image Matching
Image Matching
Invariant local features
Find features that are invariant to transformations
geometric invariance: translation, rotation, scale
photometric invariance: brightness, exposure,

Feature Descriptors
Advantages of local features
Locality
features are local, so robust to occlusion and clutter
Distinctiveness:
can differentiate a large database of objects
Quantity
hundreds or thousands in a single image
Efficiency
real-time performance achievable
Generality
exploit different types of features in different situations
More motivation
Feature points are used for:
Image alignment (e.g., mosaics)
3D reconstruction
Motion tracking
Object recognition
Indexing and database retrieval
Robot navigation
other
Want uniqueness
Look for image regions that are unusual
Lead to unambiguous matches in other images

How to define unusual?


Human eye movements

What catches your


interest?

Yarbus eye tracking


Interest points
original
Suppose you have to
click on some point, go
away and come back after
I deform the image, and
click on the same points
again.
Which points would you
choose?

deformed
Intuition
Corners
We should easily recognize the point by
looking through a small window
Shifting a window in any direction should give
a large change in intensity

flat region: edge: corner:


no change in no change along significant
all directions the edge change in all
Source: A. Efros
direction directions
Lets look at the gradient distributions
Principal Component Analysis
Principal component is the direction of How to compute PCA components:
highest variance.

1.Subtract off the mean for each data point.


Next, highest component is the direction
2.Compute the covariance matrix.
with highest variance orthogonal to the
3.Compute eigenvectors and eigenvalues.
previous components.
4.The components are the eigenvectors
ranked by the eigenvalues.
Corners have

Both eigenvalues are


large!
Second Moment Matrix or Harris Matrix

2 x 2 matrix of image derivatives smoothed


by Gaussian weights.

I I I I
Notation Ix Iy IxI y
:
x y x y
First compute Ix, Iy, and IxIy as 3 images; then apply Gaussian to each.
OR, first apply the Gaussian and the compute the derivatives.
The math
To compute the eigenvalues:

1. Compute the Harris matrix over a window.

Typically Gaussian weights

What does this equation mean in practice?

smoothed Ix2 smoothed IxIy


smoothed IxIy smoothed Iy2

2. Compute eigenvalues from that.


Corner Response Function
Computing eigenvalues are expensive
Harris corner detector used the following
alternative

Reminder:
Harris detector: Steps
1. Compute derivatives Ix, Iy and IxIy at each pixel and
smooth them with a Gaussian. (Or smooth first and then
derivatives.)
2. Compute the Harris matrix H in a window around each
pixel
3. Compute corner response function R
4.Threshold R
5.Find local maxima of response function (nonmaximum
suppression)

C.Harris and M.Stephens. Proceedings of the 4th Alvey Vision


Conference: pages 147151, 1988.
Harris Detector: Steps
Harris Detector: Steps
Compute corner response R
Harris Detector: Steps
Find points with large corner response: R > threshold
Harris Detector: Steps
Take only the points of local maxima of R
Harris Detector: Results
Simpler Response Function

Instead of

We can use
Invariance
Suppose you rotate the image by some angle
Will you still pick up the same features?

What if you change the brightness?

Scale?
Properties of the Harris corner detector

Translation invariant? Yes


Rotation invariant? Yes Whats the
Scale invariant? No problem?

Corner !
All points will be
classified as edges
Scale invariant detection
Suppose youre looking for corners

Key idea: find scale that gives local maximum of f


f is a local maximum in both position and scale
Common definition of f : Laplacian
(or difference between two Gaussian filtered images with different sigmas)
Lindebergetetal,
Lindeberg al.,1996
1996

Slide
Slidefrom
fromTinne
TinneTuytelaars
Tuytelaars
Feature descriptors
We know how to detect good points
Next question: How to match them?

?
Feature descriptors
We know how to detect good points
Next question: How to match them?

?
Lots of possibilities (this is a popular research area)
Simple option: match square windows around the point
State of the art approach: SIFT
David Lowe, UBC http://www.cs.ubc.ca/~lowe/keypoints/
Invariance
Suppose we are comparing two images I1 and I2
I2 may be a transformed version of I1
What kinds of transformations are we likely to encounter in
practice?
Invariance
Suppose we are comparing two images I1 and I2
I2 may be a transformed version of I1
What kinds of transformations are we likely to encounter in
practice?

Wed like to find the same features regardless of the


transformation
This is called transformational invariance
Most feature methods are designed to be invariant to
translation, 2D rotation, scale
They can usually also handle
Limited 3D rotations (SIFT works up to about 60 degrees)
Limited affine transformations (some are fully affine invariant)
Limited illumination/contrast changes
How to achieve invariance
Need both of the following:
1. Make sure your detector is invariant
Harris is invariant to translation and rotation
Scale is trickier
common approach is to detect features at many scales using a
Gaussian pyramid (e.g., MOPS)
More sophisticated methods find the best scale to represent
each feature (e.g., SIFT)
2. Design an invariant feature descriptor
A descriptor captures the information in a region around the
detected feature point
The simplest descriptor: a square window of pixels
Whats this invariant to?
Lets look at some better approaches
Rotation invariance for feature descriptors
Find dominant orientation (blurred gradient ) of the image patch
This is given by x+, the eigenvector of H corresponding to +
+ is the larger eigenvalue
Rotate the patch according to this angle
Rotation Invariant Frame
Scale-space position (x, y, s) + orientation ()
Multiscale Oriented PatcheS descriptor
Take 40x40 square window around detected feature
Scale to 1/5 size (using prefiltering)
Rotate to horizontal
Sample 8x8 square window centered at feature
Intensity normalize the window by subtracting the mean, dividing by
the standard deviation in the window: I = (I )/

8 pixels

CSE 576: Computer Vision

Adapted from slide by Matthew Brown


Detections at multiple scales
Scale

Lets look at scale first:

What is the best scale?


Scale Invariance

f ( I i1im ( x, )) = f ( I i1im ( x, ))

How can we independently select interest points


in each image, such that the detections are
repeatable across different scales?
K. Grauman, B. Leibe
Differences between Inside and Outside

1. We can use a Laplacian function


Scale
But we use a Gaussian.
Why Gaussian?

It is invariant to
scale change, i.e.,
and has several
other nice
properties. Lindeberg, 1994

In practice, the Laplacian is approximated


using a Difference of Gaussian (DoG).
Difference-of-Gaussian (DoG)

G1 - G2 = DoG

- =

K. Grauman, B. Leibe
DoG example
Take Gaussians at
multiple spreads
and uses DoGs.

=1

= 66
Scale invariant interest points
Interest points are local maxima in both
position and scale. Look for extrema
5 in difference of
Gaussians.
4 scale

2
Apply Gaussians List of
with different s. (x, y, )
1
Scale

In practice the image is downsampled for


larger sigmas.

Lowe, 2004.
Lowes Pyramid Scheme

s+2 filters
s+1=2(s+1)/s0

.
.
i=2i/s0
.
. s+3 s+2
2=22/s0 images differ-
1=21/s0 including ence
0 original images
The parameter s determines the number of images per octave. 57
Key point localization
s+2 difference images.
top and bottom ignored.
s planes searched.

Detect maxima and


minima of difference-of-
Gaussian in scale space
Resample

Blur

Subtract

Each point is compared


to its 8 neighbors in the For each max or min found,
current image and 9 output is the location and
neighbors each in the the scale.
scales above and below
58
Scale-space extrema detection: experimental results over 32 images
that were synthetically transformed and noise added.

% detected average no. detected

% correctly matched

average no. matched

Stability Expense
Sampling in scale for efficiency
How many scales should be used per octave? S=?
More scales evaluated, more keypoints found
S < 3, stable keypoints increased too
S > 3, stable keypoints decreased
S = 3, maximum stable keypoints found
59
Results: Difference-of-Gaussian

K. Grauman, B. Leibe
How can we find correspondences?

Similarity
Orientation Normalization
Compute orientation histogram
Select dominant orientation [Lowe, SIFT, 1999]
Normalize: rotate to fixed orientation

0 2
T. Tuytelaars, B. Leibe
Whats next?

Once we have found the keypoints


and a dominant orientation for each,

we need to describe the (rotated and


scaled) neighborhood about each.

128-dimensional vector
Important Point

People just say SIFT.

But there are TWO parts to SIFT.

1. an interest point detector

2. a region descriptor

They are independent. Many people use the


region descriptor without looking for the
points.
Patch Descriptors

65
How can we find corresponding points?
How can we find correspondences?
How do we describe an image patch?
How do we describe an image patch?

Patches with similar content should have similar descriptors.


Raw patches as local descriptors

The simplest way to describe the


neighborhood around an interest
point is to write down the list of
intensities to form a feature vector

But this is very sensitive to even


small shifts, rotations.

70
SIFT descriptor
Full version
Divide the 16x16 window (8x8 case shown below) into a 4x4 grid of cells (2x2
case shown below)
Compute an orientation histogram for each cell
16 cells * 8 orientations = 128 dimensional descriptor

71
Adapted from slide by David Lowe
SIFT descriptor
Full version
Divide the 16x16 window into a 4x4 grid of cells
Compute an orientation histogram for each cell
16 cells * 8 orientations = 128 dimensional descriptor

8 8 ... 8

72
Numeric Example

0.37 0.79 0.97 0.98

0.08 0.45 0.79 0.97

0.04 0.31 0.73 0.91

0.45 0.75 0.90 0.98

73
by Yao Lu
L(x-1,y-1) L(x,y-1) L(x+1,y-1) 0.98

L(x-1,y) L(x,y) L(x+1,y) 0.97


(x,y)
L(x-1,y+1) L(x,y+1) L(x+1,y+1) 0.91

0.45 0.75 0.90 0.98

2 2
magnitude(x,y)= + 1, 1, + , + 1 , 1

L x,y+1 L x,y1
(x,y)=a(
L(x+1,y)L(x1,y)
74
by Yao Lu
The orientations all
ended up in two bins:
Orientations in each of 11 in one bin, 5 in the
the 16 pixels of the cell other. (rough count)

5 11 0 0 0 0 0 0
75
SIFT descriptor
Full version
Start with a 16x16 window (256 pixels)
Divide the 16x16 window into a 4x4 grid of cells (16 cells)
Compute an orientation histogram for each cell
16 cells * 8 orientations = 128 dimensional descriptor
Threshold normalize the descriptor:

such that:

0.2

76
Adapted from slide by David Lowe
Properties of SIFT
Extraordinarily robust matching technique
Can handle changes in viewpoint
Up to about 30 degree out of plane rotation
Can handle significant changes in illumination
Sometimes even day vs. night (below)
Fast and efficientcan run in real time
Various code available
http://www.cs.ubc.ca/~lowe/keypoints/

77
Example

NASA Mars Rover images


with SIFT feature matches 78
Figure by Noah Snavely
Example: Object Recognition

SIFT is extremely powerful for object instance


recognition, especially for well-textured objects

79
Lowe, IJCV04
Example: Google Goggle

80
panorama?
We need to match (align) images

81
Matching with Features
Detect feature points in both images

82
Matching with Features
Detect feature points in both images
Find corresponding pairs

83
Matching with Features
Detect feature points in both images
Find corresponding pairs
Use these matching pairs to align images - the
required mapping is called a homography.

84
Automatic mosaicing

85
Recognition of specific objects, scenes

Schmid and Mohr Sivic and Zisserman,


1997 2003

Rothganger et al. Lowe 86


Kristen Grauman 2003 2002
Example: 3D Reconstructions
Photosynth (also called Photo Tourism) developed at UW
by Noah Snavely, Steve Seitz, Rick Szeliski and others
http://www.youtube.com/watch?v=p16frKJLVi0

Building Rome in a day, developed at UW by Sameer


Agarwal, Noah Snavely, Steve Seitz and others
http://www.youtube.com/watch?v=kxtQqYLRaSQ&feature=pl
ayer_embedded

87
When does the SIFT descriptor fail?

Patches SIFT thought were the same but arent:

88
Other methods: Daisy
Circular gradient binning

SIFT

Daisy

89 09
Picking the best DAISY, S. Winder, G. Hua, M. Brown, CVPR
Other methods: SURF
For computational efficiency only compute
gradient histogram with 4 bins:

SURF: Speeded Up Robust Features


Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, ECCV 2006
90
Other methods: BRIEF
Randomly sample pair of pixels a and b.
1 if a > b, else 0. Store binary vector. 011000111000...

a
b

Daisy

BRIEF: binary robust independent elementary features, Calonder,


V Lepetit, C Strecha, ECCV 2010
Descriptors and Matching
The SIFT descriptor and the various variants are
used to describe an image patch, so that we can
match two image patches.

In addition to the descriptors, we need a distance


measure to calculate how different the two patches
are?

?
Feature distance
How to define the difference between two features f1, f2?
Simple approach is SSD(f1, f2)
sum of square differences between entries of the two descriptors

(f1i f2i)2
i
But it can give good scores to very ambiguous (bad) matches

f1 f2

93
I1 I2
Feature distance in practice
How to define the difference between two features f1, f2?
Better approach: ratio distance = SSD(f1, f2) / SSD(f1, f2)
f2 is best SSD match to f1 in I2
f2 is 2nd best SSD match to f1 in I2
gives large values (~1) for ambiguous matches WHY?

f1 f2' f2

I1 I2 94
Eliminating more bad matches

50
true match
75
200
false match

feature distance

Throw out features with distance > threshold


How to choose the threshold?

95
True/false positives

50
true match
75
200
false match

feature distance

The distance threshold affects performance


True positives = # of detected matches that are correct
Suppose we want to maximize thesehow to choose threshold?
False positives = # of detected matches that are incorrect
Suppose we want to minimize thesehow to choose threshold?
96
actual class
(expectation)
TP FP
(true positive) (false positive)
Correct result Unexpected result
predicted class
(observation) TN
FN
(true negative)
(false negative)
Correct absence of
Missing result
result
TN ( True Negative): case was negative and predicted negative
TP ( True Positive): case was positive and predicted positive
FN ( False Negative): case was positive but predicted negative
FP ( False Positive): case was negative but predicted positive


Recall(True positive rate)= True negative rate=
+ +

Precision= +
+ Accuracy=
+++
Evaluating the results
How can we measure the performance of a feature matcher?

0.7

# true positives true


# matching features (positives) positive
rate

True positive rate=
+
(Recall) 0 0.1 false positive rate 1
# false positives
# unmatched features (negatives)


False positive rate=
+
Evaluating the results
How can we measure the performance of a feature matcher?
ROC curve (Receiver Operator Characteristic)
1

0.7

# true positives true


# matching features (positives) positive
rate

0 0.1 false positive rate 1


# false positives
# unmatched features (negatives)
ROC Curves
Generated by counting # current/incorrect matches, for different threholds
Want to maximize area under the curve (AUC)
Useful for comparing different feature matching methods
For more info: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
More on feature detection/description
Lots of applications
Features are used for:
Image alignment (e.g., mosaics)
3D reconstruction
Motion tracking
Object recognition
Indexing and database retrieval
Robot navigation
other
Object recognition (David Lowe)
Sony Aibo

SIFT usage:

Recognize
charging
station

Communicate
with visual
cards

Teach object
recognition
Other kinds of descriptors

There are descriptors for other purposes


Describing shapes
Describing textures
Describing features for image classification
Describing features for a code book

104
Local Descriptors: Shape Context

Count the number of points inside


each bin, e.g.:

Count = ?

...
Count = ?

Log-polar binning: more


precision for nearby points,
more flexibility for farther
points.

105
Belongie & Malik, ICCV 2001
K. Grauman, B. Leibe
Texture
The texture features of a patch can be considered a
descriptor.
E.g. the LBP histogram is a texture descriptor for a
patch.

; Varma & Zisserman, 2002, 2003;, 2003 106


Bag-of-words models
Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)

107
Bag-of-words models
Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)

US Presidential Speeches Tag Cloud 108


http://chir.ag/phernalia/preztags/
Bag-of-words models
Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)

US Presidential Speeches Tag Cloud 109


http://chir.ag/phernalia/preztags/
Bag-of-words models
Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)

US Presidential Speeches Tag Cloud 110


http://chir.ag/phernalia/preztags/
What is a bag-of-words representation?
For a text document
Have a dictionary of non-common words
Count the occurrence of each word in that document
Make a histogram of the counts
Normalize the histogram by dividing each count by
the sum of all the counts
The histogram is the representation.

apple worm tree dog joint leaf grass bush fence


111
Bags of features for image classification
1. Extract features

112
Bags of features for image classification
1. Extract features
2. Learn visual vocabulary

113
Bags of features for image classification
1. Extract features
2. Learn visual vocabulary
3. Quantize features using visual vocabulary

114
Bags of features for image classification

1. Extract features
2. Learn visual vocabulary
3. Quantize features using visual vocabulary
4. Represent images by frequencies of
visual words

115
A possible texture representation

histogram

Universal texton dictionary

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori,
116

Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002,
1. Feature extraction

Regular grid: every grid square is a feature


Vogel & Schiele, 2003
Fei-Fei & Perona, 2005
Interest point detector: the
region around each point
Csurka et al. 2004
Fei-Fei & Perona, 2005
Sivic et al. 2005

117
1. Feature extraction
3 2

Compute
SIFT Normalize
descriptor patch
[Lowe99]
1 Detect patches
[Mikojaczyk and Schmid 02]
[Mata, Chum, Urban & Pajdla, 02]
[Sivic & Zisserman, 03]

118
Slide credit: Josef Sivic
1. Feature extraction

Lots of feature descriptors


for the whole image or set
of images.

119
2. Discovering the visual vocabulary

feature vector space

What is the dimensionality?


128D for SIFT

120
2. Discovering the visual vocabulary

Clustering

121
Slide credit: Josef Sivic
2. Discovering the visual vocabulary
Visual vocabulary

Clustering

122
Slide credit: Josef Sivic
Viewpoint invariant description (Sivic)
Two types of viewpoint covariant regions computed
for each frame
Shape Adapted (SA) Mikolajczyk & Schmid
Maximally Stable (MSER) Matas et al.
Detect different kinds of image areas
Provide complimentary representations of frame
Computed at twice originally detected region size to
be more discriminating

123
Examples of Harris-Affine Operator

(Shape Adapted Regions)

124
Examples of Maximally Stable Regions

125
Maximally Stable Extremal Regions
J.Matas et.al. Distinguished Regions for Wide-baseline Stereo. BMVC 2002.

Maximally Stable Extremal Regions


Threshold image intensities: I > thresh
for several increasing values of thresh
Extract connected components
(Extremal Regions)
Find a threshold when region is Maximally
Stable, i.e. local minimum
of the relative growth
Approximate each region with
an ellipse
If we are shown a sequence of thresholded images It with frame
t corresponding to threshold t, we would see first a white image,
then 'black' spots corresponding to local intensity minima will
appear then grow larger.
These 'black' spots will eventually merge, until the whole image
is black.
The set of all connected components in the sequence is the set
of all extremal regions. In that sense, the concept of MSER is
linked to the one of component tree of the image.
Feature Descriptor

Each region represented by 128 dimensional vector


using SIFT descriptor

128
Noise Removal
Tracking region over 70 frames (must track
over at least 3)

129
Visual Vocabulary for Sivics Work

Implementation: K-Means clustering

Regions tracked through contiguous frames and average description


computed

10% of tracks with highest variance eliminated, leaving about 1000


regions per frame

Subset of 48 shots (~10%) selected for clustering

Distance function: Mahalanobis

6000 SA clusters and 10000 MS clusters

130
Visual Vocabulary

Shape-Adapted

Maximally Stable

131
Sivics Experiments on Video Shot Retrieval
Goal: match scene
locations within closed
world of shots
Data:164 frames from
48 shots taken at 19
different 3D locations;
4-9 frames from each
location

132
Experiments - Results

Precision = # relevant images/total # of frames retrieved


Recall = # correctly retrieved frames/ # relevant frames 133
More Pictorial Results

134
Clustering and vector quantization
Clustering is a common method for learning a visual
vocabulary or codebook
Each cluster center produced by k-means becomes a
codevector
Codebook can be learned on separate training set
The codebook is used for quantizing features
A vector quantizer takes a feature vector and maps it to the
index of the nearest code vector in a codebook
Codebook = visual vocabulary
Code vector = visual word
1 code vector 1
feature 2 code vector 2
vector 3 code vector 3

135
Another example visual vocabulary

136
Fei-Fei et al. 2005
Example codebook

Appearance codebook
137
Source: B. Leibe
Another codebook

Appearance codebook

138
Source: B. Leibe
Visual vocabularies: Issues

How to choose vocabulary size?


Too small: visual words not representative of all patches
Too large: quantization artifacts,
overfitting
Computational efficiency
Vocabulary trees
(Nister & Stewenius, 2006)

139
3. Image representation: histogram of codewords
frequency

..
codewords
140
Image classification
Given the bag-of-features representations of images from
different classes, learn a classifier using machine learning

141
But what about layout?

All of these images have the same color histogram


142
Spatial pyramid representation
Extension of a bag of features
Locally orderless representation at several levels of resolution

level 0
143
Lazebnik, Schmid & Ponce (CVPR 2006)
Spatial pyramid representation
Extension of a bag of features
Locally orderless representation at several levels of resolution

level 0 level 1
144
Lazebnik, Schmid & Ponce (CVPR 2006)
Extension of a bag of features
Locally orderless representation at several levels of resolution

level 0 level 1 level 2


145
Lazebnik, Schmid & Ponce (CVPR 2006)
Finale
Describing images or image patches is very
important for matching and recognition
The SIFT descriptor was invented in 1999 and is still
very heavily used.
Other descriptors are also available, some much
simpler, but less powerful.
Texture and shape descriptors are also useful.
Bag-of-words is a handy technique borrowed from
text retrieval. Lots of people use it to compare images
or regions.
Sivic developed a video frame retrieval system using
this method, called it Video Google.
The spatial pyramid allows us to describe an image
as a whole and over its parts at multiple levels. 146
Acknowledgement
Most Lecture Slides adapted from

Rick Szeliski, Microsoft


Steve Seitz, U. of Washington
Linda Shapiro, U. of Washington

Potrebbero piacerti anche