Tuan 9 HOG and SIFT

Histogram of Gradient
Robotic & Vision Lab RoVis

HOG là 1 phương pháp mô tả đặc trưng dữ liệu ảnh được sử dụng phổ biến
trong machine learning và image processing nhằm mục đích phân loại đối tượng
Phương pháp này là 1 phương pháp nhận biên dạng, với ý tưởng ban đầu là 1
đối tượng trong ảnh có thể được nhìn thấy khỏi nền nhờ biên dạng của nó.
Đường biên dạng này lại được mô tả bởi hướng (orientation) và độ lớn
(intensity) của vector pháp tuyến

Step 1- Resize

Step 2- Gradient
To calculate a HOG descriptor, we need to first calculate the horizontal and vertical
gradients; after all, we want to calculate the histogram of gradients. This is easily achieved
by filtering the image with the following kernels.

A “cell” is a rectangular region defined by the number of pixels that belong in each cell. For
example, if we had a 128 x 128 image and defined our pixels_per_cell as 4 x 4, we would
thus have 32 x 32 = 1024 cells:

If we defined our pixels_per_cell as 32 x 32, we would have 4 x 4 = 16 total cells:

Step 3 : Calculate Histogram of Gradients in 8×8 cells

The next step is to create a histogram of gradients in these 8×8 cells -> 64 gradient. The histogram
contains 9 bins corresponding to angles 0, 20, 40 … 160.
Let’s first focus on the pixel encircled in blue. It has an angle ( direction ) of 80 degrees and magnitude of
2. So it adds 2 to the 5th bin. The gradient at the pixel encircled using red has an angle of 10 degrees and
magnitude of 4. Since 10 degrees is half way between 0 and 20, the vote by the pixel splits evenly into the
two bins

If the angle is greater than 160 degrees, it is between 160 and 180, and we know the angle
wraps around making 0 and 180 equivalent. So in the example below, the pixel with angle
165 degrees contributes proportionally to the 0 degree bin and the 160 degree bin.

Step 4: Contrast normalization over blocks
To account for changes in illumination and contrast, we can normalize the gradient
values locally. This requires grouping the “cells” together into larger, connecting “blocks”.
It is common for these blocks to overlap, meaning that each cell contributes to the final
feature vector more than once.
Again, the number of blocks are rectangular; however, our units are no longer pixels —
they are the cells! Dalal and Triggs report that using either 2 x 2 or 3 x 3 cells_per_block
obtains reasonable accuracy in most cases.
RGB color vector [ 128, 64, 32 ]. The length of this vector is .

This is also called the L2 norm of the vector. Dividing each element of
this vector by 146.64 gives us a normalized vector [0.87, 0.43, 0.22].
Now consider another vector in which the elements are twice the
value of the first vector 2 x [ 128, 64, 32 ] = [ 256, 128, 64 ]. You can
work it out yourself to see that normalizing [ 256, 128, 64 ] will
result in [0.87, 0.43, 0.22]

Step 5 : Calculate the HOG feature vector
To calculate the final feature vector for the entire image

patch, the 36×1 vectors are concatenated into one giant
vector. What is the size of this vector ? Let us calculate
1.How many positions of the 16×16 blocks do we have ?

There are 7 horizontal and 15 vertical positions making a
total of 7 x 15 = 105 positions.
2.Each 16×16 block is represented by a 36×1 vector. So

when we concatenate them all into one gaint vector we
obtain a 36×105 = 3780 dimensional vector.
Fig. 5. Demonstration of a HOG histogram for

one block.

SIFT: Motivation
 The Harris operator is not invariant to scale and

correlation is not invariant to rotation1.
 For better image matching, Lowe’s goal was to

develop an interest operator that is invariant to scale
and rotation.
 Also, Lowe aimed to create a descriptor that was

robust to the variations corresponding to typical
viewing conditions. The descriptor is the most-used
part of SIFT.
1But Schmid and Mohr developed a rotation invariant descriptor for it in 1997.
11/6/2019 13
Idea of SIFT
 Image content is transformed into local feature
coordinates that are invariant to translation, rotation,
scale, and other imaging parameters
SIFT Features
11/6/2019 14
Claimed Advantages of SIFT
 Locality: features are local, so robust to occlusion

and clutter (no prior segmentation)
 Distinctiveness: individual features can be
matched to a large database of objects
 Quantity: many features can be generated for even
small objects
 Efficiency: close to real-time performance
 Extensibility: can easily be extended to wide range
of differing feature types, with each adding
robustness
11/6/2019 15
Scale Invariant Feature Transform
Basic idea:
• Take 16x16 square window around detected feature
• Compute edge orientation (angle of the gradient - 90) for each pixel
• Throw out weak edges (threshold gradient magnitude)
• Create histogram of surviving edge orientations
0 2
angle histogram
Adapted from slide by David Lowe

SIFT descriptor
Full version
• Divide the 16x16 window into a 4x4 grid of cells (2x2 case shown below)
• Compute an orientation histogram for each cell
• 16 cells * 8 orientations = 128 dimensional descriptor
Adapted from slide by David Lowe

SIFT Algorithm Overview
1.Constructing a scale space This is the initial preparation. You create internal representations
of the original image to ensure scale invariance. This is done by generating a "scale space".
2.LoG Approximation The Laplacian of Gaussian is great for finding interesting points (or key
points) in an image. But it's computationally expensive. So we cheat and approximate it using the
representation created earlier.
3.Finding keypoints With the super fast approximation, we now try to find key points. These are
maxima and minima in the Difference of Gaussian image we calculate in step 2
4.Get rid of bad key points Edges and low contrast regions are bad keypoints. Eliminating
these makes the algorithm efficient and robust. A technique similar to the Harris Corner
Detector is used here.
5.Assigning an orientation to the keypoints An orientation is calculated for each key point.
Any further calculations are done relative to this orientation. This effectively cancels out the effect
of orientation, making it rotation invariant.
6.Generate SIFT features Finally, with scale and rotation invariance in place, one more
representation is generated. This helps uniquely identify features. Lets say you have 50,000
features. With this representation, you can easily identify the feature you're looking for (say, a
particular eye, or a sign board). That was an overview of the entire algorithm. Over the next few
days, I'll go through each step in detail. Finally, I'll show you how to implement SIFT in OpenCV!
Lowe’s Scale-space Interest Points
 Laplacian of Gaussian kernel
 Scale normalised (x by scale2)
 Proposed by Lindeberg
 Scale-space detection
 Find local maxima across scale/space
 A good “blob” detector
11/6/2019 [ T. Lindeberg IJCV 1998 ] 19

Lowe’s Scale-space Interest Points:
Difference of Gaussians
 Gaussian is an ad hoc
solution of heat
diffusion equation
 Hence
 k is not necessarily very

small in practice
11/6/2019 20
Lowe’s Pyramid Scheme
• Scale space is separated into octaves:
• Octave 1 uses scale 
• Octave 2 uses scale 2
• etc.
• In each octave, the initial image is repeatedly convolved

with Gaussians to produce a set of scale space images.
• Adjacent Gaussians are subtracted to produce the DOG
• After each octave, the Gaussian image is down-sampled

by a factor of 2 to produce an image ¼ the size to start
the next level.
11/6/2019 21
You take the original image, and generate
progressively blurred out images. Then, you
resize the original image to half size. And you
generate blurred out images again. And you
keep repeating.
11/6/2019 22
Lowe’s Pyramid Scheme
s+2 filters
s+1=2(s+1)/s0
.
.
i=2i/s0
.
. s+3 s+2
2=22/s0 images differ-
1=21/s0 including ence
0 original images
The parameter s determines the number of images per octave.
11/6/2019 23
Difference-of-Gaussians
G k 2 * I
Gk * I D   Gk   G * I
G * I
11/6/2019 25
Key point localization s+2 difference images.
top and bottom ignored.
s planes searched.
 Detect maxima and

minima of difference-of-
Gaussian in scale space
Resam
ple
Blur
Subtract
 Each point is compared

to its 8 neighbors in the For each max or min found,
current image and 9 output is the location and
neighbors each in the the scale.
scales above and below
11/6/2019 26
11/6/2019 28
11/6/2019 29
Scale-space extrema detection: experimental results over 32 images
that were synthetically transformed and noise added.
% detected average no. detected
% correctly matched
average no. matched
Stability Expense
 Sampling in scale for efficiency
 How many scales should be used per octave? S=?
 More scales evaluated, more keypoints found
 S < 3, stable keypoints increased too
 S > 3, stable keypoints decreased
 S = 3, maximum stable keypoints found
11/6/2019 30
Keypoint Localization & Filtering
 Now we have much less points than pixels.

 However, still lots of points (~1000s)…
 With only pixel-accuracy at best
 And this includes many bad points
Brown & Lowe 2002

Keypoint localization
 Once a keypoint candidate is found, perform a

detailed fit to nearby data to determine
 location, scale, and ratio of principal curvatures
 In initial work keypoints were found at location and
scale of a central sample point.
 In newer work, they fit a 3D quadratic function to
improve interpolation accuracy.
 The Hessian matrix was used to eliminate edge
responses.
11/6/2019 32
Eliminating the Edge Response
 Reject flats:
 < 0.03
 Reject edges:
Let  be the eigenvalue with
larger magnitude and  the smaller.
Let r = /. (r+1)2/r is at a

So  = r min when the
2 eigenvalues
r < 10 are equal.
 What does this look like?
11/6/2019 33
11/6/2019 34
11/6/2019 35
2. Accurate keypoint localization
• Reject points with low contrast (flat) and
poorly localized along an edge (edge)
• Fit a 3D quadratic function for sub-pixel
maxima
6
5
-1 0 +1
• Reject points with low contrast (flat) and
poorly localized along an edge (edge)
• Fit a 3D quadratic function for sub-pixel
maxima 6
1 f ' ' (0)
3 f ( x)  f (0)  f ' (0) x  x2
2
6 6 2
5 f ( x)  6  2 x  x  6  2 x  3x 2
2
1
f ' ( x)  2  6 x  0 xˆ 
3
2
1 1 1 1
f ( xˆ )  6  2   3     6
3  3 3
-1 0 1 +1
3
• Taylor series of several variables
• Two variables
 f f  1  2 f 2 2 f 2 f 2 
f ( x, y )  f (0,0)   x  y    x 2 xy  y 
 x y  2  xx xy yy 
 2 f 2 f 
  x   0   f f   x  1   x
x x x y
f      f           x y  2  
 f  f  y
2
  y   0   x y   y  2
 xy yy 
T  
f 1  f 2
f x   f 0  x  xT x
x 2 x 2
Accurate keypoint localization
• Taylor expansion in a matrix form, x is a vector,
f maps x to a scalar
Hessian matrix
(often symmetric)
 f   2 f 2 f 2 f 
gradient     
 x1   x 1
2
x1x2 x1xn 
 f   2 f 2 f 2 f 
 x    
 1  x2 x1 x22 x2 xn 
        
 f   2 f 2 f

2 f 
 x   x x
 n  n 1 xn x2 xn2 
2D illustration
2D example
-17 -1 -1
-9 7 7
-9 7 7
Derivation of matrix form
h
h( x)  g x
T

x
h( x)  g T x  h 
 x1    g 
   x1   1 
h
  g1  g n        g
x  x  h   
 n  x   g n
n
 n
  g i xi
i 1
h(x)  x Ax
T
h

x
 a11  a1n  x1 
  
h(x)  x Ax  x1
T
 xn       
n n  a  a  x 
  aij xi x j  n1 nn  n 
i 1 j 1
 h   n n

    ai1 xi   a1 j x j 
 x1   i 1 j 1 
h
      A T x  Ax
x  h   n n 
 x    ain xi   a nj x j 
  ( A T
 A) x
 n   i 1 j 1

 2 2 T 
f f 1   f  f  f  f
2
   2 x  2 x
x x 2  x 2
x  x x
• x is a 3-vector
• Change sample point if offset is larger than 0.5
• Throw out low contrast (<0.03)
• Throw out low contrast | D(xˆ ) | 0.03
T
D 1 T 2D
D(xˆ )  D  xˆ  xˆ xˆ
x 2 x 2
T
1   D D   2 D   2 D D 
T 1 1
D 2
 D xˆ   2  2
x 
2  x  2 
x  x  x x 
T T T 1
D 1 D  D 2
 D  D D
2 2
 D xˆ 
x 2 x x 2 x 2 x 2 x
T T 1
D 1 D  D D2
 D xˆ 
x 2 x x 2 x
T T
D 1 D
 D xˆ  (xˆ )
x 2 x
T
1 D
 D xˆ
2 x
Maxima in D
Remove low contrast and edges
Keypoint Orientation assignment
The idea is to collect gradient directions and magnitudes around each keypoint. Then
we figure out the most prominent orientation(s) in that region. And we assign this
orientation(s) to the keypoint.
The size of the "orientation collection region" around the keypoint depends on it's
scale. The bigger the scale, the bigger the collection region.
Orientation assignment
Orientation Assignment
 Any peak within 80% of the highest peak is

used to create a keypoint with that orientation
 ~15% assigned multiplied orientations, but
contribute significantly to the stability
SIFT descriptor
Keypoint Orientation assignment
 Create histogram of
local gradient directions
at selected scale
 Assign canonical
orientation at peak of
smoothed histogram
 Each key specifies
stable 2D coordinates
(x, y, scale,orientation)
If 2 major orientations, use both.
11/6/2019 58
Keypoint localization with orientation
832
233x189
initial keypoints
536
729
keypoints after keypoints after
gradient threshold ratio threshold
11/6/2019 59
4. Keypoint Descriptors
 At this point, each keypoint has

 location
 scale
 orientation
 Next is to compute a descriptor for the local
image region about each keypoint that is
 highly distinctive
 invariant as possible to variations such as
changes in viewpoint and illumination
11/6/2019 60
Normalization
 Rotate the window to standard orientation
 Scale the window size based on the scale at

which the point was found.
11/6/2019 61
SIFT Descriptor
 16x16 Gradient window is taken. Partitioned into 4x4 subwindows.
 Histogram of 4x4 samples in 8 directions
 Gaussian weighting around center(  is 0.5 times that of the scale of
a keypoint)
 4x4x8 = 128 dimensional feature vector
Image from: Jonas Hurrelmann

Lowe’s Keypoint Descriptor
(shown with 4 X 4 descriptors over 16 X 16)
• In experiments, 4x4 arrays of 8 bin histogram is used, a total of 128

features for one keypoint
• Once you have all 128 numbers, you normalize them (just like you would normalize
a vector in school, divide by root of sum of squares). These 128 numbers form the
"feature vector". This keypoint is uniquely identified by this feature vector
11/6/2019 63
Lowe’s Keypoint Descriptor
 use the normalized region about the keypoint

 compute gradient magnitude and orientation at each
point in the region
 weight them by a Gaussian window overlaid on the
circle
 create an orientation histogram over the 4 X 4
subregions of the window
 4 X 4 descriptors over 16 X 16 sample array were
used in practice. 4 X 4 times 8 directions gives a
vector of 128 values. ...
11/6/2019 64
Using SIFT for Matching “Objects”
11/6/2019 65
11/6/2019 66
Uses for SIFT
 Feature points are used also for:

 Image alignment (homography, fundamental
matrix)
 3D reconstruction (e.g. Photo Tourism)
 Motion tracking
 Object recognition
 Indexing and database retrieval
 Robot navigation
 … many others
11/6/2019 [ Photo Tourism: Snavely et al. SIGGRAPH 2006 ] 67

SURF detectors and descriptors

Interest operator
• Interest point detector finds distinctive locations in

image (corners, blobs, T-junctions)
• should be repeatable
• Interest point descriptor used for matching
• should be distinctive and robust
Both should be fast

Harris features
• Recall Harris features?
(Eigenvalues of gradient covariance matrix)
• Dependent upon scale

Scale space
• Successively smooth an image...

ftp://ftp.nada.kth.se/CVAP/reports/Lin08-EncCompSci.pdf
This forms a 3D space, with scale as
the 3rd axis

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=AF10AB3864DB87B4414F8169
Let’s look at a slice
 
x x
Signal F Contours of Fxx=0

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.118&rep=rep1&type=pdf
SIFT / SURF
• SIFT – scale invariant feature transform (Lowe

2004)
• SURF – speeded up robust features (Bay et al. 2006)

SURF algorithm
Interest point detector:

• Compute integral image
• Apply 2nd derivative (approximate) filters to
image
• Non-maximal suppression
(Find local maxima in (x,y,) space)
• Quadratic interpolation

SURF algorithm
Interest point descriptor:

• Divide window into 4x4
(16 subwindows)
• Compute Haar wavelet outputs
• Within each subwindow, compute
• This yields a 64-element descriptor
(Only implement USURF – no rotation)

Integral image
• Integral image
(a.k.a. summed area table)
is a 2D running sum
• S(x,y) = SS I(x,y)
• To compute,
S(x,y) = I(x,y)
- S(x-1,y-1) + S(x-1,y) + S(x,y-1)
• To use,
V(l,t,r,b) = S(l,t) + S(r,b) - S(l,b) - S(r,t)
Returns sum of values inside rectangle
• Note: Sum of values in any rectangle
can be computed in constant time!
2 nd derivative filters
(9x9 filters)
2I Dyy Dxy

xy
=1.2 scale s=1.2  2G ( x;  )
x 2
=1.2
det(Happrox) = DxxDyy - (0.9Dxy)2

Changing scale
• Integral image allows us to upsample the filter

rather than downsample the image
9x9
15x15
21x21
Changing scale (within an octave)
• For 9x9 filter, l0 = 3

(length of positive or negative lobe in direction of derivative)
• To keep central pixel,
must increase l0 by l0=3
minimum of 2 pixels
 increase filter dimension by 6
9x9
• Therefore, sizes of filter:
9x9, 15x15, 21x21, 27x27
l0=5
Vision Lab RoVis

Robotic &15x15
Non-maximal suppression
• Retain pixel only if greater than 26 neighbors in x, y, s

Interpolation
• We now have values at 9, 15, 21, 27:
9 15 21 27
=1.2 =1.2 * (27/9) = 3.6
Range is halfway b/w

12 samples 24
=1.2 * (12/9) = 1.6 =1.2 * (24/9) = 3.2
Range is exactly one octave!

Interpolation
• For each local maximum, need to interpolate to get true

location (to overcome discretization effects)
• Hessian values:
• Taylor expansion:
• Solution using Newton’s method:

The next octaves
• First octave filter sizes: 9, 15, 21, 27

• Second octave sizes: 15, 27, 39, 51
• Increase by 12 each time (not 6)
• Spans from 21 (=1.2*21/9=2.8)
to 45 (=1.2*45/9=6)
(some overlap with first octave)
• Ok to measure at every other pixel in image
(saves computation, like downsampling)
• Third octave sizes:
27, 51, 75, 99
• Increase by 24
each time
• Spans from
39 (=1.2*39/9=5.2)
to 87 (=1.2*87/9=11.6)
• Ok to measure at every
4th pixel in image

SURF octave overview
9 15 21 27 33 39 45 51 57 63 69 75 81 87 93 99
5.2 ≤  ≤ 11.6
2.8 ≤  ≤ 6.0
=1.2 * (12/9) = 1.6
=1.2 * (24/9) = 3.2
=1.2 * (21/9) = 2.8
=1.2 * (45/9) = 6.0
=1.2 * (39/9) = 5.2
1.6 ≤  ≤ 3.2 =1.2 * (87/9) = 11.6
But higher octaves increasingly less
useful

SURF descriptor
• Once interest point has been found,
• Place window around point
• Divide into 4x4 subwindows
• In each subwindow,
• measure at 25 (5x5) places: dx and dy
• sum over all 25 places to get 4 values:
• Note: Use Haar wavelets to measure differences

(similar to gradients)

Details
10s 2s 2s
10s
20s
weight by Gaussian
 = 3.3s
20s

with s=1.2
12 3 3
12
24
weight by Gaussian
=4
sign of LoG =
sign of trace(Happrox) =
24 sgn( Dxx + Dyy )
 65 values per interest point (16 x 4 + 1)

sign of LoG
• speeds up matching
sgn( Dxx + Dyy )

Why SURF is better than SIFT

Examples: Flowers

Examples: Tillman
Robotic
SURF, 5 octaves, 4 scales per octave & Vision Lab RoVis
Examples: Tillman
Robotic
U-SURF, 1 octave, 4 scales per octave & Vision Lab RoVis
Examples: Tillman
OpenCV’s SURF Robotic & Vision Lab RoVis


Tuan 9 HOG and SIFT

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Tuan 9 HOG and SIFT

Caricato da

Copyright:

Formati disponibili

Histogram of Gradient

Robotic & Vision Lab RoVis

Robotic & Vision Lab RoVis

Robotic & Vision Lab RoVis

Robotic & Vision Lab RoVis

Robotic & Vision Lab RoVis

Robotic & Vision Lab RoVis

Robotic & Vision Lab RoVis

Robotic & Vision Lab RoVis

Robotic & Vision Lab RoVis

RGB color vector [ 128, 64, 32 ]. The length of this vector is .

Robotic & Vision Lab RoVis

To calculate the final feature vector for the entire image

1.How many positions of the 16×16 blocks do we have ?

2.Each 16×16 block is represented by a 36×1 vector. So

Fig. 5. Demonstration of a HOG histogram for

Robotic & Vision Lab RoVis

 The Harris operator is not invariant to scale and

 For better image matching, Lowe’s goal was to

 Also, Lowe aimed to create a descriptor that was

 Locality: features are local, so robust to occlusion

Adapted from slide by David Lowe

Adapted from slide by David Lowe

11/6/2019 [ T. Lindeberg IJCV 1998 ] 19

 k is not necessarily very

• In each octave, the initial image is repeatedly convolved

• Adjacent Gaussians are subtracted to produce the DOG

• After each octave, the Gaussian image is down-sampled

 Detect maxima and

 Each point is compared

average no. matched

 Now we have much less points than pixels.

Brown & Lowe 2002

 Once a keypoint candidate is found, perform a

Let r = /. (r+1)2/r is at a

 Any peak within 80% of the highest peak is

 At this point, each keypoint has

 Rotate the window to standard orientation

 Scale the window size based on the scale at

Image from: Jonas Hurrelmann

• In experiments, 4x4 arrays of 8 bin histogram is used, a total of 128

 use the normalized region about the keypoint

 Feature points are used also for:

11/6/2019 [ Photo Tourism: Snavely et al. SIGGRAPH 2006 ] 67

Robotic & Vision Lab RoVis

• Interest point detector finds distinctive locations in

Both should be fast

Robotic & Vision Lab RoVis

• Dependent upon scale

Robotic & Vision Lab RoVis

• Successively smooth an image...

Robotic & Vision Lab RoVis

Robotic & Vision Lab RoVis

Signal F Contours of Fxx=0

Robotic & Vision Lab RoVis

• SIFT – scale invariant feature transform (Lowe

Robotic & Vision Lab RoVis

Interest point detector:

Robotic & Vision Lab RoVis

Interest point descriptor:

• This yields a 64-element descriptor

(Only implement USURF – no rotation)

2I Dyy Dxy