Sei sulla pagina 1di 20

Pedestrian Tracking

Independent Study Spring 2009

Isaac Case

May 13, 2009

Abstract

One ability of the human visual system is the ability to detect and follow motion.

This is a critical skill in understanding the world around us as it helps us identify

objects of importance and helps us identify change in our environment. This ability is

highly desired in computer vision systems. Tracking motion can be used to monitor

activity in an area. For this work, I have applied this idea of object tracking in an

attempt to track pedestrian motion in video.

1 Introduction

The goal of this work was to track people as they walk through some frames of video. It
was intended to only track the motion of pedestrians. All other motion should be ignored.
Partial and full occlusion should be compensated for. It should not matter how many people
are in the video, or how large a person is, in comparison to the size of the video frame. Some
of these goals were met, however, others were not able to be implemented due to the time
constraint of this research, i.e. only ten weeks.

1
2 Approach

2.1 Overview

In order to accomplish this goal, it was necessary to break this down into smaller pieces. The
current implementation is based on the idea of background subtraction. Given a background,
we are able to determine the difference from the current frame and the background to find the
foreground. The foreground should be all of the objects that are worth investigation. With
a given foreground, the next step would be to segment it into regions of interest referred to
as blobs. Then, with a set of blobs, track the motion from frame to frame, matching these
blobs across a series of frames.

2.2 Implementation

The implementation of this system followed the same process as described in the overview.
Each part is a separate algorithm, however all work together to produce the results given.
It may be possible to replace a subsection without any effect on any of the other subsections
and in that way are somewhat independent and could be researched independently.

2.2.1 Background Detection

The first operation was to determine the background for every frame. A very simple approach
was tested at first. That implementation was as simple as just taking the average value of a
series of frames as the background as in:
k
X
Fn
n=1
B=
k

where B is the found background, k is the number of frames to be analyzed and Fn is


the nth frame. This assumes that the background fills the frame a majority of the time,

2
the background is fairly static, and the value of k must be large enough force any non-
background object to become insignificant. Given those conditions, the results can be fairly
good, however, given any deviation from those conditions and the results worsen quickly as
can be seen in figure 1

(a) (b)

Figure 1: A Comparison of Averaged backgrounds where (a) meets the criteria for averaging
and (b) does not. (b) is averaged over too few frames

One major issue with this system of background detections is that it does not allow
the background to change over time. If an object was moving, but then stops moving, it
should become part of the background instead of forever staying as a foreground object. One
example of this is a car that enters a scene and then parks. The car when moving may be of
interest and should be considered part of the foreground, but when the car parks and stops, it
eventually should be ignored again as it is no longer a moving object. An improvement over
the previous method was also implemented for this research. It is based on the same idea
of averaging multiple frames of video, but instead of averaging every pixel for every frame,
it only includes those pixels for which there is little motion over some range of frames. It
follows a similar function, however has some important differences.
X
fi,j
Bi,j = where fi,j is the set of values for the pixel at position i, j in the
|f |
series of frames evaluated for which there is little change compared to the frame

3
n frames previous and n frames in the future, and |f | is the number of elements
in the set f .

This is done a whole frame at a time and can be seen in figure 2. The difference between
frame n and frame n − k, referred to as ∆n−k,n , is calculated as well as the difference between
frame n and frame n + k, again referred to as ∆n,n+k . The section that changes in frame n
is then the boolean AND of the two differences or ∆n−k,n ∧ ∆n,n+k . This then allows us to
claim that the background for that frame is ¬(∆n−k,n ∧ ∆n,n+k ).

T=-10 T=0 T=10

∆ between T=-10 & T= 0 ∆ between T=0 & T= 10

Detected difference for T=0 Detected background for T=0

Figure 2: The detection of background for this one frame

The actual method for calculating the difference between frames in this portion of the
project was based on the HSV color space. The difference was calculated in the following

4
manner:

∆H = cos−1 {cos Hi ∗ cos Hj + sin Hi ∗ sin Hj }


∆S = |Si − Sj |
∆V = |Vi − Vj |

The difference for the whole region is then defined as a combination of these three chan-
nels. According to Shan et al [3] it is possible to use this color space to help remove shadows
from our difference. For the HSV color model it is done in this manner:

∆I = ∆V ∧ (∆H ∨ ∆S)

where ∆I is the entire image difference for all three channels.

2.2.2 Image Registration

This change in algorithm helps with some of the problems presented with just averaging
entire frames over some region, but problems still remain. The main problem that still exists
is the issue with camera motion. The previous method works well if the camera is stationary,
but fails if the camera is moving at all. If we did not compensate for camera motion, then
there would be problems with noise being added to our estimated background image due to
motion that is not object motion, but camera motion.
There are many ways to compensate for camera motion, but the method I implemented
requires a distinguishable amount of background to work. The general idea is to register
the frames from two different times one with another. If we can realign these frames, then
we can still determine the background pixel values for those pixels that are visible for a
given time period. We can also ensure that the frame that we will be evaluating against
this background in the future is registered with the found background accurately. The

5
registration I implemented is based on the idea that for a given image, we can detect feature
points, specifically we use harris points[1], and with those feature points we can find a
correlation between the feature points of one image and another, see figure 3.

Figure 3: Feature point matching from one frame to another

There are two different algorithms that were attempted for feature point correlation. One
method can sometimes be faster, while the other is slower, but sometimes more accurate.
The faster algorithm is uses an approximation of the slower algorithm which is why they
produce similar results. The first algorithm tried takes all of the feature points, p0 , from
an image frame F0 , and computes the euclidean distance, as separate x & y components,
from those points to all the feature points, p1 from frame F1 . Along with the euclidean
distance, the absolute value of the difference in intensity is also calculated for each feature
point pair where difference in intensity refers to the intensity of the pixel value of the source
image, not the intensity of the harris point itself. All the pairs which have a difference in
intensity greater than some threshold are ignored. For all the other pairs a histogram is
computed for spacial difference for both the x and y components. Given the histogram for
each component, select the bin with the most points. Then given the highest individual
components, see if that correlates to a high number of point pairs that match both the
selected x and y distances. If the number of pairs is greater than a preselected threshold,
then those pairs of points are returned as the x and y shift for the given frames. If it is not,

6
then we go though different combinations of x values and y values in order of occurrence to
try and find some pair for which there is a match in both directions. In many cases the first
values are selected and the function is quite fast. If it is difficult to find a set of matching
pairs, it may take somewhat longer to find the match. This happens in some frames where
there is either not much background, i.e. a blank white wall, or a background with a lot of
random noise which the harris corner detector detects as corners which are less useful.
An alternative approach is similar, but requires on average more time to compute the
feature point matching. The goal is the same for this algorithm, find the pairs of points
for which there is a high amount of correlation and use those points to realign the image.
Instead of looking at the distances between feature points and finding a correlation among
them, we look at the image itself. For each feature point p in p0 , create a window w, from
frame F0 . Also, for all points p0 in p1 , create windows w10 ..wn0 from frame F1 . Then, with this
window, perform a normalized cross correlation between window w and the windows w10 ..wn0 .
For all of these cross correlations, pick the one with the highest value, wi , and as long as the
highest values is above some threshold, assign that as the match from p0 , so p0 correlates to
pi .
Whichever method is used, as a result we have a list of points that match from one frame
to another frame. Using these points we can create a linear transform to shift the image.
For all the points given, we calculate the distance in both the x and y direction. Then, given
those values, ∆x&∆y, pick the mode out of each set and use that as the translation values.
Since the translation of the image may likely result in the images not overlapping, pad the
image with empty data in all edges where data may be lacking due to registration. For
background detection that also means that if we are constraining ourselves to only the size
of the original image, we are very likely to lose information. In order to get around this, if
we know that there is camera motion, we assume that the background is likely to move and
pad our found background on each side to compensate. Then we can have some information,

7
and tracked background even if the camera moves back and forth.
An example of found background for a given frame from a video where the camera was
moving can be seen in figure 4. The black border surrounding the frame is the padding
provided for camera motion. If it were not for the registration from frame to frame the
image would appear much blurrier and it would be more difficult to use as a representation
of the background.

Found Background Original Image

Figure 4: Comparison of found background image to actual image frame

2.2.3 Foreground Selection

Once a background is found for a frame, the next step is to segment the foreground from the
background. For some implementations it is acceptable to just perform a difference between
the found background and the foreground. This process has some drawbacks, mainly that
of ignoring normal variance or noise in the video. For this implementation I would keep
track of approximately twenty frames of found background information. With these twenty
frames it was possible to derive the mean and standard deviation of all pixels in a range
of frames, see figure 5. Specifically the mean for a given pixel is only the average of that

8
specific pixel location for which background was claimed to be found from our background
detection algorithm. Non background content is ignored in this assessment of mean. The
standard deviation is also calculated from this data set and ignores the approximated fore-
ground content. Then, given this information it was possible to determine what content in
the current frame was out of range, beyond one standard deviation, from the mean found
background. Content out of range is considered foreground, see figure 6. This is different
from the foreground/background selection process for background detection in that it the
subtle changes that can occur over time, and also includes other noise sources such as cam-
era noise, or video compression noise, and its variance over time. For multi channel images,
such as RGB, the difference from the mean is calculated separately on all three channels.
Therefore F Gr = |r − r̄| > µr , etc. For the foreground mask of an entire image frame, it
is just the boolean AND of all three channels, so F Gf = F Gr ∧ F Gg ∧ F Gb . Much of this
work was inspired by an equation given by Shan et al [3].

Frames
T=-10
T=-5 Per Pixel Mean
T=0
T=+5
T=+10

Standard
Deviation Per
Pixel

Figure 5: Computation on a range of frames for background statistics

The foreground sections should contain the areas where objects are moving. These are
the objects that we are interested in. Connected components are segmented out into different
blobs. Each blob contains information about itself including the pixel values contained in a
bounding box, the mask associated with the content that was determined to be foreground,

9
Standard
Current Frame
Current Frame Per Pixel Mean Deviation Per
Foreground
Pixel

Figure 6: Foreground Calculation from background statistics

the relative location within the frame for which the blob was found, and the blob mask’s
centroid. This information helps us determine how a blob moves from frame to frame and
allows us to track the blobs motion through the video sequence.

2.2.4 Blob Tracking

For video frames, any motion is an optical illusion created from multiple static images for
which there are small changes frame to frame and if they are presented to a person in quick
succession it creates the illusion that some of the image has moved, or the contents of the
image have moved. Unfortunately this motion is not encoded in the video itself, but rather
each individual static frame is all we have to work with in trying to detect motion from
the video. Given the previously explained parts to our system, the actual motion tracking
occurs as a function of finding associations between blobs over time. Much of this work
was inspired by the work shown in Masoud and Papanikolopoulos’s “A novel method for
tracking and counting pedestrians in real-time using a single camera”[2], however not all of
their techniques were implemented.
Just as smooth motion is the illusion created from slight variations frame to frame, each
blob should change only slightly frame to frame if it is to give the illusion of smooth motion.
The means to our goal of tracking motion becomes a function of tracking blobs from any
given previous frame to the current frame. This mapping from T=0 to T=1 of blobs can

10
decompose the path traveled by any of these given blobs over time. This is the perceived
motion.
In previous implementation of this algorithm, one major failure was observed. That was
an assumption that any given blob detected in a video frame could only be identified with a
singular moving object. For a video sequence with only one moving object, or with objects
that never cross paths this never becomes a problem. However, most real world situations
do not follow this strict rule. For situations where individual objects cross paths, or form
groups and then diverge, it must be the case that a singular blob can represent multiple
separate and distinct objects. For this reason, it was added in this revision the ability for a
found blob in frame to be matched to multiple different previously found objects. This way
different objects need not be lost as different moving objects overlap.
The basis for the tracking of blobs from frame to frame is a “best match” search of all
the blobs from the previous frame to the blobs of the current frame. This matching process
is based on a cost function which is composed of euclidean distance, difference in the size
of the bounding box, and the difference between the color histograms of the current frame
blobs to the previous frame blobs.

diff = w0 ∗ dE + w1 ∗ dA + w2 ∗ dH

The Euclidean distance of two blobs is defined as the distance between the centroid of a
given blob in the current frame to the predicted centroid of a blob from the previous frame.
The predicted centroid is calculated based on the current predicted speed of the blob and the
previous frame’s known centroid. The difference in size of the bounding box is simply the
absolute value of the difference between the are of the previous bounding box and the blob to
be evaluated’s bounding box. The color histogram difference is calculated as the sum of the
percentage of non matching color values, those values for which there is no equivalent in the
comparison, for the combinations of {red, green}, {red, blue}, and {blue, green} assuming

11
that the image is defined in RGB space. The percentage is calculated relative to the size
of the smaller of the two blobs. Also, only the RGB values for the region of the blob, not
the whole bounding box, are compared. Other pixel values are considered background and
ignored in this histogram comparison. The color histogram value will have a value of 0 for
a perfect match and a maximum value of 3 for a perfect mismatch. This color histogram
matching is similar to work done by Swain and Ballard[4] except that for their research
they were matching an entire image to an image database where this is only matching small
sections of an image to other sections for which shape and size are not constant.
In order to reduce mismatches only blobs for which there is some overlap between the
mask of the predicted previous frame blob and the mask of the current frame blob are
considered. This is determined by first performing a quick comparison of bounding box
overlap. This is a quick method of ignoring most of the possible mismatches. If the bounding
boxes overlap, then given the bounding boxes and the mask of the blobs we perform a boolean
AND operation on the two masks projected in their proper coordinates. After the boolean
AND a sum is performed on the resultant matrix. If the sum of that matrix is non zero, then
there is some overlap. If not, then there is no overlap and this blob is ignored for matching.
Reducing the blobs looked at reduces the complexity of finding a matching blob for tracking
smooth motion frame to frame. This assumes that objects are not moving in jumps larger
than their own size each frame.
Given this cost function, all blobs of the previous frame and compared to the blobs of the
current frame. If there is a 1:1 matching between these blobs, the current frame blobs inherit
all of the properties of the previous frame blobs other than the updated mask, updated image
segment, and updated coordinates. The predicted speed of this blob is calculated based on
the previous speed of the previous frame blob. Currently the function used is:
T −1
T ∆TX + SX
SX =
2

12
∆TY + SYT −1
SYT =
2
T
where SX is the speed in the X direction at time T and ∆TX is the change is the X direction
between time T and T − 1. The same calculations are performed for speed in the Y direction
and are kept separate as a vector denoting the approximate speed of the blob. This is not the
best predictor for speed as this only works for things moving in a linear manor. Any other
type of constantly changing function would not be well predicted by this function, however,
it is very simple to implement and works fairly well for simple pedestrian movement.
If there is a many to one mapping of previous frame blobs to a current frame blob, then
we take this into consideration and create a new blob, which is just a place holder, who has
a new characteristic called “sub blobs”. These sub blobs are the independent components
that are believed to make up the combined blob. This relation is held for one frame, then
when it comes time to compare this combined blob to the blobs of the new current frame,
it is decomposed into its core parts, and the “sub blobs” are compared to the current frame
blobs. In this way, individual components retain their individuality even if it appears that
they have been combined into a larger object. If they were to split at some time in the future,
they would retain all previously known knowledge about themselves, i.e. color, mask, speed,
direction, etc.
For all blobs which have been matched either in a 1:1 or a many to one mapping for
multiple frames we recognize this as a blob worth tracking and track it. Tracking for simple
1:1 matched blobs is performed by marking the previously known centroids for the object
and drawing the bounding box around the current found location of the blob for this frame.
An example of this type of output can be seen in figure 7.
For blobs which have been matched in a many to one situation we try and preserve the
independence of the sub blobs. In order to do this we do not draw a bounding box around
the merged blob, but instead draw projected centroids for where we believe the sub blobs

13
Figure 7: Tracking the motion of found blobs for a series of frames

are and a bounding box around where we believe it to be at this point in time. Current
frame place prediction is based on the last known speed vector of the blob and is calculated
as the multiple of the number of frames since this object was an individual times the known
speed. Since this prediction is based on the last known speed of the sub blob it could be very
incorrect. For this reason the projection is clipped at the boundaries of the larger merged
blob. If it is projected that the centroid of a smaller blob is outside the range of a merged
blob, then the smaller sub blob’s centroid is clipped at the boundary of the bounding box
of the larger merged blob. This prevents the predicted placement of centroids and bounding
boxes from being too far away from the actual motion of the merger of the sub blobs.
Since we maintain the existence of sub blobs this helps with the problem of what happens
in the situation where two people cross paths. In the previous paradigm one would be lost
or ignored until they separated again. In the current implementation we acknowledge the
merger and attempt to do the best we can with what we have. For some scenarios the results

14
are much improve over the past results, see figure 8

Blocks Crossing Old Method Blocks Crossing New Method

People Crossing Old Method People Crossing New Method

Figure 8: A Comparison of maintaing independence of sub blobs in a merge situation to not


acknowledging merges

So we don’t have trails forever for objects that have disappeared after each frame we
prune the list of blobs that we are interested in. If a blob is not matched frame to frame, it
is marked as having missed frames. If the number of missed frames exceeds some tolerance
value, then that blob is pruned from the list of blobs that we are interested in. It will not be
drawn anymore and it will no longer be compared to any new blobs or matched again. This
ensures that we recognize the fact that a blob has left the scene and is no longer of interest
for tracking.

15
3 Results

Compared with previous results, this new methodology improves in many areas over past
implementations. It addresses the issues involved with partial occlusion and crossed paths.
It attempts to address changes in background over time and ignoring some of the normal
variance included in the video. Although not all situations are fully addressed, this imple-
mentation is much improved over previous results. and in certain conditions produces fairly
good results.
Specifically with the video file entitled “people01 08 05.avi” as presented on the Com-
puter Vision website results were much improved over previous results. Portions where people
cross paths is much improved in that each individuals path is preserved and is fairly accurate
given the limitations. Also, the camera noise that was picked up by previous algorithms is
fairly successfully ignored along with changes in lighting from the sun moving behind clouds
and back out again.
Results for videos with not much background information, i.e. a blank white wall, had
poor results compared to other video files. Tracking the camera motion appeared not possible
with currently implemented algorithms. Also the motion of a person walking toward the
camera is not fully comprehended. At the points where the person is distinguishable from
the background only parts of the body are recognized as objects in motion. This causes
failures in object matching as it is recognizing the arms or legs as moving objects instead of
the whole person. Then when tracking the person, instead of tracking the whole, even when
the whole person is detected as a foreground object, the implemented system assumes that
it is just the merger of the appendages instead of recognizing the person as a whole.

16
4 Conclusion

Unfortunately given the situation posed to the class as our working situation, this method has
many drawbacks. One of the most significant failures seen in attempting to track students
entering a classroom had to do with the white wall background. Given the significant lack
of detail in the background of this situation, it was very difficult to determine the difference
between camera motion and object motion. One possible explanation for this is that the
human visual system is comparing the scene as a whole and is comparing what is perceived
with what makes the most ‘sense’ given the situation. If in a series of frames the movement
appears to be a person moving from right to left without any movement of their appendages,
it becomes likely that it is not the person that is moving, but it is the observer that is
moving. Then with that hypothesis it is possible to observer everything that is not a human
in the scene and see if that theory still makes sense. Unfortunately my implementation is
not advanced enough to handle that type of semantic gap between what is perceived, and
what is possible for all of the different possible objects in the given scene.
One other advantage the human visual system has over the processing I was able to do is
its ability to segment out objects from a static scene. For my analysis, I had to wait for an
object to move before I was able to identify it as an object worth noting. Our visual system,
including our brain for processing, can segment out all possible objects from a scene before
any motion happens, and therefore can pay more attention to regions that are probable for
motion and somewhat safely ignore other regions. This would include paying more attention
to the people in the video and less attention to the blank white wall. In combination with
this ability, the our visual system also can bridge the gap between an object and the parts
that compose an object. If a person is moving their arms, we know that it is not going
to move as a whole object until the legs are also moving. This connection between arm
motion, leg motion, etc and the motion of the object/person as a whole is lost in the simple

17
processing that I have done.

5 Future Work

There is much work left to do in this area. One main portion that was not completed during
this course was the application of a pedestrian detector. Once foreground content has been
detected it is just assumed to be of importance. If this were to be an actual pedestrian
tracking algorithm, one more step would need to be implemented. That would be a step
where once foreground pieces are detected there must be a process whereby each foreground
piece is evaluated as to whether or not it is a pedestrian or not. In this was we can eliminate
much of the background noise, but also eliminate non pedestrian movement. There are many
was that this could be accomplished, but unfortunately none were attempted for this scene.
Other possible refinements include a better history of the objects in motion. Currently
the only historical data kept is the previous locations of the center of the mask. Current
speed of the object is somewhat based on previous data, but historical speed is not kept for
reference or statistical query. This could more accurately project an objects movement while
occluded. Also with more historical data, it might be possible to implement more complex
predictions of the objects placement. Currently we have looked at linear prediction, but there
are many other ways to accomplish this. Historical references to the objects appearance could
also be very useful in detecting changes over time. In the current implementation we only
have the appearance of the object the last frame it was tracked. This is a poor model for
matching and provides little data for comparison. It also does not ensure that there is a
match between scenes where change is possible.
Currently the probabilistic method for obtaining the foreground data from a frame is
based on the mean and standard deviation in RGB space. It would be useful to do a
comparison of different spaces to see if any produce better results. One prime example

18
worth testing is using the HSV color space for this detection. Also, other color spaces should
be tested in the background detection section. The use of HSV in the current implementation
is based on some initial test between rgb, HSV and Ycbcr. Although HSV was chosen based
on some initial results, it may be possible to use a different space and find a better metric
for non changing content, within some range.

6 Time Tracking

This section is provided as a departmental requirement for the independent study. The
original estimate of work to be performed for this independent study was a little optimistic.
Unfortunately due to the nature of the work, progress was fairly slow. One of the main
reason for this was the time it takes to test new ideas. The majority of the work done for
this was done in Matlab, which although it is very nice to work with, and easy to implement
fairly complex algorithms, is very slow compared to languages like C++ or Java. Because of
this, often it would take over 20 minutes just to test one changed parameter to an algorithm.
Most of the time spent was not in developing new ideas, but rather in implementing some
ideas, validating that they are sound, testing them on some sample data, retuning parameters
and finally testing on real world data. For this reason, the time to implement may of the
algorithms was much larger than originally expected.
Looking at the individual portions of this project, the breakdown would match the fol-
lowing:
Background Detection 2 weeks
Image Registration 3 weeks
Foreground Selection 2 weeks
Blob Tracking 3 weeks
Although it was never a complete split between the four sections, it is an approximate

19
estimate in tallying the total time given. If hours are requested, then an approximation of
number of hours worked per week on this project would be at least 10 hours per week. Often
times more in the beginning as my algorithms were not very well optimized so it took longer
to get any results.

References

[1] Chris Harris and Mike Stephens. A combined corner and edge detector. 4th Alvey Vision
Conference, pages 189–192, 1988.

[2] O. Masoud and N.P. Papanikolopoulos. A novel method for tracking and counting pedes-
trians in real-time using a single camera. Vehicular Technology, IEEE Transactions on,
50(5):1267–1278, Sep 2001.

[3] Yong Shan, Fan Yang, and Runsheng Wang. Color space selection for moving shadow
elimination. pages 496–501, Aug. 2007.

[4] M. Swain and D. Ballard. Color indexing. International Journal of Computer Vision, 7
(1):11–32, 1991.

20

Potrebbero piacerti anche