Sei sulla pagina 1di 54

Charles University in Prague

Faculty of Mathematics and Physics

BACHELOR THESIS

Adam Vyskovsky
Object tracking by a flying drone
Department of Theoretical Computer Science and Mathematical
Logic

Supervisor of the bachelor thesis: prof. RNDr. Roman Bartak, Ph.D.


Study programme: Computer Science
Specialization: IOI

Prague 2014

Hereby, I would like to thank my supervisor Roman Bartak for his guidance
throughout this study.
I would also like to thank my family for their lifelong support.
Last but not least, I am grateful to my colleagues and friends for their constructive critisism.

I declare that I carried out this bachelor thesis independently, and only with the
cited sources, literature and other professional sources.
I understand that my work relates to the rights and obligations under the Act
No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that
the Charles University in Prague has the right to conclude a license agreement
on the use of this work as a school work pursuant to Section 60 paragraph 1 of
the Copyright Act.

In Prague, July 30, 2014

Adam Vyskovsky

Nazev prace: Sledovan objekt


u letajcm dronem
Autor: Adam Vyskovsky
Katedra: Katedra teoreticke informatiky a matematicke logiky
Vedouc bakalarske prace: prof. RNDr. Roman Bartak, Ph.D.
Abstrakt: Clem teto prace je navrhnout a implementovat vhodnou metodu
pro autonomn sledovan objekt
u z kvadrikoptery pomoc kamery umstene na
jej palube. Je rozebrano nekolik moznost vhodneho zpracovan obrazu a pote
i nasledneho hledan objektu ve video sekvenci. Dale jsou rozebrany moznosti jak
modelovat kvadrikopteru jako dynamicky system a jak tento model vyuzt pro
rzen letove faze kvadrikoptery za u
celem sledovan objekt
u. Jeden konkretn
model rzen pomoc tzv. PID regulatoru je zvolen a implementovan. Dale je
navrzena metoda pro odhad mertka sveta. Rovnez byla naimplementovana platforma pro snadnou komunikaci s kvadrikopterou.
Klcova slova: AR.Drone 2.0, autonomn sledovan objekt
u, pronasledovan objekt
u, teorie rzen

Title: Object tracking by a flying drone


Author: Adam Vyskovsky
Department: Department of Theoretical Computer Science and Mathematical
Logic
Supervisor: prof. RNDr. Roman Bartak, Ph.D.
Abstract: The goal of this thesis was to design and implement a suitable method
of autonomous object tracking by a flying quadcopter with an onboard camera.
Several methods of image processing and subsequent object tracking in a video
stream are discussed. Afterwards, the quadcopter is studied from the perspective
of a dynamical system. The knowledge gained from studying dynamical systems
is utilized in the flying phase as one specific model of the dynamical system, a so
called PID controller, is chosen and implemented. Then we propose a method of
scale estimation of the world. We also designed a platform for an easier communication with the quadcopter.
Keywords: AR.Drone 2.0, autonomous object tracking, object following, control
theory

Contents
Introduction
1 Robotic Platform
1.1 Parrot AR.Drone . . . . .
1.2 Technical Parameters . . .
1.3 Onboard Software . . . . .
1.4 Communication Protocols

2
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

5
5
5
7
7

2 Control Theory
2.1 Dynamical Systems . . . . .
2.2 Open-loop Controller . . . .
2.3 Closed-loop Controller . . .
2.4 PID Controller . . . . . . .
2.5 Controlling the Quadcopter

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

10
10
10
11
12
14

.
.
.
.
.
.
.
.
.
.
.
.

16
16
16
17
17
18
19
20
21
22
25
28
28

4 Method Integration
4.1 Scale Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .

31
31

5 FollowMe Application
5.1 Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Third Party Software . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Inner Structure of the FollowMe Application . . . . . . . . . . . .

34
34
35
35

6 User Experience
6.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Connecting to the Quadcopter . . . . . . . . . . . . . . . . . . . .
6.3 Tracking Object . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38
38
38
39

7 Experiment
7.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41
41
42

3 Object Tracking
3.1 Problem Description . . . . . . . . . .
3.2 Computer Vision . . . . . . . . . . . .
3.2.1 Template Matching . . . . . . .
3.2.2 Color Detection . . . . . . . . .
3.2.3 Feature Detection and Matching
3.2.4 Cascading Classifiers . . . . . .
3.2.5 Motion Estimation . . . . . . .
3.3 Selected Approach . . . . . . . . . . .
3.3.1 Tracking . . . . . . . . . . . . .
3.3.2 Detection . . . . . . . . . . . .
3.3.3 Integration . . . . . . . . . . .
3.3.4 Learning . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

Conclusion

45

Bibliography

46

List of Abbreviations

48

Attachments

49

Introduction
For a couple of years we have seen a growing interest in developing and utilizing unmanned aerial vehicles (UAVs). They have found many applications
such as search and rescue, aerial surveillance, military operations and filmmaking, just to name a few. Historically, UAVs were, and sometimes still are, called
drones because they often performed primitive and repetitive tasks. More difficult
tasks needed the assistance of a human operator. Nowadays, growing importance
is being assigned to the task of creating unmanned and more autonomous robots
capable of doing yet even more demanding tasks. In this thesis we attempted
to carry out the task of autonomous object tracking and following.
In this thesis we chose one specific robotic platform the Parrot AR.Drone
2.0 [8] quadcopter, see Figure 1 from the category of several different available
families of UAVs to help us assess the studied techniques in practice. One of the
biggest advantages of this robotic platform is its ease of use and its flying characteristics which are similar to those of conventional helicopters. Unlike helicopters, quadcopters are equipped with four horizontally aligned rotors. Quadcopters takeoff and land vertically, hover in place and are able to fly any curve
in 3D space. Together with their small size they are especially suitable for flying
indoors and in dense environments. Quadcopters are inherently unstable and require sophisticated stabilizing control assistance to keep them manageable. This
advanced control system is implemented in the AR.Drone 2.0 internally. Apart
from this basic control assistance, the AR.Drone 2.0 quadcopters are equipped
with several sensors, allowing them to fly autonomously.

Figure 1: Parrot AR.Drone 2.0

Problem Statement and Related Works


The goal of this thesis was to develop an application that would control a quadcopter in flight and would allow to follow an object of interest autonomously
in an unknown environment.
Several similar projects have been implemented or are still under development in various research teams, academic or private. A team led by Krajnk [5]
at the Czech Technical University in Prague exploited the Parrot AR.Drone
quadcopters and developed a platform for robotic research and education. Engel [2] studied and implemented a method of autonomous camera-based navigation of a quadcopter that incorporates the method of simultaneous localization
and mapping (SLAM). Raffaello DAndrea [11] unveiled some astounding athletic
powers of quadcopters. He set up a flying machine arena equipped with highprecision motion capture system and a wireless communication network where
quadcopters flew some very difficult figures. Hrasko [3] studied and implemented
a method of autonomously landing a quadcopter on a target. A team at the Arizona State University [13] has been focusing on vision based GPS-denied object
tracking for unmanned aerial vehicles with some promising results.
Because of the still insufficient computational capabilities of the quadcopters
hardware, the burden of heavy computations were taken off the quadcopter and
passed on to an external computer. At a high level we could split the problem
we faced into four tasks.
The first task was to develop a functional, lightweight and simple platform
for communication with the AR.Drone 2.0 quadcopter from a ground station.
This would allow to easily test and assess the studied techniques in practice.
The second task consisted of extracting information from the quadcopters
onboard sensors, interpreting this information, filtering them and processing
them through the application logic. In particular, we need to feed sensory data into a controller that would be able to navigate the quadcopter through the
environment.
The third task consisted of exploring the field of computer vision and incorporating some of its techniques into the application. Especially, we were interested
in the possibilities of object tracking from the front camera of the quadcopter.
Further, we had to resolve the problem of scale estimation.
The fourth task was to combine results of all the individual parts mentioned
above into a model that would be responsible for controlling the quadcopter
and pursuing the mission of object tracking which is the main objective of this
thesis.

Thesis Outline
In Chapter 1 we will introduce the Parrot AR.Drone 2.0 robotic platform in greater
detail. In particular, we will give an overview of the mechanical construction
of the quadcopter, describe all its onboard sensors, hardware and software parts
and the means by which it is possible to remotely control the quadcopter.
Following in Chapter 2 there is a brief description of control theory with
focus on flying quadcopters. A special treatment will be given to a proportionalintegral-derivative controller (PID controller for short). Subsequently we will
4

delve into the realm of computer vision, object detection and object tracking
in Chapter 3 and overview several methods and algorithms and eventually describe an approach of object tracking and detection adapted in this thesis.
In Chapter 4 we will describe how is the output from the object tracking
algorithm connected to the input of the controller. We will also describe a method
of scale estimation.
Afterwards, in Chapter 5, we will describe our application incorporating
the ideas studied in previous chapters and comment on the design decisions
and application logic.
In Chapter 6 we will present the program from the users perspective and provide a user experience.
In the final Chapter 7 we will present results of an experiment we conducted
in order to assess the implemented algorithms.

1. Robotic Platform
1.1

Parrot AR.Drone

The Parrot AR.Drone was initially meant to serve as a toy for augmented reality
games. The quadcopter was supposed to be controlled by a human operator with
a smart phone or a similar device. However, due to its financial accessibility
it turned out to be an interesting platform for various research projects as well.
In our thesis we will exploit some of the possibilities it opens up. First, we will
take a look at the quadcopters mechanics, then we will see what kind of hardware
and software it has as well as all of its onboard sensors.

1.2

Technical Parameters

The Parrot AR.Drone quadcopter 2.0 is propelled by four electric rotors with
power of 14.5W each. These four rotors are framed in the corners of a cross supported by a lightweight carbon-fiber construction which lends it resistance against
mechanical wear out. Each rotor has a dedicated 8 MIPS AVR microcontroller
for better flight characteristics. The quadcopter is capable of flying at speeds
of up to 10m/s. The weight of the whole device is 380g or 420g, depending
on the hull of the quadcopter, which is either for indoor or outdoor flight.
Depending on the thrust given from each rotor, the quadcopter can be manoeuvred in several directions. The first and the easiest one is when all rotors
are rotating at the same speed. Then, at least in the ideal case, the quadcopter is either hovering in place, ascending or descending, depending whether
the thrust is sufficient enough to overcome the gravitational pull. Apart from
that, the quadcopter can rotate around its vertical axis. This can be achieved
by simply increasing the speed of either of the two pairs of diagonally opposite rotors and decreasing the speed of the second pair of opposite rotors, which results
in either clockwise or counterclockwise rotation. Furthermore, the quadcopter
adjusts its pitch or roll by adding more power in one rotor and decreasing power
in the opposite rotor, tilting the quadcopter and flying in the desired direction.
The rotational angles are schematically depicted in Figure 1.1.
The quadcopters internals consist of a 1GHz 32 bit ARM Cortex A8 processor with 1GBit of DDR2 RAM running at 200MHz. Additionally, a Wi-Fi b/g/n
device allows the creation of an ad-hoc network for external connection and communication with the quadcopter. The system is running a GNU/Linux with
kernel version 2.6.32 with a versatile command line program BusyBox. The control program operating the quadcopter is distributed as a binary executable. For
faster video processing the quadcopter has a dedicated digital signal processing
(DSP) unit TMS320DMC64x running at 800 MHz. The quadcopter is controlled
via WLAN through a set of commands sent from the ground to the quadcopter.
The quadcopter acts upon the environment with its rotors. On the other
hand, it has various sensors by which it perceives the surrounding environment.
It disposes with two cameras. The frontal HD 720p camera has a resolution of
1280 720 pixels running at 30 fps with wide angle lens of 92 degrees. However,
one of its many downsides is the fact that the images are subject to significant
6

Figure 1.1: The pitch, roll and yaw angle terminology in picture
distortions and especially while in flight the images are somewhat heavily blurred.
Apart from the frontal camera there is also a camera facing downwards. It has
a resolution of 480 360 pixels and is running at 60 fps. This camera is utilized
internally by the Parrot AR.Drone 2.0 onboard software to enhance the quadcopters stability and resilience against drift. Unfortunately, experiments have
shown that this ability depends heavily on the texture of the surface. In poor
conditions the quadcopter is subject to unwanted drift.
The quadcopter is also equipped with a 3 axis gyroscope, which is able to
measure the pitch, roll and yaw angles of the flying quadcopter with rotational
speed up to 2000deg/s.
The onboard 3 axis accelerometer measures acceleration in all three directions
with precision of up to 0.05g.
For more accurate measurements the quadcopter is also equipped with a 3 axis
magnetometer with precision of up to 6 degrees. Results from the magnetometer
are combined with those from the gyroscope yielding more accurate results.
To estimate the current flying altitude, the quadcopter utilizes two onboard
sensors. The first one is an ultrasound sensor suitable for measuring altitude
just above the ground (heavily exploited in the takeoff and land manoeuvres).
The second sensor is a pressure sensor which aids in measuring the altitude
of a quadcopter several feet above the ground where the ultrasound sensor does
not give reasonable estimates. Precision of this sensor is up to 10Pa. Basically,
the maximum altitude is limited only by the reach of the Wi-Fi signal.
The standard battery provided along with the Parrot AR.Drone 2.0 has enough
capacity for about ten minutes of flight.

1.3

Onboard Software

One of the main tasks of the onboard software is to stabilize the quadcopter
and keep it manoeuverable, collect sensory data and make it easier to control
the quadcopter by simple, high level commands such as takeoff or land. It is not
a truly embedded system in the sense that it does not have a real time operating
system (RTOS) with hard deadlines and especially the communication channel
does not allow for a system with deadlines. As we mentioned before, the easiest
way to communicate with the quadcopter is to connect to its unencrypted access
point on address 192.168.1.1. Once this connection is established, communication with the quadcopter is led over several channels with a predefined set of
operations.
Parrot provides an SDK for free allowing third party developers to program
their own applications controlling the quadcopter. Unfortunately, at least in the
authors opinion, the SDK is cluttered with too many redundant and badly documented parts. Therefore we resorted to the solution of writing our own software,
rather then using the official SDK.
The quadcopter is started by simply plugging in the battery. After the onboard operating system gets booted, it forks a new process responsible for controlling the quadcopter, sets up an ad-hoc unsecured wireless network and waits
for a client to connect. A set of four LEDs, one on each rotor, serve as an indicator of the inner state of the quadcopter. Basically only when all four LEDs
are green the quadcopter is ready (anything else deserves attention of a human
operator; one possible and common cause might be low battery). After we obtain
an IP address from the quadcopters DHCP server we can start communicating
with the controller on four possible channels.

1.4

Communication Protocols

The four communication channels are:


navigation channel,
command channel,
video channel,
control channel.
The navigation channel is an unreliable UDP connection running on port 5554.
After sending a packet with a simple predefined bit sequence on the navigation
port the quadcopter starts sending packets loaded with sensory data in regular
intervals approximately every 65 ms. Each packet basically contains data from
every sensor giving us e.g. orientation measured in pitch, roll and yaw angles.
Speed in all three directions in cm/s, altitude in cm, battery state, the quadcopters internal state as a bit field containing various information (e.g. result of
the last action, whether it is hovering or flying).
For the opposite flow of information, from the controller to the quadcopter,
a dedicated command channel is set up on port 5556. Again, it is an unreliable
UDP channel. Through this channel it is possible to navigate the quadcopter.
8

There may be more than one command sent in a single packet. Each command
must contain a non-negative serial number, which must increase by at least one,
or can be reset by sending a command with serial number set to zero. Otherwise
the command is discarded by the quadcopter. This way it is guaranteed that no
two commands will be executed in reverse order compared to the order they were
sent. This way it is possible to overcome the inherent nondeterministic nature of
the UDP communication protocol at least to some extent. Along with the serial
ID the command can carry more informative data.
Some basic commands are for example the takeoff and land primitives. Then
there are the pitch, roll and yaw angles. These angles are regulated by sending
a value mapped on the real interval [1, 1] for each angle separately. Minus
values in the pitch angle instruct the quadcopter to tilt forward, in the roll angle
to tilt left and rotate counterclockwise in the yaw angle. Similarly, the opposite
directions are regulated by positive values. Between the extreme values of 1 and
1 the behaviour is approximately linear for all three angles allowing for a more
fine-grained control over the flight characteristics of the quadcopter.
Quadcopters flying altitude can be regulated by sending an instruction to increase or decrease its altitude again by sending an instruction with intensities in
the range [1, 1]; minus values for going downwards.
It is also necessary to send commands regularly or at least to send an indicator
of an ongoing connection, otherwise the quadcopter might consider the connection
as lost. Precisely for this reason there is a noop command (internally called
a watchdog command).
Each complex command performing an aerial manoeuvre must be compensated for, because the quadcopter keeps performing the last action received. Therefore, at the end of each complex command, it is necessary to send a hover command. A scenario where we lose connection with the quadcopter and the last
command the quadcopter received was a command to move forward at full speed
will most probably result in a crash.
On top of the TCP stack a video channel runs on port 5555. The stream
continuously transmits either the front view or the bottom view, see Figure 1.2
for a front camera view and Figure 1.3 for a bottom camera view. Switching
between individual modes is simply a matter of sending a particular command
to the quadcopter indicating which mode to activate. For our purposes it is quite
natural to focus on the front camera. Due to some legal issues, the video stream
is encoded with a proprietary format, which is fortunately somewhat similar to
the H.264 codec and can be decoded using standard software packages 1 without
any additional intervention.
Finally, through the control channel it is possible to view some of the internal
settings and information about the quadcopter, such as the total flying time,
the firmware version, the maximum tilt and many other.
The exact format of each command is documented in the official Parrot
AR.Drone 2.0 SDK [9].

OpenCV [6] in our case

Figure 1.2: The front camera view

Figure 1.3: The bottom camera view

10

2. Control Theory
First, we will give a summary of some relevant terms frequently used in the field of
control theory. For a comprehensive introduction to control theory and feedback
systems especially, see Murray [1].
Control theory is a branch of engineering and mathematics that deals with
the behaviour of dynamical systems.
Dynamical system is a system that changes state over time according to some
fixed rule. At any given time a dynamical system has a state in some appropriate
state space.
A physical system is affected by some input values, some of which we have
control over. The system generates some system output values. Output values are perturbed by random disturbances. In control theory we would like to
minimize some error between our current state and a desired state or setpoint.
Methods conforming to our description may be compared by the speed they reach
the desired setpoint and generally the method that reaches the setpoint fastest
is preferred.
Control theory provides tools for studying and analyzing the behaviour of the
quadcopter and will reveal how to control (give commands to) the quadcopter.

2.1

Dynamical Systems

To keep the system stable we needed a controller. In general, the inputs and outputs of a dynamical system are related by differential equations. The quadcopter
demands a precise and real-time control software. Fortunately, Parrot AR.Drones
2.0 comes with a controller onboard alleviating the user from writing complex
software controlling the rotational speed of the four rotors. The model of the
dynamical system of the quadcopter is depicted in 2.1, where we see how we
can affect the system output values by regulating the right system input values.
The quadcopter has six degrees of freedom in total three represent the position
in space and the remaining three are for the pitch, roll and yaw angles. The
main task was to implement a controller that would execute instructions to reach
a desired setpoint (e.g. fly forward 1m).

2.2

Open-loop Controller

Open-loop controllers lack any feedback from the environment. They are useful
in situations where the disturbances that may effect the outcome of the system
are negligible and dont effect the systems overall behaviour or in case the system
has some relaxed conditions on its position in the state space. For a block diagram of an open loop controller, see Figure 2.2. An open-loop controller for the
quadcopter could be simulated by some function or a table indexed by errors and
returning a corresponding action. This approach has several deficiencies. First
of all, it is impossible to cover all possible errors. Second, the systems output
does depend on the surrounding conditions (e.g. wind affects the behaviour of
the quadcopter). The third reason is that AR.Drones 2.0 of the same production
11

Figure 2.1: Dynamical system of the quadcopter [5]


behave differently under the same conditions. This fact may be attributed to
mechanical wear out, chipped rotor blade, etc.
Setpoint

Controller

Control
Commands

System

Quadcopter state
(Position, velocity, ...)

Figure 2.2: Open-loop controller

2.3

Closed-loop Controller

Closed-loop controller overcomes the limitations of an open-loop controller by


using feedback from the environment to control the input values and thus the
output values of the system, see Figure 2.3. For example, suppose we would like
to reach some cruising speed starting from a steady state. We need to instruct
the quadcopter by sending it some commands that would affect its speed appropriately. Along with our instructions going from the ground controller to the
quadcopter, the quadcopter sends back sensory data approximately every 65ms
which we can subsequently feed into the closed-loop controller. Even though
we are just sampling a continuous function of time, if the interval between two
consecutive sensory data packets is small enough for a particular application,
then the closed-loop controller will be able to timely adapt to the development
of the quadcopters speed and control the quadcopter with the right instructions
to eventually reach the desired speed.
12

Setpoint

Measured
Error

System

Controller Input

Measured
Output

System

System Output

Sensors

Figure 2.3: Feedback control loop

2.4

PID Controller

PID controller is a widely used closed-loop controller. The PID controller has
three separate control parameters: the proportional, the integral and the derivative terms. For a block diagram of a PID controller, see Figure 2.4. The proportional, integral and derivative terms are functions of time and their interpretation
is the following:
The proportional term corresponds to the current error e(t) of the system
and is responsible for diminishing this error (difference between the setpoint
and the current point). The proportional term is always required in a PID
controller.
Rt
The integral part represents the past errors 0 e( )d and is responsible for
eliminating steady state errors. For a biased system a controller with no
integral part may stabilize the system near the steady state, but never reach
the steady state. One drawback of the integral term is the fact that it may
slow down the process of reaching a setpoint.
The derivative term is a prediction of the future error dtd e(t). Its key role is
avoiding oscillations of the system around the steady state. Its magnitude
is a function of the rate of change of the error in the system.

P
Setpoint

+ Error

System

System
+ input

System output

Figure 2.4: PID controller block diagram


The weighted sum of these terms is then used to adjust the system actuators
to get the right output values out at time t. The behaviour of a PID controller
is governed by the following equation:
Z t
d
out(t) = CP e(t) + CI
e( )d + CD e(t).
dt
0
There are several techniques how to set the correct CP , CI and CD constants.
These constants set the magnitude of each of the terms in the PID controller
13

1.5

and are specific for each application of the PID controller in a given environment. Figure 2.5 shows the behaviour of the system with the correct constants.
The desired setpoint was one and a half meter away from the quadcopter.
We have found the constants by means of trial and error, although there are
existing techniques that assist in automation of finding these constants. We are
optimizing for time, that means we would like to minimize the time it takes to
reach the desired setpoint within a small tolerance (we can not expect to reach
the goal precisely). Another important note it is not enough to reach the goal
but also to stay there and to stabilize the system. In other words we dont want to
keep overshooting. Overshooting can be demonstrated very well on a quadcopter
when searching for the correct CP , CI and CD constants. The quadcopter would
oscillate a few times around the desired setpoint when the proportional term is
too large and the derivative term is too small, see Figure 2.6.

0.0

0.5

Error [m]

1.0

Discrete samples

0.0

0.5

1.0

1.5

2.0

2.5

Time [s]

Figure 2.5: The PID controller navigating the quadcopter to a desired setpoint.
The sharp drop in the graph is due to our specific implementation of the PID
controller. When the quadcopter is close to the desired setpoint (taking into
consideration the precision of the sensory data we feed into the controller) and
its speed is low (it will not overshoot the setpoint), we break out of the control
loop.

14

1.0
0.5

0.0

Error [m]

0.5

Discrete samples

10

12

14

Time [s]

Figure 2.6: PID controller overshoots the setpoint several times

2.5

Controlling the Quadcopter

In controlling the quadcopter we limited the quadcopter to only five degrees of


freedom we kept its roll constant, see Figure 2.7. Theoretically, the quadcopters
3D space exploration possibilities were not diminished (any movement done before
could be done now as well).
The best imaginable tracking algorithm would be exactly replicating the movements made by the tracked object. That is, only in some idealized conditions with
no obstacles nearby. We decided to limit the maneuverability of the quadcopter
for two practical reasons. The first reason stemmed from the fact that it was very
difficult, at least on a small scale, to determine whether the tracked object moved
on a straight line, or whether it moved on an arc of a circle with its center exactly
at the quadcopters current position. In our reference replicating algorithm this
means we dont know what the most appropriate decision would be. Whether to
move on a straight line or to move on an arc. The second reason is that by trying
to replicate the moves made by the tracked object, we could hit an obstacle and
hitting an obstacle often results in the loss of control over the quadcopter and
subsequent crash landing. Therefore we decided to implement only the simplified
model, where the quadcopter can not change its roll.

15

Figure 2.7: An example illustrating the limited maneuverability of the quadcopter, after disabling its roll (right), compared with the ideal mirroring algorithm
(left).

16

3. Object Tracking
In robotics and elsewhere, computer vision, object recognition and object tracking
have been subject to a lot of active research with many different approaches and
interesting results. First, we will describe the problem we are trying to solve, then
we will present some of the approaches that were conceivable for the purpose
of our application and finally we will demonstrate the application of one such
novel approach in this thesis.

3.1

Problem Description

Formally stated, the input consists of a finite sequence of matrices (M0 , . . . , Mn )


for some n N (a single matrix represents a single image in the video stream
obtained from the quadcopter and its entries represent individual pixels) and
an initial subset of selected entries from the matrix M0 , which represents the
object of interest. The task is to follow this object through the series of images
and for each image Mi output a subset of entries of Mi corresponding to the
object of interest (perhaps with changed appearance) from image M0 .
The biggest issue with this problem description is the fact that the object of
interest is not well defined except in the initial image. Evidently, this problem
has a close connection to (binary) classification studied in the field of machine
learning [12] to classify the object of interest from the background. Similarly as
in machine learning, we will make several simplifying assumptions about the nature of the object of interest in order to diminish the dimension of the hypothesis
set and to obtain at least partially satisfactory results.

3.2

Computer Vision

Computer vision is a vast field attempting to tackle several difficult problems. It


basically tries to describe the world from images and reconstruct its properties,
such as shape, illumination, and color distributions. Despite the fact that humans
are able to reconstruct the world from images almost effortlessly, it is a very
challenging task for computers. It has wide ranging applications such as machine
inspection, medical imaging, surveillance, 3D modelling or object recognition and
tracking, just to name a few. The application mentioned last object recognition
and tracking will be studied in the following sections and finally applied in
this thesis. For a comprehensive treatment of computer vision consult the book
Computer Vision by Szeliski [16].
There are several problems object recognition and tracking algorithms have
to cope with. One of them is the problem of appearance variations of the target
such as shape deformation, scale changes, illumination change or camera motion.
Objects are sometimes occluded or even completely out of the boundaries of
the image, therefore it is necessary for the tracking algorithm to be able to redetect the target independently of any previous sequence of images processed
so far. The video sequence we extract from the quadcopter is also noisy by motion
blur corruption and video compression. For our application we also require that
17

the method is capable of real-time image processing.


Now we will describe several different approaches we tried to apply to the problem at hand.

3.2.1

Template Matching

Template matching is a simple technique for finding patches in an image that


match a given template. It works by sliding the given template along the image
and calculating an error as the sum of differences between corresponding pixels
for every position and finally choosing a single position in the image that minimizes the accumulated error. It is possible to enhance this simple method in
several different ways. By scanning the image with a scaled template we can
find the object even if its size changed. An alternative is to scale the image first
(and accordingly the template as well), select plausible candidates and afterwards
search in the neighbourhood of these candidates in the original image for the best
fit. Taking this approach one step further and repeating this operation several
times, we obtain the method of image pyramids, depicted in Figure 3.1. The main
drawback of this method is the fact that it is computationally expensive and weak
at adapting to changes of appearance of the object of interest. Therefore, this
method was rejected in an early stage of our work.

Figure 3.1: Image pyramids. A method that searches for a template in a given
image in a top-down fashion from the most coarse resolution to the most finegrained resolution (possibly the original image).

3.2.2

Color Detection

Another elementary approach to object detection is to base the algorithm on color information contained in images. For the purpose of color detection it is useful
18

to convert the images from the RGB color space to the HSV (hue, saturation, value) color space first. Hue is a convenient numerical representation for discerning
between different colors 1 . The input of the color detection algorithm is an image
along with an acceptable range of hue values. The complement to the given color
range in the color space is filtered out from the image. The output of the algorithm is a binary image, where one is assigned to pixels lying in the given range
and zero is assigned to every other pixel. In Figure 3.2 we see a demonstration
of this method while detecting an orange ball. This binary image is afterwards
processed by a chain of morphological operations. These morphological operations process the images based on shapes. Their main contribution is in removing
noise that is common when doing color detection, isolating individual elements
and joining disparate elements into one.

Figure 3.2: An illustration of color detection. The algorithm correctly filtered


out the parts of the image not containing the orange ball.

3.2.3

Feature Detection and Matching

In the field of computer vision, features in an image represent something interesting, unique, something abstracting the information contained in the image.
Some examples are keypoint features (e.g. corners) that are described by the
appearance of a point neighbourhood. Other examples constitute edges, lines,
patches. Figure 3.3 illustrates features detected by a feature detection algorithm.
In our case the method works as follows. First, we present a feature detection
algorithm with an image containing our object of interest. The algorithm then
tries to find and register these interesting features this phase is called feature
description, as the algorithm tries to find a compact and expressive representation of the selected features, e.g. as a vector of numbers. In subsequent images
it tries to find features using the same method and afterwards runs a keypoint
matching algorithm, that tries to pair up the corresponding features. There exist
different methods for feature detection, some of which we tested in our application namely the SIFT (scale-invariant feature transform), SURF (speeded-up
robust features) and BRIEF (binary robust independent elementary features) feature detectors. These algorithms are implemented and well documented in the
OpenCV [6] computer vision software library.
1

For a detailed discussion of color spaces consult [16].

19

These algorithms gave better results than the method of template matching
and color detection as they are robust to scale and rotation, but still they lost
the object way too often, so we sought a better solution.

Figure 3.3: a) An image with detected features (colored circles). b) A template


with detected features. c) A keypoint matching algorithm correctly matched the
corresponding keypoints from image a) and b) and subsequently detected the
template in the image.

3.2.4

Cascading Classifiers

This is probably the best, yet the most computationally demanding object detection algorithm we tested. Cascading classifiers, as the name suggests, cascades
or concatenates several classifiers that together form a very good combined classifier. The principle of a cascaded classifier is shown in Figure 3.4. The method
of cascaded classifiers was first used in a face detector with promising results. We
tested this algorithm with a pre-trained classifier for face detection 2 . The classifiers are built out of basic decision-tree classifiers [12]. These decision-tree classifiers are fed with several features (e.g. edges, lines, patches). It is possible to
create a classifier on any object. The classifier is pre-trained with a few hundred
samples of a specific object scaled to a fixed size and an even larger set of negative samples arbitrary images not containing the specific object. Basically the
only reason why we didnt embed this method into our thesis is the fact that it
is difficult to train a new object detector and it may take several hours or even
days to train a good classifier.
2

included in the OpenCV library

20

Figure 3.4: Schematic diagram of the decision process of a face cascaded classifier

3.2.5

Motion Estimation

One serious drawback of the previously discussed methods is their ignorance to


any form of correlation between successive images. Because we expect to track
an object of interest that moves continuously and perhaps will not get occluded
or lost too often, it is reasonable to incorporate into our search an element that
would provide us with a probabilistic distribution over the image plane that could
help us narrow down the search. Motion estimation is the process of linking
successive images by creating motion vectors describing the transformation made
by individual pixels similarly as in Figure 3.5. The method is also popular in
several video compression formats, e.g. in the MPEG family of video standards.
Optical flow, which is the study of apparent motion estimation of objects,
makes two basic assumptions. The first assumption is that the projection of
the same point in the real world on the image plane is the same in every frame
(this is the brightness constancy assumption). The second assumption is that
neighbouring pixels show similar motion (this is the spatial coherence constraint).
One of the best known approaches to motion estimation is the Lucas-Kanade
differential method for optical flow [16].
With these two simplifying assumptions in mind, let us suppose that on a small
neighbourhood of a pixel at position (x, y) the motion vector of the image pixels
was (u, v). This motion vector is unknown and is something we would like to
find.
The brightness constancy assumption tells us that I(x, y, t) = I(x+u, y+v, t+
1) in two consecutive frames, where I(x, y, t) is the intensity of a pixel at position
(x, y) at a given frame in time t. If we take the first series Taylor expansion of
I(x + u, y + v, t + 1) we get the following approximation:
I(x + u, y + v, t + 1) I(x, y, t) +

I
I
I
u+
v+
.
x
y
t

This can be rewritten as:


I(x + u, y + v, t + 1) I(x, y, t)

I
I
I
u+
v+
.
x
y
t

Now taking into account the brightness constancy assumption, we get the following approximation:
I
I
I
0
u+
v+ ,
x
y
t
21


T I
.
0 I u v +
t
This problem is ill-posed, because we have two unknowns (u, v), but just one
equation. To see why it is not enough to have just this single equation, one can
imagine a situation where points on a vertical edge all moved at the same time
in the same (horizontal) direction. Having only this one equation we are unable
to discern whether there was any further vertical motion of the points along the
edge. Therefore, we need to impose additional constraints. Here comes into play
the second assumption he have stated earlier the spacial coherence constraint.
For example, if we take a 5 5 window around each pixel, we obtain a system
of 25 equations per pixel:

T I(pi )
0 I(pi ) u v +
, i {1, . . . , 25}.
t
Hence the system becomes over-constrained. Several methods exist how to solve
this problem approximately. One common approach from linear algebra is to
solve this system of over-constrained equations by the least-squares error approximation. Exactly this approach is applied in the Lucas-Kanade optical flow
tracker. To deal well with larger movements in the video stream, the LucasKanade optical flow method uses the method of optical pyramids and estimates
the motion of objects from coarser to finer grain. Furthermore, there are several
additional problems that need to be resolved in order to get a reliable tracker,
e.g. some points are suitable for estimating the motion from the set of equations
stated above, while other points may be less suitable. Therefore, before the actual tracking, the Lucas-Kanade method tries to pick the right keypoints to track
in advance. Readers interested in additional details are advised to look into the
book by Szeliski [16] or at the implementation of the Lucas-Kanade optical flow
tracker in the OpenCV library [6].
The method of optical flow and motion estimation alone gives poor results
because the camera is not stationary and moves along with the quadcopter, making it merely impossible to discern between motion of the object and motion of
the quadcopter. An even more critical flaw of this method is the fact that it can
not bootstrap itself after the object of interest temporarily disappears from the
scene.

3.3

Selected Approach

Finally, after several unsuccessful attempts with the methods described above,
we adapted a novel approach to object tracking from video footage the trackinglearning-detection (TLD) algorithm first introduced by Kalal [4]. This method
decomposes the task of long-term object tracking into three subtasks: tracking,
learning and detection, hence the name TLD. The method accepts the fact that
tracking and detection become error-prone when operating on their own, but they
could be combined to form a pair that support each other. The tracker can
provide the detector with learning data real-time and the detector can reinitialize
the tracker in case the object gets lost. Finally, the learner estimates the errors
made by the detector and updates its model to avoid these errors in the future.
In the following sections we will give an overview of the methods employed in this
approach and see how it cleverly combines several ideas described above.
22

Figure 3.5: Motion vectors computed by the Lucas-Kanade optical flow method.
a) position of hand before, b) position of hand after, c) computed optical flow.

3.3.1

Tracking

Tracking is the process of estimating the motion of an object in consecutive frames


under the prerequisite that the position of the object was known in the previous frames. Tracking algorithms characterize an object by its state, which might
comprise of e.g. its location and shape. The objects are represented by a model.
Generative models represent the object regardless of their surroundings. Discriminatory models are concerned about the differences between the object and their
environment. Trackers require only a single initialization step and are usually
very fast and produce smooth trajectories. One downside of the trackers is the
fact that they tend to accumulate error and start to drift away from the real
trajectory of the tracked object. Or even worse they completely fail and lose
the object. However, we will see that a good object detector can reinitialize
the tracker.
Trackers have several options how to represent the state of an object:
23

Points tracker estimates translation of the object.


Geometric shapes e.g. bounding boxes (used in TLD).
Contours for representing non-rigid objects.
Articulated models representing objects consisting of several rigid parts.
Motion fields.
For an example of these representations of the same object, see Figure 3.6.

Figure 3.6: Various state representations of an object. a) The cyclist is represented as a single point. b) Cyclist is now a single shape. c) Contour representation
may adapt to changes in shape. d) Representation of the cyclist by several rigid
parts. e) Representation of the cyclist as several motion vectors.
Generative trackers that represent the state by a bounding box search for
a rectangle in an image that best matches the model. The most primitive of
these techniques is the template tracking method. A bounding box generative
tracker is incorporated inside the TLD algorithm. The tracking algorithm uses the
Lukas-Kanade method for optical flow searching for the most likely displacement
of the object of interest.
Discriminatory trackers often build a binary classifier that distinguishes an object from the background. Static discriminatory trackers are basically offline
trained classifiers. Static discriminatory trackers have been successfully deployed
for example in face tracking. Adaptive discriminatory trackers on the other hand
require no offline training and on each frame they perform an update of the
classifier.
It is also desirable to devise and implement a robust tracking algorithm that
would be able to detect tracking failures. A tracker in a general long-term tracking task builds a point trajectory of the object. We can perform a double-check
24

based on a forward-backward consistency method which assumes that it is irrelevant whether we follow the timeline of the video sequence or whether we replay
the video in reverse. It works as follows. We take some k consecutive frames (only
two frames in our case, but generally k could be much larger) (It , . . . , It+k ) starting
at time t and in the first frame It we select a region of interest pt (could be a single pixel). Then we apply a tracking algorithm that hopefully tracks the region
along the k selected images and outputs a forward trajectory TF = (pt , . . . , pt+k )
of this region. Afterwards, we repeat this exact same procedure, except that we
apply it on the k selected images in the reversed order (It+k , . . . , It ) with position
pt+k = pt+k as the initial region of interest. This will give us some backward
trajectory TB = (
pt+k , . . . , pt+k ). Finally, we take the TF and TB trajectories
and compute the resulting forward-backward error, which is the (Euclidean) distance between pt and pt+k . The smaller the distance is, the greater the confidence
we have in the tracking algorithm that it was able to track our region of interest
correctly. In case the distance is above some chosen threshold, we may try to
re-initialize the tracker by the detector. The forward-backward consistency check
is depicted in Figure 3.7.
The TLD algorithm incorporates the forward-backward consistency method
in its bounding box tracking algorithm. Inside the bounding box we select points
on a rectangular grid (created by considering all intersections of ten horizontal
and ten vertical equidistant lines inside the bounding box) and independently
track these points (now the fact that the points were selected from a rectangular
grid plays no importance) by a Lucas-Kanade tracker. Then we perform the
forward-backward consistency check on these points independently and assign
a forward-backward error to each point. Half of the points with the largest
error are filtered out. The remaining points then estimate the bounding box
displacement. Namely, the displacement in the x axis of the bounding box is
estimated by taking the median over the displacements in the x axis over the
tracked points and the displacement of the bounding box in the y coordinate
is estimated similarly. Finally, the change in scale is estimated, again, by the
median. For each pair of tracked points we compute the ratio between the current
point distance and the previous point distance in two consecutive images and take
the median over all these ratios. One advantage of this method is its robustness
against partial occlusion as seen in Figure 3.8.

Figure 3.7: Forward-backward consistency check method [4]

25

Figure 3.8: Tracking single points independently inside the bounding box [4]

3.3.2

Detection

Detection is the process of finding an object in a single frame. We are interested in long-term object tracking and because we may lose track of an object
it is important to incorporate a detection mechanism in the long-term tracking
algorithm as well. The object is represented by a model that is continually being
trained during the uptime of the algorithm to adapt it to a possible change of
appearance of the object.
The TLD algorithm incorporates the method of a scalable scanning-windows
detector. It creates a scanning-window of several scales and with each scale it
slides along the image. A single window (picked up by the scanning-windows detector) is further scaled into a patch, which is a 1515 square of pixels. Windows
are scaled this way regardless of their initial size. For each patch the algorithm
decides whether it contains the object or not. However, now we are not comparing
the patch against a known template of the object of interest, but rather against
a whole database of templates (this database contains detected appearances of
the object of interest as well as negative examples). For an example how this
database may look like, see Figure 3.9.

Figure 3.9: An illustration of a database with positive and negative patches.


The classification process has to be very efficient, therefore the algorithm
26

cascades three stages of classifiers. In the first stage there is a patch variance
classifier. In the second phase there is an ensemble classifier and in the third stage
there is the nearest neighbour classifier (sometimes called the NN-classifier).
Patch variance is the first classifier in the chain. Its purpose is to filter out bad
patches (e.g. sky, walls) early and with little computation cost. We compute the
gray-value variance of the patch with the object from the initial image and reject
patches whose variance is smaller than 50% of the variance of the initial patch.
If the patch was accepted in the first stage, it is passed on to the second
stage. The second stage is an ensemble classifier and is composed of several base
classifiers (ten in our case). Each base classifier Ck performs a number of pixel
comparisons on a patch resulting in a binary vector x. The pixel comparisons
will be performed only on a limited number of pairs of pixels, because it would be
inefficient to compare each pair of pixels. Each base classifier will perform pixel
comparisons on some subset of all of the pairs of pixels. The pixel comparisons
are done on a patch that is first blurred by a Gaussian kernel to increase the
robustness of the method against noise and shift. Individual positions of pairs
of pixels of a patch on which the comparisons are performed in a given base
classifier are selected at random once at the beginning and stay fixed during
the whole process of object tracking. The pairs are restricted to have the same
either horizontal or vertical coordinates. A single comparison of a pair of pixels
in one patch results in either zero or one. The comparison results in one iff
the two pixels have similar intensities. For an example of base classifiers see
Figure 3.10. With a vector created by concatenating the results of individual
pixel comparisons in a single base classifier Ck we index into a table of posterior
probabilities PCk (y|x), where y {negative patch, positive patch}. The patch is
accepted by the ensemble classifier iff it was labeled as positive by at least half
of the base classifiers.

Figure 3.10: Pixel comparisons measured within individual base classifiers. The
three squares correspond to three different base classifiers. Blue lines join pixels
(small blue squares) that are compared in these base classifiers [4].
Each base classifier has a posterior table of probabilities with 2d possible
indices, where d is the number of pixel comparisons and is selected to be relatively
small (d = 13 in the TLD implementation). For each binary vector x indexing
into the posterior probability table in each base classifier the resulting probability
is computed as the number of patches with the same binary vector x previously
accepted by the tracking algorithm divided by the total number of patches with
27

the same binary vector x ever considered by this base classifier Ck . The base
classifier Ck labels a given patch as positive if and only if the resulting probability
is at least 0.5.
The classifier is initially trained with the selected bounding box and a collection of boxes that have large overlap with the selected bounding box. Furthermore, these overlapping boxes are rotated, scaled and blurred, creating about
200 positive bounding boxes in total. Then a collection of some other random
bounding boxes forming the background of the initial image are selected (these
are the negative patches). All these boxes are then scaled into patches. This is
the initial labeled training data set for the classifier.
The posterior probability tables gradually evolve and adapt to the appearance
of the object and with each new patch we alter the posterior probability tables.
If a patch passed the ensemble classifier, meaning the ensemble classifier estimated
that the given patch contains the object of interest, then some base classifiers
may be corrected. More specifically, let us consider a patch p that was eventually
labeled as positive (not only in the ensemble classifier, but also in the further
stages), but some base classifier Ck labeled the patch as negative. Patch p in Ck
resulted in some binary vector x. Therefore, we may increase the counter of the
number of positive patches with binary vector x in Ck . Similarly, we decrease the
corresponding counter of positive patches if the patch was eventually rejected.
After passing the first two stages a patch p has to pass through the NNclassifier. In the nearest neighbour classifier state space we keep a history of
several patches classified positively and negatively. We compute the similarity
measure of the newly arrived patch with each positive and negative patch in the
NN-classifier.
Similarity between two patches p1 and p2 is computed by the following formula:
1
(NCC(p1 , p2 ) + 1) ,
2
where NCC denotes the normalized correlation coefficient as applied to image
processing. In the first step of computing NCC we normalize image brightness
of the two patches. In the second step we compute the NCC as the covariance
between the pixels of the two patches divided by the product of their individual
standard deviations. NCC outputs real values between 1 and 1, therefore,
the similarity measure between two patches ranges between zero and one (zero
means two patches are completely anti-correlated and one corresponds to a perfect
match).
We find the nearest neighbour p+ among the positive patches and the nearest
neighbour p among the negative patches. The final decision whether to label
the newly arrived patch as either positive or negative is then governed by the
following formula for relative similarity:
S(p1 , p2 ) =

Sr =

S(p, p+ )
> ,
S(p, p+ ) + S(p, p )

where S r ranges between zero and one and is a tunable parameter that shapes
the decision boundary in the NN-classifier state space either towards better recall
of previously classified patches or towards precision. Afterwards, we add the
patch p to the NN-classifier state space. In case there are too many patches in
the classifier, we pick one at random and throw it away.
28

For a patch to be classified as containing the object, it needs to pass through


the sieve of all three classifiers. The advantage of this approach is its scalability.
If we knew beforehand the domain in which we would like to apply the algorithm,
we could substitute the classifiers in the chain with our own to achieve even better
performance.

3.3.3

Integration

After the tracking and detection algorithms process a newly arrived image, an integrator has to decide what information to present. It has to choose between
possibly several patches found by the detector and the best bounding box found
by the tracker. The integrator outputs the patch (more precisily the bounding
box from which this patch was created) with the highest confidence measured by
the equation for relative similarity S r in the NN-classifier state space. In case
neither the tracker nor the detector found a candidate bounding box, the integrator indicates this failure as well. Both the detector and the tracker have different
estimates of the state of the object of interest. While the detector is dependent
on the history of the whole process, the tracker is much more concerned about
the current state of the object and is likely to localize potentially new templates
of the object not considered by the detector.

3.3.4

Learning

In this section we will describe some methods how to train a bounding box object detector. These methods rely on binary classifiers that attempt to learn
how to discern an object in a sample patch from the background by constructing
a decision boundary in some feature space. Machine learning as deployed in object detection uses two general ideas supervised and semi-supervised learning.
Offline supervised learning approach is adequate when many samples of the desired object are provided in advance (e.g. face recognition). In our case we want
an online learning algorithm, therefore we resorted to semi-supervised learning.
In semi-supervised learning we are presented with a set of labeled data and we
want to bootstrap our classifier with unlabeled data (successive images in the
video stream).
Now we will describe a learning method suitable for learning during tracking.
This will be our improved semi-supervised bootstrap. The TLD method takes
a novel approach to learning. It introduces the P-N learning method. It aims
to improve the classifier incorporated in the detection phase of the algorithm by
a pair of experts the P-expert and the N-expert. The P-expert is an expert
on positive examples and detects when the classifier missed a positive example.
The N-expert on the other hand is an expert on negative examples and detects
when the classifier misclassified a negative example confusing it with a positive
example. These errors augment a training set of the detector by changing the
labels of the incorrectly classified examples and feeding them back into the classifier, but this time as labeled data.
The crucial part is to correctly estimate the error made by the classifier.
Here we have in mind the NN-classifier and hence also the ensemble classifier
that is directly affected by the result of the NN-classifier. Both classifiers were
29

discussed in sub-section 3.3.2. This is achieved by running the pair of experts on


separately labeled data sets. In each iteration of the bootstrap phase the classifier
outputs data labeled as positive (classifier deems the data to be the object) and
negative (classifier deems the data not to be the object).
The P-expert exploits the fact that the object of interest moves along a continuous path. This means, that if, for example, a detector classified in ten consecutive frames the object along some straight line and suddenly in the eleventh
frame the detector lost the object while the tracker tracked the object, and again
the detector found the object in the twelfth frame (on the same line), we know
the result of the detector is implausible and have a false negative. The P-expert
estimates the position of the object from the combination of results of the tracker, the detector and the integrator. The task of the P-expert is to estimate
reliable parts of the trajectory and use them to generate positive examples for
the classifier. The P-expert keeps an ordered history of the positive patches in
the NN-classifier state space. Therefore the positive patches form a trajectory
also in the NN-classifier state space. Patches near this trajectory (the boundary
is declared by the equation for relative similarity and depends on the tunable
parameter ) form a subspace of the classifier state space called the core. Now,
the P-expert would like to extend the core of the NN-classifier by feeding into the
classifier patches it is certain to contain the object of interest.
The P-expert starts following a trajectory (generated by the tracker) as soon
as it enters the core and stops tracking the trajectory when the tracker loses track
of the object of interest. This trajectory creates positive examples which are then
fed into the classifier as positively labeled data and therefore also extend the core.
A schematic picture of the P-expert is shown in Figure 3.11. To gain even more
positive data, several patches are created by rotating and blurring patches around
this trajectory.

Figure 3.11: Illustration of the P-expert. Green dots are the positive patches in
the classifier state space, red dots are the negative patches. The grey area is the
core. In the left picture the P-expert tracked the object until it was lost by the
tracker and detector. In the middle picture a perhaps false alarm (dashed line)
was detected by the tracker (only the tracker considered the patches along this line
as positive) and consecutively the tracker started tracking the right object again
(line in the top right corner starting in the core), giving the P-expert a chance to
extend the core of positive patches. The extension of the core is depicted in the
right picture.
The N-expert assumes that the object may appear at a single location only.
30

Therefore if the classifier finds several occurrences of the object, the N-expert may
take this into account and try to locate the false positives. The N-expert does so
by taking into account the structure of the video sequence (the object is likely
to move on a continuous path, so some positions detected by the classifier are
highly unlikely). Therefore, it judges several patches produced by the detector
and selects the patches it is most confident about (that they do not contain the
object of interest) and, again, feeds them into the classifier, but now as negatively
labeled data.
A block diagram illustrating the individual parts of the TLD algorithm and their
mutual interaction is shown in Figure 3.12. For experimental simulations proving this theory correct and a more detailed theoretical reasoning about the P-N
experts consult [4].
One very important aspect of this algorithm is the fact that it is effective for
a wide range of problems. Therefore, the problem of object tracking from video
footage may be considered as solved.

Figure 3.12: The conceptual block diagram of the TLD framework [4]

31

4. Method Integration
After we discussed the relevant parts of control theory and object tracking, we still
need to describe how are the individual parts linked, namely what input is given
to the controller. The object tracking algorithm permanently produces bounding
boxes representing the most probable position of the object of interest.
First, we adjust the horizontal and vertical position of the quadcopter in
such a way that the object of interest is seen in the center of the screen. This
is easily achieved by computing the angle by which the quadcopter needs to
rotate around its vertical axis in order to get the object of interest in the vertical
center of the screen. This angle is then given as input to the PID controller,
that subsequently performs the right action. Similarly, we adjust the height of
the quadcopter so that the object of interest is seen in the horizontal center of
the screen. If the quadcopter is below the object, the controller instructs the
quadcopter to ascend approximately 0.4m. This process may be repeated several
times, until the quadcopter reaches the desired height. Similarly the controller
instructs the quadcopter when it needs to descend. Here we must be careful not to
instruct the quadcopter to descend when it is already flying too low as a contact
with the ground might cause the quadcopter to become uncontrollable. After we
correct the horizontal and vertical displacement, we need to correct the possibly
incorrect distance the quadcopter has to the object of interest. This is done with
the help of a scale estimator as will be described in the following Section 4.1.
The scale estimator outputs a distance the quadcopter shall cover in order to get
in the right distance from the object of interest. This distance is given as input
to the PID controller that then finishes the action. After each action we wait
for a small amount of time (about a second) in order for the tracking algorithm
to stabilize. The reason is, that while the quadcopter is performing an action,
the object tracking algorithm tends to output jittery bounding boxes.

4.1

Scale Estimation

The last issue we had to resolve in order to be able to follow objects of interest
with a quadcopter was the problem of scale estimation. The aim is to keep the
quadcopter in a fixed distance from the object of interest.
If we knew in advance the size of the object of interest and knew that the
object would not change its size dramatically during the course of tracking, it
would be possible to estimate the distance between the quadcopter and the object
with a closed-formed formula. Even if we dont know the size of the object in
advance, we might still create an estimate with the following reasoning. Let
x denote the real size of the object of interest, y the real distance between the
quadcopter and the object and let denote the angle under which the quadcopter
sees the object (to compute this angle we need to know the parameters of the
quadcopters camera). These three variables are tied by the following formula:
tan() x/y ( is the only known). The relation is only approximate, because
x and y might not be perpendicular to each other. Now, let us imagine that
the quadcopter moved one meter towards the object (the distance covered was
estimated by the inertial system of the quadcopter). Then, we can write down
32

the following equation: tan( ) x/(y 1) (again is the only known). For an
illustration depicting the situation, see Figure 3.12.

Figure 4.1: Estimating the true size of the object of intereset using simple
trigonometry and the inertial measurement unit of the quadcopter.
It is not difficult to combine these two equations and derive the following
formula: x/ tan() x/ tan( ) + 1. From this equation we easily derive the
following formula for x: x 1/(tan( ) tan()). Repeating this procedure
several times we get a system of many equations for the true size of the object
of interest, therefore we obtain an overdetermined system of equations. Taking
into account the nature of the problem at hand, we could ignore some apparently
absurd estimates and then estimate the true size of the object by either the
average or the median or some other more sophisticated statistics. Unfortunately,
this method was too sensitive to even tiny imprecisions of the sensory data and
the quadcopter never managed to get somewhere near the ground truth.
Another option was to implement a method combining monocular SLAM and
position estimation from sensory data as discussed in [2]. This method gradually
builds a map of the surrounding environment and progressively creates a more
accurate scale estimate. This method was not adapted in this thesis as it is more
suitable for static environments and not for estimating the scale of a dynamically
moving object.
The second attempt we have made at tackling the problem was based on optical flow. This time we wanted to build a very simple reactive agent. If the agent
perceived the object of interest as moving away from him, he would react accordingly. The movement was sensed by the algorithm on optical flow. The motion
vectors of the points inside the bounding box would be pointing towards the center of the bounding box. Similarly, the agent would perceive that the object was
approaching him when each vector inside the bounding box would be pointing
in the opposite direction than to the center of the bounding box. Unfortunately,
this method did not work very well. The main reason was that motion vectors
were either too small, or they were unreliable and could be easily confused with
vectors indicating a movement of the object in the camera plane.
Then we experimented with a method combining reinforcement learning and
a method that creates a mapping that estimates the distance to an object of
interest based on the size of the bounding box. Generally speaking, in reinforcement learning, the objective is to train an agent the appropriate actions and
behavior that would maximize its total expected reward. In our case, the re33

ward reflected the relative size of the bounding box to the ideal size (this is the
size of the bounding box when first selected) and hence the distance to the object. The bigger the gap between the current bounding box size and the desired
bounding box size, the smaller the reward. When the object tracking algorithm
detects an increase/decrease in the size of the bounding box it is desirable to take
some action and instruct the quadcopter with proper commands. Specifically, we
take the observed size of the bounding box and normalize it. The normalization
step produces a real number r that is the ratio between the size of the current
bounding box and the initial bounding box size. Now, we take this ratio r and
see if at any previous time the scale estimator was given as input any other ratio
rp close to r (close here means that the absolute distance between r and rp is less
than 0.08 constant we set experimentally). If this is the case, we deliberately
discard the newly arrived ratio r and instead look up an appropriate action as
if ratio rp was given as input instead. Otherwise, we take the ratio r. Now, for
each ratio we keep track of the previously tried actions and maintain an estimate
of how good each action was. This estimate is updated every time we receive
some feedback. The feedback consists of the ratio of the size of the bounding box
around the object of interest and the initial bounding box size after the action
was performed. We compute the estimate for an action as the average of all the
ratios ever given as feedback for this particular action.
The action with the best expected outcome is then selected.
The idea was such that the agent would gradually build a mapping that would
instruct the quadcopter based on its current state what is the best action it should
perform in order to maximize its reward.
This approach makes several assumptions. First, it expects that the object of
interest did not move while the quadcopter was executing an action. Second, this
method assumes that the flight manoeuvres are performed accurately enough.
Third, it expects the object detector to be robust and give reasonably accurate
sizes of the bounding boxes.
This final method of scale estimation was subject of our experiment in which
we quantitatively measured its performance. If the reader would like to see the
details and results of the experiment, please skip to Chapter 7.

34

5. FollowMe Application
Because of the various insufficiencies of the official SDK provided along with the
Parrot AR.Drone 2.0 quadcopters we decided to implement our own framework.
The whole project was written in the C++ programming language. Apart from
some basic human interaction needed to set up the application, such as selecting an object of interest and triggering a button for takeoff, the communication
between the application and the quadcopter does not require any additional intervention. While in flight, the quadcopter will carry out instructions received
from the FollowMe application and will try to follow the object of interest autonomously.

5.1

Software Design

The application utilizes several features from the latest C++ standard [15], especially the long-awaited portable threading model. All third-party libraries are
written in C/C++ as well, therefore the integration of several different libraries
was smooth. They are all available under some form of license that allows free
use and distribution for scientific purposes. The application is designed in such a
way that it tries to split logically different pieces into different sections allowing
separate development of various components on its own.
The decision to split different tasks into different threads was fairly trivial
taking into account the nature of the task pursued. Let us now give an overview
of the FollowMe application.
All the logic behind the scenes is directed from the GUI. When we connect
to the quadcopter a new thread is fired which tries to connect to the ad-hoc
network setup by the drone. This task consists of sending a specific packet on
the drones navigational channel port, establishing the connection and dispatching a new thread of execution that will take care of receiving navigational data
from the drone and providing them to the main application for further processing. Shortly after the navigational channel is up and running in its own thread,
we start up another thread, this time taking care of sending commands to the
drone. It is also possible to instruct the drone about specific parameters, e.g. the
video stream (front/bottom), bit rate, frames per second. Once this is taken care
of, again inside a dedicated thread, the application starts listening for commands
which are then synchronously dispatched to the drone, or at least it sends the
noop command with the purpose not to lose the connection. Finally we connect
to the port number 5555 and try to decode and split the video stream, making it
available in a suitable format for the rest of the application, e.g. the object tracking algorithm. The communication between the quadcopter and the application
is discussed in 1.4. Finally we integrate results from the different components
(object tracking algorithm, scale estimator) and let the PID controller create
commands which are subsequently sent to the quadcopter for execution.

35

5.2

Third Party Software

Before we plunge into the details of the inner building blocks of the application
we will briefly describe all third party software packages we utilized in this application. The decisions about which software to deploy were governed by some
simple rules. We wanted free software that is widely used in their respective
fields, well tested, easy to integrate and distribute and we wanted all the libraries
to be cross-platform as well.
Qt is a cross-platform application framework used for developing graphical
user interfaces [10]. It has bindings to various programming languages, but especially its C++ binding is widely used. We used version Qt5.
SFML [14] provides a simple interface to various components that a PC usually
has. It is composed of five modules: system, window, graphics, audio and network. We exploited only the last module. We chose this framework especially
because of its low footprint, simple usage and its ease of distribution. SFML
provided us with simple send and receive functions over the UDP communication
protocol. Tested with SFML version 2.1.
OpenTLD [7] was created shortly after the first release of the official TLD
algorithm in Matlab. It is basically a C++ port following the description given
in [4]. This version of the TLD algorithm has some shortcomings, namely it
lacks any API for external bindings. The OpenTLD library is no longer officially
supported. Therefore, we incorporated the source code of this software straight
into our project and made some minor changes that would allow us to utilize this
library right from our own code. Furthermore, we modified some constants that
better suited the needs of our application.
OpenCV [6] is a cross-platform image manipulation library that is extensively
used throughout the application, therefore it is necessary to have this library
installed for the deployment of the FollowMe application. Tested with version
2.4.

5.3

Inner Structure of the FollowMe Application

Now we will describe our application and all its classes in detail. Each class will be
described as is, meaning its inner structure as well as its integration and visibility
to other classes. The logical interconnection of individual classes is depicted in
figure 5.1.
Drone class is in a sense the central part of the application. It acts as an intermediary between all other classes. Upon the request to connect to the quadcopter
the Drone class divides all the necessary tasks between others and collects their
results back. It collects sensory data and distributes them inside the application
for further use. Once we have processed all the sensory data, it is necessary to
send some commands back to our agent. These commands are dispatched to
another class which keeps a buffer of all the commands and sends them off to
the quadcopter in accordance with the communication protocols mentioned in
Section 1.4. Then we relay all the sensory information to the Main window class
for display.
36

Figure 5.1: Outline of the FollowMe application logic.

37

DroneNetwork is a general purpose, yet very simple class which is not intended to be used on its own, but rather that other classes will inherit from this class.
The communication between our application and the quadcopter is led over the
simple and unreliable UDP protocol which might not have been the best design
decision. The primary methods of this class are the send and receive functions.
First, the send function dispatches a raw C string containing a command understood by the AR.Drone 2.0. Second, the receive function receives a raw C string
containing sensory data along with some information about the internal state of
the drone.
DroneNavdata inherits from the DroneNetwork class. This class is started in
a new thread after a successful connection. It runs in an infinite loop. At the
start of each new cycle the thread gets blocked in the receive function until new
data arrive from the quadcopter. The navigational data are afterwards stored in
an internal buffer which basically eliminates the problem that the main loops in
the Drone class (in the usual terminology of parallel programming we would say
it is the consumer) and in the DroneNavdata class (the producer) have different
execution times.
DroneCommander also inherits from the DroneNetwork class. DroneCommander has an internal buffer, which absorbs all the commands from the Drone
class. Then we land in an infinite loop. In each loop we grab a command from
the front of the buffer and convert it into a format understood by the drone,
add a serial number increased by one and send the command over the UDP
protocol.
VideoReceiver is an exception, because formally, at least in the planning
phase, the class was meant to follow the same hierarchical structure as the DroneNavdata or DroneCommander classes. But we made a shortcut, bypassing the
intermediate step over DroneNetwork and directly connecting to the quadcopter
due to the fact that the OpenCV library provided a very simple way how to
download and convert the video stream into images right from the quadcopter.
Data Integrator is responsible for collecting results from the object tracking
algorithm and from the scale estimator and connecting these results to the input
of the PID controller as discussed in Chapter 4.
PID Controller class is responsible for navigating the quadcopter precisely
to a setpoint. Its input is either a 3D vector which is the desired translation of
the quadcopter or an angle for a controlled rotation around the vertical axis of
the quadcopter. The output of the controller is a set of commands sent to the
quadcopter. The details of the PID controller are described in 2.4.
Scale Estimator class is responsible for building a mapping from the relative
size of the object as seen from the quadcopter to an estimate about the distance between the quadcopter and the tracked object. The underlying algorithm
implemented in this class is described in Section 4.1.
MainWindow class forms the presentation layer of the FollowMe application.
The user interface will be discussed in the following Chapter 6.
Predator class is responsible for successful object tracking. This class periodically collects images from the video stream, tracks the object of interest and
passes the information about the size and position of the tracked object to the
Drone class for further processing. The building blocks and algorithms of this
class are discussed in Section 3.3.
38

6. User Experience
This chapter will introduce the FollowMe application from the users perspective.
The FollowMe application allows to watch the video stream from the front camera
of the quadcopter and manually select a bounding box representing the object of
interest. The application also shows the current flight status of the quadcopter.
Furthermore, it enables to manually control the quadcopter in flight.

6.1

Installation

The FollowMe application was always intended to be a multi-platform application. Currently, it has been deployed under the GNU/Linux and Microsoft Windows operating systems. Since the object tracking algorithm is computationally
demanding and spawns many threads, it is recommended to run the FollowMe
application on a computer with at least four cores for best performance.
The GNU/Linux version requires all the third-party libraries described in
section 5.2 to be properly installed on the system. To install the application,
go to the FollowMe directory first. From within that directory type qmake &&
make to install the application. After the application has been installed, type
build linux/followme to run the application. For the Microsoft Windows operating systems (Windows Vista and later) the application is distributed as a single executable. To run the application, click on the followme.exe icon in the
build windows directory. After the application is launched, the main window of
the application appears, as shown in Figure 6.1.

6.2

Connecting to the Quadcopter

Before connecting to the drone, perform a simple check of the quadcopter to


avoid any accidents due to mechanical problems. The quadcopter is started by
simply plugging in the battery provided along with AR.Drone 2.0 quadcopters.
Shortly after plugging in the battery, the drone will mildly rotate its four rotors.
Next, check that all four LEDs under each rotor are green. It is a good indicator that the quadcopter is in a good state. Afterwards, cover the quadcopter

Figure 6.1: Main window of the FollowMe application.


39

Figure 6.2: Selecting an object of interest by drawing a bounding box around it.
with either the indoor or outdoor hull, depending on the environment where the
quadcopter will be flying. Before the actual takeoff, it is rudimentary to position
the quadcopter on a flat surface, so that the quadcopter can adjust its sensors
for proper functioning before every flight. In case the quadcopter suffered from
any malfunction (e.g. a hard landing in previous flight), it is recommended to
unfasten the battery (but keep it plugged in) and push the restart button right
beneath where the battery is fixed. This will cause the quadcopter to hard reset
its sensors after it takes off the next time (it will spin 360 degrees around its main
vertical axis right after takeoff). Before launching the application, check that the
wi-fi driver is connected to the quadcopters ad-hoc network and that a firewall
does not block any communication between the quadcopter and the FollowMe
application.
When the quadcopter is ready, press the Connect button in the left upper
corner of the FollowMe application and wait until the Connection established
message appears in the status bar at the bottom. In case the connection could
not be established please check your connectivity.

6.3

Tracking Object

After the connection has been established, the application immediately starts
streaming the video from the front camera of the quadcopter. To turn on/off the
video stream, toggle the Stream video check box. When the object of interest
the quadcopter is going to follow appears in the video stream, either uncheck the
Stream check box or start drawing a bounding box around the object of interest.
This will freeze the video stream so that it is possible to draw a bounding box
similarly as in Figure 6.2 as small as possible, so that it doesnt contain a lot of
background, but only the desired object.
Selecting the object of interest will unlock the Start tracking button. After
pressing the Start tracking button it is recommended to move the quadcopter or
the object slightly for about ten, fifteen seconds, so that the tracking algorithm
has a chance to bootstrap its learning algorithms. If the object tracking algorithm
is capable of tracking the selected object, it will indicate so by showing a green
bounding box around the object of interest. It will also display a red bounding
box which indicates the desired size of the bounding box. Also a line connecting
the current center of the bounding box and the center of the image will reveal
the horizontal and vertical displacement of the quadcopter.
Then, after pressing the Takeoff button, the quadcopter will take off and will
start following the object autonomously in the air without any further intervention. After the task of following the object has been accomplished, press the Stop
40

Figure 6.4: Key bindings


Figure 6.3: Settings
tracking button (this button will appear instead of the Start tracking button).
To change the default settings, press the Settings button. This will open up
a dialog window, see Figure 6.3, with several tunable parameters. It is possible
to:
takeoff right after the Start tracking button is pressed,
land right after the Stop tracking button is pressed,
set the size of the bounding box relative to the original size,
land when battery drops below a specified level.
It is also possible to record the flight. To start recording, press the Start
Recording button. This will open a window that will ask for the location in the
file system where to store the video stream. After this operation begins, the Start
Recording button will turn into a Stop Recording that will stop recording the
flight.
To see how to control the quadcopter manually, press the Keys button. This
will open up a window as in Figure 6.4 and display the various bindings between
individual keys and the commands sent over to the quadcopter.
When the quadcopter is ready for takeoff and the flight conditions are good,
press the Takeoff button. Under the Trajectory plotter label the main window of
the application shows the trajectory of the quadcopter in flight in the horizontal
plane (the plotter captures an area of ten by ten metres). In the right bottom
corner of the application there will be a miniature of the selected object of interest.
Next to the miniature it is possible to read the quadcopters current state
the speed in all three directions, altitude and the battery level. After takeoff the
Takeoff button will toggle into a Land button.

41

7. Experiment
In this chapter we will describe an experiment we conducted in order to assess
the proposed method of scale estimation as described in Section 4.1. The proposed method of scale estimation relies heavily on the performance of the object
tracking algorithm as well as the precision of the inertial measurement unit of the
quadcopter. The experiment should give us an idea of how well we might expect
the scale estimation method to perform in practice.

7.1

Settings

Now we will describe the settings of an experiment that helped us assess the
accuracy of the scale estimator. As the scale estimator makes more and more estimates it is likely to obtain feedback from the environment. Therefore, we would
like to see whether our method truly does improve itself in the course of time as
it gains feedback.
The experiment went as follows. We selected the size of the object of interest
(its width w and height h) and the desired distance d between the quadcopter and
this object (the goal of the scale estimator is to keep the quadcopter at distance d
from the object of interest). Given w, h and d we computed the size of the initial
bounding box (i.e. this is the size of the bounding box the tracking algorithm
would detect, were the quadcopter distance d from the object of interest and were
the objects width w and height h).
Now we took our scale estimator and gave it as input the initial bounding box.
All other bounding boxes were compared against this initial bounding box in the
scale estimator. Afterwards, we gave the scale estimator as input several other
bounding boxes. The possible sizes of these bounding boxes varied from being half
the size of the initial bounding box to being twice that big. This range of possible
sizes of bounding boxes corresponds to some range of possible distances between
the quadcopter and the object of interest. Every bounding box was selected from
this range of possible sizes of bounding boxes uniformly at random. With these
bounding boxes we queried the scale estimator for an action to perform. This
means that for a given bounding box the scale estimator was supposed to output
some distance. This distance would then be, were this experiment performed with
a flying quadcopter, fed as input to the PID controller as discussed in Chapter 4.
After each action the scale estimator may receive some feedback. Now, the feedback received, if any, is likely to be inaccurate. This could be so for several
reasons. One of them is the faulty inertial measurement unit. Another reason
may be due to a bad estimate of the size of the bounding box from the object
tracking algorithm. Yet another error might be caused by the fact that while
the quadcopter was performing an action, the object moved as well. This is the
reason why we only simulated this experiment. We wanted to gain some insight
into how well the scale estimator may perform under different conditions.
Therefore, we will introduce a random variable E that will encompass all of
the potential errors introduced into the system. Each feedback sent into the scale
estimator will be affected by this random variable. E will disturb the distance
between the quadcopter and the object at the moment an action was dispatched
42

and the moment the scale estimator received some feedback from the environment.
In the ideal case, i.e. if E were always zero, this would correspond to the case
where the action produced by the scale estimator had the exact impact as wanted.
So, e.g. if the scale estimator gave on its output an action saying to move one
meter forward, then after the action was performed the distance between the
quadcopter and the object of interest was one meter shorter than before the
action was performed and the scale estimator would receive a correct feedback.
We will let E be from the family of normal distributions, although we do not
claim that this distribution truly models the combined error we are trying to
simulate. The normal distribution is entirely described by its mean and standard
deviation. We will fix the mean to be always zero (sometimes we will get too
close, sometimes too far, but on average we expect the action to be performed
accurately). The standard deviation will vary and we will see how does it affect
the speed and accuracy with which the scale estimator learns the mapping from
the sizes of the bounding boxes to the actions.

7.2

Results

We knew beforehand in this experiment the true size of the object of interest,
the desired distance between the quadcopter and the object of interest and the
sizes of the bounding boxes given as input to the scale estimator. Therefore,
we could compare our scale estimator against an ideal algorithm that knows
exactly what action to perform given as input a particular bounding box. Hence,
at the end of each experiment, we could measure how good the final model of the
scale estimator is. To quantitatively compare the results of the scale estimator
we took the final model of the estimator and computed the expected error of
the model given a bounding box selected uniformly at random from the given
range of allowed sizes of bounding boxes. This expected error is measured in
units of meters. In other words, this expected error reveals how well the scale
estimator performed after some fixed number of iterations and with some type
of error from the environment. So the next time we ask the scale estimator to
recommend an action for some (random) bounding box, this will be the average
error.
The sizes of the objects of interest and the desired tracking distances varied
throughout several different runs of our experiment. Each run of the experiment is
fully described by the size of the object of interest in the real world, the distance
between the quadcopter and the object, the standard deviation of E and the
number of bounding boxes given as input to the scale estimator. For the results
to be more robust we repeated each run of the experiment a hundred times.
The final accuracy of the scale estimator was then computed as the average of
the errors of the final model, where the average was computed over the given
range of allowed sizes of bounding boxes. In Figure 7.1 we can see how well the
scale estimator performed on different runs of the experiment.
There are two patterns we can infer from the measured data. The first pattern
tells us that the accuracy of the scale estimator improves as more data (bounding
boxes) is being processed. This trend is depicted in detail in Figure 7.2, where
we fixed the standard deviation E = 0.2 and only varied the number of bounding
boxes. The second pattern hints that the accuracy of the scale estimator increases
43

when the standard deviation decreases. Again, this trend is depicted in detail in
Figure 7.3, where we fixed the number of bounding boxes generated to 1000 and
varied the standard deviation E. Both of these patterns were predictable.
The experiment showed that theoretically the scale estimator can learn a reasonably accurate model given enough data even when the error introduced into
the system is substantial. The accuracy of the scale estimator depends on the
particular size of the object of interest and the tracking distance. Unfortunately, the scale estimator is likely to obtain much fewer data in practice (after all,
the quadcopter can stay in air for only about ten minutes). Therefore, the scale
estimator may not have enough time to learn an accurate model that would be
sufficient for successful object tracking.

0.6
0.35
0.5
0.3
0.4
0.25

0.3

0.2

0.2

0.35
0.7

0.25

0.4

0.2
0.15

0.2
0.1

0.1

0.1
0

0.5

0.3

0.15

0.1

0.3

0.6

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0.05
0

Number of iterations

0.9

0.65

0.8

0.6
0.55
0.5

0.6
0.45
0.5
0.4
0.4
0.35
0.3

0.3

0.2

0.25

0.1

0.2

0.6

0.8
0.7

0.5

0.6
0.5

0.4

0.4
0.3

0.3
0.2

0.2

0.1
0

0.15
0

0.7

0.9

Standard deviation

Standard deviation

0.7

0.7

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of iterations

Expected distance error on a new request

Expected distance error on a new request

0.4

0.4

0.8

Standard deviation

0.7

Expected distance error on a new request

0.45

0.45

0.9

0.5

0.8

Standard deviation

0.55

Expected distance error on a new request

1
0.9

0.1
0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of iterations

Number of iterations

Figure 7.1: Illustration of the acuraccy of the scale estimator in four different
scenarios. In the top left picture the scale estimator was learning actions on
an object initially 3m away and the size of the object was 1m 1m. Top right
picture: object was initially 2m away and the size of the object was 0.5m 0.5m.
Bottom left picture: object was initially 4m away and the size of the object was
2m 2m. Bottom right picture: object was initially 4m away and the size of the
object was 1m 1m. In each case the scale estimator shows a similar learning
curve, although the results may differ for the same number of bounding boxes
and the same standard deviation.

44

0.34

Expected distance error on a new request

0.32
0.3
0.28
0.26
0.24
0.22
0.2
0.18
0.16
0.14
0

1000

2000

3000

4000

5000

6000

7000

8000

9000 10000

Number of iterations

Figure 7.2: Performance of the scale estimator with fixed standard deviation of
E = 0.3. The scale estimator was learning actions on an object initially 3m away
and the size of the object was 1m 1m.

0.5

Expected distance error on a new request

0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Standard deviation

Figure 7.3: Performance of the scale estimator with fixed number of iterations of
one thousand. The scale estimator was learning actions on an object initially 3m
away and the size of the object was 1m 1m.

45

Conclusion
At first, we described the AR.Drone 2.0 robotic platform that served as a testbed
for our experiments. Afterwards, we discussed the quadcopter from the perspective of control theory and discussed several possible alternatives how to model
the system. Then we selected the PID controller as a suitable controller for our
purposes. We showed that this controller performs reasonably well in practice on
AR.Drones 2.0 when the proportional, derivative and integral constants of the
PID controller are set with care.
Subsequently we studied the field of object tracking extensively. First, we described many unsuccessful attempts we have made at tracking objects from video
footage. The methods we explored were template matching, color detection, feature detection and matching, cascaded classifiers and motion tracking.
Finally a very efficient method of object tracking - the TLD algorithm was
studied and described. We gave an intensive and thorough description of this
algorithm which incorporates many ideas we presented before in the naive approaches to object tracking. The algorithm allowed us to track a wide range of
objects and not just some constrained subset.
We tried to solve the problem of scale estimation with three different methods.
The first method was based on simple trigonometry. This method suffered from
even tiny imprecisions of the object tracking algorithm and the inertial measurement unit. The second method, based on motion estimation, was impractical,
because it detected correctly only fast and large movements. The third method
which we proposed was based on reinforcement learning, merging information
from the inertial measurement unit and the object tracking algorithm. We chose
this third method even though it has its own downsides.
Then we conducted an experiment that revealed how well is the proposed
method of scale estimation likely to perform. The experiment showed that the
scale estimator performs well given enough data. Even when other parts which
the scale estimator relies on produce large errors. Unfortunately, in practice,
we expect to receive much less data than needed to learn a reliable scale estimator.
The problem of scale estimation is by large the biggest issue we face in practice.
The very sophisticated TLD algorithm is sufficient for most cases. However,
the various possible dynamics of the objects we might follow are so diverse (e.g.
an object moving at constant speed on a straight line compared to an object
moving in bursts from side to side) that they often leave the generic algorithm of
scale estimation helpless.
Throughout the thesis we had to face several unexpected problems, some of
which we did not resolve to our satisfaction and some of which we did not even
try to solve. First, it would be great if we could localize the quadcopter more
precisely in space. This problem might be tackled e.g. by a set of external cameras
that would track the position of the quadcopter with higher precision. Second,
the method of scale estimation relies heavily on the object tracking algorithm.
We think a better solution to this problem might be with a lidar technology
or similar. Third, for a more reliable object tracking by a flying drone it would
be suitable if we could detect obstacles nearby. This would also allow us to try
to find the object of interest in case we have lost track of the object.

46

Bibliography
[1] Astrom, Karl Johan, Murray Richard M. Feedback Systems: An
Introduction for Scientists and Engineers [online]. Princeton: Princeton University Press, 2008. ISBN 0-691-13576-2 [13.7.2014]. Available from:
http://www.cds.caltech.edu/~murray/books/AM08/pdf/
am08-complete_28Sep12.pdf/.
[2] Engel, Jakob Julian. Autonomous Camera-Based Navigation of a Quadcopter. Munich: Technical University Munich 2011. Master Thesis. Technical
University Munich, Faculty of Informatics, Computer Vision Group. [13 July
2014]. Available from: http://www.vision.in.tum.de/members/engelj/.
zene pristan autonomnho drone. Prague: Charles uni[3] Hra
sko, Andrej. R
versity 2013. Bachelor Thesis. Charles University, Faculty of Mathematics
and Physics, Department of Theoretical Computer Science and Mathematical Logic.
[4] Kalal, Zdenek. Tracking Learning Detection [online]. Guildford: University
of Surrey 2011. Doctoral thesis. University of Surrey, Faculty of Engineering and Physical Sciences, Centre for Vision, Speech and Signal Processing. [13 July 2014]. Available from: http://xm2vtsdb.ee.surrey.ac.uk/
CVSSP/Publications/papers/Kalal-PhD_Thesis-2011.pdf.
sek Vojtech, Fi
[5] Krajnk Tomas, Vona
ser Daniel, Faigl Jan. AR-Drone
as a Platform for Robotic Research and Education. Heidelberg: Springer,
2011. ISSN 1865-0929. [13 July 2014]. International Conference on Research
and Education in Robotics. Available from: http://www.labe.felk.cvut.
cz/~tkrajnik/ardrone/articles/eurobot.pdf.
[6] Open Source Computer Vision Library [online]. [13. July 2014]. Available
from: http://opencv.org/.
[7] OpenTLD library [online]. [13 July 2014]. Available from:
gnebehay.com/tld/.

http://

[8] Parrot AR.Drone 2.0 [online]. [13 July 2014]. Available from: http://
ardrone2.parrot.com/.
[9] Piskorski Stephane, Brulez Nicolas, Eline Pierre, DHaeyer Frederic.
AR.Drone Developer Guide. 2nd edition. Parrot, 2012.
[10] Qt framework [online]. [13 July 2014]. Available from: http://qt-project.
org/.
[11] Raffaello, DAndrea. Flying machine arena [online]. [13 July 2014]. Available from: http://raffaello.name/projects/flying-machine-arena/.
[12] Russell, Stuart, Norvig, Peter. Artificial Intelligence: A Modern Approach. 3rd edition. Prentice Hall, 2009. ISBN 0-13-604259-7.

47

[13] Saripalli, Srikanth. Vision based GPS-denied Object Tracking and Following for Unmanned Aerial Vehicles [online]. [13 July 2014]. Available from:
http://robotics.asu.edu/ardrone2_ibvs/.
[14] Simple and Fast Multimedia Library [online]. [13 July 2014] Available from:
http://sfml-dev.org/.
[15] Stroustrup, Bjarne. The C++ Programming Language. 4th edition. Massachusetts: Addison Wesley, 2013. ISBN 0-321-56384-0.
[16] Szeliski, Richard. Computer Vision: Algorithms and Applications [online].
Springer-Verlag, 2010. ISBN 978-1848829343 [13.7.2014]. Available from:
http://www.szeliski.org/Book/.

48

List of Abbreviations
BRIEF Binary Robust Independent Elementary Features
DSP Digital Signal Processor
HUE Hue, Saturation, Value
NN-Classifier Nearest Neighbour Classifier
PID Controller Proportional Integral Derivative Controller
RTOS Real Time Operating System
SIFT Scale-invariant Feature Transform
SLAM Simultaneous Localization and Mapping
SURF Speeded Up Robust Features
TLD Tracking-Learning-Detecting
UAV Unmanned Aerial Vehicle

49

Attachments
Contents of the CD attached to this bachelor thesis:
thesis Source files of this bachelor thesis along with the text of this bachelor thesis.
FollowMe Source code of the FollowMe application.
readme.txt Instructions on installing and running the FollowMe application.
experiment All source files needed to reproduce the experiment presented
in this bachelor thesis.
build linux Folder where the executable of the FollowMe application is
placed after compilation under the GNU/Linux operating system.
build windows Folder containing an executable of the FollowMe application for the Windows operating system.

50

Potrebbero piacerti anche