List Danh Muhthanh

CUU LONG FISH JOINT STOCK COMPANY (CL-FISH CORP.
)
90 Hung Vuong street, My Quy Industrial Zone, Long Xuyen City, An Giang Province, Vietnam
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
LIST DANH MỤC CÁC BÀI VIẾT VỀ YOLO OBJECT DETECTION


GHI
STT TÊN BÀI TÊN TRANG WEB TÊN TÁC GIẢ
CHÚ
1 Yolo: Real-Time Object Dectection https://pjreddie.com Jonseph Redmon
Real-Time Object Detection With Yolo,

2 https://medium.com Jonathan Hui
Yolov2 And Now Yolov3
Yolo — You Only Look Once, Real Time

3 https://towardsdatascience.com Manish Chablani
Object Detection Explained
Yolo: Real-Time Object Detection https://github.com

4 Darren Eng
Yolo Object Detection With OpenCv And

5 Python https://www.arunponnusamy.com Arun Ponnusamy
How to implement a YOLO (v3) Object

6 detector from Scratch in PyTorch https://www.kdnuggets.com Ayoosh Kathuria
CUU LONG FISH JOINT STOCK COMPANY (CL-FISH CORP.)
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Evolution of Object Detection and

7 Localization https://towardsdatascience.com Prince Grover
Cửu Long, ngày 18 tháng 10 năm 2018

 Duyệt BGĐ  Người ThựcHiện
Nguyễn Xuân Hải Trần Huy Thanh
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
BÀI 1: YOLO: REAL-TIME OBJECT DECTECTION

You only look once (YOLO) is a state-of-the-art, real-time object detection system. On a Pascal Titan
X it processes images at 30 FPS and has a mAP of 57.9% on COCO test -dev.
Comparison to Other Detectors
YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal
Loss but about 4x faster. Moreover, you can eas ily tradeoff between speed and accuracy simply by
changing the size of the model, no retraining required!
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Performance on the COCO Dataset

Model Train Test mAP FLOPS FPS Cfg Weights
COCO test-
SSD300 41.2 - 46 link
trainval dev
COCO test-
SSD500 46.5 - 19 link
trainval dev
YOLOv2 COCO test-
48.1 62.94 Bn 40 cfg weights
608x608 trainval dev
COCO test-
Tiny YOLO 23.7 5.41 Bn 244 cfg weights
trainval dev
COCO test-
SSD321 45.4 - 16 link
trainval dev
COCO test-
DSSD321 46.1 - 12 link
trainval dev
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
COCO test-
R-FCN 51.9 - 12 link
trainval dev
COCO test-
SSD513 50.4 - 8 link
trainval dev
COCO test-
DSSD513 53.3 - 6 link
trainval dev
COCO test-
FPN FRCN 59.1 - 6 link
trainval dev
Retinanet-50- COCO test-
50.9 - 14 link
500 trainval dev
Retinanet- COCO test-
53.1 - 11 link
101-500 trainval dev
Retinanet- COCO test-
57.5 - 5 link
101-800 trainval dev
COCO test-
YOLOv3-320 51.5 38.97 Bn 45 cfg weights
trainval dev
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
COCO test-
YOLOv3-416 55.3 65.86 Bn 35 cfg weights
trainval dev
COCO test- 140.69
YOLOv3-608 57.9 20 cfg weights
trainval dev Bn
COCO test-
YOLOv3-tiny 33.1 5.56 Bn 220 cfg weights
trainval dev
COCO test- 141.45
YOLOv3-spp 60.6 20 cfg weights
trainval dev Bn
How It Works
Prior detection systems repurpose classifiers or localizers to perform detection. They apply the model
to an image at multiple locations and scales. High scoring regions of the image are considered
detections.
We use a totally different approach. We apply a single neural network to the full image. This network
divides the image into regions and predicts bounding boxes and probabilities for each region. These
bounding boxes are weighted by the predicted probabilities.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Our model has several advantages over classifier-based systems. It looks at the whole image at test
time so its predictions are informed by global context in the image. It also makes predictions with a
single network evaluation unlike systems like R-CNN which require thousands for a single image. This
makes it extremely fast, more than 1000x faster than R -CNN and 100x faster than Fast R-CNN. See
our paper for more details on the full system.
What's New in Version 3?
YOLOv3 uses a few tricks to improve training and increase performance, including: multi -scale
predictions, a better backbone classifier, and more. The full details are in our paper!
Detection Using A Pre-Trained Model
This post will guide you through detecting objects with the YOLO system using a pre -trained model.
If you don't already have Darknet installed, you should do that first. Or instead of reading all that just
run:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
git clone https://github.com/pjreddie/darknet

cd darknet
make
Easy!
You already have the config file for YOLO in the cfg/ subdirectory. You will have to download the
pre-trained weight file here (237 MB). Or just run this:
wget https://pjreddie.com/media/files/yolov3.weights
Then run the detector!
./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg
You will see some output like this:
layer filters size input output
0 conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32 0.299 BFLOPs
.......
106 detection
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
truth_thresh: Using default '1.000000'

Loading weights from yolov3.weights...Done!
data/dog.jpg: Predicted in 0.029329 seconds.
dog: 99%
truck: 93%
bicycle: 99%
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Darknet prints out the objects it detected, its confidence, and how long it took to find them. We didn't
compile Darknet with OpenCV so it can't display the detections directly. Instead, it saves them
in predictions.png. You can open it to see the detected objects. Since we are using Darknet on the CPU
it takes around 6-12 seconds per image. If we use the GPU version it would be much faster.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
I’ve included some example images to try in case you need inspiration. Try data/eagle.jpg, data/dog.jpg,
data/person.jpg, or data/horses.jpg!
The detect command is shorthand for a more general version of the command. It is equivalent to the
command:
./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights data/dog.jpg
You don't need to know this if all you want to do is run detection on one image but it's useful to know
if you want to do other things like run on a webcam (which you will see later on).
Multiple Images
Instead of supplying an image on the command line, you can leave it blank to try multiple images in a
row. Instead you will see a prompt when the config and weights are done loading:
./darknet detect cfg/yolov3.cfg yolov3.weights
layer filters size input output
.......
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
106 detection
Loading weights from yolov3.weights...Done!
Enter Image Path:
Enter an image path like data/horses.jpg to have it predict boxes for that image.
Once it is done it will prompt you for more paths to try different images. Use Ctrl-C to exit the program
once you are done.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Changing The Detection Threshold

By default, YOLO only displays objects detected with a confidence of .25 or higher. You can change
this by passing the -thresh <val> flag to the yolo command. For example, to display all detection you
can set the threshold to 0:
./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg -thresh 0
Which produces:
![][all]
So that's obviously not super useful but you can set it to different values to control what gets
thresholded by the model.
Tiny YOLOv3
We have a very small model as well for constrained environments, yolov3-tiny. To use this model, first
download the weights:
wget https://pjreddie.com/media/files/yolov3-tiny.weights
Then run the detector with the tiny config file and weights:
./darknet detect cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg
Real-Time Detection on a Webcam
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Running YOLO on test data isn't very interesting if you can't see the result. Instead of running it on a
bunch of images let's run it on the input from a webcam!
To run this demo you will need to compile Darknet with CUDA and OpenCV. Then run the command:
./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights
YOLO will display the current FPS and predicted classes as well as the image with bounding boxes
drawn on top of it.
You will need a webcam connected to the computer that OpenCV c an connect to or it won't work. If
you have multiple webcams connected and want to select which one to use you can pass the flag -c
<num> to pick (OpenCV uses webcam 0 by default).
You can also run it on a video file if OpenCV can read the video:
./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights <video file>
That's how we made the YouTube video above.
Training YOLO on VOC
You can train YOLO from scratch if you want to play with different training regimes, hyper -parameters,
or datasets. Here's how to get it working on the Pascal VOC dataset.
Get The Pascal VOC Data
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
To train YOLO you will need all of the VOC data from 2007 to 2012. You can find links to the data here.
To get all the data, make a directory to store it all and from that directory run:
wget https://pjreddie.com/media/files/VOCtrainval_11-May-2012.tar
wget https://pjreddie.com/media/files/VOCtrainval_06-Nov-2007.tar
wget https://pjreddie.com/media/files/VOCtest_06-Nov-2007.tar
tar xf VOCtrainval_11-May-2012.tar
tar xf VOCtrainval_06-Nov-2007.tar
tar xf VOCtest_06-Nov-2007.tar
There will now be a VOCdevkit/ subdirectory with all the VOC training data in it.
Generate Labels for VOC
Now we need to generate the label files that Darknet uses. Darknet wants a .txt file for each image with
a line for each ground truth object in the image that looks like:
<object-class> <x> <y> <width> <height>
Where x, y, width, and height are relative to the image's width and height. To generate these file we
will run the voc_label.py script in Darknet's scripts/ directory. Let's just download it again because we
are lazy.
wget https://pjreddie.com/media/files/voc_label.py
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
python voc_label.py
After a few minutes, this script will generate all of the requisite files. Mostly it generates a lot of label
files in VOCdevkit/VOC2007/labels/ and VOCdevkit/VOC2012/labels/. In your directory you should
see:
ls
2007_test.txt VOCdevkit
2007_train.txt voc_label.py
2007_val.txt VOCtest_06-Nov-2007.tar
2012_train.txt VOCtrainval_06-Nov-2007.tar
2012_val.txt VOCtrainval_11-May-2012.tar
The text files like 2007_train.txt list the image files for that year and image set. Darknet needs one text
file with all of the images you want to train on. In this example, let's train with everything except the
2007 test set so that we can test our model. Run:
cat 2007_train.txt 2007_val.txt 2012_*.txt > train.txt
Now we have all the 2007 trainval and the 2012 trainval set in one big list. That 's all we have to do for
data setup!
Modify Cfg for Pascal Data
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Now go to your Darknet directory. We have to change the cfg/voc.data config file to point to your data:
1 classes= 20
2 train = <path-to-voc>/train.txt
3 valid = <path-to-voc>2007_test.txt
4 names = data/voc.names
5 backup = backup
You should replace <path-to-voc> with the directory where you put the VOC data.
Download Pretrained Convolutional Weights
For training we use convolutional weights that are pre-trained on Imagenet. We use weights from
the darknet53 model. You can just download the weights for the convolutional layers here (76 MB).
wget https://pjreddie.com/media/files/darknet53.conv.74
Train The Model
Now we can train! Run the command:
./darknet detector train cfg/voc.data cfg/yolov3-voc.cfg darknet53.conv.74
Training YOLO on COCO
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
You can train YOLO from scratch if you want to play with di fferent training regimes, hyper-parameters,
or datasets. Here's how to get it working on the COCO dataset.
Get The COCO Data
To train YOLO you will need all of the COCO data and labels. The
script scripts/get_coco_dataset.sh will do this for you. Figure out where you want to put the COCO
data and download it, for example:
cp scripts/get_coco_dataset.sh data
cd data
bash get_coco_dataset.sh
Now you should have all the data and the labels generated for Darknet.
Modify cfg for COCO
Now go to your Darknet directory. We have to change the cfg/coco.data config file to point to your
data:
1 classes= 80
2 train = <path-to-coco>/trainvalno5k.txt
3 valid = <path-to-coco>/5k.txt
4 names = data/coco.names
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
5 backup = backup
You should replace <path-to-coco> with the directory where you put the COCO data.
You should also modify your model cfg for training instead of testing. cfg/yolo.cfgshould look like
this:
[net]
# Testing
# batch=1
# subdivisions=1
# Training
batch=64
subdivisions=8
....
Train The Model
Now we can train! Run the command:
./darknet detector train cfg/coco.data cfg/yolov3.cfg darknet53.conv.74
If you want to use multiple gpus run:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
./darknet detector train cfg/coco.data cfg/yolov3.cfg darknet53.conv.74 -gpus 0,1,2,3

If you want to stop and restart training from a checkpoint:
./darknet detector train cfg/coco.data cfg/yolov3.cfg backup/yolov3.backup -gpus 0,1,2,3
YOLOv3 on the Open Images dataset
wget https://pjreddie.com/media/files/yolov3-openimages.weights
./darknet detector test cfg/openimages.data cfg/yolov3-openimages.cfg yolov3-openimages.weights
What Happened to the Old YOLO Site?
If you are using YOLO version 2 you can still find the site here: https://pjreddie.com/darknet/yolov2/
Cite
If you use YOLOv3 in your work please cite our paper!
@article{yolov3,
title={YOLOv3: An Incremental Improvement},
author={Redmon, Joseph and Farhadi, Ali},
journal = {arXiv},
year={2018}
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
BÀI 2: REAL-TIME OBJECT DETECTION WITH YOLO, YOLOV2 AND

NOW YOLOV3
You only look once (YOLO) is an object detection system targeted for real-time processing. We will introduce
YOLO, YOLOv2 and YOLO9000 in this article. For those only interested in YOLOv3, please forward to
the bottom of the article. Here is the accuracy and speed comparison provided by the YOLO web site.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Source
A demonstration from the YOLOv2.

Object detection in real-time
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Let’s start with our own testing image below.
Testing image
The objects detected by YOLO:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Objects detected by YOLO.

Grid cell
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
For our discussion, we crop our original photo. YOLO divides the input image into an S×S grid. Each grid cell
predicts only one object. For example, the yellow grid cell below tries to predict the “person” object whose center
(the blue dot) falls inside the grid cell.
Each grid cell detects only one object.

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Each grid cell predicts a fixed number of boundary boxes. In this example, the yellow grid cell makes two boundary
box predictions (blue boxes) to locate where the person is.
Each grid cell make a fixed number of boundary box guesses for the object.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
However, the one-object rule limits how close detected objects can be. For that, YOLO does have some limitations
on how close objects can be. For the picture below, there are 9 Santas in the lower left corner but YOLO can detect
5 only.
YOLO may miss objects that are too close.

For each grid cell,
 it predicts B boundary boxes and each box has one box confidence score,
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
 it detects one object only regardless of the number of boxes B,

 it predicts C conditional class probabilities (one per class for the likeliness of the object class).
To evaluate PASCAL VOC, YOLO uses 7×7 grids (S×S), 2 boundary boxes (B) and 20 classes (C).
YOYO make SxS predictions with B boundary boxes.

Let’s get into more details. Each boundary box contains 5 elements: (x, y, w, h) and a box confidence score. The
confidence score reflects how likely the box contains an object (objectness) and how accurate is the boundary box.
We normalize the bounding box width w and height h by the image width and height. x and y are offsets to the
corresponding cell. Hence, x, y, w and h are all between 0 and 1. Each cell has 20 conditional class probabilities.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
The conditional class probability is the probability that the detected object belongs to a particular class (one
probability per category for each cell). So, YOLO’s prediction has a shape of (S, S, B×5 + C) = (7, 7, 2×5 + 20) =
(7, 7, 30).
Source
The major concept of YOLO is to build a CNN network to predict a (7, 7, 30) tensor. It uses a CNN network to
reduce the spatial dimension to 7×7 with 1024 output channels at each location. YOLO performs a linear regression
using two fully connected layers to make 7×7×2 boundary box predictions (the middle picture below). To make a
final prediction, we keep those with high box confidence scores (greater than 0.25) as our final predictions (the right
picture).
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Source
The class confidence score for each prediction box is computed as:
It measures the confidence on both the classification and the localization(where an object is located).
We may mix up those scoring and probability terms easily. Here are the mathematical definitions for your future
reference.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Network design
Source
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
YOLO has 24 convolutional layers followed by 2 fully connected layers (FC). Some convolution layers use 1 × 1
reduction layers alternatively to reduce the depth of the features maps. For the last convolution layer, it outputs a
tensor with shape (7, 7, 1024). The tensor is then flattened. Using 2 fully connected layers as a form of linear
regression, it outputs 7×7×30 parameters and then reshapes to (7, 7, 30), i.e. 2 boundary box predictions per
location.
A faster but less accurate version of YOLO, called Fast YOLO, uses only 9 convolutional layers with shallower
feature maps.
Loss function
YOLO predicts multiple bounding boxes per grid cell. To compute the loss for the true positive, we only want one
of them to be responsible for the object. For this purpose, we select the one with the highest IoU (intersection over
union) with the ground truth. This strategy leads to specialization among the bounding box predictions. Each
prediction gets better at predicting certain sizes and aspect ratios.
YOLO uses sum-squared error between the predictions and the ground truth to calculate loss. The loss function
composes of:
 the classification loss.

 the localization loss (errors between the predicted boundary box and the ground truth).
 the confidence loss (the objectness of the box).
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Classification loss
If an object is detected, the classification loss at each cell is the squared error of the class conditional probabilities
for each class:
Localization loss
The localization loss measures the errors in the predicted boundary box locations and sizes. We only count the box
responsible for detecting the object.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
We do not want to weight absolute errors in large boxes and small boxes equally. i.e. a 2-pixel error in a large box is
the same for a small box. To partially address this, YOLO predicts the square root of the bounding box width and
height instead of the width and height. In addition, to put more emphasis on the boundary box accuracy, we multiply
the loss by λcoord(default: 5).
Confidence loss
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
If an object is detected in the box, the confidence loss (measuring the objectness of the box) is:
If an object is not detected in the box, the confidence loss is:

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Most boxes do not contain any objects. This causes a class imbalance problem, i.e. we train the model to detect
background more frequently than detecting objects. To remedy this, we weight this loss down by a factor
λnoobj (default: 0.5).
Loss
The final loss adds localization, confidence and classification losses together.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Source
Inference: Non-maximal suppression
YOLO can make duplicate detections for the same object. To fix this, YOLO applies non-maximal suppression to
remove duplications with lower confidence. Non-maximal suppression adds 2- 3% in mAP.
Here is one of the possible non-maximal suppression implementation:
1. Sort the predictions by the confidence scores.

2. Start from the top scores, ignore any current prediction if we find any previous predictions that have the same
class and IoU > 0.5 with the current prediction.
3. Repeat step 2 until all predictions are checked.
Benefits of YOLO
 Fast. Good for real-time processing.

 Predictions (object locations and classes) are made from one single network. Can be trained end-to-end to
improve accuracy.
 YOLO is more generalized. It outperforms other methods when generalizing from natural images to other
domains like artwork.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
 Region proposal methods limit the classifier to the specific region. YOLO accesses to the whole image in
predicting boundaries. With the additional context, YOLO demonstrates fewer false positives in background
areas.
 YOLO detects one object per grid cell. It enforces spatial diversity in making predictions.
YOLOv2
SSD is a strong competitor for YOLO which at one point demonstrates higher accuracy for real-time processing.
Comparing with region based detectors, YOLO has higher localization errors and the recall (measure how good to
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
locate all objects) is lower. YOLOv2 is the second version of the YOLO with the objective of improving the
accuracy significantly while making it faster.
Accuracy improvements
Batch normalization
Add batch normalization in convolution layers. This removes the need for dropouts and pushes mAP up 2%.
High-resolution classifier
The YOLO training composes of 2 phases. First, we train a classifier network like VGG16. Then we replace the
fully connected layers with a convolution layer and retrain it end-to-end for the object detection. YOLO trains the
classifier with 224 × 224 pictures followed by 448 × 448 pictures for the object detection. YOLOv2 starts with 224
× 224 pictures for the classifier training but then retune the classifier again with 448 × 448 pictures using much
fewer epochs. This makes the detector training easier and moves mAP up by 4%.
Convolutional with Anchor Boxes
As indicated in the YOLO paper, the early training is susceptible to unstable gradients. Initially, YOLO makes
arbitrary guesses on the boundary boxes. These guesses may work well for some objects but badly for others
resulting in steep gradient changes. In early training, predictions are fighting with each other on what shapes to
specialize on.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
In the real-life domain, the boundary boxes are not arbitrary. Cars have very similar shapes and pedestrians have an
approximate aspect ratio of 0.41.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Source
Since we only need one guess to be right, the initial training will be more stable if we start with diverse guesses that
are common for real-life objects.
More diverse predictions

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
For example, we can create 5 anchor boxes with the following shapes.
5 anchor boxes
Instead of predicting 5 arbitrary boundary boxes, we predict offsets to each of the anchor boxes above. If
we constrain the offset values, we can maintain the diversity of the predictions and have each prediction focuses on
a specific shape. So the initial training will be more stable.
In the paper, anchors are also called priors.
Here are the changes we make to the network:
 Remove the fully connected layers responsible for predicting the boundary box.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
 We move the class prediction from the cell level to the boundary box level. Now, each prediction includes 4
parameters for the boundary box, 1 box confidence score (objectness) and 20 class probabilities. i.e. 5 boundary
boxes with 25 parameters: 125 parameters per grid cell. Same as YOLO, the objectness prediction still predicts
the IOU of the ground truth and the proposed box.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
 To generate predictions with a shape of 7 × 7 × 125, we replace the last convolution layer with three 3 × 3
convolutional layers each outputting 1024 output channels. Then we apply a final 1 × 1 convolutional layer to
convert the 7 × 7 × 1024 output into 7 × 7 × 125. (See the section on DarkNet for the details.)
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Using convolution filters to make predictions.
 Change the input image size from 448 × 448 to 416 × 416. This creates an odd number spatial dimension (7×7
v.s. 8×8 grid cell). The center of a picture is often occupied by a large object. With an odd number grid cell, it is
more certain on where the object belongs.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
 Remove one pooling layer to make the spatial output of the network to 13×13 (instead of 7×7).
Anchor boxes decrease mAP slightly from 69.5 to 69.2 but the recall improves from 81% to 88%. i.e. even the
accuracy is slightly decreased but it increases the chances of detecting all the ground truth objects.
Dimension Clusters
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
In many problem domains, the boundary boxes have strong patterns. For example, in the autonomous driving, the 2
most common boundary boxes will be cars and pedestrians at different distances. To identify the top-K boundary
boxes that have the best coverage for the training data, we run K-means clustering on the training data to locate the
centroids of the top-K clusters.
(Image modified form a k-means cluster)
Since we are dealing with boundary boxes rather than points, we cannot use the regular spatial distance to measure
datapoint distances. No surprise, we use IoU.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Source
On the left, we plot the average IoU between the anchors and the ground truth boxes using different numbers of
clusters (anchors). As the number of anchors increases, the accuracy improvement plateaus. For the best return,
YOLO settles down with 5 anchors. On the right, it displays the 5 anchors’ shapes. The purplish-blue rectangles are
selected from the COCO dataset while the black border rectangles are selected from the VOC2007. In both cases,
we have more thin and tall anchors indicating that real-life boundary boxes are not arbitrary.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Unless we are comparing YOLO and YOLOv2, we will reference YOLOv2 as YOLO for now.
Direct location prediction
We make predictions on the offsets to the anchors. Nevertheless, if it is unconstrained, our guesses will be
randomized again. YOLO predicts 5 parameters (tx, ty, tw, th, and to) and applies the sigma function to constraint its
possible offset range.
Here is the visualization. The blue box below is the predicted boundary box and the dotted rectangle is the anchor.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Modified from the paper.

With the use of k-means clustering (dimension clusters) and the improvement mentioned in this section, mAP
increases 5%.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Fine-Grained Features
Convolution layers decrease the spatial dimension gradually. As the corresponding resolution decreases, it is harder
to detect small objects. Other object detectors like SSD locate objects from different layers of feature maps. So each
layer specializes at a different scale. YOLO adopts a different approach called passthrough. It reshapes the 28 × 28
× 512 layer to 14 × 14 × 2048. Then it concatenates with the original 14 × 14 ×1024 output layer. Now we apply
convolution filters on the new 14 × 14 × 3072 layer to make predictions.
Multi-Scale Training
After removing the fully connected layers, YOLO can take images of different sizes. If the width and height are
doubled, we are just making 4x output grid cells and therefore 4x predictions. Since the YOLO network
downsamples the input by 32, we just need to make sure the width and height is a multiple of 32. During training,
YOLO takes images of size 320×320, 352×352, … and 608×608 (with a step of 32). For every 10 batches, YOLOv2
randomly selects another image size to train the model. This acts as data augmentation and forces the network to
predict well for different input image dimension and scale. In additional, we can use lower resolution images for
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
object detection at the cost of accuracy. This can be a good tradeoff for speed on low GPU power devices. At 288 ×
288 YOLO runs at more than 90 FPS with mAP almost as good as Fast R-CNN. At high-resolution YOLO achieves
78.6 mAP on VOC 2007.
Accuracy
Here is the accuracy improvements after applying the techniques discussed so far:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Source
Accuracy comparison for different detectors:
Source
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Speed improvement
GoogLeNet
VGG16 requires 30.69 billion floating point operations for a single pass over a 224 × 224 image versus 8.52 billion
operations for a customized GoogLeNet. We can replace the VGG16 with the customized GoogLeNet. However,
YOLO pays a price on the top-5 accuracy for ImageNet: accuracy drops from 90.0% to 88.0%.
DarkNet
We can further simplify the backbone CNN used. Darknet requires 5.58 billion operations only. With DarkNet,
YOLO achieves 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet. Darknet uses mostly 3 × 3 filters to
extract features and 1 × 1 filters to reduce output channels. It also uses global average pooling to make predictions.
Here is the detail network description:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Modified from source
We replace the last convolution layer (the cross-out section) with three 3 × 3 convolutional layers each outputting
1024 output channels. Then we apply a final 1 × 1 convolutional layer to convert the 7 × 7 × 1024 output into 7 × 7
× 125. (5 boundary boxes each with 4 parameters for the box, 1 objectness score and 20 conditional class
probabilities)
YOLO with DarkNet
Training
YOLO is trained with the ImageNet 1000 class classification dataset in 160 epochs: using stochastic gradient
descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and
momentum of 0.9. In the initial training, YOLO uses 224 × 224 images, and then retune it with 448× 448 images for
10 epochs at a 10−3 learning rate. After the training, the classifier achieves a top-1 accuracy of 76.5% and a top-5
accuracy of 93.3%.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Then the fully connected layers and the last convolution layer is removed for a detector. YOLO adds three 3 × 3
convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with 125 output channels.
(5 box predictions each with 25 parameters) YOLO also add a passthrough layer. YOLO trains the network for 160
epochs with a starting learning rate of 10−3 , dividing it by 10 at 60 and 90 epochs. YOLO uses a weight decay of
0.0005 and momentum of 0.9.
Classification
Datasets for object detection have far fewer class categories than those for classification. To expand the classes that
YOLO can detect, YOLO proposes a method to mix images from both detection and classification datasets during
training. It trains the end-to-end network with the object detection samples while backpropagates the classification
loss from the classification samples to train the classifier path. This approach encounters a few challenges:
 How do we merge class labels from different datasets? In particular, object detection datasets and different
classification datasets uses different labels.
 Any merged labels may not be mutually exclusive, for example, Norfolk terrier in ImageNet and dog in COCO.
Since it is not mutually exclusive, we can not use softmax to compute the probability.
Hierarchical classification
Without going into details, YOLO combines labels in different datasets to form a tree-like structure WordTree. The
children form an is-a relationship with its parent like biplane is a plane. But the merged labels are now not mutually
exclusive.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Combining COCO and ImageNet labels to a hierarchical WordTree (source

Let’s simplify the discussion using the 1000 class ImageNet. Instead of predicting 1000 labels in a flat structure, we
create the corresponding WordTree which has 1000 leave nodes for the original labels and 369 nodes for their
parent classes. Originally, YOLO predicts the class score for the biplane. But with the WordTree, it now predicts the
score for the biplane given it is an airplane.
Since
we can apply a softmax function to compute the probability
from the scores of its own and the siblings. The difference is, instead of one softmax operations, YOLO performs
multiple softmax operations for each parent’s children.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
source
The class probability is then computed from the YOLO predictions by going up the WordTree.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
For classification, we assume an object is already detected and therefore Pr(physical object)=1.
One benefit of the hierarchy classification is that when YOLO cannot distinguish the type of airplane, it gives a high
score to the airplane instead of forcing it into one of the sub-categories.
When YOLO sees a classification image, it only backpropagates classification loss to train the classifier. YOLO finds
the bounding box that predicts the highest probability for that class and it computes the classification loss as well as
those from the parents. (If an object is labeled as a biplane, it is also considered to be labeled as airplane, air, vehicle…
) This encourages the model to extract features common to them. So even we have never trained a specific class of
objects for object detection, we can still make such predictions by generalizing predictions from related objects.
In object detection, we set Pr(physical object) equals to the box confidence score which measures whether the box
has an object. YOLO traverses down the tree, taking the highest confidence path at every split until it reaches some
threshold and YOLO predicts that object class.
YOLO9000
YOLO9000 extends YOLO to detect objects over 9000 classes using hierarchical classification with a 9418 node
WordTree. It combines samples from COCO and the top 9000 classes from the ImageNet. YOLO samples four
ImageNet data for every COCO data. It learns to find objects using the detection data in COCO and to classify these
objects with ImageNet samples.
During the evaluation, YOLO test images on categories that it knows how to classify but not trained directly to locate
them, i.e. categories that do not exist in COCO. YOLO9000 evaluates its result from the ImageNet object detection
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
dataset which has 200 categories. It shares about 44 categories with COCO. Therefore, the dataset contains 156
categories that have never been trained directly on how to locate them. YOLO extracts similar features for related
object types. Hence, we can detect those 156 categories by simply from the feature values.
YOLO9000 gets 19.7 mAP overall with 16.0 mAP on those 156 categories. YOLO9000 performs well with new
species of animals not found in COCO because their shapes can be generalized easily from their parent classes.
However, COCO does not have bounding box labels for any type of clothing so the test struggles with categories like
“sunglasses”.
YOLOv3
A quote from the YOLO web site on YOLOv3:
On a Pascal Titan X it processes images at 30 FPS and has a mAP of 57.9% on COCO test-dev.
Class Prediction
Most classifiers assume output labels are mutually exclusive. It is true if the output are mutually exclusive object
classes. Therefore, YOLO applies a softmax function to convert scores into probabilities that sum up to one. YOLOv3
uses multi-label classification. For example, the output labels may be “pedestrian” and “child” which are not non-
exclusive. (the sum of output can be greater than 1 now.) YOLOv3 replaces the softmax function with independent
logistic classifiers to calculate the likeliness of the input belongs to a specific label. Instead of using mean square error
in calculating the classification loss, YOLOv3 uses binary cross-entropy loss for each label. This also reduces the
computation complexity by avoiding the softmax function.
Bounding box prediction & cost function calculation
YOLOv3 predicts an objectness score for each bounding box using logistic regression. YOLOv3 changes the way in
calculating the cost function. If the bounding box prior (anchor) overlaps a ground truth object more than others, the
corresponding objectness score should be 1. For other priors with overlap greater than a predefined threshold (default
0.5), they incur no cost. Each ground truth object is associated with one boundary box prior only. If a bounding box
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
prior is not assigned, it incurs no classification and localization lost, just confidence loss on objectness. We use tx and
ty (instead of bx and by) to compute the loss.
Feature Pyramid Networks (FPN) like Feature Pyramid
YOLOv3 makes 3 predictions per location. Each prediction composes of a boundary box, a objectness and 80 class
scores, i.e. N × N × [3 × (4 + 1 + 80) ] predictions.
YOLOv3 makes predictions at 3 different scales (similar to the FPN):
1. In the last feature map layer.

2. Then it goes back 2 layers back and upsamples it by 2. YOLOv3 then takes a feature map with higher resolution
and merge it with the upsampled feature map using element-wise addition. YOLOv3 apply convolutional filters
on the merged map to make the second set of predictions.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
3. Repeat 2 again so the resulted feature map layer has good high-level structure (semantic) information and good
resolution spatial information on object locations.
To determine the priors, YOLOv3 applies k-means cluster. Then it pre-select 9 clusters. For COCO, the width and
height of the anchors are (10×13),(16×30),(33×23),(30×61),(62×45),(59× 119),(116 × 90),(156 × 198),(373 × 326).
These 9 priors are grouped into 3 different groups according to their scale. Each group is assigned to a specific feature
map above in detecting objects.
Feature extractor
A new 53-layer Darknet-53 is used to replace the Darknet-19 as the feature extractor. Darknet-53 mainly compose of
3 × 3 and 1× 1 filters with skip connections like the residual network in ResNet. Darknet-53 has less BFLOP (billion
floating point operations) than ResNet-152, but achieves the same classification accuracy at 2x faster.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Darknet-53
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
YOLOv3 performance
YOLOv3's COCO AP metric is on par with SSD but 3x faster. But YOLOv3’s AP is still behind RetinaNet. In
particular, AP@IoU=.75 drops significantly comparing with RetinaNet which suggests YOLOv3 has higher
localization error. YOLOv3 also shows significant improvement in detecting small objects.
YOLOv3 performs very well in the fast detector category when speed is important.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Source
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
BÀI 3: YOLO — YOU ONLY LOOK ONCE, REAL TIME OBJECT

DETECTION EXPLAINED
Original paper (CVPR 2016. OpenCV People’s Choice Award) https://arxiv.org/pdf/1506.02640v5.pdf
YOLOv2: https://arxiv.org/pdf/1612.08242v1.pdf
Biggest advantages:
 Speed (45 frames per second — better than realtime)

 Network understands generalized object representation (This allowed them to train the network on real world
images and predictions on artwork was still fairly accurate).
 faster version (with smaller architecture) — 155 frames per sec but is less accurate.
 open source: https://pjreddie.com/darknet/yolo/
High level idea:

Compared to other region proposal classification networks (fast RCNN) which perform detection on various region
proposals and thus end up performing prediction multiple times for various regions in a image, Yolo architecture is
more like FCNN (fully convolutional neural network) and passes the image (nxn) once through the FCNN and output
is (mxm) prediction. This the architecture is splitting the input image in mxm grid and for each grid generation 2
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
bounding boxes and class probabilities for those bounding boxes. Note that bounding box is more likely to be larger
than the grid itself. From paper:
We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates
and class probabilities.
A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those
boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several
benefits over traditional methods of object detection. First, YOLO is extremely fast. Since we frame detection as a
regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time
to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and
a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25
milliseconds of latency.
Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region
proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes
contextual information about classes as well as their appearance. Fast R-CNN, a top detection method, mistakes
background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the
number of background errors compared to Fast R-CNN.
Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork,
YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly
generalizable it is less likely to break down when applied to new domains or unexpected inputs.
Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes
across all classes for an image simultaneously. This means our network reasons globally about the full image and all
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high
average precision.
Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is
responsible for detecting that object.
Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how
confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.
Formally we define confidence as Pr(Object) ∗ IOU . If no object exists in that cell, the confidence scores should be
zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box
and the ground truth
Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of
the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally
the confidence prediction represents the IOU between the predicted box and any ground truth box. Each grid cell also
predicts C conditional class probabilities, Pr(Classi |Object). These probabilities are conditioned on the grid cell
containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
At test time we multiply the conditional class probabilities and the individual box confidence predictions,
Pr(Classi|Object)∗Pr(Object)∗IOU = Pr(Classi)∗IOU
, which gives us class-specific confidence scores for each box. These scores encode both the probability of that class
appearing in the box and how well the predicted box fits the object
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Network Architecture and Training:

Changes to loss functions for better results is interesting. Two things stand out:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
1. Differential weight for confidence predictions from boxes that contain object and boxes that dont contain object
during training.
2. predict the square root of the bounding box width and height to penalize error in small object and large object
differently.
Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used
by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers
Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers.
Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to
optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization
error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain
any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from
cells that do contain objects. This can lead to model instability, causing training to diverge early on. To remedy this,
we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for
boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5
and λnoobj = .5.
Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that
small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of
the bounding box width and height instead of the width and height directly.
YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to
be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which
prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box
predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall
recall.
Limitations of YOLO
YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes
and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict.
Our model struggles with small objects that appear in groups, such as flocks of birds. Since our model learns to predict
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our
model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple
downsampling layers from the input image. Finally, while we train on a loss function that approximates detection
performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small
error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main
source of error is incorrect localizations.
YOLOv2: https://arxiv.org/pdf/1612.08242v1.pdf
Error analysis of YOLO compared to Fast R-CNN shows that YOLO makes a significant number of localization
errors. Furthermore, YOLO has relatively low recall compared to region proposal-based methods. Thus we focus
mainly on improving recall and localization while maintaining classification accuracy.
By adding batch normalization on all of the convolutional layers in YOLO we get more than 2% improvement in
mAP. Batch normalization also helps regularize the model. With batch normalization we can remove dropout from
the model without overfitting.
For YOLOv2 we first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet.
This gives the network time to adjust its filters to work better on higher resolution input. We then fine tune the resulting
network on detection. This high resolution classification network gives us an increase of almost 4% mAP
Convolutional With Anchor Boxes. YOLO predicts the coordinates of bounding boxes directly using fully connected
layers on top of the convolutional feature extractor. Predicting offsets instead of coordinates simplifies the problem
and makes it easier for the network to learn. We remove the fully connected layers from YOLO and use anchor boxes
to predict bounding boxes. Using anchor boxes we get a small decrease in accuracy. YOLO only predicts 98 boxes
per image but with anchor boxes our model predicts more than a thousand. Without anchor boxes our intermediate
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
model gets 69.5 mAP with a recall of 81%. With anchor boxes our model gets 69.2 mAP with a recall of 88%. Even
though the mAP decreases, the increase in recall means that our model has more room to improve.
Fine-Grained Features.This modified YOLO predicts detections on a 13 × 13 feature map. While this is suffi- cient
for large objects, it may benefit from finer grained features for localizing smaller objects. Faster R-CNN and SSD
both run their proposal networks at various feature maps in the network to get a range of resolutions. We take a
different approach, simply adding a passthrough layer that brings features from an earlier layer at 26 × 26 resolution.
The passthrough layer concatenates the higher resolution features with the low resolution features by stacking adjacent
features into different channels instead of spatial locations, similar to the identity mappings in ResNet. This turns the
26 × 26 × 512 feature map into a 13 × 13 × 2048 feature map, which can be concatenated with the original features.
Our detector runs on top of this expanded feature map so that it has access to fine grained features. This gives a modest
1% performance increase.
During training we mix images from both detection and classification datasets. When our network sees an image
labelled for detection we can backpropagate based on the full YOLOv2 loss function. When it sees a classification
image we only backpropagate loss from the classification specific parts of the architecture.
Hierarchical classification. ImageNet labels are pulled from WordNet, a language database that structures concepts
and how they relate [12]. In WordNet, “Norfolk terrier” and “Yorkshire terrier” are both hyponyms of “terrier” which
is a type of “hunting dog”, which is a type of “dog”, which is a “canine”, etc. Most approaches to classification assume
a flat structure to the labels however for combining datasets, structure is exactly what we need.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
BÀI 4: YOLO: REAL-TIME OBJECT DETECTION
You only look once (YOLO) is a system for detecting objects on the Pascal VOC 2012 dataset. It can detect the 20
Pascal object classes:
 person
 bird, cat, cow, dog, horse, sheep
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
 aeroplane, bicycle, boat, bus, car, motorbike, train

 bottle, chair, dining table, potted plant, sofa, tv/monitor
YOLO is joint work with Santosh, Ross, and Ali, and is described in detail in our paper.
How it works
All prior detection systems repurpose classifiers or localizers to perform detection. They apply the model to an
image at multiple locations and scales. High scoring regions of the image are considered detections.
We use a totally different approach. We apply a single neural network to the full image. This network divides the
image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are
weighted by the predicted probabilities.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Finally, we can threshold the detections by some value to only see high scoring detections:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Our model has several advantages over classifier-based systems. It looks at the whole image at test time so its
predictions are informed by global context in the image. It also makes predictions with a single network evaluation
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast, more than
1000x faster than R-CNN and 100x faster than Fast R-CNN. See our paper for more details on the full system.
Detection Using A Pre-Trained Model
This post will guide you through detecting objects with the YOLO system using a pre-trained model. If you don't
already have Darknet installed, you should do that first.
You already have the config file for YOLO in the cfg/ subdirectory. You will have to download the pre-trained
weight file here (1.0 GB).
Now you can run the Darknet yolo command in testing mode:
./darknet yolo test cfg/yolo.cfg <path>/yolo.weights <image>
I've included some example images to try in case you need inspiration.
Try data/eagle.jpg, data/dog.jpg, data/person.jpg, or data/horses.jpg! Assuming your weight file is in the base
directory, you will see something like this:
./darknet yolo test cfg/yolo.cfg yolo.weights data/dog.jpg
0: Crop Layer: 448 x 448 -> 448 x 448 x 3 image
1: Convolutional Layer: 448 x 448 x 3 image, 64 filters -> 224 x 224 x 64 image
....
27: Connected Layer: 4096 inputs, 1225 outputs
28: Detection Layer
Loading weights from yolo.weights...Done!
data/dog.jpg: Predicted in 8.012962 seconds.
0.941620 car
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
0.397087 bicycle
0.220952 dog
Not compiled with OpenCV, saving to predictions.png instead
Darknet prints out the objects it detected, its confidence, and how long it took to find them. Since we are using
Darknet on the CPU it takes around 6-12 seconds per image. If we use the GPU version it would be much faster.
We didn't compile Darknet with OpenCV so it can't display the detections directly. Instead, it saves them
in predictions.png. You can open it to see the detected objects:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Hooray!!
Multiple Images
Instead of supplying an image on the command line, you can leave it blank to try multiple images in a row. Instead
you will see a prompt when the config and weights are done loading:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
./darknet yolo test cfg/yolo.cfg yolo.weights

0: Crop Layer: 448 x 448 -> 448 x 448 x 3 image
1: Convolutional Layer: 448 x 448 x 3 image, 64 filters -> 224 x 224 x 64 image
....
27: Connected Layer: 4096 inputs, 1225 outputs
28: Detection Layer
Loading weights from yolo.weights...Done!
Enter Image Path:
Enter an image path like data/eagle.jpg to have it predict boxes for that image. Once it is done it will prompt you
for more paths to try different images. Use Ctrl-C to exit the program once you are done.
A Smaller Model
The original YOLO model uses a lot of GPU memory. If you have a smaller graphics card you can try using the
smaller version of the YOLO model, yolo-small.cfg. You should already have the config file in the cfg/
subdirectory. Download the pretrained weights here (359 MB). Then you can run the model!
./darknet yolo test cfg/yolo-small.cfg yolo-small.weights
The small version of YOLO only uses 1.1 GB of GPU memory so it should be suitable for many smaller graphics
cards.
A Tiny Model
The yolo-tiny.cfg is based on the Darknet reference network. You should already have the config file in
the cfg/ subdirectory. Download the pretrained weights here (172 MB). Then you can run the model!
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
./darknet yolo test cfg/yolo-tiny.cfg yolo-tiny.weights

The tiny version of YOLO only uses 611 MB of GPU memory and it runs at more than 150 fps on a Titan X.
YOLO Model Comparison
 yolo.cfg is based on the extraction network. It processes images at 45 fps, here are weight files
for yolo.cfg trained on 2007 train/val+ 2012 train/val, and trained on all 2007 and 2012 data.
 yolo-small.cfg has smaller fully connected layers so it uses far less memory. It processes images at 50 fps,
here are weight files for yolo-small.cfg trained on 2007 train/val+ 2012 train/val.
 yolo-tiny.cfg is much smaller and based on the Darknet reference network. It processes images at 155 fps,
here are weight files for yolo-tiny.cfg trained on 2007 train/val+ 2012 train/val.
Changing The Detection Threshold
By default, YOLO only displays objects detected with a confidence of .2 or higher. You can change this by passing
the -thresh <val> flag to the yolo command. For example, to display all detection you can set the threshold to 0:
./darknet yolo test cfg/yolo.cfg yolo.weights data/dog.jpg -thresh 0
Which produces:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Real-Time Detection On VOC 2012
If you compile Darknet with CUDA then it can process images waaay faster than you can type them in. To
efficiently detect objects in multiple images we can use the valid subroutine of yolo.
First we have to get our data and generate some metadata for Darknet. The VOC 2012 test data can be
found here but you'll need an account! Once you get the file 2012test.tar you need to run the following commands:
tar xf 2012test.tar
cp VOCdevkit/VOC2012/ImageSets/Main/test.txt .
sed 's?^?'`pwd`'/VOCdevkit/VOC2012/JPEGImages/?; s?$?.jpg?' test.txt > voc.2012.test
These commands extract the data and generate a list of the full paths of the test images. Next, move this list to
the darknet/data subdirectory:
mv voc.2012.test <path-to>/darknet/data
Now you are ready to do some detection! Make sure Darknet is compiled with CUDA so you can be super fast.
Then run:
./darknet yolo valid cfg/yolo.cfg yolo.weights
You will see a whole bunch of numbers start to fly by. That's how many images you've run detection on! On a
Titan X I see this as the final output:
....
10984
10992
Total Detection Time: 250.000000 Seconds
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
There are 10,991 images in the VOC 2012 test set. We just processed them in 250 seconds! That's 44 frames per
second! If you were using Selective Search it would take you 6 hours to even extract region proposals for all of the
images. We just ran a full detection pipeline in 4 minutes. Pretty cool.
The predicted detections are in the results/ subdirectory. They are in the format specified for Pascal
VOC submission.
If you are interested in reproducing our numbers on the Pascal challenge you should use this weight file (1.0
GB) instead. It was trained with the IOU prediction we describe in the paper which gives slightly better mAP
scores. The numbers won't match exactly since I accidentally deleted the original weight file but they will be
approximately the same.
Real-Time Detection on a Webcam
Running YOLO on test data isn't very interesting if you can't see the result. Instead of running it on a bunch of
images let's run it on the input from a webcam! Here is an example of YOLO running on a webcam that we then
pointed at YouTube videos:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
To run this demo you will need to compile Darknet with CUDA and OpenCV. You will also need to pick a YOLO
config file and have the appropriate weights file. Then run the command:
./darknet yolo demo cfg/yolo.cfg yolo.weights

YOLO will display the current FPS and predicted classes as well as the image with bounding boxes drawn on top
of it.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
You will need a webcam connected to the computer that OpenCV can connect to or it won't work. If you have
multiple webcams connected and want to select which one to use you can pass the flag -c <num> to pick (OpenCV
uses webcam 0 by default).
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Training YOLO
You can train YOLO from scratch if you want to play with different training regimes, hyper-parameters, or
datasets. Here's how to get it working on the Pascal VOC dataset.
Get The Pascal VOC Data
To train YOLO you will need all of the VOC data from 2007 to 2012. You can find links to the data here. To get
all the data, make a directory to store it all and from that directory run:
curl -O http://pjreddie.com/media/files/VOCtrainval_11-May-2012.tar
curl -O http://pjreddie.com/media/files/VOCtrainval_06-Nov-2007.tar
curl -O http://pjreddie.com/media/files/VOCtest_06-Nov-2007.tar
tar xf VOCtrainval_11-May-2012.tar
tar xf VOCtrainval_06-Nov-2007.tar
tar xf VOCtest_06-Nov-2007.tar
There will now be a VOCdevkit/ subdirectory with all the VOC training data in it.
Generate Labels for VOC
Now we need to generate the label files that Darknet uses. Darknet wants a .txt file for each image with a line for
each ground truth object in the image that looks like:
<object-class> <x> <y> <width> <height>
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Where x, y, width, and height are relative to the image's width and height. To generate these file we will run
the voc_label.py script in Darknet's scripts/ directory. Let's just download it again because we are lazy.
curl -O http://pjreddie.com/media/files/voc_label.py
python voc_label.py
After a few minutes, this script will generate all of the requisite files. Mostly it generates a lot of label files
in VOCdevkit/VOC2007/labels/ and VOCdevkit/VOC2012/labels/. In your directory you should see:
ls
2007_test.txt VOCdevkit
2007_train.txt voc_label.py
2007_val.txt VOCtest_06-Nov-2007.tar
2012_train.txt VOCtrainval_06-Nov-2007.tar
2012_val.txt VOCtrainval_11-May-2012.tar
The text files like 2007_train.txt list the image files for that year and image set. Darknet needs one text file with all
of the images you want to train on. In this example, let's train with everything except the validation set from 2012
so that we can test our model. Run:
cat 2007_* 2012_train.txt > train.txt
Now we have all the 2007 images and the 2012 train set in one big list. That's all we have to do for data setup!
Point Darknet to Pascal Data
Now go to your Darknet directory. We will have to change the train subroutine of yolo to point it to your copy of
the VOC data. Edit src/yolo.c, lines 54 and 55:
57 char *train_images = "/home/pjreddie/data/voc/test/train.txt";
58 char *backup_directory = "/home/pjreddie/backup/";
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
train_images should point to the train.txt file you just generated and backup_directory should point to a directory
where you want to store backup weights files during training. Once you have edited the lines, re-make Darknet.
Download Pretrained Convolutional Weights
For training we use convolutional weights that are pre-trained on Imagenet. We use weights from
the Extraction model. You can just download the weights for the convolutional layers here (54 MB).
If you want to generate the pre-trained weights yourself, download the pretrained Extraction modeland run the
following command:
./darknet partial cfg/extraction.cfg extraction.weights extraction.conv.weights 25

But if you just download the weights file it's way easier.
Train!!
You are finally ready to start training. Run:

./darknet yolo train cfg/yolo.cfg extraction.conv.weights
It should start spitting out numbers and stuff.
If you want it to go faster and spit out fewer numbers you should stop training and change the config file a little.
Modify cfg/yolo.cfg so that on line 3 it says subdivisions=2 or 4 or something that divides 64 evenly. Then restart
training as above.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Training Checkpoints
After every 128,000 images Darknet will save a training checkpoint to the directory you specified in src/yolo.c.
These will be titled something like yolo_12000.weights. You can use them to restart training instead of starting
from scratch.
After 40,000 iterations (batches) Darknet will save the final model weights as yolo_final.weights. Then you are
done!
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
LESSON 5: YOLO OBJECT DETECTION WITH OPENCV AND

PYTHON
Image Source: DarkNet github repo

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
If you have been keeping up with the advancements in the area of object detection, you might have got used to
hearing this word 'YOLO'. It has kind of become a buzzword.
What is YOLO exactly?
YOLO (You Only Look Once) is a method / way to do object detection. It is the algorithm /strategy behind how the
code is going to detect objects in the image.
The official implementation of this idea is available through DarkNet (neural net implementation from the ground
up in 'C' from the author). It is available on github for people to use.
Earlier detection frameworks, looked at different parts of the image multiple times at different scales and repurposed
image classification technique to detect objects. This approach is slow and inefficient.
YOLO takes entirely different approach. It looks at the entire image only once and goes through the network once
and detects objects. Hence the name. It is very fast. That’s the reason it has got so popular.
There are other popular object detection frameworks like Faster R-CNN and SSD that are also widely used.
In this post, we are going to look at how to use a pre-trained YOLO model with OpenCV and start detecting objects
right away.
...
OpenCV dnn module
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
DNN (Deep Neural Network) module was initially part of opencv_contrib repo. It has been moved to the master
branch of opencv repo last year, giving users the ability to run inference on pre-trained deep learning models within
OpenCV itself.
(One thing to note here is, dnn module is not meant be used for training. It’s just for running inference on
images/videos.)
Initially only Caffe and Torch models were supported. Over the period support for different frameworks/libraries
like TensorFlow is being added.
Support for YOLO/DarkNet has been added recently. We are going to use the OpenCV dnn module with a pre-
trained YOLO model for detecting common objects.
Let’s get started ..
Enough of talking. Let’s start writing code. (in Python obviously)
# import required packages

import cv2
import argparse
import numpy as np
# handle command line arguments

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
ap = argparse.ArgumentParser()
ap.add_argument('-i', '--image', required=True,
help = 'path to input image')
ap.add_argument('-c', '--config', required=True,
help = 'path to yolo config file')
ap.add_argument('-w', '--weights', required=True,
help = 'path to yolo pre-trained weights')
ap.add_argument('-cl', '--classes', required=True,
help = 'path to text file containing class names')
args = ap.parse_args()
view rawyolo_opencv_part1.py hosted with ❤ by GitHub
Installing dependencies
Following things are needed to execute the code we will be writing.
Python 3
Numpy
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
OpenCV Python bindings

Python 3
If you are on Ubuntu, it’s most likely that Python 3 is already installed. Run python3 in terminal to check whether
its installed. If its not installed use
sudo apt-get install python3
For macOS please refer my earlier post on deep learning setup for macOS.
I highly recommend using Python virtual environment. Have a look at my earlier post if you need a starting point.
Numpy
pip install numpy
This should install numpy. Make sure pip is linked to Python 3.x ( pip -V will show this info)
If needed use pip3. Use sudo apt-get install python3-pip to get pip3 if not already installed.
OpenCV-Python
You need to compile OpenCV from source from the master branch on github to get the Python bindings.
(recommended)
Adrian Rosebrock has written a good blog post on PyImageSearch on this. (Download the source from master branch
instead of from archive)
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
If you are feeling overwhelmed by the instructions to get OpenCV Python bindings from source, you can get the
unofficial Python package using
pip install opencv-python
This is not maintained officially by OpenCV.org. It’s a community maintained one. Thanks to the efforts of Olli-
Pekka Heinisuo.
Command line arguments
The script requires four input arguments.
input image
YOLO config file
pre-trained YOLO weights
text file containing class names
All of these files are available on the github repository I have put together. (link to download pre-trained weights is
available in readme.)
You can also download the pre-trained weights in Terminal by typing
wget https://pjreddie.com/media/files/yolov3.weights
This particular model is trained on COCO dataset (common objects in context) from Microsoft. It is capable of
detecting 80 common objects. See the full list here.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Input image can be of your choice. Sample input is available in the repo.
Run the script by typing
$ python yolo_opencv.py --image dog.jpg --config yolov3.cfg --weights yolov3.weights --classes yolov3.txt
Preparing input
# read input image

image = cv2.imread(args.image)
Width = image.shape[1]
Height = image.shape[0]
scale = 0.00392
# read class names from text file

classes = None
with open(args.classes, 'r') as f:
classes = [line.strip() for line in f.readlines()]
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
# generate different colors for different classes

COLORS = np.random.uniform(0, 255, size=(len(classes), 3))
# read pre-trained model and config file

net = cv2.dnn.readNet(args.weights, args.config)
# create input blob

blob = cv2.dnn.blobFromImage(image, scale, (416,416), (0,0,0), True, crop=False)
# set input blob for the network

net.setInput(blob)
Read the input image and get its width and height.
Read the text file containing class names in human readable form and extract the class names to a list.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Generate different colors for different classes to draw bounding boxes.

net = cv2.dnn.readNet(args.weights, args.config)
Above line reads the weights and config file and creates the network.
blob = cv2.dnn.blobFromImage(image, scale, (Width,Height), (0,0,0), True, crop=False)
net.setInput(blob)
Above lines prepares the input image to run through the deep neural network.
Output layer and bounding box
# function to get the output layer names

# in the architecture
def get_output_layers(net):
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
return output_layers
# function to draw bounding box on the detected object with class name
def draw_bounding_box(img, class_id, confidence, x, y, x_plus_w, y_plus_h):
label = str(classes[class_id])
color = COLORS[class_id]
cv2.rectangle(img, (x,y), (x_plus_w,y_plus_h), color, 2)
cv2.putText(img, label, (x-10,y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

Generally in a sequential CNN network there will be only one output layer at the end. In the YOLO v3 architecture
we are using there are multiple output layers giving out predictions. get_output_layers() function gives the names of
the output layers. An output layer is not connected to any next layer.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
draw_bounding_box() function draws rectangle over the given predicted region and writes class name over the box.
If needed, we can write the confidence value too.
Running inference
# run inference through the network

# and gather predictions from output layers
outs = net.forward(get_output_layers(net))
# initialization
class_ids = []
confidences = []
boxes = []
conf_threshold = 0.5
nms_threshold = 0.4
# for each detetion from each output layer

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
# get the confidence, class id, bounding box params

# and ignore weak detections (confidence < 0.5)
for out in outs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
center_x = int(detection[0] * Width)
center_y = int(detection[1] * Height)
w = int(detection[2] * Width)
h = int(detection[3] * Height)
x = center_x - w / 2
y = center_y - h / 2
class_ids.append(class_id)
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
confidences.append(float(confidence))
boxes.append([x, y, w, h])
outs = net.forward(get_output_layers(net))
Above line is where the exact feed forward through the network happens. Moment of truth. If we don’t specify the
output layer names, by default, it will return the predictions only from final output layer. Any intermediate output
layer will be ignored.
We need go through each detection from each output layer to get the class id, confidence and bounding box corners
and more importantly ignore the weak detections (detections with low confidence value).
Non-max suppression
# apply non-max suppression

indices = cv2.dnn.NMSBoxes(boxes, confidences, conf_threshold, nms_threshold)
# go through the detections remaining

# after nms and draw bounding box
for i in indices:
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
i = i[0]
box = boxes[i]
x = box[0]
y = box[1]
w = box[2]
h = box[3]
draw_bounding_box(image, class_ids[i], confidences[i], round(x), round(y), round(x+w), round(y+h))
# display output image

cv2.imshow("object detection", image)
# wait until any key is pressed

cv2.waitKey()
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
# save output image to disk

cv2.imwrite("object-detection.jpg", image)
# release resources
cv2.destroyAllWindows()
Even though we ignored weak detections, there will be lot of duplicate detections with overlapping bounding boxes.
Non-max suppression removes boxes with high overlapping.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Source: PyImageSearch
Finally we look at the detections that are left and draw bounding boxes around them and display the output image.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
source
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
source
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
source
I do not own the copyright for the images used in this post. Please refer source for copyright info.
Summary
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
In this post, we looked at how to use OpenCV dnn module with pre-trained YOLO model to do object detection. We
can also train a model to detect objects of our own interest that are not covered in the pre-trained one.
We have only scratched the surface. There is a lot more to object detection. I will be covering more on object
detection in the future including other frameworks like Faster R-CNN and SSD. Be sure to subscribe to get notified
when new posts are published.
That’s all for now. Thanks for reading. I hope this post was useful to get started with object detection. Feel free to
share your thoughts in the comments or you can reach out to me on twitter @ponnusamy_arun.
Peace.
Update :
Checkout the object detection implementation available in cvlibwhich enables detecting common objects in the
context through a single function call detect_common_objects(). Give it a shot and let me know your thoughts.
Cheers.
Subscribe to newsletter
If you are finding this blog interesting, consider subscribing to the newsletter to get notified when new posts go live.
(Don't worry, I publish only one or two posts per month).
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
LESSON 6 HOW TO IMPLEMENT A YOLO (V3) OBJECT DETECTOR

FROM SCRATCH IN PYTORCH
Image Credits: Karol Majek. Check out his YOLO v3 real time detection video here
Object detection is a domain that has benefited immensely from the recent developments in deep learning. Recent
years have seen people develop many algorithms for object detection, some of which include YOLO, SSD, Mask
RCNN and RetinaNet.
For the past few months, I've been working on improving object detection at a research lab. One of the biggest
takeaways from this experience has been realizing that the best way to go about learning object detection is to
implement the algorithms by yourself, from scratch. This is exactly what we'll do in this tutorial.
We will use PyTorch to implement an object detector based on YOLO v3, one of the faster object detection
algorithms out there.
The code for this tutorial is designed to run on Python 3.5, and PyTorch 0.4. It can be found in it's entirety at
this Github repo.
This tutorial is broken into 5 parts:
Part 1 (This one): Understanding How YOLO works

Part 2 : Creating the layers of the network architecture
Part 3 : Implementing the the forward pass of the network
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Part 4 : Objectness score thresholding and Non-maximum suppression

Part 5 : Designing the input and the output pipelines
Prerequisites
- You should understand how convolutional neural networks work. This also includes knowledge of
Residual Blocks, skip connections, and Upsampling.
- What is object detection, bounding box regression, IoU and non-maximum suppression.
- Basic PyTorch usage. You should be able to create simple neural networks with ease.
I've provided the link at the end of the post in case you fall short on any front.
What is YOLO?
YOLO stands for You Only Look Once. It's an object detector that uses features learned by a deep convolutional
neural network to detect an object. Before we get out hands dirty with code, we must understand how YOLO works.
A Fully Convolutional Neural Network

YOLO makes use of only convolutional layers, making it a fully convolutional network (FCN). It has 75
convolutional layers, with skip connections and upsampling layers. No form of pooling is used, and a convolutional
layer with stride 2 is used to downsample the feature maps. This helps in preventing loss of low-level features often
attributed to pooling.
Being a FCN, YOLO is invariant to the size of the input image. However, in practice, we might want to stick to a
constant input size due to various problems that only show their heads when we are implementing the algorithm.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
A big one amongst these problems is that if we want to process our images in batches (images in batches can be
processed in parallel by the GPU, leading to speed boosts), we need to have all images of fixed height and width.
This is needed to concatenate multiple images into a large batch (concatenating many PyTorch tensors into one)
The network downsamples the image by a factor called the stride of the network. For example, if the stride of the
network is 32, then an input image of size 416 x 416 will yield an output of size 13 x 13. Generally, stride of any
layer in the network is equal to the factor by which the output of the layer is smaller than the input image to
the network.
Interpreting the output
Typically, (as is the case for all object detectors) the features learned by the convolutional layers are passed onto a
classifier/regressor which makes the detection prediction (coordinates of the bounding boxes, the class label.. etc).
In YOLO, the prediction is done by using a convolutional layer which uses 1 x 1 convolutions.
Now, the first thing to notice is our output is a feature map. Since we have used 1 x 1 convolutions, the size of the
prediction map is exactly the size of the feature map before it. In YOLO v3 (and it's descendants), the way you
interpret this prediction map is that each cell can predict a fixed number of bounding boxes.
Though the technically correct term to describe a unit in the feature map would be a neuron, calling it a cell makes
it more intuitive in our context.
Depth-wise, we have (B x (5 + C)) entries in the feature map. B represents the number of bounding boxes each
cell can predict. According to the paper, each of these B bounding boxes may specialize in detecting a certain kind
of object. Each of the bounding boxes have 5 + C attributes, which describe the center coordinates, the dimensions,
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
the objectness score and C class confidences for each bounding box. YOLO v3 predicts 3 bounding boxes for every
cell.
You expect each cell of the feature map to predict an object through one of it's bounding boxes if the center
of the object falls in the receptive field of that cell. (Receptive field is the region of the input image visible to the
cell. Refer to the link on convolutional neural networks for further clarification).
This has to do with how YOLO is trained, where only one bounding box is responsible for detecting any given object.
First, we must ascertain which of the cells this bounding box belongs to.
To do that, we divide the input image into a grid of dimensions equal to that of the final feature map.
Let us consider an example below, where the input image is 416 x 416, and stride of the network is 32. As pointed
earlier, the dimensions of the feature map will be 13 x 13. We then divide the input image into 13 x 13 cells.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Then, the cell (on the input image) containing the center of the ground truth box of an object is chosen to be the
one responsible for predicting the object. In the image, it is the cell which marked red, which contains the center of
the ground truth box (marked yellow).
Now, the red cell is the 7th cell in the 7th row on the grid. We now assign the 7th cell in the 7th row on the feature
map (corresponding cell on the feature map) as the one responsible for detecting the dog.
Now, this cell can predict three bounding boxes. Which one will be assigned to the dog's ground truth label? In order
to understand that, we must wrap out head around the concept of anchors.
Note that the cell we're talking about here is a cell on the prediction feature map. We divide the input image into a
grid just to determine which cell of the prediction feature map is responsible for prediction
Anchor Boxes
It might make sense to predict the width and the height of the bounding box, but in practice, that leads to unstable
gradients during training. Instead, most of the modern object detectors predict log-space transforms, or simply offsets
to pre-defined default bounding boxes called anchors.
Then, these transforms are applied to the anchor boxes to obtain the prediction. YOLO v3 has three anchors, which
result in prediction of three bounding boxes per cell.
Coming back to our earlier question, the bounding box responsible for detecting the dog will be the one whose anchor
has the highest IoU with the ground truth box.
Making Predictions
The following formulae describe how the network output is transformed to obtain bounding box predictions.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
bx, by, bw, bh are the x,y center co-ordinates, width and height of our prediction. tx, ty, tw, th is what the network
outputs. cx and cy are the top-left co-ordinates of the grid. pw and ph are anchors dimensions for the box.
Center Coordinates
Notice we are running our center coordinates prediction through a sigmoid function. This forces the value of the
output to be between 0 and 1. Why should this be the case? Bear with me.
Normally, YOLO doesn't predict the absolute coordinates of the bounding box's center. It predicts offsets which are:
 Relative to the top left corner of the grid cell which is predicting the object.
 Normalised by the dimensions of the cell from the feature map, which is, 1.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
For example, consider the case of our dog image. If the prediction for center is (0.4, 0.7), then this means that the
center lies at (6.4, 6.7) on the 13 x 13 feature map. (Since the top-left co-ordinates of the red cell are (6,6)).
But wait, what happens if the predicted x,y co-ordinates are greater than one, say (1.2, 0.7). This means center lies
at (7.2, 6.7). Notice the center now lies in cell just right to our red cell, or the 8th cell in the 7th row. This breaks
theory behind YOLObecause if we postulate that the red box is responsible for predicting the dog, the center of the
dog must lie in the red cell, and not in the one beside it.
Therefore, to remedy this problem, the output is passed through a sigmoid function, which squashes the output in a
range from 0 to 1, effectively keeping the center in the grid which is predicting.
Dimensions of the Bounding Box

The dimensions of the bounding box are predicted by applying a log-space transform to the output and then
multiplying with an anchor.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
How the detector output is transformed to give the final prediction. Image
Credits. http://christopher5106.github.io/
The resultant predictions, bw and bh, are normalised by the height and width of the image. (Training labels are chosen
this way). So, if the predictions bx and by for the box containing the dog are (0.3, 0.8), then the actual width and
height on 13 x 13 feature map is (13 x 0.3, 13 x 0.8).
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Objectness Score
Object score represents the probability that an object is contained inside a bounding box. It should be nearly 1 for
the red and the neighboring grids, whereas almost 0 for, say, the grid at the corners.
The objectness score is also passed through a sigmoid, as it is to be interpreted as a probability.
Class Confidences
Class confidences represent the probabilities of the detected object belonging to a particular class (Dog, cat, banana,
car etc). Before v3, YOLO used to softmax the class scores.
However, that design choice has been dropped in v3, and authors have opted for using sigmoid instead. The reason
is that Softmaxing class scores assume that the classes are mutually exclusive. In simple words, if an object belongs
to one class, then it's guaranteed it cannot belong to another class. This is true for COCO database on which we will
base our detector.
However, this assumptions may not hold when we have classes like Women and Person. This is the reason that
authors have steered clear of using a Softmax activation.
Prediction across different scales.
YOLO v3 makes prediction across 3 different scales. The detection layer is used make detection at feature maps of
three different sizes, having strides 32, 16, 8respectively. This means, with an input of 416 x 416, we make detections
on scales 13 x 13, 26 x 26 and 52 x 52.
The network downsamples the input image until the first detection layer, where a detection is made using feature
maps of a layer with stride 32. Further, layers are upsampled by a factor of 2 and concatenated with feature maps of
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
a previous layers having identical feature map sizes. Another detection is now made at layer with stride 16. The same
upsampling procedure is repeated, and a final detection is made at the layer of stride 8.
At each scale, each cell predicts 3 bounding boxes using 3 anchors, making the total number of anchors used 9. (The
anchors are different for different scales)
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
The authors report that this helps YOLO v3 get better at detecting small objects, a frequent complaint with the earlier
versions of YOLO. Upsampling can help the network learn fine-grained features which are instrumental for detecting
small objects.
Output Processing
For an image of size 416 x 416, YOLO predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647 bounding boxes.
However, in case of our image, there's only one object, a dog. How do we reduce the detections from 10647 to 1?
Thresholding by Object Confidence
First, we filter boxes based on their objectness score. Generally, boxes having scores below a threshold are ignored.
Non-maximum Suppression
NMS intends to cure the problem of multiple detections of the same image. For example, all the 3 bounding boxes
of the red grid cell may detect a box or the adjacent cells may detect the same object.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
If you don't know about NMS, I've provided a link to a website explaining the same.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
LESSON 7 EVOLUTION OF OBJECT DETECTION AND

LOCALIZATION ALGORITHMS
Understanding recent evolution of object detection and localization with intuitive explanation of underlying concepts.
Object detection is one of the areas of computer vision that is maturing very rapidly. Thanks to deep learning! Every
year, new algorithms/ models keep on outperforming the previous ones. In-fact, one of the latest state of the art
software system for object detection was just released last week by Facebook AI team. The software is
called Detectron that incorporates numerous research projects for object detection and is powered by the Caffe2 deep
learning framework.
Today, there is a plethora of pre-trained models for object detection (YOLO, RCNN, Fast RCNN, Mask RCNN,
Multibox etc.). So, it only takes a small amount of effort to detect most of the objects in a video or in an image. But
the objective of my blog is not to talk about the implementation of these models. Rather, it is my attempt to explain
the underlying concepts in a clear and concise manner.
I recently completed Week 3 of Andrew Ng’s Convolution Neural Network course in which he talks about object
detection algorithms. Most of the content of this blog is inspired from that course.
Edited: I am currently doing Fast.ai’s Cutting Edge Deep Learning for Coders course, taught by Jeremy
Howard. Now, I have implementation of below discussed algorithms using PyTorch and fast.ai libraries. Here is
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
the link to the codes. Check this out if you want to learn about the implementation part of the below discussed
algorithms. The implementation has been borrowed from fast.ai course notebook, with comments and notes.
Brief introduction about CNN

Before I explain the working of object detection algorithms, I want to spend a few lines on Convolutional Neural
Networks, also called CNN or ConvNets. CNNs are the basic building blocks for most of the computer vision tasks
in deep learning era.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Fig. 1. Convolution demo in Excel
What we want? We want some algorithm that looks at an image, sees the pattern in the image and tells what type of
object is there in the image. For e.g., is that image of Cat or a Dog.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
What is image for a computer? Just matrix of numbers. For e.g. see the figure 1 above. The image on left is just a
28*28 pixels image of handwritten digit 2 (taken from MNIST data), which is represented as matrix of numbers in
Excel spreadsheet.
How can we teach computers learn to recognize the object in image? By making computers learn the patterns like
vertical edges, horizontal edges, round shapes and maybe plenty of other patterns unknown to humans. Convolutions!
(Look at the figure above while reading this) Convolution is a mathematical operation between two matrices to give
a third matrix. The smaller matrix, which we call filter or kernel (3x3 in figure 1) is operated on the matrix of image
pixels. Depending on the numbers in the filter matrix, the output matrix can recognize the specific patterns present in
the input image. In example above, the filter is vertical edge detector which learns vertical edges in the input image.
In context of deep learning, the input images and their subsequent outputs are passed from a number of such filters.
The numbers in filters are learnt by neural net and patterns are derived on its own.
Why convolutions work? Because in most of the images, the objects have consistency in relative pixel densities
(magnitude of numbers) that can be leveraged by convolutions.
I know that only a few lines on CNN is not enough for a reader who doesn’t know about CNN. But CNN is not the
main topic of this blog and I have provided the basic intro, so that the reader may not have to open 10 more links to
first understand CNN before continuing further.
After reading this blog, if you still want to know more about CNN, I would strongly suggest you to read this blog
by Adam Geitgey.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Categorization of computer vision tasks
Fig. 2: Common computer vision tasks

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Taking an example of cat and dog images in Figure 2, following are the most common tasks done by computer vision
modeling algorithms:
1. Image Classification: This is the most common computer vision problem where an algorithm looks at an image
and classifies the object in it. Image classification has a wide variety of applications, ranging from face detection
on social networks to cancer detection in medicine. Such problems are typically modeled using Convolutional
Neural Nets (CNNs).
2. Object classification and localization: Let’s say we not only want to know whether there is cat in the image, but
where exactly is the cat. Object localization algorithms not only label the class of an object, but also draw a
bounding box around position of object in the image.
3. Multiple objects detection and localization: What if there are multiple objects in the image (3 dogs and 2 cats
as in above figure) and we want to detect them all? That would be an object detection and localization problem.
A well known application of this is in self-driving cars where the algorithm not only needs to detect the cars, but
also pedestrians, motorcycles, trees and other objects in the frame. These kind of problems need to leverage the
ideas or concepts learnt from image classification as well as from object localization.
Now coming back to computer vision tasks. In context of deep learning, the basic algorithmic difference among the
above 3 types of tasks is just choosing relevant input and outputs. Let me explain this line in detail with an infographic.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
1. Image Classification
Fig. 3: Steps for image classification using CNN

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
The infographic in Figure 2 shows how a typical CNN for image classification looks like.
1. Convolve an input image of some height, width and channel depth (940, 550, 3 in above case) by n-filters
(n = 4 in Fig. 3) [if you are still confused what exactly convolution means, please check this link to
understand convolutions in deep neural network].
2. The output of convolution is treated with non-linear transformations, typically Max Pool and RELU.
3. The above 3 operations of Convolution, Max Pool and RELU are performed multiple times.
4. The output of final layer is sent to Softmax layer which converts the numbers between 0 and 1, giving
probability of image being of particular class. We minimize our loss so as to make the predictions from this
last layer as close to actual values.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
2. Object classification and localization
Fig. 4: Input and output for object localization problems

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Now, to make our model draw the bounding boxes of an object, we just change the output labels from the previous
algorithm, so as to make our model learn the class of object and also the position of the object in the image. We add
4 more numbers in the output layer which include centroid position of the object and proportion of width and height
of bounding box in the image.
Simple, right? Just add a bunch of output units to spit out the x, y coordinates of different positions you want to
recognize. These different positions or landmark would be consistent for a particular object in all the images we have.
For e.g. for a car, height would be smaller than width and centroid would have some specific pixel density as compared
to other points in the image.
Implying the same logic, what do you think would change if we there are multiple objects in the image and we want
to classify and localize all of them? I would suggest you to pause and ponder at this moment and you might get the
answer yourself.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
3. Multiple objects detection and localization
Fig. 5: Input and output for object detection and localization problems
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
To detect all kinds of objects in an image, we can directly use what we learnt so far from object localization. The
difference is that we want our algorithm to be able to classify and localize all the objects in an image, not just one.
So the idea is, just crop the image into multiple images and run CNN for all the cropped images to detect an object.
The way algorithm works is the following:

1. Make a window of size much smaller than actual image size. Crop it and pass it to ConvNet (CNN) and have
ConvNet make the predictions.
2. Keep on sliding the window and pass the cropped images into ConvNet.
3. After cropping all the portions of image with this window size, repeat all the steps again for a bit bigger window
size. Again pass cropped images into ConvNet and let it make predictions.
4. At the end, you will have a set of cropped regions which will have some object, together with class and bounding
box of the object.
This solution is known as object detection with sliding windows. It is very basic solution which has many caveats as
the following:
A. Computationally expensive: Cropping multiple images and passing it through ConvNet is going to be
computationally very expensive.
Solution: There is a simple hack to improve the computation power of sliding window method. It is to replace the
fully connected layer in ConvNet with 1x1 convolution layers and for a given window size, pass the input image
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
only once. So, in actual implementation we do not pass the cropped images one at a time, but we pass the complete
image at once.
B. Inaccurate bounding boxes: We are sliding windows of square shape all over the image, maybe the object is
rectangular or maybe none of the squares match perfectly with the actual size of the object. Although this algorithm
has ability to find and localize multiple objects in an image, but the accuracy of bounding box is still bad.
Fig. 6. Bounding boxes from sliding window CNN

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
I have talked about the most basic solution for an object detection problem. But it has many caveats and is not most
accurate and is computationally expensive to implement. So, how can we make our algorithm better and faster?
Better solution? YOLO

It turns out that we have YOLO (You Only Look Once) which is much more accurate and faster than the sliding
window algorithm. It is based on only a minor tweak on the top of algorithms that we already know. The idea is to
divide the image into multiple grids. Then we change the label of our data such that we implement both localization
and classification algorithm for each grid cell. Let me explain this to you with one more infographic.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Fig. 7. Bounding boxes, input and output for YOLO

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
YOLO in easy steps:

1. Divide the image into multiple grids. For illustration, I have drawn 4x4 grids in above figure, but actual
implementation of YOLO has different number of grids. (7x7 for training YOLO on PASCAL VOC dataset)
2. Label the training data as shown in the above figure. If C is number of unique objects in our data, S*S is number
of grids into which we split our image, then our output vector will be of length S*S*(C+5). For e.g. in above case,
our target vector is 4*4*(3+5) as we divided our images into 4*4 grids and are training for 3 unique objects: Car,
Light and Pedestrian.
3. Make one deep convolutional neural net with loss function as error between output activations and label vector.
Basically, the model predicts the output of all the grids in just one forward pass of input image through ConvNet.
4. Keep in mind that the label for object being present in a grid cell (P.Object) is determined by the presence of
object’s centroid in that grid. This is important to not allow one object to be counted multiple times in different
grids.
Caveats of YOLO and their solutions :
A. Can’t detect multiple objects in same grid. This issue can be solved by choosing smaller grid size. But even by
choosing smaller grid size, the algorithm can still fail in cases where objects are very close to each other, like image
of flock of birds.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Solution: Anchor boxes. In addition to having 5+C labels for each grid cell (where C is number of distinct objects),
the idea of anchor boxes is to have (5+C)*A labels for each grid cell, where A is required anchor boxes. If one
object is assigned to one anchor box in one grid, other object can be assigned to the other anchor box of same grid.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Fig. 8. YOLO with anchor boxes

Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
B. Possibility to detect one object multiple times.
Solution: Non-max suppression. Non max suppression removes the low probability bounding boxes which are
very close to a high probability bounding boxes.
Conclusion:
As of today, there are multiple versions of pre-trained YOLO models available in different deep learning
frameworks, including Tensorflow. The latest YOLO paper is: “YOLO9000: Better, Faster, Stronger” . The model
is trained on 9000 classes. There are also a number of Regional CNN (R-CNN) algorithms based on selective
regional proposal, which I haven’t discussed. Detectron, software system developed by Facebook AI also
implements a variant of R-CNN, Masked R-CNN.
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________
Tel: (84)-76-3931000 Fax : (84)-76-3932446
__________________________________________________________

List Danh Muhthanh

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

List Danh Muhthanh

Caricato da

Copyright:

Formati disponibili

CUU LONG FISH JOINT STOCK COMPANY (CL-FISH CORP.

LIST DANH MỤC CÁC BÀI VIẾT VỀ YOLO OBJECT DETECTION

1 Yolo: Real-Time Object Dectection https://pjreddie.com Jonseph Redmon

Real-Time Object Detection With Yolo,

Yolo — You Only Look Once, Real Time

Yolo: Real-Time Object Detection https://github.com

Yolo Object Detection With OpenCv And

How to implement a YOLO (v3) Object

Evolution of Object Detection and

Cửu Long, ngày 18 tháng 10 năm 2018

BÀI 1: YOLO: REAL-TIME OBJECT DECTECTION

Performance on the COCO Dataset

git clone https://github.com/pjreddie/darknet

truth_thresh: Using default '1.000000'

Changing The Detection Threshold

./darknet detector train cfg/coco.data cfg/yolov3.cfg darknet53.conv.74 -gpus 0,1,2,3

BÀI 2: REAL-TIME OBJECT DETECTION WITH YOLO, YOLOV2 AND

A demonstration from the YOLOv2.

Let’s start with our own testing image below.

Objects detected by YOLO.

Each grid cell detects only one object.

YOLO may miss objects that are too close.

 it detects one object only regardless of the number of boxes B,

YOYO make SxS predictions with B boundary boxes.

 the classification loss.

If an object is not detected in the box, the confidence loss is:

Here is one of the possible non-maximal suppression implementation:

1. Sort the predictions by the confidence scores.

 Fast. Good for real-time processing.

More diverse predictions

Using convolution filters to make predictions.

(Image modified form a k-means cluster)

Direct location prediction

Modified from the paper.

Accuracy comparison for different detectors:

Modified from source

YOLO with DarkNet

Combining COCO and ImageNet labels to a hierarchical WordTree (source

we can apply a softmax function to compute the probability

Feature Pyramid Networks (FPN) like Feature Pyramid

YOLOv3 makes predictions at 3 different scales (similar to the FPN):

1. In the last feature map layer.

BÀI 3: YOLO — YOU ONLY LOOK ONCE, REAL TIME OBJECT

 Speed (45 frames per second — better than realtime)

High level idea:

Network Architecture and Training:

BÀI 4: YOLO: REAL-TIME OBJECT DETECTION

 aeroplane, bicycle, boat, bus, car, motorbike, train

Detection Using A Pre-Trained Model

./darknet yolo test cfg/yolo.cfg yolo.weights

./darknet yolo test cfg/yolo-tiny.cfg yolo-tiny.weights

YOLO Model Comparison

Changing The Detection Threshold

Real-Time Detection On VOC 2012

Real-Time Detection on a Webcam

./darknet yolo demo cfg/yolo.cfg yolo.weights

Get The Pascal VOC Data

Generate Labels for VOC

Point Darknet to Pascal Data

Download Pretrained Convolutional Weights

./darknet partial cfg/extraction.cfg extraction.weights extraction.conv.weights 25

You are finally ready to start training. Run:

LESSON 5: YOLO OBJECT DETECTION WITH OPENCV AND