Effective Object Detection by Modifying Choice of Basic Parameters of Object Detection

INTERNATIONAL JOURNAL OF RESEARCH IN TECHNOLOGY (IJRT) ISSN No.
2394-9007
Vol. V, No. I, February 2018 www.ijrtonline.org
Effective Object Detection by Modifying choice of

basic Parameters of Object Detection
Shristi Shreyasi, S.S. Subashka Ramesh
Abstract— General purpose object detection should be fast, Going one step further from here, we would want to not
accurate, and able to recognize a wide variety of objects. Since only find objects inside an image, but find a pixel by pixel
the introduction of neural networks, detection frameworks have mask of each of the detected objects. This problem is referred
become increasingly fast and accurate. However, most detection as Instance or Object segmentation. Iterating over the problem
methods are still constrained to a small set of objects. Use of
of localization plus classification ends up with the need for
traditional approach which is repurposing classifier or localizer
to perform detection applies the model to an image at multiple detecting and classifying multiple objects at the same time.
location and scales. This paper discusses the parameters which Object detection is the problem of finding and classifying a
will enhance the effectiveness of object detection approaches in variable number of objects on an image. The process of object
application areas of object detection. The viewpoint for object detection is done via methods like Viola-Jones Detection,
detection that this paper proposes is parameters like an efficient Feature-based Detection, SVM Classifier with HOG features
dataset, a better feature extractor, localization of bounding boxes and Deep learning Object Detection. Today, the fields of Deep
and dimension adaptability. These features play an important Learning - Artificial Neural networks are being used
role in behavior of any object detection model in terms of ingeniously in Object Detection in real-time such as self
performance, accuracy, and classification process & outcome and
driving cars, assisted living applications, visual search engines
time factor.
and aerial image analysis etc. In this paper, the aim is to
Keywords: Object Detection, Neural Net, Image Recognition, develop a faster deep neural network for real-time video object
Image Classification.
detection by exploring the ideas of knowledge-guided training
I. INTRODUCTION and predicted regions of interest. Specifically, a new
framework is proposed for training neural networks on
Object detection comes under the field of Computer Vision. It classification as well as detection datasets, and different
is the process of finding instances of real-world objects such approach of k-means clustering toward dimension box
as faces, bicycles, and buildings in images or videos. Object clustering, localization in prediction of region proposals and
detection algorithms typically use extracted features and various resolution objects towards making object detection
learning algorithms to recognize instances of an object much more effective, faster and accurate.
category. It all starts from Classification. Classification is the
probably the most well known problem in computer vision. It II. RELATED WORK
consists of classifying an image into one of many different
Object Detection was done classically with Viola-Jones
categories. In recent years classification models have
framework but recent use of deep learning and artificial neural
surpassed human performance and it has been considered
network in detecting and classifying the images have
practically solved. Similar to classification, Localization finds
surpassed the humans as well in Image Net Challenge. In
the location of a single object inside the image. Localization
detection the most used form of neural network is R-CNN.
can be used for lots of useful real-life problems. For example,
Today, R-CNN have been effectively been mastered by Fast
smart cropping (knowing where to crop images based on
R-CNN which has even surpassed Faster R-CNN. And in
where the object is located), or even regular object extraction
recent times, Mask R-CNN has been taking over the detection
for further processing using different techniques. It can be
field in pixel level segmentation.
combined with classification for not only locating the object
Earlier, R-CNN and its successor have used datasets of
but categorizing it into one of many possible categories.
only detection for training their model to perform object
detection. The detection datasets are much limited in size due
Manuscript received on February, 2018.
to unavailability of large, accurately labeled dataset whereas
Shristi Shreyasi, Research Scholar, Department of Computer Science & the classification dataset have millions of images labeled in
Engineering, SRM Institute of Science & Technology, Ramapuram, Chennai, various categories for training models. This becomes barrier in
Tamil Nadu, India.
developing an efficient model for object detection in real time
Prof. S.S Subashka Ramesh, Asst. Professor, Department of Computer videos. Another, main cause for complex and time taking
Science & Engineering, SRM Institute of Science & Technology, outcomes is the poorly predicted bounding boxes in the frame
Ramapuram, Chennai, Tamil Nadu, India.
Impact Factor: 4.012 55

Published under
Asian Research & Training Publication
ISO 9001:2015 Certified
INTERNATIONAL JOURNAL OF RESEARCH IN TECHNOLOGY (IJRT) ISSN No. 2394-9007
or image provided. Due to this, the model becomes instable structure we do know about the data, for example that all of
and keeps analyzing the wrong region for a reasonable amount the COCO classes are mutually exclusive. So, such a well
of time. One related problem with this occurs which is labeled dataset is the first step toward making the model
adjustment of anchor or dimension boxes around the object in understands what is desired from it. For example “Norfolk
the video feed. Sometimes, due to different sizes of a category terrier” and “Yorkshire terrier” are both hyponyms of “terrier”
of an object, the model confuses over its validity in terms of which is a type of “hunting dog”, which is a type of “dog”,
whether it is should be a proposed region and if it is then what which is a “canine”, etc. This is type of detailed classification
class should it belong to! Also, the feature extractor for the which is should be taken as the best approach towards
model plays an important role in accuracy of the model. These handling models via datasets in object detection. Most
above minor vulnerabilities of object detection play an approaches to classification assume a flat structure to the
important role in design and outcome complexity, analysis and labels however for combining datasets; tree or a graph
final outcome. structure is exactly what we need. Storing the hierarchies in
nodes of tree is much better approach than using sequentially
III. OVERVIEW OF THE PROPOSED METHOD labeled flat datasets [3]. Decision Tree and Graphs are
The proposed method curbs the vulnerability of the previous versatile in nature and can be modified any way to suit the
object detecting models like dataset decision, resolution of certain level of classification that is demanded in the model's
input image/frame, a better feature extractor and localizing the outcome.
predicted bounding boxes accurately. This changes their
B. Localization of Bounding Boxes:
approach towards solving a detection of a real time video feed
Faster R-CNN predicts bounding boxes using hand-picked
which will prove resourceful in real life applications of object
priors [4]. For each sliding-window location, it
detection.
generates multiple possible regions based on k fixed-
IV. FEATURES OF THE PROPOSED WORKS ratio anchor boxes (default bounding boxes). Each region
proposal consists of a) an “objectness” score for that region
A. An Efficient Dataset: and b) 4 coordinates representing the bounding box of the
The proposed mechanism for taking an efficient dataset as region. Using a combined classification-bounding box
input for training the model is to jointly use both classification prediction framework where directly objects are predicted in
and detection data. Based on superimposing 2D images of each cell as well as the corrections on anchor boxes. Using
textured object models into images of real environments at only convolution layers the region proposal network (RPN) in
variety of locations and scales gave a better outcome of 98.2% Faster R-CNN predicts offsets and confidences for anchor
recognition rate with VGG network with BigBird dataset. [1] boxes. Since, the prediction layer is convolution; the RPN
Combining 20,000+ RGB-D images and 50,000+ 2D predicts these offsets at every location in a feature map.
bounding boxes of object instances densely captured in 9 Predicting offsets instead of coordinates simplifies the
unique scenes to create a dataset resulted in a faster, accurate problem and makes it easier for the network to learn. But use
and better approach towards active vision [2]. So, in same way of anchor boxes to predict bounding boxes is much efficient
labeled images should be used for detection to learn detection- method to get a good IOU value. This is done so that for an
specific information like bounding box coordinate prediction odd number of locations in the feature map, there is a single
and objectness as well as how to classify common objects. centre cell. Objects, especially large objects, tend to occupy
Images with only class labels should be used to expand the the centre of the image so it’s good to have a single location
number of categories it can detect. For training, images should right at the centre to predict these objects instead of four
be mixed from both detection and classification datasets. This locations that are all nearby. When we move to anchor boxes
approach presents a few challenges. Detection datasets have we also decouple the class prediction mechanism from the
only common objects and general labels, like “person” or spatial location and instead predict class and objectness for
“bicycle”. Classification datasets have a much wider and every anchor box. The objectness prediction still predicts the
deeper range of labels. ImageNet has more than a hundred IOU of the ground truth and the proposed box and the class
breeds of dog, including “Norfolk terrier”, “Yorkshire terrier”, predictions predict the conditional probability of that class
and “Bedlington terrier”. To train on both datasets a coherent given that there is an object. Using anchor boxes, a small
way is needed to merge these labels. Most valid approach to decrease in accuracy may take place. Even though the mAP
classify is to use a softmax layer across all the possible (mean average prediction) decreases, but there will be a
categories to compute the final probability distribution. Using considerable increase in recall. Increase in value of recall
a softmax layer assumes that the classes are mutually leads to believe that number of object recognized which
exclusive. This presents problems for combining datasets, for should have been returned is low in number as compared to
example if we want to combine ImageNet and COCO datasets the true recognized objects by the model.
using this model because the classes “Norfolk terrier” and
“dog” are not mutually exclusive. We could instead use a
multi-label model to combine the datasets which does not
assume mutual exclusion. This approach ignores all the

Published under
INTERNATIONAL JOURNAL OF RESEARCH IN TECHNOLOGY (IJRT) ISSN No. 2394-9007
C. Dimension Adaptability by Model: V. CONCLUSION
A multi scale approach in object detection plays an important In conclusion, the features of object detection discussed in this
role in giving an accurate output.[5] Real -time video or paper, play a very important role in performance of detection
images have same class object in different scales for detection mechanisms. An efficient dataset, localization of bounding
and classification. So, detection and classification to be boxes, dimension adaptability and a better feature extractor
performed on the frame should be elastic in nature. Training framework are characteristics of object detection model
the models on multi-scale is not enough; the model should be designing which must be carefully done. The potential of
learning to classify the class of object in all resolutions of an performance held by above features are huge in object
image. This can be done by changing the network after every detection.
few iteration. After a certain interval of time a new dimension
of input image size will be taken by the network. We resize REFERENCES
the network to that dimension and continue training. This
[1] Georgios Georgakis∗, Arsalan Mousavian∗, Alexander C. Berg†
regime forces the network to learn to predict well across a
, Jana Koseck ˇ a´ ∗- Synthesizing Training Data for Object
variety of input dimensions. This means the same network can Detection in Indoor Scenes in arXiv:1702.07836v2 [cs.CV] 8
predict detections at different resolutions. This approach will Sep 2017.
stop limiting the model to detect a class for a particular input [2] Phil Ammirato1, Patrick Poirson1, Eunbyung Park1, Jana
image size, thus making it much more versatile in taking input Koseck ˇ a´2, Alexander C. Berg1 - A Dataset for Developing
and an accurate output. and Benchmarking Active Vision in 2017 IEEE International
Conference on Robotics and Automation (ICRA) Singapore,
D. A Better Feature Extractor Framework: May 29 - June 3, 2017
Most detection frameworks rely on VGG-16 as the base feature [3] Singh Vijendra, Hemjyotsana Parashar and Nisha Vasudeva, "A
extractor [6]. VGG-16 is a powerful, accurate classification New Method of Classification of Datasets for Data Mining" in
network but it is needlessly complex. The convolution layers of 2011 3rd International Conference on Machine Learning and
VGG-16 require 30.69 billion floating point operations for a Computing (ICMLC 2011)
single pass over a single image. Other framework uses network [4] S. Ren, K. He, R. Girshick, and J. Sun. "Faster R-cnn: Towards
based on the Google net architecture [7]. This network is faster real- time object detection with region proposal networks". in
arXiv preprint arXiv:1506.01497, 2015
than VGG-16, only using 8.52 billion operations for a forward
[5] Qingshan Liu, Senior Member, IEEE, Renlong Hang, Huihui
pass. However, its accuracy is slightly worse than VGG-16. A Song, Zhi Li Learning Multi-Scale Deep Features for High-
classification model should be such that features that are Resolution Satellite Image Classification" in
extracted from the image can be used later in pooling. Building arXiv:1611.03591v1 [cs.CV] 11 Nov 2016
off a prior work on network design as well as common [6] K. Simonyan and A. Zisserman. Very deep convolution for
knowledge in the field enhances the feature extraction abilities large-scale image recognition. arXiv preprint arXiv:1409.1556,
of the extractor. Similar to the VGG models, using mostly 3X3 2014.
filters and double the number of channels after every pooling [7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
step [6] enriches the capacity of the feature extractor. D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper
with convolutions. CoRR,abs/1409.4842,2014
Extraction of correct features which can be used in pooling
[8] Sakrapee Paisitkriangkrai, Chunhua Shen, Jian Zhang- Face
later plays a very important role in object detection. Selection Detection with Effective Feature Extraction in arXiv: 1009.
of apt features by the feature extractor will generate more 5758v1 [cs.CV] 29 Sep 2010.
accurate results. As per face detection with effective feature
extraction like HOG and LBP features outperformed Haar-like
features and also yielded better generalization in noisy data [8].
So, choosing a feature extractor which works towards
enhancing the performance of model should be one of the goals
for object detection. The intense computation required in object
detection will be reduced a lot if our choice of feature extractor
is apt.

Published under

Effective Object Detection by Modifying Choice of Basic Parameters of Object Detection

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Effective Object Detection by Modifying Choice of Basic Parameters of Object Detection

Caricato da

Copyright:

Formati disponibili

INTERNATIONAL JOURNAL OF RESEARCH IN TECHNOLOGY (IJRT) ISSN No.

Effective Object Detection by Modifying choice of

Impact Factor: 4.012 55

Impact Factor: 4.012 56

Impact Factor: 4.012 57

Potrebbero piacerti anche