Applsci 08 00837 PDF

applied
sciences
Article
An Improved Image Semantic Segmentation Method
Based on Superpixels and Conditional Random Fields
Wei Zhao 1 , Yi Fu 1 , Xiaosong Wei 1 and Hai Wang 2, *
1 Key Laboratory of Electronic Equipment Structure Design, Ministry of Education, Xidian University,
Xi’an 710071, China; weizhao@xidian.edu.cn (W.Z.); yfu@stu.xidian.edu.cn (Y.F.);
winthor666@gmail.com (X.W.)
2 School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China
* Correspondence: wanghai@mail.xidian.edu.cn; Tel.: +86-029-8820-3115

Received: 11 April 2018; Accepted: 17 May 2018; Published: 22 May 2018
Abstract: This paper proposed an improved image semantic segmentation method based on
superpixels and conditional random fields (CRFs). The proposed method can take full advantage
of the superpixel edge information and the constraint relationship among different pixels. First,
we employ fully convolutional networks (FCN) to obtain pixel-level semantic features and utilize
simple linear iterative clustering (SLIC) to generate superpixel-level region information, respectively.
Then, the segmentation results of image boundaries are optimized by the fusion of the obtained
pixel-level and superpixel-level results. Finally, we make full use of the color and position information
of pixels to further improve the semantic segmentation accuracy using the pixel-level prediction
capability of CRFs. In summary, this improved method has advantages both in terms of excellent
feature extraction capability and good boundary adherence. Experimental results on both the PASCAL
VOC 2012 dataset and the Cityscapes dataset show that the proposed method can achieve significant
improvement of segmentation accuracy in comparison with the traditional FCN model.
Keywords: image semantic segmentation; superpixels; conditional random fields; fully convolutional
network
1. Introduction
Nowadays, image semantic segmentation has become one of the key issues in the field of computer
vision. A great deal of scenarios are under increasing demand for abstracting relevant knowledge
or semantic information from images, such as autonomous driving, human-machine interaction,
and image search engine [1–3]. As a preprocessing step for image analysis and visual comprehension,
semantic segmentation is used to classify each pixel in the image and divide the image into a number
of visually meaningful regions. In the past decades, researchers have proposed various methods
including the simplest pixel-level thresholding methods, clustering-based segmentation methods,
and graph partitioning segmentation methods [4] to yield the image semantic segmentation results.
These methods have high efficiency due to their having low computational complexity with fewer
parameters. However, their performance is unsatisfactory for image segmentation tasks without any
artificial supplementary information.
With the growing development of deep learning in the field of computer vision, the image
semantic segmentation methods based on convolutional neural networks (CNNs) [5–14] have been
proposed one after another, far exceeding the traditional methods in accuracy. The first end-to-end
semantic segmentation model was proposed as a CNN variant by Long et al. [15], known as FCN.
They popularized CNN architectures for dense predictions without any fully connected layers. Unlike
the CNNs, the output of FCN becomes a two-dimensional matrix instead of a one-dimensional vector
Appl. Sci. 2018, 8, 837; doi:10.3390/app8050837 www.mdpi.com/journal/applsci

Appl. Sci. 2018, 8, 837 2 of 17
during the image semantic segmentation. It is the first time image classification on a pixel-level
has been realized, which is a significant improvement in accuracy. Subsequently, a large number of
FCN-based methods [16–20] have been addressed to promoting the development of image semantic
segmentation. As one of the most popular pixel-level classification methods, the DeepLab models
proposed by Chen et al. [21–23] make use of the fully connected CRF as a separated post-processing step
in their pipeline to refine the segmentation result. The earliest version of DeepLab-v1 [21] overcomes
the poor localization property of deep networks by combining the responses at the final FCN layer with
a CRF for the first time. By using this model, all pixels, no matter how far apart they lie, are taken into
account, rendering the system able to recover detailed structures in the segmentation that were lost due
to the spatial invariance of the FCN. Later, Chen et al. extended their previous work and developed the
DeepLab-v2 [22] and DeepLab-v3 [23] with improved feature extractors, better object scale modeling,
careful assimilation of contextual information, improved training procedures, and increasingly
powerful hardware and software. Benefiting from the fine-grained localization accuracy of CRFs,
the DeepLab models are remarkably successful in producing accurate semantic segmentation results.
At the same time, the superpixel method, a well-known image pre-processing technique, has been
rapidly developed in recent years. Existing superpixel segmentation methods can be classified
into two major categories: graph-based methods [24,25] and gradient-ascent-based methods [26,27].
As one of the most widely used methods, SLIC adapts a k-means clustering approach to efficiently
generate superpixels, which has been proved better than other superpixel methods in nearly every
respect [27]. SLIC deserves our consideration for the application of image semantic segmentation due
to its advantages, such as low complexity, compact superpixel size, and good boundary adherence.
Although researchers have made some achievements, there is still much room for improvement
in image semantic segmentation. We observe that the useful details such as the boundaries of images
are often neglected because of the inherent spatial invariance of FCN. In this paper, an improved
image semantic segmentation method is presented, which is based on superpixels and CRFs. Once the
high-level abstract features of images are extracted, we can make use of the low-level cues, such as
the boundary information and the relationship among pixels, to improve the segmentation accuracy.
The improved method is briefly summarized as follows: First, we employ FCN to extract the pixel-level
semantic features, while the SLIC algorithm is chosen to generate the superpixels. Then, the fusion
of the two obtained pieces of information is implemented to get the boundary-optimized semantic
segmentation results. Finally, the CRF is employed to optimize the results of semantic segmentation
through its accurate boundary recovery ability. The improved method possesses not only excellent
feature extraction capability but also good boundary adherence.
The rest of this paper is organized as follows. Section 2 provides an overview of our method.
Section 3 describes the key techniques in detail. In Section 4, experimental results and discussion are
given. Finally, some conclusions are drawn in Section 5.
2. Overview
An improved image semantic segmentation method based on superpixel and CRFs is proposed
to improve the performance of semantic segmentation. In our method, the process of semantic
segmentation can be divided into two stages. The first stage is to extract semantic information from
input image as much as possible. In the second stage (also treated as post-processing steps), we intend
to optimize the coarse features generated during the first stage. As a widely used post-processing
technique, CRFs are introduced into the image semantic segmentation. However, the performance
of CRFs depends on the quality of feature maps, so it is necessary to optimize the first stage results.
To address this problem, the boundary information of superpixels is combined with the output of the
first stage for the boundary optimization, which helps CRFs recover the information of boundaries
more accurately.
Figure 1 illustrates the flow chart of the proposed method, in which red boxes denote two stages
and blue boxes denote three processing steps. In the first step, we use the FCN model to extract
Appl. Sci. 2018, 8, x FOR PEER REVIEW 3 of 17
Appl. Sci. 2018, 8, 837 3 of 17
Figure 1 illustrates the flow chart of the proposed method, in which red boxes denote two stages
and blue boxes denote three processing steps. In the first step, we use the FCN model to extract the
the feature
feature information
information fromfrom the input
the input imageimage to obtain
to obtain pixel-level
pixel-level semantic semantic
labels. labels.
Although Although the
the trained
trained FCN model has fine feature extraction ability, the result is still relatively rough.
FCN model has fine feature extraction ability, the result is still relatively rough. Meanwhile, the SLIC Meanwhile,
the SLIC algorithm
algorithm is employed
is employed to segmentto segment
the inputtheimage
input image and generate
and generate a largea large number
number of superpixels.
of superpixels. In
In the second step, also the most important, we use coarse features to reassign
the second step, also the most important, we use coarse features to reassign semantic predictions semantic predictions
within each
within each superpixel.
superpixel. Benefiting
Benefiting from
from the
the good
good image
image boundary
boundary adherence
adherence of of superpixel,
superpixel, wewe obtain
obtain
the results
the resultsof
ofboundary
boundaryoptimization.
optimization. In In
thethe third
third step,
step, we employ
we employ CRFsCRFs to predict
to predict the semantic
the semantic label
label of each pixel for further refining the segmentation boundaries. At this point,
of each pixel for further refining the segmentation boundaries. At this point, the final semantic the final semantic
segmentation result
segmentation result is
is obtained.
obtained. Both
Both the
the high-level
high-level semantic
semantic information
information and and the
the low-level
low-level cues
cues in
in
image boundaries are fully utilized in our
image boundaries are fully utilized in our method. method.
step1
step2 step3
extract
Features CRF
accurate
Input boundary
boundary recovery Final result
segment optimization
Superpixels
Stage1 Stage2
Figure 1. The flow chart of the proposed method.

Figure 1. The flow chart of the proposed method.
3. Key Techniques
3. Key Techniques
In this section, we describe the theory of feature extraction, boundary optimization, and accurate
In this section, we describe the theory of feature extraction, boundary optimization, and accurate
boundary recovery. The key point is the application of superpixels, which affects the optimization of
boundary recovery. The key point is the application of superpixels, which affects the optimization
coarse features and the pixel-by-pixel prediction of CRF model. The technical details are discussed
of coarse features and the pixel-by-pixel prediction of CRF model. The technical details are
below.
discussed below.
3.1. Feature Extraction
3.1. Feature Extraction
Different from the classic CNN model, the FCN model can take images of arbitrary size as inputs
Different from the classic CNN model, the FCN model can take images of arbitrary size as inputs
and generate correspondingly-sized outputs, so we employ it to implement the feature extraction in
and generate correspondingly-sized outputs, so we employ it to implement the feature extraction
our semantic segmentation method. In this paper, the VGG-based FCN is selected due to its best
in our semantic segmentation method. In this paper, the VGG-based FCN is selected due to its best
performance among various structures of FCN model. The VGG-based FCN is transformed from the
performance among various structures of FCN model. The VGG-based FCN is transformed from the
VGG-16 [28] by replacing the fully connected layers with convolutional ones and keeping the first
VGG-16 [28] by replacing the fully connected layers with convolutional ones and keeping the first five
five layers. After multiple iterations of convolution and pooling, the resolution of the resulting feature
layers. After multiple iterations of convolution and pooling, the resolution of the resulting feature
map gets lower and lower. Upsampling is required to restore the coarse feature to the output image
map gets lower and lower. Upsampling is required to restore the coarse feature to the output image
with the same size as the input one. In the implementation procedure, the resolution of the feature
with the same size as the input one. In the implementation procedure, the resolution of the feature
maps is reduced by 2, 4, 8, 16, and 32 times, respectively. Then, upsampling the output of the last
maps is reduced by 2, 4, 8, 16, and 32 times, respectively. Then, upsampling the output of the last
layer by 32 times can get the result of FCN-32s. Because the large magnification leads to the lack of
layer by 32 times can get the result of FCN-32s. Because the large magnification leads to the lack of
image details, the results of FCN-32s are not accurate enough. To improve the accuracy, we added
image details, the results of FCN-32s are not accurate enough. To improve the accuracy, we added
more detailed information of the last few layers, and combined them with the output of FCN-32s. By
more detailed information of the last few layers, and combined them with the output of FCN-32s.
this means, the FCN-16s and the FCN-8s can be derived.
By this means, the FCN-16s and the FCN-8s can be derived.
An important problem that needs to be solved in semantic segmentation is how to combine
An important problem that needs to be solved in semantic segmentation is how to combine
“where” with “what” effectively. In other words, semantic segmentation is to classify pixel-by-pixel
“where” with “what” effectively. In other words, semantic segmentation is to classify pixel-by-pixel
and combine the information of position and classification together. On one hand, due to the
and combine the information of position and classification together. On one hand, due to the difference
difference in receptive fields, the resolution is relatively higher in the first few convolutions, and the
in receptive fields, the resolution is relatively higher in the first few convolutions, and the positioning of
positioning of the pixels is more accurate. On the other hand, in the last few convolutions, the
the pixels is more accurate. On the other hand, in the last few convolutions, the resolution is relatively
Appl. Sci. 2018, 8, 837 4 of 17
lower and is
resolution the classification
relatively lowerof thethe
and pixels is more accurate.
classification An is
of the pixels example of threeAn
more accurate. models is shown
example in
of three
Figure 2.
models is shown in Figure 2.
Input FCN-32s FCN-16s FCN-8s Ground Truth
Figure 2. The features extracted from FCN models.

Figure 2. The features extracted from FCN models.
It is can be observed from Figure 2 that the result of FCN-32s is drastically smoother and less
It is This
refined. can be
is observed
because the fromreceptive
Figure 2 field
that the
of result of FCN-32s
FCN-32s model isislarger
drastically
and smoother and less
more suitable for
refined. This is because the receptive field of FCN-32s model is larger and more suitable
macroscopic perception. In contrast, the receptive field of FCN-8s model is smaller and more suitablefor macroscopic
perception.
for feeling theIn details.
contrast, the receptive
From field
Figure 2 we canofalso
FCN-8s model
see that the is smaller
result and more
of FCN-8s, suitable
which for feeling
is closest to the
the details. From Figure 2 we can also see that the result of FCN-8s, which is
ground truth, is significantly better than those of FCN-16s and FCN-32s. Therefore, we choose closest to the ground
FCN-
truth, is significantly better than those of FCN-16s and FCN-32s. Therefore, we choose
8s as the front-end to extract the coarse features of images. However, the results of FCN-8s are still FCN-8s as
the front-end to extract the coarse features of images. However, the results of FCN-8s
far from perfect and insensitive to the details of images. In the following, the two-step optimization are still far
from
is perfect and
introduced insensitive to the details of images. In the following, the two-step optimization is
in detail.
introduced in detail.
3.2. Boundary Optimization
3.2. Boundary Optimization
In this section, more attention is paid to the optimization of image boundaries. There are some
In this section, more attention is paid to the optimization of image boundaries. There are some
image processing techniques that can be used for the boundary optimization. For example, the
image processing techniques that can be used for the boundary optimization. For example, the method
method based on graph cut [29] can obtain better edge segmentation results, but it relies on human
based on graph cut [29] can obtain better edge segmentation results, but it relies on human interactions,
interactions, which is unacceptable for processing a large number of images. In addition, some edge
which is unacceptable for processing a large number of images. In addition, some edge detection
detection algorithms [30] are often used to optimize the boundary of images. These algorithms share
algorithms [30] are often used to optimize the boundary of images. These algorithms share the common
the common feature that the parameters for a particular occasion are fixed, which are more applicable
feature that the parameters for a particular occasion are fixed, which are more applicable to some
to some specific applications. However, when solving general border tracing problems for images
specific applications. However, when solving general border tracing problems for images containing
containing unknown objects and backgrounds, the fixed parameter approach often fails to achieve
unknown objects and backgrounds, the fixed parameter approach often fails to achieve the best results.
the best results. In this work, superpixel is selected for the boundary optimization purpose. Generally,
In this work, superpixel is selected for the boundary optimization purpose. Generally, a superpixel can
a superpixel can be treated as a set of pixels that are similar in location, color, texture, etc. According
be treated as a set of pixels that are similar in location, color, texture, etc. According to this similarity,
to this similarity, superpixels have a certain visual significance in comparison with pixels. Although
superpixels have a certain visual significance in comparison with pixels. Although a single superpixel
a single superpixel has no valid semantic information, it is a part of an object that has semantic
has no valid semantic information, it is a part of an object that has semantic information. Besides,
information. Besides, the most important property of superpixels is its ability to adhere to image
the most important property of superpixels is its ability to adhere to image boundaries. Based on this
boundaries. Based on this property, superpixels are applied to optimize the coarse features extracted
property, superpixels are applied to optimize the coarse features extracted by the front end.
by the front end.
As shown in Figure 3, SLIC is used to generate superpixels, and then the coarse features are
As shown in Figure 3, SLIC is used to generate superpixels, and then the coarse features are
optimized by the object boundaries from these superpixels. To some degree, this method can improve
optimized by the object boundaries from these superpixels. To some degree, this method can improve
the segmentation accuracy of the object boundaries. The critical algorithm of boundary optimization is
the segmentation accuracy of the object boundaries. The critical algorithm of boundary optimization
completely demonstrated in Algorithm 1.
is completely demonstrated in Algorithm 1.
Figure 4 shows the result of boundary optimization by applying Algorithm 1. As can be seen
Figure 4 shows the result of boundary optimization by applying Algorithm 1. As can be seen
from the partial enlarged details in Figure 4, the edge information of superpixels is utilized effectively,
from the partial enlarged details in Figure 4, the edge information of superpixels is utilized
and thus a more accurate result can be obtained.
effectively, and thus a more accurate result can be obtained.
It is observed from the red box in Figure 4 that a number of superpixels with sharp, smooth,
It is observed from the red box in Figure 4 that a number of superpixels with sharp, smooth, and
and prominent edges adhere the boundaries of object well. Due to diffusion errors in the upsampling
prominent edges adhere the boundaries of object well. Due to diffusion errors in the upsampling
process, a few pixels inside these superpixels have different semantic information. A common mistake
process, a few pixels inside these superpixels have different semantic information. A common
is to misclassify background pixels as another classification. Figure 4 shows that this kind of mistake
mistake is to misclassify background pixels as another classification. Figure 4 shows that this kind of
mistake can be corrected effectively using our optimization algorithm. There are some other
superpixels with more types of semantic information, while the number of pixels with different
Appl. Sci. 2018, 8, 837 5 of 17
can be corrected effectively using our optimization algorithm. There are some other superpixels with
more Appl.
typesSci. 2018, 8, x FOR PEER REVIEW
of semantic information, while the number of pixels with different classification5 is of 17
about
the same. These superpixels can be found in the weak edges or thin structures of images, which are
classification is about the same. These superpixels can be found in the weak edges or thin structures
easy to
ofmisclassify
images, which in aare
complex
easy to environment.
misclassify in aFor these environment.
complex superpixels, we
For keep
these the segmentation
superpixels, results
we keep
with those delivered by the front end.
the segmentation results with those delivered by the front end.
Input image Features boundary optimization
FCN
SLIC
optimize
Superpixels
Figure 3. Boundary optimization using superpixels.

Figure 3. Boundary optimization using superpixels.
Algorithm 1. The algorithm of boundary optimization.
1. 1.
Algorithm Input
The image
algorithm𝐼 and coarse features
of boundary 𝐿.
optimization.
2. Apply SLIC algorithm to segment the whole image into K superpixels R = {𝑅1 , 𝑅2 , … , 𝑅𝐾 }, in
1. image I𝑅and
Input which 𝑖 is a superpixel
coarse features L. with label 𝑖.
region
2. 3. Outer
Apply loop: For 𝑖to=segment
SLIC algorithm 1:K the whole image into K superpixels R = { R1 , R2 , . . . , RK }, in which
① Use region
Ri is a superpixel M = {𝐶 , … , 𝐶𝑁i.} refers to all pixels in 𝑅𝑖 , in which 𝐶𝑗 is a pixel with
1 , 𝐶2 label
with
3. Outer loop: For i = 1: K 𝑗.
classification
② Get the feature of each pixel in C from the front end. Initialize the weight 𝑊𝐶 with 0.
1 ③MInner
Use = {Cloop:
1 , C2 , For
...,C 𝑗 N=}1refers
: N to all pixels in Ri , in which Cj is a pixel with classification j.
2 Save the feature
Get the feature of each pixel in label ofC𝐶𝑗from
as 𝐿the
𝐶𝑗 , and
frontupdate 𝑊𝐶𝑗weight
weight the
end. Initialize of the label
W within the
0. entire
C

3 superpixel.
Inner loop: For j = 1: N
1
𝑊𝐶′𝑗 +of C, 𝑖𝑛
𝑊𝐶𝑗 = label 𝑤ℎ𝑖𝑐ℎ 𝑊 ′ 𝑑𝑒𝑛𝑜𝑡𝑒𝑠 𝑡ℎ𝑒 𝑙𝑎𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑊𝐶𝑗 in the entire superpixel.
Save the feature 𝑁 j as LCj , and𝐶𝑗update weight WCj of the label
If 𝑊𝐶𝑗 > 0.8, then exit the inner loop.
1
End WCj = WC0 j + , in which WC0 j denotes the last value o f WCj
④ Search for 𝑊 . N
𝐶
If there is a 𝑊𝐶𝑗 > 0.8, then move on to the next step.
If WCj > 0.8, then exit the inner loop.
Else
End Search the maximum 𝑊𝑚𝑎𝑥 and the sub-maximum 𝑊𝑠𝑢𝑏 .

4 Search for WIfC . 𝑊𝑚𝑎𝑥 − 𝑊𝑠𝑢𝑏 > 0.2, then move on to the next step.
Else continue the outer loop.
If there is a WCj > 0.8, then move on to the next step.
⑤ Reassign the classification of current superpixel with 𝐿𝐶𝑚𝑎𝑥 .
End Else
4. Output the image 𝐼̃.
Search the maximum Wmax and the sub-maximum Wsub .
If Wmax − Wsub > 0.2, then move on to the next step.
Else continue the outer loop.

5 Reassign the classification of current superpixel with LCmax .
End
4. I.
Output the image e
Appl. Sci. 2018, 8, 837 6 of 17
Figure 4. Boundary optimization and partial enlarged details.

Figure 4. Boundary optimization and partial enlarged details.
3.3. Accurate Boundary Figure 4. Boundary optimization and partial enlarged details.
Recovery
3.3. Accurate Boundary Recovery
After the
3.3. Accurate above Recovery
Boundary boundary optimization, it is still necessary to improve the segmentation
accuracy
After theofabove
the thin structure,optimization,
boundary weak edge, andit complex superposition.
is still necessary Therefore,
to improve the we employ the CRF
segmentation accuracy
thinAfter
of themodel the above
to recover
structure, weak boundary optimization,
the boundaries
edge, andmore
complex it isi.e.,
accurately, still
superposition. necessary
to further to improveTothe
optimization.
Therefore, we employ segmentation
clearly
the showmodel
CRF the to
accuracy
effect of of the
the CRFthin structure,
model, an weak edge,
example is andin
given complex
Figure superposition. Therefore, we employ the CRF
5.
recover the boundaries more accurately, i.e., to further optimization. To clearly show the effect of the
model to recover the boundaries more accurately, i.e., to further optimization. To clearly show the
CRF effect
model, an example is given in Figure 5.
of the CRF model, an example is given in Figure 5.
(a) before CRF (b) after CRF (c) ground truth
(a) before CRF Figure 5. An example

(b) afterof
CRFbefore/after CRF. (c) ground truth
Consider the pixel-wise labels Figure 5. An example

as random of before/after
variables CRF.
and the relationship between pixels as edges,
Figure 5. An example of before/after CRF.
and correspondingly constitute a conditional random field. These labels can be modelled after we
Consider
obtain the pixel-wisewhich
global observations, labels areas random
usuallyvariables
the inputand the relationship
images. In detail, abetween pixels as edges,
global observation 𝐼 is
Consider
and
represented the
as pixel-wise
correspondingly
the input labels
constitute
image ofaas
N random
conditional variables
pixels in our random
method. and
field. thegiven
These
Then, relationship
labels can be
a graph between
𝐺 modelled
= (𝑉, 𝐸),pixels
𝑉after
and as
we 𝐸edges,
obtain global
and correspondingly observations, which are usually the input images. In detail,
denotes the vertices and the edges of the graph, respectively. Let 𝑋 be the vector formed by the we
constitute a conditional random field. These labels a global
can beobservation
modelled 𝐼 is
after
represented
obtain as the input
globalvariables
random 𝑋1 , 𝑋2 ,image
observations, , 𝑋𝑁 , of
…which N
inare pixels
which 𝑋in
usually ourthemethod.
𝑖 is the input
random Then,
images. given
variable In a grapha𝐺global
detail,
representing = (𝑉,label
the 𝐸), 𝑉 and 𝐸 I is
observation
assigned
denotes the
to the pixel
represented vertices
𝑖. Ainput
as the and
conditional the
imagerandom edges of
of N pixels the graph,
field conforms respectively. Let 𝑋
to Gibbs distribution,
in our method. be
Then, givenand the vector formed
G (𝐼,
the pair
a graph = 𝑋) by
(V,canE)the
,be
V and
random
modelled variables 𝑋1 , 𝑋 2 , … , 𝑋 𝑁 , in which 𝑋 𝑖 is the random variable representing the label assigned
E denotes the as
vertices and the edges of the graph, respectively. Let X be the vector formed by the
to the pixel 𝑖. A conditional random field conforms 1
to Gibbs distribution, and the pair (𝐼, 𝑋) can be
random variables X1 , X2 , . . . , X N𝑃(𝑋 𝑥|𝐼) =Xi is∙ 𝑒𝑥𝑝
, in=which the(−𝐸(𝑥|𝐼)),
random variable representing the label(1)assigned
modelled as 𝑍(𝐼)
to the pixel i. A conditional random field conforms to Gibbs distribution, and the pair ( I, X ) can be
1
in which
modelled 𝑃(𝑋of
as E(x) is the Gibbs energy =a𝑥|𝐼) =
labeling ∈ 𝐿𝑁
x∙ 𝑒𝑥𝑝 (−𝐸(𝑥|𝐼)),
and Z(I) is the partition function [31]. The(1)fully
𝑍(𝐼)
connected CRF model [32] employs the energy function 1
P(of
in which E(x) is the Gibbs energy x | I ) = x ∈ 𝐿𝑁 ·exp
X a=labeling and(−
Z(I) | I ))partition
E(isxthe , function [31]. The fully (1)
𝐸(𝑥) = ∑ 𝜙 (𝑥 ) +Z∑( I )𝜓 (𝑥 , 𝑥 ),
connected CRF model [32] employs the𝑖 energy
𝑖 𝑖 𝑖,𝑗 𝑖,𝑗
function 𝑖 𝑗 (2)
in which E(x) is
in which the
𝜙𝑖 (𝑥 Gibbs energy
𝑖 ) is the unary 𝐸(𝑥)
of a labeling
potentials x∈ LN
and Z(I) is the partition function
𝑖 taking[31]. The fully
= ∑𝑖 𝜙that represent the probability
𝑖 (𝑥𝑖 ) + ∑𝑖,𝑗 𝜓𝑖,𝑗 (𝑥𝑖 , 𝑥𝑗 ),
of the pixel the(2)
label
connected
𝑥𝑖 , andCRF model
𝜓𝑖,𝑗 (𝑥 ,
𝑖 𝑗𝑥 ) [32]
is the employs
pairwise the energy
potentials thatfunction
represent the cost of assigning labels 𝑥𝑖 , 𝑥𝑗 to pixels
in
𝑖, 𝑗which 𝑖 (𝑥𝑖 ) time.
at the𝜙same is theInunary potentials
our method, thethat represent
unary the can
potentials probability
be treated of the pixel
as the 𝑖 takingoptimized
boundary the label
𝑥feature
𝑖 , and 𝜓 map that can help improve the ∑
(𝑥 , 𝑥 ) ∑ 𝑥 𝑥

𝑖,𝑗 𝑖 𝑗 is the pairwise
E ( potentials
x ) = that
( x
performance
φ represent
) + of i,j the cost
x i j The pairwise potentialstousually
CRFi,jmodel.
ψ , of
x assigning
, labels 𝑖 , 𝑗 pixels (2)
i i i
𝑖, 𝑗 at the same time. In our method, the unary potentials can be
model the relationship among neighboring pixels and are weighted by color similarity. treated as the boundary optimized
feature
in which φi (map
xi ) isthat
The expression is can
the helppotentials
unary
employed improve the performance
that
for pairwise represent
potentials of CRF
the
[32] model.
asprobability
shown The
below: ofpairwise
the pixelpotentials
i takingusually
the label xi ,
model the
relationship among neighboring pixels and are weighted by color similarity.
and ψi,j xi , x j is the pairwise potentials that represent the cost of assigning labels xi , x j to pixels i, j at
The expression is employed for pairwise potentials [32] as shown below:
the same time. In our method, the unary potentials can be treated as the boundary optimized feature
map that can help improve the performance of CRF model. The pairwise potentials usually model
Appl. Sci. 2018, 8, 837 7 of 17
the relationship among neighboring pixels and are weighted by color similarity. The expression is
employed for pairwise potentials [32] as shown below:
p i − p j 2 Ii − Ij 2 p i − p j 2
" ! #
Appl. Sci. 2018,
8, x FOR PEER
REVIEW 7 of 17
ψi,j xi , x j = µ xi , x j ω1 exp − − + ω2 exp(− ) , (3)
2σα2 2σβ2 2σγ2
||𝑝𝑖 −𝑝𝑗||2 ||𝐼𝑖 −𝐼𝑗 ||2 ||𝑝𝑖 −𝑝𝑗 ||2 (3)
𝜓𝑖,𝑗 (𝑥𝑖 , 𝑥𝑗 ) = 𝜇(𝑥𝑖 , 𝑥𝑗 ) [𝜔1 𝑒𝑥𝑝 (− 2 − 2 ) + 𝜔2 𝑒𝑥𝑝(− 2 )],
2𝜎𝛼 2𝜎𝛽 2𝜎𝛾
in which the first term depends on both pixel positions and pixel color intensities, and the second term
only depends on pixel positions.
in which the first term depends I , I are the color vectors,
i j on both pixel positions and pixeland p , p are the pixel positions.
j color intensities, and theThe other
i second
parameterstermare only described
dependsin onthe previous
pixel positions. work𝐼𝑖 , 𝐼[32].
𝑗 areAs
theshown and 𝑝model
in the Potts
color vectors, 𝑖 , 𝑝𝑗 are[33],
the µ xi , xpositions.
pixel j is equal
6= xother
to 1 if xiThe j , and parameters
0 otherwise.are described
It means in
that the previous
nearby work
similar [32].
pixels As shown
assigned in the
different Potts model
labels [33],be
should
penalized.𝜇(𝑥In
𝑖 , 𝑥𝑗other
) is equal
words, if 𝑥𝑖 ≠ 𝑥pixels
to 1 similar 𝑗 , and 0are
otherwise.
encouraged It means that
to be nearby similar
assigned the samepixels assigned
label, whereas different
pixels
labels should be penalized. In other words, similar pixels are encouraged
that differ greatly in “distance” are assigned different labels. The definition of “distance” is related to be assigned the same
label, whereas pixels that differ greatly in “distance” are assigned different labels. The definition of
to the color and the actual distance; thus, the CRF can segment images at the boundary as much as
“distance” is related to the color and the actual distance; thus, the CRF can segment images at the
possible. Partial enlarged details shown in Figure 6 are used to explain the analysis of the accurate
boundary as much as possible. Partial enlarged details shown in Figure 6 are used to explain the
boundary recovery.
analysis of the accurate boundary recovery.
(c)
(a) (b)
Ground
before CRF after CRF
truth
Figure 6. Accurate boundary recovery and partial enlarged details.

Figure 6. Accurate boundary recovery and partial enlarged details.
4. Experimental Evaluation
4. Experimental Evaluation
In this section, we first describe the used experimental setup, including the datasets and the selection
of parameters.
In this Next,
section, we wedescribe
first provide comprehensive ablation study
the used experimental of each
setup, component
including the of the improved
datasets and the
selectionmethod. Then, the evaluation
of parameters. Next, weofprovide
the proposed method is given,
comprehensive together
ablation with
study ofother
eachstate-of-the-art
component of
methods.
the improved Qualitative
method. Then,andthe
quantitative
evaluation experimental resultsmethod
of the proposed are presented entirely,
is given, and necessary
together with other
comparisons are performed to validate the competitive performance of our method.
state-of-the-art methods. Qualitative and quantitative experimental results are presented entirely,
and necessary comparisons
4.1. Experimental Setup are performed to validate the competitive performance of our method.
We use
4.1. Experimental the PASCAL VOC 2012 segmentation benchmark [34], as it has become the standard
Setup
dataset to comprehensively evaluate any new semantic segmentation methods. It involves 20
Weforeground
use the PASCAL VOC
classes and one2012 segmentation
background benchmark
class. For [34], ason
our experiments it has
VOCbecome
2012, wethe standard
adopt the
dataset extended
to comprehensively
training set of 10,581 images [35] and a reduced validation set of 346 images [20]. We involves
evaluate any new semantic segmentation methods. It further
20 foreground
evaluate classes and onemethod
the improved background
on theclass. For our
Cityscapes experiments
dataset [36], whichon VOC
focuses2012, we adopt
on semantic
the extended trainingofset
understanding of street
urban 10,581 images
scenes. [35] and
It consists a reduced
of around validation
5000 fine annotatedset of 346
images images
of street [20].
scenes
and 20,000 coarse annotated ones, in which all annotations are from 19 semantic
We further evaluate the improved method on the Cityscapes dataset [36], which focuses on semantic classes.
understanding Fromof Figure
urban 7, it canscenes.
street be seen It
that as the number
consists of aroundof superpixels increases; aimages
5000 fine annotated single superpixel will
of street scenes
get closer to the edge of the object. Most of images in VOC dataset have a resolution of around 500 ×
and 20,000 coarse annotated ones, in which all annotations are from 19 semantic classes.
500, so we set the number of superpixels to 1000. For Cityscapes dataset, we set the number of
From Figure 7, it can be seen that as the number of superpixels increases; a single superpixel
superpixels to 6000 due to the high-resolution of images. In our experiments, 10 mean field iterations
will get are
closer to the edge of the object. Most of images in VOC dataset have a resolution of around
employed for CRF. Meanwhile, we use default values of 𝜔2 = 𝜎𝛾 = 3 and set 𝜔1 = 5, 𝜎𝛼 = 49
500 × 500, so we
and 𝜎𝛽 = 3 setbythe
thenumber of superpixels
same strategy in [22]. to 1000. For Cityscapes dataset, we set the number of
superpixels to 6000 due to the high-resolution of images. In our experiments, 10 mean field iterations
Appl. Sci. 2018, 8, 837 8 of 17
are employed for CRF. Meanwhile, we use default values of ω1 = σr = 3 and set ω1 = 5, σα = 49 and
σβ = Appl.
3 bySci.
the same
2018, strategy
8, x FOR in [22].
PEER REVIEW 8 of 17
100 superpixels
100 superpixels 500superpixels
500 superpixels 1000 superpixels
1000 superpixels
1000 superpixels 3000 superpixels 6000 superpixels

1000 superpixels 3000 superpixels 6000 superpixels
Figure 7. Results with different superpixel number. The first row: SLIC segmentation results with 100,
Figure
Figure
500, Results
7. and
7. Results with
1000with different
different
superpixels superpixel
on superpixel
an number.
example number.
image The first
The
in VOC first row:
row:
dataset. SLIC
SLIC
The segmentation
segmentation
second results with
results with100,
row: SLIC segmentation 100,
500, and 1000
results superpixels
with 1000, 3000, on
andan example
6000 image
superpixels oninanVOC dataset.
example image The
in second row:
Cityscapes SLIC
dataset.
500, and 1000 superpixels on an example image in VOC dataset. The second row: SLIC segmentation segmentation
results with
results with 1000,
1000, 3000,
3000, and
and 6000
6000 superpixels
superpixels on
on an
an example
example image
image in
in Cityscapes
Cityscapes dataset.
dataset.
The standard Jaccard Index (Figure 8), also known as the PASCAL VOC intersection-over-union
(IoU)standard
The metric [34], is introduced
Jaccard for the performance
Index (Figure assessment
8), also known in this paper:
as the PASCAL VOC intersection-over-union
The standard Jaccard Index (Figure 8), also known as the PASCAL VOC intersection-over-union
(IoU) metric [34], is introduced for the performance assessment in this paper:
Ground truth Predicted

pixels in class pixels in class
FN TP FP
Ground truth Predicted
pixels in class pixels in class
FN TP FP
Figure 8. The standard Jaccard Index, in which TP, FP, and FN are the numbers of true positive, false
positive, and false negative pixels.
Figure 8. The standard

According Jaccard
to Jaccard Index,
Index, in which
many TP, FP,criteria
evaluation and FN are the numbers of true
to positive,
evaluatefalse
Figure 8. The standard Jaccard Index, in which TP, FP, and FNhave been
are the proposed
numbers of true positive, the
false
positive, and false
segmentation negative
accuracy. pixels.
Among them, PA, IoU, and mIoU are often used, and their definitions can be
positive, and false negative pixels.
found in the previous work [37]. We assume a total of k + 1 classes (including a background class),
According to amount
and 𝑝𝑖𝑗 is the JaccardofIndex,
pixels many
of classevaluation criteria
𝑖 inferred to class 𝑗.have been proposed
𝑝𝑖𝑖 represents the numberto evaluate
of true the
According
segmentation to Jaccard
accuracy. AmongIndex,
them,many
PA, evaluation
IoU, and criteria
mIoU are have
often been
used, proposed
positives, while 𝑝𝑖𝑗 and 𝑝𝑗𝑖 are usually interpreted as false positives and false negatives,
and their to evaluate
definitions canthe
be
segmentation accuracy.
foundrespectively.
in the previous Among
work [37].
The formulas of PA, them,
We PA,
assume
IoU, IoU, and
a total
and mIoU are of mIoU are
k + 1below:
shown often used, and their definitions
classes (including a background class),
can be
and 𝑝 found
thein the previous workof[37].
classWe𝑖 assume
inferred atototal of k𝑗.+𝑝1 classes (including a background
 𝑖𝑗 is
Pixel amount
Accuracy of pixels
(PA): class 𝑖𝑖 represents the number of true
class), and p
positives, whileij is the amount of pixels of class
𝑝𝑖𝑗 and 𝑝𝑗𝑖 are usually interpreted i inferred to class j. p represents
as false iipositives and thefalse
number of true
negatives,
𝑘
∑ 𝑝 (4)
positives, while Thepformulas
ij and p ji are usually
IoU, interpreted
𝑖=0 as false positives and false negatives, respectively.
𝑖𝑖
respectively. of PA, 𝑃𝐴 =mIoU
and ∑𝑘 𝑘 are ,shown below:
𝑖=0 ∑𝑗=0 𝑝𝑖𝑗
The formulas of PA, IoU, and mIoU are shown below:
 Pixel
whichAccuracy
computes (PA):a ratio between the number of properly classified pixels and the total number of
Pixel
them.
Accuracy (PA):
∑𝑘 𝑝𝑖𝑖k (4)
𝑃𝐴 = ∑𝑘 𝑖=0 ∑i=0, pii
𝑘 𝑝
 Intersection Over Union (IoU): PA = 𝑖=0 ∑𝑗=0k 𝑖𝑗 k , (4)
∑i=0 ∑ j=0 pij
𝑝
which computes a ratio between the 𝐼𝑜𝑈number
= ∑𝑘 of 𝑖𝑖properly
𝑝𝑖𝑗 +∑𝑘
, classified pixels and the total number(5) of
which computes a ratio between the number 𝑗=0of 𝑗=0 𝑝𝑗𝑖 −𝑝𝑖𝑖
properly classified pixels and the total number of them.
them.
which is used to measure whether the target in the image is detected.
Intersection Over Union (IoU):
 Intersection Over Union (IoU):
 Mean Intersection Over Union (mIoU):
𝑝 pii
𝐼𝑜𝑈
IoU ==∑𝑘 1 𝑖𝑖𝑘 , , (5) (5)
𝑚𝐼𝑜𝑈 = k𝑝𝑖𝑗 +∑∑ 𝐼𝑜𝑈
𝑝𝑗𝑖 −𝑝
k𝑖 , 𝑖𝑖 (6)
∑ j=
𝑗=0 𝑘+10 pij + ∑ j=0 p ji − pii
𝑗=0
𝑖=0

 Mean Intersection Over Union (mIoU):
1
𝑚𝐼𝑜𝑈 = ∑𝑘𝑖=0 𝐼𝑜𝑈𝑖 , (6)
𝑘+1
Appl. Sci. 2018, 8, 837 9 of 17
Mean Intersection Over Union (mIoU):
1 k
k + 1 ∑ i =0
Appl. Sci. 2018, 8, x FOR PEER REVIEW mIoU = IoUi , 9 of(6)
17
which isis the

which thestandard
standardmetric
metric
for for segmentation
segmentation purposes
purposes and computed
and computed by averaging
by averaging IoU.
IoU. mIoU
mIoU computes
computes a ratio abetween
ratio between the ground
the ground truth
truth and and
our our predicted
predicted segmentation.
segmentation.
4.2. Ablation
4.2. Ablation Study
Study
The core
The core idea
idea of
of the
the improved
improved method
method lies
lies in
in the
the utility
utility of
of superpixels
superpixels and
and the
the optimization
optimization of
of
CRF model. First, to evaluate the importance of the utility of superpixels, we directly compared
CRF model. First, to evaluate the importance of the utility of superpixels, we directly compared the the
plain FCN-8s
plain FCN-8s model
model with
with the
the boundary
boundary optimized
optimized one.
one. Then,
Then, the
the FCN-8s
FCN-8s with
with CRF
CRF and
and our
our proposed
proposed
method are implemented sequentially. For better understanding, the results of these
method are implemented sequentially. For better understanding, the results of these comparative comparative
experiments are
experiments are shown
shown inin Figure
Figure 99 and
and Table
Table 1.
1.
(c) Optimized by (d) FCN-8s

(a) Input image (b) FCN-8s (e) Proposed Method (f) Ground Truth
superpixels with CRF
Figure 9. An example result of the comparative experiments. (a) The input image, (b) the result of
Figure 9. An example result of the comparative experiments. (a) The input image, (b) the result of
plain
plain FCN-8s, (c)
FCN-8s, (c) the
the boundary
boundary optimization
optimization result
result by
by superpixels,
superpixels, (d)
(d) the
the result
result of
of FCN-8s
FCN-8s with
with CRF
CRF
post-processing,
post-processing, (e)
(e) the
the result
result of
of our
our method, and (f)
method, and (f) the
the ground
ground truth.
truth.
From the results, we see that methods with boundary optimization by superpixels consistently
From the results, we see that methods with boundary optimization by superpixels consistently
perform better than the counterparts without optimization. For example, as shown in Table 1, the
perform better than the counterparts without optimization. For example, as shown in Table 1,
method with boundary optimization is 3.2% better than the plain FCN-8s on VOC dataset, while the
the method with boundary optimization is 3.2% better than the plain FCN-8s on VOC dataset,
improvement becomes 2.8% on Cityscapes dataset. It can be observed that the object boundaries in
while the improvement becomes 2.8% on Cityscapes dataset. It can be observed that the object
Figure 9c are closer to the ground truth than Figure 9b. Based on the above experimental results, we
boundaries in Figure 9c are closer to the ground truth than Figure 9b. Based on the above experimental
can conclude that the segmentation accuracy of the object boundaries can be improved by the utility
results, we can conclude that the segmentation accuracy of the object boundaries can be improved
of superpixels. Table 1 also shows that, for the FCN-8s with CRF applied as a post-processing step,
by the utility of superpixels. Table 1 also shows that, for the FCN-8s with CRF applied as a
the better performance can be observed after boundary optimization. As shown in Figure 9d,e, the
post-processing step, the better performance can be observed after boundary optimization. As shown
results of our method are more similar with the ground truth than that without boundary
in Figure 9d,e, the results of our method are more similar with the ground truth than that without
optimization. The mIoU scores of the two right-most columns also corroborated this point, with an
boundary optimization. The mIoU scores of the two right-most columns also corroborated this point,
improvement of 5% on VOC dataset and 4.1% on Cityscapes dataset.
with an improvement of 5% on VOC dataset and 4.1% on Cityscapes dataset.
Table 1. The mIoU scores of the comparative experiments.
Table 1. The mIoU scores of the comparative experiments.
Dataset\Method Plain FCN-8s With BO 1 With CRF Our Method
DatasetMethod
VOC 2012 Plain FCN-8s
62.7 BO 1
With65.9 With
69.5CRF Our
74.5Method
VOCCityscapes
2012 56.1
62.7 58.9
65.9 61.3
69.5 65.474.5
Cityscapes 1 BO56.1 58.9 Optimization.
denotes the Boundary 61.3 65.4
1 BO denotes the Boundary Optimization.
The purpose of using CRFs is to recover the boundaries more accurately. We compared the
The purpose
performance of using
of plain FCN-8s CRFs is to recover
with/without CRFthe boundaries
under the samemore accurately.
situation. We compared
From Table the
1, it is clear
performance of plain FCN-8s with/without CRF under the same situation. From Table 1, it is
that CRF consistently boosts classification scores on both the VOC dataset and the Cityscapes dataset.
clear thatwe
Besides, CRFcompared
consistently
theboosts classification
performance scores on both
of boundary the VOC
optimized datasetwith/without
FCN-8s and the Cityscapes
CRF.
dataset. Besides, we compared the performance of boundary optimized FCN-8s
In conclusion, methods optimized by CRFs outperform the counterparts by a significant with/without CRF.
margin,
In conclusion, methods optimized by CRFs outperform
which shows the importance of boundary optimization and CRFs. the counterparts by a significant margin,
which shows the importance of boundary optimization and CRFs.
Appl. Sci. 2018, 8, 837 10 of 17
4.3. Experimental Results

4.3. Experimental Results
4.3.1. Qualitative Analysis
4.3.1. Qualitative Analysis
According to the method proposed in this paper, we have obtained the improved semantic
According to the method proposed in this paper, we have obtained the improved semantic
segmentation results. The
segmentation results. Thecomparisons on VOC
comparisons on VOCdataset
datasetamong
among FCN-8s,
FCN-8s, DeepLab-v2
DeepLab-v2 [22], [22], and our
and our
method are shown in Figure 10.
method are shown in Figure 10.
Input image Superpixels FCN-8s DeepLab-v2 Our Method Ground Truth
B-ground Aeroplane Bicycle Bird Boat Bottle Bus

PASCAL VOC 2012
Car Cat Chair Cow Dining-Table Dog Horse
class-definitions
Motorbike Person Potted-Plant Sheep Sofa Train TV/Monitor
Figure 10. Qualitative results on the reduced validation set of PASCAL VOC 2012. From left to right:
Figure 10. Qualitative results on the reduced validation set of PASCAL VOC 2012. From left to
the original image, SLIC segmentation, FCN-8s results, DeepLab-v2 results, our results, and the
right: the original image, SLIC segmentation, FCN-8s results, DeepLab-v2 results, our results, and the
ground truth.
ground truth.
Appl. Sci. 2018, 8, 837 11 of 17
As shown in Figure 10, our results are significantly closer to the ground truth than those by FCN-
8s, especially
As shown forinthe segmentation
Figure of the are
10, our results object edges in images.
significantly closer to By the
comparing
ground thetruth results
than obtained
those by
by FCN-8s
FCN-8s, with ours,for
especially it can
the be seen that they
segmentation of are
thethe sameedges
object colorin and they have
images. By acomparing
similar object theoutline,
results
which indicates that our method inherits feature extraction and object
obtained by FCN-8s with ours, it can be seen that they are the same color and they have a similar recognition abilities of FCN.
In addition,
object outline, ourwhich
method can getthat
indicates finer
ourdetails
method than DeepLab-v2
inherits featurein the vast and
extraction majority
object ofrecognition
categories,
which
abilitiesprofits
of FCN.from In the boundary
addition, our optimization
method can get of coarse features.
finer details thanFrom segmentation
DeepLab-v2 in theresults of each
vast majority
image in Figure
of categories, which 10, profits
we also fromnotice
the the following
boundary details: (a)
optimization of All
coarsethese methods
features. From cansegmentation
identify the
classification
results of each of image
the single large object
in Figure 10, we and locate
also noticethetheregion accurately,
following details:due(a)to the
All excellent
these methodsextraction
can
ability of CNNs. (b) In the complex scene, these methods cannot work effectively
identify the classification of the single large object and locate the region accurately, due to the excellent due to the error of
extracted
extractionfeatures,
ability ofevenCNNs. employing
(b) In the different
complex post-processing
scene, these methods steps. (c) Smallwork
cannot objects are possible
effectively due to
be
themisidentified
error of extractedor missed,
features,dueeven
to the lack of thedifferent
employing pixels describing the small
post-processing objects.
steps. (c) Small objects are
It is to
possible worth mentioning that,
be misidentified in some
or missed, details
due to theoflack
Figure 10, such
of the pixelsasdescribing
the rear wheel of theobjects.
the small first bicycle,
the results of DeepLab-v2
It is worth mentioningseem that,better
in some than ours. of
details The segmentation
Figure 10, such as result
the of DeepLab-v2
rear wheel of the almost
first
completely
bicycle, the retained
results ofthe rear wheel
DeepLab-v2 seemof the
betterfirst
thanbicycle.
ours. However, it cannotresult
The segmentation be overlooked
of DeepLab-v2 that
DeepLab-v2
almost completelyfails toretained
properlythe process the front
rear wheel wheels
of the first of the two
bicycle. bicycles.itIncannot
However, contrast,
be the front wheels
overlooked that
processed
DeepLab-v2 byfails
our tomethod
properly are process
closer tothethat of ground
front wheelstruth.
of theFortwothe rear wheel
bicycles. missingthe
In contrast, problem, some
front wheels
analysis
processed is by
madeouron the difference
method are closer among
to that theofbicycle
groundbody, truth.solidForrear wheel,
the rear and missing
wheel background. The
problem,
solid
somerear wheel
analysis is has
made a distinguished
on the difference coloramong
with compassion
the bicycle body,of the solid
background,
rear wheel,but atandthebackground.
same time,
the same
The case wheel
solid rear exists has in the bicycle bodycolor
a distinguished and with
solidcompassion
rear wheel,ofleading to a largebut
the background, probability
at the same of
misrecognition. Additionally, the non-solid wheel is much more common
time, the same case exists in the bicycle body and solid rear wheel, leading to a large probability of than the solid one in the
real scenario; therefore,
misrecognition. Additionally,the CRFthe model
non-solid tends to predict
wheel is much the non-solid
more common wheel.
than the solid one in the real
scenario; therefore, the CRF model tends to predict the non-solid wheel. Cityscapes dataset. Some
The proposed semantic segmentation method is implemented on
visual results
The proposed
proposed semantic by FCN-8s, DeepLab-v2,
segmentation methodand our method on
is implemented areCityscapes
shown in Figure
dataset.11. Similar
Some to
visual
the results
results on PASCAL
proposed by FCN-8s, VOC 2012 dataset, the
DeepLab-v2, andimproved
our method method achieves
are shown better performance
in Figure 11. Similar tothan the
others.
results on PASCAL VOC 2012 dataset, the improved method achieves better performance than others.
Input image Superpixels FCN-8s DeepLab-v2 Our Method Ground Truth
Cityscapes Road Sidewalk Building Wall Fence Pole Traffic light Traffic sign Vegetation
Class definitions Terrain Sky Person Rider Car Truck Bus Train Motorcycle Bicycle
Figure 11. Visual results on the validation set of Cityscapes. From left to right: the original image,
Figure 11. Visual results on the validation set of Cityscapes. From left to right: the original image,
SLIC segmentation, FCN-8s results, DeepLab-v2 results, our results, and the ground truth.
SLIC segmentation, FCN-8s results, DeepLab-v2 results, our results, and the ground truth.
4.3.2. Quantitative Analysis
4.3.2. Quantitative Analysis
 PASCAL VOC 2012 Dataset
PASCAL VOC 2012 Dataset
The per-class IoU scores on VOC dataset among FCN-8s, the proposed method, and other
The methods
popular per-class IoU
are scores
shownoninVOC dataset
Table 2. Itamong
can beFCN-8s, the that
observed proposed method, achieves
our method and otherthe
popular
best
methods are shown in Table 2. It can be observed that our method achieves the best performance
performance in most categories, which is consistent with the conclusion obtained in qualitative in
analysis. In addition, the proposed method outperforms prior methods in mIoU metric; it reaches the
Appl. Sci. 2018, 8, 837 12 of 17
highest 74.5% accuracy.
Table 2. Per-class
most categories, which IoU
is score on VOC
consistent dataset.
with Best performance
the conclusion of each
obtained category is highlighted
in qualitative in addition,
analysis. In bold.
the proposed method outperforms prior methods in mIoU metric;
Zoom-out CRF-RNNit reaches the highest
GCRF DPN 74.5%Our accuracy.
FCN-8s DeepLab-v2
[38] [20] [39] [40] Method
areo2. Per-class
Table IoU score on85.6
74.3 VOC dataset. Best
86.6performance of each category
87.5 85.2is highlighted
87.7 in 85.5
bold.
bike 36.8 37.3 37.2 39.0 43.9 59.4 40.1
birdFCN-8s 77.0Zoom-Out [38]83.2 DeepLab-v2 82.1
CRF-RNN [20] 79.7GCRF [39] 83.3
DPN [40]78.4 Our Method
83.1
areo boat 74.3 52.4 85.6 62.5 86.6 65.6 87.5 64.2 85.2 65.2 87.7 64.9 66.3
85.5
bikebottle 36.8 67.7 37.3 66.0 37.2 71.2 39.0 68.3 43.9 68.3 59.4 70.3 40.1
74.2
bird 77.0 83.2 82.1 79.7 83.3 78.4 83.1
bus 75.4 85.1 88.3 87.6 89.0 89.3 91.3
boat 52.4 62.5 65.6 64.2 65.2 64.9 66.3
bottle car 67.7 71.4 66.0 80.7 71.2 82.8 68.3 80.8 68.3 82.7 70.3 83.5 82.3
74.2
bus cat 75.4 76.3 85.1 84.9 88.3 85.6 87.6 84.4 89.0 85.3 89.3 86.1 87.5
91.3
car chair 71.4 23.9 80.7 27.2 82.8 36.6 80.8 30.4 82.7 31.1 83.5 31.7 82.3
33.6
cat cow 76.3 69.7 84.9 73.2 85.6 77.3 84.4 78.2 85.3 79.5 86.1 79.9 87.5
82.2
chair 23.9 27.2 36.6 30.4 31.1 31.7 33.6
cow table 69.7 44.5 73.2 57.5 77.3 51.8 78.2 60.4 79.5 63.3 79.9 62.6 62.3
82.2
table dog 44.5 69.2 57.5 78.1 51.8 80.2 60.4 80.5 63.3 80.5 62.6 81.9 85.9
62.3
doghorse 69.2 61.8 78.1 79.2 80.2 77.1 80.5 77.8 80.5 79.3 81.9 80.0 83.0
85.9
horse mbike 61.8 75.7 79.2 81.1 77.1 75.7 77.8 83.1 79.3 85.5 80.0 83.5 83.0
83.4
mbike 75.7 81.1 75.7 83.1 85.5 83.5 83.4
person 75.7 77.1 82.0 80.6 81.0 82.3 86.6
person 75.7 77.1 82.0 80.6 81.0 82.3 86.6
plantplant 44.3 44.3 53.6 53.6 52.0 52.0 59.5 59.5 60.5 60.5 60.5 60.5 56.9
56.9
sheep sheep 68.2 68.2 74.0 74.0 78.2 78.2 82.8 82.8 85.5 85.5 83.2 83.2 86.3
86.3
sofa sofa 34.1 34.1 49.2 49.2 44.9 44.9 47.8 47.8 52.0 52.0 53.4 53.4 49.4
49.4
traintrain 75.5 75.5 71.7 71.7 79.7 79.7 78.3 78.3 77.3 77.3 77.9 77.9 80.4
80.4
tv 52.7 63.3 66.7 67.1 65.1 65.0 69.9
mIoU tv 62.7 52.7 69.6 63.3 71.2 66.7 72.0 67.1 73.2 65.1 74.1 65.0 69.9
74.5
mIoU 62.7 69.6 71.2 72.0 73.2 74.1 74.5
This
This work
work proposed
proposed aapost-processing
post-processing method
method totoimprove
improvethetheFCN-8s
FCN-8sresult.
result. During
During the
the
implementation procedure, the superpixel and the CRF are used subsequently to improve
implementation procedure, the superpixel and the CRF are used subsequently to improve the coarse the coarse
results
resultsextracted
extractedbybyFCN-8s.
FCN-8s.Therefore,
Therefore,thethecomparison
comparisonwithwithFCN-8s
FCN-8sisisthe
themost
mostprimary
primaryandanddirect
direct
way to illustrate the improvement level of our method. Meanwhile, to further evaluate
way to illustrate the improvement level of our method. Meanwhile, to further evaluate the the performance
of our method,ofthe
performance comparison
our with
method, the DeepLab has
comparison withbeen made. The
DeepLab IoU, mIoU,
has been made.and
ThePA scores
IoU, mIoU,of FCN-8s
and PA
and DeepLab-v2, and our methods are given in Figure 12.
scores of FCN-8s and DeepLab-v2, and our methods are given in Figure 12.
FCN-8s DeepLab-v2 Our Method 93.2

91.6
100 75.9
90 74.5
80 71.2
62.7
70
60
Score
50
40
30
20
10
0
Figure 12. IoU, mIoU, and PA scores on the PASCAL VOC 2012 dataset.
Figure 12. IoU, mIoU, and PA scores on the PASCAL VOC 2012 dataset.
Appl. Sci. 2018, 8, 837 13 of 17
It can
It canbebe
observed
observed from Figure
from 12, compared
Figure with FCN-8s,
12, compared that the proposed
with FCN-8s, that themethod significantly
proposed method
significantly improves the IoU score in every category with higher mIoU and PA scores. our
improves the IoU score in every category with higher mIoU and PA scores. In addition, method
In addition,
is better
our methodthanis DeepLab-v2 in most categories
better than DeepLab-v2 in most (except for(except
categories aero plane, car plane,
for aero and chair). The
car and detailed
chair). The
improvement statistics are shown in Figure 13.
detailed improvement statistics are shown in Figure 13.
improvement(FCN-8s) improvement(DeepLab-v2)
25.0
20.0 17.4
15.0
11.8
10.0
5.0 3.3
1.7
0.0
-5.0
Figure 13. The improvement by our method compared with FCN-8s and DeepLab-v2.
As shown in Figure 13, our method can be significantly improved compared with FCN-8s, which
reachAsupshown
to 11.8% in Figure
in mIoU 13,and
our17.4%
method can respectively.
in PA, be significantly
For improved
DeepLab-v2, compared with FCN-8s,
the improvements of
which reach up to 11.8% in mIoU and 17.4% in PA, respectively. For DeepLab-v2,
mIoU and PA are 3.3% and 1.7%. Moreover, Figure 13 intuitively shows the improvement level using the improvements of
mIoU
our and PA
method inare 3.3% and 1.7%.
comparison Moreover,
with FCN-8s andFigure 13 intuitively
DeepLab. shows thecan
The improvement improvement level using
be clearly observed in
our method in comparison with FCN-8s and DeepLab. The improvement
100% (21/21) of categories when compared with FCN-8s. Meanwhile, there is 85.71% (18/21) of can be clearly observed
in 100% (21/21)
categories of categories
improvement when compared
in comparison with FCN-8s.
withDeepLab-v2. Meanwhile,
In summary, our there
methodis 85.71%
achieved(18/21) of
the best
categories improvement
performance among these inthree
comparison
methods. withDeepLab-v2. In summary, our method achieved the best
performance among these three methods.
 Cityscapes Dataset
Cityscapes Dataset
We conducted an experiment on the Cityscapes dataset, which differs from the previous one in
We conducted an
the high-resolution experiment
images and theon the Cityscapes
number dataset,
of classes. We used which the differs
providedfrom the previous
partitions one in
of training
and validation sets, and the obtained results are reported in Table 3. It can be observed that and
the high-resolution images and the number of classes. We used the provided partitions of training the
validation sets,
evaluation and
on the the obtained
Cityscapes results are
validation set reported
is similarintoTable 3. Itthe
that on can
VOCbe observed that the
dataset. Using ourevaluation
method,
on the
the Cityscapes
highest mIoU validation set is up
score can reach similar to thatThe
to 65.4%. on IoU,
the VOC
mIoU, dataset.
and PA Using ourofmethod,
scores FCN-8s,the highest
DeepLab-
mIoU
v2, andscore can reach
our method areup to 65.4%.
given The 14.
in Figure IoU, mIoU, and PA scores of FCN-8s, DeepLab-v2, and our
method are given in Figure 14.
Table 3. Per-class IoU score on Cityscapes dataset. Best performance of each category is highlighted
Table
in 3. Per-class IoU score on Cityscapes dataset. Best performance of each category is highlighted
bold.
in bold.
FCN-8s DPN [40] CRF-RNN [20] DeepLab-v2 Our Method
road 95.9
FCN-8s 96.3[40]
DPN 96.3 [20]
CRF-RNN 96.8
DeepLab-v2 97.2Method
Our
sidewalk 71.5 71.7 73.9 75.6 78.9
road 95.9 96.3 96.3 96.8 97.2
building 85.9 86.7 88.2 88.2 88.8
sidewalk 71.5 71.7 73.9 75.6 78.9
wall 25.9 43.7 47.6 31.1 35.1
building 85.9 86.7 88.2 88.2 88.8
fence
wall 38.4
25.9 31.7
43.7 41.3
47.6 42.6
31.1 43.335.1
fencepole 31.2
38.4 29.2
31.7 35.2
41.3 41.2
42.6 40.243.3
traffic
pole light 38.3
31.2 35.8
29.2 49.5
35.2 45.3
41.2 44.340.2
traffic
traffic sign
light 52.3
38.3 47.4
35.8 59.7
49.5 58.8
45.3 59.344.3
vegetation 87.3 88.4 90.6 91.6 93.5
terrain 52.1 63.1 66.1 59.6 61.6
sky 87.6 93.9 93.5 89.3 94.2
Appl. Sci. 2018, 8, 837 14 of 17
Table 3. Cont.
FCN-8s DPN [40] CRF-RNN [20] DeepLab-v2 Our Method

traffic sign 52.3 47.4 59.7 58.8 59.3
vegetation 87.3 88.4 90.6 91.6 93.5
terrain 52.1 63.1 66.1 59.6 61.6
sky 87.6 93.9 93.5 89.3 94.2
person
person 61.7
61.7 64.7
64.7 70.4
70.4 75.8
75.8 79.3
79.3
riderrider 32.9
32.9 38.7
38.7 34.7
34.7 41.2 43.9
43.9
car car 86.6
86.6 88.8
88.8 90.1
90.1 90.1 94.1
94.1
trucktruck 36.0
36.0 48.0
48.0 39.2
39.2 46.7 53.4
53.4
bus bus 50.8
50.8 56.4
56.4 57.5
57.5 60.0
60.0 66.0
66.0
traintrain 35.4
35.4 49.4
49.4 55.4
55.4 47.0
47.0 51.8
51.8
motorcycle
motorcycle 34.7 34.7 38.3
38.3 43.9
43.9 46.2
46.2 47.3
47.3
bicycle 60.6 50.0 54.6 71.9 70.4
bicycle 60.6 50.0 54.6 71.9 70.4
mIoU 56.1 59.1 62.5 63.1 65.4
mIoU 56.1 59.1 62.5 63.1 65.4
FCN-8s DeepLab-v2 Our Method 95.1

100 93.9
86.0
90
65.4
80
63.1
70
56.1
IoU Score
60
50
40
30
20
10
0
Figure 14. IoU, mIoU, and PA scores on the Cityscapes dataset.

It can be observed from Figure 14 that the proposed method has significantly improved the IoU,
It can
mIoU, andbePA
observed from
scores in everyFigure 14 that
category thethe
with proposed method
comparison has significantly
of FCN-8s. Similar toimproved
the resultsthe
in IoU,
VOC
mIoU, and PA our
2012 dataset, scores in every
method category
is better with
than the comparison
DeepLab-v2 ofcategories
in most FCN-8s. Similar
(excepttofor
thepole,
results in VOC
traffic light,
2012
and dataset,
bicycle).our
Themethod
detailedisimprovement
better than DeepLab-v2
statistics areinshown
most categories (except for pole, traffic light,
in Figure 15.
and bicycle). The detailed improvement statistics are shown in Figure 15.
As shown in Figure 15, our method can get a significant improvement compared with FCN-8s,
which reach up to 9.3% in mIoU and 9.1% in PA, respectively. Moreover, Figure 15 intuitively
shows the20.0improvement level using our method with the comparison of FCN-8s and DeepLab-v2.
The improvement can be clearly observed in 100% (19/19) of categories when compared with FCN-8s
and 84.21% (16/19) of categories when compared with DeepLab-v2, respectively.
15.0
9.3 9.1
10.0
5.0
2.3
1.2
0.0
-5.0
It can be observed from Figure 14 that the proposed method has significantly improved the IoU,
mIoU, and PA scores in every category with the comparison of FCN-8s. Similar to the results in VOC
2012 dataset,
Appl. Sci. 2018, 8,our
837 method is better than DeepLab-v2 in most categories (except for pole, traffic15
light,
of 17
and bicycle). The detailed improvement statistics are shown in Figure 15.
20.0
15.0
9.3 9.1
10.0
5.0
2.3
1.2
0.0
-5.0
5. Conclusions
In this paper, an improved semantic segmentation method is proposed, which utilizes the
superpixel edges of images and the constraint relationship between different pixels. First, our method
has the ability to extract advanced semantic information, which is inherited from the FCN model. Then,
to effectively optimize the boundaries of results, our method takes into account the good adherence to
the edges of superpixels. Finally, we apply CRF to further predict the semantic information of each pixel,
and make full use of the local texture features of the image, global context information, and smooth
priori. Experiment results show that our method can achieve the more accurate segmentation result.
Using our method, mIoU scores can reach up to 74.5% on the VOC dataset and 65.4% on the Cityscapes
dataset, which are 11.8% and 9.3% improvements over FCN-8s, respectively.
Author Contributions: Hai Wang, Wei Zhao, and Yi Fu conceived and designed the experiments; Yi Fu and
Xiaosong Wei performed the experiments; Wei Zhao and Yi Fu analyzed the data; Xiaosong Wei contributed
analysis tools; Hai Wang, Wei Zhao, and Yi Fu wrote the paper.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Oberweger, M.; Wohlhart, P.; Lepetit, V. Hands deep in deep learning for hand pose estimation. arXiv 2015,
arXiv:1502.06807.
2. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite.
In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI,
USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 3354–3361.
3. Wan, J.; Wang, D.; Hoi, S.C.H.; Wu, P.; Zhu, J.; Zhang, Y.; Li, J. Deep learning for content-based image
retrieval: A comprehensive study. In Proceedings of the 22nd ACM International Conference on Multimedia,
Orlando, FL, USA, 3–7 November 2014; ACM: Orlando, FL, USA, 2014; pp. 157–166.
4. Kang, W.X.; Yang, Q.Q.; Liang, R.P. The comparative research on image segmentation algorithms.
In Proceedings of the 2009 First International Workshop on Education Technology and Computer Science,
Wuhan, China, 7–8 March 2009; pp. 703–707.
5. Ciresan, D.; Giusti, A.; Gambardella, L.M.; Schmidhuber, J. Deep neural networks segment neuronal
membranes in electron microscopy images. In Proceedings of the Advances in Neural Information Processing
Systems 25, Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates, Inc.: Red Hook, NY, USA, 2012;
pp. 2843–2851.
Appl. Sci. 2018, 8, 837 16 of 17
6. Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning rich features from rgb-d images for object detection and
segmentation. In European Conference on Computer Vision; Springer: New York, NY, USA, 2014; pp. 345–360.
7. Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Simultaneous detection and segmentation. In European
Conference on Computer Vision; Springer: New York, NY, USA, 2014; pp. 297–312.
8. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and
semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern
Recognition, Columbus, OH, USA, 24–27 June 2014; IEEE Computer Society: Washington, DC, USA, 2014;
pp. 580–587.
9. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal
networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
10. Luo, P.; Wang, G.; Lin, L.; Wang, X. Deep dual learning for semantic image segmentation. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017;
pp. 2718–2726.
11. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017;
pp. 2881–2890.
12. Ravì, D.; Bober, M.; Farinella, G.M.; Guarnera, M.; Battiato, S. Semantic segmentation of images exploiting
dct based features and random forest. Pattern Recogn. 2016, 52, 260–273. [CrossRef]
13. Fu, J.; Liu, J.; Wang, Y.; Lu, H. Stacked deconvolutional network for semantic segmentation. arXiv 2017,
arXiv:1708.04943.
14. Wu, Z.; Shen, C.; Hengel, A.V.D. Wider or deeper: Revisiting the resnet model for visual recognition. arXiv
2016, arXiv:1611.10080.
15. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings
of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June
2015; IEEE: New York, NY, USA, 2015; pp. 3431–3440.
16. Ghiasi, G.; Fowlkes, C.C. Laplacian pyramid reconstruction and refinement for semantic segmentation.
In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October
2016; Springer: New York, NY, USA, 2016; pp. 519–534.
17. Lin, G.; Shen, C.; Hengel, A.V.D.; Reid, I. Exploring context with deep structured models for semantic
segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1352–1366. [CrossRef] [PubMed]
18. Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of
the IEEE International Conference on Computer Vision, Boston, MA, USA, 8–10 June 2015; pp. 1520–1528.
19. Chen, L.C.; Yang, Y.; Wang, J.; Xu, W.; Yuille, A.L. Attention to scale: Scale-aware semantic image
segmentation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition,
Las Vegas, NV, USA, 27 June–1 July 2016; IEEE: New York, NY, USA, 2016; pp. 3640–3649.
20. Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.Z.; Du, D.L.; Huang, C.; Torr, P.H.S.
Conditional random fields as recurrent neural networks. In Proceedings of the 2015 IEEE International
Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; IEEE: New York, NY, USA, 2015;
pp. 1529–1537.
21. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep
convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062.
22. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation
with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal.
Mach. Intell. 2018, 40, 834–848. [CrossRef] [PubMed]
23. Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image
segmentation. arXiv 2017, arXiv:1706.05587.
24. Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22,
888–905.
25. Felzenszwalb, P.F.; Huttenlocher, D.P. Efficient graph-based image segmentation. Int. J. Comput. Vis. 2004,
59, 167–181. [CrossRef]
26. Van den Bergh, M.; Boix, X.; Roig, G.; de Capitani, B.; Van Gool, L. Seeds: Superpixels extracted via
energy-driven sampling. In Proceedings of the 12th European Conference on Computer Vision-Volume Part
VII, Florence, Italy, 7–13 October 2012; Springer: New York, NY, USA, 2012; pp. 13–26.
Appl. Sci. 2018, 8, 837 17 of 17
27. Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Susstrunk, S. Slic superpixels compared to state-of-the-art
superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2281. [CrossRef] [PubMed]
28. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
2014, arXiv:1409.1556.
29. Rother, C.; Kolmogorov, V.; Blake, A. Grabcut: Interactive foreground extraction using iterated graph cuts.
In Proceedings of the ACM Transactions on Graphics (TOG), Los Angeles, CA, USA, 8–12 August 2004;
ACM: New York, NY, USA, 2004; Volume 23, pp. 309–314.
30. Maini, R.; Aggarwal, H. Study and comparison of various image edge detection techniques. Int. J.
Image Process. 2009, 3, 1–11.
31. Gadde, R.; Jampani, V.; Kiefel, M.; Kappler, D.; Gehler, P.V. Superpixel convolutional networks using
bilateral inceptions. In Computer Vision, ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer
International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 597–613.
32. Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials.
In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada,
Spain, 12–15 December 2011; Curran Associates Inc.: Red Hook, NY, USA, 2011; pp. 109–117.
33. Wu, F. The potts model. Rev. Mod. Phys. 1982, 54, 235. [CrossRef]
34. Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual
object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [CrossRef]
35. Hariharan, B.; Arbelaez, P.; Bourdev, L.; Maji, S.; Malik, J. Semantic contours from inverse detectors.
In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November
2011; IEEE Computer Society: Washington, DC, USA, 2011; pp. 991–998.
36. Cordts, M.; Omran, M.; Ramos, S.; Scharwächter, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.;
Schiele, B. The cityscapes dataset. In Proceedings of the CVPR Workshop on the Future of Datasets in Vision,
Boston, MA, USA, 11 June 2015; p. 3.
37. Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A review on deep
learning techniques applied to semantic segmentation. arXiv 2017, arXiv:1704.06857.
38. Mostajabi, M.; Yadollahpour, P.; Shakhnarovich, G. Feedforward semantic segmentation with zoom-out
features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA,
USA, 7–12 June 2015; pp. 3376–3385.
39. Vemulapalli, R.; Tuzel, O.; Liu, M.-Y.; Chellapa, R. Gaussian conditional random field network for semantic
segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Las Vegas, NV, USA, 27 June–1 July 2016; pp. 3224–3233.
40. Liu, Z.; Li, X.; Luo, P.; Loy, C.-C.; Tang, X. Semantic image segmentation via deep parsing network.
In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13
December 2015; IEEE: New York, NY, USA, 2015; pp. 1377–1385.
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Applsci 08 00837 PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Applsci 08 00837 PDF

Caricato da

Copyright:

Formati disponibili

applied

Appl. Sci. 2018, 8, 837; doi:10.3390/app8050837 www.mdpi.com/journal/applsci

Figure 1. The flow chart of the proposed method.

Input FCN-32s FCN-16s FCN-8s Ground Truth

Figure 2. The features extracted from FCN models.

Input image Features boundary optimization

Figure 3. Boundary optimization using superpixels.

If Wmax − Wsub > 0.2, then move on to the next step.

Else continue the outer loop.

Appl. Sci. 2018, 8, x FOR PEER REVIEW 6 of 17

Figure 4. Boundary optimization and partial enlarged details.

(a) before CRF (b) after CRF (c) ground truth

(a) before CRF Figure 5. An example

Consider the pixel-wise labels Figure 5. An example

Figure 6. Accurate boundary recovery and partial enlarged details.

1000 superpixels 3000 superpixels 6000 superpixels

Ground truth Predicted

Figure 8. The standard

which is used to measure whether the target in the image is detected.

Mean Intersection Over Union (mIoU):

which isis the

(c) Optimized by (d) FCN-8s

Appl. Sci. 2018, 8, x FOR PEER REVIEW 10 of 17

4.3. Experimental Results

Input image Superpixels FCN-8s DeepLab-v2 Our Method Ground Truth

B-ground Aeroplane Bicycle Bird Boat Bottle Bus

Input image Superpixels FCN-8s DeepLab-v2 Our Method Ground Truth

FCN-8s DeepLab-v2 Our Method 93.2

FCN-8s DPN [40] CRF-RNN [20] DeepLab-v2 Our Method

FCN-8s DeepLab-v2 Our Method 95.1

Figure 14. IoU, mIoU, and PA scores on the Cityscapes dataset.

Potrebbero piacerti anche