Pattern Recognition: Hong Fu, Zheru Chi, Dagan Feng

ARTICLE IN PRESS
Pattern Recognition ] (]]]]) ]]]–]]]
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
Recognition of attentive objects with a concept association network

for image annotation
Hong Fu a,, Zheru Chi a, Dagan Feng a,b
a
Center for Multimedia Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
b
School of Information Technologies, The University of Sydney, NSW 2006, Australia
a r t i c l e in fo abstract
Article history: With the advancement of imaging techniques and IT technologies, image retrieval has become a bottle
Received 12 May 2009 neck. The key for efficient and effective image retrieval is by a text-based approach in which automatic
Received in revised form image annotation is a critical task. As an important issue, the metadata of the annotation, i.e., the basic
13 April 2010
unit of an image to be labeled, has not been fully studied. A habitual way is to label the segments which
Accepted 14 April 2010
are produced by a segmentation algorithm. However, after a segmentation process an object has often
been broken into pieces, which not only produces noise for annotation but also increases the
Keywords: complexity of the model. We adopt an attention-driven image interpretation method to extract
Image annotation attentive objects from an over-segmented image and use the attentive objects for annotation. By such
Concept association network (CAN)
doing, the basic unit of annotation has been upgraded from segments to attentive objects. Visual
Attentive objects
classifiers are trained and a concept association network (CAN) is constructed for object recognition. A
Visual classifier
Neural network CAN consists of a number of concept nodes in which each node is a trained neural network (visual
classifier) to recognize a single object. The nodes are connected through their correlation links forming a
network. Given that an image contains several unknown attentive objects, all the nodes in CAN generate
their own responses which propagate to other nodes through the network simultaneously. For a
combination of nodes under investigation, these loopy propagations can be characterized by a linear
system. The response of a combination of nodes can be obtained by solving the linear system. Therefore,
the annotation problem is converted into finding out the node combination with the maximum
response. Annotation experiments show a better accuracy of attentive objects over segments and that
the concept association network improves annotation performance.
& 2010 Elsevier Ltd. All rights reserved.
1. Introduction
image segmentation algorithm [3,5] by which an image is often
over-segmented. In this case, an object is broken into pieces,
Because of its importance in image retrieval, automatic image
which not only produces noise for annotation but also increases
annotation has interested researchers for years [1–4]. In this
the complexity of the model. In our previous research, starting
paper, we consider image annotation as labeling specific regions
from an over-segmented image, we developed an attention-
of an image. There are two key issues need to be addressed for
driven image interpretation approach which is able to reconstruct
developing an automatic annotation algorithm:
attentive objects from segments [7,8]. The formation of attentive
objects benefits image retrieval. As shown in Fig. 1, a number of
To partition an image into semantically meaningful objects/ attentive objects are reconstructed from segments by using our
regions; and
attention-driven approach. Although the object reconstruction
To recognize the objects/regions. may introduce some errors and noise, an attentive object obtained
is much closer to a real object than a segment. This brings at least
The first issue is about image segmentation. Segmenting an two conveniences for image annotation: (1) an object can be
image at a proper level is still an open problem although a large better characterized than a segment; and (2) the number of units
number of methods have been proposed since 1980s. Most of the to be annotated is reduced at the object level, therefore the
region-based image labeling solutions rely on the result of an complexity of the model is decreased.
The second issue is on object classification. Besides the visual
Corresponding author. features of the target itself, the other information extracted from
E-mail addresses: enhongfu@eie.polyu.edu.hk, enhongfu@inet.polyu.edu.hk the image and that from other sources are also useful for the
(H. Fu). recognition of an object. The information benefits the object
0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2010.04.009
Please cite this article as: H. Fu, et al., Recognition of attentive objects with a concept association network for image..., Pattern
Recognition (2010), doi:10.1016/j.patcog.2010.04.009
ARTICLE IN PRESS
2 H. Fu, et al., / Pattern Recognition ] (]]]]) ]]]–]]]
Original image Segmentation by 1st attentive object 2nd attentive object The background
JSEG
Fig. 1. Examples of attention-driven image interpretation.
recognition includes the textual tokens attached to the image [6], from segments [7,8]. In the method, an image is firstly segmented
the relevance feedback from the user in an image retrieval process by JSEG algorithm [16], a popular image segmentation method,
[9], and the semantic distance between the words captioning the and then several perceptually attended objects are popped out
image [10]. Psychologists have also found that the associations iteratively with the background is remained. The procedure of
among objects in the environment are utilized when human attention-driven image interpretation is described briefly as
beings recognize and cognize an object [11]. This phenomenon below (a more detailed description can be found in Ref. [7]).
has already been studied in the context of image annotation. Two Suppose that an image I is segmented into N segments, two
notable approaches are statistically modeling and graph theory. matrices will be formed to describe the relationships among
The representative investigations in statistically modeling are segments, a boundary length matrix L and a feature matrix A. The
conditional random fields (CRF) used by He et al. [12] and boundary length matrix characterizes the spatial relationship and
Carbonetto et al. [5], in which the annotation comes down to an the feature matrix characterizes the feature difference among
inference problem through optimizing a pre-defined criterion. On segments. For example, the element li,j in the boundary length
the other side, graph theory is good at describing relationships. matrix L is defined as the length of boundary between segment i
Therefore, the annotation can be converted into a graph problem and segment j; the element ai,jk in the feature matrix Ak is the
after characterizing images and their associated labels by a graph difference between segment i and j in terms of the kth feature.
[13,14]. Despite of complex mathematics, we consider in this After the matrices are obtained, an objective function that
paper the image annotation problem intuitively. Given a number characterizes the attention value of a combination of segments
of images each of which contains labeled basic units (region/ is defined. Then an object popping-out process will be adopted to
object), a group of classifiers can be trained to recognize the basic pop out attentive objects in an iterative way:
units and the correlation between the basic units can be recorded.
When a new image with several unknown units is to be labeled, (1) Identify a combination of segments whose attention value is
all the classifiers have their responses to these units and these the maximum among all the combinations and pop out the
responses will be propagated to other nodes through a concept object formed by this combination of segments.
association network (CAN). This is a loopy structure that was (2) Eliminate the segments belonging to the popped-out object.
regarded as somehow unsolvable in attempting other models (3) Go to step (1) if a stopping condition is not satisfied.
[12]. Fortunately, we have found that for a combination of (4) The remaining segments are as a whole interpreted as the
classifiers, these loopy propagations can be characterized by a background.
linear system. The response of the combined classifiers can be
obtained by solving the linear system. Therefore, the annotation
An efficient popping-out algorithm was proposed in Ref. [8], in
problem is converted into finding out a combination of concept
which a heuristic search is used to find out the most attentive
nodes with the maximum response.
combination of segments instead of a full searching process. Fig. 1
The rest of this paper is organized as follows. Section 2
shows some examples of the attention-driven image interpreta-
introduces the extraction of attentive objects from the over-
tion. Compared with a segment, ‘‘attentive object’’ is much more
segmented image. Section 3 proposes the construction of a
similar to a semantic object. Therefore, using attentive objects
concept association network and the training of visual classifiers
instead of segments potentially benefits image annotation.
for image annotation. Section 4 reports experimental results with
discussions. Concluding remarks are drawn in Section 5.
3. Construction of a concept association network (CAN) for
image annotation
2. Extraction of attentive objects from an over-segmented
image In our study, a CAN with attentive objects is constructed and
trained for image annotation. Fig. 2 illustrates the architecture of
Visual attention, a selective process of human’s early vision, the proposed CAN. It is of a three-layer structure: (1) feature
plays a very important role for human beings to understand a stimulus, (2) visual classifiers, and (3) concept nodes. The
scene by intuitively focusing on some objects [17–19]. Being annotation process can be divided into three stages. An image is
aware of this, in our previous study, an attention-driven image first interpreted as several attentive objects. Then the visual
interpretation method was proposed to extract attentive objects features of these objects are extracted as the stimulus to the
ARTICLE IN PRESS
H. Fu, et al., / Pattern Recognition ] (]]]]) ]]]–]]] 3
visual classifiers which produce the outputs for corresponding the visual classifier corresponding to the concept ‘‘zebra’’, as
concept nodes. Finally, in the concept node layer, the image is shown in Fig. 4. A collection of objects with the same label
annotated by considering both the information from the lower constitutes a concept.
layers and the outputs from other concept nodes. To avoid ambiguity, only the objects with explicit semantic
meanings are annotated, such as ‘‘tiger’’, ‘‘human face’’, ‘‘red
3.1. Visual classifiers flower’’, etc. The semantically ambiguous objects are assigned to
the ‘‘unknown’’ class of which no visual classifier is trained. In
other words, we do not train a concept called ‘‘unknown’’, but the
We first manually annotated attentive objects extracted from
objects labeled as ‘‘unknown’’ are used to train visual classifier as
training images. Some annotated samples are shown in Fig. 3. The
negative samples.
annotated objects are used to train visual classifiers. One visual
A multi-layered perceptron [20] is trained as a visual classifier.
classifier is associated with one concept. For example, all the
The input of the perceptron is a 1 93 vector, including an 1 64
objects annotated by ‘‘zebra’’ are collected as positive samples
HSV color histogram, an 1 20 edge histogram, 7 moment
and all other objects are treated as negative samples for training
invariants of the object and two coordinates of the center of the
object. The idea of using edge features comes from our earlier
work on three-component image model [21] and face recognition
[22]. The perceptron has two hidden layers: one with 50 nodes
Concept and the other with 25 nodes. The output layer has one node with a
node target output ‘‘1’’ for a positive sample and ‘‘0’’ for a negative
Ci
sample. The perceptron will output a real number between 0 and
w ij
1 when it is used to assign a concept to an input object.
Vi
w ji 3.2. Concept association
Cj
Besides the visual features of an individual object are related to
its concept, some concepts are in conjunction with each other. For
Visual
Vj example, the concept ‘‘flower’’ is usually accompanied by the
classifier
concept ‘‘green plant’’, and ‘‘clothing’’ often appears together with
Feature ‘‘human face’’. Therefore, the correlation among concepts also
stimulus plays an important role in forming the semantics of an attentive
object, which are very useful for annotation. Suppose that we
Fig. 2. Architecture of the proposed concept association network (CAN) with
have N concept nodes in the CAN, the number of occurrences of
visual classifiers.
concepts is denoted as
F ¼ ðf1 , . . . fi , . . . fN Þ ð1Þ
where fi denotes the number of occurrences of concept Ci. The co-
occurrences between a pair of concepts are encoded by
W ¼ fwi,j g, i,j ¼ 1, . . . N ð2Þ
where wi,j is the number of occurrences of concept i when concept

j occurs. The correlation matrix among concepts is defined as
M ¼ fmi,j g ¼ fwij =fj g, i,j ¼ 1, . . . N ð3Þ
where mi,j denotes the occurrence frequency of concept i when

concept j occurs.
3.3. Image annotation using the CAN
The trained CAN is used to annotate images. An image I to be

annotated is first decomposed into objects, i.e.
I ¼ ðO1 , O2 , . . . , Ot , . . . , OT Þ ð4Þ
where Ot is the tth object in the image. Each object is associated
with a visual feature vector.
A concept permutation for input objects is denoted by
1 2 t 1
Clð1Þ , Clð2Þ , . . . , ClðtÞ , . . . ,ClðTÞ ð5Þ
where the superscript t is the index of the object t and the

subscript l(t) is the index of the concept in the CAN. For example,
C61 ,C12 ,C10
3
means that object O1, O2, and O3 are linked to concepts
C6, C1, and C10, respectively. The response of concept node Cl(t) to
the input object is defined as
X j
t
ClðtÞ ðOt Þ ¼ VlðtÞ ðOt Þ þ mlðtÞ,lðjÞ ClðjÞ ðOj Þ ð6Þ
Fig. 3. Examples of annotated images. jat
ARTICLE IN PRESS
Fig. 4. Positive and negative training samples for the concept ‘‘zebra’’.
m 16 C 63 ( O 3 ) Table 1
Concept node C1 C6 Nine permutations of concept annotations of a two-object case.
m
13 C 2 V 6 (O 3 )
V1 ( O 1 )
3 (O Object Annotations
2 ) V6
V1 O1 apple apple apple zebra zebra zebra sky sky sky
Visual classifier C3 O2 apple zebra sky apple zebra sky apple zebra sky
O3
V 3 (O 2 )
Feature stimulus O1 V3 Eq. (7) can be rewritten in matrix and vector form as
2 32 C 1 ðO Þ 3
1 mlð1Þ,lð2Þ mlð1Þ,lð3Þ mlð1Þ,lðTÞ lð1Þ 1
O2 6 m 76 2 7
6 lð2Þ,lð1Þ 1 mlð2Þ,lð3Þ mlð2Þ,lðTÞ 7 6 C ðO 2Þ 7
7
6 766
lð2Þ
7
6 mlð3Þ,lð1Þ mlð3Þ,lð2Þ 1 mlð3Þ,lðTÞ 76 ^ 7
Fig. 5. C11 ðO1 Þ ¼ V1 ðO1 Þ þ m13 C32 ðO2 Þ þ m16 C62 ðO3 Þ. The response of concept node C1 6 76 7
6 7 7
consists of the response of its own visual classifier V1(O1) and the responses from 4 564 ^ 5
other concept nodes via the network connections (m13 C32 ðO2 Þþ m16 C62 ðO3 Þ). mlðTÞ,lð1Þ mlðTÞ,lðT1Þ 1 T
ClðTÞ ðOT Þ
2 3
t
Vlð1Þ ðO1 Þ
where ClðtÞ is the response of concept node Cl(t), ml(t),l(j) is an 6 V ðO2 Þ 7
element of the correlation matrix, and Vl(t)(Ot) is the output of 6 lð2Þ 7
6 7
classifier Vl(t) with the input Ot. The response of a concept node is ¼6 ^ 7 ð8Þ
6 7
6 ^ 7
determined not only by the output of its own visual classifier 4 5
(Vl(t)(Ot)) but also by the responses from other concept nodes via VlðTÞ ðOT Þ
j
the network connections (Sj a t mlðtÞ,lðjÞ ClðjÞ ðOj Þ). Fig. 5 shows the
computation of concept node C1’s response. The response of
Therefore, the network response R() to the input objects
concept node C1 is composed of two sources: the response 1 2 t T
regarding the concept permutation Clð1Þ ,Clð2Þ ,:::,ClðtÞ ,:::,ClðTÞ is given
from its own visual classifier (V1(O1)) and the responses
by
from nodes C3 and C6 via the network connections X
(ðm13 C32 ðO2 Þ þ m16 C62 ðO2 ÞÞ. 1
RðClð1Þ 2
, Clð2Þ t
, . . . , ClðtÞ T
, . . . , ClðTÞ Þ¼ t
ClðtÞ ðOt Þ ð9Þ
The idea of using a CAN comes from the ‘‘spreading-activation t
theory’’ of cognitive psychology study made by Collins and Loftus where

[15]. The concepts formed in human’s mind are inter-connected 2 1 3 2 3 2 3
Clð1Þ ðO1 Þ 1 mlð1Þ,lð2Þ mlð1Þ,lð3Þ mlð1Þ,lðTÞ 1 Vlð1Þ ðO1 Þ
to form a network. If a concept is activated, it spreads its 6 2 7
6 Clð2Þ ðO2 Þ 7 6 mlð2Þ,lð1Þ 1 mlð2Þ,lð3Þ mlð2Þ,lðTÞ 7 6 V ðO2 Þ 7
influences to its relevant concepts. For example, when we see the 6 7 6 7 6 lð2Þ 7
6 7 6 7 6
^
7
6 ^ 7¼6 mlð3Þ,lð1Þ mlð3Þ,lð2Þ 1 mlð3Þ,lðTÞ 7 6 7
word ‘‘ship’’, we may recall other words related to ‘‘ship’’, such as 6 7 6 7 6 7
6 ^ 7 6 7 6 ^ 7
4 5 4 5 4 5
‘‘sea’’, ‘‘shipman’’, etc. Similarly, if the response of a concept node T
ClðTÞ ðOT Þ mlðTÞ,lð1Þ mlðTÞ,lðT1Þ 1 VlðTÞ ðOT Þ
named ‘‘ship’’ is large enough in the CAN, the node ‘‘ship’’ will
send a positive signal to its related nodes ‘‘sea’’ and ‘‘shipman’’. In The annotation problem is to find a set of concepts that best
general, every node can influence others in the CAN (the influence matches the input objects, i.e., the permutation of concepts that
can be zero), so we have the following simultaneous equations for has the maximum response to the input stimulus. If all the
T objects permutations of N concepts in the CAN mapping to these T objects
8 P are considered, there are NT combinations. For example, if there
j
>
> C 1 ðO Þ ¼ Vlð1Þ ðO1 Þ þ mlð1Þ,lðjÞ ClðjÞ ðOj Þ are three concept nodes in the CAN, say ‘‘apple’’, ‘‘zebra’’, ‘‘sky’’,
> lð1Þ 1
> ja1
>
> and there are two objects in the image to be annotated, we have
>
> 2 P
>
> C ðO2 Þ ¼ Vlð2Þ ðO2 Þ þ j
mlð2Þ,lðjÞ ClðjÞ
>
>
ðOj Þ 32 ¼9 permutations of concept annotations as listed in Table 1.
> lð2Þ
> ja2
>
> As mentioned above, the permutation of concepts that has the
< ^
P ð7Þ maximum response to the input stimulus will be assigned to the
> t j
> C
> lðtÞ ðO t Þ ¼ VlðtÞ ðO t Þ þ mlðtÞ,lðjÞ ClðjÞ ðOj Þ objects, i.e.
>
> jat
>
>
>
>
> ^
1
ðClð1Þ 2
, Clð2Þ t
, . . . , ClðtÞ T
, . . . , ClðTÞ Þ ¼ argmax
>
> P
>
> j
1 , C 2 , ..., C t , ..., C T Þ
ðClð1Þ
>
> C T ðO Þ ¼ VlðTÞ ðOT Þ þ mlðTÞ,lðjÞ ClðjÞ ðOj Þ lð2Þ lðtÞ lðTÞ
: lðTÞ T 1 2 t T
jaT ½RðClð1Þ , Clð2Þ , . . . , ClðtÞ , . . . , ClðTÞ Þ ð10Þ
ARTICLE IN PRESS
4. Experimental results and discussion visual classifiers are computed and sorted in the descending
order. The concepts with top Nt (Nt ¼10 in our implementa-
4.1. Description of the image database tion) responses will be assigned to the sample. For example,
suppose an object whose target concept is ‘‘butterfly’’ (see
Fig. 6), the annotation list consists of the top 10 concepts
In total, 5314 images collected from the Internet were
‘‘Bird, Butterfly, Cat,y, Firework, Fish’’.
interpreted with attentive objects by our proposed efficient
(2) Annotation using the CAN at the level of ‘‘attentive objects’’,
popping-out algorithm, from which 20,159 attentive objects were
denoted as ‘‘Object-CAN’’. This is the annotation method
produced. All the objects were annotated manually to 99 concepts
proposed by us in Section 3.3. Compared with ‘‘Object-VC’’,
(see Fig. 7), including an ‘‘unknown’’ concept. Among samples of
‘‘Object-CAN’’ combines both the responses from visual
each concept, one half was used as training samples and the other
classifiers and the responses from other concepts. For a test
half was used as test samples. Note that the numbers of samples
sample, that is, one of T objects in an image, there are 98T
for different concepts are usually different. Some concepts may
combinations (98 concepts). The computational complexity is
have a large number of samples, such as ‘‘trees-green’’, ‘‘sky’’, etc.
too high if T is large. So we set Tr4 (This can be achieved by
Some concepts may have a very limited number of samples, such
setting the maximum number of attentive objects in the
as ‘‘wild cat’’, ‘‘vase’’, etc. In total, there are 10,105
attention-driven image interpretation.) and consider top 10
training samples and 10,054 test samples (including the ‘‘un-
responses in the experiments. Therefore, the maximum
known’’ class). We also prepared training and test samples at the
computational complexity is 104, about 1.2 s of
level of segments, by decomposing the attentive objects back to
processing time for an image with Matlab 6.5 on a Intel
segments. A segment inherits the concept label that it belongs to.
Core2 CPU 1.86 GHz, 3 GB RAM PC. For the example shown in
We have 23,715 training samples and 23,667 test
Fig. 6, we have T¼3, and all the 103 combinations are
samples (including the ‘‘unknown’’ class) for the segment-based
computed and sorted in descending order of the summed
implementation.
response as defined in Eq. (10). The annotation list
for the object to be annotated is ‘‘Butterfly, Bird, Butterfly,y,
4.2. Methods under investigation Firework, Fish’’. The duplicated concepts in the list will be
deleted with the one of the highest rank kept. The annotation
In order to evaluate the performance of the CAN and training list after deleting duplicated concepts is ‘‘Butterfly, Bird,y,
schemes, four annotation strategies, at the level of both ‘‘attentive Firework, Fish’’. We will keep the top 10 concepts for
objects’’ and ‘‘segments’’, are considered. evaluation.
(3) Annotation using visual classifiers at the level of ‘‘segments’’,
(1) Annotation using visual classifiers at the level of ‘‘attentive denoted as ‘‘Segment-VC’’.
objects’’, denoted as ‘‘Object-VC’’. This is actually a conven- (4) Annotation using the CAN at the level of ‘‘segments’’, denoted
tional classification process. For a test object, the outputs of all as ‘‘Segment-CAN’’. Since the number of segments in one
Bird Butterfly Cat Firework Fish

Original Object to be C
-V
image annotated j ect
Ob
Top 10 concepts
Ob
je ct-
CA
N
Butterfly Bird Butterfly Firework Fish
Flower- Flower- Blossom- Float

Sky-dark
red pink red grass
Flower- Flower- Sky-dark Fish
Earth
red red
103 combinations
(sorted descendingly)
The duplicated concepts in the list will be deleted with
the one of the highest rank kept.
Butterfly Bird Earth Firework Fish
Flower- Flower- Blossom- Flower-

Earth
red pink red yellow
Flower- Blossom- Leaf- Sky-dark Fish
red colorful green
Top 10 concepts for each attentive object.
Fig. 6. An example of annotation with Object-VC and Object-CAN.
ARTICLE IN PRESS
image usually exceeds 4, we randomly selected three rank(i) and N the total number of objects/segments to be
segments in the image to do the annotation, together with annotated. If all the objects are correctly labeled, i.e.,
(
the segment under investigation. N, i ¼ 1;
nðiÞ ¼ , we have AP ¼ ð1=NÞNð111Þ ¼ 10. If no target
0, i 4 1
Similar annotation list will be generated for ‘‘Segment-VC’’ and concept is ranked at top 10, we have AP¼ 0. A larger AP value
‘‘Segment-CAN’’. The only difference between ‘‘Segment-VC/CAN’’ means a more favorable annotation result.
and ‘‘Object-VC/CAN’’ is that the former take a segment as a basic
entity while the latter take an attentive object as a basic entity.
The similar training scheme is conducted for ‘‘Segment-VC/CAN’’,
4.4. Experimental results
which take more training time because after an attentive object is
broken into segments, the number of samples increases.
4.4.1. Correlation matrix
The correlation matrix of the 99 concepts was computed based
4.3. Evaluation criterion
on 10,105 training samples collected. Fig. 7 visualizes the
associations among these concepts, in which each node
The rank of a target concept represents the position where the represents one concept and the connection between two
correct class label appears in the classification list by using all concepts represent their relationship. If the association between
visual classifiers. For example, for an object whose correct label is any pair of concepts is non-zero, a link will be added between
‘‘flower’’, in the classification process, the responses of all visual them. Some concepts are very ‘‘popular’’, which have connections
classifiers is computed and sorted. Then a classification list can be to many other concepts, such as ‘‘clothes’’, ‘‘sky’’, ‘‘human face’’,
obtained, in which the class labels with larger responses appear in etc. Therefore, if those popular concepts are activated, they will
the front. If the classification list for an object (whose target spread positive responses to many other concepts. On the other
concept is ‘‘flower’’) is ‘‘sun, face, flower, zebra, house,y’’, the hand, some concepts connect to particular partners only. For
rank of the target concept is 3. For each object to be annotated, the example, ‘‘penguin’’ links to ‘‘glacier’’ only, which means that only
annotation precision (AP) on top 10 concepts is defined as ‘‘glacier’’ receives positive response in case of activated ‘‘penguin’’.
1X10 It is worth noting that there is no connection between some pairs
AP ¼ nðiÞ½11rankðiÞ ð11Þ of concept, such as ‘‘sun’’ and ‘‘firework’’, ‘‘sea lion’’ and ‘‘zebra’’,
Ni¼1
etc. That is to say, those concepts rarely appear together in an
where n(i) is the number of objects whose target concepts are in image. As a result, there is no activation between these concepts
Associated concepts Correlateion

clothes 0.000099
grass-brown 0.000297
grass-green 0.000792
human-face 0.000099
sky 0.000297
Fig. 7. Visualization of the associations among concepts. Each node represents one concept. The connection between a pair of concepts means that two concepts are
related. A zoom-in of concept node ‘‘cattle’’ shows an example image of ‘‘cattle’’ and its associated concepts as well as the correlation values. The details of the CAN can be
browsed on http://158.132.148.155/chigroup/fuhong/Relations_of_files/Relations_of_frames.htm.
ARTICLE IN PRESS
even if they are activated simultaneously. included in Set 5. The maximum epoch was set to 2000 and the
target square sum error (SSE) is set to 0.01 for all training runs.
4.4.2. Training of visual classifiers The training time is also shown in Table 2. The 7654 test samples
A visual classifier is trained for each concept, with a 1 93 (excluding those from the ‘‘unknown’’ class) were used to test the
feature vector discussed in Section 3.1 as the input. Suppose that a classification performance.
concept C has PC positive samples and NC negative samples.
Naturally, all these samples can be used for training. However, NC
4.4.3. Experiments and discussions
is usually much larger than PC, especially when the number of
We evaluated 10 trails of test objects/segments each with 100
concepts is large. One solution is to retain all positive samples and
samples randomly selected from the test set. The Box-and-
use part of negative samples. How many negative samples should
Whisker plots for the average annotation precision (AP) in each
be used? Will the classification performance of visual classifiers
be improved if more negative samples are used? To find the
answers, five training sets were used to train visual classifiers, 8
wherein all positive samples are used while the number of Object -VC
negative samples used varies. The result is summarized in Table 2. 7
In Set 1, PC randomly selected samples from NC negative samples Object -CAN
are used to train the neural network visual classifiers. Then the 6
Annotation Precision
Segment -VC
negative samples doubled in Set 2 (2PC). Sets 3 and 4 contain 3PC
and 4PC negative samples, respectively. All negative samples are 5 Segment -CAN
Table 2 4
Training sets with different numbers of negative samples.
3
Training Number of positive Number of Training time (h)
set samples negative samples
Attentive Segment 2
object level level
1
Set 1 PC PC 1 1 3
Set 2 PC PC 2 5 10
Set 3 PC PC 3 10 22 0
Set 4 PC PC 4 30 56 Set 1 Set 2 Set 3 Set 4 Set 5
Set 5 PC NC 80 150
Fig. 9. Average AP (annotation precision) of ten trials.
Fig. 8. The annotation precisions (APs) of different approaches at the object level and the segment level. The visual classifiers are trained with different training sets. (a)–(e)
Set 1–Set 5.
ARTICLE IN PRESS
Table 3
Improvement by the CAN at both the attentive object level and the segment level.
Training set Attentive object level: average AP for ten trails Improvement by CAN (%) Segment level: average AP for ten trails Improvement by CAN (%)
Object-VC Object-CAN Segment-VC Segment-CAN
Set 1 2.35 3.14 33.6 1.61 2.01 24.8

Set 2 2.52 2.73 8.3 1.26 1.54 22.2
Set 3 2.58 2.92 13.2 1.30 1.40 7.7
Set 4 3.41 3.55 4.1 1.53 1.55 1.3
Set 5 7.03 7.03 0 5.87 5.87 0
Table 4
The saved computation time for the training of the visual classifiers versus the additional computational costs of CAN for 8 new classes.
Object-VC with Set 3 Object-CAN with Set 1 Time saved by CAN
Annotation Precision 2.58 3.14 –

Training time per class (h) 0.1 0.01 0.09
Training time for 8 new classes (h) 0.8 0.08 0.72
Annotation time per image (s) 2.3 3.5 1.2
Annotation time per object (s)a 0.6 0.9 0.3
Annotation time for 8 new classes (h)b 0:61038
6060 0:14 0:91038
6060 0:21 0.07
Total 0.94 0.29 0.65
a
Assume four object in one image.
b
103 samples for each class averagely.
group for the ten trails are shown in Fig. 8. for a real-world annotation application. When new samples or
new concepts are added, quite a long time is needed to
re-train all the VCs. Compared with re-training all the VCs
(1) The effect of upgrading to the object level from the segment using a full sample set, updating the correlation matrix of CAN
level: Compared with the results at the segment level, the and re-train several VCs involved using a smaller sample set is
benefits from attentive objects are obvious, since an object is relatively an easier task and spend less time. Therefore, CAN is
of more comprehensive information than several broken more feasible for real-world annotation applications, since it
segments. The extra computation brought by the attention- can be rapidly updated without losing much precision.
driven image interpretation to produce attentive objects for
each image is 1–2 s, thanks to the efficient popping-out We take object-VC with Set 3 and object-CAN with Set 1 as
algorithm [8]. So the extra computational time spent in examples for a comparison, as shown in Table 4. Suppose that we
producing attentive objects is about 3 h for the entire have 90 classes in the networks, when 8 new classes are added,
database images, which can be compensated by the time we have to train 8 new VCs. According to Table 2, the average
saved from training visual classifiers. At the segment level, training time for each VC with Set 3 is 10/98E0.1 h, whereas that
training visual classifiers takes much longer time due to more for Set 1 is 1/98E0.01 h. Thus, the training time for the 8 new
training samples involved. classes for Object-VC with Set 3 is 0.8 h, and that for Object-CAN
(2) The effect of the training samples: The average APs in ten with Set 1 is 0.08 h. Therefore, 0.72 h is saved in the training stage.
trails for visual classifiers trained by different number of In the annotation stage, CAN needs 1.2 more seconds than VC for
negative samples are shown in Fig. 9. AP is improved as the one image averagely. 0.3 more second will be spent in each object,
number of negative samples increases, which indicates that assuming 4 objects per image. The average sample number for
the classification performance is better if more negative each class is 103. So the annotation time for the 8 new classes is
samples are used in training process. More negative samples 0.14 h for Object-VC with Set 3 and 0.21 h for Object-CAN with Set
that the neural network is trained more thoroughly, so the 1, respectively. In summary, Object-VC with Set 3 needs 0.94 h,
classification performance is improved. while Object-CAN with Set 1 needs 0.29 h, saving 0.65 h and
(3) The effect of the CAN: In order to have a closer evaluation on achieving a more favorable AP, i.e. 3.14 compared with 2.58.
the function of CAN, the improvement by CAN at both the
attentive object level and the segment level are summarized
in Table 3. The results show that the use of the CAN can really 5. Conclusion
lift the annotation precision, both at the object level and the
segment level. Another interesting result is observed in that In this paper, we present an image annotation approach based
the improvement by CAN strengthens as the number of on attentive objects with a concept association network (CAN).
negative samples involved in training visual classifiers (VCs) The approach including three main processes: (1) constructing
decreases, generally. That is to say, if VCs are well trained, training samples and training visual classifiers for recognizing
they can do decent job independently without CAN. On the attentive objects; (2) constructing a CAN in which each node is
other hand, when the VC does not perform well, the CAN is attached with a visual classifier and each connection weight
able to amend the annotation result, since CAN makes use of reflects association between two concepts; and (3) annotating
the information from the neighbor nodes. Using all the objects using the CAN. Experimental results show that the CAN
negative samples to train strong VCs is not always practical improves annotation performance in general and the annotation
ARTICLE IN PRESS
at the level of ‘‘attentive object’’ results in more meaningful labels [9] S. Zhang, B. Li, X. Xue, Semi-automatic dynamic auxiliary-tag-aided image
than that at the level of ‘‘segment’’. annotation, Pattern Recognition, in press.
[10] Y. Jin, L. Khan, L. Wang, M. Awad, Image annotations by combining multiple
evidence & WordNet, in: Proceedings of the 13th Annual ACM international
Conference on Multimedia, 2005, pp. 706–715.
Acknowledgements [11] M. Bar, Visual objects in context, Nature Reviews, Neuroscience 5 (2004)
617–629.
[12] X. He, R.S. Zemel, M.Á. Carreira-Perpiñán, Multiscale conditional random
The work reported in this paper is supported by a research fields for image labeling, in: Proceedings of the IEEE Computer Society
grant from the Research Grants Council of the Hong Kong Special Conference on Computer Vision and Pattern Recognition 2, II ,2004, pp.
Administrative Region, China (Project no.: PolyU 5141/07E) and a 695–702.
[13] J. Liu, M. Li, Q. Liu, H. Lu, S. Ma, Image annotation via graph learning, Pattern
PolyU research grant (Project no.: 1-BB9V). recognition 42 (2) (2009) 218–228.
[14] J.Y. Pan, H.J. Yang, C. Faloutsos, P. Duygulu, Automatic multimedia cross-
References modal correlation discovery, in: KDD-2004-Proceedings of the Tenth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
2004, pp. 653–658.
[1] K. Barnard, D. Forsyth, Learning the semantics of words and pictures, in: [15] A.M. Conllins, E.F. Loftus, A spreading-activation theory of semantic
International Conference on Computer Vision, pp. II: 2001, 408–415. processing, Psychological Review 82 (6) (1975) 407–428.
[2] J. Fan, Y. Gao, H. Luo, G. Xu, Statistical modeling and conceptualization of [16] Y. Deng, B. Manjunath, Unsupervised segmentation of color-texture regions
natural images, Pattern Recognition 38 (2005) 865–885. in images and video, IEEE Transactions on Pattern Analysis and Machine
[3] H. Frigui, J. Caudill, Region based image annotation, ICIP2006 (2006) Intelligence 23 (8) (2001) 800–810.
953–956. [17] L. Itti, C. Koch, Computational modeling of visual attention, Nature Reviews
[4] K. Halina, P. Mariusz, Resulted word counts optimization—a new approach Neuroscience 2 (2001) 194–203.
for better automatic image annotation, Pattern Recognition 41 (2008) [18] J. Theeuwes, Visual selective attention: a theoretical analysis, Acta Psycho-
3562–3571. logica 83 (1993) 93–154.
[5] P. Carbonetto, N. De Freitas, K. Barnard, A statistical model for general [19] J. Wolfe, 9: The level of attention: mediating between the stimulus and
contextual object recognition, Lecture Notes in Computer Science (including perception, Levels of Perception, Edited by L. Harris, M. Jenkin, 2002, pp.
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in 169–191.
Bioinformatics) 3021 (2004) 350–362. [20] S.S. Haykin, Neural networks: a comprehensive foundation, Maxwell
[6] Y. Wang, T. Mei, S. Gong, X. Hua, Combining global, regional and contextual Macmillan International, 1994.
features for automatic image annotation, Pattern Recognition 42 (2009) [21] J. Li, G. Chen, Z. Chi, C. Lu, Image coding quality assessment using fuzzy
259–266. integrals with three-component image model, IEEE Transactions on Fuzzy
[7] H. Fu, Z. Chi, D. Feng, Attention-driven image interpretation with application Systems 12 (1) (2004) 99–106.
to image retrieval, Pattern Recognition 39 (9) (2006) 1604–1621. [22] J. Song, Z. Chi, J. Liu, A robust eye detection method using combined binary
[8] H. Fu, Z. Chi, D. Feng, An efficient algorithm for attention-driven image edge and intensity information, Pattern Recognition 39 (6) (2006)
interpretation from segments, Pattern Recognition 42 (1) (2009) 126–140. 1110–1125.
Dr. Hong Fu received her Bachelor and Master degrees from Xi’an Jiaotong University in 2000 and 2003, and PhD degree from Hong Kong Polytechnic University in 2007.
She is now a Postdoctoral Fellow in the Department of Electronic and Information Engineering at the Hong Kong Polytechnic University. Her research interests include
image processing, pattern recognition, and artificial intelligence.
Dr. Zheru Chi received his B.Eng. and M.Eng. degrees from Zhejiang University in 1982 and 1985 respectively, and his Ph.D. degree from the University of Sydney in March
1994, all in electrical engineering. Between 1985 and 1989, he was on the Faculty of the Department of Scientific Instruments at Zhejiang University. He worked as a Senior
Research Assistant/Research Fellow in the Laboratory for Imaging Science and Engineering at the University of Sydney from April 1993 to January 1995. Since February
1995, he has been with the Hong Kong Polytechnic University, where he is now an Associate Professor in the Department of Electronic and Information Engineering. Since
1997, he has served on the organization or program committees for a number of international conferences. His research interests include image processing, pattern
recognition, and computational intelligence. Dr. Chi has authored/co-authored one book and 11 book chapters, and published more than 170 technical papers.
Prof. (David) Dagan Feng received his M.E. in EECS from Shanghai Jiao Tong University in 1982, M.Sc. in Biocybernetics and Ph.D. in Computer Science from the University
of California, Los Angeles (UCLA) in 1985 and 1988, respectively, where he received the Crump Prize for Excellence in Medical Engineering. He has been Professor and Head
of Department of Computer Science/School of Information Technologies, and is currently Associate Dean of Faculty of Science at the University of Sydney; Chair Professor of
Information Technology, Hong Kong Polytechnic University; Advisory Professor, Chief Scientist and Chair of the International Advisory Committee, Med-X Research
Institute, Shanghai Jiao Tong University. He is the Founder and Director of the Biomedical & Multimedia Information Technology (BMIT) Research Group, and has published
over 500 scholarly research papers, pioneered several new research directions, and made a number of landmark contributions in his field. He has served as Chair of IFAC
Technical Committee on Biological and Medical Systems, and Special Area Editor/Associate Editor for IEEE Transactions on Information Technology in Biomedicine and
other 5 key journals in his area. He has been elected as Fellow of ACS, HKIE, IET, IEEE and Australian Academy of Technological Sciences and Engineering.

Pattern Recognition: Hong Fu, Zheru Chi, Dagan Feng

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Pattern Recognition: Hong Fu, Zheru Chi, Dagan Feng

Caricato da

Copyright:

Formati disponibili

ARTICLE IN PRESS

Pattern Recognition ] (]]]]) ]]]–]]]

Contents lists available at ScienceDirect

Recognition of attentive objects with a concept association network

Fig. 1. Examples of attention-driven image interpretation.

where wi,j is the number of occurrences of concept i when concept

where mi,j denotes the occurrence frequency of concept i when

3.3. Image annotation using the CAN

The trained CAN is used to annotate images. An image I to be

where the superscript t is the index of the object t and the

theory’’ of cognitive psychology study made by Collins and Loftus where

Bird Butterfly Cat Firework Fish

Flower- Flower- Blossom- Float

Butterfly Bird Earth Firework Fish

Flower- Flower- Blossom- Flower-

Top 10 concepts for each attentive object.

Fig. 6. An example of annotation with Object-VC and Object-CAN.

Associated concepts Correlateion

Object-VC Object-CAN Segment-VC Segment-CAN

Set 1 2.35 3.14 33.6 1.61 2.01 24.8

Object-VC with Set 3 Object-CAN with Set 1 Time saved by CAN

Annotation Precision 2.58 3.14 –

Total 0.94 0.29 0.65

Potrebbero piacerti anche