Sei sulla pagina 1di 4

2017 IEEE Third International Conference on Multimedia Big Data

Image Saliency Analysis based on


Retina Simulation
Shu Fang1,2, Yang Yue1,2, Liuyuan He1,2, Kai Du3∗, Yonghong Tian1,2∗, Tiejun Huang1,2∗
sfang,yueyang999,liyhe,yhtian,tjhuang @pku.edu.cn, Kai.Du@ki.se
{National Engineering Laboratory for Video}Technology, School of EE&CS,
1
Peking University, Beijing, China
2
Cooperative Medianet Innovation Center, China
3
Department of neuroscience, Karolinska institute, Sweden

Abstract—In this paper, we provide an insight of visual saliency the image intensity channel and adopt sign value after Dis-
modeling from the perspective of simulating visual information crete Cosine Transformation as visual saliency, respectively.
manipulation process in human vision system. For humans, visual A problem is that the heuristic combination of the results
stimuli are converted into spike trains by retina, and the spike
signals are then transferred to brain areas for further analysis.
from various features would fail when multiple features give
Therefore, we propose to estimate image saliency based on a different saliency maps. Consequently, some works [3], [18],
retina neural network, which is built with realistic morphology [22], [23], [32] extract features as many as possible and learn
and electrophysics data. Then, we analyze the correlation between the optimal fusion strategy from training images. For example,
the spike trains generated by retina and image saliency. The Judd et al. [18] train a linear model with the fusion weights
experimental results show the effectiveness of retinal spike trains
in image saliency analysis.
of predefined low-level (e.g., features used in [16], [28] etc.),
Index Terms—Visual Saliency, Retina Simulation, Neural Spike mid-level (e.g., the horizon line), high-level (e.g., the face
Trains and the person) features as well as the center prior. Bruce et
al. [5] present a deep learning model for visual saliency predic-
tion based on fully convolutional networks. Compared to the
I. I NTRODUCTION
approaches combining features heuristically, the data-driven
Visual attention is a filter that choose the most important approaches often achieve much better prediction performances.
information from a large amount of visual stimuli in a scene However, features would contribute differently for saliency
so that only a small subsets can get further analysis. By estimation in different scenes.
detecting the most important visual subsets, called salient For human beings, visual stimuli is processed into spike
targets, in images or videos, the performance of computer trains by retina and transferred to LGN, V1 and high-level
applications [1], [25], [26], [29], [30] could be improved. brain areas [4]. Along the visual pathway, visual subsets com-
To detect salient subsets (pixels, macroblocks or regions), pete each other in the process of neural excitations, inhibitions
a common solution is representing each subset with various and modulations of different brain regions. In this manner,
perception features and measuring its saliency as the rarity the visual subsets that win the competition could become
of features. The most frequently used features are color oppo- salient [24]. We consider an novel solution for visual saliency
nencies, orientations, luminance and semantic detection results estimation by simulating the process of visual information ma-
etc.. For example, itti et al. [17] propose to estimate saliency nipulation. Specifically, subsequent brain areas process visual
by fusing multi-scale local center-surround contrasts from information based on the output of retina, which converts light
multiple perception features. In [12], saliency is calculated as into spike trains. Thus, producing the spike train of retina is
the global rarity, which is derived from the random walking critical for solving visual saliency estimation problem.
process on a fully-connected graph. This graph consists nodes Inspired by this idea, we propose to analyze image saliency
of image patches that are connected with edges weighted by based on neural spike trains generated by a retina neural
the mutual similarities of multiple perception features. Bruce et networks. The retina neural network is built with realistic
al. [6] represent image patches by projecting the RGB data of data, and it can reproduce electrophysiological characteristics
patches onto the learned independent components and compute of retina cells. Then, the relationship between image saliency
saliency as self-information. Cerf et al. [7] incorporate face and the spike trains generated by the retina neural network is
detection results with the bottom-up saliency map of [12] to analyzed. The experimental results show the effectiveness of
achieve a better performance in saliency prediction. Beyond retinal spike trains in image saliency estimation.
these spatial saliency models, some approaches [11], [14], The rest of this paper is organized as follows: the details of
[20], [21] estimate visual saliency in the transform domain. retina simulation would be introduced in the first section II. In
For example, [15] and [14] extract spectral residuals over section III, we will describe how to analysis image saliency
via retina simulations. Experimental results are presented in
∗Kai Du, Yonghong Tian and Tiejun Huang are corresponding authors. section IV. And this paper would be concluded in the last

978-1-5090-6549-3/17 $31.00 © 2017 IEEE 142


DOI 10.1109/BigMM.2017.66
leak conductance per unit area and leak reversal potential,
respectively. Note that, parameters in HH model are mainly
the conductances of diverse ion channels.
At fovea, the neural circuit is composed by photoreceptors,
bipolar cells and ganglion cells. To model each type of retinal
neurons at the detailed level, we first collect the realistic
morphology (2D or 3D) and electrophysics data directly taken
from retinas and adjust the parameters in HH model manually
to reproduce its electrical behaviors, which is shown in Fig. 1.
In Fig. 1, the results of Fig. 1 (c-d) show that spikes generated
by the detailed model could well capture key attributes (e.g.,
spike numbers) of response of real neurons. In the same way,
we construct the neural models of photoreceptors, bipolar
cells and ganglion cells, respectively. Note that, although
photoreceptors and bipolar cells do not fire spikes, we can still
model their ion currents with the HH-like non-linear equations.

B. Neural circuit Modeling


Before constructing the neural circuit of fovea, we should
Fig. 1. Modeling a single neuron. First, we label the scanned image (A) of first model the synapse that connect arbitrary neurons in retina.
a realistic retina neuron to get its morphology data (B) with multiple com- Different from synapses that could only pass spike signals,
partments. By inserting HH-like non-linear equations in each compartment,
synapses in retina called “ribbon synapse” can process non-
we can adjust the model parameters manually to enable the model to produce
results (D) similar to the electrophysics data (C) in literatures. spiking signals (i.e., graded electrical signal). Here, we use
the existing detailed model of ribbon synapse at ModelDB
(Accession:50997).
section. By connecting different types of neuron models as the con-
nections at fovea (nearly 1:1:1) with ribbon synapse models,
II. R ETINA SIMULATION
the ON/OFF neuron pathway can be built (displayed in the
Retina is a light-sensitive layer of tissue, transferring visual left part of Fig. 2). As shown in Fig. 2, photoreceptor, OFF
information into electrical signals by 5 types of neuron cells bipolar cell and OFF ganglion cells are inhibited during the
(i.e., photoreceptor, bipolar cell, ganglion cell, horizontal cell occurrence of light while ON bipolar cell and ON ganglion
and amacrine cell) [27]. Over the whole retina, different cell exhibit excitation, which capture the electrophysiological
neuron cells connect each other in a very complex way [9]. characteristics of retina cells very well. Then, we can construct
However, the connections at fovea (a small pit on retina) are a fovea neural network in the size of 50∗ 64 by copying the
simple (see Fig. 2). Moreover, fovea is critical for accurate ON/OFF neuron pathway in the manner displayed in the right
vision to primates and human beings [19]. Therefore, we part of Fig. 2.
construct a fovea neural network. In this section, we will
introduce how to model a single neuron and construct a fovea III. I MAGE S ALIENCY ANALYSIS BASED ON S PIKE T RAINS
neural circuit and network with the built neurons. When given an image to the built fovea neural network, it
A. Single Cell Modeling would output spike trains corresponding to every pixel.Note
that we only use the spike train data recorded during the occur-
As the fundamental processing unit of the brain, a neuron
rence of light, with the length of 250 ms (milliseconds). As the
receives electrical signals from other neurons by tree-like
result, neurons represent raw image data with a temporal se-
dendrites, changes its membrane voltage, and transmits the
quence of images composed of 0 and 1. According to existing
electrical signals to other neurons by the axon. In the field
works [8], [10], the spike latency and spike number of spike
of computational neuroscience, HH (Hodgkin-Huxley) [13] is
trains are informative. Intuitively, we try to count the spike
the standard mathematical model that describes the non-linear
number/latency of ON/OFF ganglion cells’ spike trains and
electrical dynamic process (e.g., the influence of various ion
represent each pixel with the value of spike number/latency.
channels) and computes the change of membrane voltage with
Some examples are displayed in Fig. 3, indicating that spike
a set of differential equations as,
dVm numbers/latency quantizes the intensity of visual stimuli only.
I=C +g (V −V )+g (V −V )+g (V −V ) The brighter the visual information is, the higher (lower) spike
m K m K Na m Na l m l
dt number and shorter (longer) spike latency ON (OFF) ganglion
(1) cells would obtain.
where
I and Cm are the total membrane current and membrane From the perspective of temporal coding, we represent
each pixel with 0 (if ganglion cell do not spike) or 1 (do
capacitance per unit area, respectively, gK (g Na) and VK spike) at every 0.025 ms. For convenience, we call the new
(VNa) are the potassium (sodium) conductances per unit area
and reversal potentials, respectively, and gl and Vl are the

143
Fig. 2. The built neural circuit at retina fovea and the neural network. The left part shows the structure of fovea neural circuit, along with spike trains
produced by each type of cells. Based on the neural circuit, we construct a neural network, as shown in the right part, by repeating the same circuit along
two dimensions. In this manner, each neural circuit can process one pixel with the same spatial location.

Fig. 3. Representatives of spike number and spike latency.

image representation as spikes of time slice (STS). Some


representative examples of all the 10000 = 250/0.025 STSs
Fig. 4. Representative STSs of the input images. Rows in (a) and (b) show
are shown in Fig. 4. From Fig. 4, we can see that the spike an input image followed by its STSs representations, respectively. Note that,
patterns in different STSs highlight different parts of an image. those STSs are selected for illustration.
For example, the rag in the last image of Fig. 4 (a) and
the trees and bushes in Fig. 4 (b) etc. Surprisingly, neurons
corresponding to salient targets would spike at the same time
while the ones for backgrounds keep silent in some STSs. It
proves that the representative way of STSs are meaningful
for image saliency estimation by separating salient targets by “saliency-STSs”. For an input image that has been resized
and distractors with different spatial spike patterns. Thus, we to the size of 50 ∗ 65, its digital R (or G,B) values can
propose to detect salient targets based on STSs. be transferred into light currents with the algorithm in the
work [2]. After receiving the light current injections, the retina
IV. E XPERIMENTS neural network can output 6 types of STSs, which are Red-
In this section, we conduct an experiment to explore the (ON/OFF), Green-(ON/OFF) and Blue-(ON/OFF). For the
potential of STSs in image saliency estimation on the popular results of “saliency-STSs”, we evaluate shuffle AUC scores
benchmark Toronto [6], which includes 120 images of indoor (sAUC) [31] and judd AUC scores (AUC) [18] and reserve the
and outdoor scenes. maximum score for each image. The distribution of maximum
To quantize the prediction power of STSs, we adopt the scores for total 120 images (shown in Fig. 5) proves that there
algorithm presented in work [17] to extract the local center- exist STSs that can better separate targets and backgrounds in
surround contrast of every STS as the final saliency, denoted various scenes.

144
[12] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In
Advances in Neural Information Processing Systems (NIPS), pages 545–
552, 2007.
[13] A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane
current and its application to conduction and excitation in nerve. The
Journal of physiology, 117(4):500, 1952.
[14] X. Hou, J. Harel, and C. Koch. Image signature: Highlighting sparse
salient regions. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 34(1):194–201, 2012.
[15] X. Hou and L. Zhang. Saliency detection: A spectral residual approach.
In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 1–8, 2007.
[16] L. Itti and C. Koch. A saliency-based search mechanism for overt and
covert shifts of visual attention. Vision Research, 40(10–12):1489–1506,
2000.
[17] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual
Fig. 5. The distribution of metric scores of the STSs that perform the best attention for rapid scene analysis. IEEE Transactions on Pattern Analysis
for each image in Toronto. and Machine Intelligence, 20(11):1254–1259, 1998.
[18] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict
where humans look. In IEEE International Conference on Computer
Vision (ICCV), pages 2106–2113, 2009.
V. C ONCLUSIONS [19] H. Kolb and D. Marshak. The midget pathways of the primate retina.
Documenta Ophthalmologica, 106(1):67–81, 2003.
In this paper, we propose a new perspective to estimate [20] J. Li, L.-Y. Duan, X. Chen, T. Huang, and Y. Tian. Finding the secret of
image saliency based on the spike trains generated by a image saliency in the frequency domain. IEEE transactions on pattern
detailed retina neural network. This neural network is built analysis and machine intelligence, 37(12):2428–2440, 2015.
[21] J. Li, M. Levine, X. An, X. Xu, and H. He. Visual saliency based on
with realistic data of retina, and it can produce neural spike scale-space analysis in the frequency domain. IEEE Transactions on
trains that close to retina. The experimental results show Pattern Analysis and Machine Intelligence, 35(4):996–1010, 2013.
[22] J. Li, Y. Tian, T. Huang, and W. Gao. Cost-sensitive rank learning from
that the spike trains not only are designed for convenient positive and unlabeled data for visual saliency estimation. IEEE Signal
signal transportation, but also can separate salient targets and Processing Letters, 17(6):591–594, 2010.
backgrounds effectively. [23] J. Li, Y. Tian, T. Huang, and W. Gao. Multi-task rank learning for
visual saliency estimation. IEEE Transactions on Circuits and Systems
for Video Technology, 21(5):623–636, 2011.
ACKNOWLEDGMENT [24] Z. Li. A saliency map in primary visual cortex. Trends in cognitive
sciences, 6(1):9–16, 2002.
This work is partially supported by the National Basic [25] Z. Li, S. Qin, and L. Itti. Visual attention guided bit allocation in video
Research Program of China under grant 2015CB351806, the compression. Image Vision Computing, 29(1):1–14, Jan. 2011.
National Natural Science Foundation of China under contract [26] Z. Ma, L. Qing, J. Miao, and X. Chen. Advertisement evaluation
using visual saliency based on foveated image. In IEEE International
No. 61390515,and No. 61425025, and Beijing Municipal Conference on Multimedia and Expo (ICME), pages 914–917, 2009.
Commission of Science and Technology under contract No. [27] I. C. Mann. The development of the human eye. 1928.
[28] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic
Z151100000915070,. representation of the spatial envelope. International Journal of Computer
Vision, (3):145–175, 2001.
REFERENCES [29] S. Wei, D. Xu, X. Li, and Y. Zhao. Joint optimization toward
effective and efficient image search. IEEE Transactions on Cybernetics,
[1] S. Avidan and A. Shamir. Seam carving for content-aware image 43(6):2216–2227, 2013.
resizing. In ACM SIGGRAPH, 2007, New York, NY, USA, 2007. ACM. [30] S. Wei, Y. Zhao, C. Zhu, C. Xu, and Z. Zhu. Frame fusion for video
[2] S. Barnes and B. Hille. Ionic channels of the inner segment of tiger copy detection. IEEE Transactions on Circuits and Systems for Video
salamander cone photoreceptors. The Journal of General Physiology, Technology, 21(1):15–28, 2011.
94(4):719–743, 1989. [31] J. Zhang and S. Sclaroff. Saliency detection: A boolean map approach.
[3] A. Borji. Boosting bottom-up and top-down visual features for saliency In IEEE International Conference on Computer Vision (ICCV), pages
estimation. In IEEE Conference on Computer Vision and Pattern 153–160, 2013.
Recognition (CVPR), pages 438–445, 2012. [32] Q. Zhao and C. Koch. Learning visual saliency by combining feature
[4] G. S. Brindley. Physiology of the retina and the visual pathway. 1960. maps in a nonlinear manner using adaboost. Journal of Vision, 12(6):22,
[5] N. D. Bruce, C. Catton, and S. Janjic. A deeper look at saliency: Feature 1–15, 2012.
contrast, semantics, and beyond. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 516–524, 2016.
[6] N. D. Bruce and J. K. Tsotsos. Saliency based on information
maximization. In Advances in Neural Information Processing Systems
(NIPS), pages 155–162, Vancouver, BC, Canada, 2005.
[7] M. Cerf, J. Harel, W. Einhauser, and C. Koch. Predicting human gaze
using low-level saliency combined with face detection. In Advances in
Neural Information Processing Systems (NIPS), Vancouver, BC, Canada,
2009.
[8] S. M. Chase and E. D. Young. First-spike latency information in single
neurons increases when referenced to population onset. Proceedings of
the National Academy of Sciences, 104(12):5175–5180, 2007.
[9] D. Dacey, O. S. Packer, L. Diller, D. Brainard, B. Peterson, and B. Lee.
Center surround receptive field structure of cone bipolar cells in primate
retina. Vision research, 40(14):1801–1811, 2000.
[10] T. Gollisch and M. Meister. Rapid neural coding in the retina with
relative spike latencies. science, 319(5866):1108–1111, 2008.
[11] C. Guo, Q. Ma, and L. Zhang. Spatio-temporal saliency detection using
phase spectrum of quaternion fourier transform. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008.

145

Potrebbero piacerti anche