Sei sulla pagina 1di 9

Papers: “Knowing When to Look” - An Fantastic Theory for Image Captioning.

​arxiv

Abstract:
“Attention-based neural encoder-decoder frameworks have been widely adopted for
image captioning. Most methods force visual attention to be active for every generated
word. However, the decoder likely requires little to no visual information from the image
to predict nonvisual words such as “the” and “of”. Other words that may seem visual can
often be predicted reliably just from the language model e.g., “sign” after “behind a red
stop” or “phone” following “talking on a cell”. In this paper, we propose a novel adaptive
attention model with a visual sentinel. At each time step, our model decides whether to
attend to the image (and if so, to which regions) or to the visual sentinel. The model
decides whether to attend to the image and where, in order to extract meaningful
information for sequential word generation. We tested our method on the COCO image
captioning 2015 challenge dataset and Flickr30K. Our approach sets the new
state-of-the-art by a significant margin. The source code can be downloaded from
https://github".com/jiasenlu/AdaptiveAttention”​

“Knowing When to Look” - Adaptive Attention via A


Visual Sentinel for Image Captioning

In recent decades, image captioning powered by Deep Learning and Attention


Mechanism has been widely adopted. Many models achieved significant results,
however, a major number of these models might not take account of the fact that many
words such as ​‘a’, ‘the’, ‘of’, etc.​ do not directly attend to visual information or regions on
images. These kinds of words seem to be generated by the language model which is
constructed through the decoder (RNN or LSTM models) of an image captioning
framework. In order to leverage this issue, the paper of Jiasen Lu named “ ​Know When
to Look: Adaptive Attention via A Visual Sentinel for Image Captioning​” [1] proposed a
method to clarify ​which information the model will base on to formulate the
description​. In addition, the papers also adapt the traditional attention used in image
captioning by a novel algorithm called ​Adaptive Attention.
To begin with, let's review the theory of Attention Mechanism[2] and LSTM[3], the two
back-born for the proposed solution.

Attention in Computer Vision

Whoever wants to deploy an attention algorithm on Computer Vision, they must cite the
paper “ ​Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention​”[4]. Too famous to skip. In this paper, Attention was estimated by three
equations below:

z t = f att (ht−1 , V )
αt = sof tmax(z t )
L
ct = ∑ αti v ti
i

Where ht−1 is the hidden layer in LSTM decoder, V is the set of regions in an image V =
{ v 1 , v 2 , ..., v L } L regions. ct is the context - vector, which is also the context inputted into
LSTM cells. [3]

LSTM Structure

LSTM is a structure that is to resolve the problem of sequence dependency in recurrent


neural networks. Instead of constructing only one ​tanh function, LSTM contains
different blocks. For me, no one could describe the theory of LSTM as detailed as
Christopher Olah did in his blog [3]. I summarize here some salient points:

LSTM structure. Image source [3​]


Each cell of LSTM is identical with an input set at time t including ct , xt , ht−1 which are
context vector, input vector, and previously hidden state vector respectively.

LSTM operations. Image source [3] 

The core idea behind LSTM

On the top of a cell in LSTMs, a horizontal which runs through the cell is the context
information. Mr. Colah called this is “a conveyor belt” which may absorb only some
minor change when it flows along with the cell.

Context vector. Image source [3] 


 
In order to adjust the context information, LSTMs introduce cell gates composed out of
a “sigmoid neural net σ layer and a pointwise ⊗ multiplication operation”.

LSTM sigmoid and pointwise multiplication. Image source [3] 


As you already know, sigmoid’s output values are in the range of 0 and 1. In other
words, they defined how much information will be eliminated or added when the context
vector goes through the cell. 0 means no information will be kept and 1 means the
whole information will be kept.

In case that attention enters the game, attention contributes to the formation of the
context vector.

LSTM Step by Step

The first gate called forget gate ​f,​ which defined the number of information will be
transferred in a cell. This gate looks at the previous hidden state and an input x and
output a number to multiply with the previous context vector.

​Forget gate in LSTM. Image source [3] 

The second block is called input gate ​i.​ This gate contains two branches. The first one
produces it after going through a ​sigmoid layer in order to decide which information
will be updated in the cell. The second branch will transfer information through a ​tanh
function to decide how much information will be added to the new cell.

​Input gate in LSTM. Image source [3] 


 
 
Now, it is time to update the old context vector. The new context vector is a combined
vector from the previous context and the input context.
 

 
Combined context vector. Image source [3] 
 
Next, LSTMs would like to produce the output, the hidden state, by multiplying the
context vector after applying ​tanh​ with sigmoid of input information.

Output hidden state. Image source [3] 


 
That ‘s it. We did finish the review for LSTMs. Understanding the intuitive behind
LSTMs’ operations enable us to interpret the structure of language modeling as well as
how the model will generate words in Image Captioning.

The paper’s approach

The authors stated that they brought the notion of “sentinel visual” which is not related
to any regions in the descriptive image. On the other hand, sentinel visual is a fallback
that the model wants to refer to when it could not align a word ( kinds of article ‘a’, ‘an’,
‘the’, or ‘of’, etc) to the images. I guess that is the reason why the authors named
“sentinel” to this visual information. Please bear in mind that sentinel visual information
would be extracted by using LSTMs or a language model.

If you were me, you would feel how I was about this idea. I am also not amazed why
this paper has been cited nearly 500 times. In fact, the idea replicated in Neural Baby
Talk in order to generate a slotted template for Image Captioning which improves Bleu
Score, an estimated measurement for sequence generation functions, significantly.

Okay! Let’s dig deeper inside the method.

Encoder-Decoder for Image Captioning

Mathematical formula of Image Captioning is:

θ* = arg maxθ ∑ log p(y | I; θ)


(I,y)

Where I is the input image, θ is the parameter of the model, and y is the output text
y = {y 1 , y 2 , y 3 , ... , y t }

Applying chain rule, and we get this beautiful function ( but eliminate θ for convenience)

T
log p(y) = ∑ log p(y | y 1−t , I)
t=1

Because y 1−t generated by LSTMs, it is certain to formulate in terms of hidden states


and context vectors. If you have furrowed brows, you should be back in the early parts
of this article to read again about LSTMs.

log p(y | y 1−t , I) = f (ht , ct )

For ht = LST M (xt , ht−1 , ct−1 ) . Remember that in the paper, authors noted ct as mt for
memory state.
Done for background knowledge! Now, it’s time for the most novel algorithms in
Attention. You should know that there are two kinds of visual information: spatial-visual
and of course, sentinel-visual. It means this article will discuss both spatial attention and
sentinel attention.
Spatial Attention

This attention is formulated by regions in an image and hidden states in LSTMs. I note
here the formula:

ct = f att (V , ht )

And attention now is:

z t = whT tanh (W v V + (W g ht )1T )


αt = sof tmax(z t )
L
ct = ∑ αti v ti
i
V = { v 1 , v 2 , ..., v L } L regions. 1 is a vector with all elements set to 1. Ws are parameters
to be learnt. αt is the attention weight over features in V. and c is the context vector.

Uhm..m Something ‘s wrong. You should recognize that this attention is estimated by
the current hidden state while the old one based on previous hidden states. The reason
is the authors adapted the attention into a new form. The below pictures will clear the
cloud in your mind.

Left: traditional attention. Right: adaptive attention [1] 

The right attention… you know… it is ​A RESIDUAL FORM​. The author claimed that
they could achieve a better result because of this innovation.
Sentinel Attention

Sentinel Attention also finds a weight vector alpha and the context vector. Moreover, it
needs a coefficient β - a sentinel gate, to tell the model which kind of visual it will focus
on.

Sentinel Visual is an output of the Language Model, so it is easy to define in terms of


LSTM.

g t = σ (W x x t + W h ht−1 )
st = g t ⊙ tanh(ct )

st is exactly the sentinel visual.

From sentinel visual information and spatial visual information, we obtain a context
vector by this combination.

︿
ct = β t st + ( 1 − β t ) ct

As I mentioned above, β t is a sentinel gate which is calculated by estimating the


weighted alpha of concatenated z t

︿
αt = sof tmax([z t ; whT tanh (W v V + (W g ht ))])

︿
β t = αt [k + 1]

Why? It’s an exercise for you. But I will put the key here: concatenation and reading the
whole paper to find the dimension of all the matrixes which I was too lazy to put them in
this article.

OK. Only one equation to find the probability of the output word and we finish the paper.
This is it:

︿
pt = sof tmax(W p (ct + ht ))
There is much interesting information inside the paper and I recommend that you should
open it and read it [1].

[1] Original Paper: ​https://arxiv.org/pdf/1612.01887.pdf


[2] ​https://pixta.vn/attention-again-a-long-survey/
[3] ​https://colah.github.io/posts/2015-08-Understanding-LSTMs/
[4] Show, Attend and Tell: ​https://arxiv.org/abs/1502.03044

Potrebbero piacerti anche