Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
arxiv
Abstract:
“Attention-based neural encoder-decoder frameworks have been widely adopted for
image captioning. Most methods force visual attention to be active for every generated
word. However, the decoder likely requires little to no visual information from the image
to predict nonvisual words such as “the” and “of”. Other words that may seem visual can
often be predicted reliably just from the language model e.g., “sign” after “behind a red
stop” or “phone” following “talking on a cell”. In this paper, we propose a novel adaptive
attention model with a visual sentinel. At each time step, our model decides whether to
attend to the image (and if so, to which regions) or to the visual sentinel. The model
decides whether to attend to the image and where, in order to extract meaningful
information for sequential word generation. We tested our method on the COCO image
captioning 2015 challenge dataset and Flickr30K. Our approach sets the new
state-of-the-art by a significant margin. The source code can be downloaded from
https://github".com/jiasenlu/AdaptiveAttention”
Whoever wants to deploy an attention algorithm on Computer Vision, they must cite the
paper “ Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention”[4]. Too famous to skip. In this paper, Attention was estimated by three
equations below:
z t = f att (ht−1 , V )
αt = sof tmax(z t )
L
ct = ∑ αti v ti
i
Where ht−1 is the hidden layer in LSTM decoder, V is the set of regions in an image V =
{ v 1 , v 2 , ..., v L } L regions. ct is the context - vector, which is also the context inputted into
LSTM cells. [3]
LSTM Structure
On the top of a cell in LSTMs, a horizontal which runs through the cell is the context
information. Mr. Colah called this is “a conveyor belt” which may absorb only some
minor change when it flows along with the cell.
In case that attention enters the game, attention contributes to the formation of the
context vector.
The first gate called forget gate f, which defined the number of information will be
transferred in a cell. This gate looks at the previous hidden state and an input x and
output a number to multiply with the previous context vector.
The second block is called input gate i. This gate contains two branches. The first one
produces it after going through a sigmoid layer in order to decide which information
will be updated in the cell. The second branch will transfer information through a tanh
function to decide how much information will be added to the new cell.
Combined context vector. Image source [3]
Next, LSTMs would like to produce the output, the hidden state, by multiplying the
context vector after applying tanh with sigmoid of input information.
The authors stated that they brought the notion of “sentinel visual” which is not related
to any regions in the descriptive image. On the other hand, sentinel visual is a fallback
that the model wants to refer to when it could not align a word ( kinds of article ‘a’, ‘an’,
‘the’, or ‘of’, etc) to the images. I guess that is the reason why the authors named
“sentinel” to this visual information. Please bear in mind that sentinel visual information
would be extracted by using LSTMs or a language model.
If you were me, you would feel how I was about this idea. I am also not amazed why
this paper has been cited nearly 500 times. In fact, the idea replicated in Neural Baby
Talk in order to generate a slotted template for Image Captioning which improves Bleu
Score, an estimated measurement for sequence generation functions, significantly.
Where I is the input image, θ is the parameter of the model, and y is the output text
y = {y 1 , y 2 , y 3 , ... , y t }
Applying chain rule, and we get this beautiful function ( but eliminate θ for convenience)
T
log p(y) = ∑ log p(y | y 1−t , I)
t=1
For ht = LST M (xt , ht−1 , ct−1 ) . Remember that in the paper, authors noted ct as mt for
memory state.
Done for background knowledge! Now, it’s time for the most novel algorithms in
Attention. You should know that there are two kinds of visual information: spatial-visual
and of course, sentinel-visual. It means this article will discuss both spatial attention and
sentinel attention.
Spatial Attention
This attention is formulated by regions in an image and hidden states in LSTMs. I note
here the formula:
ct = f att (V , ht )
Uhm..m Something ‘s wrong. You should recognize that this attention is estimated by
the current hidden state while the old one based on previous hidden states. The reason
is the authors adapted the attention into a new form. The below pictures will clear the
cloud in your mind.
The right attention… you know… it is A RESIDUAL FORM. The author claimed that
they could achieve a better result because of this innovation.
Sentinel Attention
Sentinel Attention also finds a weight vector alpha and the context vector. Moreover, it
needs a coefficient β - a sentinel gate, to tell the model which kind of visual it will focus
on.
g t = σ (W x x t + W h ht−1 )
st = g t ⊙ tanh(ct )
From sentinel visual information and spatial visual information, we obtain a context
vector by this combination.
︿
ct = β t st + ( 1 − β t ) ct
︿
αt = sof tmax([z t ; whT tanh (W v V + (W g ht ))])
︿
β t = αt [k + 1]
Why? It’s an exercise for you. But I will put the key here: concatenation and reading the
whole paper to find the dimension of all the matrixes which I was too lazy to put them in
this article.
OK. Only one equation to find the probability of the output word and we finish the paper.
This is it:
︿
pt = sof tmax(W p (ct + ht ))
There is much interesting information inside the paper and I recommend that you should
open it and read it [1].