Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract
1
et al. [39] shows that coarse scale boundaries are a subset of
fine scale boundaries and suggests that the fine scale units
should prove useful in composing coarser units. We choose
to annotate keyframes in 3 second long video segments, as
this is dense enough to understand the content and avoids
precise temporal annotation shown to be difficult to define
precisely; the THUMOS challenge [16] observed that ac-
tion boundaries (unlike objects) are vague and subjective,
leading to significant inter-annotator disagreement and eval-
uation bias. By contrast, it is feasible for annotators to de-
Figure 2. This figure illustrates the hierarchical nature of an activ-
ity. From Barker and Wright [3], pg. 247.
termine (using 1.5s of context) whether a specific frame
contains the given action. This task is intrinsically less am-
biguous than precisely specifying the temporal boundaries.
nity. Why person-centric? There are events such as trees
To motivate the design choices behind AVA, we list be- falling which do not involve people, but our focus is on the
low some key findings from the human psychology liter- activities of people, treated as single agents. There could be
ature on activity and events. We recommend [38] to the multiple people as in a sports event or just two people hug-
reader interested in a more detailed account. ging, but each one is an agent with his or her own choices,
Foundational work [3] observed in meticulous detail so we can treat each separately. We do not preclude coop-
the children of a small town in Kansas going about their erative or competitive behavior, but each agent is described
daily lives. They proposed a unit they called a behavior by pose and interaction with objects and humans.
episode. Examples of behavior episodes include a group
of boys moving a crate across a pit, a girl exchanging re- Why movies? Ideally we want behavior in the wild,
marks with her mother, and a boy going home from school. recorded in a digital form by a modern day version of
They noted the hierarchical nature of activity, illustrated in Barker and Wright [3]. We do not have that. The next best
Fig. 2 [3]. At the finest level the actions can be described as thing is movies. For more than a hundred years cinematog-
simple body movements or manipulation of objects but at raphers have been narrating stories in this medium, and if
coarser levels the most natural descriptions are in terms of we consider the variety of genres (thrillers, drama, roman-
intentionality and goal-directed behavior, as already noted tic comedies, westerns etc.) and countries with flourishing
in movies of simple moving geometrical figures where such movie industries (USA, UK, India, China, Japan, Russia
attribution is common among human observers [14]. etc.) we may expect a significant range of human behavior
Newtson [24] experimentally studied how humans per- to show up in these stories. We expect some bias in this pro-
ceive a videotaped behavior sequence by asking them to use cess. Stories have to be interesting (To quote David Lodge:
key presses to mark breakpoints between segments. Zacks Literature is mostly about having sex and not much about
et al. [39] applied this technique to study how human per- having children. Life is the other way round.) and there
ceive everyday activities such as making a bed or doing the is a grammar of the film language [2] that communicates
dishes. There was clear evidence of hierarchical event struc- through the juxtaposition of shots. That said, in one shot
ture; the large unit boundaries were also likely to be small we expect an unfolding sequence of human actions, some-
unit boundaries. what representative of reality, as conveyed by good actors.
It is not that we regard this data as perfect, just that it is
We are now ready to justify three key design choices in
better than working with the assortment of user generated
AVA:
content such as videos of animal tricks, DIY instructional
Why short temporal scale? Coarse-scale activity/event
videos, events such as childrens birthday parties, and the
understanding is best done in terms of goals and subgoals.
like. We expect movies to contain a greater range of activ-
But this is essentially an infinite set just like the set of all
ities as befits the telling of different kinds of stories. We
sentences. Another problem is that goals may not even
make it a point to only include video from movies which
be evident visually. If we see someone walking, are they
are at least 30 minutes long, sampling 15 minute intervals.
doing it for exercise or to get to the grocery store? How-
ever if we limit ourselves to fine scales, then the actions This paper is organized as follows. In Section 2 we re-
are very physical in nature and have clear visual signatures. view previous action recognition datasets and point out the
We hope to eventually understand the coarser scales of ac- difference with our AVA dataset. In Section 3, we describe
tivity analysis mediated by conceptual structures such as the annotation process. Section 4 presents some interesting
scripts and schemas, composed from fine scale AVA units, statistics from the dataset. Section 5 explains the baseline
but this is very much future work. Research such as Zachs approach and reports results. We conclude in Section 6.
2. Related work datasets for spatio-temporal action recognition, which all
consist of a small number of complex actions annotated in a
Most popular action recognition datasets, such as small number of clips often in specific environments. Here,
KTH [31], Weizmann [4], Hollywood-2 [22], HMDB [21] we depart from these drawbacks and annotate 80 atomic ac-
and UCF101 [33], consist of short clips, manually trimmed tions densely in 57.6k realistic movie clips resulting in 210k
to capture a single action. These datasets are ideally action labels.
suited for training fully-supervised, whole-clip, forced-
The AVA dataset is also related to still image action
choice classifiers. Unfortunately, although convenient, this
recognition datasets [7, 9, 12]. Such datasets provide an-
formulation of action recognition is completely unrealis-
notations for humans and their actions. Yet, still image
tic real-world action recognition always occurs in an
action recognition datasets present two major drawbacks.
untrimmed video setting and frequently demands spatial
First, the lack of motion information makes disambiguation
grounding as well as temporal localization.
in many cases difficult or impossible. We found that the
Recently, video classification datasets, such as TrecVid
context from surrounding video is essential in order to an-
multi-media event detection [25], Sports-1M [19] and
notate the individual frames. Take for example the actions
YouTube-8M [1], have focused on video classification on
walking, standing and falling down. They all look very sim-
a large-scale, in some cases with automatically generated
ilar if only one frame is given for annotation. This neces-
and hence potentially noisy annotations. They serve a
sity for motion information is also confirmed by our exper-
valuable purpose but address a different need than AVA.
iments. Second, modeling complex events as a sequence
Another line of recent work has moved away from
of individual atomic actions is not possible for still images.
video classification towards temporal localization. Activi-
This is arguably out of scope here, but clearly required in
tyNet [6], THUMOS [16], MultiTHUMOS [36] and Cha-
many real-world applications, for which the AVA dataset
rades [32] use large numbers of untrimmed videos, each
does provide training data.
containing multiple actions, obtained either from YouTube
(ActivityNet, THUMOS, MultiTHUMOS) or from crowd-
sourced actors (Charades). The datasets also provide tem- 3. Data collection
poral (but not spatial) localization for each action of interest. The AVA Dataset generation pipeline consists of three
AVA differs from them both in terms of content and anno- stages: movie and segment selection, person bounding box
tation: we label a diverse collection of movies and provide annotation and action annotation.
spatio-temporal annotations for each subject performing an
action in a large set of sampled frames. 3.1. Movie and segment selection
A few datasets, such as CMU [20], MSR Actions [37],
UCF Sports [29] and JHMDB [18] provide spatio-temporal The raw video content of the AVA dataset comes from
annotations in each frame for short trimmed videos. The YouTube. We begin by assembling a list of top actors of
main differences with our AVA dataset are: the small num- various nationalities. For each name we issue a YouTube
ber of actions ranging from 3 (MSR Actions) to at most 21 search query, retrieving up to 2000 results. We only include
(JHMDB); the small number of video clips; and the fact videos with the film or television topic annotation, a du-
that clips are trimmed. Furthermore, actions are complex ration of over 30 minutes, at least 1 year since uploaded, and
(e.g., pole-vaulting) and not atomic as in AVA. For exam- at least 1000 views. We further exclude black & white, low
ple, the UCF Sports dataset [29] consists of 10 sports ac- resolution, animated, cartoon, and gaming videos, as well
tions, such as weight lifting, horse riding and diving. Re- as those containing mature content. Note that there is no
cent extensions, such as UCF101 [33], DALY [35] and explicit action distribution bias in this process, as the selec-
Hollywood2Tubes [23] evaluate spatio-temporal localiza- tion and filtering criteria does not include any action related
tion in untrimmed videos, which makes the task signifi- keyword, nor does it run any automated action classifier on
cantly harder and results in a performance drop. However, the video content.
the action vocabulary is still restricted to a limited num- Each movie contributes equally to the dataset, as we only
ber of actions, at most 24, and actions are complex, which label a sub-part ranging from the 15th to the 30th minute.
makes large-scale extension difficult. Moreover, they do not We skip the beginning of the movie to avoid annotating ti-
provide a dense coverage of all actions; a good example is tles or trailers. We choose a duration of 15 minutes so we
BasketballDunk in UCF101 where only the player perform- are able to include more movies under a fixed annotation
ing the dunk is annotated, whereas all the other players are budget, and thus increase the diversity of our dataset.
not. These datasets are unrealistic real-world applications Each 15-min clip is partitioned into 300 non-overlapping
require a continuous annotations of atomic actions, which 3s movie segments. We choose 3s because the dataset fo-
can then be composed into higher-level events. cuses on atomic actions and our user studies confirm that
This paper addresses the main limitations of existing 3s is a reasonably short length in which atomic actions can
be recognized. We choose to label 300 segments consec-
utively rather than randomly sampling segments from the
whole movie to preserve sequences of atomic actions in a
coherent temporal context.
Every human bounding box and its corresponding action
labels are associated with a segment. Note that although
annotators annotate only the middle frame, they take into
account the information from the surrounding 3s segments.
As stated in the introduction, annotating only one frame in
a segment is a design choice to avoid ambiguous temporal
annotation, which is not only error prone but also labor in-
tensive. This choice also alleviates the shot boundary prob-
lem, as the action label and spatial localization are both well
defined in the middle frame. Shots where the transition is
too close to the middle frame are marked by annotators and
Figure 3. User interface for bounding box annotation. The purple
removed from the dataset.
box was generated by the person detector. The orange box (missed
3.2. Bounding box annotation of subjects by the detector) was manually added by an annotator with a mouse.
Cook
Hand wave
Dig
Hand shake
Lift (a person)
Ride (e.g. car, bike, horse)