Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
400
gorized into two classes. One group of words serve as com- 400
We want to obtain prosodically highlighted words in We encode social cues in the framework of the statisti-
each spoken utterance. To do so, we compare the ex- cal learning model as shown in Figure 3. Each word u(i)
tracted features from each word with those from each ut- is assigned with a weight wp (i) based on its prosodic cat-
terance, which indicates whether a word sounds like high- egory. Similarly, each visual object vj is set with a weight
lighted in the acoustic context. Specifically, for the word wv (j) based on whether it is attended by the speaker and
wi in the spoken utterance uj , we form a feature vector: the learner. In this way, the same method described in pre-
uj uj uj T uj vious section is applied and the only difference is that the
[pw wi
75 p75 pr pr
i
pw
m pm ] , where pm is the mean
i
(s) (s)
wi
delta pitch of the utterance and pm is that of the word and estimate of c(mm |wn , Sm , Sw ) now is given by:
so on. In this way, the prosodic envelope of a word is repre-
sented by 3-dimensional feature vector. We use the support (s) (s) p(mm |wn )
c(mm |wn , Sm , Sw )=
vector clustering (SVC) method [5] to group data point into p(mm |wu(1) ) + ... + p(mm |wu(r) )
X
l X
r
by the model which actually are correct. (2) lexical spotting
(mm , v(j) wv (j)) (wn , u(i) wp (i)) accuracy (recall) measures the percentage of correct words
j=1 i=1
that the model learned among all the 26 words.
(4)
In practice, we set the values of wv (j) and wp (i) to be 3 for Four methods were applied on the same data and the re-
highlighted objects and words. The weights of all the other sults are as follows (precision and recall): (1) purely statis-
words and objects are set to be 1. tical learning ( 75% and 58%). (2) statisical learning with
prosodic cues ( 78% and 58%). (3) statistical learning with
statistical
the cues from visual attention ( 80% and 73%). (4) statisti-
1 prosody-cued cal learning with both attentional and prosodic cues ( 83%
cat attention-cued
0.75
and 77%). Figure 4 shows the comparative results of these
prosodic+attentional
0.5
four approaches for specific instances. Ideally, we want the
association probabilities of the first or second words to be
0.25
high and others to be low. For instance, the first plot rep-
10
kitty-cat meow my watch baby says yeah resents the meaning of the object cat. Both the spoken
0.75
word kitty-cat and the spoken word meow are closely rele-
mirror
vant to this meaning. Therefore, the association probabil-
0.5 ities are high for these two words and are low for all the
others words, such as my, watch and baby, which are not
0.25
correlated with this context. Note that in the meaning of
0 the object bird, we count the word eye as a positive one
mirror David bunnies hiphop can song see
because the mother uttered it several times during the in-
1
big-bird teraction when she presented the object bird to her kid.
0.75
Similarly, when she introduced the object mirror, she also
0.5
mentioned the name of the kid David whose face appeared
0.25 in the mirror.
0
big-bird eye see huh and look yeah The results of the statistical learning approach (the first
bars) are reasonably good. For instance, it obtains big-
1 hand
bird and eye for the meaning bird, kitty-cat for the mean-
0.75
ing cat, mirror for the meaning mirror and hand for
0.5 the meaning hand. But it also makes wrong estimates,
0.25 such as my for the meaning cat and got for the meaning
0 hand. We expect that attentional and prosodic constraints
hand got a that to hat
will make the association probabilities of correct words
higher and decrease the association probabilities of irrel-
Figure 4. The comparative results of the methods considering
different sets of cues. Each plot shows the association probabili- evant words. The method encoding prosodic cues moves
ties of several words to one specific meaning labeled on the top. toward this goal although occasionally it changes the prob-
The first one or two items are correct words that are relevant to the abilities on the reverse way, such as increasing the proba-
meanings and the following words are irrelevant.
bility of my in the meaning cat. What is really helpful
is to encode the cues of joint attention. The attention-cued
6 Experimental Results method significantly improves the accuracy of estimate for
Our model was evaluated by using two video clips from almost every word-meaning pairs. Of course, the method
CHILDES database. We labeled visual contexts in terms of including both joint-attention and prosodic cues achieves
12 objects that occurred in the video clips. For each object, the best performance. Compared with purely statistical
we selected the correctly associated words based on gen- learning, this method highlights the correct associations
eral knowledge. For instance, both the word kitty-cat and (e.g., kitty-cat with the meaning cat), and decreases the ir-
meow are positive instances because both of them are rel- relevant associations, such as got with the meaning hand.
evant to the object cat. Overall, there were 26 positive In this method, we can simply select a threshold and pick
words for all of the 12 objects. The computational model the word-meaning pairs which are overlapped with the ma-
estimated the association probabilities of all the possible jority of words in the target set. We need to point out that
word-meaning associations and then selected lexical items the results here are obtained from very limited data. With-
based on a threshold. Two measures were used to evaluate out any prior knowledge of the language (the worst case
the performance: (1) word-meaning association accuracy in word learning), the model is able to learn a significant
(precision) measures the percentage of the words spotted amount of correct word-meaning associations.
7 Conclusion [12] A. Fernard. Human maternal vocalizations to infants as bi-
ologically relevant signals: An evolutionary perspective. In
We believe that in a natural infant-caregiver interac-
The Adaptive Mind. Oxford University Press, 1992.
tion, the mother provides non-linguistic signals to the in- [13] L. Gleitman. The structural sources of verb meanings. Lan-
fant through her body movements, the direction of her gaze, guage Acquisition, 1:155, 1990.
and the timing of her affective cues via prosody. Previous [14] L. Gleitman, K. Cassidy, R. Nappa, A. Papafragou, and
experiments have shown that some of these non-linguistic J. Trueswell. Hard words. Language Learning and Devel-
signals can play a critical role in infant word learning, but a opment, 1, in press.
detailed estimate of their relative weights has not been pro- [15] K. Hirsh-Pasek, R. M. Golinkoff, and G. Hollich. An emer-
vided. Based on statistical learning and social-pragmatic gentist coalition model for word learning: mapping words
to objects is a product of the interaction of multiple cues. In
theories, this work proposed a unified model of early word
Becoming a word learner: a debate on lexical acquisition.
learning, which integrates statistical and social cues to en- Oxford Press, New York, 2000.
able the word-learning process to function effectively and [16] P. K. Kuhl, F.-M. Tsao, and H.-M. Liu. Foreign-language ex-
efficiently. In our model, we explored the computational perience in infancy: Effects of short-term exposure and so-
role of non-linguistic information, such as joint attention cial interaction on phonetic learning. PNAS, 100(15):9096
and prosody in speech, and provided the quantitative results 9101, July 22 2003.
to compare the effects of different statistical and social cues. [17] A. S. Meyer, A. M. Sleiderink, and W. J. Levelt. View-
We need to point out that the current unified model does not ing and naming objects: eye movements during noun phrase
encode any syntactic properties of the language, which def- production. Cognition, 66:B25B33, 1998.
[18] M. Pena, L. L. Bonatti, marina Nespor, and J. Mehler.
initely play a significant role in word learning, especially
Signal-driven computations in speech processing. Science,
in the later stage. Therefore, one natural extension of the 2002.
current work is to add the syntactic constraints in the cur- [19] K. Plunkett. Theories of early language acquisition. Trends
rent probabilistic framework to study how this knowledge in cognitive sciences, 1:146153, 1997.
can help the lexical acquisition process and how multiple [20] T. Regier. Emergent constraints on word-learning: A com-
sources can be integrated in a general system. putational review. Trends in Cognitive Sciences, 7:263268,
2003.
References [21] D. Richards and J. Goldfarb. The episodic memory model
of conceptual development: An integrative viewpoint. Cog-
[1] D. Baldwin. Early referential understanding: Infants ability nitive Development, 1:183219, 1986.
to recognize referential acts for what they are. Developmen- [22] J. R. Saffran, E. Johnson, R. Aslin, and E. Newport. Statisti-
tal psychology, 29:832843, 1993. cal learning of tone sequences by human infants and adults.
[2] D. A. Baldwin and J. A. Baird. Discerning intentions in Cognition, 1999.
dynamic human action. Trends in Cognitive Sciences, 5(4), [23] J. R. Saffran, E. L. Newport, and R. N. Aslin. Word segmen-
2001. tation: The role of distributional cues. Journal of memory
[3] D. H. Ballard, M. M. Hayhoe, P. K. Pook, and R. P. N. Rao. and language, 35:606621, 1996.
Deictic codes for the embodiment of cognition. Behavioral [24] M. S. Seidenberg. Language acquisition and use: Learning
and Brain Sciences, 20:13111328, 1997. and applying probabilistic constraints. Science, 275:1599
[4] S. Baron-Cohen. Mindblindness: an essay on autism and 1603, March 1997.
theory of mind. MIT Press, Cambridge, 1995. [25] L. Smith. How to learn words: An associative crane. In
[5] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik. R. Golinkoff and K. Hirsh-Pasek, editors, Breaking the word
Support vector clustering. Journal of machine Learning Re- learning barrier, pages 5180. Oxford: Oxford University
search, 2:125137, 2001. Press, 2000.
[6] P. Bloom. Intentionality and word learning. Trends in Cog- [26] J. Snedeker and J. Trueswell. Using prosody to avoid ambi-
nitive Sciences, 1(1):912, 1997. guity: Effects of speaker awareness and referential context.
[7] P. Bloom. How children learn the meanings of words. The Journal of Memory and Language, 48:103130, 2003.
MIT Press, Cambridge, MA, 2000. [27] M. Tomasello. Perceiving intentions and learning words in
[8] P. F. Brown, S. Pietra, V. Pietra, and R. L. Mercer. The the second year of life. In M. Bowerman and S. Levin-
mathematics of statistical machine translation:parameter es- son, editors, Language Acquisition and Conceptual Devel-
timation. Computational Linguistics, 19(2), 1993. opment. Cambridge University Press, 2000.
[9] G. Butterworth. The ontogeny and phylogeny of joint vi- [28] C. Yu and D. H. Ballard. A multimodal learning inter-
sual attention. In A. Whiten, editor, Natural theories of face for grounding spoken language in sensory perceptions.
mind: Evolution, development, and simulation of everyday In Fifth International Conference on Multimodal Interface.
mindreading. Blackwell, Oxford, England, 1991. ACM Press, 2003.
[10] N. Chomsky. Aspects of the theory of syntax. MIT Press, [29] C. Yu, D. H. Ballard, and R. N. Aslin. The role of em-
1965. bodied intention in early lexical acquisition. In Proceedings
[11] J. Elman, E. Bates, M. Johnson, A. Karmiloff-Smith, the Twenty Fifth Cognitive Science Society Annual Meetings,
D. Parisi, and K. Plunkett. Rethinking innateness: A con- Boston, MA, July 2003.
nectionist perspective on development. MIT Press, 1996.