Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
net/publication/220420019
CITATIONS READS
554 1,091
2 authors, including:
John Goodenough
Carnegie Mellon University
73 PUBLICATIONS 3,069 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by John Goodenough on 30 December 2014.
Contextual Correlates of Synonymy in this paper is the assumI)tion that the degree of semantic
similarity existing between a pair of words is indicated by
the frequency with which they stand in first-order asso-
[[ERBERT [{UB:ENST:EINAND JOHN B. GOODENOUGII ciation to the same words, More specifically we test what
Decision Sciences Laboratory, L. G. Hanscom Field, Bedford, is aetuMly a converse of the assumption; namely, that the
ltIasscl~c~<4ts. more similar in meaning two words are, the greater will
be the mimber of first-order associates they have in eom-
inon.
Experimentol corroboration was obtained for the hypothesis
that the proportion of words common to the contexts of word A I t is not a new notion that words which are similar in
and to the contexts of word B is a function of the degree to meaning occur in similar contexts. Joos [3] defined the
which A and B are similar in meaning. The tests were carried meaning of a morpheme as "the set of conditional prob-
out for variously defined contexts. The shapes of the functions, abilities of its occurrence in context with all other mor-
however, indicate that similarity of context is reliable as cri- phemes." From this definition it was a slight step to Harris'
terion only for detecting pairs of words that are very similar view [2]: " I f we consider words or morphemes A and B to
in meaning. be more different in meaning than A and C, then we will
often find that the [contextual] distributions of A and B
Introduction are more different than the distributions of A and C."
While it seems evident that words which are very similar
This s t u d y is concerned with t:he relationship between in meaning will indeed be very similar in contextual dis-
similarity of context and similarity of meaning (syn- tribution and, contrarily, words which are completely
onymy). /\lore specifieMly, we asked how the proportion dissimilar in meaning will be very dissimilar in their
of words commott to contexts containing word A and to contextual distributions, the nature of the relationship
contexis containing word B was related to the degree to for intermediate values of synonymy cannot be even con-
which A a n d B were similar in meaning. The existence of jectured.
this positive relationship is a basic assumption of statisti- The general approach is to compare contexts from a
cal association methods in information retrieval [6]. These corpus of sentences especially generated for a small set of
methods assume that pairs of words which h a v e m a n y words (hereafter called 'theme words') with human judg-
contexts in c o m m o n are semantically closely related and ments of the degree of s y n o n y m y existing between pairs
consequently that items of information in which either of these theme words. Similarity of contextual distribution
word occurs will tend to be relevant to the s a m e query. is measured by the overlap between the sentence sets of
To the extent that this assumption is invalid statistical the themes being compared. However context is defined
methods ca, nnot be reliable; our purpose here is to investi- in four different ways, each definition specifying a different
gate the w d i d i t y of this assumption. set of conditions under which words in the sentence set
The more sophisticated of these methods distinguish are to be considered first-order associates to the theme
various ordel'S of associalion. Titus if the sentence is taken word. Thus the context is defined as:
as the unit of context any two words A and B which occur (1) all words within a sentence,
in the same sentence are first-order ~tssociates. I f some (2) ,vll content words within a sentence that fall in a
other word C o(.curs with B in a sentence in which A is certain frequency range according to the Lorge Magazine
absem, then A and C are said to be second-order associates Count [8],
siace both h a v e B as a first-order associate. Our concern (3) all content words which stand in closest proxinfity
to the theme word in the grammatical schema of each
This is ESI)-Tt{4;5-21S of the AF Electronic Systems ])ivision,
Air Force Systems (?Olnm~m(l. This research was performed at the Setltence
l)ec[si(m Sciences L~fl)orat.ory as part of Project 2806, Man--(?om-- (4) all words judged to be most closely associated with
I)uter l)ymmdc Inter~ction. Further reproduction is authorized to the theme, where words A and C are considered to be
satisfy the needs (,f the U. S. Government. closely associated if the occurrence of word C is judged
Some of i:he (bata. an~dysis ~mdthe writeup were carried out while ~s strongly implying the occurrence of A and vice versa.
the ~Lu~h(>rswere (m leave ~t Harv~Lr(t University. Since the first
a.u~hor w~s :~ Resea.rch Fellow ~t the Center for Cognitive Studies, Using 65 pairs of words (which range from highly
i financi:flsui)port under AIfPA SD-187 should be ~Leknowledged. synonymous pairs to semantically unrelated pairs) the
lem~ N ,lew@j: priceless, jewelry, ray The plots clearly support the hypothesis that the more
N(Gem.) = 3, N(Jewely) = 4 similar words are ill meaning, the more similar they are in
N(Gemv n Jewels) 3 their contextual distributions. It is appare~lt however that
M,, . . . . 1.00
Min [N(Gem~), N(Jewelv)] 3 this relationship is strongest for the highly syn(myn~ous
~em~ N Jewelk: priceless, priceless, jewelry, ray pairs, i.e., those with a judged synonymy greater than
N(Gemk) = 5, N(Jewel~) = 6 3.0. For the intermediate values 1.0-2.7 overlap is almost
N(Gemk rl Jewelk) 4
Mk . . . . . . 80 constant. This constancy may mean that the subjects did
Min [N(Gem~), N(Jewel~)] 5
not react to the distance 1.0--3.0 on the scale as they did
A slightly different formula replaced the denominator of M. to the distance 2.0-4.0. However, subjects' judgments
ith N(A~) + N(B.) - N(A. n B~) and yielded quMitatively were reliable in the range 1.0-3.0 sittce the correlation
milar results. between subjects' judgments in this interval on two trials
I 2
i I
5
i
4
511 pairs could not be submitted for judgment to subje c~-"
for practical reasons, but it seemed clear that these pair-
JUDGED SYNONYMY were less synonynLous than the pairs which wore judged ao
FIG. 2. Effect of leveling. T h e conditions are the s~me as in have synonymy values greater than 3.0. Thus using the
F i g u r e i except t h a t the c o n t e x t s were morphologically leveled overlap values for 556 low synonymy pairs we determined :t
before t~c~e overlap was calculated. critical value which was exceeded by 1 percent or 5 per-
cent of these pairs and then determined tile proportioll of
T A B L E 4. COI~I{EL.X.TIONS(r) BETWEI,:N OVEI{.I.,,'~PMI,:;tSUItES FOR the 20 high synonymy pairs that exceeded ttLe critic:d
UXa~STmC'rED CON'rEXT value.
Table 5 shows the percentages of correct inferences (:Lad
Mlk~ Myl z}[$d
critical values) for overlap based on unrestricted cor~tcx~
~I~,, .s7 .96 .a7
as well as on the other contexts studied.
Ma,, .87 .98
+II~,~ .88 Note that the percentage of correct, inferences is about
the s~mle when the context is limited to content words.
u = unleve]ed; I = leveled Conte:ct Defined by Word Frequer~cy. I t might tw
Co~le:ct D e f i n i t i o ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ......................................................................................................................
Umestricted
(:W & FW 75 (34) .O5 (32) 7'5 (54) 95 (52) 7O (38) 90 (30) 80 (57) 85 ('55)
CW alone 75 (22) 95 (19) 90 (20) 95 (17) ............................................
L(,rgc Frequency I
0 LS0 85 (9.1) 90 (6,6) 85 (8.1) 95 (6.4) 70 (8.5) 95 (6,3) S0 (7.5) 85 (5.(;)
151-1000 70 (22) 80 (20) 75 (20) 95 (16) 50 (25) 70 (23) 85 (22) 95 (17)
10()l and above 5 (59) 15 (56) 35 (49) 60 (-hi) 5 ((;2) 35 (57) 55 (48) 75 (,i5)
FW 0 (82) 10 (79) 45 (8.1) 55 (80) : 5 (82) 5 (79) I 35 (8:~) 50 (81)
i .............................................................................. .................................................................
Grammatical 90 (15) 95 (12) 95 (13) 95 (ll) ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
t
Ass )ciation 90 (14) 100 (4.8) 95 (tl) 100 (3.7) { . . . . . . . . . . . . . . . . . . . . . i . . . . . . . . . . . . . . .
........................................................................... ! ..................................................... ..........................................
The figures in parentheses are the overlap values associated with the given pereeuh~ges.
1
o 0-150
My • 151-1000 Mk
o EXX)+
FW
a h 5 & a
¢3
w 7 A A []
-I
w []
>
~.S 0 / . 1 . *- 0 0
M 0 ~ 0 O0 0
Z 00 0 0 D
° 1] o 0 0 0 0
~.~ D
O
Z o • •
____._._-e-4 °"
O
ha
o
o° o° ° °°° , 0=
~z o •& & ~ • • •A ¢&• •~ ° o
¢I
r•" • °g
,,~....*---"--," " ,* o o ~o
ha • ~ 0 0 0 gO • o o, o
O ,I a ooo $ o °o _o_,t~,.-----'-"-~_
o"__._.~.-~-
°1° t I '
0 I 2 5 4
O I 2 3 4
JUDGED SYNONYMY
Fro. 3. Effect of partitioning context into frequency classes. Each point represmtts one of 65 pairs of theme words. The contexts were
Partitioned into four classes: function words (FW) and three content-word classes. The content-word classes rc3)rcscnt three intervals
according the Lorge Magazine Count (frequency of occurrence in 4.5 million tokens). Each curve reprcsenls tit(; overlap obtained for one
of the four partitions of context.
if"
connection to the theme words; however, even Ibis con- 13
jecture was not t)orne out by an analysis of the data.
The results shown in Figure 3 are essentially unchanged
if leveled data are used. > °
The percentages of correct inferences for these four oo o
o o ~ o
Lorge frequency intervals are shown in Table 5. I t is I
t 0 ~e~ t~ = a o
evident that content words with a frequency greater than
<
1000 are relatively useless for discriminating highly syn- :E
onymous pairs. The statistical significance of the 3I~ slope .4 'My
for function words permits as high as 55 percent correct ¢,0
i
inference with 5 percent, error. Apparently the small o ¢1o o o
fleet evident in Figure 3 has some inference power al- g .2 °
W o
though there is no explanation why this is so. o
o o
.I ~ oO~ ° o
Context Gra'mmatically Defined. I n this limitation on 0 0 0 0~
o
context the overlap was restricted to those words which
0
were grammatically nmst closely related to the theme.
Content words which were grammatically dependent upon o i
the theme word, which was always a noun, or upon a JUOGED SYNONYMY
pronoun standing for the theme were considered related. FIG. 4. Effect of limiting the context to words grammatically
Thus the following were included: (1) nouns, adjectives related to the theme. E~tch point represents one of 65 p~irs of
or participles describing the theme, (2) the verb if the theme words.
theme was the subject (and predicate noun if the verb
was the copula or a copula substitute), and (3) the noun
.6 Mk
in a prepositional or possessive phrase modifying the
theme. Also included were the following in the event that .5 Q
0"
the theme word was the grammatically dependent form: .4 a a
(1) the verb of which the theme was the object, (2) the
noun to which the theme was the appositive, and (3) the Q a
REFERENCES
¢3
1. GIUIaANO,V. E,, ANDJONES, P.E. Studies for the design of an
~.
<
2c o English command aDd control language system. ESD-TDR-
5 63-673, Arthur D. Little, Inc., Cambridge, Mass., 1963.
0
2. ttAm¢ts, Z.S. Distributional structure. Word 10 (1954), 146-.
162.
>- 3. Joos, M. Description of language design. J. Acoust, Soc.
.J
I Amer. 22 (1950), 701-708.
0 o
o 4. ~Kur~NS, J. L. The continuum of coefficients of association.
:E o
o ° o o oO o oooo in M. E. Stevens, L. Heilprin, & V. E. Giuliuno, (Eds.), Pro-
Z o
o
o
o o oo ceedings of the Symposium on Statistical Association Methods
W
0 o o for Mechanized Documentation, National Bureau of Stand-
o °o °o o o
W ards, US Government Printing Off., Washington, D. C. In
O. o o
press.
5 5. MILLER, G. A., ~NEwMAN, E . g . , :kiD FRIEDMAN, E. 2~k. Length-
frequency statistics for written English. Inform. and Contr. 1
o (1958), 370-389.
6. STEVENS, M. E., HEILPRIN, L., AND GIULIANO, V. E. (EDS.)
i6 i i i I i i |
-I -12 -8 -4 0 4 8 12 16 Proceedings of the Symposium on Statistical Association Meth-
ods for Mechanized Documentation, National Bureau of Stand-
DISTANCE FROM THEME ards, US Government Printing Off., Washington, D. C. In
Fro. 6. Proportion of content words at various positiuns in the press,
sentence that were judged to be in high association relationship 7. STLL~:S,H . E . The association factor in information retrieval.
with the theme. The negative values on the abscissa are the num- J. ACM 8 (1961), 271-279.
ber of word places bqfore the theme while the positive values are 8. THOIeNDIKE, E. L., AND LORGE, [. The Teacher's Word Book of
the number of word places after the theme. 30,000 Words. Columbia U. Press, New York, 1944.