Sei sulla pagina 1di 10

Open Information Extraction using Wikipedia

Fei Wu Daniel S. Weld


University of Washington University of Washington
Seattle, WA, USA Seattle, WA, USA
wufei@cs.washington.edu weld@cs.washington.edu

Abstract high precision and recall, they are limited by the


availability of training data and are unlikely to
Information-extraction (IE) systems seek scale to the thousands of relations found in text
to distill semantic relations from natural- on the Web.
language text, but most systems use super-
An alternative paradigm, Open IE, pioneered
vised learning of relation-specific examples
by the TextRunner system (Banko et al., 2007),
and are thus limited by the availability of
aims to handle an unbounded number of relations
training data. Open IE systems such as
and run quickly enough to process Web-scale cor-
TextRunner, on the other hand, aim to handle
pora. Domain independence is achieved by ex-
the unbounded number of relations found
tracting the relation name as well as its two argu-
on the Web. But how well can these open
ments. Most open IE systems use self-supervised
systems perform?
learning, in which automatic heuristics generate
This paper presents WOE, an open IE system labeled data for training the extractor. For exam-
which improves dramatically on TextRunner’s ple, TextRunner uses a small set of hand-written
precision and recall. The key to WOE’s per- rules to heuristically label training examples from
formance is a novel form of self-supervised sentences in the Penn Treebank.
learning for open extractors — using heuris- This paper presents WOE (Wikipedia-based
tic matches between Wikipedia infobox at- Open Extractor), the first system that au-
tribute values and corresponding sentences to tonomously transfers knowledge from random ed-
construct training data. Like TextRunner, itors’ effort of collaboratively editing Wikipedia to
WOE ’s extractor eschews lexicalized features train an open information extractor. Specifically,
and handles an unbounded set of semantic WOE generates relation-specific training examples
relations. WOE can operate in two modes: by matching Infobox1 attribute values to corre-
when restricted to POS tag features, it runs sponding sentences (as done in Kylin (Wu and
as quickly as TextRunner, but when set to use Weld, 2007) and Luchs (Hoffmann et al., 2010)),
dependency-parse features its precision and but WOE abstracts these examples to relation-
recall rise even higher. independent training data to learn an unlexical-
ized extractor, akin to that of TextRunner. WOE
1 Introduction
can operate in two modes: when restricted to
The problem of information-extraction (IE), gen- shallow features like part-of-speech (POS) tags, it
erating relational data from natural-language text, runs as quickly as Textrunner, but when set to use
has received increasing attention in recent years. dependency-parse features its precision and recall
A large, high-quality repository of extracted tu- rise even higher. We present a thorough experi-
ples can potentially benefit a wide range of NLP mental evaluation, making the following contribu-
tasks such as question answering, ontology learn- tions:
ing, and summarization. The vast majority of • We present WOE, a new approach to open IE
IE work uses supervised learning of relation- that uses Wikipedia for self-supervised learn-
specific examples. For example, the WebKB ing of unlexicalized extractors. Compared
project (Craven et al., 1998) used labeled exam-
1
ples of the courses-taught-by relation to in- An infobox is a set of tuples summarizing the key at-
tributes of the subject in a Wikipedia article. For example,
duce rules for identifying additional instances of the infobox in the article on “Sweden” contains attributes like
the relation. While these methods can achieve Capital, Population and GDP.
with TextRunner (the state of the art) on three Sentence Splitting
NLP Annotating Preprocessor
corpora, WOE yields between 79% and 90% Synonyms Compiling
improved F-measure — generalizing well be-
yond Wikipedia. Primary Entity Matching
Matcher
Sentence Matching
• Using the same learning algorithm and fea-
Triples
tures as TextRunner, we compare four dif- Pattern Classifier over Parser Features
ferent ways to generate positive and negative CRF Extractor over Shallow Features Learner
training examples with TextRunner’s method,
concluding that our Wikipedia heuristic is re- Figure 1: Architecture of WOE.
sponsible for the bulk of WOE’s improved ac-
curacy.
an unlexicalized, relation-independent (open) ex-
• The biggest win arises from using parser fea- tractor. As shown in Figure 1, WOE has three main
tures. Previous work (Jiang and Zhai, 2007) components: preprocessor, matcher, and learner.
concluded that parser-based features are un-
necessary for information extraction, but that 3.1 Preprocessor
work assumed the presence of lexical features. The preprocessor converts the raw Wikipedia text
We show that abstract dependency paths are into a sequence of sentences, attaches NLP anno-
a highly informative feature when performing tations, and builds synonym sets for key entities.
unlexicalized extraction. The resulting data is fed to the matcher, described
2 Problem Definition in Section 3.2, which generates the training set.
Sentence Splitting: The preprocessor first renders
An open information extractor is a function
each Wikipedia article into HTML, then splits the
from a document, d, to a set of triples,
article into sentences using OpenNLP.
{harg1 , rel, arg2 i}, where the args are noun
phrases and rel is a textual fragment indicat- NLP Annotation: As we discuss fully in Sec-
ing an implicit, semantic relation between the two tion 4 (Experiments), we consider several varia-
noun phrases. The extractor should produce one tions of our system; one version, WOEparse , uses
triple for every relation stated explicitly in the text, parser-based features, while another, WOEpos , uses
but is not required to infer implicit facts. In this shallow features like POS tags, which may be
paper, we assume that all relational instances are more quickly computed. Depending on which
stated within a single sentence. Note the dif- version is being trained, the preprocessor uses
ference between open IE and the traditional ap- OpenNLP to supply POS tags and NP-chunk an-
proaches (e.g., as in WebKB), where the task is notations — or uses the Stanford Parser to create a
to decide whether some pre-defined relation holds dependency parse. When parsing, we force the hy-
between (two) arguments in the sentence. perlinked anchor texts to be a single token by con-
We wish to learn an open extractor without di- necting the words with an underscore; this trans-
rect supervision, i.e. without annotated training formation improves parsing performance in many
examples or hand-crafted patterns. Our input is cases.
Wikipedia, a collaboratively-constructed encyclo- Compiling Synonyms: As a final step, the pre-
pedia2 . As output, WOE produces an unlexicalized processor builds sets of synonyms to help the
and relation-independent open extractor. Our ob- matcher find sentences that correspond to infobox
jective is an extractor which generalizes beyond relations. This is useful because Wikipedia edi-
Wikipedia, handling other corpora such as the gen- tors frequently use multiple names for an entity;
eral Web. for example, in the article titled “University of
Washington” the token “UW” is widely used to
3 Wikipedia-based Open IE
refer the university. Additionally, attribute values
The key idea underlying WOE is the automatic are often described differently within the infobox
construction of training examples by heuristically than they are in surrounding text. Without knowl-
matching Wikipedia infobox values and corre- edge of these synonyms, it is impossible to con-
sponding text; these examples are used to generate struct good matches. Following (Wu and Weld,
2
We also use DBpedia (Auer and Lehmann, 2007) as a 2007; Nakayama and Nishio, 2008), the prepro-
collection of conveniently parsed Wikipedia infoboxes cessor uses Wikipedia redirection pages and back-
ward links to automatically construct synonym denotes the primary entity, e.g., “he” for the
sets. Redirection pages are a natural choice, be- page on “Albert Einstein.” This heuristic is
cause they explicitly encode synonyms; for ex- dropped when “it” is most common, because
ample, “USA” is redirected to the article on the the word is used in too many other ways.
“United States.” Backward links for a Wiki- When there are multiple matches to the primary
pedia entity such as the “Massachusetts Institute of entity in a sentence, the matcher picks the one
Technology” are hyperlinks pointing to this entity which is closest to the matched infobox attribute
from other articles; the anchor text of such links value in the parser dependency graph.
(e.g., “MIT”) forms another source of synonyms.
Matching Sentences: The matcher seeks a unique
3.2 Matcher sentence to match the attribute value. To produce
the best training set, the matcher performs three
The matcher constructs training data for the
filterings. First, it skips the attribute completely
learner component by heuristically matching
when multiple sentences mention the value or its
attribute-value pairs from Wikipedia articles con-
synonym. Second, it rejects the sentence if the
taining infoboxes with corresponding sentences in
subject and/or attribute value are not heads of the
the article. Given the article on “Stanford Univer-
noun phrases containing them. Third, it discards
sity,” for example, the matcher should associate
the sentence if the subject and the attribute value
hestablished, 1891i with the sentence “The
do not appear in the same clause (or in parent/child
university was founded in 1891 by . . . ” Given a
clauses) in the parse tree.
Wikipedia page with an infobox, the matcher iter-
Since Wikipedia’s Wikimarkup language is se-
ates through all its attributes looking for a unique
mantically ambiguous, parsing infoboxes is sur-
sentence that contains references to both the sub-
prisingly complex. Fortunately, DBpedia (Auer
ject of the article and the attribute value; these
and Lehmann, 2007) provides a cleaned set of in-
noun phrases will be annotated arg1 and arg2
foboxes from 1,027,744 articles. The matcher uses
in the training set. The matcher considers a sen-
this data for attribute values, generating a training
tence to contain the attribute value if the value or
dataset with a total of 301,962 labeled sentences.
its synonym is present. Matching the article sub-
ject, however, is more involved. 3.3 Learning Extractors
Matching Primary Entities: In order to match We learn two kinds of extractors, one (WOEparse )
shorthand terms like “MIT” with more complete using features from dependency-parse trees and
names, the matcher uses an ordered set of heuris- the other (WOEpos ) limited to shallow features like
tics like those of (Wu and Weld, 2007; Nguyen et POS tags. WOEparse uses a pattern learner to
al., 2007): classify whether the shortest dependency path be-
• Full match: strings matching the full name of tween two noun phrases indicates a semantic rela-
the entity are selected. tion. In contrast, WOEpos (like TextRunner) trains
a conditional random field (CRF) to output certain
• Synonym set match: strings appearing in the
text between noun phrases when the text denotes
entity’s synonym set are selected.
such a relation. Neither extractor uses individual
• Partial match: strings matching a prefix or suf- words or lexical information for features.
fix of the entity’s name are selected. If the
full name contains punctuation, only a prefix 3.3.1 Extraction with Parser Features
is allowed. For example, “Amherst” matches Despite some evidence that parser-based features
“Amherst, Mass,” but “Mass” does not. have limited utility in IE (Jiang and Zhai, 2007),
• Patterns of “the <type>”: The matcher first we hoped dependency paths would improve preci-
identifies the type of the entity (e.g., “city” for sion on long sentences.
“Ithaca”), then instantiates the pattern to create Shortest Dependency Path as Relation: Unless
the string “the city.” Since the first sentence of otherwise noted, WOE uses the Stanford Parser
most Wikipedia articles is stylized (e.g. “The to create dependencies in the “collapsedDepen-
city of Ithaca sits . . . ”), a few patterns suffice dency” format. Dependencies involving preposi-
to extract most entity types. tions, conjuncts as well as information about the
• The most frequent pronoun: The matcher as- referent of relative clauses are collapsed to get
sumes that the article’s most frequent pronoun direct dependencies between content words. As
noted in (de Marneffe and Manning, 2008), this nate distinctions which are irrelevant for recog-
collapsed format often yields simplified patterns nizing (domain-independent) relations. Lexical
which are useful for relation extraction. Consider words in corePaths are replaced with their POS
the sentence: tags. Further, all Noun POS tags and “PRP”
Dan was not born in Berkeley. are abstracted to “N”, all Verb POS tags to “V”,
The Stanford Parser dependencies are: all Adverb POS tags to “RB” and all Adjective
nsubjpass(born-4, Dan-1) POS tags to “J”. The preposition dependencies
auxpass(born-4, was-2) such as “prep in” are generalized to “prep”. Take
−−−−−−−−−→ ←−−−−−−
neg(born-4, not-3) the corePath “Dan nsubjpass born prep in
prep in(born-4, Berkeley-6) Berkeley” for example, its generalized-corePath
−−−−−−−−−→
where each atomic formula represents a binary de- is “N nsubjpass V ←prep −−−− N”. We call such
pendence from dependent token to the governor a generalized-corePath an extraction pattern. In
token. total, WOE builds a database (named DBp ) of
These dependencies form a directed graph, 29,005 distinct patterns and each pattern p is asso-
hV, Ei, where each token is a vertex in V , and E ciated with a frequency — the number of matching
is the set of dependencies. For any pair of tokens, sentences containing p. Specifically, 311 patterns
such as “Dan” and “Berkeley”, we use the shortest have fp ≥ 100 and 3,519 patterns have fp ≥ 5.
connecting path to represent the possible relation Learning a Pattern Classifier: Given the large
between them: number of patterns in DBp , we assume few valid
−−−−−−−−−→ ←−−−−−−
Dan nsubjpass born prep in Berkeley open extraction patterns are left behind. The
We call such a path a corePath. While we will learner builds a simple pattern classifier, named
see that corePaths are useful for indicating when WOE parse , which checks whether the generalized-
a relation exists between tokens, they don’t neces- corePath from a test triple is present in DBp , and
sarily capture the semantics of that relation. For computes the normalized logarithmic frequency as
example, the path shown above doesn’t indicate the probability3 :
the existence of negation! In order to capture the max(log(fp ) − log(fmin ), 0)
meaning of the relation, the learner augments the w(p) =
log(fmax ) − log(fmin )
corePath into a tree by adding all adverbial and
adjectival modifiers as well as dependencies like where fmax (54,274 in this paper) is the maximal
“neg” and “auxpass”. We call the result an ex- frequency of pattern in DBp , and fmin (set 1 in
pandPath as shown below: this work) is the controlling threshold that deter-
mines the minimal frequency of a valid pattern.
Take the previous sentence “Dan was not born
in Berkeley” for example. WOEparse first identi-
fies Dan as arg1 and Berkeley as arg2 based
WOE traverses the expandPath with respect to the on NP-chunking. It then computes the corePath
−−−−−−−−−→ ←−−−−−−
token orders in the original sentence when out- “Dan nsubjpass born prep in Berkeley”
−−−−−−−−−→
putting the final expression of rel. and abstracts to p=“N nsubjpass V ←prep −−−−
N”. It then queries DBp to retrieve the fre-
Building a Database of Patterns: For each of the quency fp = 31, 767 and assigns a probabil-
301,962 sentences selected and annotated by the ity of 0.95. Finally, WOEparse traverses the
matcher, the learner generates a corePath between triple’s expandPath to output the final expression
the tokens denoting the subject and the infobox at- hDan, wasN otBornIn, Berkeleyi. As shown
tribute value. Since we are interested in eventu- in the experiments on three corpora, WOEparse
ally extracting “subject, relation, object” triples, achieves an F-measure which is between 79% to
the learner rejects corePaths that don’t start with 90% greater than TextRunner’s.
subject-like dependencies, such as nsubj, nsubj-
pass, partmod and rcmod. This leads to a collec- 3.3.2 Extraction with Shallow Features
tion of 259,046 corePaths. WOE parsehas a dramatic performance improve-
To combat data sparsity and improve learn- ment over TextRunner. However, the improve-
ing performance, the learner further generalizes ment comes at the cost of speed — TextRunner
the corePaths in this set to create a smaller set 3
How to learn a more sophisticated weighting function is
of generalized-corePaths. The idea is to elimi- left as a future topic.
P/R Curve on WSJ P/R Curve on Web P/R Curve on Wikipedia
1.0

1.0

1.0
0.8

0.8

0.8
precision

precision

precision
0.6

0.6

0.6
0.4

0.4

0.4
WOEparse WOEparse WOEparse
0.2

0.2

0.2
WOEpos WOEpos WOEpos
TextRunner TextRunner TextRunner
0.0

0.0

0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6
recall recall recall

Figure 2: WOEpos performs better than TextRunner, especially on precision. WOEparse dramatically im-
proves performance, especially on recall.

runs about 30X faster by only using shallow fea- triples are mixed with pseudo-negative ones and
tures. Since high speed can be crucial when pro- submitted to Amazon Mechanical Turk for veri-
cessing Web-scale corpora, we additionally learn a fication. Each triple was examined by 5 Turk-
CRF extractor WOEpos based on shallow features ers. We mark a triple’s final label as positive when
like POS-tags. In both cases, however, we gen- more than 3 Turkers marked them as positive.
erate training data from Wikipedia by matching
sentences with infoboxes, while TextRunner used 4.1 Overall Performance Analysis
a small set of hand-written rules to label training In this section, we compare the overall perfor-
examples from the Penn Treebank. mance of WOEparse , WOEpos and TextRunner
We use the same matching sentence set behind (shared by the Turing Center at the University of
DBp to generate positive examples for WOEpos . Washington). In particular, we are going to answer
Specifically, for each matching sentence, we label the following questions: 1) How do these systems
the subject and infobox attribute value as arg1 perform against each other? 2) How does perfor-
and arg2 to serve as the ends of a linear CRF mance vary w.r.t. sentence length? 3) How does
chain. Tokens involved in the expandPath are la- extraction speed vary w.r.t. sentence length?
beled as rel. Negative examples are generated Overall Performance Comparison
from random noun-phrase pairs in other sentences The detailed P/R curves are shown in Figure 2.
when their generalized-CorePaths are not in DBp . To have a close look, for each corpus, we ran-
WOE pos uses the same learning algorithm and domly divided the 300 sentences into 5 groups and
selection of features as TextRunner: a two-order compared the F-measures of three systems in Fig-
CRF chain model is trained with the Mallet pack- ure 3. We can see that:
age (McCallum, 2002). WOEpos ’s features include • WOEpos is better than TextRunner, especially
POS-tags, regular expressions (e.g., for detecting on precision. This is due to better training
capitalization, punctuation, etc..), and conjunc- data from Wikipedia via self-supervision. Sec-
tions of features occurring in adjacent positions tion 4.2 discusses this in more detail.
within six words to the left and to the right of the
• WOEparse achieves the best performance, es-
current word.
pecially on recall. This is because the parser
As shown in the experiments, WOEpos achieves
features help to handle complicated and long-
an improved F-measure over TextRunner between
distance relations in difficult sentences. In par-
15% to 34% on three corpora, and this is mainly
ticular, WOEparse outputs 1.42 triples per sen-
due to the increase on precision.
tence on average, while WOEpos outputs 1.05
and TextRunner outputs 0.75.
4 Experiments
Note that we measure TextRunner’s precision
We used three corpora for experiments: WSJ from & recall differently than (Banko et al., 2007)
Penn Treebank, Wikipedia, and the general Web. did. Specifically, we compute the precision & re-
For each dataset, we randomly selected 300 sen- call based on all extractions, while Banko et al.
tences. Each sentence was examined by two peo- counted only concrete triples where arg1 is a
ple to label all reasonable triples. These candidate proper noun, arg2 is a proper noun or date, and
Figure 4: WOEparse ’s F-measure decreases more
Figure 3: WOE pos achieves an
F-measure, which is slowly with sentence length than WOEpos and Tex-
between 15% and 34% better than TextRunner’s. tRunner, due to its better handling of difficult sen-
WOE parse achieves an improvement between 79% tences using parser features.
and 90% over TextRunner. The error bar indicates
one standard deviation.
he sold the company”, where “Sources” is
wrongly treated as the subject of the object
the frequency of rel is over a threshold. Our ex- clause. A sample error of the second type is
periments show that focussing on concrete triples hthisY ear, willStarIn, theM oviei extracted
generally improves precision at the expense of re- from the sentence “Coming up this year, Long
call.4 Of course, one can apply a concreteness fil- will star in the new movie.”, where “this year” is
ter to any open extractor in order to trade recall for wrongly treated as part of a compound subject.
precision. Taking the WSJ corpus for example, at the dip
The extraction errors by WOEparse can be cat- point with recall=0.002 and precision=0.059,
egorized into four classes. We illustrate them these two types of errors account for 70% of all
with the WSJ corpus. In total, WOEparse got errors.
85 wrong extractions on WSJ, and they are
Extraction Performance vs. Sentence Length
caused by: 1) Incorrect arg1 and/or arg2
We tested how extractors’ performance varies
from NP-Chunking (18.6%); 2) A erroneous de-
with sentence length; the results are shown in Fig-
pendency parse from Stanford Parser (11.9%);
ure 4. TextRunner and WOEpos have good perfor-
3) Inaccurate meaning (27.1%) — for exam-
mance on short sentences, but their performance
ple, hshe, isN ominatedBy, P residentBushi is
deteriorates quickly as sentences get longer. This
wrongly extracted from the sentence “If she is
is because long sentences tend to have compli-
nominated by President Bush ...”5 ; 4) A pattern
cated and long-distance relations which are diffi-
inapplicable for the test sentence (42.4%).
cult for shallow features to capture. In contrast,
Note WOEparse is worse than WOEpos in the low
WOE parse ’s performance decreases more slowly
recall region. This is mainly due to parsing er-
w.r.t. sentence length. This is mainly because
rors (especially on long-distance dependencies),
parser features are more useful for handling diffi-
which misleads WOEparse to extract false high-
cult sentences and they help WOEparse to maintain
confidence triples. WOEpos won’t suffer from such
a good recall with only moderate loss of precision.
parsing errors. Therefore it has better precision on
high-confidence extractions. Extraction Speed vs. Sentence Length
We noticed that TextRunner has a dip point We also tested the extraction speed of different
in the low recall region. There are two typical extractors. We used Java for implementing the
errors responsible for this. A sample error of extractors, and tested on a Linux platform with
the first type is hSources, sold, theCompanyi a 2.4GHz CPU and 4G memory. On average, it
extracted from the sentence “Sources said takes WOEparse 0.679 seconds to process a sen-
4 tence. For TextRunner and WOEpos , it only takes
For example, consider the Wikipedia corpus. From
our 300 test sentences, TextRunner extracted 257 triples (at 0.022 seconds — 30X times faster. The detailed
72.0% precision) but only extracted 16 concrete triples (with extraction speed vs. sentence length is in Figure 5,
87.5% precision). showing that TextRunner and WOEpos ’s extraction
5
These kind of errors might be excluded by monitor-
ing whether sentences contain words such as ‘if,’ ‘suspect,’ time grows approximately linearly with sentence
‘doubt,’ etc.. We leave this as a topic for the future. length, while WOEparse ’s extraction time grows
and the Stanford parse on Wikipedia is less accu-
rate than the gold parse on WSJ.

4.3 Design Desiderata of WOEparse


There are two interesting design choices in
WOE parse : 1) whether to require arg1 to appear
before arg2 (denoted as 1≺2) in the sentence;
2) whether to allow corePaths to contain prepo-
sitional phrase (PP) attachments (denoted as PPa).
We tested how they affect the extraction perfor-
Figure 5: Textrnner and WOEpos ’s running time
mance; the results are shown in Figure 7.
seems to grow linearly with sentence length, while
We can see that filtering PP attachments (PPa)
WOE parse ’s time grows quadratically.
gives a large precision boost with a noticeable loss
in recall; enforcing a lexical ordering of relation
quadratically (R2 = 0.935) due to its reliance on arguments (1≺2) yields a smaller improvement in
parsing. precision with small loss in recall. Take the WSJ
corpus for example: setting 1≺2 and PPa achieves
4.2 Self-supervision with Wikipedia Results a precision of 0.792 (with recall of 0.558). By
in Better Training Data changing 1≺2 to 1∼2, the precision decreases to
0.773 (with recall of 0.595). By changing PPa to
In this section, we consider how the process of
PPa and keeping 1≺2, the precision decreases to
matching Wikipedia infobox values to correspond-
0.642 (with recall of 0.687) — in particular, if we
ing sentences results in better training data than
use gold parse, the precision decreases to 0.672
the hand-written rules used by TextRunner.
(with recall of 0.685). We set 1≺2 and PPa as de-
To compare with TextRunner, we tested four
fault in WOEparse as a logical consequence of our
different ways to generate training examples from
preference for high precision over high recall.
Wikipedia for learning a CRF extractor. Specif-
ically, positive and/or negative examples are se- 4.3.1 Different parsing options
lected by TextRunner’s hand-written rules (tr for We also tested how different parsing might ef-
short), by WOE’s heuristic of matching sentences fect WOEparse ’s performance. We used three pars-
with infoboxes (w for short), or randomly (r for ing options on the WSJ dataset: Stanford parsing,
short). We use CRF+h1 −h2 to denote a particu- CJ50 parsing (Charniak and Johnson, 2005), and
lar approach, where “+” means positive samples, the gold parses from the Penn Treebank. The Stan-
“-” means negative samples, and hi ∈ {tr, w, r}. ford Parser is used to derive dependencies from
In particular, “+w” results in 221,205 positive ex- CJ50 and gold parse trees. Figure 8 shows the
amples based on the matching sentence set6 . All detailed P/R curves. We can see that although
extractors are trained using about the same num- today’s statistical parsers make errors, they have
ber of positive and negative examples. In contrast, negligible effect on the accuracy of WOE.
TextRunner was trained with 91,687 positive ex-
amples and 96,795 negative examples generated 5 Related Work
from the WSJ dataset in Penn Treebank. Open or Traditional Information Extraction:
The CRF extractors are trained using the same Most existing work on IE is relation-specific.
learning algorithm and feature selection as Tex- Occurrence-statistical models (Agichtein and Gra-
tRunner. The detailed P/R curves are in Fig- vano, 2000; M. Ciaramita, 2005), graphical mod-
ure 6, showing that using WOE heuristics to la- els (Peng and McCallum, 2004; Poon and Domin-
bel positive examples gives the biggest perfor- gos, 2008), and kernel-based methods (Bunescu
mance boost. CRF+tr−tr (trained using TextRun- and R.Mooney, 2005) have been studied. Snow
ner’s heuristics) is slightly worse than TextRunner. et al. (Snow et al., 2005) utilize WordNet to
Most likely, this is because TextRunner’s heuris- learn dependency path patterns for extracting the
tics rely on parse trees to label training examples, hypernym relation from text. Some seed-based
6
frameworks are proposed for open-domain extrac-
This number is smaller than the total number of
corePaths (259,046) because we require arg1 to appear be- tion (Pasca, 2008; Davidov et al., 2007; Davi-
fore arg2 in a sentence — as specified by TextRunner. dov and Rappoport, 2008). These works focus
P/R Curve on WSJ P/R Curve on Web P/R Curve on Wikipedia

1.0

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
precision

precision

precision
0.4

0.4

0.4
CRF+w−w=WOEpos CRF+w−w=WOEpos CRF+w−w=WOEpos
CRF+w−tr CRF+w−tr CRF+w−tr
0.2

0.2

0.2
CRF+w−r CRF+w−r CRF+w−r
CRF+tr−tr CRF+tr−tr CRF+tr−tr
TextRunner TextRunner TextRunner
0.0

0.0

0.0
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
recall recall recall

Figure 6: Matching sentences with Wikipedia infoboxes results in better training data than the hand-
written rules used by TextRunner.

Figure 7: Filtering prepositional phrase attachments (PPa) shows a strong boost to precision, and we see
a smaller boost from enforcing a lexical ordering of relation arguments (1≺2).

P/R Curve on WSJ the “preemptive IE” framework to avoid relation-


specificity (Shinyama and Sekine, 2006). They
1.0

first group documents based on pairwise vector-


space clustering, then apply an additional clus-
0.8
precision

tering to group entities based on documents clus-


parse
WOEstanford=WOEparse ters. The two clustering steps make it difficult to
0.6

parse
WOECJ50 meet the scalability requirement necessary to pro-
parse
WOEgold cess the Web. Mintz et al. (Mintz et al., 2009)
0.4

0.0 0.1 0.2 0.3 0.4 0.5 0.6 uses Freebase to provide distant supervision for
recall relation extraction. They applied a similar heuris-
Figure 8: Although today’s statistical parsers tic by matching Freebase tuples with unstructured
make errors, they have negligible effect on the sentences (Wikipedia articles in their experiments)
accuracy of WOE compared to operation on gold to create features for learning relation extractors.
standard, human-annotated data. Using Freebase to match arbitrary sentences in-
stead of matching Wikipedia infobox within corre-
sponding articles will potentially increase the size
on identifying general relations such as class at- of matched sentences at a cost of accuracy. Also,
tributes, while open IE aims to extract relation their learned extractors are relation-specific. Alan
instances from given sentences. Another seed- Akbik et al. (Akbik and Broß, 2009) annotated
based system StatSnowball (Zhu et al., 2009) can 10,000 sentences parsed with LinkGrammar and
perform both relation-specific and open IE by it- selected 46 general linkpaths as patterns for rela-
eratively generating weighted extraction patterns. tion extraction. In contrast, WOE learns 29,005
Different from WOE, StatSnowball only employs general patterns based on an automatically anno-
shallow features and uses L1-normalization to tated set of 301,962 Wikipedia sentences. The
weight patterns. Shinyama and Sekine proposed
KNext system (Durme and Schubert, 2008) per- text. WOE can run in two modes: a CRF extrac-
forms open knowledge extraction via significant tor (WOEpos ) trained with shallow features like
heuristics. Its output is knowledge represented POS tags; a pattern classfier (WOEparse ) learned
as logical statements instead of information rep- from dependency path patterns. Comparing with
resented as segmented text fragments. TextRunner, WOEpos runs at the same speed, but
Information Extraction with Wikipedia: The achieves an F-measure which is between 15% and
YAGO system (Suchanek et al., 2007) extends 34% greater on three corpora; WOEparse achieves
WordNet using facts extracted from Wikipedia an F-measure which is between 79% and 90%
categories. It only targets a limited number of pre- higher than that of TextRunner, but runs about
defined relations. Nakayama et al. (Nakayama and 30X times slower due to the time required for
Nishio, 2008) parse selected Wikipedia sentences parsing.
and perform extraction over the phrase structure Our experiments uncovered two sources of
trees based on several handcrafted patterns. Wu WOE ’s strong performance: 1) the Wikipedia
and Weld proposed the K YLIN system (Wu and heuristic is responsible for the bulk of WOE’s im-
Weld, 2007; Wu et al., 2008) which has the same proved accuracy, but 2) dependency-parse features
spirit of matching Wikipedia sentences with in- are highly informative when performing unlexi-
foboxes to learn CRF extractors. However, it calized extraction. We note that this second con-
only works for relations defined in Wikipedia in- clusion disagrees with the findings in (Jiang and
foboxes. Zhai, 2007).
Shallow or Deep Parsing: Shallow features, like In the future, we plan to run WOE over the bil-
POS tags, enable fast extraction over large-scale lion document CMU ClueWeb09 corpus to com-
corpora (Davidov et al., 2007; Banko et al., 2007). pile a giant knowledge base for distribution to the
Deep features are derived from parse trees with NLP community. There are several ways to further
the hope of training better extractors (Zhang et improve WOE’s performance. Other data sources,
al., 2006; Zhao and Grishman, 2005; Bunescu such as Freebase, could be used to create an ad-
and Mooney, 2005; Wang, 2008). Jiang and ditional training dataset via self-supervision. For
Zhai (Jiang and Zhai, 2007) did a systematic ex- example, Mintz et al. consider all sentences con-
ploration of the feature space for relation extrac- taining both the subject and object of a Freebase
tion on the ACE corpus. Their results showed lim- record as matching sentences (Mintz et al., 2009);
ited advantage of parser features over shallow fea- while they use this data to learn relation-specific
tures for IE. However, our results imply that ab- extractors, one could also learn an open extrac-
stracted dependency path features are highly in- tor. We are also interested in merging lexical-
formative for open IE. There might be several rea- ized and open extraction methods; the use of some
sons for the different observations. First, Jiang and domain-specific lexical features might help to im-
Zhai’s results are tested for traditional IE where lo- prove WOE’s practical performance, but the best
cal lexicalized tokens might contain sufficient in- way to do this is unclear. Finally, we wish to com-
formation to trigger a correct classification. The bine WOEparse with WOEpos (e.g., with voting) to
situation is different when features are completely produce a system which maximizes precision at
unlexicalized in open IE. Second, as they noted, low recall.
many relations defined in the ACE corpus are Acknowledgements
short-range relations which are easier for shallow
We thank Oren Etzioni and Michele Banko from
features to capture. In practical corpora like the
Turing Center at the University of Washington for
general Web, many sentences contain complicated
providing the code of their software and useful dis-
long-distance relations. As we have shown ex-
cussions. We also thank Alan Ritter, Mausam,
perimentally, parser features are more powerful in
Peng Dai, Raphael Hoffmann, Xiao Ling, Ste-
handling such cases.
fan Schoenmackers, Andrey Kolobov and Daniel
6 Conclusion Suskin for valuable comments. This material is
based upon work supported by the WRF / TJ Cable
This paper introduces WOE, a new approach to Professorship, a gift from Google and by the Air
open IE that uses self-supervised learning over un- Force Research Laboratory (AFRL) under prime
lexicalized features, based on a heuristic match contract no. FA8750-09-C-0181. Any opinions,
between Wikipedia infoboxes and corresponding findings, and conclusion or recommendations ex-
pressed in this material are those of the author(s) A. Gangemi M. Ciaramita. 2005. Unsupervised learn-
and do not necessarily reflect the view of the Air ing of semantic relations between concepts of a
Force Research Laboratory (AFRL). molecular biology ontology. In IJCAI.
Andrew Kachites McCallum. 2002. Mallet:
A machine learning for language toolkit. In
References http://mallet.cs.umass.edu.
E. Agichtein and L. Gravano. 2000. Snowball: Ex- Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-
tracting relations from large plain-text collections. sky. 2009. Distant supervision for relation extrac-
In ICDL. tion without labeled data. In ACL-IJCNLP.

Alan Akbik and Jügen Broß. 2009. Wanderlust: Ex- T. H. Kotaro Nakayama and S. Nishio. 2008. Wiki-
tracting semantic relations from natural language pedia link structure and text mining for semantic re-
text using dependency grammar patterns. In WWW lation extraction. In CEUR Workshop.
Workshop.
Dat P.T Nguyen, Yutaka Matsuo, and Mitsuru Ishizuka.
Sören Auer and Jens Lehmann. 2007. What have inns- 2007. Exploiting syntactic and semantic informa-
bruck and leipzig in common? extracting semantics tion for relation extraction from wikipedia. In
from wiki content. In ESWC. IJCAI07-TextLinkWS.
Marius Pasca. 2008. Turning web text and search
M. Banko, M. Cafarella, S. Soderland, M. Broadhead, queries into factual knowledge: Hierarchical class
and O. Etzioni. 2007. Open information extraction attribute extraction. In AAAI.
from the Web. In Procs. of IJCAI.
Fuchun Peng and Andrew McCallum. 2004. Accurate
Razvan C. Bunescu and Raymond J. Mooney. 2005. Information Extraction from Research Papers using
Subsequence kernels for relation extraction. In Conditional Random Fields. In HLT-NAACL.
NIPS.
Hoifung Poon and Pedro Domingos. 2008. Joint Infer-
R. Bunescu and R.Mooney. 2005. A shortest ence in Information Extraction. In AAAI.
path dependency kernel for relation extraction. In
HLT/EMNLP. Y. Shinyama and S. Sekine. 2006. Preemptive infor-
mation extraction using unristricted relation discov-
Eugene Charniak and Mark Johnson. 2005. Coarse- ery. In HLT-NAACL.
to-fine n-best parsing and maxent discriminative
Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005.
reranking. In ACL.
Learning syntactic patterns for automatic hypernym
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, discovery. In NIPS.
T. Mitchell, K. Nigam, and S. Slattery. 1998. Learn- Fabian M. Suchanek, Gjergji Kasneci, and Gerhard
ing to extract symbolic knowledge from the world Weikum. 2007. Yago: A core of semantic knowl-
wide web. In AAAI. edge - unifying WordNet and Wikipedia. In WWW.
Dmitry Davidov and Ari Rappoport. 2008. Unsuper- Mengqiu Wang. 2008. A re-examination of depen-
vised discovery of generic relationships using pat- dency path kernels for relation extraction. In IJC-
tern clusters and its evaluation by automatically gen- NLP.
erated sat analogy questions. In ACL.
Fei Wu and Daniel Weld. 2007. Autonomouslly Se-
Dmitry Davidov, Ari Rappoport, and Moshe Koppel. mantifying Wikipedia. In CIKM.
2007. Fully unsupervised discovery of concept-
specific relationships by web mining. In ACL. Fei Wu, Raphael Hoffmann, and Danel S. Weld. 2008.
Information extraction from Wikipedia: Moving
Marie-Catherine de Marneffe and Christopher D. Man- down the long tail. In KDD.
ning. 2008. Stanford typed dependencies manual. Min Zhang, Jie Zhang, Jian Su, and Guodong Zhou.
http://nlp.stanford.edu/downloads/lex-parser.shtml. 2006. A composite kernel to extract relations be-
tween entities with both flat and structured features.
Benjamin Van Durme and Lenhart K. Schubert. 2008.
In ACL.
Open knowledge extraction using compositional
language processing. In STEP. Shubin Zhao and Ralph Grishman. 2005. Extracting
relations with integrated information using kernel
R. Hoffmann, C. Zhang, and D. Weld. 2010. Learning methods. In ACL.
5000 relational extractors. In ACL.
Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and
Jing Jiang and ChengXiang Zhai. 2007. A systematic Ji-Rong Wen. 2009. Statsnowball: a statistical ap-
exploration of the feature space for relation extrac- proach to extracting entity relationships. In WWW.
tion. In HLT/NAACL.

Potrebbero piacerti anche