Sei sulla pagina 1di 12

161

Introduction to information extraction


Douglas E. Appelt
Articial Intelligence Center, SRI International, Menlo Park, California, USA
In recent years, analysts have been confronted with the increasing availability of on-line sources of information in the form of natural-language texts. This increased accessibility of textual information has led to a corresponding interest in technology for processing this text automatically to extract task-relevant information. This demand for a technological solution to the need to deal with the often-overwhelming quantity of available information has stimulated the development of the eld of Information Extraction. This article provides an overview of the problems addressed, current approaches toward solutions, and assesses the state of the art and its potential for future progress.

1. Introduction Many computer programs have been designed to process text, most of which one would not be tempted to call text understanding. At the most basic level, text is just a sequence of characters on which one can perform string searches, insertions, deletions, and replacements without even being aware of whether the characters represent newspaper articles, computer programs, tables of numbers, or something else. Moving one level up this hierarchy, some programs are designed to embed the knowledge that they are operating on actual natural-language text. Programs such as information retrieval systems know that the characters on which they operate are words in some language that can be analyzed morphologically (e.g., stemming) and have meanings. The linguistic processing of such systems is limited, because typically very little syntactic or semantic analysis is performed on the text. In fact, a clear case for the utility of performing such an analysis in an information retrieval system is yet to be convincingly made (Sparck-Jones [24], Israel [15]). Instead, information retrieval systems determine the relevance
to: Douglas E. Appelt, Sr. Computer Scientist, SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025, USA. AI Communications 12 (1999) 161172 ISSN 0921-7126 / $8.00 1999, IOS Press. All rights reserved
* Correspondence

of a text to a query by computing various statistics based on the relative frequency of words occurring in the document, the query, and the corpus as a whole. Often, information retrieval systems are applied to very large corpora on the order of gigabytes of text, or in the case of World Wide Web search engines, terabytes. On the other extreme of the text-processing spectrum are programs that could be said to do text understanding or story understanding. Such systems assume that every sentence of a text contains some relevant information, and in addition, each sentence is related to each other sentence through some relation of coherence (Hobbs [12]) or rhetorical structure (Mann and Thompson [21]), the determination of which follows not only from information that is explicitly present in the text, but also from implicit information that can only be plausibly inferred from what is explicitly present. It has been proposed that this implicit information be recovered through schema recognition or abductive reasoning within a complex theory of commonsense knowledge (Hobbs [14]). Whatever strategy is chosen, success at the task requires some fairly comprehensive representation of a broad spectrum of knowledge that humans bring to bear when understanding texts. As the last two decades of research in articial intelligence has shown, this can be a tall order. To the extent that full text understanding has been successful at all, it has succeeded only in very narrow domains, where the knowledge representation problem can be sufciently restricted to make it feasible, and for relatively small corpora, for which the relatively slow and complex linguistic processing can be accomplished in a reasonable amount of time. Since the objectives of text understanding are so ambitious it is difcult to evaluate success. One means of evaluation is to test whether a system can answer arbitrary questions whose answers follow from information in the text. Information extraction is situated somewhere between information retrieval and text understanding on this spectrum. Unlike information retrieval, one is interested in identifying not only passages of text that may contain relevant information, but in addition, in extracting the actual relevant facts and representing them in some useful form. However, the relevant facts are restricted to those that are explicitly present in the

162

D.E. Appelt / Introduction to information extraction

text. Furthermore, the kinds of relevant facts of interest are specied in advance. In addition, the range of facts of interest will be restricted to a small number of events and relations applying to entities of a limited set of types. This implies that, unlike text understanding, only a small portion of a text is typically relevant to an extraction task. Because this simpler task is amenable to simpler (and faster) processing methods, it is reasonable to think of applying information extraction systems to relatively large corpora. Although corpora the size of those processed by information retrieval systems are still out of reach for information extraction, current technology makes it possible to process document collections as large as 46 000 texts (Israel [15]) in a reasonable amount of time. Although the evaluation of the performance of an information extraction task is a difcult and tricky enterprise that is discussed in more detail in the next section, it is clearly much easier than evaluating a text understanding system. Because one has a clear picture in advance of what information is being sought, it is feasible to evaluate a system on how much of this information it is capable of nding (recall), and how accurate its claims are about the information it extracts (precision).

Over the course of several of the MUC conferences, a consensus developed among the participants and sponsors of the evaluations about how information extraction systems should be evaluated (Chinchor [7]). It was agreed that the extracted output would be represented as hierarchical attribute-value structures called templates. Human annotators would provide a set of key templates for the training data and the test data that could be compared to the system output with an automatic scoring program. The scoring program would align the templates produced by an extraction system with the key templates. Values that correctly matched the key values were counted as correct, values that did not match the key were incorrect, and attributes with non-empty values that did not align with a key attribute were considered overgeneration. It is possible to dene recall and precision for the output of an extraction system given the total possible correct responses (P ), number correct (C), number incorrect (I), and overgenerated (O) as follows: recall = C , P C . C +I +O

precision =

2. The Message Understanding Conferences (MUCs) A very important inuence on the development of the eld of information extraction has been the Message Understanding Conferences (MUC). These conferences (as well as the Text Retrieval Conferences (TREC)) have been sponsored by the US Government with the intention of developing technology of use to the intelligence community. The MUC proceedings provide an important reference source for understanding the evolution of information extraction systems, and the current state of the art. The organizers of the MUC conferences provided an application domain for information extraction, and carefully dened the rules of the extraction task. In addition, the organizers provided a moderate sized corpus (between 100 and 1000 texts) annotated with the information that was to be extracted. Organizations wishing to participate in the evaluation developed extraction systems in conformance with the task specications, and were provided with a blind test of 100 texts. The organizers automatically scored the extracted information, and published the results.

A statistic called F -measure is used as a weighted combination of recall and precision to arrive at an overall gure of merit for an extraction systems performance. This measure is essentially a geometric mean between recall and precision that can be weighted with a parameter , whose deviation from the value 1.0 determines whether recall or precision is more heavily weighted. If P and R are the recall and precision measures for a given set of responses, the F -measure is dened as follows: F = ( 2 + 1)P R . 2P + R

Recall, precision, and F -measure are now the most frequently cited metrics used in the literature when referring to information extraction system performance.

3. Types of information extraction tasks The MUC conferences dened several generic types of information extraction tasks that were intended to be prototypes of real information extraction tasks that arise in applications. Sites participating in MUC were

D.E. Appelt / Introduction to information extraction

163

evaluated on these general tasks, in hopes that real tasks would be similar enough to the generic ones to lend some validity to the evaluation metrics. The generic tasks were dened as follows: 1. Named Entity Recognition. Systems would mark the extent of certain types of proper names in a text with SGML annotations. Because most extraction tasks require the identication of persons, companies, government organizations, and locations, these entity types were singled out for the named entity recognition task. Of course, it is easy to imagine extraction tasks that require identication of other types of entities, such as the names of products, or military units. 2. Template Element Task. Whereas named entity recognition requires only the recognition of individual names, template element identication requires the individuation of named entities, which might be referred to by several different names in a text (e.g., William Jefferson Clinton, Bill Clinton, Slick Willie). In addition, some attributes like title or nationality would be included in the template. 3. Template Relation Task. The template relation task requires the identication of a small number of possible relations between the template elements identied in the template element task. This might be, for example, an employee relationship between a person and a company, a family relationship between two persons, or a subsidiary relationship between two companies. Extraction of relations among entities is a central feature of almost any information extraction task, although the possibilities in real-world extraction tasks are endless. 4. Coreference. The coreference task requires the identication of referring expressions and separating referring expressions into equivalence classes based on identity of coreference. This task is different from other MUC tasks because it constitutes a component technology. Nobody is interested in coreference for its own sake; however it has the potential of enabling higher performance on extraction tasks that are of interest. 5. Scenario Template Task. This is intended to be the MUC approximation to a real information extraction problem. A system is expected to ll a template structure with extracted information involving several relations and events of interest. Examples of such tasks included identication of the perpetrators and victims of terrorist at-

tacks, identication of the names, partners, products, prots and capitalization of joint ventures, and identication of the positions, companies, and persons involved in high-level management succession events.

4. Applications of information extraction There are many applications of information extraction technology. The most obvious application would be to populate a database with the information extracted from the text. A reasonable information extraction problem would be to read the newspaper sports reports, and extract the outcomes of basketball games, giving the date of the game, the winning team, the losing team, and the score. This problem is simple enough for information extraction techniques, but not totally trivial, because for such boring and repetitive narrative like game scores, reporters expend signicant effort to introduce variation into their language simply to reduce monotony. Nevertheless, it would not be unreasonable to expect fairly high performance from such an information extraction system, given appropriate time to build or train it. If one were a securities analyst, one might be more interested in a task like the MUC-5 joint ventures task, which was to extract information about the formation of new joint ventures, including the name, location, business, capitalization, and founding partners of the venture. Such a task is much more difcult than the basketball scores, since the underlying facts about the world are much more complex. This complexity will almost certainly result in lower performance as measured by recall and precision. The performance of recently developed information extraction systems of tasks of this kind has been signicantly lower than human performance. Maintainers of databases would be unlikely to consider the output of information extraction systems to be sufciently accurate without signicant human postediting. If information cannot be extracted with sufcient accuracy for database maintenance, it can be used to focus the users attention on the precise passages of text that contain the targeted information. If the recall of the information extraction system is sufciently high, then it may be highly useful in enabling the user to ignore most of the irrelevant text when perusing an article. Another potential application of information extraction technology is in generating topic-directed summaries. Often targeted information is briey mentioned in an article on a completely different topic. By extract-

164

D.E. Appelt / Introduction to information extraction

ing the targeted information and generating a summary directly from the extracted predicates, as done by Cancedda [6], it is possible to generate summaries that are more useful to a user seeking specic information than summaries generated by more general summarization techniques. Other applications of information extraction systems have been investigated in which the information that is extracted is not of interest to the end user, but is rather used to improve the performance of some system component that is. Recent attempts have been made to improve information retrieval routing systems by reranking their output based on the quantity of relevant information that can be extracted from each one (Israel et al. [15]).

5. The architecture of information extraction systems Before researchers had much experience with information extraction problems it was widely believed that information extraction was just a simpler text understanding problem. Some of the early MUC systems (TACITUS, Hobbs et al. [13]; PROTEUS, Grishman et al. [10]; SCISOR, Jacobs and Rau [16]) attempted to take this idea seriously. A general-purpose parsing algorithm would be applied to every sentence of the input to produce a logical form. The systems would then apply various reasoning methods to the logical form to derive the information that was to be extracted. Experience showed that this approach produced less than satisfactory results. One problem was that even fairly comprehensive grammars had inadequate coverage for real-world texts like newspaper articles. This forced the reliance on robust parsing techniques (Hobbs et al. [13]) and then in turn became a source of error. Another problem was that real texts tend to have long sentences (20 word sentences are common, 50 word sentences occur with dismaying frequency, and 100 and more word sentences are not unheard of) which cause combinatorial problems for parsers, even when employing heuristic or statistical guidance (Magerman and Weir [19]). This combinatorial problem guaranteed that the resulting systems were slow. Because they needed to obtain a logical form before they could determine the relevance of a sentence, they had to parse everything. Parsing everything meant long turnaround time on test runs (several hours to a day for a corpus of 100 texts was common on the hardware of

Fig. 1. Modules of an information extraction system.

the time), which implied that feedback on system performance to developers was slow in coming. Because of these problems, researchers soon concluded that information extraction systems were best built by using a combination of shallow analysis techniques and statistical methods. The most frequently employed shallow analysis technique is nitestate parsing. Although natural languages are at least context-free, simple non-recursive constituents can be recognized by nite-state grammars. The limitations of nite-state analysis are not as restrictive as one might rst believe, and good implementations can be very fast and robust. For practical application, one usually employs transducers rather than simple recognizers. A transducer accepts input symbols consisting of words and their associated lexical features, and produces as output, phrases with features annotating the head, and perhaps other domain-relevant properties. Transducers can be cascaded, with each transducer operating on the output of the previous one. This enables modularization of the system into phases, in which each stage of processing is handled by a transducer operating on the output of the previous phase. It is also possible to handle limited recursive structures like conjunctions, provided one accepts a limitation on the theoretically unbounded depth. A typical information extraction system has phases for input tokenization, lexical and morphological processing, some basic syntactic analysis, and some combination of basic syntactic constituents into domainrelevant patterns, as indicated by the left-hand boxes in

D.E. Appelt / Introduction to information extraction

165

Fig. 1. If all one is interested in is proper name identication, the parsing and domain analysis phases may not be necessary, but applications that attempt to extract events or relationships among entities will almost certainly have them. In addition to the modules in the left-hand column, information extraction systems may include modules from the right-hand column, depending on the particular requirements of the application. These processes will be discussed in greater detail in the subsequent sections.

6. Knowledge engineering vs. learning 2. There are two basic approaches to designing the modules of an information extraction system, which I refer to as the knowledge engineering approach, and the learning approach. In the knowledge engineering approach, a system developer with familiarity with both the workings of the system, the various linguistic resources available, and the requirements of the domain, writes rules that are operated on by the underlying processing mechanism, which in the case of information extraction systems is usually a nite-state transducer. Although the human designer usually refers to a corpus of domain-relevant texts when constructing the rules, he or she is of course free to apply any general knowledge, intuitions, or tricks of the trade in the design of the rules. Thus, the ultimate performance that can be obtained by manually crafting rules depends greatly on the skill of the craftsman. Learning approaches are based on supervised learning, where a large annotated corpus is available to provide examples on which learning algorithms can operate. (Unsupervised learning, in which systems optimize their performance without reference to an annotated corpus, have been investigated, but this is much more an area of research than of practice.) The underlying mechanisms that may be trained include decision trees (McCarthy and Lehnert [20]), maximum entropy models (Berger et al. [4]), and hidden Markov models (Bikel et al. [5]). There is, of course, no particular reason why every module of a system shown in Fig. 1 has to be built according to the same design paradigm. Typically, a number of considerations will inuence ones decision about which design approach to adopt for a particular module in an information extraction system: 1. The availability of training data. If an adequate quantity of training data is available, or cheaply

3.

4.

5.

and easily obtained, it argues in favor of a learning approach. For tasks like name recognition, it is easy to nd people with the requisite knowledge to annotate texts, there is high agreement among annotators, and the quantity of data required is relatively modest (Bikel et al. [5]). The situation is quite different for complex domainlevel tasks where inter-annotator agreement is much lower, and the annotation task is much slower, more difcult, more expensive, or requires extensive domain expertise from the annotators. It is often argued that annotations are cheaper to produce than rules, but for complex domains, annotation costs can be prohibitive. The availability of linguistic resources. If linguistic resources such as lexicons and name lists are available in the target language, then handcrafting of rules may be possible. Otherwise, it may be necessary to rely on training from an annotated corpus in those cases in which the required resources are not available. The availability of developers with the knowledge and skill to craft rules. The availability of a skilled knowledge engineer is an obvious prerequisite for the knowledge engineering approach. The stability of the nal specications. In response to a change in specications, it is often easier to make minor changes to a set of rules than to reannotate and retrain from a corpus. For example, MUC named-entity-tagging specications required identifying astronomical objects as locations. If this were undesirable for a particular application, a handcrafted system could be adapted by deleting a single rule. A learning system would have to be retrained from a corpus with updated annotations reecting the changed specications a much more difcult process. However, other changes in specications may be easier to accommodate with a trainable system. If, for example, a system trained on mixed upper and lower case text needed to be used on upper case only texts, it would sufce to map the already annotated training corpus to upper case and retrain, while the rule-based system would have a much more difcult task ahead. The level of performance required. Experience has shown that human ingenuity counts for a lot. Data from the MUC-6 and MUC-7 evaluations on the named-entity task suggests that handcrafted systems enjoy an error rate about 30% lower than automatically trained systems (F 96.4

166

D.E. Appelt / Introduction to information extraction

vs. F 93.0 for MUC-6, F 93.7 vs. F 90.4 for MUC-7). Of course, the performance of automatically trained systems depends on the quantity of training data available, and recent evaluations suggest that with enough data (over 1.2 million words), the two approaches can produce equivalent results, at least on the name recognition task. However, if the required quantity of training data is huge, it diminishes the attractiveness of a trainable system. It is typically the case that training data for named entity identication tasks is relatively easy to obtain, while training data for complex domain templatelling tasks is often sparse, less reliable, and much more difcult and expensive to obtain. For this reason, the domain specic analysis modules are often designed using the knowledge-engineering approach. Nevertheless, a number of researchers (for example, Bagga et al. [2], Soderland [22]) have investigated the induction of domain rules from examples. Cardie [7] surveys some of the recent work in this area.

7. Lexical and morphological processing The precise duties of the lexical and morphological processing component of an information extraction system depend on the language that is being processed. Languages without orthographically distinguished word boundaries, like Japanese and Chinese, will require that some word segmentation procedure be applied. Languages with very simple inectional morphology, like English, can skip morphological analysis entirely, and rely on a lexicon expanded with all inectional variants. Morphological analysis will present a more interesting problem for languages like German, with its compound nominals, as well as more complex inectional variation. The most important problem faced by this module in any language is the handling of proper names. One characteristic of proper names is that they are productive, and therefore impossible to exhaustively enumerate in advance. This is not to say that large lists of proper names are useless. In fact, many name recognizers employ name lists that are readily available from public-domain sources (census bureau for person names, SEC EDGAR database for company names, materials distributed for MUC evaluations). Two facts are important to remember: new names can be coined that are not on any lists, and name lists usually contain

signicant overlap with normal words, creating ambiguity requiring contextual resolution. An interesting anecdote from the development of the SRI FASTUS System illustrates the problem. After adding a comprehensive gazetteer of world-wide place names, the system was tested by processing the sentence I want to go to San Francisco. We were astonished to nd out that not only was San Francisco a place name, but I, want, go, and to were as well! Fortunately, many names have a discernable internal structure (like DFKI, GmbH) that can be recognized either with handcrafted rules under the knowledge engineering approach or automatically trained rules derived from an annotated corpus. Recognition of novel proper names is also a potentially productive application for unsupervised learning algorithms. Assuming that you have a substantial lexicon of known unambiguous proper names, you can infer the properties of unknown proper names in a large corpus by drawing analogies to known proper names that occur in identical syntactic relationships to the same or similar words (Cucchiarelli et al. [9]). In addition to name recognition, the lexical and morphological module must assign lexical features to words that are required by subsequent stages of processing. This can be accomplished either through straightforward lookup in a lexicon, or by tagging with some automatic tagging algorithm. Assuming that one has both a comprehensive lexicon of the desired language, and a high-performance tagger available, the question arises as to whether it is desirable to employ one or the other or both. This question has not yet been adequately resolved by research. On one hand, the more comprehensive ones lexicon is, the more likely it is to contain rare word senses that create ambiguity resolution problems. A part-of-speech tagger can often resolve such ambiguity. However, most ambiguity can be successfully resolved during a shallow parsing phase, where the preferred analysis is the one that leads to a preferred parse. Finally, part-of-speech taggers are likely to make errors precisely at those points when one would most want to have accurate information. In MUC evaluations, systems employing part-ofspeech tagging, such as the PROTEUS system (Grishman [10]) do not seem to enjoy a signicant advantage over systems that do not (e.g., FASTUS [1]). 8. Parsing in information extraction systems For reasons that were discussed earlier, most information extraction systems have forsaken complete syn-

D.E. Appelt / Introduction to information extraction

167

tactic analyses in favor of shallow, fragment analyses that can be easily produced with nite-state transducers. For English, it is possible to write unambiguous grammars (up to lexical ambiguity) for simple constituents such as noun groups and verb groups. The following paragraph from a MUC-5 joint venture text indicates how a complex sentence would be analyzed according to such a grammar: [Bridgestone Sports Co.]NG [said]VG [Friday]NG [it]NG [has set up]VG [a joint venture]VG [in]P [Taiwan]NG [with]P [a local concern]NG [and]P [a Japanese trading house]NG [to produce]VG [golf clubs]NG [to be shipped]VG [to]P [Japan]NG. In the above example, the parser associates with each constituent a head. Subsequent analysis, in which the fragments found by the initial parser are combined into larger constituents, will match based on properties associated with the head. Most extraction systems would have a rule that combines a conjunction of domain-relevant entities into a single phrase of that type, and forming a local concern and a Japanese trading house into a single noun group. The parser would not produce any further analysis of to be shipped to Japan since that does not provide any domain-relevant information in this example. Assuming the coreference between it and Bridgestone Sports Co. is resolved correctly, and assuming that locative and temporal adverbials are ignored during domain pattern matching, the above analysis would match the domain pattern <company> <form> joint venture with <company> which was a frequently invoked pattern in the MUC-5 joint ventures domain. It is important to realize that the role of parsing in an information extraction system is to uncover sufcient regularities in the input so that relatively general domain patterns can match the heads of the parsed constituents. One could regard a system based on full parsing as one that places the full burden of recognition of the domain patterns on the parser. One could also conceivably build an information extraction system that skipped parsing entirely, and worked by matching domain patterns directly with the text. Achieving high performance with such a system would presumably require numerous or very complex patterns to account for all the variation one might see in the input texts. In an information extraction system, there is no eas-

ily stated principle that determines how much syntactic analysis is the right amount other than that experience suggests that the extremes of the scale are not optimal. Most information extraction systems have adopted at least enough syntactic analysis to mark simple noun groups and verb groups, supplemented by some handling of conjunctions and appositives. Within these broad specications, there is wide variation determined by the complexity of the extraction task, and the types of variation observed in the actual corpora that the system processes.

9. Coreference in information extraction systems As the example in the previous section illustrates, when matching domain patterns, it is useful to know when a pronoun refers to a domain-relevant entity. It may be possible for an extraction system to avoid doing a coreference task as a separate step in its analysis. For example, pronominal noun groups could be treated like empty constituents that match any noun group in a domain pattern, but contribute no semantic content. Then, the partially instantiated domain analyses are merged using the strategies described in Section 11. However, because pronouns and other anaphoric elements are more constrained in how they can match their antecedents than a template unication algorithm would imply, one would expect better performance from a carefully designed coreference module. A coreference module for an extraction system should handle the following coreference problems: 1. Name-alias coreference. Names and their common variants must be recognized as coreferring, e.g., Ford Motor Company and Ford, or William Jefferson Clinton and Slick Willie. 2. Pronoun-antecedent coreference. Pronouns like he, she, they, them, etc. must be associated with their antecedents, resolving them to a domain relevant named entity if possible. 3. Denite description coreference. This type of coreference would hold between Ford and the company, or Ford and the Detroit auto manufacturer. In general this is very hard, because arbitrary world knowledge may be required to make the connection among descriptors. When building an extraction system, it is reasonable to include ontological information for domainrelevant entities that enable such resolution in restricted cases, but doing it in general is unrealis-

168

D.E. Appelt / Introduction to information extraction

tic, except perhaps in the simplest cases where a denite description is a subset of its antecedent description. Because coreference resolution attempts to uncover information that makes it easier to match domain patterns, and because it requires minimal syntactic information to work correctly, the coreference resolution module operates after the syntactic analysis has been completed, but before the domain-relevant patterns have been matched. As with other information extraction modules, it is possible to build coreference resolution modules within both knowledge engineering and learning paradigms. The knowledge-engineering approach to coreference resolution could be described as an attempt to adapt coreference resolution strategies from the discourse processing literature to the information extraction setting, where only a shallow and incomplete syntactic analysis is available. The process used by FASTUS consists of three basic steps: 1. Determine accessible antecedents. For names, all antecedents are accessible. For denite descriptions it is some part of the previous text, and for pronouns, a somewhat smaller part of the preceding text, with paragraph boundaries, if applicable, providing a useful unit. 2. Filter candidates with a semantic/sortal consistency check. This is where number and gender constraints are checked, and where the consistency of denite descriptions is evaluated. This is, of course, very hard in general, because world knowledge is necessary to make a determination like auto manufacturer and Ford can be coreferential. One can try to write rules for important cases in the domain of interest, but solving the problem in general is impossible without a knowledge base larger than anything that currently exists. 3. Order remaining candidates by syntactic preference. Antecedents in recent sentences are preferred to antecedents in earlier sentences, ordered strictly by recency, except in the same or immediately preceding sentence, in which preferences are ordered left to right. The latter heuristic is intended to reect a fact derived from centering principles that the subjects of adjacent sentence are slightly more likely to corefer, absent any other disambiguating information, and that, at least in English, subjects appear leftmost in surface order.

A learning approach to coreference resolution is implemented by associating features with each pair of noun groups encoding properties such as sameness of gender, number, sort, and distance separating them. A corpus is annotated with coreference pairs, and a model is trained using these annotations. (Decision trees are one possible approach reported by McCarthy and Lehnert [20].) Regardless of the particular approach chosen, the MUC evaluations have demonstrated that general coreference modules perform at around F 68. Of course, it is of greater interest how well coreference is resolved for domain-relevant entities, since domain-relevant coreference will have an impact on the bottom line performance of the system. Unfortunately, no ofcial MUC tests attempted to separate domain-specic performance from general performance in coreference resolution. Recent experiments performed by SRI show that disabling the coreference module results in a decline in domain task performance of 3 points of F measure (a 6.3% increase in error rate) for the MUC-6 management succession task. 10. Recognizing domain-relevant patterns The critical module in the information extraction systems processing of an input text as illustrated in Fig. 1 is the one that does the actual extraction of the targeted domain-relevant information. Although the output of some preliminary stages of processing, such as named entity recognition, might be of interest in their own right, the primary purpose of all the stages of processing prior to the domain specic analysis is to make it easier to extract the targeted information. In the simplest case, this entails merely marking the sentences that contain domain-relevant information. More ambitious systems attempt to express the extracted information in some form that is at least partially independent of the original text in which it was expressed. The representations that have been chosen in MUC evaluations are attribute-value structures known as templates. Templates consist of a collection of slots (attributes), each of which may be lled by one or more values. These values can consist of the original text, one or more of a nite set of predened alternatives, or pointers to other template objects. Typically, slot lls are subjected to normalization rules that standardize the representation of lls representing dates, times, job titles, etc. The text below might be represented with the following template structure in an extraction system directed toward the extraction of information about terrorist incidents:

D.E. Appelt / Introduction to information extraction

169

Carlos Ramon, mayor of the small coastal village of Santo Domingo, was kidnapped last Tuesday by suspected guerrillas of the FMLN. INCIDENT-0001 TYPE: KIDNAPPING STATUS: SUSPECTED DATE: 12-NOV-86 PERPETRATOR: <ORG-0001> TARGET: <PERSON-0001> PERSON-0001 NAME: Carlos Ramon TITLE: Mayor ORG-0001 NAME: FMLN In building the extraction system, rules are written that match features on the heads of phrases found by earlier modules. When a rule successfully matches a pattern, a template is created expressing the information extracted from the matched pattern. Several patterns may match the same passage of text, and it may therefore be possible to construct several alternative analyses, which may subsequently merged, or preferred subsets of them selected. There are two basic approaches to designing the nal domain-relevant pattern rules, which could be characterized as the atomic approach, and the molecular approach. The most common, and perhaps straightforward, approach is the molecular one. The molecular approach involves matching all or most of the arguments to an event (the molecule) in a single pattern. For example, one might have a pattern like <Person>[kidnap-verb, passive] (Adverbial) by <Organization> The rule would match the sentence in the above example, assuming that the appropriate preprocessing has been done to recognize the named entities, and associate appropriate domain-relevant features like person and organization with the constituents. The rule would specify the creation of an appropriate template structure from features associated with the matched constituents. The development cycle under the molecular approach is basically one of starting out with a small number of highly reliable rules that capture the common core cases of the domain, but which ignore a

broad class of less frequently occurring but nevertheless relevant patterns. The performance of such a system tends to have high precision, based as it is on common and reliable rules, but low recall, because the aggregate of all the rarer and less reliable patterns is still large. Further development is characterized by an attempt by the rule writer to capture ever-larger numbers of increasingly rare and problematic cases with increasingly general and possibly overgenerating rules. Such systems therefore tend to evolve in the direction of increasing recall, with progressively lower precision. The molecular approach is very natural and is really just a simplied variant of the parsing and semantic interpretation steps implemented in many systems in computational linguistics. The problem is that rules like the one above can be rather brittle in the face of real-world data. An extraction system for management succession events might encounter a sentence like the following: John Smith, 46, well known for his my way or the highway management style, but who is nevertheless highly regarded by employees and competitors alike, was appointed CEO of Foobarco, Inc. It is easy to see that the above sentence would not match a simple subject-passive verb-object pattern. One is then confronted with several pattern-writing choices, none of which is a perfect solution to the problem. One can attempt to parse the structure of the intervening subordinate clauses with relative linguistic delity, but as this example suggests, the range of variation is large enough to raise the full parsing problem that information extraction systems seek to avoid with shallow analysis. The alternative of skipping a block of unstructured text between the subject and the verb is possible, but this alternative is as likely to lead to incorrect analyses in other cases as to a correct analysis in cases similar to the above example. Another alternative is to make some of the elements in the pattern optional, and settle for making partial matches on adjacent domain-relevant constituents. The missing arguments could then be lled by a heuristic analysis that attempts to ll in empty argument positions with plausible candidates from the text. The latter approach suggests going all the way, and building a domain module that recognizes the arguments to an event (the atoms) and combines them into template structures strictly on the basis of intel-

170

D.E. Appelt / Introduction to information extraction

ligent guesses rather than syntactic relationships. The development cycle of this atomic approach would be characterized by hypothesizing domain-relevant events for any recognized entities, leading to high recall, but much overgeneration, and hence low precision. Further development would entail improving lters and heuristics for combining the atomic elements, gradually improving precision at the expense of the initially high recall. The atomic approach makes sense as a system development strategy for certain kinds of extraction topics. In particular, when the domain-relevant entities can be classied into categories that can play only one or a very small number of roles in relevant events, or where the events and relations sought are symmetric. For example, if one were interested in labor negotiations, A negotiating with B is equivalent to B negotiating with A. Identifying the participants in the event is all one needs to assign their roles, and the perspective on the event adopted by the author of the text can be ignored for the purpose of extracting the targeted information. One of the MUC-5 topics (microelectronics) lent itself to an approach of this kind, and was successfully applied by one system (Lin [18]). For a complex extraction domain, there is no reason why different approaches cannot be combined for different parts of the overall task.

11. Merging partial results If the output of ones extraction task is simply to highlight the sentences relevant to the targeted information, it is unnecessary to do more than identify the passages of text that match domain-relevant extraction patterns. However, if one is attempting a more complex task like template lling, there is no reason to believe that all the elements of a single template will be found in a single sentence. It is therefore necessary to merge partial results from different templates. It is by no means a simple problem to decide when two descriptions of an event are in fact talking about the same event. However, information extraction systems generally simplify this determination by resorting to straightforward template unication: If two templates contain identical information in at least one slot, and other slots are consistent, then the information in the two templates can be combined by essentially unifying the two templates.

Template merging works reasonably well as a discourse processing strategy, especially considering its simplicity. All participants in the MUC scenario template task evaluations have used some sort of unication merging to combine information extracted from different sentences about the same event. Although merging strategies can be tempered by textual considerations such as distance between sentences, it is usually performed in a relatively straightforward way. Those who are interested in learning approaches to building information extraction systems may question unication merging as too simple and restrictive. It is easy to argue that a merging strategy could be learned that might bring multiple factors into consideration, such as textual distance, differential weighting among slots, and perhaps other contextual clues as well. The problem is that an optimal merging strategy is highly dependent on the performance characteristics of an extraction system at a particular stage in its development. When more rules are added that affect the quantity and quality of generated templates, a new optimal merging strategy would have to be trained on data annotated by a developer examining the raw templates produced by the current extraction system. This can be a timeconsuming chore, and the system is likely to change before enough training date can be annotated. For this reason, this kind of supervised learning of merging strategies has not been widely employed. One way of overcoming these development-state dependencies would be to train the merger in a weakly supervised mode, in which a space of merging strategies are explored and evaluated by testing the systems bottom-line performance on a set of key templates and using a hill-climbing algorithm to nd the optimal combination of features to license a merge (Kehler [17]). The problem again is that key templates for a domain are not likely to be available for most real-world extraction tasks, and could only be created with great time and effort. The advantage is that the key templates for the domain task need only be acquired once, and do not need to be revised with every change in the systems performance. In any case, the efcacy of this approach has not yet been fully studied.

12. Future directions It is a reasonable question to ask if there are any inherent limits to the technology that has been typically employed in information-extraction systems. Because of the simplicity of the techniques, and their inherent

D.E. Appelt / Introduction to information extraction

171

Fig. 2. Recent MUC scenario template results.

robustness, it has been possible to build systems for simple information extraction tasks like name tagging that closely rival human performance. However, this has not been the case to date with complex information extraction tasks such as the scenario template task in the MUC evaluations. The diagram in Fig. 2 illustrates the performance of the highest-ranking participants in MUC evaluations between 1991 and the present. As one can see from the graph in Fig. 2, while the performance of information extraction systems has been improving over time, two facts are immediately evident: 1. There seems to be a level of performance (F -measure 60) that looks like an upper bound that can be exceeded only with difculty. 2. The gap between the best system performance and the performance of a skillful human who is well acquainted with the task is very large. There are a few problems with comparing the MUC evaluation results from year to year. First, the evaluations were conducted on different topics during different years, and the ground rules of the evaluation were somewhat different in different years. For example, the participants of MUC-6 and MUC-7 had only one month do specialize their systems to the scenario template task denition, while in MUC-5, some sites were able to work for over a year on the topic. MUC-3 and MUC-4 were both evaluated on the same topic and template structure. Also, one can raise arguments about how well the particular template scoring methodology employed in MUC accurately reects the underlying capabilities of the systems. However one resolves these issues, the currently available data does support one conclusion, namely that there is an upper bound in system performance, and, no matter what the precise details are of measur-

ing this level of performance, the performance gap between the systems and humans is large. Over the near term, the practical application of information extraction technology will have to focus on problem areas where the extraction task is simple enough that high performance can be achieved (e.g., name identication), or application niches where existing extraction technology, imperfect though it may be, can still provide added value to applications. One example that is deserving of further examination is using information extraction to improve the precision of document retrieval. The MUC tasks have always emphasized the processing of a single document at a time, and the information extracted from single documents was always evaluated in isolation from the systems performance on any other document. However, corpora, particularly news feeds from multiple sources, contain articles describing the same events. If a system could successfully identify the same events and entities across multiple documents, it could potentially exploit this redundancy to extract more information about an event than it could from a single article. Aside from some preliminary investigations (e.g., Baldwin and Bagga [3]), this area remains largely unexplored. The quantity of text available in computer-readable form is increasing at a staggering rate, and the need for humans to cope with this huge volume of textual data will continue to drive research in information extraction, and the demand for applications of extraction technology for the foreseeable future.

Acknowledgements The author is grateful to AI Communications editorial board member Elisabeth Andr for her insightful and useful comments, as well as those of two anonymous referees.

References
[1] D. Appelt, J. Hobbs, J. Bear, D. Israel and M. Tyson, FASTUS: a nite-state processor for information extraction from real-world text, in: Proceedings of the 13th International Joint Conference on Articial Intelligence, 1993, pp. 11721178. [2] A. Bagga, J. Chai, A. Biermann, C. Guin and A. Hui, A trainable system for the extraction of meaning from text, in: Proceedings of the 1995 International Conference of the Center for Advanced Studies, 1995.

172

D.E. Appelt / Introduction to information extraction [14] J. Hobbs, D. Appelt, P. Martin and M. Stickel, Interpretation as abduction, Articial Intelligence 63(1) (1993), 69142. [15] D. Israel, J. Bear, J. Petit and D. Martin, Using information extraction to improve document retrieval, in: Proceedings of the Sixth Text Retrieval Conference (TREC-6), 1997, pp. 367 378. [16] P. Jacobs and L. Rau, SCISOR: extracting information from on-line news, Communications of the ACM 33(11) (1990), 88 97. [17] A. Kehler, Probabilistic coreference in information extraction, in: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 1997, pp. 163173. [18] D. Lin, Description of the NUBA system as used for MUC-5, in: Proceedings of the Fifth Message Understanding Conference, 1993, pp. 263275. [19] D. Magerman and C. Weir, Probabilistic prediction and picky chart parsing, in: Proceedings of the DARPA Speech and Natural Language Workshop, 1992, pp. 128133. [20] J. McCarthy and W. Lehnert, Using decision trees for coreference resolution, in: Proceedings of the 14th International Joint Conference on Articial Intelligence, 1995, pp. 10501055. [21] W. Mann and S. Thompson, Relational propositions in discourse, Discourse Processes 9 (1986), 5790. [22] S. Soderland, CRYSTAL: learning domain-specic text analysis rules, Center for Intelligent Information Retrieval Technical Report TE-43, 1996. [23] S. Soderland and W. Lehnert, Corpus-driven knowledge acquisition for discourse analysis, in: Proceedings of the 12th National Conference on Articial Intelligence, 1994, pp. 827832. [24] K. Sparck-Jones, Information retrieval: how far will really simple methods take you? in: Proceedings, Twente Workshop on Language Technology, 1998, pp. 7178.

[3] B. Baldwin and A. Bagga, Coreference as the foundation for link analysis over free text databases, in: AAAI Fall Symposium on Articial Intelligence and Link Analysis, 1998. [4] A. Berger, S. Della-Pietra and V. Della-Pietra, A maximum entropy approach to natural language processing, Computational Linguistics 22(1) (1996), 3971. [5] D. Bikel, S. Miller, R. Schwartz and R. Weischedel, NYMBLE: a high-performance learning name nder, in: Proceedings of the Fifth Conference on Applied Natural Langauge Processing, 1997, pp. 194201. [6] N. Cancedda, Text generation from message understanding conference templates, unpublished PhD dissertation, Universita di Roma La Sapienza, 1999. [7] C. Cardie, Empirical methods in information extraction, AI Magazine 18(4) (1997), 6577. [8] N. Chinchor, MUC-4 evaluation metrics, in: Proceedings of the Fourth Message Understanding Conference, 1992, pp. 2229. [9] A. Cucchiarelli, D. Luzi and P. Velardi, Automatic semantic tagging of unknown proper names, in: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, 1998, pp. 286292. [10] R. Grishman, J. Sterling and C. Macleod, Description of the PROTEUS system as used for MUC-3, in: Proceedings of the Third Message Understanding Conference, 1991, pp. 183190. [11] R. Grishman, The NYU system for MUC-6 or wheres the syntax? in: Proceedings of the Sixth Message Understanding Conference, 1995, pp. 167175. [12] J. Hobbs, Why is discourse coherent? in: Coherence in NaturalLanguage Texts, F. Neubauer, ed., Helmut Buske Verlag, 1983, pp. 2969. [13] J. Hobbs, D. Appelt, J. Bear and M. Tyson, Robust parsing of real-world natural language text, in: Proceedings of the Third Conference on Applied Natural Language Processing, 1992, pp. 186192.

Potrebbero piacerti anche