Sei sulla pagina 1di 81

Information Access: State of the Art

Mark T. Maybury
maybury@mitre.org

November 15, 2001 http://www.mitre.org/resources/centers/it/maybury/mark.html


This data is the copyright and proprietary data of the MITRE Corporation. It is made available subject to Limited Rights, as defined in paragraph (a) (15) of the clause at DFAR 252.227-7013. The restrictions governing the use and MITRE disclosure of these materials are set forth in the aforesaid clause.

Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering

Page 2 Copyright 2001 The MITRE Corporation. All rights reserved.

Abstract
This briefing overviews the state of the art in the areas of information retrieval, summarization, information extraction, text clustering and question answering. The briefing reports performance of systems based on national testing and identifies current commercial products that provide this type of processing. For example, key findings include: - automated systems exist that can return documents relevant to a particular subject with around 80% precision but low recall - automated query and relevance feedback is near human performance - systems can presently identify entities at over 90% accuracy and relations among entities at 70-80% accuracy. - systems can summarize documents to 20% of their source size without information loss, saving users 50% of task time - systems can also respond to a simple factual question by returning answers from relevant documents at 75% accuracy. We describe how the community can make rapid progress on information access for intelligence using corpus-based evaluation, which includes: 1) Creating challenge problems with supporting data 2) Evaluating system performance on these problems, and 3) Comparing approaches; share data, resources and tools. The briefing provides detailed pointers to data resources and commercial products. Page 3 Copyright 2001 The MITRE Corporation. All rights reserved.

Current Situation
Analyst Data
10,000 messages per day
(IMINT, SIGINT, HUMINT, Intel Reports)

Data Search
Analyst creates a profile to sort messages

Extraction Interpretation Conclusion


Analyst reads selected messages to extract facts Analyst builds a link analysis model Analyst forms an opinion and authors a report

Report

The analyst can The system sorts messages by key words study 200 messages per day. -- limited precision.
Page 4

The analyst manually builds (at best) partial, static link models.

The final output is text: a message, web page or document.

Copyright 2001 The MITRE Corporation. All rights reserved.

Intelligent Information Access


Unstructured sources Structured databases
news web

semi-structured info
documents

- known terrorist groups - world governments - biological agents

Intelligent Information Processing


(Retrieve, Extract, Summarize, Visualize)

- world fact book - gazetteer - hand-crafted taxonomy

Question Answer

Is X a member of a terrorist organization?

user
Page 5

Copyright 2001 The MITRE Corporation. All rights reserved.

State of the Art: Summary


Automated systems exist that can:
- Return documents relevant to a particular subject with around 80% precision but low recall - Automated query, relevance feedback near human perf. - Identify entities at over 90% accuracy and relations among entities at 70-80% accuracy - Summarize documents to 20% of their source size without information loss, saving users 50% of task time - Respond to a simple factual question by returning answers from relevant documents at 75% accuracy

We can make rapid progress on information access for


intelligence using corpus-based evaluation: - Create challenge problems with supporting data - Evaluate system performance on these problems

- Compare approaches; share data, resources and tools


Page 6 Copyright 2001 The MITRE Corporation. All rights reserved.

Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering

Page 7 Copyright 2001 The MITRE Corporation. All rights reserved.

Information Retrieval

Information Retrieval

Input: query words Output: ranked list of documents Approach

Collections: Gigabytes Documents: Megabytes


PIR Genbank MEDLINE

Lists,Tables: Kilobytes Phrases: Bytes

D i ea se s Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l

S urce o P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D

C ount ry U anda g U anda g U anda g U anda g U anda g U anda g U anda g U anda g

C ty_n ame ate i D C ases N ew _c ases D d ea G a ul 2 6- ct -2000 O 182 17 64 G a ul 5- ov-2000 N 280 14 89 G u ul 1 3- ct -2000 O 42 9 30 G u ul 1 5- ct -2000 O 51 7 31 G u ul 1 6- ct -2000 O 63 12 33 G u ul 1 7- ct -2000 O 73 2 35 G u ul 1 8- ct -2000 O 94 21 39 G u ul 1 9- ct -2000 O 111 17 41

Protease-resistant prion protein interacts with...

Information Retrieval: key words to documents

- Speed, scalability domain independence and robustness are critical for access to large collections of documents
2 001 The M ITRE Corporation . ALL RIG HTS RESERVED.

MITRE

Technique - Shallow processing provides coarse-grained result (entire documents or passages) - Query is transformed to collection of words, but grammatical relations between words lost - Documents are indexed by word occurrences - Search matches query probe against indexed documents using Boolean combination of terms, or vector of word occurrences or language model

Page 8 Copyright 2001 The MITRE Corporation. All rights reserved.

Evaluating Text Retrieval


The Text Retrieval Conference (TREC) has been held annually,
starting in 1992, run by NIST* - Successful -- attracting 100s of international participants from industry, academia, government

Goal: systematic evaluation of retrieval systems using a large


(5Gb) common corpus - Given a set of queries, for each query - Systems return a ranked list of documents

- Human judges provide relevance assessments for the ranked documents - Relevance judgements are used to compute average precision-recall plots for each system Precision: % returned docs judged relevant Recall:
Page 9

% of relevant documents found


Copyright 2001 The MITRE Corporation. All rights reserved.

*US National Institute of Standards and Technology

Precision and Recall

Precision 12 out of 106 documents were relevant (12/106 = 11%)

Answer:
Who are Mohamed Attas associates?

Marwan al-Shehhi Ziad Jarrah Hani Hanjour Rohit Agrawal ...

Query: mohamed+atta+associates

Recall One document not retrieved contained a new associate (11/12 = 91%)
Answer: Said Bahaji

Page 10 Copyright 2001 The MITRE Corporation. All rights reserved.

Sample TREC Topic


<num> Number: 409 <title> legal, Pan Am, 103

Title: Up to 3 words best describing the topic

Description: One-sentence <desc> Description: description What legal actions have resulted from the of the topic destruction of Pan Am Flight 103 over Lockerbie, Scotland on December 21, 1988? <narr> Narrative: Documents describing any charges, claims or fines presented to or imposed by any court or tribunal are relevant, but documents that discuss charges made in diplomatic jousting are not relevant.
Page 11 Copyright 2001 The MITRE Corporation. All rights reserved.

Narrative: Description of what makes a document relevant or irrelevant

TREC9 Results for a High-Performing System


1 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0
Page 12

Automatically generated query Manually generated query

This is a representative high-performing system Manual (expert) choice of query words works better than automatically generated queries Note that if you need to find all the literature on a subject, you have to look through lots of junk!

4 of top 5 documents relevant: precision = 80%; recall low!

0.2

0.4 Recall

0.6

0.8

Copyright 2001 The MITRE Corporation. All rights reserved.

TREC Findings
It is possible to create test collections and an evaluation metric for
large corpora - TREC pioneered evaluation of recall over large document collections, using pooled results of multiple systems to prune space & estimate recall - TREC has experimented with different evaluations (tracks) for filtering, routing, Web search,... - The basic paradigm is still word-based; so far, addition of syntax, semantics hasnt helped - For short queries, adding information (words) to the queries helps, by hand, by thesaurus, by feedback of relevant documents

Other lessons:

Need for fine-grained retrieval has led to new track of


Question Answering
Page 13 Copyright 2001 The MITRE Corporation. All rights reserved.

Precision vs. Documents Retrieved


0.7 0.6 0.5 Precision 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 1200 # Documents Retrieved Automatic query Manual query

3 of first 5 documents are relevant;

20 of first 100 documents are relevant;

Systems can use relevant documents to enrich the set of query words via relevance feedback, done automatically or interactively

Page 14 Copyright 2001 The MITRE Corporation. All rights reserved.

Information Retrieval Resources

IR Journals
- Information Processing and Management (IP&M) - ACM Transactions on Information Systems (TOIS) - Journal of the American Society for Information Science (JASIS) - Journal of Documentation (JDOC) - Information Retrieval (IR) - Journal of Intelligent Information Systems (JIIS)

Books
- Modern Information Retrieval (ACM Press Series), R. Baeza-Yates, et al. - Intelligent Multimedia Information Retrieval, M. Maybury(ed) & Karen Sparck Jones. - Advances in Information Retrieval - Recent Research from the Center for Intelligent Information, W. Bruce Croft (ed). - Introduction to Modern Information Retrieval. Salton, G., and McGill, M. J. - Information Retrieval, C. J. van RIJSBERGEN www.dcs.gla.ac.uk/Keith/Preface.html

Conferences
- TREC (Text REtrieval Conference) trec.nist.gov - ACM SIGIR www.acm.org/sigir - ACM CIKM (Conference on Information and Knowledge Management)

Page 15 Copyright 2001 The MITRE Corporation. All rights reserved.

Information Retrieval Resources

Research Centers
- Universitt Dortmund, Informatik VI - Univ. of Glasgow ls6-www.informatik.uni-dortmund.de www.dcs.gla.ac.uk/idom

- Univ. of Maryland -- Medical Informatics and Computational Intelligence Lab. www.enee.umd.edu//medlab/filter/filter_project.html - Univ. of Massachussets -- Center for Intelligent Information Retrieval/CIIR ciir.cs.umass.edu

- Univ. of Nevada, LV -- Information Science Research Institute

www.isri.unlv.edu

Bibliographies
- mansci1.uwaterloo.ca/~jjiang/biblio.html - www-inf.enst.fr/~rungsawa/irrs.html - www.seas.gwu.edu/student/chulee/bib.html - www.si.umich.edu/~mjpinto - joinus.comeng.chungnam.ac.kr/~dolphin/db/indices/a-tree/s/Salton:Gerard.html - superbook.bellcore.com/~std/LSI.html - dmoz.org/Computers/Software/Information_Retrieval

Page 16 Copyright 2001 The MITRE Corporation. All rights reserved.

Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering

Page 17 Copyright 2001 The MITRE Corporation. All rights reserved.

What is Summarization?
Definition:
Summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) Extract vs. Abstract - An extract is a summary consisting entirely of material copied from the input - An abstract is a summary at least some of whose material is not present in the input, e.g., subject categories, paraphrase of content, etc.

Page 18 Copyright 2001 The MITRE Corporation. All rights reserved.

Illustration of Extracts and Abstracts


25 Percent Extract of Gettysburg Address (sentences 1, 2, 6) Fourscore and seven years ago our fathers brought forth upon this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. The brave men, living and dead who struggled here, have consecrated it far above our power to add or detract. . 10 Percent Extract (sentences 2) Now we are engaged in a great civil war, testing whether that nation or any nation so conceived and so dedicated, can long endure 15 Percent Evaluative Abstract This speech by Abraham Lincoln commemorates soldiers who laid down their lives in the Battle of Gettysburg. It offers an eloquent reminder to the troops that it is the future of freedom in America that they are fighting for.
Page 19 Copyright 2001 The MITRE Corporation. All rights reserved.

Intrinsic Evaluation: SUMMAC Q&A Results


1 1 0.9 3 2

0.8 3 0.7 Average Answer Recall (ARA) 2 0.6 2 0.5 2 2 1 3 1 3 2 3 1,2

Highest recall associated Highest recall associated with the least reduction of with the least reduction of the source the source
2 CGI/CMU Cornell/SabIR GE ISI NMSU Penn SRA TextWise Modsumm

0.4 2 3 0.3 3 1 1 3 1

0.2

0.1

0 0.0 0.1 0.2 Compression

Content-based automatic Content-based automatic scoring (vocabulary scoring (vocabulary overlap) correlates very overlap) correlates very well with human scoring well with human scoring (passage/answer recall) 0.5 0.3 0.4 (passage/answer recall)

Page 20

Informativeness ratio of accuracy to compression of about 1.5.


Copyright 2001 The MITRE Corporation. All rights reserved.

Intrinsic Evaluation: Japanese Text Summarization Challenge


(Fukusima and Okumura 2001)

At each compression, systems outperformed Lead and TF baselines in content overlap with human summaries Subjective grading of coherence and informativeness showed that human abstracts > human extracts > systems and baselines

Against Extracts Against Extracts

Against Abstracts Against Abstracts


Subjective Grading Subjective Grading

Page 21

Fukusima, T. and Okumura, M. 2001. "Text Summarization Challenge: Text summarization evaluation in Japan." Workshop on Automatic Summarization. Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL'2001). New Brunswick, New Jersey: Association for Computational Linguistics. Copyright 2001 The MITRE Corporation. All rights reserved.

Extrinsic Methods: Usefulness of Summary in Task



If the summary involves instructions of some kind, it is possible to measure the efficiency in executing the instructions. Measure the summary's usefulness with respect to some information need or goal, such as - Finding documents relevant to one's need from a large collection, routing documents - Extracting facts from sources - Producing an effective report or presentation using a summary - etc.

Assess the impact of a summarizer on the system in which it is embedded, e.g., how much does summarization help the question answering system? Measure the amount of effort required to post-edit the summary output to bring it to some acceptable, task-dependent state . (unlimited number of tasks to which summarization could be applied)
Copyright 2001 The MITRE Corporation. All rights reserved.

Page 22

SUMMAC Time and Accuracy (Ad hoc task, 21 subjects)


All time differences are All time differences are significant except significant except between BB & S1 between & S1

S2s (23% of source S2s (23% of source on avg.) roughly on avg.) roughly halved decision time halved decision time rel. to F (full-text)! rel. to F (full-text)!

All F-score and Recall All F-score and Recall differences are significant differences are significant except between F& S2 except between F& S2

Conclusion - -Adhoc Conclusion Adhoc S2s save time by S2s save time by 50% without 50% without impairing accuracy! impairing accuracy!

Page 23 Copyright 2001 The MITRE Corporation. All rights reserved.

SUMMAC: Accuracy versus Time by System

+ Similar sentence extraction

Page 24 Copyright 2001 The MITRE Corporation. All rights reserved.

Commercial Summarizers Compared


controllable compression component of tool suite X trainable / extrensible SUMMARIZER FEATURE statistical / machine multiple documents

single documents

>= 5 file formats

discourse model

sentence length

generic focus

multi-lingual

fragmentary

cue phrases

query focus

doc format

AUTOMATIC SUMMARIZER AutoSummarizer (MS Word 97 ) CONTEXT Data Hammer DimSum Extractor GE Summarizer Intelligent Miner IntelliScope InText InXight Summarizer Plus ProSum Search 97 Developers Kit SMART SUMMARIST TexNet 32 TextAnalyst 2.0

Page 25

Key

- Implements feature X - product has feature but not implemented as fully/normally as other products.

Copyright 2001 The MITRE Corporation. All rights reserved.

Evaluated

connected

tf or tf/idf

location

SDK

GUI

Commercial Summarizers

Oracle ConText - technet.oracle.com/products/oracle7/context/faq.htm and www.oracle.com/products/servers/st_collateral/html/cntxt_qa.html IntelliScope - www.lhsl.com/tech/icm/retrieval/toolkit/default.asp InXight Summarizer Plus www.inxight.com/Products/Developer/AD_Summzer.html Intelligent Miner www-4.ibm.com/software/data/iminer/fortext/summarize/summarize.html Extractor (NRC) - extractor.iit.nrc.ca ProSum - visita.labs.bt.com/prosum/word/sample.html GE Summarizer DidSum (SRA Int.) - www.sra.com Data Manner - www.glu.com/datahammer Microsoft Word 97 AutoSummarizer - www.microsoft.com/office Automatic Summarizer, RES International, res.ca/sum/dev2.html Search 97 Developers Kit, www.verity.com/support/documentation/tdk/adv23/05_adv.htm
Copyright 2001 The MITRE Corporation. All rights reserved.

Page 26

Summarization Resources

Books/Journals
Mani, I. and Maybury, M. (eds.) 1999. Advances in Automatic Text Summarization. MIT Press, Cambridge. Mani, I. 2001. Automated Text Summarization. John Benjimans, Amsterdam. Mani, I. And Hahn, U. Nov 2000. Summarization Tutorial. IEEE Computer.

On-line Summarization Tutorial


www.mitre.org/resources/centers/it/maybury/summarization/summarization.htm www.si.umich.edu/~radev/summarization/radev-summtutorial00.ppt www.isi.edu/~marcu/coling-acl98-tutorial.html

Bibliographies
www.si.umich.edu/~radev/summarization/ www.cs.columbia.edu/~jing/summarization.html www.dcs.shef.af.uk/~gael/alphalist.html www.csi.uottawa.ca/tanka/ts.html

Government initiatives
DUC-2001 Multi-document Summarization Evaluation (www-nlpir.nist.gov/projects/duc) DARPAs Translingual Information Detection Extraction and Summarization (TIDES) Program (tides.nist.gov, www.darpa.mil/ito/research/tides/projlist.html)

Page 27 Copyright 2001 The MITRE Corporation. All rights reserved.

Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering

Page 28 Copyright 2001 The MITRE Corporation. All rights reserved.

What is Information Extraction?


Definition:
Information extraction is the identification of specific semantic elements within a text (e.g., entities, properties, relations)

Page 29 Copyright 2001 The MITRE Corporation. All rights reserved.

Template Extraction
Source Analysis Templates
<TEMPLATE-8806150049-1> := DOC NR: 8806150049 CONTENT: <TIE_UP_RELATIONSHIP-8806150049-1> DATE TEMPLATE COMPLETED: 311292 EXTRACTION TIME: 0

Transformation Synthesis Wall Street Journal, 06/15/88 MAXICARE HEALTH PLANS INC and UNIVERSAL HEALTH SERVICES INC have dissolved a joint venture which provided health services.
Maybury, M. 1995. Generating Summaries from Event Data, Information Processing and Management, 31, 5, pp. 735-751.

Page 30 Copyright 2001 The MITRE Corporation. All rights reserved.

Source text

seis labriegos que fueron asasinados por las Autodefensas


<lex>fueron</lex>

Word segmentation (tokenization) <lex>seis</lex> <lex>labriegos</lex> <lex>que</lex>

Part-of-speech taggingpos=CD>seis</lex> <lex pos=NN>labriegos</lex> <lex <lex


pos=PP>que</lex> <lex pos=VBM>fueron</lex>

Named/Nominal <persons><num>seis</num> labriegos</persons> que <killphrase extraction


VP>fueron

...

Sentence chunking <NounGroup>seis

labriegos</NounGroup> que <VerbGroup>fueron asasinados</NounGroup> ... asasinados</kill-act> por <subject>las Autodefensas </subject>

Sentence parsing (grammatical relations) <object>seis labriegos</object> que <kill-act>fueron

Event extraction <vctims><num>seis</num> labriegos</victims> que


Page 31

<assassinate>fueronMITRE Corporation. All rights reserved. asasinados</assassinate> por Copyright 2001 The

...

IBMs Intelligent Miner for Text: Feature Extraction Tool Screen Shot

Page 32 Copyright 2001 The MITRE Corporation. All rights reserved.

Information Extraction: Epidemiology Example


1. Extract entities from text (color coded via HTML)
Disease Ebola Ebola Ebola Ebola Ebola Ebola Ebola Ebola Source PROMED PROMED PROMED PROMED PROMED PROMED PROMED PROMED Country Uganda Uganda Uganda Uganda Uganda Uganda Uganda Uganda City_nameDate Cases New_cases Dead Gula 26-Oct-2000 182 17 64 Gula 5-Nov-2000 280 14 89 Gulu 13-Oct-2000 42 9 30 Gulu 15-Oct-2000 51 7 31 Gulu 16-Oct-2000 63 12 33 Gulu 17-Oct-2000 73 2 35 Gulu 18-Oct-2000 94 21 39 Gulu 19-Oct-2000 111 17 41

2. Extract outbreak events into table

400 350 Number Cases 300 250 200 150 100 50 0

3. Display events...
Cases New_cases Dead

TIME

Total Cases; New Cases

Page 33 Copyright 2001 The MITRE Corporation. All rights reserved.

10 /1 3/ 20 00 10 /2 0/ 20 00 10 /2 7/ 20 00 11 /3 /2 00 0 11 /1 0/ 20 00 11 /1 7/ 20 00 11 /2 4/ 20 00

COTS IE Tools Compared


Named Entities Nominal Entities Normalized Time Relations Events Noun Phrases Extensible via Machine Learning x x x x x INFORMATION EXTRACTION SYSTEMS COMMERCIAL COMPANIES AeroText, Lockheed Martin IdentiFinder, BBN/Verizon Intelligent Miner for Text, IBM Net Owl, SRA Thing Finder, Inxight Context, Oracle Semio Taxonomy LexiQuest Mine LingSoft CoGenTex/Cornell TextWise/Syracuse Univ. NON-PROFIT ORGANIZATIONS Alembic, MITRE GATE, U. Sheffield Univ. of Arizona New Mexico State University Fastus/TextPro, SRI Internationa Proteus, New York University TIMEX, MITRE Univ. of Massachusetts/Amherst Extensible via Programming x x x x x x x x x x x x x x x x x Multi-Lingual EN, ZH, ES EN EN EN EN EN, JP, ES EN, ES EN

x x POLTM x POLTM+ POLTM+ x x x x x x x x x

x x

x x x x

EN,ES,ZH,JP x EN, ZH, (AR) x EN x x EN (ES, AR, Ti, JP, DE, FR, (Russian) EN,ES,ZH,JP,FR 0 EN EN,FR,ES,IT,JP,DU,(ZH,DE) EN,FR,ES,DE,DU EN EN x x EN x x

x x x x x x x

x x x x x x x

x x

x x

x x x x x x

x x

x x

x x

EN=English ZH=Chinese ES=Spanish JP=Japanese IT=Italian FR=French DE=German DU=Dutch AR=Arabic Page 34 P-People, O=Organization, L=Location, T=Time, M=Money

Copyright 2001 The MITRE Corporation. All rights reserved.

Entity Extraction Tools: Commercial Vendors


AeroText - Lockheed Martin's
AeroText &trade; Version 1.0 - www.lockheedmartin.com/factsheets/product589_hi.html BBN's Identifinder - www.gte.com/AboutGTE/gto/bbnt/speech/research/extraction/to ols/identifinder.html IBM's Intelligent Miner for Text - www-4.ibm.com/software/data/iminer/fortext/index.html SRA NetOwl - www.netowl.com Inxight's ThingFinder - www.inxight.com/pdfs/products/tf_server_ds.pdf Semio: www.semio.com Context - technet.oracle.com/products/oracle7/context/tutorial/start.htm LexiQuest Mine: www.lexiquest.com Lingsoft: www.lingsoft.fi Page 35 CoGenTex: www.cogentex.com TextWise: www.textwise.com MITRE Corporation. All rights reserved. Copyright 2001 The

Entity Extraction Tools: Non-Profit Organizations


MITREs Alembic extraction system and Alembic Workbench
annotation tool - www.mitre.org/technology/nlp MITREs TIMEX tagger for resolving references to dates and times - m19593-pc2.mitre.org/toolshed/ACL2000.pdf Univ. of Sheffields GATE: gate.ac.uk Univ. of Arizona: ai.bpa.arizona.edu New Mexico State University: crl.nmsu.edu SRI Internationals Fastus/TextPro: - www.ai.sri.com/~appelt/fastus.html - www.ai.sri.com/~appelt/TextPro New York Universitys Proteus - www.cs.nyu.edu/cs/projects/proteus/ University of Massachusetts: www-nlp.cs.umass.edu/nlpie.html
Page 36 Copyright 2001 The MITRE Corporation. All rights reserved.

Name Analysis Software


Language Analysis Systems Inc.s (Herndon, VA)


Name Reference Library www.las-inc.com Funding: Office of National Drug Control Policy Supports analysis of Arabic, Hispanic, Chinese, Thai, Russian, Korean, and Indonesian names. Others in future versions. Product Features - Name culture classification - Given a name, provides common variants on that name, e.g., Abd Al Rahman or Abdurrahman or ... - Implied gender - Identifies title, affixes, qualifiers, e.g., Arabic "abd" means "servant of; "Bin," means "son of" as in Osama Bin Laden - List top countries where name occurs Cost: Free via GSA to Government w/$700 per-license annual fee until August 31, then $3,535 a copy and a $990 annual fee
Copyright 2001 The MITRE Corporation. All rights reserved.

Page 37

Identifying Entities
Most research to date done on news reports:
- Systems can automatically identify person, organization, location, time, numerical expressions Systems exist that identify names ~ 90-95% accurately in the news (in several languages) - Simply memorizing names doesnt work, since new people (& organizations) appear in the news, just as new genes are identified and named Rules capture local patterns that characterize entities, from instances of annotated training data: - XXX met with YYY: XXX and YYY are probably people - XXX bought out YYY: XXX and YYY are probably organizations

Page 38 Copyright 2001 The MITRE Corporation. All rights reserved.

Approaches to Identifying Entities


Terminology (name) lists
- This works very well if the list of names and name expressions is stable and available Tokenization and morphology - This works well for things like formulas or dates, which are readily recognized by their internal format (e.g., DD/MM/YY or chemical formulas) Use of characteristic patterns - This works fairly well for novel entities - Rules can be created by hand or learned via machine learning or statistical algorithms

Page 39 Copyright 2001 The MITRE Corporation. All rights reserved.

Evaluating Entity Identification


Evaluation consists of two aspects:
- Detection of the phrase that names an entity - Classification of the entity correctly (that is, distinguishing a protein from a gene, e.g.) Metrics used for text in NLP evaluations are: - Precision and recall for each entity class, where Precision = #CorrectReturned/#TotalReturned Recall = #CorrectReturned/#CorrectPossible - F-measure is the harmonic mean of precision and recall; used as a balanced single measure

Page 40 Copyright 2001 The MITRE Corporation. All rights reserved.

Information Extraction Evaluations For Newswire


100

Name extraction > 90% in English, Japanese;

F-measure (Accuracy)

80

Name tagging improving in Chinese

60

40

Names: English Names: Japanese Names: Chinese

20

Commercial name taggers exist for news reports in multiple languages

0 1991
Page 41

1992

1993

1995

1998

1999

Year
Copyright 2001 The MITRE Corporation. All rights reserved.

Extracting Spoken Names from Broadcast News


100 90 80 70 60 50 40 30 20 10 0 Recog 1 Recog 2 Recog 3 Human MITRE BBN SRI Sheffield

Source: 1998 DARPA Named Entity Evaluation Results

Relation Extraction
Identify (and tag) the relation among two entities:
- A person is_located_at a location (news) - A gene codes_for a protein (biology) Relations require more information -identification of 2 entities & their relationship - Predicted relation accuracy = Pr(E1)*Pr(E2)*Pr(R) ~(.93) * (.93) * (.93) = .80 Information in relations is less local - Contextual information is a problem: right word may not be explicitly present in the sentence - Complex syntax in abstracts is a problem (see examples from Park et al., PSB 2001) Events involve more relations and are even harder

Page 43 Copyright 2001 The MITRE Corporation. All rights reserved.

Information Extraction Evaluations For Newswire


100 90 80 F-measure (Accuracy) 70 60 50 40 30 20 10 0 1991
Page 44

Relation extraction has is now at 80%

Names: English Names: Japanese Names: Chinese Relations Events 1992 1993 1995 1998 1999

Event extraction less than 60%, improving slowly

Year
Copyright 2001 The MITRE Corporation. All rights reserved.

Information Extraction Resources

Books/Journals
- Information Extraction : A Multidisciplinary Approach to an Emerging Information Technology International Summer School, Scie-97, Frascati, Italy, edited by Maria Teresa Pazienza, J. Siekmann, J. G. Carbonell

On-line IE Tutorials
- www.ai.sri.com/~appelt/ie-tutorial - citeseer.nj.nec.com/gaizauskas98information.html - NIST Information Extraction web page www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html

Government initiatives
- ACE: Automated Content Extraction (www.nist.gov/speech/tests/ace) - TIDES: Translingual Information Detection Extraction and Summarization; DARPA (www.darpa.mil/ito/research/tides)

Page 45 Copyright 2001 The MITRE Corporation. All rights reserved.

Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering

Page 46 Copyright 2001 The MITRE Corporation. All rights reserved.

What is Text Clustering?


Definition:
Clustering is the process of detecting topics within a document collection, assigning documents to those topics, and labeling these topic clusters

Page 47 Copyright 2001 The MITRE Corporation. All rights reserved.

Semio Taxonomy Creation of Browseable Collections


Extract Phrases and Reverse Index Use Existing Knowledge (e.g. geospatial gazeteer) Browse Document Collection Create Taxonomies

Provide Specialized Dictionaries

Cluster Concepts

Attach Concept Clusters to Taxonomies

Page 48 Copyright 2001 The MITRE Corporation. All rights reserved.

A Semiotic Network What Do We Know About North Korea?

Page 49 Copyright 2001 The MITRE Corporation. All rights reserved.

Taxonomy Example Browse by Geography - Select a Country

Page 50 Copyright 2001 The MITRE Corporation. All rights reserved.

Taxonomy Example Browse by Geography - Select a Relevant Concept

Page 51 Copyright 2001 The MITRE Corporation. All rights reserved.

Taxonomy Example Browse by Geography - Get Documents

Page 52 Copyright 2001 The MITRE Corporation. All rights reserved.

Taxonomy Example View Concepts in Context

Page 53 Copyright 2001 The MITRE Corporation. All rights reserved.

IntelGazette - Topic Clustering and Labeling


Automatic topic clustering generates daily and weekly news summaries, links related stories across time and news sources.

Page 54 Copyright 2001 The MITRE Corporation. All rights reserved.

IntelGazette

Page 55 Copyright 2001 The MITRE Corporation. All rights reserved.

User Profiles to Customize Delivery

Page 56 Copyright 2001 The MITRE Corporation. All rights reserved.

COTS Text Clustering Tools Compared


Accepts Predefined Terms
(Predefined Topics)

All words equally

Named Entities

Document Clustering, Mining, Topic Detection, and Visualization Systems Inxight Categorizer, Tree Studio, Inxight

x x x

x x x x x x x

Semio Taxonomy LexiQuest Mine InterMedia Text, Oracle x NorthernLight x Autonomy x Lotus Discovery Server (LDS), Lotus x QKS Classifier, Quiver x Fulcrum Knowledge Server, Hummingbird x SPIRE/Themeview, PNNL x VantagePoint, Search Technology Inc. x Mohomine, Inc. Intelligent Miner for Text, IBM Oasis, OnTopic, BBN/Verizon EN=English ZH=Chinese ES=Spanish JP=Japanese
Page 57

EN, FR, ES, DE, DU, (12) EN,FR,ES,IT,JP,DU (ZH,DE) EN, FR, ES, DE, DU EN EN EN

x x x x x

EN

x x x x

EN EN EN EN

x x EN x x x x EN x x x x EN, ZH, AR DE=German DU=Dutch FR=French AR=Arabic IT=Italian

Copyright 2001 The MITRE Corporation. All rights reserved.

Topic Tracking

Noun Phrases

Story Segmentation

Predefined Taxonomies

Generates Taxonomies

MultiLingual?

New Topic Detection?

x x x x x x

x x x

Text Clustering Tool Organizations



Inxights Categorizer and Tree Studio: www.inxight.com Oracles Intermedia Text: www.oracle.com Semio Taxonomy: www.semio.com LexiQuest Mine: www.lexiquest.com Northern Light document clustering: www.northernlight.com Autonomy: www.autonomy.com Lotuss Discover Server (LDS): www.lotus.com/km Quivers QKS Classifier: www.quiver.com Hummingbirds Fulcrum Knowledge Server: www.humingbird.com PNNLs SPIRE/Themeview visualization: showcase.pnl.gov/show?it/themeview Search Technologys VangatePoint: www.thevantagepoint.com Mohomines text classification components: www.mohomine.com IBMs Intelligent Miner for Text: www.ibm.com/software/data/iminer/fortext/tatools.html BBN/Verizons OnTopic/Oasis: www.bbn.com/speech/ontopic.html
Page 58 Copyright 2001 The MITRE Corporation. All rights reserved.

Text Clustering Resources



Books/Journals
- Topic Detection and Tracking Pilot Study (1998) (citeseer.nj.nec.com/allan98topic.html)

Tutorial
- www.parc.xerox.com/istl/projects/ia/sg-clustering.html

Bibliographies
- dewey.yonsei.ac.kr/memexlee/links/clustering.htm - dmoz.org/Reference/Knowledge_Management/Knowledge_Discovery/Text_Mining/

Government initiatives
- Topic Detection and Tracking Evaluation Project (www.nist.gov/speech/tests/tdt/index.htm) - Text REtrievval Conference (trec.nist.gov) - TIDES: Translingual Information Detection Extraction and Summarization; DARPA (www.darpa.mil/ito/research/tides, tides.nist.gov)

Paper Collections
- trec.nist.gov/pubs/trec8/index.track.html (TREC8) - trec.nist.gov/pubs/trec9/index.track.html (TREC9)

Page 59 Copyright 2001 The MITRE Corporation. All rights reserved.

Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering

Page 60 Copyright 2001 The MITRE Corporation. All rights reserved.

Question Answering

Question Answering (MITREs QANDA System)

Collections: Gigabytes Documents: Megabytes


PIR Genbank MEDLINE

Question Answering: question to answer

Lists,Tables: Kilobytes Phrases: Bytes

D i ea se s Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l

So urce PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M

C ount ry U anda g U anda g U anda g U anda g U anda g U anda g U anda g U anda g

C ty_n ame at e i D C ases N ew _c ases D d ea G a ul 2 6- ct -2000 O 182 17 64 G a ul 5- ov- 2000 N 280 14 89 G u ul 1 3- ct -2000 O 42 9 30 G u ul 1 5- ct -2000 O 51 7 31 G u ul 1 6- ct -2000 O 63 12 33 G u ul 1 7- ct -2000 O 73 2 35 G u ul 1 8- ct -2000 O 94 21 39 G u ul 1 9- ct -2000 O 111 17 41

Protease-resistant prion protein interacts with...

Where did Dylan Thomas die? 1. Swansea: In Dylan: the Nine Lives of Dylan Thomas, Fryer makes a virtue of not coming from Swansea 2. Italy: Dylan Thomass widow Caitlin, who died last week in Italy aged 81, 3. New York:Dylan Thomas died in New York 40 years ago next Tuesday What diseases are caused by prions? 1. Both CJD and BSE are caused by mysterious particles of infectious protein called prions 2. Scientists trying to understand the epidemic face an unusual problem: BSE, scrapie, and CJD are caused by a bizarre infectious agent, the prion which does not follow the normal rules of microbiology. 3. These diseases are caused by a prion, an abnormal version of a

MITRE

2 001 The M ITRE Corporation . ALL RIG HTS RESERVED.

Page 61

naturally-occurring protein, but researchers have recognized different strains of prions that differ in incubation times, symptoms, and severity of illness. ...
Copyright 2001 The MITRE Corporation. All rights reserved.

Coreference and Question Answering


Question: What diseases are caused by prions? Qanda answer #3:
These diseases are caused by a prion, an abnormal version of a naturallyoccurring protein, but researchers have recognized different strains of prions that differ in incubation times, symptoms, and severity of illness. ...

Problem: Need to resolve pronoun to get the real answer (in the
preceding sentence):
Prion disorders -- including bovine spongiform encephalopathy, or ``mad cow disease'' in cattle, CJD in humans, and scrapie in sheep -- are all characterized by progressive neurological degeneration resulting in death.
Page 62 Copyright 2001 The MITRE Corporation. All rights reserved.

Question Answering
Stage 1: Question analysis
- Find type of object that answers the question: when needs time, which proteins need protein Stage 2: Document retrieval - Using (augmented) question, retrieve set of possibly relevant documents via information retrieval Stage 3: Document processing - Search documents for entities of the desired type using information extraction - Search for entities in appropriate relations Stage 4: Rank answer candidates Stage 5: Present the answer (N bytes, or a phrase or a sentence or a summary)

Page 63 Copyright 2001 The MITRE Corporation. All rights reserved.

Evaluating Question Answering Systems


TREC-8 (99) & TREC-9 (00) have included a question
answering track; TREC-10 will also (01) TREC-9 Q&A Evaluation: - For each of 700 factual short-answers questions - Each system must return a ranked list of 5 candidate answers (250-byte or 50-byte) based on the standard TREC document collection - Each question-answer pair is judged as correct or incorrect by a person (assessor) - System score is mean reciprocal rank of correct answers For TREC-8 and TREC-9, all questions had answers; for TREC10, not all questions will have answers

Page 64 Copyright 2001 The MITRE Corporation. All rights reserved.

TREC Q&A 2000 Results (250-byte)


1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000

Harabagiu and Moldovan, Southern Methodist University Mean Reciprocal Rank: 76% First Answer Correct: 69% Correct Answer in Top 5: 86%

e l U U o s a) ld U SI lleg ea lo at is r M r as w an M o fie t N f e S (P M LI C on at ul ai he U l .I. o T M S W D ia U U U at Se er N p Im

Page 65

Lessons: question answering works -at least for simple factual questions
Copyright 2001 The MITRE Corporation. All rights reserved.

Top System at TREC Q&A 2000 (SMU): Some Key Features


Question analysis based on identifying:
- Expected answer type (using WordNet semantic hierarchy) - Syntactic relations related to answer type, e.g., What is the wingspan of a condor => quantity (wingspan), of (wingspan, condor) Iterative retrieval of relevant paragraphs using question key words - Adjust size of pool of retrieved documents to maximize probability of finding an answer extraction Semantic processing, to ensure a match between the question relations and the proposed answer

Document processing, including dependency parse and entity

Page 66 Copyright 2001 The MITRE Corporation. All rights reserved.

Question Answering: Status


Question answering has successfully pushed integration of
information retrieval and natural language processing techniques To date, question types are very limited - Assume that answer is always present (so far, systems do not know what they dont know) - Assume answer is contained in a single sentence: answers cannot be composed of lists gathered across multiple sources

Page 67 Copyright 2001 The MITRE Corporation. All rights reserved.

Example Analysis Affect of Answer Multiplicity on Correctness


0.9 0.8

0.7

0.6

0.5

0.4

0.3

0.2

Individual questions (50)


0.1

Ave. per # answers


0 0 10 20 30 40 50 60 70

# answer repetitions per question

Page 68 Copyright 2001 The MITRE Corporation. All rights reserved.

Question/Answering Resources

Books/Journals
- Question Answering Systems, Papers from the AAAI Fall Symposium, Vinay Chaudhri and Richard Fikes, Program Cochairs, AAAI Technical Report FS-99-02 - Hirschman, L. and Gaizauskas, R. (eds.) Forthcoming in Fall 2002. Special Issue on

Question Answering, Journal of Natural Language Engineering

On-line Q&A Tutorials


- www.cs.unca.edu/~bruce/acl01/QApresentations/presentations.html - www-users.cs.york.ac.uk/~mdeboni/research/links.html

Paper Collections
- trec.nist.gov/pubs/trec8/index.track.html#qa (TREC8 Q&A papers) - trec.nist.gov/pubs/trec9/index.track.html#qa (TREC9 Q&A papers)

Government initiatives
- AQUAINT: www.ic-arda.org/solicitations/AQUAINT - TREC8: citeseer.nj.nec.com/346894.html - Also, trec.nist.gov

Page 69 Copyright 2001 The MITRE Corporation. All rights reserved.

Conclusion
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering

Page 70 Copyright 2001 The MITRE Corporation. All rights reserved.

Acknowledgements
Special thanks to
- Lynette Hirschman for some IE, IR, and Q&A slides - David Day, Warren Greiff, and Christy Doran for tool and resources research - Inderjeet Mani for summarization evaluation slides - Jim Burnetti, Tom McEntee and Donna Trammell for Semio Taxonomy examples - Penny Chase for Google performance example

Page 71 Copyright 2001 The MITRE Corporation. All rights reserved.

BACKUP

Page 72 Copyright 2001 The MITRE Corporation. All rights reserved.

Mixed Initiative Annotation Methodology Used in the Alembic Workbench


3 Apply phrasefinding rules to raw text

1 Manually annotate
raw text

Source Texts

Rules Sets

If ... Then... Rule

4 Manually correct machine-annotated


text

2 Invoke machine

learning to derive annotation rules

Training & Test Corpus


Page 73 Copyright 2001 The MITRE Corporation. All rights reserved.

Alembic Workbench supports: Multilingual annotation (UNICODE chars and fonts) Machine learning Evaluation for content extraction

Page 74

http://www.mitre.org/technology/alembic-workbench Copyright 2001 The MITRE Corporation. All rights reserved.

Alembics Engine: Transformational Phrase Rules


Finding a likely named entity:
(def-phraser-rule :anchor :lexeme :conditions (:wd :p-o-s (:NNP :NNPS)) :actions (:create-phrase :NONE))

Assigning a type to a phrase:


(def-phraser-rule :anchor :phrase :conditions (:phrase :phrase-label :NONE) (:left-1 :lex (Dr. Mr. Prof. )) :actions (:set-label :PERSON))

On Monday Dr. Grieg, IBMs new chief scientist, announced that their new supercomputer, Powerful Purple, will

Page 75 Copyright 2001 The MITRE Corporation. All rights reserved.

Some Alembic System Measures

Rule sequence length English Spanish Japanese - Names 141 333s 100 - Money, percents 12 19 12 - Dates/times fsa fsa + 24 21 - Titles 4 30 167 406 133 Processing rate* (words/min.) English Spanish Japanese 23,100 26,300 23,900
* On Sparc 10, without having pursued many opportunities for optimization.

Page 76 Copyright 2001 The MITRE Corporation. All rights reserved.

Alembic Workbench: Relations

Alembic Workbench uses Machine learning (1) identify constituents (slot filling values) (2) propose plausible relation instances for selection by human annotators. Copyright 2001 The MITRE Corporation. All rights reserved.

Page 77

Relation Tagging Interface #3 Screen Dump

Page 78

Putting it All Together: Defining Templates (Relations)


Copyright 2001 The MITRE Corporation. All rights reserved.

Empirical Study of Productivity Gains Afforded by Alembic Workbench (for Named Entities)
Productivity Gains Productivity Gains
250 250 200 200 150 150 100 100 50 50 0 25 25 20 20 15 15 10 10 5

Words/Minute Words/Minute Tags/Minute Tags/Minute

0 emacs-AB emacs-AB awb-AB awb-AB

0 awb-100-AB awb-100-AB

Emacs

AWB GUI only

AWB + Pre-tagging (5 Docs)

awb-5-AB awb-5-AB

AWB + Pre-tagging (100 Docs)

Corpus Development Tools Used Corpus Development Tools Used

Page 79 Copyright 2001 The MITRE Corporation. All rights reserved.

Geospatial News on Demand (GeoNODE)


Topic Timeline

News Sources:
Broadcast
News histogram

Navigate Filter Indexed access Animate reporting trends Create reports/ web
BNN Story skim

Map overview

World Wide Web

GeoNODE Database

Intel. Msg Traffic

Data Acquisition/ Pre-process

Information Extraction

Data Mining And Clustering


topic t t

Indexing And News Modeling

Specialist Archives

Page 80 Copyright 2001 The MITRE Corporation. All rights reserved.

Automatic Topic Detection


Embassy Bombing and Counterstrike Clustering identified separate topics for bombing and counterstrike.

Bombing Counterstrike
Copyright 2001 The MITRE Corporation. All rights reserved.

Page 81

Potrebbero piacerti anche