2.1 Information Access Tutorial

Information Access: State of the Art
Mark T. Maybury
maybury@mitre.org
November 15, 2001 http://www.mitre.org/resources/centers/it/maybury/mark.html

This data is the copyright and proprietary data of the MITRE Corporation. It is made available subject to Limited Rights, as defined in paragraph (a) (15) of the clause at DFAR 252.227-7013. The restrictions governing the use and MITRE disclosure of these materials are set forth in the aforesaid clause.
Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering
Page 2 Copyright 2001 The MITRE Corporation. All rights reserved.
Abstract
This briefing overviews the state of the art in the areas of information retrieval, summarization, information extraction, text clustering and question answering. The briefing reports performance of systems based on national testing and identifies current commercial products that provide this type of processing. For example, key findings include: - automated systems exist that can return documents relevant to a particular subject with around 80% precision but low recall - automated query and relevance feedback is near human performance - systems can presently identify entities at over 90% accuracy and relations among entities at 70-80% accuracy. - systems can summarize documents to 20% of their source size without information loss, saving users 50% of task time - systems can also respond to a simple factual question by returning answers from relevant documents at 75% accuracy. We describe how the community can make rapid progress on information access for intelligence using corpus-based evaluation, which includes: 1) Creating challenge problems with supporting data 2) Evaluating system performance on these problems, and 3) Comparing approaches; share data, resources and tools. The briefing provides detailed pointers to data resources and commercial products. Page 3 Copyright 2001 The MITRE Corporation. All rights reserved.
Current Situation
Analyst Data
10,000 messages per day
(IMINT, SIGINT, HUMINT, Intel Reports)
Data Search
Analyst creates a profile to sort messages
Extraction Interpretation Conclusion

Analyst reads selected messages to extract facts Analyst builds a link analysis model Analyst forms an opinion and authors a report
Report
The analyst can The system sorts messages by key words study 200 messages per day. -- limited precision.
Page 4
The analyst manually builds (at best) partial, static link models.
The final output is text: a message, web page or document.
Copyright 2001 The MITRE Corporation. All rights reserved.
Intelligent Information Access

Unstructured sources Structured databases
news web
semi-structured info
documents
- known terrorist groups - world governments - biological agents
Intelligent Information Processing

(Retrieve, Extract, Summarize, Visualize)
- world fact book - gazetteer - hand-crafted taxonomy
Question Answer
Is X a member of a terrorist organization?
user
Page 5
State of the Art: Summary

Automated systems exist that can:
- Return documents relevant to a particular subject with around 80% precision but low recall - Automated query, relevance feedback near human perf. - Identify entities at over 90% accuracy and relations among entities at 70-80% accuracy - Summarize documents to 20% of their source size without information loss, saving users 50% of task time - Respond to a simple factual question by returning answers from relevant documents at 75% accuracy
We can make rapid progress on information access for

intelligence using corpus-based evaluation: - Create challenge problems with supporting data - Evaluate system performance on these problems
- Compare approaches; share data, resources and tools

Outline
Information Retrieval
Information Retrieval

Input: query words Output: ranked list of documents Approach
Collections: Gigabytes Documents: Megabytes

PIR Genbank MEDLINE
Lists,Tables: Kilobytes Phrases: Bytes
D i ea se s Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l
S urce o P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D
C ount ry U anda g U anda g U anda g U anda g U anda g U anda g U anda g U anda g
C ty_n ame ate i D C ases N ew _c ases D d ea G a ul 2 6- ct -2000 O 182 17 64 G a ul 5- ov-2000 N 280 14 89 G u ul 1 3- ct -2000 O 42 9 30 G u ul 1 5- ct -2000 O 51 7 31 G u ul 1 6- ct -2000 O 63 12 33 G u ul 1 7- ct -2000 O 73 2 35 G u ul 1 8- ct -2000 O 94 21 39 G u ul 1 9- ct -2000 O 111 17 41
Protease-resistant prion protein interacts with...
Information Retrieval: key words to documents
- Speed, scalability domain independence and robustness are critical for access to large collections of documents
2 001 The M ITRE Corporation . ALL RIG HTS RESERVED.
MITRE
Technique - Shallow processing provides coarse-grained result (entire documents or passages) - Query is transformed to collection of words, but grammatical relations between words lost - Documents are indexed by word occurrences - Search matches query probe against indexed documents using Boolean combination of terms, or vector of word occurrences or language model
Evaluating Text Retrieval

The Text Retrieval Conference (TREC) has been held annually,
starting in 1992, run by NIST* - Successful -- attracting 100s of international participants from industry, academia, government
Goal: systematic evaluation of retrieval systems using a large

(5Gb) common corpus - Given a set of queries, for each query - Systems return a ranked list of documents
- Human judges provide relevance assessments for the ranked documents - Relevance judgements are used to compute average precision-recall plots for each system Precision: % returned docs judged relevant Recall:
Page 9
% of relevant documents found

*US National Institute of Standards and Technology
Precision and Recall
Precision 12 out of 106 documents were relevant (12/106 = 11%)
Answer:
Who are Mohamed Attas associates?
Marwan al-Shehhi Ziad Jarrah Hani Hanjour Rohit Agrawal ...
Query: mohamed+atta+associates
Recall One document not retrieved contained a new associate (11/12 = 91%)
Answer: Said Bahaji
Sample TREC Topic

<num> Number: 409 <title> legal, Pan Am, 103
Title: Up to 3 words best describing the topic
Description: One-sentence <desc> Description: description What legal actions have resulted from the of the topic destruction of Pan Am Flight 103 over Lockerbie, Scotland on December 21, 1988? <narr> Narrative: Documents describing any charges, claims or fines presented to or imposed by any court or tribunal are relevant, but documents that discuss charges made in diplomatic jousting are not relevant.
Narrative: Description of what makes a document relevant or irrelevant
TREC9 Results for a High-Performing System

1 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0
Page 12
Automatically generated query Manually generated query
This is a representative high-performing system Manual (expert) choice of query words works better than automatically generated queries Note that if you need to find all the literature on a subject, you have to look through lots of junk!
4 of top 5 documents relevant: precision = 80%; recall low!
0.2
0.4 Recall
0.6
0.8
TREC Findings
It is possible to create test collections and an evaluation metric for
large corpora - TREC pioneered evaluation of recall over large document collections, using pooled results of multiple systems to prune space & estimate recall - TREC has experimented with different evaluations (tracks) for filtering, routing, Web search,... - The basic paradigm is still word-based; so far, addition of syntax, semantics hasnt helped - For short queries, adding information (words) to the queries helps, by hand, by thesaurus, by feedback of relevant documents
Other lessons:
Need for fine-grained retrieval has led to new track of

Question Answering
Precision vs. Documents Retrieved

0.7 0.6 0.5 Precision 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 1200 # Documents Retrieved Automatic query Manual query
3 of first 5 documents are relevant;
20 of first 100 documents are relevant;
Systems can use relevant documents to enrich the set of query words via relevance feedback, done automatically or interactively
Information Retrieval Resources
IR Journals
- Information Processing and Management (IP&M) - ACM Transactions on Information Systems (TOIS) - Journal of the American Society for Information Science (JASIS) - Journal of Documentation (JDOC) - Information Retrieval (IR) - Journal of Intelligent Information Systems (JIIS)
Books
- Modern Information Retrieval (ACM Press Series), R. Baeza-Yates, et al. - Intelligent Multimedia Information Retrieval, M. Maybury(ed) & Karen Sparck Jones. - Advances in Information Retrieval - Recent Research from the Center for Intelligent Information, W. Bruce Croft (ed). - Introduction to Modern Information Retrieval. Salton, G., and McGill, M. J. - Information Retrieval, C. J. van RIJSBERGEN www.dcs.gla.ac.uk/Keith/Preface.html
Conferences
- TREC (Text REtrieval Conference) trec.nist.gov - ACM SIGIR www.acm.org/sigir - ACM CIKM (Conference on Information and Knowledge Management)
Information Retrieval Resources
Research Centers
- Universitt Dortmund, Informatik VI - Univ. of Glasgow ls6-www.informatik.uni-dortmund.de www.dcs.gla.ac.uk/idom
- Univ. of Maryland -- Medical Informatics and Computational Intelligence Lab. www.enee.umd.edu//medlab/filter/filter_project.html - Univ. of Massachussets -- Center for Intelligent Information Retrieval/CIIR ciir.cs.umass.edu
- Univ. of Nevada, LV -- Information Science Research Institute
www.isri.unlv.edu
Bibliographies
- mansci1.uwaterloo.ca/~jjiang/biblio.html - www-inf.enst.fr/~rungsawa/irrs.html - www.seas.gwu.edu/student/chulee/bib.html - www.si.umich.edu/~mjpinto - joinus.comeng.chungnam.ac.kr/~dolphin/db/indices/a-tree/s/Salton:Gerard.html - superbook.bellcore.com/~std/LSI.html - dmoz.org/Computers/Software/Information_Retrieval
Outline
What is Summarization?
Definition:
Summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) Extract vs. Abstract - An extract is a summary consisting entirely of material copied from the input - An abstract is a summary at least some of whose material is not present in the input, e.g., subject categories, paraphrase of content, etc.
Illustration of Extracts and Abstracts

25 Percent Extract of Gettysburg Address (sentences 1, 2, 6) Fourscore and seven years ago our fathers brought forth upon this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. The brave men, living and dead who struggled here, have consecrated it far above our power to add or detract. . 10 Percent Extract (sentences 2) Now we are engaged in a great civil war, testing whether that nation or any nation so conceived and so dedicated, can long endure 15 Percent Evaluative Abstract This speech by Abraham Lincoln commemorates soldiers who laid down their lives in the Battle of Gettysburg. It offers an eloquent reminder to the troops that it is the future of freedom in America that they are fighting for.
Intrinsic Evaluation: SUMMAC Q&A Results

1 1 0.9 3 2
0.8 3 0.7 Average Answer Recall (ARA) 2 0.6 2 0.5 2 2 1 3 1 3 2 3 1,2
Highest recall associated Highest recall associated with the least reduction of with the least reduction of the source the source
2 CGI/CMU Cornell/SabIR GE ISI NMSU Penn SRA TextWise Modsumm
0.4 2 3 0.3 3 1 1 3 1
0.2
0.1
0 0.0 0.1 0.2 Compression
Content-based automatic Content-based automatic scoring (vocabulary scoring (vocabulary overlap) correlates very overlap) correlates very well with human scoring well with human scoring (passage/answer recall) 0.5 0.3 0.4 (passage/answer recall)
Page 20
Informativeness ratio of accuracy to compression of about 1.5.

Intrinsic Evaluation: Japanese Text Summarization Challenge

(Fukusima and Okumura 2001)
At each compression, systems outperformed Lead and TF baselines in content overlap with human summaries Subjective grading of coherence and informativeness showed that human abstracts > human extracts > systems and baselines
Against Extracts Against Extracts
Against Abstracts Against Abstracts

Subjective Grading Subjective Grading
Page 21
Fukusima, T. and Okumura, M. 2001. "Text Summarization Challenge: Text summarization evaluation in Japan." Workshop on Automatic Summarization. Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL'2001). New Brunswick, New Jersey: Association for Computational Linguistics. Copyright 2001 The MITRE Corporation. All rights reserved.
Extrinsic Methods: Usefulness of Summary in Task

If the summary involves instructions of some kind, it is possible to measure the efficiency in executing the instructions. Measure the summary's usefulness with respect to some information need or goal, such as - Finding documents relevant to one's need from a large collection, routing documents - Extracting facts from sources - Producing an effective report or presentation using a summary - etc.
Assess the impact of a summarizer on the system in which it is embedded, e.g., how much does summarization help the question answering system? Measure the amount of effort required to post-edit the summary output to bring it to some acceptable, task-dependent state . (unlimited number of tasks to which summarization could be applied)
Page 22
SUMMAC Time and Accuracy (Ad hoc task, 21 subjects)

All time differences are All time differences are significant except significant except between BB & S1 between & S1
S2s (23% of source S2s (23% of source on avg.) roughly on avg.) roughly halved decision time halved decision time rel. to F (full-text)! rel. to F (full-text)!
All F-score and Recall All F-score and Recall differences are significant differences are significant except between F& S2 except between F& S2
Conclusion - -Adhoc Conclusion Adhoc S2s save time by S2s save time by 50% without 50% without impairing accuracy! impairing accuracy!
SUMMAC: Accuracy versus Time by System
+ Similar sentence extraction
Commercial Summarizers Compared

controllable compression component of tool suite X trainable / extrensible SUMMARIZER FEATURE statistical / machine multiple documents
single documents
>= 5 file formats
discourse model
sentence length
generic focus
multi-lingual
fragmentary
cue phrases
query focus
doc format
AUTOMATIC SUMMARIZER AutoSummarizer (MS Word 97 ) CONTEXT Data Hammer DimSum Extractor GE Summarizer Intelligent Miner IntelliScope InText InXight Summarizer Plus ProSum Search 97 Developers Kit SMART SUMMARIST TexNet 32 TextAnalyst 2.0
Page 25
Key
- Implements feature X - product has feature but not implemented as fully/normally as other products.
Evaluated
connected
tf or tf/idf
location
SDK
GUI
Commercial Summarizers

Oracle ConText - technet.oracle.com/products/oracle7/context/faq.htm and www.oracle.com/products/servers/st_collateral/html/cntxt_qa.html IntelliScope - www.lhsl.com/tech/icm/retrieval/toolkit/default.asp InXight Summarizer Plus www.inxight.com/Products/Developer/AD_Summzer.html Intelligent Miner www-4.ibm.com/software/data/iminer/fortext/summarize/summarize.html Extractor (NRC) - extractor.iit.nrc.ca ProSum - visita.labs.bt.com/prosum/word/sample.html GE Summarizer DidSum (SRA Int.) - www.sra.com Data Manner - www.glu.com/datahammer Microsoft Word 97 AutoSummarizer - www.microsoft.com/office Automatic Summarizer, RES International, res.ca/sum/dev2.html Search 97 Developers Kit, www.verity.com/support/documentation/tdk/adv23/05_adv.htm
Page 26
Summarization Resources
Books/Journals
Mani, I. and Maybury, M. (eds.) 1999. Advances in Automatic Text Summarization. MIT Press, Cambridge. Mani, I. 2001. Automated Text Summarization. John Benjimans, Amsterdam. Mani, I. And Hahn, U. Nov 2000. Summarization Tutorial. IEEE Computer.
On-line Summarization Tutorial

www.mitre.org/resources/centers/it/maybury/summarization/summarization.htm www.si.umich.edu/~radev/summarization/radev-summtutorial00.ppt www.isi.edu/~marcu/coling-acl98-tutorial.html
Bibliographies
www.si.umich.edu/~radev/summarization/ www.cs.columbia.edu/~jing/summarization.html www.dcs.shef.af.uk/~gael/alphalist.html www.csi.uottawa.ca/tanka/ts.html
Government initiatives
DUC-2001 Multi-document Summarization Evaluation (www-nlpir.nist.gov/projects/duc) DARPAs Translingual Information Detection Extraction and Summarization (TIDES) Program (tides.nist.gov, www.darpa.mil/ito/research/tides/projlist.html)
Outline
What is Information Extraction?

Definition:
Information extraction is the identification of specific semantic elements within a text (e.g., entities, properties, relations)
Template Extraction
Source Analysis Templates
<TEMPLATE-8806150049-1> := DOC NR: 8806150049 CONTENT: <TIE_UP_RELATIONSHIP-8806150049-1> DATE TEMPLATE COMPLETED: 311292 EXTRACTION TIME: 0
Transformation Synthesis Wall Street Journal, 06/15/88 MAXICARE HEALTH PLANS INC and UNIVERSAL HEALTH SERVICES INC have dissolved a joint venture which provided health services.
Maybury, M. 1995. Generating Summaries from Event Data, Information Processing and Management, 31, 5, pp. 735-751.
Source text
seis labriegos que fueron asasinados por las Autodefensas

<lex>fueron</lex>
Word segmentation (tokenization) <lex>seis</lex> <lex>labriegos</lex> <lex>que</lex>
Part-of-speech taggingpos=CD>seis</lex> <lex pos=NN>labriegos</lex> <lex <lex

pos=PP>que</lex> <lex pos=VBM>fueron</lex>
Named/Nominal <persons><num>seis</num> labriegos</persons> que <killphrase extraction

VP>fueron
...
Sentence chunking <NounGroup>seis
labriegos</NounGroup> que <VerbGroup>fueron asasinados</NounGroup> ... asasinados</kill-act> por <subject>las Autodefensas </subject>
Sentence parsing (grammatical relations) <object>seis labriegos</object> que <kill-act>fueron
Event extraction <vctims><num>seis</num> labriegos</victims> que

Page 31
<assassinate>fueronMITRE Corporation. All rights reserved. asasinados</assassinate> por Copyright 2001 The
...
IBMs Intelligent Miner for Text: Feature Extraction Tool Screen Shot
Information Extraction: Epidemiology Example

1. Extract entities from text (color coded via HTML)
Disease Ebola Ebola Ebola Ebola Ebola Ebola Ebola Ebola Source PROMED PROMED PROMED PROMED PROMED PROMED PROMED PROMED Country Uganda Uganda Uganda Uganda Uganda Uganda Uganda Uganda City_nameDate Cases New_cases Dead Gula 26-Oct-2000 182 17 64 Gula 5-Nov-2000 280 14 89 Gulu 13-Oct-2000 42 9 30 Gulu 15-Oct-2000 51 7 31 Gulu 16-Oct-2000 63 12 33 Gulu 17-Oct-2000 73 2 35 Gulu 18-Oct-2000 94 21 39 Gulu 19-Oct-2000 111 17 41
2. Extract outbreak events into table
400 350 Number Cases 300 250 200 150 100 50 0
3. Display events...
Cases New_cases Dead
TIME
Total Cases; New Cases
10 /1 3/ 20 00 10 /2 0/ 20 00 10 /2 7/ 20 00 11 /3 /2 00 0 11 /1 0/ 20 00 11 /1 7/ 20 00 11 /2 4/ 20 00
COTS IE Tools Compared

Named Entities Nominal Entities Normalized Time Relations Events Noun Phrases Extensible via Machine Learning x x x x x INFORMATION EXTRACTION SYSTEMS COMMERCIAL COMPANIES AeroText, Lockheed Martin IdentiFinder, BBN/Verizon Intelligent Miner for Text, IBM Net Owl, SRA Thing Finder, Inxight Context, Oracle Semio Taxonomy LexiQuest Mine LingSoft CoGenTex/Cornell TextWise/Syracuse Univ. NON-PROFIT ORGANIZATIONS Alembic, MITRE GATE, U. Sheffield Univ. of Arizona New Mexico State University Fastus/TextPro, SRI Internationa Proteus, New York University TIMEX, MITRE Univ. of Massachusetts/Amherst Extensible via Programming x x x x x x x x x x x x x x x x x Multi-Lingual EN, ZH, ES EN EN EN EN EN, JP, ES EN, ES EN
x x POLTM x POLTM+ POLTM+ x x x x x x x x x
x x
x x x x
EN,ES,ZH,JP x EN, ZH, (AR) x EN x x EN (ES, AR, Ti, JP, DE, FR, (Russian) EN,ES,ZH,JP,FR 0 EN EN,FR,ES,IT,JP,DU,(ZH,DE) EN,FR,ES,DE,DU EN EN x x EN x x
x x x x x x x
x x x x x x x
x x
x x
x x x x x x
x x
x x
x x
EN=English ZH=Chinese ES=Spanish JP=Japanese IT=Italian FR=French DE=German DU=Dutch AR=Arabic Page 34 P-People, O=Organization, L=Location, T=Time, M=Money
Entity Extraction Tools: Commercial Vendors

AeroText - Lockheed Martin's
AeroText ™ Version 1.0 - www.lockheedmartin.com/factsheets/product589_hi.html BBN's Identifinder - www.gte.com/AboutGTE/gto/bbnt/speech/research/extraction/to ols/identifinder.html IBM's Intelligent Miner for Text - www-4.ibm.com/software/data/iminer/fortext/index.html SRA NetOwl - www.netowl.com Inxight's ThingFinder - www.inxight.com/pdfs/products/tf_server_ds.pdf Semio: www.semio.com Context - technet.oracle.com/products/oracle7/context/tutorial/start.htm LexiQuest Mine: www.lexiquest.com Lingsoft: www.lingsoft.fi Page 35 CoGenTex: www.cogentex.com TextWise: www.textwise.com MITRE Corporation. All rights reserved. Copyright 2001 The
Entity Extraction Tools: Non-Profit Organizations

MITREs Alembic extraction system and Alembic Workbench
annotation tool - www.mitre.org/technology/nlp MITREs TIMEX tagger for resolving references to dates and times - m19593-pc2.mitre.org/toolshed/ACL2000.pdf Univ. of Sheffields GATE: gate.ac.uk Univ. of Arizona: ai.bpa.arizona.edu New Mexico State University: crl.nmsu.edu SRI Internationals Fastus/TextPro: - www.ai.sri.com/~appelt/fastus.html - www.ai.sri.com/~appelt/TextPro New York Universitys Proteus - www.cs.nyu.edu/cs/projects/proteus/ University of Massachusetts: www-nlp.cs.umass.edu/nlpie.html
Name Analysis Software

Language Analysis Systems Inc.s (Herndon, VA)

Name Reference Library www.las-inc.com Funding: Office of National Drug Control Policy Supports analysis of Arabic, Hispanic, Chinese, Thai, Russian, Korean, and Indonesian names. Others in future versions. Product Features - Name culture classification - Given a name, provides common variants on that name, e.g., Abd Al Rahman or Abdurrahman or ... - Implied gender - Identifies title, affixes, qualifiers, e.g., Arabic "abd" means "servant of; "Bin," means "son of" as in Osama Bin Laden - List top countries where name occurs Cost: Free via GSA to Government w/$700 per-license annual fee until August 31, then $3,535 a copy and a $990 annual fee
Page 37
Identifying Entities
Most research to date done on news reports:
- Systems can automatically identify person, organization, location, time, numerical expressions Systems exist that identify names ~ 90-95% accurately in the news (in several languages) - Simply memorizing names doesnt work, since new people (& organizations) appear in the news, just as new genes are identified and named Rules capture local patterns that characterize entities, from instances of annotated training data: - XXX met with YYY: XXX and YYY are probably people - XXX bought out YYY: XXX and YYY are probably organizations
Approaches to Identifying Entities

Terminology (name) lists
- This works very well if the list of names and name expressions is stable and available Tokenization and morphology - This works well for things like formulas or dates, which are readily recognized by their internal format (e.g., DD/MM/YY or chemical formulas) Use of characteristic patterns - This works fairly well for novel entities - Rules can be created by hand or learned via machine learning or statistical algorithms
Evaluating Entity Identification

Evaluation consists of two aspects:
- Detection of the phrase that names an entity - Classification of the entity correctly (that is, distinguishing a protein from a gene, e.g.) Metrics used for text in NLP evaluations are: - Precision and recall for each entity class, where Precision = #CorrectReturned/#TotalReturned Recall = #CorrectReturned/#CorrectPossible - F-measure is the harmonic mean of precision and recall; used as a balanced single measure
Information Extraction Evaluations For Newswire

100
Name extraction > 90% in English, Japanese;
F-measure (Accuracy)
80
Name tagging improving in Chinese
60
40
Names: English Names: Japanese Names: Chinese
20
Commercial name taggers exist for news reports in multiple languages
0 1991
Page 41
1992
1993
1995
1998
1999
Year
Extracting Spoken Names from Broadcast News

100 90 80 70 60 50 40 30 20 10 0 Recog 1 Recog 2 Recog 3 Human MITRE BBN SRI Sheffield
Source: 1998 DARPA Named Entity Evaluation Results
Relation Extraction
Identify (and tag) the relation among two entities:
- A person is_located_at a location (news) - A gene codes_for a protein (biology) Relations require more information -identification of 2 entities & their relationship - Predicted relation accuracy = Pr(E1)*Pr(E2)*Pr(R) ~(.93) * (.93) * (.93) = .80 Information in relations is less local - Contextual information is a problem: right word may not be explicitly present in the sentence - Complex syntax in abstracts is a problem (see examples from Park et al., PSB 2001) Events involve more relations and are even harder
Information Extraction Evaluations For Newswire

100 90 80 F-measure (Accuracy) 70 60 50 40 30 20 10 0 1991
Page 44
Relation extraction has is now at 80%
Names: English Names: Japanese Names: Chinese Relations Events 1992 1993 1995 1998 1999
Event extraction less than 60%, improving slowly
Year
Information Extraction Resources
Books/Journals
- Information Extraction : A Multidisciplinary Approach to an Emerging Information Technology International Summer School, Scie-97, Frascati, Italy, edited by Maria Teresa Pazienza, J. Siekmann, J. G. Carbonell
On-line IE Tutorials
- www.ai.sri.com/~appelt/ie-tutorial - citeseer.nj.nec.com/gaizauskas98information.html - NIST Information Extraction web page www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html
- ACE: Automated Content Extraction (www.nist.gov/speech/tests/ace) - TIDES: Translingual Information Detection Extraction and Summarization; DARPA (www.darpa.mil/ito/research/tides)
Outline
What is Text Clustering?

Definition:
Clustering is the process of detecting topics within a document collection, assigning documents to those topics, and labeling these topic clusters
Semio Taxonomy Creation of Browseable Collections

Extract Phrases and Reverse Index Use Existing Knowledge (e.g. geospatial gazeteer) Browse Document Collection Create Taxonomies
Provide Specialized Dictionaries
Cluster Concepts
Attach Concept Clusters to Taxonomies
A Semiotic Network What Do We Know About North Korea?
Taxonomy Example Browse by Geography - Select a Country
Taxonomy Example Browse by Geography - Select a Relevant Concept
Taxonomy Example Browse by Geography - Get Documents
Taxonomy Example View Concepts in Context
IntelGazette - Topic Clustering and Labeling

Automatic topic clustering generates daily and weekly news summaries, links related stories across time and news sources.
IntelGazette
User Profiles to Customize Delivery
COTS Text Clustering Tools Compared

Accepts Predefined Terms
(Predefined Topics)
All words equally
Named Entities
Document Clustering, Mining, Topic Detection, and Visualization Systems Inxight Categorizer, Tree Studio, Inxight
x x x
x x x x x x x
Semio Taxonomy LexiQuest Mine InterMedia Text, Oracle x NorthernLight x Autonomy x Lotus Discovery Server (LDS), Lotus x QKS Classifier, Quiver x Fulcrum Knowledge Server, Hummingbird x SPIRE/Themeview, PNNL x VantagePoint, Search Technology Inc. x Mohomine, Inc. Intelligent Miner for Text, IBM Oasis, OnTopic, BBN/Verizon EN=English ZH=Chinese ES=Spanish JP=Japanese
Page 57
EN, FR, ES, DE, DU, (12) EN,FR,ES,IT,JP,DU (ZH,DE) EN, FR, ES, DE, DU EN EN EN
x x x x x
EN
x x x x
EN EN EN EN
x x EN x x x x EN x x x x EN, ZH, AR DE=German DU=Dutch FR=French AR=Arabic IT=Italian
Topic Tracking
Noun Phrases
Story Segmentation
Predefined Taxonomies
Generates Taxonomies
MultiLingual?
New Topic Detection?
x x x x x x
x x x
Text Clustering Tool Organizations

Inxights Categorizer and Tree Studio: www.inxight.com Oracles Intermedia Text: www.oracle.com Semio Taxonomy: www.semio.com LexiQuest Mine: www.lexiquest.com Northern Light document clustering: www.northernlight.com Autonomy: www.autonomy.com Lotuss Discover Server (LDS): www.lotus.com/km Quivers QKS Classifier: www.quiver.com Hummingbirds Fulcrum Knowledge Server: www.humingbird.com PNNLs SPIRE/Themeview visualization: showcase.pnl.gov/show?it/themeview Search Technologys VangatePoint: www.thevantagepoint.com Mohomines text classification components: www.mohomine.com IBMs Intelligent Miner for Text: www.ibm.com/software/data/iminer/fortext/tatools.html BBN/Verizons OnTopic/Oasis: www.bbn.com/speech/ontopic.html
Text Clustering Resources

Books/Journals
- Topic Detection and Tracking Pilot Study (1998) (citeseer.nj.nec.com/allan98topic.html)
Tutorial
- www.parc.xerox.com/istl/projects/ia/sg-clustering.html
Bibliographies
- dewey.yonsei.ac.kr/memexlee/links/clustering.htm - dmoz.org/Reference/Knowledge_Management/Knowledge_Discovery/Text_Mining/
- Topic Detection and Tracking Evaluation Project (www.nist.gov/speech/tests/tdt/index.htm) - Text REtrievval Conference (trec.nist.gov) - TIDES: Translingual Information Detection Extraction and Summarization; DARPA (www.darpa.mil/ito/research/tides, tides.nist.gov)
Paper Collections
- trec.nist.gov/pubs/trec8/index.track.html (TREC8) - trec.nist.gov/pubs/trec9/index.track.html (TREC9)
Outline
Question Answering
Question Answering (MITREs QANDA System)
Collections: Gigabytes Documents: Megabytes

PIR Genbank MEDLINE
Question Answering: question to answer
Lists,Tables: Kilobytes Phrases: Bytes
D i ea se s Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l
So urce PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M
C ount ry U anda g U anda g U anda g U anda g U anda g U anda g U anda g U anda g
C ty_n ame at e i D C ases N ew _c ases D d ea G a ul 2 6- ct -2000 O 182 17 64 G a ul 5- ov- 2000 N 280 14 89 G u ul 1 3- ct -2000 O 42 9 30 G u ul 1 5- ct -2000 O 51 7 31 G u ul 1 6- ct -2000 O 63 12 33 G u ul 1 7- ct -2000 O 73 2 35 G u ul 1 8- ct -2000 O 94 21 39 G u ul 1 9- ct -2000 O 111 17 41
Protease-resistant prion protein interacts with...
Where did Dylan Thomas die? 1. Swansea: In Dylan: the Nine Lives of Dylan Thomas, Fryer makes a virtue of not coming from Swansea 2. Italy: Dylan Thomass widow Caitlin, who died last week in Italy aged 81, 3. New York:Dylan Thomas died in New York 40 years ago next Tuesday What diseases are caused by prions? 1. Both CJD and BSE are caused by mysterious particles of infectious protein called prions 2. Scientists trying to understand the epidemic face an unusual problem: BSE, scrapie, and CJD are caused by a bizarre infectious agent, the prion which does not follow the normal rules of microbiology. 3. These diseases are caused by a prion, an abnormal version of a
MITRE
2 001 The M ITRE Corporation . ALL RIG HTS RESERVED.
Page 61
naturally-occurring protein, but researchers have recognized different strains of prions that differ in incubation times, symptoms, and severity of illness. ...
Coreference and Question Answering

Question: What diseases are caused by prions? Qanda answer #3:
These diseases are caused by a prion, an abnormal version of a naturallyoccurring protein, but researchers have recognized different strains of prions that differ in incubation times, symptoms, and severity of illness. ...
Problem: Need to resolve pronoun to get the real answer (in the
preceding sentence):
Prion disorders -- including bovine spongiform encephalopathy, or ``mad cow disease'' in cattle, CJD in humans, and scrapie in sheep -- are all characterized by progressive neurological degeneration resulting in death.
Question Answering
Stage 1: Question analysis
- Find type of object that answers the question: when needs time, which proteins need protein Stage 2: Document retrieval - Using (augmented) question, retrieve set of possibly relevant documents via information retrieval Stage 3: Document processing - Search documents for entities of the desired type using information extraction - Search for entities in appropriate relations Stage 4: Rank answer candidates Stage 5: Present the answer (N bytes, or a phrase or a sentence or a summary)
Evaluating Question Answering Systems

TREC-8 (99) & TREC-9 (00) have included a question
answering track; TREC-10 will also (01) TREC-9 Q&A Evaluation: - For each of 700 factual short-answers questions - Each system must return a ranked list of 5 candidate answers (250-byte or 50-byte) based on the standard TREC document collection - Each question-answer pair is judged as correct or incorrect by a person (assessor) - System score is mean reciprocal rank of correct answers For TREC-8 and TREC-9, all questions had answers; for TREC10, not all questions will have answers
TREC Q&A 2000 Results (250-byte)

1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000
Harabagiu and Moldovan, Southern Methodist University Mean Reciprocal Rank: 76% First Answer Correct: 69% Correct Answer in Top 5: 86%
e l U U o s a) ld U SI lleg ea lo at is r M r as w an M o fie t N f e S (P M LI C on at ul ai he U l .I. o T M S W D ia U U U at Se er N p Im
Page 65
Lessons: question answering works -at least for simple factual questions
Top System at TREC Q&A 2000 (SMU): Some Key Features

Question analysis based on identifying:
- Expected answer type (using WordNet semantic hierarchy) - Syntactic relations related to answer type, e.g., What is the wingspan of a condor => quantity (wingspan), of (wingspan, condor) Iterative retrieval of relevant paragraphs using question key words - Adjust size of pool of retrieved documents to maximize probability of finding an answer extraction Semantic processing, to ensure a match between the question relations and the proposed answer
Document processing, including dependency parse and entity
Question Answering: Status

Question answering has successfully pushed integration of
information retrieval and natural language processing techniques To date, question types are very limited - Assume that answer is always present (so far, systems do not know what they dont know) - Assume answer is contained in a single sentence: answers cannot be composed of lists gathered across multiple sources
Example Analysis Affect of Answer Multiplicity on Correctness

0.9 0.8
0.7
0.6
0.5
0.4
0.3
0.2
Individual questions (50)

0.1
Ave. per # answers

0 0 10 20 30 40 50 60 70
# answer repetitions per question
Question/Answering Resources
Books/Journals
- Question Answering Systems, Papers from the AAAI Fall Symposium, Vinay Chaudhri and Richard Fikes, Program Cochairs, AAAI Technical Report FS-99-02 - Hirschman, L. and Gaizauskas, R. (eds.) Forthcoming in Fall 2002. Special Issue on
Question Answering, Journal of Natural Language Engineering
On-line Q&A Tutorials

- www.cs.unca.edu/~bruce/acl01/QApresentations/presentations.html - www-users.cs.york.ac.uk/~mdeboni/research/links.html
Paper Collections
- trec.nist.gov/pubs/trec8/index.track.html#qa (TREC8 Q&A papers) - trec.nist.gov/pubs/trec9/index.track.html#qa (TREC9 Q&A papers)
- AQUAINT: www.ic-arda.org/solicitations/AQUAINT - TREC8: citeseer.nj.nec.com/346894.html - Also, trec.nist.gov
Conclusion
Acknowledgements
Special thanks to
- Lynette Hirschman for some IE, IR, and Q&A slides - David Day, Warren Greiff, and Christy Doran for tool and resources research - Inderjeet Mani for summarization evaluation slides - Jim Burnetti, Tom McEntee and Donna Trammell for Semio Taxonomy examples - Penny Chase for Google performance example
BACKUP
Mixed Initiative Annotation Methodology Used in the Alembic Workbench

3 Apply phrasefinding rules to raw text
1 Manually annotate
raw text
Source Texts
Rules Sets
If ... Then... Rule
4 Manually correct machine-annotated

text
2 Invoke machine
learning to derive annotation rules
Training & Test Corpus

Alembic Workbench supports: Multilingual annotation (UNICODE chars and fonts) Machine learning Evaluation for content extraction
Page 74
http://www.mitre.org/technology/alembic-workbench Copyright 2001 The MITRE Corporation. All rights reserved.
Alembics Engine: Transformational Phrase Rules

Finding a likely named entity:
(def-phraser-rule :anchor :lexeme :conditions (:wd :p-o-s (:NNP :NNPS)) :actions (:create-phrase :NONE))
Assigning a type to a phrase:

(def-phraser-rule :anchor :phrase :conditions (:phrase :phrase-label :NONE) (:left-1 :lex (Dr. Mr. Prof. )) :actions (:set-label :PERSON))
On Monday Dr. Grieg, IBMs new chief scientist, announced that their new supercomputer, Powerful Purple, will
Some Alembic System Measures
Rule sequence length English Spanish Japanese - Names 141 333s 100 - Money, percents 12 19 12 - Dates/times fsa fsa + 24 21 - Titles 4 30 167 406 133 Processing rate* (words/min.) English Spanish Japanese 23,100 26,300 23,900
* On Sparc 10, without having pursued many opportunities for optimization.
Alembic Workbench: Relations
Alembic Workbench uses Machine learning (1) identify constituents (slot filling values) (2) propose plausible relation instances for selection by human annotators. Copyright 2001 The MITRE Corporation. All rights reserved.
Page 77
Relation Tagging Interface #3 Screen Dump
Page 78
Putting it All Together: Defining Templates (Relations)

Empirical Study of Productivity Gains Afforded by Alembic Workbench (for Named Entities)
Productivity Gains Productivity Gains
250 250 200 200 150 150 100 100 50 50 0 25 25 20 20 15 15 10 10 5
Words/Minute Words/Minute Tags/Minute Tags/Minute
0 emacs-AB emacs-AB awb-AB awb-AB
0 awb-100-AB awb-100-AB
Emacs
AWB GUI only
AWB + Pre-tagging (5 Docs)
awb-5-AB awb-5-AB
AWB + Pre-tagging (100 Docs)
Corpus Development Tools Used Corpus Development Tools Used
Geospatial News on Demand (GeoNODE)

Topic Timeline
News Sources:
Broadcast
News histogram
Navigate Filter Indexed access Animate reporting trends Create reports/ web
BNN Story skim
Map overview
World Wide Web
GeoNODE Database
Intel. Msg Traffic
Data Acquisition/ Pre-process
Information Extraction
Data Mining And Clustering

topic t t
Indexing And News Modeling
Specialist Archives
Automatic Topic Detection

Embassy Bombing and Counterstrike Clustering identified separate topics for bombing and counterstrike.
Bombing Counterstrike
Page 81

2.1 Information Access Tutorial

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

2.1 Information Access Tutorial

Caricato da

Copyright:

Formati disponibili

Information Access: State of the Art

November 15, 2001 http://www.mitre.org/resources/centers/it/maybury/mark.html

Page 2 Copyright 2001 The MITRE Corporation. All rights reserved.

Extraction Interpretation Conclusion

The final output is text: a message, web page or document.

Copyright 2001 The MITRE Corporation. All rights reserved.

Intelligent Information Access

- known terrorist groups - world governments - biological agents

Intelligent Information Processing

- world fact book - gazetteer - hand-crafted taxonomy

Is X a member of a terrorist organization?

Copyright 2001 The MITRE Corporation. All rights reserved.

State of the Art: Summary

We can make rapid progress on information access for

- Compare approaches; share data, resources and tools

Page 7 Copyright 2001 The MITRE Corporation. All rights reserved.

Collections: Gigabytes Documents: Megabytes

Lists,Tables: Kilobytes Phrases: Bytes

D i ea se s Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l Ebo a l

C ount ry U anda g U anda g U anda g U anda g U anda g U anda g U anda g U anda g

Protease-resistant prion protein interacts with...

Information Retrieval: key words to documents

Page 8 Copyright 2001 The MITRE Corporation. All rights reserved.

Evaluating Text Retrieval

Goal: systematic evaluation of retrieval systems using a large

% of relevant documents found

*US National Institute of Standards and Technology

Precision and Recall

Precision 12 out of 106 documents were relevant (12/106 = 11%)

Marwan al-Shehhi Ziad Jarrah Hani Hanjour Rohit Agrawal ...

Page 10 Copyright 2001 The MITRE Corporation. All rights reserved.

Sample TREC Topic

Title: Up to 3 words best describing the topic

Narrative: Description of what makes a document relevant or irrelevant

TREC9 Results for a High-Performing System

Automatically generated query Manually generated query

4 of top 5 documents relevant: precision = 80%; recall low!

Copyright 2001 The MITRE Corporation. All rights reserved.

Need for fine-grained retrieval has led to new track of

Precision vs. Documents Retrieved

3 of first 5 documents are relevant;

20 of first 100 documents are relevant;

Page 14 Copyright 2001 The MITRE Corporation. All rights reserved.

Information Retrieval Resources

Page 15 Copyright 2001 The MITRE Corporation. All rights reserved.

Information Retrieval Resources

- Univ. of Nevada, LV -- Information Science Research Institute

Page 16 Copyright 2001 The MITRE Corporation. All rights reserved.

Page 17 Copyright 2001 The MITRE Corporation. All rights reserved.

Page 18 Copyright 2001 The MITRE Corporation. All rights reserved.

Illustration of Extracts and Abstracts

Intrinsic Evaluation: SUMMAC Q&A Results

0.8 3 0.7 Average Answer Recall (ARA) 2 0.6 2 0.5 2 2 1 3 1 3 2 3 1,2

0 0.0 0.1 0.2 Compression

Informativeness ratio of accuracy to compression of about 1.5.

Intrinsic Evaluation: Japanese Text Summarization Challenge

Against Extracts Against Extracts

Against Abstracts Against Abstracts

Extrinsic Methods: Usefulness of Summary in Task

SUMMAC Time and Accuracy (Ad hoc task, 21 subjects)

Page 23 Copyright 2001 The MITRE Corporation. All rights reserved.

SUMMAC: Accuracy versus Time by System

+ Similar sentence extraction