Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Mark T. Maybury
maybury@mitre.org
Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering
Abstract
This briefing overviews the state of the art in the areas of information retrieval, summarization, information extraction, text clustering and question answering. The briefing reports performance of systems based on national testing and identifies current commercial products that provide this type of processing. For example, key findings include: - automated systems exist that can return documents relevant to a particular subject with around 80% precision but low recall - automated query and relevance feedback is near human performance - systems can presently identify entities at over 90% accuracy and relations among entities at 70-80% accuracy. - systems can summarize documents to 20% of their source size without information loss, saving users 50% of task time - systems can also respond to a simple factual question by returning answers from relevant documents at 75% accuracy. We describe how the community can make rapid progress on information access for intelligence using corpus-based evaluation, which includes: 1) Creating challenge problems with supporting data 2) Evaluating system performance on these problems, and 3) Comparing approaches; share data, resources and tools. The briefing provides detailed pointers to data resources and commercial products. Page 3 Copyright 2001 The MITRE Corporation. All rights reserved.
Current Situation
Analyst Data
10,000 messages per day
(IMINT, SIGINT, HUMINT, Intel Reports)
Data Search
Analyst creates a profile to sort messages
Report
The analyst can The system sorts messages by key words study 200 messages per day. -- limited precision.
Page 4
The analyst manually builds (at best) partial, static link models.
semi-structured info
documents
Question Answer
user
Page 5
Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering
Information Retrieval
Information Retrieval
Input: query words Output: ranked list of documents Approach
S urce o P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D P O R ME D
C ty_n ame ate i D C ases N ew _c ases D d ea G a ul 2 6- ct -2000 O 182 17 64 G a ul 5- ov-2000 N 280 14 89 G u ul 1 3- ct -2000 O 42 9 30 G u ul 1 5- ct -2000 O 51 7 31 G u ul 1 6- ct -2000 O 63 12 33 G u ul 1 7- ct -2000 O 73 2 35 G u ul 1 8- ct -2000 O 94 21 39 G u ul 1 9- ct -2000 O 111 17 41
- Speed, scalability domain independence and robustness are critical for access to large collections of documents
2 001 The M ITRE Corporation . ALL RIG HTS RESERVED.
MITRE
Technique - Shallow processing provides coarse-grained result (entire documents or passages) - Query is transformed to collection of words, but grammatical relations between words lost - Documents are indexed by word occurrences - Search matches query probe against indexed documents using Boolean combination of terms, or vector of word occurrences or language model
- Human judges provide relevance assessments for the ranked documents - Relevance judgements are used to compute average precision-recall plots for each system Precision: % returned docs judged relevant Recall:
Page 9
Answer:
Who are Mohamed Attas associates?
Query: mohamed+atta+associates
Recall One document not retrieved contained a new associate (11/12 = 91%)
Answer: Said Bahaji
Description: One-sentence <desc> Description: description What legal actions have resulted from the of the topic destruction of Pan Am Flight 103 over Lockerbie, Scotland on December 21, 1988? <narr> Narrative: Documents describing any charges, claims or fines presented to or imposed by any court or tribunal are relevant, but documents that discuss charges made in diplomatic jousting are not relevant.
Page 11 Copyright 2001 The MITRE Corporation. All rights reserved.
This is a representative high-performing system Manual (expert) choice of query words works better than automatically generated queries Note that if you need to find all the literature on a subject, you have to look through lots of junk!
0.2
0.4 Recall
0.6
0.8
TREC Findings
It is possible to create test collections and an evaluation metric for
large corpora - TREC pioneered evaluation of recall over large document collections, using pooled results of multiple systems to prune space & estimate recall - TREC has experimented with different evaluations (tracks) for filtering, routing, Web search,... - The basic paradigm is still word-based; so far, addition of syntax, semantics hasnt helped - For short queries, adding information (words) to the queries helps, by hand, by thesaurus, by feedback of relevant documents
Other lessons:
Systems can use relevant documents to enrich the set of query words via relevance feedback, done automatically or interactively
IR Journals
- Information Processing and Management (IP&M) - ACM Transactions on Information Systems (TOIS) - Journal of the American Society for Information Science (JASIS) - Journal of Documentation (JDOC) - Information Retrieval (IR) - Journal of Intelligent Information Systems (JIIS)
Books
- Modern Information Retrieval (ACM Press Series), R. Baeza-Yates, et al. - Intelligent Multimedia Information Retrieval, M. Maybury(ed) & Karen Sparck Jones. - Advances in Information Retrieval - Recent Research from the Center for Intelligent Information, W. Bruce Croft (ed). - Introduction to Modern Information Retrieval. Salton, G., and McGill, M. J. - Information Retrieval, C. J. van RIJSBERGEN www.dcs.gla.ac.uk/Keith/Preface.html
Conferences
- TREC (Text REtrieval Conference) trec.nist.gov - ACM SIGIR www.acm.org/sigir - ACM CIKM (Conference on Information and Knowledge Management)
Research Centers
- Universitt Dortmund, Informatik VI - Univ. of Glasgow ls6-www.informatik.uni-dortmund.de www.dcs.gla.ac.uk/idom
- Univ. of Maryland -- Medical Informatics and Computational Intelligence Lab. www.enee.umd.edu//medlab/filter/filter_project.html - Univ. of Massachussets -- Center for Intelligent Information Retrieval/CIIR ciir.cs.umass.edu
www.isri.unlv.edu
Bibliographies
- mansci1.uwaterloo.ca/~jjiang/biblio.html - www-inf.enst.fr/~rungsawa/irrs.html - www.seas.gwu.edu/student/chulee/bib.html - www.si.umich.edu/~mjpinto - joinus.comeng.chungnam.ac.kr/~dolphin/db/indices/a-tree/s/Salton:Gerard.html - superbook.bellcore.com/~std/LSI.html - dmoz.org/Computers/Software/Information_Retrieval
Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering
What is Summarization?
Definition:
Summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) Extract vs. Abstract - An extract is a summary consisting entirely of material copied from the input - An abstract is a summary at least some of whose material is not present in the input, e.g., subject categories, paraphrase of content, etc.
Highest recall associated Highest recall associated with the least reduction of with the least reduction of the source the source
2 CGI/CMU Cornell/SabIR GE ISI NMSU Penn SRA TextWise Modsumm
0.4 2 3 0.3 3 1 1 3 1
0.2
0.1
Content-based automatic Content-based automatic scoring (vocabulary scoring (vocabulary overlap) correlates very overlap) correlates very well with human scoring well with human scoring (passage/answer recall) 0.5 0.3 0.4 (passage/answer recall)
Page 20
At each compression, systems outperformed Lead and TF baselines in content overlap with human summaries Subjective grading of coherence and informativeness showed that human abstracts > human extracts > systems and baselines
Page 21
Fukusima, T. and Okumura, M. 2001. "Text Summarization Challenge: Text summarization evaluation in Japan." Workshop on Automatic Summarization. Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL'2001). New Brunswick, New Jersey: Association for Computational Linguistics. Copyright 2001 The MITRE Corporation. All rights reserved.
Assess the impact of a summarizer on the system in which it is embedded, e.g., how much does summarization help the question answering system? Measure the amount of effort required to post-edit the summary output to bring it to some acceptable, task-dependent state . (unlimited number of tasks to which summarization could be applied)
Copyright 2001 The MITRE Corporation. All rights reserved.
Page 22
S2s (23% of source S2s (23% of source on avg.) roughly on avg.) roughly halved decision time halved decision time rel. to F (full-text)! rel. to F (full-text)!
All F-score and Recall All F-score and Recall differences are significant differences are significant except between F& S2 except between F& S2
Conclusion - -Adhoc Conclusion Adhoc S2s save time by S2s save time by 50% without 50% without impairing accuracy! impairing accuracy!
single documents
discourse model
sentence length
generic focus
multi-lingual
fragmentary
cue phrases
query focus
doc format
AUTOMATIC SUMMARIZER AutoSummarizer (MS Word 97 ) CONTEXT Data Hammer DimSum Extractor GE Summarizer Intelligent Miner IntelliScope InText InXight Summarizer Plus ProSum Search 97 Developers Kit SMART SUMMARIST TexNet 32 TextAnalyst 2.0
Page 25
Key
- Implements feature X - product has feature but not implemented as fully/normally as other products.
Evaluated
connected
tf or tf/idf
location
SDK
GUI
Commercial Summarizers
Oracle ConText - technet.oracle.com/products/oracle7/context/faq.htm and www.oracle.com/products/servers/st_collateral/html/cntxt_qa.html IntelliScope - www.lhsl.com/tech/icm/retrieval/toolkit/default.asp InXight Summarizer Plus www.inxight.com/Products/Developer/AD_Summzer.html Intelligent Miner www-4.ibm.com/software/data/iminer/fortext/summarize/summarize.html Extractor (NRC) - extractor.iit.nrc.ca ProSum - visita.labs.bt.com/prosum/word/sample.html GE Summarizer DidSum (SRA Int.) - www.sra.com Data Manner - www.glu.com/datahammer Microsoft Word 97 AutoSummarizer - www.microsoft.com/office Automatic Summarizer, RES International, res.ca/sum/dev2.html Search 97 Developers Kit, www.verity.com/support/documentation/tdk/adv23/05_adv.htm
Copyright 2001 The MITRE Corporation. All rights reserved.
Page 26
Summarization Resources
Books/Journals
Mani, I. and Maybury, M. (eds.) 1999. Advances in Automatic Text Summarization. MIT Press, Cambridge. Mani, I. 2001. Automated Text Summarization. John Benjimans, Amsterdam. Mani, I. And Hahn, U. Nov 2000. Summarization Tutorial. IEEE Computer.
Bibliographies
www.si.umich.edu/~radev/summarization/ www.cs.columbia.edu/~jing/summarization.html www.dcs.shef.af.uk/~gael/alphalist.html www.csi.uottawa.ca/tanka/ts.html
Government initiatives
DUC-2001 Multi-document Summarization Evaluation (www-nlpir.nist.gov/projects/duc) DARPAs Translingual Information Detection Extraction and Summarization (TIDES) Program (tides.nist.gov, www.darpa.mil/ito/research/tides/projlist.html)
Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering
Template Extraction
Source Analysis Templates
<TEMPLATE-8806150049-1> := DOC NR: 8806150049 CONTENT: <TIE_UP_RELATIONSHIP-8806150049-1> DATE TEMPLATE COMPLETED: 311292 EXTRACTION TIME: 0
Transformation Synthesis Wall Street Journal, 06/15/88 MAXICARE HEALTH PLANS INC and UNIVERSAL HEALTH SERVICES INC have dissolved a joint venture which provided health services.
Maybury, M. 1995. Generating Summaries from Event Data, Information Processing and Management, 31, 5, pp. 735-751.
Source text
...
labriegos</NounGroup> que <VerbGroup>fueron asasinados</NounGroup> ... asasinados</kill-act> por <subject>las Autodefensas </subject>
<assassinate>fueronMITRE Corporation. All rights reserved. asasinados</assassinate> por Copyright 2001 The
...
IBMs Intelligent Miner for Text: Feature Extraction Tool Screen Shot
3. Display events...
Cases New_cases Dead
TIME
10 /1 3/ 20 00 10 /2 0/ 20 00 10 /2 7/ 20 00 11 /3 /2 00 0 11 /1 0/ 20 00 11 /1 7/ 20 00 11 /2 4/ 20 00
x x
x x x x
EN,ES,ZH,JP x EN, ZH, (AR) x EN x x EN (ES, AR, Ti, JP, DE, FR, (Russian) EN,ES,ZH,JP,FR 0 EN EN,FR,ES,IT,JP,DU,(ZH,DE) EN,FR,ES,DE,DU EN EN x x EN x x
x x x x x x x
x x x x x x x
x x
x x
x x x x x x
x x
x x
x x
EN=English ZH=Chinese ES=Spanish JP=Japanese IT=Italian FR=French DE=German DU=Dutch AR=Arabic Page 34 P-People, O=Organization, L=Location, T=Time, M=Money
Page 37
Identifying Entities
Most research to date done on news reports:
- Systems can automatically identify person, organization, location, time, numerical expressions Systems exist that identify names ~ 90-95% accurately in the news (in several languages) - Simply memorizing names doesnt work, since new people (& organizations) appear in the news, just as new genes are identified and named Rules capture local patterns that characterize entities, from instances of annotated training data: - XXX met with YYY: XXX and YYY are probably people - XXX bought out YYY: XXX and YYY are probably organizations
F-measure (Accuracy)
80
60
40
20
0 1991
Page 41
1992
1993
1995
1998
1999
Year
Copyright 2001 The MITRE Corporation. All rights reserved.
Relation Extraction
Identify (and tag) the relation among two entities:
- A person is_located_at a location (news) - A gene codes_for a protein (biology) Relations require more information -identification of 2 entities & their relationship - Predicted relation accuracy = Pr(E1)*Pr(E2)*Pr(R) ~(.93) * (.93) * (.93) = .80 Information in relations is less local - Contextual information is a problem: right word may not be explicitly present in the sentence - Complex syntax in abstracts is a problem (see examples from Park et al., PSB 2001) Events involve more relations and are even harder
Names: English Names: Japanese Names: Chinese Relations Events 1992 1993 1995 1998 1999
Year
Copyright 2001 The MITRE Corporation. All rights reserved.
Books/Journals
- Information Extraction : A Multidisciplinary Approach to an Emerging Information Technology International Summer School, Scie-97, Frascati, Italy, edited by Maria Teresa Pazienza, J. Siekmann, J. G. Carbonell
On-line IE Tutorials
- www.ai.sri.com/~appelt/ie-tutorial - citeseer.nj.nec.com/gaizauskas98information.html - NIST Information Extraction web page www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html
Government initiatives
- ACE: Automated Content Extraction (www.nist.gov/speech/tests/ace) - TIDES: Translingual Information Detection Extraction and Summarization; DARPA (www.darpa.mil/ito/research/tides)
Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering
Cluster Concepts
IntelGazette
Named Entities
Document Clustering, Mining, Topic Detection, and Visualization Systems Inxight Categorizer, Tree Studio, Inxight
x x x
x x x x x x x
Semio Taxonomy LexiQuest Mine InterMedia Text, Oracle x NorthernLight x Autonomy x Lotus Discovery Server (LDS), Lotus x QKS Classifier, Quiver x Fulcrum Knowledge Server, Hummingbird x SPIRE/Themeview, PNNL x VantagePoint, Search Technology Inc. x Mohomine, Inc. Intelligent Miner for Text, IBM Oasis, OnTopic, BBN/Verizon EN=English ZH=Chinese ES=Spanish JP=Japanese
Page 57
EN, FR, ES, DE, DU, (12) EN,FR,ES,IT,JP,DU (ZH,DE) EN, FR, ES, DE, DU EN EN EN
x x x x x
EN
x x x x
EN EN EN EN
Topic Tracking
Noun Phrases
Story Segmentation
Predefined Taxonomies
Generates Taxonomies
MultiLingual?
x x x x x x
x x x
Tutorial
- www.parc.xerox.com/istl/projects/ia/sg-clustering.html
Bibliographies
- dewey.yonsei.ac.kr/memexlee/links/clustering.htm - dmoz.org/Reference/Knowledge_Management/Knowledge_Discovery/Text_Mining/
Government initiatives
- Topic Detection and Tracking Evaluation Project (www.nist.gov/speech/tests/tdt/index.htm) - Text REtrievval Conference (trec.nist.gov) - TIDES: Translingual Information Detection Extraction and Summarization; DARPA (www.darpa.mil/ito/research/tides, tides.nist.gov)
Paper Collections
- trec.nist.gov/pubs/trec8/index.track.html (TREC8) - trec.nist.gov/pubs/trec9/index.track.html (TREC9)
Outline
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering
Question Answering
So urce PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M PR O E D M
C ty_n ame at e i D C ases N ew _c ases D d ea G a ul 2 6- ct -2000 O 182 17 64 G a ul 5- ov- 2000 N 280 14 89 G u ul 1 3- ct -2000 O 42 9 30 G u ul 1 5- ct -2000 O 51 7 31 G u ul 1 6- ct -2000 O 63 12 33 G u ul 1 7- ct -2000 O 73 2 35 G u ul 1 8- ct -2000 O 94 21 39 G u ul 1 9- ct -2000 O 111 17 41
Where did Dylan Thomas die? 1. Swansea: In Dylan: the Nine Lives of Dylan Thomas, Fryer makes a virtue of not coming from Swansea 2. Italy: Dylan Thomass widow Caitlin, who died last week in Italy aged 81, 3. New York:Dylan Thomas died in New York 40 years ago next Tuesday What diseases are caused by prions? 1. Both CJD and BSE are caused by mysterious particles of infectious protein called prions 2. Scientists trying to understand the epidemic face an unusual problem: BSE, scrapie, and CJD are caused by a bizarre infectious agent, the prion which does not follow the normal rules of microbiology. 3. These diseases are caused by a prion, an abnormal version of a
MITRE
Page 61
naturally-occurring protein, but researchers have recognized different strains of prions that differ in incubation times, symptoms, and severity of illness. ...
Copyright 2001 The MITRE Corporation. All rights reserved.
Problem: Need to resolve pronoun to get the real answer (in the
preceding sentence):
Prion disorders -- including bovine spongiform encephalopathy, or ``mad cow disease'' in cattle, CJD in humans, and scrapie in sheep -- are all characterized by progressive neurological degeneration resulting in death.
Page 62 Copyright 2001 The MITRE Corporation. All rights reserved.
Question Answering
Stage 1: Question analysis
- Find type of object that answers the question: when needs time, which proteins need protein Stage 2: Document retrieval - Using (augmented) question, retrieve set of possibly relevant documents via information retrieval Stage 3: Document processing - Search documents for entities of the desired type using information extraction - Search for entities in appropriate relations Stage 4: Rank answer candidates Stage 5: Present the answer (N bytes, or a phrase or a sentence or a summary)
Harabagiu and Moldovan, Southern Methodist University Mean Reciprocal Rank: 76% First Answer Correct: 69% Correct Answer in Top 5: 86%
Page 65
Lessons: question answering works -at least for simple factual questions
Copyright 2001 The MITRE Corporation. All rights reserved.
0.7
0.6
0.5
0.4
0.3
0.2
Question/Answering Resources
Books/Journals
- Question Answering Systems, Papers from the AAAI Fall Symposium, Vinay Chaudhri and Richard Fikes, Program Cochairs, AAAI Technical Report FS-99-02 - Hirschman, L. and Gaizauskas, R. (eds.) Forthcoming in Fall 2002. Special Issue on
Paper Collections
- trec.nist.gov/pubs/trec8/index.track.html#qa (TREC8 Q&A papers) - trec.nist.gov/pubs/trec9/index.track.html#qa (TREC9 Q&A papers)
Government initiatives
- AQUAINT: www.ic-arda.org/solicitations/AQUAINT - TREC8: citeseer.nj.nec.com/346894.html - Also, trec.nist.gov
Conclusion
1. Introduction 2. Information retrieval 3. Summarization 4. Information extraction 5. Text Clustering 6. Question answering
Acknowledgements
Special thanks to
- Lynette Hirschman for some IE, IR, and Q&A slides - David Day, Warren Greiff, and Christy Doran for tool and resources research - Inderjeet Mani for summarization evaluation slides - Jim Burnetti, Tom McEntee and Donna Trammell for Semio Taxonomy examples - Penny Chase for Google performance example
BACKUP
1 Manually annotate
raw text
Source Texts
Rules Sets
2 Invoke machine
Alembic Workbench supports: Multilingual annotation (UNICODE chars and fonts) Machine learning Evaluation for content extraction
Page 74
On Monday Dr. Grieg, IBMs new chief scientist, announced that their new supercomputer, Powerful Purple, will
Rule sequence length English Spanish Japanese - Names 141 333s 100 - Money, percents 12 19 12 - Dates/times fsa fsa + 24 21 - Titles 4 30 167 406 133 Processing rate* (words/min.) English Spanish Japanese 23,100 26,300 23,900
* On Sparc 10, without having pursued many opportunities for optimization.
Alembic Workbench uses Machine learning (1) identify constituents (slot filling values) (2) propose plausible relation instances for selection by human annotators. Copyright 2001 The MITRE Corporation. All rights reserved.
Page 77
Page 78
Empirical Study of Productivity Gains Afforded by Alembic Workbench (for Named Entities)
Productivity Gains Productivity Gains
250 250 200 200 150 150 100 100 50 50 0 25 25 20 20 15 15 10 10 5
0 awb-100-AB awb-100-AB
Emacs
awb-5-AB awb-5-AB
News Sources:
Broadcast
News histogram
Navigate Filter Indexed access Animate reporting trends Create reports/ web
BNN Story skim
Map overview
GeoNODE Database
Information Extraction
Specialist Archives
Bombing Counterstrike
Copyright 2001 The MITRE Corporation. All rights reserved.
Page 81