Introduction Information Retrieval

Information Retrieval Introduction
Information Retrieval
Information Retrieval PART I

IntroductionMotivation Basic Concepts Past, Present and the Future The Retrieval Process
Motivation
IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need:
Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament.
Emphasis is on the retrieval of information (not data)
Motivation
Data retrieval
which
docs contain a set of keywords? Well defined semantics a single erroneous object implies failure!
Information retrieval
information
about a subject or topic semantics is frequently loose small errors are tolerated
IR system:
interpret
contents of information items generate a ranking which reflects relevance notion of relevance is most important
Motivation
IR at the center of the stage

IR
in the last 20 years:
classification
and categorization systems and languages user interfaces and visualization

Still,
area was seen as of narrow interest Advent of the Web changed this perception once and for all
universal
repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to finding the solutions!
Information Retrieval UNIT I

INTRODUCTION,RETRIEVAL STRATEGIES I: IntroductionMotivation Basic Concepts Past, Present and the Future The Retrieval Process
Basic Concepts
The User Task

Retrieval
Database
Browsing
Retrieval
information purposeful
or data
Browsing
glancing
around F1; cars, Le Mans, France, tourism
Basic Concepts
Logical view of the documents

Accents spacing
Docs
stopwords
Noun groups
stemming
Manual indexing
structure structure Full text Index terms
Document representation viewed as a continuum: logical view of docs might shift
Information Retrieval UNIT I

History of IR
1960-70s:
Initial exploration of text retrieval systems for small corpora of scientific abstracts, and law and business documents. Development of the basic Boolean and vectorspace models of retrieval. Prof. Salton and his students at Cornell University are the leading researchers in the area.
11
IR History Continued
1980s:
Large document database systems, many run by companies:
Lexis-Nexis Dialog MEDLINE
12
1990s:
Searching FTPable documents on the Internet
Archie WAIS
Searching the World Wide Web

Lycos Yahoo Altavista
13
1990s continued:
Organized Competitions
NIST TREC
Recommender Systems
Ringo Amazon NetPerceptions
Automated Text Categorization & Clustering
14
Recent IR History
2000s
Link analysis for Web Search
Google
Automated Information Extraction

Whizbang Fetch Burning Glass
Question Answering
TREC Q/A track
15
Recent IR History
2000s continued:
Multimedia IR
Image Video Audio and music
Cross-Language IR
DARPA Tides
Document Summarization
16
The Seven Ages of Information Retrieval
Vannevar Bush's 1945 article set a goal of fast access to the contents of the world's libraries which looks like it will be achieved by 2010, sixty-five years later. Bushs Prediction
Modern History
The information overload problem is much older than you may think
Origins in period immediately after World War II

Tremendous scientific progress during the war Rapid growth in amount of scientific publications available
The Memex Machine
Conceived by Vannevar Bush, President Roosevelt's science advisor Outlined in 1945 Atlantic Monthly article titled As We May Think Foreshadows the development of hypertext (the Web) and information retrieval system
The Memex Machine
Historical aspects
As We May Think'', by Vannevar Bush

Article was originally published in 1945. He imagined that machines would read in visual form
His assertion that logic is suitable for mechanical computation is not yet appreciated
Documents are accessible & viewable from the memex system of Bush Documents may exist on many media: text, pictures, audio. The memex can keep the ``trail'' of documents you read while you follow your curiosity(Basically, it's a persistent history of URLs as you surf the web.) You can create associations between documents You can enter original material Most have been implemented as of 2005
IR Childhood (1945-1955)
Ideas conceived Information explosion after World War II Possibility of information processing machine Memex The hardware seems mostly out of date. user inserting 5000 pages per day into a personal repository and it taking hundreds of years to fill it up. the software goals have not been achieved.
The Schoolboy (1960s)

Many many experiments Use of Precision and Recall Use of relevance feedback
Adulthood (1970s)
The invention of word processing systems time-sharing systems The beginning of information industry OCLC(Online Computer Library Centre) DIALOG BRS(Bibliographic Retrieval Service)
Maturity (1980s)
Mid-Life Crisis (1990s)

Internet put IR to the test. Better understanding of the limit of IR. Large scale evaluations Digital Libraries projects
Predictions
Fulfillment (2000s) Retirement (2010)

The Retrieval Process

Text User Interface
user need
4, 10
Text
Text Operations
6, 7
logical view Query Operations user feedback logical view
Indexing
DB Manager Module
5
query
inverted file
Searching
Index
8
retrieved docs Text Database Ranking
ranked docs

INTRODUCTION,RETRIEVAL STRATEGIES I: IntroductionMotivation Basic Concepts Past, Present and the Future The Retrieval Process Other Related Slides not part of the book
Information Retrieval (IR)

The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent killer app. Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.
30
Typical IR Task
Given:
A corpus of textual natural-language documents. A user query in the form of a textual string.
A ranked set of documents that are relevant to the query.
Find:
31
IR System
Document corpus
Query String
IR System
1. Doc1 2. Doc2 3. Doc3 . .
32
Ranked Documents
Relevance
Relevance is a subjective judgment and may include:
Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).
33
Keyword Search
Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).
34
Problems with Keywords

May not retrieve relevant documents that include synonymous terms.
restaurant vs. caf PRC vs. China
May retrieve irrelevant documents that include ambiguous terms.

bat (baseball vs. mammal) Apple (company vs. fruit) bit (unit of data vs. act of eating)
35
Beyond Keywords
We will cover the basics of keyword-based IR, but We will focus on extensions and recent developments that go beyond keywords. We will cover the basics of building an efficient IR system, but We will focus on basic capabilities and algorithms rather than systems issues that allow scaling to industrial size databases.
36
Intelligent IR
Taking into account the meaning of the words used. Taking into account the order of words in the query. Adapting to the user based on direct or indirect feedback. Taking into account the authority of the source.
37
IR System Architecture
User Interface User Need User Feedback Query Ranked Docs
Text
Text Operations
Logical View
Query Operations Searching
Indexing
Inverted file
Database Manager
Index
Ranking
Retrieved Docs
Text Database
38
IR System Components
Text Operations forms index words (tokens).
Stopword removal Stemming
Indexing constructs an inverted index of word to document pointers. Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric.
39
IR System Components (continued)

User Interface manages interaction with the user:
Query input and document output. Relevance feedback. Visualization of results.
Query Operations transform the query to improve retrieval:

Query expansion using a thesaurus. Query transformation using relevance feedback.
40
Web Search
Application of IR to HTML documents on the World Wide Web. Differences:
Must assemble document corpus by spidering the web. Can exploit the structural layout information in HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web.
41
Web Search System

Web
Spider
Document corpus
Query String
1. Page1 2. Page2 3. Page3 . .
IR System
Ranked Documents
42
Other IR-Related Tasks

Automated document categorization Information filtering (spam filtering) Information routing Automated document clustering Recommending information or products Information extraction Information integration Question answering
43
Related Areas
Database Management Library and Information Science Artificial Intelligence Natural Language Processing Machine Learning
44
Database Management
Focused on structured data stored in relational tables rather than free-form text. Focused on efficient processing of welldefined queries in a formal language (SQL). Clearer semantics for both data and queries. Recent move towards semi-structured data (XML) brings it closer to IR.
45
Library and Information Science

Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization). Concerned with effective categorization of human knowledge. Concerned with citation analysis and bibliometrics (structure of information). Recent work on digital libraries brings it closer to CS & IR.
46
Artificial Intelligence
Focused on the representation of knowledge, reasoning, and intelligent action. Formalisms for representing knowledge and queries:
First-order Predicate Logic Bayesian Networks
Recent work on web ontologies and intelligent information agents brings it closer to IR.
47
Natural Language Processing

Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse. Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords.
48
Natural Language Processing: IR Directions

Methods for determining the sense of an ambiguous word based on context (word sense disambiguation). Methods for identifying specific pieces of information in a document (information extraction). Methods for answering specific NL questions from document corpora.
49
Machine Learning
Focused on the development of computational systems that improve their performance with experience. Automated classification of examples based on learning concepts from labeled training examples (supervised learning). Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning).
50
Machine Learning: IR Directions

Text Categorization
Automatic hierarchical classification (Yahoo). Adaptive filtering/routing/recommending. Automated spam filtering.
Text Clustering
Clustering of IR query results. Automatic formation of hierarchies (Yahoo).
Learning for Information Extraction Text Mining

51
IR research
System prototyping Interface Retrieval algorithms
Interaction
IR System
Contents
User Satisfaction User
Evaluation
Top Ten Research Issues

10. Relevance Feedback. 9. Information Extraction. 8. Multimedia Retrieval. 7. Effective Retrieval. 6. Routing and Filtering.
Top Ten Research Issues

5. Interfaces and Browsing. 4. Magic (Vocabulary Mapping). 3. Efficient, Flexible Indexing and Retrieval. 2. Distributed IR. 1. Integrated Solutions. A new Industry Content Management
Introduction to Information Retrieval
Unstructured (text) vs. structured (database) data in 1996
55
Introduction to Information Retrieval
Unstructured (text) vs. structured (database) data in 2009
56
Definitions
An Information Retrieval (IR) System attempts to find relevant documents to respond to a users request. The real problem boils down to matching the language of the query to the language of the document.
What is Information?

What do you think? There is no correct definition Cookie Monsters definition:
news or facts about something
Different approaches:

Philosophy Psychology Linguistics Electrical engineering Physics Computer science Information science
Dictionary says
Oxford English Dictionary

information: informing, telling; thing told, knowledge, items of knowledge, news knowledge: knowing familiarity gained by experience; persons range of information; a theoretical or practical understanding of; the sum of what is known
Random House Dictionary
information: knowledge communicated or received concerning a particular fact or circumstance; news
Intuitive Notions
Information must

Be something, although the exact nature (substance, energy, or abstract concept) is not clear; Be new: repetition of previously received messages is not informative Be true: false or counterfactual information is misinformation Be about something
Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science , 48(3), 254-269.
Three Views of Information

Information as process Information as communication Information as message transmission and reception
One View
Information = characteristics of the output of a process
Tells us something about the process and the input

Input Input Input Output
Process
Output
Output
Information-generating process do not occur in isolation

Input
Process1
Process2
Output
Ibid.
Wheres the human?
If a tree falls in the forest, and no one is around to hear it, is information transmitted?
In the information as process: Yes, but thats not very interesting to us

Were concerned about information for human consumption

Transmission of information from one person to another Recording of information Reconstruction of stored information
Another View
Information science is characterized by the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient
This implies that the sender has knowledge of the recipient's structure
Text = a collection of signs purposefully structured by a sender with the intention of changing image-structure of a recipient Information = the structure of any text which is capable of changing the image-structure of a recipient
Nicholas J. Belkin and Stephen E. Robertson. (1976) Information Science and the Phenomenon of Information. Journal of the American Society for Information Science , 27(4), 197-204.
Transfer of Information
Communication = transmission of information
Thoughts Telepathy?
Thoughts
Words Writing
Words
Sounds Speech
Encoding
Sounds
Decoding
Information Hierarchy
More refined and abstract
Wisdom Knowledge
Information
Data
Simply matching on words is a very brittle approach. One word can have a zillion different semantic meanings Consider: Take take a place at the table take money to the bank take a picture take a lot of time take drugs
Difference of IR with rest of CS

What is Different about IR from the rest of Computer Science Most algorithms in computer science have a right answer: Consider the two problems: Sort the following ten integers Find the higest integer Now consider: Find the document most relevant to hippos in the zoo Measuring Effectiveness An algorithm is deemed incorrect if it does not have a right answer. A heuristic tries to guess something close to the right answer. Heuristics are measured on how close they come to a right answer. IR techniques are essentially heuristics because we do not know the right answer. So we have to measure how close to the right answer we can come.
DOCUMENT RETRIEVAL
Document Routing
Predetermined queries or User profiles
Document Routing System Incoming documents
User 1
User 2
User 3
User 4
Result Set: Relevant Retrieved, Relevant and Retrieved
Relevant
Retrieved
Relevant Retrieved Precision = Relevant Retrieved Retrieved Recall = Relevant Retrieved Relevant
Precision and Two points of Recall

Answer set in order of
1.0
similarity coefficient (relevant documents:d5,d2) d1

d2 d3 d4 d5
(1.0, 0.4)
100% recall 50% recall
Precision
0.8
0.6 0.4 0.2
(0.5,0.5)
d6 d7 d8
0.2
0.4
0.6
0.8
1.0
d9
d10
Recall
Precision at 50% recall = 1/2= 50% Precision at 100% recall = 2/5= 40%
Typical and Optimal Precision/Recall Graph
Precision
Recall

Introduction Information Retrieval

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Introduction Information Retrieval

Caricato da

Copyright:

Formati disponibili

Information Retrieval Introduction

Information Retrieval PART I

Emphasis is on the retrieval of information (not data)

IR at the center of the stage

in the last 20 years:

and categorization systems and languages user interfaces and visualization

Information Retrieval UNIT I

The User Task

around F1; cars, Le Mans, France, tourism

Logical view of the documents

structure structure Full text Index terms

Document representation viewed as a continuum: logical view of docs might shift

Information Retrieval UNIT I

Searching the World Wide Web

Automated Text Categorization & Clustering

Automated Information Extraction

The Seven Ages of Information Retrieval

Origins in period immediately after World War II

The Memex Machine

The Memex Machine

As We May Think'', by Vannevar Bush

The Schoolboy (1960s)

Mid-Life Crisis (1990s)

Information Retrieval PART I

The Retrieval Process

Information Retrieval PART I

Information Retrieval (IR)

Problems with Keywords

May retrieve irrelevant documents that include ambiguous terms.

Query Operations Searching

IR System Components (continued)

Query Operations transform the query to improve retrieval:

Web Search System

Other IR-Related Tasks

Library and Information Science

Natural Language Processing

Natural Language Processing: IR Directions

Machine Learning: IR Directions

Learning for Information Extraction Text Mining

User Satisfaction User

Top Ten Research Issues

Top Ten Research Issues

Introduction to Information Retrieval

Unstructured (text) vs. structured (database) data in 1996

Introduction to Information Retrieval

Unstructured (text) vs. structured (database) data in 2009

What do you think? There is no correct definition Cookie Monsters definition:

news or facts about something

Oxford English Dictionary

Random House Dictionary

information: knowledge communicated or received concerning a particular fact or circumstance; news

Three Views of Information

Information as process Information as communication Information as message transmission and reception

Information = characteristics of the output of a process

Tells us something about the process and the input

Information-generating process do not occur in isolation

Wheres the human?

In the information as process: Yes, but thats not very interesting to us

Communication = transmission of information

Difference of IR with rest of CS

Document Routing System Incoming documents

Result Set: Relevant Retrieved, Relevant and Retrieved

Precision and Two points of Recall

similarity coefficient (relevant documents:d5,d2) d1