Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
IR
Paolo Ferragina
Dipartimento di Informatica
Universit di Pisa
Reading Chapter 1
What is IR today?
Paolo Ferragina
2009
2009-12
This is an IR tool!
Paradigm shift
Paolo Ferragina,
Universit di Pisa
opportunistic
Purposely sensed
User generated
A universe of possibilities
limited only by our immagination !
A universe of possibilities
limited only by our immagination !
Basics
Paolo Ferragina
Information Retrieval
Information Retrieval (IR) is finding material
(usually documents) of an unstructured
nature (usually text) that satisfies an
information need from within large
collections (usually stored on computers).
25
IR vs. databases:
Unstructured vs Structured data
Structured data tends to refer to tables
Employee
Manager
Salary
Smith
Jones
50000
Chang
Smith
60000
Ivy
Smith
50000
26
Unstructured data
Typically
Issues:
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Calpurnia
Cleopatra
mercy
worser
Antony
e
b
d
l
u
o ig
c
x
b
i
r
t
a
M
Brutus
Caesar
Inverted index
Brutus
Caesar
Calpurnia
2
2
31
11
31
45 173 174
16
57 132
54 101
AND query
Cleopatra
45 11
46
Cesare
57
12
15
16
31 .
2
10 cmp
3
10 sec
AND query
Cleopatra
45 11
46
Cesare
57
12
15
16
Cleopatra
11
31
45 46
Cesare
12
15
16
31 .
57
n+m
10
1 msec
the
13
16
Calpurnia
10
32
13
21
Two advantages:
Speed: query requires just a scan
Space: store smaller integers (gap coding)
Compressed, they occupy 13% original text
34
35
Query optimization
Brutus
Caesar
13
16
Calpurnia
16
32
64 128
16
21
34
36
Query optimization
Sec. 2.3
41
2
41
48
128
31
11
1
64
11
17
21
31
Query optimization
Caesar
Calpurnia
13
16
34
16
21
34
Binary search
Which list you bisect at every recursive step ?
39
Sec. 1.3
Boolean queries:
More general merges
40
IR is much more
Stanford University
But
42
Documents
Users
Queries
Information needs
43
Page archive
Crawler
eb
Hashing
Page
Analizer
Indexer
Sorting
Query
Linear
Algebra
Clustering
Classification
Query
resolver
Dictionaries
Which pages
to visit next?
text
auxiliary
Structure
Data Compression
Ranker
I data center
Smart algorithms
2007
This is rocket science but you
don't have to be a rocket scientist
to use it