Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina

Nowadays IR is much
more than building

search engines !
IR
Paolo Ferragina
Dipartimento di Informatica
Universit di Pisa
Reading Chapter 1
What is IR today?
Paolo Ferragina
2009
2009-12
This is an IR tool!
Paolo Ferragina, Universit

di Pisa
Paolo Ferragina, Universit di Pisa
CISCO foresee 50 mld devices connected by 2020
Paradigm shift
We have now devices 2.0 that have their ID,

Communication capacity, computing and, more and
more, interaction ability.
We live in a post human society

[Amber Case, TED Lecture 2010]
You are a cyborg everytime

you look at a PC screen or use
a cellular phone.
Man has extended his capacities (such as movement,

memory, strength, communication,) through the use
of these devices 2.0.
The new cellular phones has changed the original
communication network into a sensing network
An Internet of bits and bodies
Paolo Ferragina,
Universit di Pisa
Three main types of data
opportunistic
Purposely sensed
Credit card transactions

Tel calls, bills, web clicks,
pollution, temperature, wind,

movement, accelleration,
Health sensing,
User generated
Photo, tweet, post, email,

Query-log on search engines
Trash track: trashtrack_release.mov
A universe of possibilities
limited only by our immagination !
A universe of possibilities
limited only by our immagination !
The Phd+ course:

how to build a start-up ?
Basics
Paolo Ferragina
Information Retrieval
Information Retrieval (IR) is finding material
(usually documents) of an unstructured
nature (usually text) that satisfies an
information need from within large
collections (usually stored on computers).
25
IR vs. databases:
Unstructured vs Structured data
Structured data tends to refer to tables
Employee
Manager
Salary
Smith
Jones
50000
Chang
Smith
60000
Ivy
Smith
50000
Typically allows numerical range and exact match

(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
26
Unstructured data
Typically
refers to free text, and allows
Keyword queries including operators
More sophisticated concept queries e.g.,
find all web pages dealing with drug abuse
Classic model for searching text documents

27
Semi-structured data: XML
In fact almost no data is unstructured
Facilitates semi-structured search such as
E.g., this slide has distinctly identified zones such

as the Title and Bullets
Title contains data AND Bullets contain search
Issues:
how do you process about?

how do you rank results?
28
Boolean queries: Exact match
The Boolean retrieval model is being able to ask a

query that is a Boolean expression:
Boolean Queries are queries using AND, OR and

NOT to join query terms
Views each document as a set of words

Is precise: document matches condition or not.
Perhaps the simplest model to build an IR system on
Many search systems still use it:
Email, library catalog, Mac OS X Spotlight

29
Implementing the Boolean model

y
r
ve
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Calpurnia
Cleopatra
mercy
worser
Antony
e
b
d
l
u
o ig
c
x
b
i
r
t
a
M
Brutus
Caesar
Brutus AND Caesar

BUT NOT Calpurnia
1 if play contains word,

0 otherwise
Inverted index
For each term t, we must store a list of all documents

that contain t.
Identify each by docID, a document serial number
Can we use fixed-size arrays for this?
Brutus
Caesar
Calpurnia
2
2
31
11
31
45 173 174
16
57 132
54 101
What happens if the word Caesar is added to

document 14?
31
AND query
Cleopatra
45 11
46
Cesare
57
12
15
16
31 .
2
If n,m are the lengths of the lists, how many

n*m
comparisons ?
12
This is not an engineering problem,

You need efficient algorithms!
10 cmp
3
10 sec
AND query
Cleopatra
45 11
46
Cesare
57
12
15
16
Cleopatra
11
31
45 46
Cesare
12
15
16
How many comparisons ?

Which are the top-10 results ?
31 .
57
n+m
10
1 msec
The Inverted index

Brutus
the
13
16
Calpurnia
10
32
13
21
Two advantages:
Speed: query requires just a scan
Space: store smaller integers (gap coding)
Compressed, they occupy 13% original text
34
Intersecting two postings lists
35
Query optimization
What is the best order for query processing?
Consider a query that is an AND of n terms.

For each of the n terms, get its postings, then
AND them together.
Brutus
Caesar
13
16
Calpurnia
16
32
64 128
16
21
34
Query: Brutus AND Calpurnia AND Caesar
36
Query optimization
Can we improve scanning-based intersection?
Skips (yet scan-based but with shortcuts)
Sec. 2.3
Augment postings with skip

pointers (at indexing time)
128
41
2
41
48
128
31
11
1
64
11
How do we deploy them ?

Where do we place them ?
17
21
31
Query optimization
Can we improve scanning-based intersection?
Skips (yet scan-based but with shortcuts)

Recursive merge (splitting by pivots)
Caesar
Calpurnia
13
16
34
16
21
34
Binary search
Which list you bisect at every recursive step ?
39
Sec. 1.3
Boolean queries:
More general merges
Exercise: Adapt the merge for :

Brutus AND NOT Caesar
Brutus OR NOT Caesar
Can we still run the merge in time O(x+y)?
40
IR is much more
What about phrases?
Proximity: Find Gates NEAR Microsoft.
Stanford University
Need index to capture term positions in docs.
Zones in documents: Find documents with

(author = Ullman) AND (text contains
automata).
41
Ranking search results
Boolean queries give inclusion or exclusion of

docs.
But
often results are too many and we need to rank

results
Classification, clustering, summarization, text
mining, etc
42
Web IR and its challenges
Unusual and diverse
Documents
Users
Queries
Information needs
Exploit ideas from social networks
link analysis, click-streams, ...
How do search engines work?
43
Our topics, on an example
Page archive
Crawler
eb
Hashing
Page
Analizer
Indexer
Sorting
Query
Linear
Algebra
Clustering
Classification
Query
resolver
Dictionaries
Which pages
to visit next?
text
auxiliary
Structure
Data Compression
Ranker
I data center
[Procs OSDI 2006]

No
SQL
Hbase, in Java, Apache license, runs on Hadoop

HyperTable, in C++, GNU license, runs on Hadoop or GlusterFS
Cassandra, in Java, Apache license 2, runs on Amazons Dynamo
Smart algorithms
2007
This is rocket science but you
don't have to be a rocket scientist
to use it

Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina

Caricato da

Copyright:

Formati disponibili

Nowadays IR is much

more than building

Paolo Ferragina, Universit

Paolo Ferragina, Universit di Pisa

Paolo Ferragina, Universit di Pisa

Paolo Ferragina, Universit di Pisa

Paolo Ferragina, Universit di Pisa

CISCO foresee 50 mld devices connected by 2020

Paolo Ferragina, Universit di Pisa

We have now devices 2.0 that have their ID,

We live in a post human society

You are a cyborg everytime

Man has extended his capacities (such as movement,

Paolo Ferragina, Universit di Pisa

An Internet of bits and bodies

Three main types of data

Credit card transactions

pollution, temperature, wind,

Photo, tweet, post, email,

Trash track: trashtrack_release.mov

The Phd+ course:

Typically allows numerical range and exact match

refers to free text, and allows

Keyword queries including operators

More sophisticated concept queries e.g.,

find all web pages dealing with drug abuse

Classic model for searching text documents

Semi-structured data: XML

In fact almost no data is unstructured

Facilitates semi-structured search such as

E.g., this slide has distinctly identified zones such

Title contains data AND Bullets contain search

how do you process about?

Boolean queries: Exact match

The Boolean retrieval model is being able to ask a

Boolean Queries are queries using AND, OR and

Views each document as a set of words

Perhaps the simplest model to build an IR system on

Many search systems still use it:

Email, library catalog, Mac OS X Spotlight

Implementing the Boolean model

Antony and Cleopatra

Brutus AND Caesar

1 if play contains word,

For each term t, we must store a list of all documents

Identify each by docID, a document serial number

Can we use fixed-size arrays for this?

What happens if the word Caesar is added to

If n,m are the lengths of the lists, how many

This is not an engineering problem,

How many comparisons ?

The Inverted index

Intersecting two postings lists

What is the best order for query processing?

Consider a query that is an AND of n terms.

Query: Brutus AND Calpurnia AND Caesar

Can we improve scanning-based intersection?

Skips (yet scan-based but with shortcuts)

Augment postings with skip

How do we deploy them ?

Can we improve scanning-based intersection?

Skips (yet scan-based but with shortcuts)

Exercise: Adapt the merge for :