Sei sulla pagina 1di 47

Nowadays IR is much

more than building


search engines !

IR

Paolo Ferragina
Dipartimento di Informatica
Universit di Pisa

Reading Chapter 1

What is IR today?

Paolo Ferragina

2009

2009-12

This is an IR tool!

Paolo Ferragina, Universit


di Pisa

Paolo Ferragina, Universit di Pisa

Paolo Ferragina, Universit di Pisa

Paolo Ferragina, Universit di Pisa

Paolo Ferragina, Universit di Pisa

CISCO foresee 50 mld devices connected by 2020

Paolo Ferragina, Universit di Pisa

Paradigm shift

We have now devices 2.0 that have their ID,


Communication capacity, computing and, more and
more, interaction ability.

We live in a post human society


[Amber Case, TED Lecture 2010]

You are a cyborg everytime


you look at a PC screen or use
a cellular phone.

Man has extended his capacities (such as movement,


memory, strength, communication,) through the use
of these devices 2.0.
The new cellular phones has changed the original
communication network into a sensing network

Paolo Ferragina, Universit di Pisa

An Internet of bits and bodies

Paolo Ferragina,
Universit di Pisa

Three main types of data

opportunistic

Purposely sensed

Credit card transactions


Tel calls, bills, web clicks,

pollution, temperature, wind,


movement, accelleration,
Health sensing,

User generated

Photo, tweet, post, email,


Query-log on search engines

Trash track: trashtrack_release.mov

A universe of possibilities
limited only by our immagination !

A universe of possibilities
limited only by our immagination !

The Phd+ course:


how to build a start-up ?
Paolo Ferragina, Universit di Pisa

Basics

Paolo Ferragina

Information Retrieval
Information Retrieval (IR) is finding material
(usually documents) of an unstructured
nature (usually text) that satisfies an
information need from within large
collections (usually stored on computers).

25

IR vs. databases:
Unstructured vs Structured data
Structured data tends to refer to tables
Employee

Manager

Salary

Smith

Jones

50000

Chang

Smith

60000

Ivy

Smith

50000

Typically allows numerical range and exact match


(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.

26

Unstructured data
Typically

refers to free text, and allows

Keyword queries including operators

More sophisticated concept queries e.g.,

find all web pages dealing with drug abuse

Classic model for searching text documents


27

Semi-structured data: XML

In fact almost no data is unstructured

Facilitates semi-structured search such as

E.g., this slide has distinctly identified zones such


as the Title and Bullets

Title contains data AND Bullets contain search

Issues:

how do you process about?


how do you rank results?
28

Boolean queries: Exact match

The Boolean retrieval model is being able to ask a


query that is a Boolean expression:

Boolean Queries are queries using AND, OR and


NOT to join query terms

Views each document as a set of words


Is precise: document matches condition or not.

Perhaps the simplest model to build an IR system on

Many search systems still use it:

Email, library catalog, Mac OS X Spotlight


29

Implementing the Boolean model


y
r
ve

Antony and Cleopatra

Julius Caesar

The Tempest

Hamlet

Othello

Macbeth

Calpurnia

Cleopatra

mercy

worser

Antony

e
b
d

l
u
o ig
c
x
b
i
r
t
a
M
Brutus

Caesar

Brutus AND Caesar


BUT NOT Calpurnia

1 if play contains word,


0 otherwise

Inverted index

For each term t, we must store a list of all documents


that contain t.

Identify each by docID, a document serial number

Can we use fixed-size arrays for this?

Brutus

Caesar

Calpurnia

2
2
31

11

31

45 173 174

16

57 132

54 101

What happens if the word Caesar is added to


document 14?
31

AND query
Cleopatra

45 11

46

Cesare

57

12

15

16

31 .
2

If n,m are the lengths of the lists, how many


n*m
comparisons ?
12

This is not an engineering problem,


You need efficient algorithms!

10 cmp
3

10 sec

AND query
Cleopatra

45 11

46

Cesare

57

12

15

16

Cleopatra

11

31

45 46

Cesare

12

15

16

How many comparisons ?


Which are the top-10 results ?

31 .

57

n+m

10

1 msec

The Inverted index


Brutus

the

13

16

Calpurnia

10

32

13

21

Two advantages:
Speed: query requires just a scan
Space: store smaller integers (gap coding)
Compressed, they occupy 13% original text

34

Intersecting two postings lists

35

Query optimization

What is the best order for query processing?

Consider a query that is an AND of n terms.


For each of the n terms, get its postings, then
AND them together.

Brutus

Caesar

13

16

Calpurnia

16

32

64 128

16

21

34

Query: Brutus AND Calpurnia AND Caesar

36

Query optimization

Can we improve scanning-based intersection?

Skips (yet scan-based but with shortcuts)

Sec. 2.3

Augment postings with skip


pointers (at indexing time)
128

41
2

41

48

128

31

11
1

64

11

How do we deploy them ?


Where do we place them ?

17

21

31

Query optimization

Can we improve scanning-based intersection?

Skips (yet scan-based but with shortcuts)


Recursive merge (splitting by pivots)

Caesar
Calpurnia

13

16

34

16

21

34

Binary search
Which list you bisect at every recursive step ?

39

Sec. 1.3

Boolean queries:
More general merges

Exercise: Adapt the merge for :


Brutus AND NOT Caesar
Brutus OR NOT Caesar

Can we still run the merge in time O(x+y)?

40

IR is much more

What about phrases?

Proximity: Find Gates NEAR Microsoft.

Stanford University

Need index to capture term positions in docs.

Zones in documents: Find documents with


(author = Ullman) AND (text contains
automata).
41

Ranking search results

Boolean queries give inclusion or exclusion of


docs.

But

often results are too many and we need to rank


results
Classification, clustering, summarization, text
mining, etc

42

Web IR and its challenges

Unusual and diverse

Documents

Users
Queries
Information needs

Exploit ideas from social networks

link analysis, click-streams, ...

How do search engines work?

43

Our topics, on an example

Page archive

Crawler

eb

Hashing

Page
Analizer

Indexer

Sorting

Query
Linear
Algebra
Clustering
Classification
Query
resolver

Dictionaries
Which pages
to visit next?
text

auxiliary
Structure

Data Compression

Ranker

I data center

[Procs OSDI 2006]


No
SQL

Hbase, in Java, Apache license, runs on Hadoop


HyperTable, in C++, GNU license, runs on Hadoop or GlusterFS
Cassandra, in Java, Apache license 2, runs on Amazons Dynamo

Smart algorithms
2007
This is rocket science but you
don't have to be a rocket scientist
to use it

Potrebbero piacerti anche