Sei sulla pagina 1di 76

Indexing and Searching

J. H. Wang
Feb. 20, 2008

The Retrieval Process


User
Interface
user need

Text
4, 10

Text

Text Operations
logical view
Query
user feedback Operations

6, 7

logical view
Indexing

quer
y
Searching

inverted file

DB Manager
Module

Index
8

retrieved docs
ranked docs

Text
Database

Ranking
2

Outline
Conventional text retrieval systems
(8.1-8.3, Salton)
File Structures for Indexing and
Searching (Chap. 8)

Inverted files
Suffix trees and suffix arrays
Signature files
Sequential searching
Pattern matching

Conventional Text Retrieval


Systems
Database management, e.g. employee DB
Structured records
Precise meaning for attribute values
Exact match

Text retrieval, e.g. bibliographic systems


Structured attributes and unstructured content
Index terms

Imprecise representation of the text


Approximate or partial matching

Conceptual Information
Retrieval
Queries

Similarity
computation

Retrieval of
similar terms

Documents

Expanded Text Retrieval


System
Queries

Formal
statements

Negotiation
and analysis
(Query formulation)

Similarity
computation

Indexed
documents

Documents

Text indexing
(Content Analysis)

Retrieval of
similar terms

Taipei city government


Taipei

Taipei travel guide


Wiki page on Taipei
Taipei 101
Taipei times

Representation
Documents
Indexed terms (or term vectors)
Unweighted or weighted

Queries
Unweighted or weighted terms
Boolean operators: or, and, not
E.g. Taiwan AND NOT Taipei

Efficiency

Data Structure
Requirement
Fast access to documents
Very large number of index terms

For each term a separate index is


constructed that stores the
document identifiers for all
documents identified by that term
Inverted index (or inverted file)

Inverted Index
The complete file is represented as
an array of indexed documents.
Term 1 Term 2 Term 3 Term 4
Doc 1

Doc 2

Doc 3

Doc 4

Inverted-file Process
The document-term array is inverted
(actually transposed).
Doc 1

Doc 2

Doc 3

Doc 4

Term 1

Term 2

Term 3

Term 4

Inverted-file Process
The rows are manipulated according
to query specification. (list-merging)
Ex: Query = (term 2 and term 3)
1
1
0
0
0
1
1
1
-------------------------------------------0
1
0
0
Ex: Query = ((T1 or T2) and not T3)

Extensions of Inverted Index


Distance Constraints
Term Weights
Synonym Specification
Term Truncation

Distance Constraints
Nearness parameters
Within sentence: terms cooccur in a
common sentence
Adjacency: terms occur adjacently in the
text

Implementation

To include term-location information in the


inverted index
information: {D345, D348, D350, }
retrieval: {D123, D128, D345, }
Cost: size of the indexes
To include sentence numbers for all term
occurrences in the inverted index
information: {D345, 25; D345, 37; D348, 10;
D350, 8;}
retrieval: {D123, 5; D128, 25; D345, 37; D345,
40;}

To include paragraph numbers, sentence


numbers within paragraphs, word
numbers within sentences in the
inverted index
information: {D345, 2, 3, 5}
retrieval: {D345, 2, 3, 6}
Ex: (information adjacent retrieval)
(information within five words retrieval)

Term Weights
Term-importance weights
Di = {Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}

Issues
How to generate term weights? (more
on this later)
How to apply term weights?
Vector queries: the sum of the weights of all
document terms that match the given query
Boolean queries: (more complex)

Term Weights (for Boolean


Queries)
Transforming each query into sum-ofproducts form (or disjunctive normal
form)
The weight of each conjunct is the
minimum term weight of any
document term in the conjunct
The document weight is the
maximum of all the conjunct weights

An Example
Example: Q=(T1 and T2) or T3
Document
Conjunct Query
Vectors Weights Weight
(T1 and T2) (T3) (T1 and T2) or T3
D1=(T1,0.2;T2,0.5;T3,0.6)

0.2 0.6 0.6


D2=(T1,0.7;T2,0.2;T3,0.1)

0.2 0.1 0.2


D1 is preferred.

Synonym Specification
(T1 and T2) or T3
((T1 or S1) and T2) or (T3 or S3)

Term Truncation (or stemming)


Removing suffixes and/or prefixes

Ex:
PSYCH*: psychiatrist, psychiatry,
psychiatric,
psychology, psychological,

File Structures for Indexing


and Searching

Introduction
How to retrieval information?
A simple alternative is to search the
whole text sequentially (online
search)
Another option is to build data
structures over the text (called
indices) to speed up the search

Introduction
Indexing techniques
Inverted files
Suffix arrays
Signature files

Notation
n: the size of the text
m: the length of the pattern (m << n)
v: the size of the vocabulary
M: the amount of main memory
available

Inverted Files
Definition: an inverted file is a wordoriented mechanism for indexing a text
collection in order to speed up the searching
task.

Structure of inverted file:


Vocabulary: is the set of all distinct words in
the text
Occurrences: lists containing all information
necessary for each word of the vocabulary
(text position, frequency, documents where
the word appears, etc.)

Example
Text:
1

12 16 18

25

29

36

40

45

54 58

66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

Inverted file
Vocabulary

Occurrences

beautiful

70

flowers

45, 58

garden

18, 29

house

Space Requirements
The space required for the vocabulary is rather
small. According to Heaps law the vocabulary
grows as O(n), where is a constant between
0.4 and 0.6 in practice (sublinear)
On the other hand, the occurrences demand
much more space. Since each word appearing
in the text is referenced once in that structure,
the extra space is O(n)
To reduce space requirements, a technique
called block addressing is used

Block Addressing
The text is divided in blocks
The occurrences point to the blocks where
the word appears
Advantages:
the number of pointers is smaller than
positions
all the occurrences of a word inside a single
block are collapsed to one reference

Disadvantages:
online search over the qualifying blocks if
exact positions are required

Example
Text:
Block 1

Block 2

Block 3

Block 4

That house has a garden. The garden has many flowers. The flowers are
beautiful

Inverted file:
Vocabulary

Occurrences

beautiful

flowers

garden

house

Searching
The search algorithm on an inverted
index follows three steps
Vocabulary search: the words present in
the query are searched in the vocabulary

Retrieval of occurrences: the lists of the


occurrences of all words found are
retrieved

Manipulation of occurrences: the


occurrences are processed to solve the
query

Searching
Searching task on an inverted file always
starts in the vocabulary (It is better to
store the vocabulary in a separate file)
The structures most used to store the
vocabulary are hashing, tries or B-trees
Hashing, tries: O(m)

An alternative is simply storing the


words in lexicographical order (cheaper
in space and very competitive with O(log
v) cost)

Construction
All the vocabulary is kept in a
suitable data structure storing for
each word a list of its occurrences
Each word of the text is read and
searched in the vocabulary
If it is not found, it is added to the
vocabulary with a empty list of
occurrences and the new position is
added to the end of its list of
occurrences

Example
Text:
1

12 16 18

25

29

36

40

45

54 58

66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

Vocabulary trie
beautiful: 70

b
f
g
h

flower: 45, 58
garden: 18, 29
house: 6

Construction
Once the text is exhausted, the vocabulary is
written to disk with the list of occurrences. Two
files are created:
in the first file, the list of occurrences are stored
contiguously (posting file)
in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer
to its list in the first file is also included. This allows
the vocabulary to be kept in memory at search time

The overall process is O(n) worst-case time


Not practical for large texts

Construction
An option is to use the previous algorithm
until the main memory is exhausted. When
no more memory is available, the partial
index Ii obtained up to now is written to
disk and erased the main memory before
continuing with the rest of the text
Once the text is exhausted, a number of
partial indices Ii exist on disk
The partial indices are merged to obtain
the final index

Example
final index

I 1...8
7

I 1...2

I 1...4

I 5...8

I 3...4
1

I1

level 3

I 5...6
2

I2

I3

level 2

I 7...8
4

I4

I5

level 1

I6

I7

I8

initial dumps

Construction
The total time to generate partial
indices is O(n)
The number of partial indices is
O(n/M)
To merge the O(n/M) partial indices,
log2(n/M) merging levels are necessary
The total cost of this algorithm is O(n
log(n/M))

Summary on Inverted File


Inverted file is probably the most
adequate indexing technique for
database text
The indices are appropriate when the
text collection is large and semi-static
Otherwise, if the text collection is
volatile online searching is the only
option
Some techniques combine online and
indexed searching

Suffix Trees and Suffix


Arrays
Each position in the text is
considered as a text suffix
Index points are selected form the
text, which point to the beginning of
the text positions which will be
retrievable
The problem with suffix trees is its
space overhead

Example
Text:
1

12 16 18

25

29

36

40

45

54 58

66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

Suffixes
house has a garden. The garden has many flowers. The flowers
are beautiful
garden. The garden has many flowers. The flowers are beautiful
garden has many flowers. The flowers are beautiful
flowers. The flowers are beautiful
flowers are beautiful
beautiful

Example
Text:
1

12 16 18

25

29

36

40

45

54 58

66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

Suffix Trie
70
b
f
g
h

a
6

w e

e n

58
45

.
29

18

Example
Text:
1

12 16 18

25

29

36

40

45

54 58

66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

Suffix Tree
70
b
1

f 8

g
.
h 7

45
58
18
29

Suffix Arrays
An array containing all the pointers to the
text suffixes listed in lexicographical order
The space requirements are almost the same
as those for inverted indices

The main drawbacks of suffix array are its


costly construction process
Allow binary searches done by comparing
the contents of each pointer
Supra-indices (for large suffix array)
The space requirements of suffix array with
vocabulary supra-index are exactly the same
as for inverted indices

Example
Text:
1

12 16 18

25

29

36

40

45

54 58

66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

Suffix Array
70

58

45

29

18

Supra Index (l=4, b=2)


flow

bea
u

70

58

45

29

gar
d
18 6

Example
Text:
1

12 16 18

25

29

36

40

45

54 58

66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

Vocabulary Supra-Index
beautif
ul

flower

garden

house

Suffix Array
70

58

45

29

18

58

18

29

Inverted List
70

45

Construction of Suffix
Arrays for Large Texts
Small text
1
Small suffix array
2
Counters

2
3
3

Long text
Long suffix array
3
Final suffix array

Signature Files
Characteristics
Word-oriented index structures based on
hashing
Low overhead (10%~20% over the text
size) at the cost of forcing a sequential
search over the index
Suitable for not very large texts
Inverted files outperform signature files
for most applications

Construction and Search


Word-oriented index structures base on hashing
Maps words to bit masks of B bits
Divides the text in blocks of b words each
The mask is obtained by bitwise ORing the signatures
of all the words in the text block.

Search
Hash the query to a bit mask W
If W & Bi = W, the text block may contain the word
For all candidate blocks, an online traversal must be
performed to verify if the word is actually there

Example
Four blocks:
This is a text. A text has many words. Words are made from
letters.
000101

110101

100100

Hash(text) = 000101
Hash(many)= 110000
Hash(words)= 100100
Hash(made)= 001100
Hash(letters)= 100001

101101

Block 4: 001100
OR
100001
101101

False Drop
Assumes that l bits are randomly set in the
mask
Let =l/B
For b words, the probability that a given bit of
the mask is set is 1-(1-1/B)bl 1-e-b
Hence, the probability that the l random bits
are also set is Fd =(1-e-b) False alarm
Fd is minimized for =ln(2)/b

Fd = 2-l
l = B ln2/b

Comparisons
Signature files
Use hashing techniques to produce an
index
advantage
storage overhead is small (10%-20%)

disadvantages
the search time on the index is linear
some answers may not match the query,
thus filtering must be done

Comparisons

(Continued)

Inverted files
storage overhead (30% ~ 100%)
search time for word searches is logarithmic

Suffix arrays
potential use in other kind of searches
phrases
regular expression searching
approximate string searching
longest repetitions
most frequent searching

Sequential Searching
Brute Force (BF)
Knuth-Morris-Pratt (KMP)
Boyer-Moore Family (BM)
Shift-Or
Suffix Automaton

Exact String Matching


Definition: Given a short pattern P of
length m and a long text T of length n,
find all the text positions where the
pattern occurs
The simplest algorithm: Brute-Force (BF)
Trying all possible pattern positions in the
text
Worst-case cost: O(mn), average-case cost:
O(n)
O(n) text positions
O(m) worst-case cost for each position

Knuth-Morris-Pratt
The KMP method scans the characters leftto-right
When a mismatch occurs, an optimum
shift is carried out for pattern P
No new match can be obtained except when
some head of the already matching part of P is
identical to a tail of the matching part of T
How to detect coincidences between heads of
P and tails of T
Any matching tail of T is also a matching tail of P
Detecting repeating portions in P

Knuth-Morris-Pratt
Next table at position j: the longest proper prefix
of P1..j-1 which is also a suffix and the characters
following prefix and suffix are different
j-next[j]-1 positions can be safely skipped
Next: 0 0 0 0 1 0 1 0 0 0 0 4
P:
abracadabra
abracabracadabra
abracad
abracadabra

At each text comparison, the window or


the pointer advance by at least one
position, the algorithm performs at most
2n comparisons (and at least n)
The Aho-Corasick algorithm is an
extension of KMP in matching a set of
patterns
Patterns are arranged in a trie-like data
structure
Ex: hello, elbow, eleven

Boyer-Moore Family
The BM method scans characters from right
to left
The heuristic which gives the longest shift is
selected
Matching shift (or good-suffix shift, 2 shift)
When some tail of P already matches some substring of S

Occurrence shift (or bad-character shift, 1 shift)


When a mismatched character is known not to occur in
the pattern
Extended 1 shift: places in coincidence any matching
positions between heads and tails of P

Examples
abracabracadabra
abracadabra
a b r a c a d a b r a (2=3)
a b r a c a d a b r a (1=5)
babcbadcabcaabca
abcabcacab
a b c a b c a c a b (2=5)
a b c a b c a c a b (1=7)
a b c a b c a c a b (extended 1=8)

Some variations

Simplified BM algorithm
BM-Horspool (BMH) algorithm
BM-Sunday (BMS) algorithm
Commentz-Walter algorithm: an
extension of BM to multi-pattern search

Shift-Or
Based on bit-parallelism to simulate the
operation of a non-deterministic automaton
It first build a table B which stores a bit mask
bmb1 for each character
B[c] has the i-th bit set to zero if pi = c
The state of search is kept in D=dmd1 (initially
set to all 1s)
Where di is zero whenever the state numbered i is active
A match is reported whenever dm is zero

For each new character Tj, D = (D<<1) | B[Tj]

Example
a

B[a] =
1
B[b] =
0
B[c] =
1
B[r] =

0
1
1

0
1
1

1
1
B[*] = 1
1
1

a
m

Example
Ex: Input T=abcabracaba
(11111111 << 1) | 01010110
(11111110 << 1) | 10111101
(11111101 << 1) | 11101111
(11111111 << 1) | 01010110
(11111110 << 1) | 10111101
(11111101 << 1) | 11111011
(11111011 << 1) | 01010110
(11110111 << 1) | 11101111
(11101111 << 1) | 01010110
(11011111 << 1) | 10111101
(10111111 << 1) | 01010110

=
=
=
=
=
=
=
=
=
=
=

11111110 (A)
11111101 (AB)
11111111 ()
11111110 (A)
11111101 (AB)
11111011 (ABR)
11110111 (ABRA)
11101111 (ABRAC)
11011111 (ABRACA)
10111111 (ABRACAB)
01111111 (ABRACABA)

Matched!

Suffix Automaton
Suffix automaton on a pattern P: an
automaton that recognizes all suffixes
of
P
I
a

Backward DAWG matching (BDM)


algorithm converts this automaton to
deterministic
DAWG: directed acyclic word graphs

To search a pattern P
Suffix automaton of Pr is built
Search backwards inside the text window for a
substring of P using suffix automaton
Each time a terminal state is reached before
hitting the beginning of the window, the
position inside the window is remembered
Finding a prefix of the pattern -> suffix of the window

The last prefix recognized backwards is the


longest prefix of P
The window is aligned with the longest prefix
recognized

Example
P: abracadabra
Pr: arbadacarba
T: a b r a c a b r a c a d a b r a
[
x
x]
[x
x
x]

Practical Comparison
The clear winners are BNDM and BMS
(Sunday)
Classical BM and BDM are also very close

For English texts, Agrep is much faster


Because the code is carefully optimized

For longer pattern, BDM is better than BNDM


For extended patterns, BNDM is normally
the fastest, otherwise Shift-Or is the best
option

Pattern Matching
Searching allowing errors
(Approximate String Matching)
Dynamic Programming
Automaton

Regular Expressions and Extended


patterns
Pattern Matching Using Indices
Inverted files
Suffix Trees and Suffix Arrays

Approximate String
Matching
Definition: Given a short pattern P of
length m, a long text T of length n, and
a maximum allowed number of errors k,
find all the text positions where the
pattern occurs with at most k errors
This corresponds to the Levenshtein
distance (edit distance)
With minimum modifications it is adapted
to searching whole words matching the
pattern with k errors

Dynamic Programming

Automaton

Regular Expressions

Pattern Matching Using


Indices
Inverted Files
The types of queries such as suffix or
substring queries, searching allowing
errors and regular expressions, are
solved by a sequential search
The restriction: not able to efficiently
find approximate matches or regular
expressions that span many word.

Pattern Matching Using


Indices
Suffix Trees
Suffix trees are able to perform complex

searches

Word, prefix, suffix, substring, and range


queries
Regular expressions
Unrestricted approximate string matching

Useful in specific areas


Find the longest substring
Find the most common substring of a fixed size

Pattern Matching Using


Indices
Suffix Arrays
Some patterns can be searched directly
in the suffix array without simulating the
suffix tree
Word, prefix, suffix, subword search and
range search

Compression
Compressed text--Huffman coding
Taking words as symbols
Use an alphabet of bytes instead of bits

Compressed indices
Inverted Files
Suffix Trees and Suffix Arrays
Signature Files

Potrebbero piacerti anche