Indexing Searching

Indexing and Searching
J. H. Wang
Feb. 20, 2008
The Retrieval Process

User
Interface
user need
Text
4, 10
Text
Text Operations
logical view
Query
user feedback Operations
6, 7
logical view
Indexing
quer
y
Searching
inverted file
DB Manager
Module
Index
8
retrieved docs
ranked docs
Text
Database
Ranking
2
Outline
Conventional text retrieval systems
(8.1-8.3, Salton)
File Structures for Indexing and
Searching (Chap. 8)
Inverted files
Suffix trees and suffix arrays
Signature files
Sequential searching
Pattern matching
Conventional Text Retrieval

Systems
Database management, e.g. employee DB
Structured records
Precise meaning for attribute values
Exact match
Text retrieval, e.g. bibliographic systems

Structured attributes and unstructured content
Index terms
Imprecise representation of the text

Approximate or partial matching
Conceptual Information
Retrieval
Queries
Similarity
computation
Retrieval of
similar terms
Documents
Expanded Text Retrieval

System
Queries
Formal
statements
Negotiation
and analysis
(Query formulation)
Similarity
computation
Indexed
documents
Documents
Text indexing
(Content Analysis)
Retrieval of
similar terms
Taipei city government

Taipei
Taipei travel guide

Wiki page on Taipei
Taipei 101
Taipei times
Representation
Documents
Indexed terms (or term vectors)
Unweighted or weighted
Queries
Unweighted or weighted terms
Boolean operators: or, and, not
E.g. Taiwan AND NOT Taipei
Efficiency
Data Structure
Requirement
Fast access to documents
Very large number of index terms
For each term a separate index is

constructed that stores the
document identifiers for all
documents identified by that term
Inverted index (or inverted file)
Inverted Index
The complete file is represented as
an array of indexed documents.
Term 1 Term 2 Term 3 Term 4
Doc 1
Doc 2
Doc 3
Doc 4
Inverted-file Process
The document-term array is inverted
(actually transposed).
Doc 1
Doc 2
Doc 3
Doc 4
Term 1
Term 2
Term 3
Term 4
Inverted-file Process
The rows are manipulated according
to query specification. (list-merging)
Ex: Query = (term 2 and term 3)
1
1
0
0
0
1
1
1
-------------------------------------------0
1
0
0
Ex: Query = ((T1 or T2) and not T3)
Extensions of Inverted Index

Distance Constraints
Term Weights
Synonym Specification
Term Truncation
Distance Constraints
Nearness parameters
Within sentence: terms cooccur in a
common sentence
Adjacency: terms occur adjacently in the
text
Implementation
To include term-location information in the

inverted index
information: {D345, D348, D350, }
retrieval: {D123, D128, D345, }
Cost: size of the indexes
To include sentence numbers for all term
occurrences in the inverted index
information: {D345, 25; D345, 37; D348, 10;
D350, 8;}
retrieval: {D123, 5; D128, 25; D345, 37; D345,
40;}
To include paragraph numbers, sentence

numbers within paragraphs, word
numbers within sentences in the
inverted index
information: {D345, 2, 3, 5}
retrieval: {D345, 2, 3, 6}
Ex: (information adjacent retrieval)
(information within five words retrieval)
Term Weights
Term-importance weights
Di = {Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}
Issues
How to generate term weights? (more
on this later)
How to apply term weights?
Vector queries: the sum of the weights of all
document terms that match the given query
Boolean queries: (more complex)
Term Weights (for Boolean

Queries)
Transforming each query into sum-ofproducts form (or disjunctive normal
form)
The weight of each conjunct is the
minimum term weight of any
document term in the conjunct
The document weight is the
maximum of all the conjunct weights
An Example
Example: Q=(T1 and T2) or T3
Document
Conjunct Query
Vectors Weights Weight
(T1 and T2) (T3) (T1 and T2) or T3
D1=(T1,0.2;T2,0.5;T3,0.6)
0.2 0.6 0.6

D2=(T1,0.7;T2,0.2;T3,0.1)
0.2 0.1 0.2

D1 is preferred.
Synonym Specification
(T1 and T2) or T3
((T1 or S1) and T2) or (T3 or S3)
Term Truncation (or stemming)

Removing suffixes and/or prefixes
Ex:
PSYCH*: psychiatrist, psychiatry,
psychiatric,
psychology, psychological,
File Structures for Indexing

and Searching
Introduction
How to retrieval information?
A simple alternative is to search the
whole text sequentially (online
search)
Another option is to build data
structures over the text (called
indices) to speed up the search
Introduction
Indexing techniques
Inverted files
Suffix arrays
Signature files
Notation
n: the size of the text
m: the length of the pattern (m << n)
v: the size of the vocabulary
M: the amount of main memory
available
Inverted Files
Definition: an inverted file is a wordoriented mechanism for indexing a text
collection in order to speed up the searching
task.
Structure of inverted file:

Vocabulary: is the set of all distinct words in
the text
Occurrences: lists containing all information
necessary for each word of the vocabulary
(text position, frequency, documents where
the word appears, etc.)
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Inverted file
Vocabulary
Occurrences
beautiful
70
flowers
45, 58
garden
18, 29
house
Space Requirements
The space required for the vocabulary is rather
small. According to Heaps law the vocabulary
grows as O(n), where is a constant between
0.4 and 0.6 in practice (sublinear)
On the other hand, the occurrences demand
much more space. Since each word appearing
in the text is referenced once in that structure,
the extra space is O(n)
To reduce space requirements, a technique
called block addressing is used
Block Addressing
The text is divided in blocks
The occurrences point to the blocks where
the word appears
Advantages:
the number of pointers is smaller than
positions
all the occurrences of a word inside a single
block are collapsed to one reference
Disadvantages:
online search over the qualifying blocks if
exact positions are required
Example
Text:
Block 1
Block 2
Block 3
Block 4
beautiful
Inverted file:
Vocabulary
Occurrences
beautiful
flowers
garden
house
Searching
The search algorithm on an inverted
index follows three steps
Vocabulary search: the words present in
the query are searched in the vocabulary
Retrieval of occurrences: the lists of the

occurrences of all words found are
retrieved
Manipulation of occurrences: the

occurrences are processed to solve the
query
Searching
Searching task on an inverted file always
starts in the vocabulary (It is better to
store the vocabulary in a separate file)
The structures most used to store the
vocabulary are hashing, tries or B-trees
Hashing, tries: O(m)
An alternative is simply storing the

words in lexicographical order (cheaper
in space and very competitive with O(log
v) cost)
Construction
All the vocabulary is kept in a
suitable data structure storing for
each word a list of its occurrences
Each word of the text is read and
searched in the vocabulary
If it is not found, it is added to the
vocabulary with a empty list of
occurrences and the new position is
added to the end of its list of
occurrences
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
beautiful
Vocabulary trie
beautiful: 70
b
f
g
h
flower: 45, 58
garden: 18, 29
house: 6
Construction
Once the text is exhausted, the vocabulary is
written to disk with the list of occurrences. Two
files are created:
in the first file, the list of occurrences are stored
contiguously (posting file)
in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer
to its list in the first file is also included. This allows
the vocabulary to be kept in memory at search time
The overall process is O(n) worst-case time

Not practical for large texts
Construction
An option is to use the previous algorithm
until the main memory is exhausted. When
no more memory is available, the partial
index Ii obtained up to now is written to
disk and erased the main memory before
continuing with the rest of the text
Once the text is exhausted, a number of
partial indices Ii exist on disk
The partial indices are merged to obtain
the final index
Example
final index
I 1...8
7
I 1...2
I 1...4
I 5...8
I 3...4
1
I1
level 3
I 5...6
2
I2
I3
level 2
I 7...8
4
I4
I5
level 1
I6
I7
I8
initial dumps
Construction
The total time to generate partial
indices is O(n)
The number of partial indices is
O(n/M)
To merge the O(n/M) partial indices,
log2(n/M) merging levels are necessary
The total cost of this algorithm is O(n
log(n/M))
Summary on Inverted File

Inverted file is probably the most
adequate indexing technique for
database text
The indices are appropriate when the
text collection is large and semi-static
Otherwise, if the text collection is
volatile online searching is the only
option
Some techniques combine online and
indexed searching
Suffix Trees and Suffix

Arrays
Each position in the text is
considered as a text suffix
Index points are selected form the
text, which point to the beginning of
the text positions which will be
retrievable
The problem with suffix trees is its
space overhead
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
beautiful
Suffixes
house has a garden. The garden has many flowers. The flowers
are beautiful
garden. The garden has many flowers. The flowers are beautiful
garden has many flowers. The flowers are beautiful
flowers. The flowers are beautiful
flowers are beautiful
beautiful
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
beautiful
Suffix Trie
70
b
f
g
h
a
6
w e
e n
58
45
.
29
18
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
beautiful
Suffix Tree
70
b
1
f 8
g
.
h 7
45
58
18
29
Suffix Arrays
An array containing all the pointers to the
text suffixes listed in lexicographical order
The space requirements are almost the same
as those for inverted indices
The main drawbacks of suffix array are its

costly construction process
Allow binary searches done by comparing
the contents of each pointer
Supra-indices (for large suffix array)
The space requirements of suffix array with
vocabulary supra-index are exactly the same
as for inverted indices
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
beautiful
Suffix Array
70
58
45
29
18
Supra Index (l=4, b=2)

flow
bea
u
70
58
45
29
gar
d
18 6
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
beautiful
Vocabulary Supra-Index
beautif
ul
flower
garden
house
Suffix Array
70
58
45
29
18
58
18
29
Inverted List
70
45
Construction of Suffix
Arrays for Large Texts
Small text
1
Small suffix array
2
Counters
2
3
3
Long text
Long suffix array
3
Final suffix array
Signature Files
Characteristics
Word-oriented index structures based on
hashing
Low overhead (10%~20% over the text
size) at the cost of forcing a sequential
search over the index
Suitable for not very large texts
Inverted files outperform signature files
for most applications
Construction and Search

Word-oriented index structures base on hashing
Maps words to bit masks of B bits
Divides the text in blocks of b words each
The mask is obtained by bitwise ORing the signatures
of all the words in the text block.
Search
Hash the query to a bit mask W
If W & Bi = W, the text block may contain the word
For all candidate blocks, an online traversal must be
performed to verify if the word is actually there
Example
Four blocks:
This is a text. A text has many words. Words are made from
letters.
000101
110101
100100
Hash(text) = 000101
Hash(many)= 110000
Hash(words)= 100100
Hash(made)= 001100
Hash(letters)= 100001
101101
Block 4: 001100
OR
100001
101101
False Drop
Assumes that l bits are randomly set in the
mask
Let =l/B
For b words, the probability that a given bit of
the mask is set is 1-(1-1/B)bl 1-e-b
Hence, the probability that the l random bits
are also set is Fd =(1-e-b) False alarm
Fd is minimized for =ln(2)/b
Fd = 2-l
l = B ln2/b
Comparisons
Signature files
Use hashing techniques to produce an
index
advantage
storage overhead is small (10%-20%)
disadvantages
the search time on the index is linear
some answers may not match the query,
thus filtering must be done
Comparisons
(Continued)
Inverted files
storage overhead (30% ~ 100%)
search time for word searches is logarithmic
Suffix arrays
potential use in other kind of searches
phrases
regular expression searching
approximate string searching
longest repetitions
most frequent searching
Sequential Searching
Brute Force (BF)
Knuth-Morris-Pratt (KMP)
Boyer-Moore Family (BM)
Shift-Or
Suffix Automaton
Exact String Matching

Definition: Given a short pattern P of
length m and a long text T of length n,
find all the text positions where the
pattern occurs
The simplest algorithm: Brute-Force (BF)
Trying all possible pattern positions in the
text
Worst-case cost: O(mn), average-case cost:
O(n)
O(n) text positions
O(m) worst-case cost for each position
Knuth-Morris-Pratt
The KMP method scans the characters leftto-right
When a mismatch occurs, an optimum
shift is carried out for pattern P
No new match can be obtained except when
some head of the already matching part of P is
identical to a tail of the matching part of T
How to detect coincidences between heads of
P and tails of T
Any matching tail of T is also a matching tail of P
Detecting repeating portions in P
Knuth-Morris-Pratt
Next table at position j: the longest proper prefix
of P1..j-1 which is also a suffix and the characters
following prefix and suffix are different
j-next[j]-1 positions can be safely skipped
Next: 0 0 0 0 1 0 1 0 0 0 0 4
P:
abracadabra
abracabracadabra
abracad
abracadabra
At each text comparison, the window or

the pointer advance by at least one
position, the algorithm performs at most
2n comparisons (and at least n)
The Aho-Corasick algorithm is an
extension of KMP in matching a set of
patterns
Patterns are arranged in a trie-like data
structure
Ex: hello, elbow, eleven
Boyer-Moore Family
The BM method scans characters from right
to left
The heuristic which gives the longest shift is
selected
Matching shift (or good-suffix shift, 2 shift)
When some tail of P already matches some substring of S
Occurrence shift (or bad-character shift, 1 shift)

When a mismatched character is known not to occur in
the pattern
Extended 1 shift: places in coincidence any matching
positions between heads and tails of P
Examples
abracabracadabra
abracadabra
a b r a c a d a b r a (2=3)
a b r a c a d a b r a (1=5)
babcbadcabcaabca
abcabcacab
a b c a b c a c a b (2=5)
a b c a b c a c a b (1=7)
a b c a b c a c a b (extended 1=8)
Some variations
Simplified BM algorithm
BM-Horspool (BMH) algorithm
BM-Sunday (BMS) algorithm
Commentz-Walter algorithm: an
extension of BM to multi-pattern search
Shift-Or
Based on bit-parallelism to simulate the
operation of a non-deterministic automaton
It first build a table B which stores a bit mask
bmb1 for each character
B[c] has the i-th bit set to zero if pi = c
The state of search is kept in D=dmd1 (initially
set to all 1s)
Where di is zero whenever the state numbered i is active
A match is reported whenever dm is zero
For each new character Tj, D = (D<<1) | B[Tj]
Example
a
B[a] =
1
B[b] =
0
B[c] =
1
B[r] =
0
1
1
0
1
1
1
1
B[*] = 1
1
1
a
m
Example
Ex: Input T=abcabracaba
(11111111 << 1) | 01010110
(11111110 << 1) | 10111101
(11111101 << 1) | 11101111
(11111111 << 1) | 01010110
(11111110 << 1) | 10111101
(11111101 << 1) | 11111011
(11111011 << 1) | 01010110
(11110111 << 1) | 11101111
(11101111 << 1) | 01010110
(11011111 << 1) | 10111101
(10111111 << 1) | 01010110
=
=
=
=
=
=
=
=
=
=
=
11111110 (A)
11111101 (AB)
11111111 ()
11111110 (A)
11111101 (AB)
11111011 (ABR)
11110111 (ABRA)
11101111 (ABRAC)
11011111 (ABRACA)
10111111 (ABRACAB)
01111111 (ABRACABA)
Matched!
Suffix Automaton
Suffix automaton on a pattern P: an
automaton that recognizes all suffixes
of
P
I
a
Backward DAWG matching (BDM)

algorithm converts this automaton to
deterministic
DAWG: directed acyclic word graphs
To search a pattern P
Suffix automaton of Pr is built
Search backwards inside the text window for a
substring of P using suffix automaton
Each time a terminal state is reached before
hitting the beginning of the window, the
position inside the window is remembered
Finding a prefix of the pattern -> suffix of the window
The last prefix recognized backwards is the

longest prefix of P
The window is aligned with the longest prefix
recognized
Example
P: abracadabra
Pr: arbadacarba
T: a b r a c a b r a c a d a b r a
[
x
x]
[x
x
x]
Practical Comparison
The clear winners are BNDM and BMS
(Sunday)
Classical BM and BDM are also very close
For English texts, Agrep is much faster

Because the code is carefully optimized
For longer pattern, BDM is better than BNDM

For extended patterns, BNDM is normally
the fastest, otherwise Shift-Or is the best
option
Pattern Matching
Searching allowing errors
(Approximate String Matching)
Dynamic Programming
Automaton
Regular Expressions and Extended

patterns
Pattern Matching Using Indices
Inverted files
Suffix Trees and Suffix Arrays
Approximate String
Matching
Definition: Given a short pattern P of
length m, a long text T of length n, and
a maximum allowed number of errors k,
find all the text positions where the
pattern occurs with at most k errors
This corresponds to the Levenshtein
distance (edit distance)
With minimum modifications it is adapted
to searching whole words matching the
pattern with k errors
Dynamic Programming
Automaton
Regular Expressions
Pattern Matching Using

Indices
Inverted Files
The types of queries such as suffix or
substring queries, searching allowing
errors and regular expressions, are
solved by a sequential search
The restriction: not able to efficiently
find approximate matches or regular
expressions that span many word.

Indices
Suffix Trees
Suffix trees are able to perform complex
searches
Word, prefix, suffix, substring, and range

queries
Regular expressions
Unrestricted approximate string matching
Useful in specific areas

Find the longest substring
Find the most common substring of a fixed size

Indices
Suffix Arrays
Some patterns can be searched directly
in the suffix array without simulating the
suffix tree
Word, prefix, suffix, subword search and
range search
Compression
Compressed text--Huffman coding
Taking words as symbols
Use an alphabet of bytes instead of bits
Compressed indices
Inverted Files
Suffix Trees and Suffix Arrays
Signature Files

Indexing Searching

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Indexing Searching

Caricato da

Copyright:

Formati disponibili

Indexing and Searching

The Retrieval Process

Conventional Text Retrieval

Text retrieval, e.g. bibliographic systems

Imprecise representation of the text

Expanded Text Retrieval

Taipei city government

Taipei travel guide

For each term a separate index is

Extensions of Inverted Index

To include term-location information in the

To include paragraph numbers, sentence

Term Weights (for Boolean

0.2 0.6 0.6

0.2 0.1 0.2

Term Truncation (or stemming)

File Structures for Indexing

Structure of inverted file:

Retrieval of occurrences: the lists of the

Manipulation of occurrences: the

An alternative is simply storing the

The overall process is O(n) worst-case time

Summary on Inverted File

Suffix Trees and Suffix

The main drawbacks of suffix array are its

Supra Index (l=4, b=2)

Construction and Search

Exact String Matching

At each text comparison, the window or

Occurrence shift (or bad-character shift, 1 shift)

For each new character Tj, D = (D<<1) | B[Tj]

Backward DAWG matching (BDM)

The last prefix recognized backwards is the

For English texts, Agrep is much faster

For longer pattern, BDM is better than BNDM

Regular Expressions and Extended

Pattern Matching Using

Pattern Matching Using

Word, prefix, suffix, substring, and range

Useful in specific areas

Pattern Matching Using

Potrebbero piacerti anche