Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
J. H. Wang
Feb. 20, 2008
Text
4, 10
Text
Text Operations
logical view
Query
user feedback Operations
6, 7
logical view
Indexing
quer
y
Searching
inverted file
DB Manager
Module
Index
8
retrieved docs
ranked docs
Text
Database
Ranking
2
Outline
Conventional text retrieval systems
(8.1-8.3, Salton)
File Structures for Indexing and
Searching (Chap. 8)
Inverted files
Suffix trees and suffix arrays
Signature files
Sequential searching
Pattern matching
Conceptual Information
Retrieval
Queries
Similarity
computation
Retrieval of
similar terms
Documents
Formal
statements
Negotiation
and analysis
(Query formulation)
Similarity
computation
Indexed
documents
Documents
Text indexing
(Content Analysis)
Retrieval of
similar terms
Representation
Documents
Indexed terms (or term vectors)
Unweighted or weighted
Queries
Unweighted or weighted terms
Boolean operators: or, and, not
E.g. Taiwan AND NOT Taipei
Efficiency
Data Structure
Requirement
Fast access to documents
Very large number of index terms
Inverted Index
The complete file is represented as
an array of indexed documents.
Term 1 Term 2 Term 3 Term 4
Doc 1
Doc 2
Doc 3
Doc 4
Inverted-file Process
The document-term array is inverted
(actually transposed).
Doc 1
Doc 2
Doc 3
Doc 4
Term 1
Term 2
Term 3
Term 4
Inverted-file Process
The rows are manipulated according
to query specification. (list-merging)
Ex: Query = (term 2 and term 3)
1
1
0
0
0
1
1
1
-------------------------------------------0
1
0
0
Ex: Query = ((T1 or T2) and not T3)
Distance Constraints
Nearness parameters
Within sentence: terms cooccur in a
common sentence
Adjacency: terms occur adjacently in the
text
Implementation
Term Weights
Term-importance weights
Di = {Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}
Issues
How to generate term weights? (more
on this later)
How to apply term weights?
Vector queries: the sum of the weights of all
document terms that match the given query
Boolean queries: (more complex)
An Example
Example: Q=(T1 and T2) or T3
Document
Conjunct Query
Vectors Weights Weight
(T1 and T2) (T3) (T1 and T2) or T3
D1=(T1,0.2;T2,0.5;T3,0.6)
Synonym Specification
(T1 and T2) or T3
((T1 or S1) and T2) or (T3 or S3)
Ex:
PSYCH*: psychiatrist, psychiatry,
psychiatric,
psychology, psychological,
Introduction
How to retrieval information?
A simple alternative is to search the
whole text sequentially (online
search)
Another option is to build data
structures over the text (called
indices) to speed up the search
Introduction
Indexing techniques
Inverted files
Suffix arrays
Signature files
Notation
n: the size of the text
m: the length of the pattern (m << n)
v: the size of the vocabulary
M: the amount of main memory
available
Inverted Files
Definition: an inverted file is a wordoriented mechanism for indexing a text
collection in order to speed up the searching
task.
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Inverted file
Vocabulary
Occurrences
beautiful
70
flowers
45, 58
garden
18, 29
house
Space Requirements
The space required for the vocabulary is rather
small. According to Heaps law the vocabulary
grows as O(n), where is a constant between
0.4 and 0.6 in practice (sublinear)
On the other hand, the occurrences demand
much more space. Since each word appearing
in the text is referenced once in that structure,
the extra space is O(n)
To reduce space requirements, a technique
called block addressing is used
Block Addressing
The text is divided in blocks
The occurrences point to the blocks where
the word appears
Advantages:
the number of pointers is smaller than
positions
all the occurrences of a word inside a single
block are collapsed to one reference
Disadvantages:
online search over the qualifying blocks if
exact positions are required
Example
Text:
Block 1
Block 2
Block 3
Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
Inverted file:
Vocabulary
Occurrences
beautiful
flowers
garden
house
Searching
The search algorithm on an inverted
index follows three steps
Vocabulary search: the words present in
the query are searched in the vocabulary
Searching
Searching task on an inverted file always
starts in the vocabulary (It is better to
store the vocabulary in a separate file)
The structures most used to store the
vocabulary are hashing, tries or B-trees
Hashing, tries: O(m)
Construction
All the vocabulary is kept in a
suitable data structure storing for
each word a list of its occurrences
Each word of the text is read and
searched in the vocabulary
If it is not found, it is added to the
vocabulary with a empty list of
occurrences and the new position is
added to the end of its list of
occurrences
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Vocabulary trie
beautiful: 70
b
f
g
h
flower: 45, 58
garden: 18, 29
house: 6
Construction
Once the text is exhausted, the vocabulary is
written to disk with the list of occurrences. Two
files are created:
in the first file, the list of occurrences are stored
contiguously (posting file)
in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer
to its list in the first file is also included. This allows
the vocabulary to be kept in memory at search time
Construction
An option is to use the previous algorithm
until the main memory is exhausted. When
no more memory is available, the partial
index Ii obtained up to now is written to
disk and erased the main memory before
continuing with the rest of the text
Once the text is exhausted, a number of
partial indices Ii exist on disk
The partial indices are merged to obtain
the final index
Example
final index
I 1...8
7
I 1...2
I 1...4
I 5...8
I 3...4
1
I1
level 3
I 5...6
2
I2
I3
level 2
I 7...8
4
I4
I5
level 1
I6
I7
I8
initial dumps
Construction
The total time to generate partial
indices is O(n)
The number of partial indices is
O(n/M)
To merge the O(n/M) partial indices,
log2(n/M) merging levels are necessary
The total cost of this algorithm is O(n
log(n/M))
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Suffixes
house has a garden. The garden has many flowers. The flowers
are beautiful
garden. The garden has many flowers. The flowers are beautiful
garden has many flowers. The flowers are beautiful
flowers. The flowers are beautiful
flowers are beautiful
beautiful
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Suffix Trie
70
b
f
g
h
a
6
w e
e n
58
45
.
29
18
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Suffix Tree
70
b
1
f 8
g
.
h 7
45
58
18
29
Suffix Arrays
An array containing all the pointers to the
text suffixes listed in lexicographical order
The space requirements are almost the same
as those for inverted indices
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Suffix Array
70
58
45
29
18
bea
u
70
58
45
29
gar
d
18 6
Example
Text:
1
12 16 18
25
29
36
40
45
54 58
66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
Vocabulary Supra-Index
beautif
ul
flower
garden
house
Suffix Array
70
58
45
29
18
58
18
29
Inverted List
70
45
Construction of Suffix
Arrays for Large Texts
Small text
1
Small suffix array
2
Counters
2
3
3
Long text
Long suffix array
3
Final suffix array
Signature Files
Characteristics
Word-oriented index structures based on
hashing
Low overhead (10%~20% over the text
size) at the cost of forcing a sequential
search over the index
Suitable for not very large texts
Inverted files outperform signature files
for most applications
Search
Hash the query to a bit mask W
If W & Bi = W, the text block may contain the word
For all candidate blocks, an online traversal must be
performed to verify if the word is actually there
Example
Four blocks:
This is a text. A text has many words. Words are made from
letters.
000101
110101
100100
Hash(text) = 000101
Hash(many)= 110000
Hash(words)= 100100
Hash(made)= 001100
Hash(letters)= 100001
101101
Block 4: 001100
OR
100001
101101
False Drop
Assumes that l bits are randomly set in the
mask
Let =l/B
For b words, the probability that a given bit of
the mask is set is 1-(1-1/B)bl 1-e-b
Hence, the probability that the l random bits
are also set is Fd =(1-e-b) False alarm
Fd is minimized for =ln(2)/b
Fd = 2-l
l = B ln2/b
Comparisons
Signature files
Use hashing techniques to produce an
index
advantage
storage overhead is small (10%-20%)
disadvantages
the search time on the index is linear
some answers may not match the query,
thus filtering must be done
Comparisons
(Continued)
Inverted files
storage overhead (30% ~ 100%)
search time for word searches is logarithmic
Suffix arrays
potential use in other kind of searches
phrases
regular expression searching
approximate string searching
longest repetitions
most frequent searching
Sequential Searching
Brute Force (BF)
Knuth-Morris-Pratt (KMP)
Boyer-Moore Family (BM)
Shift-Or
Suffix Automaton
Knuth-Morris-Pratt
The KMP method scans the characters leftto-right
When a mismatch occurs, an optimum
shift is carried out for pattern P
No new match can be obtained except when
some head of the already matching part of P is
identical to a tail of the matching part of T
How to detect coincidences between heads of
P and tails of T
Any matching tail of T is also a matching tail of P
Detecting repeating portions in P
Knuth-Morris-Pratt
Next table at position j: the longest proper prefix
of P1..j-1 which is also a suffix and the characters
following prefix and suffix are different
j-next[j]-1 positions can be safely skipped
Next: 0 0 0 0 1 0 1 0 0 0 0 4
P:
abracadabra
abracabracadabra
abracad
abracadabra
Boyer-Moore Family
The BM method scans characters from right
to left
The heuristic which gives the longest shift is
selected
Matching shift (or good-suffix shift, 2 shift)
When some tail of P already matches some substring of S
Examples
abracabracadabra
abracadabra
a b r a c a d a b r a (2=3)
a b r a c a d a b r a (1=5)
babcbadcabcaabca
abcabcacab
a b c a b c a c a b (2=5)
a b c a b c a c a b (1=7)
a b c a b c a c a b (extended 1=8)
Some variations
Simplified BM algorithm
BM-Horspool (BMH) algorithm
BM-Sunday (BMS) algorithm
Commentz-Walter algorithm: an
extension of BM to multi-pattern search
Shift-Or
Based on bit-parallelism to simulate the
operation of a non-deterministic automaton
It first build a table B which stores a bit mask
bmb1 for each character
B[c] has the i-th bit set to zero if pi = c
The state of search is kept in D=dmd1 (initially
set to all 1s)
Where di is zero whenever the state numbered i is active
A match is reported whenever dm is zero
Example
a
B[a] =
1
B[b] =
0
B[c] =
1
B[r] =
0
1
1
0
1
1
1
1
B[*] = 1
1
1
a
m
Example
Ex: Input T=abcabracaba
(11111111 << 1) | 01010110
(11111110 << 1) | 10111101
(11111101 << 1) | 11101111
(11111111 << 1) | 01010110
(11111110 << 1) | 10111101
(11111101 << 1) | 11111011
(11111011 << 1) | 01010110
(11110111 << 1) | 11101111
(11101111 << 1) | 01010110
(11011111 << 1) | 10111101
(10111111 << 1) | 01010110
=
=
=
=
=
=
=
=
=
=
=
11111110 (A)
11111101 (AB)
11111111 ()
11111110 (A)
11111101 (AB)
11111011 (ABR)
11110111 (ABRA)
11101111 (ABRAC)
11011111 (ABRACA)
10111111 (ABRACAB)
01111111 (ABRACABA)
Matched!
Suffix Automaton
Suffix automaton on a pattern P: an
automaton that recognizes all suffixes
of
P
I
a
To search a pattern P
Suffix automaton of Pr is built
Search backwards inside the text window for a
substring of P using suffix automaton
Each time a terminal state is reached before
hitting the beginning of the window, the
position inside the window is remembered
Finding a prefix of the pattern -> suffix of the window
Example
P: abracadabra
Pr: arbadacarba
T: a b r a c a b r a c a d a b r a
[
x
x]
[x
x
x]
Practical Comparison
The clear winners are BNDM and BMS
(Sunday)
Classical BM and BDM are also very close
Pattern Matching
Searching allowing errors
(Approximate String Matching)
Dynamic Programming
Automaton
Approximate String
Matching
Definition: Given a short pattern P of
length m, a long text T of length n, and
a maximum allowed number of errors k,
find all the text positions where the
pattern occurs with at most k errors
This corresponds to the Levenshtein
distance (edit distance)
With minimum modifications it is adapted
to searching whole words matching the
pattern with k errors
Dynamic Programming
Automaton
Regular Expressions
searches
Compression
Compressed text--Huffman coding
Taking words as symbols
Use an alphabet of bytes instead of bits
Compressed indices
Inverted Files
Suffix Trees and Suffix Arrays
Signature Files