Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Ricardo Baeza-Yates
Berthier Ribeiro-Neto
Document Preprocessing
Lexical analysis of the text
Elimination of stopwords
Stemming
Selection of index terms
Construction of term categorization structures
Lexical Analysis of the Text
Word separators
space
digits
hyphens
punctuation marks
the case of the letters
Elimination of Stopwords
A list of stopwords
words that are too frequent among the documents
articles, prepositions, conjunctions, etc.
Problem
Search for “to be or not to be”?
Stemming
Example
connect, connected, connecting, connection, connections
effectiveness --> effective --> effect
picnicking --> picnic
king -\-> k
Removing strategies
affix removal: intuitive, simple
table lookup
successor variety
n-gram
Index Terms Selection
Motivation
A sentence is usually composed of nouns, pronouns,
articles, verbs, adjectives, adverbs, and connectives.
Most of the semantics is carried by the noun words.
a a
Recall = Precision =
a +d a+b
A Joint Measure
F-score ( β2 + 1) × P × R
F=
β2 × P + R
β is a parameter that encode the importance of
recall and procedure.
β =1: equal weight
β <1: precision is more important
β >1: recall is more important
Choices of Recall and Precision
Both recall and precision vary from 0 to 1.
Particular choices of indexing and search policies
have produced variations in performance ranging
from 0.8 precision and 0.2 recall to 0.1 precision
and 0.8 recall.
In many circumstance, both the recall and the
precision varying between 0.5 and 0.6 are more
satisfactory for the average users.
Term-Frequency Consideration
Function words
for example, "and", "or", "of", "but", …
Content words
words that actually relate to document content
Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0
TFij x dvj
wij = tfij x dvj
N
compared with wij =tf ij ×log
df j
N
: decrease steadily with increasing document
df j
frequency
dvj: increase from zero to positive as the document
frequency of the term increase,
decrease shapely as the document frequency
becomes still larger.
Document Centroid
Issue: efficiency problem
N(N-1) pairwise similarities
Document centroid C = (c1, c2, c3, ..., ct)
N
c j = ∑wij
i =1
Pr(D|rel), Pr(D|nonrel):
occurrence probabilities of document D in the
relevant and nonrelevant document sets
Assumptions
Terms occur independently in documents
t
Pr( D | rel ) = ∏Pr( xi | rel )
i =1
t
Pr( D | nonrel ) = ∏Pr( xi | nonrel )
i =1
Derivation Process
∏Pr( xi |rel )
= log t
i =1
+ constants
∏Pr(
i =1
xi |nonrel )
t
Pr( xi | rel )
= ∑log +constants
i =1 Pr( xi | nonrel )
For a specific document D
Given a document D=(d1, d2, …, dt)
t
Pr( xi = di |rel )
g ( D) = ∑ log + constants
i =1 Pr( xi = di |nonrel )
=∑log
i i
1−d
+constants
d
q (1−q ) i i
i =1
i i
di
t
= ∑log
p (1−q ) (1 − p ) +constants
i
di
i i
d
q (1−p ) (1 −q )
d i i
i =1
i i i
di
=∑
t
log
( p (1−q )) (1−p ) +constants
i i i
i =1
(q (1−p )) d (1−q )
i i
i
i
Term Relevance Weight
t 1 − pi t p (1 − qi )
g ( D) = ∑log + ∑di log i + constants
i =1 1 − qi i =1 qi (1 − pi )
pj (1 −qj )
tr j =log
qj (1 −pj )
Issue
How to compute pj and qj ?
pj = rj / R
qj = (dfj-rj)/(N-R)
( N −df j ) N
tr j = log ≈ log
= idfj
df j
df j
Estimation of Term-Relevance
Estimate the number of relevant documents rj in the
collection that contain term Tj as a function of the known
document frequency tfj of the term Tj.
pj = rj / R
qj = (dfj-rj)/(N-R)
R: an estimate of the total number of relevant documents
in the collection.
Summary
Inverse document frequency, idfj
tfij *idfj (TFxIDF)
Term discrimination value, dvj
tfij *dvj
Probabilistic term weighting trj
tfij *trj