Sei sulla pagina 1di 40

Modern Information Retrieval

Chapter 7: Text Operations

Ricardo Baeza-Yates
Berthier Ribeiro-Neto
Document Preprocessing
 Lexical analysis of the text
 Elimination of stopwords
 Stemming
 Selection of index terms
 Construction of term categorization structures
Lexical Analysis of the Text
 Word separators
 space
 digits
 hyphens
 punctuation marks
 the case of the letters
Elimination of Stopwords
 A list of stopwords
 words that are too frequent among the documents
 articles, prepositions, conjunctions, etc.

 Can reduce the size of the indexing structure


considerably

 Problem
 Search for “to be or not to be”?
Stemming
 Example
 connect, connected, connecting, connection, connections
 effectiveness --> effective --> effect
 picnicking --> picnic
 king -\-> k

 Removing strategies
 affix removal: intuitive, simple
 table lookup
 successor variety
 n-gram
Index Terms Selection
 Motivation
 A sentence is usually composed of nouns, pronouns,
articles, verbs, adjectives, adverbs, and connectives.
 Most of the semantics is carried by the noun words.

 Identification of noun groups


 A noun group is a set of nouns whose syntactic
distance in the text does not exceed a predefined
threshold
Thesauri
 Peter Roget, 1988
 Example
cowardly adj.
Ignobly lacking in courage: cowardly turncoats
Syns: chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang), yellow-
bellied (slang).

 A controlled vocabulary for the indexing and


searching
The Purpose of a Thesaurus
 To provide a standard vocabulary for indexing
and searching
 To assist users with locating terms for proper
query formulation
 To provide classified hierarchies that allow the
broadening and narrowing of the current query
request
Thesaurus Term Relationships
 BT: broader
 NT: narrower
 RT: non-hierarchical, but related
Term Selection
Automatic Text Processing
by G. Salton, Chap 9,
Addison-Wesley, 1989.
Automatic Indexing
 Indexing:
 assign identifiers (index terms) to text documents.
 Identifiers:
 single-term vs. term phrase
 controlled vs. uncontrolled vocabularies
instruction manuals, terminological schedules, …
 objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names, publisher names,
dates of publications, …
Two Issues
 Issue 1: indexing exhaustivity
 exhaustive: assign a large number of terms
 nonexhaustive
 Issue 2: term specificity
 broad terms (generic)
cannot distinguish relevant from nonrelevant documents
 narrow terms (specific)
retrieve relatively fewer documents, but most of them are
relevant
Parameters of
retrieval effectiveness
 Recall
Number of relevant i tems retri eved
R=
Total numb er of rele vant items in collec tion
 Precision
Number of relevant i tems retri eved
P=
Total numb er of item s retrieve d
 Goal
high recall and high precision
Retrieved
Part
b a
Nonrelevant Relevant
Items Items
c d

a a
Recall = Precision =
a +d a+b
A Joint Measure
 F-score ( β2 + 1) × P × R
F=
β2 × P + R
 β is a parameter that encode the importance of
recall and procedure.
 β =1: equal weight
 β <1: precision is more important
 β >1: recall is more important
Choices of Recall and Precision
 Both recall and precision vary from 0 to 1.
 Particular choices of indexing and search policies
have produced variations in performance ranging
from 0.8 precision and 0.2 recall to 0.1 precision
and 0.8 recall.
 In many circumstance, both the recall and the
precision varying between 0.5 and 0.6 are more
satisfactory for the average users.
Term-Frequency Consideration
 Function words
 for example, "and", "or", "of", "but", …

 the frequencies of these words are high in all texts

 Content words
 words that actually relate to document content

 varying frequencies in the different texts of a collect

 indicate term importance for content


A Frequency-Based Indexing Method
 Eliminate common function words from the document
texts by consulting a special dictionary, or stop list,
containing a list of high frequency function words.
 Compute the term frequency tfij for all remaining terms Tj
in each document Di, specifying the number of
occurrences of Tj in Di.
 Choose a threshold frequency T, and assign to each
document Di all term Tj for which tfij > T.
Inverse Document Frequency
 Inverse Document Frequency (IDF) for term Tj
N
idf j = log
df j
where dfj (document frequency of term Tj) is the
number of documents in which Tj occurs.
 fulfil both the recall and the precision
 occur frequently in individual documents but rarely in
the remainder of the collection
TFxIDF
 Weight wij of a term Tj in a document di
N
wij = tf ij × log
df j
 Eliminating common function words
 Computing the value of wij for each term Tj in each
document Di
 Assigning to the documents of a collection all terms with
sufficiently high (tf x idf) factors
Term-discrimination Value
 Useful index terms
 Distinguish the documents of a collection from
each other
 Document Space
 Two documents are assigned very similar
term sets, when the corresponding points in
document configuration appear close together
 When a high-frequency term without
discrimination is assigned, it will increase the
document space density
A Virtual Document Space

Original State After Assignment of After Assignment of


good discriminator poor discriminator
Good Term Assignment
 When a term is assigned to the documents of a
collection, the few objects to which the term is
assigned will be distinguished from the rest of
the collection.

 This should increase the average distance


between the objects in the collection and hence
produce a document space less dense than
before.
Poor Term Assignment
 A high frequency term is assigned that does not
discriminate between the objects of a collection.
Its assignment will render the document more
similar.

 This is reflected in an increase in document


space density.
Term Discrimination Value
 Definition
dvj = Q - Qj
where Q and Qj are space densities before and
after the assignments of term Tj.
1 N N
Q= ∑ ∑
N ( N −1) i =1 k =1
sim ( Di , Dk )
i ≠k

 dvj>0, Tj is a good term;


dvj<0, Tj is a poor term.
Variations of Term-Discrimination Value
with Document Frequency

Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0
TFij x dvj
 wij = tfij x dvj
N
 compared with wij =tf ij ×log
df j
N
 : decrease steadily with increasing document
df j
frequency
 dvj: increase from zero to positive as the document
frequency of the term increase,
decrease shapely as the document frequency
becomes still larger.
Document Centroid
 Issue: efficiency problem
N(N-1) pairwise similarities
 Document centroid C = (c1, c2, c3, ..., ct)
N
c j = ∑wij
i =1

where wij is the j-th term in document i.


 Space density
N
1
Q=
N
∑sim (C , D )
i =1
i
Probabilistic Term Weighting
 Goal
Explicit distinctions between occurrences of
terms in relevant and nonrelevant documents of
a collection
 Definition
Given a user query q, and the ideal answer set of the
relevant documents
 From decision theory, the best ranking algorithm
for a document D
Pr( D | rel ) Pr( rel )
g ( D ) = log + log
Pr( D | nonrel ) Pr( nonrel )
Probabilistic Term Weighting
 Pr(rel), Pr(nonrel):
document’s a priori probabilities of relevance and
nonrelevance

 Pr(D|rel), Pr(D|nonrel):
occurrence probabilities of document D in the
relevant and nonrelevant document sets
Assumptions
 Terms occur independently in documents
t
Pr( D | rel ) = ∏Pr( xi | rel )
i =1
t
Pr( D | nonrel ) = ∏Pr( xi | nonrel )
i =1
Derivation Process

Pr( D | rel ) Pr( rel )


g ( D ) = log +log
Pr( D | nonrel ) Pr( nonrel )
t

∏Pr( xi |rel )
= log t
i =1
+ constants
∏Pr(
i =1
xi |nonrel )

t
Pr( xi | rel )
= ∑log +constants
i =1 Pr( xi | nonrel )
For a specific document D
 Given a document D=(d1, d2, …, dt)
t
Pr( xi = di |rel )
g ( D) = ∑ log + constants
i =1 Pr( xi = di |nonrel )

 Assume di is either 0 (absent) or 1 (present).


Pr(xi=1|rel) = pi Pr(xi=0|rel) = 1-pi
Pr(xi=1|nonrel) = qi Pr(xi=0|nonrel) = 1-qi
di 1 − di
Pr( xi = di |rel ) = pi (1 − pi )
di 1 − di
Pr( xi = di |nonrel ) = qi (1 − qi )
t
Pr( xi =di | rel )
g ( D) =∑log +constants
i =1 Pr( xi =di | nonrel )
1−di
t
p d (1−p ) i

=∑log
i i

1−d
+constants
d
q (1−q ) i i
i =1
i i

di
t
= ∑log
p (1−q ) (1 − p ) +constants
i
di
i i

d
q (1−p ) (1 −q )
d i i
i =1
i i i

di
=∑
t
log
( p (1−q )) (1−p ) +constants
i i i

i =1
(q (1−p )) d (1−q )
i i
i
i
Term Relevance Weight
t 1 − pi t p (1 − qi )
g ( D) = ∑log + ∑di log i + constants
i =1 1 − qi i =1 qi (1 − pi )

pj (1 −qj )
tr j =log
qj (1 −pj )
Issue
 How to compute pj and qj ?

pj = rj / R
qj = (dfj-rj)/(N-R)

 R: the total number of relevant documents


 N: the total number of documents
Estimation of Term-Relevance

 The occurrence probability of a term in the nonrelevant


documents qj is approximated by the occurrence
probability of the term in the entire document collection
qj = dfj / N

 The occurrence probabilities of the terms in the small


number of relevant documents is equal by using a
constant value pj = 0.5 for all j.
Comparison
df
0.5 * (1 − j
)
pj (1 −qj ) N
tr j =log =log
qj (1 − pj ) df j
* 0.5
N
( N −df j )
= log
df j

When N is sufficiently large, N-dfj ≈ N,

( N −df j ) N
tr j = log ≈ log
= idfj
df j
df j
Estimation of Term-Relevance
 Estimate the number of relevant documents rj in the
collection that contain term Tj as a function of the known
document frequency tfj of the term Tj.
pj = rj / R
qj = (dfj-rj)/(N-R)
R: an estimate of the total number of relevant documents
in the collection.
Summary
 Inverse document frequency, idfj
 tfij *idfj (TFxIDF)
 Term discrimination value, dvj
 tfij *dvj
 Probabilistic term weighting trj
 tfij *trj

 Global properties of terms in a document collection

Potrebbero piacerti anche