Sei sulla pagina 1di 10

Text

Statistics
Frequencies of Words

• Some words occur much more frequently than others in documents


describing one topic

● The importance of a word in a document depends on its frequency in the


document

• The distribution of word frequencies is very skewed

● The two most frequent words in English (“the”, “of”) account for about
10% of all word occurrences
● About one-half of all words only occur once
Zipf’s Law/Distribution
• The frequency of the r-th most common
word is inversely proportional to r

– The rank of a word times its probability of


occurrence is approximately a constant

r x P(r) = c

r* ff(r)= c * N where N = no.of words


– For English, c ≈ 0.1
Predicting Occurrence Frequencies

• A word that occurs n times has the rank r = k / n, where k = A x N, and N is the total number of
word occurrences

• But, more than one word may have the same frequency – we assume that the rank r(n) is
associated with the last of the group of words with the same frequency

• The number of words with the same frequency n is r(n) – r(n)+1


Zipf’s Law Impact on IR

• Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces
inverted index storage costs.

• Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for
correlation analysis for query expansion) is difficult since they are extremely rare.
A Log-Log Plot

log Pr = log (c x r-1) = log c – log r

• Zipf is quite accurate except for very


high and low rank.
Vocabulary Growth
• The size of corpus grows, new words occur

• The number of new words occurring in a given amount of new text decreases as the size of the
corpus increases because if we start analyzing it and notice a word that you haven’t seen before but
pretty soon you will see a repetition of words.

– Sources of new words: invented words (e.g., drug names, start-up company names), spelling errors,

product numbers, people’s names, email addresses, …

• Heaps’ Law: empirically, the size of the corpus and the size of the vocabulary is v = k x n^β

– The vocabulary size v

– Corpus size n

– Values of k and β vary from each collection, Often, 10 ≤ k ≤ 100, β ≈ 0.5


Prediction Using Heaps’
Law
• The number of new words will increase very rapidly
when the corpus is small and would continue to
increase indefinitely, but at a slower rate for larger
corpus
Heaps' law suggests that

(i) the dictionary size continues to increase with more documents in the collection, rather than
a maximum vocabulary size being reached, and

(ii) the size of the dictionary is quite large for large collections. These two hypotheses have
been empirically shown to be true of large text collections.

So dictionary compression is important for an effective information retrieval system.


Thank You

Potrebbero piacerti anche