Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Statistics
Frequencies of Words
● The two most frequent words in English (“the”, “of”) account for about
10% of all word occurrences
● About one-half of all words only occur once
Zipf’s Law/Distribution
• The frequency of the r-th most common
word is inversely proportional to r
r x P(r) = c
• A word that occurs n times has the rank r = k / n, where k = A x N, and N is the total number of
word occurrences
• But, more than one word may have the same frequency – we assume that the rank r(n) is
associated with the last of the group of words with the same frequency
• Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces
inverted index storage costs.
• Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for
correlation analysis for query expansion) is difficult since they are extremely rare.
A Log-Log Plot
• The number of new words occurring in a given amount of new text decreases as the size of the
corpus increases because if we start analyzing it and notice a word that you haven’t seen before but
pretty soon you will see a repetition of words.
– Sources of new words: invented words (e.g., drug names, start-up company names), spelling errors,
• Heaps’ Law: empirically, the size of the corpus and the size of the vocabulary is v = k x n^β
– Corpus size n
(i) the dictionary size continues to increase with more documents in the collection, rather than
a maximum vocabulary size being reached, and
(ii) the size of the dictionary is quite large for large collections. These two hypotheses have
been empirically shown to be true of large text collections.