Benvenuto in Scribd!

Zipf's Law and Heaps Law

Caricato da

Il 0% ha trovato utile questo documento (0 voti)

103 visualizzazioni10 pagine

The document discusses word frequencies in documents and Zipf's law. Some key points are: 1) A few words occur very frequently while most words occur rarely, following a skewed distribution. The most common words account for about 10% of occurrences. 2) Zipf's law states that the frequency of the rth most common word is inversely proportional to its rank r. 3) As more text is added to a corpus, the number of new unique words decreases following Heaps' law, but the total vocabulary size continues to increase with the corpus size.

Descrizione originale:

Text Statistics - Zipf's law and Heaps Law

Titolo originale

Zipf's Law and Heaps Law.ppt (2)

Copyright

Formati disponibili

PDF, TXT o leggi online da Scribd

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Segnala questo documento

Copyright:

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Il 0% ha trovato utile questo documento (0 voti)

103 visualizzazioni10 pagine

Zipf's Law and Heaps Law

Caricato da

advita sharma

Copyright:

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Salta alla pagina

Sei sulla pagina 1di 10

Cerca all'interno del documento

Text

Statistics
Frequencies of Words

• Some words occur much more frequently than others in documents

describing one topic

● The importance of a word in a document depends on its frequency in the

document

• The distribution of word frequencies is very skewed

● The two most frequent words in English (“the”, “of”) account for about
10% of all word occurrences
● About one-half of all words only occur once
Zipf’s Law/Distribution
• The frequency of the r-th most common
word is inversely proportional to r

– The rank of a word times its probability of

occurrence is approximately a constant

r x P(r) = c

r* ff(r)= c * N where N = no.of words

– For English, c ≈ 0.1
Predicting Occurrence Frequencies

• A word that occurs n times has the rank r = k / n, where k = A x N, and N is the total number of
word occurrences

• But, more than one word may have the same frequency – we assume that the rank r(n) is
associated with the last of the group of words with the same frequency

• The number of words with the same frequency n is r(n) – r(n)+1

Zipf’s Law Impact on IR

• Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces
inverted index storage costs.

• Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for
correlation analysis for query expansion) is difficult since they are extremely rare.
A Log-Log Plot

log Pr = log (c x r-1) = log c – log r

• Zipf is quite accurate except for very

high and low rank.
Vocabulary Growth
• The size of corpus grows, new words occur

• The number of new words occurring in a given amount of new text decreases as the size of the
corpus increases because if we start analyzing it and notice a word that you haven’t seen before but
pretty soon you will see a repetition of words.

– Sources of new words: invented words (e.g., drug names, start-up company names), spelling errors,

product numbers, people’s names, email addresses, …

• Heaps’ Law: empirically, the size of the corpus and the size of the vocabulary is v = k x n^β

– The vocabulary size v

– Corpus size n

– Values of k and β vary from each collection, Often, 10 ≤ k ≤ 100, β ≈ 0.5

Prediction Using Heaps’
Law
• The number of new words will increase very rapidly
when the corpus is small and would continue to
increase indefinitely, but at a slower rate for larger
corpus
Heaps' law suggests that

(i) the dictionary size continues to increase with more documents in the collection, rather than
a maximum vocabulary size being reached, and

(ii) the size of the dictionary is quite large for large collections. These two hypotheses have
been empirically shown to be true of large text collections.

So dictionary compression is important for an effective information retrieval system.

Thank You

Potrebbero piacerti anche

Finnish English Frequency Dictionary: Finnish, #1
Da Everand
Finnish English Frequency Dictionary: Finnish, #1
J.L. Laide
Valutazione: 3.5 su 5 stelle
3.5/5 (3)
Romanian Frequency Dictionary For Learners - Practial Vocabulary - Top 10000 Romanian Words
Da Everand
Romanian Frequency Dictionary For Learners - Practial Vocabulary - Top 10000 Romanian Words
MostUsedWords
Nessuna valutazione finora
Chapter Two Text Operation
Documento44 pagine
Chapter Two Text Operation
Aaron Melendez
Nessuna valutazione finora
2 Text Operation
Documento42 pagine
2 Text Operation
Tensu Aweke
Nessuna valutazione finora
Text Operations 2021
Documento45 pagine
Text Operations 2021
Abdo Ababor
Nessuna valutazione finora
Lexicon & Text Normalization
Documento39 pagine
Lexicon & Text Normalization
asma
Nessuna valutazione finora
2 TextOperations
Documento54 pagine
2 TextOperations
Mulugeta Hailu
Nessuna valutazione finora
2 - Text Operation
Documento45 pagine
2 - Text Operation
Kirubel Wakjira
Nessuna valutazione finora
Kornai - How Many Words Are There
Documento26 pagine
Kornai - How Many Words Are There
Algoritzmi
Nessuna valutazione finora
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
Documento13 pagine
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
Tolosa Tafese
Nessuna valutazione finora
N-Grams - Text Representation
Documento23 pagine
N-Grams - Text Representation
scribd_comverse
Nessuna valutazione finora
2020b Frequency & Word Lists
Documento36 pagine
2020b Frequency & Word Lists
Roy Altur Robinson
Nessuna valutazione finora
Chap. 2 Text Operation
Documento44 pagine
Chap. 2 Text Operation
adamu
Nessuna valutazione finora
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
Documento66 pagine
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
JoeJune
100% (1)
Nilsen J 2014 Thai Frequency Dictionary 2 Edition PDF
Documento221 pagine
Nilsen J 2014 Thai Frequency Dictionary 2 Edition PDF
Ieda Deutsch
Nessuna valutazione finora
Chapter Two: Text Operations
Documento41 pagine
Chapter Two: Text Operations
endris yimer
Nessuna valutazione finora
X Speech Corpora
Documento5 pagine
X Speech Corpora
Robert Robinson
Nessuna valutazione finora
2 - Text Operation
Documento43 pagine
2 - Text Operation
Hailemariam Setegn
Nessuna valutazione finora
Nowak Quantify Evol Dynamics
Documento29 pagine
Nowak Quantify Evol Dynamics
Jeff Neary
Nessuna valutazione finora
Lecture 3
Documento70 pagine
Lecture 3
Abdullah Khan Qadri
Nessuna valutazione finora
Zipfs Law For Word Frequencies Word Forms Versus
Documento21 pagine
Zipfs Law For Word Frequencies Word Forms Versus
кирилл кичигин
Nessuna valutazione finora
CORPUS TYPES and CRITERIA
Documento14 pagine
CORPUS TYPES and CRITERIA
Elias Zermane
100% (1)
Frequency Analysis of English Vocabulary and Grammar: Stig Johansson and Knut Hofland
Documento1 pagina
Frequency Analysis of English Vocabulary and Grammar: Stig Johansson and Knut Hofland
Michał Tyczyński
Nessuna valutazione finora
2 Text-Operation
Documento60 pagine
2 Text-Operation
Yididiya Yemiru
Nessuna valutazione finora
Natural Language Processing PDF
Documento47 pagine
Natural Language Processing PDF
Saidur Rahman
100% (1)
SLoSP 2007 1
Documento42 pagine
SLoSP 2007 1
Vicky Dethe
Nessuna valutazione finora
Indian Institute of Technology Kanpur
Documento42 pagine
Indian Institute of Technology Kanpur
Avi Alok
Nessuna valutazione finora
Zipf's Law in Toki Pona
Documento4 pagine
Zipf's Law in Toki Pona
Kwiila Kao
Nessuna valutazione finora
Lecture Notes Linguistics
Documento159 pagine
Lecture Notes Linguistics
jasorient
Nessuna valutazione finora
Speech and Language Processing
Documento26 pagine
Speech and Language Processing
seogmin chun
Nessuna valutazione finora
Fuentes para Estudios Tipológicos
Documento6 pagine
Fuentes para Estudios Tipológicos
arkepaster
Nessuna valutazione finora
Using Questionnaires To Investigate Non-Standard Dialects Patrick Honeybone
Documento6 pagine
Using Questionnaires To Investigate Non-Standard Dialects Patrick Honeybone
Alexandra
Nessuna valutazione finora
1904 00812 PDF
Documento42 pagine
1904 00812 PDF
кирилл кичигин
Nessuna valutazione finora
Lecture-8. Only For This Batch
Documento46 pagine
Lecture-8. Only For This Batch
musa
Nessuna valutazione finora
Regular Expression - Sentence Segment
Documento46 pagine
Regular Expression - Sentence Segment
naruto sasuke
Nessuna valutazione finora
Data Compression and Coding: Stefano Marmi (Scuola Normale Superiore, Pisa)
Documento29 pagine
Data Compression and Coding: Stefano Marmi (Scuola Normale Superiore, Pisa)
Bernardo Tarini
Nessuna valutazione finora
Natural Language Processing
Documento47 pagine
Natural Language Processing
Saidur Rahman
Nessuna valutazione finora
SInclair Corpus Concordance Collocation
Documento20 pagine
SInclair Corpus Concordance Collocation
humerajehangir
71% (7)
LEXICON
Documento29 pagine
LEXICON
Advance Certificate in English Language Teaching
Nessuna valutazione finora
Lecture 1 Text Preprocessing PDF
Documento29 pagine
Lecture 1 Text Preprocessing PDF
Ammad Asif
Nessuna valutazione finora
One Sense Per Collocation
Documento6 pagine
One Sense Per Collocation
tamthientai
Nessuna valutazione finora
Introduction To The Study of Language 2: Psycholinguistics
Documento70 pagine
Introduction To The Study of Language 2: Psycholinguistics
Ania
Nessuna valutazione finora
HG3051 Lec06 DIY
Documento59 pagine
HG3051 Lec06 DIY
Rania Abd El Fattah Abd El Hameed
Nessuna valutazione finora
Module5-Representing and Mining Text
Documento24 pagine
Module5-Representing and Mining Text
Green Mongor
Nessuna valutazione finora
5.2 Natural Language Processing
Documento43 pagine
5.2 Natural Language Processing
punit mishra
Nessuna valutazione finora
AI6122 Topic 1.2 - WordLevel
Documento63 pagine
AI6122 Topic 1.2 - WordLevel
Yujia Tian
Nessuna valutazione finora
Source Language (SL) Target Language (TL) : Translation Is The Communication of Meaning From One
Documento22 pagine
Source Language (SL) Target Language (TL) : Translation Is The Communication of Meaning From One
Oleg Voroshchuk
Nessuna valutazione finora
Lecture 2a
Documento31 pagine
Lecture 2a
Fatima syeda
Nessuna valutazione finora
BMED263 Assessment 9
Documento6 pagine
BMED263 Assessment 9
Yohansel Jiménez
Nessuna valutazione finora
Random Generation of Words of Context-Free Languages According To The Frequencies of Letters
Documento13 pagine
Random Generation of Words of Context-Free Languages According To The Frequencies of Letters
blackmailer hack
Nessuna valutazione finora
Languages' Rhythm and Language Acquisition: Franck Ramus
Documento31 pagine
Languages' Rhythm and Language Acquisition: Franck Ramus
nlyn1015
Nessuna valutazione finora
Evaluating Language Models
Documento21 pagine
Evaluating Language Models
Mahesh Yadav
Nessuna valutazione finora
02 Textprocessingboth
Documento46 pagine
02 Textprocessingboth
21051918
Nessuna valutazione finora
Q1 WK1.4 Subject-Verb Agreement
Documento9 pagine
Q1 WK1.4 Subject-Verb Agreement
Leo Ang
Nessuna valutazione finora
Syntax
Documento19 pagine
Syntax
Mahabat
Nessuna valutazione finora
Processing Text: 4.1 From Words To Terms
Documento52 pagine
Processing Text: 4.1 From Words To Terms
Thangam Natarajan
Nessuna valutazione finora
IR Chap7
Documento30 pagine
IR Chap7
biniam teshome
Nessuna valutazione finora
2009 STG CorpLing LangLingCompass
Documento17 pagine
2009 STG CorpLing LangLingCompass
Marta Milic
Nessuna valutazione finora
Data Mining:: Concepts and Techniques
Documento37 pagine
Data Mining:: Concepts and Techniques
suriyachack
Nessuna valutazione finora
Russian English Frequency Dictionary - 2.500 Most Used Words: Russian, #1
Da Everand
Russian English Frequency Dictionary - 2.500 Most Used Words: Russian, #1
J.L. Laide
Valutazione: 1 su 5 stelle
1/5 (2)
Informatics in Logistics Management
Documento26 pagine
Informatics in Logistics Management
Nazareth
0% (1)
Ooad Question Bank
Documento5 pagine
Ooad Question Bank
khusboo_bhatt
Nessuna valutazione finora
Sensible Heat
Documento31 pagine
Sensible Heat
12001301
100% (2)
Tracer-AN Series: MPPT Solar Charge Controller
Documento4 pagine
Tracer-AN Series: MPPT Solar Charge Controller
Nkosilozwelo Sibanda
Nessuna valutazione finora
I.Objectives: Grades 1 To 12 Daily Lesson Log School Grade Level Teacher Learning Area Teaching Dates and Time Quarter
Documento4 pagine
I.Objectives: Grades 1 To 12 Daily Lesson Log School Grade Level Teacher Learning Area Teaching Dates and Time Quarter
MarryShailaine Clet
Nessuna valutazione finora
Hydraulic Backhoe Machine
Documento57 pagine
Hydraulic Backhoe Machine
Lokesh Srivastava
Nessuna valutazione finora
Digital Design Course File
Documento191 pagine
Digital Design Course File
Charan Netha
Nessuna valutazione finora
Real Time Blood Type Determination by Gel Test Method On An Embedded System
Documento4 pagine
Real Time Blood Type Determination by Gel Test Method On An Embedded System
ngocbienk56
Nessuna valutazione finora
Fuses & Circuit Breakers PDF
Documento13 pagine
Fuses & Circuit Breakers PDF
Carlos Luis Santos Somcar
Nessuna valutazione finora
SDLC
Documento2 pagine
SDLC
Tahseef Reza
Nessuna valutazione finora
Example Quality Plan
Documento11 pagine
Example Quality Plan
zafeer
Nessuna valutazione finora
DOE Cooling Catalogue 2017
Documento164 pagine
DOE Cooling Catalogue 2017
Rashaad Sheik
Nessuna valutazione finora
Decline Curve Analysis
Documento37 pagine
Decline Curve Analysis
Ashwin Vel
Nessuna valutazione finora
Technical DataSheet (POWERTECH社) -55559-9 (Motormech)
Documento2 pagine
Technical DataSheet (POWERTECH社) -55559-9 (Motormech)
Bilal
Nessuna valutazione finora
ASW Connection PDF
Documento7 pagine
ASW Connection PDF
Wawan Satiawan
Nessuna valutazione finora
Mode Reversions
Documento15 pagine
Mode Reversions
ISHAAN
Nessuna valutazione finora
XLPE-AL 33 KV Single Core Cable - Hoja de Datos - NKT Cables
Documento2 pagine
XLPE-AL 33 KV Single Core Cable - Hoja de Datos - NKT Cables
kjkljkljlkjljlk
Nessuna valutazione finora
Operation & Service Manual: Murzan Inc
Documento38 pagine
Operation & Service Manual: Murzan Inc
gokul
Nessuna valutazione finora
Man Ssa Ug en 0698
Documento43 pagine
Man Ssa Ug en 0698
Andy L
Nessuna valutazione finora
NGO-CSR Internship Report Template
Documento4 pagine
NGO-CSR Internship Report Template
Priyanka Singh
100% (1)
50rhe PD
Documento40 pagine
50rhe PD
m_moreira1974
Nessuna valutazione finora
Tears of My Enemies Funny Juice Box Enamel Pin Ba
Documento1 pagina
Tears of My Enemies Funny Juice Box Enamel Pin Ba
Boban Stojanović
Nessuna valutazione finora
Sony Video Camera Manual PDF
Documento118 pagine
Sony Video Camera Manual PDF
Gary Hoehler
100% (1)
RSC 406 (English)
Documento11 pagine
RSC 406 (English)
Tuấn Dũng
Nessuna valutazione finora
Stand-mount/Books Helf Louds Peaker System Product Summary
Documento1 pagina
Stand-mount/Books Helf Louds Peaker System Product Summary
Catalin Nacu
Nessuna valutazione finora
C D C S
Documento4 pagine
C D C S
andri
Nessuna valutazione finora
Topic 10 - Advanced Approvals in Salesforce CPQ
Documento18 pagine
Topic 10 - Advanced Approvals in Salesforce CPQ
Ramkumar Poovalai
100% (1)
Trasdata Help
Documento4.852 pagine
Trasdata Help
Paul Galwez
75% (4)
Thrust Stand
Documento43 pagine
Thrust Stand
ABHIMANYU KHADGA
Nessuna valutazione finora
Ijmemr V3i1 009
Documento5 pagine
Ijmemr V3i1 009
Sandesh San
Nessuna valutazione finora