Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
The Original
Google Paper
Google is the common spelling of googol, or
10100, which fit well with the authors goal of
building very large-scale search engines.
Outline
Design goals
System features
System anatomy
Results and performance
Paper analysis
Design Goals
Design Goals
1. Scale with the rapid growth of the web
1,200,000,000
1,000,000,000
1,000,000,000
800,000,000
600,000,000
400,000,000
200,000,000
0
110,000 1,500
100,000,000
20,000,000
1994.0
Webpages Indexed
1997.0
100,000,000
2000.0
Queries/day
Design Goals
2. Improved Search Quality
System
Features
System Features
1. Makes use of the link structure of
the Web to calculate a quality
ranking for each page, called the
PageRank.
A probability distribution used to
represent the likelihood that a person
randomly clicking on links will arrive
at any particular page.
It considers the importance of each
page that casts a vote, as votes from
some pages are considered to have
greater value, thus giving the linked
page greater value.
PR(A) (1 d) d
PR(Ti )
C(T )
i
Ti L( A)
PR(A)
2
1
3
2
1
3
System Features
2. Makes use of Anchor text of links on
webpages:
E.g. <a href=http://www.yahoo.com>Yahoo!</a>
Text of a link is not only associated with the
webpage it is on, it also gives information
(sometimes more relevant) to the webpage it
points to.
Anchors may exist for documents which generally
cannot be indexed by text-based search engines,
such as images, programs, and databases.
System Features
3. Uses location information for all hits and
thus makes extensive use of proximity in
search.
4. Keeps track of visual presentation of text
on webpages such as font sizes. Words with
bolder/larger font are given more importance.
5. Stores complete raw HTML of webpages in
repository.
System
Anatomy
2. Repository
Contains full compressed HTML of all pages.
Stored one after another prefixed with docID,
length, and URL.
Compressed using high speed compression
technique (zlib) instead of high compression ratio
(bzip).
4. Lexicon
Contains list of null separated words (about 14
million) and hash table of pointers.
6. Forward Index
Stored in a number of barrels.
If a document contains words that fall into a
particular barrel, the docID is recorded into the
barrel followed by a list of wordIDs with their hitlists.
4. Sorting
Sorter takes each of the forward barrels and sorts by
wordID to produce an inverted barrel for title and
anchor hits and full text inverted barrel.
Searching
1.
2.
3.
4.
5.
6.
7.
Results and
Performance
Future Works
Current version of Google search times are
dominated by disk IO. Introduce query caching,
and hardware, software and algorithmic
optimizations.
Improve search efficiency and quickly scale to
~100 million web pages.
Develop Google as a resource for large scale
research tool for searchers and researchers.
Thank You
Presented by: Nilay Khandelwal