A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'

The Anatomy of a Large-Scale
Hypertextual Web Search Engine

Sergey Brin and Lawrence Page
The Original
Google Paper
Google is the common spelling of googol, or
10100, which fit well with the authors goal of
building very large-scale search engines.
Outline
Design goals
System features
System anatomy
Results and performance
Paper analysis
Design Goals
Design Goals
1. Scale with the rapid growth of the web
1,200,000,000
1,000,000,000
1,000,000,000
800,000,000
600,000,000
400,000,000
200,000,000
0
110,000 1,500
100,000,000
20,000,000
1994.0
Webpages Indexed
1997.0
100,000,000
2000.0
Queries/day
Design Goals
2. Improved Search Quality
Number of documents on the web are increasing

rapidly, but users ability to look at them lags.
Current search engines return lots of junk results,

with little relevance. (Note: Were talking about the
year 1998)
3. Academic Search Engine Research
Push more development and understanding into the

academic realm.
Systems that reasonable number of people can use.
Build an architecture to support novel research

activities in large-scale web data.
System
Features
System Features
1. Makes use of the link structure of
the Web to calculate a quality
ranking for each page, called the
PageRank.
A probability distribution used to
represent the likelihood that a person
randomly clicking on links will arrive
at any particular page.
It considers the importance of each
page that casts a vote, as votes from
some pages are considered to have
greater value, thus giving the linked
page greater value.
PageRank: Bringing Order

to the Web
PR(A) (1 d) d
PR(Ti )
C(T )
i
Ti L( A)
PR(A) PageRank of a webpage A

PR(Ti) PageRank of a webpage Ti pointing to A
C(Ti) Number of outbound links for webpage Ti
L(A) Set of webpages linking to A
d damping factor, a value between 0 and 1, is the
probability that a random surfer will stop clicking
Note that PageRanks form a probability distribution of
webpages, so the summation of all webpages will be 1.
PageRank: Bringing Order

to the Web
Assume a universe of 4 webpages: A, B,
C, and D
PR(A)
PR(B) PR(C) PR(D)
2
1
3
Taking into consideration that a random

surfer will eventually stop clicking, we
assume a damping factor, d, which is
generally assumed to be 0.85
PR(B) PR(C) PR(D)

PR(A) (1 d) d
2
1
3
System Features
2. Makes use of Anchor text of links on
webpages:
E.g. <a href=http://www.yahoo.com>Yahoo!</a>
Text of a link is not only associated with the
webpage it is on, it also gives information
(sometimes more relevant) to the webpage it
points to.
Anchors may exist for documents which generally
cannot be indexed by text-based search engines,
such as images, programs, and databases.
System Features
3. Uses location information for all hits and
thus makes extensive use of proximity in
search.
4. Keeps track of visual presentation of text
on webpages such as font sizes. Words with
bolder/larger font are given more importance.
5. Stores complete raw HTML of webpages in
repository.
System
Anatomy
Major Data Structures

1. BigFiles
Virtual files spanning multiple file systems and
addressable by 64 bit integers.
2. Repository
Contains full compressed HTML of all pages.
Stored one after another prefixed with docID,
length, and URL.
Compressed using high speed compression
technique (zlib) instead of high compression ratio
(bzip).

3. Document Index
Keeps information about each document.
Its a fixed width index, ordered by docID.
Stores document status, pointer into the
repository, and checksum.
If document is indexed, points to a variable width
file docinfo which contains URL and title. Else
points to URLlist containing only the URL.
4. Lexicon
Contains list of null separated words (about 14
million) and hash table of pointers.

5. Hit Lists
A list of occurrences of a particular word in a
particular document including position, font, and
capitalization information.
Hit lists account for most of the space used in both
the forward and the inverted indices.
6. Forward Index
Stored in a number of barrels.
If a document contains words that fall into a
particular barrel, the docID is recorded into the
barrel followed by a list of wordIDs with their hitlists.

7. Inverted Index
The inverted index consists of the same barrels as
the forward index, except that they have been
processed by the sorter.
Crawling the Web

1. Several distributed crawlers.
URLserver serves list of URLs to the crawler.
Each crawler keeps ~300 open connections.
At max, a system of 4 crawlers can crawl ~100
pages/sec or ~600 K/second of data.
Each maintains its own DNS cache for fast lookup.
2. Parser handles a huge array of possible errors

including HTML errors, non-ASCII characters,
or HTML tags nested hundreds deep
Indexing the Web

3. Indexing Documents into Barrels
After each document is parsed, it is encoded into a
number of barrels.
Every word is converted into a wordID using an inmemory hash table the lexicon.
Once words are converted into wordIDs, their
occurrences in the current document are translated
into hit lists and are written into the forward barrels.
4. Sorting
Sorter takes each of the forward barrels and sorts by
wordID to produce an inverted barrel for title and
anchor hits and full text inverted barrel.
Searching
1.
Parse the query
2.
Convert words into wordIDs.
3.
Seek to the start of the doclist in the short barrel for

every word.
4.
Scan through the doclists until there is a document

that matches all the search terms.
5.
Compute the rank of that document for the query.
6.
If we are in the short barrels and at the end of any

doclist, seek to the start of the doclist in the full
barrel for every word and go to step 4.
7.
If we are not at the end of any doclist go to step 4.

Sort the documents that have matched by rank and
return the top k.
Results and
Performance
Results and Performance

A qualitative analysis of the search results by
users has generally been positive.
The current version of Google answers most
queries in between 1 and 10 seconds.
Since Google takes into consideration the
proximity of word occurrences, results are
more relevant than other search engines giving
a set of results for all words in queries. (E.g.
search for bill clinton gives lower importance
to results with independent bill and clinton)
Future Works
Current version of Google search times are
dominated by disk IO. Introduce query caching,
and hardware, software and algorithmic
optimizations.
Improve search efficiency and quickly scale to
~100 million web pages.
Develop Google as a resource for large scale
research tool for searchers and researchers.
Analyses of the Research

Paper
Pros
One of the first descriptions of the PageRank
algorithm which changed how search engines
ranked and indexed the web.
Using citation graph and anchor text to rank pages
closely resembled user behavior of ranking
websites.
Google is a complete architecture for gathering web
pages, indexing them, and performing search
queries over them.
The paper mentions Google does not compromise
PageRanks for monetary gains giving more
credibility to search results. This holds true to date.
Analyses of the Research

Paper
Cons
One of the first flaws found in the PageRank
algorithm was the Google Bomb:
Because of the PageRank, a page will be ranked
higher if the sites that link to that page use
consistent anchor text.
A Google bomb is created if a large number of
sites link to the page in this manner.
Ranking quality is insufficient using only PageRank
and anchor text. (Google today uses more than
200 different parameters to judge quality of a
webpage.)
Thank You
Presented by: Nilay Khandelwal

A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'

Caricato da

Copyright:

Formati disponibili

The Anatomy of a Large-Scale

Hypertextual Web Search Engine

Number of documents on the web are increasing

Current search engines return lots of junk results,

3. Academic Search Engine Research

Push more development and understanding into the

Systems that reasonable number of people can use.

Build an architecture to support novel research

PageRank: Bringing Order

PR(A) PageRank of a webpage A

PageRank: Bringing Order

PR(B) PR(C) PR(D)

Taking into consideration that a random

PR(B) PR(C) PR(D)

Major Data Structures

Major Data Structures

Major Data Structures

Major Data Structures

Crawling the Web

2. Parser handles a huge array of possible errors

Indexing the Web

Parse the query

Convert words into wordIDs.

Seek to the start of the doclist in the short barrel for

Scan through the doclists until there is a document

Compute the rank of that document for the query.

If we are in the short barrels and at the end of any

If we are not at the end of any doclist go to step 4.

Results and Performance

Analyses of the Research

Analyses of the Research

Potrebbero piacerti anche