Web Structure Mining

WEB STRUCTURE
MINING
SUBMITTED BY:
BLESSY JOHN
R7A
ROLL NO:18
INTRODUCTION
 Web mining is the application of data
mining techniques in search engines.
 Data mining - process of discovering
useful knowledge from data sources
 Web mining automatically discover and
extract information from Web documents.
 Web structure mining discovers useful
data from hyperlinks.
WEB MINI NG
 Useful patterns extraction from WWW
resources
 WWW is widely distributed, global

information service centre that
constitutes a rich source for data
mining
 Employing techniques from Data

Mining, information retrieval,etc.
NEED FOR WEB MINING
 Aims at finding and extracting relevant
information that is hidden in web-
related data.
 The challenge is to bring back the

semantics of hyper text document
 To turn web data into web knowledge

CLASSIFICATION
WEB MINING
WEB CONTENT
WEB STRUCTURE
MINING WEB USAGE
MINING
MINING
WEB STRUCTURE
MINING
 Generate structural summary about the
Web site and Web page
 Use graph theory to analyse node and

connection structure of a web site
 Analysis of the link structure of the
web, and its purposes is to identify
more preferable documents
WEB STRUCTURE
MINING cont…..
 Discovering the nature of the hierarchy
of hyperlinks in the website and its
structure
 Hyperlink identifies author’s

endorsement of the other web page
 Retrieving information about the

relevance and the quality of the web
page.
Page Layout and Li nk
Analy sis for Web
Images
WEB BASICS
 A web is a huge collection of documents
linked together by references.
 To refer from one document to another
is based on hyper text and embedded in
HTML
 HTML describes how the document
should display on browser window
 Web document has a web address
called URL that identifies it uniquely.
WEB CRAWLERS
 Collects “all” web documents by
browsing the Web systematically and
exhaustively
 Region of the web to be crawled can be

specified by using the URL structure.
 Used by a search engine to provide

local access to the most recent versions
of possibly all web pages
INDEXING AND
KEYWORD SEARCH
 There are two types of data:
structured and unstructured
 Structured data have keys associated
with each data item that reflect its
content
 Content-based access to unstructured
data without considering the meaning is
the keyword search approach
DOCUMENT
REPRESENTATION
 To facilitate the process of matching
keywords and documents, some
preprocessing steps are taken first:
 Documents are tokenized

 Characters are converted to upper or
lower case
 Words reduced to canonical form
 Stopwords are usually removed
ALGORITHMS
 There are two main algorithms used in
web structure mining
1. HITS (Hypertext-Induced Topic

Search)
2. Page rank algorithm
HI TS (H yper tex t-In duced Top ic
Searc h)
 Link analysis algorithm

 Rates web pages
 Developed by Jon Kleinberg
 Determines two values for a page
 Authority-estimates the value of the
content of the page
 Hub-estimates the value of its links to
other pages
Hubs a nd Au th or it ies
 Hu b pages point to interesting links to authorities = relevant

pages
 Au thorit ies are targets of hub pages
Continue……
 Authority and hub values are defined in
terms of one another in a mutual
recursion
 It is executed at querry time with the

associated HIT on performance
Page R ank
 Link analysis algorithm
 Assigns a numerical weightage to each
element of a hyperlinked set of
documents
 Denoted by PR(E)
 Relies on uniquely democratic nature
 Link from page A to page B is a vote,
by page A, for page B
Continue…..
 Here, A considers itself important and
help to make B important
 Also a probability distribution –

represents the probability that a click on
a link arrives at any particular page
 Page rank of 0.5 -> 50% chance that a

person clicking on a link will be directed
to the document with the 0.5 page rank
APPLICATIONS
 Information retrieval in social networks.
 To find out the relevancy of each Web
page
 Measuring completeness of the Web
sites
 Used in search engines to find out
relevant information
CONCLUSION
 Search engines uses web structure
mining to find the information.
 We can create new knowledge out of

the available information
 Web Content mining can be added to it

to enhance the performance of search
engines.
Thank Yo u !
Questions ?

Web Structure Mining

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Web Structure Mining

Caricato da

Copyright:

Formati disponibili

WEB STRUCTURE

 WWW is widely distributed, global

 Employing techniques from Data

 The challenge is to bring back the

 To turn web data into web knowledge

 Use graph theory to analyse node and

 Hyperlink identifies author’s

 Retrieving information about the

 Region of the web to be crawled can be

 Used by a search engine to provide

 Documents are tokenized

1. HITS (Hypertext-Induced Topic

 Link analysis algorithm

 Hu b pages point to interesting links to authorities = relevant

 It is executed at querry time with the

 Also a probability distribution –

 Page rank of 0.5 -> 50% chance that a

 We can create new knowledge out of

 Web Content mining can be added to it

Potrebbero piacerti anche