Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract – URL (Uniform Resource Locator) normalization is an encoded normalization is another technique which decode any
important activity in web mining. Web data can be retrieved in percent-encoded octet that corresponds to an unreserved
smoother way using effective URL normalization technique. URL character, such as %2D for hyphen and %5F for underscore
normalization also reduces lot of calculations in web mining etc. Another common technique is path segment
activities. A web mining technique for URL normalization is normalization which removes dot segments from the path
proposed in this paper. The proposed technique is based on components.
content, structure and semantic similarity and web page 2. Scheme based normalization techniques
redirection and forwarding similarity of the given set of URLs. From the scheme of the web application and its propertied
Web page redirection and forward graphs can be used to
URL can be normalized these technique comes under the
measure the similarities between the URL’s and can also be used
scheme based normalization technique. Some of the examples
of this technique are adding trailing ‘/’ after the authority
for URL clusters. The URL clusters can be used for URL
component of URL, removing default port number, such as 80
normalization. A data structure is also suggested to store the
for http scheme and truncating the fragment of URL, e.g.
forward and redirect URL information.
http://www.example.com/name.html#ali is truncated to
http://www.example.com/name.html etc.
Keywords – URL Normalization, Clustering, Web Page Forward
3. Protocol based normalization techniques
and Redirect Similarity Tree.
Using communication protocol for the web applications some
techniques are invented, which comes under the category of
I. INTRODUCTION
protocol based normalization techniques. Some of the
examples are only appropriate when the results of accessing
URL normalization is performed by crawlers to determine if
the resources are equivalent for example;
two syntactically different URLs are equivalent. The ultimate
http://example.com/data is directed to
aim of the URL normalization is to reduce redundant Web
http://example.com/data/ by http origin server.
crawling by having a set of URLs which point to a unique set
of Web pages and to improve search engines for better and
unique results. URL normalization is deployed by search B. Web Similarity Types
engines to determine the importance of Web pages as well as Web Page can have the following three types of similarities:
to avoid indexing same Web pages. URL normalization is also 1. Web Content Similarity: Web content similarity refers to
refers as the process of identifying the similar and equivalent the similarity among the web pages whose actual web contents
URL’s. The equivalent URL’s points to the same required like html data, images, tables etc. are similar. Web content
resource which is in web user’s interest.
similarity is a important measure in URL normalization.
A. Standard URL Normalization Methods Whenever the two URL’s web page content will be similar
There is several type of normalization that may be performed then both of the URL can be assumed as equivalent.
for URL normalization, some of the common techniques are 2. Web Structure Similarity: URL points to a particular
Converting the scheme and host to lower case, Removing specific web page, which not only contains the useful
directory index, Removing the fragment, Removing “www” information but is also the source of another web page. A user
as the first domain label, Removing arbitrary query string can number of useful links on reaching to a particular web
variables and Normalization based on URL lists. Standard page, and may select another page from that page. In this way
URL Normalization are classified in the following three major
categories: these sequences of web pages forms a web structure. For the
1. Syntax based normalization techniques similar kind of operations on web generally the structure is
These methods are based on the syntax of the URL. Some of also similar. For example doing sing-up or registration process
the examples are case normalization which converts all letters in a web application consist of some typical set of operations
at scheme and authority components to lower case. Percent- and a typical traversal of some web pages like first filling the
350
Forward and redirect tree is a data structure for representing dist = Φ(FFwd,Redirect(i,t) , Fcontent-similarity(i,t) , Fstructure-
the web page forwarding and redirecting operations. The similarity(i,t) , Fsemantic-similarity(i,t) )
URL’s are the parent nodes and the children nodes represent return dist
the sequence of page traversal. Web page forwarding refers to
the process by which URL can forward the web request to /*****
some other URL for further operations. Web page redirecting Set the initial cluster centres to be random numbers. FRDisttj
denotes the average distance value of the forward and redirect
is process by which the request URL is automatically changed
from a url for the cluster t, CSimtj , StrSimtj , SemSimtj denotes
to some other URL for further process, since the location is the average content similarity measure, structural similarity
got changed or there may be some other reason. measure and semantic similarity measures . ****/
351