Sei sulla pagina 1di 3

2010 International Conference on Advances in Computer Engineering

Clustering Based URL Normalization Technique for


Web Mining
Naresh Kumar Nagwani,
Asst Prof., Department Of CSE,
NIT Raipur
nknagwani.cs@nitrr.ac.in

Abstract – URL (Uniform Resource Locator) normalization is an encoded normalization is another technique which decode any
important activity in web mining. Web data can be retrieved in percent-encoded octet that corresponds to an unreserved
smoother way using effective URL normalization technique. URL character, such as %2D for hyphen and %5F for underscore
normalization also reduces lot of calculations in web mining etc. Another common technique is path segment
activities. A web mining technique for URL normalization is normalization which removes dot segments from the path
proposed in this paper. The proposed technique is based on components.
content, structure and semantic similarity and web page 2. Scheme based normalization techniques
redirection and forwarding similarity of the given set of URLs. From the scheme of the web application and its propertied
Web page redirection and forward graphs can be used to
URL can be normalized these technique comes under the
measure the similarities between the URL’s and can also be used
scheme based normalization technique. Some of the examples
of this technique are adding trailing ‘/’ after the authority
for URL clusters. The URL clusters can be used for URL
component of URL, removing default port number, such as 80
normalization. A data structure is also suggested to store the
for http scheme and truncating the fragment of URL, e.g.
forward and redirect URL information.
http://www.example.com/name.html#ali is truncated to
http://www.example.com/name.html etc.
Keywords – URL Normalization, Clustering, Web Page Forward
3. Protocol based normalization techniques
and Redirect Similarity Tree.
Using communication protocol for the web applications some
techniques are invented, which comes under the category of
I. INTRODUCTION
protocol based normalization techniques. Some of the
examples are only appropriate when the results of accessing
URL normalization is performed by crawlers to determine if
the resources are equivalent for example;
two syntactically different URLs are equivalent. The ultimate
http://example.com/data is directed to
aim of the URL normalization is to reduce redundant Web
http://example.com/data/ by http origin server.
crawling by having a set of URLs which point to a unique set
of Web pages and to improve search engines for better and
unique results. URL normalization is deployed by search B. Web Similarity Types
engines to determine the importance of Web pages as well as Web Page can have the following three types of similarities:
to avoid indexing same Web pages. URL normalization is also 1. Web Content Similarity: Web content similarity refers to
refers as the process of identifying the similar and equivalent the similarity among the web pages whose actual web contents
URL’s. The equivalent URL’s points to the same required like html data, images, tables etc. are similar. Web content
resource which is in web user’s interest.
similarity is a important measure in URL normalization.
A. Standard URL Normalization Methods Whenever the two URL’s web page content will be similar
There is several type of normalization that may be performed then both of the URL can be assumed as equivalent.
for URL normalization, some of the common techniques are 2. Web Structure Similarity: URL points to a particular
Converting the scheme and host to lower case, Removing specific web page, which not only contains the useful
directory index, Removing the fragment, Removing “www” information but is also the source of another web page. A user
as the first domain label, Removing arbitrary query string can number of useful links on reaching to a particular web
variables and Normalization based on URL lists. Standard page, and may select another page from that page. In this way
URL Normalization are classified in the following three major
categories: these sequences of web pages forms a web structure. For the
1. Syntax based normalization techniques similar kind of operations on web generally the structure is
These methods are based on the syntax of the URL. Some of also similar. For example doing sing-up or registration process
the examples are case normalization which converts all letters in a web application consist of some typical set of operations
at scheme and authority components to lower case. Percent- and a typical traversal of some web pages like first filling the

978-0-7695-4058-0/10 $26.00 © 2010 IEEE 349


DOI 10.1109/ACE.2010.47
required information, then email verification then continuation done using the content, structure and semantic similarity
etc. Whenever similarity of the sequence of web page traversal between the different URL’s.
comes into the picture for any analysis it is technically known
III. PROPOSED TECHNIQUE
as the web structure similarity.
3. Web Semantic Similarity – Web semantic similarity is one
This method is based on the structure in which the URLs are
of the advance parameter to check the similarity between two
traversed. One the structure is available; the clustering
URL’s. It focuses on the meaning of the web pages rather than algorithm is applied to discover the different URL cluster for
their contents and structures. If two URL’s are meaningful URL normalization. The structure also includes the
similar then they are said to be as semantic similar web pages. forwarding or redirecting a URL from another URL.
Figure-1 depicts the proposed algorithm for URL
C. Clustering normalization. A set of URL’s are given input to the proposed
Clustering is a technique of creating groups of similar objects. model. Out of these URL’s set Forward-Redirect tree is
Each group, called cluster, consists of objects that are similar generated and also similarity information are captured, the
between themselves and dissimilar to objects of other groups. clustering algorithm is applied to create the URL clusters.
There exists large number of clustering algorithms. The Figure 2 depicts the possible outcome of the algorithm. In this
clustering algorithm are categorized as partitioning based,
hierarchical, density based, grid based and model based way the URL normalization can be achieved.
clustering algorithms. K-Mean is a popular partitioning based
clustering algorithm. In this paper clustering is applied over a
set of given URL’s and group of similar URL’s are created for
URL normalization.
This paper is organized in five sections. Section two discusses
about the related and previous work done in the similar area.
Section three consists of proposed technique for URL
normalization with pseudo code of the algorithm and section
four discussed about the conclusion and future scope of the
proposed technique.

II. RELATED WORK DONE

Number of techniques exists and suggested by researchers for


URL normalization. Some of them are studied and mentioned
in this section. Kim, Jeong, and Lee [3] given a method of
URL normalization in which URL’s strings are transformed
into canonical form and duplicate URL’s are eliminated. Lee,
S.H., Kim, S.J. and Hong [4] have suggested three additional
URL normalization steps and two parameters redundancy rate
and coverage loss rate for effective URL normalization. The
redundancy rate shows how many web pages are duplicated
Fig.1. Clustering based URL normalization technique.
due to equivalent URLs.coverage loss rate shows how much
valid web pages are lost due to false positives.
Soon and Lee [1, 5] has proposed to enhance the standard
URL normalization by incorporating the semantically
meaningful metadata of the Web pages. The metadata taken
into account are the body texts of the Web pages, which can
be extracted during HTML parsing. Given a URL which has
undergone the standard normalization mechanism, URL
signature is constructed by hashing or fingerprinting the body
text of the associated Web page using Message-Digest
algorithm. URLs which share identical signatures are
considered to be equivalent in our scheme. The problem of
different URLs with similar text have been studied by Yossef,,
Keidar. Schonfeld [2], this problem is named as DUST.
DUST-BUSTER algorithm is proposed to solve this problem.
A new URL clustering based URL normalization
approach is proposed in this paper. The URL clustering is Fig.2. URL cluster tree for a given URL set.

350
Forward and redirect tree is a data structure for representing dist = Φ(FFwd,Redirect(i,t) , Fcontent-similarity(i,t) , Fstructure-
the web page forwarding and redirecting operations. The similarity(i,t) , Fsemantic-similarity(i,t) )
URL’s are the parent nodes and the children nodes represent return dist
the sequence of page traversal. Web page forwarding refers to
the process by which URL can forward the web request to /*****
some other URL for further operations. Web page redirecting Set the initial cluster centres to be random numbers. FRDisttj
denotes the average distance value of the forward and redirect
is process by which the request URL is automatically changed
from a url for the cluster t, CSimtj , StrSimtj , SemSimtj denotes
to some other URL for further process, since the location is the average content similarity measure, structural similarity
got changed or there may be some other reason. measure and semantic similarity measures . ****/

A. Forward and Redirect Tree for t = 1 to k /*Initializing the cluster centers. */


Forward-and-Redirect tree is the basic data structure for the FRDisttj = random number
proposed model. This tree is created by mapping the forward CSimtj = random number
URL’s information and redirect URL’s information for a StrSimtj = random number
given URL. Transition from node N1 to N2 is referred to as SemSimtj = random number
forward or redirects transition for a given URL present at node endfor
N1. The root node in the Forward-and-Redirect tree is the first
input URL. repeat
B. Algorithm for Proposed Model for i = 1 to k /* For K Clusters */
The algorithm has four main steps that are shortly described in Calculate the new similarities mean.
the following section. for j = 1 to N /* N Number of URL’s*/
1. The first step is to define the initial cluster centroids. for each URL j calculate the Dist(j,i)
These centroids are initialized randomly. if(Dist(j,i) < ε) /* ε is the minimum
threshold value for the distance */
2. The second step is to compute distances from all cluster assign URL j to the cluster i
centroids for all URL’s and assign each URL to the endfor
nearest cluster. The distance measure used is the function
endfor
of web content similarity, web structure similarity and
until there is no new assignment
web semantic similarity. It also includes the forward and
redirect URL from a URL.
IV. CONCLUSION AND FUTURE WORK
3. The third step is to calculate new cluster centroids.
4. The final step is to calculate the value of a cost function In this paper a clustering based web mining technique for
that is based on a similarity distance measure. URL normalization is proposed. The similar type of URL’s
can be clustered together for normalizing the URL’s. A new
C. Pseudo-code of Algorithm
data structure named forward-and-redirect tree is also
The pseudo-code of the above mentioned algorithm is
proposed for web mining. The future scope related to the
mentioned in this section. The comments are also given for the proposed algorithm could be studying and simplifying the
explanation. various similarity measurement algorithms for the web mining
and implementing the proposed technique over a standard
/**** Input is a Set U = {u1, u2,…,uN} constituting of N URL repository.
number of URL’s
Output is a Set of K URL cluster’s {{u11, u12,…}, {u21, REFERENCES
u22,…},…, {uK1, uK2,…} } ****/
[1] Lay-Ki Soon Sang Ho Lee, “Enhancing URL Normalization using
Metadata of Web Pages”, Computer and Electrical Engineering,
/**** Function Dist(i,t) returns the distance measure of URL i International Conference on, International Conference on Computer and
and cluster t. This function is needed later in the clustering Electrical Engineering, pp. 331-335, 2008.
algorithm .The distance function is a function of four [2] Bar-Yossef, Z., Keidar, I., Schonfeld, U., “Do Not Crawl in the DUST:
Different URLs with Similar Text”, in the Proceedings of the
parameters– International World Wide Web Conference (WWW 2007), pp. 111 –
1. Forward & Redirect cost from a URL 120, May 2007.
2. Content Similarity between a pair of URL’s [3] Sung Jin Kim, Hyo Sook Jeong, and Sang Ho Lee: Reliable Evaluations
3. Structure similarity between a pair of URL and of URL Normalization, ICCSA 2006, LNCS 3984, pp. 609 – 617, 2006.
4. Semantic Similarity between a pair of URL. ****/ [4] Lee, S.H., Kim, S.J. and Hong, S.: On URL Normalization, Springer-
Verlag Lecture Notes in Computer Science, Vol. 3481. pp. 1076-1085,
2005.
Dist (i,t) [5] Lay-Ki Soon, Sang Ho Lee, “Identifying Equivalent URLs using URL
dist = 0 Signatures”, IEEE International Conference on Signal Image
Technology and Internet Based Systems, 2008.

351