Sei sulla pagina 1di 4

Webpage Prediction Using Latest Substring Association Rule Mining

R. P. Chatterjee

Department of Computer Science and Engineering, Meghnad Saha Institute of Technology, Kolkata

M. Ghosh

Dept. of CSE, Supreme Knowledge Foundation Group of Institutions, Mankundu, WB, India

M. K. Das

Dept. of CSE, Supreme Knowledge Foundation Group of Institutions, Mankundu, WB, India

R. Bag

Dept. of CSE, Supreme Knowledge Foundation Group of Institutions, Mankundu, WB, India
ABSTRACT: Web page prediction plays an important role by predicting and fetching probable web pages of
next request in advance, resulting in reducing the user latency. This paper proposes a web page prediction
model giving significant importance to the users interest using the clustering technique and the navigational
behavior of the user through latest substring association Rule. This method achieves better precision
compared to recent methods in web usage mining.
KEYWORDS- Data mining, Association rule, Substring association rule mining

The users surf the internet either by entering URL or
search for some topic or through link of same topic.
link prediction,
clustering plays an important
designers want to increase the number of visitors
and the time that these visitors spend on their web
site. To accomplish that, they have to supply
attractive content. And to make their content
attractive, web-site designers and content providers
need to know what their potential visitors want, in
order to organize their content according to their
visitors needs, and, if possible, according to
individual preferences. Researchers use different
techniques like Markov Model [Jin X et al. 2003],
Association rule mining, clustering [Dutta R et al.
2011] and so on. Web usage mining [Barsagade et
al. 2003] is the application of data mining
techniques to extract knowledge from web data,
where at least one of structure or usage data is used
in the mining process. Web usage mining has various
application areas such as web pre-fetching, link
personalization. Most important phases of web usage
mining are the reconstruction of user sessions by
using heuristics techniques and discovering useful
patterns from these sessions by using pattern
discovery techniques like association rule mining,
Apriori etc. We propose an integrated system for
applying data mining techniques such as association
rules on access log files

1.1 Related Work

A web page prediction model gives significant
importance to the users interest using the clustering
[Dutta R. et al.2011] techniques and the navigational
behavior of the user through Markov model. The
clustering technique is used for the accumulation of
the similar web pages. Similar web pages of same
type reside in the same cluster; the cluster containing
web pages have the similarity with respect to topic
of the session. The clustering algorithms [ Dutta R et
al.2011] considered are K-pages are stored inform of
cellular automata to make the system more memory
efficient Sequential classifiers[6] from association
rules obtained through data mining on large web log
data have been proposed by [ YANG Q et al.] and by
significant statistical correlations the next likely
web page to be predicted. Another web page
prediction method is web pre-fetching [Jin X, Xu
H .2003] which predicts the next request for web
pages based on the current request of users through
analyzing the server log, and fetches them in
advance and loads into the server cache. It reduces
the perceived access delay in some extent and
improves the service quality of web server.
Given a web log, the first step is to clean
the raw data. We filter out documents
that are not requested directly by users.
These are image requests in the log that

accessing requests to a document
containing links to these files. We
consider web log data as a sequence of
distinct web pages, where subsequences,
such as user sessions can be observed by
unusually long gaps between consecutive
We have created an unique ID for each
web page link that exists in the web log.
After that binary context corresponding to
that unique ID to count how many times a
particular link of a web page has been
visited by users for a particular session
are created. Next Apriori algorithm
(Agrawal et al.1993) to find out frequent
web pages from all the previously user
visited web pages has been used.

with length k+1 are constructed by using the

supported patterns with length k and length 1 as
follows. If the last page of the length-k pattern has a
link to the page of the (length-1) pattern, then by
appending that page length-k+1 candidate pattern is
generated. At some k value, if no new supported
pattern is constructed, the iteration halts.
2.2 Latest Substring Association Rule
After finding out the frequent web pages according
to apriori algorithm we use Latest Substring
Association Rule to generate the rules.

2.1 Apriori Algorithm

The latest-substrings are in fact the suffixes of

the strings in W1 window. These rules not only take
into account the order and adjacency information,
but also the recency information about the LHS
string. In this representation, only the substring
ending in the current time (which corresponds to the
end of the window W1) qualifies to be the LHS of a
In our example we can easily observed that in
window W1 the first session contains the sequence
{A,B,C} and the second session contains
{B,A,C} .But the suffixes C contains in both the
sessions at end and D is the only predicted page
presents in window W2. Hence only one rule{C>D} can be generated. In this way we use the
Latest Substring Association Rule in web page
ranking. The proposed approach has been depicted
by the flow chart diagram in figure 1.

This algorithm is used for mining frequent item sets

for Boolean rules where frequent subsets are
extended one item at a time (a step known as
candidate generation (Agrawal et al.1993), and
groups of candidates are tested against the data. The
algorithm terminates when no further successful
extensions are found. It first find all frequent 1itemsets (Agrawal et al.1993), and then discovering
2-itemsets and continues by finding increasingly
larger frequent item sets.
Key Concepts:
Frequent Item sets: The sets of item which has
minimum support (sup).
Apriori Property: Any subset of frequent item set
must be frequent.
Join Operation: To find Lk, a set of candidate kitem sets is generated by joining Lk-1with itself.
For each rule of the form LHSRHS, we define the
support and confidence as follows
sup = count(LHS, RHS)/count(Table) (1)
conf= count(LHS, RHS)/count(LHS) (2)
In the equations above, the function count(Table)
returns the number of records in the log table, and
count(LHS) returns the number of records that
match the left-hand-side LHS of a rule. In the
beginning, each page with sufficient support forms a
(length-1) supported pattern. Then, in the main step,
for each k value greater than 1 and up to the
maximum reconstructed session length, supported
patterns (patterns satisfying the support condition)

Table 1. Latest substring Association Rule.





Substring Rules
{C} ->D

Scan the transaction database to get
the support S of each item

S >=min

Add to frequent Item set L1

Use k join k-1 to generate a
set of candidate k-item set

window W2. For N different test cases, let C be the

number of correct predictions. Then the precision is
defined as
precision= C / N
Table 2.
User access paths steps and corresponding precision







Scan the transaction database to get the

support S of each candidate k-item sets


Add to k frequent itemsets

Figure 2. User access path with respect to precision

value has been shown.

For each frequent Itemset L, generate all

non-empty subsets of L

For each non empty subset s of L find the

rule containing R.H.S in the next window

Add to strong rules

Figure 1. Flowchart of Latest Substring Association

Rule Mining


For a test case that consists of a sequence of web

page visits, the prediction for the next page visit is
correct if the RHS of the selected rule occurs in

Firstly, we measured support of our input file which

contain user access path. Among them we find
mostly frequent access path by using Apriori
algorithm (Agrawal et al.1993). From the frequent
user access path we made the Association rule
(Agrawal et al. 1994) and there corresponding
Confidence in the different step of the user access
path. If the rule contain minimum confidence
threshold then we have treated the corresponding
test case as correct. We have made the experiment
on 10 different test cases for r=4, r=6, r=8, r=10 and
r=12 and after all calculate the precision by using
the above mentioned formula for each value of r and
plot them into the graph above.
In this paper, we surveyed about association rule
mining techniques using Apriori Algorithm and
latest substring association rules which has been

experimented and having good result. The

previously used algorithms have too many
parameters for somebody non expert in data mining
and the obtained rules are far too many, most of
them non-interesting and with low comprehensibility
.We had overcome this problem in our methods but
there are some acute problem which we had
observed during our experiment. The larger the set
of frequent itemsets the more the number of rules
presented to the user, many of which are
redundant.we have also face a problem of dynamic
itemset counting i.e the web pages are frequent or
not it should be decided dynamically.
Agrawal,R.,Imielinski, T., and Swami, A. N. (1993).
Mining association rules between sets of items in
large databases. In Proceedings of the 1993 ACM
SIGMOD International Conference on
Management of Data, 207-216.
Agrawal, R. and Srikant, R. 1994. Fast algorithms
for mining association rules. In Proc. 20th Int.
Conf. Very Large Data Bases, 487-499.
Baralis, E., Psaila, G., Designing templates for
mining association rules. Journal of Intelligent
Information Systems, 9(1):7-32, July 1997.
Barsagade, N., Web Usage Mining and Pattern
Discovery CSE 8331, December 8,2003.
Dutta, R., Kundu, A., Mukhopadhyay D.
Clustering Based Web Page Prediction ,2011.
Jin, X., Xu, H.,2003. An Approach to Intelligent
Web Pre-fetching Based on Hidden Markov
Model, Dec 12,2003.
Yang Q, Li. T, Wang. K., School of Computing
Science, Simon Fraser University, Burnaby, BC,
Canada V5A 1S6, Building Association-Rule
Based Sequential Classifiers for Web-Document