Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
client. Each transaction is comprised of a set of URLs terested in a time interval (within a day, or within a
accessed by a client in one visit to the server. For ex- week, etc.) in which a particular le is most accessed.
ample, using association rule discovery techniques we Discovering classication rules 20, 54] allows one
can nd correlations such as the following: to develop a prole of items belonging to a particu-
lar group according to their common attributes. This
40% of clients who accessed the Web page prole can then be used to classify new data items
with URL /company/product1, also accessed that are added to the database. In Web usage mining,
/company/product2 or classication techniques allow one to develop a prole
30% of clients who accessed /company/special, for clients who access particular server les based on
placed an online order in /company/product1. demographic information available on those clients, or
based on their access patterns. For example, classi-
Since usually such transaction databases contain cation on WWW access logs may lead to the discovery
extremely large amounts of data, current association of relationships such as the following:
rule discovery techniques try to prune the search space clients from state or government agencies who
according to support for items under consideration. visit the site tend to be interested in the page
Support is a measure based on the number of occur- /company/product1 or
rences of user transactions within transaction logs.
Discovery of such rules for organizations engaged 50% of clients who placed an online order in
in electronic commerce can help in the development /company/product2, were in the 20-25 age group
of eective marketing strategies. But, in addition, as- and lived on the West Coast.
sociation rules discovered from WWW access logs can
give an indication of how to best organize the organi- Clustering analysis 24, 38] allows one to group to-
zation's Web space. gether clients or data items that have similar char-
The problem of discovering sequential patterns 35, acteristics. Clustering of client information or data
52] is to nd inter-transaction patterns such that the items on Web transaction logs, can facilitate the de-
presence of a set of items is followed by another item velopment and execution of future marketing strate-
in the time-stamp ordered transaction set. In Web gies, both online and o-line, such as automated re-
server transaction logs, a visit by a client is recorded turn mail to clients falling within a certain cluster,
over a period of time. The time stamp associated or dynamically changing a particular site for a client,
with a transaction in this case will be a time interval on a return visit, based on past classication of that
which is determined and attached to the transaction client.
during the data cleaning or transaction identication
processes. The discovery of sequential patterns in Web
server access logs allows Web-based organizations to
4 Analysis of Discovered Pat-
predict user visit patterns and helps in targeting ad- terns
vertising aimed at groups of users based on these pat- The discovery of Web usage patterns, carried out by
terns. By analyzing this information, the Web mining techniques described earlier, would not be very useful
system can determine temporal relationships among unless there were mechanisms and tools to help an an-
data items such as the following: alyst better understand them. Hence, in addition to
30% of clients who visited /company/products, developing techniques for mining usage patterns from
had done a search in Yahoo, within the past week Web logs, there is a need to develop techniques and
on keyword or
w
tools for enabling the analysis of discovered patterns.
These techniques are expected to draw from a num-
60% of clients who placed an online order in ber of elds including statistics, graphics and visual-
/company/product1, also placed an online order ization, usability analysis, and database querying. In
in /company/product4 within 15 days. this section we provide a survey of the existing tools
and techniques. Usage analysis of Web access behav- rather than data is needed. An SQL-like querying
ior being a very new area, there is very little work in it, mechanism has been proposed for the WEBMINER
and correspondingly this survey is not very extensive. system 36]. For example, The query
Visualization has been used very successfully in
helping people understand various kinds of phenom- SELECT association-rules(A*B*C*)
ena, both real and abstract. Hence it is a natural FROM log.data
choice for understanding the behavior of Web users. WHERE date >= 970101 AND domain = "edu"
Pitkow, et al 45] have developed the WebViz system AND support = 1.0 AND confidence = 90.0
for visualizing WWW access patterns. A Web path extracts the rules involving the \.edu" domain after
paradigm is proposed in which sets of server log en- Jan 1, 1997, which start with URL A, and contain B
tries are used to extract subsequences of Web traver- and C in that order, and that have a minimum support
sal patterns called Web paths. WebViz allows the ana- of 1 % and a minimum condence of 90 %.
lyst to selectively analyze the portion of the Web that
is of interest by ltering out the irrelevant portions.
The Web is visualized as a directed graph with cy- 5 Web Usage Mining Architec-
cles, where nodes are pages and edges are (inter-page)
hyperlinks.
ture
On-Line Analytical Processing (OLAP) is emerg- We have developed a general architecture for Web
ing as a powerful paradigm for strategic analysis of usage mining which is presented in 13] and 36]. The
databases in business settings. It has been recently WEBMINER is a system that implements parts of
demonstrated that the functional and performance this general architecture. The architecture divides the
needs of OLAP require that new information struc- Web usage mining process into two main parts. The
tures be designed. This has led to the development rst part includes the domain dependent processes of
of the data cube information model 18], and tech- transforming the Web data into suitable transaction
niques for its ecient implementation 2, 22, 50]. Re- form. This includes preprocessing, transaction identi-
cent work 15] has shown that the analysis needs of cation, and data integration components. The sec-
Web usage data have much in common with those of a ond part includes the largely domain independent ap-
data warehouse, and hence OLAP techniques are quite plication of generic data mining and pattern matching
applicable. The access information in server logs is techniques (such as the discovery of association rule
modeled as an append-only history, which grows over and sequential patterns) as part of the system's data
time. Since the size of server logs grows quite rapidly, mining engine. The overall architecture for the Web
it may not be possible to provide on-line analysis of mining process is depicted in Figure 2.
all of it. Therefore, there is a need to summarize the Data cleaning is the rst step performed in the Web
log data, perhaps in various ways, to make its on-line usage mining process. Some low level data integration
analysis feasible. Making portions of the log selec- tasks may also be performed at this stage, such as
tively (in)visible to various analysts may be required combining multiple logs, incorporating referrer logs,
for security reasons. etc. After the data cleaning, the log entries must be
One of the reasons attributed to the great success of partitioned into logical clusters using one or a series of
relational database technology has been the existence transaction identication modules. The goal of trans-
of a high-level, declarative, query language, which al- action identication is to create meaningful clusters of
lows an application to express what conditions must references for each user. The task of identifying trans-
be satised by the data it needs, rather than having to actions is one of either dividing a large transaction into
specify how to get the required data. Given the large multiple smaller ones or merging small transactions
number of patterns that may be mined, there appears into fewer larger ones. The input and output transac-
to be a denite need for a mechanism to specify the tion formats match so that any number of modules to
focus of the analysis. Such focus may be provided in be combined in any order, as the data analyst sees t.
at least two ways. First, constraints may be placed on Once the domain-dependent data transformation
the database (perhaps in a declarative language) to phase is completed, the resulting transaction data
restrict the portion of the database to be mined for, must be formatted to conform to the data model of
e.g. 36]. Second, querying may be performed on the the appropriate data mining task. For instance, the
knowledge that has been extracted by the mining pro- format of the data for the association rule discovery
cess, in which case a language for querying knowledge task may be dierent than the format necessary for
mining sequential patterns. Finally, a query mech-
Figure 2: A General Architecture for Web Usage Mining
anism will allow the user (analyst) to provide more on whatever is possible and allowable at any point in
control over the discovery process by specifying vari- time.
ous constraints. For more details on the WEBMINER Portions of Web usage data exist in sources as di-
system refer to 13, 36]. verse as Web server logs, referral logs, registration
les, and index server logs. Intelligent integration and
6 Research Directions correlation of information from these diverse sources
can reveal usage information which may not be ev-
The techniques being applied to Web content min- ident from any one of them. Techniques from data
ing draw heavily from the work on information re- integration 33] should be examined for this purpose.
trieval, databases, intelligent agents, etc. Since most Web usage data collected in various logs is at a very
of these techniques are well known and reported else- ne granularity. Therefore, while it has the advantage
where, we have focused on Web usage mining in this of being extremely general and fairly detailed, it also
survey instead of Web content mining. In the follow- has the corresponding drawback that it cannot be an-
ing we provide some directions for future research. alyzed directly, since the analysis may start focusing
on micro trends rather than on the macro trends. On
6.1 Data Pre-Processing for Mining the other hand, the issue of whether a trend is micro
or macro depends on the purpose of a specic analysis.
Web usage data is collected in various ways, each Hence, we believe there is a need to group individual
mechanism collecting attributes relevant for its pur- data collection events into groups, called Web trans-
pose. There is a need to pre-process the data to make actions 12], before feeding it to the mining system.
it easier to mine for knowledge. Specically, we be- While 11, 12, 36] have proposed techniques to do so,
lieve that issues such as instrumentation and data col- more attention needs to be given to this issue.
lection, data integration and transaction identication
need to be addressed. 6.2 The Mining Process
Clearly improved data quality can improve the
quality of any analysis on it. A problem in the Web The key component of Web mining is the mining
domain is the inherent conict between the analysis process itself. As discussed in this paper, Web mining
needs of the analysts (who want more detailed usage has adapted techniques from the eld of data mining,
data collected), and the privacy needs of users (who databases, and information retrieval, as well as devel-
want as little data collected as possible). This has oping some techniques of its own, e.g. path analysis. A
lead to the development of cookie les on one side and lot of work still remains to be done in adapting known
cache busting on the other. The emerging OPS stan- mining techniques as well as developing new ones.
dard on collecting prole data may be a compromise Web usage mining studies reported to date have
on what can and will be collected. However, it is not mined for association rules, temporal sequences, clus-
clear how much compliance to this can be expected. ters, and path expressions. As the manner in which the
Hence, there will be a continual need to develop better Web is used continues to expand, there is a continual
instrumentation and data collection techniques, based need to gure out new kinds of knowledge about user
behavior that needs to be mined. broadness has caused Web mining to mean dierent
The quality of a mining algorithm can be measured things to dierent people 21, 36], and there is a need
both in terms of how eective it is in mining for knowl- to develop a common vocabulary. Towards this goal
edge and how ecient it is in computational terms. we proposed a denition of Web mining, and devel-
There will always be a need to improve the perfor- oped a taxonomy of the various ongoing eorts related
mance of mining algorithms along both these dimen- to it. Next, we presented a survey of the research in
sions. this area and concentrated on Web usage mining. We
Usage data collection on the Web is incremental provided a detailed survey of the eorts in this area,
in nature. Hence, there is a need to develop min- even though the survey is short because of the area's
ing algorithms that take as input the existing data, newness. We provided a general architecture of a sys-
mined knowledge, and the new data, and develop a tem to do Web usage mining, and identied the issues
new model in an ecient manner. and problems in this area that require further research
Usage data collection on the Web is also distributed and development.
by its very nature. If all the data were to be integrated
before mining, a lot of valuable information could be
extracted. However, an approach of collecting data
References
from all possible server logs is both non-scalable and 1] R. Agrawal and R. Srikant. Fast algorithms for min-
impractical. Hence, there needs to be an approach ing association rules. In Proc. of the 20th VLDB Con-
where knowledge mined from various logs can be inte- ference, pages 487{499, Santiago, Chile, 1994.
grated together into a more comprehensive model. 2] S. Agrawal, R. Agrawal, P.M. Deshpande, A. Gupta,
J. Naughton, R. Ramakrishna, and S. Sarawagi. On
6.3 Analysis of Mined Knowledge the computation of multidimensional aggregates. In
Proc. of the 22nd VLDB Conference, pages 506{521,
The output of knowledge mining algorithms is often Mumbai, India, 1996.
not in a form suitable for direct human consumption,
and hence there is a need to develop techniques and 3] R. Armstrong, D. Freitag, T. Joachims, and
tools for helping an analyst better assimilate it. Issues T. Mitchell. Webwatcher: A learning apprentice for
that need to be addressed in this area include usage the world wide web. In Proc. AAAI Spring Sympo-
analysis tools and interpretation of mined knowledge. sium on Information Gathering from Heterogeneous,
There is a need to develop tools which incorporate Distributed Environments. 1995.
statistical methods, visualization, and human factors 4] M. Balabanovic, Yoav Shoham, and Y. Yun. An
to help better understand the mined knowledge. Sec- adaptive agent for automated web browsing. Journal
tion 4 provided a survey of the current literature in of Visual Communication and Image Representation,
this area. 6(4), 1995.
One of the open issues in data mining, in general,
and Web mining, in particular, is the creation of in- 5] A. Z. Broder, S. C. Glassman, M. S. Manasse, and
telligent tools that can assist in the interpretation of G Zweig. Syntactic clustering of the web. In Proc. of
mined knowledge. Clearly, these tools need to have 6th International World Wide Web Conference, 1997.
specic knowledge about the particular problem do- 6] C. M. Brown, B. B. Danzig, D. Hardy, U. Manber,
main to do any more than ltering based on statistical and M. F. Schwartz. The harvest information dis-
attributes of the discovered rules or patterns. In Web covery and access system. In Proc. 2nd International
mining, for example, intelligent agents could be de- World Wide Web Conference, 1994.
veloped that based on discovered access patterns, the 7] P. Buneman, S. Davidson, G. Hillebrand, and D. Su-
topology of the Web locality, and certain heuristics ciu. A query language and optimization techniques
derived from user behavior models, could give recom- for unstructured data. In Proc. of 1996 ACM-
mendations about changing the physical link structure SIGMOD Int. Conf. on Management of Data, 1996.
of a particular site.
8] P. Buneman, S. Davidson, and D. Suciu. Program-
7 Conclusion ming constrcuts for unstructured data. In Proceedings
of ICDT'95, Gubbio, Italy, 1995.
The term Web mining has been used to refer to 9] C. Chang and C. Hsu. Customizable multi-engine
techniques that encompass a broad range of issues. search tool with clustering. In Proc. of 6th Interna-
However, while meaningful and attractive, this very tional World Wide Web Conference, 1997.
10] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ir- 21] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Za-
land, Y. Papakonstantinou, J. Ulman, and J. Widom. iane. Dmql: A data mining query language for rela-
The tsimmis project: Integration of heterogenous in- tional databases. In SIGMOD'96 Workshop on Re-
formation sources. In Proc. IPSJ Conference, Tokyo, search Issues in Data Mining and Knowledge Discov-
1994. ery (DMKD'96), Montreal, Canada, 1996.
11] M.S. Chen, J.S. Park, and P.S. Yu. Data mining 22] V. Harinarayan, A. Rajaraman, and J.D. Ullman. Im-
for path traversal patterns in a web environment. In plementing data cubes eciently. In Proc. of 1996
Proceedings of the 16th International Conference on ACM-SIGMOD Int. Conf. on Management of Data,
Distributed Computing Systems, pages 385{392, 1996. pages 311{322, Montreal, Canada, 1996.
23] Open Market Inc. Open market web reporter.
12] R. Cooley, B. Mobasher, and J. Srivastava. Group- http://www.openmarket.com, 1996.
ing web page references into transactions for mining
world wide web browsing patterns. Technical Report 24] L. Kaufman and P.J. Rousseeuw. Finding Groups
TR 97-021, University of Minnesota, Dept. of Com- in Data: an Introduction to Cluster Analysis. John
puter Science, Minneapolis, 1997. Wiley & Sons, 1990.
25] I. Khosla, B. Kuhn, and N. Soparkar. Database search
13] R. Cooley, B. Mobasher, and J. Srivastava. Web min- using informatiuon mining. In Proc. of 1996 ACM-
ing: Information and pattern discovery on the world SIGMOD Int. Conf. on Management of Data, 1996.
wide web. Technical Report TR 97-027, University of
Minnesota, Dept. of Computer Science, Minneapolis, 26] R. King and M. Novak. Supporting information in-
1997. frastructure for distributed, heterogeneous knowledge
discovery. In Proc. SIGMOD 96 Workshop on Re-
14] R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scal- search Issues on Data Mining and Knowledge Dis-
able comparison shopping agent for the world wide covery, Montreal, Canada, 1996.
web. Technical Report 96-01-03, University of Wash- 27] T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivastava. The
ington, Dept. of Computer Science and Engineering, information manifold. In Working Notes of the AAAI
1996. Spring Symposium: Information Gathering from Het-
15] C. Dyreson. Using an incomplete data cube as a erogeneous, Distributed Environments. AAAI Press,
summary data sieve. Bulletin of the IEEE Technical 1995.
Committee on Data Engineering, pages 19{26, March 28] D. Konopnicki and O. Shmueli. W3qs: A query sys-
1997. tem for the world wide web. In Proc. of the 21th
VLDB Conference, pages 54{65, Zurich, 1995.
16] e.g. Software Inc. Webtrends.
http://www.webtrends.com, 1995. 29] M. Koster. Aliweb - archie-like indexing in the web.
In Proc. 1st International Conference on the World
17] W. B. Frakes and R. Baeza-Yates. Information Re- Wide Web, pages 91{100, May 1994.
trieval Data Structures and Algorithms. Prentice 30] C. Kwok and D. Weld. Planning to gather informa-
Hall, Englewood Clis, NJ, 1992. tion. In Proc. 14th National Conference on AI, 1996.
18] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. 31] L. Lakshmanan, F. Sadri, and I. N. Subramanian.
Data cube: A relational aggregation operator gener- A declarative language for querying and restructur-
alizing group-by, cross-tab, and sub-totals. In IEEE ing the web. In Proc. 6th International Workshop on
12th International Conference on Data Engineering, Research Issues in Data Engineering: Interoperability
pages 152{159, 1996. of Nontraditional Database Systems (RIDE-NDS'96),
1996.
19] K. Hammond, R. Burke, C. Martin, and S. Lytinen. 32] H. Vernon Leighton and J. Srivastava. Pre-
Faq-nder: A case-based approach to knowledge nav- cision among WWW search services (search en-
igation. In Working Notes of the AAAI Spring Sym- gines): Alta Vista, Excite, Hotbot, Infoseek,
posium: Information Gathering from Heterogeneous, Lycos. http://www.winona.msus.edu/is-f/library-
Distributed Environments. AAAI Press, 1995. f/webind2/webind2.htm, 1997.
20] J. Han, Y. Cai, and N. Cercone. Data-driven dis- 33] E. Lim, S.Y. Hwang, J. Srivastava, D. Clements, and
covery of quantitative rules in relational databases. M. Ganesh. Myriad: design and implementaion of
In IEEE Transactions on Knowledge and Data Eng., federated database prototype. Software { Practive &
volume 5, pages 29{40, 1993. Experience, 25(5):533{562, 1995.
34] Y. S. Maarek and I.Z. Ben Shaul. Automatically or- 47] P. Resnik, N. Iacovou, M. Sushak, P. Bergstrom, and
ganizing bookmarks per content. In Proc. of 5th In- J. Riedl. Grouplens: An open architecture for col-
ternational World Wide Web Conference, 1996. laborative ltering of netnews. In Proc. of the 1994
Computer Supported Cooperative Work Conference,
35] H. Mannila, H. Toivonen, and A. I. Verkamo. Dis- ACM, 1994.
covering frequent episodes in sequences. In Proc. of
the First Int'l Conference on Knowledge Discovery 48] A. Savasere, E. Omiecinski, and S. Navathe. An ef-
and Data Mining, pages 210{215, Montreal, Quebec, cient algorithm for mining association rules in large
1995. databases. In Proc. of the 21th VLDB Conference,
36] B. Mobasher, N. Jain, E. Han, and J. Srivastava. Web pages 432{443, Zurich, Switzerland, 1995.
mining: Pattern discovery from world wide web trans- 49] U. Shardanand and P. Maes. Social information l-
actions. Technical Report TR 96-050, University of tering: Algorithms for automating "word of mouth".
Minnesota, Dept. of Computer Science, Minneapolis, In Proc. of 1995 Conference on Human Factors in
1996. Computing Systems (CHI-95), pages 210{217, 1995.
37] net.Genesis. net.analysis desktop. 50] A. Shukla, P.M. Deshpande, J. Naughton, and K. Ra-
http://www.netgen.com, 1996. maswamy. Storage estimation for multidimensional
38] R. Ng and J. Han. Ecient and eective clustering aggregates in the presence of hierarchies. In Proc. of
method for spatial data mining. In Proc. of the 20th the 22nd VLDB Conference, pages 522{531, Mumbai,
VLDB Conference, pages 144{155, Santiago, Chile, India, 1996.
1994. 51] E. Spertus. Parasite: mining structural information
39] K. A. Oostendorp, W. F. Punch, and R. W. Wig- on the web. In Proc. of 6th International World Wide
gins. A tool for individualizing the web. In Proc. 2nd Web Conference, 1997.
International World Wide Web Conference, 1994. 52] R. Srikant and R. Agrawal. Mining sequential pat-
40] P. Merialdo P. Atzeni, G. Mecca. Semistructured and terns: Generalizations and performance improve-
structured data in the web: Going back and forth. In ments. In Proc. of the Fifth Int'l Conference on Ex-
Proceedings of the Workshop on the Management of tending Database Technology, Avignon, France, 1996.
Semistructured Data (in conjunction with ACM SIG-
MOD), 1997. 53] R. Weiss, B. Velez, M. A. Sheldon, C. Namprempre,
P. Szilagyi, A. Duda, and D. K. Giord. Hypur-
41] M. Pazzani, J. Muramatsu, and D. Billsus. Syskill suit: a hierarchical network search engine that ex-
& webert: Identifying interesting web sites. In Proc. ploits content-link hpertexxt clustering. In Hyper-
AAAI Spring Symposium on Machine Learning in In- text'96: The Seventh ACM Conference on Hypertext,
formation Access, Portland, Oregon, 1996. 1996.
42] M. Perkowitz and O. Etzioni. Category translation: 54] S.M. Weiss and C. A. Kulikowski. Computer Sys-
learning to understand information on the internet. tems that Learn: Classication and Prediction Meth-
In Proc. 15th International Joint Conference on AI, ods from Statistics, Neural Nets, Machine Learning,
pages 930{936, Montral, Canada, 1995. and Expert Systems. Morgan Kaufmann, San Mateo,
43] P. Pirolli, J. Pitkow, and R. Rao. Silk from a CA, 1991.
sow's ear: Extracting usable structures from the 55] M. R. Wulfekuhler and W. F. Punch. Finding salient
web. In Proc. of 1996 Conference on Human Factors features for personal web page categorization. In
in Computing Systems (CHI-96), Vancouver, British Proc. of 6th International World Wide Web Confer-
Columbia, Canada, 1996. ence, 1997.
44] J. Pitkow. In search of reliable usage data on the 56] O. R. Zaiane and J. Han. Resource and knowledge
www. In Sixth International World Wide Web Con- discovery in global information systems: A prelim-
ference, pages 451{463, Santa Clara, CA, 1997. inary design and experiment. In Proc. of the First
45] J. Pitkow and Krishna K. Bharat. Webviz: A tool for Int'l Conference on Knowledge Discovery and Data
world-wide web access log analysis. In First Interna- Mining, pages 331{336, Montreal, Quebec, 1995.
tional WWW Conference, 1994.
46] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and
J. Widom. Querying semistructured heterogeneous
information. In International Conference on Deduc-
tive and Object Oriented Databases, 1995.