Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Manjunath T. N+
+
Ravindra S. Hegadi#
#Karnatak University,Dharwad,Karnataka,INDIA,
Archana R A++
++SJB Institute of Technology,Bangalore,Karnataka,INDIA,
Abstract:
Data warehousing is one of the latest trends in computing environment and information technology
applications. A data warehouse is a system that extracts, cleans and delivers source data into dimensional
data store and then supports and implements querying and analysis for the purpose of decision making.
From a data warehouse, data flows to various departments for their customized decision support systems.
These individual departmental components are called data marts. A data mart is a set of dimensional
tables supporting a business process. Data marts contain all atomic detail needed to support drilling down
to the lowest level. Every company or organization in the world has a website. Beneath each web site are
web logs that record every object either posted to or served from the web server. Web logs are important
because they reveal the user traffic on the web site. The activity of parsing web logs and storing the
results in a data mart to analyze customer activity is known as click stream data warehousing. The web
mart - database schema is designed to make the underlying data structure more comprehensible to users
and to simplify the query process. The recommended approach for data warehouse data modeling is to
follow a Dimensional Modeling approach - Star Schema. We explore the design and analysis of web mart
and its relevance today at minute level.
Keywords: Data warehousing, ETL, Web log, Data mart, Web mart.
1.
The web mart - database schema is designed to make the underlying data structure more comprehensible to
users and to simplify the query process. The recommended approach for data warehouse data modeling is to
follow a Dimensional Modeling approach-called Star Schema. The star schema has a central fact table with
dimension tables at the points of the star. The single fact tables composite primary key requires a foreign key
field corresponding to the primary key field of each dimension table. The dimension tables are hierarchical and
thus highly denormalised [4] .A fact table is a primary table in the web mart that contain the business facts, and
dimension tables are companion tables to the fact table that represent the business critical dimensions and
contain the attributes for the business critical dimensions. The central fact table provides users the ability to do
analysis on business facts, and dimensional tables provide users the ability to do analysis on these business facts
in various business critical dimensions[10].
The figure-1 presents the overall view of the click stream fact and the associated dimensions.
ISSN : 0975-5462
3141
Date
Dimension
Universal
Date
TOD
Dimension
Status
Dimension
Universal
TOD
OS
Dimension
Visitor
Dimension
Click Stream Fact
Browser
Dimension
Page
Dimension
Geography
Dimension
Session
Dimension
Referrer
Dimension
Object
Dimension
Fig-1: Star Schema of the Web Mart Design -Click Stream Fact and its Associated Dimensions
2.
This section explains about the detailed analytical capabilities of the model by giving the listing of the basic fact
that the user will be able to analyze and the corresponding dimensions which gives user the capability of drill up
and drill down and slicing and dicing on the base fact. Before the design of specific click stream data marts,
there is a need to collect together as many dimensions as one can think of that may have relevance in a click
stream environment. The unique dimensions of the click stream data warehouses are page, visitor, session and
referral. The page dimension describes the page context for a web page event. It contains attributes like page
key, page source, page function. The visitor dimension gives the details regarding visitor. The main attributes
are userId, CookieId, Operating System and Browser. The session dimension provides one or more levels of
diagnosis for the visitors session as a whole. For example, the local context of the session might be requesting
product information, but the overall session context might be ordering a product. The referral dimension
describes how the customer arrived at the current page [9] [11].
2.1 Facts and Dimensions in the Web Mart
The following table-1 presents the objects i.e. fact and dimensions available in the web mart for the analysis
purpose.
Table name
Click Stream Fact
Universal Date
Universal TOD
Date
TOD
Visitor
Page
Fact/Dimension
Fact
Dimension
Dimension
Dimension
Dimension
Dimension
Dimension
Session
Dimension
ISSN : 0975-5462
Levels
Year, Quarter, Month, Week, Day
Period of the day, Hour, Minute, Second
Year, Quarter, Month, Week, Day
Period of the day, Hour, Minute, Second
IP Address (or) Visitor Id (or) Cookie Id
Object Type, File Type, Page Type, URL (a)
Domain, Site, Directory, URL
Session Type
3142
Referrer
Dimension
Status
Visit
Content Page
Dimension
Dimension
Dimension
Description
Foreign key for the Universal Date dimension
Foreign key for the Universal TOD dimension
Foreign key for the Date dimension
Foreign key for the TOD dimension
Foreign key for the Visitor dimension
Foreign key for the Page dimension
Foreign key for the Session dimension
Foreign key for the Referrer dimension
Foreign key for the Status dimension
Foreign key for the Visit dimension
Foreign key for the Content Page dimension
The time spent in seconds, by the Visitor on a particular object like
page, file
The bytes transferred to the client machine.
Table-2: Click Stream Fact Table Description
2.3 Dimensions
The dimension table gives users the ability to analyze the business measures in different dimensions by allowing
the users to drill up and drill down and slice and dice with the attributes of the dimensions. Drilling down is
adding detailed rows to an existing request and is nothing more than requesting to give more detail. Drilling up
is subtracting row headers and is nothing more than looking at the data at more aggregated/consolidated form.
Slicing is constraining the data that is displayed on an attribute found in a dimension and dicing is constraining
the data that is displayed by attributes in multiple dimensions [4].
2.3.1
Universal Date Dimension
The figure-2 presents the universal date dimension with all its attributes. This universal date dimension
facilitates analysis along the calendar period with respect to the Greenwich mean time. The hierarchical
attributes of the dimension are represented using the arrow head connections and general attributes are
represented using straight-line connections. Date is the lowest grain level that the user will be able to drill down
to, and year is the highest level that the user will be able to drill up to. Drill down path can be identified by
following the arrow headings.
ISSN : 0975-5462
3143
YEAR
QUARTER
YEAR
MONTH
WEEK NUMBER IN
WEEK OF THE
DATE
DAY OF THE
DAY NUMBER IN
MONTH NUMBER
The following table describes the structure of the Universal date dimension table
Field Name
UniversalDatekey
UniversalDate
UniversalDayOfWeek
UniversalDOWNumber
UniversalWeekOfMonth
UniversalWeekOfYear
UniversalDayOfMonth
UniversalMonthNumber
UniversalMonth
UniversalQuarter
UniversalYear
Description
Primary key for the dimension (Surrogate key)
Date
Day of the week
Day of week number
Week number in the month
Week number in the year
Day number in the month
Month of the year in number
Month of the year
Quarter of the year
Year
Values/Example
1,2
25/01/2000
Sunday
1-7, Sunday being 1
1-5
1-52
1-31
1-12
January
1-4
2000
2.3.2
The figure-3 presents the universal TOD dimension with all its attributes. This universal TOD dimension
facilitates analysis along the time of the day with respect to the Greenwich mean time. The hierarchical
attributes of the dimension are represented using the arrow head connections and general attributes are
represented using straight-line connections. Seconds is the lowest grain level that the user will be able to drill
down to, and period of the day is the highest level that the user will be able to drill up to. Drill down path can be
identified by following the arrow headings [10] [1].
PERIOD OF THE DAY
HOUR
MINUTE
SECOND
Figure-3: Universal TOD Dimension
ISSN : 0975-5462
3144
The table-4 describes the structure of the Universal TOD dimension table.
Field Name
UTODkey
UniversalSecond
UniversalMinute
UniversalHour
UniversalTODPeriodofDay
Description
Primary key for the dimension (Surrogate key)
Second of a Minute
Minutes of a hour
Hours of a day
Collection of hours in a day
Values/Example
1,2
1-60
1-60
10-11, 11-12
Morning, Evening
2.3.3
Date Dimension
The figure-4 presents the date dimension with all its attributes. The hierarchical attributes of the dimension are
represented using the arrow head connections and general attributes are represented using straight-line
connections. Date is the lowest grain level that the user will be able to drill down to, and year is the highest level
that the user will be able to drill up to. Drill down path can be identified by following the arrow headings [10]
[1].
YEAR
YEAR
QUARTER
MONTH
DATE
DAY OF THE WEEK
MONTH NUMBER
Description
Primary key for the dimension (Surrogate key)
Date
Day of the week
Day of the week number
Week number in the month
Week number in the year
Day number in the month
Month of the year in number
Month of the year
Quarter of the year
Year of the date
Values/Example
1,2
25/01/2000
Sunday
1-7, 1 being Sunday
1-5
1-52
1-31
1-12
January
1-4
2000
ISSN : 0975-5462
3145
2.3.4
TOD Dimension
The figure-5 presents the TOD dimension with all its attributes. The hierarchical attributes of the dimension are
represented using the arrow head connections and general attributes are represented using straight-line
connections. Seconds is the lowest grain level that the user will be able to drill down to, and period of the day is
the highest level that the user will be able to drill up to. Drill down path can be identified by following the arrow
headings [10] [1].
PERIOD OF THE DAY
HOUR
MINUTE
SECOND
Figure-5: TOD dimension with its attributes
Description
Primary key for the dimension (Surrogate key)
Second of a Minute
Minutes of a hour
Hours of a day
Time of the day
Collection of hours in a day
Values/Example
1,2
1-60
1-60
10-11, 11-12
12:00:55;19:15:25
Morning, Evening
2.3.5
Visitor Dimension
The figure-6 presents the Visitor dimension with all its attributes. The hierarchical attributes of the dimension
are represented using the arrow head connections and general attributes are represented using straight-line
connections. User id/ Cookie id/ Domain name is the lowest grain level that the user will be able to drill down
to, and country is the highest level that the user will be able to drill up to. The lowest granularity will be decided
at the client site. Drill down path can be identified by following the arrow headings [10] [1].
COUNTRY
COOKIE ID
USER ID
BROWSER
DOMAIN NAME
DEMOGRAPHICS *
OPERATING SYSTEM
Figure-6: Visitor Dimension with its attributes
* - Demographics are collection of many fields. It is also possible to form a hierarchy in the demographic
information. The table-7 describes the structure of the visitor dimension table.
ISSN : 0975-5462
3146
Field Name
Visitorkey
UserId
OperatingSystem
Description
Primary key for the dimension (Surrogate key)
Identification of the Visitor (name or login user
id).
Value of the Cookie
IP address of the requesting client
The country of the visitor. Predicted from the
domain of the visitor
The name of the operating system with version
Browser
*FirstName
*LastName
*DateOfBirth
*AgeGroup
*Gender
*Occupation
CookieId
IPAddress
Country
*IncomeGroup
*ZipCode
*State
*VisitorCountry
Values/Example
1,2
A string
164.164.22.91
USA, India
Windows NT 4.0, SCO
Unix 7
Netscape Navigator 4.6,
Internet Explorer 5.1
John, Philip
Smith, Jacob
12/07/1076
18-25, 25-40
Male, Female
Engineering, Computer
related
USA, India
* Optional fields. Collected from the web site visitor through registration forms.
2.3.6
Page Dimension
The figure-7 presents the page dimension with all its attributes. The hierarchical attributes of the dimension are
represented using the arrow head connections and general attributes are represented using straight-line
connections. URL is the lowest grain level that the user will be able to drill down to, and the domain and object
type are the highest levels that the user will be able to drill up to. Drill down path can be identified by following
the arrow headings [10] [1].
DOMAIN
OBJECT TYPE
FILE TYPE
SITE
PAGE TYPE
DIRECTORY
URL
PAGE NAME
FILE NAME
Figure-7: Page Dimension with this attributes
ISSN : 0975-5462
3147
The following table-8 describes the various fields of the page dimension.
Field Name
Pagekey
URL
PageName
Description
Primary key for the dimension (Surrogate key)
Full path of the page in the server
Name of the web page
PageType
FileType
ObjectType
FileName
Directory
Site
Domain
Values/Example
1,2
C:\..\index.html
Welcome page, Product
info page
News pages, Jobs &
Career pages
Gif, au, ra, html
Multimedia files,
Application, Content
pages
Index.html,
ProductInfo.html
C:\Inetpub\doc
SESSION DESCRIPTION
* - Session parameters are collection of fields, which describes the conditions for characterizing a session
type.
The following table-9 describes the various fields of the session dimension.
Field Name
Sessionkey
SessionType
SessionDescription
SessionParameters
Description
Primary key for the dimension (Surrogate key)
The type of the user session. Session is defined
bases on the business rules
The description of a particular session
The parameters that characterize the particular
session. It can be split into multiple fields, based
on the business rules provided by the customer.
Values/Example
1,2
Quick hit and gone,
Product Ordering
Ex: If Time Spent is in
the range of 1-10 min
and the pages visited in
general info or product
info, then it is a
Looking for Info
session.
ISSN : 0975-5462
3148
2.3.7
Referrer Dimension
The figure-9 presents the session dimension with all its attributes. The hierarchical attributes of the dimension
are represented using the arrow head connections and general attributes are represented using straight-line
connections. URL is the lowest grain level that the user will be able to drill down to, and the domain and referrer
type are the highest levels that the user will be able to drill up to. Drill down path can be identified by following
the arrow headings [1].
DOMAIN
REFERRER TYPE
SITE
URL
KEY WORD
Description
Primary key for the dimension (Surrogate key)
The URL of the referring page
The Site of the referring page
The domain of the referring page
The keyword given by the user as search criteria
to reach the page.
The type of the referrer
Values/Example
1,2
C:\..\index.html
Web mining,
warehousing
Ad banner, Search
engine
2.3.8
Status Dimension
The figure-9 presents the status dimension with all its attributes. The hierarchical attributes of the dimension are
represented using the arrow head connections and general attributes are represented using straight-line
connections. Status id is the lowest grain level that the user will be able to drill down to, and the status type is
the highest level that the user will be able to drill up to. Status description provides a description for the status
id. Drill down path can be identified by following the arrow headings [10] [1].
STATUS TYPE
STATUS ID
STATUS DESCRIPTION
ISSN : 0975-5462
3149
Description
Primary key for the dimension (Surrogate key)
The Status Code
Description of the Status
StatusType
Values/Example
1,2
101, 201
Successful, File not
found error
File errors
2.3.9
Visit Dimension
This dimension has no hierarchy. This dimension is used for identifying the start and end of a visit, it is show in
table-12
Field Name
Visitkey
Description
Description
Primary key for the dimension (Surrogate key)
The value or description. For the start of the visit
it is 'Start' for an end of the visit page 'End'
Values/Example
1,2
Start, End
2.3.10
This dimension has no hierarchy. This dimension is used for identifying a page as a content page or not. it is
shown in table-13
Field Name
ContentPagekey
Description
Description
Primary key for the dimension (Surrogate key)
'Yes' to indicate a content page. 'No' to indicate
other files
Values/Example
1,2
'Yes', 'No'
ISSN : 0975-5462
3150
4. Results
Now we can proceed to the interesting part of our data warehouse: relieving information.
4.1 The average number of minutes from login to order
4.2 The average number of days from first being invited to the site by email to the first order.
5.Conclusions
Understanding the behavior of users on your website is as valuable as following a customer around a store and
recording his or her every move. Imagine how much more organized your store can be and how many
opportunities you can have to dell merchandise if you know every move customers make while navigating your
store. The ETL process in Clickstream data warehousing is significantly different from any other source you are
likely to encounter.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Ralph Kimball, The Data Warehouse ETL Toolkit, Wiley India Pvt Ltd., 2006
Dr. K.V.K.K Prasad, Data warehouse Development Tools, Dreamtech Press, 2006
White paper by Vivek R Gupta, Senior Consultant, System Services Corporation,An Introduction to data warehousing.
Manjunath T.N, Ravindra S Hegadi, Ravikumar G K."Analysis of Data Quality Aspects in DataWarehouse Systems", International
Journal of Computer Science and Information Technologies, Vol. 2 (1), 2010, 477-485
Manjunath T.N, Ravindra S Hegadi, Ravikumar G K." A Survey on Multimedia Data Mining and Its Relevance Today",
International journal of Computer Science and Network Security. Vol. 10 No. 11 pp. 165-170.
Sanjeevkumar R. Jadhav, and Praveen Kumar Kumbargoudar, Multimedia Data Mining in Digital Libraries: Standards and
Features in Proc. READIT-2007, p. 54.
Shu-Ching Chen, Mei-Ling Shyu, Chengcui Zhang, and Jeff Strickrott, "Multimedia Data Mining for Traffic Video Sequences,"
Proceedings of the Second International Workshop on Multimedia data Mining MDM/KDD'2001), in conjunction with the Seventh
ISSN : 0975-5462
3151
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 78-85, August 26, 2001, San Francisco, CA,
USA.
Valery A. Petrushin and Latifur Khan, Multimedia Data Mining and Knowledge Discovery, 2007 - London: Springer-Verlag, pp.
3- 17
S. Kotsiantis, D. Kanellopoulos, P. Pintelas, Multimedia Mining, SEAS Transactions on Systems, Issue 10, Volume 3, December
2004, pp. 3263-3268
Sanjiv Purba Data Management Handbook Published by CRC Press, 1999
Bhavani M. Thuraisingham, Data Management Systems: Evolution and Interoperation, Published by CRC Press, 1997
Jiawei Han, Micheline Kamber Data Mining: Concepts and Techniques Published by Morgan Kaufmann, 2001
Sanjeevkumar R. Jadhav, and Praveen Kumar Kumbargoudar, Multimedia Data Mining in Digital Libraries: Standards and Features,
ACVIT- 07, Dr. Babasaheb Ambedkar MarathWada University, Aurangabad,MS-India
Mori Y, Takahashi H, Oka R. Image-to-word transformation based on dividing and vector quantizing images with words. In:
MISRM99 First International Workshop on Multimedia Intelligent Stotage and Retrievel management, 1999.
Ordenoz C, Omiecinski E. Discovering association rules based on image content. In: ADL 99: Proceedings of the IEEE Forum on
Research and Technology Advances in Digital libraries. Washington, DC: IEEE Computer Society; 1999, p.38.
Chakrabarti, S. (2000): Data mining for hypertext: A tutorial survey. SIGKDD explorations, 1(2), pp. 111.
Ravikumar G K, Manjunath T. N, Ravindra S. Hegadi, Umesh I.M, Cross Industry Survey on Data mining Applications, International
Journal of Computer Science and Information Technologies, Vol. 2 (2) , 2011, 624-628.
Authors Profile
Ravikumar GK. received his Bachelors degree from Siddaganga Institute of Technology, Tumkur (Bangalore
University) during the year 1996 and M. Tech in Systems Analysis and Computer Application from Karnataka
Regional Engineering College Surthakal (NITK) during the year 2000. He is currently working towards his PhD
degree in the Area of Data mining. He has published several papers in International and national level
conferences. He is having around 14 years of Professional experienced which includes Software Industry and
teaching experience. His area of interests are Data Warehouse & Business Intelligence, multimedia and
Databases.
Manjunath T N. received his Bachelors Degree in computer Science and Engineering from Bangalore
University, Bangalore, Karnataka, India during the year 2001 and M. Tech in computer Science and Engineering
from VTU, Belgaum, Karnataka, India during the year 2004. Currently pursing Ph.D degree in Bharathiar
University, Coimbatore. He is having total 10 years of Industry and teaching experience. His areas of interests
are Data Warehouse & Business Intelligence, multimedia and Databases. He has published and presented papers
in journals, international and national level conferences.
Dr.Ravindra S Hegadi received his Master of Computer Applications (MCA) & M.Phil and Doctorate of
Philosophy (Ph.D). in year 2007 in computer science from Gurbarga University, Karnataka; He is having 15
years of Experience. He has visited overseas to various universities as SME.His area of interests are Image
Mining, Image Processing and Databases and business intelligence. He has published and presented papers in
journals, international and national level conferences.
Archana.R.A received her Bachelors Degree in computer Science and Engineering from VTU ,Belgaum,
Karnataka, India during the year 2007 and Master of Technology in year 2010 in computer science from
VTU,Belgaum,Karnataka,India,she is working in SJB Institute of Technology,Bangalore,Karnataka,India.she is
having 3 years of Experience.Her area of interests are Image Mining, Image Processing and Databases and
business intelligence. She has published and presented papers in journals, international and national level
conferences.
ISSN : 0975-5462
3152