DW Slides

Data Warehousing
for BS(CS)
Muhammad Salman Khan
Department of Computer Science

Course Material
 Course Book
 Paulraj Ponniah, Data Warehousing Fundamentals, John Wiley
& Sons Inc., NY.
 Reference Books
 W.H. Inmon, Building the Data Warehouse (Second Edition),
John Wiley & Sons Inc., NY.
 Ralph Kimball and Margy Ross, The Data Warehouse Toolkit
(Second Edition), John Wiley & Sons Inc., NY.
2
Assignments
 Implementing a data warehouse in four phases
 In first phase, the students are required to develop a relational
model for the DBLP database and create this model physically in
a DBMS.
 Students are then required to write extraction routines and insert
the DBLP data in the database.
 In the third phase, students will create a dimensional model of
the above mentioned database/ data model
 In the fourth phase, students will create MOLAP or ROLAP
cubes of the dimensional model to view data for data analysis.
3
Course Introduction
Information Sources Data Warehouse OLAP Servers Clients

Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
Semistructured Analysis
Sources Data serve
Warehouse
extract Query/Reporting
transform
load serve
refresh
e.g., ROLAP
etc.
Operational
DB’s serve Data Mining
Data Marts
4
Operational Sources (OLTP’s)
 Operational computer systems did provide information to run
day-to-day operations, and answer’s daily questions
 Also called online transactional processing system (OLTP)
 Data is read or manipulated with each transaction
 Transactions/queries are simple, and easy to write
 Usually for middle management
 Examples
 Sales systems
 Hotel reservation systems
 HRM Applications
 General-Ledger
 Payroll
 Etc.
5
Typical decision queries
 Data set are mounting everywhere, but not useful for
decision support
 Strategic Information = Enterprise wide essential data
 People need timely strategic information to take appropriate
decisions to get competitive advantages
 Decision-making require complex questions from integrated
data.
 Decision makers want to know:
 Where to build new oil warehouse?
 Which market they should strengthen?
 Which customer groups are most profitable?
 How much is the total sale by month/ year/ quarter for each
offices?
 Is there any relation between promotion campaigns and sales
growth?
 Can OLTP answer all such questions,  efficiently?
6
Characteristics of strategic information
 Integrated
 Must have a single, enterprise-wide view
 Data Integrity
 Information must be accurate and must conform to business
rules
 Accessible
 Easily accessible with intuitive access paths and responsive
for analysis
 Credible
 Every business factor must have one and only one value
 Timely
 Information must be available within the stipulated time frame
7
Information Crisis
 It means “lack of information”

 Two major reasons
 Availability of huge data on disparate platforms and in diverse
forms, Moreover, the data must be available in a format that
helps in spotting trends and pattern in the data.
 The incapability of information technology or information systems

to turn this data in usable form
8
History of Decision Support Systems
(DSS)
 Ad-Hoc Reports: IT created small program for each required
report.
 Special Extract Programs: Collection of programs written for
anticipated reports.
 Small Applications: Small programs with the option of taking
parameters as input from the user
 IT centers: Special places where users can go for asking pre-
generated reports. IT people present there helped users to obtain
information.
 Decision Support Systems: Much sophisticated systems, menu
driven, online processing, printable reports, but were made using
extracted files.
 Executive Information Systems: Simple reports on user desktop,
but again preprogramed reports.
9
Failure of old DSS (cont.)
 Inability to provide strategic information
 IT receive too many ad hoc requests, so large over
load
 Requests are not only numerous, they change
overtime
 For more understanding more reports
 Users are in spiral of reports
 Users have to depend on IT for information
 Can't provide enough performance, slow
 Strategic information have to be flexible (not
predefined strictly) and conductive (could be guided)
for analysis
10
Failure of old DSS (Cont.)
 A typical spiral of user needs and IT efforts
11
OLTP vs. DSS
 OLTP (Get the data in)
 DSS (Get the information out)
12
OLTP vs. DSS (cont.)
Trait OLTP DSS
User Middle management Executives, decision-makers
Function For day-to-day operations For analysis & decision support
DB (modeling) E-R based, after normalization Star oriented schemas
Data Current, Isolated Archived, derived, summarized
Unit of work Transactions Complex query
Access, type DML, read Read
Access frequency Very high Medium to Low
Records accessed Tens to Hundreds Thousands to Millions
Quantity of users Thousands Very small amount
Usage Predictable, repetitive Ad hoc, random, heuristic based
DB size 100 MB-GB 100GB-TB

Response time Sub-seconds Up-to min.s
13
Expectations of new soln.
 DB designed for analytical tasks

 Data from multiple applications
 Easy to use
 Ability of what-if analysis
 Read-intensive data usage
 Direct interaction with system, without IT assistance
 Periodical updating contents & stable
 Current & historical data
 Ability for users to initiate reports
14
DW meets expectations
 Provides enterprise view

 Current & historical data available
 Decision-transaction possible without affecting operational source
 Reliable source of information
 Ability for users to initiate reports
 Acts as a data source for all analytical applications
15
Definition of DW
Inmon defined
“A DW is a subject-oriented, integrated, non-volatile, time-variant
collection of data in favor of decision-making”.
Kelly said
“Separate available, integrated, time-stamped, subject-oriented, non-
volatile, accessible”
Four properties of DW
16
Subject-oriented
 In operational sources data is organized by applications, or

business processes.
 In DW subject is the organization method
 Subjects vary with enterprise
 These are critical factors, that affect performance
 Example of Manufacturing Company
 Sales
 Shipment
 Inventory etc
17
Integrated Data
 Data comes from several applications

 Problems of integration comes into play
 File layout, encoding, field names, schema, data heterogeneity are the
issues
 Bank example, variance: naming convention, attributes for data item,
account no, account type, size, currency
 In addition to internal, external data sources
 External companies data sharing
 Websites
 Others
 Removal of inconsistency
 So process of extraction, transformation & loading
18
Time variant
 Operational data has current values

 Comparative analysis is one of the best techniques for business
performance evaluation
 Time is critical factor for comparative analysis
 Every data structure in DW contains time element
 In order to promote product in certain area, analyst has to know
about current and historical values
 The advantages are
 Allows for analysis of the past
 Relates information to the present
 Enables forecasts for the future
19
Non-volatile
 Data from operational systems are moved into DW after specific
intervals
 Data is persistent/ not removed i.e. non volatile
 Every business transaction don’t update in DW
 Data from DW is not deleted
 Data is neither changed by individual transactions
20
Architecture of DW

(Tier 1)
e.g., MOLAP
Sources Data serve
Warehouse
transform
load serve
refresh
e.g., ROLAP
Operational
Data Mining
DB’s serve
Staging area Data Marts
21
Components
 Major components
 Source data component
 Data staging component
 Data storage component
 Information delivery component
 Metadata component
 Management and control component
22
1. Source Data Component
 Source data can be grouped into 4 components
 Production data
 Comes from operational systems of enterprise
 Some segments are selected from it depending on requirements
 Narrow scope, e.g. order details
 Internal data
 Private datasheet, documents, customer profiles etc.
 E.g. Customer profiles for specific offering
 Special strategies to transform ‘it’ to DW (text document)
 Archived data
 Old data is archived in operational systems using maybe
separate archival database that might be still online, maybe
stored in flat files and ever more old data on tapes.
 DW have snapshots of historical data
 External data
 Executives depend upon external sources
 E.g. market data of competitors, market indicators.
 For example, car rental business would like to have information
about the production schedule of car making companies to make
strategic decisions
 But integration require conformance with your data
23
Architecture of DW

(Tier 1)
e.g., MOLAP
Sources Data serve
Warehouse
transform
load serve
refresh
e.g., ROLAP
Operational
Data Mining
DB’s serve
24
2. Data Staging Component
 After data is extracted, data is to be prepared
 Data extracted from sources needs to be
changed, converted and made ready in
suitable format
 Three major functions to make data ready
 Extract
 Transform
 Load
25
Architecture of DW

(Tier 1)
e.g., MOLAP
Sources Data serve
Warehouse
transform
load serve
refresh
e.g., ROLAP
Operational
Data Mining
DB’s serve
26
3. Data Storage Component
 Separate repository
 Data structured for efficient processing
 Redundancy is increased
 Updated after specific periods
 Only read-only
27
Architecture of DW

(Tier 1)
e.g., MOLAP
Sources Data serve
Warehouse
transform
load serve
refresh
e.g., ROLAP
Operational
Data Mining
DB’s serve
28
4. Information Delivery Component
 Depending on the type of users, different methods
and means of information delivery are used are
used.
29
Meta-data component
 It is kind of information about the data in the data
warehouse
 Three kinds of meta-data
 Operational: Contains information about the mappings
between source systems and data warehouse, e.g., field
length, data types
 Extraction and Transformation meta-data: For example,

extraction frequencies, and extraction methods, and data
transformations
 End User meta-data: Support users to navigate information

using their own terminologies
30
Management and Control Component
 It is concerned to coordinate all activities and
components involved in data warehouse
environment. Usually part of data warehouse DBMS.
31
Difference between Data Warehouse
and Data Marts
 Sometimes called synonymously, however, strictly
speaking there is a difference between them on the
basis of their scale
 Data warehouse is enterprise wide.
 Data marts are local or at departmental level or
targeted for particular group of users.
 The main topic of discussion is whether to make
data warehouse (top-down approach) first or data
marts (bottom-up approach). Both approaches have
their advantages and disadvantages.
32
Top-down approach
 Advantages:
 Single enterprise wide integrated data
 Disadvantages:
 Takes too longer to build
 High chances of failure
 Requires experienced professionals
 Senior management is not likely to see the results
immediately
33
Bottom-up approach
 Advantages
 Faster and easier implementations
 Faster return on investment and proof of concept
 Less risk of failure
 Inherently incremental (can prioritize which data
mart to build first)
 Disadvantages:
 Narrow view of data in each data mart
 Spread redundancy in data marts
 Can cause inconsistent data
34
Practical approach propose by Ralph
Kimball
 Compromise between top-down and bottom-up
approaches
 First define the requirements of the enterprise
 Design the architecture of the whole warehouse
 Identify data content for each data mart
 Start implementing data marts carefully
 Ensure data remain consistent in all data marts in
terms of fields length, data types, and precision

 The data warehouse thus will be a union of all
data marts
35
DW Design
36
Designing DW

(Tier 1)
e.g., MOLAP
Sources Data serve
Warehouse
transform
load serve
refresh
e.g., ROLAP
Operational
Data Mining
DB’s serve
37
Background (ER Modeling)
 For ER modeling, entities are collected from
the environment
 Each entity act as a table
 Success reasons
 Normalized after ER, since it removes redundancy
(to handle update/delete anomalies)
 But number of tables is increased
 Is useful for fast access of small amount of data
38
ER Drawbacks for DW / Need of Dimensional
Modeling
 ER Hard to remember, due to increased number of tables
 Complex for queries with multiple tables (table joins)
 Ideally no calculated attributes
 The DW does not require to update data like in OLTP
system so there is no need of normalization
 Efficient indexing scheme to avoid screening of all data
39
Dimensional Modeling
 Dimensional Modeling focuses subject-
orientation, critical factors of business
 Critical factors are stored in facts
 Redundancy is no problem, achieve efficiency
 Logical design technique for high performance
 Is the modeling technique for storage
40
Dimensional Modeling (cont.)
 Two important concepts

 Fact
 Numeric measurements, represent business activity/event
 Are pre-computed, redundant
 Example: Profit, quantity sold
 Dimension
 Qualifying characteristics, perspective to a fact
 Example: date (Date, month, quarter, year)
41
Dimensional Modeling (cont.)
 Facts are stored in fact table
 Dimensions are represented by dimension
tables
 Each fact is surrounded by dimension tables
 Looks like a star so called Star Schema
42
Example
PRODUCT
TIME
product_key (PK)
time_key (PK)
SKU
SQL_date
description
day_of_week
brand
month FACT
category
time_key (FK)
store_key (FK)
STORE clerk_key (FK) CUSTOMER
store_key (PK) product_key (FK) customer_key (PK)
store_ID customer_key (FK) customer_name
store_name promotion_key (FK) purchase_profile
address dollars_sold credit_profile
district units_sold address
floor_type dollars_cost
CLERK PROMOTION
clerk_key (PK) promotion_key (PK)
clerk_id promotion_name
clerk_name price_type
clerk_grade ad_type 43
Inside Dimensional Modeling
 Inside Dimension table
 Key attribute of dimension table, for
identification
 Large no of columns, wide table
 Non-calculated attributes, textual attributes
 Attributes are not directly related (e.g., brand
and package size)
 Un-normalized in Star schema
 Ability to drill-down and roll-up are two ways
of exploiting dimensions
 Can have multiple hierarchies (product
category for marketing and product category
for accounting)
 Relatively small number of records 44
Inside Dimensional Modeling
 Have two types of attributes
 Key attributes, for connections
 Facts
 Inside fact table
 Concatenated key
 Grain or level of data identified
 Large number of records
 Limited attributes
 Sparse data set
 Degenerate dimensions (order number
Average products per order)
 Fact-less fact table
45
Star Schema Keys
 Surrogate keys in Dimension tables
 Replacement of primary key
 System generated
 Foreign keys in Fact tables
 Collection of primary keys of dimension tables
 Primary key in fact table
 Collection of P.Ks came from dimension tables
 Maybe degenerated dimension
 Maybe system generated surrogate key
46
Advantage of Star Schema
 Ease for users to understand
 Optimized for navigation (less joins
fast)
 Most suitable for query processing (drill-
down, roll-up)
 Special techniques for join and indexing
for further query optimization
47
Normalization [1]
 “It is the process of decomposing the

relational table in smaller tables.”
 Normalization Goals:
1. Remove data redundancy
2. Storing only related data in a table (data
dependency makes sense)
 5 Normal Forms
 The decomposition must be lossless
48
1st Normal Form [2]
 “A relation is in first normal form if and only if
every attribute is single-valued for each tuple”
STU_ID STU_NAME MAJOR CREDITS CATEGORY
S1001 Tom Smith History 90 Comp
S1003 Mary Jones Math 95 Elective
S1006 Edward CSC, Math 15 Comp,

Burns Elective
S1010 Mary Jones Art, English 63 Elective,
Elective
S1060 John Smith CSC 25 Comp
49
1st Normal Form (Cont.)
STU_ID STU_NAME MAJOR CREDITS CATEGORY
S1001 Tom Smith History 90 Comp
S1003 Mary Jones Math 95 Elective
S1006 Edward CSC 15 Comp

Burns
S1006 Edward Math 15 Elective
Burns
S1010 Mary Jones Art 63 Elective
S1010 Mary Jones English 63 Comp
S1060 John Smith CSC 25 Comp
50
Another Example (composite key:
SID, Course) [1]
51
1st Normal Form Anomalies [1]
 Update anomaly: Need to update all six rows
for student with ID=1if we want to change his
location from Islamabad to Karachi
 Delete anomaly: Deleting the information
about a student who has graduated will
remove all of his information from the
database
 Insert anomaly: For inserting the information
about a student, that student must be
registered in a course
52
Solution  2nd Normal Form
 “A relation is in second normal form if and

only if it is in first normal form and all the
nonkey attributes are fully functional
dependent on the key” [2]
 In previous example, functional
dependencies [1]
SID —> campus, degree
Campus degree
(SID, Course)  Marks
53
Example in 2 nd Normal Form [1]
54
Anomalies [1]
 Insert Anomaly: Can not enter a program for

example PhD for Peshawar campus unless a
student get registered
 Delete Anomaly: Deleting a row from
“Registration” table will delete all information
about a student as well as degree program
55
Solution  3rd Normal Form
 “A relation is in third normal form if it is in

second normal form and nonkey attribute is
transitively dependent on the key” [2]
 In previous example: [1]
Campus degree
56
Example in 3 rd Normal Form [1]
57
Denormalization [1]
 “Denormanlization is the process” to

selectively transforms the normalized
relations in to un-normalized form with the
intention to “reduce query processing time”
 The purpose is to reduce the number of
tables to avoid the number of joins in a query
58
Five techniques to denormalize
relations [1]
 Collapsing tables
 Pre-joining
 Splitting tables (horizontal, vertical)
 Adding redundant columns
 Derived attributes
59
Collapsing tables (one-to-one) [1]
For example, Student_ID, Gender in Table 1 and

Student_ID, Degree in Table 2
60
Pre-joining [1]
61
Splitting tables [1]
62
Redundant columns [1]
63
Derived Attributes
 The derived attributes means pre-calculated

measures or complex measures.
 It reduces query processing time
 Examples include age (current date – date of
birth), previously failed courses, etc.
64
Updates to Dimension Tables
65
Updates to Dimension Tables (Cont.)
 Type-I changes: correction of errors, e.g.,

customer name changes from Sulman
Khan to Salman Khan
 Solution to type-I updates:
 Simply update the corresponding
attribute/attributes. There is no need to
preserve their old values
66
 Type 2 changes: preserving history

 For example change in “address” of a
customer, but the user wants to see orders
by geographic location then you can not
simply update the address by replacing old
value with new value, you need to preserve
the history (old value) as well as need to
insert new value
67
 Proposed solution:
68
 Type 3 changes: When you want to compare

old and new values of attributes for a given
period
 Please note that in Type 2 changes the old
values and new values were not comparable
before or after the cut-off date (when the
address was changed)
69
Solution: Add a new column of attribute
70
 What if we want to keep a whole history of

changes?
 Should we add large number of attributes to
tackle it?
71
Rapidly Changing Dimension
 When dimension’s records/rows are very

large in numbers and changes are required
frequently then Type-II change handling is
not recommended
 It is recommended to make a separate table
of rapidly changing attributes
72
Rapidly Changing Dimension (Cont.)
 “For example, an important attribute for customers
might be their account status (good, late, very late,
in arrears, suspended), and the history of their
account status” [4]
 “If this attribute is kept in the customer dimension
table and a type 2 change is made each time a
customer's status changes, an entire row is added
only to track this one attribute” [4]
 “The solution is to create a separate account_status
dimension with five members to represent the
account states” [4] and join this new table or
dimension to the fact table.
73
Example
74
Junk Dimensions
 Sometimes there are some informative flags and
texts in the source system, e.g., yes/no flags,
textual codes, etc.
 If such flags are important then make their own
dimension to save the storage space
 However, caution, if you have large number of
combinations then make separate dimensions.
For example, 5 attributes with 100 distinct values
each. Total rows in it will be millions (1005). In
this case make separate dimension for each
attribute
75
Junk Dimension Example [3]
76
Junk Dimension Example (Cont.) [3]
77
The Snowflake Schema
 Snowflacking involves normalization of
dimensions in Star Schema
 Reasons:
 To save storage space
 Easy maintenance of tables with low
cardinality
 However, some designers discourages it.
According to them it is better to sacrifice
storage space over performance and ease of
use.
78
Example 1 of Snowflake Schema
79
Example 2 of Snowflake Schema
80
Aggregate Fact Tables
 Use aggregate fact tables when too many

rows of fact tables are involved in making
summary of required results
 Objective is to reduce query processing time
81
Example
Total Possible Rows = 1825 * 300 * 4000 * 1 = 2 billion

82
Solution
 Make aggregate fact tables, because you

might be summing some dimension and
some might not then why we should store the
dimensions that do not need highest level of
granularity of details.
 For example: Sales of a product in a year OR
total number of items sold by category on daily

basis
83
A way of making aggregates
Example:
84
Making Aggregates
 But first determine what is required from your

data warehouse then make aggregates
85
Example of Aggregate Fact Table
86
Families of Stars
87
Families of Stars (Cont.)
 Transaction (day to day) and snapshot tables (data after
some specific intervals)
88
 Core and custom tables
89
 Conformed Dimension: The attributes of a dimension
must have the same meaning for all those fact tables
with which the dimension is connected.
90
Extract, Transform, Load (ETL)
 Extract only relevant data from the internal

source systems or external systems, instead
of dumping all data (“data junkhouse”)
 The ETL completion can take up to 50-70%
of your total effort while developing a data
warehouse.
 These ETL efforts depends on various factors
which will be elaborated as we proceed in our
lectures regarding ETL.
91
Major steps in ETL
92
Data Extraction
 Data can be extracted using third party tools
or in-house programs or scripts
 Data extraction issues:
1. Identify sources
2. Method of extraction for each source (manual,
automated)
3. When and how much frequently data will be extracted
from each source
4. Time window
5. Sequencing of extraction processes
93
How data is stored in operational
systems
 Current value: Values continue to changes as
daily transactions are performed. We need to
monitor these changes to maintain history for
decision making process, e.g., bank balance,
customer address, etc.
 Periodic status: sometimes the history of
changes is maintained in the source system
94
Example
95
Data Extraction Method
 Static data extraction:

1. Extract the data at a certain time point.
2. It will include all transient data and periodic data along
with its time/date status at the extraction time point
3. Used for initial data loading or refresh
 Data of revisions
1. Data is loaded in increments thus preserving history of
both changing and periodic data
96
Incremental data extraction
 Immediate data extraction: involves data
extraction in real time when the transaction
occurs.
 Possible options:
1. Capture through transactions logs
2. Make triggers/Stored procedures
3. Capture via source application
 Deferred data extraction: involves data
extraction at later time.
1. Capture on the basis of time and date stamps
2. Capture by comparing files
97
Data Transformation
 Transformation means to integrate or

consolidate data from various sources
 Major tasks:
1. Format conversions (change in data type, length)
2. Decoding of fields (1,0  male, female)
3. Calculated and derived values (units sold, price, cost profit)
4. Splitting of single fields (House no 10, ABC Road, 54000, Lahore,
Punjab, Pakistan)
5. Merging of information (information from different sources
regarding any entity)
6. Character set conversion
98
Data Transformation (Cont.)
8. Conversion of unit of measures

9. Date/time conversion
10. Key restructuring
11. De-duplication
12. Entity identification
13. Multiple source problem
99
Data Loading
 Determine when (time) and how (as a whole or in
chunks) to load data
 Four modes to load data
1. Load: removes old data if available otherwise load
data
2. Append: The old data is not removed, the new data
is appended with the old data
3. Destructive Merge: If primary key of the new record
matched with the primary key of an old record then
update old record
4. Constructive Merge: If primary key of the new
record matched with the primary key of an old
record then do not update old record just add the
new record and mark it as superseding record
Data Loading (cont.)
101
Data Loading (Cont.)
 Data Refresh Vs. Data Update
Full refresh reloads whole data after deleting old data and
data updates are used to update the changing attributes
 Loading for dimensional tables: You need to define
a mapping between source system key and system
generated key in data warehouse, otherwise you will
not be able to load/update data correctly
 Updates to dimension table
Data Quality Management
 It is important to ensure that the data is correct to

make right decisions
 Imagine the user working on operational system is
entering wrong regions’ codes of customers.
Imagine that the relevant business has never sent
an invoice using these regions codes (so they are
ignorant). But what will happen if the data
warehouse will use these codes to make decisions?
 You need to put proper time and effort to ensure
data quality
Data Quality
 What is Data: An abstraction/representation/

description of something in reality
 What is Data Quality: Accuracy + Data must
serve its purpose/user expectations
Indicators of quality of data
 Accuracy: Correct information, e.g., address
of the customer is correct
 Domain Integrity: Allowable values, e.g.,
male/female
 Data Type: If a field requires text then no
numeric entries
 Consistency: The content and its form is
same across all source system, e.g., product
code of a product ABC in one system is 1234
then in other system it must be 1234 for that
particular product
Indicators of quality of data (Cont.)
 Redundancy: Data is not redundant, if for some

reason for example efficiency the data is redundant
then it must be identified accordingly
 Completeness: There are no missing values in any
field and only valid values
 Conformance to Business rules: Values are
according to the business constraints, e.g., loan
issued cannot be negative
 Well defined structure: Whenever the data item can
be divided in components it must be stored in terms
of components/well structure, e.g., Muhammad
Ahmed Khan can be structured as first name, middle
name, last name. Similar is the case with addresses
 Data Anomaly: Fields must contain that value

for which it was created, e.g., State field
cannot take the city name
 Proper Naming convention
 Timely: timely data updates as required by
user
 Usefulness: The data elements in data
warehouse must be useful and fulfill the
requirements of the users otherwise data
warehouse is not of any value
 Entity and Referential Integrity: Entity integrity

means every table must have a primary key
and it must be unique and not null.
Referential integrity enforces parent child
relationship between tables, you can not
insert a record in child table unless you have
a corresponding record in parent table
 SQL Server Integration Services (SSIS)
Problems due to quality of data
 Businesses ranked data quality as the biggest problem
during data warehouse designing and usage
Possible causes of low quality data
 Dummy values: For example, to pass a check

on postal code, entering dummy or not
precise information such as 4444 (dummy) or
54000 for all regions of Lahore
 Absence of data values: For example not a
complete address
 Unofficial use of field: For example writing
comments in the contact field of the customer
 Cryptic Information: At one time operation

system was using ‘R’ for “remove” then later
for “reduced” and some other time point for
“recovery”
 Contradicting values: compatible fields must
not contradict, e.g., two fileds ZIP code and
City can have values 54000 and Lahore but
not some other city name for ZIP code 54000
 Violation of business rule: Issued loan is negative

 Reused primary keys: For example, a business has
2 digit primary key. It can have maximum 100
customers. When a 101th customer comes the
business might archive the old customers and
assign the new customer a primary key from the
start. It might not be a problem for the operational
system but you need to resolve such issues
because DW keeps historical data.
 Non-unique identifiers: For example different

product codes in different departments
 Inconsistent values: one system is using
male/female to represent gender while other
system is using 1/0
 Incorrect values: Product Code: 466, Product
Name: “Crystal vas”, Height:”500 inch”. It
means that product and height values are not
compatible. Either product name or height is
wrong or maybe the product code as well
 Erroneous integration: A person might be a

buyer or seller to your business. Your
customer table might be storing such person
with ID 222 while in seller table it might be
500. In data warehouse you might need to
integrate this information. The persons with
IDs 222 in both tables might not be same
Sources of data pollution
 System migration or conversions: For

example, from manual system Flat files 
Hierarchal databases  relational
databases….
 Integration of heterogeneous systems: More
heterogeneity means more problems
 Poor database design: For example lack of
business rules, lack of entity and referential
integrity
Sources of data pollution (Cont.)
 Incomplete data entry: For example, no city

information
 Wrong information during entry: For example
United Kingdom  Unted Kingdom
Data Cleansing
Examples of data cleansing tools
 Data Ladder, http://www.dataladder.com/products.html
 Human Inference,
http://www.humaninference.com/master-data-management/data-
quality/data-cleansing
 Google Refine,
http://code.google.com/p/google-refine/
 Wrangler
http://vis.stanford.edu/wrangler/
 Text Pipe Pro
http://www.datamystic.com/textpipe.html#.UKjm9eQ3vyo
Information Supply
 It deals with how you will provide information

to the users of the data warehouse.
 It must be easy to use, appealing, and
insightful
 For successful data warehouse you need to
identify users, and when, what, where, and
how (in which form) they need information
Data warehouse modes of usage
 Verification mode: The users interact with the

system to prove or disprove a hypothesis
 Discovery mode: Without any prior
hypothesis. The intention is to discover new
things or knowledge about the business. The
data mining techniques are used here.
Approaches of interaction with data
warehouse
 Informational approach:
1. With queries, reporting tools, and charts
2. The data might be summarized
3. Can perform statistical analysis by using current
and historical data
 Analytical approach:
1. Used for analysis in terms of roll-up, drill-down,
slice, and dice
 Data Mining approach
1. Used for knowledge discovery or hidden patterns
Classes of users
 We need to identify the classes of users in order to get
firm basis what and how to provide information to them
 A typical example of classifying users
User classes (cont.)
 Naive users
 Regular users: daily users but cannot make
queries and reports themselves. They need
query templates and predefined reports
 Power users: Technically sound users, who
can make queries, reports, scripts, import
and export data themselves
User classes from other perspective
 High-level Executives and Managers: Need

standard pre-processed reports to make strategic
decisions
 Technical Analysts: complex analysis, statistical
analysis, drill-down, slice, dice
 Business Analysts: comfortable with technology but
cannot make queries, can modify reports to get
information from different angles up to some extent.
 Business-Oriented users: Predefined GUIs, and
might be support for some ad-hoc queries and
standard reports
The ways of interaction
 Preprocessed reports: routine reports which are

delivered at some specific interval
 Predefined queries and templates: The users can
use own parameters with predefined queries
templates and reports with predefined format
 Limited ad-hoc access: few and simple queries
which are developed from scratch
 Complex ad-hoc access: complicated queries and
analysis. Can be used as a basis for predefined
reports and queries
Tools selection
129
Information delivery framework
Online Analytical Processing (OLAP)
 used for fast and complex analysis on data warehouse

 It is not a database design technique but is only a category of
applications
 Definition from OLAP Council:
“On-Line Analytical Processing (OLAP) is a category of software

technology that enables analysts, managers and executives to
gain insight into data through fast, consistent, interactive access
in a wide variety of possible views of information that has been
transformed from raw data to reflect the real dimensionality of the
enterprise as understood by the user.”
 So what is the difference between data warehouse and
OLAP?
OLAP (cont.)
What is the solution?
 Should we write all queries in advance?

 Should we ask users to write SQL
themselves?
 The solution is to facilitate all kinds of
aggregates across multi-dimensions
OLAP as FASMI [1] (by Nigel Pendse)
 Fast: Delivery of information at fairly constant
rate. Most queries are answered with in 5
sec, simplest queries no more than a second,
and very few taking more than 20 seconds.
 Analysis: Support of easy analysis
(calculations and statistical analysis pre-
defined by an application developer or
defined adhoc by the user)
 Shared: ensure security in a multi user
environment (possibly to a single cell level)
and concurrent update locking capabilities.
OLAP as FASMI [1] (by Nigel Pendse)
 Multi-dimensional conceptual view of data
which include full support for hierarchies and
multiple hierarchies.
 Information: Information is defined as all data
and derived information that reside in DW
relevant for the application. The objective
over here is to ascertain the capacity of
various products in terms of how much input
data they can handle and process, instead of
how many Gigabytes of raw data they can
store.
135
OLAP as FASMI
Further reading:
What is OLAP?
by Nigel Pendse, Principal of OLAP Solutions
and Co-author of the OLAPreport.com
Source:
http://dssresources.com/papers/features/pends
e04072002.htm
136
Major features of OLAP
 Dimensional Analysis
1. What are cubes?
2. What are hyper-cubes?
 Drill-down and Roll-up
 Drill through, Drill across
 Slice and Dice
 Pivoting
Dimensional Analysis
 OLAP supports multi-dimensional analysis
 Cube: have three dimensions
Z-axis
X-axis Y-axis
Same view on spreadsheet
Multidimensional Analysis (Spreadsheet
view) / Hypercube
An example of hyper cube with 4
dimensions
141
Drill-down and Roll-up [5]
Drill through, Drill across
Slicing [5]
Dicing [5]
Pivoting [5]
A SQL server view for cube [6]
OLAP Implementations [1]
 MOLAP (uses multi-dimensional structure OR

cubes)
 ROLAP (uses relational database approach
for OLAP)
 HOLAP (hybrid approach of MOLAP and
ROLAP)
 DOLAP (OLAP for desktop decision support
system)
MOLAP
 Uses proprietary file formats to store cube

physically
 Do not use SQL, uses proprietary interfaces
such as MDX from Microsoft to retrieve data
 Remember cubes store all possible
aggregates, which are calculated across
dimensions and their hierarchies, e.g., Time
dimension and its possible hierarchies are:
day, week, months, quarter, year.
MOLAP advantages [1]
 Fast query processing because the

aggregates are already pre-calculated
MOLAP disadvantages [1]
 Long loading time due to pre-aggregations

especially when there are many dimensions
and for dimensions with large hierarchies.
 Wastage of storage due to sparse cube.
Imagine sales of heaters in summer in Sibbi.
 Maintenance of cubes with updated values
(insert, delete, update). Need to again
calculate the aggregates
MOLAP disadvantages (cont.) [1]
 Storage issues: Need a lot of space to store

multiple cubes (may be more than 100s),
because aggregates need to be calculated
across dimensions and hierarchies within
them. If the number of rows are large in
dimensions then physically storing cubes will
be again difficult.
Solution to storage problem [1]
But at the expense of processing time, so partition of non-
frequent joins
Solution to storage problem [1]
 Virtual cubes: The cubes that is created by

joining one or more cubes. It is similar of
creating a “view” in relational databases. It is
not physically stored but requires processing
time when invoked.
Virtual Cube Example [9]
155
Relational OLAP (ROLAP) [1]
 ROLAP is facilitated by Star Schema

 Results are retrieved using SQL
 Cube in ROLAP is supported by Star Schema
(fact tables and dimension tables)
 ROLAP is simply querying the star schema,
querying pre-aggregates (aggregate fact
tables), and calculating summaries (at higher
level) of pre-aggregates at run time
Creating tables using SQL queries [1]
The required queries are [1]:
Problems with this approach [1]
 The number of queries increases as the number of

aggregates increases due to rise in number of
dimensions
 Therefore avoid wasteful queries. For example we
could have saved the results of first query in
previous example and could have computed the
other two queries using that result.
 In other words, avoid calculating and saving all
possible aggregates. Such methodology will
consume the storage space and will create the
maintenance problem (same problems as in
MOLAP)
Solution to avoid saving all possible
aggregates [1]
 Compute dynamically the less detailed
aggregates (e.g., year sales) from the
detailed aggregates (e.g., daily sales) at run
time
 In this approach, the query run time might be
slow but will save the storage space
ROLAP problems [1]
 Maintenance (as with MOLAP)

 Divergence from simple hierarchies: It is
possible that simple hierarchies are not
followed (e.g., product category, product sub-
category, product) but instead some other
unconventional hierarchy is followed (e.g.,
sales of green items vs. yellow items OR
sales of items with national flag vs. items
without flag)
ROLAP problems (cont.) [1]
 Different semantics:
1. For example, what constitute an year 1st July to
30th June OR 1st January to 31st December.
Another example is week, whether it ends on
Monday or Saturday? (in various departments, e.g.,
University of Lahore, The CS department holiday is
on Monday and others take holiday on Saturday).
2. Different semantics will force you in creating many
pre-aggregates.
3. You have to figure out the best solution by your
own self (either calculating pre-aggregates or
convincing various departments to follow the same
convention).
ROLAP problems (Cont.) [1]
 Storage problem: The combinatorial

explosion, unconventional hierarchies,
different semantics force you to create
various pre-aggregates which consumes a lot
of storage space
 Aggregation problem: As a general rule of
thumb make aggregate to higher level if there
is considerable reduction (not more than 10
% of detailed table in number of rows) in size
as compared to the next level detailed table
Reducing pre-aggregates tables [1]
 Calculating summaries at run time

 Using smart wizards for a trade-off between
performance and storage
Hybrid OLAP (HOLAP) [1]
 HOLAP = MOLAP + ROALP

 Its purpose is to take advantage of both
approaches. It answer queries using MOLAP
for dimensions with low cardinality and if the
user requires more detailed analysis then it
uses ROLAP. However, this intelligent shift
between ROLAP and MOLAP is transparent
to the users
Desktop OLAP (DOLAP) [1]
 It is used for the remote users or for the
machines with limited processing power. A
subset is copied to the remote machine
Parallel Execution OR Parallelism [1]
 Basically parallelism is used to speed up the

response time of queries in data warehouse
 The processing of a query is done by many
processes instead on one process
 For example the whole year sales can be retrieved
by a query partitioned in four quarters and can be
processed individually by four processes instead of
one process executing sequentially
 However the divided processes should be
independent of each other
Benefits of Parallelism [1]
 Large table scans and joins

 Creation of large indexes
 Partitioned index scans
 Bulk inserts, updates, and deletes
 Aggregations
When we can parallelize? [1]
 “Symmetric multi-processors (SMP), clusters,

or Massively Parallel (MPP) systems AND
 Sufficient I/O bandwidth AND
 Underutilized or intermittently used CPUs (for
example, systems where CPU usage is
typically less than 30%) AND
 Sufficient memory to support additional
memory-intensive processes such as sorts,
hashing, and I/O buffers”
Speed-Up & Amdahl’s Law [1]
 Remember, if the dependencies between processes

are high then performance or speed-up will degrade
 Amdahl’s Law
 Where f= fraction of the task that must be executed

sequentially
 N= number of processors
 S= Speed-Up
Amdahl’s Law (Cont.) [1]
 This formula basically computes the

maximum expected speed-up from the
proportion of task that must be executed
sequentially and available resources
 Example-1: f = 5% and N = 100 then S = 16.8
 Example-2: f = 10% and N = 200 then S =
9.56
Amdahl’s Law (Cont.) [1]
Parallel hardware architecture [1]
 Symmetric Multiprocessing (SMP): shares
memory and I/O disks. In following figure p1,
p2, p3, p4 are processors. Supported by
UNIX and Windows NT
Parallel hardware architecture (cont.) [1]
 Distributed Memory or Massively Parallel
Processing (MPP)
 Approaches: Grid and Cluster
Non-uniform Memory Access (NUMA)
 NUMA deals with a shared memory

architecture along with their accessibility
timings. L1 cache  L2 cache  L3 cache 
local memory  remote memory. So that the
data bus is not overloaded as in the case of
SMP
Parallel Software Architectures [1]
 Shared disk RDBMS Architecture (Shared

Everything)
Parallel Software Architectures (cont.) [1]
 Shared disk RDBMS Architecture Advantage: It

does not fail as a unit as compared to shared
memory architecture. Remember, in case of failure
in the RAM the Symmetric Multiprocessing (SMP)
fails.
 Shared disk RDBMS Architecture
Disadvantage: It requires some distributed locking
management, so that the data remains consistent
when two or more processes ask for updates to
the same data (data buffer). This requirement
leads to serialization (degradation in performance)
as more and more locking is required
 Shared Nothing RDBMS Architectures

Shared Nothing RDBMS Architectures

Basically the tables are partitioned logically and
stored on separate disks. Such as horizontal
split. When any query requires data from
more than one disk the data is joined at run
time and shown to user. Each machine is
responsible for accessing and locking of data
individually.
 Shared Nothing RDBMS Architecture

Advantage: No distributed locking of data or
serialization problem
 Shared Nothing RDBMS Architecture
Disadvantage: Failure of one node will
cause a part of the whole system to be failed.
Partitioning Techniques [1]
 Ranges: e.g., >2000 and <=2000

 List: e.g., Europe, Asia, North America…
 Round Robin
 Hashing: Usage of hashing function (implemented
by a DBMS) to distribute rows across partitions. It is
done when rows can not be divided using ranges
and lists. e.g., in mySQL (example taken from [7])
CREATE TABLE employees ( id INT NOT NULL, fname VARCHAR(30), lname
VARCHAR(30), hired DATE NOT NULL DEFAULT '1970-01-01', separated DATE
NOT NULL DEFAULT '9999-12-31', job_code INT, store_id INT )
PARTITION BY HASH (store_id)
PARTITIONS 4;
Partitions with respect to queries
 Full table scan

 Point queries
 Range queries
Partitioning Techniques (cont.) [1]
 Round-Robin: Equal distribution of rows

across partitions. For example if we have four
partitions. Then first row to first partition,
second to the second partition, third to the
third partition, and fourth to fourth partition
and then repeat the same procedure again
which means fifth row to first partition, sixth to
second and so on.
Partitions with respect to queries (cont.)
 In the context of Round-Robin

1. Load will be equivalent, the scan query will
take equal time
2. In the context of point query it is not good
because the records are not partitioned
using values or ranges
 In the context of Hashing technique

 If the hashing is uniform the number of
records will be approx. same on all nodes
 The point based queries will execute well if
the hashing is done on the same column,
similarly the join will execute well
 The range query will not execute well
because it is not distributed according to
range across partitions
 In the context of range partition

1. If the partitions are done in proper manner
means that the load is same across
partitions then the scan query will run well
2. The point based and range based queries
execute well
3. Can have unequal distribution that can
degrade query execution in parallel
Some concepts to understand about
parallelism [1]
Some concepts to understand about
parallelism (cont.) [1]
 To get maximum speed
1. Need of enough resources (CPUs, Memory,
and data bus)
2. The query coordinator must be quick in
preparing the final result
3. The processing nodes ideally should finish
tasks in equal time otherwise the query
coordinator has to wait for the node taking
most time.
Spatial parallelism OR temporal
parallelism OR Pipelining [1]
 The basic idea is to break a complex task in
to smaller sub-tasks to execute them
concurrently
 The speed-up can be calculated as
 Where S=speed-up
 N= number of tasks
 M= number of stages
 T= time for the execution of one sequence of tasks

Pipelining limitations [1]
 The speed-up can be a maximum up to 10

due to relational operators (a chain of length
ten is unusual [see 10]).
 Pipelining will not be effective when all sub-
tasks must be executed to get the final result
such as sorting, aggregation
 The sub-tasks can take different timing to
complete
How relations/tables are organized on
disk [11]
 Two kind of files
1- Primary files: Actual data files
2- Secondary files: Auxiliary files that are used
to speed-up the access of primary files
193
disk [11]
 File Organizations
 Relations are stored in terms of logical structures called files, the
way files are organized on disk is called file organization
 Usually files are stored as a sequence of records
 A record is a tuple in a relational database
 Files are read in terms of physical blocks on disk
 There can be many records per block and can be many blocks per
record or there can be one-one relationship which is very rare
 File organization simply refers to the way records are stored in terms
of blocks and the way blocks are placed on the storage medium and
interlinked
194
disk [11]
 Types of File Organizations
1. Unsorted (piled file organization)
2. Sorted
3. Hashing
195
disk [11]
 Records: A record is basically a tuple in a
table/relation
 A file is a sequence of records
 Records can be of fixed length and variable
length
 Records comprises of sequence of fields
(columns or attributes)
196
disk [11]
 Blocks: They are basically the physical units of storage
in storage devices (for example, sectors in hard disks)
 Usually (but not necessarily, it depends on the file
system structure) stores records of a single file
 They are of fixed length, based on physical
characteristics of the storage/computing device and
operating system
 Storage device is either defragmented or fragmented
depending on whether contiguous sets of records lie in
contiguous blocks
197
disk [11]
 Unspanned records
Record1 Record2 Record3 unused

Block m
198
disk [11]
 Spanned records
Record1 Record2 Record3 Record4 Block m

first part
Record4
Block p
remaining
part
 If record size is bigger than block then we also

need to use spanned records
199
disk [11]
 Unordered or Pile File organization
200
disk [11]
 Unordered or Pile File organization
 Simplest file organization
 Records are inserted in the order of their
arrival
 Usually requires secondary files for efficient
search
 Insertion: very easy
 Search very expensive because we need to
search linearly
201
Unordered or Pile File organization
[11]
 Record deletion causes fragmentation and
inefficient in terms of space
202
Unordered or Pile File organization
[11]
 Record update is also problematic in case of
variable length records (needs auxiliary file to
point to new location)
203
Sorted Files [11]
 File organization where records are physically
sorted based on the value of some field called
the ordering field
 Ordering field should be a key field (unique for
each record) and belong to an ordinal domain
(where values can be compared)
 Insertion and deletion: both expensive
 Updating may involve physical migration of
record, if ordering field is modified
 Searching: efficient based on binary search
204
Sorted Files [11]
Making insertion and updating easy
205
Benefits of sorted files [11]
 More efficient than pile-files for key-based

searches
 Suitable mainly for random-access devices
due to binary search
 A better choice when database is mostly
read-only and queries are mostly key-based
retrievals
206
Hashing Techniques [11]
 Provide very fast access to records on certain

search conditions
 Search condition should be an equality
condition on a key field
 Uses a “hashing function” to map keys onto
“buckets” hosting records
 Primarily suited for random-access devices
207
Hashing Techniques [11]
208
Overflow in Hashing Techniques [11]
 When buckets overflow means that it

receives more records than it can hold
 Three techniques to overcome overflow
1. Open addressing: use next available bucket
2. Chaining: Pointers to extra buckets called
overflow buckets
3. Rehashing: Use a second hashing function
to see whether it can hash without overflow
209
Handling overflow through chaining
[11]
210
Indexing [11]
 Indexing is an efficient ways to access any

item
 It speeds up query processing by searching
minimum records instead scanning whole
table/s
 It provides speed up without using additional
and expensive hardware
 Imagine searching a book in shelves in a
library vs. through catalog
Indexing [11]
 Indexing field (indexing attribute): The field on

which an index structure is built (searching is
fast on this field)
 Primary index: An index structure that is
defined on the ordering key field (the field
that is used to physically order records on
disk in sorted file organizations)
 In many cases ordering field is the primary
key
212
Indexing [11]
 Clustering Index: When the ordering field is

not a key field (i.e., not unique) a clustering
index is used instead of a primary index
 Secondary index: An index structure defined
on a non-ordering field and not necessarily
the key field.
213
Primary Index [11]
 Comprises of an ordered file of fixed length

records having two fields
 The first field of same data types as ordering
key (primary key), and second fields of the
type block address
214
Primary Index [11]
215
Primary Index [11]
 The number of entries in the index is equal to

the number of disk blocks in the ordered data
file
 The first record in each block of the file is
index. These records are called anchor
records
 Usually primary index are sparse index
216
Sparse Index [1]
 The index is kept for a block of data items
 It takes less space but at the expense of
some efficiency in terms of time
Dense Indexing [1]
 For each record store the key of that record and the
pointer where the record is actually placed on disk
 If it fits in memory it requires one I/O operation if not then
performance will degrade
Primary index [11]
 Search: Easy, perform binary search on

index file to identify block containing required
record
 Insertion/Deletion: Easy if key values in
record are statically allocated to blocks
(results in wastage of space)
 Else, re-computation of order and accordingly
index updating is required on
insertion/deletion
219
Clustering Index [11]
 Clustering field: A non-key ordering field.

That is blocks are ordered on this field which
does not have the UNIQUE constraint
 Structure of index file similar to primary index
file, but each index points to the first block
having the given value in its clustering field
 One index entry for every distinct value of the
clustering field
220
221
 A sparse index, since only distinct values are

indexed
 Insertion and deletion cause problems when
a block can hold more than one value for
clustering field
 Alternative solution: Allocate blocks for each
value of clustering field
222
223
Secondary Index [1]
 Used to index fields that are not ordering

fields
224
Secondary Index on key field (dense
index) [11]
225
Secondary Index on non-key field [11]
 Three techniques to handle duplication

1. Duplicate index entries
2. Variable length records
3. Extra redirection levels
226
 Duplicate index entries

 Index entries are repeated for each duplicate
occurrences of the non-key attribute
 Binary search becomes more complicated.
Mid-point of a search may have duplicate
entries on either side
 Insertion of records may need restructuring of
index table
227
 Variable length records: use variable length

records for index table in order to
accommodate duplicate key entries
228
Extra Redirection Levels [11]
229
Extra Redirection Levels [11]
 Extra redirection Levels: Most frequently

used technique
 Index records are of fixed length
 Block overflows handled by chaining
 Insertion/deletion of records straightforward
230
Multilevel Indexing [1]
 It uses a little more space but it increases efficiency in
terms of time
 It is good for queries posing some conditions or range
conditions
Example of B-Tree [1]
B-Tree limitations [1]
 A bigger table in terms of rows but with few

unique values is not good
Bitmap Indexes
234
235
Join Indexes
 Stores results of a join from two or more

tables
236
Recommended Indexes for Data
Warehouse
 For primary key, foreign keys, and frequently
accessed metrics in Fact tables you can use
B-Tree index (distinct values)
 For primary keys in dimensions use B-Tree.
For other attributes that are repetitive in
dimensions use Bitmap index
237
Join Techniques [1]
 Joins are used to retrieve data from multiple

tables.
 Types:
- Nested loop join
- Sort merge join
- Hash based join
238
Join Techniques (cont.) [1]
 Nested loop join:

- Typically used in OLTP environment
- Nested-loop joins provide efficient access
when tables are indexed on join columns
- For high selectivity it is justified
239
 A nested loop join involves the following

steps:
1. The optimizer determines a main table (i.e. Table_A) and designates
it as the outer table. Table_A is accessed once. If the outer table has
no useful indexes, a full table scan is performed. If an index can reduce
I/O costs, the index is used to locate the rows.
2. The other table is designated as the inner table or Table_B. Table_B
is accessed once for each qualifying row (or touple) in Table_A.
3. For every row in the outer table, DBMS accesses all the rows in the
inner table. The outer loop is for every row in outer table and the inner
loop is for every row in the inner table.
4. Usually the outer table is smaller than inner table.
240
 Sort-merge join
- Sort both relations on join attributes and then
finds matching rows
- Both relations are scanned once
- However, additional cost of sorting
241
 Sort merge join
242
Hash join [1]
- Suitable for large databases
- The choice about which table first gets hashed
plays a pivotal role in the overall performance of
the join operation, and left to the optimizer
- The optimizer decides by using the smaller of
the two tables (say) Table_A or data sources to
build a hash table in the main memory on the
join key used in the WHERE clause.
- It then scans the larger table (say) Table_B and
probes the hashed table to find the joined rows
Hash Join [12]
244
Questions?
245
References
 [1] Abdullah, A.: “Data warehousing handouts”, Virtual
University of Pakistan
 [2] Ricardo, C. M.: “Database Systems: Principles
Design and Implementation”, Macmillan Coll Div.
 [3] Junk Dimension,
http://www.1keydata.com/datawarehousing/junk-
dimension.html
 [4] Advanced Topics of Dimensional Modeling
https://mis.uhcl.edu/rob/Course/DW/Lectures/Advanced
%20Dimensional%20Modeling.pdf
 [5] http://en.wikipedia.org/wiki/OLAP_cube
246
References
 [6]
http://www.mssqltips.com/sqlservertutorial/2011/processi
ng-dimensions-and-cube/
 [7] http://dev.mysql.com/doc/refman/5.1/en/partitioning-
hash.html
 [8] Karen Corral, et al. (2006) The impact of alternative
diagrams on the accuracy of recall: A comparison of
star-schema diagrams and entity-relationship diagrams,
Decision Support Systems, 42(1), 450-468.
 [9]
http://www.ibm.com/developerworks/data/library/techarti
cle/dm-0909infospherevirtualcubes/
 [10] ”Reading in Database Systems” by By Joseph M.
Hellerstein, Michael Stonebraker, 2005
References
[11] Lecture Series on Database Management
System by Dr.S.Srinath, IIIT Bangalore.
Available at: http://nptel.iitm.ac.in
[12] Zurek, T. (1997): “Optimisation of

Partitioned Temporal Joins”, Thesis avilable at:
http://www.dcs.ed.ac.uk/home/tz/phd/thesis/t.ht
m
248

DW Slides

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

DW Slides

Caricato da

Copyright:

Formati disponibili

Data Warehousing

Muhammad Salman Khan

Department of Computer Science

Information Sources Data Warehouse OLAP Servers Clients

 Hotel reservation systems

 Which market they should strengthen?

 Which customer groups are most profitable?

 It means “lack of information”

 The incapability of information technology or information systems

 DSS (Get the information out)

Records accessed Tens to Hundreds Thousands to Millions

Quantity of users Thousands Very small amount

Usage Predictable, repetitive Ad hoc, random, heuristic based

DB size 100 MB-GB 100GB-TB

 DB designed for analytical tasks

 Provides enterprise view

 In operational sources data is organized by applications, or

 Data comes from several applications

 Operational data has current values

Information Sources Data Warehouse OLAP Servers Clients

Staging area Data Marts

Information Sources Data Warehouse OLAP Servers Clients

Staging area Data Marts

Information Sources Data Warehouse OLAP Servers Clients

Staging area Data Marts

Information Sources Data Warehouse OLAP Servers Clients

Staging area Data Marts

 Extraction and Transformation meta-data: For example,

 End User meta-data: Support users to navigate information

 Design the architecture of the whole warehouse

 Identify data content for each data mart

 Start implementing data marts carefully

 Ensure data remain consistent in all data marts in

terms of fields length, data types, and precision

Information Sources Data Warehouse OLAP Servers Clients

Staging area Data Marts

 Two important concepts

 “It is the process of decomposing the

S1001 Tom Smith History 90 Comp

S1003 Mary Jones Math 95 Elective

S1006 Edward CSC, Math 15 Comp,

S1001 Tom Smith History 90 Comp

S1003 Mary Jones Math 95 Elective

S1006 Edward CSC 15 Comp

S1010 Mary Jones English 63 Comp

S1060 John Smith CSC 25 Comp

 “A relation is in second normal form if and

 Insert Anomaly: Can not enter a program for

 “A relation is in third normal form if it is in

 “Denormanlization is the process” to

For example, Student_ID, Gender in Table 1 and

 The derived attributes means pre-calculated

 Type-I changes: correction of errors, e.g.,

 Type 2 changes: preserving history

 Type 3 changes: When you want to compare

 What if we want to keep a whole history of

 When dimension’s records/rows are very

 Use aggregate fact tables when too many

Total Possible Rows = 1825 * 300 * 4000 * 1 = 2 billion

 Make aggregate fact tables, because you

total number of items sold by category on daily

 But first determine what is required from your