Sei sulla pagina 1di 80

Chapter 2

Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1


Outline
 Database Concepts
 Steps in Database Design
 Entity-Relationship Model
 Logical Database Design
 Relational Model
 Object-Relational Model
 Physical Database Design
 Data Warehouse Concepts
 Logical Data Warehouse Design
 Physical Data Warehouse Design
 Data Warehouse Architecture

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 2


Steps in Database Design
 Requirements specification: Collects information about users’
needs with respect to the database system
 Conceptual design: Builds a user-oriented representation of
the database without any implementation considerations
 Logical design: Translates the conceptual schema from the
previous phase into an implementation model common to
several DBMSs, e.g., relational or object-relational
 Physical design: Customizes the logical schema from the
previous phase to a particular platform, e.g., Oracle or SQL
Server

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 3


Entity Relationship Model: University Example
Academic staff Participates Project
(0,n) (1,n)
Employee no Start date Project id
Name End date Project acronym
First name Project name
Last name Description (0,1)
Address Start date
Email End date
Homepage Budget
Research areas (1,n) Funding agency

(partial,exclusive) Teaches
(1,1) (0,n)
(0,n) (0,n)
Assistant Professor Section Prerequisite
Thesis title Status Semester Has prereq Is a prereq
Thesis description Tenure date (0,1) Year
/No courses Homepage (0,1) Course
(1,1)
(1,1) (0,n) Course id
(1,n)
Advisor CouSec Course name
Level

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 4


ER Model: Entity and Relationship Types (1)
 Entity Relationship Model: Most often used conceptual model
for database design
 Entity type: Represents a set of real-world objects of
interest to an application
 Entity: object belonging to an entity type
 Relationship type: Represents an association between
objects
 Relationship: association belonging to a relationship type
 Role: Participation of an entity type in a relationship type
 Cardinalities in a role: Minimum and maximum number of
times that the entity may participate in the relationship type

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 5


ER Model: Entity and Relationship Types (2)
 Types of roles
 Optional vs mandatory: minimum cardinality 0 vs 1
 Monovalued vs multivalued: maximum cardinality 1 vs n
 Relationship types
 Binary vs n-ary: two vs more than two participating entity types
 Binary relationship types:
 One-to-one, one-to-many, or many-to-many depending on the
maximum cardinality of the two roles
 Recursive relationship types: the same entity type
participates more than once in the relationship type
 Role names are then necessary to distinguish different roles

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 6


ER Model: Attributes
 Attributes: Structural characteristics describing objects and
relationships
 Attributes have cardinalities, as for roles
 Depending on cardinalities, attributes may be
 Optional vs mandatory
 Monovalued vs multivalued
 Complex attributes: Attributes composed of other attributes
 Derived attributes: Their value for each instance may be
calculated from other elements of the schema

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 7


ER Model: Identifiers and Weak Entity Types
 Identifier or key: One or several attributes that uniquely
identify a particular object
 Weak entity type: Do not have identifier of their own
 Regular entity types are called strong entity types
 A weak entity type is dependent on the existence of another
entity type, called the identifying or owner entity type
 An identifying relationship type relates a weak entity type to
its owner
 Normal relationship types are called regular relationship types
 Partial key: Set of attributes that can uniquely identify weak
entities related to the same owner entity

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 8


ER Model: Generalization (1)
 Generalization or is-a relationship: Relates perspectives of
the same concept at different abstraction levels
 Supertype: Entity at higher abstraction level
 Subtype: Entity at lower abstraction level
 Population inclusion: Every instance of the subtype is also an
instance of the supertype
 Inheritance: All characteristics (e.g., attributes, roles) of the
supertype are inherited by the subtype
 Substitutability: Each time an instance of a supertype is
required, an instance of the subtype can be used instead

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 9


ER Model: Generalization (2)
 Several generalization relationships accounting for the same
criterion can be grouped
 Types of generalizations
 Total vs partial: depending on whether every instance of the
supertype is also an instance of one of the subtypes
 Disjoint vs overlapping: depending on whether an instance may
belong to one or several subtypes
 Multiple inheritance: A subtype that has several supertypes
 May induce conflicts when a property of the same name is inherited
from two supertypes

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 10


ER Model: Summary of Notation

Simple attribute
(0,1) Complex attribute
Professor Teaches role name Attribute 1
(1,1)
Attribute 2
Identifier (1,1) (0,n) Optional att. (0,1)
Attributes
Other attributes (1,n) Multivalued attr. (1,n)

Entity Type Relationship Type Role Cardinalities Attribute types

Section
(total,exclusive)
CouSec (partial,exclusive)
Partial Identifier
Other Attributes (total,overlapping)
(partial,overlapping)
Identifying
Weak Entity Type Relationship Type Generalization Generalization types

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 11


Relational Model: University Example

Academic staff Academic staff Project


Research area
Employee no Employee no Project id
First name Research area Project acronym
Last name Project name
Address Description (0,1)
Email Participates Start date
Homepage End date
Employee no Budget
Project id Funding agency
Professor Start date
End date
Employee no
Status Course
Tenure date (0,1) Section
No courses Course id
Course id Course name
Semester Level
Assistant Year
Homepage (0,1)
Employee no Employee no Prerequisite
Thesis title
Thesis description Course id
Advisor Has prereq

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 12


Relational Model (1)
 Based on a simple data structure
 A relation or table, composed of attributes or columns
 Each attribute is defined over a domain or data type
 Typical domains: integer, float, date, string, …
 Tuple or row: element of the table
 Provides several declarative integrity constraints
 Not null attribute: a value for the attribute must be provided
 Key: column(s) that uniquely identify one row of the table
 Simple vs composite key: depending on whether key is composed of
one or several columns
 Primary vs alternate key: one of the keys must be chosen as primary
by the designer, the others are alternate keys

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 13


Relational Model (2)
 Referential integrity
 Defines a link between two tables (or twice the same one)
 A set of attributes in one table (foreign key) references the primary
key of the other table
 Values in the foreign key columns must also exist in the primary key
of the referenced table
 Check constraint: defines a predicate that must be valid
when inserting or updating a tuple in a relation
 Predicate can only involve the tuple being inserted/deleted
 All other constraints must be expressed by triggers: an
event-condition-action rule that is automatically activated
when a relation is updated

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 14


Translating from ER to Relational Schemas (1)
 Rule 1: A strong entity type is associated with a table
 Table contains the simple monovalued attributes and the simple
component of the monovalued complex attributes of the entity type
 Table also defines not null constraints for mandatory attributes
 Identifier of the entity type define the primary key of the table

Academic staff Academic staff

Employee no Employee no
Name First name
First name Last name
Last name Address
Address Email
Email Homepage
Homepage
Research areas (1,n)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 15


Translating from ER to Relational Schemas (2)
 Rule 2: A weak entity type is transformed as a strong entity
type, but the associated table also contains the identifier of
the owner entity
 A referential integrity constraint relates this identifier to the table
associated with the owner entity type
 Primary key of the table: partial identifier of the weak entity type +
identifier of the owner entity type

Section CouSec Course Section


(1,1) (1,n)
Semester Course id Course id
Year ... Semester
... Year
...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 16


Translating from ER to Relational Schemas (3)
 Rule 3: A binary one-to-one relationship type is represented
by including the identifier of one of the participating entity
types in the table associated to the other entity type
 A referential integrity constraint relates this identifier to the table
associated with the corresponding entity type
 Simple monovalued attributes and simple components of monovalued
complex attributes are also included in the table
 Not null constraint for mandatory attributes are also defined

Department Dean Professor Department


(1,1) (0,1)
... ... ...
Dean

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 17


Translating from ER to Relational Schemas (4)
 Rule 4: A binary one-to-many relationship type is
represented by including the identifier of the entity type with
cardinality 1 in the table associated to the other entity type
 A corresponding referential integrity constraint is added
 Attributes and not null constraints are also added as in previous rules

Assistant Advisor Professor Assistant


(1,1) (0,n)
... ... ...
Advisor

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 18


Translating from ER to Relational Schemas (5)
 Rule 5: A binary many-to-many relationship type or an n-ary
relationship type is associated with a table containing the
identifiers of the entity types that participate in all its roles
 Corresponding referential integrity constraints are added
 Attributes and not null constraints are also added as in previous rules
 Identifier if any is the key of the relation, but the combination of all
role identifiers can be used as a key

Academic staff Participates Project Participates


(0,n) (1,n)
Employee no Start date Project id Employee no
... End date ... Project id
Start date
End date

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 19


Translating from ER to Relational Schemas (6)
 Rule 6: A multivalued attribute is associated with a table
containing the identifier of the entity or relationship type to
which it belongs
 Corresponding referential integrity constraint is added
 Primary key of table is composed of all its attributes

Academic staff Academic staff


Research area
... Employee no
Research areas (1,n) Research area

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 20


Translating from ER to Relational Schemas (7)
 Rule 7: A generalization can be translated in 3 ways
 The supertype and the subtype are associated with tables containing
the identifier of the entity type. There is a referential integrity
constraint from the table of the subtype to the table of the subtype

Academic staff Academic staff

Employee no Employee no
Name Name
... ...

Assistant Professor
Assistant Professor
Employee no Employee no
Thesis title Status Thesis title Status
... ... ... ...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 21


Translating from ER to Relational Schemas (8)
 Rule 7 (cont.):
 Only the supertype is associated with a table, which contains also the
attributes of the subtype (these are optional attributes)
Academic staff

Employee no
Name

Thesis title (0,1)
...
Status
...

 Only the subtype is associated with a table containing all attributes


of the subtype and the supertype
Assistant Professor
Employee no Employee no
Name Name
… …
Thesis title Status
... ...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 22
Defining Relational Schemas in SQL
create table AcademicStaff as (
EmployeeNo integer primary key,
FirstName character varying (30) not null,
LastName character varying (30) not null,
Address character varying (50) not null,
Email character varying (30) not null,
Homepage character varying (64) not null );
create table AcademicStaffResearchArea as (
EmployeeNo integer not null,
ResearchArea character varying (30) not null,
primary key (EmployeeNo,ResearchArea),
foreign key EmployeeNo references AcademicStaff(EmployeeNo) );

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 23


Redundancies and Update Anomalies
Participates Affiliation
Employee no Employee no
Project id Research area
Start date Department id
End date
Project acronym
Project name

 Relation with redundancies may induce anomalies in the


presence of insertions, updates, and deletions
 In relation Participates the information about a project is repeated
for each staff member who works on that project
 In relation Affiliation the information about a research area is
repeated for all departments to which the staff member is affiliated
 Dependencies and normal forms are used to describe such
anomalies

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 24


Functional Dependencies
 Given a relation R and to sets of attributes X and Y in R, a
functional dependency X Y holds if in all tuples of the
relation, each value of X is associated with at most one
value of Y
 In Participates above, Project Id {Project acronym, Project name}
 A key is a particular case of a functional dependency
 A prime attribute is an attribute that is part of a key
 A full functional dependency is a dependency X Y in
which the removal of an attribute from X invalidates the
dependency
 A dependency X Z is transitive if there is a set of
attributes Y such that X Y and Y Z hold

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 25


Multivalued Dependencies
 Given a relation R and to sets of attributes X and Y in R, a
multivalued dependency X Y holds if the value of X
determines a set of values for Y, independently of any other
attributes
 In Affiliation above, Employee no Research area and
Employee no Department Id
 A multivalued dependency X Y is trivial if either Y X
or X Y = R, otherwise it is nontrivial

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 26


Normal Forms (1)
 Normal Form: Integrity constraint certifying that a relation
schema satisfies particular properties
 In the relational model the relations are in first normal form
since all attributes are atomic and monovalues
 A relation schema is in the second normal form if every
nonprime attribute is functionally dependent on every key

Participates Participates Project


Employee no Employee no Project id
Project id Project id Project acronym
Start date Start date Project name
End date End date
Project acronym
Project name

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 27


Normal Forms (2)
 A relation is in the third normal form if it is in the second
normal form and there are no transitive dependencies
between a key and a nonprime attribute
Assistant Assistant Professor
Employee no Employee no Employee no
Thesis title Thesis title First name
Thesis description Thesis description Last name
Advisor id Advisor id
Advisor first name
Advisor last name

 A relation is in the Boyce-Codd normal form if for every


nontrivial dependency X  Y, X is a key or contains a key
Participates Participates Location
Employee no Employee no Location
Project id Project id Project id
Start date Start date
End date End date
Location
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 28
Normal Forms (3)
 A relation is in the fourth normal form if for every nontrivial
dependency X Y, X is a key or contains a key

Affiliation Affiliation Research


Employee no Employee no Employee no
Research area Research area Research area
Department id

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 29


Weaknesses of the Relational Model
 It provides a very simple data structure, which disallows
multivalued and complex attributes
 complex objects are split into several tables performance problems
 The set of types provided by relational DBMSs is very
restrictive (integer, float, string, date, …)
 Insufficient for complex application domains
 There is no integration of operations with data structures
 It is not possible to directly reference an object by use of a
surrogate or a pointer
 links are based on comparison of values performance problems

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 30


Object-Relational Model: Motivations
 Preserves the foundations of the relational model
 Organizes data using an object model
 Allow attributes to have complex types groups related facts into a
single row
 Extensions
 Complex and/or multivalued attributes
 User-defined types with associated methods
 System-generated identifiers
 Inheritance among types
 Defined in the SQL:1999 and SQL:2003 standards
 Current DBMSs (Oracle, …) introduced object-relational
extensions, but do not necessarily comply with the standard

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 31


Object-Relational Model: Definition (1)
 Allow composite types, in addition to predefined types
 Complex values may be stored in a column of a table
 Row type: structured values (i.e., composite attributes)
 Array type: variable-sized vectors of values
 Multiset type: Unordered collection of values, with duplicates
permitted (bags)
 Composite types can be nested
 This is considered an “advanced feature” in the standard

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 32


Object-Relational Model: Definition (2)
 Two sorts of user-defined types
 Distinct types: Represented internally by a predefined type,
but cannot be mixed in expressions
 Does not make sense to compare a person’s name and a SSN
 Structured user-defined types:  class declarations in OO
languages
 May have attributes, which can be of any SQL type
 May have methods, which can be instance or static (class) methods
 Attributes are encapsulated through observer and mutator functions
 Both of them can be used as a domain of a column, of an
attribute of another type, or a table
 Comparison/ordering of values: with user-defined methods

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 33


Object-Relational Model: Definition (3)
 SQL:2003 supports single inheritance:
 A subtype inherits attributes and methods from its supertype
 A subtype can
 Define additional attributes and methods
 Overload inherited methods: provide another definition with a
different signature
 Override inherited methods: provide another implementation with a
the same signature
 Final types: cannot have subtypes
 Final methods: cannot be redefined
 Instantiable type: includes constructor methods for creating
instances

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 34


Object-Relational Model: Definition (4)
 Two types of tables
 Relational tables: usual ones, where domains for attributes
are all predefined or user-defined types
 Typed tables: defined on structured user-defined types
 Constraints (e.g. keys) defined in the table declaration
 Have a self-referencing column that contains identifiers (surrogates)
generated by the DBMS
 Row identifiers can be used for establishing links between tables
 Ref type: values are the unique identifiers
 Always associated with a specified structured type

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 35


Object-Relational Model: University Example

Academic staff Type Participates Type Project Type


Employee no Employee Project id
Name Project Project acronym
First name Start date Project name
Last name End date Description (0,1)
Address Start date
Email End date
Homepage Budget
Research areas (1,n) Funding agency
Participates (0,n) Participates (1,n)

Assistant Type Professor Type Section Type Course Type


Thesis title Status Semester Course id
Thesis description Tenure date (0,1) Year Course name
Advisor No courses Homepage (0,1) Level
Advises (0,n) Professor Sections (1,n)
Teaches (0,n) Course Has prereq (0,n)
Is a prereq (0,n)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 36


Translating from ER to OR Schemas (1)
 Rule 1: A strong entity type is associated with a structured
type containing all its attributes
 Multivalued attributes defined with the array or the multiset type
 Composite attributes defined with the row or the structured type
 A table of this structured type must be declared
 Table defines not null constraints for mandatory attributes
 Identifier of the entity type define the primary key of the table

Academic staff Academic staff Type

Employee no Employee no
Name Name
First name First name
Last name Last name
Address Address
Email Email
Homepage Homepage
Research areas (1,n) Research areas (1,n)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 37


Translating from ER to OR Schemas (2)
 Rule 2: A weak entity type is transformed as a strong entity
type, but the structured type contains the identifier of its
owner

Section CouSec Course Section Type Course Type


(1,1) (1,n)
Semester Course id Semester Course id
Year ... Year ...
Homepage (0,1) Homepage (0,1)
Course

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 38


Translating from ER to OR Schemas (3)
 Rule 3: A regular relationship type is associated with
structured type containing the identifiers of all participating
entity types, and all attributes of the relationship
 A table of this structured type must be declared
 Table defines not null constraints for mandatory attributes
 Set of identifiers of entity types define the primary key of the table
 Inverse references may also be defined

Academic staff
Type
Academic staff Project Participates
Participates Type …
(0,n) (1,n) Participates (0,n)
Employee no Start date Project id Acad. staff
... End date ... Project Project
Start date Type
End date

Participates (1,n)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 39


Translating from ER to OR Schemas (4)
 Rule 4: Only single inheritance is supported by the model 
multiple inheritance must be removed by transforming some
generalization links into binary one-to-one relationships, to
which Rule 3 is applied

Assistant Student Assistant Student


Assistant Type Major Assistant Type Major
... ... ... ...
PhD Student

PhD Student PhD Student

Thesis title Thesis title


... …
Student

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 40


Defining Object-Relational Schemas in SQL
create type AcademicStaffType as (
EmployeeNo integer,
Name row (
FirstName character varying (30),
LastName character varying (30) ),
Address character varying (50),
Email character varying (30),
Homepage character varying (64),
ResearchAreas character varying (30) multiset,
Participates ref(ParticipatesType) multiset scope Participates
references are checked )
ref is system generated;
create table AcademicStaff of AcademicStaffType (
constraint AcademicStaffPK primary key (EmployeeNo),
Email with options constraint AcademicStaffEmailNN Email not null,
ref is oid system generated );

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 41


Physical Database Design
 Specifies how database records are stored, accessed, and
related in order to ensure adequate performance of a
database application
 Requires to know the specificities of an application,
properties of data, and usage patterns
 Involves analyzing the transactions/queries that run
frequently, that are critical, and the periods of time in which
there will be a high demand on the database (peak load)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 42


Performance of Database Applications
 Factors for measuring performance of database applications
 Transaction throughput: # transaction processed in a time interval
 Response time: Elapsed time for the completion of a transaction
 Disk storage: Space required to store the database file
 A compromise has to be made among these factors
 Space-time trade-off: Reduce time to perform an operation by using
more space, and vice versa
 Query-update trade-off: Access to data can be more efficient by
imposing some structure upon it
 However the more elaborate the structure, the more time is needed
to built it and maintain it when content changes
 After initial physical design is done, it is necessary to
monitor it and tune it

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 43


Data Organization
 A database is organized in secondary storage into files, each
composed of records, at their turn composed of fields
 In a disk, data is stored in disk blocks (pages)
 Set by the operating system during disk formatting
 Transfer of data between main memory and disk takes place
in units of disk blocks
 DBMSs store data on database blocks (pages)
 Selection of a database block depends on several issues
 Locking granularity may be at the block level, not the record level
 For disk efficiency, database block size must be equal to or be a
multiple of the disk block size

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 44


File Organization
 Arrangement of data in a file into records and blocks
 Heap files (unordered files): Records placed in the file in the
order as they are inserted
 Efficient insertions, slow retrieval
 Sequential files (ordered files): Records sorted on the values
of ordering fields
 Fast retrieval, slow inserting and deleting
 Hash files: Use a hash function that calculates the block
(bucket) of a record based on one or several fields
 Collision: Bucket is filled and a new record must be inserted
 Fastest retrieval, but collision management degrades performance

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 45


Indexes
 Additional access structures that speed up retrieval of
records in response to search conditions
 Provide alternative ways of accessing the records based on
the indexing fields on which the index is constructed
 Many different types of indexes
 Clustering vs nonclustering: Whether records in the file are physically
ordered according to the fields of the index
 Single column vs multiple column, depending on the number of
indexing files
 Unique vs nonunique: Whether duplicate values are allowed

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 46


Indexes, Clustering
 Types of indexes (cont.)
 Sparse or dense: Whether there are index records for all search
values
 Single-level vs multilevel: Whether an index is split in several smaller
indexes, with an index to these indexes
 Multi-level indexes often implemented by using B-trees of B+-trees
 Clustering: Tables physically stored together as they share
common columns (cluster key) and are often used together
 Improves data retrieval
 Cluster key stored only once  storage efficiency

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 47


Outline
 Database Concepts
 Steps in Database Design
 Entity-Relationship Model
 Logical Database Design
 Relational Model
 Object-Relational Model
 Physical Database Design
 Data Warehouse Concepts
 Logical Data Warehouse Design
 Physical Data Warehouse Design
 Data Warehouse Architecture

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 48


Data Warehouses: Definition (1)
 Operational databases (online transaction processing
systems or OLTP), are not suitable for data analysis
 Contain detailed data, do not include historical data, perform poorly
for complex queries due to normalization
 Data warehouses address requirements of decision-making
users
 A data warehouse is a collection of subject-oriented,
integrated, nonvolatile, and time-varying data to support
data management decisions
 Subject-oriented: focus on particular subjects of analysis
(e.g., purchase, inventory, and sales in a retail company)
 Operational databases focus on specific functions of an application

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 49


Data Warehouses: Definition (2)
 Integrated: Join together data from several operational and
external systems problems due to differences in data
definition and content
 In operational DBs these problems are solved in the design phase
 Nonvolatile: Data modification and removal is disallowed 
scope of the data is long period of time
 Operational DBs keep data for a short period of time, as required for
daily operations
 Time-varying: Keep different values for the same informa-
tion and the time when changes to these values occurred
 Operational DBs typically do not have temporal support
 Online analytical processing (OLAP): Allows decision-making
users to perform interactive analysis of data

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 50


Multidimensional Model
 DWs and OLAP use a multidimensional view of data
 Represented as a data cube or an hypercube
 Dimensions: Perspectives for analyzing data
 Cells (facts): Contain measures, values that are to be analyzed
ity e Milan 24 18 28 14
(C tor
Rome 33 25 23 25
)
S

Nice 12 20 24 33
Paris

18
23
Q1 21 10 18 35 measure
Time (Quarter)

14

17
values

20
Q2 27 14 11 30

12

18
dimensions

33
Q3 26 12 35 32

10
Q4 14 20 47 31
games DVDs
books CDs
Product (Category)
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 51
Hierarchies
 Data granularity: Level of detail of measures
 Data analyzed at different granularities (abstraction levels)
 Hierarchies relate low-level (detailed) concepts to higher-
level (general concepts)
 Example: Store – City – Region/Province – Country
 Given two related levels in a hierarchy, lower level is called
child, higher level is called parent
 Instances of these levels are called members

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 52


Kinds of Hierarchies
 Balanced: Same number of levels from leaf members to root
Store dimension All

Country level France ... Italy

Region/ Ile-de-France Provence-Alpes- Lazio Lombardy


Province level Côte d'Azur ... ...

City level Paris Nice Rome Milan

Store level Store 1 Store 2 Store 3 ... Store 10 Store 11 Store 12

 Noncovering (ragged): Parent of some members does not


belong to the level immediately above
 E.g., small countries do not have Region/Province level
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 53
Kinds of Hierarchies
 Unbalanced: May not have instances at the lower levels
 E.g., some cities do not have stores, sales are through wholesalers
 Parent-child (recursive): Involve twice the same level
Lepage

Ivanov Claus

Vancamp Rochat Pirlot Quintin

Jottard Nguyen Weiss Spiegel Praet

Hallez Berger

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 54


Measure Aggregation and Summarizability
 Measures are aggregated when using hierarchies for
visualizing data at different abstraction levels
 Summarizability conditions ensure correct aggregation
 Disjointness of instances: Grouping of instances in a level with
respect to the parent in the next level must result in disjoint sets
 Completeness: All instances are included in the hierarchy and each
instance is related to one parent in the next level
 Correct use of aggregation functions: Type of measures determine
the kind of aggregation functions that can be applied

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 55


Measure Classification: Additivity
 Additive measures (flow or rate measures): Can be
meaningfully summarized using addition along all
dimensions
 E.g., sales amount can be summarized when the hierarchies in Store,
Time, and Product dimensions are traversed
 Semiadditive measures (stock or level measures): Can be
meaningfully summarized using addition along some (not
all) dimensions
 E.g., inventory quantities, can be aggregated in the Store dimension,
but cannot be aggregated in the Time dimension
 Nonadditive measures (value-per-unit measures): Cannot be
meaningfully summarized using addition along any
dimension
 E.g., item price, cost per unit, exchange rate

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 56


Measure Classification: Aggregation Complexity
 Distributive measures: Defined by an aggregation function
that can be computed in a distributed way
 Data is partitioned into n sets, aggregate function applied to each
set, aggregated value is computed by applying a function to these n
subaggregate values
 E.g., count, sum, min, max
 Algebraic measures: Defined by an aggregation function that
has a bounded number of arguments, each is obtained by
applying a distributive function
 E.g., average, computed from sum and count, which are distributive
 Holistic measures: Cannot be computed from other
subaggregate values
 E.g., median, mode, rank

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 57


OLAP Operations: Roll up
 Transforms detailed measures into summarized ones when
one moves up in a hierarchy

Milan 24 18 28 14
ity e

Italy 57 43 51 39
(C tor

Rome 33 25 23 25
)

France
S

Nice 12 20 24 33
Paris 18 Q1 33 30 42 68

Time (Quarter)
23

41
Q1 21 10 18 35
14 17 Roll-up to the Country level
Time (Quarter)

Q2 27 14 11 30
20

37
Q2 27 14 11 30
12 18 Q3 26 12 35 32
33

51
Q3 26 12 35 32 0
1 Q4 14 20 47 31
Q4 14 20 47 31
games DVDs
games DVDs books CDs
books CDs Product (Category)
Product (Category)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 58


OLAP Operations: Drill down
 Opposite to the roll-up operation, i.e., it moves from a more
general level to a detailed level in a hierarchy

Milan 24 18 28 14 Milan 8 6 9 5

ity e
(C tor
ity e

Rome 10 8 11 8

)
(C tor

Rome 33 25 23 25
)

S
Nice 4 7 8 10
S

Nice 12 20 24 33
Paris
Paris 18

6
10
Q1 21 10 18 35 23 Drill-down to the Jan 7 2 6 13
Time (Quarter)

14
14 17

3
Month level

Time (Quarter)
20

7
Q2 27 14 11 30 Feb 8 4 8 12

...
12 18

9
...
Q3 26 12 35 32 33 Mar 6 4 4 10

...

8
1 0

14
Q4 14 20 47 31 ... ... ... ... ...

5
games DVDs Dec 4 4 16 7
books CDs games DVDs
Product (Category) books CDs
Product (Category)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 59


OLAP Operations: Pivot or Rotate
 Rotates the axes of a cube to provide an alternative
presentation of the data

Milan 24 18 28 14

or t
ity e

DVDs 35 30 32 31

y)
eg c
(C tor

Rome 33 25 23 25
)

at du
CDs 18 11 35 47
S

(C Pro
Nice 12 20 24 33 games 10 14 12 20
Paris 18 books

10
Q1 21 10 18 35 23

21
Time (Quarter)

14 17 Paris 21 27 26 14

17

33
20

Store (City)
Q2 27 14 11 30

20
12 18 Pivot Nice 12 14 11 13

28

18
Q3 26 12 35 32 33

47
1 0 Rome 33 28 35 32

19
Q4 14 20 47 31
Milan 24 23 25 18
games DVDs
Q1 Q2 Q3 Q4
books CDs
Product (Category) Time (Quarter)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 60


OLAP Operations: Slice
 Performs a selection on one dimension of a cube, resulting
in a subcube

Milan 24 18 28 14
ity e
(C tor

Rome 33 25 23 25
)

Q1 21 10 18 35

Time (Quarter)
S

Nice 12 20 24 33
Paris 18 Q2 27 14 11 30
Q1 21 10 18 35 23
Time (Quarter)

14 17 Q3 26 12 35 32
Q2 27 14 11 30 20 Slice on Store.City = ‘Paris’
12 18 Q4 14 20 47 31
Q3 26 12 35 32 33
1 0 games DVDs
Q4 14 20 47 31 books CDs
Product (Category)
games DVDs
books CDs
Product (Category)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 61


OLAP Operations: Dice
 Defines a selection on two or more dimensions, thus again
defining a subcube

Milan 24 18 28 14
ity e
(C tor

Rome 33 25 23 25
)

ity e
(C tor
S

)
Nice 12 20 24 33 Nice 12 20 24 33

S
Paris 18 Paris
23

(Quarter)
Q1 21 10 18 35
14 17 Dice on Store.Country = ‘France’
Time (Quarter)

Q1 21 10 18 35

Time

14
Q2 27 14 11 30 20 and Time.Quarter= ‘Q1’ or ‘Q2’
1 2 8 Q2 27 14 11 30
1
33
Q3 26 12 35 32 0 games DVDs
1
books CDs
Q4 14 20 47 31
Product (Category)
games DVDs
books CDs
Product (Category)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 62


Other OLAP Operations
 Drill-across: Executes queries involving more than one cube,
which share at least one dimension
 E.g., compare sales data in a cube with purchasing data in another
one
 Drill-through: Allows one to move from data at the bottom
level in a cube, to data in the operational systems from
which the cube was derived
 E.g., for determining the reason for outlier values in a data cube

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 63


Implementations of a Multidimensional Model
 Relational OLAP (ROLAP): Store data in relational databases,
support extensions to SQL and special access methods to
efficiently implement the model and its operations
 Multidimensional OLAP (MOLAP): Store data in special data
structures (e.g., arrays) and implement OLAP operations in
these structures
 Better performance than ROLAP for query and aggregation,
less storage capacity than ROLAP
 Hybrid OLAP (HOLAP): Combine both technologies,
 E.g., detailed data stored in relational databases, aggregations kept
in a separate MOLAP store

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 64


Logical Data Warehouse Design
 In ROLAP systems, tables organized in specialized structures
 Star schema: One fact table and a set of dimension tables
 Referential integrity constraints between fact table and dimension
tables
 Dimension tables may contain redundancy in the presence of
hierarchies
 Snowflake schema: Avoids redundancy of star schemas by
normalizing dimension tables
 Normalized tables optimize storage space, but decrease performance
 Starflake schema: Combination of the star and snowflake
schemas, some dimensions normalized, other not
 Constellation schema: Multiple fact tables that share
dimension tables

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 65


Logical DW Design: Star Schema

Product Store

Product key Store key


Product number Store number
Product name Sales Store name
Description Store address
Size Product fkey Manager name
Category name Store fkey City name
Category descr. Promotion fkey City population
Department name Time fkey City area
Department descr. Amount State name
... Quantity State population
State area
State major activity
Promotion Time ...
Promotion key Time key
Promotion descr. Date
Discount pct. Event
Type Weekday flag
Start date Weekend flag
End date Season
... ...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 66


Logical DW Design: Snowflake Schemas
Product Category Department

Product key Category key Department key


Product number Category name Department name
Product name Description Description
Description Department fkey ...
Size ...
Category fkey
...

Promotion Sales Time

Promotion key Product fkey Time key


Promotion descr. Store fkey Date
Discount pct. Promotion fkey Event
Type Time fkey Weekday flag
Start date Amount Weekend flag
End date Quantity Season
... ...

Store
City State
Store key
Store number City key State key
Store name City name State name
Store address City population State population
Manager name City area State area
City fkey State fkey State major activity
... ... ...
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 67
Logical DW Design: Constellation Schemas
Promotion Sales Store

Promotion key Product fkey Store key


Promotion descr. Store fkey Store number
Discount pct. Promotion fkey Store name
Type Time fkey Store address
Start date Amount Manager name
End date Quantity City name
... City population
City area
State name
Product Time State population
State area
Product key Time key State major activity
Product number Date ...
Product name Event
Description Weekday flag
Size Weekend flag
Category name Season
Category descr. ...
Department name
Department descr.
...
Purchases Supplier

Product fkey Supplier key


Supplier fkey Supplier name
Order time fkey Contact person
Due time fkey Supplier address
Amount City name
Quantity State name
Freight cost ...

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 68


Physical Data Warehouse Design: Views (1)
 Data warehouses are several orders of magnitude larger
than operational database
 Physical design is crucial for ensuring adequate response
time for complex queries
 View: Table derived from a query
 Query may involve base tables or other views
 Materialized view : View physically stored in a database
 Enhances query performance by precalculating costly operations
 This is achieved at the expense of storage space
 Updating materialized views: Modifications of underlying
base tables must be propagated into the view
 Must be performed incrementally to ensure efficiency

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 69


Physical Data Warehouse Design: Views (2)
 In a DW not all possible aggregations can be materialized
 Number of aggregations grows exponentially with the number of
dimensions and hierarchies
 Selection of materialized views: Identify the queries to be
prioritized, which determines the views to be materialized
 DBMSs are providing tools that tune the selection of
materialized views given previous queries
 Queries must be rewritten in order to best exploit existing
materialized views to improve response time
 Drawback of materialized view approach: Requires one to
anticipate queries to be materialized
 DW queries are often ad hoc and cannot always be anticipated

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 70


Physical Data Warehouse Design: Indexes (1)
 Traditional indexes are not appropriate for data warehouses
 Designed for OLTP transactions that access a small number of tuples
 Bitmap indexes: For columns with a low number of distinct
values
 Each value is associated with a bit vector whose length is the
number of records in the table
 ith bit is set if ith row has that value in the column
 Small in size, logical operators used to manipulate them
Professor Index on Rank
PId Name ... Rank Rec# Assistant Associate Full
P1 John Smith ... Assistant 1 1 0 0
P2 Ada Jones ... Associate 2 0 1 0
P3 Bill Kock ... Full 3 0 0 1
P4 Ed Low ... Associate 4 0 1 0
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 71
Physical Data Warehouse Design: Indexes (2)
 Join indexes: Materialize relational join by keeping pairs of
row identifiers that participate in the join
Index on Sales Index on
Product Client
P1,C1,...
P1 C1
P1,C2,...
P2 C2
P2,C1,...
P3 ...
P3,C2,...
...
...

 Index selection: Select indexes to minimize execution cost of


a series of queries and updates
 DBMSs are providing tools to assist DBAs to this task

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 72


Fragmentation or Partitioning
 Divide the contents of a relation into several files that can be
more efficiently processed
 Vertical fragmentation: Splits attribute of a table into groups
that are stored independently
 E.g., most often used attributes in one partition
 Horizontal fragmentation: Divides records of a table into
groups according to a particular criterion
 Range partitioning: Partitioned wrt range of values
 E.g., partitioning based on time, each partition contains data about a
particular time period (year, month)
 Hash partitioning: Partitioned wrt hash function
 Combined partitioning: Combine these two methods
 E.g., rows first partitioned according to range values, then subdivided
using a hash function

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 73


Data Warehouse Architecture (1)

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 74


Data Warehouse Architecture (2)
 Data sources
 Operational databases
 Other internal or external sources of information (e.g. files)
 Back-end tier
 Extraction-Transformation-Loading (ETL) tools for manipulating data
from sources
 Data staging area: Intermediate database where manipulation is
done
 OLAP tier
 OLAP Server: Supports multidimensional data and operations
 Front-end tier: Deals with data analysis and visualization
 Composed of OLAP tools, reporting tools, statistical tools, data-
mining tools, …

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 75


Back-End Tier
 Extraction: Gathers data from multiple heterogeneous data
sources
 May be operational databases or files in various formats
 May be internal or external to the organization
 Uses APIs such as ODBC, JDBC, … for achieving interoperability
 Transformation: Modifies data to conform to the data
warehouse format
 Cleaning: Removes errors, inconsistencies, format transformation
 Integration: Reconciles data from different sources
 Aggregation: Summarizes data according to the granularity (level of
detail) of the DW
 Loading: Feeds the DW with transformed data
 Also includes refreshing the data warehouse at a specified frequency

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 76


Data Warehouse Tier
 Enterprise data warehouse: Centralized DW that
encompasses all areas in an organization
 Data mart: Specialized DW targeted to a particular
functional area or user group
 Their data can be derived from the enterprise DW or collected from
data sources
 Metadata repository: Describes the content of the DW
 Business metadata: Meaning (semantics) of data, organization rules,
policies, constraints, …
 Technical metadata: How data is structured/stored in the computer
 Data sources, data warehouse, and data marts: logical and physical
schemas, security information, monitoring information …
 ETL process: Data lineage (trace to sources), rules, defaults, refresh and
purging rules, algorithms for summarization, …

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 77


OLAP Tier
 OLAP servers that provides multidimensional view from DWs
and data marts
 Can be ROLAP, MOLAP, or HOLAP
 Most database products provide OLAP extensions and
related tools for manipulating cubes
 However, no standardized language for querying data cubes
 Oracle uses Java and query language OLAP DML
 SQL Server uses .NET and query language MDX
 XMLA (XML for Analysis) aims at providing a common
language for exchanging multidimensional data

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 78


Front-End Tier
 OLAP tools: Allow interactive exploration and manipulation
of the warehouse data
 Facilitate formulation of ad hoc queries (no prior knowledge of them)
 Reporting tools: Enable production, delivery and
management of reports (paper and web-based)
 Use predefined queries
 Statistical tools: Used to analyze and visualize the cube data
using statistical methods
 Data-mining tools: Allow users to analyze data to discover
patterns, trends, enable predictions

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 79


Variations of the Architecture
 Enterprise DW without data marts, or data marts without
enterprise DW
 Building enterprise DW is costly, building a data mart is easier
 Integrating independently created data marts is costly
 No OLAP server: Client tools access directly DW
 Extreme situation: No DW either, client tools access sources
 Virtual DW: Set of materialized views over data sources
 Easier to build, but does not provide a real DW solution
 Does not contain historical data, no centralized metadata, no possibility
to clean and transform the data
 No data staging area: If the source systems conform very
closely to the data in the DW
 Rarely the case in real-world situations

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 80

Potrebbero piacerti anche