Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Table of Contents
Preliminary Remarks................................................................................................... 3
The Basic Construction Set ......................................................................................... 4
Entity Types and Entities ...................................................................................... 4
Attributes and Attribute Values ............................................................................. 5
Relationship Types and Relations ........................................................................... 8
ID-dependent Entity Types .................................................................................. 14
ISA-dependent Entity Types ................................................................................ 19
Composite Entity Types...................................................................................... 21
Correct Entity Relationship diagrams .......................................................................... 24
Correct ER diagrams .......................................................................................... 24
Mapping of correct ER diagrams into relational tables ............................................. 28
Patterns ................................................................................................................. 31
Time dependencies ............................................................................................ 31
Sets, Trees, Partitions and the like ....................................................................... 34
Accordion principle ............................................................................................ 35
Group data collection ......................................................................................... 37
Star join schema................................................................................................ 38
Making an ISA-battery dynamic ........................................................................... 39
Overlapping Foreign Keys ................................................................................... 40
4NF and 5NF Silly Warnings................................................................................ 40
Example DB designs................................................................................................. 42
Asset Management ............................................................................................ 42
Customer Relations Management......................................................................... 47
Data Design Guidelines............................................................................................. 50
Exercises ................................................................................................................ 52
Structural Exercises ........................................................................................... 52
ER Diagram Design Exercises .............................................................................. 57
Exercise D1 (Monolingual Thesaurus) ............................................................. 57
Exercise D2 (Journeys and Vehicles) .............................................................. 57
Exercise D3 (Course Administration) .............................................................. 58
Exercise D4 (SOX Application Controls).......................................................... 58
Exercise D5 (Household Services) .................................................................. 59
Solutions to the Exercises................................................................................... 60
Appendix A: Bachman Diagrams ................................................................................ 68
Appendix B for Specialists: Mathematics of IDNF......................................................... 71
Third Normal Form (3NF) is not enough ................................................................ 71
Referential Integrity Guidelines ............................................................................ 72
NULL Avoidance................................................................................................ 73
Semantics Support ............................................................................................ 74
Information versus Data Representation................................................................ 75
3
Preliminary Remarks
The success of process oriented business analysis methods and the object oriented
programming paradigm has brought the danger that people neglect the data side.
Data is just “made persistent” if the program or process does not use it any more. Even
worse, whole data structures sometimes are generated by means of automatic tools. The
consequence then is that all parties involved, business analyst, programmer and even the
user, loose control over the semantics of the data.
This is not advantageous for a company like Swiss Re, where some data must be kept and
understood for decades, and where sophisticated data statistics modelling analysis as well
as reporting applications of very advanced type are in place.
For new applications it is important that the data structures with clearly defined business
semantics belong to the first things to be modeled. This document shows how to model
data structures for relational database systems. Why don’t we just take a “best practice”
textbook? Because we can do it better.
Swiss Re Zurich has a huge knowledge base and long practical experience with relational
database systems, beginning in 1983. We are the only company outside the technology
sector that is mentioned in the 1990 book of E.F.Codd, the famous inventor of the relational
database paradigm.
In 1989 when the so called referential integrity functionality came, we interpreted the notion
of “correct entity relationship diagram” in such a way that the gap between classical
normalization and the new functionality was filled in a seamless way, and beyond that, the
sometimes difficult to understand normalization theory could be avoided completely. Please
read more on that in an appendix of the document.
This document is a much extended and improved version of the Swiss Re brochure “Entity
relationship for relational database”, which was published 1991 in a first and 1998 in a
second edition. It contains many examples and many exercises with solutions.
Employee
“The Rodale Book of Composting: Easy Methods for Every Gardener, by Grace Gershuny,
with International Standard Book Number 0878579915” is an entity of type “Book”.
Book
Suppose you have two copies of the Composting book in your hands. Are they the same
book?
The two copies might be the same book, but surely they are different copies. A local public
lending library having ten copies of the Composting book probably stamps a copy number
into each copy and additionally needs a further entity type “CopyOfBook”.
CopyOfBook
If all ten copies are lent, you can book the book, reserve it, and you will get a copy as soon
as one is available, hopefully.
5
You can tell the librarian: “I booked the book, and now you only give me a copy”. Then if he
says: “This is the book you booked”, you can say “No, this is’nt the book I booked, I did’nt
book a copy of the book I booked, I booked the book”, and he will send you anywhere.
But if he understands you, he probably is a good database designer and well aware of the
identity problem and of the first and fundamental principle of relational database design,
entity integrity: it must be clearly decidable what an entity (of a given type) is, when it is
identical to another entity of the same type, and such an entity is equivalent to exactly one
row in the relational table that corresponds to the given type.
In the relational world entities are identified by attributes and their values.
Employee
Address
EmpNr
Name Firstname
CopyOfBook
Shape
ISBN
CopyNr ShelfNr
Entity integrity enforces every relational table to have at least one key, namely an attribute
or maybe several attributes whose values are guaranteed to be unique over the lifetime of
the table.
In some cases the table needs even a primary key, which is just one of the keys designated
as primary. It will be clear in due course when an object type needs a primary key. Since
there can be at most one primary key for a type, we may underline the attributes that define
it. The entity type “CopyOfBook” has the primary key {ISBN, CopyNr}.
The entity integrity principle can hardly be overstressed. If you want to model a database
that can store and keep information for employees that can change over time as for example
“OfficeNr”, then your entity is not an employee but “employee in certain time interval”, or
“time dependant information of a certain employee”, or something similar, and the
corresponding entity type should not carry the name “Employee”, but maybe
EmployeeHistory
EmpNr Department
ValidToDate
OfficeNr
Of course there will be some redundancy in the data of our entity type “EmployeeHistory”
insofar as there will come a new entity into existence if the employee changes his office
number or his department, so there can be two or more consecutive “ValidToDate” entries
for the same employee with the same office number or the same department.
But this is a controlled redundancy that is not dangerous, and more important, there are
ways to avoid it, to be bespoken later.
Another way to depart from the entity integrity principle which should be avoided is to
connect the meaning, the business semantics, of an entity with other entities of the same
type as for instance in the somewhat trivial example
7
Employee
EmpNr PercentageOf
TotalDeptBenefit
Salary Department
If you are not sure about the choice of natural keys like in
Date
Event
Description
Place ShortName
EventNr Description
Event
Date ShortName
Place
But the more natural keys one can identify the better. Every natural key helps support the
entity integrity principle. In the above example, the same real event could be defined twice
with different artificial keys. But if we are sure that we will be able to choose different short
names for events that happen to occur at the same date in the same place, then we should
take {Date, Place, ShortName} as additional natural key besides the primary key {EventNr}.
By the way key conditions map to unique indexes in the database, and every unique index
can help the database system to deliver improved performance.
8
Name etc
Description
Candidate
Ca#
Characteristic Ch#
m 1
Result
Assessment
Ca#
Ch#
E#
E# Expert EMail
Every arrow constitutes an existence dependency and will therefore be mapped into a
referential integrity constraint. This will guarantee that a new assessment can be inserted
only when the database already contains the corresponding candidate, characteristic and
expert.
The targets of referential integrity constraints will always be primary keys. This answers the
question of when an object type needs a primary key, namely if there is an ingoing arrow
into it, which says that some other object type is existentially dependent on it.
The arrows have labels, “1” or “m”. They must be interpreted in the context of all object
types that our relationship type is dependent on.
In our example the “1” at the arrow from “Assessment” to “Characteristic” means that per
candidate and expert there can be at most one characteristic such that these constitute an
assessment. Therefore this “1” will map into a key condition, namely the relational table
“Assessment” will have {Ca#, E#} as a key.
The second label “1” in the figure, the one between “Assessment” and “Expert”, also
enforces a key condition, namely per candidate and characteristic there will be at most one
expert such that these constitute an assessment. Therefore the “Assessment” relational
table will have a second key, namely {Ca#, Ch#}.
9
The label “m” merely means that the designer decided that it is not “1” and does not lead to
any condition.
The whole of the semantics of our assessment database diagram shows up in the following
rudimentary CREATE TABLE statements.
If the labels of all arrows that go out of a relationship type are “m” as in the following
ministry of commerce database
10
Name
Name
Company
C#
Country L#
m m
FirstTrade
C# TradeRelation
L#
P#
P#
Product Description
Name
then we have only one key in “TradeRelation”, the set of all foreign key attributes:
Many relationship types are existentially dependent on only two entity types:
Name
etc
Name
m Covers m
InsuranceCompany
CName TName InsuranceType
Description
11
Another example:
StartDate
Name etc
Name
E# Employee
m WorksFor 1 P#
E# P# Project
StartDate Description
The label “1” means that every employee works for at most one project. Therefore {E#}
must be a key in “WorksFor”:
At this point the question arises why we do not just define an attribute “WorksForProject” in
the entity type “Employee” with a referential integrity connection to the entity type
“Project”.
There are several points to consider here. The first is that the entity type “Employee” should
not be overloaded with attributes that are only relevant for some of the employees.
Remember that in our case we would have two attributes, namely “WorksForProject” and
“WorksForStartDate”, which is an attribute that explicitly belongs to the relationship of the
12
employee working for a project, it does not naturally belong to the employee nor to the
project.
A side effect of this unnatural attachment of the attribute “WorksForStartDate” to the entity
type “Employee” is the introduction of a hidden constraint that usually cannot be guaranteed
by the database system alone, namely
The second point to consider is the loss of flexibility. If you map “WorksFor” into a pure
referential integrity condition, then you must be very sure that the business condition “every
employee works for at most one project” is rather stable. With the first employee that
happens to have to work for two projects, one would have to introduce a separate relational
table anyway, and if we already have one as the above CREATE TABLE statement shows,
we only have to drop the key {E#} and add the key {E#, P#}, which can always be done.
Experience shows that business conditions containing the phrase “at most one” are seldom
very stable and should always be considered with caution (“How many wives do you have?”
“At most one!”).
The third point is that one should avoid NULLs wherever possible, because we are not used
to argue in three valued logic. Suppose we had a table
Employee(E#, Name,…,WorksForProject,…)
such that “WorksForProject“ can be NULL. Assume that employee with E#=’17’ has NULL
at his “WorksForProject“ attribute. Then the harmless question for other employees that
work in the same project,
delivers an empty answer, which might be OK. But the question for other employees that do
not work in the same project,
The last point is a kind of ‘unity of doctrine’: referential integrity constraints, which go far
beyond classical normalization theory of relational databases, are always business existence
dependencies in our approach. They are not sometimes existence constraints, sometimes
half existence constraints (if you relate them to nonprimary keys that can be NULL) and
sometimes hidden relationship types.
13
Now let us assume that some of the projects are supervised by an internal specialist who, if
in this role, only supervises one project:
Name etc
Name
E# Employee
1 Supervision 1 P#
E# P# Project
StartDate Description
Then the labels “1” at the arrows account for two keys in “Supervision”, {E#} and {P#}:
B# C#
Book Class
m m
B# C#
Recommendation
1 S#
T# 1
Teacher Subject
T# S#
Here we will have two keys in „Recommendation“, {B#, C#, S#} and {B#, T#, C#},
guaranteeing the business semantics of the labels “1”:
14
The label “1” at the arrow pointing to “Teacher” means that any book in any subject is
recommended to any class by at most one teacher, and
the label “1” at the arrow pointing to “Subject” means that any book is recommended by
any teacher to any class in at most one subject.
A widespread data structure pattern is the hierarchy. In fact, from the early seventies for
more than a decade many companies in which database systems were used to keep the
business data persistent, had to design everything in hierarchical structures.
It is typical for a hierarchical connection that an entity of the dependent level is uniquely
identified “in a natural way” only within its parent entity.
Author Title
ID
Note that we could define {ISBN, CopyNr} as unique, not necessarily as primary key, as long
as there is no object type dependent on “CopyOfBook”.
As usual the arrow amounts to the referential integrity condition which implements the
existential dependency of “CopyOfBook” on “Book”, and as always the label of the arrow
influences the keys in the dependent object type. In case of label “ID” the dependent entity
type must have defined a key that is an extension of the foreign key attributes. To speak
loose, the key of the ID-dependent type must be an extension of the key of the parent type.
As another example we take loss development triangles used for reinsurance reserves
calculations (IBNR, Incurred But Not Reported). A loss triangle can be visualized as follows
Accident Years
1987 1986 1985 1984 1983 1982
600 500
5
500
6
For a deeper understanding see for instance Erwin Straub, Non-Life Insurance Mathematics,
Springer 1988, Chapter 7, we are interested here only in a data structure that captures loss
development triangles.
T# LossDevelopmentTriangle Description
Titel
ID
T# Premium
Premiums
AccidentYear
ID
T# Losses Loss
AccidentYear
DevelopmentYear
16
Another practical use of ID-dependent entity types are multivalued attributes. Suppose we
have an entity type “Book” with an attribute “Subject”.
ISBN Book
Subject
Author Title
Now we realize that it should be possible to assign more than one subject to a book. Then
we can design “Subject” as an entity type that is ID-dependent on “Book”.
Author
ID
ISBN
Subject
Subject
17
As an aside it should be mentioned here that in both forms “Subject” has no independent
existence, a subject does exist in the database only as long as there is a book having
attached it.
Author Title
m m
ISBN Book Categorization Subject
ISBN
Subject Subject
But the categorization could also “go in the other direction”, namely if it defines a partition
of the to be categorized entities, that is if every entity needs exactly one category.
C# Category Description
Name
ID
C# Title
Book
ISBN Author
This could be the situation in a book store where the shelves are categorized, which is
usually the case.
Many attributes are not really multivalued but do have a natural development of values in
time, as for instance “QuantityInStock” for “Assets”
A# Asset etc
Title
ID
A# QuantityInStock
ActualUnitPrice
InStockDate
NumberOfUnits
18
But note that not every historization problem fits into the ID-dependency pattern. If you
have an entity type “E-Current” (whatever “E” means) and another one “E-History”, it is
perfect to design them to be independent from each other.
E-Current K# etc
K# E-History
etc A3
A1 A2 ValidToTimestamp A1 A2
A3
By the way beginners sometimes tend to believe that in a data design diagram every thing
should somehow be connected to every other thing. This is not so, on the contrary the more
independence and “orthogonality of concepts” there is in a data structure the better.
The world is full of natural ID-dependencies. A last example (for the moment):
m Order
S# Software m U#
S# U#
User
ID m
S# Name
Installed
m V# U#
S#
SoftwareVersion InstallDate
V#
CertificationDate
Note that I can have versions of software that I did not order and also might have ordered
software which is not installed yet.
19
“ISA” comes from “is a” and means that the dependent entities are also of the type of the
parent entities. It corresponds to generalization in the direction towards the parent and to
specialization in the opposite direction.
B# Address
BusinessPartner
Name
ISA ISA
B# Customer B#
Supplier
DegreeOfCreditWorthiness
FaxNumber Reliability
The arrows, as always, are existence dependencies and are mapped into referential integrity
constraints, and the labels, as usual, map into key constraints of the dependent types. If a
customer is a business partner, then by entity integrity the foreign key of “Customer” also
must be a key (one row in “Customer” equals a customer and is a business partner equal to
a row in “BusinessPartner”). The same principle applies to “Supplier”.
This construction only makes sense if the dependent entity types do not have the same
attributes and if there are some attributes that apply to all dependents, here “Name” and
“Address” as examples. Note that a customer can be a supplier at the same time.
Name TelNumber S#
P#
m WorksFor
S# StaffMember m P#
ISA Project
ISA m
S# Name
1 Leads
ExternalStaff S#
Employee P#
S# Company PrivateAddress
Generalization also sometimes simplifies the structures. Take the case of three independent
entity types “Book”, “Magazine” and “Article”. Articles might be contained in books or in
magazines, books can describe or mention magazines and vice versa, articles might write
about books, and so on.
m m
Magazine MagazineOnBook Book
m m
1 BookOnMagazine 1
ContainedInBook m
ContainedInMagazine m
m ArticleOnBook
m Article
P#Contained P# P#Writes
m m
IsContainedIn WritesAbout
WrittenPiece
m
1
P#In
P#About
ISA ISA ISA
A composite entity type is just a relationship type that has been promoted to an entity type
because some other object type will become dependent on it. Then the newly promoted
composite entity type needs a primary key (relationship types need keys but they do not
need primary keys because nothing else is dependent on them).
Take the case of the relationship type “Arrangement” between “StaffMember” and
“Project”.
M# Name M# P# P# Name
m Arrangement m
StaffMember Project
Text Description
“Arrangement” has a key, {M#, P#}, enforced by the labels “m” at the arrows, but no
primary key.
M# Name M# P# Name
P#
m Arrangement m
StaffMember Project
Text Description
ID
P#
DateAdded
M# Details
D# Text
As another example we consider internet auctions. There are objects to be offered, maybe
in several auctions if nobody wanted them in the first place. An offering is therefore clearly
a combination of an auction and an object. As such it would be a relation. But now different
participants, known as URLs (network addresses) come into play and make bids for the
offering.
O# Picture EndTimestamp
O# A#
m Offering m A#
Object Auction
Description m BeginTimestamp
A# BidPrice
O#
Bid
m
BidTimestamp
UrlAtGivenTimestamp
Url Url
BidTimestamp
Note that we could have taken “Url” as independent entity type and “UrlAtGivenTimestamp”
ID-dependent on it. That would be necessary if there is a need to store information on the
URL as such, for example whether it is registered to be authorized to take part in the
auction, and so on.
Of course there are also other structurings possible (it is not forbidden to discuss other
options).
23
As a last example (for the moment) take a company that offers household services like
cleaning or gardening services but also babysitting services. Then it is interesting for the
company to keep an exact trace of the requested timing of services.
R# H#
R# RequestDate H#
m ServiceRequest m
CustomerRequest HouseholdService
Description
ID
R#
ServiceStaggering TimeTo
H#
S# Date TimeFrom
24
Correct ER diagrams
Not every collection of rectangles, diamonds, etc. can be considered as a “correct” diagram.
. C .
D . . .. .
. .. .
. . .. . . .. . . .
. .. . . . .
. .. . . .. .
. . .
Let D be the set of all possible ER diagrams (every point is a diagram). In D there are
diagrams which either cannot be mapped into relational tables in a meaningful way or where
possible results of such a mapping have unwanted qualities.
We will define a subset C of correct diagrams that are easily recognizable as such, together
with a “natural” mapping into relational tables that are fully normalized, have only referential
integrity constraints with decent behaviour and some other nice qualities. For database
specialists there is an appendix where this is elaborated in a more precise way.
The correct diagrams are defined recursively as the empty diagram and those diagrams that
can be produced from a correct diagram by application of one of the following six
operations.
2) Define attribute:
assumption: E is a rectangle or diamond or diamond rectangle
result: new (within E uniquely named) oval connected with E
Define attribute:
F
26
E 1
R
A m
E 1
R
A m
E 1
R
A m
ID
G F
E 1
R
1
A m
S ID
m
G F
27
H
E 1
R
1
A m
S ID
m
G F
H
E 1
R
1 ISA
A m
S ID I
m
G F
Define attribute:
H
E 1
R
1
B ISA
A m
S ID I
m
G F
And so on.
28
Due to the recursiveness of the definition of correct ER diagrams it is possible to define the
mapping into relational tables very easily, namely also recursively, along the lines that have
been sketched in the examples with given “CREATE TABLE” statements.
Since some diagrams might be drawn for overview purposes that are still correct but do not
contain all necessary attributes or keys etc., we should ensure that the notions
key
primary key
set of foreign key/primary key attribute pairs
The independent entity types are the most important types. Every other type is directly or
indirectly existentially dependent on them and inherits some of their key attributes, so we
should first ensure that they have their (primary) keys defined.
Every set of foreign key/primary key attribute pairs corresponds to an arrow. In most cases
it will be clear from attribute names which pairs constitute an arrow. For instance in the
example
Person m PName
CName Company
WorksFor 1
Forename
Name Name
PForename
where there is a name clash in the attributes of “WorksFor”, resolved by proper renaming,
we have two sets of pairs, namely
{<WorksFor.Pname, Person.Name>,
<WorksFor.Pforename, Person.Forename>}
{<WorksFor.Cname, Company.Name>}
Warning: The following page contains mathematically presented material. If you are not
familiar to set theory notation, leave the page out and consider again the examples with
added “CREATE TABLE” statements. To grasp them gives you enough understanding for all
practical business examples.
29
To treat the recursion step “define relationship type” more formally (to be sure that it is
defined for all cases) let the relationship type R be dependent on the (composite) entity
types E1 , E2 , … , En with the sets of foreign key/primary key attribute pairs
corresponding to the arrow pointing from R to E(j), for 1 ≤ j ≤ N. Note that we have n
entity types and N arrows where N ≥ n (in most cases N=n).
If all arrows from R to E(j) have the label “m”, then the set of attributes
must be defined as key in R. This is the set of all foreign key attributes.
But if there is an arrow with label “1”, then for all k such that the arrow from R to E(k)
has the label “1”, the set
must be defined as key in R. This is the set of all foreign key attributes corresponding to
the arrows from R to E(j) for all j ≠ k. Note that in the case of overlapping foreign key
attributes this set not necessarily equals the set
We do not recommend overlapping foreign key attributes, but in some cases they might be
practical. So, if all arrows have label “m”, then we have one key, if one arrow has label “1”,
we also have one key, if two arrows have label “1”, we have two keys, etc.
In the case of the recursion step “define ID-dependent entity type” we have a given
(composite) entity type E with primary key {B1, B2, … , Bn}, n ≥ 1. The arrow pointing from
the new entity type F to E corresponds to the set of foreign key/primary key attribute pairs
In this case we must define a key in F that includes {A1, A2, … , An}, the set of foreign key
attributes, plus at least one additional attribute of F.
In the case of the recursion step “define ISA-dependent entity type” we have a given
(composite) entity type E with primary key {B1, B2, … , Bn}, n ≥ 1. The arrow pointing from
the new entity type F to E corresponds to the set of foreign key/primary key attribute pairs
In this case we must define {A1, A2, … , An}, the set of foreign key attributes, as a key
of F.
Of course, every (composite) entity type with incoming arrows must have a primary key
defined.
30
Now it should be clear how a correct ER diagram, enriched if necessary with properly
defined keys, primary keys and sets of foreign key/primary key attribute pairs is mapped into
relational tables.
The observation is that the correct diagrams have clearly defined semantics. For example
every arrow is an existence constraint and nothing else. That is one of the reasons why we
always refer to the primary key of the parent table in referential integrity constraints, in
other words, the arrow which corresponds to
maps into
On the other side the diagrams show the tables as they will be defined, so the analyst
formulating queries against the tables does not have to switch forth and back between
tables and business semantics of diagrams with different structuring.
To draw a correct diagram means to move inside rather narrow borders. But up to now
there is no business data structure known that could not be represented by a correct
diagram.
The big benefit of correct diagrams is that they map directly into IDNF relational tables
(Inclusion Dependency Normal Form, which includes Boyce-Codd Normal Form), that they
avoid the appearance of NULLs completely at all dangerous places and that all semantics
expressible in the diagram can be controlled by the database system (queries that try to
destruct this semantics get an SQL error back). More on this in an appendix.
31
Patterns
Time dependencies
Take the example of a weather statistics table that stores periods of similar weather
situations.
AverageTemperature
WeatherPeriodStart
WeatherSituation
City FromTime
The idea is to capture for example a three day period of cloudless weather in Zurich, a
following period of three hours of rain, and so on.
Note that an entity of type “WeatherPeriodStart” is not a period, only the start of a period.
Even if we let a period end when the next one starts, we do not know when the last one
ends for a given city.
This can lead to wrong answers for certain queries. The search for the weather situation in
Zurich at time ‘t’ would have to look for a row r with r.FromTime ≤ ‘t’ such that there is
no row s for Zurich with r.FromTime < s.FromTime ≤ ‘t’. But since also the average
temperature per period is to be stored, which can be calculated only at the end of a period,
we can not be sure whether the period that started last has already ended yet.
In other words to identify a period we need two rows that have none in between. If an
entity should represent the period, it also needs an endpoint.
AverageTemperature
WeatherPeriod
WeatherSituation
City FromTime ToTime
The above mentioned query could be formulated much easier now and probably would also
perform much better. But there is a price to pay, namely the program (or user) that enters
the values now should guarantee that there are no overlapping periods entered.
32
The entity type “WeatherPeriod” is much more reliable in information content than
“WeatherPeriodStart”. Take the example of a hobby meteorologist who sometimes forgets
to observe the weather or to enter the information about it. He should prefer the entity type
“WeatherPeriod”.
Person m P#
C#
m Company
Employed
P# C#
FromDate ToDate
In this example a person can be employed in one and the same company only once, even
though the employment has a period of validity defined.
A possible resolution of the issue is to take an independent entity type to represent the
periods of validity.
Person Company
1 1
P# C#
P# C#
EmplApplic
E#
m
ToDate
E# Employment
FromDate
K# E
A
33
K# E
ID
K# AinTime A
ValidFrom
K# E
ID
K# AValues A
ID
K# AValuesInTime
ValidFrom A
K# E
ID
K# AValuesInTime
ValidFrom
A
34
Most sets are defined implicitly, for example the set of contracts a given employee is
responsible for, would be realized by directly relating the contracts to the employees. But if
we need sets of elements explicitly, where the sets maybe even have names, then we may
take the following pattern.
SetOfElement m m Element
BelongsTo
S# Name E#
S# E#
If the sets should not be overlapping, as for example in a partition, we must change only
one label.
SetOfElement 1 m Element
BelongsTo
S# Name E#
S# E#
This is not mathematical set theory, of course, because there we have a lot of generally
accepted axioms that could not be enforced by such data structures alone, but only by
program (or user) controlled constraints. An example would be the extensionality axiom
which says that two sets that have the same elements are equal. So for example there is
only one empty set, but our design pattern would allow many different empty sets.
A graph is a set of nodes where any two of them may be connected by an edge.
N# Node
m m
Edge
NLeft# NRight#
35
Note that in this representation one has to abstract from the direction given by the fact that
there is automatically an ordering in (NLeft#, NRight#), even if we call the foreign keys
Lauren and Rose instead of NLeft# and NRight#.
This pattern would also represent a directed graph where the nodes have a direction by
purpose, are arrows so to speak.
A directed graph where each node has at most one outgoing arrow is a forest.
N# Node
m 1
Edge
NDown# NUp#
A subset of nodes and edges of a forest where each two nodes are connected by a path of
edges is a tree. Trees are often used to map hierarchical structures, for example the profit
centers of an enterprise.
Accordion principle
In most operational application systems where hundreds of users enter and ask for millions
of data entries sooner or later the users would like to add new attributes to certain
important entity types (booking, contract, claim, and so on).
Very often this is not possible without changing dozens of programs and user interfaces. If
it is too expensive to add new attributes, users often change the semantics of the coding
structure of an attribute that is already there, and map their envisioned new attribute into an
old one (for instance, all contracts of the new sort are coded with a leading ‘X’). Such
attribute domain shuffling does not make the logic of reporting programs easier.
Therefore the Swiss Reinsurance Company invented 1985 a then new design pattern and
called it accordion principle (“Handorgel”). Nowadays in university circles it is called
“descriptor oriented data model”, or something like that.
36
Suppose you are given an entity type “E” with “mandatory” attributes “A1”, “A2”, “A3”,
and with “optional” attributes “B1”, “B2”, … , “Bn”.
Bn
E
A1 B2
A3 B1
A2
Suppose that at design time the optional attributes are not all known in advance. This is the
business case we discuss here, namely that the user wants to invent new attributes into the
running system later on.
The solution to this problem lies in the movement shown by an accordion, namely to
compress the optional partly unknown B-attributes in horizontal direction and unfold them
again in the vertical direction. In other words, the attribute names “B1”, “B2”, etc
themselves have to become values in a column of a relational table.
E m m EAttributes
EAttributeValues
A1 A3 A1 AName AName
A2 Value
A typical relation of type “EAttributeValues” would be <a1, B3, b>, which means that the
entity of type “E” with primary key value ‘a1’ has, possibly among others, a defined value
for the attribute ‘B3’, namely ‘b’.
Of course this is only possible if the B-attributes all have the same data type, therefore
sometimes two or more accordions are designed, one for character data and one for numeric
data. This pressure towards norming data types has the advantage that also user interfaces
can be designed in a flexible way independent of concrete B-attributes.
37
InStock
m m
Warehouse m m Article
InLocalCatalog
W# m m A#
OrderedForWarehouse
RType
Name
m
Warehouse Name
m m Article
Concerns
A#
W# W# A#
A group of companies collects financial data from subsidiaries in the form of amount vectors
of variable length
The (Ak,ak) are pairs of attributes Ak and corresponding attribute values ak. One and the
same attribute may occur several times, that is Ak = Aj for a k ≠ j, but then the values
have to be different, ak ≠ aj. The number n of attribute-value-pairs may be different from
vector to vector, because not every kind of financial data needs the same level of detail in
description.
The central office puts all amount vectors into one data pot, for easy access with SQL to
control certain conditions (for instance certain amounts will have to be equal to the sum of a
set of other amounts, etc).
38
Amount m ID AName
AmountAttrValue
m
A# Amount
A# AValue AttrValue
Currency AName
AName AValue
D1 Dn
A1
m An
m
m Facts m
D2 D4
A2 m
FactsValue A4
D3
A3
The ideal star join schema for OLAP (On-line Analytical Processing) consists of dimensions
D1, D2, … , Dn, which are independent entity types, and a facts relationship type
dependent on the dimensions.
The facts table has one or more additional attributes (besides the foreign keys corresponding
to the dimensions), “FactsValue” in the example diagram, which should be additive in the
following sense.
For every set M of tuples <d1, d2, … , dn> of dimensional key values that makes sense
from an information content point, if there is a tuple <m1, m2, … , mn> that information
content wise represents M, then the facts value of this tuple must be equal to the sum of all
facts values of tuples in M.
This additivity constraint is sometimes difficult to achieve. Take the case of dimension D1
standing for geographical unit with the attributes “Region”, “District”, “Country” and
39
Some systems exaggerate the usage of the generalisation/specialisation principle and are
confronted with an ISA-battery that should grow and shrink dynamically (at runtime).
E#
E0
ISA
ISA ISA
E1
A3
E2 E3
ISA ISA
ISA A17
E6
E4
E5
ISA
A13
A6 E7 A2
To capture the dynamics into structure, we map this to the following diagram.
T# EType
ID
ID
E E#
EAttr
T# m m T#
A#
EAValue
T# E#
A# AValue
Please note the use of the overlapping foreign key attribute T# in ”EAValue“.
40
Every relationship type that is dependent on some object type more than once via different
arrow paths is a potential candidate for overlapping foreign keys. The crucial question is
whether for every relation of the considered type there is one object on which it is
multidependent. We consider an example.
C#
Customer
ID ID
C# Delivery C#
Order
O# D#
m
m
C#
OrderPosition m
DeliveryPosition
O#
A# D#
C# O#
A# m
Article
A#
C# Company
Variant 1
m
C#
L#
Export
m P# m
Product Country
P# L#
41
C# Company
Variant 2
C# m m
C#
Produces SellsTo
P# L#
m m
Product Country
P# L#
Introductory courses for data design sometimes speculate on fourth and fifth normal forms
(4NF, 5NF) with the help of such examples. For instance it is said that with variant 1 goes
the danger to hurt these forms.
Forget that! You can only hurt for example the fourth normal form with variant 1 if you
explicitly write down and demand besides the diagram a constraint that says for instance
that the data in “Export” must be at every time point in a shape that it could be mapped by
corresponding projections into “Produces” and “SellsTo” of variant 2 in such a way that you
could get back “Export” from “Produces” and “SellsTo” by (natural) join.
But if you are aware of such a constraint and formulate it, you will in this case draw directly
variant 2. Please note that the business information contents that can be captured by the
diagrams are quite different in variant 1 compared to variant 2.
42
Example DB designs
Asset Management
Since we want to capture all kinds of assets of different types we take an entity type
“Asset” together with an accordion “Asset_Attributes” and “AAttrVal”. “Asset” might have
a description but consists primarily of its key identifying the abstract entity asset.
Asset_Attributes AAttr A#
m Asset
m AAttrVal
AAttr
A#
Descr Descr
Attr_Val
This construct can capture any kind of (asset) entity that is describable by attributes, of
course also future not yet invented types.
The only restriction we impose on the abstract entity asset is that if it is intangible (not a
building or work of art etc) as for instance shares then if we hold 500 shares of General
Motors in stock we will not enter 500 entities into “Asset” but only one. Therefore a
mechanism is needed to capture the number of units of an asset in time. Our typical pattern
for this is the following.
Asset
A#
ID
A#
Asset_In_Time
TimeSt UnitsHeld
43
Create_Date
P#
Portfolio
Name Manager
ID
P#
Ptf_Development
TimeSt
“Ptf_Development” will then get related to “Asset” with a relationship attribute (besides the
foreign keys it consists of) to capture the number of units.
Ptf_Development m P# A#
m Asset
Asset_In_Ptf
P# A#
TimeSt TimeSt
No_Units
Next we design another independent entity type “Trade”, also envisaging very different
kinds of trades with different sets of attributes, therefore using the accordion principle
again.
Trade_ID
User_ID
Trade_Attributes m m Trade T#
TAttrVal
TAttr Descr SettleDate Buy_Sell
TAttr T#
Attr_Val
44
The next independent entity type is “Broker”, which stands for all kinds of financial
intermediaries.
etc. Name
No_Units
Asset 1 1 Broker
Trade_Asset
A# B#
A#
B#
T# m
Trade T#
Many assets have an issuer company, but we will not model that as attribute but rather as
another independent entity type since it will be connected also to ratings.
etc.
Issue_Date
Issuer_Company m Asset
1
Issue
C# A#
Name C# A#
Name
Rating_Org
Descr
R#
ID
R#
Rating
TimeSt Remarks
45
CRating
Issuer_Company 1 TimeSt
Comp_Rating
C#
C#
m
R#
Rating
m
A# A# R#
R# TimeSt
1 Asset_Rating
Asset TimeSt
ARating
Now let us put the pieces together for the whole picture.
46
CRating
R#
Descr
C# R#
Comp_Rating TimeSt Name
Name
etc
C# 1 Rating_Org
Issuer_Company
m ID
Issue_Date
1 C# R#
A# Rating
Issue ARating TimeSt
m Remarks
A#
Asset_Rating R#
Asset_Attributes AAttr
A#
m TimeSt
m AAttrVal
1
AAttr Descr
Attr_Val
m
Descr
Asset Name
etc
A#
Broker
Name 1
P# Create_Date m No_Units 1 B#
A#
Portfolio Trade_Asset
TimeSt ID A#
Asset_In_Ptf B#
Manager
ID P#
T# m
No_Units
m Buy-Sell
Trade_Id
Ptf_Development Trade
Asset_In_Time
A# User_Id
TimeSt T#
P# SettleDate
TimeSt UnitsHeld
T#
TAttr
Descr
m TAttrVal
Attr_Val
Trade_Attributes
TAttr
There might be more attributes, for instance at every table the authentication identification
(Auth-ID) of the process or user that inserted the row.
47
As next example we design the kernel of a customer relations management system for a
company whose customers are other companies and where the customer relationships are
very much concentrated on personal relationships with people of the customer companies,
as is typically the case with a reinsurance undertaking.
We take the own employee, the customer companies and those of their employees that get
in our focus for granted and do not make attempts to historize corresponding entities.
Name Address
E#
etc.
Customer
Company
Employee
C#
Name
ID
C# Company
Person
P#
Name Address
First of all at every time point there is (at most) one responsible for a customer company,
and this has to be historized.
C# E#
FromDate
Responsible
ToDate History
The next construct around marketing activities needs no separate history entity types, since
marketing activity has an abstract key and a time point as one of its attributes.
M#
MParticipation
C# E#
Instead of describing marketing activities in a text field “Descr” we could take an accordion
for any number of dynamic attributes in “MarketingActivity” (not shown here, since the
principle should be clear by now).
Contacts are most important, therefore we distinguish different types of contacts (which
could be mails, electronic mails, telephones, personal contacts, etc.). Here also there is no
separate historization construct necessary.
P#
Name Address
C# E# E# etc.
Company m CParticipation m
Person Employee
Timepoint T#
C# Name
P#
m
T# Name
Contact ID
Timepoint
ContactType
Descr Report
T# Descr
For persons of customer companies everything is collected that they themselves make
known, and that could be important to know for an adequate polite treatment of contacts to
them. Of course, this amounts to another accordion.
49
Name Address
AValues
Company m PAttributes
Person m
PAttrValues
C# Name Name
C#
P#
P#
Note that this accordion could also be used for a historization of addresses of company
persons (by taking attributes like “CurrentAddress”, “AddressTill20060626”, etc.).
FromDate
C# E#
Responsible
Name Address m 1 E# etc.
M#
Customer
Timepoint Employee
Company Marketing
Activity Descr
C# m Name
m m
M#
ID
MParticipation m
C# E#
C# P#
Company P#
Person m C# E#
CParticipation
Address Name
Name Timepoint T#
ContactType
m m ID
T# T# Descr
Contact
Timepoint
C# P#
Name Descr Report
PAttrValues
m
AValues
PAttributes C# E#
FromDate Responsible
Name
ToDate
History
50
There is no royal road to the right data structure, simply because in some cases one can say
only after a few years of a running system whether the data structures had been chosen
adequately in the first place.
future
past and reality
present reality
Now one could argue that a data structure or an electronic system in general not only
should predict the future requirements, but also shape them, and therefore have an influence
on the future organization of the business. But this is dangerous because information
technology should follow business and not vice versa.
The only way out of the dilemma between normalizing past requirements and predicting the
future is to design the (data) structures flexibly, in such a way that the user has a maximal
influence on the perceived (data) structures, and can himself respond fast and flexibly to
new business requirements. This can be achieved by using the abstract design patterns
described in this document.
The independent entity types are also the ones that should be described the most carefully.
What exactly is or should be an entity of the given type? The answer to this question has a
close connection to the choice of the keys. It is a fatal habit of object oriented design that
keys are very often almost neglected. Keys in a relational arrangement are explicitly
designated sets of attributes.
It is advisable to consider dependencies on time from the very beginning. An entity can have
several time dimensions as accident year, development year, business year, reporting year,
time of estimate, time of creation, and so on. If an entity changes in the course of time, one
51
should focus not on that entity but on the combination <entity, time> taken as the new
entity to be modeled.
If the data model will carry a system where many users enter tons of individual data entries
(in contrast to batch loaded read only systems), then it should be designed in such a way
that never any deletion of data entered by a user has to take place. The data might be
moved from one table to another history table in one and the same transactional atomic
unit, but it should not be deleted. The reason for this is not only the appearance of state
authorities that want to know what happened nine and a half years ago, but also the fact
that programming errors can cause big damage, and the chances to wrongly delete entries
reduce if deletion as such is a rare happening in a program (I know programs with bad
experience in that respect).
Now comes an advice from Radio Erivan, useless but nevertheless with some truth in it.
Find the right mix in generalization, often it enhances flexibility, but if everything that should
happen in the system is generalized to a business event, then that very business event
entity type will be a first class bottleneck in the running system, and recovery turns to a
nightmare.
It should be possible to define and understand the business semantics in the data structures
without considering surrounding program logic or processes. Data should not be just a
byproduct or a garbage of a process. “Make persistent” is a bad advisor for data design.
Do not generate the data model automatically from a business object model. This would
result in semantics gaps, inability to report from the data with direct query access tools (so
that every new report has to be programmed within the same technical environment),
unnecessary excesses of referential integrity, NULLs and outer joins festivals, and the
inability to perceive the data as a value per se. Data structures cannot be changed every
day, processes can be changed several times a day.
If you take an object oriented programming language, be aware that there is no generic
mapping from OO to Relational for all cases. Each and every case must be considered
individually, and the mapping constructed accordingly.
Observations of running systems show extreme differences in what happens if the company
undergoes a reorganization. In some systems the changes can be defined completely by the
user community, and in some other systems many programs have to be adapted or even
reprogrammed. Please take your own consequences from that observation.
In systems with a heavy load of batch jobs consider the invention of control data structures
that ease the program restartability, as well as the batch interfacing. One should be able to
abort manually a batch job at any time and restart it without much repetition or
endangerment of the database management system.
To change data structures in a running system is not easy. Therefore, think first, program
later. The data model should be in a stable state before the first line of program is written.
52
Exercises
Structural Exercises
Exercise S1:
Let the following database for countries, languages and official languages be given. Please
propose a simplification.
C#
L#
OfficialLanguage
Name Name
m m
Country Language
m L# m
C# SpokenLanguage L#
Capital
C# Percent of
Population
Exercise S2:
Let the following part of a software distribution administration be given. Please make an
enhancement such that also different versions of the same software can be controlled.
Further
Attributes
Software m m PCAdress
Distribution
S# S# P# P#
53
Exercise S3:
Let the following diagram be given with unknown labels x,y and z at the arrows.
A1 A3
R
x z
E1 A2 E3
A1 y A3
A2 E2
The labels of course have to be “1” or “m”. For every one of the eight possible choices of
these labels please show which keys in R are implied by the labels.
Exercise S4:
For the following two variants please discuss the difference in information capturing
capability.
Company
Variant 1
m m Country
Product Export
Variant 2 m Company
Manufactures
m
m Export m Country
Product
54
Exercise S5:
E1 E3
m ID
E1
R1
ISA
E3 ISA
ID
m
E2 E2
Variant 1 Variant 2
ISA
m E2 E1 E2
R1 m
R2 ID ID
m
m
E1 E3 E4
E3
ISA
Variant 3 Variant 4
m m
E4 R1 E2
E2
ISA ISA
A1
ID
E3 E1
Variant 6
Variant 5
55
Exercise S6:
Let the following diagram be given where only the primary key attributes of the independent
entity types are drawn in the figure.
C#
S#
EDPCompany
m m
ContractFrame SoftwareVersion
V#
ID
Date
ContractActualisation m 1
Responsible Employee
E#
Please fill in the missing key and foreign key attributes and formulate an SQL CREATE
TABLE statement for “Responsible”.
Exercise S7:
Please describe the following database in prose (requirements that might have lead to it).
Name
Customer C# TariffPosition
C# m
Address
Composition
T#
ID B# T# Descr
m Price
Number
C#
Bill
B# Date
ID ID
OrderToPay Date
C# Payment Amount
B# Date C# O#
P# B#
56
Exercise S8:
Supplier
Variant 1
m m Product
Part Composition
Variant 2 1 Supplier
SuppliesPart
m
m Composition m
Product
Part
(3) If a certain part is needed for a certain product, then it always comes from the same
supplier.
(4) A key of “Composition” consists of part and product (which of course means the
corresponding foreign key attributes).
(6) All products that contain a given part get something from the same supplier.
57
Please design a correct ER diagram for a monolingual thesaurus. For the requirements
definition we follow more or less the ANSI/NISO standard Z39.19 – 2003.
There are terms with term names. Some of the terms are preferred terms (for the usage in
indexing documents). If β is a preferred term for α, then α is also called a synonym for β
(standard notation α USE β, or equivalently, β UF α, UF means “Used For”).
We want a variant where for any term at most one preferred term is defined, therefore, for
the handling of the so called compound USE references (for example “Snowmobiles USE
Vehicles + Snow”) we differentiate generally also compound terms, which are also terms
but which are composed of some other terms.
Among the preferred terms there are a “Broader Term” and a “Narrower Term” relation
defined, such that x BT y is equivalent to y NT x.
Then “Related Terms” (RT) of different (user definable) categories must be storable.
It must be said here that the design of a diagram is easy, more difficult would be to
formulate an SQL query that finds all interesting preferred terms an index search engine
should look for, given any term. But this document is about ER design, not SQL.
Please design a correct ER diagram for the administration of a vehicle park and planned
journeys realized by chauffeurs.
There are chauffeurs with names, birthdays and telephone numbers, and vehicle categories,
and the information who has a drivers license for which vehicle category.
There are vehicles according to the categories, with serial numbers, construction year and
brand. For each vehicle the mileage in form of numbers of kilometers driven is captured from
time to time.
Journeys must be planned with time interval, destination, purpose and price. If a planned
journey is realized, the information is captured what chauffeur made it with which vehicle.
On a journey a vehicle is driven by only one chauffeur, and on one and the same journey a
chauffeur drives only one vehicle. But in rare cases there can be several vehicles (with equal
number of corresponding chauffeurs) on the same journey.
There are courses with titles and descriptions and a unique course number, as well as
course realizations, where a course is actually given at a certain time and place. The same
course can have several realizations.
Every concrete realization of a course has only one person giving it, but different realizations
of the same course might have different course teachers.
The course teachers are internal or external employees, where for internals their office must
be kept whereas for externals their company.
For every course realization all participants must be stored. The participants all are internal
employees.
Internal employees can apply for course participations, even several times for the same
course, they might have found a better justification in the meantime.
Please design a correct ER diagram for the administration of an aspect of the Sarbanes
Oxley, Section 404, Controls, namely application controls. Application controls are the
functions of applications that support the process level business controls. This is not to be
mixed up with IT controls, that looks for proper handling of all aspects of IT applications.
We must capture business processes with a hierarchical leveling for subprocesses. Risks
have to be identified in processes, one risk in at most one process.
Then we need process level controls (PLC), and the possibility to declare which PLC
mitigates which risk. A risk can be mitigated by several PLCs, and a PLC can mitigate
several risks.
One must be able to define application controls (AC), which support the PLCs. There are
manual PLCs and automated PLCs. The automated PLCs are carried out by one application
control, and the manual PLCs may be supported by several application controls.
Please design a correct ER diagram for the administration of a household services agency.
There are household services (for example with name and description), customers with
name, address and telephone numbers, and customer requests (date, remarks).
A customer request may relate to several household services, and every one of these
service requests may contain a time slicing (for example “next week Thursday and Friday
from 8 to 10 pm babysitting”).
Of course the agency needs a file of persons with the information who can offer what
services, and also with information about special knowledge or skills, for example to identify
those service offering persons that are ready to offer private lessons in mathematics (those
who have weapons of math instruction, not to be mixed up with mass destruction).
A plan must be made that attaches service persons to service request time slices, that can
be declared to be definitive.
60
Exercise S1 Solution:
Name C# L# Name
m SpokenLanguage m
Country Language
Official Percent of
C# L#
Language Population
Capital
YesNo
Note that the proposed solution is only possible because there was an attribute “Percent of
Population”. In Switzerland the German language is an official language, but not a spoken
one, so German can be entered with Percent of Population (that speaks it) equal to zero.
Exercise S2 Solution:
for example
Further
Attributes
PCAdress
Software m
S#
P#
S# ID Distribution
m P#
SoftwareVersion Version
S# Version
61
Exercise S3 Solution:
x y z keys
--------------------------------------------------------------------------------
m m m {A1, A2, A3}
m m 1 {A1,A2}
m 1 m {A1,A3}
m 1 1 {A1,A3} and {A1,A2}
1 m m {A2,A3}
1 m 1 {A2,A3} and {A1,A2}
1 1 m {A2,A3} and {A1,A3}
1 1 1 {A2,A3} and {A1,A3} and {A1,A2}
Exercise S4 Solution:
Variant 2 has more information capturing capabilities than variant 1. First, every data
content of variant 1 can be mapped into variant 2, by filling “Manufactures” with the
projection of “Export” onto the foreign key attributes that correspond to “Company” and
“Product”. Second, not every data content of variant 2 can be mapped into variant 1
without loss. In variant 2 it is possible to capture information on the manufacturing of
products by companies, without the necessity that these products are exported or that the
export of these products is known.
Exercise S5 Solution:
Exercise S6 Solution:
C# C# S# V#
S#
EDPCompany
m m
ContractFrame SoftwareVersion
V#
ID
Date
ContractActualisation m 1
Responsible Employee
C# E#
C# A#
S# V# E#
S# A#
V#
62
Exercise S7 Solution:
The database can capture customers with name and address, bills sent to them as well as
payments and orders to pay. Bills have a date and can be connected with a set of tariff
positions, such that a tariff position can be counted several times (for example 2 times drill
at the dentists). The tariff positions have a description and a price. The payments, that
relate always to one bill, can be part payments, which are entered with date and amount. It
is possible to have several orders to pay per bill, and each one has also a date.
Exercise S8 Solution:
RTCategoryName
RT_Category
TermName SN# m RT
Category
ScopeNotes TermName1 Name
m RT
Note ID TermName
m
Term TermName2
m
Comp_Of_ m
TermName
ISA UF_TermName
ComposedOf ISA
USE_UF
Comp_
TermName m 1 USE_TermName
PreferredTerm TermName
CompoundTerm
TermName m m
BT_NT
BT_TermName NT_TermName
64
Description F#
Tel
C# Name C# F#
HasDrivers m Vehicle
m LicenceFor Category
Chauffeur
Name
BirthDay ID
1 V# F#
Brand Out
Of
Vehicle Use
C# V# 1
Serial
Realization Construction
Number
F# Year
P#
ID
m F#
V# Mileage
P# Price
Date
Purpose Kilometers
PlannedJourney
Destination
FromDate ToDate
Please note that except with chauffeurs most information can be kept in their corresponding
object types (and therefore carry their own historization), due to the “out of use” marker at
“vehicle”.
65
Location TimeFrom
Date Name P#
TimeTo
Course R# P#
m Gives 1 Person
Realization Course
C#
C# R#
ISA
m
ID P#
External
R# ISA
P# Employee
Course
C# Participates
C# Company
Title m
Description
m Internal
Employee
m Office P#
C# Application
P#
F#
RequestFor Justification
F# Course
Participation RequestDate
Note that, as usual, this is not the only possible solution, only a proposal. For instance the
only justification for “External Employee” is the “Company” attribute, not common to all
entities in “Person”.
The way “Application” is designed in this proposal means complete openness for requests,
for instance someone (we do not store who) can apply for several colleagues and several
courses, and if he wants to do so, in a way that for every colleague another course is
requested.
66
P#
Business
R# Risk m Identified 1 Process
In
m 1
m
Mitigation IsSub
ProcessOf
m
PLC# ProcessLevelControl
ISA
AC#
Automated m Carried
ISA PLC OutBy 1
Application
Control
m
Manual
m MayBe m
PLC
SupportedBy
Uses
A#
A# ID Application
Application Control
CF#
Function
67
Address Name
TelNr
Name S# Description
C# Customer R# Household
C# S#
m Services
ID Service
Request m
m
C#
Customer S#
R# Request
Readiness
S#
Date Remarks ID R# P# P#
C# Service
Plan
m
T# m
Definitive
m
C# ServiceRequest P#
TelNr
R# TimeSlices
S# Service
T# To Person
From
ID
Address Name
P# Special
Knowhow BirthYear
K#
Description
68
With the appearance of DASD (Direct Access Storage Device) came a powerful feature that
had not been there in times of magnetic tapes: the direct navigation to a given storage
address. Records could contain pointers to other records.
Set Type
This is the basic construction paradigm for network database systems, the first identifiable
types of database management systems.
Charles Bachman, pioneering architect of one of those network systems (IDS, Integrated
Data Store, at General Electric), also played an important role at the CODASYL (Conference
On Data Systems Languages), where the navigational interplay of programming languages
with network data storage systems was standardized.
Bachman later (1969) published a diagram language for the owner/member paradigm
(containing one-to-many set types drawn with arrows from owner record type to member
record type), and a few years later (1983) generalized it to something that can be
designated as “Binary Entity Relationship”. In the meantime (1976) Peter P. Chen had
published his famous “The Entity Relationship Model”.
Chen supported the same basic construction elements as presented here (with ISA, ID and
manyfold relations), but Bachman continued to propagate the binary model, which
essentially looks like
x y
E F
where per entity e of type E there are y entities of type F, and per entity f of type F there
are x entities of type E.
69
Here “x” and “y” stand for a mixture of cardinality and existence constraints. In the notation
of Zehnder (1981) x,y ∈ {1, c, m, mc} with the meaning of
In other dialects “x” and “y” can be given as intervals (for instance x=[0..1] for x=c), or as
a combination of little strokes, double strokes or circles over the line or as crow’s-feet, but
the different notations are all equivalent. For example the figure
c mc
Department Employee
means that for each employee there is at most one department, and for each department
there is a number between zero and many employees.
CODASYL had only one-to-many, but no many-to-many relationships (in favor of reducing
the implementation challenge). Therefore in the following example
mc mc
Project Employee
the implicit relationship type of “WorksFor” had to be made explicit like follows
Project Employee
1 1
mc WorksFor mc
Since Zehnder discussed similar structure changes in the context of binary entity
relationship for relational databases, we call such a step a “Zehnder Normalization”.
70
Please note that if you do not take the Zehnder normalization step in the example
c mc
Department Employee
and map the two entity types into relational tables and the line between them into a
referential integrity condition (in Employee … FOREIGN KEY (Dept) REFERENCES
Department …), then the cardinality condition “c” forces you to define the foreign key
column “Dept” in “Employee” as NULLable.
The Bachman diagram language, designed for network database structures, is not very well
suited for relational table data design. Bachman was a very strong opponent of E. F. Codd
and the idea of relational database management systems.
Given a Bachman diagram of 200 entity types with lines between them, it can be
cumbersome to find out whether there is an existence dependency cycle or not (even when
it is already Zehnder-normalized), and it is not easy to find out what the independent entity
types are (these are the most important, in new designs as well as in assessments of given
designs).
The focus on cardinality lets you ask the wrong questions (Is a connection a real relationship
type or a mere cardinality constraint?). If you ask a user “Are you sure that for all E there is
an F?”, he will answer “yes” even if for one E out of a trillion there will be no F.
The Bachman ER diagrams are semantically very meager compared to Chen ER diagrams.
Please note that this appendix is for specialists only. If you never heard about normalization
in relational databases, do not bother. If you heard about it but do not understand it, don’t
bother either, more than half of the writers of textbooks on normalization theory also do not
understand it.
This appendix gives a sketch of the mathematical background of the presented entity
relationship language and the reason why correct diagrams deliver advantageous relational
table structures. But note that for designers it is not necessary to understand this
background.
If R is a relation schema and F a set of functional dependencies over Attr(R), the set of
attributes of R, then (R,F) is in third normal form (3NF) if
The ugly part here is ∀A∈Y\X (A prime), which must be there for the theorem to be true
which says that third normal form is always achievable.
Note that without ∀A∈Y\X (A prime) one would not have to talk about keys, only
superkeys. F ⊢ X → Attr(R) says that X is a superkey (and a key is a minimal superkey).
An attribute is prime (“A prime”) if it is part of a key.
Third normal form cannot be supported by relational database systems. Take the example
R(A,B,C) with F={AB → C, C → B}. Here AB → C can be supported by declaring
UNIQUE(A,B) in the table definition of R(A,B,C), but C → B cannot be supported.
The situation is quite different with Boyce-Codd Normal Form (BCNF). (R,F) is in BCNF if
If Y⊆X, then X → Y is trivial and always valid, and if F ⊢ X → Attr(R), then you can
declare UNIQUE(attributes of X), so the database system can guarantee the validity of
BCNF.
So we observe that the database system can guarantee the validity of BCNF (react with an
SQL-error message if you try to violate BCNF), but not so for 3NF. Therefore BCNF should
be the goal of design, not 3NF. The proposed mapping of correct ER diagrams into relational
tables delivers directly BCNF.
Even more important is that the designer does not have to learn normalization theory, which
is difficult to understand (there have been many published corrections to beforehand
published erroneous normalization algorithms).
72
It is clear that
{D → E} ∪ {R[A,B] ⊆ S[D,E]} ⊨ A → B,
which means that in every database where the functional dependency D → E and the
inclusion dependency R[A,B] ⊆ S[D,E] are both valid, also A → B is valid.
But
F ∪ IND ⊨ ρ (ρ a functional or inclusion dependency)
is not decidable, that is, there cannot exist any algorithm which would automatically find
out whether given dependencies follow from a given set of functional and inclusion
dependencies.
There is no way to automatically find out whether any given database structure with
functional and inclusion dependencies is normalized or not. This is known since 1984.
A way out of this dilemma is the definition of a further normal form that handles the mixed
case of functional and inclusion dependencies.
The inclusion dependency R[X] ⊆ S[Y] between (R,F) and (S,G), where F and G are
corresponding sets of functional dependencies, is keybased if Y is a key of (S,G). Here the
minimality of the key is important.
The database schema {(R1,F1) , (R2,F2) , ......., (Rn,Fn), IND} is in inclusion dependency
normal form (IDNF), if IND is noncircular, all inclusion dependencies from IND are keybased
and all (Rj,Fj) are in BCNF.
In IDNF databases we are on the safe side, inclusion and functional dependencies are
independent from each other, more precisely we have
F ∪ IND ⊨ X → Y ⇒ F⊢X→Y
and
In other words, all valid functional dependencies are controlled by the given set F of
functional dependencies alone and all inclusion dependencies are controlled by the given set
IND of inclusion dependencies, it is decidable what dependencies follow from the given
ones and there are no hidden dependencies following from the mixture that might destroy
normal forms (Mannila, Räihä 1986 and 1992 with BCNF).
The proposed mapping of correct ER diagrams into relational tables delivers directly IDNF.
Our inclusion dependencies are not only keybased, they are primary keybased. Of course we
assume here that a designer does not declare a proper subset of the primary key as another
key, but chances are low that he declares in a CREATE TABLE definition something like
which would unfortunately be accepted by SQL. But this would not be a possible result of
our recommended mapping of correct diagrams into relational tables.
NULL Avoidance
Relational databases are the only systems in information technology that have a solid
mathematical foundation. However only the NULL-free version of the relational model has a
generally accepted mathematical model.
One of the many unfortunate consequences of the lacking background is that certain
difficulties arising with the presence of NULLs appeared only recently in the literature (for
example Galindo-Legaria, Rosenthal 1997).
A striking example is the fact that the left outer (theta) join is no more associative in the
presence of NULLs, symbolically
The only comfort in this scandalous situation is that vendors of chips also cannot guarantee
the validity of basic laws in number calculations.
Anyway, NULLs and corresponding three-valued logics are too difficult for theoreticians.
Proposals have appeared in the literature (on how to overcome some difficulties confronting
the user), but these have proven logically unfeasible.
NULLs also have been too difficult for the first implementers of SQL. The fact that the
EXISTS quantor behaves in two-valued fashion in midst of three-valued calculations, which
still is disturbing for many users, is a clear semantical mistake that later has been declared
to be a feature when it was too late to correct it.
74
And please, if it is too difficult for theoreticians and system implementers, how then should
an end-end-user understand why his grandma is not on the result list of the query
(the solution is that his grandma has a NULL in the Age Column because she refused to
announce her age)
A few days later the missing value becomes known and there is an attempt to update
<a,NULL> to <a,b>. Now the system checks the referential integrity constraint. If there is
no <a,b> value combination in the parent table, things get complicated. Is the update to
<a,b> wrong or was already the insertion of <a,NULL> wrong or unrealistic? The user
has to go back a few days and the system must provide the ability to do so.
Happily we have learned how to avoid all this. The presented version of the entity
relationship language with its definition of correct diagrams helps avoid the use of NULLs,
primarily at the dangerous places like in keys or foreign keys. Construction patterns like the
accordion principle avoid NULLs generally. In our approach the missing information is
mapped into the missing of a row of a table, not into a NULL.
Semantics Support
We do not only have to design data but also operations and constraints, which have their
counterpart in programs or administrative rules. Since it is impossible to build a graphical
design language which covers the full logic of operations and constraints, a borderline is
needed.
This version of entity relationship language with the definition of correct diagrams define a
very clear borderline. All semantics that can be expressed can be guaranteed by the
database management system. The programmer only has to occupy himself with constraints
he writes down separately, not with constraints contained in the semantics of the graphical
design language. Note that this is not the case with Bachman diagrams or other dialects of
binary entity relationship.
75
Perhaps the most important advantage of the use of the presented approach with correct ER
diagrams is the preference given to information representation normalization over data
representation normalization. This train of thought was first formulated by Victor Markowitz
1987 in Haifa (as far as I know). Please consider the following comparison.
Different
Zehnder kinds of
Normalisation Other
Conceptual mappings Relational
Conceptual
Design Design
Design
Relational
Normalisation
Other
Relations
No
Lucky?
Yes
BCNF
normalized
relations
No
Lucky?
Yes
Mapping
Correct ER clearly defined in all cases IDNF
Diagram normalized
Design relations
Most textbooks and courses in data design present you a long way from conceptual
diagrams to relational tables. In the widespread case of the use of Bachman diagrams there
is a first step “Zehnder normalization” eliminating many to many connections carrying
mutual existence constraints.
Then there usually is some speak on mapping possibilities into relational tables, presented as
case to case pragmatics.
76
Then there comes the huge and difficult to understand relational normalization, where as
result only 3NF is guaranteed. If you are lucky, you hit BCNF, and if you are lucky again,
your referential integrity constraints are properly integrated. If not, you go back a few steps.
Even if you have been lucky two times, you might have arrived at a design of relational
tables that are structured very differently compared to the conceptual design. This is not
very helpful, because there is always a translation needed from the conceptual design (that
the business analyst knows by heart) to the relational tables design (that the programmer
knows by heart). Sometimes it is a great advantage if also the programmer understands the
business logic.
Compared to above complications, with correct ER diagrams we get directly to the wanted
goal in one step that is well defined in all cases, without structure gap.
The reason for this big advantage is the fact that the definition of the correct ER diagram is
a narrow corset. One has to shape the business information conception until it fits, and then
come all the benefits (my colleagues and I have never seen a business case with data
requirements that could not be mapped into a correct diagram). This is what Markowitz
called information representation normalization.