Sei sulla pagina 1di 76

A Course in Data Design

for Relational Databases


2

Table of Contents
Preliminary Remarks................................................................................................... 3
The Basic Construction Set ......................................................................................... 4
Entity Types and Entities ...................................................................................... 4
Attributes and Attribute Values ............................................................................. 5
Relationship Types and Relations ........................................................................... 8
ID-dependent Entity Types .................................................................................. 14
ISA-dependent Entity Types ................................................................................ 19
Composite Entity Types...................................................................................... 21
Correct Entity Relationship diagrams .......................................................................... 24
Correct ER diagrams .......................................................................................... 24
Mapping of correct ER diagrams into relational tables ............................................. 28
Patterns ................................................................................................................. 31
Time dependencies ............................................................................................ 31
Sets, Trees, Partitions and the like ....................................................................... 34
Accordion principle ............................................................................................ 35
Group data collection ......................................................................................... 37
Star join schema................................................................................................ 38
Making an ISA-battery dynamic ........................................................................... 39
Overlapping Foreign Keys ................................................................................... 40
4NF and 5NF Silly Warnings................................................................................ 40
Example DB designs................................................................................................. 42
Asset Management ............................................................................................ 42
Customer Relations Management......................................................................... 47
Data Design Guidelines............................................................................................. 50
Exercises ................................................................................................................ 52
Structural Exercises ........................................................................................... 52
ER Diagram Design Exercises .............................................................................. 57
Exercise D1 (Monolingual Thesaurus) ............................................................. 57
Exercise D2 (Journeys and Vehicles) .............................................................. 57
Exercise D3 (Course Administration) .............................................................. 58
Exercise D4 (SOX Application Controls).......................................................... 58
Exercise D5 (Household Services) .................................................................. 59
Solutions to the Exercises................................................................................... 60
Appendix A: Bachman Diagrams ................................................................................ 68
Appendix B for Specialists: Mathematics of IDNF......................................................... 71
Third Normal Form (3NF) is not enough ................................................................ 71
Referential Integrity Guidelines ............................................................................ 72
NULL Avoidance................................................................................................ 73
Semantics Support ............................................................................................ 74
Information versus Data Representation................................................................ 75
3

Preliminary Remarks

The success of process oriented business analysis methods and the object oriented
programming paradigm has brought the danger that people neglect the data side.

Data is just “made persistent” if the program or process does not use it any more. Even
worse, whole data structures sometimes are generated by means of automatic tools. The
consequence then is that all parties involved, business analyst, programmer and even the
user, loose control over the semantics of the data.

This is not advantageous for a company like Swiss Re, where some data must be kept and
understood for decades, and where sophisticated data statistics modelling analysis as well
as reporting applications of very advanced type are in place.

For new applications it is important that the data structures with clearly defined business
semantics belong to the first things to be modeled. This document shows how to model
data structures for relational database systems. Why don’t we just take a “best practice”
textbook? Because we can do it better.

Swiss Re Zurich has a huge knowledge base and long practical experience with relational
database systems, beginning in 1983. We are the only company outside the technology
sector that is mentioned in the 1990 book of E.F.Codd, the famous inventor of the relational
database paradigm.

In 1989 when the so called referential integrity functionality came, we interpreted the notion
of “correct entity relationship diagram” in such a way that the gap between classical
normalization and the new functionality was filled in a seamless way, and beyond that, the
sometimes difficult to understand normalization theory could be avoided completely. Please
read more on that in an appendix of the document.

This document is a much extended and improved version of the Swiss Re brochure “Entity
relationship for relational database”, which was published 1991 in a first and 1998 in a
second edition. It contains many examples and many exercises with solutions.

Zurich, June 2006


Hanswalter Buff
4

The Basic Construction Set

Entity Types and Entities

“John Smith” might be an entity of entity type “Employee”.

Employee

“The Rodale Book of Composting: Easy Methods for Every Gardener, by Grace Gershuny,
with International Standard Book Number 0878579915” is an entity of type “Book”.

Book

Suppose you have two copies of the Composting book in your hands. Are they the same
book?

The two copies might be the same book, but surely they are different copies. A local public
lending library having ten copies of the Composting book probably stamps a copy number
into each copy and additionally needs a further entity type “CopyOfBook”.

CopyOfBook

If all ten copies are lent, you can book the book, reserve it, and you will get a copy as soon
as one is available, hopefully.
5

You can tell the librarian: “I booked the book, and now you only give me a copy”. Then if he
says: “This is the book you booked”, you can say “No, this is’nt the book I booked, I did’nt
book a copy of the book I booked, I booked the book”, and he will send you anywhere.

But if he understands you, he probably is a good database designer and well aware of the
identity problem and of the first and fundamental principle of relational database design,
entity integrity: it must be clearly decidable what an entity (of a given type) is, when it is
identical to another entity of the same type, and such an entity is equivalent to exactly one
row in the relational table that corresponds to the given type.

In the relational world entities are identified by attributes and their values.

Attributes and Attribute Values

“EmpNr”, “Name, “Firstname”, “Address” might be attributes of entity type “Employee”.

Employee
Address
EmpNr

Name Firstname

“ISBN”, “CopyNr”, “ShelfNr”, “Shape” might be attributes of entity type “CopyOfBook”.

CopyOfBook
Shape
ISBN

CopyNr ShelfNr

If <0878579915, 7, B17.3.44, to be replaced> is an entity of type “CopyOfBook”,


corresponding to copy number 7 of the Composting book, which is in a bad shape, then
“0878579915” is an attribute value of the attribute “ISBN”, “7” is an attribute value of the
attribute “CopyNr”, and so on.
6

Entity integrity enforces every relational table to have at least one key, namely an attribute
or maybe several attributes whose values are guaranteed to be unique over the lifetime of
the table.

In some cases the table needs even a primary key, which is just one of the keys designated
as primary. It will be clear in due course when an object type needs a primary key. Since
there can be at most one primary key for a type, we may underline the attributes that define
it. The entity type “CopyOfBook” has the primary key {ISBN, CopyNr}.

The entity integrity principle can hardly be overstressed. If you want to model a database
that can store and keep information for employees that can change over time as for example
“OfficeNr”, then your entity is not an employee but “employee in certain time interval”, or
“time dependant information of a certain employee”, or something similar, and the
corresponding entity type should not carry the name “Employee”, but maybe

EmployeeHistory
EmpNr Department

ValidToDate
OfficeNr

An entity of type “EmployeeHistory”, identified by <EmpNr, ValidToDate> is under no


circumstances an employee.

Of course there will be some redundancy in the data of our entity type “EmployeeHistory”
insofar as there will come a new entity into existence if the employee changes his office
number or his department, so there can be two or more consecutive “ValidToDate” entries
for the same employee with the same office number or the same department.

But this is a controlled redundancy that is not dangerous, and more important, there are
ways to avoid it, to be bespoken later.

Another way to depart from the entity integrity principle which should be avoided is to
connect the meaning, the business semantics, of an entity with other entities of the same
type as for instance in the somewhat trivial example
7

Employee
EmpNr PercentageOf
TotalDeptBenefit

Salary Department

If you are not sure about the choice of natural keys like in

Date
Event
Description

Place ShortName

then you may choose an artificial key, a surrogate key

EventNr Description
Event

Date ShortName
Place

But the more natural keys one can identify the better. Every natural key helps support the
entity integrity principle. In the above example, the same real event could be defined twice
with different artificial keys. But if we are sure that we will be able to choose different short
names for events that happen to occur at the same date in the same place, then we should
take {Date, Place, ShortName} as additional natural key besides the primary key {EventNr}.

By the way key conditions map to unique indexes in the database, and every unique index
can help the database system to deliver improved performance.
8

Relationship Types and Relations

Name etc
Description

Candidate
Ca#
Characteristic Ch#

m 1
Result

Assessment
Ca#
Ch#
E#

E# Expert EMail

“Assessment” is a relationship type that is existentially dependent on the entity types


“Candidate”, “Characteristic” and “Expert”.

Every arrow constitutes an existence dependency and will therefore be mapped into a
referential integrity constraint. This will guarantee that a new assessment can be inserted
only when the database already contains the corresponding candidate, characteristic and
expert.

The targets of referential integrity constraints will always be primary keys. This answers the
question of when an object type needs a primary key, namely if there is an ingoing arrow
into it, which says that some other object type is existentially dependent on it.

The arrows have labels, “1” or “m”. They must be interpreted in the context of all object
types that our relationship type is dependent on.

In our example the “1” at the arrow from “Assessment” to “Characteristic” means that per
candidate and expert there can be at most one characteristic such that these constitute an
assessment. Therefore this “1” will map into a key condition, namely the relational table
“Assessment” will have {Ca#, E#} as a key.

The second label “1” in the figure, the one between “Assessment” and “Expert”, also
enforces a key condition, namely per candidate and characteristic there will be at most one
expert such that these constitute an assessment. Therefore the “Assessment” relational
table will have a second key, namely {Ca#, Ch#}.
9

The label “m” merely means that the designer decided that it is not “1” and does not lead to
any condition.

The whole of the semantics of our assessment database diagram shows up in the following
rudimentary CREATE TABLE statements.

CREATE TABLE Candidate


(Ca# INTEGER NOT NULL
, Name VARCHAR(40) NOT NULL
, PRIMARY KEY (Ca#))

CREATE TABLE Characteristic


(Ch# INTEGER NOT NULL
, Description VARCHAR(99) NOT NULL
, PRIMARY KEY (Ch#))

CREATE TABLE Expert


(E# INTEGER NOT NULL
, Email VARCHAR(99) NOT NULL
, PRIMARY KEY (E#))

CREATE TABLE Assessment


(Ca# INTEGER NOT NULL
, Ch# INTEGER NOT NULL
, E# INTEGER NOT NULL
, Result VARCHAR(99) NOT NULL
, FOREIGN KEY (Ca#) REFERENCES Candidate
, FOREIGN KEY (Ch#) REFERENCES Characteristic
, FOREIGN KEY (E#) REFERENCES Expert
, UNIQUE (Ca#, E#)
, UNIQUE (Ca#, Ch#))

If the labels of all arrows that go out of a relationship type are “m” as in the following
ministry of commerce database
10

Name
Name

Company
C#
Country L#

m m
FirstTrade

C# TradeRelation
L#
P#

P#
Product Description

Name

then we have only one key in “TradeRelation”, the set of all foreign key attributes:

CREATE TABLE TradeRelation


(C# INTEGER NOT NULL
, L# INTEGER NOT NULL
, P# INTEGER NOT NULL
, FirstTrade DATE NOT NULL
, FOREIGN KEY (C#) REFERENCES Company
, FOREIGN KEY (L#) REFERENCES Country
, FOREIGN KEY (P#) REFERENCES Product
, UNIQUE (C#, L#, P#))

Many relationship types are existentially dependent on only two entity types:

Name
etc
Name
m Covers m
InsuranceCompany
CName TName InsuranceType

Description
11

Obviously “Covers” has only one key:

CREATE TABLE Covers


(CName CHAR(30) NOT NULL
, TName CHAR(30) NOT NULL
, FOREIGN KEY (CName) REFERENCES InsuranceCompany
, FOREIGN KEY (TName) REFERENCES InsuranceType
, UNIQUE (CName, TName))

Another example:

StartDate
Name etc
Name

E# Employee
m WorksFor 1 P#
E# P# Project

StartDate Description

The label “1” means that every employee works for at most one project. Therefore {E#}
must be a key in “WorksFor”:

CREATE TABLE WorksFor


(E# CHAR(10) NOT NULL
, P# CHAR(8) NOT NULL
, StartDate DATE NOT NULL
, FOREIGN KEY (E#) REFERENCES Employee
, FOREIGN KEY (P#) REFERENCES Project
, UNIQUE (E#))

At this point the question arises why we do not just define an attribute “WorksForProject” in
the entity type “Employee” with a referential integrity connection to the entity type
“Project”.

The answer is that this would not be a good idea. Why?

There are several points to consider here. The first is that the entity type “Employee” should
not be overloaded with attributes that are only relevant for some of the employees.
Remember that in our case we would have two attributes, namely “WorksForProject” and
“WorksForStartDate”, which is an attribute that explicitly belongs to the relationship of the
12

employee working for a project, it does not naturally belong to the employee nor to the
project.

A side effect of this unnatural attachment of the attribute “WorksForStartDate” to the entity
type “Employee” is the introduction of a hidden constraint that usually cannot be guaranteed
by the database system alone, namely

if WorksForProject IS NULL then WorksForStartDate IS NULL

The second point to consider is the loss of flexibility. If you map “WorksFor” into a pure
referential integrity condition, then you must be very sure that the business condition “every
employee works for at most one project” is rather stable. With the first employee that
happens to have to work for two projects, one would have to introduce a separate relational
table anyway, and if we already have one as the above CREATE TABLE statement shows,
we only have to drop the key {E#} and add the key {E#, P#}, which can always be done.
Experience shows that business conditions containing the phrase “at most one” are seldom
very stable and should always be considered with caution (“How many wives do you have?”
“At most one!”).

The third point is that one should avoid NULLs wherever possible, because we are not used
to argue in three valued logic. Suppose we had a table

Employee(E#, Name,…,WorksForProject,…)

such that “WorksForProject“ can be NULL. Assume that employee with E#=’17’ has NULL
at his “WorksForProject“ attribute. Then the harmless question for other employees that
work in the same project,

SELECT E# FROM Employee WHERE WorksForProject


IN (SELECT WorksForProject FROM Employee WHERE E#=’17’),

delivers an empty answer, which might be OK. But the question for other employees that do
not work in the same project,

SELECT E# FROM Employee WHERE WorksForProject


NOT IN (SELECT WorksForProject FROM Employee WHERE E#=’17’),

also delivers an empty answer. This is confusing, to say the least.

The last point is a kind of ‘unity of doctrine’: referential integrity constraints, which go far
beyond classical normalization theory of relational databases, are always business existence
dependencies in our approach. They are not sometimes existence constraints, sometimes
half existence constraints (if you relate them to nonprimary keys that can be NULL) and
sometimes hidden relationship types.
13

Now let us assume that some of the projects are supervised by an internal specialist who, if
in this role, only supervises one project:

Name etc
Name

E# Employee
1 Supervision 1 P#
E# P# Project

StartDate Description

Then the labels “1” at the arrows account for two keys in “Supervision”, {E#} and {P#}:

CREATE TABLE Supervision


(E# CHAR(10) NOT NULL
, P# CHAR(8) NOT NULL
, FOREIGN KEY (E#) REFERENCES Employee
, FOREIGN KEY (P#) REFERENCES Project
, UNIQUE (E#)
, UNIQUE (P#))

As another example consider the following.

B# C#

Book Class
m m
B# C#

Recommendation

1 S#
T# 1
Teacher Subject

T# S#

Here we will have two keys in „Recommendation“, {B#, C#, S#} and {B#, T#, C#},
guaranteeing the business semantics of the labels “1”:
14

The label “1” at the arrow pointing to “Teacher” means that any book in any subject is
recommended to any class by at most one teacher, and

the label “1” at the arrow pointing to “Subject” means that any book is recommended by
any teacher to any class in at most one subject.

ID-dependent Entity Types

A widespread data structure pattern is the hierarchy. In fact, from the early seventies for
more than a decade many companies in which database systems were used to keep the
business data persistent, had to design everything in hierarchical structures.

It is typical for a hierarchical connection that an entity of the dependent level is uniquely
identified “in a natural way” only within its parent entity.

Author Title

ISBN Book YearOfPublication

ID

ISBN CopyOfBook AcquireDate

CopyNr ShelfNr Shape

CREATE TABLE Book


(ISBN VARCHAR(30) NOT NULL
, Author VARCHAR(99) NOT NULL
, Titel VARCHAR(99) NOT NULL
, YearOfPublication VARCHAR(4) NOT NULL
, PRIMARY KEY (ISBN))

CREATE TABLE CopyOfBook


(ISBN VARCHAR(30) NOT NULL
, CopyNr INTEGER NOT NULL
, ShelfNr INTEGER NOT NULL
, Shape VARCHAR(20) NOT NULL
, AcquireDate DATE NOT NULL
, FOREIGN KEY (ISBN) REFERENCES Book
, PRIMARY KEY (ISBN, CopyNr))
15

Note that we could define {ISBN, CopyNr} as unique, not necessarily as primary key, as long
as there is no object type dependent on “CopyOfBook”.

As usual the arrow amounts to the referential integrity condition which implements the
existential dependency of “CopyOfBook” on “Book”, and as always the label of the arrow
influences the keys in the dependent object type. In case of label “ID” the dependent entity
type must have defined a key that is an extension of the foreign key attributes. To speak
loose, the key of the ID-dependent type must be an extension of the key of the parent type.

As another example we take loss development triangles used for reinsurance reserves
calculations (IBNR, Incurred But Not Reported). A loss triangle can be visualized as follows

Accident Years
1987 1986 1985 1984 1983 1982

110 130 140 150 120 100


1
Development Years

260 280 300 240 200


2

420 450 360 300


3

600 480 400


4

600 500
5

500
6

Premiums 624 624 625 625 627 627

For a deeper understanding see for instance Erwin Straub, Non-Life Insurance Mathematics,
Springer 1988, Chapter 7, we are interested here only in a data structure that captures loss
development triangles.

T# LossDevelopmentTriangle Description

Titel
ID

T# Premium
Premiums
AccidentYear
ID
T# Losses Loss

AccidentYear
DevelopmentYear
16

The diagram leads to the following statements.

CREATE TABLE LossDevelopmentTriangle


(T# INTEGER NOT NULL
, Titel VARCHAR(60) NOT NULL
, Description VARCHAR(200) NOT NULL
, PRIMARY KEY (T#))

CREATE TABLE Premiums


(T# INTEGER NOT NULL
, AccidentYear INTEGER NOT NULL
, Premium REAL NOT NULL
, FOREIGN KEY (T#) REFERENCES LossDevelopmentTriangle
, PRIMARY KEY (T#, AccidentYear))

CREATE TABLE Losses


(T# INTEGER NOT NULL
, AccidentYear INTEGER NOT NULL
, DevelopmentYear INTEGER NOT NULL
, Loss REAL NOT NULL
, FOREIGN KEY (T#, AccidentYear) REFERENCES Premiums
, PRIMARY KEY (T#, AccidentYear, DevelopmentYear))

Another practical use of ID-dependent entity types are multivalued attributes. Suppose we
have an entity type “Book” with an attribute “Subject”.

ISBN Book
Subject

Author Title

Now we realize that it should be possible to assign more than one subject to a book. Then
we can design “Subject” as an entity type that is ID-dependent on “Book”.

ISBN Book Title

Author
ID

ISBN
Subject

Subject
17

As an aside it should be mentioned here that in both forms “Subject” has no independent
existence, a subject does exist in the database only as long as there is a book having
attached it.

But if “Subject” needs to be independent, because subjects are administered elsewhere or


because other object types are dependent on it (but not on the books), then we can of
course design it like follows.

Author Title

m m
ISBN Book Categorization Subject
ISBN
Subject Subject

But the categorization could also “go in the other direction”, namely if it defines a partition
of the to be categorized entities, that is if every entity needs exactly one category.

C# Category Description

Name
ID

C# Title
Book

ISBN Author

This could be the situation in a book store where the shelves are categorized, which is
usually the case.

Many attributes are not really multivalued but do have a natural development of values in
time, as for instance “QuantityInStock” for “Assets”

A# Asset etc

Title

ID

A# QuantityInStock
ActualUnitPrice

InStockDate
NumberOfUnits
18

But note that not every historization problem fits into the ID-dependency pattern. If you
have an entity type “E-Current” (whatever “E” means) and another one “E-History”, it is
perfect to design them to be independent from each other.

E-Current K# etc
K# E-History
etc A3
A1 A2 ValidToTimestamp A1 A2
A3

It would be wrong here to connect the attributes “K#” by an existence constraint. If an


entity is no more current, it should be deleted from “E-Current” (and in the same unit of
recovery augmented by timestamp and inserted into “E-History”).

By the way beginners sometimes tend to believe that in a data design diagram every thing
should somehow be connected to every other thing. This is not so, on the contrary the more
independence and “orthogonality of concepts” there is in a data structure the better.

The world is full of natural ID-dependencies. A last example (for the moment):

Name Manufacturer OrderDate

m Order
S# Software m U#
S# U#
User

ID m
S# Name
Installed
m V# U#
S#
SoftwareVersion InstallDate
V#

CertificationDate

Note that I can have versions of software that I did not order and also might have ordered
software which is not installed yet.
19

ISA-dependent Entity Types

“ISA” comes from “is a” and means that the dependent entities are also of the type of the
parent entities. It corresponds to generalization in the direction towards the parent and to
specialization in the opposite direction.

B# Address
BusinessPartner
Name

ISA ISA

B# Customer B#
Supplier

DegreeOfCreditWorthiness
FaxNumber Reliability

The arrows, as always, are existence dependencies and are mapped into referential integrity
constraints, and the labels, as usual, map into key constraints of the dependent types. If a
customer is a business partner, then by entity integrity the foreign key of “Customer” also
must be a key (one row in “Customer” equals a customer and is a business partner equal to
a row in “BusinessPartner”). The same principle applies to “Supplier”.

CREATE TABLE BusinessPartner


(B# INTEGER NOT NULL
, Name VARCHAR(30) NOT NULL
, Address VARCHAR(200) NOT NULL
, PRIMARY KEY (B#))

CREATE TABLE Customer


(B# INTEGER NOT NULL
, DegreeOfCreditWorthiness INTEGER NOT NULL
, FOREIGN KEY (B#) REFERENCES BusinessPartner
, PRIMARY KEY (B#))

CREATE TABLE Supplier


(B# INTEGER NOT NULL
, FaxNumber VARCHAR(20) NOT NULL
, Reliability VARCHAR(10) NOT NULL
, FOREIGN KEY (B#) REFERENCES BusinessPartner
, PRIMARY KEY (B#))
20

This construction only makes sense if the dependent entity types do not have the same
attributes and if there are some attributes that apply to all dependents, here “Name” and
“Address” as examples. Note that a customer can be a supplier at the same time.

As another example please take the following.

Name TelNumber S#
P#
m WorksFor
S# StaffMember m P#

ISA Project
ISA m
S# Name
1 Leads
ExternalStaff S#
Employee P#

S# Company PrivateAddress

Generalization also sometimes simplifies the structures. Take the case of three independent
entity types “Book”, “Magazine” and “Article”. Articles might be contained in books or in
magazines, books can describe or mention magazines and vice versa, articles might write
about books, and so on.

m m
Magazine MagazineOnBook Book
m m
1 BookOnMagazine 1
ContainedInBook m
ContainedInMagazine m
m ArticleOnBook
m Article

This calls for a generalization.


21

P#Contained P# P#Writes

m m
IsContainedIn WritesAbout
WrittenPiece
m
1
P#In
P#About
ISA ISA ISA

Book Magazine Article

Composite Entity Types

A composite entity type is just a relationship type that has been promoted to an entity type
because some other object type will become dependent on it. Then the newly promoted
composite entity type needs a primary key (relationship types need keys but they do not
need primary keys because nothing else is dependent on them).

Take the case of the relationship type “Arrangement” between “StaffMember” and
“Project”.

M# Name M# P# P# Name
m Arrangement m
StaffMember Project

Text Description

“Arrangement” has a key, {M#, P#}, enforced by the labels “m” at the arrows, but no
primary key.

Now we want to make an entity type “Details” ID-dependent on “Arrangement”. Then we


promote the relationship type “Arrangement” to the composite entity type with the same
name (graphically embed the diamond into a rectangle) and take the already given key as the
primary key.
22

M# Name M# P# Name
P#
m Arrangement m
StaffMember Project

Text Description
ID
P#
DateAdded
M# Details
D# Text

As another example we consider internet auctions. There are objects to be offered, maybe
in several auctions if nobody wanted them in the first place. An offering is therefore clearly
a combination of an auction and an object. As such it would be a relation. But now different
participants, known as URLs (network addresses) come into play and make bids for the
offering.

O# Picture EndTimestamp
O# A#
m Offering m A#
Object Auction

Description m BeginTimestamp

A# BidPrice
O#
Bid
m
BidTimestamp
UrlAtGivenTimestamp
Url Url

BidTimestamp

Note that we could have taken “Url” as independent entity type and “UrlAtGivenTimestamp”
ID-dependent on it. That would be necessary if there is a need to store information on the
URL as such, for example whether it is registered to be authorized to take part in the
auction, and so on.

Of course there are also other structurings possible (it is not forbidden to discuss other
options).
23

As a last example (for the moment) take a company that offers household services like
cleaning or gardening services but also babysitting services. Then it is interesting for the
company to keep an exact trace of the requested timing of services.

R# H#
R# RequestDate H#

m ServiceRequest m
CustomerRequest HouseholdService

Description
ID
R#
ServiceStaggering TimeTo
H#

S# Date TimeFrom
24

Correct Entity Relationship diagrams

Correct ER diagrams

Not every collection of rectangles, diamonds, etc. can be considered as a “correct” diagram.

. C .
D . . .. .
. .. .
. . .. . . .. . . .
. .. . . . .
. .. . . .. .
. . .

Let D be the set of all possible ER diagrams (every point is a diagram). In D there are
diagrams which either cannot be mapped into relational tables in a meaningful way or where
possible results of such a mapping have unwanted qualities.

We will define a subset C of correct diagrams that are easily recognizable as such, together
with a “natural” mapping into relational tables that are fully normalized, have only referential
integrity constraints with decent behaviour and some other nice qualities. For database
specialists there is an appendix where this is elaborated in a more precise way.

The correct diagrams are defined recursively as the empty diagram and those diagrams that
can be produced from a correct diagram by application of one of the following six
operations.

1) Define independent entity type:


assumption: none
result: new rectangle (with unique name)

2) Define attribute:
assumption: E is a rectangle or diamond or diamond rectangle
result: new (within E uniquely named) oval connected with E

3) Define relationship type:


assumption: E1 , E2 , … , En given rectangles or diamond rectangles (n ≥ 1)
result: new diamond R such that for each Ej there is (at least one) arrow
labeled “1” or “m” pointing from R to Ej
(total number of arrows at least two)
25

4) Define ID-dependent entity type:


assumption: E a rectangle or diamond rectangle
result: new rectangle with “ID” labeled arrow pointing to E

5) Define ISA-dependent entity type:


assumption: E a rectangle or diamond rectangle
result: new rectangle with “ISA” labeled arrow pointing to E

6) Transform relationship type to composite entity type:


assumption: D a diamond
result: D becomes a diamond rectangle

We build a simple example of a correct ER diagram.

Define independent entity type:

Define attribute:

Define independent entity type:

F
26

Define relationship type:

E 1
R

A m

Transform relationship type to composite entity type:

E 1
R

A m

Define ID-dependent entity type:

E 1
R

A m
ID

G F

Define relationship type:

E 1
R
1
A m
S ID
m
G F
27

Define independent entity type:

H
E 1
R
1
A m
S ID
m
G F

Define ISA-dependent entity type:

H
E 1
R
1 ISA
A m
S ID I
m
G F

Define attribute:

H
E 1
R
1
B ISA
A m
S ID I
m
G F

And so on.
28

Mapping of correct ER diagrams into relational tables

Due to the recursiveness of the definition of correct ER diagrams it is possible to define the
mapping into relational tables very easily, namely also recursively, along the lines that have
been sketched in the examples with given “CREATE TABLE” statements.

Since some diagrams might be drawn for overview purposes that are still correct but do not
contain all necessary attributes or keys etc., we should ensure that the notions

key
primary key
set of foreign key/primary key attribute pairs

are properly defined.

The independent entity types are the most important types. Every other type is directly or
indirectly existentially dependent on them and inherits some of their key attributes, so we
should first ensure that they have their (primary) keys defined.

Every set of foreign key/primary key attribute pairs corresponds to an arrow. In most cases
it will be clear from attribute names which pairs constitute an arrow. For instance in the
example

Person m PName
CName Company
WorksFor 1
Forename
Name Name
PForename

where there is a name clash in the attributes of “WorksFor”, resolved by proper renaming,
we have two sets of pairs, namely

{<WorksFor.Pname, Person.Name>,
<WorksFor.Pforename, Person.Forename>}

corresponding to the arrow from “WorksFor” to “Person”, and

{<WorksFor.Cname, Company.Name>}

corresponding to the arrow from “WorksFor” to “Company”.

Warning: The following page contains mathematically presented material. If you are not
familiar to set theory notation, leave the page out and consider again the examples with
added “CREATE TABLE” statements. To grasp them gives you enough understanding for all
practical business examples.
29

To treat the recursion step “define relationship type” more formally (to be sure that it is
defined for all cases) let the relationship type R be dependent on the (composite) entity
types E1 , E2 , … , En with the sets of foreign key/primary key attribute pairs

{<R.Aj,1, E(j).Bj,1>, <R.Aj,2, E(j).Bj,2>, … , <R.Aj,nj, E(j).Bj,nj>},

corresponding to the arrow pointing from R to E(j), for 1 ≤ j ≤ N. Note that we have n
entity types and N arrows where N ≥ n (in most cases N=n).

If all arrows from R to E(j) have the label “m”, then the set of attributes

∪1≤ j ≤ N {Aj,1, Aj,2, … , Aj,n } j

must be defined as key in R. This is the set of all foreign key attributes.

But if there is an arrow with label “1”, then for all k such that the arrow from R to E(k)
has the label “1”, the set

∪1≤ j ≤ N , j ≠ k {Aj,1, Aj,2, … , Aj,n }


j

must be defined as key in R. This is the set of all foreign key attributes corresponding to
the arrows from R to E(j) for all j ≠ k. Note that in the case of overlapping foreign key
attributes this set not necessarily equals the set

( ∪1≤ j ≤ N {Aj,1, Aj,2, … , Aj,n } )


j \ {Ak,1, Ak,2, … , Ak,nk}.

We do not recommend overlapping foreign key attributes, but in some cases they might be
practical. So, if all arrows have label “m”, then we have one key, if one arrow has label “1”,
we also have one key, if two arrows have label “1”, we have two keys, etc.

In the case of the recursion step “define ID-dependent entity type” we have a given
(composite) entity type E with primary key {B1, B2, … , Bn}, n ≥ 1. The arrow pointing from
the new entity type F to E corresponds to the set of foreign key/primary key attribute pairs

{<F.A1, E.B1>, <F.A2, E.B2>, … , <F.An, E.Bn>}.

In this case we must define a key in F that includes {A1, A2, … , An}, the set of foreign key
attributes, plus at least one additional attribute of F.

In the case of the recursion step “define ISA-dependent entity type” we have a given
(composite) entity type E with primary key {B1, B2, … , Bn}, n ≥ 1. The arrow pointing from
the new entity type F to E corresponds to the set of foreign key/primary key attribute pairs

{<F.A1, E.B1>, <F.A2, E.B2>, … , <F.An, E.Bn>}.

In this case we must define {A1, A2, … , An}, the set of foreign key attributes, as a key
of F.

Of course, every (composite) entity type with incoming arrows must have a primary key
defined.
30

Now it should be clear how a correct ER diagram, enriched if necessary with properly
defined keys, primary keys and sets of foreign key/primary key attribute pairs is mapped into
relational tables.

The observation is that the correct diagrams have clearly defined semantics. For example
every arrow is an existence constraint and nothing else. That is one of the reasons why we
always refer to the primary key of the parent table in referential integrity constraints, in
other words, the arrow which corresponds to

{<R.Aj,1, E(j).Bj,1>, <R.Aj,2, E(j).Bj,2>, … , <R.Aj,nj, E(j).Bj,nj>},

maps into

FOREIGN KEY (Aj,1, Aj,2, … , Aj,nj) REFERENCES E(j).

On the other side the diagrams show the tables as they will be defined, so the analyst
formulating queries against the tables does not have to switch forth and back between
tables and business semantics of diagrams with different structuring.

To draw a correct diagram means to move inside rather narrow borders. But up to now
there is no business data structure known that could not be represented by a correct
diagram.

The big benefit of correct diagrams is that they map directly into IDNF relational tables
(Inclusion Dependency Normal Form, which includes Boyce-Codd Normal Form), that they
avoid the appearance of NULLs completely at all dangerous places and that all semantics
expressible in the diagram can be controlled by the database system (queries that try to
destruct this semantics get an SQL error back). More on this in an appendix.
31

Patterns

Time dependencies

Take the example of a weather statistics table that stores periods of similar weather
situations.

AverageTemperature
WeatherPeriodStart

WeatherSituation
City FromTime

The idea is to capture for example a three day period of cloudless weather in Zurich, a
following period of three hours of rain, and so on.

Note that an entity of type “WeatherPeriodStart” is not a period, only the start of a period.
Even if we let a period end when the next one starts, we do not know when the last one
ends for a given city.

This can lead to wrong answers for certain queries. The search for the weather situation in
Zurich at time ‘t’ would have to look for a row r with r.FromTime ≤ ‘t’ such that there is
no row s for Zurich with r.FromTime < s.FromTime ≤ ‘t’. But since also the average
temperature per period is to be stored, which can be calculated only at the end of a period,
we can not be sure whether the period that started last has already ended yet.

In other words to identify a period we need two rows that have none in between. If an
entity should represent the period, it also needs an endpoint.

AverageTemperature
WeatherPeriod
WeatherSituation
City FromTime ToTime

The above mentioned query could be formulated much easier now and probably would also
perform much better. But there is a price to pay, namely the program (or user) that enters
the values now should guarantee that there are no overlapping periods entered.
32

The entity type “WeatherPeriod” is much more reliable in information content than
“WeatherPeriodStart”. Take the example of a hobby meteorologist who sometimes forgets
to observe the weather or to enter the information about it. He should prefer the entity type
“WeatherPeriod”.

To represent periods of validity in relationship types is more difficult, because a relationship


is essentially only a combination of the entities on which it depends. Take the example of
persons employed in companies.

Person m P#
C#
m Company
Employed
P# C#
FromDate ToDate

In this example a person can be employed in one and the same company only once, even
though the employment has a period of validity defined.

A possible resolution of the issue is to take an independent entity type to represent the
periods of validity.

Person Company
1 1
P# C#
P# C#
EmplApplic
E#
m
ToDate
E# Employment
FromDate

As a next example we consider an attribute that should be made time dependent.

K# E

A
33

The solution is to convert the attribute into an ID-dependent entity type.

K# E

ID

K# AinTime A

ValidFrom

Since it is also possible to realize a multivalued attribute by an ID-dependent entity type, we


may combine these two patterns, namely take a multivalued attribute where each value
itself may be time dependent.

K# E

ID

K# AValues A

ID
K# AValuesInTime

ValidFrom A

By collapsing hierarchy levels we may simplify to

K# E

ID

K# AValuesInTime

ValidFrom
A
34

Sets, Trees, Partitions and the like

Most sets are defined implicitly, for example the set of contracts a given employee is
responsible for, would be realized by directly relating the contracts to the employees. But if
we need sets of elements explicitly, where the sets maybe even have names, then we may
take the following pattern.

SetOfElement m m Element
BelongsTo
S# Name E#
S# E#

If the sets should not be overlapping, as for example in a partition, we must change only
one label.

SetOfElement 1 m Element
BelongsTo
S# Name E#
S# E#

This is not mathematical set theory, of course, because there we have a lot of generally
accepted axioms that could not be enforced by such data structures alone, but only by
program (or user) controlled constraints. An example would be the extensionality axiom
which says that two sets that have the same elements are equal. So for example there is
only one empty set, but our design pattern would allow many different empty sets.

A graph is a set of nodes where any two of them may be connected by an edge.

N# Node

m m

Edge
NLeft# NRight#
35

Note that in this representation one has to abstract from the direction given by the fact that
there is automatically an ordering in (NLeft#, NRight#), even if we call the foreign keys
Lauren and Rose instead of NLeft# and NRight#.

This pattern would also represent a directed graph where the nodes have a direction by
purpose, are arrows so to speak.

A directed graph where each node has at most one outgoing arrow is a forest.

N# Node

m 1

Edge
NDown# NUp#

A subset of nodes and edges of a forest where each two nodes are connected by a path of
edges is a tree. Trees are often used to map hierarchical structures, for example the profit
centers of an enterprise.

Accordion principle

In most operational application systems where hundreds of users enter and ask for millions
of data entries sooner or later the users would like to add new attributes to certain
important entity types (booking, contract, claim, and so on).

Very often this is not possible without changing dozens of programs and user interfaces. If
it is too expensive to add new attributes, users often change the semantics of the coding
structure of an attribute that is already there, and map their envisioned new attribute into an
old one (for instance, all contracts of the new sort are coded with a leading ‘X’). Such
attribute domain shuffling does not make the logic of reporting programs easier.

Therefore the Swiss Reinsurance Company invented 1985 a then new design pattern and
called it accordion principle (“Handorgel”). Nowadays in university circles it is called
“descriptor oriented data model”, or something like that.
36

Suppose you are given an entity type “E” with “mandatory” attributes “A1”, “A2”, “A3”,
and with “optional” attributes “B1”, “B2”, … , “Bn”.

Bn
E
A1 B2
A3 B1
A2

Suppose that at design time the optional attributes are not all known in advance. This is the
business case we discuss here, namely that the user wants to invent new attributes into the
running system later on.

The solution to this problem lies in the movement shown by an accordion, namely to
compress the optional partly unknown B-attributes in horizontal direction and unfold them
again in the vertical direction. In other words, the attribute names “B1”, “B2”, etc
themselves have to become values in a column of a relational table.

E m m EAttributes
EAttributeValues
A1 A3 A1 AName AName
A2 Value

A typical relation of type “EAttributeValues” would be <a1, B3, b>, which means that the
entity of type “E” with primary key value ‘a1’ has, possibly among others, a defined value
for the attribute ‘B3’, namely ‘b’.

Of course this is only possible if the B-attributes all have the same data type, therefore
sometimes two or more accordions are designed, one for character data and one for numeric
data. This pressure towards norming data types has the advantage that also user interfaces
can be designed in a flexible way independent of concrete B-attributes.
37

We can not only “accordionize” columns but also tables.

InStock
m m

Warehouse m m Article
InLocalCatalog

W# m m A#

OrderedForWarehouse

As above the names of types become ordinary values.

RType
Name

m
Warehouse Name
m m Article
Concerns
A#
W# W# A#

Group data collection

A group of companies collects financial data from subsidiaries in the form of amount vectors
of variable length

(A#, Currency, Amount, (A1,a1), (A2,a2), ... ,(An,an)).

The (Ak,ak) are pairs of attributes Ak and corresponding attribute values ak. One and the
same attribute may occur several times, that is Ak = Aj for a k ≠ j, but then the values
have to be different, ak ≠ aj. The number n of attribute-value-pairs may be different from
vector to vector, because not every kind of financial data needs the same level of detail in
description.

The central office puts all amount vectors into one data pot, for easy access with SQL to
control certain conditions (for instance certain amounts will have to be equal to the sum of a
set of other amounts, etc).
38

Origin InsertTimestamp Attribute

Amount m ID AName
AmountAttrValue
m
A# Amount
A# AValue AttrValue
Currency AName
AName AValue

Star join schema

D1 Dn
A1
m An
m

m Facts m
D2 D4

A2 m
FactsValue A4

D3

A3

The ideal star join schema for OLAP (On-line Analytical Processing) consists of dimensions
D1, D2, … , Dn, which are independent entity types, and a facts relationship type
dependent on the dimensions.

The facts table has one or more additional attributes (besides the foreign keys corresponding
to the dimensions), “FactsValue” in the example diagram, which should be additive in the
following sense.

For every set M of tuples <d1, d2, … , dn> of dimensional key values that makes sense
from an information content point, if there is a tuple <m1, m2, … , mn> that information
content wise represents M, then the facts value of this tuple must be equal to the sum of all
facts values of tuples in M.

This additivity constraint is sometimes difficult to achieve. Take the case of dimension D1
standing for geographical unit with the attributes “Region”, “District”, “Country” and
39

“Territory”. The topographical meaning imposes a natural containment ordering with


incomparable elements (where neither element is contained in the other). In these cases it is
important to choose the smallest possible elements from which all others can be built
(“Region” as primary key of the dimension).

Making an ISA-battery dynamic

Some systems exaggerate the usage of the generalisation/specialisation principle and are
confronted with an ISA-battery that should grow and shrink dynamically (at runtime).

E#
E0
ISA
ISA ISA
E1
A3
E2 E3

ISA ISA
ISA A17
E6
E4
E5
ISA
A13
A6 E7 A2

To capture the dynamics into structure, we map this to the following diagram.

T# EType

ID
ID
E E#
EAttr
T# m m T#
A#
EAValue
T# E#
A# AValue

Please note the use of the overlapping foreign key attribute T# in ”EAValue“.
40

Overlapping Foreign Keys

Every relationship type that is dependent on some object type more than once via different
arrow paths is a potential candidate for overlapping foreign keys. The crucial question is
whether for every relation of the considered type there is one object on which it is
multidependent. We consider an example.

C#
Customer

ID ID

C# Delivery C#
Order
O# D#
m
m
C#
OrderPosition m
DeliveryPosition
O#
A# D#
C# O#
A# m

Article
A#

Here C# is contained in two different foreign keys of “DeliveryPosition”, because a delivery


position is dependent on only one customer via two different paths.

4NF and 5NF Silly Warnings

Please consider the following two variants of diagrams.

C# Company
Variant 1
m
C#
L#
Export
m P# m

Product Country
P# L#
41

C# Company
Variant 2
C# m m
C#

Produces SellsTo
P# L#
m m

Product Country
P# L#

Introductory courses for data design sometimes speculate on fourth and fifth normal forms
(4NF, 5NF) with the help of such examples. For instance it is said that with variant 1 goes
the danger to hurt these forms.

Forget that! You can only hurt for example the fourth normal form with variant 1 if you
explicitly write down and demand besides the diagram a constraint that says for instance
that the data in “Export” must be at every time point in a shape that it could be mapped by
corresponding projections into “Produces” and “SellsTo” of variant 2 in such a way that you
could get back “Export” from “Produces” and “SellsTo” by (natural) join.

But if you are aware of such a constraint and formulate it, you will in this case draw directly
variant 2. Please note that the business information contents that can be captured by the
diagrams are quite different in variant 1 compared to variant 2.
42

Example DB designs
Asset Management

Since we want to capture all kinds of assets of different types we take an entity type
“Asset” together with an accordion “Asset_Attributes” and “AAttrVal”. “Asset” might have
a description but consists primarily of its key identifying the abstract entity asset.

Asset_Attributes AAttr A#
m Asset
m AAttrVal
AAttr
A#
Descr Descr
Attr_Val

This construct can capture any kind of (asset) entity that is describable by attributes, of
course also future not yet invented types.

The only restriction we impose on the abstract entity asset is that if it is intangible (not a
building or work of art etc) as for instance shares then if we hold 500 shares of General
Motors in stock we will not enter 500 entities into “Asset” but only one. Therefore a
mechanism is needed to capture the number of units of an asset in time. Our typical pattern
for this is the following.

Asset
A#

ID

A#
Asset_In_Time
TimeSt UnitsHeld
43

Some groups of assets might be organized in portfolios, with a responsible manager. Of


course, the portfolio must have a developing existence in time.

Create_Date
P#
Portfolio
Name Manager

ID

P#
Ptf_Development
TimeSt

“Ptf_Development” will then get related to “Asset” with a relationship attribute (besides the
foreign keys it consists of) to capture the number of units.

Ptf_Development m P# A#
m Asset
Asset_In_Ptf
P# A#
TimeSt TimeSt
No_Units

The question whether “Asset_In_Time” and “Asset_In_Ptf” should contain redundant


information or not would be subject to organizational ruling. One could for example say that
only assets not organized in portfolios should be captured in “Asset_In_Time”.

Next we design another independent entity type “Trade”, also envisaging very different
kinds of trades with different sets of attributes, therefore using the accordion principle
again.

Trade_ID
User_ID

Trade_Attributes m m Trade T#

TAttrVal
TAttr Descr SettleDate Buy_Sell
TAttr T#
Attr_Val
44

The next independent entity type is “Broker”, which stands for all kinds of financial
intermediaries.

A trading of assets is then a relationship depending on “Trade”, “Broker” and “Asset”.

etc. Name

No_Units
Asset 1 1 Broker
Trade_Asset
A# B#
A#
B#
T# m

Trade T#

Many assets have an issuer company, but we will not model that as attribute but rather as
another independent entity type since it will be connected also to ratings.

etc.

Issue_Date
Issuer_Company m Asset
1
Issue
C# A#
Name C# A#

Now we come to the ratings. We take “Rating_Org”, the rating organization, as


independent, and the actual ratings which occur to specific time points, as ID-dedendent
from “Rating_Org”.

Name
Rating_Org
Descr
R#

ID

R#
Rating
TimeSt Remarks
45

There can be ratings of assets as well as ratings of issuer companies:

CRating
Issuer_Company 1 TimeSt
Comp_Rating
C#
C#
m
R#
Rating
m
A# A# R#
R# TimeSt
1 Asset_Rating
Asset TimeSt
ARating

Now let us put the pieces together for the whole picture.
46

CRating
R#
Descr
C# R#
Comp_Rating TimeSt Name
Name
etc
C# 1 Rating_Org

Issuer_Company
m ID
Issue_Date
1 C# R#

A# Rating
Issue ARating TimeSt
m Remarks
A#
Asset_Rating R#

Asset_Attributes AAttr
A#
m TimeSt

m AAttrVal
1
AAttr Descr
Attr_Val
m
Descr
Asset Name
etc

A#
Broker
Name 1
P# Create_Date m No_Units 1 B#

A#
Portfolio Trade_Asset
TimeSt ID A#

Asset_In_Ptf B#
Manager
ID P#
T# m
No_Units
m Buy-Sell

Trade_Id
Ptf_Development Trade
Asset_In_Time
A# User_Id

TimeSt T#
P# SettleDate
TimeSt UnitsHeld

T#
TAttr
Descr
m TAttrVal
Attr_Val
Trade_Attributes
TAttr

There might be more attributes, for instance at every table the authentication identification
(Auth-ID) of the process or user that inserted the row.
47

Customer Relations Management

As next example we design the kernel of a customer relations management system for a
company whose customers are other companies and where the customer relationships are
very much concentrated on personal relationships with people of the customer companies,
as is typically the case with a reinsurance undertaking.

We take the own employee, the customer companies and those of their employees that get
in our focus for granted and do not make attempts to historize corresponding entities.

Name Address
E#
etc.
Customer
Company
Employee

C#
Name
ID

C# Company
Person
P#
Name Address

First of all at every time point there is (at most) one responsible for a customer company,
and this has to be historized.

Name Address E# etc.


FromDate
Customer Employee
Company m 1
Responsible
C# E# Name
C#

C# E#

FromDate
Responsible
ToDate History

Note that “ResponsibleHistory” is a separate entity type with no existence dependencies on


other object types. As soon as a responsibility for a given customer company changes, the
old responsibility has to be augmented with a “ToDate” date, entered into
“ResponsibleHistory” and deleted in “Responsible”, all in a single atomic transaction of
course.
48

The next construct around marketing activities needs no separate history entity types, since
marketing activity has an abstract key and a time point as one of its attributes.

M#

Name Address Timepoint


Marketing E# etc.
Activity
Descr
Customer
Company Employee
m m
m
C# M# Name

MParticipation
C# E#

Instead of describing marketing activities in a text field “Descr” we could take an accordion
for any number of dynamic attributes in “MarketingActivity” (not shown here, since the
principle should be clear by now).

Contacts are most important, therefore we distinguish different types of contacts (which
could be mails, electronic mails, telephones, personal contacts, etc.). Here also there is no
separate historization construct necessary.

P#
Name Address
C# E# E# etc.

Company m CParticipation m
Person Employee
Timepoint T#
C# Name
P#
m

T# Name
Contact ID
Timepoint
ContactType
Descr Report

T# Descr

For persons of customer companies everything is collected that they themselves make
known, and that could be important to know for an adequate polite treatment of contacts to
them. Of course, this amounts to another accordion.
49

Name Address
AValues
Company m PAttributes
Person m
PAttrValues
C# Name Name
C#
P#
P#

Note that this accordion could also be used for a historization of addresses of company
persons (by taking attributes like “CurrentAddress”, “AddressTill20060626”, etc.).

Now for the whole picture.

FromDate
C# E#
Responsible
Name Address m 1 E# etc.
M#
Customer
Timepoint Employee
Company Marketing
Activity Descr
C# m Name
m m
M#
ID
MParticipation m
C# E#
C# P#

Company P#
Person m C# E#
CParticipation
Address Name
Name Timepoint T#
ContactType
m m ID
T# T# Descr
Contact
Timepoint
C# P#
Name Descr Report
PAttrValues
m
AValues
PAttributes C# E#

FromDate Responsible
Name
ToDate
History
50

Data Design Guidelines

There is no royal road to the right data structure, simply because in some cases one can say
only after a few years of a running system whether the data structures had been chosen
adequately in the first place.

future
past and reality
present reality

Now one could argue that a data structure or an electronic system in general not only
should predict the future requirements, but also shape them, and therefore have an influence
on the future organization of the business. But this is dangerous because information
technology should follow business and not vice versa.

The only way out of the dilemma between normalizing past requirements and predicting the
future is to design the (data) structures flexibly, in such a way that the user has a maximal
influence on the perceived (data) structures, and can himself respond fast and flexibly to
new business requirements. This can be achieved by using the abstract design patterns
described in this document.

In assessing data structures or designing new ones it is recommended to concentrate first


on the independent entity types and there characterizing keys. Everything else is
existentially dependent on them. Chris Date, the most famous interpreter of the relational
idea after Edgar Codd, calls the independent entity types the kernel entity types. They are
the pillars that carry everything else.

The independent entity types are also the ones that should be described the most carefully.
What exactly is or should be an entity of the given type? The answer to this question has a
close connection to the choice of the keys. It is a fatal habit of object oriented design that
keys are very often almost neglected. Keys in a relational arrangement are explicitly
designated sets of attributes.

It is advisable to consider dependencies on time from the very beginning. An entity can have
several time dimensions as accident year, development year, business year, reporting year,
time of estimate, time of creation, and so on. If an entity changes in the course of time, one
51

should focus not on that entity but on the combination <entity, time> taken as the new
entity to be modeled.

If the data model will carry a system where many users enter tons of individual data entries
(in contrast to batch loaded read only systems), then it should be designed in such a way
that never any deletion of data entered by a user has to take place. The data might be
moved from one table to another history table in one and the same transactional atomic
unit, but it should not be deleted. The reason for this is not only the appearance of state
authorities that want to know what happened nine and a half years ago, but also the fact
that programming errors can cause big damage, and the chances to wrongly delete entries
reduce if deletion as such is a rare happening in a program (I know programs with bad
experience in that respect).

Now comes an advice from Radio Erivan, useless but nevertheless with some truth in it.
Find the right mix in generalization, often it enhances flexibility, but if everything that should
happen in the system is generalized to a business event, then that very business event
entity type will be a first class bottleneck in the running system, and recovery turns to a
nightmare.

It is perfect to have (independent) packages of object types with no connections over


package limits. It is not necessary that everything is connected directly or indirectly to
everything else.

It should be possible to define and understand the business semantics in the data structures
without considering surrounding program logic or processes. Data should not be just a
byproduct or a garbage of a process. “Make persistent” is a bad advisor for data design.

Do not generate the data model automatically from a business object model. This would
result in semantics gaps, inability to report from the data with direct query access tools (so
that every new report has to be programmed within the same technical environment),
unnecessary excesses of referential integrity, NULLs and outer joins festivals, and the
inability to perceive the data as a value per se. Data structures cannot be changed every
day, processes can be changed several times a day.

If you take an object oriented programming language, be aware that there is no generic
mapping from OO to Relational for all cases. Each and every case must be considered
individually, and the mapping constructed accordingly.

Observations of running systems show extreme differences in what happens if the company
undergoes a reorganization. In some systems the changes can be defined completely by the
user community, and in some other systems many programs have to be adapted or even
reprogrammed. Please take your own consequences from that observation.

In systems with a heavy load of batch jobs consider the invention of control data structures
that ease the program restartability, as well as the batch interfacing. One should be able to
abort manually a batch job at any time and restart it without much repetition or
endangerment of the database management system.

To change data structures in a running system is not easy. Therefore, think first, program
later. The data model should be in a stable state before the first line of program is written.
52

Exercises
Structural Exercises

Exercise S1:

Let the following database for countries, languages and official languages be given. Please
propose a simplification.

C#
L#
OfficialLanguage
Name Name
m m
Country Language
m L# m
C# SpokenLanguage L#
Capital
C# Percent of
Population

Exercise S2:

Let the following part of a software distribution administration be given. Please make an
enhancement such that also different versions of the same software can be controlled.

Further
Attributes

Software m m PCAdress
Distribution

S# S# P# P#
53

Exercise S3:

Let the following diagram be given with unknown labels x,y and z at the arrows.

A1 A3
R
x z
E1 A2 E3

A1 y A3

A2 E2

The labels of course have to be “1” or “m”. For every one of the eight possible choices of
these labels please show which keys in R are implied by the labels.

Exercise S4:

For the following two variants please discuss the difference in information capturing
capability.

Company
Variant 1

m m Country
Product Export

Variant 2 m Company

Manufactures
m
m Export m Country

Product
54

Exercise S5:

Which of the diagrams are correct ones?

E1 E3
m ID
E1
R1
ISA
E3 ISA
ID
m
E2 E2

Variant 1 Variant 2

ISA
m E2 E1 E2

R1 m
R2 ID ID
m
m
E1 E3 E4
E3
ISA
Variant 3 Variant 4

m m
E4 R1 E2

E2
ISA ISA
A1
ID
E3 E1
Variant 6
Variant 5
55

Exercise S6:

Let the following diagram be given where only the primary key attributes of the independent
entity types are drawn in the figure.

C#

S#
EDPCompany
m m
ContractFrame SoftwareVersion
V#

ID
Date

ContractActualisation m 1
Responsible Employee

E#

Please fill in the missing key and foreign key attributes and formulate an SQL CREATE
TABLE statement for “Responsible”.

Exercise S7:

Please describe the following database in prose (requirements that might have lead to it).

Name
Customer C# TariffPosition
C# m
Address
Composition
T#
ID B# T# Descr
m Price
Number
C#
Bill
B# Date
ID ID

OrderToPay Date
C# Payment Amount

B# Date C# O#
P# B#
56

Exercise S8:

Please compare the two variants.

Supplier
Variant 1

m m Product
Part Composition

Variant 2 1 Supplier

SuppliesPart
m
m Composition m
Product

Part

Which of the following statements are true in which variant?

(1) A product can consist of several parts.

(2) A part always comes from the same supplier.

(3) If a certain part is needed for a certain product, then it always comes from the same
supplier.

(4) A key of “Composition” consists of part and product (which of course means the
corresponding foreign key attributes).

(5) A product always comes from parts of the same supplier.

(6) All products that contain a given part get something from the same supplier.
57

ER Diagram Design Exercises

Exercise D1 (Monolingual Thesaurus)

Please design a correct ER diagram for a monolingual thesaurus. For the requirements
definition we follow more or less the ANSI/NISO standard Z39.19 – 2003.

There are terms with term names. Some of the terms are preferred terms (for the usage in
indexing documents). If β is a preferred term for α, then α is also called a synonym for β
(standard notation α USE β, or equivalently, β UF α, UF means “Used For”).

We want a variant where for any term at most one preferred term is defined, therefore, for
the handling of the so called compound USE references (for example “Snowmobiles USE
Vehicles + Snow”) we differentiate generally also compound terms, which are also terms
but which are composed of some other terms.

Among the preferred terms there are a “Broader Term” and a “Narrower Term” relation
defined, such that x BT y is equivalent to y NT x.

Then “Related Terms” (RT) of different (user definable) categories must be storable.

Each term may have scope notes (SN).

It must be said here that the design of a diagram is easy, more difficult would be to
formulate an SQL query that finds all interesting preferred terms an index search engine
should look for, given any term. But this document is about ER design, not SQL.

Exercise D2 (Journeys and Vehicles)

Please design a correct ER diagram for the administration of a vehicle park and planned
journeys realized by chauffeurs.

There are chauffeurs with names, birthdays and telephone numbers, and vehicle categories,
and the information who has a drivers license for which vehicle category.

There are vehicles according to the categories, with serial numbers, construction year and
brand. For each vehicle the mileage in form of numbers of kilometers driven is captured from
time to time.

Journeys must be planned with time interval, destination, purpose and price. If a planned
journey is realized, the information is captured what chauffeur made it with which vehicle.
On a journey a vehicle is driven by only one chauffeur, and on one and the same journey a
chauffeur drives only one vehicle. But in rare cases there can be several vehicles (with equal
number of corresponding chauffeurs) on the same journey.

Please consider also the historization of the information captured.


58

Exercise D3 (Course Administration)

Please design a correct ER diagram for a course administration.

There are courses with titles and descriptions and a unique course number, as well as
course realizations, where a course is actually given at a certain time and place. The same
course can have several realizations.

Every concrete realization of a course has only one person giving it, but different realizations
of the same course might have different course teachers.

The course teachers are internal or external employees, where for internals their office must
be kept whereas for externals their company.

For every course realization all participants must be stored. The participants all are internal
employees.

Internal employees can apply for course participations, even several times for the same
course, they might have found a better justification in the meantime.

Exercise D4 (SOX Application Controls)

(communicated by Peter Aus der Au, but missunderstandings are my own)

Please design a correct ER diagram for the administration of an aspect of the Sarbanes
Oxley, Section 404, Controls, namely application controls. Application controls are the
functions of applications that support the process level business controls. This is not to be
mixed up with IT controls, that looks for proper handling of all aspects of IT applications.

We must capture business processes with a hierarchical leveling for subprocesses. Risks
have to be identified in processes, one risk in at most one process.

Then we need process level controls (PLC), and the possibility to declare which PLC
mitigates which risk. A risk can be mitigated by several PLCs, and a PLC can mitigate
several risks.

One must be able to define application controls (AC), which support the PLCs. There are
manual PLCs and automated PLCs. The automated PLCs are carried out by one application
control, and the manual PLCs may be supported by several application controls.

Some functions (programs, modules or whatever) of applications, called application control


functions (ACF) will be used by the ACs, where one AC may use several ACFs and and an
ACF might be used by several ACs.
59

Exercise D5 (Household Services)

Please design a correct ER diagram for the administration of a household services agency.

There are household services (for example with name and description), customers with
name, address and telephone numbers, and customer requests (date, remarks).

A customer request may relate to several household services, and every one of these
service requests may contain a time slicing (for example “next week Thursday and Friday
from 8 to 10 pm babysitting”).

Of course the agency needs a file of persons with the information who can offer what
services, and also with information about special knowledge or skills, for example to identify
those service offering persons that are ready to offer private lessons in mathematics (those
who have weapons of math instruction, not to be mixed up with mass destruction).

A plan must be made that attaches service persons to service request time slices, that can
be declared to be definitive.
60

Solutions to the Exercises

Exercise S1 Solution:

Name C# L# Name
m SpokenLanguage m
Country Language

Official Percent of
C# L#
Language Population
Capital
YesNo

Note that the proposed solution is only possible because there was an attribute “Percent of
Population”. In Switzerland the German language is an official language, but not a spoken
one, so German can be entered with Percent of Population (that speaks it) equal to zero.

Exercise S2 Solution:

for example

Further
Attributes
PCAdress
Software m
S#
P#
S# ID Distribution
m P#

SoftwareVersion Version

S# Version
61

Exercise S3 Solution:

x y z keys
--------------------------------------------------------------------------------
m m m {A1, A2, A3}
m m 1 {A1,A2}
m 1 m {A1,A3}
m 1 1 {A1,A3} and {A1,A2}
1 m m {A2,A3}
1 m 1 {A2,A3} and {A1,A2}
1 1 m {A2,A3} and {A1,A3}
1 1 1 {A2,A3} and {A1,A3} and {A1,A2}

Exercise S4 Solution:

Variant 2 has more information capturing capabilities than variant 1. First, every data
content of variant 1 can be mapped into variant 2, by filling “Manufactures” with the
projection of “Export” onto the foreign key attributes that correspond to “Company” and
“Product”. Second, not every data content of variant 2 can be mapped into variant 1
without loss. In variant 2 it is possible to capture information on the manufacturing of
products by companies, without the necessity that these products are exported or that the
export of these products is known.

Exercise S5 Solution:

The variants 1, 5 and 6 are correct, the others not.

Exercise S6 Solution:

C# C# S# V#

S#
EDPCompany
m m
ContractFrame SoftwareVersion
V#

ID
Date

ContractActualisation m 1
Responsible Employee
C# E#
C# A#
S# V# E#
S# A#
V#
62

CREATE TABLE Responsible


(C# INTEGER NOT NULL
, S# INTEGER NOT NULL
, V# INTEGER NOT NULL
, A# INTEGER NOT NULL
, E# INTEGER NOT NULL
, FOREIGN KEY (C#,S#,V#,A#) REFERENCES ContractActualisation
, FOREIGN KEY (E#) REFERENCES Employee
, UNIQUE (C#,S#,V#,A#) )

Exercise S7 Solution:

The database can capture customers with name and address, bills sent to them as well as
payments and orders to pay. Bills have a date and can be connected with a set of tariff
positions, such that a tariff position can be counted several times (for example 2 times drill
at the dentists). The tariff positions have a description and a price. The payments, that
relate always to one bill, can be part payments, which are entered with date and amount. It
is possible to have several orders to pay per bill, and each one has also a date.

Exercise S8 Solution:

In variant 1 statements (1), (3) and (4) are true.


In variant 2 statements (1), (2), (3), (4) and (6) are true.
63

Exercise D1 Solution (Monolingual Thesaurus)

RTCategoryName

RT_Category

TermName SN# m RT
Category
ScopeNotes TermName1 Name

m RT
Note ID TermName
m
Term TermName2

m
Comp_Of_ m
TermName
ISA UF_TermName
ComposedOf ISA
USE_UF
Comp_
TermName m 1 USE_TermName

PreferredTerm TermName
CompoundTerm

TermName m m
BT_NT
BT_TermName NT_TermName
64

Exercise D2 Solution (Journeys and Vehicles)

Description F#
Tel
C# Name C# F#
HasDrivers m Vehicle
m LicenceFor Category
Chauffeur
Name
BirthDay ID
1 V# F#
Brand Out
Of
Vehicle Use
C# V# 1
Serial
Realization Construction
Number
F# Year
P#

ID

m F#

V# Mileage
P# Price
Date
Purpose Kilometers
PlannedJourney
Destination
FromDate ToDate

Please note that except with chauffeurs most information can be kept in their corresponding
object types (and therefore carry their own historization), due to the “out of use” marker at
“vehicle”.
65

Exercise D3 Solution (Course Administration)

Location TimeFrom
Date Name P#
TimeTo

Course R# P#
m Gives 1 Person
Realization Course
C#
C# R#

ISA
m
ID P#

External
R# ISA
P# Employee
Course
C# Participates
C# Company
Title m
Description
m Internal
Employee

m Office P#

C# Application
P#
F#

RequestFor Justification
F# Course
Participation RequestDate

Note that, as usual, this is not the only possible solution, only a proposal. For instance the
only justification for “External Employee” is the “Company” attribute, not common to all
entities in “Person”.

The way “Application” is designed in this proposal means complete openness for requests,
for instance someone (we do not store who) can apply for several colleagues and several
courses, and if he wants to do so, in a way that for every colleague another course is
requested.
66

Exercise D4 Solution (SOX Application Controls)

P#
Business
R# Risk m Identified 1 Process
In
m 1
m

Mitigation IsSub
ProcessOf
m

PLC# ProcessLevelControl

ISA
AC#
Automated m Carried
ISA PLC OutBy 1
Application
Control

m
Manual
m MayBe m
PLC
SupportedBy

Uses

A#
A# ID Application
Application Control
CF#
Function
67

Exercise D5 Solution (Household Services)

Address Name
TelNr
Name S# Description

C# Customer R# Household
C# S#
m Services

ID Service
Request m
m
C#
Customer S#
R# Request
Readiness
S#
Date Remarks ID R# P# P#
C# Service
Plan
m
T# m
Definitive
m
C# ServiceRequest P#
TelNr
R# TimeSlices

S# Service
T# To Person
From
ID
Address Name
P# Special
Knowhow BirthYear
K#
Description
68

Appendix A: Bachman Diagrams

With the appearance of DASD (Direct Access Storage Device) came a powerful feature that
had not been there in times of magnetic tapes: the direct navigation to a given storage
address. Records could contain pointers to other records.

Owner Record Type Department

Set Type

Member Record Type Employee

This is the basic construction paradigm for network database systems, the first identifiable
types of database management systems.

Charles Bachman, pioneering architect of one of those network systems (IDS, Integrated
Data Store, at General Electric), also played an important role at the CODASYL (Conference
On Data Systems Languages), where the navigational interplay of programming languages
with network data storage systems was standardized.

Bachman later (1969) published a diagram language for the owner/member paradigm
(containing one-to-many set types drawn with arrows from owner record type to member
record type), and a few years later (1983) generalized it to something that can be
designated as “Binary Entity Relationship”. In the meantime (1976) Peter P. Chen had
published his famous “The Entity Relationship Model”.

Chen supported the same basic construction elements as presented here (with ISA, ID and
manyfold relations), but Bachman continued to propagate the binary model, which
essentially looks like

x y
E F

where per entity e of type E there are y entities of type F, and per entity f of type F there
are x entities of type E.
69

Here “x” and “y” stand for a mixture of cardinality and existence constraints. In the notation
of Zehnder (1981) x,y ∈ {1, c, m, mc} with the meaning of

x=1 : exactly one


x=c : at most one
x=m : at least one
x=mc : no condition (between zero and many)

In other dialects “x” and “y” can be given as intervals (for instance x=[0..1] for x=c), or as
a combination of little strokes, double strokes or circles over the line or as crow’s-feet, but
the different notations are all equivalent. For example the figure

c mc
Department Employee

means that for each employee there is at most one department, and for each department
there is a number between zero and many employees.

CODASYL had only one-to-many, but no many-to-many relationships (in favor of reducing
the implementation challenge). Therefore in the following example

mc mc
Project Employee

the implicit relationship type of “WorksFor” had to be made explicit like follows

Project Employee

1 1

mc WorksFor mc

Since Zehnder discussed similar structure changes in the context of binary entity
relationship for relational databases, we call such a step a “Zehnder Normalization”.
70

Please note that if you do not take the Zehnder normalization step in the example

c mc
Department Employee

and map the two entity types into relational tables and the line between them into a
referential integrity condition (in Employee … FOREIGN KEY (Dept) REFERENCES
Department …), then the cardinality condition “c” forces you to define the foreign key
column “Dept” in “Employee” as NULLable.

This is not recommended practice.

The Bachman diagram language, designed for network database structures, is not very well
suited for relational table data design. Bachman was a very strong opponent of E. F. Codd
and the idea of relational database management systems.

Given a Bachman diagram of 200 entity types with lines between them, it can be
cumbersome to find out whether there is an existence dependency cycle or not (even when
it is already Zehnder-normalized), and it is not easy to find out what the independent entity
types are (these are the most important, in new designs as well as in assessments of given
designs).

The focus on cardinality lets you ask the wrong questions (Is a connection a real relationship
type or a mere cardinality constraint?). If you ask a user “Are you sure that for all E there is
an F?”, he will answer “yes” even if for one E out of a trillion there will be no F.

Beginners in database design tend to connect everything with everything (directly or


indirectly), encouraged by the cardinality question. But the contrary should be emphasized,
the more mutually independent clusters the better.

The Bachman ER diagrams are semantically very meager compared to Chen ER diagrams.

A last remark here: Types of database management systems (networked, hierarchical,


relational) last longer than database structures. Therefore it does not make sense to design
data structures independent from the target type of DB system (we will “design for
relational” still another while).
71

Appendix B for Specialists: Mathematics of IDNF

Please note that this appendix is for specialists only. If you never heard about normalization
in relational databases, do not bother. If you heard about it but do not understand it, don’t
bother either, more than half of the writers of textbooks on normalization theory also do not
understand it.

This appendix gives a sketch of the mathematical background of the presented entity
relationship language and the reason why correct diagrams deliver advantageous relational
table structures. But note that for designers it is not necessary to understand this
background.

Third Normal Form (3NF) is not enough

If R is a relation schema and F a set of functional dependencies over Attr(R), the set of
attributes of R, then (R,F) is in third normal form (3NF) if

∀ X,Y ⊆ Attr(R) [ (X → Y) ∈ F ⇒ ∀A∈Y\X (A prime) ∨ F ⊢ X → Attr(R) ].

The ugly part here is ∀A∈Y\X (A prime), which must be there for the theorem to be true
which says that third normal form is always achievable.
Note that without ∀A∈Y\X (A prime) one would not have to talk about keys, only
superkeys. F ⊢ X → Attr(R) says that X is a superkey (and a key is a minimal superkey).
An attribute is prime (“A prime”) if it is part of a key.

Third normal form cannot be supported by relational database systems. Take the example
R(A,B,C) with F={AB → C, C → B}. Here AB → C can be supported by declaring
UNIQUE(A,B) in the table definition of R(A,B,C), but C → B cannot be supported.

The situation is quite different with Boyce-Codd Normal Form (BCNF). (R,F) is in BCNF if

∀ X,Y ⊆ Attr(R) [ (X → Y) ∈ F ⇒ Y⊆X ∨ F ⊢ X → Attr(R) ].

If Y⊆X, then X → Y is trivial and always valid, and if F ⊢ X → Attr(R), then you can
declare UNIQUE(attributes of X), so the database system can guarantee the validity of
BCNF.

So we observe that the database system can guarantee the validity of BCNF (react with an
SQL-error message if you try to violate BCNF), but not so for 3NF. Therefore BCNF should
be the goal of design, not 3NF. The proposed mapping of correct ER diagrams into relational
tables delivers directly BCNF.

Even more important is that the designer does not have to learn normalization theory, which
is difficult to understand (there have been many published corrections to beforehand
published erroneous normalization algorithms).
72

Referential Integrity Guidelines

Inclusion dependencies are (generalized) referential integrity constraints deprived of


operational aspects. If we mix functional dependencies with inclusion dependencies, the
theory becomes undecidable. We consider an example,

{(R(A,B,C),{}), (S(D,E),{D → E}), {R[A,B] ⊆ S[D,E]}}.

It is clear that

{D → E} ∪ {R[A,B] ⊆ S[D,E]} ⊨ A → B,

which means that in every database where the functional dependency D → E and the
inclusion dependency R[A,B] ⊆ S[D,E] are both valid, also A → B is valid.

But
F ∪ IND ⊨ ρ (ρ a functional or inclusion dependency)

is not decidable, that is, there cannot exist any algorithm which would automatically find
out whether given dependencies follow from a given set of functional and inclusion
dependencies.

There is no way to automatically find out whether any given database structure with
functional and inclusion dependencies is normalized or not. This is known since 1984.

A way out of this dilemma is the definition of a further normal form that handles the mixed
case of functional and inclusion dependencies.

A set IND of inclusion dependencies is noncircular, if for all sequences

R1[X1] ⊆ R2[Y2] , R2[X2] ⊆ R3[Y3] , ...... , Rn-1[Xn-1] ⊆ Rn[Yn]

of inclusion dependencies from IND R1 ≠ Rn .

The inclusion dependency R[X] ⊆ S[Y] between (R,F) and (S,G), where F and G are
corresponding sets of functional dependencies, is keybased if Y is a key of (S,G). Here the
minimality of the key is important.

The database schema {(R1,F1) , (R2,F2) , ......., (Rn,Fn), IND} is in inclusion dependency
normal form (IDNF), if IND is noncircular, all inclusion dependencies from IND are keybased
and all (Rj,Fj) are in BCNF.

In IDNF databases we are on the safe side, inclusion and functional dependencies are
independent from each other, more precisely we have

F ∪ IND ⊨ X → Y ⇒ F⊢X→Y

and

F ∪ IND ⊨ Ri[X] ⊆ Rj[Y] ⇒ IND ⊢ Ri[X] ⊆ Rj[Y].


73

In other words, all valid functional dependencies are controlled by the given set F of
functional dependencies alone and all inclusion dependencies are controlled by the given set
IND of inclusion dependencies, it is decidable what dependencies follow from the given
ones and there are no hidden dependencies following from the mixture that might destroy
normal forms (Mannila, Räihä 1986 and 1992 with BCNF).

The proposed mapping of correct ER diagrams into relational tables delivers directly IDNF.

Our inclusion dependencies are not only keybased, they are primary keybased. Of course we
assume here that a designer does not declare a proper subset of the primary key as another
key, but chances are low that he declares in a CREATE TABLE definition something like

……, PRIMARY KEY (A, B), UNIQUE (A), ……

which would unfortunately be accepted by SQL. But this would not be a possible result of
our recommended mapping of correct diagrams into relational tables.

There is another aspect of referential integrity guidelines to be discussed in the context of


the next paragraph.

NULL Avoidance

Relational databases are the only systems in information technology that have a solid
mathematical foundation. However only the NULL-free version of the relational model has a
generally accepted mathematical model.

One of the many unfortunate consequences of the lacking background is that certain
difficulties arising with the presence of NULLs appeared only recently in the literature (for
example Galindo-Legaria, Rosenthal 1997).

A striking example is the fact that the left outer (theta) join is no more associative in the
presence of NULLs, symbolically

(r1 ⋊ r2) ⋊ r3 ≠ r1 ⋊ (r2 ⋊ r3).

The only comfort in this scandalous situation is that vendors of chips also cannot guarantee
the validity of basic laws in number calculations.

Anyway, NULLs and corresponding three-valued logics are too difficult for theoreticians.
Proposals have appeared in the literature (on how to overcome some difficulties confronting
the user), but these have proven logically unfeasible.

NULLs also have been too difficult for the first implementers of SQL. The fact that the
EXISTS quantor behaves in two-valued fashion in midst of three-valued calculations, which
still is disturbing for many users, is a clear semantical mistake that later has been declared
to be a feature when it was too late to correct it.
74

And please, if it is too difficult for theoreticians and system implementers, how then should
an end-end-user understand why his grandma is not on the result list of the query

SELECT * FROM OldPeoplesHomeUnderTheSweetTree


WHERE Age<=’80’ OR Age>’80’?

(the solution is that his grandma has a NULL in the Age Column because she refused to
announce her age)

Now we go back to the abovementioned referential integrity guidelines in the context of


NULLs. Imagine a foreign key FK(A,B) with no MATCH OPTION defined, which is the
default. Then assume a row is inserted with <A,B>=<a,NULL>, which is accepted by the
system even if there is no ‘a’ value in the active domain of the column corresponding to “A”
in the parent table.

A few days later the missing value becomes known and there is an attempt to update
<a,NULL> to <a,b>. Now the system checks the referential integrity constraint. If there is
no <a,b> value combination in the parent table, things get complicated. Is the update to
<a,b> wrong or was already the insertion of <a,NULL> wrong or unrealistic? The user
has to go back a few days and the system must provide the ability to do so.

Happily we have learned how to avoid all this. The presented version of the entity
relationship language with its definition of correct diagrams helps avoid the use of NULLs,
primarily at the dangerous places like in keys or foreign keys. Construction patterns like the
accordion principle avoid NULLs generally. In our approach the missing information is
mapped into the missing of a row of a table, not into a NULL.

Semantics Support

We do not only have to design data but also operations and constraints, which have their
counterpart in programs or administrative rules. Since it is impossible to build a graphical
design language which covers the full logic of operations and constraints, a borderline is
needed.

This version of entity relationship language with the definition of correct diagrams define a
very clear borderline. All semantics that can be expressed can be guaranteed by the
database management system. The programmer only has to occupy himself with constraints
he writes down separately, not with constraints contained in the semantics of the graphical
design language. Note that this is not the case with Bachman diagrams or other dialects of
binary entity relationship.
75

Information versus Data Representation

Perhaps the most important advantage of the use of the presented approach with correct ER
diagrams is the preference given to information representation normalization over data
representation normalization. This train of thought was first formulated by Victor Markowitz
1987 in Haifa (as far as I know). Please consider the following comparison.

Different
Zehnder kinds of
Normalisation Other
Conceptual mappings Relational
Conceptual
Design Design
Design

Relational
Normalisation

Other
Relations

No
Lucky?

Yes

BCNF
normalized
relations

No
Lucky?

Yes
Mapping
Correct ER clearly defined in all cases IDNF
Diagram normalized
Design relations

Most textbooks and courses in data design present you a long way from conceptual
diagrams to relational tables. In the widespread case of the use of Bachman diagrams there
is a first step “Zehnder normalization” eliminating many to many connections carrying
mutual existence constraints.

Then there usually is some speak on mapping possibilities into relational tables, presented as
case to case pragmatics.
76

Then there comes the huge and difficult to understand relational normalization, where as
result only 3NF is guaranteed. If you are lucky, you hit BCNF, and if you are lucky again,
your referential integrity constraints are properly integrated. If not, you go back a few steps.

Even if you have been lucky two times, you might have arrived at a design of relational
tables that are structured very differently compared to the conceptual design. This is not
very helpful, because there is always a translation needed from the conceptual design (that
the business analyst knows by heart) to the relational tables design (that the programmer
knows by heart). Sometimes it is a great advantage if also the programmer understands the
business logic.

Compared to above complications, with correct ER diagrams we get directly to the wanted
goal in one step that is well defined in all cases, without structure gap.

The reason for this big advantage is the fact that the definition of the correct ER diagram is
a narrow corset. One has to shape the business information conception until it fits, and then
come all the benefits (my colleagues and I have never seen a business case with data
requirements that could not be mapped into a correct diagram). This is what Markowitz
called information representation normalization.

Potrebbero piacerti anche