Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
WHITE PAPER
http://www.datamgmt.com
White Paper - Process Neutral Data Modelling
Table of Contents
Table of Contents ...................................................................................................................... 2
Synopsis .................................................................................................................................... 4
Intended Audience .................................................................................................................... 4
About Data Management & Warehousing ................................................................................. 4
Introduction................................................................................................................................ 5
The Problem .............................................................................................................................. 6
The Example Company......................................................................................................... 6
The Real World ..................................................................................................................... 9
The Customer Paradigm ......................................................................................................... 10
Requirements of a Data Warehouse Data Model.................................................................... 12
Assumptions........................................................................................................................ 12
Requirements...................................................................................................................... 12
The Data Model ....................................................................................................................... 14
Major Entities ...................................................................................................................... 14
Type Tables ........................................................................................................................ 17
Band Tables ........................................................................................................................ 19
Property Tables................................................................................................................... 20
Event Tables ....................................................................................................................... 22
Link Tables.......................................................................................................................... 23
Segment Tables .................................................................................................................. 24
The Sub-Model ........................................................................................................................ 25
History Tables ..................................................................................................................... 26
Occurrences and Transactions ........................................................................................... 27
Implementation Issues ............................................................................................................ 33
The ‘Party’ Special Case..................................................................................................... 33
Partitioning .......................................................................................................................... 35
Data Cleansing.................................................................................................................... 36
Null Values .......................................................................................................................... 36
Indexing Strategy ................................................................................................................ 36
Enforcing Referential Integrity............................................................................................. 36
Data Insert versus Data Update.......................................................................................... 37
Row versus Set Based Loading in ETL............................................................................... 37
Disk Space Utilisation ......................................................................................................... 38
Implementation Effort .......................................................................................................... 38
Data Commutativity ................................................................................................................. 39
Data Model Explosion and Compression ................................................................................ 40
How big does the data model get?...................................................................................... 40
Can the data model be compressed? ................................................................................. 40
Which Results to Store? .......................................................................................................... 41
The Holistic Approach ............................................................................................................. 42
Summary ................................................................................................................................. 43
Appendix 1 – Data Modelling Standards ................................................................................. 44
General Conventions .......................................................................................................... 44
Table Conventions .............................................................................................................. 44
Column Conventions........................................................................................................... 46
Index Conventions .............................................................................................................. 50
Standard Table Constructs ................................................................................................. 50
Sequence Numbers For Primary Keys................................................................................ 52
Appendix 2 – Understanding Hierarchies ................................................................................ 53
Sales Regions ..................................................................................................................... 53
Internal Organisation Structure ........................................................................................... 53
Appendix 3 – Industry Standard Data Models ......................................................................... 55
Appendix 4 – Information Sparsity .......................................................................................... 57
Appendix 5 – Set Processing Techniques............................................................................... 59
Appendix 6 – Standing on the shoulders of giants .................................................................. 60
Synopsis
This paper describes in detail the process for creating an enterprise data warehouse physical
data model that is less susceptible to change. Change is one of the largest on-going costs in
a data warehouse and therefore reducing change reduces the total cost of ownership of the
system. This is achieved by removing business process specific data and concentrating on
core business information.
The white paper examines why data-modelling style is important and how issues arise when
using a data model for reporting. It discusses a number of techniques and proposes a specific
solution. The techniques should be considered when building a data warehouse solution even
when an organisation decides against using the specific solution.
This paper is intended for a technical audience and project managers involved with the
technical aspects of a data warehouse project.
Intended Audience
Reader Recommended Reading
Executive Synopsis
Business Users Synopsis
IT Management Synopsis
IT Strategy Entire Document
IT Project Management Entire Document
IT Developers Entire Document
Introduction
Commissioning a data warehouse system is a major undertaking. Organisations will invest
significant capital in the development of the system. The data model is always a major
consideration and many projects will invest a significant part of the budget on developing and
re-working the initial data model.
Unfortunately projects also often fail to look at the maintenance costs of the data model that
they develop. A data model that is fit for purpose when developed will rapidly become an
expensive overhead if it needs to change when the source systems change. The cost
involved is not only in the change to the data model but also in the changes to the ETL that
feed the data model.
This problem is exacerbated by the fact that changes to the data model may be done in an
inconsistent way from the original design approach. The data model loses transparency and
becomes even more difficult to maintain.
For many large data warehouse solutions it is not uncommon to have a resource permanently
assigned to maintaining the data model and several more resources assigned to managing
the change in the associated ETL within a short time of going live.
By understanding the problem and using techniques imported from other areas of systems
and software development and well as change management techniques it is possible to
define a method that will greatly reduce this overhead.
This white paper sets out an example of the issues from which to develop a statement of
requirements for the data model and then demonstrates a number of techniques which, when
used together, can address those requirements in a sustainable way.
The Problem
Data modelling is the process of defining the database structures in which to hold information.
To understand the Process Neutral Data Modelling approach first this paper looks at why
these database structures have such an impact on the data warehouse.
In order to demonstrate the issues with creating a data model for a data warehouse more
experienced readers are asked bear with the necessarily simplistic examples that follow.
2
Figure 1 - Initial Operational System Data Model
This simple data model describes both the widget and the cabinet and provides the
current combinations. It does not provide any historical context: “What was the
previous configuration and when was it changed?”
Historical data can be recorded by simply adding start date and end date to each of
3
the main tables. This provides the ability to report on the historical configuration . In
order to facilitate this a separate reporting environment would be setup because
retaining history in the operational system would unacceptably reduce the operational
system performance. There are three consequences of doing this:
• Queries are now more complex. In order to report the information for a given
date the query has to allow for the required date being between the start date
1
Data models in this document are illustrative and therefore should be viewed as suitable for making
specific points rather than complete production quality solutions. Some errors exist to explicitly
demonstrate certain issues.
2
The are several conventions for data modelling. In this and subsequent diagrams the link with a 1 and
∞ represents a one-to-many relationship where the ‘1’ record is a primary key field and the ‘∞’
represents the foreign key field.
3
Note that the ‘WIDGET_LOCATIONS’ table requires an additional field called ‘INSTALL_SEQUENCE’
to allow for the case where a widget is re-installed in a cabinet.
and the end date of the record in each of the tables. The extra complexity
slows the execution of the query.
o The volume of data stored has also increased. The storage of dates has a
minor impact on the size of each row but this is small when compared to the
4
number of additional rows that need to be stored.
o Data has to be moved from the operational system to the reporting system
via an extract, transform and load (ETL) process. This process has to extract
the data from the operational system, compare the records to the current
records in the reporting system to determine if there are any changes and if
so make the required adjustments to the existing record (e.g. updating the
end date) and insert the new record. Already the process is more complex
5
and time consuming than simply copying the data across.
When the reporting system is built, it accurately reflects the current business
processes, operational systems and provides historical data. From a systems
management perspective there is now an additional database, and a series of ETL or
interface scripts that have to be run reliably every day.
The systems architecture may be further enhanced so that the reporting system
becomes a data warehouse and the users make their queries on data marts, or sets
of tables where the data has been re-structured in order to simplify of the users query
environment. The ‘data marts’ typically use star-schema or snowflake-schema data
6
modelling techniques or tool specific storage strategies . This adds an additional layer
of ETL to move between the data warehouse and the data mart.
However the company doesn’t stop here. The product development team create a
new type of widget. This new widget allows amber lamps and can optionally be
mounted in a rack that is in turn mounted in a cabinet. The IT director also insists that
the new OLTP application is more flexible for other future developments.
4
Assume that everything remains the same except that widgets are moved around (i.e. there are no
new widgets and no new cabinet/customer combination) then the WIDGET_LOCATIONS table grows in
direct proportion to the number of changes. If each widget were modified in some way once a month
then the reporting system table would be twelve times bigger than the operational system after one year
and this before any other change is handled.
5
Additional functionality such as data cleansing will also impact the complexity of ETL and affect
performance
6
This is accepted good practice and the design and implementation of data marts is outside the scope
of this paper.
These business process changes results in a new data model for the operational
system.
The reporting system is also now a live system with a large amount of historical
information. It too can be re-designed. The operational system will be implemented to
meet the business requirements and timescales regardless of whether the reporting
system is ready. It also may not be possible to create the history required for the new
7
data model when it is changed.
If a data mart is built from the data warehouse there are two impacts. Firstly that the
data mart model will need to be changed to exploit the new data and secondly that
the change to data warehouse model will require the data mart ETL to be modified
regardless of any changes to the data mart data model.
The example company does not stop here however as senior management decide to
acquire a smaller competitor. The new subsidiary has it’s own systems that reflect
their own business processes. The data warehouse was built with a promise of
providing an integrated management reporting so there is an expectation that the
data from the new source system will be quickly and seamlessly integrated into the
data warehouse. From a technical perspective this could present issues around
mapping the new source system data model to the existing data warehouse data
8 9
model, critical information data types , duplication of keys , etc. that all cause
problems with the integration of data and therefore slow down the processing.
Within a few short iterations of change it is possible to see the dramatic impact on the
data warehouse and that the system is likely to run into issues.
7
A common example of this is an organisation that captures the fact that an individual is married or not.
Later the organisation decided to capture the name of the partner if someone is married. It is not
possible to create the historical information systemically so for a period of time the system has to
support the continued use of the marital status and then possibly run other activities such as outbound
calling to complete the missing historical data.
8
The example database assumed that serial number was numeric and used it as a primary key but what
happens if the acquired company uses alphanumeric serial numbers?
9
If both companies use numbers starting from 1 for their customer ID then there will be two customers
who have the same ‘unique’ id, and customers that have two ‘unique’ IDs.
o A global ERP vendor supplies a system with over five thousand database
objects and typically makes a major release every two years, a ‘dot’ release
every six months and has numerous patches and fixes in between each
major release. This type of ERP system is in use in nearly every major
company and the data is a critical source to most data warehouses.
o A global food and drink manufacturer that came into existence as a result of
numerous mergers and acquisitions and also divested some assets found
itself with one hundred and thirty-seven general ledger instances in ten
countries with seventeen different ERP packages. Even where the ERP
packages were the same they were not necessarily using the same version of
the package. The business intelligence requirement was for a single data
warehouse and a single data model.
Obviously these issues cannot be fixed just by creating the correct data model for the
10
data warehouse but the objective of the data model design should be two fold:
o To ensure that all the required data can be stored effectively in the data
warehouse.
o To ensure that the design of the data model does not impose cost and where
possible actively reduces the cost of change on the system.
10
Data Management & Warehousing have published a number of other white papers that are available
at http://www.datamgmt.com and look at other aspects of data warehousing and address some of these
issues. See Further Reading at the end of this document for more details.
Figure 4 - The Sales Funnel The most common solutions that are
created as a result either add ‘flag’ or
‘indicator’ columns to the customer table to represent each category or to create multiple
tables for the different categories required and to repeat the data in each of the tables.
This example clearly demonstrates that the business process is being embedded into the
data model. The current business process definition(s) of customer are defining how the data
model is created. What has been forgotten is that these ‘customers’ exist outside the
organisation and it is their interaction with different parts of the organisation that defines their
status of being a customer, supplier, etc. In legal documents there is the concept of a ‘party’
where a party is a person or group of persons that compose a single entity that can be
11
identified as one for the purposes of the law . This definition is one that should be borrowed
and used in the data model.
If users query a data mart that is loaded with data extracted from the transaction repository
and data marts are built for a specific team or function that only requires one definition of the
12
data then the current definition can be used to build that data mart and different definitions
used for other departments.
11
http://en.wikipedia.org/wiki/Party_(law)
12
This also allows flexibility, as, when business processes change, it is possible at a cost to change the
rules by which data is extracted. The cost of change is relatively much lower than trying to rebuild the
data warehouse and data mart with a new definition.
• Isn’t one of the purposes of building a data warehouse to have a single version of the
truth?
Yes. There is a single version of the truth in the data warehouse and this single
version is perpetuated into the data marts, the difference is that the information in the
data mart is qualified. Asking the question “How many customers do we have?”
should get the answer “Customer Services have X active service contract customers”
and not the answer “X” without any further qualification.
It might be argued that there are too many differences to put all individuals and organisations
in a single table; this and other issues will be discussed later in the paper.
Assumptions
1. The data model is for use in the architectural component called the transaction
13
repository or data warehouse.
2. As the data model is used in the data warehouse it will not be a place where
users go to query the data, instead users will query separate dependant data
marts.
3. As the data model is used in the data warehouse data will be extracted from it
to populate the data marts by ETL tools.
4. As the data model is used in the data warehouse the data will be loaded into it
from the source systems by ETL tools.
5. Direct updates (i.e. not through formally released ETL processes) will be
prohibited; instead a separate application or applications will exist as a
surrogate source.
6. The data model will not be used in a ‘mixed mode’ where some parts use one
data modelling convention and other parts use another. (This is generally bad
practice with any modelling technique but often the outcome where the
responsibility for data modelling changes is distributed or re-assigned over
time).
Requirements
1. The data model will work on any standard business intelligence relational
14
database. This is to ensure that it can be deployed on any current platform
and if necessary re-deployed on a future platform.
2. The data model will be process neutral i.e. it will not reflect current business
processes, practices or dependencies but instead will store the data items and
relationships as defined by their use at the point in time when the information is
acquired.
15
3. The data model will use a design pattern i.e. a general reusable solution to a
commonly occurring problem. A design pattern is not a finished design but a
description or template for how to solve a problem that can be used in many
different situations.
13
For further information on Transaction Repositories see the Data Management & Warehousing white
paper ”An Overview Architecture For Enterprise Data Warehouses”
14
A typical list would (at the time of writing) include IBM DB2, Microsoft SQL Server, Netezza, Oracle,
Sybase, Sybase IQ, and Teradata. For the purposes of this document it implies compliance with at least
the SQL92 standard
15
http://en.wikipedia.org/wiki/Software_design_pattern
16
4. Convention over configuration : This is a software design paradigm which
seeks to decrease the number of decisions that developers need to make,
gaining simplicity, but not necessarily losing flexibility. It can be applied
successfully to data modelling and reduce the number of decisions of the data
modeller by ensuring that tables and columns use a standard naming
convention and are populated and queried in a consistent fashion. This also
has a significant impact on the efforts of an ETL developer.
5. The design should also follow the DRY (Don’t Repeat Yourself) principle. This
is a process philosophy aimed at reducing duplication. The philosophy
emphasizes that information should not be duplicated, because duplication
increases the difficulty of change, may decrease clarity, and leads to
17
opportunities for inconsistency.
6. The data model should be significantly static over a long period of time, i.e.
there should not be a need to add or modify tables on a regular basis. In this
case there is a difference between designed and implemented, it is possible to
have designed a table but not to implement it until it is actually required. This
does not affect the static nature of the data model, as the placeholder already
exists.
18
7. The data model should store data at the lowest possible level and avoid the
storage of aggregates.
8. The data model should support the best use of platform specific features whilst
19
not compromising the design.
10. The data model should act as a communication tool to aid the refinement of
requirements and an explanation of possibilities.
16
For further information see http://en.wikipedia.org/wiki/Convention_over_Configuration and
http://softwareengineering.vazexqi.com/files/pattern.html. The Ruby on Rails language
(http://www.rubyonrails.org/) makes extensive use of this principle.
17
DRY is a core principle of Andy Hunt and Dave Thomas's book The Pragmatic Programmer. They
apply it quite broadly to include "database schemas, test plans, the build system, even documentation."
When the DRY principle is applied successfully, a modification of any single element of a system does
not change other logically unrelated elements. Additionally, elements that are logically related all change
predictably and uniformly, and are thus kept in sync. (http://en.wikipedia.org/wiki/DRY). This does not
automatically imply database normalisation but database normalisation is one method for ensuring
‘dryness’.
18
This is the origin of the term ‘Transaction Repository’ rather than ‘Data Warehouse’ in Data
Management & Warehousing documentation. The transaction repository stores the lowest level of data
that is practical and/or available. (See An Overview Architecture for Enterprise Data Warehouses)
19
This turns out to be both simple and very effective. For Oracle the most common features that need
support include partitioning and materialized views. For Sybase IQ and Netezza there is a preference for
inserts over updates due to their internal storage mechanisms. For all databases there is variation in
indexing strategies. These and other features should be easily accommodated.
20
Also known as temporal. Most data warehouses are not linearly time variant but quantum time variant.
If a status field is updated three times in a day and the data warehouse reflects all changes then it is
linearly time-variant. If a data warehouse holds the first and last values only because a batch process
loads it once a day then it is quantum time-variant where the quantum is, in this case, one day.
Quantum time variant solutions can only resolve data to the level of the quantum unit of measure.
Major Entities
Party is, as described in the customer paradigm section above, an example of a type of
table within the Process Neutral Data Modelling method known as a ‘Major Entity’.
These are tables that deliver the placeholders for all major subject areas of the data
model and around which other information is grouped. Each business transaction will
relate to a number of major entities. Some major entities are global i.e. they apply to all
types of organisation (e.g. Calendar) and there are a number of major entities that are
industry specific (e.g. for Telco, Manufacturing, Retail, Banking, etc.). It would be very
unusual for an organisation to need a major entity that was not industry wide. Below is
a list of some of the most common:
• Calendar
Every data warehouse will need a calendar. It should always contain data to
the day level and never to parts of the day. In some cases there is a need to
21
support sub-types of calendar for non-Gregorian calendars .
• Party
Every organisation will have dealings between parties. This will normally
include three major sub-types: individuals, organisations (any formal
organisation such as a company, charity, trust, partnership, etc.) and
organisational units (the components within an organisation including the
system owners organisation).
• Geography
The information about where. This is normally sub-typed into two components,
22
address and location. Address information is often limited to postal addresses
whilst location is normally described by the longitude and latitude via GPS co-
ordinates. Other specialist geographic models exist that may need to be taken
23
into account.
• Account
Every customer will have at least one account if financial transactions are
involved (even those organisations that do not think they currently use the
concept of account will do so as accounting systems always have the concept
of a customer with one or more accounts).
21
See http://www.qppstudio.net/footnotes/non-gregorian.htm for various calendars, notably 2008 is the
Muslin Year 1429 and the Jewish Year 5968
22
Some countries, such as the UK, have validated lists of all addresses (see the UK Post Office
Postcode Address File at http://www.royalmail.com/portal/rm/jump2?mediaId=400085&catId=400084)
23
Network Rail in the UK use an Engineers Line Reference, which is based on a linear reference model
and refers to a known distance from a fixed point on a track. In Switzerland they have an entire national
co-ordinate system (http://en.wikipedia.org/wiki/Swiss_coordinate_system)
• Electronic_Address
Any electronic address such as a telephone number, email address, web
address, IP address etc. This is normally sub-typed by the categories used.
• Component
A physical object that cannot be uniquely identified by a serial number but has
a part number and is used in the make-up of either an asset or of a product
service. In the example company there was not a particular record of the serial
numbers of the lamps, however they would all have had a part number that
described the type of lamp to be used.
• Channel
A conceptual route to market (e.g. direct, indirect, web-based, call-centre, etc.).
• Campaign
A marketing exercise that is designed to promote the organisation, e.g. the
running of a series of adverts on the television.
• Campaign Activities
The running of a specific advert as part of a larger campaign.
• Contract
Depending on the type of business the relationship between the organisation
and its supplier or its customer may require the concept of a contract as well as
that of an account.
This list is not comprehensive by if an organisation can effectively describe their major
entities and combine this information with the interactions between them (the
occurrences or transactions) then they have the basis of a very successful data
warehouse.
Major Entities can have any meaningful name provided it is not a reserved word in the
database or (as will be seen below) a reserved word within the design pattern of
Process Neutral Data Modelling.
Some readers, who are familiar with the concepts of star schemas and data marts, will
also be aware that these are very close to the basic dimensions that most data marts
use. This should come as no surprise as these are the major data items of any
business regardless of their business processes or of their specific industry sector and
a data mart is only a simplification of the data presented for the user. This effect is
called “natural star schemas” and will be explored in more detail later.
Lifetime Value
The next decision is which columns (attributes) should be included in the table.
24
Much like the processes involved in normalising a database the objective is to
minimise duplication of data and there is also a requirement to minimise updates.
To this end the attributes that are included should therefore have ‘lifetime value’,
i.e. they should remain constant once they have been inserted into the database.
This means that variable data needs to be handled elsewhere.
Calendar:
Lifetime Value Attributes: Date, Public Holiday Flag
Geography:
Lifetime Value Attributes: Address Line 1, Address Line 2, City,
25
Postcode , County, Country
Non-Lifetime Value Attributes: Population
Party (Individuals):
26
Lifetime Value Attributes: Forename, Surname , Date of Birth,
27
Date of Death, Gender , State ID Number
Non-Lifetime Value Attributes: Marital Status, Number of Children, Income
Party (Organisations):
Lifetime Value Attributes: Name, Start Date, End Date,
State ID Number
Non-Lifetime Value Attributes: Number of Employees, Turnover,
Shares Issued
Account:
Lifetime Value Attributes: Account Number, Start Date, End Date.
Non-Lifetime Value Attributes: Balance
Other than this lifetime value requirement for columns every table must comply with the
general rules for any table. For example every table will have a key column that uses
28
the table short name made up of six characters and the suffix _DWK , a TIMESTAMP
column and an ORIGIN column.
24
http://en.wikipedia.org/wiki/Database_normalization: Database normalization is a technique for
designing relational database tables to minimize duplication of information and, in so doing, to safeguard
the database against certain types of logical or structural problems, namely data anomalies.
25
This may occasionally be a special case as postal services do, from time to time, change postal codes
that are normally static.
26
There is a specific special case that deals with the change of name for married women that will be
dealt with in the section ‘The Party Special Case’ later.
27
One insurance company had to deal with updatable genders due to the fact that underwriting rules
require assessment based on birth gender and not gender as a result of re-assignment surgery.
Therefore for marketing it had to handle ‘current’ gender and for underwriting it had to deal with ‘birth’
gender.
28
See the data modelling rules appendix for how this name is created.
Type Tables
There is often a need to categorise information into discrete sets of values. The valid
set of categories will probably change over time and therefore each category record
also needs to have lifetime value. Examples of the categorisation have already
occurred with the some of the major entities:
To support this and to comply with the requirement for convention over configuration all
_TYPES tables of this format have a standard data model as follows:
• The table will have the same name as the major entity but with the suffix
_TYPES (e.g. PARTY_TYPES, GEOGRAPHY_TYPES, etc.).
• The table will always have a key column that uses the six character short code
and the _DWK suffix.
• The table will have a _TYPE column that is the type name.
• The table will have a _DESC column that is a description of the type.
• The table will have a _GROUP column that groups certain types together.
• The table will have a _START_DATE column and a _END_DATE column.
This is a type table in its entirety. If a table needs more information (i.e. columns) then
this is not a _TYPES table and must not have the _TYPES extension, as it does not
comply with the rules for a _TYPES table.
PARTY_TYPES
The start date in this context has little initial value in this context, although it is a
29
mandatory field and therefore has to be completed with a date before the earliest
party in this example. Legal types of organisation do change over time and so it is
possible that the start and end dates of these will become significant.
These types do not describe the type of role that the party is performing (i.e. Customer,
Supplier, etc.) they describe the type of the party (e.g. Individual, etc.). Describing the
role comes later. The type and group column are repeated for INDIVIDUAL, as there is
no hierarchy of information for this value but the field is mandatory.
29
Start Dates in _TYPES tables are mandatory as, with only a few exceptions, they are required
information. In order to be consistent they therefore have to be mandatory for all _TYPES tables
GEOGRAPHY_TYPES
The start date in this context has little initial value, although it is a mandatory field and
therefore has to be completed with a date.
These types do not describe the type of role that the geography is performing (i.e.
home address, work address, etc.) they describe the type of the geography (postal
address, point location, etc.).
The type and group column are repeated for both values, as there is no hierarchy of
information for them.
CALENDAR_TYPES
The convention over configuration design aspect allows for this table, however it is
rarely needed and can therefore be omitted. This is an example where a table can be
described as designed (i.e. it is known exactly what it looks like) but not implemented.
_TYPES tables will appear in other parts of the data model but they will always have
the same function and format.
30
The consequence of this design re-use is that implementing an application to manage
the source of _TYPE data is easy. The system than manages the type data needs to
have a single table with the same columns as a standard _TYPES table and an
additional column called, for example, DOMAIN. This DOMAIN column has the target
system table name (e.g. PARTY_TYPES) in it. The ETL then simply maps the data
from the source system to the target system where the DOMAIN equals the target table
name. This is an example of re-use generating a significant saving in the
implementation.
30
This is a good use of a Warehouse Support Application as defined in “An Overview Architecture for
Enterprise Data Warehouses”
Band Tables
Whilst _TYPES tables classify information into discrete values it is sometimes
necessary to classify information into ranges or bands i.e. between one value and
another. The classic example of this is for telephone calls which are classified as ‘Off-
Peak Rate’ if they are between 00:00 and 07:59 or between 18:00 and 23:59. Calls
between 08:00 and 17:59 are classified as ‘Peak Rate’ and charged at a premium.
_BANDS is a special case of the _TYPES table and would store the data as follows:
• The table will have the same name as the major entity but with the suffix
_BANDS (e.g. TIME_BANDS, etc.).
• The table will always have a key column that uses the six character short code
and the _DWK suffix.
• The table will have a _BAND column that is the type name.
• The table will have a _START_VALUE and a _END_VALUE that represent the
starting and finishing values of the band.
• The table will have a _DESC column that is a description of the band.
• The table will have a _GROUP column that groups certain band together.
• The table will have a _START_DATE column and a _END_DATE column.
The table has to comply with this convention in order to be given the _BANDS suffix.
31
Note that values are stored as a number of minutes since midnight.
Property Tables
In the discussion of major entities and lifetime value the data that failed to meet the
lifetime value principle was omitted from the major entity tables, however it still needs
to be stored. This is handled via a property table. Property tables also help to support
the extensibility aspects of the data model.
If we use PARTY as an example then as already identified the marital status does not
possess lifetime value and therefore is not included in the major entity. Everyone starts
as single, some marry, some divorce and some are widowed, these ‘status changes’
occur through the lifetime of the individual.
To deal with this problem the property table can be modelled as follows:
As can be seen from example above in order to handle the properties two new tables
are created. The first is the PARTY_PROPERTIES table itself and the second a
supporting PARTY_PROPERTY_TYPES table.
In order to store the marital status of an individual a set of data needs to be entered in
the PARTY_PROPERTY_TYPES table:
TYPE GROUP
Single Marital Status
Married Marital Status
Divorced Marital Status
Co-Habiting Marital Status
Figure 9 - Example Party Property Data
The description, start and end date would be filled in appropriately. Note that the start
and end date here represent the start and end date of the type and not that of the
32
individuals’ use of that type.
It is now possible to insert a row in the PARTY_PROPERTIES table that references the
individual in the PARTY table and the appropriate PARTY_PROPERTY_TYPES (e.g.
‘Married’). The PARTY_PROPERTIES table can also hold the start date and end date
of this status and optionally where appropriate a text or numeric value that relates to
that property.
32
The need for start and end dates on such items is often questioned however experience shows that
legislation changes supposed static values in most countries over the lifetime of the data warehouse.
For example in December 2005 the UK permitted a new type of relationship called a civil partnership.
http://en.wikipedia.org/wiki/Civil_partnerships_in_the_United_Kingdom.
This means that not only the current marital status can be stored but also historical
information.
33
PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE
John Smith Single 01-Jan-1970 02-Feb-1990
John Smith Married 03-Feb-1990 04-Mar-2000
John Smith Divorced 05-Mar-2000 06-Apr-2005
John Smith Co-Habiting 07-Apr-2005
Figure 10 - Example data for PARTY_PROPERTIES
The data shown here describes the complete history of an individual with the last row
showing the current state as the START_DATE is before ‘today’ and the END_DATE is
null. There is also nothing to prevent future information from being held. If John Smith
announces that he is going to get married on a specific date in the future then the
current record can have it’s end date set appropriately and a new record added.
TYPE GROUP
Male Number of Children
Female Number of Children
Figure 11 - Example Data for PARTY_PROPERTY_TYPES
In fact any number of new properties can be added to the tables as business processes
and source systems change and new data requirements come about.
The effect of this method when compared to other methods of modelling this
information is to create very narrow (i.e. not many columns) long (i.e. many rows)
tables instead of making very much wider, shorter tables. However the properties table
34
is very effective. Firstly, unlike the example, the two _DWK columns are integers , as
are the start and end dates. Many of the _VALUE fields will be NULL, and those that
are not will be predominately numeric rather than text values.
33
Text from the related table is used in the _DWK column rather than the numeric key for clarity in these
examples.
34
Integers are better than text strings for a number of reasons: they usually require less storage and
there is less temptation to mix the requirements of identification and description (a problem clearly
illustrated by car registration numbers in the UK).
Keys are more reliable when implemented as integers because databases often have key generation
mechanisms that deliver unique values. Integers do not suffer from upper/lower case ambiguities and
can never contain special characters or ambiguities caused by different padding conventions (trailing
spaces or leading zeros).
The real saving in the number of rows is normally less than expected when compared
to more conventional data model techniques that store duplicated rows for changed
data. The example above has seven rows of data. The alternate approach of repeated
sets of data requires six rows of data and considerably more storage because of the
duplicated data:
UNKNOWN
FEMALE
CHILD
CHILD
CHILD
MALE
John Smith 01-Jan-1970 02-Feb-1990 Single 0 0 0
John Smith 03-Feb-1990 08-Jun-2001 Married 0 0 0
John Smith 09-Jun-2001 09-Jul-2002 Married 0 1 0
John Smith 10-Jul-2002 04-Mar-2000 Married 0 1 1
John Smith 05-Mar-2000 06-Apr-2005 Divorced 0 1 1
John Smith 07-Apr-2005 Co-Habiting 0 1 1
Figure 13 - Example Data for PARTY_PROPERTIES
The other main objection to this technique is often described as the cost of matrix
transformation of the data. That is the changing of the data from rows into columns in
the ETL to load the data warehouse and then changing the columns back to rows in the
ETL to load the data mart(s). This objection is normally due to a lack of knowledge of
appropriate ETL techniques that can make this very efficient such as using SQL set
operations such as ‘UNION’, ‘MINUS’ and ‘INTERSECT’.
Event Tables
An event table is almost identical to a property table except that instead of having
_START_DATE and _END_DATE columns it has a single column _EVENT_DATE. It
also has the appropriate _EVENT_TYPES table. The table name has a suffix of
_EVENTS. For example a wedding is an event (happens at a single point in time), but
‘being married’ is a property (happens over a period of time). Events can be stored in
property tables simply by storing the same value in both the start date and end date
columns and this is a more common solution than creating a separate table. The use of
_EVENTS tables is usually limited to places where events form a significant part of the
data and the cost of storing the extra field becomes significant.
It should be noted that this is only required where the event may occur many times
(e.g. a wedding date) rather than information that can only happen once (e.g. first
wedding date) which would be stored in the appropriate major entity as, once set, it
would have lifetime value.
Link Tables
Up to this point major entity attributes within a single record have been examined. It is
also possible that records within the major entities will also relate to other records in the
same major entity (e.g. John Smith is married to Jane Smith, both of whom are records
within the PARTIES table). This is called a peer-to-peer relationship and is stored in a
table with the suffix _LINKS and the appropriate _LINK_TYPES table.
The significant difference in a _LINK table is that there are two relationships from the
major entity (in this case PARTIES).
where ‘works in’ and ‘is a division of’ are examples of the _LINK_TYPE.
It should also be noted that there is a priority to the relationship because one of the
linking fields is the main key (in this case PARTIE_DWK) and the other is the linked
key (in this case LINKED_PARTIE_DWK). There are two options; one is to store the
relationship in both directions (e.g. John Smith is married to Jane Smith and Jane
35
Smith is married to John Smith). This can be made complete with a reversing view
but defeats both the ‘Convention over Configuration’ principle and the ‘DRY (Don’t
Repeat Yourself)’ principle. The second method is to have a convention and only
store the relationship in one direction (e.g. John Smith is married to Jane Smith,
therefore the convention could be that that the male is being stored in the main key
and the female is being stored in the linked key).
35
A reversing view is one that has all the same columns as the underlying table except that the two key
columns are swapped around. In this example PARTIE_DWK would be swapped with
LINKED_PARTIE_DWK.
Segment Tables
The final type of information that might be required about a major entity is the
segment. This is a collection of records from the major entity that share something in
common but more detail is not known. The most common business example of this
would be the market segmentations done on customers. These segments are
normally a result of detailed statistical analysis and then storing the results.
In our example John Smith and Jane Smith could both be part of a segment of
married people along with any number of other individuals for whom it is known that
they are married but there is no information about when or to whom they are married.
Where the _LINKS table provided the peer-to-peer relationship the segment provides
the peer group relationship.
The Sub-Model
The major entities and the six supporting data structures (_TYPES, _BANDS,
_PROPERTIES, _EVENTS, _LINKS, and _SEGMENTS) provide sufficient design pattern
structure to hold a large part of the information in the data warehouse. This is known as a
Major Entity Sub-Model. Significantly the information that has been stored for a single major
entity sub-model is very close to the typical dimensions of a data mart. This design pattern
provides complete temporal support and the ability to re-construct a dimension or dimensions
based on a given set of business rules.
The set of a major entity and the supporting structures is known as a sub-model. For example
the designed PARTY sub-model consists of:
• PARTIES
• PARTY_TYPES
• PARTY_BANDS
• PARTY_PROPERTIES
• PARTY_PROPERTY_TYPES
• PARTY_EVENTS
• PARTY_EVENT_TYPES
• PARTY_LINKS
• PARTY_LINK_TYPES
• PARTY_SEGMENTS
• PARTY_SEGMENT_TYPES
Those tables in bold italics might represent the implemented PARTY sub-model
Importantly what has not been provided is the relationships between major entities and the
business transactions that occur as a result of the interaction between major entities.
History Tables
Extending the example above it is noticeable that the party does not contain any
address information; this is held in the geography major entity. This is also another
example where current business processes and requirements may change. At the
outset the source system may provide a contract address and a billing address. A
change in process may require the capture of additional information e.g. contact
addresses and installation addresses.
In practice the only difference between this type of relationship between major entities
and the _LINKS relationship is that instead of two references to the same major entity
there is one relationship to each of two major entities.
There is one minor semantic difference between links and histories. _LINKS tables join
back on to the major entity and therefore one half of the relationship has to be given
priority. In a _HISTORY table there is no need for priority as each of the two attributes
is associated with a different major entity.
Finally note that in this example the major entity is shown without the rest of the sub-
model that can be assumed.
The Example
The bank has a number of regions and a central ‘premium’ account function that
caters for some business customers. Each region has a number of branches.
Branches have a manager and a number of staff. Each branch manager reports
to a regional manager.
If the account belongs to a business with less than USD1M turnover then the
account manager is a business account manager at the branch who reports to
the branch manager. If the account belongs to a business with a turnover of
between USD1M and USD10M then the account manager is an individual at the
regional office who reports to the regional manager. If the account belongs to a
business with a turnover more than USD10M then the account managers at the
central office are responsible for the account. Businesses have contact and
statement addresses as well as a number of approved individuals who can use
the company account and contact details for them.
The bank offers a range of services including current, loan and deposit accounts,
credit and debit cards, EPOS (for business accounts only), foreign exchange,
etc.
The bank has a number of channels including branches, a call centre service, a
web service and the ability to use ATMs for certain transactions.
The bank offers a range of transaction types including cash, cheque, standing
order, direct debit, interest, service charges, etc.
After the close of business on the last working day of each month the starting
and ending balances, the average daily balance and any interest is calculated for
each account.
On a daily basis the exposure (i.e. sum of all account balances) is calculated for
each customer along with a risk factor that is a number between 0 and 100 that
is influenced by a number of factors that are reviewed from time to time by the
risk management department. Risk factors might include sudden large deposits
or withdrawals, closure of a number of accounts, long-term non-use of an
account, etc. that might influence account managers’ decisions.
Every transaction that is made is recorded every day and has three associated
dates, the date of the transaction, the date it appeared on the system and the
cleared date.
The bank has a number of regions and a central ‘premium’ account function that
caters for some business customers. Each region has a number of branches.
Branches have a manager. Each branch manager reports to a regional manager.
At this point only existing major entities and history tables have been used. Also
this information would be re-usable in many places just like the conformed
dimensions concept of star schemas but with more flexibility.
36
See Appendix 2 – Understanding Hierarchies for an explanation as to why the regions are
organisational units and not geography.
If the account belongs to a business with less than USD1M turnover then the
account manager is a business account manager at the branch who reports to
the branch manager. If the account belongs to a business with a turnover of
between USD1M and USD10M then the account manager is an individual at the
regional office who reports to the regional manager. If the account belongs to a
business with a turnover over USD10M then the account managers at the central
office are responsible for the account. Businesses have contact and statement
addresses as well as a number of approved individuals who can use the
company account, and contact details for them.
The bank offers a range of services including current, loan and deposit accounts,
credit and debit cards, epos (for business accounts only), foreign exchange, etc.
• The product services are held in the product service major entity.
• The product services are associated with an account via a history.
The bank has a number of channels including branches, a call centre service, a
web service and the ability to use ATMs for certain transactions.
This adds the PRODUCT_SERVICE and CHANNEL major entities into the
model.
The bank offers a range of transaction types including cash, cheque, standing
order, direct debit, interest, service charges, etc.
After the close of business on the last working day of each month the starting
and ending balances, the average daily balance and any interest is calculated for
each account.
On a daily basis the exposure (i.e. sum of all account balances) is calculated for
each customer along with a risk factor that is a number between 0 and 100 that
is influenced by a number of factors that are reviewed from time to time by the
risk management department. Risk factors might include sudden large deposits
or withdrawals, closure of a number of accounts, long-term non-use of an
account, etc. that might influence account managers’ decisions.
Every transaction that is made is recorded every day and has three associated
dates, the date of the transaction, the date it appeared on the system and the
cleared date.
This would complete the model for the example. There are some interesting
features to examine. The first is that all amounts would be positive. This is
because for a credit to an account the ‘from account’ would be the sending party
and the ‘to account’ would be the customer’s account. For a debit the ‘to account’
would be the recipient and the ‘from account’ would be the customer’s account.
This has a number of effects. Firstly it complies with the DRY (Don’t Repeat
Yourself) principle and means that extra data is not stored for the transaction. It
also means that a collection of account information not related to any current
party (e.g. a customer at another bank) is built up. This information is useful in
the analysis of fraud, churn, market share, competitive analysis, etc.
For a customer analysis data mart the data can be extracted and converted into
the positive credit/negative debt arrangement required by the users.
The payment of bank changes and interest would also have accounts and this
information in a different data mart could be used to look at profitability,
exposure, etc.
The process has used seven major entities’ sub-models, an additional type table
and an occurrence or transaction table. Storing this information should
accommodate and absorb almost any change in business process or source
system without the need to change the data warehouse model and will allow
multiple data marts to be built from a single data warehouse quickly and easily.
In effect the type tables act as metadata for how to use and extend the data
model rather than defining the business process explicitly in the data model,
hence the name process neutral data modelling.
It also demonstrates the ability of the data model to support the requirements
process. By knowing the major entities and using a storyboard approach similar
to the example above, and familiar as an approach to agile developers, it is
possible to quickly and easily identify business, data and query requirements.
History
History History
The model above has been almost fully described in detail by this document since the self-
similar modelling for all the sub-model components has been described along with the history
tables, most of the retail banking transactions and some of the lifetime attributes of the major
entities. To complete the model just needs these additional attributes to be added.
Two other effects that will influence the creation of data marts from this model can also be
seen. Firstly the creation of dimensions will revolve around the de-normalisation of the
attributes that are required from each of the major entities into one of the two dimensions
associate with account as these have the hierarchies for the customer, account manager, etc
associated with them.
The second effect is that of the natural star schema. It is clear from this diagram that the fact
tables will be based around the ‘Retail Banking Transactions’ table. As has already been
stated there are several data marts that can be built from this fact table, probably at different
levels of aggregation and with different dimensions.
The occurrence or transaction table above is one of perhaps twenty that a large enterprise
would require along with approximately thirty _HISTORY tables. This would be combined with
around twenty major entity sub models to create an enterprise data warehouse data model.
For those readers who have also read and are familiar with the Data Management &
37
Warehousing white paper ‘How Data Works’ that describes natural star schemas in more
detail and also a technique called left to right entity diagrams will see a correlation as follows:
Level Description
1 _TYPE and _BAND tables, simple small volume reference data.
2 Major Entities, complex low volume data.
3 Some major entities that are dependent on others along with _PROPERTIES and _SEGMENTS
tables, less complex but with greater volume.
4 _HISTORY tables and some occurrence or transaction tables.
5 Occurrence or transaction tables. Significant volume but low complexity data.
Figure 19 - Volume & Complexity Correlations
37
Available for download from http://www.datamgmt.com/whitepapers
Implementation Issues
The use of a process neutral data model and a design pattern is meant to ease the design of
a system but there will always be exceptions and things that need further explanation in order
to fit them into the solution. Much of this section refers to ETL issues that can only be briefly
38
described in this context.
There is also a requirement to track some form of state identity number. In the United
Kingdom an individual has their National Insurance number and in the United States
their social security number, other numbers (e.g. passport, ID card, etc are simply
stored as properties). Organisations have other numbers (Companies have registration
numbers, charities and trusts have different registration numbers, but VAT numbers are
properties as they can and do change).
Another minor issue is that people have a date of birth and a date of death. This is
simply resolved as date of birth is the Individual Start Date and date of death is the
Individual End Date however this terminology can sometimes prove controversial.
The solution to the PARTY special case depends on the database technology being
used. If the database supports the creation of views and the ‘UNION ALL’ SQL
40
operator then the preferred solution is as follows:
• PARTY_DWK
• PARTY_TYPE_DWK
• TITLE
• FORENAME
41
• CURRENT_SURNAME
• PREVIOUS_SURNAME
• MAIDEN_SURNAME
• DATE_OF_BIRTH
• DATE_OF_DEATH
• STATE_ID_NUMBER
• Other lifetime attributes as required
38
Data Management & Warehousing provide consultancy on ETL design and techniques to ensure that
data warehouses can be loaded effectively regardless of the data modelling approach used.
39
Interestingly, in Scotland, which has different regulations from England & Wales, birth marriage and
death certificates (also known as vital records) have, since 1855, understood the importance of knowing
the birth names of everyone on the certificate. For example on a wedding certificate you will get the
groom’s mother’s maiden name and a married woman’s death certificate will also feature the her maiden
name. Effectively the birth name has lifetime value and all other names are additional information.
http://www.scotlandspeople.gov.uk/content/help/index.aspx?r=554&628
40
Nearly all business intelligence databases support this functionality.
41
CURRENT_ and PREVIOUS_ are reserved prefixes; see Appendix 1 Data Modelling Standards.
• PARTY_DWK
• PARTY_TYPE_DWK
• CURRENT_ORGANISATION_NAME
• PREVIOUS_ORGANISATION_NAME
• START_DATE
• END_DATE
• STATE_ID_NUMBER
• Other lifetime attributes as required
• PARTY_DWK
• PARTY_TYPE_DWK
• CURRENT_ORGANISATION_UNIT_NAME
• PREVIOUS_ORGANISATION_UNIT_NAME
• START_DATE
• END_DATE
• Other lifetime attributes as required
Where possible it is often beneficial to create this as a materialized view so that it can
be indexed and used as a primary key to the other tables.
Whilst the PARTIES table needs all these techniques they can also be used in part on
other major entities if required.
The alternate strategy where UNION ALL views are not available is to create a single
table including all the columns and use those columns that are appropriate as required
by the query.
Partitioning
The Party Special Case is an example of vertical partitioning, i.e. tables that are split
based on the different columns required for the different types. Queries require a view
across the information in order to be able to access all the information.
Common Individuals
Data Data
Common Organisations
Data Data
Common Organisation
Data Units Data
Tables can also be horizontally partitioned, i.e. whilst the table structure remains the
same the table is split on some data item that changes, most commonly the date. This
sometimes requires a view to be able to access all the information but is more
commonly implemented in the database architecture itself.
Common Data
For January
Common Data
For February
Common Data
For March
If both horizontal and vertical partitioning are used together this is known as matrix
partitioning. This is uncommon.
Horizontal partitioning is not effective and often not supported on MPP platforms that
hash the data internally to multiple nodes. Column or vector storage databases render
horizontal partitions meaningless as a storage strategy.
Data Cleansing
Data cleansing itself is outside the scope of this document however the model must
make allowances for it. In particular if data is to be cleaned or standardized then the
original data must also be stored. To this end every column that is to be modified in this
way should have an additional column with the prefix STANDARDIZED_ added to it.
For example there may be a column in a table called NAME that has ‘Fred Bloggs’
stored with mixed case, two spaces between the words and a trailing space. The
cleaning routine would convert this in such a way as to replace multiple white spaces
with a single space character, and then remove leading and training white space before
converting the text to uppercase producing ‘FRED BLOGGS’. The result would be
stored in the column STANDARDIZED_NAME leaving the original data in NAME. This
technique should be used wherever data cleansing takes place.
If this column is created then it must always be populated even if the individual row has
not changed. This is because the fact that there is no change is information in itself and
also to avoid the need on load and extraction to determine whether the original or
cleansed data should be used.
Null Values
Process neutral data modelling does not require many nulls in the database at all and
they should be avoided wherever possible. All _END_DATES must allow nulls. Some
_START_DATES will need to allow nulls. The _VALUE columns must also allow nulls.
Other than these cases the principles of lifetime value should ensure the data model
requires few other columns that allow a null value.
Indexing Strategy
The data warehouse should only be indexed where necessary, i.e. primary and foreign
keys and one or two other essential columns for good performance for the extraction of
information into data marts. Users are not exploiting the data warehouse and therefore
the indexes should be aimed at ensuring that the ETL is as effective as possible.
Where a holistic view of the data warehouse processing is taken, regardless of the data
modelling technique used, it becomes apparent that disabling referential integrity is
more expensive than enabling it and designing processes to accommodate it. Process
Neutral Data Models should always have referential integrity enabled unless there is a
specific case for individual tables that means it cannot be done.
There is also an approach of disabling referential integrity, loading data and then re-
enabling referential integrity. This is acceptable as long as any issues are resolved but
in practice many systems ignore issues and ultimately this affects the quality and
therefore the longevity of the system.
Finally it is possible to write ETL that always complies with referential integrity even
when there is missing data, using a technique called ‘Defensive Programming’. For
example if a type is missing from a _TYPE table it is possible to write the value into the
_TYPE table before inserting the data into the main table. Doing so will create a row in
the _TYPE table where the description, group, etc. are set to ‘Unknown’. This allows all
data to be processed and data quality metrics to be run (‘How much of my reference
data is unknown?’), provide early warning of unplanned changes in the source system
and allowing users, via the data maintenance application, to fix reference data in a
timely fashion without impacting the load process.
Where the data warehouse platform favours inserts it is preferable that the processing
of the data and any staging that is update intensive is performed in the ETL tool or a
dedicated staging database (depending on the architectural constraints and platform
choices made by the organisation) outside the data warehouse.
Since the change in properties only affects the individual cell rather than the entire row
it means that as the data warehouse grows each change uses less space and therefore
the total disk space used drops below that used by the other techniques.
Over the lifetime of the data warehouse it is unlikely that either approach will see either
significant cost or significant savings in disk space.
Implementation Effort
The method chosen for the data modelling can have a significant effect on the effort
involved in building a data warehouse. Process Neutral Data Modelling typically has the
following characteristics when compared to more traditional approaches:
Data Commutativity
42
In mathematics there is a concept of commutativity , the ability to change the order of
something without changing the end result, for example 2 + 1 is the same as 1 + 2 and is
therefore commutative, however 2 - 1 is not the same as 1 – 2 and is therefore not
commutative. In general data is not commutative, however it is hierarchical and therefore can
be derived in one direction.
A common question asked about process neutral data modelling is that with so many places
that data can be held which is the right place? The answer is simple: data should be held at
the most detailed level possible
Since data will be extracted into data marts the ETL that performs the extraction should
consolidate the information to the appropriate level for that data mart. It is important to note
that this is also a core part of the change management process for the data warehouse.
For example the initial system that is used as a source collects data at the segment level. A
new system is commissioned to replace the initial source. The new system collects data at
the link level. The data can immediately be loaded at the link level and then extracted to the
data mart at the segment level. Over time the initial system is de-commissioned and all the
information is gathered at the link level. At this point the data mart can be updated to supply
the data mart with information at the link, property or event level as required.
It is against the DRY (Don’t Repeat Yourself) principle to store the derived data at every level
in the data warehouse unless there is some specific added value that is provided by doing so.
42
http://en.wikipedia.org/wiki/Commutative
• How big will this data model get, especially if every major entity has ten supporting
43
tables?
• Can’t all the type and band tables be put in a single table; actually can we merge all
the properties, events, links and segments together into a single table too?
Before answering these two specific questions it is important to make some observations
about the process of the data modelling.
The objective of a data model is to create a clear, structured environment in which to store
data. Every data modeller will have their own preferences for the way in which they design the
data model. Process neutral data models strive to find a balance between:
It is possible for data modellers to change the rules that they apply to the data model
however before doing so the data modeller should understand the effect on the overall
balance of the system. Projects inevitably fail when the balance is lost and one of these
aspects overrides all the others. Projects should always enforce a single data modelling style.
This compares very favourably with other data warehouse models. Large data
warehouses that have been in production usually exceed this number. Smaller and
newer data warehouses often start with fewer but quickly grow to this sort of size. The
advantage of this approach is that nearly everything that comes along in the future has
already been designed into the solution, therefore there is no long-term data model size
increase provided the model is properly managed.
43
A Major Entity could have a type table, a band table, a property table with its own type table, an event
table with its own type table, a link table with its own type table and a segment table with its own type
table which is a total of ten tables excluding the major entity
A typical Telco billing process will handle billions of unrated call data records through complex
algorithms in order to generate the rated call record and consequently the bill. Furthermore
the billing systems allow rapid change in the rules used for billing so that the company can
bring new products to market quickly.
Given the engineering that has gone into building high performance billing systems and the
amount of change in billing requirements it is impractical to try and reliably recreate the billing
process in the ETL process. It is therefore important to know what factors were used in the
rating of an individual record (e.g. time bands, distance, number types, etc) but not exactly
how they were applied. The accurate storing of the results generated elsewhere is the
objective of the data warehouse.
This approach can be extended to a general principle that data warehouses should store the
results of batch processes in source systems rather than try to reproduce the algorithms that
generated the result sets.
This has consequences for data quality. Users in the example given above might perceive
data quality issues if the sum of the rated calls does not equal the billed amount. There are
two possible causes for this.
The first cause is inaccurate ETL that of course, is a data warehouse problem that has to be
resolved.
The second cause is an issue in the batch process in the source system. This second type of
issue often goes un-detected because users in the operational system look at individual bills,
whilst in the data warehouse they are likely to analyse across multiple bills. There is also no
simple remedy for this problem. If data is loaded and reconciled against the source system(s)
differences will be found. It may not be possible for the data warehouse to resolve them all.
Instead they must all be explained and the users of the system educated to understand how
the differences between the source systems and the data marts come about. As a result the
users may consider changing the source system or business process to get more accurate
information.
Even if the data models themselves are not used there is much benefit from using the
techniques as a method of analysis for the data modelling. Given that major entities have
types, bands, properties, events, links and segments it becomes much easier to ensure all the
data that might be required has been analysed and discussed in the requirements stage.
Unlike most data modelling approaches this method has a basis in the understanding of
enterprise architecture and therefore it is possible to tune the data model to get the optimal
overall solution for the specific situation because the impact of changing one aspect (e.g. ETL
loading) at the expense of another (e.g. data maintenance) can be clearly seem.
The holistic approach also requires technical discipline and good change control, as should
any other method. The use of this approach often highlights the failure of an organisation in
these areas and means that sometimes organisations will chose to use other methods that do
not directly expose these failures. Unfortunately hiding these failures does not mean there is
no impact, just that the impact is hidden until it becomes critical and causes problems.
Summary
This white paper has looked at an example company and how the data models of operational
systems within that company evolve. Using this example it has been possible to study the
impact those changes have on the reporting and data warehousing solutions.
To mitigate the impact of these changes the use of a process neutral data model has been
examined. This method creates a data model that stores the core business data in a format
that is abstracted from the current operational systems.
The technique also takes advantage of the benefits of using convention over configuration to
define standard format tables and lifetime value principles to implement a DRY or “Don’t
Repeat Yourself” concept that makes the data model easily understood. Of course there is no
perfect solution to developing a data model and so the implementation issues associated with
this technique are also examined.
Combining the techniques described in the white paper allows a data model to be quickly and
easily developed that is easily understood and that will lower the total cost of ownership
because it is not so susceptible to change.
General Conventions
All table and column names must use uppercase letters, the digits 0-9 and an
underscore ‘_’ to replace a space. No other characters are allowed. This is for
44
database compatibility reasons .
Table and column names must be no longer than 30 characters including underscores.
This is for database compatibility reasons
Table names and column names should be in English, this is because regardless of
where in the world the system is operating the majority of source systems will have
English table names and the amount of time lost trying to translate and match tables
and column names compared with the visual inspection and quick comparison in the
45
same language is significant
Table Conventions
Table Names are always plural; a table is a collection of zero, one or more rows
Every table should have a short name or alias. The short name is created using the
following rules:
If a table name is less than six characters the short name is the table name right
padded with ‘Z’ until it is six characters long
E.g. BILLS becomes BILLSZ
If a table name is made up of one word of six or more characters then the short name
is the first six characters
E.g. ACCOUNTS becomes ACCOUN
If a table name is made up of two words then the first three characters of each word
are used to create the short name
E.g. ACCOUNT_TRANSACTIONS becomes ACCTRA
If a table name is made up of three words then the first two characters of each word
are used to create the short name
E.g. CALL_DISTANCE_BAND becomes CADIBA
If a table name is made up of four words then the first two characters of the first two
words and the first character of the third and fourth words are used to make up the
short name
E.g. CALL_DISTANCE_BAND_GROUPS becomes CADIBG
44
Database Identifier Lengths Comparison
https://test.kuali.org/confluence/display/KULRICE/Database+Table+and+Column+Name+Standards
45
Taking this further there are minor differences between UK and US English (e.g. COLOUR vs.
COLOR) and therefore strictly speaking the data model should be in US English.
If a table name is made up of five words then the first two characters of the first word
and the first character of the second, third, fourth and fifth words are used to make up
the short name
E.g. THE_QUICK_BROWN_FOX_JUMPED becomes THQBFJ
If a table name is made up of six or more words then the first character of each of the
first six words is used to make up the short name
E.g. THE_QUICK_BROWN_FOX_JUMPED_OVER becomes TQBFJO
If there are any conflicts as a result of this then they should be resolved and
documented by the data modeller.
Table Suffixes.
There are a series of table name suffixes that are reserved for specific functions;
these are:
A _PROPERTIES table provides time variant data storage support for non-
lifetime value attributes of a major entity. The singular form of the table that is
being supported always prefixes the _PROPERTIES. (e.g. PARTIES is
supported by PARTY_PROPERTIES, PRODUCTS is supported by
PRODUCT_PROPERTIES, etc.). _PROPERTIES tables can only be
associated with major entities and always have a related _TYPES table.
A _EVENTS table provides time variant data storage support for non-lifetime
value attributes of a major entity that occur more than once but at a point in
time rather than over a period of time (which is covered by _PROPERTIES).
The singular form of the table that is being supported always prefixes the
_EVENTS. (e.g. PARTIES is supported by PARTY_EVENTS, PRODUCTS is
supported by PRODUCT_EVENTS, etc.) _EVENTS tables can only be
associated with one major entity table and always have a related _TYPES
table.
Column Conventions
Column Names are always singular; a column is a single element within a row
TIMESTAMP
A timestamp for each record is held. This is either the date and time that the
row was created, or subsequently when it was last modified. If two systems
update any part of a row within one load process only the last modification is
preserved, and no count of modifications is maintained. The data type of a
timestamp must be TIMESTAMP where supported by the database or DATE
otherwise.
ORIGIN
This is used to identify what made the last change to the record. This should
be the name of the ETL process or mapping that performed the insert or last
update. If two systems update any part of a row within one load process only
the last updating system is preserved, and no count of modifications is
maintained. It is important to note that the origin only reflects the last process
in the chain to insert or update a record. A record comes from possibly
multiple sources system passing through many ETL processes before being
inserted into the database. The ORIGIN is set to the last ETL process and the
ETL tool must then contains the audit trail back to the previous system, and
so on. The data type and format of the ORIGIN column must be
46
VARCHAR(32) . This approach is known as tracking the data lineage
46
This document uses ANSI SQL92 standards. Other databases may use other data types e.g. Oracle
would VARCHAR2(32)
Column Suffixes
Standard extensions added to column names
_DWK
The use of _DWK indicates a Data Warehouse Key – a key generated and
maintained within the Data Warehouse, allowing the use of the words _ID,
_CODE, _NUMBER, etc to denote identifiers brought in from the source data.
Every table must use a _DWK surrogate key rather than any source system
key that may change when the source system is changed. All _DWK are
integer data type.
_TIME
Any field that has the suffix _TIME must contain a time. This information is
stored in a TIME data type if available, otherwise it is stored in the DATE data
type with the date component set to ‘01-JAN-1900’. This is to allow arithmetic
to be performed on time fields.
_DATE
Any field that has the suffix _DATE must contain information stored in the
DATE data type.
_START_DATE
_END_DATE
_EVENT_DATE
_DESC
The description fields are free text fields that describe the record. This should
not be relied on for queries and instead keys and appropriate joins used. The
standard data type and format for a description is VARCHAR (255).
_NUMERIC_VALUE
_TEXT_VALUE
Column Prefixes
There are also a number of standard column prefixes
STANDARDIZED_
Standardized fields are fields that have been cleaned in some way
CURRENT_
The CURRENT_ prefix denotes a current value that might not have lifetime
value in a major entity such as SURNAME in the PARTIES table.
PREVIOUS_
LINKED_
Used where two foreign keys from the same table are used in a _LINK table.
For large projects and models it is sometimes useful to consider using all
abbreviations from the outset.
Index Conventions
Where databases use administrator-defined indexes the following conventions should
be used.
Primary Key indexes should be named PK_XXXXXX where XXXXXX is the six-
character short table name
_TYPES
Column Data Type Length Optional
_TYPE_DWK INTEGER NOT NULL
_TYPE VARCHAR 32 NOT NULL
_TYPE_DESC VARCHAR 255 NOT NULL
_TYPE_GROUP VARCHAR 32 NOT NULL
_TYPE_START_DATE DATE NOT NULL
_TYPE_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 27 - Standard _TYPES table
_BANDS
Column Data Type Length Optional
_BAND_DWK INTEGER NOT NULL
_BAND VARCHAR 32 NOT NULL
_BAND_START_VALUE NUMBER NOT NULL
_BAND_END_VALUE NUMBER NULL
_BAND_DESC VARCHAR 255 NOT NULL
_BAND_GROUP VARCHAR 32 NOT NULL
_BAND_START_DATE DATE NOT NULL
_BAND_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 28 - Standard _BANDS table
_PROPERTIES
Column Data Type Length Optional
_DWK INTEGER NOT NULL
_PROPERTY_TYPE_DWK INTEGER NOT NULL
_PROPERTY_TEXT_VALUE VARCHAR 32 NULL
_PROPERTY_NUMERIC_VALUE NUMBER NULL
_PROPERTY_START_DATE DATE NOT NULL
_PROPERTY_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 29 - Standard _PROPERTIES table
_EVENTS
Column Data Type Length Optional
_DWK INTEGER NOT NULL
_EVENT_TYPE_DWK INTEGER NOT NULL
_EVENT_TEXT_VALUE VARCHAR 32 NULL
_EVENT_NUMERIC_VALUE NUMBER NULL
_EVENT_DATE DATE NOT NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 30 - Standard _EVENTS table
_LINKS
Column Data Type Length Optional
_DWK INTEGER NOT NULL
LINKED_ _DWK INTEGER NOT NULL
_LINK_TYPE_DWK INTEGER NOT NULL
_LINK_TEXT_VALUE VARCHAR 32 NULL
_LINK_NUMERIC_VALUE NUMBER NULL
_LINK_START_DATE DATE NOT NULL
_LINK_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 31 - Standard _LINKS table
_SEGMENTS
Column Data Type Length Optional
_DWK INTEGER NOT NULL
_SEGMENT_TYPE_DWK INTEGER NOT NULL
_SEGMENT_TEXT_VALUE VARCHAR 32 NULL
_SEGMENT_NUMERIC_VALUE NUMBER NULL
_SEGMENT_START_DATE DATE NOT NULL
_SEGMENT_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 32 - Standard _SEGMENTS table
_HISTORY
Column Data Type Length Optional
_DWK INTEGER NOT NULL
_DWK INTEGER NOT NULL
_HISTORY_TYPE_DWK INTEGER NOT NULL
_HISTORY_TEXT_VALUE VARCHAR 32 NULL
_HISTORY_NUMERIC_VALUE NUMBER NULL
_HISTORY_START_DATE DATE NOT NULL
_HISTORY_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 33 - Standard _HISTORY table
The _TYPES and _BANDS tables all need a sequence to populate their _DWK field.
This should be a single sequence that is shared amongst all _TYPES and _BANDS
tables. This has two effects; it prevents a larger number of sequences being created
than necessary and also means that reference data cannot inadvertently be joined to
other reference data
Occurrence or transaction tables do not normally need a primary key. If they do then
like major entities each one should have it’s own sequence
It is often asked if the CALENDAR table should have a _DWK column or if the date is
sufficient. Either approach will work however for consistency the use of a _DWK is
47
preferred.
47
Some organisations compromise by using Julian Day Number i.e. the integer part of the date (see
http://en.wikipedia.org/wiki/Julian_day) as a surrogate key that obscures the underlying information from
the users but aids development. This does, from time to time risk inconsistencies.
Sales Regions
Most business will have a ‘geographic’ structure of some type - sales region being a
prime example. The title ‘sales region’ and the names of the elements (e.g. country
names, state names, city names) adds to the confusion in implying that this is a
geographic hierarchy but this is wrong; it is an organisational hierarchy.
Whilst in concept the business allocated resource to cover different geographic regions
the practicalities of running the business soon overtake the situation. A series of
exemptions are soon created, some accounts being looked after by people out of
region and some accounts being looked after by non-geographic functions. There is no
direct geographical relationship, just a use of geographic names for familiarity.
This is often stored as Jack Doe reporting to John Smith, etc. However the people in
the organisation structure are not the hierarchy. What needs to be stored is the role
as an organisation unit:
The role hierarchy is significantly less dynamic than the people within it and the
organisation changes are much more controlled as the business chooses when to re-
structure but does not choose when staff join or leave. Most large organisations will
have a personnel or human resources department that manages the organisation
hierarchy and if they use a Human Resource Management System it is likely that
every role will have a unique ID and a documented position in the hierarchy.
It is then possible to relate the roles as organisational units to the individuals so:
This has the added advantage of dealing with temporary resources and also with the
transition of resources (e.g. when someone is moving from one team to another and
fulfils two roles for a short period of time).
• Vendors such as IBM, Oracle, Sybase and Teradata, who all provide some form of
standard data models for some industry sectors. These models usually started from
a client project and then a period of refinement internally before becoming
‘productized’
• Industry Organisations such as TMForum in the telecommunications industry that
have decided that there is value in building an industry wide common data model
Both types of provider usually provide logical models and these can be used in one of two
ways:
A logical data model will need to be converted to a physical model and it is in this conversion
process when the physical data model is created that process neutral data modelling can be
used.
As an example one of the best described and available data models is the Information
48 49
Framework (SID) from TMForum.org an industry association focused on transforming
business processes, operations and systems for managing and monetizing on-line
Information, Communications and Entertainment services.
The Information Framework provides the foundation of a "common language" that allows
common representation, as well as a standardized meaning for the relationships that exist
among logical entities. For example, a common definition of what is "customer" and how It
relates to other elements, such as mailing address, purchase order, billing records, trouble
tickets, and so on.
This is an ideal basis for a process neutral data model as there are defined set of major
entities with lifetime values and relationships
• Market / Sales
• Product
• Customer
• Service
• Resources
• Supplier / Partner
• Common Business
• Enterprise
48
TMForum Information Framework (SID)
http://www.tmforum.org/InformationFramework/1684/home.html
49
TMForum: http://www.tmforum.org/browse.aspx
50
Figure 37 - TMForum Information Framework (SID) Version 8.0 overview
At first glance this appears to offer up a view of the world incompatible with the design
objectives of a process neutral data model (e.g. because there is a customer and a supplier
rather than just a party). This is an incorrect assumption. There are two ways in which the
Information Framework can be used with the approach described in this document.
The first method is to trust the Information Framework and implement it as a process neutral
data model. Therefore there is no major entity called Party, instead there is one called
Customer and one called Supplier. This approach trusts the reference model to have thought
through the industry specific lifetime value issues and be satisfied that it will be fit for purpose.
In the specific case of the TMForum Information Framework this is a safe assumption as it is
widely peer reviewed and widely used by industry experts. This is however not always true of
all vendor data models.
The second method is to use the Information Framework as a point of reference and create a
process neutral data model that meets all the described entities and attributes of the
Information Framework. In this case a Party entity would exist and the attributes and
associated properties, links, etc. would be validated to ensure that all information held in the
Information Framework could be stored in the resulting data model.
Both of these approaches have been successfully used with the TMForum Information
Framework and could be used with other industry standard data models. The choice of
approach will often depend on the quality of the reference data model, the likelihood of
change and the needs of the business.
50
The SID model is copyright TMForum.org and was taken from
http://www.tmforum.org/sdata/content/PracticesStandards/sid/default.aspx
The chances are that are that you will not know the answer for 100% of your friends to any
question. (What is Mrs Smith’s first name, she lives two doors down and looks after the cat
when you are away?)
The situation is also one that deteriorates rapidly. As a result of reading this white paper you
might decide to contact your 100 best friends and get all the above information. Your friends
are tolerant of your request and provide you will all this information. In six months time you
decide to update your address book and you contact your tolerant friends again to check that
all the details are still correct. The chances are that at least twenty percent of your friends will
52
have changed some part of the information over the six months.
53
The use of synchronisation tools, social network and personal address book sites has
improved the automation of change notification. It is now possible to update your own details
on a service and for that to automatically update the records of your friends who also use the
service but the change rate is still high.
51
From a survey by RapLeaf
http://www.marketingvox.com/more-women-than-men-on-social-networks-have-more-friends-than-men-
do-038384/
52
The percentage is variable with age and socio-economic factors. Middle age and high incomes
improve stability and reduce the percentage of change. Youth and older age as well as lower incomes
increase the amount of change. This was dramatically demonstrated in the UK with the introduction of
the community charge or “poll tax” (http://en.wikipedia.org/wiki/Community_Charge) between 1990 and
1993. Local authorities were responsible for collecting the basic household information and struggled to
maintain an accurate list of households. Whilst there was quite a lot of deliberate avoidance that cannot
be factored in there was also regularly 20% of notified change in any single month.
53
Sites include Facebook, Bebo, Plaxo, LinkedIn and Naymz. Many of these sites are now adding
features that allow you to better qualify and quantify these friends into true friends, acquaintances, etc.
This issue transfers into the data warehouse environment. If an individual cannot keep track
of their friends then how does a business keep track of their customers? Businesses only get
informed of changes when the customer requires something. For example if you register a
‘pay as you go’ mobile telephone and when doing so are required to provide an address do
you bother to update their records when you change address? However if you later need
something sent from the telephone company then you will contact them to update their
records.
One Telco data warehouse team attempted to measure how poor the address data was. They
decided to look at post-paid customers who receive a printed bill each month. The method
was obvious and simple once it was identified. They went to the mailroom and asked how
many bills were returned by the postal service. The answer was about 25,000 per month or
1% of the bills generated. They also discovered that a team handled the returned mail and by
various methods updated them in the main billing system. Therefore each month 1% of the
post-paid customer data expired. Pre-paid customers, the much larger proportion of total
customer base, would have much less reason to update their information and therefore a
much larger percentage.
The process neutral data model aids this situation in two simple ways:
• The first way in which it helps is that there is a separate record for each piece of
information (home address is stored separately from work address within the
PARTY_ADDRESS_HISTORY, etc.) and this means that it is easy to maintain each
of different pieces of information without impacting other pieces of information.
• The second way in which it helps is that each piece of information has its own
TIMESTAMP. It is therefore possible to exclude information based on the age of the
information.
Process Neutral Data Modelling may seem a large leap from more widely discussed methods
of data modelling for a data warehouse but it has been used for over fifteen years by some of
the largest organisations in the world.
• Ralph Kimball
Creator of the data mart concept and the need to deliver simple, easy to use
information to business users
• Bill Inmon
Known as the father of data warehousing whose approach required a normalised
database in which to store the lowest level of information
• Ward Cunningham
Owner of c2.com (home of the Portland Pattern Repository) and signatory of the agile
manifesto. Ward also worked with a number of the author’s contemporaries at
Sequent Computers in the early and mid 1990s.
And many others who will no doubt feel that they should have been included and to whom the
author can only apologise for their omission.
54
This quote, often attributed to Isaac Newton, is by John of Salisbury, from his 1159 Metalogicon.
http://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants
Further Reading
Data Management & Warehousing have published a number of white papers on data
warehousing and related issues. The following papers are available for download from
http://www.datamgmt.com
This particular document looks at what an organisation will need in order to build and
operate an enterprise data warehouse in terms of the following:
* The toolsets
What types of products and skills will be used to develop a system
* The documentation
How do you capture requirements, perform analysis and track changes in scope of a
typical data warehouse project.
This document is, however, an overview and therefore subsequent documents deal
with specific issues in detail
This paper has identified five sources of change to the system and the aspects of the
system that these sources of change will influence in order to assist the organisation to
develop standards and structures to support the development and maintenance of the
solution. These standards and structures must then evolve, as the programme
develops to meet its changing needs.
The best governance must only be an aid to the development and not an end in itself.
Data Warehouses are successful because of good understanding, discipline and the
skill of those involved. On the other hand systems built to a template without
understanding, discipline and skill will inevitably deliver a system that fails to meet the
users’ needs and sooner rather than later will be left on the shelf, or maintained at a
very high cost but with little real use.
Projects often have poorly set expectations in terms of timescales; the likely return on
investment, the vendors’ promises for tools or the expectations set between the
business and IT within an organisation. They also have large technical architectures
and resourcing issues that need to be handled.
This document will outline the building blocks of good project control including the
definition of phases, milestones, activities, tasks, issues, enhancements, test cases,
defects and risks and will discuss how they can be managed, and when, using an event
horizon, the project manager can expect to get information.
To help manage these building blocks this paper will look at the types of tools and
technology that are available and how they can be used to assist the project manager.
It also looks at how these tools fit into methodologies.
The final section of the paper has looked at how effective project leadership and
estimating can improve the chances of success for a project. This includes
understanding the roles of the executive sponsor, project manager, technical architect
and senior business analyst along with the use of different leadership styles,
organisational learning and team rotation.
• For projects using other methodologies or creating their own set of documents to
use as a checklist. This allows the project to ensure that the documentation covers
the essential areas for describing the data warehouse.
• To demonstrate our approach to our clients by describing the templates and
deliverables that are produced.
Data Management & Warehousing believes that the approach or methodology for
building a data warehouse should be to use a series of guides and checklists. This
ensures that small teams of relatively skilled resources developing the system can
cover all aspects of the project whilst being free to deal with the specific issues of their
environment to deliver exceptional solutions, rather than a rigid methodology that
ensures that large teams of relatively unskilled staff can meet a minimum standard.
This paper examines how data is structured and then examines characteristics such as
the data model depth, the data volumes and the data complexity. Using these
characteristics it is possible to look at the effects on the development of reporting
structures, the types of data models used in data warehouses, the design and build of
interfaces (especially ETL for data warehouses), data quality and query performance.
Once the effects are understood it is possible for programmes and projects to reduce
(but never remove) the impact of these characteristics resulting in cost savings for the
business.
This paper also introduces concepts created by Data Management & Warehousing
including:
List of Figures
Figure 1 - Initial Operational System Data Model...................................................................... 6
Figure 2 - Initial Reporting System Data Model......................................................................... 7
Figure 3 - Second Version Operational System Data Model..................................................... 8
Figure 4 - The Sales Funnel .................................................................................................... 10
Figure 5 - Example data for PARTY_TYPES .......................................................................... 17
Figure 6 - Example Data for GEOGRAPHY_TYPES .............................................................. 18
Figure 7 - Example data for TIME_BANDS ............................................................................. 19
Figure 8 - Party Properties Example ....................................................................................... 20
Figure 9 - Example Party Property Data ................................................................................. 20
Figure 10 - Example data for PARTY_PROPERTIES............................................................. 21
Figure 11 - Example Data for PARTY_PROPERTY_TYPES.................................................. 21
Figure 12 - Example Data for PARTY_PROPERTIES ............................................................ 21
Figure 13 - Example Data for PARTY_PROPERTIES ............................................................ 22
Figure 14 - Party Events Example........................................................................................... 22
Figure 15 - Party Links Example ............................................................................................. 23
Figure 16 - Party Segments Example ..................................................................................... 24
Figure 17 – Party Geography History Example ....................................................................... 26
Figure 18 - The Example Bank Data Model ............................................................................ 31
Figure 19 - Volume & Complexity Correlations ....................................................................... 32
Figure 20 - PARTIES view mapping........................................................................................ 34
Figure 21 - Vertically Partitioned Data..................................................................................... 35
Figure 22 - Horizontally Partitioned Data ................................................................................ 35
Figure 23 - Data Commutativity............................................................................................... 39
Figure 24 - _START_DATE Rules .......................................................................................... 47
Figure 25 - _END_DATE Rules............................................................................................... 47
Figure 26 - Column Name Abbreviations ................................................................................ 49
Figure 27 - Standard _TYPES table ........................................................................................ 50
Figure 28 - Standard _BANDS table ....................................................................................... 51
Figure 29 - Standard _PROPERTIES table ............................................................................ 51
Figure 30 - Standard _EVENTS table ..................................................................................... 51
Figure 31 - Standard _LINKS table ......................................................................................... 51
Figure 32 - Standard _SEGMENTS table ............................................................................... 52
Figure 33 - Standard _HISTORY table.................................................................................... 52
Figure 34 - Typical Organisation Hierarchies .......................................................................... 53
Figure 35 - Stored Organisational Hierarchy ........................................................................... 53
Figure 36 - Relating Individuals to Roles................................................................................. 54
Figure 37 - TMForum Information Framework (SID) Version 8.0 overview............................. 56
Figure 37 - Set Processing Technique .................................................................................... 59
Copyright
© 2009 Data Management & Warehousing. All rights reserved. Reproduction not permitted
without written authorisation. References to other companies and their products use
trademarks owned by the respective companies and are for reference purposes only.