The Data Warehouse ETL Toolkit - Chapter 02

The Data Warehouse
ETL Toolkit
by Ralph Kimball
VSV Training
Chapter 2: ETL Data Structures
Prepared by: Hien Bui
Date: 09/02/2008
2.0 Introduce - ETL

Data
Structures
The ETL team will need a number of different
data structures to meet all the legitimate
staging needs.
2.1 To Stage or Not to

Stage
The decision to store data in a physical
staging area versus processing it in memory is

ultimately the choice of the ETL architect
The issue with determining whether to stage
your data or not depends on two conflicting
objectives:
Getting the data from the originating source to
the ultimate target as fast as possible

Having the ability to recover from failure
without restarting from the beginning of the
process
2.1 To Stage or Not to

Stage
Consider the(ct)
following reasons for staging data before
it is loaded into the data warehouse:
Recoverability. In most enterprise environments, its a
good practice to stage the data as soon as it has been

extracted from the source system and then again
immediately after each of the major transformation steps
Backup. Quite often, massive volume prevents the data
warehouse from being reliably backed up at the database
level.
Auditing. Many times the data lineage between the
source and target is lost in the ETL code.
2.2 Designing the

Staging
The staging areaArea
stores data on its way to the final
presentation area of the data warehouse.
Make sure you give serious thought to the various roles
that staging can play in your overall data warehouse
operations.
A given staging file can also be used for restarting the
job flow if a serious problem develops downstream,
and the staging file can be a form of audit or proof that
the data had specific content when it was processed.
2.2 Designing the Staging

Area
(ct)
The data-staging area must be owned by the
ETL team.
Users are not allowed in the staging area for
any reason.
Reports cannot access data from the staging
area.
Only ETL processes can write to and read from
the staging area.
2.2 Designing the Staging Area

(ct)

Area
(ct)
The volumetric worksheet lists each table in the
staging area with the following information:
Table Name. The name of the table or file in the staging
area.
Update Strategy. This field indicates how the table is
maintained.
Load Frequency. Reveals how often the table is loaded
or changed by the ETL process.
ETL Job(s). Staging tables are populated or updated via
ETL jobs.
Initial Row Count. The ETL team must estimate how
many rows each table in the staging area initially
contains.

Area
(ct)
Average Row Length. For size-estimation purposes,
you must supply the DBA with the average row length in
each staging table.
Grows With. Even though tables are updated on a
scheduled interval, they dont necessarily grow each time
they are touched.
Expected Monthly Rows. This estimate is based on
history and business rules.
Expected Monthly Bytes. Expected Monthly Bytes is a
calculation of Average Row Length times Expected
Monthly Rows.

Area
(ct)
Initial Table Size. The initial table size is
usually represented in bytes or megabytes.

Table Size 6 Months. An estimation of table
sizes after six months of activity helps the DBA
team to estimate how the staging database or
file system grows.
The ETL architect needs to arrange for the allocation
and configuration of data files that reside on the file

system as part of the data-staging area to support the
ETL process.
2.3 Data Structures in the ETL

System
In this section, we describe the important
types of data structures you are likely to need

in your ETL system.
2.3.1 Flat Files

When data is stored in columns and rows within a file
on your file system to emulate a database table, it is

referred to as a flat or sequential file.
Arguments in favor of relational tables.
It is always faster to WRITE to a flat file as long
you are truncating or inserting.

There is no real concept of UPDATING existing
records of a flat file efficiently
When you READ from a staging table in the ETL
system
Being able to work in SQL and get automatic
database parallelism for free is a very elegant
approach.
2.3.1 Flat Files (ct)

Staging source data for safekeeping and
recovery.
Sorting data. Sorting is a prerequisite to virtually
every data integration task.
Filtering. Suppose you need to filter on an attribute
that is not indexed on the source database.
Replacing/substituting text strings.
Aggregation
Referencing source data.
2.3.2 XML Data Sets

XML is a language for data communication.
XML metadata consists of tags unambiguously
identifying each item in an XML document.

XML has extensive capability for declaring
hierarchical structures, such as complex forms
with nested repeating subfields.
XML is today an extremely effective medium
for moving data between otherwise
incompatible systems
2.3.2 XML Data Sets (ct)

DTDs, XML Schemas, and XSLT
The DTD declaration for our customer example
could be cast as:
<!ELEMENT
Customer(Name,Address,City?,State?,Postalcode?)>
<!ELEMENT Name (#PCDATA)>
XML Schemas contain much more database-
oriented information about data types

XSLT is a general mechanism for translating
one XML document into another XML document
2.3.3 Relational Tables

Apparent metadata. One of the main
drawbacks of using flat files is that they lack

apparent metadata.
Relational abilities. Enforcing data or
referential integrity among entities is easy to
accomplish in a relational environment.
Open repository
DBA support
SQL interface
2.3.4 Independent DBMS

Working
Tables
If you decide to store your staging data in a
DBMS, you have several architecture options

when you are modeling the data-staging
schema.
To justify the use of independent staging
tables, well use one of our favorite
aphorisms: Keep it simple.
Most of the time, the reason you create a
staging table is to set the data down so you
can again manipulate it using SQL or a
scripting language.
2.3.5 Third Normal Form

Entity/Relation
Models
There are arguments that the data-staging
area is perhaps the central repository of all
the enterprise data that eventually gets
loaded into the data warehouse.
Remember two of the goals for designing your
ETL processes we describe at the beginning of
this chapter: Make them fast and make them
recoverable.
2.3.6 Nonrelational Data

Sources
A common reason for creating a dedicated
staging environment is to integrate
nonrelational data.
The power of ETL tools in handling
heterogeneous data minimizes the need to
store all of the necessary data in a single
database.
2.3.6 Nonrelational Data

Sources
2.3.7 Dimensional Data Models:

The Handoff
Dimensional
data structures
arethe
the target
of
from
the Back
Room to
Front
the ETL processes, and these tables sit at the
Room
boundary between the back room and the
front room.
Dimensional data models are by far the most
popular data structures for end user querying
and analysis.
This section is a brief introduction to the main
table types in a dimension model.
2.3.8 Fact Tables

A single measurement creates a single fact table record.
2.3.9 Dimension Tables

The dimensional model does not anticipate or
depend upon the intended query uses.

One of the great strengths of dimensional
models is their ability to gracefully add
dimensional context that is valid in the
context of the measurement event.
The primary surrogate keys in each dimension
are paired with corresponding foreign keys in
the fact table.
2.3.10 Atomic and

Aggregate
Fact
Tables
Its good practice to partition fact tables
stored in the staging area because its
resulting aggregates will most likely be based
on a specific period - perhaps monthly or
quarterly.
Dimensionally designed tables in the staging
area are in many cases required for
populating on-line analytic processing (OLAP)
cubes.
2.3.11 Surrogate Key

Mapping
Tables
Surrogate key mapping tables are designed to
map natural keys from the disparate source
systems to their master data warehouse
surrogate key.
Mapping tables can be equally effective if
they are stored in a database or on the file
system.
2.4 Planning and Design

Standards
The data-staging area must be a controlled
environment.
People, especially developers, are very
creative when it comes to reusing existing
resources.
2.4.1 Impact Analysis

Impact analysis examines the metadata
associated to an object (in this case a table or

column) and determines what is affected by a
change to its structure or content.
Once a table is created in the staging area,
you must perform impact analysis before any
changes are made to it.
2.4.2 Metadata Capture

Metadata has many different meanings
depending on its context.

Types of metadata derived by the staging
area include the following:
Data Lineage
Business Definitions
Technical Definitions
Process Metadata
2.4.3 Naming
Conventions
The data-staging area may contain tables or
elements that are not in the data warehouse
presentation layer and do not have
established naming standards
Work with the data warehouse team and DBA
group to embellish the existing naming
standards to include special data-staging
tables.
2.4.4 Auditing Data

Transformation
Steps
Replacing natural keys with surrogate keys
Combining and deduplicating entities
Conforming commonly used attributes in
dimensions
Standardizing calculations, creating
conformed key performance indicators (KPIs)
Correcting and coercing data in the data
cleaning routines
2.5 Summary
We have reviewed the primary data structures you
need in your ETL system.

We started by making the case for staging data in
many places, for transient and permanent needs.
A mature ETL environment will be a mixture of flat
files, independent relational tables.
We touched on some best-practice issues, including
adopting a set of consistent design standards,
performing systematic impact analyses on your table
designs, and changing those table designs.

The Data Warehouse ETL Toolkit - Chapter 02

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

The Data Warehouse ETL Toolkit - Chapter 02

Caricato da

Copyright:

Formati disponibili

The Data Warehouse

2.0 Introduce - ETL

2.1 To Stage or Not to

staging area versus processing it in memory is

the ultimate target as fast as possible

2.1 To Stage or Not to

good practice to stage the data as soon as it has been

2.2 Designing the

2.2 Designing the Staging

2.2 Designing the Staging Area

2.2 Designing the Staging

Table Name. The name of the table or file in the staging

2.2 Designing the Staging

2.2 Designing the Staging

usually represented in bytes or megabytes.

The ETL architect needs to arrange for the allocation

and configuration of data files that reside on the file

2.3 Data Structures in the ETL

types of data structures you are likely to need

2.3.1 Flat Files

on your file system to emulate a database table, it is

you are truncating or inserting.

2.3.1 Flat Files (ct)

2.3.2 XML Data Sets

identifying each item in an XML document.

2.3.2 XML Data Sets (ct)

XML Schemas contain much more database-

oriented information about data types

2.3.3 Relational Tables

drawbacks of using flat files is that they lack

2.3.4 Independent DBMS

DBMS, you have several architecture options

2.3.5 Third Normal Form

2.3.6 Nonrelational Data

2.3.6 Nonrelational Data

2.3.7 Dimensional Data Models:

2.3.8 Fact Tables

2.3.9 Dimension Tables

depend upon the intended query uses.

2.3.10 Atomic and

2.3.11 Surrogate Key

2.4 Planning and Design

The data-staging area must be a controlled

2.4.1 Impact Analysis

associated to an object (in this case a table or

2.4.2 Metadata Capture

depending on its context.

2.4.4 Auditing Data

need in your ETL system.

Potrebbero piacerti anche