Sei sulla pagina 1di 31

The Data Warehouse

ETL Toolkit
by Ralph Kimball
VSV Training
Chapter 2: ETL Data Structures
Prepared by: Hien Bui
Date: 09/02/2008

2.0 Introduce - ETL


Data
Structures
The ETL team will need a number of different
data structures to meet all the legitimate
staging needs.

2.1 To Stage or Not to


Stage
The decision to store data in a physical

staging area versus processing it in memory is


ultimately the choice of the ETL architect
The issue with determining whether to stage
your data or not depends on two conflicting
objectives:
Getting the data from the originating source to

the ultimate target as fast as possible


Having the ability to recover from failure
without restarting from the beginning of the
process

2.1 To Stage or Not to


Stage
Consider the(ct)
following reasons for staging data before
it is loaded into the data warehouse:
Recoverability. In most enterprise environments, its a

good practice to stage the data as soon as it has been


extracted from the source system and then again
immediately after each of the major transformation steps
Backup. Quite often, massive volume prevents the data
warehouse from being reliably backed up at the database
level.
Auditing. Many times the data lineage between the
source and target is lost in the ETL code.

2.2 Designing the


Staging
The staging areaArea
stores data on its way to the final
presentation area of the data warehouse.
Make sure you give serious thought to the various roles
that staging can play in your overall data warehouse
operations.
A given staging file can also be used for restarting the
job flow if a serious problem develops downstream,
and the staging file can be a form of audit or proof that
the data had specific content when it was processed.

2.2 Designing the Staging


Area
(ct)
The data-staging area must be owned by the
ETL team.
Users are not allowed in the staging area for
any reason.
Reports cannot access data from the staging
area.
Only ETL processes can write to and read from
the staging area.

2.2 Designing the Staging Area


(ct)

2.2 Designing the Staging


Area
(ct)
The volumetric worksheet lists each table in the
staging area with the following information:

Table Name. The name of the table or file in the staging

area.
Update Strategy. This field indicates how the table is
maintained.
Load Frequency. Reveals how often the table is loaded
or changed by the ETL process.
ETL Job(s). Staging tables are populated or updated via
ETL jobs.
Initial Row Count. The ETL team must estimate how
many rows each table in the staging area initially
contains.

2.2 Designing the Staging


Area
(ct)
Average Row Length. For size-estimation purposes,

you must supply the DBA with the average row length in
each staging table.
Grows With. Even though tables are updated on a
scheduled interval, they dont necessarily grow each time
they are touched.
Expected Monthly Rows. This estimate is based on
history and business rules.
Expected Monthly Bytes. Expected Monthly Bytes is a
calculation of Average Row Length times Expected
Monthly Rows.

2.2 Designing the Staging


Area
(ct)
Initial Table Size. The initial table size is

usually represented in bytes or megabytes.


Table Size 6 Months. An estimation of table
sizes after six months of activity helps the DBA
team to estimate how the staging database or
file system grows.

The ETL architect needs to arrange for the allocation

and configuration of data files that reside on the file


system as part of the data-staging area to support the
ETL process.

2.3 Data Structures in the ETL


System
In this section, we describe the important

types of data structures you are likely to need


in your ETL system.

2.3.1 Flat Files


When data is stored in columns and rows within a file

on your file system to emulate a database table, it is


referred to as a flat or sequential file.
Arguments in favor of relational tables.
It is always faster to WRITE to a flat file as long

you are truncating or inserting.


There is no real concept of UPDATING existing
records of a flat file efficiently
When you READ from a staging table in the ETL
system
Being able to work in SQL and get automatic
database parallelism for free is a very elegant
approach.

2.3.1 Flat Files (ct)


Staging source data for safekeeping and

recovery.
Sorting data. Sorting is a prerequisite to virtually
every data integration task.
Filtering. Suppose you need to filter on an attribute
that is not indexed on the source database.
Replacing/substituting text strings.
Aggregation
Referencing source data.

2.3.2 XML Data Sets


XML is a language for data communication.
XML metadata consists of tags unambiguously

identifying each item in an XML document.


XML has extensive capability for declaring
hierarchical structures, such as complex forms
with nested repeating subfields.
XML is today an extremely effective medium
for moving data between otherwise
incompatible systems

2.3.2 XML Data Sets (ct)


DTDs, XML Schemas, and XSLT
The DTD declaration for our customer example
could be cast as:
<!ELEMENT
Customer(Name,Address,City?,State?,Postalcode?)>
<!ELEMENT Name (#PCDATA)>

XML Schemas contain much more database-

oriented information about data types


XSLT is a general mechanism for translating
one XML document into another XML document

2.3.3 Relational Tables


Apparent metadata. One of the main

drawbacks of using flat files is that they lack


apparent metadata.
Relational abilities. Enforcing data or
referential integrity among entities is easy to
accomplish in a relational environment.
Open repository
DBA support
SQL interface

2.3.4 Independent DBMS


Working
Tables
If you decide to store your staging data in a

DBMS, you have several architecture options


when you are modeling the data-staging
schema.
To justify the use of independent staging
tables, well use one of our favorite
aphorisms: Keep it simple.
Most of the time, the reason you create a
staging table is to set the data down so you
can again manipulate it using SQL or a
scripting language.

2.3.5 Third Normal Form


Entity/Relation
Models
There are arguments that the data-staging
area is perhaps the central repository of all
the enterprise data that eventually gets
loaded into the data warehouse.
Remember two of the goals for designing your
ETL processes we describe at the beginning of
this chapter: Make them fast and make them
recoverable.

2.3.6 Nonrelational Data


Sources
A common reason for creating a dedicated
staging environment is to integrate
nonrelational data.
The power of ETL tools in handling
heterogeneous data minimizes the need to
store all of the necessary data in a single
database.

2.3.6 Nonrelational Data


Sources

2.3.7 Dimensional Data Models:


The Handoff
Dimensional
data structures
arethe
the target
of
from
the Back
Room to
Front
the ETL processes, and these tables sit at the
Room
boundary between the back room and the

front room.
Dimensional data models are by far the most
popular data structures for end user querying
and analysis.
This section is a brief introduction to the main
table types in a dimension model.

2.3.8 Fact Tables


A single measurement creates a single fact table record.

2.3.9 Dimension Tables


The dimensional model does not anticipate or

depend upon the intended query uses.


One of the great strengths of dimensional
models is their ability to gracefully add
dimensional context that is valid in the
context of the measurement event.
The primary surrogate keys in each dimension
are paired with corresponding foreign keys in
the fact table.

2.3.10 Atomic and


Aggregate
Fact
Tables
Its good practice to partition fact tables
stored in the staging area because its
resulting aggregates will most likely be based
on a specific period - perhaps monthly or
quarterly.
Dimensionally designed tables in the staging
area are in many cases required for
populating on-line analytic processing (OLAP)
cubes.

2.3.11 Surrogate Key


Mapping
Tables
Surrogate key mapping tables are designed to
map natural keys from the disparate source
systems to their master data warehouse
surrogate key.
Mapping tables can be equally effective if
they are stored in a database or on the file
system.

2.4 Planning and Design


Standards

The data-staging area must be a controlled

environment.
People, especially developers, are very
creative when it comes to reusing existing
resources.

2.4.1 Impact Analysis


Impact analysis examines the metadata

associated to an object (in this case a table or


column) and determines what is affected by a
change to its structure or content.
Once a table is created in the staging area,
you must perform impact analysis before any
changes are made to it.

2.4.2 Metadata Capture


Metadata has many different meanings

depending on its context.


Types of metadata derived by the staging
area include the following:
Data Lineage
Business Definitions
Technical Definitions
Process Metadata

2.4.3 Naming
Conventions
The data-staging area may contain tables or
elements that are not in the data warehouse
presentation layer and do not have
established naming standards
Work with the data warehouse team and DBA
group to embellish the existing naming
standards to include special data-staging
tables.

2.4.4 Auditing Data


Transformation
Steps
Replacing natural keys with surrogate keys
Combining and deduplicating entities
Conforming commonly used attributes in

dimensions
Standardizing calculations, creating
conformed key performance indicators (KPIs)
Correcting and coercing data in the data
cleaning routines

2.5 Summary
We have reviewed the primary data structures you

need in your ETL system.


We started by making the case for staging data in
many places, for transient and permanent needs.
A mature ETL environment will be a mixture of flat
files, independent relational tables.
We touched on some best-practice issues, including
adopting a set of consistent design standards,
performing systematic impact analyses on your table
designs, and changing those table designs.

Potrebbero piacerti anche