Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Version 1.0
Wes Dumey
Copyright 2006
Protected by the ‘Open Document License’
ETL Methodology Document
By using this document you are agreeing to the terms listed above.
Page 2 of 12 6/13/2012
ETL Methodology Document
Overview
This document is designed for use by business associates and technical resources to better
understand the process of building a data warehouse and the methodology employed to
build the EDW.
ETL Definitions
Term Definition
ETL – Extract Transform Load The physical process of extracting data
from a source system, transforming the
data to the desired state, and loading it into
a database
EDW – Enterprise Data Warehouse The logical data warehouse designed for
enterprise information storage and
reporting
DM – Data Mart A small subset of a data warehouse
specifically defined for a subject area
Documentation Specifications
A primary driver of the entire process is accurate business information requirements.
Durable Impact Consulting will use standard documents prepared by the Project
Management Institute for requirements gathering, project signoff, and compiling all
testing information.
Page 3 of 12 6/13/2012
ETL Methodology Document
Tables
All destination tables will utilize the following naming convention:
EDW_<SUBJECT>_<TYPE>
There are six types of tables used in a data warehouse: Fact, Dimension, Aggregate,
Staging, Temp, and Audit. Sample names are listed below the quick overview of table
types.
Each type of table will be kept in a separate schema. This will decrease maintenance
work and time spent looking for a specific table.
ETL Processing
There following types of ETL jobs will be used for processing. This table lists the job
type, naming convention, and explains the job functions.
Page 4 of 12 6/13/2012
ETL Methodology Document
Comments
Every job will have a standard comment template that specifically spells out the
following attributes of the job:
In addition there will also be a job data dictionary that describes every job in a table such
that it can be easily searched via standard SQL.
Page 5 of 12 6/13/2012
ETL Methodology Document
Auditing
The ETL methodology maintains a process for providing audit and logging capabilities.
For each run of the process, a unique batch number composed of the time segments is
created. This batch number is loaded with the data into the PSA and all target tables. In
addition, an entry with the following data elements will be made into the
ETL_PROCESS_AUDIT table.
Page 6 of 12 6/13/2012
ETL Methodology Document
The audit process will allow for efficient logging of process execution and encountered
errors.
Quality
Due to the sensitive nature of data within the EDW, data quality is a driving priority.
Quality will be handled through the following processes:
1. Source job - the source job will contain a quick data scrubbing mechanism that
verifies the data conforms to the expected type (Numeric is a number and
character is a letter).
2. Transform – the transform job will contain matching metadata of the target table
and verify that NULL values are not loaded into NOT NULL columns and that
the data is transformed correctly.
3. QualityCheck – a separate job is created to do a cursory check on a few identified
columns and verify that the correct data is loaded into these columns.
Source Quality
A data scrubbing mechanism will be constructed. This mechanism will check identified
columns for any anomalies (ex. Embedded carriage returns) and value domains. If an
error is discovered, the data is fixed and a record is written in the
ETL_QUALITY_ISSUES table (see below for table definition).
Transform Quality
The transformation job will employ a matching metadata technique. If the target table
enforces NOT NULL constraints, a check will be built into the job preventing NULLS
from being loaded and causing a jobstream abend.
Quality Check
Quality check is the last point of validation within the jobstream. QC can be configured
to check any percentage of rows (0-100%) and any number of columns (1-X). QC is
designed to pay attention to the most valuable or vulnerable rows with the data sets. QC
will use a modified version of the data scrubbing engine used during the source job to
derive correct values and reference rules listed in the ETL_QC_DRIVER table. Any
suspect rows will be pulled from the insert/update files, updated in the PSA table to a ‘R’
status and create an issue code for the failure.
Page 7 of 12 6/13/2012
ETL Methodology Document
Data that fails the QC job will not be loaded into the EDW based on defined rules. An
entry will be made into the following table (ETL_QUALITY_ISSUES). An indicator
will show the value of the column as defined in the rules (‘H’ HIGH, ‘L’ LOW). This
indicator will allow resources to be used efficiently to trace errors.
ETL_QUALITY_ISSUES
ETL_QUALITY_AUDIT
Page 8 of 12 6/13/2012
ETL Methodology Document
Page 9 of 12 6/13/2012
ETL Methodology Document
Lookup Dimension
Page 10 of 12 6/13/2012
ETL Methodology Document
Transform
Page 11 of 12 6/13/2012
ETL Methodology Document
Load
Closing
After reading this ETL document you should have a better understanding of the issues
associated with ETL processing. This methodology has been created to address as many
negatives as possible while providing a high level of performance and ease of
maintenance while being scalable and workable in a real-time ETL processing scenario.
Page 12 of 12 6/13/2012