ETLDesignMethodologyDocument SSIS PDF

ETL Design Methodology Document
Microsoft SQL Server Integration Services

Durable Impact Consulting, Inc.
www.durableimpact.com
Version 1.0
Wes Dumey
Copyright  2006
Protected by the ‘Open Document License’
ETL Methodology Document
Document Licensing Standards

This document is protected by the copyright laws of the United States of America. In
order to facilitate development of an open source ETL Methodology document, users are
permitted to modify this document at will, with the expressed understanding that any
changes that improve upon or add to this methodology become property of the open
community and must be forwarded back to the original author for inclusion in future
releases of this document. This document or any portion thereof may not be sold or
bartered for any form of compensation without expressed written consent of the original
author.
By using this document you are agreeing to the terms listed above.
Page 2 of 12 6/13/2012
Overview
This document is designed for use by business associates and technical resources to better
understand the process of building a data warehouse and the methodology employed to
build the EDW.
This methodology has been designed to provide the following benefits:

1. A high level of performance
2. Scalable to any size
3. Ease of maintenance
4. Boiler-plate development
5. Standard documentation techniques
ETL Definitions
Term Definition
ETL – Extract Transform Load The physical process of extracting data
from a source system, transforming the
data to the desired state, and loading it into
a database
EDW – Enterprise Data Warehouse The logical data warehouse designed for
enterprise information storage and
reporting
DM – Data Mart A small subset of a data warehouse
specifically defined for a subject area
Documentation Specifications
A primary driver of the entire process is accurate business information requirements.
Durable Impact Consulting will use standard documents prepared by the Project
Management Institute for requirements gathering, project signoff, and compiling all
testing information.
ETL Naming Conventions

To maintain consistency all ETL processes will follow a standard naming methodology.
Page 3 of 12 6/13/2012
Tables
All destination tables will utilize the following naming convention:
EDW_<SUBJECT>_<TYPE>
There are six types of tables used in a data warehouse: Fact, Dimension, Aggregate,
Staging, Temp, and Audit. Sample names are listed below the quick overview of table
types.
Fact – a table type that contains atomic data

Dimension – a table type that contains referential data needed by the fact tables
Aggregate – a table type used to aggregate data, forming a pre-computed answer to a
business question (ex. Totals by day)
Staging – Tables used to store data during ETL processing but the data is not removed
immediately
Temp – tables used during ETL processing that can immediately be truncated afterwards
(ex. storing order ids for lookup)
Audit – tables used to keep track of the ETL process (ex. Processing times by job)
Each type of table will be kept in a separate schema. This will decrease maintenance
work and time spent looking for a specific table.
Table Name Explanation

EDW_RX_FACT Fact table containing RX subject matter
EDW_TIME_DIM Dimension table containing TIME subject
matter
EDW_CUSTOMER_AG Aggregate table containing CUSTOMER
subject matter
ETL_PROCESS_AUDIT Audit table containing PROCESS data
STG_DI_CUSTOMER Staging table sourced from DI system used
for CUSTOMER data processing
ETL_ADDRESS_TEMP Temp table used for ADDRESS processing
ETL Processing
There following types of ETL jobs will be used for processing. This table lists the job
type, naming convention, and explains the job functions.
Job Type Explanation Naming Convention

Extract Extracts information from a Extract<Source><Subject>
source systems & places in ExtractDICustomer
a staging table
Load PSA Loads the persistent staging LoadPSA<Table>
area
Source and LoadTemp Sources information from Source<Table>
Page 4 of 12 6/13/2012
STG tables & performs SourceSTGDICustomer

column validation and loads
temp tables used in
processing
Lookup Unload Dimensions Lookup and unload LookupUnloadDimensions
dimension tables into flat
files
LookupUnloadFacts Lookup and unload fact LookupUnloadFacts
tables into flat files
TransformFacts Transform the fact subject TransformFacts
area data and generate insert
files
TransformDimensions Transform the dimension TransformDimensions
subject area data and
generate insert files
QualityCheck Checks the quality of the QualityCheck<Subject>
data before loaded into the QualityCheckCustomer
EDW
Aggregate Aggregates data Aggregate
Update Records Loads/Inserts the changed Update Records
records into the EDW
ETL Job Standards

All ETL jobs will be created with a boiler-plate approach. This approach allows for rapid
creation of similar jobs while keeping maintenance low.
Comments
Every job will have a standard comment template that specifically spells out the
following attributes of the job:
Job Name: LoadPSA

Purpose: Load the ETL_PSA_CUSTOMERS
Predecessor: Extract Customers
Date: July 10, 2007
Author: Wes Dumey
Revision History:
April 21, 2007 – Created the job from standard template
May 22, 2007 – Added new columns to PSA tables
In addition there will also be a job data dictionary that describes every job in a table such
that it can be easily searched via standard SQL.
Page 5 of 12 6/13/2012
Persistent Staging Areas

Data will be received from the source systems in its native format. The data will be
stored in a PSA table following the naming standards listed previously. The table will
contain the following layout:
Column Data Type Explanation

ROW_NUMBER NUMBER Unique for each row in the
PSA
DATE DATE Date row was placed in the
PSA
STATUS_CODE CHAR(1) Indicates status of row (‘I’
inducted, ‘P’ processed, ‘R’
rejected)
ISSUE_CODE NUMBER Code uniquely identifying
problems with data if
STATUS_CODE = ‘R’
BATCH_NUMBER NUMBER Batch number used to
process the data (auditing)
Data columns to follow
Auditing
The ETL methodology maintains a process for providing audit and logging capabilities.
For each run of the process, a unique batch number composed of the time segments is
created. This batch number is loaded with the data into the PSA and all target tables. In
addition, an entry with the following data elements will be made into the
ETL_PROCESS_AUDIT table.

DATE DATE (Index) run date
BATCH_NUMBER NUMBER Batch number of process
PROCESS_NAME VARCHAR Name of process that was
executed
PROCESS_RUN_TIME TIMESTAMP Time (HH:MI:SS) of
process execution
PROCESS_STATUS CHAR ‘S’ SUCCESS, ‘F’
FAILURE
ISSUE_CODE NUMBER Code of issue related to
process failure (if ‘F’)
RECORD_PROCESS_COUNT NUMBER Row count of records
processed during run
Page 6 of 12 6/13/2012
The audit process will allow for efficient logging of process execution and encountered
errors.
Quality
Due to the sensitive nature of data within the EDW, data quality is a driving priority.
Quality will be handled through the following processes:
1. Source job - the source job will contain a quick data scrubbing mechanism that
verifies the data conforms to the expected type (Numeric is a number and
character is a letter).
2. Transform – the transform job will contain matching metadata of the target table
and verify that NULL values are not loaded into NOT NULL columns and that
the data is transformed correctly.
3. QualityCheck – a separate job is created to do a cursory check on a few identified
columns and verify that the correct data is loaded into these columns.
Source Quality
A data scrubbing mechanism will be constructed. This mechanism will check identified
columns for any anomalies (ex. Embedded carriage returns) and value domains. If an
error is discovered, the data is fixed and a record is written in the
ETL_QUALITY_ISSUES table (see below for table definition).
Transform Quality
The transformation job will employ a matching metadata technique. If the target table
enforces NOT NULL constraints, a check will be built into the job preventing NULLS
from being loaded and causing a jobstream abend.
Quality Check
Quality check is the last point of validation within the jobstream. QC can be configured
to check any percentage of rows (0-100%) and any number of columns (1-X). QC is
designed to pay attention to the most valuable or vulnerable rows with the data sets. QC
will use a modified version of the data scrubbing engine used during the source job to
derive correct values and reference rules listed in the ETL_QC_DRIVER table. Any
suspect rows will be pulled from the insert/update files, updated in the PSA table to a ‘R’
status and create an issue code for the failure.
Logging of Data Failures
Page 7 of 12 6/13/2012
Data that fails the QC job will not be loaded into the EDW based on defined rules. An
entry will be made into the following table (ETL_QUALITY_ISSUES). An indicator
will show the value of the column as defined in the rules (‘H’ HIGH, ‘L’ LOW). This
indicator will allow resources to be used efficiently to trace errors.
ETL_QUALITY_ISSUES

DATE DATE Date of entry
creating entry
PROCESS_NAME VARCHAR Name of process creating
entry
COLUMN_NAME VARCHAR Name of column failing
validation
COLUMN_VALUE VARCHAR Value of column failing
validation
EXPECTED_VALUE VARCHAR Expected value of column
failing validation
ISSUE_CODE NUMBER Issue code assigned to error
SEVERITY CHAR ‘H’ HIGH, ‘L’ LOW
ETL_QUALITY_AUDIT

DATE DATE Date of entry
creating entry
PROCESS_NAME VARCHAR Name of entry creating
process
RECORD_PROCESS_COUNT NUMBER Number of records
processed
RECORD_COUNT_CHECKED NUMBER Number of records
checked
PERCENTAGE_CHECKED NUMBER Percentage of checked
records out of data set
ETL Job Templates

Extract
Page 8 of 12 6/13/2012
Source Combined with Load Temp
Page 9 of 12 6/13/2012
Lookup Dimension
Page 10 of 12 6/13/2012
Transform
Page 11 of 12 6/13/2012
Load
Closing
After reading this ETL document you should have a better understanding of the issues
associated with ETL processing. This methodology has been created to address as many
negatives as possible while providing a high level of performance and ease of
maintenance while being scalable and workable in a real-time ETL processing scenario.
Page 12 of 12 6/13/2012

ETLDesignMethodologyDocument SSIS PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

ETLDesignMethodologyDocument SSIS PDF

Caricato da

Copyright:

Formati disponibili

ETL Design Methodology Document

Microsoft SQL Server Integration Services

Document Licensing Standards

This methodology has been designed to provide the following benefits:

ETL Naming Conventions

Fact – a table type that contains atomic data

Table Name Explanation

Job Type Explanation Naming Convention

STG tables & performs SourceSTGDICustomer

ETL Job Standards

Job Name: LoadPSA

Persistent Staging Areas

Column Data Type Explanation

Column Data Type Explanation

Logging of Data Failures

Column Data Type Explanation

Column Data Type Explanation

ETL Job Templates

Source Combined with Load Temp

Potrebbero piacerti anche