Sei sulla pagina 1di 95

Data Integration –

Techniques for Extracting,


Transforming and Loading Data

Internal and Confidential


Agenda
• Module 1 - Source Data Analysis & Modeling
• Module 2 – Data Capture Analysis & Design
• Module 3 – Data Transformation Analysis & Design
• Module 4 – Data Transport & Load Design
• Module 5 – Implementation Guidelines

April 13, 2011


Module 1 –
Source Data Analysis and
Modeling

Internal and Confidential April 13, 2011


Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

April 13, 2011


Data Acquisition Concepts
• What is Data Acquisition?
– Data acquisition is the process of moving data to and within the Data sources

warehousing environment
• Goal oriented
– Not an isolated activity. Like data modeling, it is driven by goals external data Operational data
and purpose of the data warehouse
• Source Driven / Target Driven
– Source Driven – the necessary activities for getting data into the Get data from sources
warehousing environment from various sources Data Intake

– Target Driven – activities for acquisition within the warehousing


environment ETL
Staging
• Data Acquisition activities
– Identifying the set of data needed
– Choosing the extract approach Get data into warehouse
– Extract data from the source Data Distribution
– Apply transformations ETL Data
– Load data warehouse

Get data for delivery


Information Delivery
ETL Access
ETL ETL

Data Data Data


mart mart mart

April 13, 2011


Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

April 13, 2011


Source Data Analysis – Scope of Data Sources
• Data acquisition issues
– Identifying and understanding data sources
– Mapping source data to target data
– Deciding which data to capture
– Effectively and efficiently capturing the data, and
– Determining how and when to transform the data
• Acquisition process design must pay attention to:
Subjects
– Ensuring data quality
?
– Efficiently loading warehouse databases
product Customer
– Using tools effectively
– Planning for error recovery
Finance Process

Entities HR Organization
Customer
Business
Order Events
Product….. Receive order
History?
Ship order 2003
Cancel 2004
order…. 2005
Enterprise
Kinds Events
of Merger with..
data? Acquisition of..
Termination of….

April 13, 2011


Source Data Analysis – Identification of
sources
• Types of sources
– Operational Systems
– Secondary Systems
– Backups, Logs, and Archives
– Shadow Systems
– DSS/EIS Systems, and
– External Data

• On-Going versus Single Load Sources

April 13, 2011


Source Data Analysis – Evaluation of Sources
Qualifying criteria Assessment Questions
Availability How available and accessible is the data?
Are there technical obstacles to access?
Or ownership and access authority issues?

Understandability How easily understood is the data?


Is it well documented?
Does someone in the organization have depth of knowledge?
Who works regularly with this data?

Stability How frequently do data structures change?


What is the history of change for the data?
What is the expected life span of the potential data source?

Accuracy How reliable is the data?


Do the business people who work with the data trust it?

Timeliness When and how often is the data updated?


How current is the data?
How much history is available?
How available is it for extraction?

Completeness Does the scope of data correspond to the scope of the data warehouse?
Is any data missing?

Granularity Is the source the lowest available grain ( most detailed level) for this data?

April 13, 2011


Evaluation of Sources – Origin of data
• Original Point of Entry
– Best practice technique is to evaluate the original point of entry. “Is is this the very first place that the data is
recorded anywhere within the business?”
– If “yes”, then you have found the original point of entry. If “no”, then source may not be the original point of
entry. Ask the follow up question “ Can the element be updated in this file/table?”
– If not then this is not the original point of entry. If “yes” then the data element may be useful as a data
warehousing source data

CLAIM-
CUSTOM
C U S TO M E R -N U M B E RPOLICY
Point of Origin?

C U S TO M E R -N A M E CUSTOM
CUSTOM
GENDER
DRIVER April 13, 2011
Evaluation of Sources – Origin of data
• Original Point of Entry – This practice has many benefits
– Data timeliness and accuracy are improved
– Simplifies the set of extracts from the source system

• Business System of Record


– To what system do the business people go when they are validating results?
– If business identifies a system as the “System of Record” then it must be considered as a
probable warehousing data source

• Data Stewardship
– In organizations that have data stewardship program, involve the data stewards

April 13, 2011


Evaluation of Sources – An example

atrix
M
t ore
S
D ata
e
urc
So

FIELD tM
a trix
men
Ele
Data
urce
So

CLAIM-NUMBER
April 13, 2011
Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

April 13, 2011


Source Data Modeling – Overview of Warehouse data
modeling
• Business Goals and Drivers
Contextual
• Information Needs
Models

•Warehousing subjects
•Business Questions
• Source composition Conceptual
•Facts and Qualifiers
• Source subjects Models
•Targets Configuration

• Staging, Warehouse, & mart


ER models
• Data mart Dimensional Logical
• Integrated Source
Data Model (ERM)
Triage models
Models

• Staging Area Structure


• Warehouse structure
• Relational mart structures Structural
• Dimensional mart structures Models
• Source Data Structure Model

• Staging Physical Design


• Source Data File • Warehouse Physical Design Physical
Descriptions • Data Mart Physical Designs
Models

• Implemented warehousing Functional


• Source Data Files
databases Databases

April 13, 2011


Source Data Modeling
Source Data Modeling Objectives
• A single logical model representing a design view of all source data within scope
• An entity relationship model in 3rd normal form ( a business model without
implementation redundancies)
• Traceability from logical entities to the specific data sources that implement those
entities
• Traceability from logical relationships to the specific data sources that implement those
relationships
• Verification that each logical attribute is identifiable in implemented data sources

Source Data Modeling Challenges


• Many data sources do not have data models
• Where data models exist, they are probably out-dated and almost certainly not
integrated
• Many source structures are only documented in code (e.g. COBOL definitions of
VSAM files)
• Sometimes multiple and conflicting file descriptions exist for a single data structure

April 13, 2011


Source Data Modeling – The activities
Business drivers
Contextual

Information needs Business goals


(scope)

Source data What warehousing data


kinds of
data stores To
target
modeling
Each source
Source composition
model Does
no yes
source model validate
Conceptual

exist
(analyze)

Which Source
top-down Subject
Modeling Existing data
Approach? model model
bottom-up

Source
Logical Logical
integrate
Model (ERM)
(design)

Structure
Structural Of data
(specify) Store (matrix)

Physical Existing
file desc
(optimize)
locate extract
Functional
(Implement) Existing data store
April 13, 2011
Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

April 13, 2011


Understand the Scope of Data Sources
Business drivers
Context

(scope)

Information needs Business goals


ual

What
Source data kinds of warehousing data
data stores To
target
modeling
Source composition Each source
model Does
no source model yes validate
(analyze
Concept

exist
Which Source
Modeling top-down Subject Existing data
ual

Approach? model model


)

bottom-up

Identify & Name Associate


Subjects Subjects

•Source composition model uses set notation to develop a subject area model
•Classifies each source by the business subjects that it supports
•Helps to understand
•which subjects have a robust set of sources
•which sources address a broad range of business subjects
•Helpful to plan, size, sequence and schedule development of the DW increments

April 13, 2011


Composition Subject Model - Example
CUSTOMER
MIS customer table
REVENUE
LIS policy file
APMS premium file
RPS policy file
POLICY
APMS policy master

CLAIM
MIS product table
EXPENSE

INCIDENT
CPS claim master CPS claim
action file
LIS claim file
CPS party file
CPS claim detail file

MIS auto
ORGANIZATION
Marketplace table
MIS residential
PARTY
Marketplace table
MARKETPLACE
April 13, 2011
Composition Subject Matrix Example

S
April 13, 2011
Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

April 13, 2011


Understanding source content - Integrated view
of Non-integrated sources
Source Data Modeling…
“Not within the charter of the warehousing program to redesign data sources”

• Understand the existing source data designs


• Merging all of the designs into one representative model
• The source logical data modeling process is not one of design but of integrating
• To develop a logical source data model, you will need to integrate design information from
multiple inputs including
– Merging and integrating existing data models
– Extending the subject model that represents any source
– Extracting design structures from source data stores with reverse engineering
techniques

April 13, 2011


Understanding Source content – Using
existing models
Source composition
model
no Does yes
source model validate Existing data
Conceptual

exist model
(analyze)

Which Source
top-down
Modeling Subject Existing data
Existing data
Approach? model model
model
bottom-up

Source
Logical Logical
integrate
(design) Model (ERM)

Check for Currency Combine into


& Accuracy Single model

•This modeling activity begins with collection of existing data models


•Models must be validated – to ensure accuracy and currency
•Existing models – “jump start” the process
•Merging models – identifying and resolving redundancy and conflict across source models

April 13, 2011


Understanding Source content – Working Top
Down
Source composition
model
no Does yes
source model validate Existing data
Conceptual

exist model
(analyze)

Which Source
top-down
Modeling Subject Existing data
Existing data
Approach? model model
model
bottom-up

Source
Logical Logical
integrate
(design) Model (ERM)

Identify, Name &


Describe Entities

Identify, Name &


Describe Relationships

Identify, Name &


Describe Attributes

Map to Data Stores

April 13, 2011


Understanding Source Content – Working
Bottom-Up
• Derive the data model from the File descriptions

• The source data element matrix serves as the tool to perform source data modeling

• Source modeling and source assessment work well together and share the same set of
documentation techniques.

F ile F ield A ttrib u te ID E n tity R e la tio n ship


(w h a t fa ct?) (k ey ? ) (w h at (fo reig n k ey ?)
su b ject?)
A PM S Prem ium PO LIC Y-N U M B ER U n iqu e Policy ID Yes PO LIC Y
A PM S Prem ium NAME Policy H older N a m e C U S TO M ER
A PM S Prem ium A D D R ES S Policy H older A d dress C U S TO M ER
A PM S Prem ium PR EM IU M -A M O U N T C ost of Policy Prem ium PO LIC Y
A PM S Prem ium PO LIC Y-TER M C overa g e D u ration PO LIC Y
A PM S Prem ium B EG IN -D A TE S tart d ate of covera ge PO LIC Y
A PM S Prem ium EN D -D A TE End d ate of covera g e PO LIC Y
A PM S Prem ium D IS C O U N T-C D Iden tify kin d of d iscounPa
t rtial D IS C O U N T
A PM S Prem ium S C H ED U LE B asis of discou nt am t D IS C O U N T
A PM S Policy PO LIC Y-N U M B ER U n iqu e Policy ID Yes PO LIC Y
A PM S Policy C U S TO M ER -N U M B ERU n iqu e custom er ID Yes C U S TO M ER PO LIC Y-> C U S TO M ER
A PM S Policy V IN V eh icle ID n um b er Yes V EH IC LE PO LIC Y-> V EH IC LE
A PM S Policy MAKE V eh icle M anu facturer V EH IC LE

April 13, 2011


Integrating Multiple Views – Resolving Redundancy &
Conflict
Resolve Redundancy
And Conflict

Model States

Normalize

Verify Model

Examine sets of entities and relationships:


customer places order and person places order  customer and person
are redundant
customer places order and customer sends order  places redundant
with sends
Examine sets of entities and attributes
When differently named entities have a high degree of similarity in their
sets of attributes
April 13, 2011
Understanding Source content - Data Profiling –
Looking at the Data

Look at the data to:

discover patterns

know how it is used

understand data quality

identify all data values

3 types of profiling: classify and organize


• Column profiling
• Dependency profiling
• Redundancy profiling

April 13, 2011


Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

April 13, 2011


Logical to Physical Mapping
Tracing Business data to Physical Implementation
• Two way connection
– What attribute is implemented by this field/column?
– Which fields/columns implement this attribute?
• Documenting the mapping
– Extend the source data element matrix to include all data sources
– Provides comprehensive documentation of source data, and detailed mapping from
the business view to implemented data elements

April 13, 2011


Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

April 13, 2011


Understanding the source systems – A Team
Effort
• Understanding the source systems feeding the warehousing environment is a critical success factor

• All members of the warehousing team have a role in this effort

• The acquisition (ETL) team is generally responsible to


– document the source layouts
– perform reverse engineering as needed
– determine point of entry
– identify system of record
– and look at actual data values

• The data modelers are likely to


– create the single source logical model (using the inputs gathered by the acquisition team)

• Business Analysts/representatives are involved to


– look at the data and help understand the values
– help to identify point of entry
– and help to determine system of record

April 13, 2011


Module 2 –
Data Capture Analysis and
Design

Internal and Confidential April 13, 2011


Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

April 13, 2011


Data Capture Concepts – An overview
• Data capture – activities involved in getting data out of sources
• Synonym for Data Extraction source

Extract
Data capture Analysis – What to extract?
– Performed to understand requirements for data capture
• Which data entities and elements are required by target data stores? Transform

• Which data sources are needed to meet target requirements?


• What data elements need to be extracted from each data source? Load

target
Data capture Design – When to extract? How to extract?
– Performed to understand and specify methods of data capture
• Timing of extracts from each data source
• Kinds of data to be extracted
• Occurrences of data (all or changes only) to be captured
• Change detection methods
• Extract technique (snapshot or audit trail)
• Data capture method (push from source or pull from warehouse

April 13, 2011


Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

April 13, 2011


Source/Target Mapping – Mapping Objectives
– Primary technique used to perform data capture
analysis
source
– Mapping source data to target data il i
ty
a b
l
– Three levels of detail ai Extract
av
a
at
• Entities D

• Data Stores Source/Target mapping Transform

• Data Elements
Load

– The terms “source” and “target” describe roles


Data
that a data store may play, not fixed Req
uire target
men
characteristics of a data store ts

April 13, 2011


Source and Target as Roles
Data sources

external data Operational data


Source: Operational/external data
Target: staging data
Data Intake
The terms “source” and “target”
describe roles that a data store ETL
Staging
may play, not fixed characteristics
of a data store
Source: staging data
Target: data warehouse
Data Distribution

ETL Data
warehouse

Source: data warehouse


Information Delivery
Target: data marts ETL
ETL ETL

Data Data Data


mart mart mart

April 13, 2011


Mapping Techniques

customer
Source & target data models
product

service

Map source entities Logical data


to target entities models

Map source data stores


to target data stores

Map source data elements to


target data elements

Structural &

design transformations physical models

April 13, 2011


Source/Target Mapping: EXAMPLES
ENTITY MAPPING

DATA STORE
MAPPING

DATA ELEMENT
MAPPING

MEMB
April 13, 2011
l
Source/target Mapping: Full set of Data
elements
l Elements added by business
ica l s
g e nes ns
Lo od si tio
m u
B es
qu
e nt
l
e lem ica l
ta x g e
a
D t ri Lo od
ma Source/target mapping m

e
bl ons l
a
/t ti ica
i
F s
le crip
hys gn
ap P esi
de et m

Elements added by
tar g d
ur ce/
So

transform logic
Elements
added by triage
triage

transform design

April 13, 2011


Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

April 13, 2011


Source Data Triage
What to extract - opportunities
• Source/target mapping analyzes need, triage analyzes opportunity
• Triage is about extracting all data with potential value

What is Triage?
• Source data structures are analyzed to determine the appropriate data elements for inclusion

Why Triage?
• Ensure that a complete set of attributes is captured in the warehousing environment.
• Rework is minimized

Triage and Acquisition


• Performing triage is a joint effort between the acquisition team and the warehousing data modelers

April 13, 2011


The Triage Technique
Source systems needed for
increment

Select needed files

Identify elements addressing


known business questions

Eliminate Operational and


redundant elements

Take all other business


elements

First draft of element mapping for


staging area or atomic DW

April 13, 2011


Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

April 13, 2011


Kinds of Data
Event Data

Custom

Member Nu
Reference Data

metadata Membership
Source system
Source system keys April 13, 2011
Data Capture Methods

ALL CHANGED
DATA DATA

Replicate Replicate
PUSH TO source Source changes
WAREHOUSE Files/tables Or transactions

Extract Extract source


PULL FROM source Changes or
SOURCE Files/tables transactions

April 13, 2011


Detecting Data Changes
• Detecting changes at source how to
• source date/time stamps
know
which data
• source transaction files/logs
has
• replicate source data changes changed???
• DBMS logs
• compare back-up files

• Detecting changes after extract ALL CHANGED


DATA DATA
• compare warehouse extract
generations
• compare warehouse extract to Replicate Replicate
PUSH TO
source system WAREHOUSE
source Source changes
Files/tables Or transactions

Extract Extract source


PULL FROM source Changes or
SOURCE Files/tables transactions

April 13, 2011


Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

April 13, 2011


Timing issues

OLTP Frequency of
Acquisition
Sources Data
Extraction

Data
Transformation

Work
Tables Warehouse
Loading

Latency of
Load
Intake
layer

Periodicity
Of
Data
Data Marts Mart

April 13, 2011


Source System Considerations

OLTP

Sources Data
Extraction

Work
When is the data ready in
Tables each source system ?

How will I know when How will i know when it’s


source systems fail? ready?

How long will it remain in


the steady state? How will I respond to
source system failures?

How will i recover from a


failure?

April 13, 2011


Handling time variance – techniques and
methods
• SNAPSHOT
– Periodically posts records as of a specific point in time
– Records all data of interest without regard to changes
– Acquisition techniques to create snapshots
• DBMS replication
• Full File Unload or Copy

• AUDIT TRAIL
– Records details of each change to data of interest
– Details may include date and time of change, how the change was detected, reason for
change, before and after data values, etc.
– Acquisition techniques
• DBMS triggers
• DBMS replication
• Incremental selection
• Full file unload/copy

Important distinction between Snapshot and Audit trail :


Audit trail techniques only Changed data is extracted and loaded,
Snapshot all data is extracted and loaded, whether changed or not

April 13, 2011


Module 3 –
Data Transformation Analysis &
Design

Internal and Confidential April 13, 2011


Module 3 – Data Transformation Analysis &
Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

April 13, 2011


Transformation concepts – An overview
• Data Transformation
• Changes that occur to the data after it is extracted
• Transformation processing removes
– Complexities of operational environments
– Conflicts and redundancies of multiple databases
– Details of daily operations
– Obscurity of highly encoded data

• Transformation Analysis
• Integrate disparate data
• Change granularity of data
• Assure data quality

• Transformation Design
• Specifies the processing needed to meet the requirements that are determined by
transformation analysis
• Determining kinds of transformations
– Selection
– Filtering
– Conversion
– Translation
– Derivation
– Summarization
– Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools
and technology

April 13, 2011


Module 3 – Data Transformation Analysis &
Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

April 13, 2011


Data Integration Requirements
• Integration
– Create a single view of the data

• Integration & Staging Data


– organize data by business subjects
– ensure integrated identity through use of common, shared business keys

• Integration & Warehouse data


– implement data standards, including derivation of conformed facts and structuring
of conformed dimensions
– ensure integration of internal identifiers – where staging integrates real world keys,
the warehouse needs to do the same for surrogate keys

• Integration & Data marts


– Intended to satisfy business specific /department specific requirements

April 13, 2011


Data Granularity Requirements
• Granularity
– Each change of data grain, from atomic data to progressively higher levels of
summary – achieved through transformation

• Granularity & Staging Data


– Staging data kept at atomic level

• Granularity & warehouse data


– In a 3 tier environment, warehouse should contain all common and standard
summaries

• Granularity & Data marts


– Derivation of summaries specific to individual needs

April 13, 2011


Data Quality Requirements
• Data Cleansing
– process by which data quality needs are met
– range from filtering bad date to replacing data values with some alternative default or derived
values

• Cleansing & Staging Data


– the earlier the data is cleansed, the better the result
– sometimes important for staging data to reflect what was contained in the source systems
– Delay data cleansing transformation until data is moved from staging to warehouse
– Keep both cleansed and un-cleansed data in staging area

• Cleansing & Warehouse data


– data not cleansed in staging is cleansed before loaded into the warehouse

• Cleansing & Data marts


– cleansing at data marts is not necessarily desirable, however as a practical matter may be
necessary

April 13, 2011


Module 3 – Data Transformation Analysis &
Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

April 13, 2011


Transformation Design - Approach

transformation requirements

Identify transformation rules


& logic

Determine transformation
sequences

Specify transformation
process

transformation specifications

April 13, 2011


Module 3 – Data Transformation Analysis &
Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

April 13, 2011


Kinds of transformations
This Transformation type… is Used to…

Selection Choose one source to be used among


multiple possibilities

Filtering Choose a subset of rows from a source


data table, or a subset of records from a
source data file

Conversion and Translation Change the format of data elements

Derivation Create new data values, which can be


inferred from the values of existing data
elements

Summarization Create new data values, which can


inferred from the values of existing data
elements

April 13, 2011


Selection
Choose among
alternative sources
Extracted based upon selection
Source # 1 rules

Select

Extracted
Source # 2
Transformed
Target data
sometimes from source 1
sometimes from source 2

‘If membership type is individual use member name from the


membership master file, otherwise use member name from the
business contact table’

April 13, 2011


Filtering
eliminate some
data from the target
Extracted set of data based on
Source data filtering rules

Filter

Some rows or values discarded


Transformed
Target data

‘If the last 2 digits of policy number are 04,27,46, or 89 extract


data for the data mart, otherwise exclude the policy and all
associated data’

April 13, 2011


Conversion
Change data content
and/or format based
Extracted on conversion rules
Source data

Convert

Transformed
Target data
Value/format in is different than
value/format out

‘For policy history prior to 1994, reformat from Julian date to


YYYYMMDD format. Default century to 19’

April 13, 2011


Translation
decode data whose
values are encoded
encode values in based on rules for
Extracted
Source data translation

Translate

Transformed
Target data
both encoded and decoded value out

‘if membership-type-code is ‘C’ translate to ‘Business’; If


membership-type-code is ‘P’, blank, or null translate to
‘Individual’; otherwise translate to ‘Unknown’ ’

April 13, 2011


Derivation
use existing data
values to create new
Extracted data based on
Source data derivation rules

Derive

Transformed
Target data
new data values created…
More values out than in

‘Total Premium Cost = base-premium-amount + (sum of all


additional coverage amounts)-(sum of all discount amounts) ’

April 13, 2011


Summarization
Change data
granularity based on
atomic or base rules of
Extracted data in
Source data summarization

Summarize

Transformed
Target data
Summary data out

‘for each store (for each product line (for each day (count the
number of transactions, accumulate the total dollar value of the
transactions))) ’
‘for each week (sum daily transaction count, sum daily dollar
total)
April 13, 2011
Identifying Transformation Rules

for any source-to-target data


element association, what needs
exist for:
• selection?
• filtering?
• conversion?
• translation?
• derivation?
• summarization?
April 13, 2011
Specifying Transformation Rules
cells expand
to identify
transformations
by type &
name

cleansing DTR027 (default value)

Derivation DTR008 (Derive


name)
ic ns
g
lo f io DTR027(Default Membership Type)
o at ly
m te If membership-type is null or
for ara ted
s n invalid assume “family” membership
r an sep me
t is u
DTR008(Derive Name) doc
If membership-type is “family”
separate name using comma insert
characters prior to comma in customer-last-name insert characters
after comma in customer-first-name else move name to customer-
biz-name April 13, 2011
Module 3 – Data Transformation Analysis &
Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

April 13, 2011


Dependencies and Sequences
• Time Dependency – when one transformation rule must execute before another
• example: summarization of derived data cannot occur before the derivation

• Rule Dependency – when execution of a transformation rule is based upon the result of
another rule
• example: different translations occur depending on source chosen by a selection
rule

• Grain Dependency – when developing one level of summary if based on results of a


previous summarization
• example: quarters can’t be summarized annually before months are summarized
on a quarterly basis

April 13, 2011


Dependencies and Sequences
2 4

Specify selection

Specify filtering

Specify conversion & translation

Specify derivation

Specify summarization

1 3
1. Identify the transformation rules
2. Understand rule dependency – package as modules
3. Understand time dependency – package as processes
4. Validate and define the test plan
April 13, 2011
Modules and Programs
DTR027(Default Membership Type)
If membership-type is null or
DTR008(Derive Name) If membership-type is “family”
invalid assume “family” membership
separate name using comma
insert characters prior to comma in customer-last-name
insert characters after comma in customer-first-name
else move name to customer-biz-name

Transformation Rules

Dependencie
s among
rules

Structures of Modules,
Programs, Scripts, etc.

April 13, 2011


Job Streams & Manual Procedures- completing the
ETL design

Transformation Rules and their


implementation

Extract &
Load
scheduling Dependencie
s
execution

ts
trae
r
Automated & Manual
verification
Procedures

communication
April 13, 2011
Module 4 –
Data Transportation & Loading
Design

Internal and Confidential April 13, 2011


Module 4 – Data Transport & Load Design

Data Transport and Load Concepts

Data Transport Design

Database Load Design

April 13, 2011


Overview
Source Data

Extract

Transform
databa
s e load
t r ops nart at ad

Load
r of t al p od er e h w

Target Data
a hc m

April 13, 2011


Module 4 – Data Transport & Load Design

Data Transport and Load Concepts

Data Transport Design

Database Load Design

April 13, 2011


Data Transport Issues
Source Data

Extract which platforms?


data volumes?
transport frequency?
network capacity?
ASCII vs EBCDIC
Transform
data security?
t r ops nart at ad transport methods?

Load

Target Data April 13, 2011


Data Transport Techniques
Source Data

Extract Open FTP


Secure FTP
Alternatives to FTP
Data compression
Data Encryption
Transform
ETL tools
t r ops nart at ad

Load

Target Data April 13, 2011


Module 4 – Data Transport & Load Design

Data Transport and Load Concepts

Data Transport Design

Database Load Design

April 13, 2011


Database Load Issues
Source Data

which DBMS?
Extract relational vs dimensional?
tables & indices?
load frequency?
load timing?
data volumes?
exception handling?
restart & recovery?
Transform load methods?
referential integrity?

databa
s e load
Load

Target Data
April 13, 2011
Populating Tables
• Drop and rebuild the tables
• Insert (only) rows into a table
• Delete old rows and insert changed rows

April 13, 2011


Indexing

Load

Indices

Tables
update
at load?
dr o p &
index s rebuild?
egment
ation?

April 13, 2011


Updating

allow
Load
updating of
rows in
tables?

Indices

Tables

• Isn’t the warehouse read only?


• Updating Business Data
• Updating Row level Metadata

April 13, 2011


Referential Integrity
• RI is the condition where every reference to another table has a foreign key/primary key
match.

• Three common options for RI


• DBMS checking
• Test load files before load using a tool/custom application
• Test data base(s) after load using a tool/custom application

April 13, 2011


Timing Considerations
• User Expectations
• Data Readiness
• Database synchronization

April 13, 2011


Exception Processing

Transform

Load
Suspend
exceptions

ok
Reports
Target
data
Log

Discard
April 13, 2011
Integrating with ETL processes
scheduling restart/recover
dependencies y
scheduling dependencies

execution
ts
trae
r
T C ART XE verification

communicatio scheduling
n process
dependencies metadata
M

execution
ts
trae
r
verification
• Loading as a part of single transform
DA OL R OF S NART

& load job stream


• Loads triggered by completion of
transform job stream
• Loads triggered by verification of
transforms
scheduling
• Parallel ETL processing
• Loading Partitions
execution
• Updating summary tables
ts
trae
r
parallel verification
tool
processing capabilities
April 13, 2011
Module 5 – Implementation
Guidelines

Internal and Confidential April 13, 2011


Module 5 – Implementation Guidelines

Data Acquisition Technology

ETL Summary

April 13, 2011


Technology in Data Acquisition
ETL Technology

Data Mapping

Data Transformation

Data Conversion

Data Cleansing
ss ecc A at a D
s met s y S ecr uo S

gni daoL es abat a D

mega na M es abat a D
Data Movement

Storage Management

Metadata Management
April 13, 2011
ETL - Critical Success Factors
Data Store Data
Roles Transformation
Roles

Integration
Intake

Granularity
Information
Delivery

Cleansing
Distribution
1. Design for the Future, Not for the Present v v v v v v
2. Capture and store only changed data v
3. Fully understand source systems and data v v v v v
4. Allow enough time to do the job right v v v v v v
5. Use the right sources, not the easy ones v v v
6. Pay attention to data quality v v v v v v
7. Capture comprehensive ETL metadata v v v v v v
8. Test thoroughly and according to a test plan v v v v v v
9. Distinguish between one-time and ongoing loads v v v
10. Use the right technology for the right reasons v v v v v v
11. Triage source attributes v v v
12. Capture atomic level detail v v
13. Strive for subject orientation and integration v v
14. Capture history of changes in audit trail form v
15. Modularize ETL processing v v v v v v
16. Ensure that business data is non-volatile v v v
17. Use bulk loads and/or insert-only processing v
18. Complete subject orientation and integration v v
19. Use the right data structures (relational vs. dimensional) v v v v v
20. Use shared transformation rules and logic v v v v v
21. Design for distribution first, then for access v
22. Fully understand each unique access need v v v
23. Use DBMS update capabilities v v v
24. Design for access before other purposes v v
25. Design for access tool capabilities v v
26. Capture quality metadata and report data quality v v v

April 13, 2011


Exercises

Exercise 1: Source Data Options

Exercise 2: Source Data Modeling

Exercise 3: Data Capture

Exercise 4: Data Transformation

Exercise 5: Data Acquisition Decision

April 13, 2011

Potrebbero piacerti anche