Sei sulla pagina 1di 21

SAD 2008/09

H.Galhardas
ETL
(Extract-Transform-Load)
process
SAD 2008/09
H.Galhardas
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources
Front-End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
Data
Staging
SAD 2008/09
Metadata repository
Description of the structure of the DW
Schema, views, hierarchies, etc
Operational metadata
Data lineage: history of migrated data + transformations applied
Currency of data: active, archived or purged
Monitoring information: warehouse usage statistics, error reports,
audit trails
Algorithms used for summarization
Mapping from operational data sources to the DW
Data extraction, cleaning, transformation rules
Data refresh and purging rules
Gateway descriptions
Data related to system performance
Indices and profiles to improve access performance, etc
Business metadata
Business terms and definitions, data ownership information, etc
H.Galhardas
SAD 2008/09
H.Galhardas
Agenda
ETL issues:
Overview of the ETL process
Extract
Transformation/cleaning
Data staging area (DSA)
Load
Building dimensions
Building fact tables
Data cleaning
Brief overview of ETL and data cleaning tools
SAD 2008/09
H.Galhardas
Overview of the ETL
Process
The most underestimated process in DW development
The most time-consuming process in DW development
Often, 80% of development time is spent on ETL
Extract
Extract relevant data
Transform
Transform data to DW format
Build keys, etc.
Cleansing of data
Load
Load data into DW
Build aggregates, etc.
SAD 2008/09
H.Galhardas
ETL construction process
Plan
1)Make high-level diagram of source-destination flow
2)Test, choose and implement using an ETL tool
3)Outline complex transformations, key generation and job
sequence for every destination table
Construction of dimensions
4)Construct and test building static dimension
5)Construct and test change mechanisms for one dimension
6)Construct and test remaining dimension builds
Construction of fact tables and automation
7)Construct and test initial fact table build
8)Construct and test incremental update
9)Construct and test aggregate build
10)Design, construct, and test ETL automation
SAD 2008/09
Definition
ETL process: directed acyclic graph
Activities and record sets are the nodes
Input-output relationships between nodes are the
edges
Model a workflow of activities to perform:
Appropriate filtering
Intermediate data staging,
Transformations
Loading
H.Galhardas
SAD 2008/09
Add_SPK
1
SUPPKEY=1
SK
1
DS.PS
1
.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
$2!
COST DATE
DS.PS
2 Add_SPK
2
SUPPKEY=2
SK
2
DS.PS
2
.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
COST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS
1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF
1
DS.PS_NEW
1
.PKEY,
DS.PS_OLD
1
.PKEY
DS.PS_NE
W
1
DS.PS_OL
D
1
DW.PARTS
UPP
Aggregate
1
PKEY, DAY
MIN(COST)
Aggregate
2
PKEY, MONTH
AVG(COST)
V2
V1
TIME
!"
DW.PARTSUPP.DATE,
DAY
FTP
1
S
1
_PARTS
UPP
S
2
_PARTS
UPP
FTP
2
DS.PS_NE
W
2
DIFF
2
DS.PS_OL
D
2
DS.PS_NEW
2
.PKEY,
DS.PS_OLD
2
.PKEY
Sources DW
DSA
Example
SAD 2008/09
Extraction
Goal: to identify the correct subset of source data
to be submitted to the ETL workflow for further
processing and extract it fast
Takes place at idle times of the source system
typically at night
Two strong requirements:
Source must suffer minimum overhead
Minimum interference with SW configuration in
the source side
H.Galhardas
FTP
1
S
1
_PARTS
UPP
S
2
_PARTS
UPP
FTP
2
Sources
SAD 2008/09
Extraction policies for
capturing changes in data
1. (nave) to extract the whole source
2. To extract a snaphot of data and compare it with
the previous snapshot
Only data changes need to be processed
3. To use triggers in the source activated when a
modification takes place
Only if data source is a relational DB
4. To parse the log file of the source (log sniffing) to
detect the modifs in the data
H.Galhardas
SAD 2008/09
H.Galhardas
Types of data sources
Non-cooperative sources
Snapshot sources provides only full copy of source
Specific sources each is different, e.g., legacy systems
Logged sources writes change log (DB log)
Queryable sources provides query interface, e.g., SQL
Cooperative sources
Replicated sources publish/subscribe mechanism
Call back sources calls external code (ETL) when changes occur
Internal action sources only internal actions when changes occur (DB
triggers is an example)
Extract strategy is very dependent on the source types
SAD 2008/09
Transformation
Deals with several types of conflicts and problems:
Schema-level problems
naming conflicts: homonyms and synonyms
structural conflicts: Different representations of the same object
or conversion of types
Record-level problems
Duplicate records, e.g. John Smith and Jonh Smith
Contradictory info, e.g.,different birth dates for the same person
Different granularity, e.g., sales per year vs per month
Value-level problems
Diff. formats, e.g. male vs m
H.Galhardas
SAD 2008/09
Transformation, integration
and cleaning
Wide variety of functions
Normalizing
Denormalizing,
Reformatting
Recalculating
Summarizing
Merging data, etc
H.Galhardas
SAD 2008/09
H.Galhardas
Data Staging Area (DSA)
Intermediate area of the DW where the transformation phase takes
place
Steps:
1. Snapshots of source data compared with previous versions to
detect the newly inserted or updated data
2. New data stored in disk so that the process doesnt start from
scratch in case of failure
3. Data undergoes several filters and transformations
Add_SPK
1
SUPPKEY=1
SK
1
DS.PS
1
.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
$2!
COST DATE
DS.PS
2 Add_SPK
2
SUPPKEY=2
SK
2
DS.PS
2
.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
COST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS
1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF
1
DS.PS_NEW
1
.PKEY,
DS.PS_OLD
1
.PKEY
DS.PS_NE
W
1
DS.PS_OL
D
1
DS.PS_NE
W
2
DIFF
2
DS.PS_OL
D
2
DS.PS_NEW
2
.PKEY,
DS.PS_OLD
2
.PKEY
DSA
SAD 2008/09
To stage or not to stage
A conflict between
getting the data from the operational systems as fast as
possible
having the ability to restart without repeating the process
from the beginning
Reasons for staging
Recoverability: stage the data as soon as it has been
extracted from the source systems and immediately after
major processing (cleaning, transformation, etc).
Backup: can reload the data warehouse from the staging
tables without going to the sources
Auditing: lineage between the source data and the
underlying transformations before the load to the data
warehouse
SAD 2008/09
Designing the staging area
The staging area is owned by the ETL team
no indexes, no aggregations, no presentation access, no querying,
no service level agreements
Users are not allowed in the staging area for any reason
staging is a construction site
Reports cannot access data in the staging area
tables can be added, or dropped without modifying the user
community
Only ETL processes can read/write the staging area
ETL developers must capture table names, update strategies, load
frequency, ETL jobs, expected growth and other details about the staging
area
The staging area consists of both RDBMS tables and data files
SAD 2008/09
Staging Area data structures
Flat files
fast to write, append to, sort and filter (grep) but slow to update, access
or join
enables restart without going to the sources
Relational Tables
Metadata, SQL interface, DBA support
Dimensional Model Constructs: Facts, Dimensions, Atomic Facts
tables, Aggregate Fact Tables (OLAP Cubes)
Surrogate Key Mapping Tables
map natural keys from the OLTP systems to the surrogate key from the
DW
can be stored in files or the RDBMS
you can use the IDENTITY function if you go with the RDBMS approach
SAD 2008/09
H.Galhardas
Load (1)
Goal: fast loading into DW
Loading deltas is much faster than total load
SQL-based update is slow
Large overhead (optimization,locking,etc.) for every SQL call
Bulk-loading using DBMS specific utility is much
faster
Some load tools can also perform UPDATEs
Index on tables slows load a lot
Drop index and rebuild after load
Can be done per partition
Parallellization
Dimensions can be loaded concurrently
Fact tables can be loaded concurrently
Partitions can be loaded concurrently
DW.PARTS
UPP
TIME
DW
SAD 2008/09
H.Galhardas
Load (2)
Relationships in the data
Referential integrity must be ensured
Can be done by loader
Aggregates
Must be built and loaded at the same time as the detail data
Today, RDBMS can often do this
Load tuning
Load without log
Sort load file first
Make only simple transformations in loader
Use loader facilities for building aggregates
Use loader within the same database
SAD 2008/09
H.Galhardas
Agenda
ETL issues:
Overview of the ETL process
Extract
Transformation/cleaning
Data staging area (DSA)
Load
! Building dimensions
Building fact tables
Data cleaning
Brief overview of ETL and data cleaning tools
SAD 2008/09
H.Galhardas
Building Dimensions
Static dimension table
Assignment of keys: production keys to DW using
table
Combination of data sources: find common key?
Handling dimension changes
Slowly changing dimensions
Find newest DW key for a given production key
Table for mapping production keys to DW keys
must be updated
Load of dimensions
Small dimensions: replace
Large dimensions: load only changes
SAD 2008/09
The basic structure of a dimension
Primary key (PK)
Meaningless, unique integer
Aka as surrogate key
Joins to Fact Tables
Is a Foreign Key to Fact Tables
Natural key (NK)
Meaningful key extracted from
source systems
1-to-1 relationship to the PK for
static dimensions
1-to-many relationship to the
PK for slowly changing
dimensions, tracks history of
changes to the dimension
Descriptive Attributes
Primary textual but numbers
legitimate but not numbers that
are measured quantities
100 such attributes normal
Static or slow changing only
SAD 2008/09
Generating surrogate keys
for Dimensions
Via triggers in the DBMS
Read the latest surrogate key, generate the next value, create the
record
Disadvantages: severe performance bottlenecks
Via the ETL process, an ETL tool or a 3-rd party
application generate the unique numbers
A surrogate key counter per dimension
Maintain consistency of surrogate keys between dev, test and
production
Using Smart Keys
Concatenate the natural key of the dimension in the source(s) with
the timestamp of the record in the source or the Data Warehouse.
Tempting but wrong
SAD 2008/09
Why smart keys are wrong
By definition
Surrogate keys are supposed to be meaningless
Do you update the concatenate smart key if the natural key changes?
Performance
Natural keys may be chars and varchars, not integers
Adding a timestamp to it makes the key very big
The dimension is bigger
The fact tables containing the foreign key are bigger
Joining facts with dimensions based on chars/varchars become inefficient
Heterogeneous sources
Smart keys work for homogeneous environments, but most likely than not the
sources are heterogeneous, each having the own definition of the dimension
How does the definition of the smart key changes when there is another source
added? It doesnt scale very well.
One advantage: simplicity in the ETL process
SAD 2008/09
The grain of a dimension
The definition of the key of the dimension in business terms,
what does the dimension represent
Analyze the source systems so that a particular set of fields in
that source corresponds to the grain of the dimension
Verify that a given source (file) implements the intended grain
Nothing should be returned by the following query from the
source system/file
If something is returned by this query, the fields A, B and C
do not represent the grain of the dimension
select A, B, C, count(*)
from DimensionTableSource
group by A, B, C
having count(*) > 1
SAD 2008/09
H.Galhardas
Building fact tables
Two types of load:
Initial load
ETL for all data up till now
Done when DW is started the first time
Often problematic to get correct historical data
Very heavy - large data volumes
Incremental update
Move only changes since last load
Done periodically (../month/week/day/hour/...) after DW start
Less heavy - smaller data volumes
Dimensions must be updated before facts
The relevant dimension rows for new facts must be in place
Special key considerations if initial load must be performed
again
SAD 2008/09
H.Galhardas
Agenda
ETL issues:
Overview of the ETL process
Extract
Transformation/cleaning
Data staging area (DSA)
Load
Building dimensions
Building fact tables
! Data cleaning
Brief overview of ETL and data cleaning tools
SAD 2008/09
H.Galhardas
Data Cleaning

Activity of converting source data into target data
without errors, duplicates, and inconsistencies,
i.e.,
Cleaning and Transforming to get
High-quality data!
SAD 2008/09
H.Galhardas
Why Data Cleaning and
Transformation?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
e.g., occupation=
noisy: containing errors or outliers (spelling, phonetic
and typing errors, word transpositions, multiple values
in a single free-form field)
e.g., Salary=-10
inconsistent: containing discrepancies in codes or
names (synonyms and nicknames, prefix and suffix
variations, abbreviations, truncation and initials)
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
SAD 2008/09
H.Galhardas
Why Is Data Dirty?
Incomplete data comes from:
non available data value when collected
different criteria between the time when the data was collected
and when it is analyzed.
human/hardware/software problems
Noisy data comes from:
data collection: faulty instruments
data entry: human or computer errors
data transmission
Inconsistent (and redundant) data comes from:
Different data sources, so non uniform naming conventions/data
codes
Functional dependency and/or referential integrity violation
SAD 2008/09
H.Galhardas
Why Is Data Cleaning
Important?
Data warehouse needs consistent integration
of quality data
Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
No quality data, no quality decisions!
Quality decisions must be based on quality data (e.g.,
duplicate or missing data may cause incorrect or even
misleading statistics)
SAD 2008/09
H.Galhardas
Types of data cleaning
Conversion, parsing and normalization
Text coding, date formats, etc.
Most common type of cleansing
Special-purpose cleansing
Normalize spellings of names, addresses, etc.
Remove duplicates, e.g., duplicate customers
Domain-independent cleansing
Approximate, fuzzy joins on not-quite-matching
keys
SAD 2008/09
Example:
approximate joins
Name SSN Addr
Jack Lemmon 430-871-8294 Maple St
Harrison Ford 292-918-2913 Culver Blvd
Tom Hanks 234-762-1234 Main St

Table R
Name SSN Addr
Ton Hanks 234-162-1234 Main Street
Kevin Spacey 928-184-2813 Frost Blvd
Jack Lemon 430-817-8294 Maple
Street

Table S
" Find records from different datasets that could be
the same entity
SAD 2008/09
H.Galhardas
A database solution
SELECT R.SSN, S.SSN, distance
FROM (SELECT R.SSN, S.SSN,
editDistance (R.name, S.name) distance
FROM R, S)
WHERE distance < maxDist;
Problem: No optimization supported for a
Cartesian product with external function
calls
SAD 2008/09
H.Galhardas
Data quality vs cleaning
Data quality = Data cleaning +
Data enrichment
enhancing the value of internally held data by appending related
attributes from external sources (for example, consumer
demographic attributes or geographic descriptors).
Data profiling
analysis of data to capture statistics (metadata) that provide
insight into the quality of the data and aid in the identification of
data quality issues.
Data monitoring
deployment of controls to ensure ongoing conformance of data to
business rules that define data quality for the organization.
Data stewards responsible for data quality
DW-controlled improvement
Source-controlled improvement
Construct programs to check data quality
SAD 2008/09
ETL Technological Solution
Desired Features
Automated data movement across data stores
and the analytical area in support of a data
warehouse, data mart
Extensible, robust, scalable infrastructure
Standardization of ETL processes across the
enterprise
Reusability of custom and pre-build functions
Better utilization of existing hardware
resources
Faster change-control & management
Integrated meta-data management
Complete development environment, work as
you think design metaphor
An ETL tool is a tool that reads data from one or more sources, transforms the
data so that it is compatible with a destination and loads the data to the destination
Desired
Features
SAD 2008/09
Buy versus Build
Vendor tools promote standardization of the ETL process, reusability of
custom and pre-built functions, lowering the time (and cost) of additional
ETL efforts
Vendor ETL tools are somewhat self-documenting
Many tools can connect to a variety of sources (RDBMSs, non-relational,
different OS, ERP, PeopleSoft, etc) without exposing the ETL developer to
the differences
Vendor tools deal with changes in the source systems or the necessary
transformations better, reducing long term ETL maintenance
Meta-data management is a huge advantage, especially when sharing data
from many applications
ETL prices have not dropped much over the years, but there is increased
functionality, performance, and usability from vendor tools
SAD 2008/09
H.Galhardas
Commercial ETL tools
ETL tools from the big vendors, e.g.,
Oracle Warehouse Builder
IBM DB2 Warehouse Manager
Microsoft Integration Services
Offer much functionality at a reasonable price (included)
Data modeling
ETL code generation
Scheduling DW jobs

The best tool does not exist
Choose based on your own needs
Check first if the standard toolsfrom the big vendors are ok
http://www.etltool.com/
Magic Quadrant for Data Quality Tools, 2007

SAD 2008/09
H.Galhardas
Open-source ETL tools
Some open source ETL tools:
Talend
Enhydra Octopus
Clover.ETL
SAD 2008/09
Classification of DC tools (2005)
C
o
m
m
e
r
c
i
a
l

R
e
s
e
a
r
c
h

SAD 2008/09
H.Galhardas
References
The DW ETL toolkit, R. Kimball, J. Caserta, Wiley, 2004
Extraction-Transformation-Loading, P. Vassiliadis, A.
Simitsis, Encyclopedia of Database Systems, Eds: Liu, Ling,
Ozsu, M Tamer, Springer 2009.
Data Preprocessing slides, Jiawei Han and M. Kamber
http://www-faculty.cs.uiuc.edu/~hanj/bk2/slidesindex.html
Extract, Transform, and Load slides, Torben Bach
Pedersen, Aalborg University, http://www.cs.aau.dk/~tbp
/Teaching/DWML06/DWML06.html

Potrebbero piacerti anche