Sei sulla pagina 1di 55

VSV Training

Chapter 3: Extracting
Prepared by: Thang Nguyen, Dung Phan
Date: 09/02/2008

3.0 Overview
The ETL process needs to effectively integrate
systems that have different:
Database management systems
Operating systems
Hardware
Communications protocols

3.1 The Logical Data


Map

3.1 Designing Logical


Before Physical
1. Have a plan.
2. Identify data source candidates.
3. Analyze source systems with a dataprofiling tool.
4. Receive walk-though of data lineage and
business rules.
5. Receive walk-through of data
warehouse data model.
6.Validate calculations and formulas.

3.2 Inside the Logical


Data Map
Before descending into the details of the

various sources you will encounter, we need


to explore the actual design of the logical data
mapping document.

3.2.1 Components of the


Logical
Data
Map
Target table
name.
Target column name.
Table type.
SCD (slowly changing dimension) type.
Source database.
Source table name.
Source column name.
Transformation.

3.2.2Components of the
Logical Data Map (cont)

3.2.2 Using Tools for the


Logical
Data
Map
Some ETL and
data-modeling tools directly
capture logical data mapping information.
There is a natural tendency to want to
indicate the data mapping directly in these
tools.

3.3 Building the Logical


Data
Map
The analysis of the source system is usually
broken into two major phases:
The data discovery phase
The anomaly detection phase

3.3.1 Data Discovery


Phase
1. Collecting and Documenting Source
Systems
2. Keeping Track of the Source Systems
3. Determining the System-of-Record
4. Analyzing the Source System: Using
Findings from Data Profiling

3.3.1.1 Collecting and


Documenting
Source
The source systems are usually
established in
various pieces of documentation, including
Systems
interview notes, reports, and the data
modelers logical data mapping.

3.3.1.2 Keeping Track of


the Source Systems

3.3.1.2 Keeping Track of


the
Source
Systems
Subject
area.
Interface name.
(cont)
Business name.
Priority.

Department/Business use.
Business owner.
Technical Owner.
DBMS.
Production server/OS.

3.3.1.2 Keeping Track of


the
Source
Systems
# Daily
users.
DB size.
(cont)
DB complexity.
# Transactions per day.
Comments.

3.3.1.3 Determining the


System-of-Record
Like a lot of the terminology in the data
warehouse world, the system of-record has
many definitionsthe variations depend on
who you ask.

3.3.1.4 Analyzing the Source


System: Using Findings from Data
1. Unique identifiers and natural keys.
Profiling
2. Data types.

3.3.1.4 Analyzing the


Source System: Using
3.Relationships between tables.
Findings
from Data Profiling
4. Discrete relationships.
5. Cardinality of relationships and
(cont)
columns.

One-to-one
One-to-many.
Many-to-many.

3.3.2 Data Content


Analysis
NULL values.
Dates in non-date fields.

3.3.3 Collecting Business


Rules
in
the
ETL
Process
You might think at this stage in the process
that all of the business rules must have been
collected.

3.4 Integrating
Heterogeneous
Data
1. Identify the source systems.
Sources
2. Understand the source systems (data
profiling).
3. Create record matching logic.
4. Establish survivorship rules.
5. Establish non-key attribute business
rules.
6. Load conformed dimension.

3.4.1 The Challenge of


Extracting from Disparate
Each data source can be in a different DBMS
Platforms
and also a different platform. Databases and
operating systems, especially legacy and
proprietary ones, may require different
procedure languages to communicate with
their data.

3.4.2 Connecting to
Diverse Sources through
ODBC

3.4.2 Connecting to Diverse


Sources
through
ODBC
ODBC manager.
(cont)
ODBC driver.

3.5 Mainframe Sources


COBOL copybooks
EBCDIC character sets
Numeric data
Redefines fields
Packed decimal fields
Multiple OCCURS fields
Multiple record types
Variable record lengths

3.5.1 Working with


COBOL Copybooks

3.5.2 EBCDIC Character


Set
Both the legacy mainframe systems and the
UNIX- and Windows-based systems, where
most data warehouses reside, are stored as
bits and bytes.

3.5.3 Converting EBCDIC


to
YouASCII
might think that since both systems use
bits and bytes, data from your mainframe
system is readily usable on your UNIX or
Windows system.

3.5.4 Transferring Data


between
Platforms
Luckily, translating
data from EBCDIC to ASCII
is quite simple. In fact its virtually automatic,
assuming you use File Transfer Protocol (FTP)
to transfer the data from the mainframe to
your data warehouse platform.

4.5.5 Handling
Mainframe
Data
When you begin to Numeric
work with quantitative
data elements, such as dollar amounts,
counts, and balances, you can see that theres
more to these numbers than meets the eye.

3.5.6 Using PICtures

3.5.6 Using PICtures


(cont)

3.5.7 Unpacking Packed


Decimals
Reformat data
Transfer data
Use robust ETL tools
Use a utility program

3.5.8 Working with


Redefined Fields

3.5.9 Multiple OCCURS

3.5.10 Managing Multiple


Mainframe Record Type
Files

3.5.11 Handling
Mainframe Variable
Record Lengths

3.5.11 Handling Mainframe


Variable
Record
Lengths
Convert all data
(cont)
Transfer the file
The last option

Extracting from IMS, IDMS, Adabase, and


Model 204

3.6 Flat Files


Delivery of source data.
Working/staging tables.
Preparation for bulk load.

Not all flat files are created equally. Flat files


essentially come in two flavors:
Fixed length
Delimited

3.6.1 Processing Fixed


Length Flat Files

3.6.2 Processing
Delimited
Flat
Files
Flat files often come
with a set of delimiters
that separate the data fields within the file.

3.7 XML Sources

3.7.1 Character Sets


Character sets are groups of unique symbols

used for displaying and printing computer


output

3.7.2 XML Meta Data


We hear quite often that XML is nothing more
than a flat file that contains data.
DTD (Document Type Definition)
Base Data.
Element structures.
Mixed Content.
Nillable
Cardinality.
Allowed Values.

3.7.2 XML Meta Data


(cont)
XML Schema
Elements that appear in an XML document
Attributes that appear in an XML document
The number and order of child elements
Data types of elements and attributes
Default and fixed values for elements and

attributes
Extensible to future additions
Support of namespaces
Namespaces

3.8 Web Log Sources

3.8.1 W3C Common and


Extended
Formats
Regardless of the OS,
Web logs have a common
set of columns that usually include the following:
Date.
Time.
c-ip
Service Name.
s-ip
cs-method.
cs-uri-stem.

3.8.1 W3C Common and


Extended
Formats
(cont)
cs-uri-query.
sc-status.
sc-bytes.
cs(User-Agent).
cs(Cookie).
cs(Referrer).

Web server documentation or theW3CWeb site


Server Name.
cs-username
Server Port.
Bytes Received.
Time Taken.
Protocol Version.

3.8.2 Name Value Pairs


inWeb
Logs
Take a look at the query string and notice the
following segments:
/product/
product.asp
?The question mark indicates that
parameters were sent to the program file.
In this example, there are three parameters:
p, c, and s.

3.8.2 Name Value Pairs


inWeb
Logs (cont)
The parameters are captured in the Web log in
namevalue pairs. In this example, you can see
three parameters, each separated by an
ampersand (&).
p indicates the product number
c indicates the product category number
s indicates the search string entered by the
user to find the product

3.9 ERP System Sources

3.10 Extracting Changed


Data

3.10.1 Detecting
Changes
Using Audit Columns.
Database Log Scraping or Sniffing
Timed Extracts
Process of Elimination
Initial and Incremental Loads

3.10.2 Extraction Tips


Constrain on indexed columns.
Retrieve the data you need.
Use DISTINCT sparingly.
Use SET operators sparingly.
Use HINT as necessary.
Avoid NOT.
Avoid functions in your where clause.

3.10.3 Detecting Deleted


or
Overwritten
Fact
Negotiate with the source system owners, if
possible, explicit
notification
of all deleted or
Records
at
the
Source
overwritten measurement records.
Periodically check historical totals of

measurements from the source system to


alert the ETL staff that something has
changed. When a change is detected, drill
down as far as possible to isolate the change.

Summary
The Logical Data Map
The Challenge of Extracting from Disparate

Platforms
Extracting Changed Data

Potrebbero piacerti anche