Sei sulla pagina 1di 27

Data Integration

Dr. N. P. Singh Professor Management Development Institute Mehrauli Road, Sukhrali Gurgaon -122001 E-mail: knpsingh@mdi.ac.in

Extract-Transform-Load (ETL)

Extract

Transform & Clean

Load

Sources

DSA

DW

Extract-Transform-Load (ETL)
Extract Transform & Clean Load

Sources

DSA

DW

The lifecycle of a Data Warehouse and its ETL processes


Administration of DW Metrics

Logical Model for DW, Sources & Activities

Logical Design

Tuning Full Activity Description

Physical Model for DW, Sources & Activities

Conceptual Model for DW, Sources & Activities

Reverse Engineering of Sources & Software Requirements Construction Collection

Software & SW Metrics

Conceptual Model
Entities of our model:
Concepts Attributes Part-of Relationships Transformations Serial Composition of Transformations Provider Relationships Notes ETL Constraints Candidate Relationships

Conceptual Model
concept attribute transformation ETL_constraint Note

provider 1:1

provider N:M

serial composition

active canditate candidate1 ... candidaten target {XOR}

part of

Conceptual Model
Concepts
a name, finite set of attributes represent an entity in the source database or in the DW
concept

Attributes
same role as in ER/dimensional models a granular module of information

attribute

We do not employ standard UML notation for concepts and attributes, for the reason that we need to treat attributes as first class citizens of our model

Conceptual Model
Part-of Relationships
finite set of attributes emphasize the fact that a concept is composed of a set of attributes
part of

Conceptual Model
Example
Source 1
S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST}

Data Warehouse
DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST}

Conceptual Model
S1.PARTSUPP DW.PARTSUPP

PKey SuppKey

PKey SuppKey Date

Qty Cost

Qty Cost

Conceptual Model
Transformations
finite set of input/output attributes, a symbol abstractions that represent parts, or full modules of code, executing a single task two categories:
filtering or data cleaning operations (e.g., foreign key violations) transformation operations (e.g., aggregation)

transformation

Conceptual Model
Provider Relationships
finite set of input/output attributes, an appropriate transformation map a set of input attributes to a set of output attributes through a relevant transformation*
provider 1:1
*

provider N:M

If the attributes are semantically and physically compatible, no transformation is required

Conceptual Model
S1.PARTSUPP DW.PARTSUPP

PKey SuppKey

SK

PKey SuppKey

Date Qty

Qty Cost
NN

Cost

Conceptual Model
Notes
informal tags, exactly as in UML modeling used for:
simple comments explaining design decisions explanation of the semantics of the applied transformation tracing of runtime constraints
Note

Conceptual Model
S1.PARTSUPP DW.PARTSUPP

PKey SuppKey

SK

PKey SuppKey

Date Qty

Qty Cost
NN

Cost

Date = SysDate()

Conceptual Model
ETL Constraints
finite set of attributes, a single transformation express the fact that the data of a certain concept fulfill several requirements
ETL_constraint

Conceptual Model
S1.PARTSUPP
PK

DW.PARTSUPP

PKey SuppKey

SK

PKey SuppKey

Date Qty

Qty Cost
NN

Cost

Date = SysDate()

Conceptual Model
Candidate Relationships
a single candidate concept, a single target concept used when a certain DW concept is populated by a finite set of more than one candidate source concepts

Active Candidate Relationship


a certain candidate that has been selected for the population of the target concept a specialization of candidate relationships
active canditate candidate1 ... candidaten target {XOR}

Conceptual Model
Due to acccuracy and small size (< update window)
S1.PartSupp Annual PartSupps S2.PartSupp Recent PartSupps {XOR}

Necessary providers: S1 and S2

DW.PartSupp

Conceptual Model
Due to acccuracy and small size (< update window) Necessary providers: S1 and S2
{Duration<4h}

Annual PartSupps S2.PARTSUPP Recent PartSupps {XOR} PKey


y Ke .P y S2 uppKe S . 2 S S2.Date SUM SU (S2.Q ty) M( S2 .C os t)
SK

DW.PARTSUPP

PK

S1.PARTSUPP

PKey SuppKey Qty Date Department Cost

SK

PKey SuppKey

SuppKey Date Qty Cost


NN f

Qty Cost

$2

American to European Date

Date = SysDate()

Conceptual Model: first attempts


Due to acccuracy and small size (< update window) Necessary providers: S1 and S2
{Duration<4h} PS1

Annual PartSupps S2.PARTSUPP Recent PartSupps {XOR} PKey


y Ke y .P Ke 2 S Supp . 2 S S2.Date SU SU M(S2.Q ty) M (S 2. Co st)
SK

PKey DW.PARTSUPP
PK

S1.PARTSUPP

SuppKey Qty

PKey SuppKey Qty Date Cost

SK

PKey Dept SuppKey

SuppKey Date Qty Cost


NN f

PS2

Qty Cost Dept PKey SuppKey Cost Dept

American to European Date

Date = SysDate()

PS1.Pkey+=PS2.PKey PS1.SuppKey+=PS2.SuppKey PS1.Dept+=PS2.Dept

Instantiation & Specialization Layers


The key issues:
generecity
identification of a small set of generic constructs to capture all cases

usability
construction of a palette of frequently used types

Instantiation & Specialization Layers Metamodel layer


a set of generic entities, able to represent any ETL scenario involves classes: Concept, Attribute, Transformation, ETL Constraint and Relationship

Template layer
a set of built-in specializations of the entities of the Metamodel layer, specifically tailored for the most frequent elements of ETL scenarios

Schema layer
a specific ETL scenario all the entities of the Schema layer are instances of the classes of the Metamodel layer

Instantiation & Specialization Layers


Concept Attribute Transformation ETL_Constraint Relationship

Metamodel Layer
Part Of Fact Table ER Relationship Dimension ER Entity American to European Date Surrogate Key Assignment Aggregation Provider $2 Serial Composition Candidate

IsA

Template Layer

InstanceOf

Candidate 1 S2.PartSupp Candidate 2

SK f DW.PartSupp f

Schema Layer

Instantiation & Specialization Layers


Template layer
Four groups of logical transformations
Filters Unary transformations Binary transformations Composite transformations

Two groups of physical transformations


Transfer operations File operations

Instantiation & Specialization Layers


Filters
Selection () Not null (NN) Primary key violation (PK) Foreign key violation (FK) Unique value (UN) Domain mismatch DM)

Composite transformations
Slowly changing dimension (Type 1,2,3) (SDC-1/2/3) Format mismatch (FM) Data type conversion (DTC) Switch (*) Extended union (U)

Unary transformations
Push Aggregation () Projection () Function application (f) Surrogate key assignment(SK) Tuple normalization (N) Tuple denormalization (DN)

File operations
EBCDIC to ASCII conversion (EB2AS) Sort file (Sort)

Transfer operations
Ftp (FTP) Compress/Decompress (Z/dZ) Encrypt/Decrypt (Cr/dCr)

Binary transformations
Union (U) Join () Diff () Update Detection (UPD)

Methodology
Step 1
Identification of the proper data stores

Step 2
Candidates and active candidates for the involved data stores

Step 3
Attribute mapping between the providers and the consumers

Step 4
Annotating the diagram with runtime constraints

Potrebbero piacerti anche