Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Dr. N. P. Singh Professor Management Development Institute Mehrauli Road, Sukhrali Gurgaon -122001 E-mail: knpsingh@mdi.ac.in
Extract-Transform-Load (ETL)
Extract
Load
Sources
DSA
DW
Extract-Transform-Load (ETL)
Extract Transform & Clean Load
Sources
DSA
DW
Logical Design
Conceptual Model
Entities of our model:
Concepts Attributes Part-of Relationships Transformations Serial Composition of Transformations Provider Relationships Notes ETL Constraints Candidate Relationships
Conceptual Model
concept attribute transformation ETL_constraint Note
provider 1:1
provider N:M
serial composition
part of
Conceptual Model
Concepts
a name, finite set of attributes represent an entity in the source database or in the DW
concept
Attributes
same role as in ER/dimensional models a granular module of information
attribute
We do not employ standard UML notation for concepts and attributes, for the reason that we need to treat attributes as first class citizens of our model
Conceptual Model
Part-of Relationships
finite set of attributes emphasize the fact that a concept is composed of a set of attributes
part of
Conceptual Model
Example
Source 1
S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST}
Data Warehouse
DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST}
Conceptual Model
S1.PARTSUPP DW.PARTSUPP
PKey SuppKey
Qty Cost
Qty Cost
Conceptual Model
Transformations
finite set of input/output attributes, a symbol abstractions that represent parts, or full modules of code, executing a single task two categories:
filtering or data cleaning operations (e.g., foreign key violations) transformation operations (e.g., aggregation)
transformation
Conceptual Model
Provider Relationships
finite set of input/output attributes, an appropriate transformation map a set of input attributes to a set of output attributes through a relevant transformation*
provider 1:1
*
provider N:M
Conceptual Model
S1.PARTSUPP DW.PARTSUPP
PKey SuppKey
SK
PKey SuppKey
Date Qty
Qty Cost
NN
Cost
Conceptual Model
Notes
informal tags, exactly as in UML modeling used for:
simple comments explaining design decisions explanation of the semantics of the applied transformation tracing of runtime constraints
Note
Conceptual Model
S1.PARTSUPP DW.PARTSUPP
PKey SuppKey
SK
PKey SuppKey
Date Qty
Qty Cost
NN
Cost
Date = SysDate()
Conceptual Model
ETL Constraints
finite set of attributes, a single transformation express the fact that the data of a certain concept fulfill several requirements
ETL_constraint
Conceptual Model
S1.PARTSUPP
PK
DW.PARTSUPP
PKey SuppKey
SK
PKey SuppKey
Date Qty
Qty Cost
NN
Cost
Date = SysDate()
Conceptual Model
Candidate Relationships
a single candidate concept, a single target concept used when a certain DW concept is populated by a finite set of more than one candidate source concepts
Conceptual Model
Due to acccuracy and small size (< update window)
S1.PartSupp Annual PartSupps S2.PartSupp Recent PartSupps {XOR}
DW.PartSupp
Conceptual Model
Due to acccuracy and small size (< update window) Necessary providers: S1 and S2
{Duration<4h}
DW.PARTSUPP
PK
S1.PARTSUPP
SK
PKey SuppKey
Qty Cost
$2
Date = SysDate()
PKey DW.PARTSUPP
PK
S1.PARTSUPP
SuppKey Qty
SK
PS2
Date = SysDate()
usability
construction of a palette of frequently used types
Template layer
a set of built-in specializations of the entities of the Metamodel layer, specifically tailored for the most frequent elements of ETL scenarios
Schema layer
a specific ETL scenario all the entities of the Schema layer are instances of the classes of the Metamodel layer
Metamodel Layer
Part Of Fact Table ER Relationship Dimension ER Entity American to European Date Surrogate Key Assignment Aggregation Provider $2 Serial Composition Candidate
IsA
Template Layer
InstanceOf
SK f DW.PartSupp f
Schema Layer
Composite transformations
Slowly changing dimension (Type 1,2,3) (SDC-1/2/3) Format mismatch (FM) Data type conversion (DTC) Switch (*) Extended union (U)
Unary transformations
Push Aggregation () Projection () Function application (f) Surrogate key assignment(SK) Tuple normalization (N) Tuple denormalization (DN)
File operations
EBCDIC to ASCII conversion (EB2AS) Sort file (Sort)
Transfer operations
Ftp (FTP) Compress/Decompress (Z/dZ) Encrypt/Decrypt (Cr/dCr)
Binary transformations
Union (U) Join () Diff () Update Detection (UPD)
Methodology
Step 1
Identification of the proper data stores
Step 2
Candidates and active candidates for the involved data stores
Step 3
Attribute mapping between the providers and the consumers
Step 4
Annotating the diagram with runtime constraints