Sei sulla pagina 1di 40

Extract, Transform and

Load (ETL)
Eduardo Almeida
Master Alma Universit de Nantes
{eduardo.almeida@univ-nantes.fr}

Goal
To present the general concepts of the
Extract, Transform and Load (ETL) process

To present an open source ETL tool

To ETL

Bibliography
Berson, Alex e Smith, Stephen J
Data Warehousing, Data Mining & OLAP
Kimball, Ralph
The Data Warehouse Toolkit
Inmon, Willian H.
Building the Data Warehouse
Business Inteligence avec Oracle 10g
Claire Noirault
http://asktom.oracle.com
Donsez, Didier (prsentations)
Universit Joseph Fourier

DW Overall architecture

Extract, Transform
and Load (ETL)

DW Overall architecture
(staging area)

ETL

Extract

Extract

Production data
Heterogeneous data
sources
Heterogeneous
representations
Incremental x full loading

Extract

Extraction
Logical (Full, Incremental)
Physical

Full Extraction
Export from the source of one table or a
set of tables (ex., )
Extract using programs (ex., PL/SQL, Java,
etc)

Advantages
No trace of the changes
No additional information
on the source

Drawbacks
Large amount of data
Impact performance on data
sources and the ETL process

Incremental Extraction
Necessity of a mechanism to define
modified data
A DATE attribute
Triggers
Original / current value (ex., MINUS
operator)

Physical Extraction
Necessity of a mechanism to define
modified data
Log files
Dump files
Flat files
Partitioning (source tables are partitioned
along a date key)

Transform

Transform

Integration
Cleansing
Standardizing
Enrichment
Sort
Filter
...

Transform
Data integration

Transform
Data Cleansing
Data Cleansing is the act of detecting and
correcting (or removing) corrupt or inaccurate
records.
So Paulo
S. Paulo
SP

DW

Transform
Standardizing
Address
number, street, city, country, zip
street, number, neighborhood, city, country, zip
Phone
+33 (0) 2 40 55 66 77
330240556677
Name
Johnny Hallyday
Hallyday, Johnny
JOHNNY HALLYDAY

Load

Load
Large amount of data
Significant processing
loads
Low system use
Verify referential
integrity after the load
From fact table to
dimension

Command line tools

Extract
Oracle 'exp' command
exp scott/tiger file=emp.dmp log=emp.log
tables=emp rows=yes indexes=no
exp scott/tiger file=emp.dmp tables=(emp,dept)
exp scott/tiger tables=emp query="where
deptno=10"
exp scott/tiger file=abc.dmp tables=abc
query=\"where sex=\'f\'\" rows=yes

Extract
Extracting into Flat Files Using SQL*Plus
SET echo off
SET pagesize 0
SPOOL country_city.dat
SELECT distinct t1.country_name ||'|'|| t2.cust_city
FROM countries t1, customers t2
WHERE t1.country_id = t2.country_id
AND t1.country_name= 'United States of America';
SPOOL off

Load
Oracle 'imp' command
exp scott/tiger file=emp.dmp log=emp.log
tables=emp rows=yes indexes=no
exp scott/tiger file=emp.dmp tables=(emp,dept)
exp scott/tiger tables=emp query="where
deptno=10"
exp scott/tiger file=abc.dmp tables=abc
query=\"where sex=\'f\'\" rows=yes

Load
Scenario
My system has both clients and clients_dim tables
I want to load clients_dim table from an export of
clients

How to load using 'imp'?


rename clients to clients_temp;
rename clients_dim to clients;
imp alma1 fromuser=almax touser=alma1
tables=clients file=almax.clients.dmp
log=almax.clients.log IGNORE=Y
rename clients to clients_dim;
rename clients_temp to clients;

Load
Using SQL*Loader
sqlldr user control=control.ctl
The control.ctl file has the load information:
load data
infile 'country_city.dat'
into table country_city
fields terminated by "|" optionally enclosed by '"'
( country_name, cust_city )

Load
Using PL/SQL
DECLARE
nom_cat VARCHAR2(25);
descr VARCHAR2(100);
CURSOR cur IS
SELECT ref_produit, nom_produit
FROM produits;

Load
Using PL/SQL
BEGIN
FOR crec IN cur LOOP
select NOM_CATEGORIE,DESCRIPTION
into NOM_CAT,DESCR
from categories
where code_categorie=crec.CODE_CATEGORIE;

Load
Using PL/SQL
insert into products_dim (REF_PRODUIT,NOM_PRODUIT
NOM_CATEGORIE,DESCRIPTION)
values(
crec.REF_PRODUIT,crec.NOM_PRODUIT,
NOM_CAT,DESCR);
END LOOP;
COMMIT;
END;
/

Cursor
PL/SQL Variables

Kettle
Open source ETL tool
http://kettle.pentaho.org/

Kettle
Kettle is designed to help you with your ETTL
needs, which include the Extraction,
Transformation, Transportation and Loading of
data.

Runs with Java

Has a graphical user interface called Spoon

Kettle Tutorial
Open a terminal
$ spoon.sh

Transformation

Kettle Tutorial
1 - Explorateur

2 - Connections

4 Tester la
connection

3 Configuration

Kettle Tutorial
1 Desing (Palette de cration)

2 Glisser et dposer

Kettle Tutorial
1 Nom tape

2 Ecrire SQL

Kettle Tutorial

1 Insertion dans table


2 Lien

Kettle Tutorial

Kettle Tutorial
1 Excuter

2 Vrifier les resultats

Kettle Tutorial

1 Filtrer

Kettle Tutorial
1 Excuter

2 Vrifier les resultats

Kettle Tutorial

1 Agrgation

Kettle Tutorial
1 Nom tape

2 Champ group

3 Champ agrg

Potrebbero piacerti anche