Sei sulla pagina 1di 5

Proceso ETL ETL - este termino viene de ingles de las siglas Extract-Transform-Load que significan Extraer, Transformar y Cargar

y se refiere a los datos en una empresa. ETL es el proceso que organiza el flujo de los datos entre diferentes sistemas en una organizacin y aporta los mtodos y herramientas necesarias para mover datos desde mltiples fuentes a un almacn de datos, reformatearlos, limpiarlos y cargarlos en otra base de datos, data mart bodega de datos. ETL forma parte de la Inteligencia Empresarial (Business Intelligence), tambin llamado Gestin de los Datos (Data Management). La idea es que una aplicacin ETL lea los datos primarios de unas bases de datos de sistemas principales, realice transformacin, validacin, el proceso cualitativo, filtracin y al final escriba datos en el almacn y en este momento los datos son disponibles para analizar por los usuarios. Los ms populares herramientas y aplicaciones ETL del mercado IBM Websphere DataStage (anteriormente Ascential DataStage y Ardent DataStage) Pentaho Data Integration (Kettle ETL) - Una herramienta Open Source Business Intelligence SAS ETL Studio Oracle Warehouse Builder Informatica PowerCenter Cognos Decisionstream Ab Initio BusinessObjects Data Integrator (BODI) Microsoft SQL Server Integration Services (SSIS)

Componentes DataStage Hay cuatro componentes principales de la herramienta Datastage: Administrator - Interfaz de usuario usada para configurar proyectos de Datastage y usuarios. Gestiona administracin de proyectos de DataStage en ambientes de desarrollo y produccin. Designer - usada para crear, disear y compilar tareas Datastage (pero tambin permite testear y ejecutar). Mayormente usado por los desarrolladores. Director - para validar, calendarizar, testear, ejecutar y monitorizar jobs Datastage. Usado por los operadores y los testers. Manager - interfaz de usuario usada para visualizar y editar el contenido del repositorio.

Pantalla de Datastage manager

Pantalla de Datastage Administrator

Pantalla de Datastage Designer con una secuencia

Diseo de las tareas-Datastage palette Una lista de todos los stages de Datastage: Datastage server palette - los stages generales:

Datastage server palette - los stages de fichero:

Datastage server palette - los stages de base de datos:

Datastage server palette - los stages de transformar y filtrar:

Datastage server palette - elementos de secuencias:

ODBC stages are used to allow Datastage to connect to any data source that represents the Open Database Connectivity API (ODBC) standard. ODBC stages are mainly used to extract or load the data. However, ODBC stage may also be very helpful when aggregating data and as a lookup stage (in that case it can play role of aggregator stage or a hash file and can be used instead). Each ODBC stage can have any number of inputs or outputs. The input links specify the data which is written to the database (they act as INSERT, UPDATE or DELETE statements in SQL). Input link data can be defined in various ways: using an SQL statement constructed by DataStage, a user-defined SQL query or a stored procedure. Output links specify the data that are extracted (correspond to the SQL SELECT statement). The data on an output link is passed through ODBC connector and processed by an underlying database. If a processing target is an Oracle database, it may be worth considering use of ORACLE (ORAOCI9) stage. It has a significantly better performance than ODBC stage and allows setting up more configuration options and parameters native to the Oracle database. Theres a very useful option to issue an SQL before or after main dataflow operations (Oracle stage properties -> Input -> SQL). For example, when loading a big chunk of data into an oracle table, it may increase performance to drop indexes in a before SQL tab and create indexes and analyze table in a after SQL tab ('ANALYZE TABLE xxx COMPUTE STATISTICS' SQL statement). Update actions in Oracle stage The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Update actions available from the dropdown list: Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT). Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT). Insert rows without clearing - only adds new rows (INSERT statement). Delete existing rows only - deletes matched rows (issues only the DELETE statement). Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT). Update existing rows only - updates existing rows (UPDATE statement). Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited. Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited. User-defined SQL - the data is written using a user-defined SQL statement. User-defined SQL file - the data is written using a user-defined SQL statement from a file.

Examples of the use of ODBC and ORACLE stages in Datastage SQL generated by ODBC stage

ODBC columns view

Oracle destination stage update action