ETL State of The Art

Department of Computer Science
ETL STATE OF THE ART

by
RICARDO FORTUNA RAMINHOS
Monte de Caparica, April 2007
Acronyms
Table 1: List of Acronyms Acronym 3GL 3M 3NF API ARS AS BTO CA CASE CDC CDI CLF CLIPS CORBA COM COTS CPU CRC CRM CSV CWM DAS DBMS DDL DM DIM DLL DODS DOM DOML DPM DSL DTD DTS DW EAI ECM EDR EII EIM EJBs ELT ER ERP ETI ETLT ETTL Envisat Description Third Generation Language Multi Mission Module Third Normal Form Application Programmers Interface Access Rule Service Audit Service Built-To-Order Computer Associates Computer-Aided Software Engineering Change Data Capture Customer Data Integration Common Log Format C Language Integrated Production System Common Object Request Broker Architecture Component Object Model Commercial Off-The-Shelf Central Processing Unit Cyclic Redundancy Checksum Customer Relationship Management Comma Separated Values Common Warehouse Model Data Access Service Database Management System Data Definition Language Data Mart Data Integration Module Dynamic Link Library Data Object Design Studio Document Object Model DODS-XML Data Processing Module Data System Libraries Document Type Definition Data Transformation Services Data Warehouse Enterprise Application Integration Enterprise Content Management Enterprise Data Replication Enterprise Information Integration Enterprise Information Management Enterprise Java Beans Extract, Load and Transform Entity Relationship Enterprise Resource Planning Evolutionary Technologies International Extract, Transform, Load and Transform Extract, Transform, Transport and Load Environmental Satellite
-I-
ESA ETL FCT (1) FCT (2) FFD FTP FQS FSS FTP GUI HMM HTML HTTP HTTPS IBHIS IBIS IDE IDMS IKM IIS IMAP IMS INTEGRAL IP IT J2EE JCL JDBC JESS JMS JNDI JVM KETTLE LDAP LKM MAPI MEO MOM MR MS MT NoDoSE ODBC ODS OLAP OLE DB OS PDF PL/SQL POP RAM RAT RDBMS Regex RIFL - II -
European Space Agency Extract, Transform and Load Faculdade de Cincias e Tecnologia Flight Control Team File Format Definition File Transfer Protocol Federated Query Service Federated Schema Service File Transfer Protocol Graphical User Interface Hidden Markov Modelling Hyper Text Mark-up Language Hyper Text Transfer Protocol Hyper Text Transfer Protocol Secure Integration Broker for Heterogeneous Information Sources Internet-Based Information System Integrated Development Environment Integrated Database Management System Integration Knowledge Module Internet Information System Internet Message Access Protocol Information Management System International Gamma-Ray Astrophysics Laboratory (satellite) Internet Protocol Information Technology Java 2 Enterprise Edition Job Control Language Java Data Base Connection Java Expert System Shell Java Message Service Java Naming and Directory Interface Java Virtual Machine Kettle ETTL Environment Lightweight Directory Access Protocol Load Knowledge Module Messaging Application Programming Interface Medium Earth orbits Message-Oriented Middleware Metadata Repository Microsoft Monitoring Tool Northwestern Document Structure Extractor Open Data Base Connectivity Operational Data Storage On-Line Analytical Processing Object Linking and Embedding Database Operating System Portable Document Format Procedural Language / Structured Query Language Post Office Protocol Random Access Memory Reporting and Analysis Tool Relational Database Management Systems Regular Expression Rapid Integration Flow Language
RMI RS RSH S/C S/W SADL SAX SDK SEIS SESS SFTP SML SOA SNMP SOAP SQL SSIS TM UDAP UDOB UDET UNL URL URS UTL VSAM XADL XMI XML XMM XPath XPDL XQuery XSD XSL XSL FO XSLT WSDL WWW
Remote Method Invocation Registry Service Remote Shell Spacecraft Space Weather Simple Activity Definition Language Simple API for XML Java Software Development Kit Space Environment Information System for Mission Control Purposes Space Environment Support System for Telecom and Navigation Systems Secure File Transfer Protocol Simple Mapping Language Services Oriented Architecture Simple Network Management Protocol Simple Object Access Protocol Structured Query Language SQL Server Integration Services Transformation Manager Uniform Data Access Proxy Uniform Data Output Buffer Uniform Data Extraction and Transformer Universidade Nova de Lisboa Uniform Resource Locator User Registration Service Universal Transformation Language Virtual Storage Access Method XML-based Activity Definition Language XML Metadata Interchange eXtended Markup Language X-Ray Multi-Mission (satellite) XML Path Language XML Process Definition Language XML Query Language XML Schema Definition Extensible Stylesheet Language XSL Formatting Objects XSL Transformations Web Service Definition Language World Wide Web
- III -
Index
1.1 1.2 1.3 MOTIVATION THE CORRECT ETL TOOL ............................................. 13 ETL CONCEPTUAL REPRESENTATION AND FRAMEWORK........................... 15 CLASSICAL DATA INTEGRATION ARCHITECTURES .................................. 18 Hand Coding ........................................................................ 18 Code Generators................................................................... 19 Database Embedded ETL........................................................ 20 Metadata Driven ETL Engines.................................................. 20
1.3.1 1.3.2 1.3.3 1.3.4 1.4
APPROACHES TO DATA PROCESSING ................................................... 21 Data Consolidation ................................................................ 22 Data Federation .................................................................... 24 Data Propagation .................................................................. 25 Hybrid Approach ................................................................... 25 Change Data Capture ............................................................ 26 Data Integration Technologies ................................................ 27
1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.4.6 1.5 1.6
METADATA FOR DESCRIBING ETL STATEMENTS .................................... 33 RESEARCH ETL TOOLS .................................................................... 36 AJAX ................................................................................... 36 ARKTOS .............................................................................. 39 Clio..................................................................................... 42 DATAMOLD .......................................................................... 45 IBHIS.................................................................................. 47 IBIS .................................................................................... 51 InFuse................................................................................. 54 INTELLICLEAN ...................................................................... 57 NoDoSe ............................................................................... 59 Potters Wheel ...................................................................... 62
1.6.1 1.6.2 1.6.3 1.6.4 1.6.5 1.6.6 1.6.7 1.6.8 1.6.9 1.6.10 1.7
FREEWARE / OPEN SOURCE AND SHAREWARE ETL TOOLS ...................... 66 Enhydra Octopus .................................................................. 66
1.7.1 - IV -
1.7.2 1.7.3 1.7.4 1.7.5 1.7.6 1.8
Jitterbit ............................................................................... 71 KETL ................................................................................... 76 Pentaho Data Integration: Kettle Project .................................. 78 Pequel ETL Engine................................................................. 81 Talend Open Studio............................................................... 84
COMMERCIAL ETL TOOLS................................................................. 90 ETL Market Analysis .............................................................. 90 Business Objects Data Integrator ............................................ 97 Cognos DecisionStream ....................................................... 103 DataMirror Transformation Server ......................................... 107 DB Software Laboratorys Visual Importer Pro ......................... 109 DENODO............................................................................ 114 Embarcadero Technologies DT/Studio .................................... 117 ETI Solution v5................................................................... 120 ETL Solutions Transformation Manager................................... 121 Group1 Data Flow ............................................................... 125 Hummingbird Genio ............................................................ 129 IBM Websphere Datastage ................................................... 134 Informatica PowerCenter...................................................... 137 IWay Data Migrator ............................................................. 143 Microsoft SQL Server 2005 ................................................... 148 Oracle Warehouse Builder .................................................... 154 Pervasive Data Integrator .................................................... 160 SAS ETL ............................................................................ 165 Stylus Studio...................................................................... 170 Sunopsis Data Conductor ..................................................... 175 Sybase TransformOnDemand................................................ 179
1.8.1 1.8.2 1.8.3 1.8.4 1.8.5 1.8.6 1.8.7 1.8.8 1.8.9 1.8.10 1.8.11 1.8.12 1.8.13 1.8.14 1.8.15 1.8.16 1.8.17 1.8.18 1.8.19 1.8.20 1.8.21 1.9
SPACE ENVIRONMENT INFORMATION SYSTEM FOR MISSION CONTROL PURPOSES .................................................................................. 185
-V-
1.9.1 1.9.2 1.9.3 1.9.4
Objectives ......................................................................... 186 Architecture ....................................................................... 187 Data Processing Module ....................................................... 188 Evaluation.......................................................................... 190
1.10 CONCLUSIONS ............................................................................. 191
REFERENCES
- VI -
Index of Figures
Figure 1: AJAX two-level framework (example for a library) ........................................................ 16 Figure 2: MOF layers using UML and Java as comparison ............................................................ 17 Figure 3: Data integration techniques: consolidation, federation and propagation .......................... 22 Figure 4: Push and pull modes of data consolidations ................................................................. 23 Figure 5: Part of the scenario expressed with XADL .................................................................... 34 Figure 6: The syntax of SADL for the CREATE SCENARIO, CONNECTION, ACTIVITY, QUALITY FACTOR statements ................................................................................................................... 35 Figure 7: Part of the scenario expressed with SADL .................................................................... 35 Figure 8: AJAX architecture ..................................................................................................... 38 Figure 9: Example of the Matching transformation ..................................................................... 39 Figure 10: Specifying the check for format transformation .......................................................... 40 Figure 11: The architecture of ARKTOS ..................................................................................... 41 Figure 12: Scenario for the propagation of data to a data warehouse ........................................... 41 Figure 13: Clio architecture ..................................................................................................... 42 Figure 14: A relational to XML Schema mapping ........................................................................ 43 Figure 15: DATAMOLD Architecture .......................................................................................... 46 Figure 16: Structure of a naive HMM ........................................................................................ 47 Figure 17: The Architecture of the IBHIS Operational System ...................................................... 48 Figure 18: IBHIS GUI ............................................................................................................. 49 Figure 19: Set up of the IBHIS broker ...................................................................................... 50 Figure 20: Architecture of IBHIS .............................................................................................. 52 Figure 21: Query interface in IBIS............................................................................................ 54 Figure 22: Query result in IBIS ................................................................................................ 54 Figure 23: Global Architecture of InFuse ................................................................................... 55 Figure 24: Angie transformation pipeline................................................................................... 56 Figure 25: Angie in advance mode ........................................................................................... 57 Figure 26: INTELLICLEAN graphical interaction .......................................................................... 58 Figure 27: INTELLICLEAN graphical interaction (explaining a rule) ............................................... 59 Figure 28: User Level Architecture ........................................................................................... 60 Figure 29: Example schema for a simulation output ................................................................... 60 Figure 30: NoDoSE Document Structure Editor .......................................................................... 61 Figure 31: Potter's Wheel user interface with column transformations selected .............................. 63
- VII -
Figure 32: Potter's Wheel Architecture ...................................................................................... 65 Figure 33: Enhydra Octopus Architecture .................................................................................. 67 Figure 34: XML Syntax for Defining a Transformation ................................................................. 69 Figure 35: Octopus Generator input and outputs........................................................................ 69 Figure 36: Octopus Generator output options ............................................................................ 70 Figure 37: Octopus Loader inputs............................................................................................. 70 Figure 38: After defining an Operation ...................................................................................... 73 Figure 39: Defining a Database Source ..................................................................................... 73 Figure 40: Transformation Mapping .......................................................................................... 74 Figure 41: Operation Queue and Log ........................................................................................ 75 Figure 42: KETL Architecture ................................................................................................... 76 Figure 43: SQL script file structure ........................................................................................... 77 Figure 44: OS Job example - execute a job on the machine running the KETL engine ..................... 77 Figure 45: SQL Job example - execute a job on a database ......................................................... 77 Figure 46: Text File Input........................................................................................................ 78 Figure 47: Spoon Screenshot................................................................................................... 79 Figure 48: Mapping between input and target fields ................................................................... 79 Figure 49: Chef with a job....................................................................................................... 80 Figure 50: Conditional Job Hop ................................................................................................ 80 Figure 51: Pequel script for processing an Apache CLF Log file..................................................... 83 Figure 52: Talend architecture ................................................................................................. 84 Figure 53: Talend Open Studio ................................................................................................ 85 Figure 54: Defining a Business Model ....................................................................................... 86 Figure 55: Defining a delimited file........................................................................................... 87 Figure 56: Example of a tMap transformation ............................................................................ 88 Figure 57: Mapper graphical interface....................................................................................... 89 Figure 58: Presenting statistics to the user................................................................................ 89 Figure 59: METAspectrum Evaluation........................................................................................ 92 Figure 60: Magic Quadrant for Extraction, Transformation and Loading......................................... 94 Figure 61: Data Integrator ...................................................................................................... 97 Figure 62: Distinct values (left) and tables relationship analysis (right)......................................... 99 Figure 63: Detection and visualization of data patterns ............................................................... 99 Figure 64: Data Validation......................................................................................................100 Figure 65: Data Integrator (Data Cleansing window) .................................................................100
- VIII -
Figure 66: Example of Multi-User Collaboration.........................................................................101 Figure 67: Impact Analysis.....................................................................................................101 Figure 68: Composer web-based application.............................................................................102 Figure 69: Cognos 8 BI ..........................................................................................................104 Figure 70: Cognos Data Manager ............................................................................................106 Figure 71: DataMirror Tool Suite .............................................................................................107 Figure 72: DataMirror Tree-structured Query Technology ...........................................................108 Figure 73: Architecture for Transformation Server XML-based engine ..........................................109 Figure 74: Visual Importer Architecture ...................................................................................110 Figure 75: Data Source Options ..............................................................................................111 Figure 76: Mapping Editor Screen Overview .............................................................................111 Figure 77: Mapping Panel.......................................................................................................112 Figure 78: Example of an actions pipeline ................................................................................113 Figure 79: Package Screen Overview .......................................................................................113 Figure 80: Scheduling options ................................................................................................114 Figure 81: Execution Monitor ..................................................................................................114 Figure 82: Denodo Virtual Data Port Architecture ......................................................................115 Figure 83: Denodo ITPilot architecture.....................................................................................117 Figure 84: Data Flow Designer (example 1)..............................................................................118 Figure 85: Data Flow Designer (example 2)..............................................................................119 Figure 86: TM Suite of Tools...................................................................................................122 Figure 87: Design Tool IDE.....................................................................................................123 Figure 88: Data Flow Architecture ...........................................................................................126 Figure 89: Data Flow Server ...................................................................................................127 Figure 90: Data Flow (process level planning)...........................................................................127 Figure 91: Manipulating data in the view through set-based logic................................................128 Figure 92: Hummingbird Genio Architecture .............................................................................129 Figure 93: Genio Designer......................................................................................................131 Figure 94: Data Access Method One ........................................................................................133 Figure 95: Data Access Method Two ........................................................................................133 Figure 96: Data Access Method Three ......................................................................................133 Figure 97: Realtime ETL with WebSphere DataStage .................................................................134 Figure 98: WebSphere Design Studio ......................................................................................136 Figure 99: MapStage layout ...................................................................................................137
- IX -
Figure 100: Informatica Architecture .......................................................................................138 Figure 101: Mapping Designer ................................................................................................139 Figure 102: Administration Console GRID management ..........................................................141 Figure 103: Pushdown Optimization Viewer ..............................................................................142 Figure 104: Mapping for Partial Pushdown Processing ................................................................143 Figure 105: Data flow example: data sources (left) and diagram area (right) ...............................145 Figure 106: Join properties.....................................................................................................145 Figure 107: Filtering input data (limited to the 2003 year) .........................................................146 Figure 108: Mapping source to target columns..........................................................................147 Figure 109: Process flow example ...........................................................................................148 Figure 110: BI Development Studio Interface in Visual Studio ....................................................149 Figure 111: A SSIS data flow example.....................................................................................151 Figure 112: A SSIS report example .........................................................................................152 Figure 113: A cleansing data flow example...............................................................................153 Figure 114: SSIS data visualizations example...........................................................................154 Figure 115: Warehouse Builder Architecture .............................................................................155 Figure 116: Defining the fields delimiters .................................................................................158 Figure 117: Design Center .....................................................................................................158 Figure 118: Mapping Editor ....................................................................................................159 Figure 119: Process Editor (empty flow) ..................................................................................159 Figure 120: Process flow and XPDL script code .........................................................................160 Figure 121: Pervasive Integration Architecture .........................................................................161 Figure 122: Process Designer .................................................................................................162 Figure 123: Map Designer Source Connection Section .............................................................163 Figure 124: Map Designer Map Section .................................................................................163 Figure 125: Extract Schema Designer ......................................................................................164 Figure 126: Integration Manager.............................................................................................165 Figure 127: Software Architecture for the SAS Intelligence Platform ............................................166 Figure 128: Process Flow for a SAS ETL Studio Job....................................................................167 Figure 129: Import Parameters Window...................................................................................168 Figure 130: Set Column Definition Window...............................................................................168 Figure 131: Empty job based on a single transformation ............................................................169 Figure 132: Process Library Tree.............................................................................................169 Figure 133: Process Designer Window .....................................................................................170
-X-
Figure 134: A pipeline definition example.................................................................................171 Figure 135: Convert to XML editor...........................................................................................173 Figure 136: Pipeline XML outputs ............................................................................................173 Figure 137: DB to XML Data Source module .............................................................................174 Figure 138: XSLT mapper ......................................................................................................174 Figure 139: Sunopsis architecture ...........................................................................................176 Figure 140: Designer Diagram view ......................................................................................177 Figure 141: Designer LKM definition example.........................................................................178 Figure 142: Designer Flow definition .....................................................................................178 Figure 143: Operator application.............................................................................................179 Figure 144: TransformOnDemand Architecture .........................................................................180 Figure 145: Process Designer Main Window ..............................................................................181 Figure 146: Text Data Provider ...............................................................................................182 Figure 147: Data Calculator....................................................................................................183 Figure 148: Splitter Window ...................................................................................................183 Figure 149: Job Definition ......................................................................................................184 Figure 150: Runtime Manager ................................................................................................185 Figure 151: SEIS system architecture modular breakdown .........................................................187
- XI -
Index of Tables
Table 1: List of Acronyms...........................................................................................................I
- XII -
Motivation The correct ETL Tool
1.1
Motivation The correct ETL Tool
The development of any new information system poses many challenges and doubts to the team responsible for its implementation. A significant part is related with the Extract, Transform and Load (ETL) component, responsible for the acquisition and normalization of data. During the requirement / design phases, commonly, five questions drive the implementation of the ETL component: 1. Can an existing Commercial off-the-shelf (COTS) solution be reused? 2. Must a custom solution be developed, specifically for this problem? 3. What is the cost (time and man-power) associated? 4. How robust shall the application be? 5. Is the application easy to maintain and extend? In order to answer the first question, usually a survey is conducted regarding the state of the art for the ETL domain. Depending on the budget associated to the information system project a higher evaluation effort may be put in research / open-source applications or in commercial applications. An alternative to the COTS approach passes by developing a custom ETL solution (usually very specific). This last approach is frequent when the ETL component must follow a strict set of requirements that are found to be too specific. The decision on using a COTS approach or developing an ETL component from scratch is influenced according four main parameters: associated cost (third question), required robustness level (fourth question), maintainability and extensibility issues (fifth question). Unfortunately, practice has shown that the choice of the correct ETL tool is overestimated, minimized or even ignored. Many times this happens, since the choice becomes not a technological issue but an administrative issue, where research and open-source are rejected due to its lack of credibility and a commercial tool is selected, having the associated cost as the only criterion for the evaluation. According to the Gartner ETL evaluation report [1], in-house development procedures and poorly administrative decisions when selecting an appropriate ETL tool consume up to 70% of the resources of a information system project (e.g. a data warehouse project). An example of the incorrect selection of a ETL tool in a real-world situation, taken from [2], is described bellow. This shows how an upgrade to an appropriate - 13 -
Motivation The correct ETL Tool ETL component contributed greatly to cost savings in the order of $2.5 billion in a single year: There are few companies that have been more aggressive than Motorola in pursuing e-business. Motorola become public last year that one of its corporatewide goals - a strategic rather than a tactical one - was to get all spending into electronic systems. But, in order to make this kind of progress, the company has had to lean heavily on a business intelligence initiative. Chet Phillips, IT director for BI at Motorola was the responsible for this initiative. "At the beginning of 2002, the procurement leaders at Motorola were given a goal to drop $2.5 billion worth of spend out of the cost structure on the direct and indirect side, and they needed a way of looking at spend comprehensively," Phillips says. Gathering the spend in one location would provide the visibility and decision support that the procurement leaders needed; in the way of such aggregation, however, was Motorola's reliance on many different enterprise systems: three version levels of Oracle, SAP (particularly within the semiconductor organization), and Ariba on the indirect procurement side. Motorola already had an enterprise application integration tool from vendor webMethods that touched a lot of different systems, but Phillips explains how, by nature, it couldn't fit the need at hand. "EAI communicates between the different systems -- it's transaction-level data interaction," Phillips says. To go deeper in getting the data out, Motorola got an ETL tool from BI vendor Informatica. Phillips describes the benefits of the tool. "By using its capability to pull data as opposed to requesting that source systems push data, we covered ground quickly without using intensive IT resources and we had minimal intrusion on source systems." Motorola's BI project handed the baton off to the procurement organization, which could now examine $48 billion dollars worth of spending, one million purchase orders, and six million receipts at the desired level of detail. For its part, the procurement organization has come through for the corporation. Motorola reaped $2.5 billion in cost savings last year thanks to its new eprocurement tools and processes, and expects to save more this year.
- 14 -
ETL Conceptual Representation and Framework Thus, the first step on selecting an ETL tool consists on determining the current state of the art. Unfortunately, the existing ETL surveys are affected by one or more of the following drawbacks: o Incomplete: ETL surveys as [3], only refer to commercial tools. Research and open-source initiatives are not taken into consideration in these surveys; o Non-extensive: Only a limited number of surveys exist that correlate more than one ETL tool. For these, however, the number of ETL tools is rather limited and usually only refers to the top three-four market-leaders; o Biased: Multiple evaluations for ETL tools exist sponsored by individual or consortiums of ETL vendors. These evaluations (usually white papers) are rather biased to the ETL vendors software and can not be considered reliable; o Expensive: The ETL Survey for 2006-2007 performed by ETL Tools [3], an independent company within the ETL domain, costs around 1700 and is not open to the public. This report provides a solution for the above-mentioned drawbacks, to some extend 1. Initially, the current trends on ETL methodology, architecture and data processing are explained globally. Afterwards, less common features of ETL are presented (e.g. real-time ETL) as well as declarative approaches to ETL. Next, the conducted survey for ETL applications is presented, according to three domains: research, open-source and commercial. At the end, an evaluation for the ETL domain is presented given the analysed tools.
1.2
Work in
ETL Conceptual Representation and Framework

the area of ETL conceptual representation and methodology
standardization has been limited so far to some scarce research initiatives. Most of this work has been developed in the area of conceptual representation of ETL processes [4-6] and ETL methodology [7-9] by computer science researchers from the University of Ioannina. Both works were envisaged in order to ease the documentation and formalization effort, for ETL at the early stages of data warehousing definition (not describing technical details regarding the actual implementation of ETL tasks). A set of iconographic symbols has been suggested for conceptual representation of ETL primitives like concepts, instances, transformations, relations and data flows.
1
Some of the applications could not be properly evaluated due to the shortage of
technical information and / or unavailability of software for public usage.
- 15 -
ETL Conceptual Representation and Framework The same researchers proposed a general methodology for dealing with ETL processes [10] (following the proposed ETL conceptual representation), based on a two-layered design that attempts to separate the logical and physical levels. Using this framework any ETL / data cleaning program would involve two activities (Figure 1): 1. The design of a graph of data transformations that should be applied to the input data - logical level; 2. The design of performance heuristics that can improve the execution speed of data transformations without sacrificing accuracy - physical level.
Figure 1: AJAX two-level framework (example for a library) At the logical level, the main constituent of an ETL program is the specification of a data flow graph where nodes are operations of the following types: mapping, view, matching, clustering and merging, and the input and output data flows of operators are logically modelled as database relations. The design of logical operators was based on the semantics of SQL primitives extended to support a larger range of transformations. Each operator can make use of externally defined functions or algorithms, written in a third generation language (3GL) programming language and then registered within the library of functions and algorithms of the tool. The semantics of each operator includes the automatic generation of a variety of exceptions that mark tupples that cannot be automatically handled by an operator. At any stage of execution of the program, a data lineage mechanism enables users to inspect exceptions, analyze their provenance in the data flow - 16 -
ETL Conceptual Representation and Framework graph and interactively correct the data items that contributed to its generation. Corrected data can then be re-integrated into the data flow graph. At the physical level, decisions can be made to speed up the execution. First, the implementation of the externally defined functions can be optimized. Second, an efficient algorithm can be selected to implement a logical operation among a set of alternative algorithms. Both the conceptual representation and methodology have been put to practice with the AJAX prototype (analysed under the Research ETL Tools section). However, the proposed standards for conceptual representation and framework had no impact on the remaining research / commercial tools and have not been adopted. All further research using the conceptual representation and framework proposed by AJAX has been conducted by the same research team that has meanwhile enhance the tool (AJAX II), although without any visible impact in the remaining ETL domain. Instead, most metadata based ETL tools usually follow (at least to some extent) the abstract Meta Object Facility (MOF) approach. MOF is a naming standard, which defines a four-layer architecture where the item belonging to a layer L is an instance of a concept item described in the above layer (i.e. L + 1). Figure 2 provides a parallelism between UML diagrams and Java using the MOF hierarchy as an example.
Figure 2: MOF layers using UML and Java as comparison
- 17 -
Classical Data Integration Architectures A parallelism between metadata 2 and the MOF standard can be easily derived for the definition of instance and concept terms. Concepts refer to the definition of entity types (e.g. car, person, or table), while instances are actual specifications for those entities (e.g. Audi A4, John or a round wood table). Thus, the information layer (M0) represents actual data (e.g. a record in a database), model layer (M1) contains the instance metadata, describing the information layer objects. The Metamodel layer (M2) contains the definitions of the several types of concept metadata that shall be stored in the previous layer. The meta-metamodel layer (M3) contains common rule definitions for all concepts (e.g. structural rules regarding concepts) enabling a normalized metadata representation. As described, the MOF layer representation is quite abstract and each research group / tool vendor instantiates its own objects following internal representation schemes and functions, providing no type of standardization. Further, no type of methodology, operations set or metadata representation has been defined or proposed for ETL as a specialization of MOF 3.
1.3
integration
Classical Data Integration Architectures

problem [11-13]: hand coding, code generators, database
Historically, four main approaches have been followed for solving the data embedded ETL and metadata driven ETL engines.
1.3.1
Hand Coding
Since the dawn of data processing, integration issues have been solved through the development of custom hand-coded programs, developed by in-house or consulting programming teams. Although this approach appears at start as a low cost solution, it quickly evolves to a costly, hard to maintain and time consuming task. This usually happens, since all the relevant knowledge is represented at the low-level source code, making it hard to understand, maintain and update, being especially prone to error during maintenance tasks. The problem is even increased for those legacy
Data that describes data. Besides the AJAX initiative that can be seen has a subset of MOF layers
specialized with a declarative language with a minimum subset of operators and iconographic representation.
- 18 -
Classical Data Integration Architectures systems where the original team that developed the code is not available any more. At the middle of 1990 decade, this paradigm started to be replaced by a number of third party products (code generators and engine-based tools) from specialized vendors in data integration and ETL. Surprisingly, even though ETL tools have been developed for over 10 years and are now mature products, hand coding still persists as a significant contribution for solving transformation problems. These efforts still proliferate in many legacy environments, low-budget data migration projects or for dealing with very specific ETL scenarios. Although hand-coded ETL provides unlimited flexibility, it has an associated cost: the creation of an ETL component from scratch that may be possible hard to maintain and evolve in the near-future depending on the ETL component complexity.
1.3.2
Code Generators
Code generators were the first early attempt to increase data processing efficiency, replacing possible inefficient source-code developed manually. Code generation frameworks have been proposed, presenting a graphical front-end where users can map processes and data flows and then generate automatically source code (such as C or COBOL) as the resultant run-time solution, that can be compiled and executed on various platforms. Generally, ETL code-generating tools can handle more complex processing than their engine-based counterparts can. Compiled code is generally accepted as the fastest of solutions and also enables organizations to distribute processing across multiple platforms to optimize performance. Although code generators usually offer visual development environments, they are sometimes not as easy to use as engine-based tools, and can lengthen overall development times in direct comparisons with engine-based tools. Code generators were a step-up from hand-coding for developers, but this approach did not gain widespread adoption as solution requirements and Information Technology (IT) architecture complexity arose and the issues around code maintenance and inaccuracies in the generation process led to higher rather than lower costs.
- 19 -
Classical Data Integration Architectures
1.3.3
Database Embedded ETL
With origins in the early code generators, Database Management System (DBMS) vendors have embedded ETL capabilities in their products, using the database as engine and Structured Query Language (SQL) as supporting language. Some DBMS vendors have opted to include third party ETL tools that leverage common database functionality, such as stored procedures and enhanced SQL, increasing the transformation and aggregation power. This enabled third party ETL tools to optimize performance by exploiting the parallel processing and scalability features of DBMS. Other DBMS vendors offer ETL functions that mirror features available in ETL specialist vendors. Many database vendors offer graphical development tools that exploit the ETL capabilities of their database products, competing directly with third party ETL solution providers. Database-centric ETL solutions vary considerably in quality and functionality. To some extent, these products have exposed the lack of capability of SQL and database-specific extensions (e.g., PL/SQL, stored procedures) to handle crossplatform data issues, eXtended Markup Language (XML) data, data quality, profiling and business logic needed for enterprise data integration. Further, most organisations do not wish to be dependent on a single proprietary vendors engine. However, for some specific scenarios, the horsepower of the relational database can be effectively used for data integration, with better results compared to other ETL tool vendors.
1.3.4
Metadata Driven ETL Engines
Informatica pioneered a new data integration approach by presenting a data server, or engine, powered by open, interpreted metadata as the main driver for transformation processing. This approach addressed complexity and met performance needs, also enabling re-use and openness since it is metadata driven. Other ETL tool vendors have also adopted this approach since, through other types of engines and languages, becoming the current trend on ETL data processing. Many of these engine-based tools have integrated metadata repositories that can synchronize metadata from source systems, target databases and other business intelligence tools. Most of these tools automatically generate metadata at every step of the process and enforce a consistent metadata-driven methodology that all developers must follow. Proprietary scripting languages are used for
- 20 -
Approaches to Data Processing representing metadata, running within a generally rigid centralised ETL server. These engines use language interpreters to process ETL workflows at run-time, defined by developers in a graphical environment and stored in a meta-data repository, which the engine reads at run-time to determine how to process incoming data. This way it is possible to abstract some of the implementation issues, making data mapping graphically orientated and introduce automated ETL processes. Key advantages of this approach are: o Technical people with broad business skills but without programmer expertise can use ETL tools effectively; o o o ETL tools have connectors prebuilt for most source and target systems; ETL tools deliver good performance even for very large data sets. ETL tools can often manage complex load-balancing scenarios across servers, avoiding server deadlock. Although the proposed approach is based on metadata-interpretation, the need for custom code is rarely eliminated. Metadata driven engines can be augmented with selected processing modules hand coded in an underlying programming language. For example, a custom CRC (Cyclic Redundancy Checksum) algorithm could be developed and introduced into an ETL tool if the function was not part of the core function package provided by the vendor. Another significant characteristic of an engine-based approach is that all processing takes place in the engine, not on source systems. The engine typically runs on a server machine and establishes direct connections to source systems. This architecture may raise some issues since some projects may require a great deal of flexibility in their architecture for deploying transformations and other data integration components rather than in a centralized server.
1.4
Approaches to Data Processing
Data integration is usually accomplished using one (or a composition) of the following techniques [14]: consolidation, federation and propagation, as depicted in Figure 3.
- 21 -
Figure 3: Data integration techniques: consolidation, federation and propagation
1.4.1
Data Consolidation
Data Consolidation gathers data from different input sources and integrates it into a single persistent data store. This centralized data can be used either for reporting and analysis (data warehouse approach) or as a data source for external applications. When using data consolidation, a delay or latency period is usually present, between the data entry at the source system and the data being available at the target store. Depending on business needs, this latency may range from a few seconds to several days. The term near real time is used to describe an exchange data operation with a minimum latency (usually in the seconds / minutes range). Data with zero latency is known as real-time data and is almost impossible to reach using data consolidation. Whenever the exchanged data refers to high-latency periods (e.g. more than one day), then a batch approach is applied, where data is pull from the source systems at scheduled intervals. This pull approach uses queries that take periodical snapshots of source data. Queries are able to retrieve the current version of the data, but unable to capture any internal changes that might have occurred since the last snapshot. A source value could have been updated several times during this period and these intermediate values would not be visible at the target data store. On the other way, when the exchanged data refers to low-latency periods (e.g. seconds) then the target data store must be updated by online data integration - 22 -
Approaches to Data Processing applications that continuously capture and push data changes occurring at the source systems to the target store. This push technique requires data change to be captured, using some form of Change Data Capture (CDC) technique. Both pull and push consolidation modes can be used together: e.g. an online push application may accumulate data changes in a staging area, which is then queried at scheduled intervals by a batch pull application. While the push model follows an event-driven approach, the pull mode gathers data on demand (Figure 4).
Figure 4: Push and pull modes of data consolidations Applications commonly use the consolidated data for querying, reporting and analysis purposes. Update of consolidated data is usually not allowed due to the data synchronization problems with the source systems. However, a few data integration products enable this writing capability, providing ways to handle possible data conflicts between the updated data in the consolidated data store and the origin source systems. Data consolidation allows for large volumes of data to be transformed (restructured, reconciled, cleansed and aggregated) as it flows from source systems to the target data store. As disadvantages, this approach requires intensive computing to support the data consolidation process, network bandwidth for transferring data and disk space required for the target data store. Data consolidation is the main approach used by data warehousing applications to build and maintain an operational data store and an enterprise data warehouse, while the ETL technology is one of the more common technologies used to support data consolidation. Besides ETL, another way of accomplishing data consolidation is by using the Enterprise Content Management (ECM) technology.
- 23 -
Approaches to Data Processing Most ECM solutions focus the consolidation and management of unstructured data such as documents, reports and Web pages 4.
1.4.2
Data Federation
Data Federation enables a unified virtual view of one or more data sources. When a query is issued to this virtual view, a data federation engine distributes the query through the data sources, retrieves and integrates the resulting data according to the virtual view before outputting the results back. Data federation always pulls data on-demand basis from source systems, according to query invocation. Data transformations are performed after extracting the information from the data sources. Enterprise Information Integration (EII) is a technology that enables a federated approach. Metadata is the key element in a federated system, which is used by the federation engine to access data sources. This metadata may have different types of complexity. Simple metadata configurations may consist only on the definition of the virtual view, explaining how this is mapped into the data sources. In more complex situations, it may describe the existing data load in the data sources and which access paths should be used to access this (this way the federated solution may optimize greatly the access to the source data). Some federated engines may use metadata even further, describing additional business rules like semantic relationships between data elements crosscutting to the source systems (e.g. customer data, where a common customer identifier may be mapped to various customer keys used in other source systems). The main advantage of the federated approach is that is provides access to data, removing the need to consolidate it into another data store, i.e. when the cost of data consolidation outweighs the business benefits it provides. Data federation can be especially useful when data security policies and license restrictions prevent source data being copied. However, data federation is not suited for dealing with large amounts of data, when significant quality problems may be present at the data sources or when the performance impact and overhead of accessing multiple data sources at runtime becomes a performance bottleneck.
ECM will not be further discussed since ETL is the main technology for data
consolidation.
- 24 -
1.4.3
Data Propagation
The main focus of data propagation applications is to copy data from one location to another. These applications usually operate online and push data to the target location using an event-driven approach. Updates from source to the target systems may be performed either
asynchronously or synchronously. While synchronous propagation requires that data updates occur within the same transaction, an asynchronous propagation is independent from the update transaction at the data source. Regardless from the synchronization type, propagation guarantees the delivery of the data to the target system. Enterprise Application Integration (EAI) and Enterprise Data Replication (EDR) are two examples of technologies that support data propagation. The key advantage of data propagation is that it can be used for real-time / nearreal-time data movement and can also be used for workload balancing, backup and recovery. Data propagation tools vary considerably in terms of performance, data restructuring and cleansing capabilities. Some tools may support the movement and restructuring of high volume of data, whereas EAI products are often limited in these two features. This partially happens since enterprise data replication has a data-centric architecture, whereas EAI is message or transaction-centric.
1.4.4
Hybrid Approach
Data integration applications may not be limited to a single data integration technique, but use a hybrid approach that involves several integration techniques. Customer Data Integration (CDI), where the objective is to provide a harmonized view of customer information, is a good example of this approach. A simple example of CDI is a consolidated customer data store that holds customer data captured from different data sources. The information entry latency in the consolidated database will depend on whether data is consolidated online or through batch. Another possible approach to CDI is using data federation where a virtual customer view is defined according to the data sources. This view could be used by external applications to access current customer information. The federated approach may use metadata to relate customer information based on a common key. A hybrid data consolidation and data federation approach could also be possible: common customer data (e.g. name, address) could be consolidated in a single store and the remaining customer fields (e.g. customer orders),
- 25 -
Approaches to Data Processing usually unique, could be federated. This scenario could be extended even further through data propagation, e.g. if a customer updates his or her name and address during a transaction, this change could be sent to the consolidated data store and then propagated to other source systems such as a retail store customer database.
1.4.5
Change Data Capture
Both data consolidation and data propagation create to some extent copies of source data, requiring a way to identify and handle data changes that occur in source systems. Two approaches are common to this purpose: rebuild the target data store on a regular basis, keeping data minimally synchronized between source and target systems (which is impractical except for small data stores) or perform some form of Change Data Capture (CDC) capability. If a time-stamp is available at the source data for the date of the last modification, this could be used to locate the data that had been changed since the CDC application last executed. However, unless a new record or version of the data is created at each modification, the CDC application will only be able to identify the most recent change for each individual record and not all possible changes that might have occurred between application runs. If no time-stamp exists associated to the source data, then in order to enable CDC, data sources must be modified either to create a time-stamp or to maintain a separate data file or message queue of the data changes. CDC can be implemented through various ways. In Relational Database Management Systems (DBMS) a common approach is to add database update triggers that take a copy of the modified data or isolate data changes through the DBMS recovery log. Triggers may have a high negative impact on the performance of source applications since the trigger and source data update processing is usually done in the same physical transaction, thus increasing the transaction latency. Processing of the recovery log, causes less impact, since this is usually an asynchronously task independent from the data update. In non-DBMS applications (e.g. document based) time stamping and versioning are quite common, which eases the CDC task. When a document is created or modified, the document metadata is usually updated to reflect the date and time of the activity. Many unstructured data systems also create a new version of a document each time this is modified.
- 26 -
1.4.6
Data Integration Technologies
As previously introduced, several technologies are available for implementing the data integration techniques described above: Extract, Transform and Load (ETL), Enterprise Information Integration (EII), Enterprise Application Integration (EAI) and Enterprise Data Replication (EDR). Follows a review for each technology.
1.4.6.1
Extract, Transform and Load (ETL)
The ETL technology provides means of extracting data from source systems transforming it accordingly and loading the results into a target data store. Databases and files are the most common inputs and outputs for this technology. ETL is the main consolidation support for data integration. Data can be gathered either using a schedule-based pull mode or based on event detection. When using the pull mode, data consolidation is performed in batch, while if applying the push technique the propagation of data changes to the target data store is performed online. Depending on the input and output data formats, data transformation may require a few or many steps: e.g. data formatting, arithmetic operations, record restructuring, data cleansing or content aggregation. Data loading may result on a complete refresh of the target store or may be performed gradually by multiple updates at the target destination. Common interfaces for data loading are ODBC, JDBC, JMS, native database and application interfaces. The first ETL solutions were limited to running batch jobs at pre-defined scheduled intervals, capturing data from file or database sources and consolidate it into a data warehouse (or relational staging area). Over the last years, a wide set of new features has been introduced, providing customization and extension to the ETL tools capabilities. Follows some significant examples: o Multiple data sources (e.g. databases, text files, legacy data, application packages, XML files, web services, unstructured data); o o Multiples data targets (e.g. databases, text files, web services); Improved data transformation (e.g. data profiling and data quality
management, standard programming languages, DBMS engine exploitation); o Better administration (job scheduling and tracking, metadata management, error recovery); o o o Better performance (e.g. parallel processing, load balancing, caching); Better visual development interfaces; Data federation support.
- 27 -
Approaches to Data Processing 1.4.6.1.1 Tuning ETL
ETL is a traditional data integration technique widely used in information systems. However, for some specific cases, variations to standard ETL could increase performance drastically, taking advantage of RDBMS technology and special tuning features. Besides traditional ETL, two new trends exist: o ELT (Extract, Load and Transform): Oracle and Sunopsis are the leaders of this technology, where data is loaded to a staging area database and only after transformations take place. The ELT technology has been constrained by database capabilities. Since ELT had its origins with RDBMS the technology tended to be suitable for just one database platform. ELT also lacked functionality, as the vendors were more concerned with building a database rather than an ELT tool. Sunopsis was the only exception of an ELT tool not owned by an RDBMS vendor until it was acquired by Oracle; o ETLT (Extract, Transform, Load and Transform): Informatica is the leader of this technology, which consists on a database pushdown optimization to traditional ETL. This consists on a standard ETL delivering to a target database, where further transformations are performed (for performance reasons) before moving information into the target tables. Microsoft SSIS has also good ETLT capabilities with the SQL Server database. Summarizing, while ELT presents itself as a novel approach that takes advantages of database optimization, ETLT can be considered as a simple extension of ETL with some tuning functionalities. Comparing both ETL and ELT [15] the following advantages can be identified regarding ELT: o o o ELT leverages RDBMS engine hardware for scalability; ELT keeps all data in the RDBMS all the time; ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput; o ELT can achieve three to four times the throughput rates on the appropriately tuned RDBMS platform. while the key negative points are: o o ELT relies on proper database tuning and proper data model architecture; ELT can easily use 100% of the hardware resources available for complex and huge operations;
- 28 -
Approaches to Data Processing o o ELT can not balance the workload; ELT can not reach out to alternate systems (all data must exist in the RDBMS before ELT operations take place); o o o ELT easily quadruples disk storage requirements; ELT can take longer to design and implement; More steps (less complicated per step) but usually resulting in more SQL code.
Finally, the key advantages of ETL towards ELT are presented: o o ETL can balance the workload / share the workload with the RDBMS; ETL can perform more complex operations in single data flow diagrams (data maps); o o ETL can scale with separate hardware; ETL can handle partitioning and parallelism independent of the data model, database layout, and source data model architecture; o ETL can process data in-stream, as it transfers from source to target;
while the key negative points are: o o ETL requires separate and equally powerful hardware in order to scale; ETL can "bounce" data to and from the target database, requires separate caching mechanisms, which sometimes do not scale to the magnitude of the data set; o ETL can not perform as fast as ELT, due to RAM and CPU resources.
1.4.6.2
Enterprise Information Integration (EII)
Enterprise Information Integration provides a virtual view of dispersed data, supporting the data federation approach for data integration. This view can be used for on-demand querying for accessing transactional data, data warehouse and / or unstructured information. EII enables applications to see dispersed data sets as a single database, abstracting the complexities of retrieving data from multiple sources, heterogeneous semantics and data formats and disparate data interfaces for communication. EII products have evolved from two different technological backgrounds relational DBMS and XML, but the current trend of the industry is to support both
- 29 -
Approaches to Data Processing approaches, via SQL (ODBC and JDBC) and XML (XQuery and XPath) data interfaces. EII products with strong DBMS background take advantage of the research performed in developing Distributed Database Management Systems (DBMS) that has the objective of providing transparent, full read / write permissions over distributed data. A key issue in DBMS is the performance impact over distributed processing for mission-critical application (especially when supporting write access to distributed data). To overcome this problem, most EII products provide only read access to heterogeneous data and just a few tools allow limited update capabilities. Another important performance option is the ability of EII products to cache results and allow administrators to define rules that determine when the data in the cache is valid or needs to be refreshed.
1.4.6.3
EII versus ETL
EII data federation cannot replace the traditional ETL data consolidation approach used for data warehousing, due to performance and data consistency issues of a fully federated data warehouse. Instead EII should be use to extend data warehousing to address specific needs. When using complex query processing that requires access to operational transaction systems, this may affect the performance of the operation applications running on those systems. EII increase performance in these situations by sending simpler and more specific queries for operational systems. A potential problem with EII arises when transforming data from multiple source systems, since data relationship may be complex / confuse and the data quality may be poor, not allowing a good federated access. These issues point out the need of a more rigorous approach in the system modelling and analysis for EII. Follows a set of circumstances when EII may be a more appropriate alternative for data integration than ETL: o Direct write access to the source data: Updating a consolidated copy of the source data is generally not advisable due to data integrity issues. Some EII products enable this type of data update; o It is difficult to consolidate the original source data: For widely heterogeneous data and content, it may be impossible to bring all the structured and unstructured data together in a single consolidated data store;
- 30 -
Approaches to Data Processing o Federated queries cost less than data consolidation: The cost and performance impact of using federated queries should be compared with the network, storage, and maintenance costs of using ETL to consolidate data in a single store. When the source data volumes are too large to justify consolidation, or when only a small percentage of the consolidated data is ever used, a federated solution is more appropriate. The arguments in favour of ETL compared to EII are: o Read-only access to reasonably stable data is required: Creating regular snapshots of the data source isolates users from the ongoing changes to source data, defining a stable set of data that can be used for analysis and reporting; o Users need historical or trend data: Operational data sources may not have a complete historical available at all times (e.g. sliding window approach). This historical can be built up over time through the ETL data consolidation process; o Data access performance and availability are key requirements: Users want fast access to local data for complex query processing and analysis; o User needs are repeatable and can be predicted in advance: When most of the performed queries are well defined, repeated in time, and require access to only a known subset of the source data, it makes sense to create a copy of the data in a consolidated data store for its manipulation; o Data transformation is complex: Due to performance issues it is inadvisable to perform complex data transformation as part of an EII federated query.
1.4.6.4
Enterprise Application Integration (EAI)
Enterprise Application Integration provides a set of standard interfaces that allow application systems to communicate and exchange business transactions, messages and data, accessing data transparently abstracting from its location and format logic. EAI supports the data propagation approach for data integration and is usually used for real-time operational transaction processing. Access to application sources can be performed through several technologies like web services, Microsoft .NET interfaces or Java Message Service (JMS). EAI was designed for propagating small amounts of data between applications (not supporting complex data structures handled by ETL products), either
- 31 -
Approaches to Data Processing synchronously or asynchronously, within the scope of a single business transaction. If an asynchronous propagation is used, then the business transaction may be broken into multiple lower-level transactions (e.g. a travel request could be broken down into airline, hotel and car reservations, although in a coordinated way).
1.4.6.5
EAI versus ETL
EAI and ETL are not competing technologies and in many situations are used together to complement one another: EAI can be a data source for ETL and ETL can be a service to EAI. The main objective of EAI is to provide transparent access to a wide set of applications. Therefore an EAI to ETL interface could be used to give ETL access to this application data, e.g. through web service communication. Using this interface, custom point-to-point adapters for these data source application would not be required to be developed for ETL purposes. In the opposite architectural configuration, the interface could be also used as a data target by an ETL application. Currently most of these interfaces are still in their early stages of development and in many cases instead of an EAI to ETL interface, organizations use EAI to create data files, which are then fed to the ETL application.
1.4.6.6
Enterprise Data Replication (EDR)
The Enterprise Data Replication technology is less known than ETL, EII or EAI, even being widely used in data integration projects. This lack of visibility happens since EDR is often packaged together with other solutions (e.g. all major DBMS vendors use data replication capabilities in their products as many CDC-based solutions that also offer data replication facilities). EDR is not limited only to data integration purposes but is also used for backup and recovery, data mirroring and workload balancing. Some EDR products, support two-way synchronous data propagation between multiple databases. Also, online data transformation is a common property of EDR tools, when data is flowing between two databases. The major difference between EDR and EAI approaches is that EDR data replication is used for transferring a considerable amount of data between databases, while EAI is designed for moving messages and transactions between applications. A hybrid approach with a data replication tool and ETL tool is very common: e.g. EDR can be used to continuously capture and transfer big data sets to a staging
- 32 -
Metadata for Describing ETL Statements area and at regular bases this data is extracted from the staging area by a batch tool that consolidates the data in a data warehouse infrastructure.
1.5
Metadata for Describing ETL Statements
The metadata driven ETL design is the most common among ETL tools. The key feature in this design is the metadata definition that provides a language for expressing ETL statements and when instantiated instructs which operations shall be performed by the ETL engine, as its order. The structure and semantics of this language does not follow any specific standard, since none has been commonly accepted so far for describing ETL tasks. Depending on the type of ETL tool (research, open-source or commercial) one of four approaches are usually followed for the representation of ETL metadata (or a combination of them): 1. Proprietary metadata: Private metadata definition (e.g. binary) that is not made available to the public domain. Usually used in the ETL commercial domain; 2. Specific XML-based language: Defines freely the structure and semantics through an XML Schema or Document Type Definition (DTD). The language can be defined freely according to the specific functionalities / necessities of the supporting ETL engine. Usually every research / open-source ETL tool follows their own language, completely different one another either in terms of structure or semantics; 3. XML-based Activity Definition Language (XADL) [16, 17]: An XML language for data warehouse processes, on the basis of a well-defined DTD; 4. Simple Activity Definition Language (SADL) [16, 17]: A declarative definition language motivated from the SQL paradigm. Since the first approach can not be discussed since it is unknown and the second one can assume almost any structure and specification, these will not be further addressed. Regarding the XADL and SADL approaches, even not being defined specifically for supporting ETL assertions, they will be explained, according to the following set of data transformations: 1. Push data from table LINEITEM of source database S to table LINEITEM of the DW database; 2. Perform a referential integrity violation checking for the foreign key of table LINEITEM in database DW that is referencing table ORDER. Delete any violating rows;
- 33 -
Metadata for Describing ETL Statements 3. Perform a primary key violation check to the table LINEITEM. Report violating rows to a file. Figure 5 depicts a subset of the XADL definition for this scenario.
Figure 5: Part of the scenario expressed with XADL In lines 310 the connection instructions are given for the source database (the data warehouse database is described similarly). Line 4 describes the Uniform Resource Locator (URL) of the source database. Line 8 presents the class name for the employed Java Data Base Connection (JDBC) driver: JDBC communicates with an instance of a DBMS through the DBMSFin driver. Lines 67102 describe the second activity of the scenario. First, in lines 6885 the structure of the input table is given. Lines 8692 describe the error type (i.e., the functionality) of the activity: declare that all rows that violate the foreign key constraint should be deleted. The target column and table are specifically described. Lines 9395 deal with the policy followed for the identified records and declare that in this case, they should be deleted. A quality factor returning the absolute number of violating rows is described in lines 9698. This quality factor is characterized by the SQL query of line 97, which computes its value and the report file where this value should be stored.
- 34 -
Metadata for Describing ETL Statements Four definition statements compose the SADL language (Figure 6): o CREATE SCENARIO: specifies the details of scenario (ties all other statements together); o o o CREATE CONNECTION: specifies the details of each database connection CREATE ACTIVITY: statement specifies an activity; CREATE QUALITY FACTOR: specifies a quality factor for a particular activity.
Figure 3 presents the syntax for the main statements.
Figure 6: The syntax of SADL for the CREATE SCENARIO, CONNECTION, ACTIVITY, QUALITY FACTOR statements Returning to the scenario example, Figure 7 presents a representation using SADL. Lines 14 define the scenario, which consists of three activities. The connection characteristics for connecting to the data warehouse are declared in lines 69. An example of the SADL description of an activity can be seen in lines 1116 for the reference violation checking activity. Finally, lines 1822 express the declaration of a quality factor, which counts the number of the rows that do not pass the foreign key violation check. The quality factor is traced into a log file.
Figure 7: Part of the scenario expressed with SADL
- 35 -
Research ETL Tools SADL is rather verbose and complex to write compared to XADL. Yet it is more comprehensible since it is quite detailed in its appearance and produces programs that are easily understandable even for a non-expert. SADL is also more compact and resembles SQL, making itself more suitable for a trained designer.
1.6
Research ETL Tools
This section comprises the ETL tools developed within the research and academic domains. Software prototypes are quite scarce (not to say inexistent) and information is mainly available in scientific papers and journals. In most cases, the presented work does not refer to a complete ETL software solution but focus particular features of ETL, mainly related with automatic learning.
1.6.1
AJAX
AJAX [18] is an extensible and flexible data-cleaning tool developed at INRIA, France. The main application goal is to ease the specification and execution of data cleaning programs for one or multiple input data sources. A key feature of AJAX is its proposed framework wherein the program logic is modelled as a directed graph of data transformations. Further, the framework proposes a clear separation between the logical and physical levels of data cleansing. While the logical level supports the design of the data cleansing workflow and specification of cleansing operations, the physical level refers to their implementation. Besides this framework, AJAX proposes a declarative language based on SQL statements, enriched by four macro operators (i.e. transformations): o Mapping: Enables standardization of data formats (e.g. date formats), merging or splitting fields in order to produce more suitable formats; o Matching: The matching operator computes an approximate join between two relations assigning a distance value to each pair in the Cartesian product using a pre-defined distance function. This operator is fundamental for duplicate detection; o Clustering: Groups together matching pairs with a high similarity value by applying given grouping criteria; o Merging: Applied to each individual cluster in order to eliminate duplicates or produce new records for the resulting integrated data source. Executing an AJAX program may raise exceptions (e.g. a name has a null value). AJAX proposes two ways for dealing with exceptions: pre-defined code can be invoked or a special table containing all the tuples that caused the exception can
- 36 -
Research ETL Tools be automatically generated (AJAX default option). On exception, fault tuples are placed in a table and the user may use the explainer functionality for selecting one or several tuples and backtrack into the data cleaning process. The explainer functionality is supported by a conceptual metadata repository that stores annotations about the execution of data cleaning programs describing the transformations as their input/output. AJAX is extensible since it can be customized in various ways: 1. The predefined macro-operators can invoke external domain specific functions (e.g. a string comparison function in the case of the matching macrooperator) that have been previously added to an open library of functions; 2. Macro-operators can be combined with pure SQL statements in order to formulate more complex data cleaning programs; 3. The initial set of provided algorithms can be extended as needed (e.g. new clustering methods); 4. Both a regular SQL query engine and an external homegrown execution engine support the execution of the operators. This dual architecture enables to exploit the computing capabilities offered by relational database systems, as well as particular optimization techniques. AJAX has been implemented in Java and C programming languages and the used RDBMS was Oracle, version 8i. The XML metadata repository was also supported by Oracle iFS for storing XML documents. The architecture of AJAX (Figure 8) includes two types of components: the repositories that manage data or fragments of code and the operational units that constitute the execution core. Three repositories exist within AJAX: o o RDBMS for storing the input dirty data and the output cleaned data; A library of functions that can be called within each transformation, storing their code, as well as a description of their characteristics to be exploited by the AJAX query optimizer; o A library of algorithms that implement the clustering transformation.
The core of the AJAX system is implemented by the following operational units: the analyzer, which parses the cleaning programs; the optimizer, which selects an optimal execution plan for executing a cleaning program; the execution
- 37 -
Research ETL Tools engine, which schedules the tasks composing the selected plan and the explainer that triggers an audit trail mechanism.
Figure 8: AJAX architecture During the specification phase, the user describes the cleaning program to be applied to the data. This can be achieved by writing directly a program using the AJAX specification language or through the SPEC-GUI application. Then the user may write a set of external functions and programs implementing the application dependent functions and algorithms (not present in the system libraries) using the REGT-GUI application. AJAXs optimizer generates several alternative execution plans for the given program and chooses the optimal one based on its estimated cost. The human expert is able to visualize partial results of the cleaning execution and introduce modifications (e.g. correct field values, invalidate automatic actions that proved to be incorrect) that are then fed back into the execution cycle. Traces of the actions performed by the execution engine are stored in the metadata repository. At the end of each cleaning iteration, the DEBG-GUI permits to analyze both the cleaned data and the metadata. A cleaning program consists of a sequence of macro-operator instances. Each operator takes as input one or several data flows, and produces as output one or several data flows, globally forming a directed cyclic graph.
- 38 -
Research ETL Tools Syntactically, each operator specification has a FROM clause that indicates the input data flow, a CREATE clause, which names the output data flow (for further reference), a SELECT clause specifying the shape of the output data and a WHERE clause to filter out non-interesting tuples from the input. An optional LET clause describes the transformation that has to be applied to each input item (tuple or group of tuples) in order to produce output items. This clause contains limited imperative primitives: the possibility to define local variables, to call external functions or to control their execution via if/then/else constructs. Finally, the clustering operator includes a BY clause which specifies the grouping algorithm to be applied, among the ones existing in the AJAX library. An example of the syntax for the Matching transformation is depicted in Figure 9. CREATE MATCH M SELECT m1.key key1, m2.key key2, SIMILARITY FROM MP m1, MP m2 LET sim1 = nameSimilarityFunction(m1.lowerName, m2.lowerName) sim2 = streetSimilarityFunction(m1.street, m2.street) SIMILARITY = min(sim1, sim2) WHERE (m1.key != m2.key) and SIMILARITY > 0.6 Figure 9: Example of the Matching transformation
1.6.2
ARKTOS
ARKTOS [16, 17] is a framework capable of modelling and executing ETL scenarios, data cleaning, scheduling and data transformation. ARKTOS provides three ways to describe ETL scenarios: a graphical front end and two declarative languages: XADL (an XML variant) and SADL (an SQL-like language). ARKTOS is based on a simple metamodel, where the main process type is the entity Activity. An activity is an atomic unit of work and a discrete step in the chain of data processing. Since activities are supposed to process data in a data flow, each activity is linked to input and output tables of one or more databases. An SQL statement gives the logical and declarative description of the work performed by each activity. A Scenario is a set of activities to be executed together that is the outcome of a design process. Each activity is accompanied by an Error Type and a Policy. The error type of an activity identifies the problem the process is concerned with, while the policy signifies the way the offending data should be treated. Around each activity, several Quality Factors can be defined (implemented through the use of SQL queries). ARKTOS is also enriched with a set of template generic entities that correspond to the most popular data cleaning tasks (e.g. primary or foreign key violations) and policies (e.g. deleting offending rows, reporting offending rows to a file or table).
- 39 -
Research ETL Tools ARKTOS offers the following primitive operations to support the ETL process: (i) primary key violation checking, (ii) reference violation checking, (iii) NULL value existence checking, (iv) uniqueness violation checking and (v) domain mismatch checking. Further, ARKTOS possesses transformation primitive operations that enable the transformation of data according to some pattern (which can be either built-in or user-defined). For example, a transformation primitive can be used to transform a date field format (Figure 10).
Figure 10: Specifying the check for format transformation The customization (graphically or declaratively) includes the specification of input and output (if necessary) contingency policy and quality factors. Each operation is mapped to an SQL statement. Once a primitive filter is defined in the context of a scenario, it is possible that some rows do not fulfil its criteria at runtime. For each such filter, the user is able to specify a policy for the treatment of the violating rows: (i) ignore (i.e., do not react to the error and let the row pass), (ii) delete (from the source data store), (iii) report to a contingency file or (iv) report to a contingency table. Within the tool, the user specifies the desired treatment for the rows identified by the error type of the activity, by selecting one of the aforementioned policies. ARKTOS comprises several cooperating modules, namely the SADL and XADL parsers, the graphical scenario builder and a scheduler (Figure 11). The graphical scenario builder is the front-end module of the application, i.e., the module employed by the user to graphically sketch the logical structure of an ETL process. The scenario builder equips the designer with a palette of primitive transformations used to graphically construct the architecture of the data flow within the system. - 40 -
Research ETL Tools
Figure 11: The architecture of ARKTOS The graphical interface consists of a number of visual objects which guide the user in the process of sketching the sequence of the transformation steps: rectangle shapes represent data stores, oval shapes represent transformation activities, rounded-corner rectangles represent user-defined quality factors (Figure 12). The priority of activities and the linkage of activities to data stores are depicted with links between them. The XADL and SADL parsers are the declarative equivalent of the graphical scenario builder.
Figure 12: Scenario for the propagation of data to a data warehouse The central ARKTOS module performs three typical tasks: (i) it interprets a submitted scenario to a set of Java objects, (ii) it places the objects into a sequence of executable steps and (iii) it performs all the consistency checks to ensure the correctness of the design process.
- 41 -
Research ETL Tools The consistency checking involves (i) the detection of possible cycles in the definition of a process flow, (ii) the identification of isolated activities in the scenario and (iii) the proper definition of all the activities within a scenario, with respect to input/output attributes, scripts and policies.
1.6.3
Clio
Clio [19, 20] is a semi-automatic tool developed at IBM, that provides a declarative way of specifying mappings between either XML or relational hierarchical schemas. Mappings are compiled into an abstract query graph representation that captures the transformation semantics of the mappings. The query graph can then be serialized into different query languages, depending on the kind of schemas and systems involved in the mapping: XQuery, XSL Transformations (XSLT), SQL, and SQL/XML queries. There are two complementary levels at which mappings between schemas can be established. The first, mostly syntactic, employs schema-matching techniques, a set of non-interpreted correspondences between elements (e.g. terms, field names) in two different schemas. The second level at which mappings can be established is a more operational one, that relates instances over the source schema(s) with instances over the target schema. Establishing an operational mapping is necessary if one needs to move actual data from a source to a target. A pictorial view of Clios architecture is shown in Figure 13.
Figure 13: Clio architecture At the core of the system are the mapping generation and the query generation components. The mapping generation component takes as input, correspondences between the source and the target schemas and generates a - 42 -
Research ETL Tools schema mapping consisting of a set of logical mappings that provide an interpretation of the given correspondences. The logical mappings are declarative assertions (expressed as source-to-target constraints) that work as abstractions for physical transformations (e.g. using SQL, XQuery) that execute at runtime. Such abstractions are easier to reason about and are independent of any physical transformation language. Nonetheless, they capture most of the information that is needed to generate the physical artefacts automatically. The query generation component has the role to convert a set of logical mappings into an executable transformation script. This is a generic module, independent from execution language specifics, holding a plugin component for each execution language. During design the user can interact with the system through the GUI component. The user can view, add and remove correspondences between the schemas, attach transformation functions to such correspondences, inspect the generated logical mappings and the generated transformation script. To understand the data translation implied by the value correspondences, one must first identify the different attributes that form a real world object at the source and target, i.e. associations. Associations are computed, all at once, when pre-defined schemas and their constrains are loaded into the mapping tool. Value correspondences are then grouped by the associations that they affect and are interpreted as a whole, not individually. In effect, the mapping is viewed at an association-level rather than at an attribute-level. The result of this phase is a logical mapping that reflects the many ways the target associations can be constructed from the source associations through the given value correspondences. Consider the two schemas shown in Figure 14.
Figure 14: A relational to XML Schema mapping
- 43 -
Research ETL Tools The source relational schema on the left, expenseDB, contains three tables: company, grant, and project. The nested target schema on the right, statisticsDB (specified either as a DTD or XML Schema), groups information about companies and their funding by cities. In order to handle relational and XML schemas, Clio has an internal schema model that is expressive enough to capture both structures. Given these two schemas, the user defines value correspondences from source to target. Consider, for example, the value correspondence V1 in Figure 14. This correspondence means that for each company name in the source, one organization with the same name must exist in the target. In the same manner, V2 indicates that for each principal investigator (pi) in the source, there must exist a funding of some organization in the target that has the same pi. The lines marked r1, r2 and r3 in the figure specify foreign key constraints. According to the foreign key r1 each grant tuple in the source is associated (assuming non-null foreign keys) with a company tuple. In the target, funds are nested under org elements. Therefore, a very likely interpretation of the group {V1, V2} is to map, via V2, each pi in the source to a pi in the target that is nested under precisely the same organization that is generated, via V1, by the company associated to the source pi. Such semantics cannot be achieved by naively looking at V1 and V2 separately: a target instance built in such a way would lose the association that exists between principal investigators and companies in the source. Consider now the third value correspondence V3 in Figure 14. At the source, amount and pi of a grant are both in the same tuple. However, in the target, those two pieces of information are located in different elements: pi is in funds and amount in financials. The association between amount and pi of grant is achieved through the foreign key r3. The value correspondences V2 and V3 indicate that for each grant tuple in the source, there must be a financials tuple in the target with the same amount and a funds tuple with the same pi. Even if V2 and V3 are considered together for the generation of a mapping, the association between amount and pi of a grant will be lost in the target unless the appropriate aid value is generated in the target funds and the financials tuples that have the pi and amount values of the source. The second phase of schema mapping is the data translation phase which produces the query implied by the logical mapping created by the previous phase. Following the example, the query generation phase should be able to create a query that: (i) retrieves company names from the source, (ii) for each such company retrieves the principal investigators (pi) of its grants, (iii) nests them
- 44 -
Research ETL Tools under the generated organization elements. Second, new values may need to be generated in the target. Recall that aid in the target is required in order to maintain the association between the amount of a grant and its pi. Since there is no correspondence to determine how that value will be generated from the source, Clio must generate a value that is the same for aid of financials and aid of funds given the same mapped source values. Finally, in the query generation process each logical mapping is compiled into a query graph that encodes how each target element/attribute is populated from the source-side data. For each logical mapping, query generators walk the relevant part of the target schema and create the necessary join and grouping conditions. The query graph also includes information on what source data-values each target element/attribute depends on.
1.6.4
DATAMOLD
DATAMOLD [21] is a tool for the automatic segmentation of unformatted text records into structured elements, able to automatically learn the data structure when presented with a meaningful set of training examples. The prime motivation for this tool is the cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like City and Street. The core of DATAMOLD is a powerful statistical technique called Hidden Markov Modelling (HMM), supported by multiple sources of information including the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. The tool combines information about different aspects of a record, including: o Characteristic words in each element: The dictionary along with the symbol hierarchy learns characteristic words in each element intuitively capturing patterns of the form words like street appear in road-names and house-numbers usually consist of digits; o Number of symbols in each element: Sometimes, records can be segmented based on typical lengths of different elements. For example, title fields are long whereas location names are small. The HMMs and the transition probabilities capture this information; o Partial ordering amongst elements: Most often there is a partial ordering amongst elements: for example, house number appears earlier in the
- 45 -
Research ETL Tools address record than zipcode. The transition probabilities and the structure of the outer HMM help learn this information; o Non-overlapping elements: The approach attempts to simultaneously identify all the elements of a record. Thus, the different inner HMMs collaborate to pick the segmentation that is globally optimal. This contrasts to other systems that extract each element in isolation. Figure 15 presents an architectural overview of DATAMOLD. The input to DATAMOLD is a fixed set of E elements of the form House #, Street and City and a collection of T example addresses that have been segmented into one or more of these elements. No fixed ordering is assumed amongst the elements nor all the elements are required to be present in all addresses. Two other optional inputs to the training process are a taxonomy on the syntax of symbols in the training data and a database of relationship amongst symbols.
Figure 15: DATAMOLD Architecture A Hidden Markov Model (HMM) is a probabilistic finite state automaton comprising a set of states, a finite dictionary of discrete output symbols and edges denoting transitions from one state to another. Each edge is associated with a transition probability value. Each state emits one symbol in the dictionary from a probability distribution for that state. Beginning from the start state, a HMM generates an output sequence O = o1; o2;; ok by making k transitions from one state to the next until the end state is reached. The ith symbol oi is generated by the ith state based on that state's probability distribution of the dictionary symbols. In general, an output sequence can be generated through multiple paths each with some probability. The sum of these probabilities is the total probability of generating the output sequence. The HMM thus induces a probability distribution on sequences of symbols chosen from a discrete dictionary. The training data helps learn this
- 46 -
Research ETL Tools distribution. During testing, the HMM outputs the most probable state transitions that could have generated an output sequence. The basic HMM model consists of: o o o A set of n states; A dictionary of m output symbols: An n x n transition matrix A where the ijth element aij is the probability of making a transition from state i to state j; o An n x m emission matrix B where entry bjk denotes the probability of emitting the kth output symbol in state j. Figure 2 depicts a typical HMM for address segmentation. The number of states n is 10 and the edge labels depict the state transition probabilities. For example, the probability of an address beginning with House Number is 0.92 and that of seeing a City after Road is 0.22. The dictionary and the emission probabilities are not shown for compactness.
Figure 16: Structure of a naive HMM Experiments on real address data sets yielded an accuracy value of 90% for Asian addresses handling and 99% for US addresses.
1.6.5
IBHIS
The Integration Broker for Heterogeneous Information Sources (IBHIS) [22] aims to provide a broker based on the concept of data federation that enable the coherent use of a set of distributed, heterogeneous data sources, from a single perspective. The domain of application is the UK Health and Social Care where hospitals, social care clinics and local doctors use largely autonomous, distributed and - 47 -
Research ETL Tools heterogeneous systems. The federated approach has been especially useful within this domain since the underlying data is of a confidential nature and a data consolidation approach would not be possible. IBHIS users are provided with transparent access to the data sources once set-up is complete. During set-up, the registration of the users and the underlying data sources takes place. The system (or federation) administrator constructs the federated schema and resolves all the semantic differences. Data recorded or created during the system set-up are passed as metadata to the Operational System. The architecture of the IBHIS broker is depicted in Figure 17.
Figure 17: The Architecture of the IBHIS Operational System The role of the operational system is to receive a query from the user, identify his/her access rights, locate and query the appropriate data sources and return the results to the user. To provide this functionality, the operational system consists of communicating web services and a user interface. The Graphical User Interface (GUI) depicted in Figure 18 is responsible for presenting a list of available queries to the user according to its profile, formulate a federated query to pass to the Federated Query Service and display the final result. The Access Rule Service (ARS) is responsible for the initial user authentication and subsequent authorisation. Within the architecture, the ARS is primarily concerned with authorising access to the available data resources.
- 48 -
Research ETL Tools The Federated Schema Service (FSS) maintains the Federated Schema and all the mappings between the export schema and the federated schema created during broker set-up.
Figure 18: IBHIS GUI The Federated Query Service (FQS) contains two sub-modules: 1. The Query Decomposer splits the federated query into a set of local queries. This is performed in consultation with the FSS that holds the mappings between federated and export schemas; 2. The Query Integrator receives the set of local results from the Data Access Service and integrates them into a federated record. This module sends the federated query and the federated record to the Audit Service. The Audit Service (AS) contains two sub-modules that keep track of every action of IBHIS that may need to be recreated or audited in the future. 1. User Audit (per session): holds information such as: user login date, time, Internet Protocol (IP) address, logout, sequence of federated queries and sequence of federated records;
- 49 -
Research ETL Tools 2. System Audit (per Registration): holds information about Data Source (e.g. registration date and time, intervals of availability) and User Setup (e.g. time stamped creation, deletion, profile update, user registration/deletion). The Data Access Service (DAS) are data intensive and are responsible for providing data from their respective data sources. The broker administrator will provide the consumer of the service (the FQS) the following information: o o o o The data that the DAS provides, and its format; The domain and functionality related to the data; The security requirements for using the service; Other non-functional characteristics (e.g. quality of service, cost).
The administrator will then publish the description file into the Registry Service, for runtime interrogation by the IBHIS operational system. The DAS is used to provide transparent access to the distributed, autonomous heterogeneous data sources. When the FQS decomposes the Federated Query into a set of local queries, the FQS uses the Registry Service to find a corresponding DAS that provides the required data outputs for each sub-query. It uses the DAS description to bind with the data service, which will access the local data sources. The system set-up (Figure 19) is used by the system administrator to register users and the data sources schemas. The export schemas (and the domain ontology) are consulted by the Schema Integration Service that provides the federated schema to the Operational System. Every action is recorded by the Audit Service.
Figure 19: Set up of the IBHIS broker - 50 -
Research ETL Tools The Registry Service (RS) acts as a metadata store about available data sources, with an interface for registering the metadata (available to the system set-up) and an interface for looking up data sources that satisfy the users query at runtime (available to the operational system). Before users are added to the system, the administrator must define available roles and build their corresponding set of data access rights using the URS component. The Ontology Service retrieves information about data held within the accessible data sources through the Registry Service. This knowledge is gained through inspection of the data schemas and the details held within the registry for each data source. It then checks the ontology for related synonyms and semantic concepts and passes the details to the Schema Integration Service. The Schema Integration Service retrieves the registered exported schemas from the Registry Service and translates them into a common model, e.g. an Entity Relationship (ER) model. Relationships between components of the translated schemas are identified and the appropriate mappings are generated. The process involves resolving various forms of heterogeneity and results in the federated schema. The ontology service will provide domain information and helps to resolve domain definition heterogeneity.
1.6.6
IBIS
The Internet-Based Information System (IBIS) [23], is a system for the semantic integration of heterogeneous data sources, applying the data federation approach for data integration. IBIS uses a relational mediated schema to query a variety of heterogeneous data sources, including data sources on the Web, relational databases and legacy sources. Each non-relational source is wrapped to provide a relational view over it. A key issue is that the system allows the specification of integrity constraints in the global schema. Since sources are autonomous and incomplete, the extracted data in general do not satisfy the constraints. To deal with this characteristic, IBIS adapts and integrates the data extracted from the sources making use of the constraints in the global schema, so as to answer queries at best with the information available. In this way, the intentional information in the constraints over the global schema allows one to obtain additional answers. Traditional federated systems answer a query posed over the global schema by unfolding each atom of the query using the corresponding view. The reason why unfolding is sufficient in those systems is that the federated mapping essentially specifies a
- 51 -
Research ETL Tools single database conforming to the global schema. Instead, due to the presence of integrity constraints over the global schema there are several potential global databases conforming to the data in the sources, and hence query answering has to deal with a form of incomplete information. Query processing in IBIS is separated in three phases: (i) the query is expanded to take into account the integrity constraints in the global schema; (ii) the atoms in the expanded query are unfolded according to their definition in terms of the mapping, obtaining a query expressed over the sources; (iii) the expanded and unfolded query is executed over the retrieved source database, to produce the answer to the original query. Query unfolding and execution are the standard steps of query processing in federated data integration systems, while the expansion phase is the distinguishing feature of the IBIS query processing method. The system architecture of IBIS is shown in Figure 20.
Figure 20: Architecture of IBHIS Four subsystems are identified: wrapping subsystem (provides an uniform layer for all the data sources by presenting each source as a set of relations), configuration subsystem (supports system management and configuration of all the metadata), IBIS core (implements the actual data integration algorithms and controls all the parts of the system) and user interface, which is divided in a Web interface and an application interface. In addition to these subsystems, a data
- 52 -
Research ETL Tools store is used to store temporary data that are used during query processing, and cached data extracted from the sources during the processing of previous queries. The IBIS Core represents the set of components that take care at runtime of all the aspects of query processing. User queries are issued to IBIS core by the application interface. The core performs evaluation of a query by extracting data from the sources and executing the query over such extracted data. The extraction of data is performed as follows: starting from the set of initial values in the query, IBIS accesses as many sources as possible (according to their access limitations). The new obtained tuples (if any) are used to access the sources again, getting new tupples, and so on, until there is no way of reaching new values. At each step, the values obtained are stored in a data store. To avoid wrappers to be overloaded with a number of binding tuples (i.e., access requests) that exceeds the capacity of the wrappers, they are fed with batches of binding tuples that do not exceed a prefixed maximum size. Also the new values in the tuples that are stored in the retrieved source database are not immediately poured in the domain tables, in order not to cause a performance bottleneck due to the excessive production of binding tuples. The limitations in accessing the sources make the issue of data extraction inherently complex and costly, resulting in long times for the extraction of all obtainable tuples. On the other hand, experiments have also shown that the system retrieves tuples (and values) that are significant for the answer in a time that is usually very short, compared to the total extraction time. This is due to the recursive nature of the extraction process, which obtains new values from the already retrieved ones; hence, a lower number of steps are required to obtain values extracted earlier, and these values have shown to be more likely part of the answer to the query. IBIS offers the possibility of using previously extracted tuples in order to quickly answer queries based on those values. With this feature, queries can be chained, in the sense that each query uses the retrieved source databases of all the queries preceding it in the chain. IBIS is able to avoid producing binding tuples that have already been issued to the sources during the extraction of previous queries. IBIS is equipped with a Web interface (Figure 21 and Figure 22). In practice, the time required for answering a query may be significantly long and therefore, the traditional submit-and-wait interaction with Web-based systems is not suitable
- 53 -
Research ETL Tools for IBIS. In order to cope with this problem, IBIS provides two strategies. The first one consists in showing tuples to the user as soon they are obtained, while the answering process is going on. The second feature is the ability to continue the answering process while a user is logged off and present the obtained answers as soon as the user logs on again. E-mail and pager alerts are also available, to signal the user that a query has been completed.
Figure 21: Query interface in IBIS
Figure 22: Query result in IBIS When a user query is processed, the set of constants appearing in the query is crucial, since at the beginning of data extraction such values represent the only way to access the sources. Therefore, adding values before starting the extraction process may significantly alter the extraction process itself.
1.6.7
InFuse
InFuse [24] is a database centred middleware system for data integration. As depicted in Figure 23 the global architecture of InFuse consists of three tiers: the fusion engine for process and metadata management, the FraQL query processor
- 54 -
Research ETL Tools for data management of heterogeneous sources and a front end for interactive graphical data analysis and exploration.
Figure 23: Global Architecture of InFuse The fusion engine represents the central part of the system. Since the fusion process consists of several dependent steps, the fusion engine manages the definition and persistence of processes and controls their execution. Process definitions as well as the states of running processes are stored in a metadata repository. Besides data mining or machine learning algorithms, the fusion engine provides additional basic services and a CORBA based API to connect to different front ends. Supporting the data analysis techniques on heterogeneous data sources, the fusion engine relies on the features of the query processor FraQL. FraQL [25] is a declarative language supporting the specification of a data cleansing process. The language is an extension to SQL based on an object-relational data model. It supports the specification of schema transformations as well as data transformations (e.g. standardization and normalization of values), possibly extended by user-defined functions. The implementation of user-defined functions has to follow specific requirements within the individual data cleansing process. With its extended join and union operators, in conjunction with user-defined reconciliation functions FraQL supports identification and elimination of duplicates. Both the union and join operators can be applied with an additional by clause, which denotes a used-defined function for resolution of contradictions between tuples fulfilling the comparison clause. FraQL also supports filling in missing values, and eliminating invalid tuples by detection and removal of outliers. Missing values can be filled-in with values - 55 -
Research ETL Tools computed from others, e.g. the average or median of the attribute or the attribute value of all tuples in a relation. Outliers are handled by partitioning the data based on a certain attribute and smoothing the data by different criteria like mean, median or boundaries. A set of plugins is used by the front end Angie to support different means of visual information representation, ranging from a traditional visualisation pipeline to advanced glyph construction and volume rendering. Figure 24 depicts a simple data mining process modelled and executed within the workbench. In this example a simple classification process is illustrated. First, a data sample is obtained, used to learn a Baysian Classification Model and finally the model is applied for data classification.
Figure 24: Angie transformation pipeline In Figure 25 the same process is shown, but an advanced visualisation technique is used to investigate the data.
- 56 -
Research ETL Tools
Figure 25: Angie in advance mode
1.6.8
INTELLICLEAN
INTELLICLEAN [25, 26] is a rule based approach to data cleansing with the main focus on duplicate elimination. The proposed framework consists on three stages. In the Pre-Processing stage syntactical errors are eliminated and the values are standardized in format and consistency 5. The Processing stage represents the evaluation of cleansing rules over the data items that specify actions to be taken under certain circumstances. Four different classes of rules exist: 1. Duplicate identification rules specify the conditions under which tuples are classified as duplicates; 2. Merge/Purge rules specify how duplicate tuples are to be handled; 6 3. Update rules specify the way data is to be updated in a particular situation. This enables the specification of integrity constraint enforcing rules. For each integrity constraint an Update rule defines how to modify the tuple in order to
It its not specified in detail, how this is accomplished. It is not specified how the merging is to be performed or how its functionality
can be declared.
- 57 -
Research ETL Tools satisfy the constraint. Update rules can also be used to specify how missing values ought to be filled-in; 4. Alert rules specify conditions under which the user is notified for interaction. During the first two stages of the data cleansing process the actions taken are logged, providing documentation of the performed operations. In the human verification and validation stage these logs are investigated to verify and possibly correct the performed actions. The prototype was developed using Java and JDBC to connect to an ORACLE database server while the expert system engine was supported by Java Expert System Shell (JESS) [27], which enables C Language Integrated Production System (CLIPS) like [28] rules to be used. Figure 26 presents the INTELLICLEAN graphical interface. The Query window allows users to keep track of the state of the tables during data processing. The Log Window logs all actions and results (e.g. from the Log Window, it is possible to analyse that two similarity measures have been defined and three rules have been loaded).
Figure 26: INTELLICLEAN graphical interaction Figure 27 presents an example of a rule creation procedure. If the user clicks on the table column NAME a default Rule:Name is automatically created. According to the available table fields (e.g. NAME, ADDRESS, DC, TEL) these will appear in the rule creation wizard. For these, the user may select a set of measure - 58 -
Research ETL Tools functions and define the appropriated threshold values. According to these values the rule in CLIPS syntax is automatically generated in the View Rule window.
Figure 27: INTELLICLEAN graphical interaction (explaining a rule)
1.6.9
NoDoSe
The Northwestern Document Structure Extractor (NoDoSE) [29] is an interactive tool for semi-automatically determining the structure of semi-structured text documents (e.g. HTML pages, text files) for data extraction purposes. Given the complexity of parsing text files, NoDoSE implements a semi-automatic approach to text processing, i.e. the user indicates a few of the regions of a document that are interesting and the application identifies similar regions automatically. Figure 28 illustrates the NoDoSE architecture. The input to the extractor is a set of text files documents that are instances of the same document type (e.g. reports generated by a weekly file backup program). Using a GUI, the user hierarchically decomposes these files, outlining their interesting regions and describing their semantics. This task is expedited by a mining component that attempts to infer the files grammar using the information the user has input so far. Once the format of the document has been determined and verified by successfully parsing all input documents, the extractor is able to perform different outputs: a report (e.g. Comma Separated Values - CSV) file, wrapper code to be executed or load data into a DBMS. - 59 -
Research ETL Tools
Figure 28: User Level Architecture Before extracting data from the documents the user must decide how to model the data. One possibility for the data, is shown in Figure 29. Documents of type SimulationRun contain three top-level components: a timestamp, a list of input parameters for each simulation node, and a list of measured results for each node. The parameters for each node (part of NodeParams) are represented as a list of <name, value> pairs. Interface NodeParams { attribute int node_number attribute string node_name; attribute List<Struct parameters; } interface NodeResults { attribute int node_number; attribute string node_name; attribute List<Struct OneResult {string name, real average, real std, int num}> results; } interface SimulationRun { attribute String timestamp; attribute List<NodeParams> node_params; attribute List<NodeResults> node_results; } Figure 29: Example schema for a simulation output The decomposition process begins by loading a single document into NoDoSE that the user then hierarchically decomposes using a GUI. Next, additional documents of the same type are loaded in to the system and automatically parsed. Any errors are corrected by using the GUI and reparsing. The process is complete when all of the documents have been successfully parsed. - 60 OneParam {string name, string value}>
Research ETL Tools The first step in decomposing a document is indicating its top-level structure, in this case a record of type SimulationRun. Next, each of its three fields (timestamp, node_params, and node_results) are added by selecting the relevant portion of the text in the document window and clicking on the add structure button in the tool bar (Figure 30). The type, type name, and label of each field can be entered using the controls on the bottom portion of the window. Since node_params atomic types. and node_results fields are complex types (lists), the decomposition process continues until all the leaves of the document tree are
Figure 30: NoDoSE Document Structure Editor Externally, documents are represented as flat files that serve as inputs to NoDoSE. Internally however, it is required to store information about the structure of the documents. Hence for every file that is loaded by the user, NoDoSE maintains a tree that maps the structural elements of the document to the text of the file. Each node on the tree represents one of the structural components of the document. The following values are stored in each node: o TypeName: Every node in the tree, and thus every component of a document, must be either an atomic type (e.g. Integer, Float, String, Date, EmailAddress and URL) or a named composite type (e.g. Set<Type>, Bag<Type>, List<Type>, Record{Type1 fieldName1, Type2 fieldName2,}); o startOffset, endOffset: These two values indicate which portion of the file corresponds to the structural component. For non-root nodes, the offsets are relative to the start of the parent nodes region;
- 61 -
Research ETL Tools o Authored: This identifies the creator of the node: the user or one of the mining components. Maintaining the originator of a node is useful when mining the document structure since user identified regions can usually be given greater credence than regions identified by the mining components; o ConfidenceValue: This is a value between 0 and 1 indicating how confident the author is about the node correctness. It is typically set to 1, meaning complete confidence, for nodes added by the user and is set to a lower value for nodes inferred by one of the mining components. One practical use of the confidence value is to alert the user about nodes that may not have been parsed correctly. At any time the user can ask NoDoSE to try to infer the remaining elements by mining the text. If the tool mistakenly identifies elements the user can correct a few of the errors and ask for the text to be remined. After determining the grammar for a particular file, NoDoSE is loaded with all the other files of the same type. These are automatically parsed using the grammar inferred from the first file. It is possible, though, that parsing fails if any of the additional files contains something that was not present in the first parsed file. In this case, the user must correct the parsed tree for the new file, describing the new field using the GUI as before. The extractor will then update its grammar to account for the new field. When all of the files of the same document type have been successfully parsed, the conversion process step is complete. Finally, it is required to specify how to output the data that has been extracted from the parsed files. One option is to write the data into a text file. The format of the file and which data shall be output is specified using a simple GUI-based report generator. The intent of this component is not to replace the querying and reporting functions of a DBMS but to provide a quick means of writing simple files, such as comma or tab delimited tabular data. For users who need to perform more complex operations on the data, NoDoSE can generate a schema file and a load file for use by a load utility provided by a third party DBMS. Finally, if the input documents are to be exposed through a query interface, the Lex / Yacc code needed by a wrapper can be generated.
1.6.10
Potters Wheel
and automatic error detection. Users gradually build
Potters Wheel [30] is an interactive data cleaning system that integrates data transformation transformations to clean the data by adding or undoing transformations on a spreadsheet-like interface (Figure 31) and the resulting tuples are presented - 62 -
Research ETL Tools immediately to the user. These transformations are specified either through simple graphical operations, or by showing the desired effects on example data values. In the background, the application automatically infers structures for data values in terms of user-defined domains, and accordingly checks for constraint violations. Thus, users can gradually build a transformation as discrepancies are found, and clean the data without writing complex programs.
Figure 31: Potter's Wheel user interface with column transformations selected Most of the transformations present in Potters Wheel are simple and easy to specify graphically. A significant part of these refer to algebraic operations over an underlying data set, e.g. format, drop, copy, add a column, merge delimited columns, split a column on the basis of a regular expression or a position, divide, selection of rows on the basis of a condition, folding columns (where a set of attributes of a record is split into several rows) and unfolding. For complex transformations, Potters Wheel let users specify the desired results on example values, and automatically infers a suitable transform, using the structure extraction techniques described below. Potters Wheel allows users to define custom domains, and corresponding algorithms to enforce domain constraints. The application does not follow a regular expression approach but defines structures rather in terms of userdefined domains. For example, flight records like Taylor, Jane, JFK to ORD on April 23, 2000 Coach are not represented as a regular expression (that would
- 63 -
Research ETL Tools hardly detect anomalies), but as a structure (e.g. <Airport> to <Airport> on <Date> <Class>) that enables the detection of logical errors like false airport codes or dates. Potters Wheel provides the following default domains: arbitrary ASCII strings, character strings, Integers, sequences of Punctuation, C-style Identifiers, floating point values, English words spell-checked, common Names (checked by referring to the online 1990 census results), Money and generic regular-expressions. Summarizing, the applications engine is able to detect and handle three different kinds of data discrepancies: 1. Structural discrepancies: When two values report to the same data type, but are structurally different (e.g. 2005.07.12 and 12.07.05); 2. Schema discrepancies: Resulting from an erroneous data integration strategy from different source data; 3. Domain constraint violations: a. Involving a single tuple: When the single value directly violates the defined domain constraints; b. Involving more than one tuple: When constraint violation may not occur for single tuples, but may exist when simultaneously analysing more than one tuple (e.g. functional dependency between two tuples Birthday and Age). The main components of the Potters Wheel architecture (Figure 32) are a Data Source, a Transformation Engine that applies transformations along two paths, an Online Reorderer to support interactive scrolling and sorting at the user interface, and an Automatic Discrepancy Detector. The ability to undo incorrect transformations is an important requirement for interactive transformation. However, if the specified transformations are directly applied on the input data, many transformations (such as regular-expressionbased substitutions and some arithmetic expressions) cannot be undone unambiguously. Undoing these transformations would require a physical undo, i.e., the system would have to maintain multiple versions of the (potentially large) dataset. Instead Potters Wheel never changes the actual data records. It merely collects transformations as the user adds them, and applies them only on the records displayed on the screen, in essence showing a view using the transformations specified so far. Undos are done logically, by removing the concerned transformation from the sequence and redoing the rest before publishing to the screen. - 64 -
Research ETL Tools
Figure 32: Potter's Wheel Architecture Data Source: Potters Wheel accepts input data as a single, pre-merged stream that can come from an ODBC source or any ASCII file descriptor (or pipe). Data read from the input is displayed on a spreadsheet interface that allows users to interactively re-sort any column, and scroll in a representative sample of the data, even over large datasets. When the user starts Potters Wheel on a dataset, the spreadsheet interface appears immediately, without waiting until the input has been completely read. This is important when transforming large datasets or never-ending data streams. Transformations specified by the user need to be applied in two scenarios. First, they need to be applied when records are rendered on the screen. With the spreadsheet user interface this is done when the user scrolls or jumps to a new scrollbar position. Since the number of rows that can be displayed on screen at a time is small, users perceive transformations as being instantaneous. Second, transformations need to be applied to records used for discrepancy detection because, as argued earlier, discrepancies are checked on transformed versions of data. While the user is specifying transformations and exploring data, the discrepancy detector runs in the background, applying appropriate algorithms to find errors in the data. Hence tuples fetched from the source are transformed and sent to the discrepancy detector, in addition to being sent to the Online Reorderer. The discrepancy detector first parses values in each field into sub-components according to the structure inferred for the column (a sequence of user-defined domains). Then algorithms are applied for each sub-component, depending on its domain. For example, if the structure of a column is <number><word><time> and a value is 19 January 06:45, the discrepancy detector finds 19, January, and 06:45 as sub-components belonging to the <number>, <word>, and <time> domains, and applies the detection algorithms specified for those domains.
- 65 -
Freeware / Open Source and Shareware ETL Tools After the user is satisfied with the sequence of transformations, Potters Wheel can compile the transformation set and export it as either a C or Perl program, or a Potters Wheel macro.
1.7 Tools
Freeware / Open Source and Shareware ETL
This section comprises the ETL tools developed within the freeware and open source domains. Software can be used freely and in some cases presents a suitable suite of ETL tools and capabilities, although yet far away from the capabilities of commercial solutions. Special care must be taken regarding the mature of the application suite as well as the community support for that integration tool.
1.7.1
Enhydra Octopus
Enhydra Octopus [31, 32] is a simple Java-based ETL tool that enables simple data cleansing / transformation when transferring data between JDBC data sources 7. Data normalization, creation of artificial keys and execution of SQL statements during, before or after data transfer are common features that are useful when copying / moving data (usually between databases) and are supported by the tool. The data transformation pipeline is defined using a simple XML-based declarative language, that provides a global ordering over the transformations to perform and references the transformations logic. A transformation can be defined either by inserting JavaScript code within the transformations metadata XML script, or by implementing an external Java class. Enhydra Octopus consists on two main applications (Figure 33): o Octopus Generator: Enables the creation of SQL and XML metadata files from a source database or source DOML 8 file. SQL files may include SQL statements for creating the database, tables, primary keys, indexes and foreign keys. Users can customize which SQL files will be created. XML files
Besides the traditional JDBC database driver, Enhydra Octopus also supports
JDBC drivers for CSV and XML files.

8
The DODS-XML (DOML) format is used by the Data Object Design Studio
(DODS) file, a XSL based tool for database communication.
- 66 -
Freeware / Open Source and Shareware ETL Tools describe relations between data and between tables and where transformation rules are defined; o Octopus Loader: A Java-based ETL engine for transferring data from a JDBC source to a JDBC target according to the metadata files previously created at the OctopusGenerator.
Figure 33: Enhydra Octopus Architecture The OctopusGenerator generates OctopusLoader loadjob skeletons (or DOML files as alternative) from an existing JDBC data source. OctopusGenerator supports many different types of databases: MSQL, DB2, QED, Oracle, PostgreSQL, McKoi, MySQL, HypersonicSQL, InstantDB, Access, Csv, XML, Excel, Standard, Sybase and Paradox. If a JDBC database is taken as source, after the OctopusGenerator connects to it, it reads all metadata, describing relationships between tables and columns in those tables. As a result three files can be generated by the application: o An XML file: This is the main metadata file for the data processing. In this, OctopusGenerator will store information about: o JDBC parameters for source and target database (driver name, driver class, user and password); o o o SQL files (which SQL files will be generated and path to them); Import definitions to external transformation files.
SQL files: In SQL files, OctopusGenerator will store information about: o o Tables and their columns that will be created as target tables; Primary keys that will be created in target tables;
- 67 -
Freeware / Open Source and Shareware ETL Tools o o o Foreign keys that will be created in target tables; Indexes that also will be created in target tables.
DOML file (optional): A DOML file with the structure of the source database.
Instead, if a DOML file is taken as source, OctopusGenerator will read all the metadata associated to the file: database tables, primary keys, foreign keys and indexes, and the outputs above mentioned can be generated. Metadata regarding data cleansing and transformations is codified manually in using an appropriate XML editor. Besides data transformations, the OctopusLoder can perform data cleaning operators if described in the XML metadata file. Octopus provides three methods for this purpose: o If some table fields have a null value: Replace null values with column's default value; o Cutting of data: If application tries to insert/update some value in a table field, which has length greater than it is allowed, then that value will be truncated to the appropriate length; o Cleaning: Replace invalid foreign key values with relations column's default value, or with a null value. If data cleaning features are present in the XML metadata file, a log table / file may be created (if customized by the user) with all the events that took place while the loading process was occurring. All transformations must be defined by the user through Java classes, implementing a well-known interface. References to theses classes shall be present in the XML metadata file. Figure 34 presents an example of such a reference. <transformations> <transformation name="transformation" transformatorClassName="org.webdocwf.util.loader.TestTransformer" transformatorConfig=""> <sourceColumns> <sourceColumn name="XMLNAME"/> </sourceColumns> <targetColumns> <targetColumn name="TRANSXMLNAME" valueMode="Overwrite"/> </targetColumns> </transformation> tableName="BOOKS" tableID="0"
- 68 -
Freeware / Open Source and Shareware ETL Tools </transformations> Figure 34: XML Syntax for Defining a Transformation The XML metadata defines the source (inputs) and target (outputs) for each transformation according to the schema created by the OctopusGenerator application. OctopusGenerator and OctopusLoader share the same graphic user interface, that is mainly composed by two tabs, one for managing the OctopusGenerator metadata and another for managing the OctopusLoader metadata. Regarding the OctopusGenerator application, two main metadata panels exist, one for managing the JDBC connections to the input and output databases (Figure 35) and another for customizing the generation of output files (Figure 36).
Figure 35: Octopus Generator input and outputs Regarding the OctopusGenerator output files configuration, metadata is
comprised within four categories (Figure 36): o Generation of SQL files: Customizes the generation of SQL files as output files. The user can select which SQL files shall be generated; o o o Generation of XML files: Customizes the generation of XML output files; Generation of DOML files: Customizes the generation of DOML output file; Logging: Specifies the verbosity level for the log, the path and filename for the log file.
- 69 -
Freeware / Open Source and Shareware ETL Tools When all the metadata has been defined, the user generates the output files according to the metadata specifications.
Figure 36: Octopus Generator output options Finally, in the metadata panel for the OctopusLoader application (Figure 37) one can define metadata regarding the processing of the ETL scripts created by the OctopusGenerator application.
Figure 37: Octopus Loader inputs
- 70 -
Freeware / Open Source and Shareware ETL Tools When all the metadata has been defined the user executes the ETL engine and data is transferred / transformed if no error is detected during the process.
1.7.2
Jitterbit
Jitterbit [33] is an open source ETL tool based on a client/server paradigm, enabling users to design, configure, test and deploy data integration projects. The application can handle data from various types of systems and uses Web Services, XML Files, HTTP/S, FTP, ODBC, File Sharing, Flat and Hierarchic file structures for data exchange. All integration operations execute on the Jitterbit integration server, while the Jitterbit client provides a graphical front end to: o o Define integration operations including source and target systems; Create document definitions, from simple flat file structures (e.g. comma delimited) to complex hierarchic files structures; o Map source to target data, with possible execution of data transformation in the process; o Set schedules, create success and failure events and track the results for the integration operations; o o Display the operation queue and allow users to prioritize events; Provide visibility to the operation execution with history details, error logging, and the ability to monitor and manage error resolution during integration operations. Being a collaborative open-source application, Jitterbit can generate Jitterpaks: portable and sharable XML documents community. Jitterbit defines specific semantics for four types of expressions. o Operation: A collection of definitions (e.g. source, target, transformation) that are grouped together to create a particular integration. An operation is the overall definition for an integration and can be run on a schedule; o Source: Identifies where the information is currently stored (e.g. FTP server, File Share directory, database); o Target: Identifies where the transformed information will be sent (e.g. FTP server, File Share directory, database); that include common mapping / transformation operations that can be published and used freely by the
- 71 -
Freeware / Open Source and Shareware ETL Tools o Transformation: Describes how information from a source is mapped to a target. Transformations can be used in three different ways: o Transformation Operation: The operation includes a source (e.g. text, xml, database) that is mapped and transformed to a target (e.g. text, xml, database). o Web Services Operation: The operation transforms a source (e.g. text, xml, database) in to a Web Service request. The Web Service request retrieves the appropriate information from the remote host and returns the relevant information via a Web Service response. This response is then mapped and transformed to a target (e.g. text, xml, database). o Archive Operation: This is the most basic operation. It simply moves information from a source (files only) directly to a target (files only) without any information manipulation. Each of these operations can be created within separate projects or included in the same project. Operations can be set to run on the success of failure of a previously executed operation (i.e. a successful web service operation could launch a separate archive operation). A simple data integration project using Jitterbit comprises six steps 9: 1. Create / Reuse a Project: Launching the Jitterbit graphical application the user can create a new data integration project or reuse an existing one; 2. Create an Operation: Creating an Operation, the user will define and assemble all the components (i.e. Source, Target and Transformation) required for the definition of an integration step. Depending on the type of activity to perform (e.g. transformation, web service call, archive) a customized configuration for the Operation is presented to the user. Figure 38 presents an example of an Operation already defined for performing a data transformation. Besides customizing the type of activity, the user must also define the Source and Target of the data and the Transformation to be applied. Further, the user can customize the task to be executed according to a defined schedule or only on specific request. Finally, the user can enable / disable warning and errors to be logged for latter debug. If none of the Source, Transformation and Target items has been previously defined, the user can create new instances as required.
For complex data integration projects steps 2 to 5 are repeated several times.
- 72 -
Freeware / Open Source and Shareware ETL Tools
Figure 38: After defining an Operation 3. Create a Source: A source definition instructs Jitterbit how to retrieve its source data for the integration. Figure 39 depicts an example of a source definition for a Postgre database (the Type combo-box has been set to Database).
Figure 39: Defining a Database Source
- 73 -
Freeware / Open Source and Shareware ETL Tools For the depicted example, besides the database driver definition the user must select the timeout associated to the database connection, the server name where the database resides (which may be remote) and the login and password required for connecting to the database. After saving the source metadata the user can test the connection to see if everything has been defined accordingly. 4. Create a Target: The procedure for creating a target is the same as for creating a source reference. Source and target definitions are completely independent one another. 5. Create a Transformation: Transformations enable to map data from a source to a target format. Although mapping is the main concern in the transformation step for Jitterbit, during this process the user can apply simple data transformations (e.g. arithmetic, string, date / time) before uploading data to the target scheme. Figure 40 presents an example of this mapping. Independent from the source / target schemes, both are represented as a tree. For performing a mapping the user must select a node from each tree and press a Map button. Defined mappings are depicted bellow both source and target trees and can be edited / removed as required.
Figure 40: Transformation Mapping For the depicted transformation the source schema corresponds to a database, while the target schema refers to a DTD (the target output shall be an XML file compliant with that DTD). Depending on the selected data source / targets - 74 -
Freeware / Open Source and Shareware ETL Tools schemes specific steps may be required. For example in the case of a relational database if more than one table is selected and a relationship exists between them the user must specify it in order for this relation be correctly mapped into the tree structure (in the source tree a 1 / N relation exists between the OrderHeader and OrderDetail tables). Jitterbit has the ability to test transformations on the fly by allowing the user to load sample data and run a test transformation. This process does not actually run the operation but gives the user a preview of how data will look at the target source. 6. Activate and Monitor the Operation: After testing the correctness of the project the user will deploy it into the Transformation Server making it ready for execution. Once available at the server the project can be initiated from the GUI. Doing so will immediately place the operation into the server's queue, regardless of its schedule setting. Assuming that the server is running, the operation will be fed into the system queue. The operation queue (Figure 41) can be depicted at the client application as the transformations status. Further, log information is also available regarding the transformation execution. This information can be useful for understanding why an error as occurred when applying a specific operation.
Figure 41: Operation Queue and Log
- 75 -
1.7.3
KETL
KETL [34] is a open source Java ETL script tool based on metadata. The applications main features are (Figure 42): o Independent ETL engine that enables the execution of complex ETL transformations. Clusters of KETL servers can be created where each server contributes to the workload and manages a configurable number of executors; o Job execution and scheduling supports multiple job types, conditional exception handling, email notification and time-based scheduling. Four job types are built-in in the application (that may be further extended): 1. SQL: Executes a hand coded SQL statement via JDBC; 2. OS: Executes an Operating System (OS) level command; 3. XML Executes an XML defined job; 4. Sessionizer Accesses to the KETL session. o Centralized repository supports multiple KETL instances to leverage job and parameter definitions; o Performance monitoring collects job statistics in the repository, allowing a posterior analysis of problematic jobs; o Extraction and loading of relational, flat file and XML data sources, via JDBC and proprietary database APIs; o In the event of a job failure alerts can be generated and sent via email or pager.
Figure 42: KETL Architecture Jobs are defined using XML and can be executed manually from the command line or via the KETL server. Server execution requires job details to be defined in the
- 76 -
Freeware / Open Source and Shareware ETL Tools KETL metadata repository as well as any associated job dependencies. Jobs can be triggered to execute once or according to a schedule. If a job is triggered but is already executing it will be rejected, unless configured for multiple execution. The metadata for the ETL jobs execution is stored in SQL scripts (Figure 43) or XML files (Figure 44 and Figure 45). The XML files can be loaded into the metadata repository using a KETL command prompt command. The SQL scripts are generated using a Perl script into two condensed SQL scripts one for job definitions and the other for job dependencies. -- { -- DEPENDS_ON = <PREVIOUS_JOB_1>, <PREVIOUS_JOB_2>; -- JOB_TYPE_ID = 1; -- PARAMETER_LIST_ID = 3; -- PROJECT_ID = 3; -- NAME = 'Load some table'; -- DESCRIPTION = 'Loads some table from sources X and Y'; -- RETRY_ATTEMPTS = 0; -- SECONDS_BEFORE_RETRY = 0; -- } <SQL statement here> -- END_SQL_JOB Figure 43: SQL script file structure <JOB ID=TRANSIPSNAME="TRANS_IPS" PROJECT="Demo TYPE="OSJOBPARAMETER_LIST="tmpIPAddress"> <DEPENDS_ON>WRITE_TRANS_IPS</DEPENDS_ON> <OSJOB> /u02/DI/dnstran -translate -normal exchange /u02/DI/tmpIPAddress.tmp </OSJOB> </JOB> Figure 44: OS Job example - execute a job on the machine running the KETL engine <JOB ID=TRUNCDEMO12NAME="TRUNCATE_DEMO_1_2" PROJECT="Demo Project" TYPE=SQLPARAMETER_LIST=stageDB"> <SQL> <STATEMENT AUTOCOMMIT=FALSE> TRUNCATE TABLE stage.demo1 </STATEMENT> <STATEMENT AUTOCOMMIT=FALSE> TRUNCATE TABLE stage.demo2 </STATEMENT> </SQL> </JOB> Figure 45: SQL Job example - execute a job on a database Project"
- 77 -
1.7.4
Pentaho Data Integration: Kettle Project
KETTLE [35] is an acronym for Kettle ETTL Environment, where ETTL stands for Extract, Transform, Transport and Load of data. The application comprises a set of four tools that allow data manipulation from various databases. It does so by giving a graphical user environment to describe the ETTL operations based on metadata. KETTLE is used for: o o o o Data warehouse population; Export of database(s) to text-file(s) or other databases; Import of data into databases, ranging from text-files to excel sheets; Information enrichment by looking up data in various information stores (e.g. databases, text-files, excel sheets); o o Data cleaning by applying complex conditions in data transformations; Application integration.
The SPOON application allows the transformation design using a GUI (Figure 46, Figure 47), that can be executed with the Kettle tool Pan, a data transformation engine.
Figure 46: Text File Input A hop connects one step to another and the direction of the data flow is indicated with an arrow on the graphical view pane. A hop can be enabled or disabled (e.g. - 78 -
Freeware / Open Source and Shareware ETL Tools testing purposes). When a hop is disabled, the steps downstream of the disabled hop are cut off from any data flowing upstream of the disabled hop.
Figure 47: Spoon Screenshot By right-clicking on a hop it is possible to view both the input and output data flowing through it. In cases where field mapping from a source to a target is required, a dialog is available helping to determine which input fields correspond to which table field (Figure 48).
Figure 48: Mapping between input and target fields The PAN application allows the execution of transformations designed with Spoon (either in XML format or residing in a Repository) in batch mode.
- 79 -
Freeware / Open Source and Shareware ETL Tools The CHEF application allows the creation of jobs that further automate the complex task of updating data warehouses by allowing each transformation, job or script, to be checked if it ran correctly or not. Chef is a graphical user interface (Figure 49) that allows designing jobs that can be executed with the Kettle tool Kitchen. Steps are easily added to a transformation by selecting a step type from the tree on the left and drag in onto the canvas. Jobs are described through XML based metadata that can be placed on the Kettle database repository.
Figure 49: Chef with a job Conditional logic can be used to determine if a job step shall be executed or not based on the selected conditional statement (Figure 50):
Figure 50: Conditional Job Hop o Unconditional: Specifies that the next job entry will be executed regardless of the result of the originating job entry; o Follow when result is true: Specifies that the next job entry will only be executed when the result of the originating job entry was true;
- 80 -
Freeware / Open Source and Shareware ETL Tools o Follow when result is false: Specifies that the next job entry will only be executed when the result of the originating job entry was false; A job entry is part of a job and provides a wide range of functionalities (e.g. shell, email, SQL, FTP and HTTP operations). The Log View shows what is happening when a job is running, displaying the details of the completed job entries. Secondly, it shows the log as it would be shown if the job would be launched by Kitchen. KITCHEN allows the execution of jobs designed with Chef in batch, either present in a XML file or database repository. Usually jobs are scheduled in batch mode to be run automatically at regular intervals. Kitchen is capable of performing a multitude of functions such as: execute transformations, execute jobs, verify file existence, get files using FTP, SFTP, HTTP. Scheduling of KETTL tasks can be performed via external tools. Windows scheduling can be used or in alternative the command line feature named at. On Unix the operating system cron table is advisable.
1.7.5
Pequel ETL Engine
Pequel 10 ETL Engine [36, 37] is a simple event driven scripting interface for file processing and transformation that transparently generates and executes Perl and C code. For non-technical users Pequel will transparently generate, build and execute the transformation process, while developers may examine and extend the generated transformation program. The scripting language is event driven with each event defining a specific stage in the overall transformation process. Each event section is filled in systematically by a list of items: e.g. condition statements, field names, property settings, aggregation statements, calculation statements. A set of aggregation functions and macros are also available. Perl statements and regular expressions can be embedded within Pequel statements. Pequel supports the following incoming data stream formats: variable length delimited, CVS, fixed length, Apache Common Log Format (CLF) and any format that Perl pack/unpack can handle.
10
The name Pequel is derived from perlish sequence.
- 81 -
Freeware / Open Source and Shareware ETL Tools Some examples of Pequel features are: o o Selecting Columns: Output selected columns from an input data stream; Selecting Records: Output selected records based on filtering conditional statements; o Perform calculations: Calculate on input fields to generate new derived fields, using Perl expressions. Calculations can be performed on both numeric fields (mathematical) and string fields; o Grouping and Aggregating Data: Records with similar characteristics can be grouped together (e.g. max, min, mean, sum and count on grouped record sets); o In-Memory Sort-less Aggregation: Grouping can be performed in memory on unsorted input data; o Data Conversion: Convert data using any of the built-in macros and Perl regular expressions. These include, converting from one data type to another, reformatting, splitting a field into two or more fields, combining two or more fields into one field, converting date fields from one date format to another; o o Distributed Data Processing: Data can be distributed based on conditions; Piped Data Processing: The output from one Pequel process can be piped into a second Pequel process; o Database Connectivity: Direct access to Oracle and Sqlite database tables;
Pequel scripts (.plq files) are created manually using a text editor (syntax highlighting is available for the vim editor). Once created the script its validity can be checked and if required the source code can be generated. Finally the Pequel application can be executed at the command prompt: pequel scriptfile.pql < file_in > file_out A Pequel script is divided into sections. Each section describes an event and starts with a name, followed by a list of items. For example, the input section is activated whenever an input record is read or the output section is activated whenever an aggregation is performed. The input section defines the fields format of the input data stream. It can also define new calculated (derived) fields. The output section defines the format of the output data stream. The output section is required in order to perform aggregation and consists of input fields, aggregations based on grouping the input records, and new calculated fields. Each field definition must begin with a type (e.g. numeric, decimal, string, date). A naming
- 82 -
Freeware / Open Source and Shareware ETL Tools notation can be used to specify that a field shall be temporary (only for intermediate calculations) and will not be output. A key feature of Pequel is its built-in tables feature. Tables, consisting of key / value pairs and are used to perform merge and joins on multiple input data sources. They can also be used to access external data for cross-referencing and value lookups. Pequel statements can contain a mix of Perl code, including regular expressions, field names, Pequel-Macro calls and Pequel-Table calls. The Pequel compiler will first parse the statement for Pequel field names, macros and table names, and translate these into Perl code. Any text following and including the # symbol or // is considered as comment text. If the cpp preprocessor is available then comments are limited to C style comments with (// and /* ... */) and the # becomes a macro directive. Figure 51 presents an example of a script: split record on space delimiter, parse quoted fields and square bracketed fields. This example requires a C compiler because the input_delimiter_extra option will instruct Pequel to generate C code. options header // (default) write header record to output. optimize // (default) optimize generated code. nulls transfer // Copy input to output input_delimiter( ) // Input delimiter is space. input_delimiter_extra(\"[) // For Apache Common Log Format (CLF). inline_CC(CC) // C compiler. inline_clean_after_build(0) // Pass-through Inline options: inline_clean_build_area(0) inline_print_info(1) inline_build_noisy(1) inline_build_timers(0) inline_force_build(1) inline_directory() inline_optimize("-xO5 -xinline=%auto") // Solaris 64 bit inline_ccflags("-xchip=ultra3 -DSS_64BIT_SERVER -DBIT64 -DMACHINE64") input section IP_ADDRESS, TIMESTAMP, REQUEST, output section Figure 51: Pequel script for processing an Apache CLF Log file
- 83 -
1.7.6
Talend Open Studio
Talend Open Studio [38, 39] is an open-source ETL metadata-driven solution developed by Talend and JasperSoftware based on the Eclipse graphical framework. JasperETL Professional [40] uses the same code base as the open source edition, but has a broader platform support. The tool suite contains the following features: o Business Modeller: Provides a non-technical graphical view of the business information workflow; o Numerous Connectors: Allows input and output from and to different data sources (e.g. flat files, XML, databases, Post Office Protocol (POP) and FTP servers); o Wizards: Metadata configuration wizards configure heterogeneous data sources and complex file formats; o o Job Designer: A graphical editor and functional view of the ETL process; Transformation Mapper: A graphical editor and viewer of complex mappings and transformations; o Real-time Debugging: Allows tracking of ETL statistics and traces throughout the transformation process in real-time. The Perl programming language was selected for supporting Open Studio, since it was originally developed to perform data extraction and transformation functions. Talends architectural design (Figure 52) is based on three client applications: Scheduler, Designer and Administrator, and supported by two types of servers: processing and main (metadata server).
Figure 52: Talend architecture
- 84 -
Freeware / Open Source and Shareware ETL Tools The Scheduler graphical interface is based on the crontab command. The Scheduler generates cron-compatible entries that are executed periodically via the crontab program.
Figure 53: Talend Open Studio The layout of the Open Studio window (Figure 53) is composed by four panels: o Repository: A toolbox (supported by an independent SGDB) gathering all reusable technical items that can be used either to describe business models or job designs; o Graphical workspace: A flowcharting editor, where both business models as well as job designs can be laid out. A Palette contains all the available model components, from which the user can drag and drop shapes, branches, notes or technical components to the workspace and then define them in the Properties panel; o Properties: Gathers information relative to the selected graphical elements in the modelling workspace or about the actual execution of a complete job; o Code Viewer: Visualizes the code generated for the selected component, a sub-set of the code for the whole job. This is only applicable for job designs since no code is generated from business models.
- 85 -
Freeware / Open Source and Shareware ETL Tools Talend Open Studio enables a top/down approach where architects can participate from the general business model to the more precise details in the technical application. Architects who want to model their flow management at a macro level can use a Business model, a non-technical view of a business workflow. The architect can symbolise systems, steps and needs using multiple shapes and create relations among them through the Modeller application (Figure 54). The nature of this connection can be defined using the Repository elements. Further, notes and comments can be added to the model in order to identify elements or connections at a later stage. A job design is the functional layer of a business model, technically implementing the data flow.
Figure 54: Defining a Business Model Input sources are defined through wizards. For a database table schema the user must select the type of database, location, name and user permissions. Next the schema for the input table must be created. From the available set of tables one or more are selected and a new schema is composed by selecting fields from them (e.g. add / remove columns, change names). File schema creation is very similar for all types of file connections, e.g. Delimited, Positional, Regular Expression (Regex) or XML. First a sample file must be selected and uploaded into the wizard. A File viewer gives an instant picture of the file loaded, allowing to check the file structure and consistency, as well as the
- 86 -
Freeware / Open Source and Shareware ETL Tools presence of a header or footer. The next wizard step (Figure 55) sets the Encoding, as well as Field and Row separators. If the file preview shows a header message, this can be excluded from the parsing. The same is applicable to footers. The Limit of rows allows to restrict the extend of the file being parsed. In the last step a schema is generated similarly as in the database input source. At this step the application may propose a possible schema given the sample data.
Figure 55: Defining a delimited file File Positional (i.e. fix width) schemas are defined in a similar way as delimited files. The only difference consists on a wizard step where position markers (similar to Microsoft Excel position markets for data importation) are used to determine the fix width parameters. For setting up a File Regex schema (i.e. files containing redundant information, such as log files) the procedure is also similar, and the user must only define the regular expression for capturing the data fields. Finally, for handling XML files, a specified wizard step is presented where the XML hierarchy is defined as a tree. Specific nodes can be drag-and-drop from the tree - 87 -
Freeware / Open Source and Shareware ETL Tools and a XML Path Language (XPath) expression is automatically generated for those values. By repeatedly using this procedure a new schema can be defined. The transformation Palette contains multiple actions, either for input / output definition, logging information, Perl code execution, FTP and email transfer, transform data, aggregate data and detection of repeated values. The data transformation action tMap (Figure 56) is a power-transformation that can perform multiple data transformation operations: data multiplexing and demultiplexing, data transformation, field filtering using constraints and data rejecting. Since tMap is an advanced component it is managed by a specific graphical interface: the Mapper application.
Figure 56: Example of a tMap transformation The Mapper graphical interface (Figure 57) is composed of several panels: o The Input panel (top left): Offers a graphical representation of all incoming data flows; o o The Variable panel (central): Allows to factorize redundant information; The Output panel (top right): Allows mapping data and fields from Input tables and Variables to the appropriate Output rows; o Input and Output schemas description (bottom): Schema editor tab offers a schema view of all columns of input and output tables selected in their respective panel; TMap can be used for data transformation through Perl code execution or join data from different input sources. The join operation can be performed in two ways: simple joins between input tables and inner joins with data rejection capabilities (i.e. if the inner join cannot be established for any reason, then the requested data will be rejected and be placed in a pre-defined table).
- 88 -
Freeware / Open Source and Shareware ETL Tools Filters are available and can be used to perform selection among the input fields, and propagate only partial data to the outputs. Filters are also defined by Perl operators and functions and logical operators can be used in the definition.
Figure 57: Mapper graphical interface Reject options define the nature of an output table. These rejected tables group data that do not satisfy one or more filters defined in the regular output tables. This way, data rejected from other output tables, are gathered in one or more dedicated tables, and can be used to spot errors or unpredicted cases. The Reject principle concatenates all non-Reject tables filters and defines them as an ELSE statement. On job execution, statistics are available (Figure 58) displaying each component performance rate (e.g. time, number of records, average values per record), detecting any bottleneck in the data processing flow. The tracking feature provides a row-by-row view of the data processing.
Figure 58: Presenting statistics to the user - 89 -
Commercial ETL Tools
1.8
This section comprises the ETL tools developed within the commercial domain. Such application suites are very complete, not only in terms of ETL capabilities but also regarding complementary tools (e.g. data profiling, grid computing) that are also property of the ETL tool vendor. Depending if the ETL tool vendor is simultaneously a DBMS vendor or not, different approaches to ETL may be followed (ETL versus ELT versus ETLT). However, most commercial solutions can be generalized to a metadata-based architecture where metadata is seamlessly generated by graphical client tools, interpreted and executed by some kind of engine, residing in a centralized metadata repository.
1.8.1
ETL Market Analysis
The ETL market comprises multiple tools for the design and population of data warehouses, data marts and operational data stores. Most of these tools enable a periodical extraction, transformation, and integration of data from any number of heterogeneous data sources (frequently transaction databases) into time-based databases used predominantly for query and reporting purposes. It is usual for these tools to provide developers with an interface for designing source-to-target mappings, transformation and handling metadata. The market is quickly evolving towards multimode, multipurpose data integration platforms suitable for use beyond the traditional domain of business intelligence and data warehousing. This section presents two independent surveys conducted on April 2004 and May 2005 by METAspectrum [41] and Gartner [42] market analysis enterprises, respectively. The surveys only refer to ETL commercial applications, no research or freeware ETL applications have been analysed. Since the information held in these reports is proprietary (1600 per report copy), full contents are not available to the public and no information regarding the 2006 ETL survey could be found. The presentation of the market surveys will follow a chronological order, starting with an explanation of the analysis criteria followed by an overview of the survey findings.
- 90 -
1.8.1.1
METAspectrum Market Summary
The market survey performed by METAspectrum [41] on April 2004 followed a set of seven criteria for evaluating ETL tools 11: o Platform Support: Support for an enterprises existing sources, targets, and execution environments is fundamental. Increasingly, support for non-DBMS sources (e.g. Web services, log files) are also becoming critical concerns; o Transformations: Developers require both a broad palette of selectable data transformations and flexibility in developing and incorporating new logic; o Data Management Utilities: Going outside the tool for high-performance sorting, job scheduling, and data transport can be a nuisance and maintenance headache; o Performance Characteristics: Data has to be integrated faster and batch windows are progressively shrinking; o Developer Environment Features: GUI features and flexibility, multi developer capabilities, code debugging and application versioning are useful capabilities; o Metadata: Support for metadata sources and interchange standards (e.g. Common Warehouse Model (CWM), XML Metadata Interchange (XMI), metadata browsing/reporting and repository extensibility; o Viability: Even near-term metadata standards do not provide for the porting of ETL applications. Enterprises should be concerned with long-term support. Figure 59 depicts a graphical representation of the evaluation of ETL tools. The graphic representation contains two axes (tools performance versus market presence). The market applications have been grouped according three clusters: follower, challenger and leader applications. Market leaders have stable, mature products with a broad array of data sourcing and targeting options, seamless access to mainframe data, robust developer environments and job parallelization. They have also leveraged strong financial cushions, enabling them to innovate and acquire ancillary capabilities (e.g. data quality).
11
Only a short description of the evaluating criteria and an overall evaluation of
the ETL products has been made public for this report. Evaluation values for each of the criterion have not been disclosed.
- 91 -
Figure 59: METAspectrum Evaluation Many challengers in the ETL market offer built-in intelligence features that help speed the mapping of data sources, tuning of jobs or handling of errors during runtime. Others offer enormous libraries of built-in transformations, enterprise information integration-style data integration, or conjoined BI capabilities. Followers in this market are those that are trying an existing installed base for service and maintenance fees as they search for an acquirer or new home for their technology. Some have chosen to specialize in unpopular or vertical industry data sources. Others are nudging their way into the ETL marketplace with alternate data integration paradigms.
1.8.1.2
Gartner Market Summary
The market survey performed by Gartner [42] on May 2005 followed a set of eleven criterion for evaluating ETL tools 12: o Ease of Deployment and Use: Implementation, configuration, design and development productivity;
12
Similar to the METAspectrum survey, individual evaluation values for each of
the criterion have not been disclosed.
- 92 -
Commercial ETL Tools o Breadth of Data Source and Target Support: Connectivity to a range of database types, applications and other infrastructure components; o Richness of Transformation and Integration Capabilities: Support for a variety of transformation types and ability to handle complexity in transforming and merging data from multiple sources; o Performance and Scalability: Ability to process large data volumes and support the needs of large enterprises in a timely and cost-effective manner; o Metadata: Discovery, audit, lineage, impact analysis, interoperability with other tools; o Vendor Viability and Overall Execution: Vendor focus, financials,
innovation, partnerships, pricing, support capabilities and breadth of customer references; o Data Quality Functionality: Data quality analysis, matching,
standardization, cleansing and monitoring of data quality; o Service Orientation: Ability to deploy data integration functionality as a service and consume other services as a source of data; o Real-time and Event-driven Capabilities: Support for real-time data sources such as message queues, low-latency data delivery, change data capture; o Portability: Seamless deployment across multiple platforms, distributed and mainframe; o Breadth of vision: Degree to which the vendor acknowledges and supports data integration patterns beyond traditional ETL for BI and data warehousing. The Magic Quadrant graphic (Figure 60) is supported on two axes (ability to execute versus completeness of vision). This divides vendors into four brackets: leaders (big on vision and execution), challengers (big on execution, less big on vision), visionaries (big on vision, not as good at execution) and niche players (short on both vision and execution).
- 93 -
Figure 60: Magic Quadrant for Extraction, Transformation and Loading According to the ETL market panorama in 2005, the evaluated tools by Gartner [42] received the following remarks: o Ab Initio Software pursues opportunities in financials services, retail and telecommunications. Customers and prospects continue to see the same strengths and challenges for Ab Initio's technology and business practices. Its lack of public visibility (either documentation or software) has slow down its market evolution; o Ascential Software (IBM) has continued its trend of growth, increased market visibility and greater brand awareness. Strong financial results in 2004, demonstrate further improvements in execution. The acquisition by IBM in 2005, detracts slightly from Ascential's near-term ability to execute, raising questions for Ascential prospects and customers regarding product plans and overlaps with existing IBM products; o Business Objects has made solid strides in execution during 2004, increasing its frequency of appearance in client and competitive situations. It has delivered good ETL license revenue growth and continues to expand the functionality of Data Integrator. The vendor's vision and positioning of the product will likely remain BI-focused, limiting its ability to win broader data integration business;
- 94 -
Commercial ETL Tools o Cognos continues to use its ETL capabilities strictly to support sales and implementations of its BI products, and does not often appear in stand-alone ETL deals. Cognos continues to make modest improvements in functionality but recognizes the limitations of its product, and occasionally partners with other ETL vendors when requirements are significant in scale and complexity. o Computer Associates (CA) does not promote ETL as a stand-alone product. As a result, it does not compete for significant ETL deals. With gaps in functionality relative to other vendors and market demand, CA's ability to execute against ETL competitors is declining. However, it is finding useful ways to embed its ETL functionality in other products; o DataMirror retains a strong differentiation in its real-time change data capture capabilities. It has improved its positioning toward and experience in non-BI scenarios. However, DataMirror needs to develop greater competence in complex transformations and certified integration with packaged applications; o Embarcadero Technologies is struggling to grow its presence in the ETL market. The base for its ETL product remains small (approximately 100 customers), although it is building an original equipment manufacturer channel due to the low price, extensibility and ease of embedding; o Evolutionary Technologies International (ETI) remains flat on execution, retaining a small presence in the market with an active base estimated at 250 customers. Limited availability and high cost of ETI skills continue to be issues for customers, causing some to shift to other products. However, trends in market demand, such as broad platform support and data integration scenarios beyond traditional ETL, align with ETI's strengths; o Group 1 Software entered the ETL market via its acquisition of the Sagent Technology ETL tools in 2003 and Pitney Bowes in 2004. This removes viability concerns around the Sagent technology but raises questions regarding how important it is to its new owner, because Pitney Bowes and Group 1 are focused on mailing automation, customer communications solutions and data quality; o Hummingbird has had limited visibility in the ETL market. Hummingbird continues to make modest functional enhancements to its ETL tool; o IBM like the other major DBMS vendors, has had ETL functionality available with its DBMS product for some time. DB2 Data Warehouse Center and Warehouse Manager never achieved significant market awareness. They were - 95 -
Commercial ETL Tools overshadowed by IBM's reselling Ascential's, and more recently Informatica's, ETL products. IBM's acquisition of Ascential further detracts from IBM's ability to execute with its original ETL offering; o Informatica has refined its positioning and strategy toward data integration by removing its PowerAnalyzer BI tools and Superglue metadata management offering as separate products. Improved financial results for 2004 and 2005 represent stronger execution; o iWay Software is placing a stronger focus on marketing its ETL product and as a result, sales have increased. Although product functionality is improving, the company's vision is oriented toward BI and lacks substance in areas such as data quality; o Microsoft's position in the ETL market will remain constant until the delivery of SQL Server 2005 and its Integration Services component, which replaces the SQL Server 2000 Data Transformation Services. SQL Server 2005 beta sites show positive feedback regarding the ease of use, functionality and performance; o Oracle continues to build "mind share" in ETL as the customer base for Oracle Warehouse Builder expands. Although Oracle's vision has expanded to include data quality, Warehouse Builder remains Oracle-centric in architecture; o Pervasive Software offers broad data integration capabilities, with a large customer base demonstrating a wide range of application types. Support for real-time and event-driven capabilities, as well as the addition of data quality functionality and mainframe support, further enhance its broad vision for data integration. Pervasive must continue to build on its core strengths of low cost and ease of use, achieving a higher degree of architectural robustness and gaining credibility in larger enterprise-level implementations; o SAS has made solid improvements in functionality and ease of use in its ETL offering, including richer metadata management and tighter integration with the DataFlux data quality technology, which also owns. These advances have increased SAS's appearances in competitive situations against the market leaders, although nearly always in the SAS customer base. While this is positive, SAS's corporate strategy and focus on BI will limit its ability to win business in broader, non-BI data integration scenarios; o Sunopsis has made strides in building market awareness beyond its base in Europe. Sunopsis has a range of capabilities, spanning ETL and real-time messaging, and an architecture that enables distribution of transformation - 96 -
Commercial ETL Tools workload across data sources and targets. Its product plans call for the addition of richer data quality functionality and data federation, leading to a broader data integration platform. The pace of change is accelerating in the market for ETL tools. In 2003, vendors began expanding their capabilities, adding basic data quality functionality, elements of real-time data movement and richer metadata support. Since 2004, acquisitions and partnerships have increased, which has moved many ETL providers into new domains and competitive situations (for example, application integration suites, virtual data federation, adapters/connectivity and replication). In 2005, organizations expanded their intent and vision for the deployment of ETL tools. Participants in many ETL tool selection projects evaluated the tools and vendors for suitability in the traditional business intelligence (BI) use cases, as well as Enterprise Information Management (EIM), batch-oriented application integration and system migrations.
1.8.2
Business Objects Data Integrator
Business Objects comprises a suite of BI supporting tools that enable data integration from different sources and provides a rich set of ETL methods for batch and real-time integration as well as built-in data quality features. A key component of Business Object BI solution is the Data Integrator component [43]. This is responsible for the automatic extraction, transformation and movement of data from / to diverse data sources. Using Data Integrator it is possible to build and manage data integration jobs within a single graphical environment (Figure 61).
Figure 61: Data Integrator
- 97 -
Commercial ETL Tools A drag-and-drop interface allows building jobs that profile, transform, validate, cleanse, transform and move data. Debugging features are also present for the analysis of problematic data throughout the ETL process. Complementary, the application can automatically generate documentation reports regarding the defined ETL tasks. Data Integrator offers web service support and allows batch or real-time data integration jobs to be published as a web service and invoked by external applications. Further, the application can also call external web service applications for accessing data. Data Integrator is a scalable application, supporting parallel processing,
distributed processing and real-time data movement. It follows an open servicesbased architecture that allows integration with third-party products using industry standard protocols such as CWM, XML, HTTP/HTTPS, JMS, Simple Network Management Protocol (SNMP), and web services. Intelligent threading is used for enabling parallelization in several ways. Within a single data transformation, it can dynamically launch multiple threads. In a single job, multiple transformations and job steps can run in a parallel structure. Multiple jobs can run simultaneously through multiple instances of the Data Integrator job engine. Data Integrator can enhance performance by distributing workload to source systems such as databases and mainframes. For example, Data Integrator can leverage a database server to aggregate functions at the source level without moving the data to the Data Integrator server, minimizing unnecessary data movement across the network. Based on its metadata-based connectivity, Data Integrator is able to access SAP, Siebel, PeopleSoft, JDE, and Oracle. Native connectivity is available for most database types or through ODBC. Access to legacy mainframe applications is also possible using integrated IBM technology. Data Integrator also supports flat files, XML, and web services. If the source application is proprietary, the Java Software Development Kit (SDK) can be used for defining a connection interface. Data Integrator can perform a wide range of data transformations, structured in libraries (that may be extended by user) making the data transformation process reusable through several data integration projects. Some examples of available transformations are: XML pipelining, pivot and reverse pivot of rows and columns, data cleansing and matching, change data capture or data validation. Prior to the definition of ETL / data cleansing tasks it is fundamental to understand the data present at the data sources. Data Integrator provides
- 98 -
Commercial ETL Tools multiple data profiling capabilities: e.g. distinct values detection (Figure 62), number of NULL values, data patterns (Figure 63) and relationships between two tables (Figure 62).
Figure 62: Distinct values (left) and tables relationship analysis (right)
Figure 63: Detection and visualization of data patterns The available data validation features (Figure 64) help building a firewall between source and target systems to filter out unwanted data based on business rules. Data can be audit throughout the ETL process to verify the data flux integrity. Adding data quality mechanisms into the ETL process is performed via the Data Cleansing window (Figure 65). Through the Data Integrator Designer user interface, developers can drag and drop data cleansing transformations from the transformation library. Data Cleansing operations within Business Objects follow six major requirements [44] that have been implemented as application features: o o Parse: Identify and isolate data elements in data structures; Standardize: Normalize data values and formats according to business rules; - 99 -
Figure 64: Data Validation
Figure 65: Data Integrator (Data Cleansing window) o Correct: Verify, scrub, and append data, based on a set of sophisticated algorithms that work with secondary data sources; o Enhance: Append additional data, thereby increasing the value of the information; o Match: Identify duplicate records within multiple tables or databases;
- 100 -
Commercial ETL Tools o Consolidate: Combine unique data elements from matched records into a single source; Data Integrator offers complete functionality for team-based development, allowing users to securely check work in and out of a central metadata repository and compare differences between objects (Figure 66).
Figure 66: Example of Multi-User Collaboration Data Integrator enables the creation of reports regarding statistical metadata values (e.g. volume of data processed, error rates, CPU processing time). With Data Integrator users can have end-to-end impact analysis and data lineage. If a change occurs in a source data the administrator can quickly see which BI reports are affected (Figure 67).
Figure 67: Impact Analysis - 101 -
Commercial ETL Tools Using the data lineage capabilities the administrator can view the context of data in their BI reports. Users can see when it was updated, how it was computed, and where it came fromall the way back to the original transactional source. This visibility is critical to help users trust their information as for debugging the ETL data flux. The web-based Business Objects Composer application [45, 46] has been designed to complement the Data Integrator application (Figure 68). The application contains some of Data Integrator functionalities, accessible via any common web browser. Using Composer the user can collaborate with a team to gather business requirements, profile source data, build mappings, validate the accuracy of the design, generate data integration jobs and document the project.
Figure 68: Composer web-based application Data warehousing initiatives involve source system identification and a process of mapping source data to the target system. Composer focus specifically on these issues, providing ETL design capabilities and an understanding of the data mappings before implementing and executing ETL jobs. The user can systematically define ETL projects, sources, targets, and mappings, as well as performing data profiling to examine the structure and quality of source data and incorporate data quality checks into them. A task list is proposed for defining a standard or custom workflow for design and project completion: (i) Review target data model; (ii) Identify source systems; (iii) Analyse and profile source systems;
- 102 -
Commercial ETL Tools (iv) Document source data defects and anomalies; (v) Define business rules required for the project; (vi) Define data quality rules; (vii) Develop mappings for the target tables and (viii) Integrate business and quality rules with mappings. Having completed the ETL design, the user can validate the work by running reports that describe data lineage, mappings and data-store structures. Further, database and project metadata browsing is available. For example, the user can explore table relationships including views for lineage and impact, star schemas, or location of a table in the project.
1.8.3
Cognos DecisionStream
Cognos DecisionStream [47-50] is a key component within Cognos 8 Business Intelligence. DecisionStream is an enterprise-wide ETL solution designed for BI, with the following highlights: o A multi-platform, server-based ETL engine that processes large volumes of data with low hardware investment; o A graphical interface that makes transformation processes intuitive for the user; o o A flexible dimensional framework; Integration with Cognos analysis, reporting, dashboarding, and scorecarding software for business intelligence; o A single Lightweight Directory Access Protocol (LDAP) based component delivers seamless secure access; o A tested and proven scalable platform through load balancing and distributed architecture supported by Unix and Windows; o o o BI applications are managed from a central console; Monitor multiple servers and fine-tune multi-server environments; Create and manage BI metadata and business rules in a single metadata model, providing a consistent enterprise data view; o An open Application Programmers Interface (API) based on web services architecture. The architecture (Figure 69) is built on three distinct tiers: o o o A presentation tier handles all user interaction in the Web environment; An application tier with purpose-built services is used for all BI processing; A data tier provides access to a wide range of data sources.
- 103 -
Commercial ETL Tools Cognos DecisionStream works with multiple data sources: o All relational databases, including dimensionally aware sources like SAP BW, Oracle, SQL, IBM, Teradata, Sybase, and ODBC; o Widely deployed Enterprise Resource Planning (ERP) systems, including SAP, PeopleSoft, and Siebel; o Enterprise data warehouses and marts, with both Third Normal Form (3NF) and star schemas; o All widely used On-Line Analytical Processing (OLAP) sources, including SSAS, DB2 OLAP Server, and Essbase; o Modern data sources, such as XML, JDBC, LDAP, and Web Service Definition Language (WSDL); o o Satellite sources, including Excel files, Access files, flat files, and more; Mainframe sources, including Virtual Storage Access Method (VSAM),
Information Management System (IMS), Integrated Database Management System (IDMS); o Content management data, including FileNet, documentum, and OpenSoft.
Figure 69: Cognos 8 BI
- 104 -
Commercial ETL Tools As output of the data integration, results may be delivered in a variety of formats, such as a dimensional framework or relational tables. The dimensional framework specially benefits organizations by: o Ensuring that data is structured and checked for integrity according to specific business dimensions, such as customer, time, and location; o Enforcing a consistent view of corporate information across the enterprise data warehouse (all data transformation jobs use the same dimensional framework); o o o Providing a central location to update dimension definitions; Aggregating transactional data along dimensional hierarchies; Managing many of the complex processes associated with warehouse dimension table creation and management. Cognos data integration is comprised of two main components: a Windows-based dimensional design environment and a multi-platform, server-based engine. The graphical interface of the design environment deals with the definition of the transformation processes. The transformation engine is integrated with an aggregation engine, allowing data integration in a single pass, rather than using multi-pass aggregation. The primary differentiator of Cognos compared to other ETL tools is that it has a multidimensional model at its core. This was designed specifically to build dimensional data marts. The designer and core engine work in terms of fact and dimension deliveries, not in terms of arbitrary table movement. Once the source data has been transformed, Cognos loads it into the destination target database. Further, it supports the delivery of dimensional information to any appropriate storage/access platform, allowing organizations to mix and match relational and OLAP databases to choose the most suitable technology. Organizations can partition information between databases and access tools according to specific requirements. Flexible partitioning also let the organization send data to multiple targets at the same time. Cognos enables incremental updates that are split into two distinct steps to improve updating speed. The first step is to insert new data in bulk, reducing the demand on processing resources. The next step updates changes to existing data, a process that involves going into the database, finding the row to modify, updating it, and then saving the change.
- 105 -
Commercial ETL Tools Besides basic table-to-table data movement, Cognos data integration software also extracts data from transaction-style data sources including applications, traditional legacy files, and purchased data (mailing lists), as well as new data sources resulting from e-business (e.g. e-commerce transactions). There are many events in the data integration phase. These include the delivery of the target tables, calls to existing business rule modules, report launches, email notices and rebuilding indexes for the database. These events are all designed in a drag-and-drop visual palette to produce a coordinated set of commands called a JobStream. A JobStream can multi-task events and allow commands to be executed in a parallel or serial manner. Conditional events control the processing path. DecisionStreams graphical design environment (Figure 70) provides:
Figure 70: Cognos Data Manager o Visual reports for JobStream actions, build processes, source and target mappings, and star joins of fact and dimension tables in a star schema; o A reference explorer for prototyping and validating business rules in the dimensional framework; o o An integrated SQL browser for interactively editing and testing queries; The ability to package components and easily move them from environment to environment;
- 106 -
Commercial ETL Tools o The ability to test functions and scripts as they are developed in the same environment; o Automatic generation of the Data Definition Language (DDL) for creating tables and indexes. Section A) defines dimension and fact builds based on the dimensional framework. Developers can control ETL activities, from the extraction to the data mart and metadata delivery. Section B) comprises the dimensions library and templates that ease the task of transforming source data into a common framework. Finally, section C) visualizes the entire ETL process. Application designers combine data sources into data streams, control how data is merged, transformed, and aggregated, and define output targets for transformed data. Automated wizards lead developers through the steps in creating a dimension or fact build, and throughout the many transformation functions.
1.8.4
DataMirror Transformation Server
DataMirror Transformation Server [51-54] is an ETL engine for bi-directional data transformation between XML, database and text formats based on the CDC data integration approach. Instead of using triggers or performing queries against the database, Transformation Server uses the CDC technology to capture changed data from database logs. This ensures a good performance even for large data volumes and critical source applications are not adversely impacted. Further, Transformation Server can operate in batch refresh or net-change CDC mode. Transformation Server applicability is mainly focused on data warehouse loading, data synchronization between systems and Web applications, distribute and consolidate data between different applications or manage other replication-based requirements. The server application supports all major databases including DB2, UDB, Microsoft SQL Server, Oracle, Sybase, Teradata, XML and PointBase.
Figure 71: DataMirror Tool Suite
- 107 -
Commercial ETL Tools The server is able to transform and flow data according to the following source / target types: database to XML, XML to database, database to text, text to database, text to XML, XML to text, database to database, XML to XML and text to text. A rich set of out-of-the-box transformations, supported by a graphical application, makes the data integration programming-free, i.e. all ETL information is stored as metadata. The application follows an open architecture that allows users to plug in Java objects for complex or specific processing requirements. In addition, open API class libraries are available, enabling third-party applications to easily interact with Transformation Server. The CDC approach followed in Transformation Server is supported by Data Mirrors Tree-structured Query Technology, based on a tree model and object mapping techniques (Figure 72), enhancing performance for complex relational queries. This technology is based on the relational model of the source database and the tree model of the target XML document (ideal for structuring the retrieved data and preserving its relationships). The core concept of Treestructured Query Technology is to transform the relationships of data, or hierarchical data, from the relational model to the tree model. The result is an XML document that contains data and its relationships, as a database snapshot. Enterprise data is typically structured as repeating sets of hierarchical entities such as those stored in a relational database. Traditional transformation engines scale up but lack flexibility and cannot deal with hierarchical data. Emerging XML technologies like Transformation Server provide a novel approach for transformation engines development.
Figure 72: DataMirror Tree-structured Query Technology In XML-based transformation engines, both Document Object Model (DOM) and Simple API for XML (SAX) models have limitations in handling data, scalability and flexibility. The Streamed XML Transformation Technology takes the
- 108 -
Commercial ETL Tools advantages of both SAX and DOM models and applies them globally or locally, wherever it is required, to best achieve the desired results. The data input source can be a database, XML document or text stream. Transformation Servers functionality can easily be extended to other data sources through customized data readers. The same is true for the target or output formats. The mapping objects act as a controller and directs the proper flow of the stream from source to target (using the XML format for transformation process and data exchange). The Transformation Server engine processes the data from the input stream and writes it to the output stream according to the mapping rules. During this transformation process, data format, structure as well as the data itself can be manipulated and re-organized. Besides the servers XML based transformation engine a visual object-mapping tool is also available. Transformation Server has built-in complementary functions to XPath and XSLT such as date formatting, string conversion, date / time functions, databasespecific extensions for stored procedure calls, data lookup and generation of key values using the databases stored procedures and sequences as well as Java objects.
Figure 73: Architecture for Transformation Server XML-based engine
1.8.5
DB Software Laboratorys Visual Importer Pro
DB Software Laboratorys Professional Visual Importer [55, 56] is a business intelligence application for data integration with simple ETL capabilities. The application contemplates non-ETL features like an SQL script editor / checker, email notification (triggered on success or failure), simple file operations, (e.g. copy, move, compare), data compression and FTP. Figure 74 presents an architectural overview of Visual Importer. Data can be imported from two types of
- 109 -
Commercial ETL Tools data sources: text files and ODBC data sources, and exported only to ODBC data sources, Oracle 7-9i and SQL Server 7 2000. Requests are executed sequentially, according to the order they are placed in an Execution Queue. All ETL metadata definitions (e.g. sources, ETL tasks) are stored in a relational Repository in a simple database scheme that can be supported by MS Access (default), Oracle, MS SQL Server, Interbase, MySQL and PostgreSQL. Metadata importation and exportation are managed via SQL text files that create the supporting database and populate it according to the defined metadata.
Figure 74: Visual Importer Architecture In order to load data from a data source into a data target the user must define a data mapping between a target table and a data source. If a file-based data source is chosen, the user must define the relevant data by expressing the file format, fixed or delimited and the corresponding fix width values / delimiter characters that comprise the data (Figure 75). Further, it is possible to define repeating data blocks and skipping non-relevant text lines at the start of the file.
- 110 -
Figure 75: Data Source Options Independently on using a file or relational data source, the user will always need to perform a data mapping between source and target schemas, through the Mapping Editor (Figure 76). This application allows the user to create, delete, modify, and test data mappings to the target databases. During the mapping process, fields from different source databases and / or files can be used.
Figure 76: Mapping Editor Screen Overview
- 111 -
Commercial ETL Tools The cornerstone of the mapping process is the Mapping panel (Figure 77). In this, the user performs the actual mapping, selecting the source and target fields to map, error handling in case of data unavailability or NULL values, definition of default values and data conversion / calculation formulas.
Figure 77: Mapping Panel Thus, Visual Importer is capable to perform data mapping and simple data conversion. Follow some examples of calculation formulas: Field Multiplication: {INTEGER_F}*{FLOAT_F} Concatenation: "{INTEGER_F}"+ " kilos" Conditional: if({FLOAT_F}>{INTEGER_F},1,2) The Expression Editor application can be used in the definition of more complex statements, that otherwise would be defined manually. This application is similar in layout and functionality to the MS Access Function Builder application. The main component of the Visual Importer application is the Package Editor. This enables the creation of action pipelines, either using the mappings previously created or other data management functionalities. Figure 78 depicts an example of a pipeline for the creation of a zip package and its delivery through an email.
- 112 -
Figure 78: Example of an actions pipeline Some examples of items that can be used in a package pipeline definition are: on success / error conditions, add import, add export, add SQL script, add package, file manipulation operations, start application, emailing, FTP operations and data compression operations. An action pipeline can be composed by other actions pipeline (Figure 79), enabling a simple way of grouping related tasks together and providing a scalable graphical way of representing pipeline actions.
Figure 79: Package Screen Overview
- 113 -
Commercial ETL Tools For this type of pipeline by double-clicking on an action, the detail of the pipeline is expanded and presented in another window (e.g. Figure 78). Associated to a package the user can select a schedule for its execution (Figure 80) in a regular basis. The package can be executed once, daily, weekly or monthly. The user may also specify the day of the week or month for the execution.
Figure 80: Scheduling options Once the definition of a package is completed it can be executed through the Execution Monitor screen (Figure 81) that allows checking the status (i.e. Executing, Submitted, Failed or Finished) or debug if an error happens. Detailed information (not present in this window) can be analysed in a Log area.
Figure 81: Execution Monitor
1.8.6
DENODO
Denodo Virtual Data Port [57-59] is a mediator architecture for data integration applications. The framework allows the definition of an unified data model - 114 -
Commercial ETL Tools through data federation, combining data models of individual data sources via SQL. Queries are split into a set of sub-queries, executed in parallel and the resulting data is integrated, applying the relations implied in the query. Denodo Virtual Data Port (depicted in Figure 82) takes as input semi-structured and structured data from heterogeneous data sources (possibly distributed). The architectures is comprised of wrappers for different data sources: e.g. Hyper Text Mark-up Language (HTML), flat files, relational databases and web sites (accessed via SQL statements), which are semi-automatically generated by the core component: Denodo Virtual Port. This is composed by a Query/Plan Generator, a Query Optimizer and Execution Engine components that enable the execution of federated SQL queries. A cache module stores materialized views, avoiding querying sources when queries can be solved using cache results obtained previously. Results are presented asynchronously as they are found in order not to wait for all data sources before beginning to use the returned data.
Figure 82: Denodo Virtual Data Port Architecture Each source exports a combination of relations, called base relations, following a relational model. Base relations generally have limitations in the way in which they can be queried, e.g. in Web sources query possibilities are often limited to those provided by some type of HTML form. Each wrapper must provide access to the base relations of a source in such a way that, when faced by the mediator, it behaves like a table within a relational
- 115 -
Commercial ETL Tools database. The Virtual Data Port comes with built in wrappers for the following data sources: o o o o o Database tables or views accessible through JDBC or ODBC; Web Services; Excel (via ODBC); Structured or semi-structured flat files (e.g. logs, XML); Web sources (HTML pages) through wrappers generated using Denodo ITPilot.
Further, specific wrappers can be defined through an open interface. As an example, for extracting the required information from HTML pages (or other markup languages), the tool uses a specification language to generate specialized grammars. The proposed language (which is called DEXTL) makes use of various heuristics in the presentation of data. In addition, it provides graphic tools to construct the wrapper specifications visually, through an iterative process. Once the base relations have been defined and their wrappers constructed, each relation of the global schema is defined by a query involving the base relations, in a similar way to the definition of views in a conventional database. The query is expressed in a language very similar to SQL. The Denodo Virtual Data Port can be further extended with: o Denodo Virtual Data Port Administration tool that allows administration of the unified data model, the desired combination of sources and cache configuration; o Periodic schedule of queries, e.g. to preload information, send alerts or record retrieved data; o Denodo ITPilot (Figure 83) that is responsible for automating navigation through web sites (e.g. link tracking, form filling, login/password, JavaScripts) and content extraction. Once the information has been structured, it can be directly used by applications or combined with other sources using the Denodo Virtual DataPort.
- 116 -
Figure 83: Denodo ITPilot architecture
1.8.7
solution
Embarcadero Technologies DT/Studio

composed by three applications: DT/Engine, DT/Designer and
Embarcadero Technologies DT/Studio [60, 61] is a scalable and extensible ETL DT/Console. DT/Studio is a server-based solution that offers a centralized hub-and-spoke integration model, not requiring one-to-one interfaces between data sources (increasing geometrically with each additional data source). DT/Studio offers a clear visualization of metadata and process flow, being entirely built in Java, which provides easy-to-use extensibility APIs that allow users to add custom business logic and functionalities. The application suite can handle the following data sources: o Relational Databases Platforms: Supports IBM DB2, IBM Informix, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, Sybase and Teradata databases through JDBC, ODBC and native bulk facilities; o Other Data Sources: Supports JMS, XML, Flat Files, and Microsoft Nonrelational (Microsoft Access, Microsoft Excel); Further, the ETL engine supports CDC for real-time systems that need to capture and send the changes from the source system to the targets.
- 117 -
Commercial ETL Tools DT/Studios Data flow Designer (Figure 84 and Figure 85) makes data movement visible, starting from the macro view of sources and targets through to data flow designing. DT/Studio features the DT/Engine, a portable and extensible Java-based server component. As data grows in volume, ETL projects need to scale efficiently. DT/Studios clustering and parallel processing capabilities support large load demands. Detailed automatic email notifications for task failure or completion enable quick and accurate troubleshooting. Tasks can be easily restarted from the last point of processing to reduce failure recovery time and ensure that processing time is not wasted. Open APIs and custom transformations allow developers to extend DT/Studio by customizing processes for specific needs including custom calculations, business logic or by interfacing with other applications. Using Java and JavaScript, developers can build on existing knowledge instead of having to learn a proprietary scripting language. Through an extensive set of out-of-the-box features, Embarcadero DT/Studio reduces complexity by eliminating the need to write extensive amounts of custom code. Around one thousand ready-to-use functions to transform data are available, including standard type conversions, financial functions, aggregations, units of measure and dates.
Figure 84: Data Flow Designer (example 1) - 118 -
Commercial ETL Tools DT/Studio offers a model-driven approach to data integration extending users ability to do upfront analysis and streamline the workflow process. Users can quickly solve common data integration problems by using data transformation templates. DT/Studio provides command-line invocation of tasks, enabling unattended execution of predefined jobs. In addition, it offers advanced job scheduling capabilities through its tight integration with Embarcadero Job Scheduler. Users can take advantage of unattended job execution to chain tasks and manage interdependencies amongst processes. DT/Studios advanced reporting features provide metrics to measure process quality so that users can proactively manage their environments rather than reacting to failing processes. DT/Designer includes an embedded full macro-building interface that enables users to automate repetitive operations or encode best practices directly into the interface.
Figure 85: Data Flow Designer (example 2)
- 119 -
1.8.8
ETI Solution v5
ETI Solution v5 [62-64] is an ETL code generator that enables the creation of interfaces in Java, C++, C, COBOL, RPG and SAP ABAP. ETI provides Built-ToOrder (BTO) connectors that integrate mainframe, legacy and proprietary applications. BTO connectors are created using the Integration Center (iCenter), part of ETI Solution v5, eliminating the need for hand coding and proprietary runtime integration engines (costly and many times a performance bottleneck) resulting on neither additional maintenance overhead nor introducing new flaws in an operational system. ETIs Integration Design Studio enables the specification of business rules (for example nested if. . .then. . .else statements) from dynamically created menubased dialogues. This allows ETI integration engineers to create specifications in such a way that a natural language description of the transformation is captured as well as part of the metadata. Moreover, since BTO connectors are native code, they run up to 20 times faster than traditional ODBC/JDBC-based solutions. The primary target users of ETI Solution are data modellers who are familiar with the data and the mapping specifications (usually represented in spreadsheets or documents) that are used to define and/or document the relationships between the source data and the desired output. ETI Solution uses a declarative-style of user interface, where users define what they want done, and ETI Solution figures out the steps required to produce that result. ETI iCenter provides transformation capabilities including multi-step processing, conditional logic, data lookup, string manipulation, conversion capabilities, profiling, metadata discovery and data cleansing capabilities, all built into the connector. iCenter integration engineers use a rapid integration development methodology made possible through ETI Solution v5. This version of the ETI platform also includes a user interface that helps customers automate the capture and communication of mapping specifications. Data owners at the customer site download a web service-based analyst client that provides them with a GUI into which the source and target data layouts have been loaded. Then they drag and drop to indicate the desired mapping, attaching annotations where they require business or transformation logic to appear. When they are finished, the information becomes immediately accessible to an iCenter integration engineer through the ETI Solution platform. In addition, iCenter integration engineers have access to a broad range of tools provided with the ETI Solution platform, including
- 120 -
Commercial ETL Tools ETI Data Profiler for data discovery, ETI Data Cleanser for complex cleansing, reformatting and matching, and ETI Impact Analyst to determine the impact of a proposed change. As result operational and documented connectors are delivered in a few weeks, regardless the systems or execution protocol batch, eventdriven, or publish and subscribe. The ready-to-deploy integration solution includes: o o o o o ETI BTO Integration executable file(s); Installation and deployment guide based on the customers infrastructure; Comprehensive mapping specification(s) including metadata reports; Detailed business rules and transformation specification reports; Test reports.
BTO integration is only possible due to the open architecture of ETI Solution V5. This architecture allows the same core engines to generate different types of code based on the contents of a set of rules and data structures called Data System Libraries (DSL). Each DSL encapsulates all the coding for a particular environment.
1.8.9
ETL Solutions Transformation Manager
Transformation Manager (TM) [12, 65] comprises a suite of metadata-driven code generator programs that enable the creation, test, debug and scheduling of data transformation. The deployment of the generated source code is possible in both Java 2 Enterprise Edition (J2EE) and Microsoft .NET architectures. Figure 86 depicts an overview of TMs architecture and its main components: o Connectivity: TM maps between several types of data. Built-in adapters provide access to RDBMS (e.g. Oracle, DB2, SQL Server 2000, Sybase, Access, Ingres), XML, Excel, CVS, various flat file formats, messaging services (e.g. CORBA, JMS, HTPPS and MQSeries) and Java classes. A generic adapter exists for accessing other data formats by implementing Java classes that conform to a published open interface; o Design Tool: Provides a drag and drop creation environment for data mapping and transformation using a simple transformation syntax that is automatically built according to user interaction; o Test Tool: Simulates a run-time environment for interactive testing and development where transformations results can be analysed;
- 121 -
Commercial ETL Tools o Debugger: Provides an interactive environment for analysis or transformation execution. The debugger permits setting breakpoints and stepping through transformation statements; o Deployment Tool: Provides a stand-alone environment for scheduling and running transformations. A Wizard enables transformations to be selected, data connections to be made and results to be viewed; o Metadata Text Repository: TM models and transformations are stored in a metadata text repository, which is fully compatible with version control systems, enabling a multi-user environment for sharing and developing models and transforms metadata; o Code Generator: TM is a code generator producing either Java code or XML transformation descriptions, which can be run in a Java or .NET run-time environment. Generated Java code can be execute on any Java Virtual Machine (JVM), and therefore integrate into a client architecture/application e.g. Customer Relationship Management (CRM), ERP, workflow, trading systems, messaging systems, whereas the .NET source code provides identical transformation runtime features for the Microsoft architecture.
Figure 86: TM Suite of Tools The main TM component (or at least the most visible) within the suite of tools is the TM Design Tool (Figure 87).
- 122 -
Figure 87: Design Tool IDE Key features for the Design Tool application are: o Graphical User Interface: Allows the user to describe transformations between different types of data using a drag and drop paradigm. The transformation descriptions may be edited by the user to solve complex transformation requirements; o Wizards: Wizards assist users in virtually all key aspects of transformation design and creation, database lookup; o Source and Target Models: All aspects of the source and target models are displayed using a specific iconography (e.g. primary key, relationships, mandatory elements, parts of the model that have been mapped); o Sub-Transformations: A transformation may be divided into sub-
transformations. This is either executed each time the transformation is executed (independent transform) or called by another transformation as it runs (dependent transform). Each transformation describes a mapping between a source and target model; o Transformation Syntax: The Design Tool allows the user to describe transformations between different types of data using a drag and drop paradigm and displays a simple, humanly readable, high-level transformation syntax Simple Mapping Language (SML);
- 123 -
Commercial ETL Tools o Coding Patterns: Transformation statements are semi-declarative and include coding patterns to control flow and perform iterations such as BEGIN END, CASE, FOR EACH, IF THEN ELSE and REPEAT; o Built-in Functions: Over 270 built-in functions are provided grouped in categories (e.g. Aggregate, Data Quality, Error Handling, Maths, Matrix, XML). o User-Defined Functions: User-defined Java functions can be developed internally in the TM Design Tool or externally in an Integrated Development Environment (IDE) application; o Traceability: Two forms of traceability are provided, one that relates to metadata and the other that relates to instance data; Most aspects of the models are captured and represented in a normalized way regardless of their source (e.g. entities, attribute, relationships, cardinality, attribute types, constraints), ensuring that model constraints are enforced in the generated transformations. Models are viewed as a tree structure consisting of entities, attributes and relationships. Instance data may be used to subset the models displaying only part of the model relating to the instance data. Models can be modified, extended and exported in alternative forms, e.g. a model captured as a DTD or from a database can be generated as an XML Schema (XSD). A model comparator is also available to analyse two versions of a model and produces a configurable report summarising the differences. Around 330 built-in functions covering maths functions, string manipulation, date conversions, transaction handling, database tuning, data cleansing and reconciliation are included to provide an extremely powerful transformation development environment. Local variables, accessible in the transform, and global variables, accessible by all sub-transformations of the transformation project, may be used. Control flow and iteration coding patterns such as, BEGIN END, IF THEN ELSE, FOR EACH, CASE and CATCH ERROR, can be included. An integral part of TM is its ability to carry out various quality checks on source and desired target data. Inconsistent, incorrect or redundant values in a dataset can be subject to data quality/data cleansing techniques. Features that may be used to determine data quality include: o o o XSD facet constraints may be applied to models regardless of model format. In-built data quality functions (e.g. check for date between a range); Error handling and logging allowing full control to respond to errors detected.
- 124 -
1.8.10
Group1 Data Flow
Data Flow [66-68] is a process-oriented transformation environment where users can build data flow plans to perform data manipulation tasks, not just visualising the transformation processes but also to inspect the data itself. Data Flow is aimed at data integration and data federation, providing both conventional ETL and EII capabilities, either together or singly, using a common platform. The framework is based on three main features. First, the application is focused on processes rather than mapping. No specific definitions on how to map a data field in a data source to a data target are performed. This approach typically fails to capture exceptions that are common in business environments due to the lack of interactivity between users and developers. However, user / developer interaction is only reasonable when the end-users are available and can understand such terminology. Processes operate at a higher level than mappings and enable a fast application development where the user can participate. Second, the application does not limit itself to ETL per se. Indeed, the company describes its approach as the visualization of processes and data: Data Flow can present the results of its Extraction and Transformation processes (without Loading) to the user. The user can see the processed data directly and use it (e.g. graphs), within the environment (or using third party business intelligence tools or spreadsheets). Finally, the approach is engine-based and can be set to suit different environments. In particular, it can be tuned to provide mass movement of data where the number of users involved is minimal. Alternatively, a second configuration is available in which the engine is optimised for large numbers of users but relatively small amounts of data. These two configurations are called Data Load Server and Data Access Server respectively. This means that the Data Load Server can provide conventional ETL data movement capabilities, while the Data Access Server provides EII capabilities or data federation. Figure 88 presents the architecture for the Data Flow framework. The Data Flow Server is the core component of the data integration. It is a massively parallel, multi-threaded server that applies data transformations to data sources. Data Load Server is a Data Flow Server containing the transformation capabilities for data movement and loading. It is used to extract data from multiple operational systems, integrate data, and load it into a database or flat file. The core competency is in the transformations expressiveness and the efficiency in its execution.
- 125 -
Figure 88: Data Flow Architecture Data Access Server is a Data Flow Server containing the transformation capabilities for data visualization and distribution. It is used to combine heterogeneous data sources, perform analytics on the data, and deliver the data sets to many users simultaneously in the form of reports, charts, cross tabs, flat files or display functionality on third party tools. The core competency is in the ability to prepare exactly the data set the analyst needs and do it in a documented and repeatable way. WebLink is the web based front end for the Data Access server. The reporting tool of the Data Flow environment enables users to create custom and standard reports for browser viewing while the business analysis tool of the Data Flow environment lets to analyze data in cross tabs and charts over the web. The Automation component is used to set up schedules, event triggers and alerts. The stages in an automation process are defined using a fairly standard icondriven drag-and-drop based approach, to create a data flow diagram. This component can be replaced by third a party scheduling application provided that it has a command line interface. Admin is an unified administration tool for managing the users and tuning of the Data Flow environment (Figure 89). The Design Studio is the primary design and development tool for Data Flow. This client application uses a graphical data flow methodology to help EII and ETL application users designing processes, easing collaboration between business users and IT.
- 126 -
Figure 89: Data Flow Server An example of Design Studio process-level development (referred to as a plan) is illustrated in Figure 90.
Figure 90: Data Flow (process level planning)
- 127 -
Commercial ETL Tools In a plan, yellow boxes (objects) represent individual operations and blue boxes represent a set of operations that have been previously defined and stored in a repository (based on relational technology that can run on top of SQL Server, Oracle, DB2, Informix or Sybase database) so that they can be reused. The user can click on a blue box to see the plan that it represents. Each of the objects can have business functions and comments can be added to further specify the transformation for documentation and collaboration purposes. In practice, the developer would step through this plan with the user, analysing the generated data at each stage of the process. Data will typically be displayed in a grid (table) format below the plan, where the grid layout is defined by dragging and dropping the relevant columns into the display pane. Data can then be manipulated in a variety of ways. Using the Analytical Calculator (Figure 91), the user can perform a variety of statistical and other functions, using set-based logic.
Figure 91: Manipulating data in the view through set-based logic In addition, there is an Expression Builder, allowing additional transformations that manipulate data inside a plan, using Boolean logic and other options. These transformations include functions such as joining information from multiple data sources, transposing data sets, time-data manipulation, string manipulation, lookups and data warehouse key generation. This application environment allows users to use a rich set of pre-built transformation objects by dragging and dropping - 128 them on screen to build the transformation flow logic. The
Commercial ETL Tools transformations can also be easily customized to create specific business transformation rules that can be shared and reused by developers. Data Flow fits into the general class of a black-box application as opposed to a code-generating application. If required the user can inspect the SQL code that is generated for access to source data and perform edition. Data Flows massively parallel, multi-threaded design allows it to process huge amounts of data using multi-processor UNIX and Windows servers. To maximize throughput, Data Flow processes data as a continuous stream of in-memory blocks. No intermediate staging or unnecessary disk I/O is required. Data Flows transformations simply operate on the block in memory and then pass it on to the next transformation in the pipeline.
1.8.11
Hummingbird Genio
Hummingbird Genio [69, 70] is a data integration solution for ETL and EAI, part of the Hummingbird Integration Suite. It transforms, cleanses, and directs information across support systems for projects that might include data warehouses, data consolidation and data archiving. The architecture (Figure 92) is built on an extensible, component-based, hub-and-spoke design. This features a centralized engine and metadata repository (the hub) and data sources and targets (the spokes) between which the hub exchanges data.
Figure 92: Hummingbird Genio Architecture Genio has followed an open and extensible client/server design, extensible for future developments, achieved through an internal programming API (ETL - 129 -
Commercial ETL Tools Development Kit). This constitutes a standard for all product functionality, simplifying the development of additional functionality and unifying the look and feel of different applications in the suite. With this approach, users are not limited to the functions provided by the tool. Rather, they are free to develop their own re-usable code transformation to any degree of complexity. The architecture supports distribution and synchronization of data transformation and exchange processes over multiple engines. This is crucial as data volumes increase in size and as transformation processes increase in complexity. Genio components include Genio Engine, Genio Repository, Genio Designer, Genio Scheduler, Genio MetaLinks, Genio DataLinks and Hummingbird Met@Data. The Repository stores and manages all aspects of data transformation and exchange process metadata. All technical metadata (e.g. data structures), business metadata (e.g. transformation rules) and production metadata (e.g. logs) are stored in the repository. This repository is database-neutral and completely open. It can reside on Oracle, Microsoft SQL Server, Sybase Adaptive Server, IBM UDB or Informix. Each component in a data transformation and exchange process is created as an object and stored in the repository. Since Genio automatically identifies any change made to metadata, it provides impact analysis, requiring every object impacted by the change to be addressed before the next data transformation and exchange process is executed. The Designer (Figure 93) is a multi-user graphical environment for designing data transformation and exchange processes. Data structures can be imported directly from source and target systems or using metadata bridges (MetaLinks). User-defined business rules, functions, and procedures created in Genio Designer are stored as objects within the Genio Repository and are completely reusable from project to project. Genio incorporates a graphical interface that provides a procedural scripting environment for designing data transformation processes, which can be scheduled or triggered by external events, such as file modifications, file transfers and emails.
- 130 -
Figure 93: Genio Designer The Scheduler component triggers and controls processes that can be scheduled to execute on an event or calendar basis. The Scheduler provides real-time control and monitoring of process executions as well as full history and audit-trail reporting. If desired, Genio Scheduler can work alongside external schedulers like IBM Tivoli or CA-Unicenter. MetaLinks are pluggable metadata bridges embedded in the Designer that enable the import of data structures from Computer-Aided Software Engineering (CASE) tools or ERP systems (e.g. SAP R/3, SAP BW, SAP IDoc, Sybase PowerDesigner and CA Erwin). DataLinks provide native connectivity to most relational, multi-dimensional database systems and flat files. Data access is achieved through native APIs or ODBC. For all connectivities, Genio has a specific grammar that is used to generate native SQL statements. Genio offers native population of multidimensional databases, such as Hyperion Essbase and Oracle Express. With this functionality, users can directly create hierarchies or members, set all necessary attributes, and load or refresh cubes. Through native access, users do not require an additional staging area or complex multi-layer tools from multiple vendors.
- 131 -
Commercial ETL Tools Hummingbird Met@Data is a metadata management solution that enables IT professionals to collect, organize and document structured and unstructured data to optimize its deployment and its use across an organization. Genio supports all middleware providing ODBC, XML, file-based or MessageOriented Middleware (MOM), such as IBM WebSphere MQ, interfaces. By leveraging third-party solutions Genio is able to provide access to most legacy and also custom applications. Further, it can use FTP protocol, Messaging Application Programming Interface (MAPI), Remote Shell (RSH) or any external application to get access to data or push data to the target environment. Genio offers multiple strategies to perform incremental extraction. Simple approaches, like selection limits based on time stamps, use of database log tables or use of database triggers to capture changes can be implemented. These techniques are environment-neutral and can be implemented without any additional software investments. Genio also provides specific DataLinks that can automatically capture changes from databases or VSAM files. Genio has a set of 100 transformation functions that can be used to build complex target expressions with the source data or variables. These functions cover the entire spectrum of string, dates, number or boolean manipulation. Complex clauses such as IF, THEN, ELSE or CASE can also be used. These functions are processed inside the Genio Engine, but they are translated automatically, when possible, in native SQL functions if the execution takes place on a remote database. To extend the processing capabilities it is also possible to use legacy functions in a Dynamic Link Library (DLL) written in C++ or other language. Further it can call an executable, external batch or shell script for specialized transformation needs. Genio has multiple loading strategies: single row, packet and bulk. In certain cases, the loading can be perfortmedby the source database directly, when the developer has decided to bypass the engine altogether. Genio offers a unique methodology that distributes transformation workload by offloading certain tasks to idle database engines during off-peak hours to maximize efficiency and system performance. Three modes exist: 1. Use of the Genio Engine Exclusively (Figure 94): Extracts data from the source database, transforming it within the engine, and then populating target databases directly. This architectural schema is recommended when sources and targets are heterogeneous, or when the required transformations cannot be performed by remote source or target databases.
- 132 -
Figure 94: Data Access Method One 2. Use of Genio Engine with Processing on the Remote Database (Figure 95): Involves extracting data and transforming it on the source database platform. The resulting data is then transmitted through the Genio platform, and is loaded into the target database where any further transformation is carried out. If possible, aggregations are processed on the source, thereby reducing the volume of data transmitted between source and target and, as a result, the required bandwidth to move data through the network.
Figure 95: Data Access Method Two 3. Exclusive Use of the Transformation Capabilities of Remote Databases (Figure 96): The source and target are on the same server and so it is not necessary for the data to leave the server or to transport the data through a communications layer. Since the Genio engine only sends SQL orders adapted to the relational database, the RDBMS manages the extractions, transformations and insertions (or updates).
Figure 96: Data Access Method Three Each data access mode can be developed through a common Genio interface and defined using the same graphical metaphor and programming methodology. By maximizing user control over data flow, the Genio data access architecture enables users to significantly improve the performance of their data exchange
- 133 -
Commercial ETL Tools processes. Regardless of which data access model is chosen, Genio impact analysis capabilities are maintained, ensuring that, if changes are made to any element of the data exchange process, administrators are notified prior to the next scheduled execution.
1.8.12
IBM Websphere Datastage
IBM Websphere Datastage [71-74] (formerly known as Ascential DataStage), is the current ETL leader in enterprise data integration. The solution is based on a client-server architecture, where the server engine accepts a set of execution requests while the client side provides a suite of tools for graphical design, deployment and monitoring of the ETL process. Two types of integration servers are supported across a wide range of platforms: an Event Server (Figure 97), for event-driven integration, and a Command Server, for use when batch processing of data is preferred. The Event Server includes a complete run-time environment that supports multi-threaded processing and manages execution of maps based on user-defined events, which it monitors continuously. One or more events such as scheduled time events, file events, database events, or message events can automatically trigger execution of maps, according to specifications previously defined. The Event Server manages system resources to optimize performance and provides real-time performance and operations monitoring facilities. A Command Server executes on demand from a command line, batch file, shell script, Job Control Language (JCL) or scheduling program an integration script that has been designed and tested within the Design Studio.
Figure 97: Realtime ETL with WebSphere DataStage The platform enables to solve large-scale business problems through highperformance processing of massive data volumes. By leveraging the parallel - 134 -
Commercial ETL Tools processing capabilities of multiprocessor hardware platforms, IBM WebSphere DataStage can scale to satisfy the demands of big data volumes, stringent realtime requirements and reduced batch windows. IBMs parallel technology operates by a divide-and-conquer technique, splitting the largest integration jobs into subsets ("partition parallelism") and flowing these subsets concurrently across all available processors ("pipeline parallelism"). This combination of pipeline and partition parallelism delivers almost linear scalability. DataStage supports a high number of heterogeneous data sources and targets in a single job, including text files, XML data structures, enterprise application systems (e.g. SAP, Siebel, Oracle and PeopleSoft), almost any database (e.g. Oracle, IBM DB2 Universal Database, IBM Informix, Sybase, Teradata and Microsoft SQL Server), web services, SAS, messaging and EAI products (e.g. WebSphere MQ and SeeBeyond). DataStage is able to efficiently deal with a wide variety of proprietary and commercial business applications that produce data with complex, semistructured and interdependent data types. In addition, the Development Kit offers another deployment option consisting of a set of application program interfaces for a number of programming languages and environments such as C, Java, Component Object Model (COM), Common Object Request Broker Architecture (CORBA), C#, Remote Method Invocation (RMI), BEA WebLogic and IBM WebSphere Business Integration Message Broker. IBM WebSphere DataStage provides a Services Oriented Architecture (SOA) for publishing data integration logic as shared services that can be reused. In fact, the DataStage SOA Edition allows jobs to be designed without any knowledge of how they will be accessed. This prevents the job developer from having to know anything about Enterprise Java Beans (EJBs), JMS queues or Simple Object Access Protocol (SOAP) message headers. DataStage features an extensive data integration development environment, with a library of more than 400 pre-built functions and routines that are internally mapped into SQL statements or stored procedures. DataStage was designed with a clear focus on enterprise data integration challenges, providing: o GUI-based and wizard-driven support for top-down design, test, deployment and management of transactional and operational data integration solutions; o Elimination of hand coding for integrating complex application and transaction content across diverse business environments; - 135 -
Commercial ETL Tools o Interoperability with the applications, databases, middleware, messaging and business process management; o Native and seamless interoperability with the WebSphere Data Integration Suite, enabling complete focus on integration of corporate data. The Design Studio consists of multiple GUI-based applications (Figure 98) that run on Windows platforms. These applications include designers for data integration flows, data object definitions, data transformation maps and contentbased routing rules.
Figure 98: WebSphere Design Studio The Integration Flow Designer is a graphical manager for data flows and it is used during development to define a set of logically related data integration maps. An analyzer checks the flow for logical consistency in the sources and destinations that have been defined. Once a process flow is designed, the Integration Flow Designer is able to port and transfer all the integration objects in a system to the appropriate server location. In addition, the Integration Flow Designer can deploy process control information, automatically configuring an Event Server. The Map Designer (Figure 99) is a graphical facility for the creation of maps, which consist of rules for data-content transformation, data validation and content-based routing. The process uses convenient from and to windows, drag- 136 -
Commercial ETL Tools and-drop techniques, and spreadsheet-like rules. These are augmented by a set of pre-defined functions for operations such as conditional testing, table lookups, mathematical functions, character string parsing and data extraction.
Figure 99: MapStage layout The Database Interface Designer is used to import metadata about database objects such as queries, tables and stored procedures to meet mapping and execution requirements. Databases can be both sources of input data and destinations for output data. The Database Interface Designer provides the means to generate data types automatically for queries and tables based on database catalogue information. The Management Console provides error capture, logging, and reporting services. It also includes tools for monitoring and tuning performance and it can work with third party enterprise systems management solutions. The WebSphere application suite can be further complemented, mainly by the following IBM applications: o o o Understand: IBM WebSphere ProfileStage; Cleanse: IBM WebSphere QualityStage; Federate: IBM WebSphere Information Integrator Standard Edition, IBM WebSphere Information Integrator Classic Federation for z/OS and IBM WebSphere Information Integrator Content Edition.
1.8.13
Informatica PowerCenter
Informatica PowerCenter 8 [75, 76] is a single, unified enterprise data integration platform for accessing, discovering, integrating and delivering data. The platform consists of a high performance and available data server, a global metadata - 137 -
Commercial ETL Tools infrastructure, and GUI-based development and administration tools. Figure 100 presents an architectural overview of Informaticas ETL solution.
Figure 100: Informatica Architecture Tools: Informatica provides a suite of design, view and management tools for design / develop / test / refine (Figure 101), administration and configuration, performance monitoring, metadata management and reporting and analysis tasks. The toolset serves as visual interface for business analysts, developers, testers and operational users. The platform provides mapping templates for various data integration scenarios, which are customized with parameter-based mappings eliminating a significant amount of redundant work. A single Webbased administration console serves all administrative, operational and tuningrelated tasks. Data Services: The SOA-based platform delivers data services to consuming client applications through a variety of standards-based protocols. Web Services expose batch and real-time data services. The platform also provides support for JMS and other messaging systems, such as IBM MQ Series and TIBCO Rendezvous, through its ability to publish and subscribe message queuing systems. Finally, the platform also supports clients accessing data services over programmatic APIs such as JDBC and ODBC.
- 138 -
Figure 101: Mapping Designer Data Integration Services: The set of core data integration services enable IT applications to access, understand, transform and manage data. These services profile, cleanse, match, transform, move, federate and monitor the quality of data, assuring that it is consistent, accurate and timely received. The platform can profile and assess data quality on an ongoing basis. Business analysts can define quality metrics and business rules to measure and monitor the quality of enterprise data against these metrics. The platform can cleanse, standardize, and enhance data through parsing and standardization (e.g. geographic name and address). The solution features a high performance data transformation engine that is driven by metadata-based instructions outlined in the design environment. Organizations can have federated access to multiple, disparate data sources, allowing virtual data views to be created without having to physically move data. Universal Data Access: The Informatica platform can access a wide array of data sources including: o Relational databases: e.g. Oracle, DB2, SQL Server, Sybase, Teradata, Informix; o o o Mainframe data: e.g. IMS, VSAM, DataCom; Enterprise applications: e.g. SAP, Siebel, PeopleSoft; Messaging systems: e.g. IBM MQ Series, TIBCO Rendezvous, webMethods;
- 139 -
Commercial ETL Tools o Standards: e.g. XML, Web Services, JMS, ODBC, JDBC, LDAP, Internet Message Access Protocol (IMAP), POP; o o Unstructured data: e.g. Word, Portable Document Format PDF, Excel; Semi-structured data: e.g. HL7, ACORD, FIXML, SWIFT, EDI.
The platform was designed to handle increasingly larger data volumes over time, being able to scale to large data volume or allow for CDCbased techniques. Metadata Services: The Informatica platform is built on a foundation of metadata services. These include metadata management, lineage, change management, object versioning and reporting/analysis. Metadata services enable end-to-end visibility and analysis of the data integration lifecycle and streamline the overall data integration process. The platform features a metadata repository that works as a hub in the architecture. Metadata services are provided on top of the repository allowing for the import and management of metadata assets throughout the entire data integration lifecycle (e.g. business rules, physical models, relationships and all end user-created metadata). Data lineage services trace a complete data integration solution from data model through business user end use. Data lineage helps users answer questions about where data came from and determine which rules were applied to data along the way. Impact Analysis assesses the impact of making a change across the entire data integration environment. Infrastructure Services: The Informatica platform features infrastructure services, such as high availability, grid-based scalability, multi-level security and paralleled optimization. The platform delivers seamless failover (ability to run without interruption on a secondary node if a primary node is unavailable) and automated recovery capabilities to ensure that data integration services are always available. Based on its load-balancing algorithm, PowerCenter failover may be fined-tuned as instant or per a specified time, say, after 10 minutes of unsuccessfully attempting to re-establish resilience connectivity. Administrators may specify the node for restart, authorize PowerCenter to automatically detect the optimal node based on existing load and resource availability, and configure any number of secondary, tertiary and backup nodes. The Informatica platform can be deployed in a scalable heterogeneous grid environment with data integration services configured to scale across several CPUs and physical servers. With support for dynamic load balancing, the platform
- 140 -
Commercial ETL Tools ensures the optimized CPU utilization independently of runtime changes in data volumes and node utilization. In addition, all of these capabilities are configured and managed through a Web-based administrative console (Figure 102) providing administrators the flexibility and controls that simplify configuration and management in complex grid environments.
Figure 102: Administration Console GRID management With an underlying thread based architecture, the platform abstracts mapping specifications from parallel execution plans, and uses a heuristic algorithm to dynamically adapt to both changes in node population and repartitioning of a RDBMS. A related feature offers administrators an easy way to scale out data processing across a grid by breaking up a single data integration service task into sub-tasks, or partition groups, to be executed on multiple nodes. This feature is enabled by a simple session property flag in the PowerCenter administrator console. PowerCenter's execution of sub-tasks is based on a master/worker process model. The master process breaks up a data integration service session into sub-tasks and submits them to the load balancer. The load balancer dispatches the sub-tasks to multi-threaded "worker" processes on various nodes, which run concurrently to execute the sub-tasks (and which may exchange data amongst themselves). Further, the architecture enables data processing to be pushed down [11] into the relational database to optimize the use of existing database assets. The pushdown optimization provides the flexibility to push data transformation processing to the most appropriate processing resource, whether within a source - 141 -
Commercial ETL Tools or target database or through the PowerCenter server. Pushdown can easily be configured by setting a flag in the PowerCenter 8 Workflow Manager. The amount of data integration tasks that can be pushed to the database depends on the pushdown optimization configuration, the data transformation logic and the mapping configuration. When pushdown optimization is used, PowerCenter writes one or more SQL statements to the source or target database based on the data transformation logic. PowerCenter analyzes the data transformation logic and mapping configuration to determine the data transformation logic it can push to the database. At run time, PowerCenter executes any SQL statement generated against the source or target tables, and it processes any data transformation logic within PowerCenter that it cannot push to the database. The Pushdown Optimization Viewer (Figure 103) allows previewing the data flow, the amount of transformation logic that can be pushed to the source or target database, and the SQL statements that will be generated at run time as well as any messages related to pushdown optimization. For example, the mapping depicted in Figure 103 can be pushed into the source database.
Figure 103: Pushdown Optimization Viewer Pushdown processing is based on a two-pass scan of the mapping metadata. In the first pass, PowerCenter starts scanning the mapping objects starting with source definition object, moving towards the target definition object. When the scan encounters an object containing data transformation logic that cannot be represented in SQL, the scanning process stops, and all transformation upstream of this transformation are grouped together with equivalent SQL for execution - 142 -
Commercial ETL Tools inside the source system. In the second pass, PowerCenter scans in the opposite direction (i.e., from the target definitions towards the source definitions). When the scan encounters an object containing data transformation logic that cannot be represented in SQL, the scanning process stops, and all transformation objects downstream of this transformation are grouped together with equivalent SQL for execution inside the target system. Partial pushdown processing occurs when either the source and target systems are in different database instances, or only some of the data transformation logic can be represented in SQL. In such cases, some processing may be pushed into source database, some processing occurs inside PowerCenter, and some processing may be pushed to the target database (e.g. Figure 104).
Figure 104: Mapping for Partial Pushdown Processing Full pushdown processing occurs when the source and target relational database management systems are the same instance, and data transformation logic can be completely represented in SQL. In this case, PowerCenter pushes down the entire mapping processing inside the database system.
1.8.14
IWay Data Migrator
iWay Data Migrator [77-80] is a suite of ETL software components, forming a SOA that executes ETL processes in real-time (CDC-based), batch, event-driven and on schedule, comprising management and auditing capabilities. Two components compose the client / server architecture: the Data Migrator Server (houses and executes data and process flows) and the Data Migrator Data Management Console client (a graphical interface where the user can design, test and run data and process flows). Data Migrator provides a wide set of adapters to provide seamless access and data migration. Data Migrator can also use ODBC, JDBC, XML, and IBM WebSphere MQ connectivity for further database access. Adapters provide a simple way to make data sources available as part of a SOA. These are SQL- 143 -
Commercial ETL Tools based components that can interact with over 250 information systems, without custom code development. IWay adapters can provide statistics on their usage and performance characteristics and when used in ETL processes they play an important role in data flow audition. In addition to SQL, adaptors may interpret a request language called WebFOCUS that provides a more rich set of features than SQL and extends the types of requests that can be processed, especially those against non-relational data sources like flat files. The Data Migrator Data Management Console follows a point-and-click interface and it is mainly composed by three graphical components: o Data Flow Designer: A flow-control interface (Figure 105) that allows visual creation of joins, table and column searches, and dynamic error notification. Users can add external routines (e.g. COBOL or C-based programs) to the ETL function list and select tables for data extraction; o Process Flow Designer: A flow-control interface that integrates several jobs in a single process flow. A job can refer either to a previously created data flow, external programs or scripts created outside Data Migrator. Conditional branching is possible as well as management facilities like scheduling, email delivery and synchronization periods; o Web Console: Administrators can review job statistics and drill down into job logs, review the job queue, start and stop servers and server processes, schedule jobs and configure e-mail nodes, manage adapters and metadata. Data Migrator servers can be visualized through the Data Management Console. Associated to each server exists a set of data and process flows. Depending on either creating a data or process flow different diagram views can be used. All SQL language specifics are hidden from the non-expert user. In case of a text file be the data source, a specific syntax is generated by Data Migrator transparently to the user. Thus, all data sources are presented to the user as relational tables. During the ETL specification process the user only interacts with a set of wizards and graphical diagrams. For creating a data flow (Figure 105) it is required first to select the data sources where the data is coming from and drag and drop a set of tables (left application pane) to the data flow diagram (right application pane). By double-clicking a data source icon it is possible to browse the table metadata as well as retrieving sample data. When multiple table sources are dragged into the data flow diagram area and these contain common column names a Join icon is automatically created, expressing a JOIN operator. By double-clicking on the
- 144 -
Commercial ETL Tools operator icon the user can specify the properties of the JOIN operation (Figure 106).
Figure 105: Data flow example: data sources (left) and diagram area (right) By default, tables appear inner-joined by the column name they have in common. This can be customized as well as the type of join: left or right part of the join, cross-join or no join at all. At this level the user can also perform a set of transformations, to be executed after data is joined.
Figure 106: Join properties
- 145 -
Commercial ETL Tools It is possible to view and change the SQL statements automatically generated for expressing the JOIN operator as well as test the query by retrieving some records from the data sources and apply the SQL to it. Further, filtering criteria can be defined to limit the retrieval of data to a specific subset. For this purpose the Filter Calculator window (Figure 107) can be used for defining a filter. In this component references to data columns are available as well as SQL functions and simple arithmetic functions.
Figure 107: Filtering input data (limited to the 2003 year) Follows the specification of target tables and related loading options (e.g. how to handle duplicates, CDC parameterization). Having defined all target tables follows the mapping between source and target columns (Figure 108). Source and target columns with the same name and data types are automatically mapped by default. During the mapping process is also possible to create new target columns or perform simple transformations. Data Migrator can perform transformations outside the limits of SQL and can incorporate routines containing IF-THEN-ELSE logic. It also allows for advanced transformation functions such as the ability to create 3GL subroutines and add them to a library of subroutines for reuse. Data Migrator can load data to several target tables in a single data flow, by simply adding another target object into the workflow. Once the data flow is complete it can be executed. When execution completes the user can review the log that shows a complete record of the ETL process and any problems that may have occurred.
- 146 -
Commercial ETL Tools As data flows are built, they can be assembled into a larger ETL process flow. These process flows can be scheduled, use conditional flow logic, stored procedures or remote procedure calls. In addition, different data flows can be assigned to run in parallel.
Figure 108: Mapping source to target columns Figure 109 presents an example of a process flow. The data flow is executed according to a pre-defined schedule for a single data flow (icon with a blue background). Conditional branching is performed according to the successful execution (or not) of the data flow. In case of success (green arrow) a stored procedure in executed, and if the data flow fails (red arrow) an email is sent to the system administrator. Data Migrator provides several ways to run data and process flows: on demand, event driven (e.g. start a request when other processes complete) or schedule (e.g. on a recurring basis or on specified days of the month). Data Migrator also enables impact analysis that helps to determine which resources (e.g. target table, data flow) can be affected by a change on another resource (e.g. source table, stored procedure). Data Manager can audit the ETL process. Each process and data flow generates a complete record of results, including a log of significant events that occurred during the run. Flow summaries and detailed logging are available from the Web Console.
- 147 -
Figure 109: Process flow example Data Migrator capabilities can be further expanded by using the remaining iWay Software data management applications: o Resource Analyzer: Monitors resource consumption, identifying inefficiencies in the information architecture and eliminate system bottlenecks; o Resource managing availability. Governor: resource Controls the and business ensuring intelligence good environment, and
consumption
performance
1.8.15
Microsoft SQL Server 2005
SQL Server 2005 is a database platform providing enterprise-class data management with integrated BI tools. The database engine provides a secure, reliable storage for both relational and structured data. The BI Development Studio is a common environment for building BI solutions based on Visual Studio, including a database engine, analysis services, and reporting services. This is used to design SQL Server Integration Services (SSIS) packages for datamanagement applications. SSIS packages are designed, developed, and debugged in BI Development Studio by dragging tasks from the toolbox, setting their properties, and connecting tasks with precedence constraints (Figure 110).
- 148 -
Figure 110: BI Development Studio Interface in Visual Studio SQL Server 2005 includes a redesigned enterprise data ETL platform, called SSIS [81-84]. This new platform is the successor to SQL Server 2000 Data Transformation Services (DTS). SSIS provides the following features: o Collaborative development tool: Enables collaborative ETL with the development tool BI Design Studio. It is hosted in Visual Studio, where it automatically takes advantage of Visual Studios collaborative capabilities for source code management, version control and multi-user project management; o Separate management tool: SQL Server Management Studio is SSIS administration tool, where users operate and maintain BI database objects, as well as all SQL Server 2005 components; o Data quality and profiling functions: SSIS enables the design of transformations that handle data quality directly in the data flow, featuring capabilities such Fuzzy Lookup (which matches incoming dirty data with clean records in a reference table) and Fuzzy Grouping (which detects similarities between input rows and merges duplicates). SSIS complements these runtime data quality functions with design-time data profiling capabilities;
- 149 -
Commercial ETL Tools o Data lineage and impact analysis via metadata: All the services of SQL Server 2005 share common metadata, resulting in broad metadata-based visibility from data sources, through ETL processes, and into reporting and analytic processes. This enables downstream impact analysis, so IT can see how a change in a source system will affect not only ETL but also reports. It also enables the reverse, namely upstream data lineage, where a report consumer can drill all the way back to the source of a reports data; o ETL that can behave like EII: Whereas most ETL tools load data into a database or file from which applications access it, SSIS can pass data directly to an application, similar to the way EII works. The benefits are that information delivery approaches real time and that IT do not need to design and maintain a persistent data store. In SSIS an ETL developer simply defines the target application as a data reader. When this component is included in a data-flow pipeline, the package containing the DataReader destination can be used as a data source. This allows SSIS to be used not only as a traditional ETL to load data warehouses, but also as a data source that can deliver integrated, reconciled, and cleansed data from multiple sources on-demand. The primary features of SSIS used to implement core business logic are contained in the Control Flow and Data Flow components. The Control Flow is the task workflow engine that coordinates the business process flow logic for a package. Every package has exactly one primary Control Flow whether it contains a single step or dozens of interconnected tasks. The tasks within the Control Flow are linked by constraints: success, failure, completion, and custom constraint expressions and Boolean logic. The Data Flow is the data processing engine that handles data movement, transformation logic, data organization, and the extraction and commitment of the data to and from sources and destinations. Unlike Control Flow, there can be multiple Data Flows defined in packages that are orchestrated by the Control Flow. The convention that is used for data flow layout, task naming and annotations starts with a 3 or 4-letter abbreviation of the type of task or transformation followed by a 3 or 4 word description of the object function. Package layout generally flows down first and then to the right as tasks or transformations break out from the primary control flows and data flows. Annotations help to document packages by describing each task in more detail as depicted in Figure 111.
- 150 -
Figure 111: A SSIS data flow example Integrated into the data flow structure design are custom auditing steps that track high-level execution details. This includes package starting, ending, and failure times, as well as row counts that assist in validating data volumes and processing. A core aspect of the auditing is to quickly identify errors. is employed for this purpose, which will trap any package errors. To tie everything together and present well organized information to the administrator and a troubleshooting tool for the developer, a series of linked Reporting Services reports may be designed, which correlate the auditing, validation and logging (Figure 112). At the core of SSIS is the data transformation pipeline. This pipeline has a bufferoriented architecture that is extremely fast at manipulating row sets of data once they have been loaded into memory. The approach is to perform all data transformation steps of the ETL process in a single operation without staging data. Even copying the data in memory is avoided as far as possible. SSIS introduces the concept of Event Handler Control Flows. The OnError event handler
- 151 -
Figure 112: A SSIS report example With SSIS, all types of data (e.g. structured, unstructured, XML) are converted to a tabular structure before being loaded into its buffers. Any data operation that can be applied to tabular data can be applied to the data at any step in the dataflow pipeline. This means that a single data-flow pipeline can integrate diverse sources of data and perform arbitrarily complex operations on these data without having to stage it. This architecture allows SSIS to be used not only for large datasets, but also for complex data flows. As the data flows from source(s) to destination(s), the stream of data can be split, merged, combined with other data streams, and otherwise manipulated. SSIS can consume data from / into a variety of sources including Object Linking and Embedding Database (OLE DB), managed ADO.NET, ODBC, flat file, Excel and XML using a specialized set of components called adapters. SSIS can even consume data from custom data adapters (developed in-house or by third parties). SSIS includes a set of powerful data transformation components that allow data manipulations that are essential for building data warehouses (e.g. aggregate, sort, lookup, pivot, merge, data conversion or audit). XML data is distributed into tabular data, which then can be easily manipulated in the data flow (e.g. XSLT, XPATH). SSIS is integrated with the data mining functionality in Analysis Services. Data mining abstracts out the patterns in a dataset and encapsulates them in a mining
- 152 -
Commercial ETL Tools model. This model can be used to make predictions on what data belongs to a dataset and what data may be anomalous, allowing data mining to be used as a tool for implementing data quality. Support for complex data routing in SSIS allows anomalous data to not only be identified, but also be automatically corrected and replaced with the correct values. Figure 113 shows an example of a cleansing data flow.
Figure 113: A cleansing data flow example SSIS is not only integrated with the data mining features from Analysis Services, but it also has text mining components (identification of relationships between business categories and text data). This allows the discovery of relevant terms in text data. BI Development Studio has also features for true run-time debugging of SSIS packages. These include the ability to set breakpoints and watching variables. The Data Viewer provides the ability to view rows of data as they are processed in the data-flow pipeline. This visualization of data can be in the form of a regular text grid or a graphical presentation such as a scatter plot or bar graph (Figure 114).
- 153 -
Commercial ETL Tools In fact, it is possible to have multiple connected viewers that can display the data simultaneously in multiple formats.
Figure 114: SSIS data visualizations example In addition to providing a development environment, SSIS exposes all its functionality via a set of APIs. These APIs are both managed (.NET Framework) and native (Win32) and allow developers to extend the functionality of SSIS by developing custom components in any language supported by the .NET Framework (such as Visual C#, Visual Basic .NET, etc.) and C++. These custom components can be workflow tasks and data-flow transformations (including source and destination adapters).
1.8.16
Oracle Warehouse Builder
Warehouse Builder [85-89] 10g is Oracles business integration design tool for handling both data and metadata. It provides data quality, data auditing, integrated relational and dimensional modelling, and life cycle data management. The basic architecture comprises two components, the design environment (handles metadata) and the runtime environment (handles physical data). The repository is based on the CWM standard and consists of a set of tables in an - 154 -
Commercial ETL Tools Oracle database that is accessed via a Java based access layer. The front-end of the tool (entirely written in Java) features wizards and graphical editors for logging onto the repository. Warehouse Builder is comprised by four main components: o o ETL Core features: Provided free with the database Standard Edition; Enterprise ETL Option: Supports large scale and complex deployments by improving the scalability and performance of ETL jobs. Some of the features included in this option are slowly changing dimensions and impact analysis; o Data Quality Option: The aim of the Data Quality Option is to support the transformation of data into quality information on an ongoing basis; o Connectors: Connectors allow applications to quickly and easily extract data to/from their CRM and ERP applications. The following connectors are available: SAP, Oracle EBusiness Suite and Peoplesoft 8/9. The Warehouse Builder architecture is depicted in Figure 115:
Figure 115: Warehouse Builder Architecture o Design Client and Deployment Manager: The rectangle (A) represents the Design Client, which provides the graphical interface used to define sources, - 155 -
Commercial ETL Tools targets and ETL processes. The design client contains the Deployment Manager. Generated code from the Deployment Manager goes to the Runtime Access User (1) on the Oracle Database server. Warehouse Builder can be installed without an Oracle Database on any number of computers designated as design clients; o Design Repository Schema, Design Repository and Design Browser: The database instance (x) containing the Design Repository schema is typically installed on a server shared by a number of Design Clients (A). The Design Repository schema (B) contains the Design Repository, which stores metadata definitions for all the sources, targets and ETL processes that constitute the design metadata. Through the Design Browser (C1 or C2), the user can access the content in the Design Repository in read-only mode; o Runtime Access User: Because the Runtime Repository schema (2) requires system privileges, the Runtime Access User (1) is the security measure that shields it from abuse; o Runtime Repository Schema: The Runtime Repository schema (2) owns the audit tables and audit/service packages that are accessed by the Target Schema (3). The Runtime Repository schema is the central user within the system that manages connections to the various targets in the Target Schema; o Runtime Audit Browser: The Runtime Audit Browser (D1 or D2) also communicates with the Runtime Repository schema (2) enabling to run reports on the audit and error information captured when jobs to load and refresh the ETL targets are executed; o Target Schema: The Target Schema (3) is the actual target to which data is load, and contains the target data and data objects, such as cubes, dimensions, views, and mappings. The Target Schema contains packages to execute the ETL processes that access the audit/service packages in the Runtime Repository schema (2); o Runtime Platform Service: The Runtime Platform Service (4) manages native execution and calls to Oracle Enterprise Manager for remote execution. Oracle Warehouse Builder includes several pre-defined transformations, as well as a library of predefined functions and procedures, to transform data. The ETL processes designed with Warehouse Builder can be translated into PL/SQL packages. These packages are deployed to the Oracle database and stored as packages available for execution. To enable faster development of warehousing - 156 -
Commercial ETL Tools solutions, Warehouse Builder provides custom procedures and functions written in PL/SQL. Warehouse Builder enables to reuse PL/SQL as well as to write new PL/SQL transformations. Because the final process runs on the Oracle database, Warehouse Builder supports all constructs supported by the Oracle database. The Mapping Editor application can also be used to design data transformations using SQL code components. For example, activities such as joining disparate data sources or splitting data streams into multiple output streams can be implemented as SQL components. This enables Warehouse Builder to generate efficient SQL code to move data from source to target. Warehouse Builder provides a graphical interface to model and create an ETL process based on objects that are stored in an open repository. After completing the initial design, this can be deployed to the runtime platform. Because the runtime platform is the Oracle database, Warehouse Builder is used as a tool that generates code rather than an ETL engine-based tool. Warehouse Builder requires metadata to describe a source object for use within the product. For this purpose, it groups sources in specific metadata groups, called modules. A module references an OS directory or a database schema. Flat files are a typical data source for warehouse projects. A set of wizards based on a sample file enable the definition of metadata for the ETL process for that class of files, comprised by the following steps: 1. Define the sample file path; 2. Define the internal name for the sample file, character set and number of sample characters; 3. Select how the file is delimited and the number of records present in the file (Figure 116); 4. Specify the number of rows to skip; 5. Define fields data types and validation rules; 6. Review the selected configuration in a summary page;
- 157 -
Figure 116: Defining the fields delimiters 7. The metadata is made visible in the Design Center application under the SOURCE folder (Figure 117).
Figure 117: Design Center When designing a mapping in Warehouse Builder, the Mapping Editor interface (Figure 118) is used. Mappings extract data from the source, transform the data, and load the data into the target module. Activities external to Warehouse Builder (such as e-mail, FTP commands, and operating system executables) can be used. - 158 -
Commercial ETL Tools While design a mapping the user selects operators from the Mapping Editor Palette panel and place them on the canvas.
Figure 118: Mapping Editor Using the Process Flow Editor, it is possible to design process flows that join together mappings and other activities. By default the Process Flow Editor is launched with only a Start and End_Success activities in the canvas (Figure 119).
Figure 119: Process Editor (empty flow) - 159 -
Commercial ETL Tools Activities are dragged from the Palette panel into the canvas and then connected through transitions until defining a pipeline from the start to the end node (Figure 120). Transitions indicate the sequence and conditions in which activities occur in the process flow. Transitions are unconditional by default, indicating that the process flow continues after the preceding activity completes, regardless of the ending state of the previous activity. However these can be customized in order to route the process flow depending on the existence of warning, errors or correct execution of a previous activity. Warehouse Builder process flows comply with the XML Process Definition Language (XPDL). When a process flow is created, Warehouse Builder generates an XML file in the XPDL format. This XML file can be plug into any workflow engine that follows the XPDL standard (Figure 120).
Figure 120: Process flow and XPDL script code
1.8.17
Pervasive Data Integrator
Data Integrator [90-93] is Pervasives ETL solution for data integration, whose architecture (Figure 121) consists of a unified set of developer-focused tools, separated into three distinct parts:
- 160 -
Figure 121: Pervasive Integration Architecture Integration Architect: A set of IDE tools that enable developers to design effective multi-step integration processes, data transformations across data sources and targets, and a highly differentiated set of schema design tools to help access data sources; Integration Repository: An open, XML-based file storage facility used by developers to track and store their work across multiple projects. The openness of this design allows client tools to work directly with the XML formatted documents that represent everything from schema designs to business process modelling; Integration Engine: The workhorse of the integration product line provides the processing power to execute the maps and designs created by developers with the Integration Architect. Engines can execute locally for quick access to view data and test designs or be distributed into the companys infrastructure to run designs in the production environment. Integration Architect has the tools and designers for connectivity to a wide range of sources and targets: Process Designer, Map Designer, Structured Schema Designer, Document Schema Designer, Extract Schema Designer, Repository Explorer, Repository Manager, and the Universal Data Browser. Follows an overview for the most important tools. Process Designer: A graphical interface (Figure 122) for drag-and-drop integration process design. Using simple flowchart symbols the Process Designer
- 161 -
Commercial ETL Tools links multi-step integration processes together in a single, automated integration task.
Figure 122: Process Designer During the design process, any number of steps can be dragged onto the Process Designer canvas (marked as 1) from the pre-built list of process flow icons on the left hand side of the screen (marked as 2). These are used to build transformation steps that can call transformations (i.e. maps), decision steps where logic can be recorded to push a flow down one or more paths depending on the situation, looping logic and parallel processing logics among others. After dropping an item onto the canvas, the mappings are then altered to build the integration process, by adjusting their properties or using the Map Designer application to handle data transformations where needed. Map Designer: A data map and transformation designer, providing a visual user interface and wizards. The mapping process is internally codified in XML, allowing reusability. The basic steps for map design are (Figure 123): 1 - Connect to a source: Handles all connectivity options to the data source for the transformation design. Hundreds of native connectors are available, from legacy and modern databases to flat files. Once connected, the developer can browse the source data to assist with the transformation process; 2 - Connect to a target: Depending on the selected options, the transformation can replace data in a target, append data, delete all data and reload. Further,
- 162 -
Commercial ETL Tools parent-child relationships of data in target databases can be maintained to more than one table in a database.
Figure 123: Map Designer Source Connection Section 3 Map: In this tab the data transformation is designed (Figure 124). Field structures are displayed for source and target data and a simple drag and drop of source fields to the target is enables a simple pass through transformation. Using the Expression Builder application event actions and transformation scripting options can be defined.
Figure 124: Map Designer Map Section
- 163 -
Commercial ETL Tools Expression Builder: Uses a development scripting language called Rapid Integration Flow Language (RIFL), developed to promote developer efficiency on common tasks encountered in integration projects. The language comes with hundreds of pre-built functions to automate developer tasks, ranging from automatic conversion of date formats, Boolean logic, math functions or event handling; Extract Schema Designer (Figure 125): Used for integrating unstructured text data (e.g. email, HTML or any other raw text) into business processes;
Figure 125: Extract Schema Designer Integration Manager (Figure 126): This central console is a single point of administration and monitoring for a set of deployed engines, improving operational efficiency, especially when using geographically distributed integration engines. Further, it can be used for defining the scheduling of jobs and collects runtime statistics.
- 164 -
Figure 126: Integration Manager
1.8.18
SAS ETL
SAS ETL Studio [94-96] is the main application from SAS ETLQ suite, managing the ETL process flow through an interactive design process that allows designers to build logical process workflows, identify input and output data flows and create business rules through metadata. SAS introduces the ETLQ paradigm, where traditional ETL procedures are extended by Quality methods for verification and data cleansing purposes. SAS ETL Studio cannot be used in isolation but integrated in the SAS Intelligence Platform (Figure 127), which is defined by the following main tiers: o SAS Foundation: Comprised by a broad range of core data manipulation functions, such as distributed data management, data access across multiple database sources, data visualization and data mining; SAS ETL Studio generates SAS code and submits it to Base SAS for execution. It also uses Base SAS to access SAS data and SAS/ ACCESS to access nonSAS data. SAS ETL Studio also uses SAS/CONNECT to submit generated SAS code to remote machines and to interact with remote libraries; o SAS Business Intelligence Infrastructure: Suite of servers and services that enable the execution of SAS Web servers and application servers;
- 165 -
Commercial ETL Tools o SAS Client Services: Provides a suite of Web-based and desktop front-end interfaces to the content and applications generated from the SAS BI Infrastructure and the SAS Foundation. SAS ETL Studio is located in this layer.
Figure 127: Software Architecture for the SAS Intelligence Platform SAS ETLQ transformation engine is responsible for the execution of built-in transformations. Wizards within SAS ETL Studio remove the need of managing custom code by defining pre-built template transformations. All transformations are tracked and registered in central metadata repository. Integrated metadata management and standards enforcement facilitates the creation, storage and exchange of metadata. ETL metadata information is gathered by the ETL process itself during the wizard steps. Integrated workflow scheduling and load balancing allows the automatic execution of ETL processes and redirecting jobs and processes to the resources with the least load. An integrated management console provides tools for deploying and maintaining all ETL assets. SAS ETL Studio is a Java-based ETL design tool with a drag-and-drop interface that provides a single point of control for managing ETL processes. Wizards are massively used for defining all tasks, including extractions, creation of jobs, process flow diagrams and data transformations. A process designer let developers build and edit ETL jobs using built-in design templates and an extensive library of transformations. Multiple data sources can be integrated during the wizard steps (e.g. DB2 sources, Sybase, Teradata, External File, ODBC Sources, Oracle, SAS source). A SAS ETL Studio source to target flow (job) can be defined following three steps: 1. Specify metadata for data sources (e.g. tables in an operational system); 2. Specify metadata for data targets (e.g. tables in a data warehouse);
- 166 -
Commercial ETL Tools 3. Create jobs that specify how data is extracted, transformed and loaded from sources to targets. SAS ETL Studio uses each job to generate SAS code that reads sources and creates targets (e.g. Figure 128).
Figure 128: Process Flow for a SAS ETL Studio Job In the display, each round object represents the metadata for a table, and each square object represents the metadata for a process: o ALL_EMP represents the metadata for a table that is the source for a data transfer process; o ALL_EMP2 represents the metadata for a table that is the target of a load process and the source for an extract process; o All Male Emp represents the metadata for a table that is the target of a load process. To create a process flow in SAS ETL Studio, the user must first add metadata for the sources and targets in the flow. Any data store can be a source, a target, or both and there is no metadata differentiation. Source / Target designer wizards enable to generate metadata for tables or external files that exist in physical storage. In most cases, a source designer displays a list of tables that currently exist for a selected database and generates metadata from the physical structure of the selected tables. If a file is selected as data source a specific wizard is present, enabling the user to: o o Extract information from flat files (fixed or delimited format); Import column-aligned data or data that is not column-aligned (using single or multiple delimiter values); o o Import variable length records and fixed-length records; Import character, numeric and non-standard numeric data (such as currency data or signed numbers); o o Specify how missing values should be dealt; Remove or add columns in the imported data.
- 167 -
Commercial ETL Tools For column-aligned data (Figure 129), the External File source designer uses sample data from the external file, to estimate the length and data type of the columns. If the data in the external file is not arranged in columns, a fixed width column definition wizard is present to the user.
Figure 129: Import Parameters Window When the import parameters are defined, the wizard reads the source table / file and derives default metadata for the columns in the target (based on the columns in the source) that can be further customized by the user (Figure 130).
Figure 130: Set Column Definition Window Having defined both data sources and targets, a Job creation wizard can be invoked for creating the data transformation process. In its most simple way, the wizard can be used to create an empty job with a single transformation (Figure 131).
- 168 -
Figure 131: Empty job based on a single transformation Starting from an empty job definition, the user can create a process flow diagram by drag and drop tables (from the Inventory tree) and transformations (from the Process Library tree - Figure 132) into the Process Designer window (Figure 133). Further, the following tasks can be performed: o View or update the metadata for sources, targets and transformations within the selected job; o View or update the code that is generated for the entire selected job or for a transformation within that job; o View a log that indicates whether code was successfully generated for the selected job or for one of its transformations.
Figure 132: Process Library Tree The Process Library contains two other kinds of transformation templates: Java plug-in transformation templates and SAS code transformation templates. SAS code transformation templates are created with the Transformation Generator wizard. All transformation templates available in the Process Library are SAS code transformations.
- 169 -
Figure 133: Process Designer Window
1.8.19
Stylus Studio
Stylus Studio 2007 [97] is comprised by a suite of XML tools integrated within the same XML editor. Stylus Studio's XML Editor enables XSLT development, including an XSLT editor, XSLT debugger, XSLT mapping, XSLT profiling, visual HTML-toXSLT stylesheet design and XSL Formatting Objects (XSL FO). The application also includes multiple visual XML editing views, Intelligent XML Editing, Integrated XML Parser, XML Validator and XML differencing. Stylus Studio can be used either as a desktop tool or as a Java code generator. Java code can be generated, compiled and then executed in another location outside the Stylus Studio framework. Within the scope of ETL, the XML Pipeline is the main feature to use. XML Pipeline can be used for modelling, editing, debugging and deploying XML applications. XML Publisher can further refine these XML outputs by creating data reports (XML to PDF or XML to HTML). An XML pipeline (Figure 134) is a way to express the various steps in the processing of XML documents. Stylus Studio provides powerful graphical tools for expressing XML pipelines in terms of a series of operations, e.g. converting, validating, transforming, using XML, XSLT, XQuery, XML Schema and XPath
- 170 -
Commercial ETL Tools technologies. With the application, software architects can visualize data integration applications at a high level and automate many common tasks (e.g. debugging of XML processing applications).
Figure 134: A pipeline definition example As an example, Stylus Studio could define a pipeline for creating the following XML application: 1. Getting a catalogue of books from a Text File (convert to XML); 2. Getting an Order from an EDIFACT file (convert to XML); 3. Using XQuery to extract the order information; 4. Using XSLT to publish an HTML order report; 5. Using XQuery to generate an XSL FO stylesheet; 6. Using XSL FO to publish a PDF order report. The creation of an XML Pipeline is performed visually by drag and drop different XML operations from the toolbox to the XML Pipeline editor's canvas, associate them with their respective XML artefacts (e.g. stylesheets, queries) and then connect them together (Figure 134). Pipelines can contain other pipelines, enabling reusability. With XML Pipeline Simulator and XML Pipeline Debugger, each individual pipeline component can be tested and debugged as a unit. The debugger does more than simply show what is going in and what is coming out of each connected
- 171 -
Commercial ETL Tools operation. It allows to step into the XSLT and XQuery code, using the XSLT Debugger and XQuery Debugger to see what is happening inside. If those programs happen to call into Java, the application allows debugging those extension functions as well. Once a pipeline is successful defined the user can pack it into a single program for deployment (either within the Stylus Studio framework or externally as a Java application). The user can define a pipeline either through a top-down or bottom-up strategy, only depending if the pipeline must be created from scratch or if previously created pipelines can be used as building blocks. Stylus Studio employs adapter components for making an XML view over nonXML sources like text (Figure 135), binary or database data sources (Figure 137), through the "Convert-to-XML" module. In the case of a file data source a representative file must be selected for defining the conversion process. Then the file encoding and layout settings are specified. Alternatively, the user can let Stylus Studio to infer this information from the file, and then alter it if required. The "Convert-to-XML" editor (Figure 135) consists of a "properties" window (bottom), a "document" pane (up) and a "schema" pane (right). The "properties" window uses a tree to display information about the input file and the properties about how this will be converted to XML. By default the document pane displays files in a grid (useful for selecting and defining sections, rows and fields). The document pane provides both display and editing features helping work with the input file and define the XML output. Toggle buttons allow controlling the display of a ruler, control characters and matching patterns (defined through regular expressions). Properties contents change based on the selection on the document pane. The schema pane displays a representation of the XML schema for the XML document that the converter will create. Further, it allows the definition of matching patterns and the element names for the nodes that will be created in the output XML document. Delimiter characters and separators are displayed with a different background. Delimiter regions can be defined in the input file, especially useful when the region contents have heterogeneous formats and formatting rules. Filters can be defined in order to select just a subset of the text lines / fields to be propagated into the XML outputs of the conversion.
- 172 -
Figure 135: Convert to XML editor The outputs of a XML pipeline can be directly in XML format (Figure 136) or XLST operations can be defined for creating HTML reports (designed with the Stylus Studio HTML publisher). Alternatively XSL FO operations can be performed for the creation of PDF reports. Besides textual input, Stylus Studio can also handle database inputs through the DB to XML Data Source module that simulates an XML stream with the database contents.
Figure 136: Pipeline XML outputs
- 173 -
Commercial ETL Tools According to the selected database, tables and fields, SQL/XML code is automatically generated by the application (Figure 137).
Figure 137: DB to XML Data Source module In a similar way, the module can be used for mapping XML to a database (in this case the update statements would replace the select statements). In cases where a source-to-target mapping is required with different schemas, the XSLT mapper component can be used for defining the mapping graphically (Figure 138) and the XSLT code is automatically generated.
Figure 138: XSLT mapper - 174 -
1.8.20
Sunopsis Data Conductor
Sunopsis Data Conductor [98-101] is the ELT solution from Sunopsis for exchanging data between legacy databases and Data Warehouses / Data Marts as well as migrating data between heterogeneous systems. In this application the ELT paradigm is used and all data transformations are executed by RDBMS engine(s) already installed. Data Conductor generates native code (e.g. SQL, PL/SQL, Transact-SQL, bulk loaders) needed to load data directly from the sources into the target server and enables data transformations to be executed in bulk mode by a RDBMS. However this solution is limited by the expressiveness of the transformation methods available at the RDBMS engines. Three main advantages are identified [102] regarding the use of the Data Conductor: 1. Bulk Processing: By generating native SQL code, the ELT approach leverages the powerful bulk data transformation capabilities of the RDBMS. In addition, since data is loaded directly from source systems to the target server, only one set of network transfers are required, instead of two or more as with a engine-based approach. Inserts and updates are handled as bulk operations and no longer performed row-by-row; 2. Business Rules: With a business-rules-driven paradigm, the developer only defines what he/she wants to do, and the data integration tool automatically generates the data flow, including whatever intermediate steps are required, based on a library of knowledge modules. The what to do, i.e. the business rules are stored in a central metadata repository where they can be easily reused. The implementation details, specifying how to do it, are stored in a separate knowledge module library, and can be shared between multiple business rules within multiple ETL processes. With this approach it is easy to make incremental changes either to the rules or to the implementation details, as they are, in essence, independent; 3. Lower Cost: ELT is a cost effective approach to data integration since no middle-tier ETL engine, dedicated ETL server hardware or maintenance expenses are required. The architecture for Sunopsis ELT solution is depicted in Figure 139. Data Conductor uses the target data warehouse, or in some circumstances the source database, to do all data transformation. Data Conductor is support by a Metadata Repository that can be held in any relational database. It stores metadata about the source and target databases, details of the mappings, the process flows and details of the execution results. - 175 -
Commercial ETL Tools Data Conductor is written in Java, and communicates with source and target databases via JDBC and other sources (e.g. LDAP directories, flat files) via methods such as JMS, Java Naming and Directory Interface (JNDI) and JDBC/OS. The Designer, Operator, Metadata Navigator and Security Manager communicate directly to the repository and the Sunopsis Agent acts as a scheduler caring out non-database activities.
Figure 139: Sunopsis architecture From all client applications the Designer is the most interesting one (Figure 140 and Figure 142). The application may hold multiple ETL projects that are stored within the Metadata Repository. Projects have Interfaces for mapping purposes and Packages for Process Flow definition. Procedures refer to transformations, User Functions represent functionalities written in SQL or Java and Knowledge Modules are the implementation methods that actually load data. Data Conductor takes the approach of splitting a data mapping into business rules (the what of the data mapping) and the flow (the how of the mapping). For example, a business analyst might create a data interface that declaratively says that the sales fact table is populated from the orders table, the order items table and data from an XML document, whilst the database developer would pick a flow implementation type that actually uses SQL to get data from the tables, XQuery to get the XML data and joins it together using a bit of Java. These flows are encapsulated into Knowledge Modules that can be extended by the user. Knowledge Modules are split into Loading modules for getting data from a source
- 176 -
Commercial ETL Tools onto a target platform, Implementation modules to handle situations e.g. insert/updates and bulk loads. This way the user can define mappings logically and change the actual implementation type if the platform technology changes. An interface (business rules) for mapping data is defined using a Diagram view (Figure 140), defining both the source and target tables.
Figure 140: Designer Diagram view The user can define joins between source tables and add details of the other mappings that are not simple column-to-column copies. The user can also customize where to perform transformations: on the source, staging or target databases, depending on what is most appropriate. The data flow is defined specifying how the data will be loaded (Figure 141). The flows work with Knowledge Modules, encapsulated best practices that instruct Data Conductor in the most efficient way to load data from a data source. This is specified in two parts; (i) how the data is loaded and (ii) how it is integrated. The load method is specified by choosing an Load Knowledge Module (LKM), whilst the integration is specified by choosing an Integration Knowledge Module (IKM). Various LKMs and IKMs are provided with the Sunopsis platform and these can be further extended. As an example a generic SQL to SQL LKM can be used to get data and then a SQL Incremental Update IKM can be chosen to load data into the target database.
- 177 -
Figure 141: Designer LKM definition example Although an interface can be executed without further specifications, for more complex operations the definition of a data flow is advisable, keeping together multiple interfaces and procedures together as a graph (Figure 142).
Figure 142: Designer Flow definition
- 178 -
Commercial ETL Tools Data flows can be executed alone or integrated in a schedule (interface execution can also be scheduled). When executing a defined interface or data flow the Operator application (Figure 143) is launched presenting the progress of the ELT operations. Clicking on the individual interfaces, the user can see what is loaded into the target tables, and also check if data needs correction. Further by doubleclicking on any step, the user is presented with the SQL that the tool is generating.
Figure 143: Operator application
1.8.21
Sybase TransformOnDemand
Sybase TransformOnDemand [103-105] is an integrated ETL environment. The architectural design (Figure 144) is supported by a transformation bus, reducing the number of dedicated peer-to-peer interfaces between heterogeneous systems. When a process is activated, the engine retrieves the associated rules from a Repository, reads, transforms, validates and writes the processed data. Two engines can be executed in the transformation process. The Universal Transformation Language (UTL) Engine serves small to medium transformation volume needs, while the GRID Engine can concurrently process jobs across multiple machines. Sybase provides multiple connectors for accessing data. Each connector exposes three properties to the transformation bus: the object name, the object structure and a data stream. Connectors deliver sample data for interactive validation and metadata that describes the content they deliver. Connectors provide connectivity to: Oracle, Sybase, SybaseIQ, MySQL, MS SQLServer, Informix, ODBC, MS
- 179 -
Commercial ETL Tools Access, Postgres, MS Excel, Btrieve, Allbase, Image, Unidata, Universe, Paradox, Progress, Teradata, dBase, Ingre, Interbase, DB2/UDB, DB2/400, DB2/MVS, Flat Files, Log Files, FTP, HTTP, XML, Web Services, SOAP Services, HTML, SPA R/3, SAP BW and SAP Netweaver.
Figure 144: TransformOnDemand Architecture All metadata is stored in a centralized Repository, allowing the exchange and reuse of transformation rules across multiple teams and projects. The Repository also stores performance data that is collected by the engines during the transformation processes. This data can be analyzed using the Performance Analyzer application, which enables the administrator to automatically detect bottlenecks in the transformation chain. The Process Designer (Figure 145) enables the definition of data exchange / transformation between data sources without programming. The application allows to visually represent the transformation flow and linking transformation components by graphical association. Data simulation features enable real-time testing and quality control during the definition process. The Process Designer consists in four major areas: The Navigator (top-left) allows the selection of Repositories, Projects or Jobs. The Design (top-right) allows to design the project flow, drag components onto it, connect them one another and to simulate the data flow. Properties (lower-left) allows to set properties of the component currently selected in the Design section. The Component Store (lower-right) contains all components that are installed and available for designing projects. A project consists of various components that can be arranged on the project design area, by dragging them from the component store (i.e. Source/Transform/Processing/Destination) into the
- 180 -
Commercial ETL Tools workspace. Once a component is added, the component can be customized through the definition of its properties.
Figure 145: Process Designer Main Window To define a process flow the user must first select a Data Provider from the Source tab of the Component Store and drag it to the Design area. Considering that the source is a relational table, the desired connectivity (e.g. ODBC), source system (e.g. a database) and data stream (e.g. a table) must be customized. Follows the specification of the SQL query to extract data, which can be created either manually or through the Query Designer graphical component. Using the Query Designer the user can drag one or more tables from a table catalogue and then select the fields to add to the Query. Through this graphical manipulation non-experts can define which data shall be extracted without SQL knowledge. Alternatively, a flat file data source can also be defined (Figure 146). The component is loaded with a text sample and according to the delimiter specification, a preview of the extracted data contents is displayed in tabular format bellow. A data target can be defined by selecting a Data Sink component from the Destination tab of the component store (for a database data source). This must be customized according to the database connectivity and target system. If the target table as been previously selected it can be set in the component
- 181 -
Commercial ETL Tools properties, otherwise, the column structure must be defined (by default it will have the same structure as the data source).
Figure 146: Text Data Provider Once the data flow is executed, the number of records read from the source is present in the link between source and target. The user can right click on the link and view the mapping definition, as well as simulate the mapping with a set of sample values extracted from the source. Transformations are added to the data flow by dragging transformation components into the links (e.g. drop a transformation in the link that connects source and target data components). Depending on the dragged component specific wizards are triggered, e.g. Data Calculator (Figure 147) and Splitter (Figure 148). The Data Calculator enables to apply complex transformation rules to one or multiple fields in the process flow. As soon as a transformation expression (e.g. upper case conversion) is defined the input buffer is processed to display the transformation result.
- 182 -
Figure 147: Data Calculator The Splitter transformation enables to feed more than one target with the data available in the process flow. In the Data Splitter window the upper part displays sample data, while the lower part shows the output ports, the split conditions and the current port status. A splitter can have any number of output ports.
Figure 148: Splitter Window
- 183 -
Commercial ETL Tools In the previous example data is being split according to a product name > "2" on port one and product name < 2 on port two. The green / red colours indicate in which port will the selected record (in the above panel) be placed. When a project is complete and ready to be processed, it can be executed manually or integrated in a Job schedule. A Job allows to set up one or multiple projects (Figure 149).
Figure 149: Job Definition While projects require some kind of user interaction during simulation or execution, a job can be scheduled to run without user interaction. Depending on success or failure of a project within a job, the data flow can be controlled through conditional branching. The components for Job execution are also listed in the component store. Every Job is started by a Start component, which is integrated with the operating system task scheduler. When defining a job, a project behaves as any other component that can be dragged into the Design area and customized in the Properties area. If multiple projects are placed independently one another they will be executed in parallel. To enforce the completion of each preceding component the Synchronizer component must be added. Input ports of a Synchronizer can be declared to be critical, which causes an overall Job failure if the Project connected to the input Port fails. Every Job component can contain pre or post-processing routines. The Runtime Manager (Figure 150) manages jobs and gives an overview to the administrator of the currently scheduled job tasks. As the Runtime Manager is - 184 -
Space Environment Information System for Mission Control Purposes based on the operating system task-scheduling manager, it will contain any scheduled task defined on the system.
Figure 150: Runtime Manager
1.9
Space
Environment
Information
System
for
Mission Control Purposes

The space environment and its effects are being progressively taking into account in spacecraft manufacture from the earliest design phases until reaching the operational state. New interplanetary missions, the state of the art payloads and the scientific and navigation missions in a highly dynamic space environment need a better understanding of the space environment to assure the mission performance within the design margins. Most of the space environment effects on the spacecraft are considered during the design phases - long term effects or temporal effects - that can be predicted with acceptable accuracy thanks to space environment models and tools developed for this purpose. This is the case of SPENVIS [106], which provides users access to several models and tools to produce a space environment specification for any space mission (e.g. particle fluxes, atomic oxygen degradation). On the other hand, and during the operational phase of the spacecraft, some anomalies due to the space environment or to unpredictable space weather events can occur and affect the spacecraft behaviour. These anomalies are mainly originated by the solar activity (i.e. Solar Protons Events, Coronal Mass Ejections), which effects are not evaluated during the system design with the same level of accuracy as the effects produced by the well know space environment. Solar events and its effects are predicted with difficulty, and the spacecraft anomalies produced by the space environment are not always assigned to it due to the lack of proper operational tools that are able to integrate and correlate space environment information and housekeeping data of the spacecraft simultaneously.
- 185 -
Space Environment Information System for Mission Control Purposes Scientific and navigation spacecrafts, orbiting in Medium Earth orbits (MEO) environment, are a good example of how the space environment models can not be as realistic as in other orbits due to the high variation of the environment at this altitude. The continuous operation of the payload of these spacecrafts are a critical issue due the nature of the products supplied by these spacecrafts. Better space environment models describing the dynamics of MEO orbits can help to design and predict with better accuracy the behaviour of the spacecraft system. A good access to the space environment and space weather databases, together with a deep knowledge of the space environment design, i.e. spacecraft shielding information, radiation testing data, house-keeping telemetry designed to monitor the behaviour of the spacecraft systems versus the space environment effects will help to increase the life-time of spacecraft missions and improve the construction of the next generation of spacecrafts. Such data integration systems require both real-time and historical data from multiple sources (possibly with different formats) that must be correlated by a space-domain expert, through visual inspection, using monitoring and reporting tools. This is the main principle of the Space Environment Information System for Mission Control Purposes (SEIS) project 13. The SEIS project has been developed for the European Space Agency (ESA) [107] by UNINOVA [108] as prime contractor and DEIMOS Engenharia [109] as subcontractor. The author participated in the projects implementation as UNINOVA team member and was responsible for the partial definition of metadata ETL scripts for processing input data files that were found relevant in the projects scope.
1.9.1
Objectives
SEIS is a multi-mission decision support system capable of providing near realtime monitoring [110] and visualization, in addition to offline historical analysis [111] of Space Weather (S/W) and Spacecraft (S/C) data, events and alarms to Flight Control Teams (FCT) responsible for International Gamma-Ray Astrophysics Laboratory (Integral), Environmental Satellite (Envisat) and X-Ray Multi-Mission (XMM) satellite. Since the Integral S/C has been selected as the reference
13
This is also the main principle for the Space Environment Support System for Telecom and
Navigation Systems (SESS) project.
- 186 -
Space Environment Information System for Mission Control Purposes mission, all SEIS services offline and online will be available, while Envisat and XMM teams will only benefit from a fraction of all the services available for the Integral mission. The following list outlines the SEISs core services: o o Reliable Space Weather and Spacecraft data integration; Inclusion of Space Weather and Space Weather effects estimations generated by a widely accepted collection of physical Space Weather models; o Near real-time alarm triggered events, based on rules extracted from the Flight Operations Plan (FOP) which capture users domain knowledge; o Near real-time visualization of ongoing Space Weather and Spacecraft conditions through the SEIS Monitoring Tool (MT) [110]; o Historical data visualization and correlation analysis (including automatic report design, generation and browsing) using state-of-art OLAP client/server technology - SEIS Reporting and Analysis Tool (RAT) [111].
1.9.2
Architecture
In order to provide users with the previously mentioned set of services, the system architecture depicted in Figure 151 was envisaged.
Figure 151: SEIS system architecture modular breakdown The SEISs architecture (depicted in Figure 151) is divided in several modules according to their specific roles. o Data Processing Module (DPM): Is responsible for the file retrieval, parameter extraction and further transformations applied to all identified data,
- 187 -
Space Environment Information System for Mission Control Purposes ensuring it meets the online and offline availability constraints, whilst having reusability and maintainability issues in mind (further detailed on in the following section); o Data SEIS Integration client Module (DIM): Acts as the systems supporting Data
infrastructure database, providing high quality integrated data services to the applications, using three multi-purpose databases: Warehouse (DW) [13], Operational Data Storage (ODS) and Data Mart (DM); o Forecasting Module (3M): A collection of forecast and estimation model components capable of generating Space Weather [112] and Spacecraft data estimations. Interaction with any of these models is accomplished using remote Web Services invocation, which relies on XML message-passing mechanisms; o Metadata Module (MR): SEIS is a metadata driven system, incorporating a central metadata repository [113], that provides all SEIS applications with means of accessing shared information and configuration files. o Client Tools: The SEIS system comprises two client tools, which take advantage of both the collected real time and historical data the SEIS Monitoring Tool and the SEIS Reporting and Analysis Tool, respectively.
1.9.3
Data Processing Module
The Data Processing Module integrates three components UDAP, UDET and UDOB that act as a pipeline for data processing. Follows an explanation describing the functionalities of each component.
1.9.3.1
UDAP
The Uniform Data Access Proxy (UDAP) [114] is the primary interface with the external data distributing services, acting as the start point for the data processing chain composed by UDAP, UDET and UDOB. UDAP is responsible for retrieving all input files from the different data service providers locations via HTTP and FTP protocols, being able to cope with remote service availability failures and performing recovery actions whenever possible. This component is also responsible for preparing, invoking (through a web service layer) and process both space weather and spacecraft models data outputs generated by the estimation and forecasting 3M application. All the retrieved data is afterwards stored in a local file cache repository for backup purposes. Once stored, files are immediately sent for processing to UDET.
- 188 -
Space Environment Information System for Mission Control Purposes This component also supports the addition, removal and update of its specified metadata in real-time while the UDAP application is actually running. For maintainability and reusability purposes, these metadata definitions are stored in a centralized Metadata Repository. Although UDAP can be considered an engine for input files download it has been integrated into a graphical application that enables the user to control file download at data service provider and input file level. Further, a graphical component is available that enables the visualization of all download actions performed by UDAP as the responses of UDET to data processing requests. Besides visualization, this component also enables filtering and querying of logging data. The application has been implemented using Microsoft .NET [115] and Internet Information System [116] technologies.
1.9.3.2
UDET
The Unified Data Extractor and Transformer (UDET) [114] is the second component in the chain of the Data Processing pipeline. The main goal of UDET is data processing, which includes performing extraction and transformation activities according to user declarative definitions File Format Definition (FFD), for online and offline data files received from UDAP. After processing, the results are sent to the respective UDOB (Uniform Data Output Buffer) offline or near real-time instance. The application has been implemented using Microsoft .NET and IIS technologies and can be executed in two ways: o Web Service: Provides a transparent data processing mechanism, capable of accepting data processing requests and delivering processed data into the respective target UDOB. Since the processing tasks are mainly processor intensive, the deployment scenario should at least comprise two UDET-UDOB instances, one for processing and delivering of near real-time data and the other for offline data processing; o Portable library: Extraction and transformation logic have been gathered in a common package that may be used by external applications. The FFD Editor, capable to create, edit and test FFDs given an example input file, would be the main user application for this library. However, due to time constraints this application has not been developed in the scope of SEIS.
- 189 -
Space Environment Information System for Mission Control Purposes
1.9.3.3
UDOB
The Uniform Data Output Buffer (UDOB) [114] constitutes the endpoint component for the Data Processing Module, also commonly known as Staging Area. The primary role of UDOB is to behave as an intermediate data buffer on which the same data is made available to both the ODS, DW or any other data retrieving client. UDOB has been developed as a generic (as much as possible) relational database component, since all SEIS client tools and services feed on data stored on relational database tables and/or multi-dimensional structures (which are also populated from data stored on relational databases). UDOB has been implemented using Microsoft .NET [115], IIS [116] and SQL Server [81] technologies.
1.9.4
Evaluation
Although the SEIS project (and all its underlying components) has been enormously successful in practice, it was a prototype system and some restrictions and simplifications have been posed in order to reach a functional implementation within the projects schedule. Thus, SEIS Data Processing Module presented some shortcomings: o SEIS ETL solution was not independent from the operating system. SEIS DPM architecture was based on proprietary Microsofts .NET and IIS technologies, which made the usage of Windows operating system mandatory; o Although using a declarative language suppressed the need of source code development, SEIS DPM did not follow a clear separation of concerns between domain expertise and computer-science expertise right from the project start. Such resulted in a somewhat tangled solution; o In SEIS all FFDs were created without any graphical support besides a XML editor, which required extensive XML knowledge from the domain user, during the FFD definition task; o No formal specification was defined for the supporting language besides the XML Schema representation; o UDOB is too hard coded with the target data delivery database implemented with Microsoft SQL Server. A generic interface should be available, abstracting any specific reference to the target database / application, promoting this way a possible reuse of the DPM package in other problems / domains;
- 190 -
Conclusions o UDOB is not a feasible approach when dealing with a large set of data, where a relational staging area may become a performance bottleneck. In SEIS, when performing the processing of a massive set of historical telemetry data that would be inserted directly in the Data Warehouse, UDOB was found to be a major bottleneck. For this case a file-based approach would be more suitable than the relational scheme of UDOB. At the time a new data processing pipeline had to be developed removing UDOB and replacing it by a file-based output component. Performance was upgraded from several months to several days; o Data quality mechanisms as data typing and validation rules were missing in SEIS declarative language. In case of change in the provided file format, invalid data could be loaded into UDOB without raising any error at the UDET level; o SEIS supporting language was not extensible in terms of the definition of new transformations (a common change, that is directly dependent on the way data is formatted). If a new transformation was required a direct change to DPMs core source code had to be performed; o The scalability and performance of DPM components must be improved for dealing with big volumes of textual data. The DPM solution was found to be only scalable for small text files (below 200KB). Above this threshold performance started to degrade exponentially; o Engine functionalities (i.e. retrieval and data processing) are not isolated from the presentation level (GUI). In UDAP, both layers were merged together in a same application, requiring further computational resources; o In case of failure during a data processing task SEIS DPM followed a passive approach only registering the occurrence in a log file.
1.10
Conclusions
This report presented the current state of the art for the ETL domain. This discussion has been performed based on five main aspects: o A theoretical overview of the domain in terms of importance of data integration, approaches to data integration and involved technologies; o o o Academic research effort for the ETL domain; Development effort produced by the open source community regarding ETL; ETL commercial applications;
- 191 -
Conclusions o Authors previous experience with a custom ETL solution for a space domain project. Based on this analysis some conclusions can be derived: o Academic software prototypes are quite scarce (not to say inexistent) and information is mainly available in scientific papers and journals. In most cases, the presented work does not refer to a complete ETL software solution but focus particular features of ETL, mainly related with automatic learning; o Open source software can be used freely and in some cases presents a suitable set of ETL tools and capabilities, although yet far away from the capabilities of commercial ETL solutions. Special care must be taken regarding the application suite development stability as well as the community support for the integration tool; o Commercial application suites are very complete, not only in terms of ETL capabilities but also regarding complementary tools (e.g. data profiling, grid computing) that are also property of the ETL tool vendor. Depending if the ETL tool vendor is simultaneously a DBMS vendor or not, different approaches to ETL may be followed (ETL versus ELT versus ETLT). However, most commercial solutions can be generalized to a metadata-based architecture where metadata is seamlessly generated by graphical client tools, interpreted and executed by some kind of engine, residing in a centralized metadata repository; o Although most technical aspects of the ETL language are transparent to the user, technical expertise is still required (allied to domain expertise); o Semi-structured data is considered a secondary source and many times even ignored. Wizards for semi-structured data can only be applied to a very limited set of files (e.g. Fixed width, CSV); o For database sources transformation pipelines are usually considered very small (data is usually already normalized) making text data hard to transform; o A business meta layer should be explored in the future providing a further abstraction layer to the already existing data flow layer for the definition of ETL pipelines.
- 192 -
Conclusions
References
1. 2. 3. 4. 5. 6. Barlas, D., Gartner Ranks ETL. Line56 - The E-Business Executive Daily, 2003. Barlas, D. (2003) Motorola's E-Business Intelligence. Line56 - The E-Business Executive Daily Volume, Tools, E. ETL Tool Survey 2006-2007. 2007 [cited; Available from: http://www.etltool.com/etlsurveybackground.htm. Vassiliadis, P. and A. Simitsis. Conceptual modelling for ETL processes. in DOLAP. 2002. Vassiliadis, P., A. Simitsis, and S. Skiadopoulos. On the Logical Modelling of ETL Processes. in CAiSE 2002. 2002. Toronto, Canada. Simitsis, A. and P. Vassiliadis. A Methodology for the Conceptual Modelling of ETL Processes. in Decision Systems Engineering Workshop (DSE'03) in conjunction with the 15th Conference on Advanced Information Systems Engineering (CAiSE '03). 2003. Klagenfurt, Austria. Vassiliadis, P. A Framework for the Design of ETL Scenarios. in 15th Conference on Advanced Information Systems Engineering (CAiSE '03). 2003. Klagenfurt, Austria. Vassiliadis, P. A generic and customizable framework for the design of ETL scenarios. in Information Systems. 2005. Vassiliadis, P., A. Simitsis, and S. Skiadopoulos. Modelling ETL Activities as Graphs. in 4th International Workshop on the Design and Management of Data Warehouses (DMDW'2002) in conjunction with CAiSE02. 2002. Toronto, Canada. Galhardas, H., et al. Declarative Data Cleaning: Language, Model and Algorithms. in The 27th VLDB Conference. 2001. Rome, Italy. Informatica. How to Obtain Flexible, Cost-effective Scalability and Performance through Pushdown Processing. 2007 [cited. Ltd, E.S., Transformation Manager: Meta-data Driven Flexible Data Transforms For any Environment. 2005. Kimball, R. and M. Ross, The Data Warehouse Toolkit: The Complete Guide to Multidimensional Modelling, ed. Wiley. 2002. White, C. Data Integration: Using ETL, EAI and EII Tools to Create an Integrated Enterprise. 2005 [cited. Linstedt, D. ETL, ELT - Challenges and Metadata. 2006 [cited; Available from: http://www.b-eye-network.com/blogs/linstedt/archives/2006/12/etl_elt_challen.php. Vassiliadis, P., et al. ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. Vassiliadis, P., et al., ARKTOS: towards the modeling, design, control and execution of ETL processes. Information Systems 26, 2001. Galhardas, H., et al. AJAX: An Extensible Data Cleaning Tool. in SIGMOD 2000. 2000. Popa, L., et al., Mapping XML and Relational Schemas with Clio. 2002. Haas, L., et al. Clio Grows Up: From Research Prototype to Industrial Tool. in SIGMOD 2005. 2005. Baltimore, Maryland, USA. Borkar, V., K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. in SIGMOD 2001. 2001. Santa Barbara, California, USA. Manchester, T.U.o., K. University, and U.o. Durham. Integration Broker for Heterogeneous Information Sources. 2004 [cited; Available from: http://www.informatics.manchester.ac.uk/ibhis/. Cali, A., et al. IBIS: Semantic Data Integration at Work. in CAiSE 2003. 2003. Klagenfurt/Velden, Austria. Dunemann, O., et al. A Databaser-Support Workbench for Information Fusion: InFuse. Heiko Mller, J.-C.F. Problems, Methods, and Challenges in Comprehensive Data Cleansing. 2003. Lup, L.W. IntelliClean. 2000 [cited; Available from: http://www.comp.nus.edu.sg/~iclean/. Friedman-Hill, E. JESS, the Rule Engine for the Java Platform. 2007 [cited; Available from: http://www.jessrules.com/jess/index.shtml.
7. 8. 9.
10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
23. 24. 25. 26. 27.
- 193 -
Conclusions 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. Riley, G. CLIPS, A Tool for Building Expert Systems. 2006 [cited; Available from: http://www.ghg.net/clips/CLIPS.html. Adelberg, B. NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. 1998. Raman, V. and J. Hellerstein. Potter's Wheel: An Interactive Data Cleaning System. in 27th VLDB Conference. 2001. Roma, Italy. Raicevic, I., S. Milosevic, and A. Madl, Enhydra Octopus. 2006. Dutina, R., Z. Milakovic, and S. Milosevic. Enhydra Octopus Application. [cited. Jitterbit. Jitterbit 1.1. 2006 [cited; Available from: http://www.jitterbit.com/. Networks, K. KETL Project. 2007 [cited; Available from: http://www.ketl.org/en/home. Integration, P.D. Kettle Project. 2007 [cited; Available from: http://kettle.pentaho.org/. Gaffiero, M. Pequel ETL Data Transformation Engine. 2006 [cited; Available from: https://sourceforge.net/projects/pequel. Gaffiero, M. Pequel ETL 2.4-6 User Guide. 2006 [cited; Available from: http://search.cpan.org/src/GAFFIE/ETL-Pequel-2.4-6b/docs/pequeluserguide.pdf. Talend. Talend Open Studio. 2007 [cited; Available from: http://www.talend.com/. Talend. Talend Open Studio User's Guide. 2007 [cited. JasperSoftware. JasperETL. 2007 [cited; Available from: http://jasperforge.org/sf/projects/jasperetl. Group, M. ETL Tools METAspectrum Evaluation. 2004 [cited; Available from: http://www.sas.com/offices/europe/czech/technologies/enterprise_intelligence_platfor m/Metagroup_ETL_market.pdf. Friedman, T. and B. Gassman. Magic Quadrant for Extraction, Transformation and Loading. 2005 [cited; Available from: http://mediaproducts.gartner.com/reprints/oracle/127170.html. BusinessObjects. Data Integrator. 2006 [cited; Available from: http://www.businessobjects.com/products/dataintegration/dataintegrator/default.asp. BusinessObjects, BusinessObjects Data Cleansing. 2007. BusinessObjects. Busines Objects Composer. 2007 [cited; Available from: http://www.businessobjects.com/products/dataintegration/composer.asp. BusinessObjects, Info Sheet - BusinessObject Composer. 2007. Cognos. Cognos DecisionStream. 2006 [cited; Available from: http://www.cognos.com/products/business_intelligence/data_preparation/index.html. Cognos, The Right Architecture for Business Intelligence. 2007. Cognos. Cognos Data Integration. 2007 [cited. Cognos. Data Preparation with Cognos Business Intelligence. 2007 [cited. DataMirror. DataMirror Transformation Server. 2006 [cited; Available from: http://www.datamirror.com/products/tserver/default.aspx. DataMirror. Managing your data the XML way: Data transformation, exchange and integration. 2007 [cited. DataMirror, High-Performance Data Replication. 2007. DataMirror. Real-Time Event Awareness and Data Integration. 2007 [cited. Laboratory, D.S. Visual Importer Professional. 2006 [cited; Available from: http://www.dbsoftlab.com/e107_plugins/content/content.php?content.50. Laboratory, D.S. Visual Importer Professional & Enterprise User Manual. 2006 [cited. Technologies, D. Denodo Platform. 2006 [cited; Available from: http://www.denodo.com/. Pan, A., et al. The Denodo Data Integration Platform. in 28th VLDB Conference. 2002. Hong Kong, China. Denodo, Denodo Virtual DataPort 3.0 - Data Sheet. 2007. Embarcadero. DT/Studio. 2006 [cited; Available from: http://www.embarcadero.com/products/dtstudio/index.html. Technologies, E. ETL-based Integration for All of Us. 2003 [cited. ETI. ETI Solution v5. 2006 [cited; Available from: http://www.eti.com/products/solutions.html. Inc, E.T.I. Data Integration Management - ETI Solution vs. Engine-based Data Integration Products. 2006 [cited.
42.
43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63.
- 194 -
Conclusions 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. Inc, E.T.I., 8 Reasons to Consider ETI Built-to-Order Integration. 2006. Ltd, E.S. Transformation Manager Feature Summary. 2006 [cited. Software, G. Data Flow. 2007 [cited; Available from: http://www.g1.com/Products/Data-Integration/. Software, G., Data Flow Data Integration Solution - Turn Disparate Data into Valuable Information. 2007. Software, G. Sagent Data Flow from Group 1 Software: an extract from the Bloor Research report, Data Integration, Volume 1. 2007 [cited. Corporation, O.T. Hummingbird Genio - Data Sheet. 2007 [cited. Corporation, O.T., Hummingbird Integration Suite - A Technical Overview. 2007. IBM. WebSphere Application Server. 2006 [cited. Ascential, IBM WebSphere DataStage TX. 2007. IBM. IBM WebSphere DataStage Version 7.5.2 - Transformation using MapStage. 2007 [cited. Ascential. IBM WebSphere DataStage. 2007 [cited. Informatica. PowerCenter 8. 2006 [cited; Available from: http://www.informatica.com/products/powercenter/default.htm. Informatica. Enterprise Data Integration. 2007 [cited. iWay. DataMigrator. 2007 [cited; Available from: http://www.iwaysoftware.com/index.html. IWay, Data Integration Solutions. 2007. iWay. dm71demo. 2007 [cited. iWay. iWay Adapter Administration for UNIX, Windows, OpenVMS, OS/400, OS/390 and z/OS. 2007 [cited. Microsoft. Microsoft SQL Server 2005. 2006 [cited; Available from: http://www.microsoft.com/sql/default.mspx. Microsoft. Microsoft SQL Server 2005 - Integration Services. 2005 [cited. Hathi, K. An Introduction to SQL Server 2005 Integration Services. 2005 [cited. Russom, P., C. Moore, and C. Teubner. Microsoft Addresses Enterprise ETL. 2005 [cited. Oracle. Warehouse Builder. 2006 [cited; Available from: http://www.oracle.com/technology/products/warehouse/index.html. Oracle. User's Guide 10g Release 1 (10.1). 2003 [cited. Oracle. Transformation Guide 10g Release 1 (10.1). 2003 [cited. Oracle. Installation and Configuration Guide 10g Release 1 (10.1). 2004 [cited. Oracle. Oracle By Example (OBE) - Oracle Warehouse Builder 10g Release 2. 2007 [cited; Available from: http://www.oracle.com/technology/obe/admin/owb10gr2_gs.htm. Software, P. Data Integrator. 2006 [cited; Available from: http://www.pervasive.com/dataintegrator/. Software, P. Pervasive Integration Architecture. 2005 [cited. Software, P. Pervasive Data Integrator - Product Sheet. 2007 [cited. Software, P. Evaluator's Guide: Pervasive Integration Products. 2004 [cited. SAS. SAS Data Integration. 2006 [cited; Available from: http://www.sas.com/technologies/dw/index.html. Institute, S., SAS 9.1.3 ETL Studio: User's Guide. 2004. SAS, SAS ETL Studio - Fact Sheet. 2007. Studio, S. Stylus Studio 2007. 2006 [cited; Available from: http://www.stylusstudio.com/. Sunopsis. Sunopsis Data Conductor. 2006 [cited; Available from: http://www.sunopsis.com/corporate/us/products/sunopsis/snps_dc.htm. Consulting, R.M. Sunopsis Data Conductor : Creating an Oracle Project. 2006 [cited; Available from: http://www.rittmanmead.com/2006/11/16/sunopsis-dataconductor-creating-an-oracle-project/. Consulting, R.M. Moving Global Electronics Data using Sunopsis. 2006 [cited; Available from: http://www.rittmanmead.com/2006/11/30/moving-global-electronicsdata-using-sunopsis/. Consulting, R.M. Getting Started with Sunopsis Data Conductor. 2006 [cited; http://www.rittmanmead.com/2006/11/10/getting-started-withAvailable from: sunopsis-data-conductor/.
90. 91. 92. 93. 94. 95. 96. 97. 98. 99.
100.
101.
- 195 -
Conclusions 102. 103. 104. 105. 106. 107. 108. 109. 110. Sunopsis Is ETL becoming obsolete? Why a Business-Rules-Driven "E-LT" Architecture is better. Volume, Solonde. TransformOnDemand. 2006 [cited; Available from: http://cms.solonde.com/cms/front_content.php?idcat=31. Solonde, Information Integration Architecture. 2007. Solonde. TransformOnDemand User Guide. 2007 [cited. ESA. SPENVIS - Space Environment Information System. 2006 [cited; Available from: http://www.spenvis.oma.be/spenvis/. ESA. ESA. 2007 [cited; Available from: http://www.esa.int. UNINOVA. Instituto de Desenvolvimento de Novas Tecnologias. 2006 [cited; Available from: http://www.uninova.pt/website/. Engenharia, D. Deimos Engenharia. 2006 [cited; Available from: http://www.deimos.pt/. Moura-Pires, J., M. Pantoquilho, and N. Viana. Space Environment Information System for Mission Control Purposes: Real-Time Monitoring and Inference of Spacecraf Status. in 2004 IEEE Multiconference on CCA/ISIC/CACSD. 2004. Taipei, Taiwan. Pantoquilho, M., et al. SEIS: A Decision Support System for Optimizing Spacecraft Operations Strategies. in IEEE Aerospace Conference. 2005. Montana, USA. Belgian Institute for Space Aeronomy, Space Applications Services, and P.S. Institute. SPENVIS - Space Environment Information System. 1998 [cited 2004; Available from: http://www.spenvis.oma.be/spenvis/. Ferreira, R., et al. XML Based Metadata Repository for Information Systems. in EPIA 2005 - 12th Portuguese Conference on Artificial Intelligence, Covilh. 2005. Covilh, Portugal. Viana, N., Extraction and Transformation of Data from Semi-structured Text Files in Real-Time, in Departamento de Informtica. 2005, UNL/FCT Universidade Nova de Lisboa / Faculdade de Cincias e Tecnologia: Caparica. Microsoft. Microsoft .NET. 2006 [cited; Available from: http://www.microsoft.com/net/default.mspx. Microsoft. Microsoft Internet Information Services. 2003 [cited; Available from: http://www.microsoft.com/windowsserver2003/iis/default.mspx.
111. 112.
113.
114.
115. 116.
- 196 -

ETL State of The Art

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

ETL State of The Art

Caricato da

Copyright:

Formati disponibili

Department of Computer Science

ETL STATE OF THE ART

RICARDO FORTUNA RAMINHOS

Monte de Caparica, April 2007

1.3.1 1.3.2 1.3.3 1.3.4 1.4

1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.4.6 1.5 1.6

1.7.2 1.7.3 1.7.4 1.7.5 1.7.6 1.8

1.9.1 1.9.2 1.9.3 1.9.4

1.10 CONCLUSIONS ............................................................................. 191

Motivation The correct ETL Tool

Motivation The correct ETL Tool

ETL Conceptual Representation and Framework

technical information and / or unavailability of software for public usage.

Figure 2: MOF layers using UML and Java as comparison

Classical Data Integration Architectures

Classical Data Integration Architectures

Database Embedded ETL

Metadata Driven ETL Engines

Approaches to Data Processing

Approaches to Data Processing

Figure 3: Data integration techniques: consolidation, federation and propagation

Approaches to Data Processing

Change Data Capture

Approaches to Data Processing

Data Integration Technologies

Extract, Transform and Load (ETL)

Approaches to Data Processing 1.4.6.1.1 Tuning ETL

Enterprise Information Integration (EII)

EII versus ETL

Enterprise Application Integration (EAI)

EAI versus ETL

Enterprise Data Replication (EDR)

Metadata for Describing ETL Statements

Figure 3 presents the syntax for the main statements.

Figure 7: Part of the scenario expressed with SADL

Research ETL Tools

Research ETL Tools

Figure 14: A relational to XML Schema mapping

Figure 19: Set up of the IBHIS broker - 50 -

Figure 21: Query interface in IBIS

Research ETL Tools

Figure 25: Angie in advance mode

Figure 27: INTELLICLEAN graphical interaction (explaining a rule)

Research ETL Tools

Research ETL Tools

Freeware / Open Source and Shareware ETL

JDBC drivers for CSV and XML files.

(DODS) file, a XSL based tool for database communication.

Figure 37: Octopus Loader inputs

Freeware / Open Source and Shareware ETL Tools

Figure 39: Defining a Database Source

Figure 41: Operation Queue and Log

Freeware / Open Source and Shareware ETL Tools

Freeware / Open Source and Shareware ETL Tools

Pentaho Data Integration: Kettle Project

Pequel ETL Engine

The name Pequel is derived from perlish sequence.

Freeware / Open Source and Shareware ETL Tools

Talend Open Studio

Figure 52: Talend architecture

Figure 58: Presenting statistics to the user - 89 -

Commercial ETL Tools

Commercial ETL Tools