Sei sulla pagina 1di 25

Road map to ETL / TALEND

1
Agenda

 Why ETL?

 What is ETL – Desirable Features

 Various ETL Tools available

 Talend – Introduction
 various versions

 Use of Talend as

2
Why ETL?

 ETL tools are used to manage data problems and following concerns:

 Data is scattered across different locations

 Data is stored in different types of sources

 Volume of data (Structured/semi structured/unstructured) keeps on increasing as time progresses

3
What is ETL?

4
ETL Tools - About

 ETL stands for Extract, Transform and Load. An ETL tool extracts the data from heterogenous
RDBMS source systems, transforms the data like applying calculations, concatenate, etc. and
then load the data to Data Warehouse system. The data is loaded in the DW system in the form
of dimension and fact tables.

 ETL process itself consumes 70 percent of work efforts in datawarehousing

 Popular ETL Tools:


 Infosphere Data stage, Informatica, Abinitio, SSIS, Pentaho, Clover,SAS

 Considerable features to opt for a ETL Tool:


 Data Connectivity
 Performance
 Transformation Flexibility
 Pricing
5
TALEND- What it is
 Talend is a opensource data Integration platform/vendor that helps companies in taking real time
decisions/sales trends/ business management decisions in a better way.

 Talend is a software integration platform which provides solutions for Data integration, Data
quality, Data management, Data Preparation and Big Data.

 Also, Talend is the only ETL tool with all the plugins to integrate with Big Data ecosystem easily.

 According to Gartner, Talend falls in Leaders magic quadrant for Data Integration tools.

 A Talend component is a functional piece that is used to perform a single operation.

6
Talend – System Requirements

 Recommended Operating system:


 Microsoft Windows 10
 Ubuntu 16.04 LTS
 Apple macOS 10.13/High Sierra

Memory Requirement
Memory - Minimum 4 GB, Recommended 8 GB

Storage Space - 30 GB Besides, you also need an up and running Hadoop cluster (preferably
Cloudera.

Note: Java 8 must be available with environment variables already set

7
Types of Components: Part 1 of 3

 File
 The File family groups components that read and write data in all types of files, from the most popular to the most
specific format (in the Input and Output subfamilies). In addition, the Management subfamily groups File-
dedicated components that perform various tasks on files, including unarchiving, deleting, copying, comparing
files and so on.
 E.g: tFileCompare, tFileExists, tFileInputDelimited, tFileInputExcel, tFileList, tFileUnarchive etc.

 Processing
 The Processing family gathers components that help you to perform all types of processing tasks on data flows,
including aggregation, mapping, transformation, denormalizing, filtering and so on.
 E.g: tAggregateRow,tFilterRow,tNormalize etc

 Database
 The Databases family groups most popular database connectors. These connectors cover various needs
including: opening connection, reading and writing tables, committing transactions as a whole as well as
performing rollback for error handlings. More than 40 RDBMS are natively supported.
 E.g: tSybaseConnection,

8
Types of Components: Part 2 of 3
 Business Intelligence components
 The BI family groups connectors that cover needs such as reading or writing multidimensional or
OLAP databases, outputting Jasper reports, tracking DB changes in slow changing dimension
tables and so on.
E.g: tDB2SCD, tINGRESSCD,tMSSQLSCD, tMysqlSCD, tOracleSCD, tSybaseSCD etc

 The ELT family groups most popular database connectors, as well as processing components, all
dedicated to the ELT mode where the target DBMS becomes the transformation engine. This
mode supports Teradata, Oracle, Netezza, Dataupia, QuickFire, Datallegro & Vertica.
 E.g: tELTTeradataOutput, tELTFilterRows

 The System family groups components that help you interact with the operating system.
 E.g: tRunJob, tSetEnv,tSSH,tSystem

 The custom code gather the components that cover on the on-fly specific code needs
9
Types of Components: Part 3 of 3

 The System family groups components that help you interact with the operating
system.
 E.g: tRunJob, tSetEnv, tSSH, tSystem

 The Business component family groups connectors that covers specific Business
needs, such as reading and writing CRM, or ERP types of database and reading
from or writing to an SAP system.

Eg: tMicrosoftCRMInput, tSalesforceGetDeleted, tSAPConnection, tSugarCRMInput etc

10
TALEND vs Informatica Powercenter
 Talend generates native Java code allowing you to run anywhere. PowerCenter on the other
hand, generates metadata that is stored in a RDBMS repository that their proprietary engine uses
to run.

 Talend is a code generator, it can run both as an ETL (running on its own standalone server) or
as ELT (running natively on the target server) engine. The Java code that is generated by Talend
can be run on any platform that supports Java — it could be on a server sitting in your data
center, on the cloud or even running on your laptop.

 While both platforms provide components that handle the majority of tasks required for data
integration, there are situations where something custom is required. This often results in some
custom coding which is an arduous and inefficient process to do using PowerCenter.

 Yet in Talend you can build your own custom components in Java and integrate them into the
studio without any hassle.

11
Talend Products:

12
Talend Cloud

 Integrate data and people easily with Talend Cloud, a highly scalable and secure
managed cloud integration platform-as-a-service (iPaaS).

 Talend Cited as a Strong Performer in The Forrester Wave Strategic iPaaS and
Hybrid Integration Platforms, Q1 2019

13
 Integration platform as a service (iPaaS) is a set of automated tools for connecting
software applications that are deployed in different environments. iPaaS is often used
by large business-to-business (B2B) enterprises that need to integrate on-premises
applications and data with cloud applications and data.
 Data Integration [First version launched in 2006]
 Develop 10x faster with a modern data integration and data quality platform for
relational databases, flat files, cloud apps and platforms
 Big Data Integration [First version launched in 2012]
 Unleash the potential of real-time and IoT analytics by leveraging the power of Spark
Streaming and machine learning.
 Cloud API Services
 Build, test, and deploy APIs up to 80% faster by eliminating the need to use several
tools or manually code.
14
 Data Catalog
 Create a central, governed catalog of enriched data that is documented automatically and can be shared and
collaborated on easily.

 Data Quality [First version launched in 2008]


 Leverage the full power and scale of big data with the leading data integration platform built on Spark for cloud
and on-premises.

 Data Preparation [First version launched in 2016]


 Transform how IT and users work together with governed access to self-service tools for discovering, cleansing,
and sharing data.

 Master Data Management [First version launched in 2009]


 Unleash the potential of data, applications, and processes by creating a single “version of the truth” for cloud, big
data, and mobile applications.

 Talend Data Fabric [First version launched in 2017]


 Meet all your integration needs with a single, unified platform 15
Introduction to Talend Open Studio for Big Data

 Talend provides unified development and management tools to integrate and


process all of your data with an easy to use, visual designer.

 Built on top of Talend's data integration solution, the big data solution is a powerful
tool that enables users to access, transform, move and synchronize big data by
leveraging the Apache Hadoop Big Data Platform and makes the Hadoop platform
ever so easy to use.

 The Talend Open Studio for Big Data functional architecture is an architectural model
that identifies Talend Open Studio for Big Datafunctions, interactions and
corresponding IT needs. The overall architecture has been described by isolating
specific functionalities in functional blocks.

16
Functional Architecture of Talend Open Studio for Big Data

17
Functional Architecture of Talend Open Studio for Big Data - Overview
 From Talend Studio, you design and launch Big Data Jobs that leverage a Hadoop cluster to
handle large data sets. Once launched, these Jobs are sent to, deployed on and executed on this
Hadoop cluster.

 The Oozie workflow scheduler system is integrated within the Studio through which you can
deploy, schedule, and execute Big Data Jobs on a Hadoop cluster and monitor the execution
status and results of these Jobs.

18
 Talend Open Studio is a free open source ETL tool for Data Integration and Big Data.

 It is an Eclipse based developer tool and job designer.

 You just need to Drag and Drop components and connect them to create and run ETL or ETL
Jobs.

 TOS can be used easily to bridge between:


 Databases, Mainframes,FileSystems,Web services, Packaged Enterprise Applications,
Datawarehouse, OLAP applications,SAASand cloud based applications etc

 The tool will create the Java code for the job automatically and you need not write a single line of
code. There are multiple options to connect with Data Sources such as RDBMS, Excel, SaaS Big
Data ecosystem, as well as apps and technologies like SAP, CRM, Dropbox and many more.

19
Talend Open Studio for BigData

© Talend 2011 20
Overview of Talend Open Studio

 The Repository is where all the resources — Folders, Jobs, Schema Definitions and
Connections, parameters and variables are defined.

 The Design Area (Workspace is where Jobs are assembled.

 The Palette is a library of all components available.

 The Perspective determines the overall layout of the Studio and the arrangement of
the different areas within the Studio.

 A component is a snippet of Java code that is generated when a job is executed

21
Benefits of using TALEND Open Studio:
 Provides all features needed for data integration and synchronization with 900 components, built-
in connectors, converting jobs to Java code automatically and much more.

 The tool is completely free, hence there are big cost savings.

 In last 12 years, multiple giant organizations have adopted TOS for Data integration, which shows
very high trust factor in this tool.

 The Talend community for Data Integration is very active.

 Talend keeps on adding features to these tools and the documentations are well structured and
very easy to follow.

22
Summary

23
TALEND - Usecases

Virgin Mobile in UK – has a customer base of about 100 million

Groupon – Portal and E-commerce

Orange – Media and Entertainment

References:

 https://www.talend.com

24
25

Potrebbero piacerti anche