Sei sulla pagina 1di 25

1.

IBM Infosphere Data stage content


Contents

Introduction about Data stage


Difference between server jobs and Parallel jobs
Pipeline parallelism and Partition Parallelism
Partition Techniques
Configuration file
Processing Environment
Data Stage Client and Server Components

Data stage Designer

Introduction about Data stage Designer


Repository
Palatte
Types of Links

File Stages

Sequential file
Data set
File set
Lookup file set

Database Stages
Oracle Enterprise
ODBC Enterprise
Dynamic RDBMS

Processing Stages

Aggregator
Change Apply
Change Capture
Compare
Compress
Copy
Decode
Difference
Encode
Expand
Filter
Funnel
Generic
Join
Look up
Merge
Modify

Pivote
Remove Duplicate
External Filter
Sort
Surrogate Key Generator
Switch
Transformer

Debugging Stages

Column Generator
Head
Peek
Row Generator
Sample
Tail

Datastage Manager
Introduction about Data stage Manager
Importing the Jobs
Exporting the Jobs

2. IBM Infosphere Information Analyzer content


Information Analyzer Overview

Overview about Information Analyzer Tabs

Home
Overview
Investigate
Data Quality
Operate

Types of Analysis in IA
Column Analysis

Overview
Frequency distribution
Domain and completeness
Data Class Analysis

Format Analysis

Base Line Analysis


Foreign Key and Cross Domain Analysis

Primary Key Analysis

Single Column Analysis


Multi Column Analysis

Data rules creation process

Defining Data rule Defintion


Setting Bench Marks
Data Rule Logic
Validation
Deriving Data rule from rule definition

Rule Sets Creation Process


Create List of Data rules
Defining Rule set Definition
Adding Data rules to rule set

Creating IA Projects

Importing Meta Data


Adding Data Source to IA project
Adding users to IA project
Adding Groups to IA project

Information Analyzer Projects Roles

Bussiness Analyst
Data Steward
Data Operator
Drill Down user

Virtual Column Creation Process


Different Types of Reports in IA

Column Domain Report


Column Frequency Report
Data Rule Exception Report
Project Summary Report

======================================================

3. IBM Info sphere Quality Stage Content:


Introduction about Data Quality
Data Quality Issues
Rule Set Files

Classification Table
Dictionary File
Pattern Action File
Reference Tables
Override Tables

User Define Rule sets Creation process


Investigate Stage
Character Discrete:

Investigate Character Discrete with C Mask


Character Discrete with T Mask
Character Discrete with X Mask

Character Concatenate:
Word Investigate:

Word Investigate for Full Name


Word Investigate for Address
Word Investigate for Area

Standardize Stage

Standardize Country
Standardize Domain Preprocessing
Standardize Name
Standardize Address
Standardize Area

Match Frequency Stage


Data Rules stage
AVI (Address Verification Interface Stage)

DATAWAREHOUSE:
------------------------------Data ware house is nothing but collection of transactional data and historical data and can
be maintained in dwh for analysis purpose.
They are 3 types of tools should be maintained on any data warehousing project
1. ETL Tools
2. OLAP Tools (or) Reporting Tools
3. Modeling Tool

ETL TOOL:
ETL is nothing but Extraction, Transformation, and Loading. a ETL Developer(those who
are expertise in dwh extracts data from heterogeneous databases(or)Flat files, Transform
data from source to target(dwh) while transforming needs to apply transformation rules
and finally load data into dwh.
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica
3. Abinitio
4. Oracle Warehouse Builder
5. Bodi (Business Objects Data Integration)
6. MSIS (Microsoft Integration Services)
OLAP:
OLAP is nothing but Online Analytical Processing and these tools are called as reporting
Tools Also

A OLAP Developer analyses the data ware house and generate reports based on selection
criteria.
There are several OLAP Tools are available
1. Business Objects
2. Cognos
3. Report Net
4. SAS
5. Micro Strategy
6. Hyperion
7. MSAS (Microsoft Analysis Services)
MODELING TOOL:
Those who are working with ERWIN Tool called data modeler .A data modeler can
design data base of DWH with the help of fallowing tools
A ETL Developer can extract data from source databases (or) flat files(.txt,csv,.xls etc)
and populates into DWH .While populating data into DWH they are some staging areas
can be maintained between source and target .these staging areas are called staging area1
and staging area2.

STAGING AREA:
Staging Area is nothing but is temporary place which is used for cleansing unnecessary
data (or) unwanted data (or) inconsistency data.
Note: A Data Modeler can design DWH in two ways
1. ER Modeling
2. Dimensional Modeling
ER Modeling:
ER Modeling is nothing but entity relationship modeling. in this model always call table
as entities and it may be second normal form (or) 3rd normal form (or) in between 2nd and
3rd normal form
Dimensional Modeling:
In this model tables called as dimensions (or) fact tables. It can be subdivided into three
schemas.
1. Star Schema
2. Snow Flake Schema
3. Multi Star Schema (or) Hybrid (or) Galaxy

Star Schema:
A fact table surrounded by dimensions is called start schema. it looks like start
In a start schema if there is only one fact table then it is called simple start schema.
In a start schema if there are more than one fact table then it is called complex start
schema

Sales Fact table:


Sale_id
Customer_id
Product_id
Account_id
Time_id
Promotion_id
Sales_per_day
Profit_per_day
Account Dimension:
Account_id
Account_type
Account_holder_name
Account_open_date
Account_nominee
Account_open_balence
Pramotion:
Promotion_id
Promotion_type
Promotion_date
Pramotion_designation

Pramotion_Area

Product:
Product_id
Product_name
Product_type
Product_desc
Product_version
Product_stratdate
Product_expdate
Product_maxprice
Product_wholeprice
Customer:
Cust_id
Cust_name
Cust_type
Cust_address
Cust_phone
Cust_nationality
Cust_gender
Cust_father_name
Cust_middle_name
Time:
Time_id
Time_zone
Time_format
Month_day
Week_day
Year_day
Week_Yeat
DIMENSION TABLE:
If a table contains primary keys and provides detail information about the table
(or) master information of the table then called dimension table.
FACT TABLE:
If a table contains more foreign keys and its having transactions, provides
summarized information such a table called fact table.

DIMENSION TYPES:
There are several dimension types are available
CONFORMED DIMENSION:
If a dimension table shared with more than one fact table (or) having foreign key more
than one fact table. Then that dimension table is called confirmed dimension.
DEGENERATED DIMENSION:
If a fact table act as dimension and its shared with another fact table (or) maintains
foreign key in another fact table .such a table called degenerated dimension.
JUNK DIMENSION:
A junk dimension contains text values, genders,(male/female),flag values(True/false) and
which is not use full to generate reports. Such dimensions is called junk dimension.
DIRTY DIMENSION:
If a record occurs more than one time in a table by the difference of non key attribute
such a table is called dirty dimension
FACT TABLE TYPES:
There are 3 types of fact s are available in fact table
1. Additive facts
2. Semi additive facts
3. Non additive facts
ADDITIVE FACTS:
If there is a possibility to add some value to the existing fact in the fact table .that facts
we called as additive fact.
SEMI ADDITIVE FACT:
If there is possibility to add some value to the existing fact up to some extent in the fact
table is we called as semi additive fact.
NON ADDITIVE FACT:
If there is not possibility to add some value to the existing fact in the fact table is we
called as Non additive fact.
SNOW FLAKE SCHEMA:

Snow Flake schema maintains in dimension table normalized data .in this schema some
dimension tables are not directly maintained relation ship with fact table and those are
maintained relation ship with another dimension

DIFFERENCE BETWEEN STAR SCHEMA AND SNOW FLAKE SCHEMA:


Star schema
It maintains demoralized data in the
dimension table
Performance will be increased when
joining fact table to dimension table when
compared with snow flake
All dimension table should maintain ed
relation ship directly with fact table

Snow flake schema


It maintains normalized data in the
dimension table
Performance will be decreases when
joining fact table to dimension table to
shrunken dimension table because it require
more inner joins when compared with snow
flake
Some dimension tables are not directly
maintained relationship with fact table

INTRODUCTION ABOUT DATA STAGE:

1. Data stage:
Data stage is a comprehensive ETL tool Or we can say Data stage is an data Integration
and transformation tool which enables collection and consolidation of data from several
sources,its transformation and delivery into one or multiple target systems
ETL stands for Extarction,Transformation ,Load
EExtraction from any source
TTransformation(rich set of transformation capabilities)
LLoading in to any target
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica
3. Abinitio
4. Oracle Warehouse Builder
5. Bodi (Business Objects Data Integration)
6. MSIS (Microsoft Integration Services)

History of Data stage:


History Begins in 1997 the first version of data stage released by VMRAK company
its a US based company
Mr Lee scheffner is the father of data stage
Those days data stage we callaed as Data integrator
In 1997 Data integrator acquired by company called TORRENT
Again in 1999 INFORMIX company has acquired this Data integrator from
TORRENT company
In 2000 ASCENTIAL company aquired this Data Integrator and after that Ascentaial
Data stage server Edition
From 6.0 to 7.5.1 versions they supports only Unix flavor environment
Because server configured on only Unix plot form environment
In 2004, a version 7.5.x2 is released that support server configuration for windows flat
form also.
In 2004, December the version 7.5.x2 were having ASCENTIAL suite components
Profile stage,
Quality stage,
Audit stage,
Meta stage,
DataStage Px,
DataStage Tx,
DataStage MUS,
These are all Individual tools
In 2005, February the IBM acquired all the ASCENTIAL suite components and the
IBM released IBM DS EE i.e., enterprise edition.
In 2006, the IBM has made some changes to the IBM DS EE and the changes are the
integrated the profiling stage and audit stage into one, quality stage, Meta stage, and
DataStage Px.IBM WEBSPHERE DS & QS 8.0
In 2009, IBM has released another version that IBM INFOSPHERE DS & QS 8.1

Features Of Data stage:


There are 5 important features of DataStage, they are
- Any to Any,
- Plat form Independent,
- Node configuration,

- Partition parallelism, and


- Pipe line parallelism.

Any to Any:
Data stages can extract data from any source and can load data in to any target

Platform Independent:
A job can run in any processor is called platform independent
Data stage jobs can run on 3 types of processors
Three types of processor are there, they are
1. UNI Processor
2. Symmetric Multi Processor (SMP), and
3. Massively Multi Processor (MMP).

Node Configuration:
Node is a logical CPU ie.instance of physical CPU
The process of creating virtual CPUs is called Node configuration

Example:
ETL job requires executing 1000 records
In Uni processor it takes 10 mins to execute 1000 records
But in same thing SMP processor takes 2.5 mins to execute 1000 records

Difference between server jobs and Parallel jobs:


Parallel jobs:
1. Datastage parallel jobs can run in parallel on multiple nodes
2. Parallel jobs support partition parallelism (Round robin Hash modulus etc.
3. The transformer in Parallel jobs compiles in C++
4. Parallel jobs run on more than one node
5. Parallel jobs run on UNIX platform
6. Major difference in job architecture level Parallel jobs process in parallel. It uses the
configuration file to know the number of CPU's defined to process parallely
Server jobs:
1. Datastage server jobs do not run on multiple nodes
2. Data stage server jobs don't support the parallelism (Round robin Hash modulus etc.
3. The transformer in server jobs compiles in Basic language
4. Data stage server jobs run on only one node
5. Data stage server jobs run on unix platform

6. Major difference in job architecture level Server jobs process in sequence one stage
after other

Configuration File:
What is configuration file? What is the use of this in data stage?
It is normal text file. it is having the information about the processing and storage
resources that are available for usage during parallel job execution.
The default configuration file is having like
Node: - it is logical processing unit which performs all ETL operations.
Pools: - it is a collection of nodes.
Fast Name: it is server name. by using this name it was executed our ETL jobs.
Resource disk:- it is permanent memory area which stores all Repository components.
Resource Scratch disk:-it is temporary memory area where the staging operation will be
performed.
Configuration file:
Example:
{
node "node1"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools
""}
}
node "node2"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools

""}
}
}

Note:
In a configuration file No node names has same name
Default Node pool is
At least one node must belong to the default node pool, which has a name of "" (zerolength string).

Pipeline parallelism:
Pipe:
Pipe is a channel through which data moves from one stage to another stage

Pipeline parallelism: Its a technique of simultaneously processing Extraction,


Transformation and, Loading

Partition Parallelism:
Partitioning:
Partioning is a technique of dividing the data into chunks
Data stage supports 8 types of partitions
Partioning plays a important role in data stage
Every stage in Data stage associated with default partitioning technique
Defualt partinining technique is Hash
Note:
Selection of portioning techniques is based on
1 .Data(Volume ,Type
2 .Stage
3. No of key Columns
5.Key column data type
Partitioning techniques are grouped in to two categories
1.Key Based
2.Key Less
Key Based Partitiong techniques:
1.Hash
2.Modulo
3.Range

4.DB2
Key Less Partioning techniques:
1.Random
2.Round robin
3.Entire
4.Same

Data stage Architecture:


Data stage is a client server technology so that it has server comopents and client
components
Servercompoents (Unix)
PX Engine
Data stage Repository
Package Installer

Client components(Windows)
Data stage Administrator
Data stage Manager
Data stage Director
Data stage Designer

Data stage Server:


The server components again classified in to
Data stage server:
It is the heart of data stage and its contain the archestrate Engine usually this engine
picks up requirement from the client component and according to the request it performs
the operation and respond to the client components ,if it requires it get the information
from data stage repository
Data stage Repository:
The repository conatins jobs,table definations,file definations,routines,shared containers
etc
Package Instaler:
It is used to install the softwares and gives compatability to the other softwares

Client Components:
The client component again classified into
Data stage Administartor
Data stage Manager
Data stage Director
Data stage Designer
Data stage Administrator:
Ds admin can create projects and delete the projects
Can give permissions to the users
Can define global parameters

Data stage Manager:


Datastage Manager can import and export the jobs
can create routines
Can configure configuration file
Data stage Director:
Da ta stage Director can validate ajobs
Can run the jobs
Can monitor a job
Can schedule ajob
Can view the job logs
Data stage Designer:
Through Data stage Designer a developer can design a jobs and compile and run a jobs

Differences between 7.5.x2 & 8.0.1


7.5X2:
1. Four client components (Ds Designer,Ds Director,Ds Manager,Ds Administrator)
2. Architecture Components( Server Components,Client Component
3. Two tier architecture
4.Os dependent with respect to users
5. Capable of Phase3, Phase4
6.No web based Administration
7.File Based Repository
8.0.:
1. Five client components (Ds Designer,Ds Director,Information Analyzer,Ds
Administrator, Web console)
2. Architecture Components
Common User Interface
Common Repository
Common Engine
Common Connectivity
Common Shared services
3. N- tier architecture
4. Os Independent with respect to users but one time dependent
5. Capable of All Phases
6.Web based Administration through web console
7.Data base Based Repository

Data stage 7.5x2 Client Components:


In 7.5x2 we have 4 client components
Data stage Designer:
it is to create jobs, compile, run and multiple job compile.
It can handle four type of jobs
1.Server Jobs
2. Parallel jobs
3. Job Sequencing
4. Main frame Jobs
Data stage Director
Using data stage Director
Can schedule the jobs,run the jobs
Can Monitor the jobs,Unlock the jobs,Batch jobs
Can View (job, Status, logs)
Message Handling
Data stage Manager
Can Import and Export the repository components
Node Configuration
Data stage Administrator
Can create the projects
Can delete the projects
Organize the projects

Server Components:
We have 3 server components
1. PX Engine: it is executing DataStage jobs and it automatically selects the partition
technique.

2. Repository: It contains the repository components


3.Package Installer:
Package Instaler has packs and Plug Ins

Data stage 8.0.1 Client Components:


In 8.0.1 we have 5 client components
Data stage Designer:
it is to create jobs, compile, run and multiple job compile.
It can handle four type of jobs
1.Server Jobs
2. Parallel jobs
3. Job Sequencing
4. Main frame Jobs
5.Data Qulaity jobs
Data stage Director
Using data stage Director
Can schedule the jobs,run the jobs
Can Monitor the jobs,Unlock the jobs,Batch jobs
Can View (job, Status, logs)
Message Handling
Data stage Administrator
Can create the projects
Can delete the projects
Organize the projects
Web Console:
Through administrator components can perform the below tasks
1.Security services
2.Scheduling services
3.Logging services
4.Reporting services
5.Domian Management
5.Session Management
Information Analyzer:
It is also console for IBM infosphere Information Server console
It performs All activities of Phase1
1.Column Analysis

2.BaseLine Analysis
3.Primary Key Analysis
4.foriegn Key Analysis
5.Cross Domian Analysis

Data stage 8.0.1 Architecture:


1. Common User Interface
Unified user is called a Common user interface
1. Web console
2. Information Analyzer
3. Data stage Designer
4. Data stage Director
5. Data stage Administrator
2. Common Respository
Common Repository devided in to two types
1. Global Repository: It is used for data stage jobs files would store here
2. Local Repository: for storing induvidual files
Common repository we called as a Meta data server
3. Comon Engine:
It is responsible for the following
Data Profiling Analysis
Data Data Quality Analysis
Data Transmission Analysis
4. Common Connectivity
It provides the common connections to the Common Repository

Stages Enhancements and Newly Introduced stages Comparison from


7.5x2 And 8.0.1:
Stage Category
Type

Available Stage version


7.5X2

Processing Stage
Processing Stage
Processing Stage
Processing Stage

Stage Name
SCD(Slowly Change
Dimension)
FTP(File Transfer Protocal)
WTX(Webshere Transfer)
Surrogate Key

Processing Stage
Data Base Stage
Data Base Stage
Data Base Stage
Data Base Stage

Look up
IWAY
Classic Federation
ODBC Connector
Netteza

Not Available
Not Available
Not Available
Available
Available
(Normal Lookup,Sprase
Lookup)
Not Available
Not Available
Not Available
Not Available

Data Base Stage

Sql builder

Not Available

Avilable Stage in Version 8.0.1

Available
Available
Available
Available (Enhance ment done

Available( Range Lookup,Case


Lookup)
Available
Available
Available
Available
Available
(All Stages Technqs used wrto
Builder)

Note: Data base stages and Processing stage has Enhancements has done
Datastage Designer Window:
Its has Title BarIBM Infosphere Datastage and Quality stage Designer
Menu bar File,Edit,View,Repository,Diagram,Import,Export,Tools,Window,Help
Tool BarTool Options like Jobproperties,Compile,Run
RepositoryRepository which contains Repository components

File Stages:
---------------Sequential file stage:
===============
Sequential file stage is a file stage which is used to read the data sequentially or
Parallely.
If it is 1 file - It reads the data sequentially
If it is N files - It reads the data Parallely
Sequential file supports 1 Input link |1 Output Link | 1 reject link.
To read the data, we have read methods. Read methods are
a) Specific files
b) File Patterns
Specific File is for particular file
And File Pattern is used for Wild cards.
And in Error Mode. It has
Continue
Fail and
Output
If you select Continue - If any data type mismatch it will send the rest of the data to the
target.
If you Select Fail- Job Abort or Any Data type mismatch
Output- It will send the mismatch data to Rejected data file.
Error data we get are
Data type Mismatch
Format Mismatch
Condition Mismatch
and we have the option like
Missing File Mode: In this Option
we have three sub options like
Depends
Error
Ok
(That means How to handle, if any file is missed)
Different Options usage in Sequential file:

----------------------------------------------------Read Method=Specific file Then execute in sequencial mode


Read Method=File patternThen execute in file pattern
Note :If we choose Read method =Specific file then it asks input file path
If we select Read method=File Pattern then it asks ask pattern

Example for file pattern:


Emp1.txt
Emp2.txt
To read the data of above two files then file pattern should be like Emp?.txt
?--> for one character match
*--> for one or more character match
Example jobs for Lab Hand out:
1. Read Method =Specific files
Rejectmode=Continue,Fail,Output
Note:If Reject Mode=Output then you must provide the output reject link for rejected
records other wise it gives error
2. Read Method=File Pattern
Rejectmode=Continue,Fail,Output
Note:If Reject Mode=Output then you must provide the output reject link for rejected
records other wise it gives error
3. Read Method =Specific files
Rejectmode=Continue,Fail,Output
FileNameColumn=InputRecordFilepath
Note1:If Reject Mode=Output then you must provide the output reject link for rejected
records other wise it gives error
Note2: If FileNameColumn=InputRecordFilepath
Here FileNameColumn is an Option if we select this option then we have to create one
more extra column in extended column properties then in output file you will get the
InputRecordFilepathColumn extra at output
4. Read Method =Specific files
Rejectmode=Continue,Fail,Output
RowNumberColumn=InputRowNumberColumn

Note1:If Reject Mode=Output then you must provide the output reject link for rejected
records other wise it gives error
Note2: RowNumberColumn=InputRowNumberColumn
Here RowNumberColumn is an Option if we select this option then we have to create one
more extra column in extended column properties then in output file you will get the
InputRowNumberColumn extra at output

Sequencialfile Options:
Filter
FileNameColumn
RowNumcolumn
Read Firstrows
NullFieldvalue
1.Filter Options
Sed command:
-------------------Sed: is a stream Editor for filtering and transforming text from standard input to
standard output
Sed 5qIt displays first 5 lines
Sed 2pIt displays all lines but 2nd line will displayed twice
Sed 1dit displays all records except first record
Sed 1d,2d it displays all lines except the first and second record
Sed n 2,4p here it prints only from record 2 to 4 only
Sed n e 2p e 3pIt displays only 2 nd 3rd line
Sed $d it is for deleting the trailer record
Sed i\ it insert the blank line after each line
Grep commands:
----------------------1) grep string Ex: grep bhaskar
2) grep v string Ex: grep v bhaskar
3) grep i String Ex: grep - i bhaskar

Potrebbero piacerti anche