Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
File Stages
Sequential file
Data set
File set
Lookup file set
Database Stages
Oracle Enterprise
ODBC Enterprise
Dynamic RDBMS
Processing Stages
Aggregator
Change Apply
Change Capture
Compare
Compress
Copy
Decode
Difference
Encode
Expand
Filter
Funnel
Generic
Join
Look up
Merge
Modify
Pivote
Remove Duplicate
External Filter
Sort
Surrogate Key Generator
Switch
Transformer
Debugging Stages
Column Generator
Head
Peek
Row Generator
Sample
Tail
Datastage Manager
Introduction about Data stage Manager
Importing the Jobs
Exporting the Jobs
Home
Overview
Investigate
Data Quality
Operate
Types of Analysis in IA
Column Analysis
Overview
Frequency distribution
Domain and completeness
Data Class Analysis
Format Analysis
Creating IA Projects
Bussiness Analyst
Data Steward
Data Operator
Drill Down user
======================================================
Classification Table
Dictionary File
Pattern Action File
Reference Tables
Override Tables
Character Concatenate:
Word Investigate:
Standardize Stage
Standardize Country
Standardize Domain Preprocessing
Standardize Name
Standardize Address
Standardize Area
DATAWAREHOUSE:
------------------------------Data ware house is nothing but collection of transactional data and historical data and can
be maintained in dwh for analysis purpose.
They are 3 types of tools should be maintained on any data warehousing project
1. ETL Tools
2. OLAP Tools (or) Reporting Tools
3. Modeling Tool
ETL TOOL:
ETL is nothing but Extraction, Transformation, and Loading. a ETL Developer(those who
are expertise in dwh extracts data from heterogeneous databases(or)Flat files, Transform
data from source to target(dwh) while transforming needs to apply transformation rules
and finally load data into dwh.
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica
3. Abinitio
4. Oracle Warehouse Builder
5. Bodi (Business Objects Data Integration)
6. MSIS (Microsoft Integration Services)
OLAP:
OLAP is nothing but Online Analytical Processing and these tools are called as reporting
Tools Also
A OLAP Developer analyses the data ware house and generate reports based on selection
criteria.
There are several OLAP Tools are available
1. Business Objects
2. Cognos
3. Report Net
4. SAS
5. Micro Strategy
6. Hyperion
7. MSAS (Microsoft Analysis Services)
MODELING TOOL:
Those who are working with ERWIN Tool called data modeler .A data modeler can
design data base of DWH with the help of fallowing tools
A ETL Developer can extract data from source databases (or) flat files(.txt,csv,.xls etc)
and populates into DWH .While populating data into DWH they are some staging areas
can be maintained between source and target .these staging areas are called staging area1
and staging area2.
STAGING AREA:
Staging Area is nothing but is temporary place which is used for cleansing unnecessary
data (or) unwanted data (or) inconsistency data.
Note: A Data Modeler can design DWH in two ways
1. ER Modeling
2. Dimensional Modeling
ER Modeling:
ER Modeling is nothing but entity relationship modeling. in this model always call table
as entities and it may be second normal form (or) 3rd normal form (or) in between 2nd and
3rd normal form
Dimensional Modeling:
In this model tables called as dimensions (or) fact tables. It can be subdivided into three
schemas.
1. Star Schema
2. Snow Flake Schema
3. Multi Star Schema (or) Hybrid (or) Galaxy
Star Schema:
A fact table surrounded by dimensions is called start schema. it looks like start
In a start schema if there is only one fact table then it is called simple start schema.
In a start schema if there are more than one fact table then it is called complex start
schema
Pramotion_Area
Product:
Product_id
Product_name
Product_type
Product_desc
Product_version
Product_stratdate
Product_expdate
Product_maxprice
Product_wholeprice
Customer:
Cust_id
Cust_name
Cust_type
Cust_address
Cust_phone
Cust_nationality
Cust_gender
Cust_father_name
Cust_middle_name
Time:
Time_id
Time_zone
Time_format
Month_day
Week_day
Year_day
Week_Yeat
DIMENSION TABLE:
If a table contains primary keys and provides detail information about the table
(or) master information of the table then called dimension table.
FACT TABLE:
If a table contains more foreign keys and its having transactions, provides
summarized information such a table called fact table.
DIMENSION TYPES:
There are several dimension types are available
CONFORMED DIMENSION:
If a dimension table shared with more than one fact table (or) having foreign key more
than one fact table. Then that dimension table is called confirmed dimension.
DEGENERATED DIMENSION:
If a fact table act as dimension and its shared with another fact table (or) maintains
foreign key in another fact table .such a table called degenerated dimension.
JUNK DIMENSION:
A junk dimension contains text values, genders,(male/female),flag values(True/false) and
which is not use full to generate reports. Such dimensions is called junk dimension.
DIRTY DIMENSION:
If a record occurs more than one time in a table by the difference of non key attribute
such a table is called dirty dimension
FACT TABLE TYPES:
There are 3 types of fact s are available in fact table
1. Additive facts
2. Semi additive facts
3. Non additive facts
ADDITIVE FACTS:
If there is a possibility to add some value to the existing fact in the fact table .that facts
we called as additive fact.
SEMI ADDITIVE FACT:
If there is possibility to add some value to the existing fact up to some extent in the fact
table is we called as semi additive fact.
NON ADDITIVE FACT:
If there is not possibility to add some value to the existing fact in the fact table is we
called as Non additive fact.
SNOW FLAKE SCHEMA:
Snow Flake schema maintains in dimension table normalized data .in this schema some
dimension tables are not directly maintained relation ship with fact table and those are
maintained relation ship with another dimension
1. Data stage:
Data stage is a comprehensive ETL tool Or we can say Data stage is an data Integration
and transformation tool which enables collection and consolidation of data from several
sources,its transformation and delivery into one or multiple target systems
ETL stands for Extarction,Transformation ,Load
EExtraction from any source
TTransformation(rich set of transformation capabilities)
LLoading in to any target
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica
3. Abinitio
4. Oracle Warehouse Builder
5. Bodi (Business Objects Data Integration)
6. MSIS (Microsoft Integration Services)
Any to Any:
Data stages can extract data from any source and can load data in to any target
Platform Independent:
A job can run in any processor is called platform independent
Data stage jobs can run on 3 types of processors
Three types of processor are there, they are
1. UNI Processor
2. Symmetric Multi Processor (SMP), and
3. Massively Multi Processor (MMP).
Node Configuration:
Node is a logical CPU ie.instance of physical CPU
The process of creating virtual CPUs is called Node configuration
Example:
ETL job requires executing 1000 records
In Uni processor it takes 10 mins to execute 1000 records
But in same thing SMP processor takes 2.5 mins to execute 1000 records
6. Major difference in job architecture level Server jobs process in sequence one stage
after other
Configuration File:
What is configuration file? What is the use of this in data stage?
It is normal text file. it is having the information about the processing and storage
resources that are available for usage during parallel job execution.
The default configuration file is having like
Node: - it is logical processing unit which performs all ETL operations.
Pools: - it is a collection of nodes.
Fast Name: it is server name. by using this name it was executed our ETL jobs.
Resource disk:- it is permanent memory area which stores all Repository components.
Resource Scratch disk:-it is temporary memory area where the staging operation will be
performed.
Configuration file:
Example:
{
node "node1"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools
""}
}
node "node2"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools
""}
}
}
Note:
In a configuration file No node names has same name
Default Node pool is
At least one node must belong to the default node pool, which has a name of "" (zerolength string).
Pipeline parallelism:
Pipe:
Pipe is a channel through which data moves from one stage to another stage
Partition Parallelism:
Partitioning:
Partioning is a technique of dividing the data into chunks
Data stage supports 8 types of partitions
Partioning plays a important role in data stage
Every stage in Data stage associated with default partitioning technique
Defualt partinining technique is Hash
Note:
Selection of portioning techniques is based on
1 .Data(Volume ,Type
2 .Stage
3. No of key Columns
5.Key column data type
Partitioning techniques are grouped in to two categories
1.Key Based
2.Key Less
Key Based Partitiong techniques:
1.Hash
2.Modulo
3.Range
4.DB2
Key Less Partioning techniques:
1.Random
2.Round robin
3.Entire
4.Same
Client components(Windows)
Data stage Administrator
Data stage Manager
Data stage Director
Data stage Designer
Client Components:
The client component again classified into
Data stage Administartor
Data stage Manager
Data stage Director
Data stage Designer
Data stage Administrator:
Ds admin can create projects and delete the projects
Can give permissions to the users
Can define global parameters
Server Components:
We have 3 server components
1. PX Engine: it is executing DataStage jobs and it automatically selects the partition
technique.
2.BaseLine Analysis
3.Primary Key Analysis
4.foriegn Key Analysis
5.Cross Domian Analysis
Processing Stage
Processing Stage
Processing Stage
Processing Stage
Stage Name
SCD(Slowly Change
Dimension)
FTP(File Transfer Protocal)
WTX(Webshere Transfer)
Surrogate Key
Processing Stage
Data Base Stage
Data Base Stage
Data Base Stage
Data Base Stage
Look up
IWAY
Classic Federation
ODBC Connector
Netteza
Not Available
Not Available
Not Available
Available
Available
(Normal Lookup,Sprase
Lookup)
Not Available
Not Available
Not Available
Not Available
Sql builder
Not Available
Available
Available
Available
Available (Enhance ment done
Note: Data base stages and Processing stage has Enhancements has done
Datastage Designer Window:
Its has Title BarIBM Infosphere Datastage and Quality stage Designer
Menu bar File,Edit,View,Repository,Diagram,Import,Export,Tools,Window,Help
Tool BarTool Options like Jobproperties,Compile,Run
RepositoryRepository which contains Repository components
File Stages:
---------------Sequential file stage:
===============
Sequential file stage is a file stage which is used to read the data sequentially or
Parallely.
If it is 1 file - It reads the data sequentially
If it is N files - It reads the data Parallely
Sequential file supports 1 Input link |1 Output Link | 1 reject link.
To read the data, we have read methods. Read methods are
a) Specific files
b) File Patterns
Specific File is for particular file
And File Pattern is used for Wild cards.
And in Error Mode. It has
Continue
Fail and
Output
If you select Continue - If any data type mismatch it will send the rest of the data to the
target.
If you Select Fail- Job Abort or Any Data type mismatch
Output- It will send the mismatch data to Rejected data file.
Error data we get are
Data type Mismatch
Format Mismatch
Condition Mismatch
and we have the option like
Missing File Mode: In this Option
we have three sub options like
Depends
Error
Ok
(That means How to handle, if any file is missed)
Different Options usage in Sequential file:
Note1:If Reject Mode=Output then you must provide the output reject link for rejected
records other wise it gives error
Note2: RowNumberColumn=InputRowNumberColumn
Here RowNumberColumn is an Option if we select this option then we have to create one
more extra column in extended column properties then in output file you will get the
InputRowNumberColumn extra at output
Sequencialfile Options:
Filter
FileNameColumn
RowNumcolumn
Read Firstrows
NullFieldvalue
1.Filter Options
Sed command:
-------------------Sed: is a stream Editor for filtering and transforming text from standard input to
standard output
Sed 5qIt displays first 5 lines
Sed 2pIt displays all lines but 2nd line will displayed twice
Sed 1dit displays all records except first record
Sed 1d,2d it displays all lines except the first and second record
Sed n 2,4p here it prints only from record 2 to 4 only
Sed n e 2p e 3pIt displays only 2 nd 3rd line
Sed $d it is for deleting the trailer record
Sed i\ it insert the blank line after each line
Grep commands:
----------------------1) grep string Ex: grep bhaskar
2) grep v string Ex: grep v bhaskar
3) grep i String Ex: grep - i bhaskar