RCP

Switch Stage :
Definition : The switch stage takes a single data set as input and assigns each input row to an output data set based on the value of a selector field.
It can have a single input link, up to 128 output links and a single rejects link. This stage performs an operation similar to a C switch statement. Rows that satisfy none of the cases are output on the rejects link.
Property:
when Selector Mode = Auto
Input :
Output - 1 :
Output - 2 : Reject :
Output - 3 :
Job (after execution):
Property :
when Selector Mode = Hash
Input :
Output - 1 :
Output - 2 :
Output - 3 :
Property :
when Selector Mode = User Defined Mapping
Input :
Output - 1 :
Output - 2 :
Reject :
Output - 3 :
Modify Stage:
Definition : The Modify stage alters the record schema of its input data set. The modified data set is then output. It is a processing stage.
It can have a single input and single output.
Null Handling:
Job (before handling):
Step 1:
"NULL" value has to be handled
Null Handling
Step 2:
Syntax : Column_Name=Handle_Null('Column_Name',Value)
Input Link Columns
Output Link Columns
Step 3:
"NULL" value has been handled
Drop Column(s): Job (before execution):
Step 1:
The column "MGR" has to be dropped
Step 2:
Dropping Column
Syntax : DROP Column_Name
Input Link Columns
Output Link Columns
Step 3:
The column "MGR" has been dropped
Keep Column(s): Job (before execution):
Step 1:
The column "EmpNo" has to be kept
Keeping Column
Step 2:
Syntax : KEEP Column_Name
Input Link Columns
Output Link Columns
Step 3:
The column "EmpNo" is kept
Type Conversion:
Job (before execution):
Step 1:
The column "HireDate" has to converted to Date
Type Conversion
Step 2:
Syntax : Column_Name=type_conversion('Column_Name')
Input Link Columns
Output Link Columns
Step 3:
The column "HireDate" has been converted to Date
Multiple Specifications: Job (before execution):
Step 1:
The column "MGR" has to be Null handled
The column "HireDate" has to converted to Date
Null Handling
Step 2:
Type Conversion
Input Link Columns
Output Link Columns
Step 3:
The column "MGR" has been Null handled
The column "HireDate" has been converted to Date
Compress Stage
The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set from a sequence of records into a stream of raw binary data.
Steps :
Set the Stage Properties : "Command = Compress"
Load the Metadata in the Output tab
Example Job
Limitations
A compressed data set cannot be processed by many stages until it is expanded, i.e., until its rows are returned to their normal format. Stages that do not perform column based processing or reorder the rows can operate on compressed data sets. For example, you can use the copy stage to create a copy of the compressed data set.
Expand Stage
The Expand stage uses the UNIX compress or GZIP utility to expand the data set. It converts a data set from a stream of raw binary data into sequence of records.
Steps :
Set the Stage Properties : "Command = Uncompress"
Load the Metadata in the Output tab
Example Job :
Runtime Column Propagation:

DataStage EE is flexible about meta data. It can cope with the
situation where meta data isnt fully defined. You can define part of your schema and specify that, if your job encounters extra
columns that are not defined in the meta data when it actually
runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). RCP is always on at runtime. Design and compile time column mapping enforcement. RCP is off by default. Enable first at project level. (Administrator project properties) Enable at job level. (job properties General tab) Enable at Stage. (Link Output Column tab)
Enabling RCP at Project Level:
Enabling RCP at Job Level:
To Do
Column definition will define all columns that must be carried through to the next stage Column definition column name must match those defined in the schema file Ensure RCP is disabled for the output links
record {final_delim=end, record_delim='\n', delim='|', quote=double, charset="ISO8859-1"} ( REGION_ID:int32 {quote=none}; SALES_CITY:ustring[max=255]; SALES_ZONE:ustring[max=255]; SALES_TOTAL:int32 {quote=none}; )
When the input format changes

ONLY the schema file must be modified! Data Set will always contain the columns for which the definition is included within the stages as well as the computed field
Enabling RCP at Stage Level:
Go to output links columns tab. For transformer you can find the output links columns tab by first going to stage properties.

When RCP is Disabled DataStage Designer will enforce Stage Input Column to Output Column mappings. At job compile time modify operators are inserted on output links in the generated OSH.

When RCP is Enabled DataStage Designer will not enforce mapping rules. No Modify operator inserted at compile time. Danger of runtime error if column names incoming do not match column names outgoing link case sensitivity.
Merits of RCP:
RCP is useful when you are dealing with hundreds of columns. RCP can be very useful is in the case of a complex program with more stages, i.e., the more places you have to correct table definitions. With RCP you change it once at the beginning and the changes will flow through the rest of the program. RCP is a feature of the base orchestrate framework. It cannot be switched "off". What actually happens when RCP is switched off is an implicit Modify operator is added after every stage that changes column definitions. This may mean potentially doubling the number of operators in your job causing performance hits.
Demerits of RCP
RCP should be switched off at certain stages in your job where leaving it on will cause all those headaches (Defaulting, dropping issues). Example of such a stage would be funnel. A recommended best practice is to document through annotations where in the job RCP has been switched off. you can write jobs that contain no column metadata.
Configuration Files:
Objectives:
Understand how DataStage EE uses configuration files to determine parallel behavior. Use this understanding to: Build a EE configuration file for a computer system. Change node configurations to support adding resources to processes that need them. Create a job that will change resource allocations at the stage level.
Configuration File Concepts:

Determine the processing nodes and disk space connected to each node. When system changes, need only change the configuration file no need to recompile jobs. When DataStage job runs, platform reads configuration file Platform automatically scales the application to fit the system.
Processing Nodes Are:

Locations on which the framework runs applications.
Logical rather than physical construct.

Do not necessarily correspond to the number of CPUs in your system. Typically one node for two CPUs. Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node.
Optimizing Parallelism:
Degree of parallelism determined by number of nodes defined. Parallelism should be optimized, not maximized Increasing parallelism distributes work load but also increases Framework overhead. Hardware influences degree of parallelism possible. System hardware partially determines configuration.
More Factors to Consider:

Communication amongst operators: Should be optimized by your configuration. Operators exchanging large amounts of data should be assigned to nodes communicating by shared memory or high-speed link. SMP leave some processors for operating system. Desirable to equalize partitioning of data.
Use an experimental approach: Start with small data sets. Try different parallelism while scaling up data set sizes.
Factors Affecting Optimal Degree of Parallelism:

CPU intensive applications: Benefit from the greatest possible parallelism. Applications that are disk intensive: Number of logical nodes equals the number of disk spindles being accessed.
Configuration File:
Text file containing string data that is passed to the Framework Sits on server side Can be displayed and edited
Name and location found in environmental variable APT_CONFIG_FILE

Components Node Fast name Pools Resource
Node Options:
Node name name of a processing node used by EE: Typically the network name Use command uname n to obtain network name Fastname: Name of node as referred to by fastest network in the system Operators use physical node name to open connections NOTE: for SMP, all CPUs share single connection to network Pools: Names of pools to which this node is assigned Used to logically group nodes Can also be used to group resources Resource: Disk Scratchdisk
Sample Configuration File:
{ node "Node1" { fastname "BlackHole" pools "" "node1" resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" } resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" } } }
The Configuration File:

CPU CPU CPU CPU CPU CPU CPU
An SMP
A Cluster
Ethernet or High-speed network
CPU
{ node "node0" { fastname "node0_byn" pools "" "node0" "node0_fddi" resource disk "/dsee/s0" { } resource disk "/dsee/s1" { } resource scratchdisk "/scratch" { } } node "node1" { fastname "node0_byn" pools "" "node1" "node1_fddi" resource disk "/dsee/s0" { } resource disk "/dsee/s1" { } resource scratchdisk "/scratch" { } }
{ node "node0" { fastname "node0_css" pools "" "node0" "node0_fddi" resource disk "/dsee/s0" { } resource scratchdisk "/scratch" { } } node "node1" { fastname "node1_css" pools "" "node1" "node1_fddi" resource disk "/dsee/s0" { } resource scratchdisk "/scratch" { } } }
No change in Application. Just a Config File change.
Disk Pools:
Pool "bigdata"
Disk pools allocate storage By default, EE uses the default pool, specified by ""
Sorting Requirements:
Resource pools can also be specified for sorting: The Sort stage looks first for scratch disk resources in a "sort" pool, and then in the default disk pool.
Another Configuration File Example:

{ node "n1" { fastname s1" pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {} resource disk "/data/n1/d2" {} resource scratchdisk "/scratch" } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {} resource scratchdisk "/scratch" } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/data/n3/d1" {} resource scratchdisk "/scratch" } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/data/n4/d1" {} resource scratchdisk "/scratch" } ... }
{"sort"}
{}
4 2 1
5 3
{}
{}
Resource Types:
Disk Scratchdisk DB2 Oracle Saswork Sortwork Can exist in a pool Groups resources together
Using Different Configurations:
Lookup stage where DBMS is using a sparse lookup type
Building a Configuration File:

Scoping the hardware: Is the hardware configuration SMP, Cluster, or MPP? Define each node structure (an SMP would be single node): Number of CPUs CPU speed Available memory Available page/swap space Connectivity (network/back-panel speed)
Is the machine dedicated to EE? If not, what other applications are running on it?
Get a breakdown of the resource usage (vmstat, mpstat, iostat)
Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them?
Working with Configuration Files:

You can easily switch between config files: '1-node' file - for sequential execution, lighter reportshandy for testing 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism 'BigN-nodes' file - aims at full data-partitioned parallelism
Only one file is active while a step is running The Framework queries (first) the environment variable: $APT_CONFIG_FILE
# nodes declared in the config file needs not match # CPUs

Same configuration file can be used in development and target machines.
Scheduling Nodes, Processes, and CPUs:

DS/EE does not: know how many CPUs are available Nodes = # logical nodes declared in config. file schedule Ops = # ops. (approx. # blue boxes in V.O.) Who knows what? Processes = # Unix processes
CPUs = # available CPUs
Nodes User Orchestrate O/S

Y Y
Ops
N Y
Processes
Nodes * Ops "
CPUs
N Y
Who does what? DS/EE creates (Nodes*Ops) Unix processes The O/S schedules these processes on the CPUs
Configuring DSEE Node Pools:

{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }
3 1
4 2
Configuring DSEE Disk Pools:

{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }
3
1
4 2
Job Properties
View/Modify from Designer/Director General Properties

Job Properties
Job Descriptions Enable RCP To be discussed shortly Multiple Instances To allow parallel execution of multiple instances. Job Design must ensure that there is no conflict, e.g. writing into the same file, etc. Before/After Job Subroutine
Before/After Job Subroutine

Job Properties
Execute subroutine Creation/definition of these to discussed later Does not return data values After job routine can be conditionally executed based on successful job execution
Defining through Job Properties > Parameters Used to pass business & control parameters to the jobs Recap of sample usage:
Recap Of Job Parameters

#XXX# Usage as stage parameter for string substitution
Default value optional
Direct usage for expression evaluation
Setting Parameter Values
Recap Of Job Parameters Passed by calling sequence**/script/program If value set is by calling program, this will overrides default value If no default value, calling sequence/script MUST set parameter, else job fails
Used For
Flexibility Change business parameters Reuse Run same job with different parameters to handle different needs Portability set path, user name, password, etc. according to the environment
Some Common Parameters Run Date Business Date Path Filename suffix/prefix (input/output) User Name/password Database connection parameters, DSNs, etc. Base currency, etc.
Can also set/override Environment Variables Values - valid only within the job
Recap Of Job Parameters
Orchestrate Shell Script that is compiled by the engine

RCP

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

RCP

Caricato da

Copyright:

Formati disponibili

Switch Stage :

when Selector Mode = Auto

Job (after execution):

when Selector Mode = Hash

Job (after execution):

when Selector Mode = User Defined Mapping

Job (after execution):

It can have a single input and single output.

"NULL" value has to be handled

Input Link Columns

Output Link Columns

"NULL" value has been handled

Job (after execution):

Drop Column(s): Job (before execution):

The column "MGR" has to be dropped

Syntax : DROP Column_Name

Input Link Columns

Output Link Columns

The column "MGR" has been dropped

Job (after execution):

Keep Column(s): Job (before execution):

The column "EmpNo" has to be kept

Syntax : KEEP Column_Name

Input Link Columns

Output Link Columns

The column "EmpNo" is kept

Job (after execution):

The column "HireDate" has to converted to Date

Input Link Columns

Output Link Columns

The column "HireDate" has been converted to Date

Job (after execution):

Multiple Specifications: Job (before execution):

The column "MGR" has to be Null handled

The column "HireDate" has to converted to Date

Input Link Columns

Output Link Columns

The column "MGR" has been Null handled

The column "HireDate" has been converted to Date

Job (after execution):

Load the Metadata in the Output tab

Load the Metadata in the Output tab

Runtime Column Propagation:

Enabling RCP at Project Level:

Enabling RCP at Job Level:

When the input format changes

Enabling RCP at Stage Level:

Runtime Column Propagation:

Runtime Column Propagation:

Configuration File Concepts:

Processing Nodes Are:

Logical rather than physical construct.

More Factors to Consider:

Factors Affecting Optimal Degree of Parallelism:

Name and location found in environmental variable APT_CONFIG_FILE

Sample Configuration File:

The Configuration File:

No change in Application. Just a Config File change.

Another Configuration File Example:

Using Different Configurations:

Lookup stage where DBMS is using a sparse lookup type

Building a Configuration File: