DataStage Tricks Tips

DataStage Designer Tips & Tricks
Jim Tsimis
Advanced Technical Support
Mike Carney
Advanced Consulting Group
Michael Ruland
Field Engineering
Steven Totman
Product Manager Connectivity
Agenda
Designer Session
General Debug Tips & Tricks Handling Complex Flat Files Joy of the Command Line Transaction Handling Tips & Tricks
Managing transactions in Server, Enterprise Edition, Enterprise MVS Edition and RTI
Re-usability Tips & Tricks

Shared Containers, Templates, Pre-configured stages, Runtime column propagation
Performance Tuning
General Debug Tips & Tricks
Server - Generating Test Data
General Debug
Stage Variables are always executed and drive the transformer stage
Notice there are no input links

Output rows will be generated until the constraint is false at that point the job will stop
Enterprise Edition Generating Test Data
General Debug
Building test data from live data

Head stage: selects the first N records from each partition of an input data set and copies the selected records to an output data set. Tail stage: selects the last N records from each partition of an input data set and copies the selected records to an output data set. Sample stage: samples an input data set. Operates in two modes: Percent mode, extracts rows, selecting them by means of a random number generator, and writes a given percentage of these to each output data set; Period mode, extracts every Nth row from each partition, where N is the period which you supply. Filter stage: transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records. You can specify different requirements to route rows down different output links. External Filter stage: allows you to specify a UNIX command that acts as a filter on the data you are processing. An example would be to use the stage to grep a data set for a certain string, or pattern, and discard records which did not contain a match.
Sequential File stage: FILTER OPTION - use this to specify that the data is passed through a filter program before being written to a file or files on output or before being placed in a dataset on input.
Handling Complex Flat Files
Server Decoding Multi-formatted Files
Input column definitions (3 columns) The selected complex column is decoded into individual columns
Enterprise Edition Decoding Multi-formatted Files
Indicate the columns to import Map the columns to their destination
Enterprise Edition Taming the import
Print field option
od x A x
10
Working with Schemas
Converting Copybooks To Schemas
11
Enterprise Edition Working with Complex Files

Make Subrecord stage: combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors. Promote Subrecord stage: promotes the columns of an input subrecord to top-level columns, can also promote the columns in vectors of subrecords, in which case it acts as the inverse of the Combine Record stage. Split Subrecord stage: separates an input subrecord field into a set of top-level vector columns. Make Vector stage: combines specified columns of an input data record into a vector of columns of the same type. Split Vector stage: promotes the elements of a fixed-length vector to a set of similarly named top-level columns.
Combine Records stage: combines records, in which particular key-column values are identical, into vectors of subrecords.
Column Import stage: imports data from a single column and outputs it to one or more columns. Column Export stage: exports data from a number of columns of different data types into a single column of data type string or binary.
12
Enterprise Edition - Complex Structures

Subrecords
A subrecord is a nested data structure. The column with type subrecord does not itself define any storage, but the columns it contains do. These columns can have any data type, and you can nest subrecords one within another. The LEVEL property is used to specify the structure of subrecords. The following diagram gives an example of a subrecord structure.
Promote
Make
Vectors
Make Split A vector is a 1 dimensional array of any type except tagged. Elements of a vector are of the same type, and are numbered from 0. A vector can be of fixed or variable length. For fixed length vectors the length is explicitly stated, for variable length ones a property defines a link field which gives the length at run time.
13
Enterprise Edition
Combining Vectors and Subrecords
There is a rich ability to support very complex data structures here.
14
Joy of the Command Line
15
The Joy of the Command Line

What is dsjob? Utility to backup all the jobs in a project Utility to take BMPs from the command line DSJob exposed as a web service
16
Automatically Backing up Projects

@echo off rem This batch script is used to backup all the projects on a DataStage server. It rem must be run from a DataStage client machine and the parameters below should be rem modified to fit your environment. Use of parameters was avoided to simplify backup rem allow the command to be customized to a particular environment. rem rem Based on design by Manoli Krinos rem Modified by M Ruland to allow iteration through a complete server set of projects rem ***************************************************** rem Replace the following variables prior to running rem ***************************************************** rem Host is server name rem User is username to use to attach to DataStage rem PW is password to use to attach to DataStage rem BackupDir is the directory to place the backed up project in (don't forget final /) rem DsxCmd is directory of the export command on client rem DsxCmd1 is the dsjob command to retrieve the project list rem TempProjFile is temp file to store project names rem DSLog is the name of the log file accumulated during the backup rem ***************************************************** rem ***************************************************** Set Host=yourhosthere Set User=yourusername Set PW=yourpassword Set BackupDir=E:\Data\AutoBackup\UserConference\ SET DsxCmd=E:\Progra~1\Ascential\DataStage7Beta\dscmdexport.exe SET DsxCmd1=C:\Ascential\DataStage\Engine\bin\dsjob.exe Set TempProjFile=c:\temp\ProjectList.txt Set DSLog=DataStageDumpLog rem ****************************************************** rem rem ------------------------------------------------------------------------rem Get the current Date rem ------------------------------------------------------------------------FOR /f "tokens=2-4 delims=/ " %%a in ('DATE/T') do SET DsxDate=%%c%%a%%b rem rem ------------------------------------------------------------------------rem Get the current Time rem rem ------------------------------------------------------------------------FOR /f "tokens=1* delims=:" %%a in ('ECHO.^|TIME^|FINDSTR "[0-9]"') do (SET DsxTime=%%b) rem rem ------------------------------------------------------------------------rem Set delimeters so that current time can be broken down into components rem then execute FOR loop to parse the DsxTime variable into Hr/Min/Sec/Hun. rem rem ------------------------------------------------------------------------SET delim1=%DsxTime:~3,1% SET delim2=%DsxTime:~9,1% FOR /f "tokens=1-4 delims=%delim1%%delim2% " %%a in ('echo %DsxTime%') do ( set DsxHr=%%a set DsxMin=%%b set DsxSec=%%c set DsxHun=%%d) ECHO *** Backing up server %Host% == please be patient %DsxCmd1% -server %Host% -user %user% -password %user% -lprojects > %TempProjFile% echo AutoProjectBackup run on %DsxDate%%DsxHr%%DsxMin%%DsxSec% with the following parameters > %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log echo Host=%Host%, User=%user%, BackupDir=%BackupDir%, DsxCmd=%DsxCmd%, DsxCmd1=%DsxCmd1% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log echo TempProjFile=%TempProjFile%, DSLog=%DSLog% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log echo. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log echo Following Projects found on %Host% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log type %TempProjFile% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log rem ************************* rem ** Begin backup loop ** rem ************************* for /F "tokens=1" %%i in (%TempProjFile%) do ( ECHO The current Project is %%i >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO Backing up Project %%i rem rem ------------------------------------------------------------------------rem Issue message to screen (stdio) that the export is starting. rem ------------------------------------------------------------------------ECHO Exporting Project=%%i on Host=%Host% into File=%BackupDir%%HostName%%%i%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.dsx ... >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log %DsxCmd% /H=%Host% /U=%User% /P=%PW% %%i %BackupDir%%HostName%%%i%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.dsx >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log IF NOT %ERRORLEVEL%==0 GOTO BADEXPORT ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO *** Completed Export for Project: %%i on Host: %Host% to File: %BackupDir%%HostName%%%i%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.dsx >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO ************************************************************************** >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO ************************************************************************** >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ) ECHO *** Export completed successfully for projects: type %TempProjFile% GOTO EXITPT rem ------------------------------------------------------------------------rem :BADEXPORT ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO *** ERROR: Failed to Export Project: %%i on Host: %Host% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log :EXITPT ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log del %TempProjFile%
This script is available from your account team and backs up all projects on an identified server
Available on Ascential Developers Net when it launches in December
17
Automated Diagramming
"E:\Program Files\Ascential\DataStage\dsdesign.exe" /h=YourHost /u=UserID /p=***** YourProject YourJobName /saveasbmp=e:\Diagrams\YourProject\JobDesigns\YourJobName.bmp
A script is available from Ascential that will obtain all the jobs within a selected project and create a bmp diagram for each job into a selected folder. This can be an effective way to create a file that MetaStage can later use to present a graphical representation of the DataStage job design in an HTML or XML report.
18
Job Control using Web Services

Job Control Web Service in C# .NET
Job Control Web Service in HTML using Web Service Behavior Job Control Web Service in Office XP documents Job Control Web Service in VBScript
19
Unit Of Work Tips & Tricks
20
Server Unit of Work

Unit of Work processing is currently available in the Oracle and ODBC stages.
Used to specify whether to continue or to roll back if a link is skipped due to a constraint on it not being satisfied.
Used to specify whether or not to continue or rollback on failure of the SQL statement.
21
DataStage Enterprise Edition MVS Unit Of Work Support
The DataStage Enterprise Edition MVS (XE/390) Business Rule Stage provides the ease of graphical construction through drag and drop facilities as well as the ability to customize the processing rules to meet specific demands.
22
Enterprise Edition - Unit of Work

Bank Account
T4 T3 EOU
T2 T4
Stock Account
T4 Tax Account T4 t1 Source Queue T2 Work Queue(s) Reject Queue
23
Enterprise Edition Unit of Work

Enterprise Edition Framework enhancement (end of unit)
Causes records to flush through flow Stages complete work, but stay running for next unit of work. MQ-READ UNIT OF WORK Solution Utilizes (end of unit) MQ series transaction manager (xa/open two phase commit) Coordinated transactions between MQSERIES, ORACLE,other RDBMS Guarantee no loss of data Automatic checkpoint/restart Highly scalable near realtime
Only available through Ascential Professional Services
24
RTI Enabled Jobs
Job acts like web service via Real Time Integration Server (RTI)
Calls to service treated as unit of work

Multi row unit of work is supported Service arguments are complex arrays
25
Re-usability Tips & Tricks
26
Reuse Job Templates
Creates an XML template that can be used as a starter job. Facilities exist to allow consulting to further customize the template such that token values can be replaced during job creation.
27
Copy Paste From Job to Job

Can also now copy and paste directly into a shared container on both the Enterprise Edition and Server canvas.
28
Pre-configured Stages
Implemented through Shared containers Configure Stage with parameters
Create empty job with parameters When developing new job.
Start with empty job with parameters Drag/Drop preconfigured stages holding CTRL key. Minimal configuration required.
29
Enterprise Edition - Reusable Transformers

Implemented through shared containers Runtime Column Propagation
Turn on for transformer stage. Specify minimal input columns, only column used in derivation. (not copy through) Specify minimal output column, new columns or columns that required a derivation expression. (complex vs column name) Enterprise Edition Framework automatically propagates all other columns.
30
Performance Tuning in Parallel Environments
31
Performance and Tuning

Any given system can be tuned to favor one application so much that it actually negatively impacts the performance of other applications. This phenomena is exacerbated as we introduce parallel capabilities into the system. Many factors affect the performance of an application
RDBMS configuration and performance Memory vs. System Working Set Size CPU's vs. System Load Data input/output Throughput Rates Launch Microsoft Outlook Amdahls law (an application is gated by its slowest component).
32

Best Practices:
Establish baselines (especially with I/O), use copy with no output Avoid the use of only one flow for tuning/performance testing. Prototyping can be a powerful tool. Work in increments...change 1 thing at a time. Evaluate data skew: repartition to balance the data flow
Isolate and Solve - determine which stage is causing a problem.

distribute file systems (if possible) to eliminate bottlenecks Do NOT involve the RDBMS in initial testing. (See above) Understand and evaluate the tuning knobs available
33

Establishing a baseline:
Set up at least 3 configurations: sequential; max parallel; max parallel Use real data if possible, else use table definition Create or generate a dataset with 2-3 times available RAM (limit test to 10-15 mins) Using sequential configuration file:
Read dataset to copy (copy f) Rerun and watch for caching Add a write to dataset Run a read/sort/copy test (use a relatively random key for sort)
Using max parallel configuration file

Create a non-skewed dataset Rerun tests above tune the configuration to obtain a linear application speed-up
Review the entire I/O system Review the configuration file to spread I/O activity
Using max parallel configuration

Create a non-skewed dataset Rerun tests above Stress the system, looking for areas of contention
34

Buffering (Enterprise Edition and Server)
Facility added behind the scenes to optimize and regulate data flow. Its primary purpose is to match the rate data is produced upstream with the rate it is consumed downstream. (see next slide)
Partitioning/Sorting (Enterprise Edition)

Operations added behind the scenes to alleviate the need for a developer to worry about this while assuring that the flow operates correctly. (APT_NO_PART_INSERTION & APT_NO_SORT_INSERTION)
Operator Combination (Enterprise Edition)

Operations combined behind the scenes to improve performance.
APT_DISABLE_COMBINATION
35

Controlling the Buffers in DataStage Enterprise Edition
APT_BUFFER_MAXIMUM_TIMEOUT set to 1 for pre V7
Controls the speed of data flow after buffering
APT_BUFFER_MAXIMUM_MEMORY default is 3M
Increase for large memory configurations to avoid buffering to disk
APT_BUFFER_DISK_WRITE_INCREMENT default is 1M
Increase to create larger bursts of I/O during buffering to disk
APT_BUFFER_FREE_RUN default is N * APT_BUFFER_MAXIMUM_MEMORY

increase to reduce data flow impedance for large memory configurations
Controlling the Buffers in DataStage Server Set BUFFERSIZE and TIMEOUT for intra/inter-partitioning default is 128K
Set for project in administrator or in job properties for a particular job
36

Evaluating performance with Enterprise Edition
APT_DUMP_SCORE
used to understand the details of a data flow.
APT_PM_PLAYER_TIMING
Used to understand the CPU characteristics of a data flow
APT_RECORD_COUNTS
Used to check for data skew across data partitions
Evaluating performance with Server

Performance statistics
enabled in the Tracing panel of the Job run options presented when a server job is run (Director or Designer)
37
Performance Tuning
The Configuration File
Tells DataStage how to exploit the underlying computer hardware. For any given system there is not one ideal config file since in a given job there is a lot of variance about how they work on that system. General hints: (assumes SMP environment)
avoid using the disk that are used for landing input and output data for scratch and resource disk Do not use NFS or other remotely mounted disk for scratch disk Understand the file system underneath the mount points being used by the configuration file Separate the I/O between nodes as much as possible to provide the maximum I/O bandwidth Run your application using various configurations to understand its complexion during volume testing before moving to production.
38
Ascential Developer Net
Launches In December
39
DataStage Operator Tips & Tricks tomorrow: Agenda
Tuesday 9:15am
Operator Tips & Tricks Session
Upgrades & Installs Version Control Production Automation Running in a High Availability environment
40
EOD/EOT
E388 8195 92A2 4086 9699 4081 A3A3 8595 8489 9587 40A3 8889 A240 A285 A2A2 8996 9540 ! Please let us know if you have any comments or suggestions regarding this material.
Jim.Tsimis@ascentialsoftware.com Mike.Carney@ascentialsoftware.com mruland@ascentialsoftware.com Steven.Totman@ascentialsoftware.com

41

DataStage Tricks Tips

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

DataStage Tricks Tips

Caricato da

Copyright:

DataStage Designer Tips & Tricks

Re-usability Tips & Tricks

General Debug Tips & Tricks

Server - Generating Test Data

Notice there are no input links

Enterprise Edition Generating Test Data

Building test data from live data

Handling Complex Flat Files

Server Decoding Multi-formatted Files

Enterprise Edition Decoding Multi-formatted Files

Indicate the columns to import Map the columns to their destination

Enterprise Edition Taming the import

Print field option

Working with Schemas

Converting Copybooks To Schemas

Enterprise Edition Working with Complex Files

Enterprise Edition - Complex Structures

There is a rich ability to support very complex data structures here.

Joy of the Command Line

The Joy of the Command Line

Automatically Backing up Projects

Available on Ascential Developers Net when it launches in December

Job Control using Web Services

Unit Of Work Tips & Tricks

Server Unit of Work

DataStage Enterprise Edition MVS Unit Of Work Support

Enterprise Edition - Unit of Work

T4 Tax Account T4 t1 Source Queue T2 Work Queue(s) Reject Queue

Enterprise Edition Unit of Work

Only available through Ascential Professional Services

RTI Enabled Jobs

Calls to service treated as unit of work

Re-usability Tips & Tricks

Reuse Job Templates

Copy Paste From Job to Job

Enterprise Edition - Reusable Transformers

Performance Tuning in Parallel Environments

Performance and Tuning

Performance and Tuning

Isolate and Solve - determine which stage is causing a problem.

Performance and Tuning

Using max parallel configuration file

Using max parallel configuration

Performance and Tuning

Partitioning/Sorting (Enterprise Edition)

Operator Combination (Enterprise Edition)

Performance and Tuning

APT_BUFFER_FREE_RUN default is N * APT_BUFFER_MAXIMUM_MEMORY

Performance and Tuning

Evaluating performance with Server

Ascential Developer Net

DataStage Operator Tips & Tricks tomorrow: Agenda

Jim.Tsimis@ascentialsoftware.com Mike.Carney@ascentialsoftware.com mruland@ascentialsoftware.com Steven.Totman@ascentialsoftware.com

Potrebbero piacerti anche