Sei sulla pagina 1di 41

DataStage Designer Tips & Tricks

Jim Tsimis
Advanced Technical Support

Mike Carney
Advanced Consulting Group

Michael Ruland
Field Engineering

Steven Totman
Product Manager Connectivity

Agenda
Designer Session
General Debug Tips & Tricks Handling Complex Flat Files Joy of the Command Line Transaction Handling Tips & Tricks
Managing transactions in Server, Enterprise Edition, Enterprise MVS Edition and RTI

Re-usability Tips & Tricks


Shared Containers, Templates, Pre-configured stages, Runtime column propagation

Performance Tuning

General Debug Tips & Tricks

Server - Generating Test Data

General Debug

Stage Variables are always executed and drive the transformer stage

Notice there are no input links


Output rows will be generated until the constraint is false at that point the job will stop

Enterprise Edition Generating Test Data

General Debug

Building test data from live data


Head stage: selects the first N records from each partition of an input data set and copies the selected records to an output data set. Tail stage: selects the last N records from each partition of an input data set and copies the selected records to an output data set. Sample stage: samples an input data set. Operates in two modes: Percent mode, extracts rows, selecting them by means of a random number generator, and writes a given percentage of these to each output data set; Period mode, extracts every Nth row from each partition, where N is the period which you supply. Filter stage: transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records. You can specify different requirements to route rows down different output links. External Filter stage: allows you to specify a UNIX command that acts as a filter on the data you are processing. An example would be to use the stage to grep a data set for a certain string, or pattern, and discard records which did not contain a match.

Sequential File stage: FILTER OPTION - use this to specify that the data is passed through a filter program before being written to a file or files on output or before being placed in a dataset on input.

Handling Complex Flat Files

Server Decoding Multi-formatted Files

Input column definitions (3 columns) The selected complex column is decoded into individual columns

Enterprise Edition Decoding Multi-formatted Files

Indicate the columns to import Map the columns to their destination

Enterprise Edition Taming the import

Print field option

od x A x

10

Working with Schemas

Converting Copybooks To Schemas

11

Enterprise Edition Working with Complex Files


Make Subrecord stage: combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors. Promote Subrecord stage: promotes the columns of an input subrecord to top-level columns, can also promote the columns in vectors of subrecords, in which case it acts as the inverse of the Combine Record stage. Split Subrecord stage: separates an input subrecord field into a set of top-level vector columns. Make Vector stage: combines specified columns of an input data record into a vector of columns of the same type. Split Vector stage: promotes the elements of a fixed-length vector to a set of similarly named top-level columns.

Combine Records stage: combines records, in which particular key-column values are identical, into vectors of subrecords.

Column Import stage: imports data from a single column and outputs it to one or more columns. Column Export stage: exports data from a number of columns of different data types into a single column of data type string or binary.

12

Enterprise Edition - Complex Structures


Subrecords
A subrecord is a nested data structure. The column with type subrecord does not itself define any storage, but the columns it contains do. These columns can have any data type, and you can nest subrecords one within another. The LEVEL property is used to specify the structure of subrecords. The following diagram gives an example of a subrecord structure.

Promote

Make

Vectors
Make Split A vector is a 1 dimensional array of any type except tagged. Elements of a vector are of the same type, and are numbered from 0. A vector can be of fixed or variable length. For fixed length vectors the length is explicitly stated, for variable length ones a property defines a link field which gives the length at run time.

13

Enterprise Edition
Combining Vectors and Subrecords

There is a rich ability to support very complex data structures here.

14

Joy of the Command Line

15

The Joy of the Command Line


What is dsjob? Utility to backup all the jobs in a project Utility to take BMPs from the command line DSJob exposed as a web service

16

Automatically Backing up Projects


@echo off rem This batch script is used to backup all the projects on a DataStage server. It rem must be run from a DataStage client machine and the parameters below should be rem modified to fit your environment. Use of parameters was avoided to simplify backup rem allow the command to be customized to a particular environment. rem rem Based on design by Manoli Krinos rem Modified by M Ruland to allow iteration through a complete server set of projects rem ***************************************************** rem Replace the following variables prior to running rem ***************************************************** rem Host is server name rem User is username to use to attach to DataStage rem PW is password to use to attach to DataStage rem BackupDir is the directory to place the backed up project in (don't forget final /) rem DsxCmd is directory of the export command on client rem DsxCmd1 is the dsjob command to retrieve the project list rem TempProjFile is temp file to store project names rem DSLog is the name of the log file accumulated during the backup rem ***************************************************** rem ***************************************************** Set Host=yourhosthere Set User=yourusername Set PW=yourpassword Set BackupDir=E:\Data\AutoBackup\UserConference\ SET DsxCmd=E:\Progra~1\Ascential\DataStage7Beta\dscmdexport.exe SET DsxCmd1=C:\Ascential\DataStage\Engine\bin\dsjob.exe Set TempProjFile=c:\temp\ProjectList.txt Set DSLog=DataStageDumpLog rem ****************************************************** rem rem ------------------------------------------------------------------------rem Get the current Date rem ------------------------------------------------------------------------FOR /f "tokens=2-4 delims=/ " %%a in ('DATE/T') do SET DsxDate=%%c%%a%%b rem rem ------------------------------------------------------------------------rem Get the current Time rem rem ------------------------------------------------------------------------FOR /f "tokens=1* delims=:" %%a in ('ECHO.^|TIME^|FINDSTR "[0-9]"') do (SET DsxTime=%%b) rem rem ------------------------------------------------------------------------rem Set delimeters so that current time can be broken down into components rem then execute FOR loop to parse the DsxTime variable into Hr/Min/Sec/Hun. rem rem ------------------------------------------------------------------------SET delim1=%DsxTime:~3,1% SET delim2=%DsxTime:~9,1% FOR /f "tokens=1-4 delims=%delim1%%delim2% " %%a in ('echo %DsxTime%') do ( set DsxHr=%%a set DsxMin=%%b set DsxSec=%%c set DsxHun=%%d) ECHO *** Backing up server %Host% == please be patient %DsxCmd1% -server %Host% -user %user% -password %user% -lprojects > %TempProjFile% echo AutoProjectBackup run on %DsxDate%%DsxHr%%DsxMin%%DsxSec% with the following parameters > %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log echo Host=%Host%, User=%user%, BackupDir=%BackupDir%, DsxCmd=%DsxCmd%, DsxCmd1=%DsxCmd1% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log echo TempProjFile=%TempProjFile%, DSLog=%DSLog% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log echo. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log echo Following Projects found on %Host% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log type %TempProjFile% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log rem ************************* rem ** Begin backup loop ** rem ************************* for /F "tokens=1" %%i in (%TempProjFile%) do ( ECHO The current Project is %%i >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO Backing up Project %%i rem rem ------------------------------------------------------------------------rem Issue message to screen (stdio) that the export is starting. rem ------------------------------------------------------------------------ECHO Exporting Project=%%i on Host=%Host% into File=%BackupDir%%HostName%%%i%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.dsx ... >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log %DsxCmd% /H=%Host% /U=%User% /P=%PW% %%i %BackupDir%%HostName%%%i%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.dsx >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log IF NOT %ERRORLEVEL%==0 GOTO BADEXPORT ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO *** Completed Export for Project: %%i on Host: %Host% to File: %BackupDir%%HostName%%%i%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.dsx >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO ************************************************************************** >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO ************************************************************************** >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ) ECHO *** Export completed successfully for projects: type %TempProjFile% GOTO EXITPT rem ------------------------------------------------------------------------rem :BADEXPORT ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log ECHO *** ERROR: Failed to Export Project: %%i on Host: %Host% >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log :EXITPT ECHO. >> %BackupDir%%DSLog%%DsxDate%%DsxHr%%DsxMin%%DsxSec%full.log del %TempProjFile%

This script is available from your account team and backs up all projects on an identified server

Available on Ascential Developers Net when it launches in December

17

Automated Diagramming
"E:\Program Files\Ascential\DataStage\dsdesign.exe" /h=YourHost /u=UserID /p=***** YourProject YourJobName /saveasbmp=e:\Diagrams\YourProject\JobDesigns\YourJobName.bmp

A script is available from Ascential that will obtain all the jobs within a selected project and create a bmp diagram for each job into a selected folder. This can be an effective way to create a file that MetaStage can later use to present a graphical representation of the DataStage job design in an HTML or XML report.

18

Job Control using Web Services


Job Control Web Service in C# .NET

Job Control Web Service in HTML using Web Service Behavior Job Control Web Service in Office XP documents Job Control Web Service in VBScript

19

Unit Of Work Tips & Tricks

20

Server Unit of Work


Unit of Work processing is currently available in the Oracle and ODBC stages.

Used to specify whether to continue or to roll back if a link is skipped due to a constraint on it not being satisfied.

Used to specify whether or not to continue or rollback on failure of the SQL statement.

21

DataStage Enterprise Edition MVS Unit Of Work Support

The DataStage Enterprise Edition MVS (XE/390) Business Rule Stage provides the ease of graphical construction through drag and drop facilities as well as the ability to customize the processing rules to meet specific demands.

22

Enterprise Edition - Unit of Work


Bank Account

T4 T3 EOU
T2 T4

Stock Account

T4 Tax Account T4 t1 Source Queue T2 Work Queue(s) Reject Queue

23

Enterprise Edition Unit of Work


Enterprise Edition Framework enhancement (end of unit)
Causes records to flush through flow Stages complete work, but stay running for next unit of work. MQ-READ UNIT OF WORK Solution Utilizes (end of unit) MQ series transaction manager (xa/open two phase commit) Coordinated transactions between MQSERIES, ORACLE,other RDBMS Guarantee no loss of data Automatic checkpoint/restart Highly scalable near realtime

Only available through Ascential Professional Services

24

RTI Enabled Jobs

Job acts like web service via Real Time Integration Server (RTI)

Calls to service treated as unit of work


Multi row unit of work is supported Service arguments are complex arrays

25

Re-usability Tips & Tricks

26

Reuse Job Templates

Creates an XML template that can be used as a starter job. Facilities exist to allow consulting to further customize the template such that token values can be replaced during job creation.

27

Copy Paste From Job to Job


Can also now copy and paste directly into a shared container on both the Enterprise Edition and Server canvas.

28

Pre-configured Stages
Implemented through Shared containers Configure Stage with parameters
Create empty job with parameters When developing new job.
Start with empty job with parameters Drag/Drop preconfigured stages holding CTRL key. Minimal configuration required.

29

Enterprise Edition - Reusable Transformers


Implemented through shared containers Runtime Column Propagation
Turn on for transformer stage. Specify minimal input columns, only column used in derivation. (not copy through) Specify minimal output column, new columns or columns that required a derivation expression. (complex vs column name) Enterprise Edition Framework automatically propagates all other columns.

30

Performance Tuning in Parallel Environments

31

Performance and Tuning


Any given system can be tuned to favor one application so much that it actually negatively impacts the performance of other applications. This phenomena is exacerbated as we introduce parallel capabilities into the system. Many factors affect the performance of an application
RDBMS configuration and performance Memory vs. System Working Set Size CPU's vs. System Load Data input/output Throughput Rates Launch Microsoft Outlook Amdahls law (an application is gated by its slowest component).

32

Performance and Tuning


Best Practices:
Establish baselines (especially with I/O), use copy with no output Avoid the use of only one flow for tuning/performance testing. Prototyping can be a powerful tool. Work in increments...change 1 thing at a time. Evaluate data skew: repartition to balance the data flow

Isolate and Solve - determine which stage is causing a problem.


distribute file systems (if possible) to eliminate bottlenecks Do NOT involve the RDBMS in initial testing. (See above) Understand and evaluate the tuning knobs available

33

Performance and Tuning


Establishing a baseline:
Set up at least 3 configurations: sequential; max parallel; max parallel Use real data if possible, else use table definition Create or generate a dataset with 2-3 times available RAM (limit test to 10-15 mins) Using sequential configuration file:
Read dataset to copy (copy f) Rerun and watch for caching Add a write to dataset Run a read/sort/copy test (use a relatively random key for sort)

Using max parallel configuration file


Create a non-skewed dataset Rerun tests above tune the configuration to obtain a linear application speed-up
Review the entire I/O system Review the configuration file to spread I/O activity

Using max parallel configuration


Create a non-skewed dataset Rerun tests above Stress the system, looking for areas of contention
34

Performance and Tuning


Buffering (Enterprise Edition and Server)
Facility added behind the scenes to optimize and regulate data flow. Its primary purpose is to match the rate data is produced upstream with the rate it is consumed downstream. (see next slide)

Partitioning/Sorting (Enterprise Edition)


Operations added behind the scenes to alleviate the need for a developer to worry about this while assuring that the flow operates correctly. (APT_NO_PART_INSERTION & APT_NO_SORT_INSERTION)

Operator Combination (Enterprise Edition)


Operations combined behind the scenes to improve performance.
APT_DISABLE_COMBINATION

35

Performance and Tuning


Controlling the Buffers in DataStage Enterprise Edition
APT_BUFFER_MAXIMUM_TIMEOUT set to 1 for pre V7
Controls the speed of data flow after buffering

APT_BUFFER_MAXIMUM_MEMORY default is 3M
Increase for large memory configurations to avoid buffering to disk

APT_BUFFER_DISK_WRITE_INCREMENT default is 1M
Increase to create larger bursts of I/O during buffering to disk

APT_BUFFER_FREE_RUN default is N * APT_BUFFER_MAXIMUM_MEMORY


increase to reduce data flow impedance for large memory configurations

Controlling the Buffers in DataStage Server Set BUFFERSIZE and TIMEOUT for intra/inter-partitioning default is 128K
Set for project in administrator or in job properties for a particular job

36

Performance and Tuning


Evaluating performance with Enterprise Edition
APT_DUMP_SCORE
used to understand the details of a data flow.

APT_PM_PLAYER_TIMING
Used to understand the CPU characteristics of a data flow

APT_RECORD_COUNTS
Used to check for data skew across data partitions

Evaluating performance with Server


Performance statistics
enabled in the Tracing panel of the Job run options presented when a server job is run (Director or Designer)

37

Performance Tuning
The Configuration File
Tells DataStage how to exploit the underlying computer hardware. For any given system there is not one ideal config file since in a given job there is a lot of variance about how they work on that system. General hints: (assumes SMP environment)
avoid using the disk that are used for landing input and output data for scratch and resource disk Do not use NFS or other remotely mounted disk for scratch disk Understand the file system underneath the mount points being used by the configuration file Separate the I/O between nodes as much as possible to provide the maximum I/O bandwidth Run your application using various configurations to understand its complexion during volume testing before moving to production.

38

Ascential Developer Net

Launches In December

39

DataStage Operator Tips & Tricks tomorrow: Agenda

Tuesday 9:15am
Operator Tips & Tricks Session
Upgrades & Installs Version Control Production Automation Running in a High Availability environment

40

EOD/EOT

E388 8195 92A2 4086 9699 4081 A3A3 8595 8489 9587 40A3 8889 A240 A285 A2A2 8996 9540 ! Please let us know if you have any comments or suggestions regarding this material.

Jim.Tsimis@ascentialsoftware.com Mike.Carney@ascentialsoftware.com mruland@ascentialsoftware.com Steven.Totman@ascentialsoftware.com


41

Potrebbero piacerti anche