Sei sulla pagina 1di 30

Data Stage Best Practices & Performance Tuning

DataStage
Best Practices & Performance Tuning

Page 1 of 30

Data Stage Best Practices & Performance Tuning

DataStage ..........................................................................................................................1 1 Environment Variable Settings.........................................................................................3 1.1 Environment Variable Settings for All Jobs...............................................................3 1.2 Additional Environment Variable Settings ................................................................3 2 Configuration Files............................................................................................................6 2.1 Logical Processing Nodes.........................................................................................6 2.2 Optimizing Parallelism...............................................................................................7 2.3 Configuration File Examples......................................................................................7 2.3.1 Example for Any Number of CPUs and Any Number of Disks...........................8 2.3.2 Example that Reduces Contention...................................................................10 2.3.3 Smaller Configuration Example........................................................................11 2.4 Sequential File Stages (Import and Export)............................................................13 2.4.1 Improving Sequential File Performance............................................................13 2.4.2 Partitioning Sequential File Reads....................................................................13 2.4.3 Sequential File (Export) Buffering.....................................................................13 2.4.4 Reading from and Writing to Fixed-Length Files..............................................14 2.4.5 Reading Bounded-Length VARCHAR Columns...............................................14 2.5 Transformer Usage Guidelines................................................................................14 2.5.1 Choosing Appropriate Stages...........................................................................14 2.5.2 Transformer NULL Handling and Reject Link...................................................15 2.5.3 Transformer Derivation Evaluation...................................................................16 2.5.4 Conditionally Aborting Jobs..............................................................................16 2.6 Lookup vs. Join Stages............................................................................................16 2.7 Capturing Unmatched Records from a Join............................................................16 2.8 The Aggregator Stage..............................................................................................17 2.9 Appropriate Use of SQL and DataStage Stages.....................................................17 2.10 Optimizing Select Lists...........................................................................................18 2.11 Designing for Restart.............................................................................................18 2.12 Database OPEN and CLOSE Commands............................................................18 2.13 Database Sparse Lookup vs. Join.........................................................................19 2.14 Oracle Database Guidelines..................................................................................19 2.14.1 Proper Import of Oracle Column Definitions (Schema)..................................19 2.14.2 Reading from Oracle in Parallel......................................................................19 2.14.3 Oracle Load Options.......................................................................................20 3 Tips for Debugging Enterprise Edition Jobs...................................................................21 3.1 Reading a Score Dump............................................................................................21 3.2 Partitioner and Sort Insertion...................................................................................22 4 Performance Tips for Job Design...................................................................................24 5 Performance Monitoring and Tuning..............................................................................25 5.1 The Job Monitor.......................................................................................................25 5.2 OS/RDBMS-Specific Tools .....................................................................................25 5.3 Obtaining Operator Run-Time Information..............................................................26 5.4 Selectively Rewriting the Flow.................................................................................27 5.5 Eliminating Repartitions...........................................................................................27 5.6 Ensuring Data is Evenly Partitioned.......................................................................27 5.7 Buffering for All Versions........................................................................................28 5.8 Resolving Bottlenecks.............................................................................................28

Page 2 of 30

Data Stage Best Practices & Performance Tuning 5.8.1 Variable Length Data........................................................................................28 5.8.2 Combinable Operators......................................................................................28 5.8.3 Disk I/O..............................................................................................................29 5.8.4 Buffering............................................................................................................29

1 Environment Variable Settings


DataStage EE provides a number of environment variables to control how jobs operate on a UNIX system. In addition to providing required information, environment variables can be used to enable or disable various DataStage features, and to tune performance settings.

1.1

Environment Variable Settings for All Jobs

Ascential recommends the following environment variable settings for all Enterprise Edition jobs. These settings can be made at the project level, or may be set on an individual basis within the properties for each job. Environment Variable Settings For All Jobs Environment Variable
$APT_CONFIG_FILE $APT_DUMP_SCORE

Setting
filepath 1

Description
Specifies the full pathname to the EE configuration file. Outputs EE score dump to the DataStage job log, providing detailed information about actual job flow including operators, processes, and datasets. Extremely useful for understanding how a job actually ran in the environment. (see section 10.1 Reading a Score Dump) Includes a copy of the generated osh in the jobs DataStage log. Starting with v7, this option is enabled when Generated OSH visible for Parallel jobs in ALL projects option is enabled in DataStage Administrator. Outputs record counts to the DataStage job log as each operator completes processing. The count is per operator per partition. Places entries in DataStage job log showing UNIX process ID (PID) for each process started by a job. Does not report PIDs of DataStage phantom processes started by Server shared containers. Maximum buffer delay in seconds Only needed for DataStage v7.0 and earlier. Setting this environment variable significantly reduces memory usage for very large (>100 operator) jobs.

$OSH_ECHO

$APT_RECORD_COUNTS

$APT_PM_SHOW_PIDS

$APT_BUFFER_MAXIMUM_TIMEOUT $APT_THIN_SCORE (DataStage 7.0 and earlier)

1 1

1.2

Additional Environment Variable Settings

Ascential recommends setting the following environment variables on an as-needed basis. These variables can be used to tune the performance of a particular job flow, to assist in debugging, and to change the default behavior of specific EE stages.

Page 3 of 30

Data Stage Best Practices & Performance Tuning NOTE: The environment variable settings in this section are only examples. Set values that are optimal to your environment. Sequential File Stage Environment Variables Environment Variable
$APT_EXPORT_FLUSH_COUNT

Setting [nrows]

Description
Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O. Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance. In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations. Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2 bytes. This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.

$APT_IMPORT_BUFFER_SIZE $APT_EXPORT_BUFFER_SIZE

[Kbytes ]

$APT_CONSISTENT_BUFFERIO_SIZE

[bytes]

$APT_DELIMITED_READ_SIZE

[bytes]

$APT_MAX_DELIMITED_READ_SIZE

[bytes]

By default, Sequential File (import) will read ahead 500 bytes to get the next delimiter. If it is not found the importer looks ahead 4*500=2000 (1500 more) bytes, and so on (4X) up to 100,000 bytes. This variable controls the upper bound which is by default 100,000 bytes. When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.

Oracle Environment Variables Environment Variable


$ORACLE_HOME

Setting [path] [sid]

Description
Specifies installation directory for current Oracle instance. Normally set in a users environment by Oracle scripts. Specifies the Oracle service name, corresponding to a TNSNAMES entry.

$ORACLE_SID

Page 4 of 30

Data Stage Best Practices & Performance Tuning


$APT_ORAUPSERT_COMMIT_ROW_INTERVAL $APT_ORAUPSERT_COMMIT_TIME_INTERVAL

[num] [second s]

These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method. Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.

$APT_ORACLE_LOAD_OPTIONS

[SQL* Loader options] 1 [filepath]

Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE) When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the tables tablespace. Useful in debugging Oracle SQL*Loader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands. Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.

$APT_ORA_IGNORE_CONFIG_FILE_PARALLELIS M $APT_ORA_WRITE_FILES

$DS_ENABLE_RESERVED_CHAR_CONVERT

Job Monitoring Environment Variables Environment Variable Setting [second $APT_MONITOR_TIME s]

Description
In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZE below. Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000). Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks. Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

$APT_MONITOR_SIZE

[rows]

$APT_NO_JOBMON

$APT_RECORD_COUNTS

Job Monitoring Environment Variables Environment Variable Setting [second $APT_MONITOR_TIME

Description
In v7 and later, specifies the time interval (in seconds)

Page 5 of 30

Data Stage Best Practices & Performance Tuning s]


for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZE below. Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000). Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks. Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

$APT_MONITOR_SIZE

[rows]

$APT_NO_JOBMON

$APT_RECORD_COUNTS

2 Configuration Files
The configuration file tells DataStage Enterprise Edition how to exploit underlying system resources (processing, temporary storage, and dataset storage). In more advanced environments, the configuration file can also define other resources such as databases and buffer storage. At runtime, EE first reads the configuration file to determine what system resources are allocated to it, and then distributes the job flow across these resources. When you modify the system, by adding or removing nodes or disks, you must modify the DataStage EE configuration file accordingly. Since EE reads the configuration file every time it runs a job, it automatically scales the application to fit the system without having to alter the job design. There is not necessarily one ideal configuration file for a given system because of the high variability between the way different jobs work. For this reason, multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources. At runtime, the configuration file is specified through the environment variable $APT_CONFIG_FILE.

2.1

Logical Processing Nodes

The configuration file defines one or more EE processing nodes on which parallel jobs will run. EE processing nodes are a logical rather than a physical construct. For this reason, it is important to note that the number of processing nodes does not necessarily correspond to the actual number of CPUs in your system. Within a configuration file, the number of processing nodes defines the degree of parallelism and resources that a particular job will use to run. It is up to the UNIX operating system to actually schedule and run the processes that make up a DataStage job across physical processors. A configuration file with a larger number of nodes

Page 6 of 30

Data Stage Best Practices & Performance Tuning generates a larger number of processes that use more memory (and perhaps more disk activity) than a configuration file with a smaller number of nodes. While the DataStage documentation suggests creating half the number of nodes as physical CPUs, this is a conservative starting point that is highly dependent on system configuration, resource availability, job design, and other applications sharing the server hardware. For example, if a job is highly I/O dependent or dependent on external (eg. database) sources or targets, it may appropriate to have more nodes than physical CPUs. For typical production environments, a good starting point is to set the number of nodes equal to the number of CPUs. For development environments, which are typically smaller and more resource-constrained, create smaller configuration files (eg. 2-4 nodes). Note that even in the smallest development environments, a 2-node configuration file should be used to verify that job logic and partitioning will work in parallel (as long as the test data can sufficiently identify data discrepancies).

2.2

Optimizing Parallelism

The degree of parallelism of a DataStage EE application is determined by the number of nodes you define in the configuration file. Parallelism should be optimized rather than maximized. Increasing parallelism may better distribute your work load, but it also adds to your overhead because the number of processes increases. Therefore, you must weigh the gains of added parallelism against the potential losses in processing efficiency. The CPUs, memory, disk controllers and disk configuration that make up your system influence the degree of parallelism you can sustain. Keep in mind that the closest equal partitioning of data contributes to the best overall performance of an application running in parallel. For example, when hash partitioning, try to ensure that the resulting partitions are evenly populated. This is referred to as minimizing skew. When business requirements dictate a partitioning strategy that is excessively skewed, remember to change the partition strategy to a more balanced one as soon as possible in the job flow. This will minimize the effect of data skew and significantly improve overall job performance.

2.3

Configuration File Examples

Given the large number of considerations for building a configuration file, where do you begin? For starters, the default configuration file (default.apt) created when DataStage is installed is appropriate for only the most basic environments. The default configuration file has the following characteristics: - number of nodes = number of physical CPUs - disk and scratchdisk storage use subdirectories within the DataStage install filesystem You should create and use a new configuration file that is optimized to your hardware and file systems. Because different job flows have different needs (CPU-intensive? Memory-intensive? Disk-Intensive? Database-Intensive? Sorts? need to share resources

Page 7 of 30

Data Stage Best Practices & Performance Tuning with other jobs/databases/ applications? etc), it is often appropriate to have multiple configuration files optimized for particular types of processing. With the synergistic relationship between hardware (number of CPUs, speed, cache, available system memory, number and speed of I/O controllers, local vs. shared disk, RAID configurations, disk size and speed, network configuration and availability), software topology (local vs. remote database access, SMP vs. Clustered processing), and job design, there is no definitive science for formulating a configuration file. This section attempts to provide some guidelines based on experience with actual production applications. IMPORTANT: It is important to follow the order of all sub-items within individual node specifications in the example configuration files given in this section. 2.3.1 Example for Any Number of CPUs and Any Number of Disks Assume you are running on a shared-memory multi-processor system, an SMP server, which is the most common platform today. Lets assume these properties: computer host name fastone 6 CPUs 4 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3 You can adjust the sample to match your precise environment. The configuration file you would use as a starting point would look like the one below. Assuming that the system load from processing outside of DataStage is minimal, it may be appropriate to create one node per CPU as a starting point. In the following example, the way disk and scratchdisk resources are handled is the important.
{ /* config files allow C-style comments. */ /* Configuration do not have flexible syntax. Keep all the sub-items of the individual node specifications in the order shown here. */ node "n0" { pools "" /* on an SMP node pools arent used often. */ fastname "fastone" resource scratchdisk "/fs0/ds/scratch" {} /* start with fs0 */ resource scratchdisk "/fs1/ds/scratch" {} resource scratchdisk "/fs2/ds/scratch" {} resource scratchdisk "/fs3/ds/scratch" {} resource disk "/fs0/ds/disk" {} /* start with fs0 */ resource disk "/fs1/ds/disk" {} resource disk "/fs2/ds/disk" {} resource disk "/fs3/ds/disk" {} } node "n1" { pools "" fastname "fastone" resource scratchdisk "/fs1/ds/scratch" {} /* start with fs1 */ resource scratchdisk "/fs2/ds/scratch" {} resource scratchdisk "/fs3/ds/scratch" {} resource scratchdisk "/fs0/ds/scratch" {} resource disk "/fs1/ds/disk" {} /* start with fs1 */ resource disk "/fs2/ds/disk" {} resource disk "/fs3/ds/disk" {}

Page 8 of 30

Data Stage Best Practices & Performance Tuning


resource disk "/fs0/ds/disk" {} } node "n2" { pools "" fastname "fastone" resource scratchdisk "/fs2/ds/scratch" {} /* start with fs2 */ resource scratchdisk "/fs3/ds/scratch" {} resource scratchdisk "/fs0/ds/scratch" {} resource scratchdisk "/fs1/ds/scratch" {} resource disk "/fs2/ds/disk" {} /* start with fs2 */ resource disk "/fs3/ds/disk" {} resource disk "/fs0/ds/disk" {} resource disk "/fs1/ds/disk" {} } node "n3" { pools "" fastname "fastone" resource scratchdisk "/fs3/ds/scratch" {} /* start with fs3 */ resource scratchdisk "/fs0/ds/scratch" {} resource scratchdisk "/fs1/ds/scratch" {} resource scratchdisk "/fs2/ds/scratch" {} resource disk "/fs3/ds/disk" {} /* start with fs3 */ resource disk "/fs0/ds/disk" {} resource disk "/fs1/ds/disk" {} resource disk "/fs2/ds/disk" {} } node "n4" { pools "" fastname "fastone" /* Now we have rotated through starting with a different disk, but the fundamental problem * in this scenario is that there are more nodes than disks. So what do we do now? * The answer: something that is not perfect. Were going to repeat the sequence. You could * shuffle differently i.e., use /fs0 /fs2 /fs1 /fs3 as an order, but that most likely wont * matter. */ resource scratchdisk /fs0/ds/scratch {} /* start with fs0 again */ resource scratchdisk /fs1/ds/scratch {} resource scratchdisk /fs2/ds/scratch {} resource scratchdisk /fs3/ds/scratch {} resource disk /fs0/ds/disk {} /* start with fs0 again */ resource disk /fs1/ds/disk {} resource disk /fs2/ds/disk {} resource disk /fs3/ds/disk {} } node n5 { pools fastname fastone resource scratchdisk /fs1/ds/scratch {} /* start with fs1 */ resource scratchdisk /fs2/ds/scratch {} resource scratchdisk /fs3/ds/scratch {} resource scratchdisk /fs0/ds/scratch {} resource disk /fs1/ds/disk {} /* start with fs1 */ resource disk /fs2/ds/disk {} resource disk /fs3/ds/disk {} resource disk /fs0/ds/disk {} } } /* end of entire config */

The file pattern of the configuration file above is a give every node all the disk example, albeit in different orders to minimize I/O contention. This configuration method works well when the job flow is complex enough that it is difficult to determine and precisely plan for good I/O utilization. Within each node, EE does not stripe the data across multiple filesystems. Rather, it fills the disk and scratchdisk filesystems in the order specified in the configuration file. In the 4-node example above, the order of the disks is purposely shifted for each node, in an attempt to minimize I/O contention.

Page 9 of 30

Data Stage Best Practices & Performance Tuning Even in this example, giving every partition (node) access to all the I/O resources can cause contention, but EE attempts to minimize this by using fairly large I/O blocks. This configuration style works for any number of CPUs and any number of disks since it doesn't require any particular correspondence between them. The heuristic here is: When its too difficult to figure out precisely, at least go for achieving balance.

2.3.2

Example that Reduces Contention

The alternative to the first configuration method is more careful planning of the I/O behavior to reduce contention. You can imagine this could be hard given our hypothetical 6-way SMP with 4 disks because setting up the obvious one-to-one correspondence doesn't work. Doubling up some nodes on the same disk is unlikely to be good for overall performance since we create a hotspot. We could give every CPU two disks and rotate them around, but that would be little different than the previous strategy. So, lets imagine a less constrained environment with two additional disks: computer host name fastone 6 CPUs 6 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3, /fs4, /fs5 Now a configuration file for this environment might look like this:
{ node "n0" { pools "" fastname "fastone" resource disk "/fs0/ds/data" {pools ""} resource scratchdisk "/fs0/ds/scratch" {pools ""} } node "node2" { fastname "fastone" pools "" resource disk "/fs1/ds/data" {pools ""} resource scratchdisk "/fs1/ds/scratch" {pools ""} } node "node3" { fastname "fastone" pools "" resource disk "/fs2/ds/data" {pools ""} resource scratchdisk "/fs2/ds/scratch" {pools ""} } node "node4" { fastname "fastone" pools "" resource disk "/fs3/ds/data" {pools ""} resource scratchdisk "/fs3/ds/scratch" {pools ""} } node "node5" { fastname "fastone" pools "" resource disk "/fs4/ds/data" {pools ""}

Page 10 of 30

Data Stage Best Practices & Performance Tuning


resource scratchdisk "/fs4/ds/scratch" {pools ""} } node "node6" { fastname "fastone" pools "" resource disk "/fs5/ds/data" {pools ""} resource scratchdisk "/fs5/ds/scratch" {pools ""} } } /* end of entire config */

While this is the simplest scenario, it is important to realize that no single player, stage, or operator instance on any one partition can go faster than the single disk it has access to. You could combine strategies by adding in a node pool where disks have a one-to-one association with nodes. These nodes would then not be in the default node pool, but a special one that you would specifically assign to stage / operator instances. 2.3.3 Smaller Configuration Example Because disk and scratchdisk resources are assigned per node, depending on the total disk space required to process large jobs, it may be necessary to distribute file systems across nodes in smaller environments (fewer available CPUs/memory). Using the above server example, this time with 4-nodes: computer host name fastone 4 CPUs 6 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3, /fs4, /fs5
{ node "node1" { fastname "fastone" pools "" resource disk "/fs0/ds/data" {pools ""} /* start with fs0 */ resource disk "/fs4/ds/data" {pools ""} resource scratchdisk "/fs4/ds/scratch" {pools ""} /* start with fs4 */ resource scratchdisk "/fs0/ds/scratch" {pools ""} } node "node2" { fastname "fastone" pools "" resource disk "/fs1/ds/data" {pools ""} resource disk "/fs5/ds/data" {pools ""} resource scratchdisk "/fs5/ds/scratch" {pools ""} resource scratchdisk "/fs1/ds/scratch" {pools ""} } node "node3" { fastname "fastone" pools "" resource disk "/fs2/ds/data" {pools ""} resource disk "/fs6/ds/data" {pools ""} resource scratchdisk "/fs6/ds/scratch" {pools ""} resource scratchdisk "/fs2/ds/scratch" {pools ""} } node "node4" { fastname "fastone" pools "" resource disk "/fs3/ds/data" {pools ""} resource disk "/fs7/ds/data" {pools ""} resource scratchdisk "/fs7/ds/scratch" {pools ""} resource scratchdisk "/fs3/ds/scratch" {pools ""} } } /* end of entire config */

The 4-node example above illustrates another concept in configuration file setup you can assign multiple disk and scratch disk resources for each node.

Page 11 of 30

Data Stage Best Practices & Performance Tuning Unfortunately, physical limitations of available hardware and disk configuration dont always lend themselves to clean configurations illustrated above. Other configuration file tips: Consider avoiding the disk(s) that your input files reside on. Often those disks will be hotspots until the input phase is over. If the job is large and complex this is less of an issue since the input part is proportionally less of the total work. Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if they're located on a RAID system. Do not trust high-level RAID/SAN monitoring tools, as their cache hit ratios are often misleading. Never use NFS file systems for scratchdisk resources. Know what's real and what's NFS: Real disks are directly attached, or are reachable over a SAN (storage-area network - dedicated, just for storage, low-level protocols). Proper configuration of scratch and resource disk (and the underlying filesystem and physical hardware architecture) can significantly affect overall job performance. Beware if you use NFS (and, often SAN) filesystem space for disk resources. For example, your final result files may need to be written out onto the NFS disk area, but that doesn't mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. It is better to setup a "final" disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS.

Page 12 of 30

Data Stage Best Practices & Performance Tuning

2.4

Sequential File Stages (Import and Export)

2.4.1 Improving Sequential File Performance If the source file is fixed/de-limited, the Readers Per Node option can be used to read a single input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is not maintained. If the input sequential file cannot be read in parallel, performance can still be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage. On heavily-loaded file servers or some RAID/SAN array configurations, the environment variables $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE can be used to improve I/O performance. These settings specify the size of the read (import) and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve performance. Finally, in some disk array configurations, setting the environment variable $APT_CONSISTENT_BUFFERIO_SIZE to a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations. 2.4.2 Partitioning Sequential File Reads Care must be taken to choose the appropriate partitioning method from a Sequential File read: Dont read from Sequential File using SAME partitioning! Unless more than one source file is specified, SAME will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is later repartitioned). When multiple files are read by a single Sequential File stage (using multiple files, or by using a File Pattern), each files data is read into a separate partition. It is important to use ROUND-ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow.

2.4.3 Sequential File (Export) Buffering By default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNT allows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty associated with increased I/O.

Page 13 of 30

Data Stage Best Practices & Performance Tuning

2.4.4

Reading from and Writing to Fixed-Length Files

Particular attention must be taken when processing fixed-length fields using the Sequential File stage: If the incoming columns are variable-length data types (eg. Integer, Decimal, Varchar), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property. If a field is nullable, you must define the null field value and length in the Nullable section of the column property. Double-click on the column number in the grid dialog to set these properties. When writing fixed-length files from variable-length fields (eg. Integer, Decimal, Varchar), the field width and pad string column properties must be set to match the fixed-width of the output column. Double-click on the column number in the grid dialog to set this column property. To display each field value, use the print_field import property. All import and export properties are listed in chapter 25, Import/Export Properties of the Orchestrate 7.0 Operators Reference.

2.4.5

Reading Bounded-Length VARCHAR Columns

Care must be taken when reading delimited, bounded-length Varchar columns (Varchars with the length option set). By default, if the source file has fields with values longer than the maximum Varchar length, these extra characters will be silently truncated. Starting with v7.01 the environment variable will direct DataStage to reject records with strings longer than their declared maximum column length.
$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS

2.5

Transformer Usage Guidelines

2.5.1 Choosing Appropriate Stages The parallel Transformer stage always generates C code which is then compiled to a parallel component. For this reason, it is important to minimize the number of

Page 14 of 30

Data Stage Best Practices & Performance Tuning transformers, and to use other stages (Copy, Filter, Switch, etc) when derivations are not needed. The Copy stage should be used instead of a Transformer for simple operations including: - Job Design placeholder between stages (unless the Force option =true, EE will optimize this out at runtime) - Renaming Columns - Dropping Columns - Default Type Conversions Note that rename, drop (if runtime column propagation is disabled), and default type conversion can also be performed by the output mapping tab of any stage. NEVER use the BASIC Transformer stage in large-volume job flows. Instead, user-defined functions and routines can expand parallel Transformer capabilities. Consider, if possible, implementing complex derivation expressions using regular patterns by Lookup tables instead of using a Transformer with nested derivations. For example, the derivation expression: If A=0,1,2,3 Then B=X If A=4,5,6,7 Then B=C Could be implemented with a lookup table containing values for column A and corresponding values of column B. Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible. In v7 and later, the Filter and/or Switch stages can be used to separate rows into multiple output links based on SQL-like link constraint expressions. In v7 and later, the Modify stage can be used for non-default type conversions, null handling, and character string trimming. See section 7.5 for more information. Buildops should be used instead of Transformers in the handful of scenarios where complex reusable logic is required, or where existing Transformer-based job flows do not meet performance requirements. 2.5.2 Transformer NULL Handling and Reject Link When evaluating expressions for output derivations or link constraints, the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. To create a Transformer reject link in DataStage Designer, right-click on an output link and choose Convert to Reject.

Page 15 of 30

Data Stage Best Practices & Performance Tuning The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. For this reason, always test for null values before using a column in an expression, for example: If ISNULL(link.col) Then Else Note that if an incoming column is only used in a pass-through derivation, the Transformer will allow this row to be output. DataStage release 7 enhances this behavior by placing warnings in the log file when discards occur. 2.5.3 Transformer Derivation Evaluation Output derivations are evaluated BEFORE any type conversions on the assignment. For example, the PadString function uses the length of the source type, not the target. Therefore, it is important to make sure the type conversion is done before a row reaches the Transformer. For example, TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the incoming column must be type VarChar before it is evaluated in the Transformer. 2.5.4 Conditionally Aborting Jobs The Transformer can be used to conditionally abort a job when incoming data matches a specific rule. Create a new output link that will handle rows that match the abort rule. Within the link constraints dialog box, apply the abort rule to this output link, and set the Abort After Rows count to the number of rows allowed before the job should be aborted (eg. 1). Since the Transformer will abort the entire job flow immediately, it is possible that valid rows will not have been flushed from Sequential File (export) buffers, or committed to database tables. It is important to set the Sequential File buffer flush (see section 7.3) or database commit parameters.

2.6

Lookup vs. Join Stages

The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory. If the datasets are larger than available resources, the JOIN or MERGE stage should be used. If the reference to a Lookup is directly from a Oracle table, and the number of input rows is significantly smaller (eg. 1:100 or more) than the number of reference rows, a Sparse Lookup may be appropriate.

2.7

Capturing Unmatched Records from a Join

The Join stage does not provide reject handling for unmatched records (such as in an InnerJoin scenario). If un-matched rows must be captured or logged, an OUTER join operation must be performed. In an OUTER join scenario, all rows on an outer link (eg. Left Outer, Right Outer, or both links in the case of Full Outer) are output regardless of match on key values.

Page 16 of 30

Data Stage Best Practices & Performance Tuning During an Outer Join, when a match does not occur, the Join stage inserts NULL values into the unmatched columns. Care must be taken to change the column properties to allow NULL values before the Join. This is most easily done by inserting a Copy stage and mapping a column from NON-NULLABLE to NULLABLE. A Filter stage can be used to test for NULL values in unmatched columns. In some cases, it is simpler to use a Column Generator to add an indicator column, with a constant value, to each of the outer links and test that column for the constant after you have performed the join. This is also handy with Lookups that have multiple reference links.

2.8

The Aggregator Stage

By default, the output data type of a parallel Aggregator stage calculation or recalculation column is Double. Starting with v7.01 of DataStage EE, the new optional property Aggregations/Default to Decimal Output specifies that all calculation or recalculations result in decimal output of the specified precision and scale. You can also specify that the result of an individual calculation or recalculation is decimal by using the optional Decimal Output subproperty.

2.9

Appropriate Use of SQL and DataStage Stages

When using relational database sources, there is often a functional overlap between SQL and DataStage stages. Although it is possible to use either SQL or DataStage to solve a given business problem, the optimal implementation involves leveraging the strengths of each technology to provide maximum throughput and developer productivity. While there are extreme scenarios when the appropriate technology choice is clearly understood, there may be gray areas where the decision should be made factors such as developer productivity, metadata capture and re-use, and ongoing application maintenance costs. The following guidelines can assist with the appropriate use of SQL and DataStage technologies in a given job flow: a) When possible, use a SQL filter (WHERE clause) to limit the number of rows sent to the DataStage job. This minimizes impact on network and memory resources, and leverages the database capabilities. b) Use a SQL Join to combine data from tables with a small number of rows in the same database instance, especially when the join columns are indexed. c) When combining data from very large tables, or when the source includes a large number of database tables, the efficiency of the DataStage EE Sort and Join stages can be significantly faster than an equivalent SQL

Page 17 of 30

Data Stage Best Practices & Performance Tuning query. In this scenario, it can still be beneficial to use database filters (WHERE clause) if appropriate. d) Avoid the use of database stored procedures (eg. Oracle PL/SQL) on a per-row basis within a high-volume data flow. For maximum scalability and parallel performance, it is best to implement business rules natively using DataStage components.

2.10 Optimizing Select Lists


For best performance and optimal memory usage, it is best to explicitly specify column names on all source database stages, instead of using an unqualified Table or SQL SELECT * read. For Table read method, always specify the Select List subproperty. For Auto-Generated SQL, the DataStage Designer will automatically populate the select list based on the stages output column definition. The only exception to this rule is when building dynamic database jobs that use runtime column propagation to process all rows in a source table.

2.11 Designing for Restart


To enable restart of high-volume jobs, it is important to separate the transformation process from the database write (Load or Upsert) operation. After transformation, the results should be landed to a parallel data set. Subsequent job(s) should read this data set and populate the target table using the appropriate database stage and write method. As a further optimization, a Lookup stage (or Join stage, depending on data volume) can be used to identify existing rows before they are inserted into the target table.

2.12 Database OPEN and CLOSE Commands


The native parallel database stages provide options for specifying OPEN and CLOSE commands. These options allow commands (including SQL) to be sent to the database before (OPEN) or after (CLOSE) all rows are read/written/loaded to the database. OPEN and CLOSE are not offered by plug-in database stages. For example, the OPEN command could be used to create a temporary table, and the CLOSE command could be used to select all rows from the temporary table and insert into a final target table. As another example, the OPEN command can be used to create a target table, including database-specific options (tablespace, logging, constraints, etc) not possible with the Create option. In general, dont let EE generate target tables unless they are used for temporary storage. There few options to specify Create table options, and doing so may violate data-management (DBA) policies. It is important to understand the implications of specifying a user-defined OPEN and CLOSE command. For example, when reading from DB2, a default OPEN statement

Page 18 of 30

Data Stage Best Practices & Performance Tuning places a shared lock on the source. When specifying a user-defined OPEN command, this lock is not sent and should be specified explicitly if appropriate. Further details are outlined in the respective database sections of the Orchestrate Operators Reference which is part of the Orchestrate OEM documentation.

2.13 Database Sparse Lookup vs. Join


Data read by any database stage can serve as the reference input to a Lookup operation. By default, this reference data is loaded into memory like any other reference link (Normal Lookup). When directly connected as the reference link to a Lookup stage, both DB2/UDB Enterprise and Oracle Enterprise stages allow the lookup type to be changed to Sparse, sending individual SQL statements to the reference database for each incoming Lookup row. Sparse Lookup is only available when the database stage is directly connected to the reference link, with no intermediate stages. IMPORTANT: The individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage JOIN stage between the input and DB2 reference data than it is to perform a Sparse Lookup. For scenarios where the number of input rows is significantly smaller (eg. 1:100 or more) than the number of reference rows in a Oracle table, a Sparse Lookup may be appropriate.

2.14 Oracle Database Guidelines


2.14.1 Proper Import of Oracle Column Definitions (Schema) DataStage EE always uses the Oracle table definition, regardless of explicit job design metadata (Data Type, Nullability, etc) IMPORTANT: To avoid unexpected default type conversions, always import Oracle table definitions using the orchdbutil option (in v6.0.1 or later) of DataStage Designer to avoid unexpected data type conversions. 2.14.2 Reading from Oracle in Parallel By default, the Oracle Enterprise stage reads sequentially from its source table or query. Setting the part i t i o n table option to the specified table will enable parallel extracts from an Oracle source. The underlying Oracle table does not have to be partitioned for parallel read within DataStage EE. Page 19 of 30

Data Stage Best Practices & Performance Tuning It is important to note that certain types of queries cannot run in parallel. Examples include: - queries containing a GROUP BY clause that are also hash partitioned on the same field - queries performing a non-collocated join (a SQL JOIN between two tables that are not stored in the same partitions with the same partitioning strategy) 2.14.3 Oracle Load Options When writing to an Oracle table (using Write Method = Load), Parallel Extender uses the Parallel Direct Path Load method. When using this method, the Oracle stage cannot write to a table that has indexes (including indexes automatically generated by Primary Key constraints) on it unless you specify the Index Mode option (maintenance, rebuild). Setting the environment variable $APT_ORACLE_LOAD_OPTIONS to OPTIONS (DIRECT=TRUE, PARALLEL=FALSE) also allows loading of indexed tables without index maintenance. In this instance, the Oracle load will be done sequentially. The Upsert Write Method can be used to insert rows into a target Oracle table without bypassing indexes or constraints. In order to automatically generate the SQL required by the Upsert method, the key column(s) must be identified using the check boxes in the column grid.

Page 20 of 30

Data Stage Best Practices & Performance Tuning

3 Tips for Debugging Enterprise Edition Jobs


There are a number of tools available to debug DataStage EE jobs. The general process for debugging a job is: Check the DataStage job log for warnings. These may indicate an underlying logic problem or unexpected data type conversion. When a fatal error occurs, the log entry is sometimes preceded by a warning condition. Enable the Job Monitoring Environment Variables detailed in section 5.2. Use the Data Set Management tool (available in the Tools menu of DataStage Designer or DataStage Manager) to examine the schema, look at row counts, and to manage source or target Parallel Data Sets. For flat (sequential) sources and targets: o To display the actual contents of any file (including embedded control characters or ASCII NULLs), use the UNIX command od xc o To display the number of lines and characters in a specified ASCII text file, use the UNIX command wc lc [filename] Dividing the total number of characters number of lines provides an audit to ensure all rows are same length. NOTE: The wc command counts UNIX line delimiters, so if the file has any binary columns, this count may be incorrect . Use $OSH_PRINT_SCHEMAS to verify that the jobs runtime schemas matches what the job developer expected in the design-time column definitions. Examine the score dump (placed in the DataStage log when $APT_DUMP_SCORE is enabled).

3.1

Reading a Score Dump

When attempting to understand an EE flow, the first task is to examine the score dump which is generated when you set APT_DUMP_SCORE=1 in your environment. A score dump includes a variety of information about a flow, including how composite operators and shared containers break down; where data is repartitioned and how it is repartitioned; which operators, if any, have been inserted by EE; what degree of parallelism each operator runs with; and exactly which nodes each operator runs on. Also available is some information about where data may be buffered.

Page 21 of 30

Data Stage Best Practices & Performance Tuning The following score dump shows a flow with a single dataset, which has a hash partitioner that partitions on key field a. It shows three stages: Generator, Sort (tsort) and Peek. The Peek and Sort stages are combined; that is, they have been optimized into the same process. All stages in this flow are running on one node. The job runs 3 processes on 2 nodes. ##I TFSC 004000 14:51:50(000) <main_program> This step has 1 dataset: ds0: {op0[1p] (sequential generator) eOther(APT_HashPartitioner { key={ value=a } })->eCollectAny op1[2p] (parallel APT_CombinedOperatorController:tsort)} It has 2 operators: op0[1p] {(sequential generator) on nodes ( lemond.torrent.com[op0,p0] )} op1[2p] {(parallel APT_CombinedOperatorController: (tsort) (peek) )on nodes ( lemond.torrent.com[op1,p0] lemond.torrent.com[op1,p1] )} In a score dump, there are three areas to investigate: Are there sequential stages? Is needless repartitioning occurring? In a cluster, are the computation-intensive stages shared evenly across all nodes?

3.2

Partitioner and Sort Insertion

Partitioner and sort insertion are two processes that can insert additional components into the work flow. Because these processes, especially sort insertion, can be computationally expensive, understanding the score dump can help a user detect any superfluous sorts or partitioners. EE automatically inserts partitioner and sort components in the work flow to optimize performance. This makes it possible for users to write correct data flows without having to deal directly with issues of parallelism. However, there are some situations where these features can be a hindrance. Presorted data, coming from a source other than a dataset, must be explicitly marked as sorted, using the Dont Sort, Already Sorted key property in the Sort stage. This same mechanism can be used to override sort insertion on any specific link. Partitioner insertion may be disabled on a per-link basis by specifying SAME partitioning on the appropriate link. Orchestrate users accomplish this by inserting same partitioners.

Page 22 of 30

Data Stage Best Practices & Performance Tuning In some cases, setting $APT_SORT_INSERTION_CHECK_ONLY=1 may improve performance if the data is pre-partitioned or pre-sorted but EE does not know this. With this setting, EE still inserts sort stages, but instead of actually sorting the data, they verify that the incoming data is sorted correctly. If the data is not correctly sorted, the job will abort. As a last resort, $APT_NO_PART_INSERTION=1 and $APT_NO_SORT_INSERTION=1 can be used to disable the two features on a flowwide basis. It is generally advised that both partitioner insertion and sort insertion be left alone by the average user, and that more experienced users carefully analyze the score to determine if sorts or partitioners are being inserted sub-optimally.

Page 23 of 30

Data Stage Best Practices & Performance Tuning

4 Performance Tips for Job Design


Remove unneeded columns as early as possible within the job flow every additional unused column requires additional buffer memory which can impact performance (it also makes each transfer of a record from one stage to the next more expensive). o When reading from database sources, use a select list to read needed columns instead of the entire table (if possible) o To ensure that columns are actually removed using a stages Output Mapping, disable runtime column propagation for that column. Always specify a maximum length for Varchar columns. Unbounded strings (Varchars without a maximum length) can have a significant negative performance impact on a job flow. There are limited scenarios when the memory overhead of handling large Varchar columns would dictate the use of unbounded strings. For example: o Varchar columns of a large (eg. 32K) maximum length that are rarely populated o Varchar columns of a large maximum length with highly varying data sizes Placing unbounded columns at the end of the schema definition may improve performance. In DataStage v7.0 and earlier, limit the use of variable-length records within a flow. Depending on the number of variable-length columns, it may be beneficial to convert incoming records to fixed-length types at the start of a job flow, and trim to variablelength at the end of a flow before writing to a target database or flat file (using fixedlength records can dramatically improve performance). DataStage v7.01 and later implement internal performance optimizations for variable-length columns that specify a maximum length. Avoid type conversions if possible o Be careful to use proper datatype from source (especially Oracle) in EE job design Enable $OSH_PRINT_SCHEMAS to verify runtime schema matches job design column definitions o Verify that the data type of defined Transformer stage variables matches the expected result type Minimize the number of Transformers. Where appropriate, use other stages (eg. Copy, Filter, Switch, Modify) instead of the Transformer NEVER use the BASIC Transformer in large-volume data flows. Instead, userdefined functions and routines can expand the capabilities of the parallel Transformer.

Page 24 of 30

Data Stage Best Practices & Performance Tuning Buildops should be used instead of Transformers in the handful of scenarios where complex reusable logic is required, or where existing Transformer-based job flows do not meet performance requirements. Minimize and combine use of Sorts where possible o It is sometimes possible to re-arrange the order of business logic within a job flow to leverage the same sort order, partitioning, and groupings. If data has already been partitioned and sorted on a set of key columns, specifying the dont sort, previously sorted option for those key columns in the Sort stage will reduce the cost of sorting and take greater advantage of pipeline parallelism. o When writing to parallel datasets, sort order and partitioning are preserved. When reading from these datasets, try to maintain this sorting if possible by using SAME partitioning. o The stable sort option is much more expensive than non-stable sorts, and should only be used if there is a need to maintain row order except as needed to perform the sort. o Performance of individual sorts can be improved by increasing the memory usage per partition using the Restrict Memory Usage (MB) option of the standalone Sort stage . The default setting is 20MB per partition. Note that sort memory usage can only be specified for standalone Sort stages, it cannot be changed for inline (on a link) sorts.

5 Performance Monitoring and Tuning


5.1 The Job Monitor
The Job Monitor provides a useful snapshot of a jobs performance at a moment of execution, but does not provide thorough performance metrics. That is, a Job Monitor snapshot should not be used in place of a full run of the job, or a run with a sample set of data. Due to buffering and to some job semantics, a snapshot image of the flow may not be a representative sample of the performance over the course of the entire job. The CPU summary information provided by the Job Monitor is useful as a first approximation of where time is being spent in the flow. However, it does not include operators that are inserted by EE. Such operators include sorts which were not explicitly included and the suboperators of composite operators. The Job Monitor also does not monitor sorts on links. For these components, the score dump can be of assistance. See Reading a Score Dump in section 10.1. A worst-case scenario occurs when a job flow reads from a dataset, and passes immediately to a sort on a link. The job will appear to hang, when, in fact, rows are being read from the dataset and passed to the sort.

5.2

OS/RDBMS-Specific Tools

Each OS and RDBMS has its own set of tools which may be useful in performance monitoring. Talking to the system administrator or DBA may provide some useful monitoring strategies.

Page 25 of 30

Data Stage Best Practices & Performance Tuning

5.3

Obtaining Operator Run-Time Information

Setting $APT_PM_PLAYER_TIMING=1 provides information for each stage in the DataStage job log. For example:
##I TFPM 000324 08:59:32(004) <generator,0> Calling runLocally: step=1, node=rh73dev04, op=0, ptn=0 ##I TFPM 000325 08:59:32(005) <generator,0> Operator completed. status: APT_StatusOk elapsed: 0.04 user: 0.00 sys: 0.00 suser: 0.09 ssys: 0.02 (total CPU: 0.11) ##I TFPM 000324 08:59:32(006) <peek,0> Calling runLocally: step=1, node=rh73dev04, op=1, ptn=0 ##I TFPM 000325 08:59:32(012) <peek,0> Operator completed. status: APT_StatusOk elapsed: 0.01 user: 0.00 sys: 0.00 suser: 0.09 ssys: 0.02 (total CPU: 0.11) ##I TFPM 000324 08:59:32(013) <peek,1> Calling runLocally: step=1, node=rh73dev04a, op=1, ptn=1 ##I TFPM 000325 08:59:32(019) <peek,1> Operator completed. status: APT_StatusOk elapsed: 0.00 user: 0.00 sys: 0.00 suser: 0.09 ssys: 0.02 (total CPU: 0.11)

This output shows that each partition of each operator has consumed about one tenth of a second of CPU time during its runtime portion. In a real world flow, wed see many more operators and partitions. It can often be very useful to see how much CPU each operator, and each partition of each component, is using. If one partition of an operator is using significantly more CPU than others, it may mean the data is partitioned in an unbalanced way, and that repartitioning, or choosing different partitioning keys, might be a useful strategy. If one operator is using a much larger portion of the CPU than others, it may be an indication that there is a problem in your flow. Common sense is generally required here, for example, a sort is going to use dramatically more CPU time than a copy. This, however, gives you a sense of which operators are using more of the CPU; and when combined with other metrics presented in this document, the information can be very enlightening. Setting $APT_DISABLE_COMBINATION=1 , which globally disables stage combination, may be useful in some situations to get finer-grained information as to which operators are using up CPU cycles. Be aware, however, that setting this flag changes the performance behavior of your flow, therefore, this should be done with care. Unlike the Job Monitor CPU percentages, setting $APT_PM_PLAYER_TIMING provides timings on every operator within the flow.

Page 26 of 30

Data Stage Best Practices & Performance Tuning

5.4

Selectively Rewriting the Flow

One of the most useful mechanisms you can use to determine what is causing bottlenecks in your flow is to isolate sections of the flow by rewriting portions of it to exclude stages from the set of possible causes. The goal of modifying the flow is to see whether modified flow runs noticeably faster than the original flow. If the flow is running at roughly an identical speed, change more of the flow. While editing a flow for testing, it is important to keep in mind that removing one operator might have unexpected affects in the flow. Comparing the score dump between runs is useful before concluding what has made the performance difference. When modifying the flow, be aware of introducing any new performance problems. For example, adding a persistent dataset to a flow introduces disk contention with any other datasets being read. This is rarely a problem, but it might be significant in some cases. Reading and writing data are two obvious places to be aware of potential performance bottlenecks. Changing a job to write into a Copy stage with no outputs discards the data. Keep the degree of parallelism the same, with a nodemap if necessary. Similarly, landing any read data to a dataset can be helpful if the point of origin of the data is a flat file or RDBMS. This pattern should be followed, removing any potentially suspicious stages while trying to keep the rest of the flow intact. Removing any customer-created operators or sequence operators should be at the top of the list. Much work has gone into the latest 7.0 release to improve Transformer performance.

5.5

Eliminating Repartitions

Superfluous repartitioning should be eliminated. Due to operator or license limitations (import, export, RDBMS operators, SAS operators, and so on) some operators run with a degree of parallelism that is different than the default degree of parallelism. Some of this cannot be eliminated, but understanding the where, when and why these repartitions occur is important for flow understanding. Repartitions are especially expensive when the data is being repartitioned on an MPP, where significant network traffic is generated. Sometimes a repartition might be able to be moved further upstream in order to eliminate a previous, implicit repartition. Imagine an Oracle read, which does some processing, and is then hashed and joined with another dataset. There might be a repartition after the Oracle read stage and then the hash, when only one repartitioning is ever necessary. Similarly, a nodemap on a stage may prove useful for eliminating repartitions. In this case, a transform between a DB2 read and a DB2 write might need to have a nodemap placed on it to force it to run with the same degree of parallelism as the two DB2 stages in order to avoid two repartitions.

5.6

Ensuring Data is Evenly Partitioned

Due to the nature of EE, the entire flow runs as slow as its slowest component. If data is not evenly partitioned, the slowest component is often a result of data skew. If one Page 27 of 30

Data Stage Best Practices & Performance Tuning partition has ten records, and another has ten million, EE can simply not make ideal use of the resources. displays the number of records per partition for each component. Ideally, counts across all partitions should be roughly equal. Differences in data volumes between keys often skew this data slightly, but any significant (over 5 or 10%) differences in volume should be a warning sign that alternate keys or an alternate partitioning strategy might be required.
$APT_RECORD_COUNTS=1

5.7

Buffering for All Versions

Buffer operators are introduced in a flow anywhere that a directed cycle exists or anywhere that the user or operator requests them using the C++ API or osh. The default goal of the buffer operator, on a specific link, is to make the source stage output rate match the consumption rate of the target stage. In any flow, where there is incorrect behavior for the buffer operator, performance degrades. For example, the target stage has two inputs, and waits until it has exhausted one of those inputs before reading from the next. Identifying these spots in the flow requires an understanding of how each stage involved reads its records, and is often only found by empirical observation. There is a buffer operator tuning issue when a flow runs slowly when it is one massive flow; but when broken up, each component runs quickly. For example, replacing an Oracle write with a Copy stage vastly improves performance; and writing that same data to a dataset, then loading using Oracle write, also goes quickly. When the two are put together, performance grinds to a crawl. xHyperlink\xd2 on page Default Font details specific common buffer operator configurations in the context of resolving various bottlenecks. For more information on buffering, see Appendix A, Data Set Buffering in the Orchestrate 7.0 User Guide.

5.8

Resolving Bottlenecks

5.8.1 Variable Length Data In releases prior to v7.01, using fixed-length records can dramatically improve performance; therefore, limit the use of variable-length records within a flow. This is no longer an issue in 7.01 and later releases. 5.8.2 Combinable Operators Combined operators generally improve performance at least slightly; and in some cases, the performance improvement may be dramatic. However, there may be situations where combining operators actually hurts performance. Identifying such operators can be difficult without trial and error. The most common situation arises when multiple operators, such as Sequential File (import and export) and Sort, are combined and are performing disk I/O. In I/O-bound situations, turning off combination for these specific operators may result in a performance increase. Page 28 of 30

Data Stage Best Practices & Performance Tuning This is a new option in the Advanced stage properties of DataStage Designer version 7.x. Combinable operators often provide a dramatic performance increase when a large number of variable length fields are used in a flow. To experiment with this, try disabling the combination of any stages that perform I/O and any sort stages. $APT_DISABLE_COMBINATION=1 globally disables operator combining. 5.8.3 Disk I/O Total disk throughput is often a fixed quantity that EE has no control over. There are, however, some settings and rules of thumb that are often beneficial: If data is going to be read back in, in parallel, it should never be written as a sequential file. A dataset or fileset is a much more appropriate format. When importing fixed-length data, the Number of Readers Per Node option on the Sequential File stage can often provide a noticeable performance boost as compared with a single process reading the data. However, if there is a need to assign a number in source file row order, the -readers option cannot be used because it opens multiple streams at evenly-spaced offsets in the source file. Also, this option can only be used for fixed-length sequential files. Some disk arrays have read-ahead caches that are only effective when data is read repeatedly in like-sized chunks. $APT_CONSISTENT_BUFFERIO_SIZE=n forces import to read data in chunks which are size n or a multiple of n. Memory mapped IO is, in many cases, a big performance win; however, in certain situations, such as a remote disk mounted via NFS, it may cause significant performance problems. APT_IO_NOMAP=1 and APT_BUFFERIO_NOMAP=1 turn off this feature and sometimes affect performance. AIX and HP-UX default to NOMAP. APT_IO_MAP=1 and APT_BUFFERIO_MAP=1 can be used to turn on memory mapped IO on for these platforms.

5.8.4 Buffering Buffer operators are intended to slow down their input to match the consumption rate of the output. When the target stage reads very slowly, or not at all, for a length of time, upstream stages begin to slow down. This can cause a noticeable performance loss if the optimal behavior of the buffer operator is something other than rate matching. By default, the buffer operator has a 3MB in-memory buffer. Once that buffer reaches two-thirds full, the stage begins to push back on the rate of the upstream stage. Once the 3MB buffer is filled, data is written to disk in 1MB chunks. In the following discussions, settings in all caps are environment variables and affect all buffer operators. Settings in all lowercase are buffer-operator options and can be set per buffer operator.

Page 29 of 30

Data Stage Best Practices & Performance Tuning In most cases, the easiest way to tune the buffer operator is to eliminate the push back and allow it to buffer the data to disk as necessary. $APT_BUFFER_FREE_RUN=n or bufferfreerun do this. The buffer operator reads N * max_memory (3MB by default) bytes before beginning to push back on the upstream. If there is enough disk space to buffer large amounts of data, this usually fixes any egregious slow- down issues caused by the buffer operator. If there is a significant amount of memory available on the machine, increasing the maximum in-memory buffer size is likely to be very useful if the buffer operator is causing any disk IO. $APT_BUFFER_MAXIMUM_MEMORY or maximummemorybuffersize is used to do this. It defaults to roughly 3000000 (3MB). For systems where small to medium bursts of IO are not desirable, the 1MB write to disk size chunk size may be too small. $APT_BUFFER_DISK_WRITE_INCREMENT or diskwriteincrement controls this and defaults to roughly 1000000 (1MB). This setting may not exceed max_memory * 2/3. Finally, in a situation where a large, fixed buffer is needed within the flow, queueupperbound (no environment variable exists) can be set equal to max_memory to force a buffer of exactly max_memory bytes. Such a buffer blocks an upstream stage (until data is read by the downstream stage) once its buffer has been filled, so this setting should be used with extreme caution. This setting is rarely necessary to achieve good performance, but is useful when there is a large variability in the response time of the data source or data target. No environment variable is available for this flag; it can only be set at the osh level. For releases 7.0.1 and beyond, per-link buffer settings are available in EE. They appear on the Advanced tab of the Input & Output tabs. The settings saved on an Output tab are shared with the Input tab of the next stage and vice versa, like Columns.

Page 30 of 30