Sei sulla pagina 1di 4

Performance Tuning

Database Performance

 Try not to use a sort stage when you can use an ORDER BY clause in the database.
 Use a constraint to filter a record set is much slower than performing a SELECT WHERE .
 Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally
faster than using ODBC or OLE.
 Tune the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts
updates and selects.
 Sort the data as much as possible in DB and reduced the use of DS-Sort for better performance
of jobs.
 Work with DB-admin to create appropriate Indexes on tables for better performance of DS
queries.
 Use the native database stages.

File Performance

 If the source file is fixed width, the Readers Per Node option can be used to read a single input
file in parallel at evenly-spaced offsets.
 The environment variables $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE
can be used to improve I/O performance.
 Use DataSet for any intermittent persistent storage needs instead of Sequential File stage since
DataSet reads and writes data in a native format editable by DataStage and easily administered
through DataStage itself. DataSets should not be edited/deleted directly from Unix.

Configuration File

 The degree of parallelism of a DataStage Job is determined by the number of nodes that is
defined in the Configuration File. A configuration file with a larger number of nodes will generate a
larger number of processes and will in turn add to the processing overheads as compared to a
configuration file with a smaller number of nodes. Therefore, while choosing the configuration file
one must weigh the benefits of increased parallelism against the losses in processing efficiency
(increased processing overheads and slow start up time).Ideally, if the amount of data to be
processed is small, configuration files with less number of nodes should be used while if data
volume is more, configuration files with larger number of nodes should be used.

Partitioning

 Proper partitioning of data is another aspect of DataStage Job design, which significantly
improves overall job performance. Partitioning should be set in such a way so as to have
balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be
minimized.
 Avoid unnecessary Re-partitioning of data to avoid data being skewed.

Transformer 

 Take precautions when using expressions or derivations on nullable columns within the parallel
Transformer: 
 Always convert nullable columns to in-band values before using them in an expression or
derivation. 
 Always place a reject link on a parallel Transformer to capture / audit possible rejects. 

Lookup 

 It is most appropriate when reference data is small enough to fit into available shared memory. If
the Data Sets are larger than available memory resources, use the Join or Merge stage.
  Limit the use of database Sparse Lookups to scenarios where the number of input rows is
significantly smaller (for example 1:100 or more) than the number of reference rows, or when
exception processing. 

Join 

 Be particularly careful to observe the nullability properties for input links to any form of Outer Join.
Even if the source data is not nullable, the non-key columns must be defined as nullable in the
Join stage input in order to identify unmatched records. 

Aggregator

 Use Hash method Aggregators only when the number of distinct key column values is small.
 A Sort method Aggregator should be used when the number of distinct key values is large or
unknown.

Modify

 The Modify stage is the most efficient stage available. Any transformation which can be
implemented in Modify is more efficient.
 Transformations that touch a single field, such as keep/drop, type conversions, some string
manipulations, and null handling, are the primary operations which should be implemented using
Modify instead of Transformer derivations.

Sort

 If data has already been partitioned and sorted on a set of key columns, specify the ″don't sort,
previously sorted″ option for the key columns in the Sort stage
 The stable sort option is much more expensive than non-stable sorts, and should only be used if
there is a need to maintain row order other than as needed to perform the sort
 The performance of individual sorts can be improved by increasing the memory usage per
partition using the Restrict Memory Usage (MB) option of the Sort stage. The default setting is 20
MB per partition. Note that sort memory usage can only be specified for standalone Sort stages, it
cannot be changed for inline (on a link) sorts.

Sequence Job

 Always use “Reset if required, then run” option in Job Activity stages.
 Please enable checkpoint in the jobs properties.
Environment Variable

 DataStage EE provides a number of environment variables to control how jobs operate on a


UNIX system. In addition to providing required information, environment variables can be used to
enable or disable various DataStage features, and to tune performance settings.

Job Design

 Minimize the usage of Transformer (Instead of this use Copy modify Filter Row Generator).
Reduce the number of lookups in a job design.
 Minimize the warnings.
 Please use RCP (Runtime Column Propagation) whenever possible.
 Remove unneeded columns as early as possible within the job flow.

Architecture

 Datastage support grid architecture from 8.1 version. As grid does proper utilization of hardware
and speed up the process so if possible it is better to implement the grid architecture.

Server Settings

 In the scratch disk there should be enough space otherwise jobs using sort stage with large
volume of data might fail.
 Proper NLS setting
 Need to take care of the environment variables while moving from one environment to other
environments. Migrating the entire DSParams file can result in development environment settings
being moved into production and trying to migrate just the user defined section can result in a
corrupt DSParams file.

8.5 Performance Enhancements

 As connector stages (like Oracle Connector, Db2 connector) are the upgraded stages(Better error
Handling) for enterprise stage go for connector stages.
 As the connector stage supports multiple inputs it can load the multiple tables at a time.
 Instead of using stage variables in transformer please use pivot enterprise stage for vertical
pivoting.
 Use transformer for looping and aggregation.
 Use the new xml input stage and xml transformer stage as it can handle large and complex xml
and the processing time is much faster.
Performance Monitoring

Job Monitor

 The IBM InfoSphere DataStage job monitor can be accessed through the IBM


InfoSphere DataStage Director.

Score Dump

 In order to resolve any performance issues it is essential to have an understanding of the data
flow within the jobs. To help understand a job flow, a score dump should be taken. This can be
done by setting the APT_DUMP_SCORE environment variable to "true” prior to running the job.

Resource Estimation

 With new sophisticated analytical information and deep understanding of the parallel framework,
IBM has added Resource Estimation to DataStage (and QualityStage) 8.x. This can be used to
determine the needed system requirements or to analyze if the current infrastructure can support
the jobs that have been created.

Performance Analysis

 IBM Infosphere DataStage 8.x adds a new capability called Performance Analysis. It is enabled
through a job property on the execution tab which collects data at job execution time

Potrebbero piacerti anche