Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Information Management
Information Management
Module Objectives
Explain performance tuning methodology Selectively disabling operator combination Understand configuration file guideline Understand the impact of partitioning Understand the impact of sorting Understand the impact of transformer Use the performance analyzer
Information Management
Optimizing Performance
The ability to process large volumes of data in a short period of time requires optimizing all aspect of the job flow and environment for maximum throughput and performance
Job design Stage properties DataStage parameters Configuration file Disk subsystems: RAID / SAN Source and target databases Network etc....
Information Management
Individual job design including shared containers Stages chosen, overall design approach Partitioning strategy Operator combination Buffering (as a last resort)
Information Management
Information Management
Change ONE item at time, then examine impact Use job score to determine
Number of processes generated Operator combination Framework inserted sorting and partitioning
Information Management
Be cautious with Rows/sec numbers calculated by Director (elapsed time of entire job, not per stage)
Information Management
Operator Combination
At run time, DataStage parallel framework will attempt to combine stages (operators) into a single process Operator combination is intended to improve overall performance and lower resource usage Combination only occurs between stages (operators) that:
Use the same partitioning method
Repartitioning prevents operator combination between the producer and consumer stages Implicit repartitioning (sequential operators) prevents combination
Are combinable
Set automatically within the stage / operator definition Can also be set within stages advanced properties
Information Management
For easier debugging, in order to know which stage produced the warning or error log message, selectively disable combination through Designer stage properties
Information Management
With cluster / grid / MPP environment, named pools can be used to further control resources
Minimize data shipping, direct database connection, etc.
10
Information Management
Use parallel Data Set to land intermediate results between parallel jobs
No conversion overhead, stored in native internal format Retains data partitioning and sort order Maximum performance through parallel I/O
11
Information Management
Impact of Partitioning
12
Information Management
Impact of Sorting
Use the Restrict Memory Usage option to increase the amount of memory available for sorting per partition
13
Information Management
Parallel Data Set maintains partitions and sort order across jobs
14
Information Management
Impact of Transformer
Use stage variables to perform calculations used by constraints and multiple derivations Never use the BASIC Transformer
Doesnt show up in the standard palette by default Intended to provide a migration path for existing DataStage Server applications that use DataStage BASIC routines Runs sequentially Invokes the DataStage server engine Extremely expensive (slow)!
15
Information Management
For optimum performance, consider more appropriate stages instead of a Transformer in parallel job flows:
Use non-Transformer stage (e.g., Copy stage) to:
Rename Columns Drop Columns Perform default type conversions Split output
Information Management
Sequential File stage file pattern reads start with a single CAT process
Setting $APT_IMPORT_PATTERN_USES_FILESET allows parallel I/O Dynamically builds a File Set header file for list of files match pattern
17
Information Management
Performance Analysis
18
Information Management
Use the Director monitor to watch the throughput (rows/sec) during a job run Compare job run durations Turn on APT_PM_PLAYER_TIMING and APT_PM_PLAYER_MEMORY to report player calls and memory allocation Long running jobs couldnt be watched for record throughput changes throughout the job run The job monitor didnt allow recording for playback Job monitor throughput rates included time waiting for data Couldnt determine what was happening on the machines
19
Information Management
Performance Analyzer
Visualization tool that provides deeper insight into job runtime behavior Part of the DataStage engine Offers several categories of visualizations:
Record throughput (rows/sec) CPU utilization Job timing & memory utilization Physical machine utilization
20
Information Management
Open the job in Designer Select Record job performance data in Job Properties Run your job. Performance collection has little impact on overall job performance To view the results, click the Performance Analysis icon in Designer
21
Information Management
Example Job
22
Information Management
Machine utilization
Stages in job
Lengths of time
23
Information Management
Process phases
24
Information Management
25
Information Management
Run mouse over line to identify the stage port represented Timeline
26
Information Management
27
Information Management
28
Information Management
29
Information Management
Filters
30
Information Management
Module Summary
Explain performance tuning methodology Selectively disabling operator combination Understand configuration file guideline Understand the impact of partitioning Understand the impact of sorting Understand the impact of transformer Use the performance analyzer
31