Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
cover
Front cover
Student Notebook
ERC 1.0
Student Notebook
Trademarks
IBM and the IBM logo are registered trademarks of International Business Machines Corporation. The following are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide: DataStage InfoSphere DB2 Informix
Windows is a trademark of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Other product and service names might be trademarks of IBM or other companies.
Copyright International Business Machines Corporation 2005, 2011. This document may not be reproduced in whole or in part without the prior written permission of IBM. Note to U.S. Government Users Documentation related to restricted rights Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
V6.0
Student Notebook
TOC
Contents
Course description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Unit 0. IBM InfoSphere Advanced DataStage v8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . Course objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0-1 0-2 0-3 0-4 0-5
Unit 1. Introduction to the Parallel Framework Architecture . . . . . . . . . . . . . . . . . . 1-1 Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 Why study the parallel architecture? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3 What we need to master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4 DataStage parallel job documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5 Key parallel concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6 Scalable hardware environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7 Drawbacks of traditional batch processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8 Pipeline parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9 Partition parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10 Partitioning illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11 DataStage combines partitioning and pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12 Job design versus execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13 Defining parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14 Configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-15 Example configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16 Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-18 Generating mock data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-19 Job design for generating mock data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-20 Specifying the generating algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-21 Inside the Lookup stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-22 Configuration file displayed in job log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-23 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-25 Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26 Unit 2. Compilation and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel Job Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel job compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformer job compilation notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generated OSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage to OSH operator mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generated OSH primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8
Contents
iii
Student Notebook
DataStage GUI versus OSH terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9 Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10 Configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 Processing nodes (partitions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 Configuration file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 Node options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14 Sample configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15 Resource pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16 Sorting resource pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17 Another configuration file example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18 Constraining operators to specific node pools . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 Configuration Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20 Configuration editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21 Parallel Runtime Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22 Parallel Job startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23 Parallel job run time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24 Viewing the job Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25 Example job Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26 Job execution: The orchestra metaphor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27 Runtime control and data networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28 Parallel data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29 Monitoring job startup and execution in the log . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30 Counting the total number of processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31 Parallel Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32 Peeking at the data steam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33 Peeking at the data stream design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34 Using Transformer stage variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36 Exercise 2 - Compilation and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-37 Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38 Unit 3. Partitioning and Collecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 Partitioning and collecting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 Partitioning and collecting icons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4 Partitioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5 Where partitioning is specified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6 The Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 Viewing the Score operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 Interpreting the Score partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9 Score partitioning example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 Partition numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11 Partitioning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 Selecting a partitioning method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13 Selecting a partitioning method, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14 Same partitioning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15 Caution regarding Same partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16 Round Robin and Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
iv Advanced DataStage v8 Copyright IBM Corp. 2005, 2011
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V6.0
Student Notebook
TOC
Parallel runtime example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entire partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hash partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unequal distribution example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modulus partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Range partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using Range partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example partitioning icons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Auto partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preserve partitioning flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning strategy, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Collecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Collectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specifying the collector method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Collector methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sort Merge example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-deterministic execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Choosing a collector method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Collector method versus Funnel stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel number sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Row Generator sequences of numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generated numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformer example using @INROWNUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformer example using parallel variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . Header and detail processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inside the Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examining the Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Difficulties with the design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examining the Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generating a header detail data file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inside the Column Export stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inside the Funnel stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 3 - Read data with multiple record formats . . . . . . . . . . . . . . . . . . . . . . . Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-18 3-19 3-20 3-21 3-22 3-23 3-24 3-25 3-26 3-27 3-28 3-29 3-30 3-31 3-32 3-33 3-34 3-35 3-36 3-37 3-38 3-39 3-40 3-41 3-42 3-43 3-44 3-45 3-46 3-47 3-48 3-49 3-50 3-51 3-52 3-53 3-54 3-55 4-1 4-2 4-3 4-4 4-5 4-6 4-7 4-8 4-9
v
Unit 4. Sorting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Traditional (sequential) sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example parallel sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stages that require sorted data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel sorting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . In-Stage sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sort stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Copyright IBM Corp. 2005, 2011
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Contents
Student Notebook
Stable sorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Resorting on sub-groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dont sort (previously grouped) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning and sort order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Global sorting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inserted tsorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Changing inserted tsort behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sort resource usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partition and sort keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimizing job performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fork join job example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fork join job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examining the Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Difficulties with the design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimized solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Score of optimized Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 4 - Optimize a fork join job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-10 4-11 4-12 4-13 4-14 4-15 4-16 4-17 4-18 4-19 4-20 4-21 4-22 4-23 4-24 4-25 4-26 4-27 4-28 4-29
Unit 5. Buffering in Parallel Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1 Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2 Introducing the buffer operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 Identifying buffer operators in the Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 How buffer operators work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 Buffer flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6 Buffer tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8 Changing buffer settings in a job stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 Buffer resource usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 Buffering for group stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11 Join stage internal buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 Avoiding buffer contention in fork-join jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 Parallel Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14 Revisiting the header detail job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15 Buffering solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 Redesigned header detail processing job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18 Exercise - Optimize a fork join job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19 Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20 Unit 6. Parallel Framework Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Type conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi Advanced DataStage v8
V6.0
Student Notebook
TOC
Source to target type conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7 Using Modify Stage For Type Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8 Processing external data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9 Sequential file import conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10 COBOL file import conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11 Oracle automatic conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12 Standard Framework data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13 Complex data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14 Schema with complex types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15 Complex types column definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16 Complex Flat File Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17 Complex Flat File (CFF) stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18 Sample COBOL copybook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19 Importing a COBOL File Definition (CFD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20 COBOL table definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21 COBOL file layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-22 Specifying a date mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23 Example data file with multiple formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24 Sample job With CFF Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25 File options tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26 Records tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-27 Record ID tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-28 Selection tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29 Record options tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-30 Layout tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-31 View data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-32 Processing multi-format records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-33 Transformer constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-34 Nullability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-35 Nullable data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-36 Null transfer rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-37 Nulls and sequential files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-38 Null field value examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-39 Viewing data with Null values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-40 Lookup stage and nullable columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-41 Default values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-42 Nullability in lookups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-43 Outer joins and nullable columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-44 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-45 Exercise 6 - Test nullability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-46 Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-47 Unit 7. Reusable components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using Schema Files to Read Sequential Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schema file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating a schema file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Importing a schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Copyright IBM Corp. 2005, 2011
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Contents
Student Notebook
Creating a schema from a table definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7 Reading a sequential file using a schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8 Runtime Column Propagation (RCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9 Runtime Column Propagation (RCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10 Enabling Runtime Column Propagation (RCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11 Enabling RCP at Project Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 Enabling RCP at Job Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13 Enabling RCP at Stage Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 When RCP is Disabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15 When RCP is Enabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16 Where do RCP columns come from? (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17 Where do RCP columns come from? (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18 Where do RCP columns come from? (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 Shared Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20 Shared containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21 Creating a shared container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22 Inside the shared container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23 Inside the shared container Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24 Using a shared container in a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25 Mapping input / output links to the container . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-26 Interfacing with the shared container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28 Exercise 7 - Reusable components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29 Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-30 Unit 8. Advanced Transformer Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1 Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 Transformer Null Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 Transformer legacy null handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4 Legacy null processing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5 Inside the Transformer stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6 Transformer stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8 Transformer non-legacy null handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9 Transformer stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10 Results with non-legacy null processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11 Transformer Loop Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12 Transformer loop processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13 Repeating columns example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14 Solution using multiple-output links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15 Inside the Transformer stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16 Limitations of the multiple output links solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17 Loop processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18 Creating the loop condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19 Loop variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20 Repeating columns solution using a loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21 Inside the Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 Transformer Group Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23
viii Advanced DataStage v8 Copyright IBM Corp. 2005, 2011
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V6.0
Student Notebook
TOC
Transformer group processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Building a Transformer group processing job (1) . . . . . . . . . . . . . . . . . . . . . . . . . Building a Transformer group processing job (2) . . . . . . . . . . . . . . . . . . . . . . . . . Group processing example job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformer stage variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Variable Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specifying the Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Runtime errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Validating rows before saving them in the queue . . . . . . . . . . . . . . . . . . . . . . . . . Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 8 - Transformer Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8-24 8-25 8-26 8-27 8-28 8-29 8-30 8-31 8-32 8-33 8-34 8-35
Unit 9. Extending the Functionality of Parallel Jobs. . . . . . . . . . . . . . . . . . . . . . . . . 9-1 Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2 Ways of adding new functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3 Wrapped Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4 Building Wrapped stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5 Wrapped stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6 Wrapped stage example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7 Creating a Wrapped stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8 Defining the Wrapped stage interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9 Specifying Wrapped stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10 Job with Wrapped stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11 Exercise 9 - Wrapped stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12 Build Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-13 Build stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14 Example job with Build stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15 Creating a new Build stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16 Build stage elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-17 Anatomy of a Build stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-18 Defining the input, output interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20 Interface table definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-21 Specifying the input interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-22 Specifying the output interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-23 Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-24 Defining a transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-25 Anatomy of a transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26 Defining stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-27 Specifying properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-28 Defining the Build stage logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-29 Definitions tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30 Pre-Loop tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-31 Per-Record tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32 Post-Loop tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-33 Writing to the job log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-34 Using a Build stage in a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-35 Stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-36
Copyright IBM Corp. 2005, 2011
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Contents
ix
Student Notebook
Build stages with multiple ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Build Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Build macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Turning off auto read, write, and transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reading records using macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT Framework Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT framework and utility classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Framework class sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_String Build stage example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 9 - Build stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . External Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . External function example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Another external function example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating an external function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Defining the input arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calling the external function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 9 - External Function Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9-37 9-38 9-39 9-40 9-41 9-42 9-43 9-44 9-45 9-46 9-47 9-48 9-49 9-50 9-51 9-52 9-53 9-54 9-55 9-56
Unit 10. Accessing Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1 Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 Connector stage usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5 Connector stage look and feel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6 Connector stage GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-7 Connection properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8 Usage properties - Generate SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-9 Deprecated stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10 Database stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11 Do it in DataStage or in the Database? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12 Connector Stage Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13 Reading with Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-14 Before/After SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-15 Sparse lookups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16 Writing using Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17 Parameterizing the table action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18 Optimizing the insert/update performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19 Commit interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20 Bulk load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21 Cleaning Up failed DB2 loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-22 Error Handling in Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23 Error handling in Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-24 Connector stage with reject link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-25 Specifying reject conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-26 Added error code information examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-27
x Advanced DataStage v8 Copyright IBM Corp. 2005, 2011
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V6.0
Student Notebook
TOC
Multiple Input Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple input links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inside the Connector - stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Connection Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standard insert plus update example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Insert-Update Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 10. Working with Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10-28 10-29 10-30 10-31 10-32 10-33 10-34 10-35 10-36 10-37
Unit 11. Processing XML Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1 Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2 XML stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3 Schema Library Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4 Schema Library Manager window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5 Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6 Schema file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7 Composing XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8 Composing XML data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9 Compositional Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10 Inside the XML stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11 Inside the Assembly editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12 Input step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-13 Composer step - XML Target tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14 Composer step - XML Document Root tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15 Composer step - Validation tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16 Composer step - Mappings tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17 XML file output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18 Parsing XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19 Parsing XML data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20 Parser step - XML Source tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21 Parser step - Document Root tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22 Transforming XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-23 Transforming XML data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24 Transformation Example - HJoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-25 Editing the HJoin step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26 Switch step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-27 Aggregate step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-28 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-29 Exercise 11 - XML stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30 Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31 Unit 12. Slowly Changing Dimensions Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Surrogate Key Generator Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Surrogate Key Generator stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example job to create surrogate key state files . . . . . . . . . . . . . . . . . . . . . . . . . . .
Copyright IBM Corp. 2005, 2011
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Contents
Student Notebook
Editing the Surrogate Key Generator stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6 Example job to update the surrogate key state file . . . . . . . . . . . . . . . . . . . . . . . . 12-7 Specifying the update information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8 Slowly Changing Dimensions Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9 Slowly Changing Dimensions stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 Star schema database structure and mappings . . . . . . . . . . . . . . . . . . . . . . . . . 12-11 Example Slowly Changing Dimensions (SCD) job . . . . . . . . . . . . . . . . . . . . . . . 12-13 Working in the SCD stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14 Selecting the output link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15 Specifying the purpose codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-16 Surrogate key management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17 Dimension update specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-18 Output mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-20 Exercise 12 - Slowly Changing Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21 Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-22 Unit 13. Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1 Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2 Job Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3 Overall job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4 Balancing performance with requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5 Modular job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6 Establishing job boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7 Use job sequences to combine job modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8 Adding environment variables as job parameters . . . . . . . . . . . . . . . . . . . . . . . . . 13-9 Stage Usage Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10 Reading sequential files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11 Reading a sequential file in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12 Parallel file pattern I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13 Partitioning and sequential files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14 Other sequential file tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15 Buffering sequential file writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16 Lookup Stage Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17 Lookup stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-18 Partitioning lookup reference data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-19 Lookup reference data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-20 Lookup file sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-21 Using Lookup File Set stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-22 Aggregator Stage Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-23 Aggregator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-24 Using Aggregator to sum all input rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-25 Transformer Stage Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-26 Transformer performance guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-27 Transformer vs. other stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-28 Modify stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-29 Optimizing Transformer expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-30 Simplifying Transformer expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-31
xii Advanced DataStage v8 Copyright IBM Corp. 2005, 2011
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V6.0
Student Notebook
TOC
Transformer stage compared with Build stage . . . . . . . . . . . . . . . . . . . . . . . . . . Transformer decimal arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformer decimal rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditionally aborting a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summing all rows with Aggregator stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditionally aborting the job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 13 - Best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13-32 13-33 13-34 13-35 13-36 13-37 13-38 13-39 13-40 13-41
Contents
xiii
Student Notebook
xiv
Advanced DataStage v8
V6.0
Student Notebook
pref
Course description
IBM InfoSphere Advanced DataStage v8 Duration: 4 days Purpose
This course is designed to introduce advanced job development techniques in DataStage v8.5.
Audience
Experienced DataStage developers seeking training in more advanced DataStage techniques and who seek an understanding of the parallel framework architecture.
Prerequisites
DataStage Essentials course or equivalent and at least one year of experience developing parallel jobs using DataStage.
Objectives
After completing this course, you should be able to: - Describe the parallel processing architecture and development and runtime environments - Describe the compile process and the runtime job execution process - Describe how partitioning and collection works in the parallel framework - Describe sorting and buffering in the parallel framework and optimization techniques - Describe and work with parallel framework data types - Create reusable job components - Use loop processing in a Transformer stage - Process groups in a Transformer stage - Extend the functionality of DataStage by building custom stages and creating new Transformer functions
Course description
xv
Student Notebook
- Use Connector stages to read and write from relational tables and handle errors in Connector stages - Process XML data in DataStage jobs using the XML stage - Design a job that processes a star schema database with Type 1 and Type 2 slowly changing dimensions - List job and stage best practices
xvi
Advanced DataStage v8
V6.0
Student Notebook
pref
Agenda
Day 1
(00:30) Welcome (01:35) Unit 1 - Introduction to the Parallel Framework Architecture (02:10) Unit 2 - Compilation and Execution (02:10) Unit 3 - Partitioning and Collecting Data
Day 2
(01:40) Unit 4 - Sorting Data (01:00) Unit 5 - Buffering in Parallel Jobs (02:00) Unit 6 - Parallel Framework Data Types (01:45) Unit 7 - Reusable components
Day 3
(02:10) Unit 8 - Advanced Transformer Logic (04:10) Unit 9 - Extending the Functionality of Parallel Jobs (01:55) Unit 10 - Accessing Databases (start if there is time)
Day 4
(-------) Unit 10 - Accessing Databases, continued (01:40) Unit 11 - Processing XML Data (01:20) Unit 12 - Slowly Changing Dimensions Stages (01:50) Unit 13 - Best Practices
Agenda
xvii
Student Notebook
V5.4
Student Notebook
Uempty
0-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Course objectives
After completing this course, you should be able to: Describe the parallel processing architecture and development and runtime environments Describe the compile process and the runtime job execution process Describe how partitioning and collection works in the parallel framework Describe sorting and buffering in the parallel framework and optimization techniques Describe and work with parallel framework data types Create reusable job components Use loop processing in a Transformer stage Process groups in a Transformer stage Extend the functionality of DataStage by building custom stages and creating new Transformer functions Use Connector stages to read and write from relational tables and handle errors in Connector stages Process XML data in DataStage jobs using the XML stage Design a job that processes a star schema database with Type 1 and Type 2 slowly changing dimensions List job and stage best practices
Copyright IBM Corporation 2011
KM4001.0
Notes:
0-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Agenda
Day 1 Unit 1: Introduction to the Parallel Framework Architecture
Exercise 1
KM4001.0
Notes:
0-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Agenda
Day 3 Unit 8: Advanced Transformer Logic
Exercise 8
Unit 10: Accessing Databases (start) Day 4 Unit 10: Accessing Databases (finish)
Exercise 4
KM4001.0
Notes:
0-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Introductions
Name Company Where you live Your job role Current experience with products and technologies in this course
Database ETL tools DataStage Programming
Class expectations
Copyright IBM Corporation 2011
KM4001.0
Notes:
0-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
0-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
1-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Describe the parallel processing architecture Describe pipeline and partition parallelism Describe the role of the configuration file Design a job that creates robust test data
KM4001.0
Notes:
1-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Learning DataStage at the GUI job design level is not enough. In order to develop the ability to design sound, scalable jobs, it is necessary to understand the underlying architecture. This is because the DataStage client is primarily a productivity tool. It is not intended to mirror underlying architecture.
1-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Development environment
How to develop efficient, well-performing GUI job designs How to debug and change the GUI job design based on the generated OSH and Score and messages in the job log
KM4001.0
Notes:
To be able to design robust parallel jobs, we need to get behind and beyond the GUI. We need to understand what gets generated from the GUI design and how this gets executed by the parallel framework. We also need to be able to debug and modify our job designs based on what we see happen at runtime.
1-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide lists and summarizes the main DataStage guides covering the material in this course. DataStage documentation is installed during the DataStage client installation.
1-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Scalable processing:
Add more resources (CPUs and disks) to increase system performance
Example system: 6 CPUs (processing nodes) and disks Scale up by adding more CPUs Add CPUs as individual nodes or to an SMP system
KM4001.0
Notes:
Parallel processing is the key to building jobs that are highly scalable. The parallel engine uses the processing node concept. Standalone processes rather than thread technology is used. Processed-based architecture is platform-independent, and allows greater scalability across resources within the processing pool.
1-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
GRID / Clusters
Multiple, multi-CPU systems Dedicated memory per system Typically SAN-based shared storage
MPP
Multiple nodes with dedicated memory, storage
2 1000s of CPUs
KM4001.0
Notes:
DataStage parallel jobs are designed to be platform-independent. A single job, if properly designed, can run across resources within a single machine (SMP) or multiple machines (cluster, GRID, or MPP architectures). While DataStage can run on a single-CPU environment, it is designed to take advantage of parallel platforms.
1-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Complex to manage
Lots of small jobs
KM4001.0
Notes:
Traditional batch processing consists of a distinct set of steps, defined by business requirements. Between each step, intermediate results are written to disk. This processing may exist outside of a database (using flat files for intermediate results) or within a database (using SQL, stored procedures, and temporary tables). There are several problems with this approach: First, each step must complete and write its entire result set before the next step can begin. Secondly, landing intermediate results incurs a large performance penalty through increased I/O. In this example, a single source incurs 7 times the I/O to process. Thirdly, with increased I/O requirements come increased storage costs.
1-8
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Pipeline parallelism
Transform, enrich, load processes execute simultaneously Like a conveyor belt moving rows from process to process
Start downstream process while upstream process is running
Advantages:
Reduces disk usage for staging areas Keeps processors busy
KM4001.0
Notes:
In this diagram, the arrows represent rows of data flowing through the job. While earlier rows are undergoing the Loading process, later rows are undergoing the Transform and Enrich processes. In this way a number of rows (7 in the picture) are being processed in parallel.
1-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Partition parallelism
Divide the incoming stream of data into subsets to be separately processed by an operation
Subsets are called partitions
KM4001.0
Notes:
Partitioning breaks a data set into smaller sets. This is a key to scalability. However, the data needs to be evenly distributed across the partitions; otherwise, the benefits of partitioning are reduced. It is important to note that what is done to each partition of data is the same. How the data is processed or transformed is the same.
V5.4
Student Notebook
Uempty
Partitioning illustration
Node 1
Operation
subset1 Node 2 subset2
Operation
Node 3
Data
subset3
Operation
Here the data is partitioned into three subsets The same operation is performed on each partition of data separately and in parallel If the data is evenly distributed, the data will be processed roughly three times faster
Copyright IBM Corporation 2006-2011
KM4001.0
Notes:
This diagram depicts how partition parallelism is implemented in DataStage. The data is split into multiple data streams which are each processed separately by the same stage operations.
1-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Within DataStage, pipelining, partitioning, and repartitioning are automatic Job developer only identifies: Sequential vs. parallel operations (by stage) Method of data partitioning Configuration file (which identifies resources) Advanced stage options (buffer tuning, operator combining, etc.)
KM4001.0
Notes:
By combining both pipelining and partitioning, DataStage creates jobs with higher volume throughput. The configuration file drives the parallelism by specifying the number of partitions.
V5.4
Student Notebook
Uempty
at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)
KM4001.0
Notes:
Much of the parallel processing paradigm is hidden from the designer. The designer simply designates the process flow, as shown in the upper portion of this diagram. The Parallel engine, using definitions in a configuration file, will actually execute processes that are partitioned and parallelized, as illustrated in the bottom portion. A misleading feature of the lower diagram is that it makes it appear as if the data remains in the same partitions through the duration of the job. In fact, partitioning and re-partitioning occurs on a stage-by-stage basis. There will be times when the data moves from one partition to another.
1-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Defining parallelism
Execution mode (sequential / parallel) is controlled by stage definition and properties
Default is parallel for most stages Can override default in most cases (Advanced Properties tab) By default, Sequential File stage runs in sequential mode
Can run in parallel mode when using multiple readers
By default, Sort stage (and most other stages) run in parallel mode
KM4001.0
Notes:
Stages run in two possible execution modes: sequential, parallel. The default is parallel for most stages. For example, the Sequential File stage runs in sequential mode by default. The Sort stage, and most other stages, run in parallel mode. If a stage runs in sequential node it will run on only one of the available nodes specified in the configuration file. If a stage runs in parallel mode, it can use all the available nodes specified in the configuration file. The Score provides this information.
V5.4
Student Notebook
Uempty
Configuration file
Configuration file separates configuration (hardware / software) from job design
Specified per job at runtime by $APT_CONFIG_FILE environment variable Optimizes overall throughput and matches job characteristics to overall hardware resources Allows you to change hardware and resources without changing job design
KM4001.0
Notes:
The configuration file determines the degree of parallelism (number of partitions) of jobs that use it. Each job runs under a configure file. The configuration file is specified by the $APT_CONFIG_FILE job parameter. DataStage job runs can point to different configuration files by using job parameters. Thus, a job can utilize different hardware architectures without being recompiled. It might, for example, pay to have a 4-node configuration file running on a 2 processor box, for example, if the job is resource bound. We can spread disk I/O among more controllers.
1-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
3 1
4 2
Key points: 1. 2. 3. Number of nodes defined Resources assigned to each node. Their order is significant. Nameless node pool (). Nodes in it are available to stage
KM4001.0
Notes:
This example shows a typical configuration file. Pools can be applied to nodes or other resources. The curly braces following some disk resources specify the resource pools associated with that resource. A node pool is simply a collection of nodes. The pools a given node belongs to are listed after the key word pool for the given node. A stage that is constrained to use a particular named pool will run only on the nodes that are in that pool. By default, all stages run on the nodes that are in the nameless pool (). Following the keyword node is the name of the node (logical processing unit). The order of resources is significant. The first disk is used before the second, and so on. Keywords, such as sort and bigdata, when used, restrict the signified processes to the use of the resources that are identified. For example, sort restricts sorting to node pools and scratch disk resources labeled sort. Database resources (not shown here) can also be created that restrict database access to certain nodes.
V5.4
Student Notebook
Uempty
Question: Can objects be constrained to specific CPUs? No, a request is made to the operating system and the operating system chooses the CPU.
1-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Among its many uses, the Row Generator stage can be used to generate mock or test data. When used with Lookup stages in a job, large amounts of robust mock data can be generated.
1-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Row Generator
KM4001.0
Notes:
In this job design, the Row Generator stage generate integers to look up. For different columns, it cycles through integer sets, generating all possible combinations. The lookup files map these integers to specific values. For example, FName maps different integer values to first names. LName maps different integer values to last names. And so on.
V5.4
Student Notebook
Uempty
Cycle through
KM4001.0
Notes:
The number of values to cycle through should be different for each set of integers, so that all possible combinations will be generated, for example: 000 111 120 301 010 121 200 Here the first column cycles through 0-3, the second 0-2, and the third 0-1.
1-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
This shows the inside of the Lookup stage. Notice how integer columns (int1, int2, ) are specified as keys into the lookup files. The values the keys are mapped to are returned in the output.
V5.4
Student Notebook
Uempty
First partition
Second partition
Copyright IBM Corporation 2006-2011
KM4001.0
Notes:
The job log contains a lot of valuable information. One message displays the configuration file the job is running under. This slide shows that message.
1-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Checkpoint
1. What two main factors determine the number of nodes a stage in a job will run on? 2. What two types of parallelism are implemented in parallel jobs? 3. What stage is often used to generate mock data?
KM4001.0
Notes:
Write your answers here:
V5.4
Student Notebook
Uempty
Exercise 1
In this lab exercise, you will:
Generate mock data Examine the job log
KM4001.0
Notes:
1-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit summary
Having completed this unit, you should be able to: Describe the parallel processing architecture Describe pipeline and partition parallelism Describe the role of the configuration file Design a job that creates robust test data
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
2-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Describe the main parts of the configuration file Describe the compile process and the OSH that is generated during it Describe the role and the main parts of the Score Describe the job execution process
KM4001.0
Notes:
2-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
2-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Compile
DataStage server
Transformer Components
KM4001.0
Notes:
During the compile process, DataStage generates all the code for the job. The compilation process generates OSH (a scripting language) from the job design and also C++ code for any Transformer stages that are used in the job. For each Transformer, DataStage builds a C++ operator. This explains why jobs with Transformers often take longer to compile (but not to run).
2-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
On clustered and grid runtime environments, job processing is distributed across multiple platforms
Transformer operators must be available to all the platforms the job is running on To share Transformer operators, you can share the project directory across the different platforms
Alternatively, you can set $APT_COPY_TRANSFORM_OPERATOR on the first job run to distribute Transformer operators to all the platforms Build and custom stage code must be shared or distributed manually
KM4001.0
Notes:
As previously mentioned, DataStage generates and then compiles C++ source code for each Transformer in a job. These become custom operators in the OSH. This explains why jobs with Transformers often take longer to compile. This also creates a problem if the jobs are run in a grid or clustered environment which distributes the processing across multiple platforms. These custom operators must exist on each of the platforms. This is not a problem for other standard stages because their corresponding operators will at installation time have been distributed to all the platforms.
2-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Generated OSH
Enable viewing of generated OSH in Administrator:
Comments
KM4001.0
Notes:
You can view generated OSH in Designer in several places, as shown above. To view the OSH, you must enable this in Administrator on the Parallel tab. When enabled it is enabled for all projects.
2-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Data Set stage: copy operator Sort: tsort operator Aggregator: group operator Row Generator, Column Generator, Surrogate Key Generator
All mapped to generator operator
Oracle
Source: oraread Sparse Lookup: oralookup Target Load: orawrite Target Upsert: oraupsert Target: lookup -createOnly
Copyright IBM Corporation 2005-2011
KM4001.0
Notes:
The stages on the diagram do not necessarily map one-to-one to OSH operators. For example, the Sequential File stage when used as a source is mapped to the import operator. When used as a target it is mapped to the export operator. The converse is also true. Different stages can be mapped to a single operator. For example, the Row Generator and Column Generator stages are both mapped to the generator operator with different parameters.
2-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
#################################################### #### STAGE: Row_Generator_0 ## Operator generator ## Operator options -schema record ( a:int32; b:string[max=12]; c:nullable decimal[10,2] {nulls=10}; ) -records 50000 ## General options [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')] ## Outputs 0> [] 'Row_Generator_0:lnk_gen.v' ; #################################################### #### STAGE: SortSt ## Operator tsort ## Operator options -key 'a' -asc ## General options [ident('SortSt'); jobmon_ident('SortSt'); par] ## Inputs 0< 'Row_Generator_0:lnk_gen.v' ## Outputs 0> [modify ( keep a,b,c; )] 'SortSt:lnk_sorted.v' ;
Virtual data set is used to connect output of one operator to input of another
Virtual data sets are generated to connect operators Have *.v extensions
Copyright IBM Corporation 2005-2011
KM4001.0
Notes:
Data sets connect the OSH operators. These are virtual data sets, that is, in-memory data flows. These data sets correspond to links in the job diagram. Link names are used in data set names. So good practice is to name links meaningfully, so they can be recognized in the OSH. To determine the execution order of the operators, trace the output to input data sets. For example, if operator1 has dataSet1.v as an output and this data set is input to operator2, then operator2 follows operator1 in the execution order.
2-8
Advanced DataStage v8
V5.4
Student Notebook
Uempty
GUI
table definition property SQL column type link row column stage
OSH
schema format C++ data type virtual dataset record field operator
-------------------------------------------------------------------------
KM4001.0
Notes:
This slide lists some of the equivalencies of terminology between the DataStage GUI and the generated OSH. OSH terms and DataStage GUI terms have an equivalency. The GUI frequently uses terms from both paradigms. Log messages almost exclusively use OSH terminology because this is what the parallel engine executes.
2-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Configuration File
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Configuration file
Specifies the processing nodes
Determines the degree of parallelism
Identifies resources connected to each processing node When system resources change, only need to change the configuration file
No need to modify or recompile jobs
KM4001.0
Notes:
The Parallel Job Developers Guide documents the configuration file.
2-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Processing nodes are specified in the configuration file. These do not necessarily correspond to computer nodes. A single computer node can run multiple processing nodes.
V5.4
Student Notebook
Uempty
Name and location of the file to be used is determined by the $APT_CONFIG_FILE environmental variable Primary elements
Node name Fast name Pools Resources
KM4001.0
Notes:
This slide lists the configuration file format. The primary elements are the node name, fast name, pools, and resources.
2-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Node options
Node name
User-defined name of a processing node Name of computer system upon which a node is located, as referred to by fastest network in the system Specified for each processing node Used by DataStage operators to open connections For non-distributive systems such as SMP, the operators all run on a single system So all processing nodes will have the same fast name Node pools Names of pools to which a node is assigned Used to logically group nodes Resource pools Names of pools to which resources are assigned Used to logically group resources The Default pool: Specified by the empty string () By default, all operators can use any node assigned to the default pool By default, resources assigned to the default pool are available to all operators Disk Scratch disk
Copyright IBM Corporation 2005-2011
Fast name
Resources
KM4001.0
Notes:
The node name is not required to correspond to anything physical. It is a user-defined name for a virtual location where operators can run. Fast name is the name of the node as it is referred to on the fastest network in the system, such as an IBM switch, FDDI, or BYNET. For non-distributive systems such as SMP, the operators all run on a single system. So regardless of the number of processing nodes, the fast name will be the same for all of them. The fast name is the physical node name that operators use to open connections for high-volume data transfers. Typically this is the principal node name as returned by the UNIX command uname n.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
There are a set of resource pool reserved names, including: db2, oracle, informix, sas, sort, lookup, buffer. Certain types of operators will use resources assigned to these reserved name pools. For example, the sort operator will use scratch disk assigned to the sort pool, if it exists. If it exhausts the space on the sort pool, it will use other default scratch disk.
2-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Resource pools
By default, operators use the default pool, specified by But operators can be constrained to use resources assigned to specific named pools, for example bigdata
KM4001.0
Notes:
Resource pools allocate resources, mainly disk resources, to nodes as specified in the configuration file. One resource pool, specified by (the empty pool) is special. It is the default pool of resources to be used by operators.
V5.4
Student Notebook
Uempty
Sort stage looks first for scratch disk resources in the sort pool
Then it looks for resources assigned to default disk pools
KM4001.0
Notes:
One type of resource pool, the sort pool, specifies disk resources to be used if a sorting operation runs out of memory. If it runs out of sort disk resources, it will use scratch disk resources.
2-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Node pool for sort operator Resource pool for sort operator Fast names are all different. So running on a grid or cluster
KM4001.0
Notes:
This slide shows an example of a configuration file. Notice the sort keyword used for the first node which designates disk resources to use by a sort operator running on node n1. This is disk the sort operator will use if it runs out of memory.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
In this example, since a sparse lookup is viewed as the bottleneck, the stage has been set to execute on multiple nodes. These are nodes that are assigned to the extra node pool. It is also given extra resources. These are resources that are assigned to the extra resource pool.
2-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Configuration Editor
View and edit configuration files
Click Tools>Configurations in Designer
There is a button you can click to check a configuration for syntax errors
KM4001.0
Notes:
DataStage Designer has a configuration editor you can use to create and edit configuration files. The editor also contains functionality for checking the configuration file for syntax errors.
V5.4
Student Notebook
Uempty
Configuration editor
Select configuration
Check configuration
KM4001.0
Notes:
This slide shows the configuration editor. Select the configuration file to edit from the list box at the top. When you click the Check button the editor checks the syntax and displays the results in the lower window.
2-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Identifies degree of parallelism and node assignments for each operator Inserts sorts and partitioners as needed to ensure correct results Defines connection topology (virtual data sets) between adjacent operators Inserts buffer operators to prevent deadlocks Defines number of actual operating system processes
Where possible, multiple operators are combined within a single process to improve performance and optimize resource requirements
Set $APT_STARTUP_STATUS to show each step of job startup Set $APT_PM_SHOW_PIDS to show process IDs in log messages
Copyright IBM Corporation 2005-2011
KM4001.0
Notes:
The Score is one of the main runtime debugging tools. It is generated from the OSH and configuration file. This slide lists some of the information the Score contains.
2-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Generating the Score and initiating the operator processes is part of the job overhead. Processing does not begin until this occurs. This is inconsequential for jobs processing very large amounts of data, but it can be consequential for jobs processing smaller amounts of data.
V5.4
Student Notebook
Uempty
message
Score contents
KM4001.0
Notes:
The only place the Score is displayed is in the job log. Unfortunately, the message does not contain the word Score in its heading. Look for the message heading that begins main program: This step has N data sets.
2-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Operators
KM4001.0
Notes:
The Score contains a lot of useful information including the number of operators and data sets, and the mappings of operators to processing nodes. Recall that the names of the nodes are arbitrary. 1 in node1 is just part of an arbitrary string name; it does not identify where it is in the partitioning order. p0, p1, identify the partitions and their ordering, as determined by the configuration file. The last entry identifies the number of player processes. In this example, there is one for the Row Generator stage, which is running sequentially, and four each for the two Peek stages, which are running in parallel using all the nodes. The total is nine processes.
V5.4
Student Notebook
Uempty
Processing Node
SL P
Processing Node
SL P
Default Communication:
SMP: Shared Memory MPP: Shared Memory (within hardware node); TCP (across hardware nodes)
Notes:
The conductor node has the start-up process. It creates the Score based on OSH and configuration file. Then it starts up section leader processes. Section leaders manage communication between the conductor node and the players. Error and information messages returned by an operator running on a node (that is, a player process) are passed to the section leader who then passes them to the conductor.
2-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Section Leader,0
Section Leader,1
Section Leader,2
generator,0
generator,1
generator,2
copy,0
copy,1
copy,2
KM4001.0
Notes:
The dotted lines are communication channels between player processes for passing data. Data that moves between nodes (for example, between section leader 1 and section leader 2) is being repartitioned. Every player has to be able to communicate with every other player. There are separate communication channels (pathways) for control, messages, errors, and data. Note that the data channel does not go through the section leader/conductor, as this would limit scalability. Data flows directly from upstream operators to downstream operators.
V5.4
Student Notebook
Uempty
Row order is undefined (non-deterministic) across partitions and across multiple links
Order within a particular link and partition is deterministic Based on partition type and optionally on the sort order For example, cannot update a source or reference file used in the same flow
Data Flow
Copyright IBM Corporation 2005-2011
KM4001.0
Notes:
Conceptually, you can picture a running parallel job as a series of conveyor belts transporting rows. The order of the rows across the partitions is non-deterministic. Within a single partition the order is determined.
2-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Define Section Leaders Send score to Section Leaders Start Players Set up data connections between Players
Copyright IBM Corporation 2005-2011
Notes:
This slide shows some of the information contained in the log about the start-up processing. Reporting environment variables control how much of this information shows up in the log.
V5.4
Student Notebook
Uempty
Total number of processes = Conductor + Section Leader processes + Player processes for all operators
KM4001.0
Notes:
The total number of processes a job generates is important to performance. If you can reduce the number of processes a job is using, relative to a certain configuration file, you can improve its performance.
2-31
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Sometimes it would be nice to know what it happening to the data at a particular place in the job. For example, maybe you want to know what the data is before it is processed by a Transformer stage. This is one use you can make of Copy and Peek stages.
2-33
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Peek stage
KM4001.0
Notes:
This slide shows a job with Copy stages used to get snapshots of the data during the processing. The second Copy stage will be optimized away, because it has only one output and the Force property has been set to False. The first may be combined with the Transformer in the final optimization. This information can be seen in the log.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
In this job example, stage variables are defined. Stage variables are executed top to bottom, just like columns. They are executed before any output links are processed.
2-35
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Checkpoint
1. Why do jobs with Transformer stages take longer to compile? 2. What do the following objects in the GUI correspond to in the OSH? Table definition, stage, link? 3. Suppose node2 is in a node pool named sort. Then a Sort stage operator can run on this node. What about other stage operators? What about, for example, a Transformer stage operator? Could it run on node2? 4. From the Score we learn that this job generates three operators. The first runs sequentially. The last two run in parallel each on two nodes. How many player processes does it run? How many total processes?
KM4001.0
Notes:
Write your answers here:
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
2-37
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit summary
Having completed this unit, you should be able to: Describe the main parts of the configuration file Describe the compile process and the OSH that is generated during it Describe the role and the main parts of the Score Describe the job execution process
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
3-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Understand how partitioning works in the Framework Viewing collectors and partitioners in the Score Selecting collecting and partitioning algorithms Generate sequences of numbers (surrogate keys) in a partitioned, parallel environment
KM4001.0
Notes:
3-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
partitioner
collector
KM4001.0
Notes:
Partitioners are generated by default or when you specify them explicitly in the stage. They distribute rows of a link into smaller segments that can be processed independently in parallel. Collectors reverse this process. They combine parallel partitions of a link into a single partition for sequential processing.
3-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Fan-Out Partitioner
Sequential to parallel
Fan-In Collector
Parallel to sequential
Partitioner and collector icons always appear left to right regardless of the angle of the link
KM4001.0
Notes:
This slide shows a job opened in Designer. The partition and collector icons show up on the input links going to a stage. They always appear left to right regardless of the angle of the link.
3-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Partitioners
Partitioners are inserted before stages running in parallel. The previous stage may be running:
Sequentially
Stage running Sequentially
partitioner
Stage running in Parallel
In Parallel
Stage running in Parallel
Repartitioning icon
Copyright IBM Corporation 2011
KM4001.0
Notes:
Partitioners are inserted before stages running in parallel. The previous stage may be running in parallel or sequentially. The former yields a fan-in or butterfly icon, depending on whether there is repartitioning. The latter yields a fan-out icon.
3-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Partitioning tab
Partitioning method
KM4001.0
Notes:
This slide shows the Inputs>Partitioning tab. Auto is the default. If a partitioning method other than Auto is selected, then this information can go into the OSH. If Auto is selected, the framework inserts partitioners when the Score is composed.
3-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
The Score
Set $APT_DUMP_SCORE to include the Score in the job log Score includes information, including:
How the data is partitioned and collected
Including partitioning keys
Extra operators and buffers inserted into the flow Degree of parallelism each operator runs on, and on which nodes tsort operators inserted into the flow
KM4001.0
Notes:
The setting of the environment variable $APT_DUMP_SCORE determines whether the Score is displayed in the job log. The Score contains a lot of valuable information, some of which in listed in this slide.
3-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Summary of operators
Running in parallel
Text name
KM4001.0
Notes:
Under each operator is a list of nodes the operator is running on. This will include multiple nodes if the operator is running in parallel. For each node (for example, node1), the partition (p0) the node name is associated with is shown. Each operator has a name derived from the GUI stage the operator was generated from and an alias (op0, op1, and so on) used within the Score. So, for example, op0 was generated from a Row Generator stage. Following the operator alias is the number of partitions (1p, 2p, and so on) that it is running on.
3-8
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Producer
Indicator
Partitioning method is associated with producer Even though it is set in the consumer stage on the GUI Collector method is associated with consumer Separated by an indicator:
-> Sequential to Sequential <> Sequential to Parallel => Parallel to Parallel (SAME) #> Parallel to Parallel (not SAME) >> Parallel to Sequential > No producer or no consumer
Consumer
May also include [pp] notation when Preserve Partitioning flag is set
Copyright IBM Corporation 2011
KM4001.0
Notes:
At the operator level, partitioning and collecting involves a pair of operators. The first operator produces the rows; the second consumes them. At the GUI level in the job design we specify the partitioning or collecting algorithm always and only at the consumer stage. So here the GUI is a little misleading when we specify a partitioning method. To interpret the score partitioning and collecting methods, first look for the indicator symbol in the row between the two operators. The indicator identifies the parallelism sequence between the two operators, as shown in the list. Look to the left of the indicator to determine the partitioning method. eAny indicates Auto, which is a default as determined by the type of stage. If we had, for example, chosen Entire as the partitioning method for the Transformer, we would see eEntire to the left of the indicator. Look to the right of the indicator symbol to determine the collection method. Since the Transformer is running in parallel there is no useful information on the right side. The eCollectAny symbol indicates that even when a Transformer operator is running in parallel it still has to retrieve the data from the producer operator.
Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Combinability mode: Dont Combine Partitioning method: Hash by c1 Sequential to parallel Collector: Ordered
Partitioner: Hash
Collector: Ordered
Parallel to sequential
Copyright IBM Corporation 2011
KM4001.0
Notes:
This slide shows the Score generated for the job displayed. Property settings in the stages in the job affect the contents of the Score. For example, Hash by c1 has been set in the Transformer stage. Notice that a hash partitioner is generated in the Score, as indicated.
V5.4
Student Notebook
Uempty
Partition numbers
At runtime, the parallel framework determines the degree of parallelism for each stage from:
Configuration file Execution mode (Stage>Advanced tab) Stage constraints if applicable (Stage>Advanced tab)
Partition #
KM4001.0
Notes:
At runtime, the parallel framework determines the degree of parallelism for each stage from the configuration file and other settings. Partitions are assigned numbers, starting at zero. In the log, the partition number is appended to the stage name in messages.
3-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Partitioning methods
Keyless Partitioning Rows are distributed independently of data values Keyed Partitioning Rows are distributed based on values in specified key columns
Same
Existing partitioning is not altered Rows are evenly alternated among partitions Rows are assigned randomly to partitions Each partition gets the entire data set (rows are duplicated)
Hash
Rows with same key column values go to the same partition Assigns each row of an input data set to a partition, as determined by a specified numeric key column Similar to hash, but partition mapping is user-determined and partitions are ordered Matches DB2 EEE partitioning
Modulus
Entire
Range
DB2
KM4001.0
Notes:
This slide lists the two main categories of partitioning methods: Keyless, and Keyed. Auto (the default method): DataStage chooses appropriate partitioning method. Round Robin, Same, or Hash are most commonly chosen. Random: DataStage uses a Random algorithm to choose where the row goes. The result of Random is that you cannot know where a row will end up. Hash: DataStages internal algorithm applied to key values determines the partition. The data type of the key value is irrelevant. All key values are converted to characters before the algorithm is applied. Range: The partition is chosen based on a range map, which maps ranges of values to specified partitions. There is a stage that can be used to build the range map, but its use is not required. DB2: DB2 has published its hashing algorithm and DataStage copies that. Use when hashing to partitioned DB2 tables.
V5.4
Student Notebook
Uempty
Enable Show Instances in Director Job Monitor to show data distribution across partitions:
Setting the environment variable $APT_RECORD_COUNTS outputs row counts per partition to the job log as each stage operator completes processing
KM4001.0
Notes:
In general, when it comes to choosing a partitioning method, you should choose a partitioning method that gives approximately an equal number of rows to each partition, but satisfies business requirements. This ensures that processing is evenly distributed across nodes.
3-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Aggregator, Join, Merge, Sort, Remove Duplicates, Transformers and Build stages (when processing groups)
A partitioning method needed to ensure correct results may lead to uneven distribution
KM4001.0
Notes:
The partition method must match the stage logic. Some stages, for example, require that all related records (by key) are in the same partition. This includes any stage that operates on groups of related data. For best performance, leverage partitioning performed earlier in the flow.
V5.4
Student Notebook
Uempty
Keyless
Row ID's
0 3 6
1 4 7
2 5 8
0 3 6
1 4 7
2 5 8
KM4001.0
Notes:
This slide illustrates the Same partitioning algorithm. It is a keyless method that retains the current distribution and order of the rows from the previous parallel stage.
3-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Do not follow a Data Set stage with a stage using Same partitioning
This would occur if one job writes to a data set that a second job reads The downstream stage will run with the degree of parallelism used to create the data set
Regardless of the degree of parallelism defined in the jobs configuration file
KM4001.0
Notes:
Same has low overhead, but there are times when it should not be used. Do not follow a stage running sequentially (for example, a Sequential File stage) with a stage using Same partitioning. And do not follow a Data Set stage with a stage using Same partitioning.
V5.4
Student Notebook
Uempty
Keyless
8 7 6 5 4 3 2 1 0
Fairly low overhead Round Robin assigns rows to partitions like dealing cards
The row assignment will always be the same for a given configuration file
6 3 0
Round Robin
7 4 1
8 5 2
Random has slightly higher overhead, but assigns rows in a non-deterministic fashion between job runs
Copyright IBM Corporation 2011
KM4001.0
Notes:
Round Robin and Random are two other keyless methods. In both cases, rows are evenly distributed across partitions. Random has slightly higher overhead, but assigns rows in a non-deterministic fashion between job runs.
3-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
It is very important to know that row order is undefined across partitions in different job runs. This is an example that illustrates this. In this example, we see that the row containing a:3, which is in partition 2, arrives first. In the second job run, the row containing a:3 arrives after other rows in other partitions, for example, a:2 in partition 1.
V5.4
Student Notebook
Uempty
Entire partitioning
Each partition gets a copy of each row
Useful for distributing lookup and reference data May have performance impact in MPP / clustered environments On SMP platforms, Lookup stage uses shared memory instead of duplicating the entire reference data On MPP platforms, each server uses shared memory for a single local copy
8 7 6 5 4 3 2 1 0
Keyless
Entire
. . 3 2 1 0
. . 3 2 1 0
. . 3 2 1 0
KM4001.0
Notes:
Entire partitioning is another keyless method. Each partition gets a complete copy of each row. This is very useful for distributing lookup and reference data. On SMP platforms, the Lookup stage uses shared memory instead of duplicating the entire reference data, so there is no performance impact.
3-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Hash partitioning
Keyed partitioning method Rows are distributed according to values in key columns Rows with same key values go into the same partition Prevents matching rows from hiding in other partitions
For example, with Join, Merge, Remove Duplicates,
Values of key column 0 3 2 1 0 2 3 2 1 1
Keyed
Hash
Partition distribution is relatively equal if the data across the source key columns is evenly distributed
Copyright IBM Corporation 2011
0 3 0 3
1 1 1
2 2 2
KM4001.0
Notes:
For certain stages (Remove Duplicates, Join, Merge) to work correctly in parallel, the user must use a keyed method such as Hash. In this example, the numbers are values of key column. Hash guarantees that all the rows with key value 3 end up in the same partition. Hash does not guarantee continuity. Here, threes are bunched with zeros, not with neighboring two values. Hash may not provide an even distribution of the data. Use key columns that have enough values to distribute data across the available partitions. For example, gender would be a poor choice of key because all rows would flow into two partitions.
V5.4
Student Notebook
Uempty
LName
Dodge Dodge
FName
Horace John
Address
17840 Jefferson 75 Boston Boulevard
Notes:
This is an example of unequal distribution of rows down the different partitions. This is something you would want to avoid if possible. Partition 1 would take much longer to process and so the job as a whole would take longer.
Source Data
Partition 1
ID
1 2 3
LName
Ford Ford Ford
FName
Henry Clara Edsel
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson
ID
1 2 3 4 7 8 9 10
LName
Ford Ford Ford Ford Ford Ford Ford Ford
FName
Henry Clara Edsel Eleanor Henry Clara Edsel Eleanor
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
4 5 6 7 8 9 10
7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
Hash partitioning distribution matches source data key values distribution. Here, the number of distinct hash key values limits parallelism!
Copyright IBM Corporation 2011
KM4001.0
3-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Modulus partitioning
Keyed partitioning method Rows are distributed according to the values in a numeric key column
The modulus determines the partition:
partition = MOD (key_value / number of partitions) Values of key column 0 3 2 1 0 2 3 2 1 1
Keyed
MODULUS
Faster than Hash Guarantees that rows with identical key values go into the same partition Partition size is relatively equal if the data within the key column is evenly distributed
Copyright IBM Corporation 2011
0 3 0 3
1 1 1
2 2 2
KM4001.0
Notes:
Modulus is a keyed partitioning method that works like Hash, except that it can only be set for numeric columns. Rows are distributed according to the values in a numeric key column. Like Hash, which is slower, Modulus guarantees that rows with identical key values go into the same partition.
V5.4
Student Notebook
Uempty
Range partitioning
Rows are distributed by range according to the values in one or more key columns Pre-process the data to generate a range map
More expensive than Hash partitioning Must read entire data twice to guarantee results
Values of key column 4 0 5 1 6 0 5 4 3
Keyed
RANGE
Rang e Map f ile
Guarantees that rows with identical values in key columns end up in the same partition Rows outside the map go into the first or last partition Limited use only useful in cases where incoming data distribution is consistent over time
Copyright IBM Corporation 2011
0 1 0
4 4 3
5 6 5
KM4001.0
Notes:
Range partitioning is a keyed method. Rows are distributed by range according to the values in one or more key columns. The partitioning is based on a range map. If the source data distribution is consistent over time, it may be possible to re-use the range map file and thereby avoid the time it takes to pre-process the data. Note that at runtime, values that are outside of a given range map will land in the first or last partition as appropriate.
3-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Keyed
Reference this Range Map file when specifying Range partitioning Note that range map files are specific to a given configuration file
KM4001.0
Notes:
In general, it is best not to use range partitioning, because it requires two passes over the data to guarantee good results: One pass to create the range map. Another pass running the job using the Range partitioning method.
V5.4
Student Notebook
Uempty
Same partitioner
Re-partition
watch for this!
Auto partitioner
KM4001.0
Notes:
Reading link markings: S----------------->S (no marking). S----(fan out)--->P (partitioner). P----(fan in) ---->S (collector). P----(box)------->P (no reshuffling: partitioner using Same method). P----(butterfly)--->P (reshuffling: partitioner using another method).
3-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Auto partitioning
DataStage inserts partitioners as necessary to ensure correct results
Generally chooses Round Robin or Same Inserts Hash for stages that require matched key values (Join, Merge, Remove Duplicates) Inserts Entire on Lookup reference links
Since DataStage has limited awareness of your data and business rules, best practice is to explicitly specify Hash partitioning when needed
DataStage has no visibility into Transformer logic Hash is required before Sort and Aggregator stages DataStage sometimes inserts unnecessary partitioners
Check the Score
KM4001.0
Notes:
When Auto is chosen, DataStage inserts partitioners as necessary to ensure correct results. Auto generally chooses Round Robin when going from sequential to parallel. It generally chooses Same when going from parallel to parallel. Since DataStage has limited awareness of your data and business rules, best practice is to explicitly specify Hash partitioning when needed, that is, when processing requires groups of related records.
V5.4
Student Notebook
Uempty
Set automatically by some operators (Sort, Hash partitioning) Can be manually set (Stage>Advanced tab) Functionally equivalent to explicitly specifying SAME partitioning
But allows DataStage to over-ride and optimize for performance
Preserve Partitioning setting is part of data set metadata Log warnings are issued when Preserve Partitioning flag is set but downstream operators cannot use the same partitioning
Copyright IBM Corporation 2011
KM4001.0
Notes:
The Preserve Partitioning flag is used in a stage before stages that use Auto. It has 3 possible settings but most often the default is used. Sometime you may want to choose Set. In that case, downstream stages are to attempt to retain partitioning and sort order.
3-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Partitioning strategy
Use Hash when stage requires grouping of related values Use Modulus if group key is a single integer column
Better performance than Hash
Range may be appropriate in cases where data distribution is uneven but consistent over time Know your data!
How many unique values in the Hash key columns?
KM4001.0
Notes:
This slide lists some best practices for setting stage partitioning.
V5.4
Student Notebook
Uempty
Across jobs:
Use data sets to retain partitioning
KM4001.0
Notes:
This slide continues the list of best practices for setting stage partitioning.
3-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
collector
Collecting Data
Stage running Sequentially
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Collectors
Collectors combine partitions into a single input stream going to a sequential stage
...
Sequential Stage
KM4001.0
Notes:
Collector methods combine partitions into a single input stream going to a sequential stage or stream. This slide illustrates this process. At the top are multiple data partitions reduced to one.
3-31
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Collector method is defined on the Input>Partitioning tab just as for the partitioning. The word Collector indicates that we are selecting a collector method as opposed to a partitioning method.
V5.4
Student Notebook
Uempty
Collector methods
Auto
Read in the first row that shows up from any partition Output row order is undefined (non-deterministic) Default collector method Pick row from input partitions in round robin order Slower than Auto, rarely used Read all rows from first partition, then second, and so on Preserves order of rows that exists within each partition Produces a single stream of rows sorted on specified key columns from input sorted on those keys Row order is not preserved for non-key columns
Round Robin
Ordered
Sort Merge
KM4001.0
Notes:
This slide lists the collector methods available. Auto (the default) reads rows from partitions as soon as they arrive. This can yield different row orders in different runs with identical data (non-deterministic execution). Round Robin picks the first row from partition 0, the next from partition 1, even if other partitions can produce rows faster than partition 1. Ordered is the great American novel collector. Assume you just finished writing the great American novel. You use DataStage to spell check in parallel. Partition 0 holds chapter one, partition 1 holds chapter 2, and so on. You need a collector before sending your opus to the printer. The default collector (Auto) will print lines in random-looking order. Round Robin will print the first line from partition 0, the next from partition 1, and so on. What you need is the Ordered collector: it will first read all lines from partition 0, then from partition 1, and so on.
3-33
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Sort Merge produces a (globally) sorted sequential stream from within partition sorted rows. Let us look how it works on a two-node example. Rows have one column, an integer, and it is the key column. Assume that the rows are already sorted within each of the two partitions. Sort Merge produces a sorted sequential stream using the following algorithm: always pick the next row from the partition that produces the smallest key value. This produces the desired ordered sequence: 0011223355, regardless of the original partitioning as long as the input data partitions are sorted by key.
V5.4
Student Notebook
Uempty
Non-deterministic execution
The collector may yield non-deterministic results in case of ties
partition_0 partition_1 ----------- ----------2,"K" 2,"a" 1,"x" 0,"p" 0,"a" 5,"j" 5,"p" 3,"x" 3,"y" 1,"y
The third row can equally be (1,"x") or (1,"y") because there is a tie (same key value 1) between partitions Can be avoided by Hash partitioning on the key values
Then key 1 could not exist in both partitions
KM4001.0
Notes:
The Sort Merge collector can yield non-deterministic results in some cases, for example, when there are rows in two or more partitions that fit the sort sequence. In this example, the third row can equally be 1,x or 1,y because there is a tie (same key value 1) in both partitions. Which one is chosen depends on the relative speed in which these partitions are processing rows.
3-35
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Ordered is only appropriate in special cases Round robin collector can sometimes be used to reconstruct the original (sequential) row ordering for Round Robin partitioned inputs Intermediate processing must not have altered row order or reduced the number of rows Rarely used
KM4001.0
Notes:
This slide describes some best practices for choosing a collector method. Generally Auto is the fastest and most efficient method of collection. When you need sorted data, select Sort Merge.
V5.4
Student Notebook
Uempty
Funnel stage
Stage that runs in parallel Merges data from multiple links Table definitions (schema) of all links must match
Collector
Copyright IBM Corporation 2011
Funnel
KM4001.0
Notes:
Sometimes links are confused with partitions, so collectors seem like Funnel stages. But remember that a single link can (and most often do) contain multiple partitions. They are not the same thing.
3-37
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Transformer
KM4001.0
Notes:
In the parallel word, creating unique sequences of numbers is complicated. In the partitioned world, each operation is performed on each partition. So a counter in a Transformer counts the number of rows in the partition (not globally). Each partition creates a duplicate list. But there are some system variables that can be used to generate unique sequences. The Surrogate Key Generator stage is discussed in a later unit.
3-39
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
In the Row Generator stage you can create a unique sequence of numbers in a particular column by setting the properties shown here. Set Initial value to part (partition number). Set Increment to partcount (number of partitions).
V5.4
Student Notebook
Uempty
Generated numbers
KM4001.0
Notes:
This shows the results of setting the Row Generator properties as shown in the previous slide. The number of nodes in this example equals 2.
3-41
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
RowCount output
KM4001.0
Notes:
This slide shows the results of a Transformer using @INROWNUM. Assume that there are 4 partitions. @INROWNUM will contain the number of the row going through the partition. Each partition repeats the same sequence of integers.
V5.4
Student Notebook
Uempty
RowCount output
KM4001.0
Notes:
This slide shows how to use the system variables along with @INROWNUM to generate a unique sequence of integers. Assume that there are 4 partitions. @INROWNUM will contain the number of the row going through the partition. The formula @PARTIONNUM + (@NUMPARTITIONS * @INROWNUM - 1) will yield the following sequence of integers for the rows going down partition 0: 0, 4, 8, For partition 1, the series will be: 1, 5, 9, For partition 2, the series will be: 2, 6, 10, For partition 3, the series will be: 3, 7, 11,
3-43
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Source file
Target file
Copyright IBM Corporation 2011
KM4001.0
Notes:
It is sometimes necessary to process files that have a header and detail format. The header row contains information that apply to all the detail rows that follow (up to the next header row).These two types of rows have different formats so their individual columns cannot be specified on the Columns tab.
V5.4
Student Notebook
Uempty
Job design
Split records into header and detail streams. Parse out individual fields
Copyright IBM Corporation 2011
KM4001.0
Notes:
Here is a job that can be used to process a header detail file. The source file is a variable format data file. The trick is to read the rows in as single fields. Then the individual fields can be parsed out in the Transformer stage.
3-45
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
This shows how the Field function in a Transformer can be used to parse columns. The Column Import stage can also be used.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide shows the Score for the job. Notice the inserted Hash partitioners.
3-47
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Solution:
Select Entire partitioning algorithm for Header into the Join Select Same partitioning algorithm for Detail into the Join Not all Detail records are in the same partition, but in every partition theyre in, theres a Header
KM4001.0
Notes:
This job design has some performance issues. Because of the parallelism that occurs in DataStage, this is not a particularly easy task to accomplish. The header will only go down one partition but we actually need to put it down all partitions. Two solutions suggest themselves: Join them together by hashing on the key. The problem with this approach is that the join is hashing on a single value and, in essence, running in sequential mode. Or take the header information and copy it to all partitions and the join will run in parallel.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Notice that in the revised job design, there are different partitioners.
3-49
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
You may be interested in knowing how to create a header detail file. This example shows one way. The Column Export stages are used to put the header and detail records, which have different formats, into a single format. This is necessary in order to use the Funnel stage to merge these records together.
V5.4
Student Notebook
Uempty
Columns to export
KM4001.0
Notes:
This slide shows the inside of the Column Export stage. The Explicit column method has been chosen and the individual columns in the input link are explicitly listed. These are combined into a single column of output named Header. This can be funneled together with the individual column of output from the detail link.
3-51
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
All input links must all have the same column metadata
KM4001.0
Notes:
This slide shows the inside of the Funnel stage. All input links must all have the same column metadata.
V5.4
Student Notebook
Uempty
Checkpoint
1. What two sections does a Score contain? 2. How does Modulus partitioning differ from Hash partitioning? 3. What collection method can be used to collect sorted rows in multiple partitions into a single sorted partition?
KM4001.0
Notes:
Write your answers here:
3-53
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Unit summary
Having completed this unit, you should be able to: Understand how partitioning works in the Framework Viewing collectors and partitioners in the Score Selecting collecting and partitioning algorithms Generate sequences of numbers (surrogate keys) in a partitioned, parallel environment
KM4001.0
Notes:
3-55
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
V5.4
Student Notebook
Uempty
4-1
Student Notebook
Unit objectives
After completing this unit, you should be able to: Sort data in the parallel framework Find inserted sorts in the Score Reduce the number of inserted sorts Optimize Fork-Join parallel jobs
KM4001.0
Notes:
4-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Sorted Result
Notes:
This slide discusses the traditional (sequential) sort. This process of sorting data uses one primary key column and (optionally) multiple secondary key columns to generate a sequential, ordered result set. This is the method that SQL uses in SQL statements with an ORDER BY clause. This will be contrasted with parallel sort described on the next slide.
Source Data
ID
1 2 3 4 5 6 7 8 9 10
LName
Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford
FName
Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
ID
6 5 1 7 4 10 3 9 2 8
LName
Dodge Dodge Ford Ford Ford Ford Ford Ford Ford Ford
FName
John Horace Henry Henry Eleanor Eleanor Edsel Edsel Clara Clara
Address
75 Boston Boulevard 17840 Jefferson 66 Edison Avenue 4901 Evergreen 7900 Jefferson 1100 Lakeshore 7900 Jefferson 1100 Lakeshore 66 Edison Avenue 4901 Evergreen
Sort on:
Lname (asc), FName (desc)
KM4001.0
4-3
Student Notebook
Parallel sort
In many cases, there is no need to globally sort data
Sorting is most often needed to establish order within specified groups of data
Join, Merge, Aggregator, Remove Duplicates, for examples This sort can be done in parallel!
Hash partitioning can be used to gather related rows into single partitions Assigns rows with the same key column values to the same partition
Sorting is used to establish grouping and order within each partition based on key columns
Rows with the same key values are grouped together within the partition
Hash and Sort keys need not totally match Often the case before Remove Duplicates stage
Hash ensures that all duplicates are in the same partition Sort groups the rows and then establishes an ordering within each group, for example, by latest date
Copyright IBM Corporation 2011
KM4001.0
Notes:
This slide discusses the parallel sort. In many cases, there is no need to globally sort data. In these cases a parallel sort can be used and this will be much faster. If a global sort is needed Sort Merge can be used to accomplish this after the parallel sort.
4-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Part 0
Notes:
This illustrates a parallel sort. Each partition sorts the data within it separately from the others.
ID
2 8
LName
Ford Ford
Address
66 Edison Avenue 4901 Evergreen
ID
2 8
LName
Ford Ford
FName
Clara Clara
Address
66 Edison Avenue 4901 Evergreen
Parallel Sort
Part 1
ID
3 5 9
LName
Ford Dodge Ford
FName
Edsel Horace Edsel
Address
7900 Jefferson 17840 Jefferson 1100 Lakeshore
ID
5 3 9
LName
Dodge Ford Ford
FName
Horace Edsel Edsel
Address
17840 Jefferson 7900 Jefferson 1100 Lakeshore
Parallel Sort
Part 2
ID
4 6 10
LName
Ford Dodge Ford
FName
Eleanor John Eleanor
Address
7900 Jefferson 75 Boston Boulevard 1100 Lakeshore
ID
6 4 10
LName
Dodge Ford Ford
FName
John Eleanor Eleanor
Address
75 Boston Boulevard 7900 Jefferson 1100 Lakeshore
Parallel Sort
ID
1 7
LName
Ford Ford
FName
Henry Henry
Address
66 Edison Avenue 4901 Evergreen
Part 3
Parallel Sort
ID
1 7
LName
Ford Ford
FName
Henry Henry
Address
66 Edison Avenue 4901 Evergreen
KM4001.0
4-5
Student Notebook
Stages that can minimize memory usage by requiring the data to be sorted
Join Merge Aggregator (using Sort method, rather than Hash method)
KM4001.0
Notes:
There are a number of stages that require sorted data. This includes stages that process groups of data, and stages that can minimize memory usage by requiring the data to be sorted.
4-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
In-stage sorts
Partitioning cannot be Auto
Input links with in-stage sorts will have a Sort icon:
Both methods generate the same internal tsort operator in the OSH
Copyright IBM Corporation 2011
KM4001.0
Notes:
There are two parallel sorting methods available in DataStage: The Sort stage, and in-stage sorts. Internally they both use the same tsort operator, so there is no difference in terms of performance.
4-7
Student Notebook
In-Stage sorting
- Easier job maintenance (fewer stages on job canvas) - But fewer options (tuning, features)
Copyright IBM Corporation 2011
KM4001.0
Notes:
This shows how to define an in-stage sort. It requires a partitioning method other than Auto, as illustrated. The same key can be specified for sorting, partitioning, or both sorting and partitioning.
4-8
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Sort stage
Offers more options than an in-stage sort
KM4001.0
Notes:
This shows the Sort stage. The Sort stage offers more options than an in-stage sort. The default sort utility is DataStage, which is recommended.
4-9
Student Notebook
Stable sorts
Preserves the order of non-key columns within each sort group Slower than non-stable sorts
Use only when needed Enabled by default in Sort stage Not enabled by default for in-stage sorts
KM4001.0
Notes:
Both the Sort stage and in-stage sorts offer stable sorts. A stable sort preserves the order of non-key columns within each sort group. This is necessary for some business purposes, but stable sorts are slower than non-stable sorts. Use only when needed. It is enabled by default in the Sort stage, so be sure to disable this if it is not needed.
V5.4
Student Notebook
Uempty
Resorting on sub-groups
Use Sort Key Mode property to re-use key column groupings from previous sorts
Uses significantly less memory and disk
Sorts within previously sorted groups, not the total data set Outputs rows after each group, not the total data set
KM4001.0
Notes:
A major property that the Sort stage has that is not available for in-stage sorts is the Sort Key Mode property. Use the Sort Key Mode property to re-use key column groupings from previous sorts. This uses significantly less memory and disk and improves performance.
4-11
Student Notebook
When rows are previously sorted by a key, all the rows are grouped together and, moreover, the groups are in sort order In either case the Sort stage can be used to sort by a sub-key within each group
1,c 1,b 2,r 2,a 3,a 2,r 2,a 3,a 1,c 1,b
Sorted by col1
Grouped by col1
KM4001.0
Notes:
The Sort Key Mode offers two options: Dont Sort (Previously sorted) and Dont Sort (Previously grouped). This slide discusses the difference. When rows were previously grouped by a key, all the rows with the same key value are grouped together. But the groups of rows are not necessarily in sort order.
V5.4
Student Notebook
Uempty
Re-partitioned
2 101 3
1 102 103
KM4001.0
Notes:
Try to avoid repartitioning after a sort because this destroys the sort order. In that case, you will not be able to reuse that sort order downstream.
4-13
Student Notebook
Sequential mode
Sort Merge
KM4001.0
Notes:
There are two ways you can create a global sort: Operate the Sort stage in sequential mode, or use parallel sort followed by Sort Merge. In general, parallel sort along with the Sort Merge collector will be much faster than a sequential sort unless data is already sequential. Database systems sort in a similar parallel way to achieve adequate performance.
V5.4
Student Notebook
Uempty
Inserted tsorts
By default, tsort operators are inserted into the Score as necessary
Before any stage that requires matched key values (Join, Merge, RemDups)
op1[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
Only inserted if the user has not explicitly defined the sort
Explicitly defined sorts show up as Sort operators qualified with the name of the stage
KM4001.0
Notes:
By default, tsort operators are inserted into the Score as necessary. By default they will be inserted before any stage that requires matched key values (Join, Merge, RemDups). They are only inserted if the user has not explicitly defined the sort. Explicitly defined sorts show up as Sort operators qualified with the name of the stage.
4-15
Student Notebook
op1[4p] {(parallel inserted tsort operator {key={value=LastName, subArgs={sorted}}, key={value=FirstName}, subArgs={sorted}}(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
KM4001.0
Notes:
You can use the $APT_SORT_INSERTION_CHECK_ONLY and $APT_NO_SORT_INSERTION environment variables to change behavior of automatically inserted sorts. When $APT_NO_SORT_INSERTION is turned on, tsort operators are not inserted even for checking.
V5.4
Student Notebook
Uempty
Decreasing this value may hurt performance This option is unavailable for in-stage sorts
When the memory buffer is filled, sort uses temporary disk space in the following order:
Scratch disks in the $APT_CONFIG_FILE sort named disk pool Scratch disks in the $APT_CONFIG_FILE default disk pool The default directory specified by $TMPDIR The UNIX /tmp directory
KM4001.0
Notes:
By default, Sort uses 20MB per partition as an internal memory buffer per partition. You can change the default using the Restrict Memory Usage option, which may improve performance.
4-17
Student Notebook
KM4001.0
Notes:
There is a difference between partition and sort keys, and they do not have to be the same. Partitioning assigns related records. Sorting establishes group order.
V5.4
Student Notebook
Uempty
Specify only necessary key columns Avoid stable sorts unless needed Re-use previous sort keys
Use Sort Key Usage key column option
Within Sort stage, try adjusting Restrict Memory Usage to see if more memory will help
KM4001.0
Notes:
This slide lists some best practices in using the Sort stage. A basic principle of optimization is to minimize the number of sorts within a job flow. You can do this by defining the sort as far as possible upstream and reusing the sort as far as possible downstream.
4-19
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
A fork join is one important job design you should be aware of. The data stream is split into two output streams and then joined back. In this example, the data is split so that an aggregation can be performed which is then joined back to each row.
4-21
Student Notebook
Fork
KM4001.0
Notes:
This slide shows the job design. The Copy stage is used to fork the data to the Aggregator and Join. The Join stage merges the aggregation result back to the main stream.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This shows the Score for the fork join job. Notice that hash partitioners are inserted, even though they were not explicitly specified in the job. Similarly tsort operators have been inserted.
4-23
Student Notebook
Under the covers, DataStage inserts by default Hash partitioners and tsort operators before the Aggregator and Join stages
Default when Auto is chosen
KM4001.0
Notes:
The sort icons have been added to show what is going on under the covers, even when no sorts are defined. Sorting is occurring before the Aggregator stage and on both input links to the Join stage. This job can be optimized to remove so many sorts.
V5.4
Student Notebook
Uempty
Optimized solution
Explicitly set a Sort by Zip before the Copy stage Explicitly specify Same as the partitioner for the Aggregator and Join stages Notice that the data is repartitioned and sorted once, instead of three times
Copyright IBM Corporation 2011
KM4001.0
Notes:
To optimize the job, the sort has been moved upstream before the Copy stage. Same partitioners have been specified to avoid repartitioning which destroys the sort. In earlier versions of DataStage, tsort operators are inserted and perform sorts unless the $APT_SORT_INSERTION_CHECK_ONLY environment variable is set. In the latest versions of DataStage, it is not necessary to set this variable, because tsort operators will not be inserted, as shown on the next slide.
4-25
Student Notebook
KM4001.0
Notes:
Notice that in the Score for the optimized job there are no inserted tsort operators.
V5.4
Student Notebook
Uempty
Checkpoint
1. Name two stages that require the data to be sorted. 2. What are the advantages of using a Sort stage in a job design rather than an in-stage sort?
KM4001.0
Notes:
Write your answers here:
4-27
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Unit summary
Having completed this unit, you should be able to: Sort data in the parallel framework Find inserted sorts in the Score Reduce the number of inserted sorts Optimize Fork-Join jobs
KM4001.0
Notes:
4-29
Student Notebook
V5.4
Student Notebook
Uempty
5-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Describe how buffering works in parallel jobs Tune buffers in parallel jobs Avoid buffer contentions
KM4001.0
Notes:
5-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Buffer
Some stages (Sort, Aggregator in Hash mode) internally buffer the entire dataset before outputting a row
Buffer operators are never inserted after these stages
Copyright IBM Corporation 2011
KM4001.0
Notes:
In the Score, buffer operators are inserted to prevent deadlocks and to optimize performance. Buffers provide resistance for incoming rows so that operators are not overwhelmed with incoming rows.
5-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
(PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )} op5[1p] {(sequential APT_RealFileExportOperator in Sequential_File_12) on nodes ( ecc3672[op5,p0] )} It runs 12 processes on 4 nodes.
KM4001.0
Notes:
In the Score, buffer operators are displayed in the operators section. This example Score shows that one buffer that has been inserted.
5-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
When the buffer memory is filled, rows are spilled to scratch disk
Producer
Buffer
Consumer
KM4001.0
Notes:
To prevent deadlocks, the buffer operator provides resistance to incoming rows. Buffer sizes can be specified, but keep in mind that this will be allocated per partition and per operator. The total amount of memory may be great.
5-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Producer
Buffer
Consumer
$APT_BUFFER_FREE_RUN
Buffer will offer resistance to new rows, slowing down the rate rows are produced
KM4001.0
Notes:
As the buffer fills, it will begin to push back once the $APT_BUFFER_FREE_RUN threshold is crossed. By default the number is 50%, at which point the buffer will offer resistance. 50% is designated as .5. If you set $APT_BUFFER_FREE_RUN to greater than 100%, it will stop the buffer from offering any resistance.
5-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Buffer tuning
Apply to stage (operator) links (input or output) Buffer policy
$APT_BUFFERING_POLICY specifies the default buffering policy:
AUTOMATIC_BUFFERING (Auto buffer)
Initial installation default Buffer only if necessary to prevent a deadlock Unconditionally buffer all links Do not buffer under any circumstances May lead to deadlocks
FORCE_BUFFERING (Buffer)
Buffer settings
APT_BUFFER_MAXIMUM_MEMORY
Maximum amount of memory per buffer (default is 3 MB) Amount of memory to consume before offering resistance Size of blocks of data moved to and from disk by buffering operator
$APT_BUFFER_FREE_RUN $APT_BUFFER_DISK_WRITE_INCREMENT
KM4001.0
Notes:
This slide summarizes the buffer settings. $APT_BUFFERING_POLICY specifies the default buffering policy. This can be set to AUTOMATIC_BUFFERING (Auto buffer), FORCE_BUFFERING (Buffer), or NO_BUFFERING (No buffer). Particular setting customize the degree of buffering.
5-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Cautions
In general, buffer tuning should be done cautiously Default settings are appropriate for most jobs For jobs processing very wide rows, it may be necessary to increase default buffer size to handle more rows in memory
Calculate total record width using internal storage for each column data type, length, and scale. For variable length columns, use the maximum length
KM4001.0
Notes:
Only tune buffers if you know what you are doing. Improper buffer settings can cause deadlocks. The width-size of a row determines how many rows we can put into a buffer. Therefore, wide rows may require larger buffers. In this context, rows with more than 1000 columns are considered wide.
5-8
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Edit settings
KM4001.0
Notes:
Buffer settings can be specified in a job stage. This settings would only apply to the operator generated by the relevant stage. These settings are made on the Inputs>Advanced tab or Outputs>Advanced tab of a stage. The settings apply to the link, either the input link or the output link.
5-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
When buffer memory is filled, temporary disk space is used in the following order:
Scratch disks in the $APT_CONFIG_FILE buffer named disk pool Scratch disks in the $APT_CONFIG_FILE default disk pool The default directory specified by $TMPDIR The UNIX /tmp directory
KM4001.0
Notes:
This slide discusses buffer resource usage. By default, each buffer operator uses 3MB per partition of virtual memory. This can be changed through Advanced link properties or globally using $APT_BUFFER_MAXIMUM_MEMORY.
V5.4
Student Notebook
Uempty
Some stages (Sort, Aggregator in Hash mode) must read the entire input before outputting a single record
Setting Dont Sort, Previously Sorted key option changes Sort stage behavior to output on groups instead of entire dataset
KM4001.0
Notes:
Stages that process groups of data (Join, Merge, Aggregator in Sort mode) cannot output a row until either an End of data and end of group event occurs. End of data and end of group are events that might cause something to happen within the system. For example, in an Aggregator stage there may be a change in the key value of a group that is being processed. This indicates that the group is done. At this point the stage can output the summary row. Once an operator gets to the end of data, it can shut itself down.
5-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
The second link (#1, Right by link ordering) buffers all rows with key values that match the driver row
KM4001.0
Notes:
For Join and Merge stages, the order of links is important. This slide illustrates that there is a difference between buffering (as done by the buffer operator) and buffering as it is done in the Join stage. Both can effect performance (but in different ways). An example of a job where join buffering can degrade performance is later in this unit.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
If buffering becomes an issue in a fork-join job, one solution is to split the job into two separate jobs. Develop the single fork-join job first. Check if testing indicates a buffer-related performance issue. See if you can resolve it by changing buffer settings. If not, try changing the job design.
5-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Generating one header row with no subsequent change in join column, data is buffered until end of group Problem: Processing is halted until all rows in the group are read
Buffer Header Src Buffer Detail Out
KM4001.0
Notes:
Everything gets buffered up, and the join will not output data until the end-of-group condition. If the groups contain small numbers of records, then the performance impact is minimal. The problem is most severe in cases where there is a single header row to be merged with all rows.
5-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Buffering solution
Perform the join in the Transformer using stage variables to store the header record information Data is Hash partitioned to ensure that header and detail records in a group are not spread across different partitions
KM4001.0
Notes:
Here we consider one possible solution to a buffering issue. Perform the join in the Transformer using stage variables to store the header record information. In this case, the data does not need to be split into two streams and then joined back.
V5.4
Student Notebook
Uempty
Parse out OrderNum and RecType columns Store header info in stage variables
Copyright IBM Corporation 2011
KM4001.0
Notes:
The two fields (OrderNum and RecType) that are in common to both header and detail records are parsed out using the Column Import stage. OrderNum is needed so that the data can be Hash partitioned by OrderNum before it is processed by the Transformer. It is assumed here that the data in the Orders file is sorted so that each group is contiguous and the header record precedes the detail records that make up the group. This order is required at the time of the Hash partitioning before the Transformer. So the Column Import stage has to run sequentially. Header record info (RecType = A) is stored in the Name and OrderDate fields. Detail records (RecType= B) are written out with the added header information.
5-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Checkpoint
1. Which property determines the degree to which a buffer offers resistance to new rows? 2. Name two stages that must read the entire input set of input records before outputting a single record.
KM4001.0
Notes:
Write your answers here:
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
5-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit summary
Having completed this unit, you should be able to: Describe how buffering works in parallel jobs Tune buffers in parallel jobs Avoid buffer contentions
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
6-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Describe virtual data sets Describe schemas Describe data type mappings and conversions Describe how external data is processed Handle nulls Work with complex data
KM4001.0
Notes:
6-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
External data
External data
Conversion
Conversion
KM4001.0
Notes:
Parallel operators process in-memory data sets. For data that exists outside of DataStage (external) data, conversions are performed. These conversions can involve data type conversions, recordization (breaking up the data block into individual records), and columnization (breaking up the records into individual fields).
6-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Data sets
Structured internal representations of data within the parallel framework Consist of: Schema
Describes the format of records and columns Partitioned according to the number of nodes
Data Virtual data sets are in memory Correspond to job links Persistent data sets are stored on disk Descriptor file: Lists schema, configuration file, data file locations, flags Multiple data files
One per node Stored in disk resource file systems
KM4001.0
Notes:
Data sets are structured internal representations of data. They include a schema, which describes the format of records and columns, and the data.
6-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Example schemas
Column name: data type
Record properties
KM4001.0
Notes:
The schema includes all the properties found in a table definition, including extended properties. The schema data types are C++ types. The extended properties are in brackets following the data type. Notice all the -schema record property which lists the record delimiter and the column delimiter.
6-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Type conversions
DataStage provides conversion functions between input and output data types Default type conversions
Examples: int -> varchar, varchar -> char, char -> varchar
Generally what makes sense (see chart on next page)
Not defaulted: char -> date, date -> char Variable to fixed-length string conversions are padded, by default, with ASCII Null (0x0) characters Use $APT_STRING_PADCHAR to change the default
Warnings are issued for default conversions with potentially unexpected results
For example, varchar(100) -> varchar(50)
Truncation may occur
Copyright IBM Corporation 2011
KM4001.0
Notes:
When an input column is mapped to an output column of a different type, a data type conversion occurs. Some of these are default conversions that occur automatically. Other conversions must be explicitly specified in, for example, a Transformer derivation. The default pad character is in keeping with the C++ end-of-string character (ASCII Null (0x0)).
6-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
int8 uint8 int16 uint16 int32 uint32 Int64 uint64 sfloat dfloat decimal string ustring raw date time timestamp d de d de d de d de de de de de e e e e
d d
d d d
d d d d
d d d d d
d d d d d d
d d d d d d d
d d d d d d d d
de d d d d d d d d
d d d d d d d d d de
de d de de de de d d d de de
de d de de de de d d d de de d
d d d d d d d d d d d d d d d d d d d de de
d d d d d d d d d d d d d d de d d e
e e
d d d d d de de d d d de d d
d d de d d d d d d
de de de de de
e e
e e
e e e
e e e
e e e e e
e de
KM4001.0
Notes:
You can use this chart as a reference. In the chart, e means that you need to explicitly define the conversion. d means a default conversion is available. Note that this chart uses framework data types, not DataStage GUI type names.
6-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Format specifier is only applicable for some conversions, for example, date conversions Input column name is in parentheses Example: Converting a string to a date:
OrderDate:date = date_from_string [%mm/%dd/%yyyy] (inDate)
Format specifier
KM4001.0
Notes:
The Modify stage can be used to perform type conversions. It generates the modify operator in the OSH. The modify operator is also inserted by DataStage as necessary into the OSH and Score to perform required conversions.
6-8
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Manual conversions
COBOL files import / export operators are used to perform the conversions
Converting with source stage: import Converting with target stage: export
Copyright IBM Corporation 2011
KM4001.0
Notes:
External data can come from sources. Some can be converted automatically, for example, relational data. Other requires a manual conversion. In the case of a sequential file, the Sequential File stage is used to import the data from the file into the internal framework format.
6-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
In the Sequential File stage Format and Columns tab, you specify how you want the data converted. The GUI columns data types are SQL types. A schema is generated from table definition which the import operator uses to perform the conversion.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
COBOL files are another source of external data that needs to be converted. The Complex Flat File (CFF) stage is used for this purpose. The Complex Flat File stage supports complex data types including arrays (OCCURS) and groups. The schema generated from the CFF stage includes complex framework types. Note the use of the subrec type, which corresponds to a COBOL group.
6-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Oracle table, input columns Stage output columns, with schema types
Copyright IBM Corporation 2011
KM4001.0
Notes:
Database stages, such as the Oracle Connector stage, convert types automatically. The modify operator, which is inserted into the Score, is used to perform these conversions.
V5.4
Student Notebook
Uempty
Time
Default string format: %hh:%nn:%ss
Timestamp
Default string format: %yyyy-%mm-%dd %hh:%nn:%ss
KM4001.0
Notes:
This slide lists the standard (non-complex) framework data types. These correspond to the standard SQL types.
6-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Subrecord
A group or structure of elements Elements of the subrecord can be of any type Subrecords can be embedded
KM4001.0
Notes:
This slide lists the framework complex types. A vector is a one-dimensional array. All the elements in the array have to be of the same type. A subrecord is a group or structure of elements. The elements of the subrecord can be of any type.
V5.4
Student Notebook
Uempty
subrecord
vector
Table definition with complex types Authors is a subrecord Books is a vector of three 3 strings of length 5
KM4001.0
Notes:
On the Layout tab, you can view the metadata according to different views. In this way you can easily see how a COBOL file description (CFD) will be converted to a schema. Shown in this screenshot is the Parallel view schema that includes some complex types.
6-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Vector
Copyright IBM Corporation 2011
KM4001.0
Notes:
Importing metadata from a COBOL copybook can generate these level structures for use in the Complex Flat File stage. In COBOL, the higher level numbers indicate a subrecord (group). The OCCURS is equivalent to a vector (array).
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
6-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Columns can be loaded for each record type On the Records ID tab, you specify how to identify each type of record Columns from any or all record types can be selected for output
This allows columns of data from multiple records of different types to be combined into a single output record
Copyright IBM Corporation 2011
KM4001.0
Notes:
The Complex Flat File (CFF) stage can be used to process data in a mainframe COBOL file. A COBOL file is described by a COBOL file description (CFD). COBOL copybooks with multiple record formats can be imported. In the stage separate table definitions can be loaded for each record type.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide shows an example of a COBOL copybook. It has three record types as indicated by the call-outs.
6-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Level 01 items
KM4001.0
Notes:
In Designer, you can import CFD files. These will be converted into one or more table definitions. In this example, a single file contains three record types: CLIENT, COVERAGE, and POLICY. These correspond to the level 01 items in the CFD file.
V5.4
Student Notebook
Uempty
Level numbers
KM4001.0
Notes:
This shows the table definition for the CLIENT record type that was imported. The level numbers are preserved in the table definition indicating the column hierarchy.
6-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
COBOL layout
KM4001.0
Notes:
In the table definition Layout tab, you can switch from the Parallel view to the COBOL and back. In the screenshot, PIC X(30) is a COBOL data type, equivalent to Char(30).
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
You can specify date masks for columns that contain dates. Double-click to the left of the column number on the Columns tab to open the Edit Column Meta Data window. Select a field that contains date values. Then select the date mask that describes the format of the date from the Date format list. The SQL type is changed to Date. All dates are stored in a common format, which is described in project or job properties. By default, dates are stored in DB2 format.
6-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
For clarity in this example, each record type has been placed on a separate line. Spaces have been added between fields. In practice, the records might follow each other immediately without being placed on a separate line. In the file used in the lab exercises, records follow each other immediately with a single record character, the pipe (|), separating them. In this example, client information is stored as a group of three types of records: CLIENT, POLICY, COVERAGE. There is one CLIENT record type which is the first record of the group. This can be followed by one or more POLICY records. Each POLICY record is followed by one or more COVERAGE records. Client Ralesh has two insurance policies. The first is for motor vehicles (MOT). He has two coverages under this policy. The second policy is for travel (TRA). He has one coverage under this policy.
V5.4
Student Notebook
Uempty
CFF stage
KM4001.0
Notes:
The Transformer in this job is used to split the data into multiple outputs streams. In the Transformer, a separate constraint is defined on each output link. Alternatively, the three output links with their constraints could have come directly from the CFF stage. The CFF stage supports multiple output links and constraints. A Transformer is required if derivations need to be performed on any columns of data.
6-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Job parameter
Copyright IBM Corporation 2011
KM4001.0
Notes:
The CFF stage contains a number of tabs. This shows the File options tab. Here you can specify one or more files to be read.
V5.4
Student Notebook
Uempty
Records tab
Active record type Add another record type Load columns for record type
Set as master
Copyright IBM Corporation 2011
KM4001.0
Notes:
Define each record type on the Records tab. Here we see that three record types have been defined. For each type, click the Load button to load the table definition that defines the type. To add another record type click a button at the bottom of the Records tab. Click the far right icon to set it as master. When a master record is read the output buffer will be emptied.
6-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Record ID tab
KM4001.0
Notes:
On the Record ID tab, you specify how to identify which type of record you are currently reading. The condition specified here says that if the RECTYPE_3 field contains a 3, then the record is a COVERAGE record. A constraint must be defined for each record type.
V5.4
Student Notebook
Uempty
Selection tab
KM4001.0
Notes:
After each record is read, a record will be sent out the output link. This tab is where you specify the columns of the output record. Notice that the output record can contain values from any or all of the record types. Since only a single record type is read at a time, only some of the output columns (those which get their values from the current record type) will receive values. The other columns will retain whatever value they had before or they will be empty. Whenever the master record is read, all columns are emptied before the new values are written. It is crucial to be aware that although each output record has all of these columns, not all of these columns will necessarily have valid data. When you process these records, for example, in a Transformer, you need to determine which fields contain valid data.
6-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
On the Records options tab, you specify format information about the file records. Here, the file is described as a text file (rather than binary), as an ASCII file (rather than EBCDIC), and a file with records separated by the pipe (|).
V5.4
Student Notebook
Uempty
Layout tab
Layout tab
COBOL layout
KM4001.0
Notes:
The Layout tab is a very useful tab. It displays the length of the record (as described by the metadata), and the lengths and offsets of each column in the record. It is crucial that the metadata accurately describe the actual physical layout of the file. Otherwise, errors will occur when the file is read.
6-31
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
View data
KM4001.0
Notes:
Click the View Data button to view the data. When you view the data, you are viewing the data in all the output columns. Notice that output columns for a given row can contain data from previous reads. For example, when the second record, which is a POLICY record, is read, the CLIENT columns are populated with data from the previous record, which was a CLIENT record. So you need to distinguish, usually within a Transformer, which columns contain valid data.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Usually a CFF stage will be followed by a Transformer stage so that the different record types can be identified and processed. In this example, when the IsClient stage variable equals Y, then we know that the CLIENT columns contain valid data. When the IsPolicy stage variable equals Y, then we know that the POLICY columns contain valid data. When the IsCoverage stage variable equals Y, then we know that the COVERAGE columns contain valid data.
6-33
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Transformer constraints
KM4001.0
Notes:
These constraints ensure that a record is written out to the CLIENT output link only when the columns contain valid client information. And so on, for the POLICY and COVERAGE output links.
V5.4
Student Notebook
Uempty
Nullability
KM4001.0
Notes:
6-35
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Nullable data
Out-of-band: an internal data value marks a field as null
Value cannot be mistaken for a valid data value of the given type Disadvantage:
Must reserve a field value that cannot be used as valid data elsewhere Numeric fields most negative possible value Empty string
Examples:
Modify stage:
KM4001.0
Notes:
Nulls are categorized as out-of-band and in-band. The former is an internal data value that marks a field as null. The latter is a specific user-defined field value that indicates a null. The value of an out-of-band type of null is that it cannot be mistaken for a valid data value of the given type.
V5.4
Student Notebook
Uempty
Destination Field
not_nullable
Result
Source value propagates to destination. Source value or Null propagates.
nullable
nullable
not_nullable
nullable
nullable
not_nullable
WARNING messages in log. If source value is Null, a fatal error occurs. Must handle in Transformer or Modify stage.
KM4001.0
Notes:
When mapping between source and destination columns of different nullability settings, there are four possibilities. The last case (nullable -> not_nullable) is the only case that creates a problem.
6-37
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Null field representation can be any string, regardless of valid values for actual column data type
KM4001.0
Notes:
Nulls can be written to and read from sequential files. In the file, some value or lack of a value (for example, indicated by two side-by-side column delimiters) means null. You can specify this in a Sequential File stage.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide shows some examples of values you can specify. The null field representation can be any string, regardless of valid values for actual column data type.
6-39
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
When you view the file data within DataStage, the word NULL is displayed by DataStage for null values, regardless of their actual value in the file.
V5.4
Student Notebook
Uempty
K: integer A: varchar(20)
KM4001.0
Notes:
A best practice, when using the Lookup stage, is to specify that the reference link key columns are nullable. This ensures that the Lookup stage assigns null values to non-key reference columns for unmatched rows. These lookup failure rows can then be identified in a Transformer following the Lookup stage.
6-41
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Default values
What happens if non-key reference columns are not nullable?
Lookup stage assigns a default value to a row without a match The default Value depends on the data type. For example:
Integer columns default to zero. Varchar defaults to empty string (not to be confused with Null) Char to a fixed length string of $APT_STRING_PADCHAR characters
K: integer A: varchar(20)
KM4001.0
Notes:
If non-key reference columns are not nullable then the Lookup stage assigns a default value to a row without a match. The default value depends on the data type. This makes it more difficult to identify whether the row is a lookup failure in subsequent stages such as a Transformer.
V5.4
Student Notebook
Uempty
Nullability in lookups
Continue. If lookup fails, returns Nulls in reference columns
Lookup non-key reference column If nullable, returns Nulls; otherwise, returns empty string
Copyright IBM Corporation 2011
KM4001.0
Notes:
This slide shows the inside of the Lookup stage. Here the JOB_DESCRIPTION column has been set to nullable, so that null will be returned by the lookup stage for a lookup failure. Note that the output column that this row is mapped to must also be nullable or you will get a runtime error.
6-43
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Right
Right
K: integer A: varchar(20)
KM4001.0
Notes:
Like the Lookup stage, the Join stage can generate nulls when using outer joins.
V5.4
Student Notebook
Uempty
Checkpoint
1. What type of files contain the metadata that is typically loaded into the CFF stage? 2. Does the CFF stage support variable length records? 3. What does it accomplish to select a record type as a master? 4. Which of the following conversions are automatic and which require manual conversions using a Transformer or Modify stage? integer --> varchar, date --> char, varchar --> char, char --> varchar, char --> date 5. Suppose the Lookup Failure option is "Continue". The reference link column is a varchar but not nullable. What values will be returned for rows that are lookup failures?
KM4001.0
Notes:
Write your answers here:
6-45
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Unit summary
Having completed this unit, you should be able to: Describe virtual data sets Describe schemas Describe data type mappings and conversions Describe how external data is processed Handle nulls Work with complex data
KM4001.0
Notes:
6-47
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
V5.4
Student Notebook
Uempty
7-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Create a schema file Read a sequential file using a schema Describe Runtime Column Propagation (RCP) Enable and disable RCP Create and use shared containers
KM4001.0
Notes:
7-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
7-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Schema file
Alternative way of specifying column definitions and record formats
Similar to a table definition
Written in a plain text file Can be imported as a table definition Can be created from a table definition Can be used in place of a table definition in a Sequential File stage
Requires Runtime Column Propagation (RCP) Schema file path can be parameterized
Enables a single job to process files with different column definitions
KM4001.0
Notes:
The format of each line describing a column is: column_name:[nullability]datatype; Here column_name is the name that identifies the column. Names must start with a letter or an underscore (_) and can contain only alphanumeric or underscore characters. The name is not case sensitive. The name can be of any length. You can optionally specify whether a column is allowed to contain a null value or whether this would be viewed as invalid. If the column can be null, insert the word nullable. By default columns are not nullable. You can also include the nullable property at record level to specify that all columns are nullable, then override the setting for individual columns by specifying not nullable. For example: record nullable (' Age:int32; BirthDate:date) Following the nullability specifier is the C++ data type of the column.
7-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide lists several ways to create a schema file. Another good way of capturing a schema is to set the $OSH_PRINT_SCHEMAS environment variable and copy entries from the DataStage Director log.
7-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Importing a schema
KM4001.0
Notes:
Schemas can be imported from data sets, file sets, files on the DataStage Server system or your workstation, and from database tables.
7-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Layout
Save schema
KM4001.0
Notes:
It is easy to create a schema file from an existing table definition. Open the table definition to the Layout tab. This displays the schema. Then right-click and select Save As. The file is saved on your client system.
7-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
To use a schema file to read from a sequential file, first add the Schema File optional property. Schemas can only be used when Runtime Column Propagation (RCP) is turned on in the stage. This is discussed later in this unit.
7-8
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
7-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Input values are implicitly mapped to output columns based on the column name Job flexibility
Job can process input files and tables with different column layouts Component logic can apply to a single named column All other columns flow through untouched
Benefits of RCP
KM4001.0
Notes:
This slide describes RCP and lists some benefits of using it. The key feature of RCP is that when it is turned on columns of data can flow through a stage without being explicitly defined in the stage. The key benefit is a flexible job design.
V5.4
Student Notebook
Uempty
Job level Stage level Settings at a lower level override settings at a higher level
KM4001.0
Notes:
RCP can be enabled at the project level, the job level, or even the stage level.
7-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
In the Administrator client, you must set the Enable Runtime Column Propagation for Parallel Jobs property if you are to use RCP in the project at any level. Check the Enable Runtime Column Propagation for new links property (not recommended) to have it turned on by default.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
If RCP has been enabled for the project in Administrator, it can be turned on at the job level on the Job Properties General tab. This will turn it on for all stages in the job.
7-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Transformer
KM4001.0
Notes:
If RCP has been enabled for the project in Administrator, it can be turned on at the stage level. How this is done varies somewhat for different types of stages. Shown here are the Sequential File Stage and the Transformer stage.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
When RCP is turned off every output column have an input column explicitly mapped to it. Otherwise the job will not compile. In this example, this is indicated by the columns in red.
7-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
When RCP is turned on, output columns do not have to have input columns explicitly mapped to them. The job will compile. In this example, this is indicated by the columns not being in red. However, a runtime error will occur if no incoming columns match unmapped target column names.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
There are a number of ways in which implicit columns (columns not explicitly defined on a stage Columns tab) can get into the job. One way, shown here, is from columns previously defined in the job flow.
7-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
If the schema file is parameterized, the names used to read them can change in different job runs
33 Alvin Ohio CustID Name Address CustID Name CustID Name Address
Notes:
There are a number of ways in which implicit columns (columns not explicitly defined on a stage Columns tab) can get into the job. Another way is from a sequential file read with a Sequential File stage using a schema file.
V5.4
Student Notebook
Uempty
Notes:
There are a number of ways in which implicit columns (columns not explicitly defined on a stage Columns tab) can get into the job. Another way if by reading from a relational table using SELECT *.
7-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Shared Containers
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Shared containers
Encapsulate job design components into a named stored container Provide named reusable job components stored in the Repository
Example: Apply stored Transformer business logic to convert dates from one format to another
KM4001.0
Notes:
Shared containers encapsulate job design components into a named stored container. In this way they provide named reusable job components stored in the Repository which can be inserted into jobs. Shared containers are inserted by reference: Changes made to the shared container outside of the job will apply to the job, although the job must be recompiled.
7-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Selected components
KM4001.0
Notes:
The easiest way to create a shared container is by selecting components from within an existing job. This also allows you to test the container at the time you build it.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This shows the inside of a shared container. Input and Output stages are used in the shared container to provide an interface to links in the containing job it is added to. The container will only work in a job if there are input and output links in the job that can be validly mapped to the Input and Output stages of the container.
7-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
In this example shared container, the Transformer will process two columns named InDate1 and InDate2. Other columns will flow through by RCP. Two columns will need to match InDate1 and InDate2 by name.
V5.4
Student Notebook
Uempty
Shared Container
KM4001.0
Notes:
This shows the shared container in a job. If RCP is being used, input and output link columns are matched by name. If RCP is not being used, the number, order, and types of columns must match up.
7-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
After you link up the shared container to the job open the shared container. A stage container can contain job parameters. If they exist they can be specified on the Stage tab. On the Inputs and Outputs tabs, map job links to the container links. Click Validate to validate the interface mapping compatibility
V5.4
Student Notebook
Uempty
Notes:
This shows one type of interface where RCP is being used. So there are many input columns in the link mapped to the container. Two of the columns have to match InDate1 and InDate2. The other columns will flow through the container by RCP.
7-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Checkpoint
1. 2. What are two benefits of RCP? What can you use to encapsulate stages and links in a job to make them reusable?
KM4001.0
Notes:
Write down your answers here: 1. 2.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
7-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit summary
Having completed this unit, you should be able to: Create a schema file Read a sequential file using a schema Describe Runtime Column Propagation (RCP) Enable and disable RCP Create and use shared containers
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
8-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Describe and set null handling in the Transformer Use Loop processing in the Transformer Process groups in the Transformer
KM4001.0
Notes:
8-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
8-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Example:
Fname : Lname : Address1 : City : State : PostalCode -> Address
Here, Address is a stage variable or output column in a Transformer. All others are input columns
If Address1 or any other of the input columns is null in the current row being processed, then the row will be rejected
Set the Abort on unhandled null option to abort the job when a row with an unhandled null is processed by the Transformer Add a reject link from the Transformer to capture rejected rows
This property is not compatible with the Abort on unhandled null option
Copyright IBM Corporation 2011
KM4001.0
Notes:
Before IS release 8.5, input rows processed with unhandled nulls were dropped or rejected by the Transformer stage. This behavior is called legacy null handling. With IS 8.5, this behavior can be turned on or off. With legacy behavior it is recommended that you add a reject link from the Transformer to capture rejected rows.
8-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This is a legacy null processing example. Notice the reject link to capture rejected rows. Here we assume that the source file contains nulls in the FName and Zip columns. Suppose that a derivation for a Transformer stage variable contains Zip. And suppose that a derivation for a target column contains FName. It is expected that rows containing these nulls will be rejected.
8-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Stage variable
Notes:
This shows the inside of the Transformer stage. There is a nullable input column in the derivation of the stage variable and in the derivation for the output column.
8-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Open the Transformer Stage Properties window to specify legacy null handling.
8-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Results
First case: Input rows with nulls are rejected or dropped Second case: When the Abort on unhandled null option is set, the job aborts First case: Records are rejected
KM4001.0
Notes:
Now let us look at some results. In the first test, input rows with nulls are rejected or dropped as we see from messages in the job log. In the second test where the Abort on unhandled null option is set, the job aborts.
8-8
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Here, Address is a stage variable or output column in a Transformer. All others are input columns
If Address1 or any other of the input columns is null in the current row being processed, then null will be written to the Address stage variable
Set the Abort on unhandled null option to abort the job when a row with an unhandled null is processed by the Transformer
Expressions containing nulls will not abort the job
They will evaluate to null
KM4001.0
Notes:
With IS release 8.5 non-legacy null handling, derivations involving unhandled nulls return null. This is true whether they are derivations for output columns or stage variables. What happens if you set the Abort on unhandled null option? Expressions containing nulls will not abort the job, since they are being handled by evaluating to null. However, be aware that nulls written to non-nullable output columns will abort the job. This is always true.
8-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
This shows the Non-legacy setting on the Transformer Stage Properties General tab.
V5.4
Student Notebook
Uempty
Results:
If output columns are nullable, no rows are rejected If output rows are non-nullable, rows with nulls are dropped or rejected
Case
Legacy null processing not set Abort on unhandled null is set
Result
Transform operator aborts thereby aborting the job when target column is non-nullable
Copyright IBM Corporation 2011
KM4001.0
Notes:
This slide summarizes the results for non-legacy null processing.
8-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Transformer loop processing enable each row to be processed an indefinite number of times within a Transformer, each with separate outputs. This can be done without using a loop using multiple output columns. But this is fixed by the number of output links. Loop processing has many uses including the ability to process rows containing an indefinite number of values within a single column.
8-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Size 12 7 8
Size 12 12 12
Copyright IBM Corporation 2011
Notes:
Here is an example of rows that have repeating columns. A product can have up to four colors. These colors are entered into the Color columns. Nulls are added if there are fewer than four colors. In this example, product rows with just one of the colors is output. One row is output for each color.
V5.4
Student Notebook
Uempty
The Funnel stage collects the records into one output stream
Funnel stage
KM4001.0
Notes:
This example shows how this can be done using multiple output links. Each ColorN Transformer output link writes out a record based on the value of the ColorN column in the input row. The Funnel stage collects the records into one output stream.
8-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Inside the Transformer, the constraint for each Color link checks whether the column contains a color or not (column is null).
V5.4
Student Notebook
Uempty
Size
Colors Red | Blue | Yellow | Black Green | yellow Tan | Orange | Black
Requires the extra step of collecting all the output rows from the multiple links into a single stream
Copyright IBM Corporation 2011
KM4001.0
Notes:
The main limitation of the multiple output links solution is that it works only in cases where the maximum number of potential output rows is known. We can also imagine a similar case involving variable record formats. The variable record formats example is similar to that shown here, except that each of the colors is in a separate column. So the first row would have six columns, the second four, and the third three. How to read a file with multiple format records is discussed elsewhere in this course.
8-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Loop processing
For each row read into the Transformer, the loop condition is tested While the loop condition remains true, the loop variables are processed in order from top to bottom
After the loop variables are processed each output link is processed
If the output link constraint is true then process output columns
When the loop condition is false, the loop variables are not processed and no output rows are written out
KM4001.0
Notes:
Now let us see how this can be done with loop processing. For each row read into the Transformer, the loop condition is tested. While the loop condition remains true, the loop variables are processed in order from top to bottom.
V5.4
Student Notebook
Uempty
@ITERATION system variable holds a count of then number of times the loop has iterated, starting at 1
Reset to 1 when a new input row is read by the Transformer
If you can determine the number of iterations that are needed, then the loop condition can be specified as follows:
@ITERATION <= numNeededIterations For a delimited list of items, the Count function can be used:
Count(Red/Blue/Green, /) + 1 -> 3 items
KM4001.0
Notes:
Each loop has a loop condition. The loop continues while the loop condition remains true. It should become false after the input row has been fully processed. The @ITERATION system variable can be used in the condition.
8-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Loop variables
Executed in order from top to bottom Similar to stage variables Defined on Transformer Stage>Loop Variables tab Loop variables can be referenced in derivations for other loop variables and in derivations for output columns
KM4001.0
Notes:
Loop variables are similar to stage variables. Their derivations are executed in order from top to bottom.
V5.4
Student Notebook
Uempty
Use the Field function to extract the next color as you iterate through the list Only one output link is needed
KM4001.0
Notes:
Here is a repeating columns solution using a loop. Notice that the job is simpler in overall design. This slide outlines the main steps of the solution.
8-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
The varColors stage variable is used to create a colors list by examining the Color columns in the input row. Then the number of colors in the list are counted. varColorCount contains the number of colors to process in the loop. This variable is used in the loop condition. During each loop iteration the color in the list corresponding to the loop iteration is extracted and written out in a separate row.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
8-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
LastRowInGroup(In.Col) function can be used to determine when the last row in a group is being processed
Transformer stage must be preceded by a Sort stage that sorts the data by the group key columns
KM4001.0
Notes:
Transformer group processing provides an alternative to using an Aggregator stage. It can add results to individual rows without using Fork-Join job design. The LastRowInGroup(In.Col) function can be used to determine when the last row in a group is being processed.
V5.4
Student Notebook
Uempty
Use SaveInputRecord() function to save the group records in the Transformer queue
They are saved so that the aggregation result can be added to each one before it is written out of the Transformer Execute in a stage variable derivation Returns the number of rows in the queue
Use the LastRowInGroup(group_key_columns) to determine when the last row in the group is being processed
Execute in a stage variable derivation Returns True (1) if last row in the group is being processed; else returns False (0) Requires a Sort stage before the Transformer
KM4001.0
Notes:
This slide lists the main steps of our example. In this example, the aggregation result will be added to each row.
8-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
This slide continues the list of the main steps of our example.
V5.4
Student Notebook
Uempty
List of IDs
Count
KM4001.0
Notes:
This slide shows the job and the expected results. For each customer row, count of number of customers in the same postal code and list of customer IDs in the same postal code. The Sort stage is required when using the LastRowInGroup(group_key_columns) function.
8-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
This slide lists the stage variables that will be used.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide explains the derivations for each of the stage variables.
8-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
The loop condition uses the @ITERATION system variable. The GetSavedInputRecord() retrieves the next row from the queue.
V5.4
Student Notebook
Uempty
Runtime errors
The number of calls to GetSavedInputRecord() must be equal to the number of calls to SaveInputRecord()
Runtime error if GetSavedInputRecord() is called before SaveInputRecord() is called Runtime error if GetSavedInputRecord() is called three times but SaveInputRecord() was only called twice Runtime error if SaveInputRecord() is called but GetSavedInputRecord() is never called
After GetSavedInputRecord() is called once, it must be called enough times to empty the queue before another call to SaveInputRecord()
Runtime error if there are only two iterations of the loop, each iteration calling GetSavedInputRecord(), but there are three or more records in the queue
A warning is written to the job log whenever a multiple of warning loop threshold is reached
Set in Transformer Stage properties Loop Variables tab or using the APT_TRANSFORM_LOOP_WARNING_THRESHOLD environment variable Set to 10,000 by default Applies both to the number of loop iterations and the number of records written to the queue Jobs can be set to abort after a certain number of warnings in the Job Run Options window
KM4001.0
Notes:
A number of runtime errors are possible. This slide lists some of the main cases.
8-31
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
For example, only save rows with valid customer IDs (ID>=10100) For example, only save rows with valid postal codes (zip>10000 and <= 99999) Programming complication:
Be careful not to include data from invalid rows in group totals
KM4001.0
Notes:
One thing you can do in Transformer group processing that cannot be done using, for example, an Aggregator stage is to validate the rows before saving them in the queue. Invalid rows do not become part of the group summary.
V5.4
Student Notebook
Uempty
Checkpoint
1. What function can you use in a Transformer to determine when you are processing the last row in a group? What additional stage is required to use this function? 2. What function can you use in a Transformer to save copies of input rows? 3. What function can you use in a Transformer to retrieve saved rows?
KM4001.0
Notes:
Write your answers here:
8-33
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Unit summary
Having completed this unit, you should be able to: Describe and set null handling in the Transformer Use Loop processing in the Transformer Process groups in the Transformer
KM4001.0
Notes:
8-35
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
V5.4
Student Notebook
Uempty
9-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Create Wrapped stages Create Build stages Create new External Function routines Describe Custom stages
KM4001.0
Notes:
9-2
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Custom stages
A way of creating a new stage in C++ that compiles into a new framework operator New operators are instantiations of the APT_Operator class You define a new stage that invokes the custom operator Property values are passed to the operator by the stage C++ source is created, compiled, and linked outside of DataStage Wrap an existing executable into a new custom stage Define a new parallel routine (function) to use in Transformer stages Specify input arguments C++ function is created, compiled, and linked outside of DataStage
Copyright IBM Corporation 2011
Wrapper stages
External functions
KM4001.0
Notes:
This slide lists and describes four ways the functionality of DataStage can be extended. Custom stages are beyond the scope of this course. They require low-level knowledge of C++ and the Framework class libraries.
9-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Wrapped Stages
KM4001.0
Notes:
9-4
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Pipe-safe
> > Can read rows sequentially No random access to data
Copyright IBM Corporation 2011
KM4001.0
Notes:
You may improve performance of an existing legacy application that meets the requirements for parallelism by wrapping it and running it in parallel.
9-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Wrapped stages
KM4001.0
Notes:
Wrapped stages are treated as black boxes. DataStage has no knowledge of contents. And DataStage has no means of managing anything that occurs inside the wrapped stage.
9-6
Advanced DataStage v8
V5.4
Student Notebook
Uempty
Create and load table definitions that define the input and output interfaces
KM4001.0
Notes:
We will take a look at a Wrapped stage example. This example will wrap the UNIX ls command. It will have one property: the directory to be listed.
9-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
To create a new Wrapped stage, click the right mouse button over the Stage Types folder and then click New>Other>Parallel Stage Type (Wrapped). Specify the new stage type and the command to be executed.
9-8
Advanced DataStage v8
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
On the Interfaces tab, specify input and output interfaces. This is done by selecting an existing table definition that specifies the expected input or output columns and their types.
9-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
On the Properties tab you can specify stage properties. This stage has one property named Dir, the directory to be listed. Here we want to run, for example, ls c:/KM400Files, so we choose the conversion property Value Only. If you chose -Name Value the following would be executed: ls Dir c:/KM400Files. This is not proper syntax for the ls command, so the job would abort.
V5.4
Student Notebook
Uempty
Wrapped stage
KM4001.0
Notes:
This shows a job with Wrapped stage. The stage functions as any other stage functions. Also shown here are some sample results of running the job.
9-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Build Stages
KM4001.0
Notes:
9-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Build stages
Work like existing parallel stages Extend the functionality of parallel jobs Can be used in any parallel jobs Coded in C++
Predefined macros can be used in the code Predefined header files make additional framework classes and class functions available
Documentation
Parallel Job Advanced Developer Guide: Specifying Your Own Parallel Stages
KM4001.0
Notes:
Build stages, like Wrapped stages, work like existing parallel stages and extend the functionality of parallel jobs. They differ from Wrapped stages in that their functionality is created in DataStage using C++ code.
V5.4
Student Notebook
Uempty
Build stage
KM4001.0
Notes:
This shows a job with Build stage. The stage functions as any other stage functions. The Copy stage here is used to change column names so that they match the Build stages expected interface.
9-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
The Build stage window is similar to the Wrapped stage window. On the General tab you provide the stage type name and the name of the operator that will be generated by the stage.
V5.4
Student Notebook
Uempty
Input / output
Transfer method
Code
KM4001.0
Notes:
This slide lists the main task that need to be done in the Build stage: Specify properties, define input/output interfaces, specify the transfer method, and write the C++ code.
9-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
auto / noauto
auto / noauto
Build
Properties
Interface input / output fields are defined by table definitions C++ variables, includes added to Definitions tab C++ code added to Pre-Loop, Per-Record, Post-Loop of Build tab
Copyright IBM Corporation 2011
KM4001.0
Notes:
On the left is the input, virtual data set whose rows will be read. Its schema provides the names and types of the input field values. On the right is the output, virtual data set whose rows will be written. Its schema provides the names and types of the output field values. The Build stage has an input interface consisting of one or more ports along with their schemas. This interface is specified by means of Table Definitions referenced when building the stage, one Table Definition for each input port. The Build stage also has an output interface consisting of one or more ports along with their schemas. This interface is specified by means of Table Definitions referenced when building the stage. There are three ways for data to move across or through the Build stage: (1) Code assignments. Fields enumerated in the output interface can be assigned values. These values can be based on values in referenced input fields. (2) Transfers. When a Transfer is specified (whether automatic or manual), the whole input record is copied to the output schema. The input record includes columns of values specified in the input interface as well as all the columns in the input data set schema. The transferred columns are added after the columns explicitly enumerated in the output interface. (3) RCP. RCP functions like
9-18 Advanced DataStage v8 Copyright IBM Corp. 2005, 2011
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
a Transfer. RCP on an input port adds the whole input record to the input as a block of fields. RCP on an output port copies the whole input record to the output schema. RCP must be turned on for both the input and output. RCP is also incompatible with Transfer.
9-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
To define the interfaces, first create table definitions for each input and output link. Then specify whether you want the stage to automatically read and write records or you want to control this using input/output macros.
V5.4
Student Notebook
Uempty
This DataStage table definition defines the input interface to a Build stage
Provides column names and types
The input link column to the Build stage must have columns with the same names and compatible types as the input interface columns
If necessary, use Copy stage to modify incoming field names
Copyright IBM Corporation 2011
KM4001.0
Notes:
This shows an example of an interface table definition. Choose C++ field data types for input interface. Otherwise, class function and operator signatures will not directly apply.
9-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Default port names are in0, in1, in order defined Select table definition that defines the input fields
Required to reference specific input columns in the C++ code
KM4001.0
Notes:
RCP is an alternative to the mechanism of transfer for moving data not explicitly assigned to output columns through the stage. If it is turned on, then you cannot specify either automatic or manual transfers. Without some special reason, it should be turned off, since the transfer mechanism is more flexible.
V5.4
Student Notebook
Uempty
auto / no auto
Table definition
Default port names are out0, out1, in order defined Select table definition that defines the output interface
KM4001.0
Notes:
Specifying the output interface is similar to specifying the input interface. Select table definition that defines the output interface.
9-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Transfer
Used to pass unreferenced input link columns through the Build stage
Input link columns are passed as a block to the output link Auto transfers occur at end of each iteration of the Per-Record loop code
Transfer macros are used in code to explicitly transfer records from input buffers to output buffers
DoTransfer(transfer_index): Index is integer of defined transfer: 0, 1, DoTransfersFrom(input) DoTransfersTo(output) TransferAndWriteRecord(output)
KM4001.0
Notes:
The transfer mechanism is used to pass unreferenced input link columns through the Build stage. Transfers can be done automatically by the stage or manually specified in the code.
V5.4
Student Notebook
Uempty
Defining a transfer
Definition order defines the transfer index: 0, 1, Refer to ports by specified names or default names Specify whether transfer is to be done automatically at the end of each loop Specify type of transfer (separate or combined)
Copyright IBM Corporation 2011
KM4001.0
Notes:
Define the transfer on the Transfer tab. Specify the input port that is to be transferred to the specified output port. Also specify whether transfer is to be done automatically at the end of each loop.
9-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Anatomy of a transfer
Qty Price TaxRate OrderNum ItemNum Qty Price TaxRate ----------inRec.* Amount --------------outRec.* OrderNum ItemNum Qty Price TaxRate Amount
Input buffer
ReadRecs (auto or explicit) brings ennumerated, input interface values into the input buffer. If a transfer is specified, then the whole input record also comes in as a block of values Assignments in Per-Record move values to ennumerated output fields Transfers copy input fields as a block (inRec.*) to the output buffer Duplicate columns coming from a transfer are dropped with warnings in log
For example, if the input record contained a column named Amount, this would be dropped. Explicit assignments in the code to output interface columns take precedence over Transferred column values
If RCP is enabled instead of a Transfer, the picture is the same. If neither Transfer nor RCP is specified, then the inRec.* and outRec.* will not exist
Copyright IBM Corporation 2011
KM4001.0
Notes:
This slide looks at the anatomy of a transfer. ReadRecs (auto or explicit) brings enumerated, input interface values into the input buffer. If a transfer is specified, then the whole input record also comes in as a block of values. Transfers copy input fields as a block (inRec.*) to the output buffer. Duplicate columns coming from a transfer are dropped with warnings in log.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Defining stage properties in a Build stage is similar to defining stage properties in a Wrapped stage. Specifying the conversion is required. Most of the time you should choose Name Value.
9-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Specifying properties
Property type
Default value
If data type is List, open the Extended Properties window to define the members
KM4001.0
Notes:
In a Build stage, you would generally choose -Name Value. The other options are not very useful since you are not invoking the operator created by the Build stage from the command line.
V5.4
Student Notebook
Uempty
Pre-Loop
Code executed once, prior to entering the Per-Record loop
Per-Record
Executed for each input record
Post-Loop
Code executed once, after exiting the Per_Record loop
KM4001.0
Notes:
The code is written on several different tabs depending on its purpose. This slide lists and describes the different tabs.
9-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Definitions tab
KM4001.0
Notes:
On the Definitions tab you define variables and specify any header files you want to include.
V5.4
Student Notebook
Uempty
Pre-Loop tab
Initialize variable Code to be executed before input records are processed This code is executed only once
KM4001.0
Notes:
On the Pre-Loop tab, you specify the code that is to be executed before input records are processed. This code is executed only once.
9-31
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Per-Record tab
Build macro
Qualified input column Unqualified output column Code to be executed for each input record read in This code is executed once for each input record
Copyright IBM Corporation 2011
KM4001.0
Notes:
Most of your code will be written on the Per-Record tab. This is code to be executed for each input record read in. In this example, the code is C++ code. Notice the macros that are used in the code, for example endLoop().
V5.4
Student Notebook
Uempty
Post-Loop tab
Framework types
Property
Framework functions
Code to be executed after all input records are processed This code is executed only once
KM4001.0
Notes:
On the Post-Loop tab you specify Code to be executed after all input records are processed. Notice in this example, the reference to a property, Debug. This property was defined on the Properties tab as shown earlier and can be referenced in the code.
9-33
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Using the errorLog object is the only way to write error messages to the log (messages with yellow or red icons by default). Writing an error message does not abort the job. To abort the job, you can call the failstep() macro after writing an error message to the log. The message number is an index to the message. However, this is not relevant for Build stages, so you can choose any number.
V5.4
Student Notebook
Uempty
Build stage
KM4001.0
Notes:
This shows an example of a job using a Build stage. A Build stage functions as any other stage.
9-35
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Stage properties
KM4001.0
Notes:
This shows the inside of the Build stage in the job. Notice that the properties are displayed and edited in the same way as other stages.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Build stages can have multiple input/ output ports. They are indexed 0, 1, 2, and so on. In macros you can specify which port you are reading the record from or which port you are writing the record to.
9-37
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Build Macros
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Build macros
Informational
inputs() -> number of inputs outputs() -> number of outputs transfers() -> number of transfers
Flow Control
endLoop(): Exit Per-Record loop and go to Post-Loop code nextLoop(): Read next input record failStep(): Abort the job
Input/Output
Input / output ports are indexed: 0, 1, 2, readRecord(index), writeRecord(index), inputDone(index) holdRecord(index): suspends next auto read discardRecord(index): suspends next auto write discardTransfer(index): suspends next auto transfer Transfers are indexed: 0, 1, 2, doTransfer(index): Do specified transfer doTransfersFrom(index): Do all transfers from specified input doTransfersTo(index): Do all transfers to specified output transferAndWriteRecord(index): Do all transfers to specified output, then write a record
Transfers
KM4001.0
Notes:
This slide lists most of the macros that are available to you and puts them into different categories depending on their functions.
9-39
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
In most cases it is simpler to let the Build stage automatically handle reading, writing, and transferring records. But for maximum control you can turn off this functionality and explicitly handle it yourself in the code.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide lists and describes the macros you can use for reading. readRecord(0) reads a record from the first input link. Use readRecord(0) in the pre-Loop logic to bring in the first record. You can use the inputDone() macro to test whether the current readRecord(0) instance contains a genuine record to be processed.
9-41
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
IBM/InformationServer/PXEngine/include/apt_util
Classes of useful functions and macros Automatically included in Build stages
No need to explicitly include them
APT_ prefix distinguishes utility objects from standard C++ objects string.h
Defines string handling functions and operators Functions for writing messages to the DataStage log
errlog.h
KM4001.0
Notes:
When you install DataStage, the APT framework and utility classes are installed. You can include the header files for these classes and then use any of the class functions in your code. The APT_ prefix distinguishes utility objects from standard C++ objects.
9-43
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
This slide lists some of the APT classes. It is beyond the scope of this course to look at this in detail. But you can open up the header files and study the functions that are available.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide shows an APT_String Build stage example. In this example s1 and s2 are declared to be APT_String objects. This allows the + operator to be used for string concatenation, as well as the toLower and toUpper class functions to be used.
9-45
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
External Functions
KM4001.0
Notes:
9-47
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Parallel routines
Two types:
External function
Returns a value: Use in Transformer derivations and constraints Does not return values Can be executed before/after a job runs or before/after a Transformer stage Specify in Job Properties or Transformer Stage Properties
External Before/After
KM4001.0
Notes:
External function routines extend the functionality of a Transformer stage. There are two types. Our focus is on the external function type that can be used in the Transformer.
V5.4
Student Notebook
Uempty
Function returns Y if key words are found in the input string; else returns N
KM4001.0
Notes:
The function itself is coded outside of DataStage in the usual C++ way. In this example, keyWords returns a string (char*). It returns Y if it finds in the input parameter string (inString) any of the words listed in the code (hello, ugly). This is a simple example but a function like it can serve a real business purpose. An enterprise may want a function that checks text for the business names.
9-49
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Function returns Y if key words are found in the input string; else returns N This version of the function uses the APT_String class functions
Note that orchestrate.h is included. This file includes all Framework classes
KM4001.0
Notes:
This shows another example using class functions in the framework classes. The framework class is included at the top of the code. Note here that orchestrate.h is included. This file includes all Framework classes.
V5.4
Student Notebook
Uempty
Return type
KM4001.0
Notes:
Once you have coded the external function outside of DataStage you need to register it within the DataStage GUI. In this example, the C++ executable object file is referenced in the Library path box. The function return type (char*) is specified in the Return type box.
9-51
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Define all the input arguments Only an input argument is defined in this example
KM4001.0
Notes:
As noted, the keyWords function has one input argument. Define this on the Arguments tab.
V5.4
Student Notebook
Uempty
New function
External functions are listed in the DSRoutines folder in the DataStage Expression Editor
Copyright IBM Corporation 2011
KM4001.0
Notes:
Once created, the external function is available in the Transformer in the DSRoutines folder.
9-53
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Checkpoint
1. What is a Wrapper stage? How does it differ from a Build stage? 2. What defines the input and output interfaces to Build and Wrapper stages? 3. True or false? External functions are C++ functions that are coded within the DataStage GUI?
KM4001.0
Notes:
Write your answers here:
9-55
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit summary
Having completed this unit, you should be able to: Create Wrapped stages Create Build stages Create new External Function routines Describe Custom stages
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
10-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Use Connector stages to read from relational tables Use Connector stages to write to relational tables Handle SQL errors in Connector stages Use Connector stages with multiple input links Optimize jobs that write to relational tables by separating inserts from updates
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
Overview
KM4001.0
Notes:
10-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Connector stages
Used to read and write to database tables Types of Connector stages
Individual databases: DB2, Oracle, Teradata, and so on ODBC: connect to data sources using ODBC drivers
Data sources include databases and non-relational data sources such as files
Any source that has an ODBC driver available
Documentation
See the set of Connectivity Guides for each database type
KM4001.0
Notes:
Connector stages are used to read and write to database tables. This slide lists some of the types. Other database types are supported. This is a partial list of the main types. Other types include Informix, Sybase, SQL Server, and others.
V5.4
Student Notebook
Uempty
Write: parallel connections to the server Supports bulk loading Multiple input links can be used to write rows to multiple tables within the same unit of work Can be used for lookups
Supports sparse lookups
You can create your own SQL or let the stage generate the SQL
Create your own SQL manually, using a tool outside of DataStage, or using SQL Builder
SQL Builder is accessible from within the stage and fully integrated with the stage
The Connector stage optionally generates SQL based on the table name and column definitions
KM4001.0
Notes:
Connector stages offer parallel support for both reading and writing. They also support bulk loading. You can create your own SQL or let the stage generate the SQL.
10-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Job parameters can be inserted into any properties Required properties are visually identified Properties are divided into two basic categories
Connection properties
Data Connection objects can be used to populate these properties
Usage properties
KM4001.0
Notes:
All Connector stages have the same look and feel and the same core set of properties. Some may include properties specific to the database type.
V5.4
Student Notebook
Uempty
Navigation Panel
Properties Columns
Test connection
View data
KM4001.0
Notes:
This slide shows the inside of the ODBC Connector stage and highlights some of its features. The Navigation panel provides a way of moving between different sets of properties. Click the stage icon in the middle to access the stage properties. Click a link icon to access properties related to that link.
10-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Connection properties
ODBC Connection Properties
Data source name or database name User name and password Requires a defined ODBC data source on the DataStage Server
Use Test to test the connection Can Load Connection properties from a Data Connection object (discussed later)
Copyright IBM Corporation 2011
KM4001.0
Notes:
This slide discusses the connection properties in the Connector stage.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide discusses the Usage properties in the Connector stage. It focuses on the Generate SQL properties.
10-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Deprecated stages
Enterprise stages: DB2 UDB Enterprise, Oracle Enterprise, Teradata Enterprise, and others Plug-in stages: DB2 UDB API, DB2 UDB Load, Oracle OCI Load, Teradata API, Dynamic RDBMS, and others
Plug-in stages are stages ported over from Server jobs
Run sequentially Invoke the DataStage Server engine Cannot span multiple servers in grid or cluster configurations
Deprecated stages have been removed from the Palette but are still available in the Repository Stage Types folder
KM4001.0
Notes:
There are many, many database stages available. Many of these have been deprecated. That is, they have been replaced by the Connector stages that are the focus of this unit. In some cases, you may want to use one of the deprecated stages. Deprecated stages have been removed from the Palette but are still available in the Repository Stage Types folder.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Database stages
All available database stages Connector stages
Stages in Palette
KM4001.0
Notes:
The DataStage Repository window displays all available database stages that are available in the Stage types>Parallel>Database folder. Not all of these stages are included in the default Designer Palette. You can customize the Palette to add additional stage types by dragging them from the Repository window to the Palette.
10-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
When the choice is not obvious, test the possibilities to see which yields the best performance
KM4001.0
Notes:
When reading data from a database, it is often possible to use either SQL or DataStage for some tasks. In these situations you should leverage the strengths of each technology.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
10-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Appropriate for dynamic source flows, where columns move through by RCP
Selected columns
Copyright IBM Corporation 2011
KM4001.0
Notes:
This slide lists some best practices using the Connector stages.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Before/After SQL
Before SQL statements are executed before the stage starts processing
For example, create temporary table to write to
After SQL statements are executed after the stage finishes processing
For example, SELECT FROM INSERT INTO from temporary table to actual table For example, Delete temporary table
Before SQL
Copyright IBM Corporation 2011
KM4001.0
Notes:
Before/After SQL can be used in Connector stages. Before SQL statements are executed before the stage starts processing. After SQL statements are executed after the stage finishes processing.
10-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Sparse lookups
By default lookup data is loaded into memory A sparse lookup sends individual SQL statements to the database for each input row
Expensive operation from a performance point of view Appropriate when the number of input rows is significantly smaller than the number of reference rows (1:100 or more)
KM4001.0
Notes:
When a Connector stage is being used for a lookup, the Sparse lookup option is available. By default lookup data is loaded into memory. A sparse lookup sends individual SQL statements to the database for each input row. This is a very expensive operation from a performance point of view. It may be appropriate when you are dealing with huge lookup tables that cannot fit into memory.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Table action
Append Create Replace: Drop the table if it exists; then create it Truncate: Empty the table before writing to Can be parameterized
Create within the Connector stage
KM4001.0
Notes:
Connector stages offer several types of write operation, including bulk load.
10-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Select the property Click the Use Job Parameter icon Click New Parameter Specify parameter
Parameter name
Default value
KM4001.0
Notes:
All Connector properties can be parameterized. When creating a parameter for a list type property, it is best to create it in the Connector stage. To do this click New Parameter from the Use Job Parameter icon.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Choose Insert then update or Update then insert based on the expected number of inserts over updates For larger data volumes, it is often faster to identify insert and update data within the job and separate into different Connector target stages
Copyright IBM Corporation 2011
KM4001.0
Notes:
The Connector stage offers two types of insert plus update (sometime called upsert) statements. For the Insert then update write mode, the insert statement is executed first. If the insert fails with a unique-constraint violation, the update statement is executed. The Update then insert is the reverse. Choose Insert then update or Update then insert based on the expected number of inserts over updates. For example, if you expect more updates than inserts, choose the latter.
10-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Commit interval
Auto commit
If Off (default), commits are made after the number of records specified by the Record count property are processed If On, commits are made after each write operation
Record count
Number of rows before a commit Default is 2000 rows Must be a multiple of Array size
Record size
Auto commit
Copyright IBM Corporation 2011
KM4001.0
Notes:
This slide discusses how commits are handled.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Bulk load
Most Connector stages support bulk load
Insert/update uses database APIs
Allows concurrent processing with other jobs and applications Does not bypass database constraints, indexes, triggers
Load control set of properties are enabled to set utility specific parameters
Load control
KM4001.0
Notes:
Most Connector stages support bulk load. Insert/update uses database APIs. Bulk load uses database-specific parallel load utilities. It can be significantly faster than insert/update for large data volumes.
10-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Clean-up on failure
KM4001.0
Notes:
In the event of a failure during a DB2 Load operation, the DB2 Fast Loader marks the table inaccessible (quiesced exclusive or load pending state). You can reset the target table to the normal mode by rerunning the job with the Clean-up on failure option turned on.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
10-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Reject links can be added to Connector stages to specify conditions under which rows are rejected. If no conditions are specified, SQL errors abort the job. Conditions include SQL error and row not updated.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Reject link
KM4001.0
Notes:
This example shows a Connector stage with a reject link. Connector stages can have multiple input links. (This is discussed later in this unit.) If there are multiple input links there can be multiple reject links.
10-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Reject conditions
Abort condition
Copyright IBM Corporation 2011
KM4001.0
Notes:
Select the reject link in the Navigation panel to specify the reject link conditions and other properties.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Reject condition
KM4001.0
Notes:
This slide shows the types of messages that will show up in the job log when reject conditions are specified. The top shows an insert error message. The bottom shows an update error message.
10-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Multiple input links write rows to multiple tables within the same unit of work. Reject links can be created for each input link based on SQL error or Row not updated conditions.
10-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Record ordering
KM4001.0
Notes:
In the Navigation panel you see reject links corresponding to each of the input links. Click the stage properties icon to specify the record ordering properties which apply to both input links.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
10-31
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Data Connection objects can be used to store data connection property values in a named Repository object. They are similar to parameter sets. Passwords can be encrypted.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
In this example, Insert then update has been chosen for the Write mode.
10-33
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Insert-Update Example
Separate links for updates and inserts
KM4001.0
Notes:
For the Inserts link, just use the standard Insert or you can choose Insert then update. From a performance point of view this will be equivalent to just doing inserts, since inserts are tried first. Updates are performed only if the Insert fails. For the updates link, use Update or Update, then insert, if you want to be safe.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Checkpoint
1. What is a sparse lookup? 2. How do you decide whether to use Insert then update or Update then insert write modes?
KM4001.0
Notes:
Write your answers here:
10-35
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Unit summary
Having completed this unit, you should be able to: Use Connector stages to read from relational tables Use Connector stages to write to relational tables Handle SQL errors in Connector stages Use Connector stages with multiple input links Optimize jobs that write to relational tables by separating inserts from updates
KM4001.0
Notes:
10-37
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
11-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Use the XML stage to parse, compose, and transform XML data Use the Schema Library Manager to import and manage XML schemas Use the Assembly editor in the XML stage to build an assembly of parsing, composing, and transformation steps
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
XML stage
Use to parse, compose, and transform XML data Located in Real Time folder in Designer Palette Supports both input and output links
Supports multiple input and output links Can have both or just one or the other
KM4001.0
Notes:
The XML stage can be used to parse, compose, and transform XML data. It can also combine any of these operations. The XML stage is configured by creating an assembly. An assembly consists of a series of parse, compose, and transform steps.
11-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
The Schema Library Manager is fully integrated with the XML stage but is also available in Designer outside the stage. Imported schemas can be organized into libraries.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
This slide shows the inside of the Schema Library Manager window. In this example, km400 is a schema library category (folder) used to organize the libraries. The km400 category contains one library name EmpDept. You can create new categories and libraries. Click the Import New Resource button to import a schema file.
11-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Schemas
Describes the structure of an XML document XML data is hierarchical
Contains objects within objects Most data processed in DataStage jobs is flat (tabular)
Consists of rows Each row consists of columns of single values
Example structure
Employees: list of employees Employee:
Employee ID Job title Name Gender Birth date Hire date Work department > Department number > Department name > Department location
Copyright IBM Corporation 2011
KM4001.0
Notes:
Schemas describes the structure of an XML document. XML data is hierarchical but most data processed in DataStage jobs is flat. Input and output links from the XML stage are used to map the flat data to the hierarchical data and vice versa.
V5.4
Student Notebook
Uempty
Schema file
PersonType has three elements
Employee type is an extension of PersonType Additional elements Additional attributes One of EmployeeType elements
Copyright IBM Corporation 2011
KM4001.0
Notes:
This slide shows an example of a schema file. Different objects and their elements are highlighted.
11-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
XML targets include: File, XML string to pass downstream, or LOB (Large OBject) to go into LOB-aware target field
For LOB targets, the last stage in the job must be a LOB-aware stage, such as the DB2 Connector or Oracle Connector
Document root: Select from the top-level elements of the schema library Mappings
Define how to create the target nodes
Target nodes can be either list nodes or content nodes (values)
Once a target list node is mapped its content nodes become available for mappings
Select item for mapping or use Auto map or enter constant value
Copyright IBM Corporation 2011
KM4001.0
Notes:
A composition step in the XML stage can be used to create an XML document from DataStage tabular data. The document created is based on a referenced schema library. The compositional step target does not have to be a file. It can also be an XML string passed downstream in the job or an LOB (Large OBject).
11-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Compositional Job
Tabular data to be composed into an XML document comes from upstream sources Output link is not needed if target is a file
Specify path to output directory and filename prefix
XML stage
KM4001.0
Notes:
This shows an example of a compositional job. First the data is from two tables is joined. This becomes input to the XML stage. The XML stage will compose an XML document from this data. In this example, the XML target is a file, so no output link is necessary. In the XML stage a path to the file is specified.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Stage properties
Assembly editor
KM4001.0
Notes:
The XML stage has the same basic GUI as a Connector stage. However, most of the work is done in the Assembly editor which is invoked from the stage.
11-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Open palette
Palette steps
KM4001.0
Notes:
This shows the inside of the Assembly editor. Notice the Libraries tab at the top where you can invoke the Schema Library Manager. Click the Palette button to open the palette. The palette contains the list of steps that you can add to your assembly. The assembly steps are shown on the left. You can add any number of steps in any order. The Input Step and Output Step are always present and are always first and last, respectively. If there is no input or output link, then the corresponding step will be empty, but still present.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Input step
The output from one step becomes the input for the next step The input to this step is the link data
There can be multiple links
KM4001.0
Notes:
The input to the Input step is the input link data. The input step maps this data into the stage. The mapped data then becomes available to the step following the Input step.
11-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
String or LOb
A result-string node will be created on Output tab
Select target
Copyright IBM Corporation 2011
KM4001.0
Notes:
Once you have added a step, for example, the Composer step shown here, you can open it. Inside there are tabs to edit based on the type of step. A Composer step has a XML Target tab. On this tab you specify the type of target. In this example, the target is a file or set of files. You specify the directory path and the prefix to use for the name when creating the XML file or files.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Once you have added a step, for example, the Composer step shown here, you can open it. Inside there are tabs to edit based on the type of step. A Composer step has a Document Root tab. On this tab you browse a schema library for the root of the document you want to compose.
11-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Menu of options
KM4001.0
Notes:
The Validation tab exists in all the different types of steps. Select the type of validation and the action to take for exceptions. For Strict validation (default), the job fails with any violations. For Minimal validation, the job ignores violations although they are recorded in the log. Either of these types of validation can be modified using the menu lists available.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
On the Mappings tab, you specify mappings to the target document nodes. The target nodes can be list nodes or value nodes. The object mapped to the target node must be the same level of object. For example, the employee node is an object that contains a number of elements and attributes. You can map a source link object to this node because the link contains a number of columns. But you cannot map a single column to the employee node. In general, list nodes must be mapped to lists and value nodes must be mapped to values. Map the list nodes first. Then the elements it contains are available for mapping.
11-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Employee elements
KM4001.0
Notes:
This shows the XML document output produced in this example. Notice that the employee object attributes (for example, EmpNo) and elements (for example, dateOfBirth) have been populated with values.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
11-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
A parsing step can be used to convert XML hierarchical data into DataStage tabular data. It is the reverse of a compositional step. Here, we need to specify the format of the XML source that is to be flattened. We do this by referencing a schema library and choosing the document root.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
On the XML Source tab specify the type of source: file, string, or set of files.
11-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Document Root
Selected root
KM4001.0
Notes:
On the Document root tab, browse for the document root in a schema library.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
11-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Union: Combine two lists into a single list with a pre-defined structure Switch: Split items in a list into one or more new lists
KM4001.0
Notes:
There are several types of transformation steps that you can choose from. This slide lists and describes the types.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
In this example we are using the HJoin transformation step to join the data from the two source tables much as you can do with the Join step. The difference is that the result of the join will be an XML hierarchical object, not a flat tabular object.
11-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
On the Configuration tab, you select the parent list, the child list, and the key used to join the two lists together. In this example, the Department link provides the child list and the Employee link provides the parent list.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Switch step
Categorize items in a list into one or more target lists based on a constraint. Constraints include:
isNull, Greater than, Equals, Compare, Like, and so on
A default target captures all items that fail to go to any of the other targets
Targets become output nodes from the step
List to categorize
Switch targets
Added targets
Copyright IBM Corporation 2011
KM4001.0
Notes:
Another type of composition step you can create is a Switch step. This functions a little bit like a Transformer stage. It splits the data into one or more target lists based on constraints.
11-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Aggregate step
Aggregate one or more items in a list Tasks: Select the list. Add items and specified aggregation functions Functions include: Sum, Max, Min, First, Last, Average, Concatenate
List to aggregate
KM4001.0
Notes:
Another type of composition step you can create is a Aggregate step. This functions a little bit like an Aggregator stage. It performs summary calculations over elements in a list.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Checkpoint
1. What three types of steps can be performed within an XML stage? 2. What three types of XML targets are supported?
KM4001.0
Notes:
11-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Unit summary
Having completed this unit, you should be able to: Use the XML stage to parse, compose, and transform XML data Use the Schema Library Manager to import and manage XML schemas Use the Assembly editor in the XML stage to build an assembly of parsing, composing, and transformation steps
KM4001.0
Notes:
11-31
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
12-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Design a job that creates a surrogate key source key file Design a job that updates a surrogate key source key file from a dimension table Design a job that processes a star schema database with Type 1 and Type 2 slowly changing dimensions
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
12-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
The Surrogate Key Generator stage is used to create and update a surrogate key state file. There is one file per dimension table. The file stores the last used surrogate key integer for the dimension table.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Without any links the stage is used just to create a state file for a dimension table.
12-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
In this example, the stage is just creating a file. The path to the file to be created is specified.
V5.4
Student Notebook
Uempty
Figure 12-6. Example job to update the surrogate key state file
KM4001.0
Notes:
If there are links going into the Surrogate Key stage, as shown in this example, the stage can be used to update the state file based on the surrogate keys that already exist in the table.
12-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Here, the column that contains the surrogate keys needs to be indicated. The Source Update action in this case is Update.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
12-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Inserts new rows into the dimension table as required Updates existing rows in the dimension table as required
Type 1 fields of a matching row are overwritten Type 2 fields of a matching row are retained as history rows
A new record with the new field value is added to the dimension table and made the current record
KM4001.0
Notes:
The Slowly Changing Dimensions stage is a stage designed to be used for processing a star schema data warehouse. It is an extremely powerful stage that performs all the necessary tasks. It performs a lookup into a star schema dimension table to see if the incoming row is an insert or update. It inserts new rows into the dimension table as required. It updates existing rows in the dimension table as required. It can perform both type 1 and type 2 updates.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Fact table
KM4001.0
Notes:
Here is an example of a star schema data warehouse. This example is used in the lab exercise for this unit. The fact table is the center of the star schema. It contains the numerical (factual) data that is aggregated over to produce analytical reports covering the different dimensions. Non-numerical (non-factual) information is stored in the dimension tables. This information is referenced by surrogate key values in the act table rows. This example star schema database has two dimensions. The StoreDim table stores non-numerical information about stores. Each store has been assigned a unique surrogate key value (integer). Each row stores information about a single store, including its name, its manager, and its business identifier (a.k.a., natural key, business key). The ProdDim table stores non-numerical information about a single product, including its brand, its description, and its business identifier. Each row in the fact table references a single store and a single product by means of their surrogate keys. Why are surrogate keys used rather than the business keys? There are two major reasons. First, surrogate keys can yield better performance because they are
Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
numbers rather than, possibly, long strings of characters. Secondly, it is possible for their to be duplicate business keys, coming from different source systems. For example, the business key X might refer to bananas in Australia, but tomato soup in Mexico. In this example, each row in the fact table contains a sales amount and units for a particular product sold by a particular store for some given period of time not shown in this example. For simplicity, the time dimension has been omitted. A source record contains sales detail from a sales order. It includes information about the product sold and the store that sold the product. This information needs to be put into the star schema. The store information needs to go into the StoreDim table. The product information needs to go into the ProdDim table and the factual information needs to go into the Facttbl table. Moreover, the record put into the fact table must contain surrogate key references to the corresponding rows in the StoreDim and ProdDim tables. In this example, the Mgr field in the StoreDim table is considered a type 1 dimension table attribute. This means that if a source record that references a certain store lists a different manager, then this is to be considered a simple update of the record for that store. The value in the source data replaces the value in the existing store record by means of a simple update to the existing record. Similarly, Brand is a type 1 dimension table attribute of the ProdDim table. In this example, the Descr field is a type 2 dimension table attribute. Suppose a source data record contains a different product description for a given product than the current record for that product in the ProdDim table. The record in the ProdDim table is not simply updated with the new product description. The record is retained with the old product description but flagged as non-current. A new record is created for that product with the new product description. This record is flagged as current. The field that is used to flag a record as current or non-current is called the Current Indicator field. Two additional fields (called Effective Date and Expire Date) are used to specify the date-range that the description is applicable.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Notes:
This shows the SCD job. It processes two dimensions so there are two SCD stages. For each SCD stage there is both a reference link to a Connector stage used for lookup into the dimension table and an output link to a Connector stage used to insert and update rows in the dimension table. It is important to note that both Connector stages access the same dimension table. That is, two stages are used to write to the same table.
12-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Surrogate key field Type 1 fields Type 2 fields Current Indicator field for Type 2 Effective Date, Expire Date for Type 2
KM4001.0
Notes:
The SCD stage is designed like a wizard. There are a series of five fast path stages that guide you through the process. This slide lists and describes the five pages. The following slides go through each step.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
On the first page all you need to do is to select the output link from the SCD stage. Recall that there are two output links from the SCD stage. One goes to the Connector stage that updates the dimension table. The other goes out the SCD to the downstream stage, which in this case happens to be another SCD stage. The output link is the latter link, not the one that goes to the Connector stage.
12-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Type 1 field
KM4001.0
Notes:
On this fast path page, you select the purpose codes for the columns in the dimension table. Select Surrogate Key for the table column that contains the surrogate keys. Select Business Key for the table column that contains the natural or business key. Also map the field in the input record that contains the business key. This information is used for the lookup that determines whether the record is an insert or an update. For any type 1 updates select Type 1. For any type 2 updates select Type 2. Not all fields in the dimension table are required to have purpose codes or required to be updated. Choose one field as the Current Indicator field if you are performing type 2 updates. This field will be used to indicate whether the record is the currently active record or an historical record. You can use Effective Date and Expiration Date codes to specify the date range that a particular record is effective.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Notes:
On the Surrogate Key Management tab, specify the path to the surrogate key state file associated with this dimension table. You can specify the number of values to retrieve at each physical read of the file. The larger the blocks of numbers the fewer the number of reads required, so the better the performance. In this example, the path is a Windows path format (rather than UNIX). This implies that in this example the job is running on a Windows system. The surrogate key state files are always located on the DataStage Server system; never on the client system.
12-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Functions used to calculate history date range Copyright IBM Corporation 2011
Figure 12-16. Dimension update specification KM4001.0
Notes:
On the Dim Update tab, you specify how updates are to be performed given the purpose codes you specified earlier. For the Surrogate Key column, invoke the NextSurrogateKey() function to retrieve the next available surrogate key from the state file or the block of surrogate key values held in memory by the stage. For the type 1 and type 2 updates, map the columns from the source file that will be used to update the table columns. The the Current Indicator field, specify the values for current and non-current. In this example Y means current and N means not-current. For the Effective Date and Expiration Date fields specify the functions or values that are to be used.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Output mappings
KM4001.0
Notes:
On the Output Mappings tab, specify the columns that will be sent out of the stage. In this example, surrogate key (PRODSK) is output. The business key field is not retained because it is not used in the fact table. The sales fields are output because they will be processed in the next SCD stage which updates the STOREDim dimension table. The product columns are dropped because they are not used in the fact table.
12-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Checkpoint
1. How many Slowly Changing Dimension stages are needed to process a star schema with 4 dimension tables? 2. How many Surrogate Key state files are needed to process a star schema with 4 dimension tables? 3. Whats the difference between a Type 1 and a Type 2 dimension field attribute? 4. What additional fields are needed for handling a Type 2 slowly changing dimension field attribute?
KM4001.0
Notes:
Write your answers here:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
12-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit summary
Having completed this unit, you should be able to: Design a job that creates a surrogate key source key file Design a job that updates a surrogate key source key file from a dimension table Design a job that processes a star schema database with Type 1 and Type 2 slowly changing dimensions
KM4001.0
Notes:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
13-1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Unit objectives
After completing this unit, you should be able to: Describe overall job guidelines Describe stage usage guidelines Describe Lookup stage guidelines Describe Aggregator stage guidelines Describe Transformer stage guidelines
KM4001.0
Notes:
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
13-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Performance
Resources
Copyright IBM Corporation 2011
Restartability
KM4001.0
Notes:
The first requirement for a job is that it meets the business requirements. Other requirements on this slide should only be examined after the first job requirement for a job is met.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Resource usage can grow dramatically as the degree of parallelism increases. Ulimit restricts the number of processes that can be spawned and the amount of memory that can be allocated. Ulimit may prevent your parallel job from running. Reading everything into memory may not be possible with large amounts of data.
13-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Shared containers create operations that can be reused. If you use a shared container in a job that utilizes RCP the shared container will act somewhat like a function call. The data not effected by the shared container will automatically pass through the flow; only the data needed by the shared container will be affected. Remember, a shared container may have many stages and will demand resources for all processes hidden by the shared container. Landing intermediate results to data set preserves the partitioning and is very efficient.
V5.4
Student Notebook
Uempty
Resource utilization
Break a job up into smaller jobs requiring less resources
Performance
Fork-join job flows may run faster if split into two separate jobs with intermediate datasets
Depends on processing requirements and ability to tune buffering
KM4001.0
Notes:
The developer is responsible for making jobs restartable. DataStage does not have an automatic restart. Separate long-running processes from other processes.
13-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Do not checkpoint run property executes the stage activity every run
KM4001.0
Notes:
If you use a job sequencer and one stage activity fails, the job sequence can be re-run, and it will start at the step that failed. It does not do any specialized processing, like rolling back rows from a database. If you want something like this you will need to build it into the job design.
V5.4
Student Notebook
Uempty
Use $PROJDEF to pick up the default value at Administrator level when the job is run
Picks up the value current at Administrator level at the time the job is run
KM4001.0
Notes:
When an environment variable is added as a job parameter its default value is added. Use $PROJDEF to pick up the default value at Administrator level when the job is run.
13-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Parallel I/O:
Single file when Readers Per Node > 1 Multiple individual files Reading with a file pattern, when $APT_IMPORT_PATTERN_USES_FILESET is turned on
KM4001.0
Notes:
Sequential row order cannot be maintained when reading a file in parallel.
13-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Reading a sequential file in parallel The Number of Readers Per Node optional property can be used to read a single input file in parallel at evenly spaced offsets
KM4001.0
Notes:
The readers per node can be set for both fixed and variable-length files.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
13-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Round robin is fastest way to partition data.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
When reading delimited files, extra characters are silently truncated for source file values longer than the maximum specified length of VarChar columns
Set the environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to reject these records instead
KM4001.0
Notes:
You should specify the pad character because normally you do not want ASCII nulls as padding.
13-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Environment variable $APT_EXPORT_FLUSH_COUNT can be used to specify the number of rows to buffer
$APT_EXPORT_FLUSH_COUNT=1 flushes to disk for every row Setting this value too low incurs a performance penalty!
KM4001.0
Notes:
When DataStage issues a write, it writes to memory and assumes the record made it to disk. However, with operating system buffering, this may not be true if there is a hard crash. Setting the $APT_EXPORT_FLUSH_COUNT to 1 will guarantee the record is written to disk.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
13-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Lookup stage
Lookup stage runs in two phases:
Read all rows from reference link into memory; indexed by lookup key Process incoming rows
Reference data should be small enough to fit into physical (shared) memory
For reference data sets larger than available memory, use the JOIN or MERGE stage
Lookup stage processing cannot begin until all reference links have been read into memory
KM4001.0
Notes:
The Lookup stage always runs in two phases: First, it reads all rows from reference link into memory (until end-of-data), indexing by lookup key. Secondly, it processes incoming rows. Lookup processing cannot begin until data for all reference links have been read into memory.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
On SMP configurations, it is usually best to specify ENTIRE for lookup reference data partitioning. For clustered/GRID/MPP configurations you should consider a keyed (for example, Hash) partitioning method.
13-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Header
HeaderRef
Src Detail
Out
KM4001.0
Notes:
Because the Lookup stage cannot begin to process data until all reference data has been loaded, you should never create lookup reference data from a fork of incoming source data.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Lookup file sets can only be used as reference input link to a Lookup stage
The partitioning method and key columns specified when the Lookup file set was created will be used to process the reference data
KM4001.0
Notes:
Lookups read data from the source into memory and create indexes on that data. If you are going to reuse data that does not change much, create a lookup file set because the indexes are saved with the data. Lookup file sets can only be read in a lookup reference, which limits their real-world use. No utilities can read a lookup file set, including orchadmin.
13-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Within the Lookup stage editor, you cannot change the Lookup key column derivations
Key column names in the Lookup file set must match source key column names
KM4001.0
Notes:
Beware when using Lookup File Set stages. Lookup file sets are specific to the configuration file used to create them
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
13-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Aggregator
Match input partitioning to Aggregator stage groupings Use Hash method for a limited number of distinct key values (that is, limited number of groups)
Uses 2K of memory per group Incoming data does not need to be pre-sorted Results are output after all rows have been read Output row order is undefined
Even if input data is sorted
KM4001.0
Notes:
Because rows depend on each other, partitioning matters. Hash performs aggregations in memory and will build a table. Data does not need to be sorted.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Parallel
Sequential
KM4001.0
Notes:
The Aggregator stage does not contain a sum all function, and by default it runs in parallel. This slide outlines the steps for summing over all rows in all partitions.
13-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Transformer stage have a lot of overhead. You can reduce the overhead if you reduce the number of Transformer stages.
13-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Transformer stage have a lot of overhead. You can reduce the overhead if you reduce the number of Transformer stages by using other stages with less overhead. For example, use a Copy stage rather than a Transformer if all you want to do is rename some columns or split the output stream.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Modify stage
May perform better than the Transformer stage in some cases Consider as a possible alternative for: Non-default type conversions Null handling String trimming Date / Time handling Drawback of Modify stage is that it has no expression editor Expressions are hand-coded Can be used to parameterize column names
Only stage where you can do this
KM4001.0
Notes:
The Modify stage can do many of the types of derivations the Transformer stage can do, but it has less overhead. On the negative it is not as user friendly and maintainable. Specific syntax for Modify is detailed in the DataStage Parallel Job Developers Guide.
13-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Examples:
Portions of output column derivations that are used in multiple derivations Where an expression includes calculated constant values:
Use the stage variable initial value to calculate once for all rows
Where an expression is used as a constant (same value for every row read):
Set it as the stage variable initial value
KM4001.0
Notes:
You can improve Transformer performance by optimizing Transformer expressions. This slide lists some ways.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Simplifying Transformer expressions Leverage built-in functions to simplify complex expressions For example:
Original expression:
IF Link_1.ProdNum = "000" Link_1.ProdNum = "888" OR Link_1.ProdNum = "877" OR Link_1.ProdNum = "844" OR Link_1.ProdNum = "822" OR THEN 'N ELSE "Y" OR Link_1.ProdNum = "800" OR Link_1.ProdNum = "866" OR Link_1.ProdNum = "855" OR Link_1.ProdNum = "833" OR Link_1.ProdNum = "900"
Simplified expression:
IF index('000|800|888|866|877|855|844|833|822|900', Link_1.ProdNum, 1) > 0 THEN 'N' ELSE "Y"
Copyright IBM Corporation 2011
KM4001.0
Notes:
Simplifying Transformer expressions can save some time.
13-31
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Build stages provide a lower-level method to build framework components. You may be able to accomplish what a Transformer is doing in a Build stage and get better performance. Only replace those Transformers that are bottlenecks. Build stages are much more difficult to maintain.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Default internal decimal variables are precision 38, scale 10. You can change these defaults using the environment variables described on this slide.
13-33
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
ceil: round up 1.4 -> 2, -1.6 -> -1 floor: round down 1.6 -> 1, -1.4 -> -2 round_inf: round to nearest integer. Up for ties 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2 trunc_zero: discard any fractional digits to the right of the rightmost fractional digit supported 1.56 -> 1.5, -1.56 -> -1.5
Copyright IBM Corporation 2011
KM4001.0
Notes:
Use $APT_DECIMAL_INTERM_ROUND_MODE to specify decimal rounding.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Use the Abort After Rows property in the Transformer constraints to conditionally abort a job. Here, the constraint decribes the condition that you are measuring to determine whether to default. These might, for example, be rows that contain out-of-range values.
13-35
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
KM4001.0
Notes:
Written to the CUSTS_Log sequential file are the number of rows that should have been written to the CUSTS table. This is because the same number of rows that go down the ToCount link go down the CUSTS link. The database may reject some rows. In that case, the number in the log will be more than the number in the table. So the CUSTS_log file provides a check. On the next slide we see that the purpose of the Transformer in this job is to conditionally abort the job.
13-37
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Here, we are conditionally aborting the job when the number of rows going down the Rejects link reaches 50 (in any given partition).
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Checkpoint
1. What effect does using $PROJDEF as a default value of a job parameter have? 2. What optional property can you use to read a sequential file in parallel? 3. What is the default partitioning method for the Lookup stage?
KM4001.0
Notes:
Write your answers here:
13-39
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
KM4001.0
Notes:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V5.4
Student Notebook
Uempty
Unit summary
Having completed this unit, you should be able to: Describe overall job guidelines Describe stage usage guidelines Describe Lookup stage guidelines Describe Aggregator stage guidelines Describe Transformer stage guidelines
KM4001.0
Notes:
13-41
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Student Notebook
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V6.0
backpg
Back page