Sei sulla pagina 1di 530

IBM Information Server 8.

x DataStage / QualityStage Fundamentals


ValueCap Systems

ValueCap Systems - Proprietary

Introduction
What is Information Server(IIS) 8.x Suite of applications that share a common repository Common set of Application Services (hosted by WebSphere app server) Data Integration toolset (ETL, Profiling and data quality) Employs scalable parallel processing Engine Supports N-tier layered architecture Newer version of data integration/ETL tool set offered by IBM Web Browser Interface to manage security and authentication

ValueCap Systems - Proprietary

Product Suite
IIS Organized into 4 layers
Client: Administration, Analysis,Development, and User Interface Metadata Repository: Single repository for each install. Can reside in DB2, Oracle or SQL Server database. Stores configuration, design and runtime metadata. DB2 is supplied database. Domain: Common Services. Requires WebSphere Application Server. Single domain for each install. Engine: Core engine that run all ETL jobs. Engine install includes connectors, packs, job monitors, performance monitors, log service etc.,

Note : Metadata Repository , Domain and Engine can reside in either same server or separate server. Multiple engines can exist in a single Information Server install.

ValueCap Systems - Proprietary

Detailed IS Architecture
Client layer
DataStage & QualityStage Client Admin Console Client Reporting Console Client Information Analyzer WebSphere Business Glossary Fast Track Metadata Workbench

Domain layer

WebSphere Application Server

IADB (Profiling)

Metadata Repository layer

Metadata Services IBM WebSphere Metadata Server

Import/Export Manager

External Data Sources (Erwin, Cognos)

Engine layer

Metadata DB

DataStage & QualityStage Engine

ValueCap Systems - Proprietary

Information Server 8.1 Components


Core Components:
Information Analyzer profiles and establishes an understanding of source systems, and monitors data rules on an ongoing basis to eliminate the risk of proliferating incorrect and inaccurate data. QualityStage standardizes and matches information across heterogeneous sources. DataStage extracts, transforms and loads data between multiple sources and targets. Metadata Server provides unified management, analysis and interchange of metadata through a shared repository and services infrastructure. Business Glossary defines data stewards and creates and manages business terms, definitions and relates these to physical data assets. Metadata Workbench provides unified management, analysis and interchange of metadata through a shared repository and services infrastructure. FastTrack Easy-to-use import/export features allow business users to take advantage of familiar Microsoft Excel interface and create new specifications. Federation Server defines integrated views across diverse and distributed information sources, including cost-based query optimization and integrated caching. Information Services Director enables information access and integration processes for publishing as reusable services in a SOA.

ValueCap Systems - Proprietary

Information Server 8.1 Components


Optional Components:
Rational Data Architect provides enterprise data modeling and information integration design capabilities. Replication Server provides high-speed, event-based replication between databases for high availability, disaster recovery, data synchronization and data distribution. Data Event Publisher detects and responds to data changes in source systems, publishing changes to subscribed systems, or feeding changed data into other modules for event-based processing. InfoSphere Change Data Capture Log-based Change Data Capture (CDC) technology, acquired in the DataMirror acquisition, detects and delivers changed data across heterogeneous data sources such as DB2, Oracle, SQL Server and Sybase. Supports service-oriented architectures (SOAs) by packaging realtime data transactions into XML documents and delivering to and from messaging middleware such as WebSphere MQ. DataStage Pack for SAP BW (DataStage BW Pack) The DataStage BW Pack is a companion product of the IBM Information Server. The pack was originally developed to support SAP BW and currently supports both SAP BW and SAP BI. The GUIs of the DataStage BW Pack are installed on the DataStage Client. The runtime part of the Pack is installed on the DataStage Server.

ValueCap Systems - Proprietary

IBM Information Server 8.x DataStage / QualityStage Fundamentals


ValueCap Systems

ValueCap Systems - Proprietary

Course Objectives
Upon completion of this course, you will be able to:
Understand principles of parallel processing and scalability Understand how to create and manage a scalable job using DataStage Implement your business logic as a DataStage Job Build, Compile, and Execute DataStage Jobs Execute your DataStage Jobs in parallel Enhance DataStage functionality by creating your own Stages Import and Export DataStage Jobs
ValueCap Systems - Proprietary

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116 Page 252 Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

IS DataStage Overview
In this section we will discuss: Product History Product Architecture Project setup and configuration Job Design Job Execution Managing Jobs and Job Metadata

ValueCap Systems - Proprietary

10

Product History
Prior to IBMs acquisition of Ascential Software, Ascential had performed a series of its own acquisitions: Ascential started off as VMark before it became Ardent Software and introduced DataStage as an ETL solution Ardent was then acquired by Informix and through a reversal in fortune, Ardent management took over Informix. Informix was then sold to IBM and Ascential Software was spun out with approximately $1 Billion in the bank as a result. Ascential Software kept DataStage as its cash cow product, but started focusing on a bigger picture: Data Integration for the Enterprise

ValueCap Systems - Proprietary

11

Product History (Continued)


With plenty of money in the bank and a weakening economy, Ascential embarked upon a phase of acquisitions to fulfill its vision as becoming the leading Data Integration software provider.
DataStage Standard Edition was the original DataStage product and is also known as DataStage Server Edition. Server will be going away with the Hawk release later in 2006. DataStage Enterprise Edition was originally Orchestrate, which had been renamed to Parallel Extender after the Torrent acquisition. Valitys Integrity was renamed to QualityStage DataStage TX was originally known as Mercator and renamed when purchased by Ascential. ProfileStage was once Metagenixs Metarecon software
ValueCap Systems - Proprietary

12

Product History (Continued)


By 2004, Ascential had completed its acquisitions and turned its focus onto completely integrating the acquired technologies. Ascentials Data Integration Suite:
Service-Oriented Architecture Real-Time Integration Services and Event Management

DISCOVER Discover data content and structure ProfileStage

PREPARE Standardize, match, and correct data

TRANSFORM and DELIVER

Transform, enrich, and deliver data

QualityStage

DataStage

Parallel Execution Engine Meta Data Management Enterprise Connectivity

ValueCap Systems - Proprietary

13

Product History (Continued)


In 2005, IBM acquired Ascential. In November of 2006, IBM released Information Server version 8, which included WebSphere Application Server, DataStage, QualityStage, and other tools, some of which are part of the standard install, and some of which are optional: FastTrack Metadata Workbench Information Analyzer (formerly ProfileStage) WebSphere Federation Server and others.
ValueCap Systems - Proprietary

14

Old DataStage Client/Server Architecture


4 Clients, 1 Server
Client - Microsoft Windows NT/2K/XP/2003

Designer

Director

Administrator

Manager

DataStage Enterprise Edition Framework

DataStage Repository

Server WIN, UNIX (AIX, Solaris, TRU64, HP-UX, USS)


ValueCap Systems - Proprietary

15

New DataStage Client/Server Architecture


3 Clients, 1 (or more) Server(s)
Client - Microsoft Windows XP/2003/Vista

Clients now handle both DataStage and QualityStage


Designer Director Administrator

No more Manager client Common Repository can be on a separate server Default J2EE-compliant Application Server is WebSphere Application Server
16

DataStage Enterprise Edition Framework

Common Repository Application Server

Server WIN, UNIX (Linux, AIX, Solaris, HP-UX, USS)


ValueCap Systems - Proprietary

DataStage Clients: Administrator


DataStage Administrator Manage licensing details Create, update, administer projects and users Manage environment variable settings for entire project

ValueCap Systems - Proprietary

17

DataStage Administrator Logon

When 1st connecting to the Administrator, you will need to provide the following: Server address where the DataStage repository was installed Your userid Your password Assigned project

ValueCap Systems - Proprietary

18

DataStage Administrator Projects


Next, click on Add to create a new DataStage project. In this course, each student will create his/her own project In typical development environments, many developers can work on the same project.
C:\IBM\InformationServer\Projects\ANALYZEPROJECT

Project paths / locations can be customized


C:\IBM\InformationServer\Projects\

ValueCap Systems - Proprietary

19

DataStage Administrator Projects


Once a project has been created, it is populated with default settings. To change these defaults, click on the Properties button to bring up the Project Properties window.

C:\IBM\InformationServer\Projects\Sample

Next, click on Environment button


ValueCap Systems - Proprietary

20

DataStage Administrator Environment


This window displays all of the default environment variable settings, as well as the user defined environment variables.

Do not change any values for now

Click here when done


ValueCap Systems - Proprietary

21

DataStage Administrator Other Options


Useful options to set for all projects include: Enable job administration in Director this allows various administrative actions to be performed to jobs via the Director interface Enable Runtime Column Propagation for Parallel Jobs aka RCP a feature which allows column metadata to be automatically propagated at runtime. More on this later
ValueCap Systems - Proprietary

22

DataStage Clients: Designer


DataStage Designer Develop DataStage jobs or modify existing jobs Compile jobs Execute jobs Monitor job performance Manage table definitions Import table definitions Manage job metadata Generate job reports
ValueCap Systems - Proprietary

23

DataStage Designer Login

After login in you should see a similar screen:

ValueCap Systems - Proprietary

24

DataStage Designer Where to Start?


Open any existing Open a job that DataStage job you were recently working on

Select to create a new DataStage job

For majority of lab exercises, you will be selecting Parallel Job or using the Existing and Recent tabs.
ValueCap Systems - Proprietary

25

DataStage Designer Elements

Indicates Parallel Canvas (i.e. Parallel DataStage Job)

These boxes can be docked in various locations within this interface. Just click and drag around

The DataStage Designer user interface can be customized to your preferences. Here are just a few of the options

Icons can be made larger by right-clicking inside to access menu.

Categories can be edited and customized as well ValueCap Systems - Proprietary

26

DataStage Designer Toolbar

New Job

Save / Save All

Job Compile

Grid Lines

Snap to Grid

Open Existing Job

Job Properties

Run Job

Link Markers

Zoom In / Out

Some of the useful icons you will become very familiar with as you get to know DataStage. Note that if you let the mouse pointer hover over any icon, a tooltip will appear.
ValueCap Systems - Proprietary

27

DataStage Designer Paradigm

Left-click and drag the stage(s) onto the canvas. You can also left-click on the stage once and then position your mouse cursor on the canvas and left-click again to place the chosen stage there.

ValueCap Systems - Proprietary

28

DataStage Designer Paradigm

To create the link, you can right-click on the upstream stage and drag the mouse pointer to the downstream stage. This will create a link as shown here. Alternatively, you can select the link icon from the General category in your Palette by left-clicking on it.

ValueCap Systems - Proprietary

29

DataStage Designer Design Feedback

When Show stage validation errors under the Diagram menu is selected (the default) DataStage Designer uses visual cues to alert users that theres something wrong. Placing the mouse cursor over an exclamation mark on a stage will display a message indicating what the problem is. A red link indicates that the link cannot be left dangling and must have a source and/or target attached to it.

ValueCap Systems - Proprietary

30

DataStage Designer Labels

You may notice that the default labels that are created on the stages and links are not very intuitive. You can easily change them by left-clicking once on the stage or link and then start typing a more appropriate label. This is considered to be a best practice. You will understand why shortly. Labels can also be changed by right-clicking on the stage or link and selecting the Rename option.

ValueCap Systems - Proprietary

31

DataStage Designer Stage Properties


Double-clicking on any stage on the canvas or right-clicking and selecting Properties will bring up the options dialogue for that particular stage. Almost all stages will require you to open and edit their properties and set them to appropriate values. However, almost all property dialogues follow the same paradigm.

ValueCap Systems - Proprietary

32

DataStage Designer Stage Properties

Heres an example of a fairly common stage properties dialogue box. The Properties tab will always contain the stage specific options. Mandatory entries will be highlighted red. The Input tab allows you to view the incoming data layout as well as define data partitioning (we will cover this in detail later). The Output tab allows you to view and map the outgoing data layout.
ValueCap Systems - Proprietary

33

DataStage Designer Stage Input


Once youve changed the link label to something more appropriate, it will make it easier to track your metadata. This is especially true if there are multiple inputs or outputs.

We will discuss partitioning in detail later

Another useful feature on the Input properties tab is the fact that you can see what the incoming data layout looks like.

ValueCap Systems - Proprietary

34

DataStage Designer Stage Output


On the Output tab, there is a Mapping tab and another Columns tab. Note that the columns are missing on the Output side. Where did they go? We saw them on the Input, right?

The answer lies in the Mapping tab. This is the Source to Target mapping paradigm you will find throughout DataStage. It is a means of propagating design-time metadata from source to target
ValueCap Systems - Proprietary

35

DataStage Designer Field Mapping

Source to Target mapping is achieved by 2 methods in DataStage: Left-clicking and dragging a field or collection of fields from the Source side (left) to the Target side (right). Left-clicking on the Columns bar on the Source side and dragging it into the Target side. This is illustrated above. When performed correctly, you will see the Target side populated with some or all of the fields from the Source side, depending on your selection.
ValueCap Systems - Proprietary

36

DataStage Designer Field Mapping

Once the mapping is complete, you can go back into the Output Columns tab and you will notice that all of the fields youve mapped from Source to Target now appear under the Columns tab. You may have also noticed the Runtime column propagation option below the columns. This is here because we enabled it in the Administrator. If you do not see this option, it is likely because it did not get enabled.
ValueCap Systems - Proprietary

37

DataStage Designer RCP


What is Runtime Column Propagation? Powerful feature which allows you to bypass Source to Target mapping At runtime (not design time), it will automatically propagate all source columns to the target for all stages in your job. What this means: if you are reading in a database table with 200 columns/fields, and your business logic only affects 2 of those columns, then you only need to specify 2 out of 200 columns and subsequently enable RCP to handle the rest.
ValueCap Systems - Proprietary

38

DataStage Designer Mapping vs RCP


So, why Map when you can RCP? Design time vs runtime consideration When working on a job flow that affects many fields, it is easier to have the metadata there to work with Mapping also provides explicit documentation of what is happening Note that RCP can be combined with Mapping
Enable RCP by default, and then turn it off when you only want to propagate a subset of fields. Do this by only mapping fields you need. It is often better to keep RCP enabled at all times, but be careful when you only want to keep certain columns and not others!
ValueCap Systems - Proprietary

39

DataStage Designer Table Definitions


Table Definitions in DataStage are the same as a table layout or schema. You can manually enter everything and these can be saved for re-use later
Specify location where the table definition is to be saved. Once saved, table definition can be accessed from the repository view.

ValueCap Systems - Proprietary

40

DataStage Designer Metadata Import


Table Definitions can also be automatically generated by translating definitions stored in various formats. Popular options include COBOL copybooks and RDBMS table layouts. RDBMS layouts can be accessed via a couple of options: ODBC Table Definitions Orchestrate Schema Definitions (via orchdbutil option) Plug-in Meta Data Definitions

ValueCap Systems - Proprietary

41

DataStage Designer Job Properties


The Parameters tab allows users to add environment variables both pre-defined and user-defined.

Once selected, it will show up in the Job Properties window. The default value can be altered to a different value. Parameters can be used to control job behavior as well as referenced within stages to allow for simple adjustment of properties without having to modify the job itself.
ValueCap Systems - Proprietary

42

DataStage Designer Job Compile/Run

Before a job can be executed, it must first be saved and compiled. Compilation will validate that all necessary options are set and defined within each of the stages in the job.

Compile

Run

To run the job, just click on the run button on the Designer. Alternatively, you can also click on the run button from within the Director. The Director will contain the job run log, which provides much more detail than the Designer will.
ValueCap Systems - Proprietary

43

DataStage Designer Job Statistics


As a job is executing, you can right-click on the canvas and select Show performance statistics to monitor your jobs performance. Note that the link colors signify job status. Blue means it is running and green means it has completed. If the link is red, then the job has aborted due to error.

ValueCap Systems - Proprietary

44

DataStage Designer Export

The Designer is also used for exporting and importing DataStage jobs, table Definitions, routines, containers, etc Items can be exported in 1 of 2 formats: DSX or XML. DSX format is DataStages internal format. Both formats can be opened and viewed in a standard text editor. We do not recommend altering the contents unless you really know what you are doing!.
ValueCap Systems - Proprietary

45

DataStage Designer Export

You can export the contents of the entire project, or individual components. You can also export items into an existing file by selecting the Append to existing file option. Exported projects, depending on the total number of jobs, can grow to be several megabytes. However, these files can be easily compressed.
ValueCap Systems - Proprietary

46

DataStage Designer Import

Previously exported items can be imported via the Designer. You can choose to import everything or only selected content. DSX files from previous versions of DataStage can also be imported. The upgrade to the current version will occur on the fly as the content is being imported into the repository.

ValueCap Systems - Proprietary

47

DataStage Clients: Director


DataStage Director: Execute DataStage jobs Compile jobs Reset jobs Schedule jobs Monitor job performance Review job logs

ValueCap Systems - Proprietary

48

DataStage Director Access


The easiest way to access the Director is from within the Designer. This will bypass the need to re-login again. Alternatively, you will have to double-click on the Director icon to bring up the Director interface.

ValueCap Systems - Proprietary

49

DataStage Director Interface


The Directors default interface shows a list of Jobs along with their status. You will be able to see if jobs are compiled, how long it took to run, and when it was last run.

ValueCap Systems - Proprietary

50

DataStage Director Toolbar


Open Project Job Scheduler Run Job Reset Job

Job Status View

Job Log

Some of the useful icons you will become very familiar with as you get to know DataStage. Note that if you let the mouse pointer hover over any icon, a tooltip will appear..
ValueCap Systems - Proprietary

51

DataStage Director Interface


Whenever a job runs, you can view the job log in the Director. Current entries are in black, whereas previous runs will show up in blue. Double-click on any entry to access more details. What you see here is often just a summary view
ValueCap Systems - Proprietary

52

DataStage Director Monitor


To enable job monitoring from within the Director, go to Tools menu and select New Monitor. You can set the update intervals as well as specify which statistics you would like to see. Colors correspond to status. Blue means it is running, green means it has finished, and red indicates a failure.

ValueCap Systems - Proprietary

53

IBM Information Server DataStage / QualityStage Fundamentals Labs


ValueCap Systems

ValueCap Systems - Proprietary

54

Lab 1A: Project Setup & Configuration

ValueCap Systems - Proprietary

55

Lab 1A Objective

Learn to setup and configure a simple project for IBM Information Server DataStage / QualityStage

ValueCap Systems - Proprietary

56

Creating a New Project


Log into the DataStage Administrator using the userid and password provided to you by the instructor. Steps are outlined in the course material.

Click on Add button to create a new project. Your instructor may advise you on a project name do not change the default project directory. Click OK when finished.
ValueCap Systems - Proprietary

57

Project Setup
Click on the new project you have just created and select the Properties button. Under the General tab, check the boxes next to:
Enable job administration in the Director Enable Runtime Column Propagation for Parallel Jobs

Next, click on the Environment button to bring up the Environment Variables editor.

ValueCap Systems - Proprietary

58

Environment Variable Settings


The Environment Variables editor should be similar to the screen shot shown here: We only need to change a couple of values
APT_CONFIG_FILE instructor will provide value. Click on Reporting and set APT_DUMP_SCORE to TRUE. Instructor will provide details if any other environment variable needs to be defined.
ValueCap Systems - Proprietary

59

Setting APT_CONFIG_FILE defines the default configuration file used by jobs in the project. Setting APT_DUMP_SCORE will enable additional diagnostic information to appear in the Director log. Click OK button when finished editing Environment Variables. Click OK and then Close to exit the Administrator. You have now finished configuring your project.

ValueCap Systems - Proprietary

60

Lab 1B: Designer Walkthrough

ValueCap Systems - Proprietary

61

Lab 1B Objective

Become familiar with DataStage Designer.

ValueCap Systems - Proprietary

62

Getting Into the Designer


Log into the DataStage Designer using the userid and password provided to you by the instructor. Be sure to select the project you just created when login in.

Once connected, select the Parallel Job option and click on OK. You should see a blank canvas with the Parallel label in the upper left hand corner.
ValueCap Systems - Proprietary

63

Create a Simple Job


We will construct the following simple job:

Use the techniques covered in the lecture material to build the job. Job consists of a Row Generator stage and a Peek stage.
For the Row Generator, you will need to enter the following table definition:

Alter the stage and link labels to match the diagram above.
ValueCap Systems - Proprietary

64

Compile and Run the Job


Save the job as lab1b Click on the Compile button.
Did the job compile successfully? If not, can you determine why not? Try to correct the problem(s) in order to get the job to compile.

Once the job has compiled successfully, right-click on the canvas and select Show performance statistics Click on the Job Run button. Once your job finishes executing, you should see the following output:

ValueCap Systems - Proprietary

65

Lab 1C: Director Walkthrough

ValueCap Systems - Proprietary

66

Lab 1C Objective

Become familiar with DataStage Director.

ValueCap Systems - Proprietary

67

Getting Into the Director


Log into the DataStage Director using the userid and password provided to you by the instructor. You can also use the shortcut shown in the course materials.

Once connected, you should see the status of lab1b, which was just executed from within the Designer:

ValueCap Systems - Proprietary

68

Viewing the Job Log


Click on the Job Log button on the toolbar to access the log for lab1b. The log should be very similar to the screenshot here:

There should not be any red (error) icons.


ValueCap Systems - Proprietary

69

Director Job Log


Take a closer look at some of the entries in the log. Double click on the following highlighted selections:

Also note the Job Status

First one shows the configuration file being used. The next few entries show the output of the Peek stage.
ValueCap Systems - Proprietary

70

Stage Output
The Peek stage output in the Director log should be similar to the following:

Peek stage is similar to inserting a Print statement into the middle of a program. Where did this data come from? The data was generated by the Row Generator stage! You will learn more about this powerful stage in later sections & labs.
ValueCap Systems - Proprietary

71

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116 Page 252 Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

72

Parallel Framework Overview


In this section we will discuss: Hardware and Software Scalability Traditional processing Parallel processing Configuration File overview Parallel Datasets

ValueCap Systems - Proprietary

73

Scalability
Scalability is a term often used in product marketing but seldom well defined: Hardware vendors claim their products are highly scalable
Computers Storage Network

Software vendors claim their products are highly scalable


RDBMS Middleware
ValueCap Systems - Proprietary

74

Scalability Defined
How should scalability be defined? Well, that depends on the product. For Parallel DataStage : The ability to process a fixed amount of data in decreasing amounts of time as hardware resources (cpu, memory, storage) are increased Could also be defined as the ability to process growing amounts of data by increasing hardware resources accordingly.

ValueCap Systems - Proprietary

75

Scalability Illustrated
Linear Scalability: runtime decreases as amount of hardware resources are increased. For example: a job that takes 8 hours to run on 1 cpu, will take 4 hours on 2 cpus, 2 hours on 4 cpus, and 1 hour on 8 cpus. Poor Scalability: results when running time no longer improves as additional hardware resources are added. Super-linear Scalability: occurs when the job performs better than linear as amount of hardware resources are increased.

run time

poor scalability

Hardware Resources (CPU, Memory, etc)


(assumes that data volumes remain constant)

ValueCap Systems - Proprietary

76

Hardware Scalability
Hardware vendors achieve scalability by: Using multiple processors Having large amounts of memory Installing fast storage mechanisms Leveraging a fast back plane Using very high bandwidth, high speed networking solutions

ValueCap Systems - Proprietary

77

Examples of Scalable Hardware


SMP 1 physical machine with 2 or more processors and shared memory. MPP 2 or more SMPs interconnected by a high bandwidth, high speed switch. Memory between nodes of a MPP is not shared. Cluster more than 2 computers connected together by a network. Similar to MPP. Grid several computers networked together. Computers can be dynamically assigned to run jobs.

ValueCap Systems - Proprietary

78

Software Scalability
Software scalability can occur via: Executing on scalable hardware Effective memory utilization Minimizing disk I/O Data partitioning Multi-threading Multi-processing

ValueCap Systems - Proprietary

79

Software Scalability DS EE
Parallel DataStage achieves scalability in a variety of ways: Data Pipelining Data Partitioning Minimizing disk I/O In memory processing We will explore these concepts in detail!

ValueCap Systems - Proprietary

80

The Parallel Framework


The Engine layer consists, in large part, of the Parallel Framework (aka Orchestrate). The Framework was written in C++ and has a published and documented API DS/QS jobs run on top of the Framework via OSH OSH is a scripting language much like Korn shell The Designer client will generate OSH automatically Framework relies on a configuration file to determine level of parallelism during job execution.
ValueCap Systems - Proprietary

81

Parallel Framework

DataStage Job executes on the Framework at runtime

Configuration File Configuration File contains virtual map of available system resources.

Parallel Framework Framework will reference the Configuration File to determine the degree of parallelism for the job at runtime.

ValueCap Systems - Proprietary

82

Traditional Processing
Suppose we are interested in implementing the following business logic where A, B, and C represent specific data transformation processes:
file A B C
RDBMS

Manual implementation of the business logic typically results in the following:


file A B C
RDBMS Invoke loader

staging area:

disk

disk

disk

While the above solution works and eventually delivers the correct results, problems will occur when data volumes increase and/or batch windows decrease! Disk I/O is the slowest link in the chain. Sequential processing prohibits scalability

ValueCap Systems - Proprietary

83

Data Pipelining
What if, instead of persisting data to disk between processes, we could move the data between processes in memory?
file A B C
RDBMS Invoke loader

staging area:

disk

disk

disk

The application will certainly run faster simply because we are now avoiding the disk I/O that was previously present.
file A B C
RDBMS

This concept is called data pipelining. Data continuously flows from Source to Target, through the individual transformation processes. The downstream process no longer has to wait for all of the data to be written to disk it can now begin processing as soon as the upstream process is finished with the first record!
ValueCap Systems - Proprietary

84

Data Partitioning
Parallel processing would not be possible without data partitioning. We will devote an entire lecture to this subject matter later in this course. For now: Think of partitioning as the act of distributing records into separate partitions for the purpose of dividing the processing burden from one processor to many.
Data File Partitioner
Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000

ValueCap Systems - Proprietary

85

Parallel Processing
By combining data pipelining and partitioning, you can achieve what people typically envision as being parallel processing:

Input file A B C
RDBMS

In this model, data flows from source to target, upstream stage to downstream stage, while remaining in the same partition throughout the entire job. This is often referred to as partitioned parallelism.

ValueCap Systems - Proprietary

86

Pipeline Parallel Processing


There is, however, a more powerful way to perform parallel processing. We call this spaghetti pipeline parallelism.

Input file A B C
RDBMS

What makes pipeline parallelism powerful is the following: Records are not bound to any given partition Records can flow down any partition Prevents backup and hotspots from occurring in any given partition The parallel framework does this by default!
ValueCap Systems - Proprietary

87

Pipeline Parallelism Example


Suppose you are traveling from point A to point B along a 6 lane toll-way. Between the start and end points, there are 3 toll stations your car must pass through and pay toll.
During your journey, you will most likely change lanes. These lanes are just like partitions During your journey, you will likely use the toll station with the least number of cars Think about the fact that other cars are doing the same! Each car is like a record, toll stations are processes What would happen if you are stuck in a single lane during the entire journey?

This is a simple real-world example of pipeline parallelism!


ValueCap Systems - Proprietary

88

Configuration Files
Configuration files are used by the Parallel Framework to determine the degree of parallelism for a given job. Configuration files are plain text files which reside on the server side Several configuration files can co-exist, however, only one can be referenced at a time by a job Configuration files have a minimum of one processing node defined and no maximum Can be edited through the Designer or vi or other text editors Syntax is pretty simple and highly repetitive.

ValueCap Systems - Proprietary

89

Configuration File Example


Here is a sample configuration file which will allow a job to run 4 way parallel. The path will be different for windows installations.
{ node node_1" { fastname dev_server" pool " resource disk "/data/work" {} resource scratchdisk "/data/scratch" } node node_2" { fastname dev_server" pool " resource disk "/data/work" {} resource scratchdisk "/data/scratch" } node node_3" { fastname dev_server" pool " resource disk "/data/work" {} resource scratchdisk "/data/scratch" } node node_4" { fastname dev_server" pool " resource disk "/data/work" {} resource scratchdisk "/data/scratch" } }

Label for each node, can be anything needs to be different for each node
{}

Hostname for the ETL server can also use IP address


{}

{}

Location for parallel dataset storage used to spread I/O can have multiple entries per node

{}

Location for temporary scratch file storage used to spread I/O can have multiple entries per node 90

ValueCap Systems - Proprietary

Reading & Writing Parallel Datasets


Suppose that in each scenario illustrated below, we are reading in or writing out 4000 records. Which performs better?
Data File Partitioner
Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000

VS

Data Data File Data File Data File Files

Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000

-- OR -Collector
Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000 Records 1 - 1000

Data File

VS

Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000

Data Data File Data File Data File Files

ValueCap Systems - Proprietary

91

Parallel Dataset Advantage


Being able to read and write data in parallel will almost always be faster and more scalable than reading or writing data sequentially.
Data Data File Data File Data File Files
Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000 Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000

Data Data File Data File Data File Files

Parallel Datasets perform better because: data I/O is distributed instead of sequential, thus removing a bottleneck data is stored using a format native to the Parallel Framework, thus eliminating need for the Framework to re-interpret data contents data can be stored and read back in a pre-partitioned and sorted manner
ValueCap Systems - Proprietary

92

Parallel Dataset Mechanics


Datasets are made up of several small fragments or data files Fragments are stored per the resource disk entries in the configuration file
This is where distributing the I/O becomes important!

Datasets are very much dependent on configuration files.


Its a good practice to read the dataset using the same configuration file that was originally used to create it.

ValueCap Systems - Proprietary

93

Using Parallel Datasets


Parallel datasets should use a .ds extention. The .ds file is only a descriptor file containing metadata and location of actual datasets. When writing data to a parallel dataset, be sure to specify whether to create, overwrite, append, or insert.

ValueCap Systems - Proprietary

94

Browsing Datasets
deleting datasets

Dataset viewer can be accessed from the Tools menu in the Designer. Use the Dataset viewer to see all metadata as well as records stored within the dataset. Alternatively, if all you want to do is browse the records in the dataset, you can use the View Data button in the properties window for the dataset stage.

ValueCap Systems - Proprietary

95

Lab 2A: Simple Configuration File

ValueCap Systems - Proprietary

96

Lab 2A Objectives

Learn to create a simple configuration file and validate its contents. Note: You will need to leverage skills learned during previous labs to complete subsequent labs.

ValueCap Systems - Proprietary

97

Creating a Configuration File


Log into the DataStage Designer using your assigned userid and password. Click on the Tools menu to select Configurations

ValueCap Systems - Proprietary

98

Configuration File Editor


The Configuration File editor should pop up, similar to the one you see here. Click on New and select default. We will use this as our starting point to create another config file.

ValueCap Systems - Proprietary

99

Checking the Configuration File


Once you have opened the default configuration file, click on the Check button at the bottom. This action will validate the contents of the configuration file.
Always do this after you have created a configuration file. If it fails this simple test, then there is no way any job will run using this configuration file!

What is in your configuration file will depend on the hardware environment you are using (i.e. number of cpus). For Example, on a 4 cpu system, you will likely see a configuration file with 4 node entries defined.
ValueCap Systems - Proprietary

100

Editing the Configuration File


At this point, how many nodes do you see defined in your default configuration file?
Remember, this dictates how many way parallel your job will run. If you see 8 node entries, then your job will run 8-way parallel.

Regardless of how many cpus your system has, edit the configuration file and create as many node entries as you have cpus.
The default may already have the nodes defined. Copy and paste is the fastest way to do this if you need to add nodes. Keep in mind that node names need to be unique, while everything else can stay the same! Pay attention to the { }s!!! Your instructor may choose to provide you with alternate resource disk and resource scratchdisk locations to use.
ValueCap Systems - Proprietary

101

Save and Check the Config File


Once you have finished editing the configuration file, click on the Save button and save it as something other than default.
Suggestions include using your initials along with the number of nodes defined. This helps prevent other students from accidentally using the wrong configuration file.
o For example: JD_Config_4node.apt

Once you have saved your configuration file, click on the Check button again at the bottom. This action will validate the contents of your configuration file.
Again, always do this after you have created a configuration file. If it fails this simple test, then there is no way any job will run using this configuration file! If the validation fails, use the error message to determine what the problem is. Correct the problem and repeat the above step.
ValueCap Systems - Proprietary

102

Save and Check the Configuration File


Next, re-edit the configuration file you just created (and validated) and remove all node entries except for the first one. Check it again and, if no errors are returned, save it as a 1node configuration using the same nomenclature you applied to the multi-node configuration file you previously had created.
For example: JD_Config_1node.apt Note: when you check the configuration, it may prompt you to save it first. You can check the configuration without saving it first, but always remember to save it once it passes the validation test.
ValueCap Systems - Proprietary

103

Checking the Configuration File


What does Parallel DataStage do when it is checking the config file? Validates syntax
Correct placement of all { }, , , etc Correct spelling and use of keywords such as node, fastname, resource disk, resource scratchdisk, pool, etc

Validates information
Fastname entry should match hostname or IP rsh permissions, if necessary, are in place Read and Write permissions exist for all of your resource disk and scratchdisk entries
ValueCap Systems - Proprietary

104

Changing Default Settings


Exit the Manager and go into the Administrator be sure to select your project and not someone elses. Enter the Environment editor
Find and set APT_CONFIG_FILE to the 1node configuration file you just created. This makes it the default for your project. Find and set APT_DUMP_SCORE to TRUE. This will enable additional diagnostic information to appear in the Director log.

Click OK button when finished editing Environment Variables. Click OK and then Close to exit the Administrator. You have now finished configuring your project.
ValueCap Systems - Proprietary

105

Lab 2B: Applying the Configiguration File to a Simple DataStage Job.

ValueCap Systems - Proprietary

106

Lab 2B Objective

Use your newly created configuration files to test a simple DataStage application.

ValueCap Systems - Proprietary

107

Create Lab2B Using Lab1B


Open the job you created in Lab 1B should be called lab1b Save the job again using Save As use the name lab2b Next, find the job properties icon:
Click on the job properties icon to bring up the Job Properties window.

ValueCap Systems - Proprietary

108

Editing Job Parameters


Click on the Parameters tab Find and click on the Add Environment Variable button You will see the big (and sometimes confusing) list of environment variables. Take some time to browse through these. Find and select APT_CONFIG_FILE

ValueCap Systems - Proprietary

109

Defining APT_CONFIG_FILE
Once selected, you will return to the Job Properties window. Verify that the value for APT_CONFIG_FILE is the same as the 1node configuration file you defined previously in Lab 2A.

Save, Compile, and Run your job.

ValueCap Systems - Proprietary

110

Running Using Parameters


When you run your job, you should see the following Job Run Options dialogue: Note that it shows you the default configuration file being used, which happens to be the one defined previously in the Administrator. Keep this value for now, and just click on Run. Go to the Director to view the job run log.

ValueCap Systems - Proprietary

111

Director Log Output


Look for a similar entry in the job log for lab2b: Double click on it. You should see the contents of the 1node configuration file used. Click on Close to exit from the dialogue. Click on Run again and this time, change the APT_CONFIG_FILE parameter to the multiple node configuration file you defined in Lab 2A. Click the Run button.
ValueCap Systems - Proprietary

112

Director Log Output


Again, look for a similar entry in the job log for lab2b: Double click on it. You should see the contents of the multiple node configuration file used. Click on Close to exit from the dialogue. You have just successfully run your job sequentially and in parallel by simply changing the configuration file!

ValueCap Systems - Proprietary

113

Using APT_DUMP_SCORE
Another way to verify degree of parallelism is to look at the following output in your job log:

The entries Peek,0 and Peek,1 show up as a result of you having set APT_DUMP_SCORE to TRUE. The numbers 0 and 1 signify partition numbers. So if you have a job running 4 way parallel, you should see numbers 0 through 3.
ValueCap Systems - Proprietary

114

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116 Page 252 Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

115

Data Import and Export


In this section we will discuss: Data Generation, Copy, and Peek Data Sources and Targets
Flat Files Parallel Datasets vs Filesets RDBMS Other

Related Stages

ValueCap Systems - Proprietary

116

Generating Columns and Rows


DataStage allows you to easily test any job you develop by providing an easy way to generate data.
Row Generator generates as many records as you want Column Generator generates extra fields within existing records. Must first have input records.

To use either stage, you will need to have a table or column definition. You can generate as little as 1 record with 1 column. Columns can be of any supported data type
Integer, Float, Double, Decimal, Character, Varchar, Date, and Timestamp
ValueCap Systems - Proprietary

117

Row Generator
The Row Generator is an excellent stage to use when building jobs in Datastage. It allows you to test the behavior of various stages within the product. To configure the Row Generator to work, you must define at least 1 column. Looking at what we did for the job in Lab 1b, we see that 3 columns were defined:

We could have also loaded an existing table definition instead of entering our own.
ValueCap Systems - Proprietary

118

Row Generator
Suppose we want to stick with the 3 column table definition we created. As you saw in Lab 2B, the Row Generator will produce records with miscellaneous 10-byte character, integer, and date values. There is, however, a way to specify values to be generated. To do so, double click on the number next to the column name.

ValueCap Systems - Proprietary

119

Column Metadata Editor


The Column Metadata editor allows you to provide specific data generation instructions for each and every field. Options vary by data types. Frequent options include cycle-through user-defined values, random values, incremental values, and alphabetic algorithm
ValueCap Systems - Proprietary

120

Character Generator Options


For a Character or Varchar type, when you click on Algorithm you will have 2 options: cycle cycle through only the specific values you specify. alphabet methodically cycle through characters of the alphabet. This is the default behavior:

ValueCap Systems - Proprietary

121

Number Generator Options


For an Integer, Decimal, or Float type, your 2 options are:
cycle cycle through numbers beginning at the initial value and incrementing by the increment value. You can also define an upper limit. random randomly generate numerical values. You can define an upper limit and a seed for the random number generator. You can also use the signed option to generate negative numbers.

Note: In addition, with Decimal types, you also have the option of defining percent zero and percent invalid

ValueCap Systems - Proprietary

122

Other Data Type Generator Options


Date, Time, and Timestamp data types have some useful options:
Epoch: earliest date to use. For example, the default value is 1960-01-01. Scale Factor: Specifies a multiplier to the increment value for time. For example, a scale factor of 60 and an increment of 1 means the field increments by 60 seconds. Use Current Date: Generator will insert current date value for all rows. This cannot be used with other options.

ValueCap Systems - Proprietary

123

Column Generator
The Column Generator is an excellent stage to use when you need to insert a new column or set of columns into a record layout. Column Generator requires you to specify the name of the column first, and then in the output-mapping tab, you will need to map source to target. In the output-columns tab, you will need to customize the column(s) added the same way as it is done in the Row Generator.
For example, if you are generating a dummy key, you would want to make it an Integer type with an initial value of 0 and increment of 1. When running this in parallel, you can start with an initial value of part and increment of partcount. part is defined in the Framework as the partition number and partcount is the number of partitions.

ValueCap Systems - Proprietary

124

Making Copies of Data


Copy Stage (in the Processing Palette): Incredibly flexible with little or no overhead at runtime Often used to create a duplicate of the incoming data:

Can also be used to terminate a flow:


Records get written out to /dev/null Useful when you dont care about the target or just want to test part of the flow.
ValueCap Systems - Proprietary

125

Take a Look at the Data


Peek Stage (in the Development / Debug Pallette): Often used to help debug a job flow Can be inserted virtually anywhere in a job flow
Must have an input data source

Outputs fixed amount of records into the job log


For example, in Lab 2B: Output volume can be controlled

Can also be used to terminate any job flow. Similar in behavior to inserting a print statement into your source code.
ValueCap Systems - Proprietary

126

Importing Data from Outside DataStage


If you have access to real data, then you probably will not have a lot of use for the Row Generator! DataStage can read in or import data from a large variety of data sources:
Flat Files Complex Files RDBMSs SAS datasets Queues Parallel Datasets & Filesets FTP Named Pipes Compressed Files Etc

ValueCap Systems - Proprietary

127

Importing Data
There are 2 primary means of importing data from external sources:
Automatically DataStage automatically reads the table definition and applies it to the incoming data. Examples include RDBMSs, SAS datasets, and parallel datasets. Manually user must define the table definition that corresponds to the data to be imported. These table definitions can be entered manually or imported from an existing copy book or schema file. Examples include flat files and complex files.

ValueCap Systems - Proprietary

128

Manual Data Import


When DataStage reads in data from an external source, there are 2 steps that will always take place: Recordization
DataStage carves out the entire record based on the table definition being used. Record delimiter is defined within table definition

Columnization
DataStage parses through the record it just carved out and separates out the columns, again based on the table definition provided. Column delimiters are also defined within the table definition

Can become very troublesome if you dont know the correct layout of your data!
ValueCap Systems - Proprietary

129

DataStage Data Types


In order to properly setup a table definition, you must 1st understand the internal data types used within DataStage:
Integer: Signed or unsigned, 8-, 16-, 32- or 64-bit integer. In the Designer you will see TinyInt, SmallInt, Integer, and BigInt instead. Floating Point: Single- (32 bits) or double-precision (64 bits); IEEE. In the Designer you will see Float, and Double instead. String: Character string of fixed or a variable length. In the Designer you will see Char and VarChar instead. Decimal: Numeric representation compatible with the IBM packed decimal format. Decimal numbers consist of a precision (number of decimal digits) greater than 1 with no maximum, and scale (fixed position of decimal point) between 0 and the precision.
ValueCap Systems - Proprietary

130

DataStage Data Types (continued)


Date: Numeric representation compatible with RDBMS notion of date (year, month and day). The default format is month/date/year. This is represented by the default format string of: %mm/%dd/%yyyy Time: Time of day with either one second or one microsecond resolution. Time values range from 00:00:00 to 23:59:59.999999. Timestamp: Single field containing both a date and time. Raw: Untyped collection of contiguous bytes of a fixed or a variable length. Optionally aligned. In the Designer you will see Binary.
ValueCap Systems - Proprietary

131

DataStage Data Types (continued)


Subrecords (subrec): Nested form of field definition that consists of multiple nested fields, Similar to COBOL record levels or C structs. A subrecord itself does not define any storage; instead, the fields of the subrecord define storage. The fields in a subrecord can be of any data type, including tagged. In addition, you can also nest subrecords and vectors of subrecords, to any depth of nesting. Tagged Subrecord (tagged): Any one of a mutually exclusive list of possible data types, including subrecord and tagged fields. Similar to COBOL redefines or C unions, but more type-safe. Defining a record with a tagged type allows each record of a data set to have a different data type for the tagged column.
ValueCap Systems - Proprietary

132

Null Handling
All DataStage data types are nullable
Tags and subrecs are not nullable, but their fields are

Null fields do not have a value DataStage null is represented by an out-of-band indicator
Nulls can be detected by a stage Nulls can be converted to/from a value Null fields can be ignored by a stage, can trigger error, or other action
Exporting a nullable field to a flat file without 1st defining how to handle the null will cause an error.

ValueCap Systems - Proprietary

133

Data Import Example


Suppose you have the following data:
Last, First, Purchase_DT, Item, Amount, Total Smith,John,2004-02-27,widget #2,21,185.20 Doe,Jane,2005-07-03,widget #1,7,92.87 Adams,Sam,2006-01-15,widget #9,43,492.93

What would your table definition look like for this data?
You need column names, which are provided for you You need data types for each column You need to specify , as the column delimiter You need to specify newline as the record delimiter
ValueCap Systems - Proprietary

134

Data Import Example (continued)


It is critical that you fill out the Format options correctly, otherwise, DataStage will not be able to perform the necessary recordization and columnization!
Sequential File Stage

Data types must also match the data itself, otherwise it will cause the columnization step to fail.

ValueCap Systems - Proprietary

135

Data Import Example (continued)


Once all of the information is properly filled out, you can press the View Data button to see a sample of your data and at the same time, validate that your table definition is correct.

If your table definition is not correct, then the View Data operation will fail.
ValueCap Systems - Proprietary

136

Data Import Example (continued)

The table definition we used above worked for the data we were given. Was this the only table definition that would have worked? No, but this was the best one
VarChar is perhaps the most flexible data type, so we could have defined all columns as VarChars. All numeric and date/time types can be imported as Char or VarChar as well, but the reverse is rarely true. Decimal types can typically be imported as Float or Double and vice versa, but be careful with precision you may lose data! Integer types can also be imported as Decimal, Float, or Double.
ValueCap Systems - Proprietary

137

Data Import Reject Handling


Data is not always clean. When unexpected or invalid values come up, you can: Continue default option. It will discard any records where a field does not import correctly Fail abort the job as soon as an invalid field value is encountered Output - send reject records down a reject link to a Dataset. Can also be passed onto other stages for further processing.

ValueCap Systems - Proprietary

138

Exporting Data to Disk


Once the data has been read into DataStage and processed, it is then typically written out somewhere. These targets can be the same as the sources which originally produced the data or a completely different target.
Exporting data to a flat file is easier than importing it from a flat file, simply because DataStage will use the table definition that has been propagated downstream to define the data layout within the output target file. You can easily edit the formatting properties within the Sequential File stage for items such as null handling, delimiters, quotes, etc Consider using a parallel dataset instead of flat file to stage data on disk! Much faster and easier if there is another DSEE application which will consume the data downstream.
ValueCap Systems - Proprietary

139

Data Export to Flat File Example


Heres an example of what it takes to setup the Sequential File stage to export data to a flat file.

ValueCap Systems - Proprietary

140

Data Export to Parallel Dataset Example


With DataStage Parallel Datasets, regardless of it being a source or target, all you need to specify is its name and location! No need to worry about data types, handling nulls, or delimiters.

ValueCap Systems - Proprietary

141

Automatic Data Import


Besides flat files and other manual sources, DataStage can also import data from a Parallel Dataset or RDBMS without the need to first define a table definition! Parallel Datasets are self-describing datasets native to DataStage
Easiest way to read and write data

RDBMSs often store table definitions internally


For example, the DESCRIBE or DESCRIBE TABLE command often returns the table definition associated with the given table

DataStage has the ability to:


Automatically extract the table definition during design time. Automatically extract the table definition to match the data at runtime, and propagate that table definition downstream using RCP
ValueCap Systems - Proprietary

142

Parallel Datasets vs Parallel Filesets


Parallel Dataset vs Parallel Fileset Primary difference is format
Parallel datasets are stored in a native DataStage format Parallel filesets are stored as ASCII

Parallel filesets use a .fs extension vs .ds for parallel datasets


The .fs file is also a descriptor file, however, its ascii and only contains the location of each fragment and the layout.

Parallel datasets are faster than parallel filesets


Parallel datasets avoid the recordization and columnization process because data is already stored in a native format.
ValueCap Systems - Proprietary

143

Parallel Datasets vs RDBMS


Parallel Dataset vs RDBMS Logically and functionally very similar
Parallel datasets have data that is partitioned and stored across several disks Table definition (aka schema) is stored and associated with the table

Parallel datasets can sometimes be faster than loading/extracting a RDBMS. Some conditions that can make this happen:
Non-partitioned RDBMS tables Remote location of RDBMS Sequential RDBMS access mechanism
ValueCap Systems - Proprietary

144

Importing RDBMS Table Definitions


There are a couple of options you can choose for importing a RDBMS table definition for use during design time. Import Orchestrate Schema is one option.
Once you enter all the necessary parameters, you can click on the Next button to import the table definition. Once imported, the table definition can be used at design time

Select from DB2, Oracle, or Informix


ValueCap Systems - Proprietary

145

Importing RDBMS Table Definitions


Other options for importing a RDBMS table definition include using ODBC or Plug-In Metadata access. ODBC option requires that the correct ODBC driver be setup The Plug-In Metadata option requires that it be setup during install.

Once setup, each option guides you through a simple process to import the table definition and save it for future re-use.
ValueCap Systems - Proprietary

146

Using Saved Table Definitions


There are 2 ways to reference a saved table definition in a job. The first is to select it from the repository tree view on the left side, and then drag and drop it onto the link.

Table definition icon shows up on link

The presence of the icon on the link signifies that a table definition is present, or that metadata is present on the link. Why do this when DataStage can do this automatically at runtime? Sometimes it is easier or more straight forward to have the metadata available at design time.
ValueCap Systems - Proprietary

147

Using Saved Table Definitions

Another way to access saved table definitions is to use the Load button on the Output tab of any given stage. Note that you can also do this on the Input tab, but that is the same as loading it on the Output tab of the upstream (preceding) stage.
ValueCap Systems - Proprietary

148

Loading Table Definitions


When loading a previously saved table definition, the column selection dialogue will appear. This allows you to optionally eliminate certain columns which you do not want to carry over. This is useful when you are only reading in some columns or your select clause only has some columns.

ValueCap Systems - Proprietary

149

RDBMS Connectivity
DataStage offers an array of options for RDBMS connectivity, ranging from ODBC to highly-scalable native interfaces. For handling large data volumes, DataStages highly-scalable native database interfaces are the best way to go. While the icons may appear similar, always look for the _enterprise label.

DB2 parallel extract, load, upsert, and lookup. Oracle parallel extract, load, upsert, and lookup. Teradata parallel extract and load Sybase sequential extract, parallel load, upsert, and lookup Informix parallel extract and load

ValueCap Systems - Proprietary

150

Parallel RDBMS Interface


Query or Application Parallel DataStage

Usually a query is submitted to a database sequentially, and the database then distributes the query to execute it in parallel. The output, however, is returned sequentially. Similarly, when loading data, data is loaded sequentially 1st, before being distributed by the database.

DataStage will avoid this bottleneck by establishing parallel connections into the database and execute queries, extract data, and load data in parallel. The degree of parallelism changes depending on the database configuration (i.e. number of partitions that are set up).
151

ValueCap Systems - Proprietary

DataStage and RDBMS Scalability


Query or Application Parallel DataStage

While the database itself may be highly scalable, the overall solution which includes the application accessing the database is not. Any sequential bottlenecks in an end to end solution will limit its ability to scale!

DataStage s native parallel connectivity into the database is the key enabler for a truly scalable end to end solution.

ValueCap Systems - Proprietary

152

Extracting from the RDBMS


Extracting data from DB2, Oracle, Teradata, and Sybase is pretty straight forward. The stage properties dialogue is very much the same for each database, despite its different behavior under the covers. For all database stages, to extract data you will need to provide the following:
Read Method full table scan or user defined query Table table name if using Table as the Read Method User user id (optional with DB2) Password password (optional with DB2) Server/Database used for some databases for establishing connectivity Options database specific options

ValueCap Systems - Proprietary

153

Loading to the RDBMS


Loading data into DB2, Oracle, Teradata, and Sybase is also pretty straight forward. The stage properties dialogue is very much the same for each database, despite its different behavior under the covers. For all database stages, to load data you will need to provide the following:
Table name of table to be loaded Write Method write, load, upsert. Details will be discussed shortly. Write Mode Append, Create, Replace, and Truncate User user id (optional with DB2) Password password (optional with DB2) Server/Database used for some databases for establishing connectivity Options database specific options
ValueCap Systems - Proprietary

154

Write Methods Explained


Write/Load often the default option. Used to append data into an existing table, create and load data into a target table, or drop an existing table, recreate it, and load data into it. The mechanics of the load itself depend on the database. Upsert update or insert data into the database. There is also an option to delete data from the target table. Lookup perform a lookup against a table inside the database. This is useful when the lookup table is much larger than the input data.
ValueCap Systems - Proprietary

155

Write Modes Explained


Append default option. Append data into an existing table. Create creates a table using the table definition provided by the stage. If table already exists, then job will fail. Insert table into the created table. Replace if target table exists, drop the table first. If table does not exist, create it. Insert data into the created table. Truncate delete all records from the target table, but do not drop the table. Insert data into the empty table.
ValueCap Systems - Proprietary

156

Connectivity
DataStage Oracle Enterprise Stage

ValueCap Systems - Proprietary

157

Configuration for Oracle


To establish connectivity to Oracle, certain environment variables and stage options need to be defined: Environment Variables (defined via DataStage Administrator)
ORACLE_SID name of the ORACLE database to access ORACLE_HOME location for ORACLE home PATH append $ORACLE_HOME/bin LIBPATH or LD_LIBRARY_PATH append $ORACLE_HOME/lib32 or $ORACLE_HOME/lib64, depending on the operating system. Path must be spelled out.

Stage Options
User Oracle user-id Password Oracle user password DB Options can also accept SQL*Loader parameters such as:
o DIRECT = TRUE, PARALLEL = TRUE,
ValueCap Systems - Proprietary

158

Specifics for Extracting Oracle


Extracts from Oracle:
Default option (depending on the version used) is to use the SQL Builder interface, which allows you to use a graphical interface to create a custom query.
Note: the query generated will run sequentially by default.

User-Defined Query option allows you to enter your own query or copy and paste an existing query.
Note: the custom query will run sequentially by default.

Running SQL queries in parallel requires the use of the following option:
Partition Table option enter name of the table containing the partitioning strategy you are looking to match
ValueCap Systems - Proprietary

159

Oracle Parallel Extract

Both set of options above will yield identical results. Leaving out the Partition Table option would cause the extract to execute sequentially.

ValueCap Systems - Proprietary

160

Specifics for Loading Oracle


There are 2 ways to put data into Oracle:
Load (default option) leverage the Oracle SQL*Loader technology to load data into Oracle in parallel.
Load uses the Direct Path load method by default Fastest way to load data into Oracle Select Append, Create, Replace, or Truncate mode

Upsert update or insert data in an Oracle table


Runs in parallel Uses standard SQL Insert and Update statements Use auto-generated or user-defined SQL

Can also use DELETE option to remove data from target Oracle table
ValueCap Systems - Proprietary

161

Oracle Index Maintenance


Loading to range/hash partitioned table in parallel is supported, however, if the table is indexed:
Rebuild can be used to rebuild global indexes. Can specify NOLOGGING (speeds up rebuild by eliminating the log during index rebuild) and COMPUTESTATISTICS to provide stats on the index. Maintenance is supported for local indexes partitioned the same way the table is partitioned Dont use both the rebuild and maintenance options in same stage either the global or local index must be dropped prior to the load. Using DB Options
DIRECT=TRUE,PARALLEL=TRUE, SKIP_INDEX_MAINTENANCE=YES to allow the Oracle stage to run in parallel using direct path mode but indexes on the table will be unusable after the load.

ValueCap Systems - Proprietary

162

Relevant Stages
Column Import Only import a subset of the columns in a record, leaving the rest as raw or string. This is useful when you have a very wide record and only plan on referencing a few columns. Column Export Combine 2 or more columns into a single column. Combine Records Combines records in which particular keycolumn values are identical into vectors of subrecords. As input, the stage takes a data set in which one or more columns are chosen as keys. All adjacent records whose key columns contain the same value are gathered into the same record as subrecords. Make Subrecord Combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors. Specify the vector columns to be made into a vector of subrecords and the name of the new subrecord.
ValueCap Systems - Proprietary

163

Relevant Stages (Continued)


Split Subrecord Inverse of Make Subrecord. Creates one new vector column for each element of the original subrecord. Each top-level vector column that is created has the same number of elements as the subrecord from which it was created. The stage outputs columns of the same name and data type as those of the columns that comprise the subrecord. Make Vector Combines specified columns of an input data record into a vector of columns. The stage has the following requirements:
The input columns must form a numeric sequence, and must all be of the same type. The numbers must increase by one. The columns must be named column_name0 to column_namen, where column_name starts the name of a column and 0 and n are the first and last of its consecutive numbers. The columns do not have to be in consecutive order.

All these columns are combined into a vector of the same length as the number of columns (n+1). The vector is called column_name. Any input columns that do not have a name of that form will not be included in the vector but will be output as top level columns.
ValueCap Systems - Proprietary

164

Relevant Stages (Continued)


Split Vector Promotes the elements of a fixed-length vector to a set of similarly named top-level columns. The stage creates columns of the format name0 to nameN, where name is the original vectors name and 0 and N are the first and last elements of the vector. Promote Subrecord Promotes the columns of an input subrecord to top-level columns. The number of output records equals the number of subrecord elements. The data types of the input subrecord columns determine those of the corresponding top-level columns. DRS Dynamic Relational Stage. DRS reads data from any DataStage stage and writes it to one of the supported relational databases. It also reads data from any of the supported relational databases and writes it to any DataStage stage. It supports the following relational databases: DB2/UDB, Informix, Microsoft SQL Server, Oracle, and Sybase. It also supports a generic ODBC.
ValueCap Systems - Proprietary

165

Relevant Stages (Continued)


ODBC Access or write data to remote sources via an ODBC interface. Stored Procedure allows a stored procedure to be used as:
A source, returning a rowset A target, passing a row to a stored procedure to write A transform, invoking logic within the database

The Stored Procedure stage supports input and output parameters or arguments. It can process the returned value after the stored procedure is run. Also provides status codes indicating whether the stored procedure completed successfully and, if not, allowing for error handling. Currently supports DB2, Oracle, and Sybase. Complex Flat File As a source stage it imports data from one or more complex flat files, including MVS datasets with QSAM and VSAM files. A complex flat file may contain one or more GROUPs, REDEFINES, OCCURS or OCCURS DEPENDING ON clauses. When used as a target, the stage exports data to one or more complex flat files. It does not write to MVS datasets.

ValueCap Systems - Proprietary

166

Lab 3A: Flat File Import

ValueCap Systems - Proprietary

167

Lab 3A Objectives

Learn to create a table definition to match the contents of the flat file Read in the flat file using the Sequential File stage and the table definition just created.

ValueCap Systems - Proprietary

168

The Data Files


There are 4 data files you will be importing. You will be using these files for future labs. The files contains Major League Baseball data.
Batting.csv player hitting statistics Pitching.csv pitcher statistics Salaries.csv player salaries Master.csv player details

The files all have the following format


1st row in each file contains the column names Data is in ASCII format Records are newline delimited Columns are comma separated

ValueCap Systems - Proprietary

169

Batters File
The layout of the Batting.csv file is:
Column Name playerID yearID teamID lgID G AB R H DB TP HR RBI SB IBB Description Player ID code Year Team League Games At Bats Runs Hits Doubles Triples Homeruns Runs Batted In Stolen Bases Intentional walks

Tips: 1. Use a data type that most closely matches the data. For example, for the Games column, use Integer instead of Char or VarChar! 2. When using a VarChar type, always fill in a maximum length by filling in a number in the length column 3. When defining numerical types such as Integer or Float, theres no need to fill in length or scale values. You only do this for Decimal types.

Open the file using vi or any other text editor to view contents note the contents and data types Create a table definition for this data, save it as batting.
ValueCap Systems - Proprietary

170

Pitchers File
The layout of the Pitching.csv file is:
Column Name playerID yearID teamID lgID W L SHO SV SO ERA Description Player ID code Year Team League Wins Losses Shutouts Saves Strikeouts Earned Run Average

Tips: 1. Be careful to choose the right data type for the ERA column. Your choices should boil down to Float vs Decimal

Open the file using vi or any other text editor to view contents note the contents and data types Create a table definition for this data, save it as pitching.
ValueCap Systems - Proprietary

171

Salary File
The layout of the Salaries.csv file is:
Column Name yearID teamID lgID playerID salary Description Year Team League Player ID code Salary
Tips: 1. Salary value is in whole dollars. Again be sure to select the best data type. While it may be tempting to use Decimal, the Framework is more efficient at processing Integer and Float types. Those are considered native to the Framework.

Open the file using vi or any other text editor to view contents note the contents and data types Create a table definition for this data, save it as salaries.

ValueCap Systems - Proprietary

172

Master File
The layout of the Master.csv file is:
Column Name playerID birthYear birthMonth birthDay nameFirst nameLast debut finalGame Description A unique code asssigned to each player. Year player was born Tips: Month player was born 1. Treat birthYear, birthMonth, & Day player was born birthDay as Integer types for now. Player's first name 2. Be sure to specify the correct Date format string: %mm/%dd/%yyyy Player's last name Date player made first major league appearance Date player made last major league appearance

Open the file using vi or any other text editor to view contents note the contents and data types Create a table definition for this data, save it as master.

ValueCap Systems - Proprietary

173

Testing the Table Definitions


Create the following flow by linking a Sequential File stage to a Peek stage:

Next, find the batting table definition you created, click and drag the table onto the link On the link:
Look for the icon that signifies the presence of a table definition
ValueCap Systems - Proprietary

174

Testing the Table Definition


In the Sequential File stage properties:
Fill in the File option with the correct path and filename. For example: C:\student01\training\data\Batting.csv Click on the Format tab and review the settings. Are these consistent with what you see in the Batting.csv data file? In the Columns tab, you will note that the table definition you previously selected and dragged onto the link is now present. Alternatively, you could have used the Load button to bring it in or typed it it all over again! Next, click on the View Data button to see if you got everything correct! Click OK to view data!

ValueCap Systems - Proprietary

175

Viewing Data
If everything went well, you should see the View Data window pop up:

If you get an error instead, take a look at the error message to determine the location and nature of the error. Make the necessary corrections and try again.
ValueCap Systems - Proprietary

176

Testing lab3a
Save the job as lab3a_batting Compile the job and then click on the run button. Go into the Director and take a look at the job log.
Look out for Warnings and Errors !!! Errors are fatal and must be resolved. Warnings can be an issue. In this case, it could be warning you that certain records failed to import. This is a bad thing!

Typical mistakes include formatting and data type mismatches


Verify that the column delimiter is correct. Everything should be comma separated Are you using the correct data types?
ValueCap Systems - Proprietary

177

lab3a_batting Results
For your lab3a_batting job:
You should see Import complete. 25076 records imported successfully, 0 rejected. There should be no rejected records! Find the Peek output line in the Directors Log. Double-click on it. It should look like the following:

ValueCap Systems - Proprietary

178

Importing Rest of the Files


Repeat the process for the Pitching, Salaries, and Master files.
Save the jobs as lab3a_pitching, lab3a_salaries, and lab3a_master accordingly

When finished, your job should resemble one of the diagrams on the right.
Be sure to rename the stages accordingly.

Make sure that View Data works for each and every input file.
ValueCap Systems - Proprietary

179

Validating Results
For your lab3a_pitching job:
You should see Import complete. 11917 records imported successfully, 0 rejected. There should be no rejected records!

For your lab3a_salaries job:


You should see Import complete. 17277 records imported successfully, 0 rejected. There should be no rejected records!

For your lab3a_master job:


You should see Import complete. 3817 records imported successfully, 0 rejected. There should be no rejected records!
ValueCap Systems - Proprietary

180

Lab 3B: Exporting to a Flat File

ValueCap Systems - Proprietary

181

Lab 3B Objective

Write out the imported data files to ASCII flat files and parallel datasets Use different formatting properties

ValueCap Systems - Proprietary

182

Create Lab 3B Using Lab 3A


Open the jobs you created in Lab 3A lab3a_batting, lab3a_pitching, lab3a_salaries, and lab3a_master Save each job again using Save As use the names lab3b_batting, lab3b_pitching, lab3b_salaries, and lab3b_master accordingly.

ValueCap Systems - Proprietary

183

Edit lab3a_batting_out
Go to lab3b_batting and edit the job to look like the following:

To do so, perform the following steps:


Click on the Peek stage and delete it Attach the Copy stage in its place Place a Sequential File stage and a Dataset stage after the copy Draw a link between the copy and the 2 output stages Update the link and stage names accordingly
ValueCap Systems - Proprietary

184

Edit lab3b_batting
In the Copy stages Output Mapping tab, map the source columns to the target columns for both output links:

ValueCap Systems - Proprietary

185

Source to Target Mapping


Once the mapping is complete, you should see the table definition icon present on the links. An easier way to do this would be to right-click on the Copy stage and use the Auto-map columns feature.

right-click

ValueCap Systems - Proprietary

186

Output Stage Properties


In the Parallel Dataset stage options, fill in the appropriate path and filename for where the dataset descriptor file should reside.
For example: Use Batting.ds as the filename

For the Sequential File stage, fill in the appropriate path and filename for where the data file should reside.
For example: Use Batting.txt as the filename
ValueCap Systems - Proprietary

187

Sequential File Formatting


In the Sequential File stage properties Format tab, change the Delimiter option to | (pipe character):

Save and compile lab3b_batting Run the job and view the results in the Director
ValueCap Systems - Proprietary

188

Viewing the Results


In the Directors Log, there should be no warnings or errors. Look for the following output in the Log: View the Batting.txt from the Unix prompt and verify that the columns are now pipe | delimited
aasedo01|1985|BAL|AL|54|0|0|0|0|0|0|0|0|0 ackerji01|1985|TOR|AL|61|0|0|0|0|0|0|0|0|0 agostju01|1985|CHA|AL|54|0|0|0|0|0|0|0|0|0 aikenwi01|1985|TOR|AL|12|20|2|4|1|0|1|5|0|0 alexado01|1985|TOR|AL|36|0|0|0|0|0|0|0|0|0 allenga01|1985|TOR|AL|14|34|2|4|1|0|0|3|0|0 allenne01|1985|NYA|AL|17|0|0|0|0|0|0|0|0|0 armasto01|1985|BOS|AL|103|385|50|102|17|5|23|64|0|4 armstmi01|1985|NYA|AL|9|0|0|0|0|0|0|0|0|0 atherke01|1985|OAK|AL|56|0|0|0|0|0|0|0|0|0
ValueCap Systems - Proprietary

189

Understanding the Output


How does DataStage know which columns to write out and in what order? If you look in the Columns tab under the output Sequential File stage properties, you will see that the table definition from the source has been propagated to the target:
DataStage uses this along with the formatting to create the output file
ValueCap Systems - Proprietary

190

Exporting Rest of the Files


Repeat the process for lab3b_pitching, lab3b_salaries, and lab3b_master accordingly The number of records read in each time should match the number of records written out Make sure there are no warnings or errors in the Directors Log for each job.

ValueCap Systems - Proprietary

191

Lab3A and Lab3B Review


Congratulations! You have successfully: Created table definitions to describe existing data layouts Imported data from flat files using the newly created table definitions and the Sequential File stage Exported data to flat files using the Sequential File stage Written data to Parallel Datasets using the Parallel Dataset stage.

ValueCap Systems - Proprietary

192

PLACEHOLDER SLIDE
Insert appropriate set of database connectivity slides here depending on customer environment: DB2 Oracle Teradata Sybase
ValueCap Systems - Proprietary

193

Connectivity
DataStage DB2 Enterprise Stage

ValueCap Systems - Proprietary

194

Lab 3C: Inserting Into RDBMS

ValueCap Systems - Proprietary

195

Lab 3C Objective

Insert the data stored within the Datasets created in Lab 3B into the database

ValueCap Systems - Proprietary

196

Creating Jobs for Lab 3C


In this lab you will use the data you wrote out to the Datasets in Lab 3B as the source data to be loaded into the target database table. The instructor should have pre-configured the necessary settings for database connectivity.
Confirm this with the instructor If database is not pre-configured, then obtain the necessary connectivity details such as database server, database name, location, etc

ValueCap Systems - Proprietary

197

Setting Up Parameters
To make things easier, we will use job parameters. Go to the Administrator and open your project properties. Access the Environment Variable settings and create the following 3 parameters:

Use your own directory path, userid, and password (NOTE: userid and password may not be necessary, depending on DB2 setup)

ValueCap Systems - Proprietary

198

Lab 3C DB2
Create a new job and pull together the following stages (Dataset, Peek, and DB2 Enterprise):

Rename the links and stages accordingly In the Dataset stage:


Use the FILEPATH parameter along with the Dataset filename created earlier in Lab 3B Load the Batting table definition in the Columns tab. While this step is optional, it does provide design time metadata.
ValueCap Systems - Proprietary

199

Lab 3C DB2 (continued)


In the DB2 Enterprise stage:
For the Table option, precede the table name with your initials. If using a shared database environment, this will prevent conflicts.

ValueCap Systems - Proprietary

200

Lab 3C DB2 (continued)


Make sure to map source columns to target columns, going from Dataset to Peek to DB2. Save the job as lab3c_batting Compile and run the job Verify that there are no errors in the Directors Log. Use the Job Performance Statistics to verify that 25076 records were loaded:
Log into the database and issue a select count(*) from tablename query to double-check the record count
ValueCap Systems - Proprietary

201

Lab 3C DB2 (continued)


Repeat the process for Pitching, Salaries, and Master in jobs lab3c_pitching, lab3c_salaries, and lab3c_master respectively The number of records read in each time should match the number of records written out Make sure there are no errors in the Directors Log for each job.
There may be some warnings concerning data conversion. These are not critical.
ValueCap Systems - Proprietary

202

Lab 3D: Extracting From RDBMS

ValueCap Systems - Proprietary

203

Lab 3D Objective

Extract Batting, Pitching, Salaries, and Master tables from the Database

ValueCap Systems - Proprietary

204

Creating Jobs for Lab 3D


In this lab you will extract the data you loaded into the database in Lab 3C You will leverage the same setup for database connectivity as Lab 3C
Confirm this with the instructor If database is not pre-configured, then obtain the necessary connectivity details such as database server, database name, location, etc

Use the same USERID and PASSWORD job parameters (if needed) from Lab 3C
ValueCap Systems - Proprietary

205

Lab 3D DB2
Create a new job and pull together the following stages (DB2 Enterprise and Peek):

For the DB2 Enterprise stage:


Use the Table Read Method instead of SQL-Builder

ValueCap Systems - Proprietary

206

Lab 3D DB2 (continued)


Optionally, you can load the Batting table definition in the Columns tab for the DB2 Enterprise stage properties.
DataStage automatically extracts the table definition from the database and uses RCP to propagate it downstream You can try the job with and without the table definition loaded in the Columns tab If you load the Batting table definition, you may receive some type conversion warnings at runtime in the Directors Log which can be ignored for this job.

ValueCap Systems - Proprietary

207

Lab 3D DB2 (continued)


Save the job as lab3d_batting Compile and run the job Verify that there are no errors in the Directors Log. Use the Job Performance Statistics to verify that 25076 records were extracted.

ValueCap Systems - Proprietary

208

Lab 3D DB2 (continued)


Repeat the process for Pitching, Salaries, and Master in jobs lab3d_pitching, lab3d_salaries, and lab3d_master respectively The number of records read in each time should match the number of records written out Make sure there are no errors in the Directors Log for each job.
There may be some warnings concerning data conversion. These are not critical.
ValueCap Systems - Proprietary

209

Lab3C and Lab3D Review


Congratulations! You have successfully: Loaded Batting, Pitching, Salaries, and Master data into the respective database tables Extracted Batting, Pitching, Salaries, and Master tables from the database

ValueCap Systems - Proprietary

210

Connectivity
DataStage Oracle Enterprise Stage

ValueCap Systems - Proprietary

211

Lab 3C: Inserting Into RDBMS

ValueCap Systems - Proprietary

212

Lab 3C Objective

Insert the data stored within the Datasets created in Lab 3B into the database

ValueCap Systems - Proprietary

213

Creating Jobs for Lab 3C


In this lab you will use the data you wrote out to the Datasets in Lab 3B as the source data to be loaded into the target database table. The instructor should have pre-configured the necessary settings for database connectivity.
Confirm this with the instructor If database is not pre-configured, then obtain the necessary connectivity details such as database server, database name, location, etc

ValueCap Systems - Proprietary

214

Setting Up Parameters
To make things easier, we will use job parameters. Go to the Administrator and open your project properties. Access the Environment Variable settings and create the following 3 parameters:

Use your own directory path, userid, and password

ValueCap Systems - Proprietary

215

Lab 3C Oracle
Create a new job and pull together the following stages (Dataset, Peek, and Oracle Enterprise):

Rename the links and stages accordingly In the Dataset stage:


Use the FILEPATH parameter along with the Dataset filename created earlier in Lab 3B Load the Batting table definition in the Columns tab. While this step is optional, it does provide design time metadata.
ValueCap Systems - Proprietary

216

Lab 3C Oracle (continued)


In the Oracle Enterprise stage:
Configure settings as shown below, using the USERID and PASSWORD job parameters as shown below For the Table option, precede the table name with your initials. If using a shared database environment, this will prevent conflicts.

ValueCap Systems - Proprietary

217

Lab 3C Oracle (continued)


Make sure to map source columns to target columns, going from Dataset to Peek to Oracle. Save the job as lab3c_batting Compile and run the job Verify that there are no errors in the Directors Log. Use the Job Performance Statistics to verify that 25076 records were loaded:
Log into the database and issue a select count(*) from tablename query to double-check the record count
ValueCap Systems - Proprietary

218

Lab 3C Oracle (continued)


Repeat the process for Pitching, Salaries, and Master in jobs lab3c_pitching, lab3c_salaries, and lab3c_master respectively The number of records read in each time should match the number of records written out Make sure there are no errors in the Directors Log for each job.
There may be some warnings concerning data conversion. These are not critical.
ValueCap Systems - Proprietary

219

Lab 3D: Extracting From RDBMS

ValueCap Systems - Proprietary

220

Lab 3D Objective

Extract Batting, Pitching, Salaries, and Master tables from the Database

ValueCap Systems - Proprietary

221

Creating Jobs for Lab 3D


In this lab you will extract the data you loaded into the database in Lab 3C You will leverage the same setup for database connectivity as Lab 3C
Confirm this with the instructor If database is not pre-configured, then obtain the necessary connectivity details such as database server, database name, location, etc

Use the same USERID and PASSWORD job parameters from Lab 3C
ValueCap Systems - Proprietary

222

Lab 3D Oracle
Create a new job and pull together the following stages (Oracle Enterprise and Peek):

For the Oracle Enterprise stage:


Configure settings as shown below, using the USERID and PASSWORD job parameters as shown below Use the Table Read Method instead of SQL-Builder
ValueCap Systems - Proprietary

223

Lab 3D Oracle (continued)


Optionally, you can load the Batting table definition in the Columns tab for the Oracle Enterprise stage properties.
DataStage automatically extracts the table definition from the database and uses RCP to propagate it downstream You can try the job with and without the table definition loaded in the Columns tab If you load the Batting table definition, you may receive some type conversion warnings at runtime in the Directors Log which can be ignored for this job.

ValueCap Systems - Proprietary

224

Lab 3D Oracle (continued)


Save the job as lab3d_batting Compile and run the job Verify that there are no errors in the Directors Log. Use the Job Performance Statistics to verify that 25076 records were extracted:

ValueCap Systems - Proprietary

225

Lab 3D Oracle (continued)


Repeat the process for Pitching, Salaries, and Master in jobs lab3d_pitching, lab3d_salaries, and lab3d_master respectively The number of records read in each time should match the number of records written out Make sure there are no errors in the Directors Log for each job.
There may be some warnings concerning data conversion. These are not critical.
ValueCap Systems - Proprietary

226

Lab3C and Lab3D Review


Congratulations! You have successfully: Loaded Batting, Pitching, Salaries, and Master data into the respective database tables Extracted Batting, Pitching, Salaries, and Master tables from the database

ValueCap Systems - Proprietary

227

Connectivity
DataStage Teradata Enterprise Stage

ValueCap Systems - Proprietary

228

Lab 3C: Inserting Into RDBMS

ValueCap Systems - Proprietary

229

Lab 3C Objective

Insert the data stored within the Datasets created in Lab 3B into the database

ValueCap Systems - Proprietary

230

Creating Jobs for Lab 3C


In this lab you will use the data you wrote out to the Datasets in Lab 3B as the source data to be loaded into the target database table. The instructor should have pre-configured the necessary settings for database connectivity.
Confirm this with the instructor If database is not pre-configured, then obtain the necessary connectivity details such as database server, database name, location, etc

ValueCap Systems - Proprietary

231

Setting Up Parameters
To make things easier, we will use job parameters. Go to the Administrator and open your project properties. Access the Environment Variable settings and create the following 3 parameters:

Use your own directory path, userid, and password

ValueCap Systems - Proprietary

232

Lab 3C Teradata
Create a new job and pull together the following stages (Dataset, Peek, and Teradata Enterprise): Rename the links and stages accordingly In the Dataset stage:
Use the FILEPATH parameter along with the Dataset filename created earlier in Lab 3B Load the Batting table definition in the Columns tab. While this step is optional, it does provide design time metadata.

ValueCap Systems - Proprietary

233

Lab 3C Teradata (continued)


In the Teradata Enterprise stage:
For the Table option, precede the table name with your initials. If using a shared database environment, this will prevent conflicts.

NOTE: You may also need to specify a Database option and provide the name of the database to connect to.

ValueCap Systems - Proprietary

234

Lab 3C Teradata (continued)


Make sure to map source columns to target columns, going from Dataset to Peek to Teradata. Save the job as lab3c_batting Compile and run the job Verify that there are no errors in the Directors Log. Use the Job Performance Statistics to verify that 25076 records were loaded:
Log into the database and issue a select count(*) from tablename query to double-check the record count
ValueCap Systems - Proprietary

235

Lab 3C Teradata (continued)


Repeat the process for Pitching, Salaries, and Master in jobs lab3c_pitching, lab3c_salaries, and lab3c_master respectively The number of records read in each time should match the number of records written out Make sure there are no errors in the Directors Log for each job.
There may be some warnings concerning data conversion. These are not critical.
ValueCap Systems - Proprietary

236

Lab 3D: Extracting From RDBMS

ValueCap Systems - Proprietary

237

Lab 3D Objective

Extract Batting, Pitching, Salaries, and Master tables from the Database

ValueCap Systems - Proprietary

238

Creating Jobs for Lab 3D


In this lab you will extract the data you loaded into the database in Lab 3C You will leverage the same setup for database connectivity as Lab 3C
Confirm this with the instructor If database is not pre-configured, then obtain the necessary connectivity details such as database server, database name, location, etc

Use the same USERID and PASSWORD job parameters from Lab 3C
ValueCap Systems - Proprietary

239

Lab 3D Teradata
Create a new job and pull together the following stages (Teradata Enterprise and Peek): For the Teradata Enterprise stage:
Configure settings as shown below, using the USERID and PASSWORD job parameters as shown below Use the Table Read Method May need to specify the Database option
ValueCap Systems - Proprietary

240

Lab 3D Teradata (continued)


Optionally, you can load the Batting table definition in the Columns tab for the Teradata Enterprise stage properties.
DataStage automatically extracts the table definition from the database and uses RCP to propagate it downstream You can try the job with and without the table definition loaded in the Columns tab If you load the Batting table definition, you may receive some type conversion warnings at runtime in the Directors Log which can be ignored for this job.

ValueCap Systems - Proprietary

241

Lab 3D Teradata (continued)


Save the job as lab3d_batting Compile and run the job Verify that there are no errors in the Directors Log. Use the Job Performance Statistics to verify that 25076 records were extracted.

ValueCap Systems - Proprietary

242

Lab 3D Teradata (continued)


Repeat the process for Pitching, Salaries, and Master in jobs lab3d_pitching, lab3d_salaries, and lab3d_master respectively The number of records read in each time should match the number of records written out Make sure there are no errors in the Directors Log for each job.
There may be some warnings concerning data conversion. These are not critical.
ValueCap Systems - Proprietary

243

Lab3C and Lab3D Review


Congratulations! You have successfully: Loaded Batting, Pitching, Salaries, and Master data into the respective database tables Extracted Batting, Pitching, Salaries, and Master tables from the database

ValueCap Systems - Proprietary

244

Lab 3E: Importing a COBOL Copybook

ValueCap Systems - Proprietary

245

Lab 3E Objective

Import a COBOL copybook and save it as a table definition Compare the DataStage table definition to the copybook

ValueCap Systems - Proprietary

246

Lab 3E Copybook
We will import the following COBOL copybook:
01 CLIENT-RECORD. 05 FIRST-NAME 05 LAST-NAME 05 GENDER 05 BIRTH-DATE 05 INCOME 05 STATE 05 RECORD-ID PIC X(2). PIC 999999999 COMP. PIC X(16). PIC X(20). PIC X(1). PIC X(10). PIC 9999999V99 COMP-3.

The copybook is located in a file called customer.cfd You will need to have a copy of this file locally on your computer in order to import it into DataStage
ValueCap Systems - Proprietary

247

Importing the Copybook


To import a COBOL copybook:
Right-click on Table Definitions Select Import Select COBOL File Definitions

Click on Import to translate the Copybook into a DataStage table definition.

ValueCap Systems - Proprietary

248

Copybook Imported
If no errors occur, then your copybook has successfully imported and has been translated into a DataStage table definition. Double-click on the newly created table definition to view it

ValueCap Systems - Proprietary

249

Viewing the Translated Copybook


You can click on the Layout tab to view the copybook in its original format as well as its newly translated format! This is the DataStage internal schema format
Clicking on the Columns tab will show the table definition in a grid format
ValueCap Systems - Proprietary

250

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116 Page 252 Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

251

Data Partitioning, Sorting and Collection


In this section we will discuss: Data Partitioning Sorting Data Duplicate Removal Data Collection Funnel Stage

ValueCap Systems - Proprietary

252

Data Partitioning
Partitioner

In Chapter 2 we very briefly touched upon the topic of Records 1 - 1000 partitioning: Data Records 1001 - 2000
File
Records 2001 - 3000 Records 3001 - 4000

We also discussed the fact that parallelism would not be possible without partitioning.
For example, how would the following be accomplished without partitioning the data as it comes in from the sequential input file?
Input file A B C
RDBMS

ValueCap Systems - Proprietary

253

What Is Data Partitioning?


Data Partitioning, simply put, is a means of distributing records amongst partitions.
A partition is like a division or logical grouping Several partitioning algorithms exist

When you sit down at a card game, how are the cards dealt out?
The dealer typically distributes the cards evenly to all players. Each player winds up with an equivalent amount of cards.

When partitioning data, it is often desirable to achieve a balance of records in each partition
Too many records in any given partition is referred to as a data skew. Data skews cause overall processing times to take longer to finish.
ValueCap Systems - Proprietary

254

Data Partitioner Overview


In DataStage, there are many options for partitioning data:
Auto (Default) Random Roundrobin Same Entire Modulus Range Hash DB2
ValueCap Systems - Proprietary

255

Auto Partitioning
By default, partitioning is always set to Auto
Auto means the Framework will decide the most optimal partitioning algorithm based on what the job is doing.

Partitioning is accessed from the same location for any given stage with an input link attached:

ValueCap Systems - Proprietary

256

Random Partitioning
The records are partitioned randomly, based on the output of a random number generator. No further information is required. Suppose we have the following record:
playerID - varchar yearID - integer teamID - char[3] ERA - float

The randomly partitioned records may look like:


Partition #1
behenri01 blackbu02 blylebe01 bordiri01 butchjo01 candejo01 clancji01 clarkbr01 clarkst02 coopedo01 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 CLE KCA MIN NYA MIN CAL TOR CLE TOR NYA 7.78 4.33 3.00 3.21 4.98 3.80 3.78 6.32 4.50 5.40

Partition #2
aasedo01 armstmi01 beckwjo01 boddimi01 brownma02 brownmi01 camacer01 caudibi01 1985 1985 1985 1985 1985 1985 1985 1985 BAL NYA KCA BAL MIN BOS CLE TOR 3.78 3.07 4.07 4.07 6.89 21.6 8.10 2.99

Partition #3
ackerji01 alexado01 atherke01 barklje01 birtsti01 blylebe01 1985 1985 1985 1985 1985 1985 TOR TOR OAK CLE OAK CLE 3.23 3.45 4.30 5.27 4.01 3.26

Partition #4
agostju01 allenne01 bairdo01 bannifl01 barojsa01 beattji01 beller01 berenju01 bestka01 1985 1985 1985 1985 1985 1985 1985 1985 1985 CHA NYA DET CHA SEA SEA BAL DET SEA 3.58 2.76 6.24 4.87 5.98 7.29 4.76 5.59 1.95

ValueCap Systems - Proprietary

257

Roundrobin Partitioning
Records are distributed very evenly amongst all partitions. Use this method (or Auto) when in doubt. The roundrobin partitioned records may look like:
Partition #1
aasedo01 allenne01 bannifl01 beckwjo01 bestka01 blylebe01 boydoi01 burrira01 camacer01 cerutjo01 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 BAL NYA CHA KCA SEA MIN BOS ML4 CLE TOR 3.78 2.76 4.87 4.07 1.95 3.00 3.79 4.81 8.10 5.40

Partition #2
ackerji01 armstmi01 barklje01 behenri01 birtsti01 boddimi01 brownma02 burttde01 candejo01 clancji01 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 TOR NYA CLE CLE OAK BAL MIN MIN CAL TOR 3.23 3.07 5.27 7.78 4.01 4.07 6.89 3.81 3.80 3.78

Partition #3
agostju01 atherke01 barojsa01 Beller01 blackbu02 boggsto01 brownmi01 butchjo01 Carych01 clarkbr01 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 CHA OAK SEA BAL KCA TEX BOS MIN DET CLE 3.58 4.30 5.98 4.76 4.33 11.57 21.6 4.98 3.42 6.32

Partition #4
alexado01 bairdo01 beattji01 berenju01 blylebe01 bordiri01 burnsbr01 bystrma01 caudibi01 clarkst02 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 TOR DET SEA DET CLE NYA CHA NYA TOR TOR 3.45 6.24 7.29 5.59 3.26 3.21 3.96 5.71 2.99 4.50

ValueCap Systems - Proprietary

258

Same Partitioning
Preserves whatever partitioning that is already in place.
Data remains in the same partition throughout the flow (aka partitioned parallelism), or until data becomes repartitioned on purpose Does not care about how data was previously partitioned Sets the Preserve Partitioning flag to prevent automatic repartitioning later

Same partitioning is most useful for preserving sort order

ValueCap Systems - Proprietary

259

Entire Partitioning
Places a complete copy of the data into each partition:
Records 1 25,000 Records 1 100,000

Auto

Records 25,001 50,000 Records 50,001 75,000 Records 75,001 100,000

Entire

Data File

Records 1 100,000 Records 1 100,000 Records 1 100,000

Entire partitioning is useful for making a copy of the data available on all processing nodes of a shared nothing environment.
No shared memory between processing nodes Entire forces a copy to be pushed out to each node Lookup stage does this
ValueCap Systems - Proprietary

260

Modulus Partitioning
Distributes records using a modulus function on the key column selected from the available list.
(Field value) mod (no. of partitions, n) = 0,1,,n, n-1

Using our previous example record,


playerID - varchar yearID - integer teamID - char[3] ERA - float

we will perform a modulus partition on yearID. Results would look like this (Modulus+1=part#):
Partition #1
aasedo01 alexado01 allenne01 anderal02 anderri02 aquinlu01 atherke01 augusdo01 bailesc01 bairdo01 . . . 1988 1988 1988 1988 1988 1988 1988 1988 1988 1988 BAL DET NYA MIN KCA KCA MIN ML4 CLE TOR 4.05 4.32 3.84 2.45 4.24 2.79 3.41 3.09 4.9 4.05

Partition #2
aasedo01 ackerji01 agostju01 alexado01 allenne01 armstmi01 atherke01 bairdo01 bannifl01 barklje01 . . . 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 BAL TOR CHA TOR NYA NYA OAK DET CHA CLE 3.78 3.23 3.58 3.45 2.76 3.07 4.3 6.24 4.87 5.27

Partition #3
aasedo01 ackerji01 agostju01 agostju01 akerfda01 alexado01 allenne01 anderal02 andujjo01 aquinlu01 . . . 1986 1986 1986 1986 1986 1986 1986 1986 1986 1986 BAL TOR CHA MIN OAK TOR CHA MIN OAK TOR 2.98 4.35 7.71 8.85 6.75 4.46 3.82 5.55 3.82 6.35

Partition #4
aasedo01 akerfda01 aldrija01 alexado01 allenne01 allenne01 anderal02 anderri02 andersc01 andujjo01 . . . 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 BAL CLE ML4 DET CHA NYA MIN KCA TEX OAK 2.25 6.75 4.94 1.53 7.07 3.65 10.95 13.85 9.53 6.08

ValueCap Systems - Proprietary

261

Range Partitioning
Partitions data into approximately equal size partitions based on one or more partitioning keys.
Range partitioning is often a preprocessing step to performing a total sort on a dataset Requires extra pass through the data to create range map

Suppose we want to range partition based on a baseball pitchers Earned Run Average (ERA). We 1st have to create a range map file as shown here.

ValueCap Systems - Proprietary

262

Range Partitioning (Continued)


Once the range map is created, it must also be referenced from within the job that is performing the range partitioning:

Sorting typically occurs whenever range partitioning is performed, in order best group records belonging in the same range.
Actual range values are determined by the Framework using an algorithm that attempts to achieve an optimal distribution of records.
ValueCap Systems - Proprietary

263

Range Partitioning (Continued)


Using the same example record, once the data has been range partitioned and sorted on just ERA, the results would resemble the following:
Partition #1
benneer01 mercejo02 dascedo01 chenbr01 stricsc01 brownke03 seoja01 gaettga01 damicje01 ceronri01 . . . 1995 2003 1990 2002 2002 1990 2002 1997 1999 1987 CAL MON CHN NYN MON NYN NYN SLN MIL NYA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Partition #2
clarkma01 moyerja01 downske01 ballaje01 bonesri01 wehrmda01 tomkobr01 loiseri01 coneda01 remlimi01 . . . 1996 2001 1990 1989 1994 1985 1997 1998 1999 2004 NYN SEA SFN BAL ML4 CHA CIN PIT NYA CHN 3.43 3.43 3.43 3.43 3.43 3.43 3.43 3.44 3.44 3.44

Partition #3
desseel01 cruzne01 marotmi01 towerjo01 zitoba01 tomkobr01 roberna01 clancji01 hudsoch02 johnto01 . . . 2001 2002 2002 2003 2004 2005 2005 1988 1988 1988 CIN HOU DET TOR OAK SFN DET TOR NYA NYA 4.48 4.48 4.48 4.48 4.48 4.48 4.48 4.49 4.49 4.49

Partition #4
foulkke01 searara01 willica01 smallma01 welleto01 bairdo01 staplda02 wengedo01 broxtjo01 smithmi03 . . . 2005 1985 1994 1996 2004 1987 1988 1998 2005 1987 BOS ML4 MIN HOU CHN PHI ML4 SDN LAN MIN 5.91 5.92 5.92 5.92 5.92 5.93 5.93 5.93 5.93 5.94

Range partitioning is very effective for producing balanced partitions and can be efficient if data characteristics do not change over time.
ValueCap Systems - Proprietary

264

Hash Partitioning
Partitions records based on the value of a key column or columns
All records with the same key column value will wind up in the same partition Hash partitioning is often a preprocessing step to performing a total sort on a dataset Poorly chosen partition key(s) can result in a data skew that is, majority of the records wind up in one or two partitions while the rest of the partitions receive no data.
o For example, hash partitioning on gender would result in a data skew where the majority of records will be spread between 2 partitions. Skews are bad!
ValueCap Systems - Proprietary

265

Hash Partitioning (Continued)


To select the key(s) to perform a Hash partitioning on, click on the Input Tab found in stage properties. Select the Hash partition type and then select the partitioning key(s) Check the Sort box to also sort the data once it has been partitioned NOTE: Selection order matters! The first key selected will always act as the primary key
ValueCap Systems - Proprietary

266

Hash Partitioning (Continued)


Using the same example record, once the data has been hash partitioned and sorted on playerID and teamID, the results would resemble the following:
Partition #1
aasedo01 aasedo01 aasedo01 aasedo01 aasedo01 abbotpa01 abbotpa01 abbotpa01 abbotpa01 abbotpa01 . . . 1986 1985 1988 1987 1990 1991 1990 1992 2002 2000 BAL BAL BAL BAL LAN MIN MIN MIN SEA SEA 2.98 3.78 4.05 2.25 4.97 4.75 5.97 3.27 11.96 4.22

Partition #2
aasedo01 abbotky01 abbotky01 abbotky01 abbotky01 aceveju01 ackerji01 ackerji01 ackerji01 ackerji01 . . . 1989 1991 1996 1992 1995 2003 1990 1985 1991 1986 NYN CAL CAL PHI PHI TOR TOR TOR TOR TOR 3.94 4.58 20.25 5.13 3.81 4.26 3.83 3.23 5.20 4.35

Partition #3
aardsda01 abbotji01 abbotji01 abbotji01 abbotji01 abbotji01 abbotji01 abbotji01 abbotpa01 abbotpa01 . . . 2004 1989 1991 1995 1992 1996 1990 1999 1993 2003 SFN CAL CAL CAL CAL CAL CAL MIL CLE KCA 6.75 3.92 2.89 4.15 2.77 7.48 4.51 6.91 6.38 5.29

Partition #4
abbotji01 abbotji01 abbotji01 abbotji01 abbotpa01 abregjo01 aceveju01 aceveju01 aceveju01 aceveju01 . . . 1998 1995 1993 1994 2004 1985 2001 2003 1998 1999 CHA CHA NYA NYA TBA CHN FLO NYA SLN SLN 4.55 3.36 4.37 4.55 6.70 6.38 2.54 7.71 2.56 5.89

Note that all records with the same playerID and teamID value are now in the same partition.
ValueCap Systems - Proprietary

267

DB2 Partitioning
Distributes the data using the same partitioning algorithm as DB2.
Must supply specific DB2 table via the Partition Properties

NOTE: DB2 Enterprise stage automatically invokes the DB2 partitioner prior to loading data in parallel.
ValueCap Systems - Proprietary

268

Partitioning And Then Sorting!


When sorting data, always make sure its prepartitioned using either Range or Hash partitioners.
Why? What happens when data is sorted without first either Range or Hash partitioning it? The result will be useless in steps like de-duping or merging, since the data will not be truly sorted records that belong together in the same partition would likely be found on different partitions. When running sequentially, partitioning is not necessary!

Partitioning and Sorting are very expensive operations


Requires lots of CPU and disk I/O Do not unnecessarily partition and sort the data!
ValueCap Systems - Proprietary

269

Sorting Techniques
There are 2 ways to sort data
Easiest way is to specify the sort key(s) on the input link properties for any given stage that supports an input link. The actual Sort stage is also an option for sorting data within a flow. Functionality wise, it is identical.

Sorting requires data to be pre-partitioned using either Range or Hash Sorting sets the Preserve Partitioning flag which forces Same partitioning to occur downstream.
This avoids messing up the sorted order of the records
ValueCap Systems - Proprietary

270

Sort Properties
Common properties for sorting include
Unique removes duplicates where duplicates are determined by the specified key fields being sorted on. For example, if sorting on playerID and teamID, then all records with the same playerID and teamID will be considered identical and only 1 will be kept. Stable indicates that incoming data is already pre-sorted and not to re-sort the data. For example, if the data is already sorted on playerID and teamID, and the new sort key is ERA, then the data will now be sorted on playerID, teamID, and ERA. If the option is not set, then the data will only be sorted on ERA.
ValueCap Systems - Proprietary

271

Sort Stage Properties


The Sort Stage is more flexible than the sort option on the stage input link properties. Sort Stage allows for more advanced options to be leveraged:
Restrict Memory Usage by default, DS uses 20MB of memory per partition for sorting. It is a good idea to increase this amount if there is plenty of memory available. Sorting on disk is much slower than sorting in memory. Create Cluster Key Change Column creates a marker to indicate each time a key change occurs.
o Useful for applications needing to identify key changes in order to apply business logic to individual record groups
ValueCap Systems - Proprietary

272

Removing Duplicates
The Remove Duplicates stage records based on specified keys.
Must specify at least 1 key Selected key columns define a duplicate record Can choose to keep first or last record Similar to using the Unique option under Sort options

removes duplicate

Records must be Hash and Sorted on the same keys

ValueCap Systems - Proprietary

273

Data Collection
Data collection is the opposite of data partitioning:
Collector
Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000

Data File

All records from all partitions are gathered into a single partition. Collectors are used when:
Writing out to a sequential file Processing data through a stage that runs sequentially

ValueCap Systems - Proprietary

274

Data Collection Methods


In DataStage, there are a few options for collecting data:
Auto (Default) Roundrobin Ordered Sort Merge

ValueCap Systems - Proprietary

275

Auto Collector
By default, collecting is always set to Auto
Auto means the Framework will decide the most optimal collecting algorithm based on what the job is doing.

Collector type is accessed from the same location for any given sequential stage with an input link attached:

ValueCap Systems - Proprietary

276

Roundrobin Collector
Collects records from multiple partitions in a roundrobin manner.
Collects a record from the first partition, then the second, then the third, etc until it reaches the last partition and then starts over again. Extremely fast! Typically same as Auto

ValueCap Systems - Proprietary

277

Ordered Collector
Reads all records from the first partition, then all records from the second partition, and so on until all partitions have been read.
Useful for maintaining sort order if data was previously partition-sorted, then the outcome will be a sorted single partition. Could be slow if some partitions get backed up.

ValueCap Systems - Proprietary

278

Sort Merge Collector


Reads records in an order based on one or more key columns of the record.
Will maintain the sorted order Must select at least 1 collecting key column Collector key(s) should match partition-sorting key(s) Similar to Ordered collector

Sort Merge not only acts as a collector, but also manages data flow from many partitions to fewer partitions
For example, a job can run 8-way parallel and then slow down to 4way parallel. To accomplish this, the Framework leverages the Sort Merge to maintain the sort order and partitioning strategy.

ValueCap Systems - Proprietary

279

Link Indicators
The icons found on links are an indicator of what is happening in terms of partitioning or collecting.
Auto partitioning Sequential to Parallel, data is being partitioned Data re-partitioning Same partitioning Partition and Sort Sort and Collect data Collect data

ValueCap Systems - Proprietary

280

Funnel Stage
Collects many links and outputs only 1 link
All input links must possess the exact same data layout

Do not confuse with Collectors!!!


Funnel keeps everything running in parallel many links come in, one link goes out. Each link represents many partitions Collectors go from parallel to sequential one link with many partitions come in, and all data is put into a single sequential partition on the output.

ValueCap Systems - Proprietary

281

Lab 4A: Data Partitioning

ValueCap Systems - Proprietary

282

Lab 4A Objectives

Learn more about the Peek stage Learn to invoke different partitioners Observe outcome from different partitioners Roundrobin, Entire, and Hash

ValueCap Systems - Proprietary

283

Changing Default Settings


Go to the Designer and use the Configuration File editor to create a 4 node configuration file as was done in Lab2A.
You may be able to skip this step if one was already previously created. Be sure to Save and Check the configuration file accordingly

In the Administrator, select Project Properties and enter the Environment editor
Find and set APT_CONFIG_FILE to the 4node configuration file you just created. This makes it the default for your project. Make sure APT_DUMP_SCORE is set to True.

Click OK button when finished editing Environment Variables. Click OK and then Close to exit the Administrator.
ValueCap Systems - Proprietary

284

Peek Stage Behavior


We will be using the Peek stage throughout this lab. Keep in mind the following:
Peek outputs 10 rows per partition by default in the Director log. You can specify for it to output as many or as few records per partition as you would like to see Peek displays the column names by default this can be disabled Peek displays all columns by default you can specify only the columns you would like to look at.
ValueCap Systems - Proprietary

285

Creating Lab3A - Roundrobin


Open lab3a_batting Re-name and Save-As lab4a Under the Peek stage Input properties, change the Partition type to Roundrobin
Save and compile Run the job Open the Director and view the job log Compare your output to the one on the next slide. Should be similar. Note that there is no distinct pattern.
ValueCap Systems - Proprietary

286

Roundrobin Partitioning Output


Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:bairdo01 yearID:1985 teamID:DET lgID:AL G:21 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:bandoch01 yearID:1985 teamID:CLE lgID:AL G:73 AB:173 R:11 H:24 DB:4 TP:1 HR:0 RBI:13 SB:0 IBB:0 playerID:barklje01 yearID:1985 teamID:CLE lgID:AL G:21 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:beattji01 yearID:1985 teamID:SEA lgID:AL G:18 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:beller01 yearID:1985 teamID:BAL lgID:AL G:4 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:berenju01 yearID:1985 teamID:DET lgID:AL G:31 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:bestka01 yearID:1985 teamID:SEA lgID:AL G:15 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 playerID:bakerdo01 yearID:1985 teamID:DET lgID:AL G:15 playerID:bannial01 yearID:1985 teamID:TEX lgID:AL G:57 playerID:barojsa01 yearID:1985 teamID:SEA lgID:AL G:17 playerID:beckwjo01 yearID:1985 teamID:KCA lgID:AL G:49 playerID:bellge02 yearID:1985 teamID:TOR lgID:AL G:157 playerID:bergmda01 yearID:1985 teamID:DET lgID:AL G:69 playerID:biancbu01 yearID:1985 teamID:KCA lgID:AL G:81 playerID:agostju01 playerID:allenne01 playerID:ayalabe01 playerID:bakerdu01 playerID:bannifl01 playerID:barrema02 playerID:behenri01 playerID:beniqju01 playerID:bernato01 playerID:birtsti01 yearID:1985 yearID:1985 yearID:1985 yearID:1985 yearID:1985 yearID:1985 yearID:1985 yearID:1985 yearID:1985 yearID:1985 teamID:CHA teamID:NYA teamID:CLE teamID:OAK teamID:CHA teamID:BOS teamID:CLE teamID:CAL teamID:CLE teamID:OAK lgID:AL lgID:AL lgID:AL lgID:AL lgID:AL lgID:AL lgID:AL lgID:AL lgID:AL lgID:AL AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 AB:27 R:4 H:5 DB:1 TP:0 HR:0 RBI:1 SB:0 IBB:0 AB:122 R:17 H:32 DB:4 TP:1 HR:1 RBI:6 SB:8 IBB:0 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 AB:607 R:87 H:167 DB:28 TP:6 HR:28 RBI:95 SB:21 IBB:6 AB:140 R:8 H:25 DB:2 TP:0 HR:3 RBI:7 SB:0 IBB:0 AB:138 R:21 H:26 DB:5 TP:1 HR:1 RBI:6 SB:1 IBB:0

G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 G:46 AB:76 R:10 H:19 DB:7 TP:0 HR:2 RBI:15 SB:0 IBB:1 G:111 AB:343 R:48 H:92 DB:15 TP:1 HR:14 RBI:52 SB:2 IBB:0 G:34 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 G:156 AB:534 R:59 H:142 DB:26 TP:0 HR:5 RBI:56 SB:7 IBB:3 G:4 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 G:132 AB:411 R:54 H:125 DB:13 TP:5 HR:8 RBI:42 SB:4 IBB:3 G:153 AB:500 R:73 H:137 DB:26 TP:3 HR:11 RBI:59 SB:17 IBB:2 G:29 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0 playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4 playerID:baineha01 yearID:1985 teamID:CHA lgID:AL G:160 AB:640 R:86 H:198 DB:29 TP:3 HR:22 RBI:113 SB:1 IBB:8 playerID:balbost01 yearID:1985 teamID:KCA lgID:AL G:160 AB:600 R:74 H:146 DB:28 TP:2 HR:36 RBI:88 SB:1 IBB:4 playerID:barfije01 yearID:1985 teamID:TOR lgID:AL G:155 AB:539 R:94 H:156 DB:34 TP:9 HR:27 RBI:84 SB:22 IBB:5 playerID:baylodo01 yearID:1985 teamID:NYA lgID:AL G:142 AB:477 R:70 H:110 DB:24 TP:1 HR:23 RBI:91 SB:0 IBB:6 playerID:bellbu01 yearID:1985 teamID:TEX lgID:AL G:84 AB:313 R:33 H:74 DB:13 TP:3 HR:4 RBI:32 SB:3 IBB:1 playerID:bentobu01 yearID:1985 teamID:CLE lgID:AL G:31 AB:67 R:5 H:12 DB:4 TP:0 HR:0 RBI:7 SB:0 IBB:2 playerID:berrada01 yearID:1985 teamID:NYA lgID:AL G:48 AB:109 R:8 H:25 DB:5 TP:1 HR:1 RBI:8 SB:1 IBB:0 playerID:blackbu02 yearID:1985 teamID:KCA lgID:AL G:33 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

ValueCap Systems - Proprietary

287

Entire Partitioning
Go back into the Peek stage Input properties, and change the Partition type to Entire
Save and compile the job Run the job Go back to the Director and view the job log Compare your output to the one on the next slide. Note that each output is identical, since Entire places a copy of the data into each partition.

ValueCap Systems - Proprietary

288

Roundrobin Partitioning Output


Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:agostju01 yearID:1985 teamID:CHA lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0 playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0 playerID:allenne01 yearID:1985 teamID:NYA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4 playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:agostju01 yearID:1985 teamID:CHA lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0 playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0 playerID:allenne01 yearID:1985 teamID:NYA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4 playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:agostju01 yearID:1985 teamID:CHA lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0 playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0 playerID:allenne01 yearID:1985 teamID:NYA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4 playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:agostju01 yearID:1985 teamID:CHA lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0 playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0 playerID:allenne01 yearID:1985 teamID:NYA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4 playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

ValueCap Systems - Proprietary

289

Hash Partitioning (and Sorting)


Go back into the Peek stage Input properties, and change the Partition type to Hash and click on Sort
Select RBI as the key to Hash and Sort on Save and compile the job Run the job Go back to the Director and view the job log Compare your output to the one on the next slide. Note that data is grouped and sorted by RBI. All records with the same RBI value will be in the same partition.
ValueCap Systems - Proprietary

290

Hash and Sort Partitioning Output


Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: playerID:brunato01 yearID:1994 teamID:ML4 lgID:AL G:16 AB:28 R:2 H:6 DB:2 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:sageraj01 yearID:1995 teamID:COL lgID:NL G:10 AB:3 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:burkeja02 yearID:2005 teamID:CHA lgID:AL G:1 AB:1 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:levinal01 yearID:2005 teamID:SFN lgID:NL G:9 AB:2 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:ortizra01 yearID:2005 teamID:CIN lgID:NL G:30 AB:54 R:1 H:4 DB:2 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:percitr01 yearID:2005 teamID:DET lgID:AL G:26 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:shielsc01 yearID:2005 teamID:LAA lgID:AL G:78 AB:1 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:seleaa01 yearID:2005 teamID:SEA lgID:AL G:21 AB:3 R:1 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:schoesc01 yearID:2005 teamID:TOR lgID:AL G:80 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:washbja01 yearID:2005 teamID:LAA lgID:AL G:29 AB:4 R:1 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 playerID:lawleto01 playerID:liefeje01 playerID:howitda01 playerID:perezol01 playerID:nievejo01 playerID:humphmi01 playerID:ortizjo02 playerID:youngge02 playerID:lopezme01 playerID:loretma01 playerID:landrce01 playerID:drabedo01 playerID:friasha01 playerID:gilkebe01 playerID:hamilda02 playerID:cowenal01 playerID:deloslu01 playerID:lopesda01 playerID:mckayco01 playerID:zambrca01 yearID:1986 yearID:2003 yearID:1991 yearID:2005 yearID:2001 yearID:1991 yearID:2001 yearID:1994 yearID:1999 yearID:1995 yearID:1991 yearID:1992 yearID:2000 yearID:2000 yearID:2000 yearID:1986 yearID:1989 yearID:1987 yearID:2004 yearID:2005 teamID:SLN teamID:TBA teamID:OAK teamID:PIT teamID:ANA teamID:NYA teamID:OAK teamID:SLN teamID:KCA teamID:ML4 teamID:CHN teamID:PIT teamID:ARI teamID:ARI teamID:NYN teamID:SEA teamID:KCA teamID:HOU teamID:SLN teamID:CHN lgID:NL lgID:AL lgID:AL lgID:NL lgID:AL lgID:AL lgID:AL lgID:NL lgID:AL lgID:AL lgID:NL lgID:NL lgID:NL lgID:NL lgID:NL lgID:AL lgID:AL lgID:NL lgID:NL lgID:NL G:46 AB:39 R:5 H:11 DB:1 TP:0 HR:0 RBI:3 SB:8 IBB:0 G:9 AB:25 R:4 H:3 DB:1 TP:0 HR:1 RBI:3 SB:0 IBB:1 G:21 AB:42 R:5 H:7 DB:1 TP:0 HR:1 RBI:3 SB:0 IBB:0 G:20 AB:33 R:1 H:6 DB:0 TP:0 HR:0 RBI:3 SB:1 IBB:0 G:29 AB:53 R:5 H:13 DB:3 TP:1 HR:2 RBI:3 SB:0 IBB:0 G:25 AB:40 R:9 H:8 DB:0 TP:0 HR:0 RBI:3 SB:2 IBB:0 G:11 AB:42 R:4 H:7 DB:0 TP:0 HR:0 RBI:3 SB:1 IBB:0 G:16 AB:41 R:5 H:13 DB:3 TP:2 HR:0 RBI:3 SB:2 IBB:0 G:7 AB:20 R:2 H:8 DB:0 TP:1 HR:0 RBI:3 SB:0 IBB:0 G:19 AB:50 R:13 H:13 DB:3 TP:0 HR:1 RBI:3 SB:1 IBB:0 G:56 G:35 G:75 G:38 G:43 G:28 G:28 G:47 G:35 G:34 AB:86 R:28 H:20 DB:2 TP:1 HR:0 RBI:6 SB:27 IBB:0 AB:89 R:5 H:14 DB:3 TP:0 HR:0 RBI:6 SB:0 IBB:0 AB:112 R:18 H:23 DB:5 TP:0 HR:2 RBI:6 SB:2 IBB:0 AB:73 R:6 H:8 DB:1 TP:0 HR:2 RBI:6 SB:0 IBB:2 AB:105 R:20 H:29 DB:4 TP:1 HR:1 RBI:6 SB:2 IBB:0 AB:82 R:5 H:15 DB:4 TP:0 HR:0 RBI:6 SB:1 IBB:0 AB:87 R:6 H:22 DB:3 TP:1 HR:0 RBI:6 SB:0 IBB:0 AB:43 R:4 H:10 DB:2 TP:0 HR:1 RBI:6 SB:2 IBB:2 AB:74 R:7 H:17 DB:2 TP:0 HR:0 RBI:6 SB:0 IBB:0 AB:80 R:8 H:24 DB:6 TP:2 HR:1 RBI:6 SB:0 IBB:0

playerID:weaveje01 yearID:2002 teamID:DET lgID:AL G:2 AB:7 R:0 H:2 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:tocajo01 yearID:2001 teamID:NYN lgID:NL G:13 AB:17 R:3 H:3 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:valdeis01 yearID:1997 teamID:LAN lgID:NL G:28 AB:57 R:0 H:5 DB:1 TP:0 HR:0 RBI:1 SB:1 IBB:0 playerID:valdema01 yearID:1997 teamID:MON lgID:NL G:47 AB:19 R:0 H:2 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:valenfe01 yearID:1997 teamID:SDN lgID:NL G:13 AB:17 R:0 H:3 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:valenfe01 yearID:1997 teamID:SLN lgID:NL G:5 AB:5 R:1 H:1 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:wadete01 yearID:1997 teamID:ATL lgID:NL G:12 AB:12 R:0 H:3 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:germaes01 yearID:2003 teamID:OAK lgID:AL G:5 AB:4 R:0 H:1 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:whitega01 yearID:1997 teamID:CIN lgID:NL G:11 AB:9 R:0 H:1 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:greenga01 yearID:1991 teamID:TEX lgID:AL G:8 AB:20 R:0 H:3 DB:1 TP:0 HR:0 RBI:1 SB:0 IBB:0

ValueCap Systems - Proprietary

291

Changing the Sort Order


The results from the Hash and Sort were fairly boring, after all, it would be interesting to see who had the highest RBI. Go back into the Peek stage Input properties, and change the Sort direction to Descending instead of Ascending.
Change the Sort Direction for the RBI by right-clicking on it: Save and compile the job Run the job Go back to the Director and view the job log Compare your output to the one on the next slide. Note that data is still grouped and sorted by RBI, however, you can now see the players with the highest RBI statistics.

ValueCap Systems - Proprietary

292

Hash and Sort Partitioning Output


Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,0: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,1: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,2: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: Batting_Peek,3: playerID:sosasa01 yearID:1998 teamID:CHN lgID:NL G:159 AB:643 R:134 H:198 DB:20 TP:0 HR:66 RBI:158 SB:18 IBB:14 playerID:gonzaju03 yearID:1998 teamID:TEX lgID:AL G:154 AB:606 R:110 H:193 DB:50 TP:2 HR:45 RBI:157 SB:2 IBB:9 playerID:belleal01 yearID:1996 teamID:CLE lgID:AL G:158 AB:602 R:124 H:187 DB:38 TP:3 HR:48 RBI:148 SB:11 IBB:15 playerID:palmera01 yearID:1999 teamID:TEX lgID:AL G:158 AB:565 R:96 H:183 DB:30 TP:1 HR:47 RBI:148 SB:2 IBB:14 playerID:ortizda01 yearID:2005 teamID:BOS lgID:AL G:159 AB:601 R:119 H:180 DB:40 TP:1 HR:47 RBI:148 SB:1 IBB:9 playerID:castivi02 yearID:1998 teamID:COL lgID:NL G:162 AB:645 R:108 H:206 DB:28 TP:4 HR:46 RBI:144 SB:5 IBB:7 playerID:teixema01 yearID:2005 teamID:TEX lgID:AL G:162 AB:644 R:112 H:194 DB:41 TP:3 HR:43 RBI:144 SB:4 IBB:5 playerID:ramirma02 yearID:2005 teamID:BOS lgID:AL G:152 AB:554 R:112 H:162 DB:30 TP:1 HR:45 RBI:144 SB:1 IBB:9 playerID:sweenmi01 yearID:2000 teamID:KCA lgID:AL G:159 AB:618 R:105 H:206 DB:30 TP:0 HR:29 RBI:144 SB:8 IBB:5 playerID:gonzaju03 yearID:1996 teamID:TEX lgID:AL G:134 AB:541 R:89 H:170 DB:33 TP:2 HR:47 RBI:144 SB:2 IBB:12 playerID:ramirma02 yearID:1999 teamID:CLE lgID:AL G:147 AB:522 R:131 H:174 DB:34 TP:3 HR:44 RBI:165 SB:2 IBB:9 playerID:thomafr04 yearID:2000 teamID:CHA lgID:AL G:159 AB:582 R:115 H:191 DB:44 TP:0 HR:43 RBI:143 SB:1 IBB:18 playerID:vaughmo01 yearID:1996 teamID:BOS lgID:AL G:161 AB:635 R:118 H:207 DB:29 TP:1 HR:44 RBI:143 SB:2 IBB:19 playerID:rodrial01 yearID:2002 teamID:TEX lgID:AL G:162 AB:624 R:125 H:187 DB:27 TP:2 HR:57 RBI:142 SB:9 IBB:12 playerID:willima04 yearID:1999 teamID:ARI lgID:NL G:154 AB:627 R:98 H:190 DB:37 TP:2 HR:35 RBI:142 SB:2 IBB:9 playerID:palmera01 yearID:1996 teamID:BAL lgID:AL G:162 AB:626 R:110 H:181 DB:40 TP:2 HR:39 RBI:142 SB:8 IBB:12 playerID:gonzalu01 yearID:2001 teamID:ARI lgID:NL G:162 AB:609 R:128 H:198 DB:36 TP:7 HR:57 RBI:142 SB:1 IBB:24 playerID:thomafr04 yearID:1996 teamID:CHA lgID:AL G:141 AB:527 R:110 H:184 DB:26 TP:0 HR:40 RBI:134 SB:1 IBB:26 playerID:delgaca01 yearID:1999 teamID:TOR lgID:AL G:152 AB:573 R:113 H:156 DB:39 TP:0 HR:44 RBI:134 SB:1 IBB:7 playerID:bellge02 yearID:1987 teamID:TOR lgID:AL G:156 AB:610 R:111 H:188 DB:32 TP:4 HR:47 RBI:134 SB:5 IBB:9 playerID:tejadmi01 playerID:galaran01 playerID:griffke02 playerID:heltoto01 playerID:mcgwima01 playerID:mcgwima01 playerID:griffke02 playerID:heltoto01 playerID:martied01 playerID:delgaca01 yearID:2004 yearID:1996 yearID:1997 yearID:2000 yearID:1998 yearID:1999 yearID:1998 yearID:2001 yearID:2000 yearID:2003 teamID:BAL teamID:COL teamID:SEA teamID:COL teamID:SLN teamID:SLN teamID:SEA teamID:COL teamID:SEA teamID:TOR lgID:AL lgID:NL lgID:AL lgID:NL lgID:NL lgID:NL lgID:AL lgID:NL lgID:AL lgID:AL G:162 G:159 G:157 G:160 G:155 G:153 G:161 G:159 G:153 G:161 AB:653 AB:626 AB:608 AB:580 AB:509 AB:521 AB:633 AB:587 AB:556 AB:570 R:107 R:119 R:125 R:138 R:130 R:118 R:120 R:132 R:100 R:117 H:203 H:190 H:185 H:216 H:152 H:145 H:180 H:197 H:180 H:172 DB:40 DB:39 DB:34 DB:59 DB:21 DB:21 DB:33 DB:54 DB:31 DB:38 TP:2 TP:3 TP:3 TP:2 TP:0 TP:1 TP:3 TP:2 TP:0 TP:1 HR:34 HR:47 HR:56 HR:42 HR:70 HR:65 HR:56 HR:49 HR:37 HR:42 RBI:150 RBI:150 RBI:147 RBI:147 RBI:147 RBI:147 RBI:146 RBI:146 RBI:145 RBI:145 SB:4 IBB:6 SB:18 IBB:3 SB:15 IBB:23 SB:5 IBB:22 SB:1 IBB:28 SB:0 IBB:21 SB:20 IBB:11 SB:7 IBB:15 SB:3 IBB:8 SB:0 IBB:23

playerID:sosasa01 yearID:2001 teamID:CHN lgID:NL G:160 AB:577 R:146 H:189 DB:34 TP:5 HR:64 RBI:160 SB:0 IBB:37 playerID:belleal01 yearID:1998 teamID:CHA lgID:AL G:163 AB:609 R:113 H:200 DB:48 TP:2 HR:49 RBI:152 SB:6 IBB:10 playerID:galaran01 yearID:1997 teamID:COL lgID:NL G:154 AB:600 R:120 H:191 DB:31 TP:3 HR:41 RBI:140 SB:15 IBB:2 playerID:griffke02 yearID:1996 teamID:SEA lgID:AL G:140 AB:545 R:125 H:165 DB:26 TP:2 HR:49 RBI:140 SB:16 IBB:13 playerID:gonzaju03 yearID:2001 teamID:CLE lgID:AL G:140 AB:532 R:97 H:173 DB:34 TP:1 HR:35 RBI:140 SB:1 IBB:5 playerID:ortizda01 yearID:2004 teamID:BOS lgID:AL G:150 AB:582 R:94 H:175 DB:47 TP:3 HR:41 RBI:139 SB:0 IBB:8 playerID:dawsoan01 yearID:1987 teamID:CHN lgID:NL G:153 AB:621 R:90 H:178 DB:24 TP:2 HR:49 RBI:137 SB:11 IBB:7 playerID:bondsba01 yearID:2001 teamID:SFN lgID:NL G:153 AB:476 R:129 H:156 DB:32 TP:2 HR:73 RBI:137 SB:13 IBB:35 playerID:delgaca01 yearID:2000 teamID:TOR lgID:AL G:162 AB:569 R:115 H:196 DB:57 TP:1 HR:41 RBI:137 SB:0 IBB:18 playerID:giambja01 yearID:2000 teamID:OAK lgID:AL G:152 AB:510 R:108 H:170 DB:29 TP:1 HR:43 RBI:137 SB:2 IBB:6

ValueCap Systems - Proprietary

293

Lab 4B: Data Collection

ValueCap Systems - Proprietary

294

Lab 4B Objective

Use collectors to process data sequentially View difference between SortMerge, Ordered, and Roundrobin collectors

ValueCap Systems - Proprietary

295

Lab4B Collectors
Open lab4a and Save-As lab4b Edit the job and add a second Peek stage: Go to the Advanced Stage properties for the 2nd Peek
Change the Execution Mode to Sequential Click OK, save and compile lab4b. Run lab4b and view the results in the Director log.
ValueCap Systems - Proprietary

296

Auto Collector Output


Instead of 4 outputs, Sequential_Peek will only produce 1 output in the Director log. This is because the 2nd Peek stage was running sequentially. Output should look like the following:
Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: playerID:ramirma02 yearID:1999 teamID:CLE lgID:AL G:147 AB:522 R:131 H:174 DB:34 TP:3 HR:44 RBI:165 SB:2 IBB:9 playerID:sosasa01 yearID:2001 teamID:CHN lgID:NL G:160 AB:577 R:146 H:189 DB:34 TP:5 HR:64 RBI:160 SB:0 IBB:37 playerID:sosasa01 yearID:1998 teamID:CHN lgID:NL G:159 AB:643 R:134 H:198 DB:20 TP:0 HR:66 RBI:158 SB:18 IBB:14 playerID:gonzaju03 yearID:1998 teamID:TEX lgID:AL G:154 AB:606 R:110 H:193 DB:50 TP:2 HR:45 RBI:157 SB:2 IBB:9 playerID:belleal01 yearID:1998 teamID:CHA lgID:AL G:163 AB:609 R:113 H:200 DB:48 TP:2 HR:49 RBI:152 SB:6 IBB:10 playerID:tejadmi01 yearID:2004 teamID:BAL lgID:AL G:162 AB:653 R:107 H:203 DB:40 TP:2 HR:34 RBI:150 SB:4 IBB:6 playerID:galaran01 yearID:1996 teamID:COL lgID:NL G:159 AB:626 R:119 H:190 DB:39 TP:3 HR:47 RBI:150 SB:18 IBB:3 playerID:belleal01 yearID:1996 teamID:CLE lgID:AL G:158 AB:602 R:124 H:187 DB:38 TP:3 HR:48 RBI:148 SB:11 IBB:15 playerID:palmera01 yearID:1999 teamID:TEX lgID:AL G:158 AB:565 R:96 H:183 DB:30 TP:1 HR:47 RBI:148 SB:2 IBB:14 playerID:ortizda01 yearID:2005 teamID:BOS lgID:AL G:159 AB:601 R:119 H:180 DB:40 TP:1 HR:47 RBI:148 SB:1 IBB:9

Note that in Auto mode, the Collector maintained the sort order on RBI
This suggests that the Framework decided to use the SortMerge Collector
ValueCap Systems - Proprietary

297

SortMerge Collector
Go back into the Sequential_Peek stage Input properties, and change the Collector type to SortMerge
Be sure to set the Sort direction to Descending No need to click on Sort Save and compile the job Run the job Go to the Director and view the job log Compare the output of the Sequential_Peek stage to the output on the previous slide. The output should be the same.
ValueCap Systems - Proprietary

298

Ordered Collector
Go back into the Sequential_Peek stage Input properties, and change the Collector type to Ordered
Save and compile the job Run the job Go back to the Director and view the job log Compare your output to the one on the next slide.

ValueCap Systems - Proprietary

299

Ordered Collector Output


Output from the Sequential_Peek should look like the following:
Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: playerID:sosasa01 yearID:1998 teamID:CHN lgID:NL G:159 AB:643 R:134 H:198 DB:20 TP:0 HR:66 RBI:158 SB:18 IBB:14 playerID:gonzaju03 yearID:1998 teamID:TEX lgID:AL G:154 AB:606 R:110 H:193 DB:50 TP:2 HR:45 RBI:157 SB:2 IBB:9 playerID:belleal01 yearID:1996 teamID:CLE lgID:AL G:158 AB:602 R:124 H:187 DB:38 TP:3 HR:48 RBI:148 SB:11 IBB:15 playerID:palmera01 yearID:1999 teamID:TEX lgID:AL G:158 AB:565 R:96 H:183 DB:30 TP:1 HR:47 RBI:148 SB:2 IBB:14 playerID:ortizda01 yearID:2005 teamID:BOS lgID:AL G:159 AB:601 R:119 H:180 DB:40 TP:1 HR:47 RBI:148 SB:1 IBB:9 playerID:castivi02 yearID:1998 teamID:COL lgID:NL G:162 AB:645 R:108 H:206 DB:28 TP:4 HR:46 RBI:144 SB:5 IBB:7 playerID:teixema01 yearID:2005 teamID:TEX lgID:AL G:162 AB:644 R:112 H:194 DB:41 TP:3 HR:43 RBI:144 SB:4 IBB:5 playerID:ramirma02 yearID:2005 teamID:BOS lgID:AL G:152 AB:554 R:112 H:162 DB:30 TP:1 HR:45 RBI:144 SB:1 IBB:9 playerID:sweenmi01 yearID:2000 teamID:KCA lgID:AL G:159 AB:618 R:105 H:206 DB:30 TP:0 HR:29 RBI:144 SB:8 IBB:5 playerID:gonzaju03 yearID:1996 teamID:TEX lgID:AL G:134 AB:541 R:89 H:170 DB:33 TP:2 HR:47 RBI:144 SB:2 IBB:12

Ordered Collector takes all records from the 1st partition, then the 2nd, then the 3rd, and finally the 4th.
Compare this output with the output from partition 0 for the Hash and Sort exercise in lab4a If the records were originally range partitioned, then the resulting output would show up sorted.
ValueCap Systems - Proprietary

300

Roundrobin Collector
Go back into the Sequential_Peek stage Input properties, and change the Collector type to Roundrobin
Save and compile the job Run the job Go back to the Director and view the job log Compare your output to the one below:
Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: playerID:sosasa01 yearID:1998 teamID:CHN lgID:NL G:159 AB:643 R:134 H:198 DB:20 TP:0 HR:66 RBI:158 SB:18 IBB:14 playerID:ramirma02 yearID:1999 teamID:CLE lgID:AL G:147 AB:522 R:131 H:174 DB:34 TP:3 HR:44 RBI:165 SB:2 IBB:9 playerID:tejadmi01 yearID:2004 teamID:BAL lgID:AL G:162 AB:653 R:107 H:203 DB:40 TP:2 HR:34 RBI:150 SB:4 IBB:6 playerID:sosasa01 yearID:2001 teamID:CHN lgID:NL G:160 AB:577 R:146 H:189 DB:34 TP:5 HR:64 RBI:160 SB:0 IBB:37 playerID:gonzaju03 yearID:1998 teamID:TEX lgID:AL G:154 AB:606 R:110 H:193 DB:50 TP:2 HR:45 RBI:157 SB:2 IBB:9 playerID:thomafr04 yearID:2000 teamID:CHA lgID:AL G:159 AB:582 R:115 H:191 DB:44 TP:0 HR:43 RBI:143 SB:1 IBB:18 playerID:galaran01 yearID:1996 teamID:COL lgID:NL G:159 AB:626 R:119 H:190 DB:39 TP:3 HR:47 RBI:150 SB:18 IBB:3 playerID:belleal01 yearID:1998 teamID:CHA lgID:AL G:163 AB:609 R:113 H:200 DB:48 TP:2 HR:49 RBI:152 SB:6 IBB:10 playerID:belleal01 yearID:1996 teamID:CLE lgID:AL G:158 AB:602 R:124 H:187 DB:38 TP:3 HR:48 RBI:148 SB:11 IBB:15 playerID:vaughmo01 yearID:1996 teamID:BOS lgID:AL G:161 AB:635 R:118 H:207 DB:29 TP:1 HR:44 RBI:143 SB:2 IBB:19

ValueCap Systems - Proprietary

301

Lab 4C: Funnel

ValueCap Systems - Proprietary

302

Lab 4C Objective

Create a job to illustrate Funnel stage operation

ValueCap Systems - Proprietary

303

Lab4C Building the Job


Create the following flow which consists of
3 Row Generator stages 1 Funnel stage 1 Peek stage

Enter the following table definition under the Row Generator stage properties Output Columns tab.

ValueCap Systems - Proprietary

304

Lab4C Saving the Table Definition


Once entered, save the table definition as shown below:

This allows us to re-use the table definition in other Row Generator stages.
ValueCap Systems - Proprietary

305

Lab4C Completing the Job


Load the saved table definition into the other 2 Row Generator stages Edit the Row Generator properties such that Generator_1 generates 100 rows, Generator_2 200 rows, and Generator_3 300 rows. Once your job looks like the one on the right, save the job as lab4c.
Compile the job Run the job
ValueCap Systems - Proprietary

306

Lab4C Results
Verify that the record count going to the Peek stage is 600 rows (100+200+300):

Remember, links Input1, Input2, and Input3 get combined in the Funnel stage, which outputs only 1 link while maintaining same number of partitions.
ValueCap Systems - Proprietary

307

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116 Page 252 Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

308

Data Transformation and Manipulation


In this section we will discuss the following: Modify Switch Filter Transform Related Stages

ValueCap Systems - Proprietary

309

Modify Stage
Modify stage is useful and effective for light transformations:
Drop columns permanently remove columns that are not needed from the record structure Keep columns specify which columns to keep (opposite of drop columns) Null handling specify alternative null representation Substring obtain only a subset of bytes from a Char column. Change data types alter column data types. Data must be compatible between data types.

o For example, a column of type Char[3] with a value of ABC cannot be changed to become ValueCap Systems - Proprietary 310 an Integer type.

Modify Stage Properties


The Modify stage does not offer much support in terms of correct syntax to use for the transformations it supports. To successfully use the Modify stage, you will need to consult the users guide for the correct syntax. In general, the format is as follows:
DROP columnname [, columnname] KEEP columnname [, columnname] new_columnname [:new_type] = [explicit_conversion_function] old_columnname
ValueCap Systems - Proprietary

311

Null Handling Using Modify


Source Field Destination Field Result Not Nullable Not Nullable Nullable Not Nullable Nullable Not Nullable Source value propagates to destination Source value propagates to destination If source value is not null, source value propagates. If source value is null, an error occurs, unless the Modify stage is used to handle null representation Source value propagates to destination
ValueCap Systems - Proprietary

Nullable

Nullable

312

Modify Stage Examples


Example transformations available in the Modify stage:
playerID = substring[0,3] (playerID) will start at the first character and grab the first 3 bytes. startDate = year_from_date (startDate) will only retain the year value and discard month and day. salary = handle_null (salary,0000000.00) will populate 0000000.00 into the salary column for any incoming salary column containing a null. salary2 = string_from_decimal (salary) will copy the value from the salary column into the salary2 column. If salary2 did not previously exist, it will create it.
ValueCap Systems - Proprietary

313

Switch Stage
Switch stage is useful for splitting up records and sending them down different links based on a key value.
Similar in behavior to switch/case statement in C Must provide a Selector field to perform the switch operation Must specify case value and corresponding link number (starts at 0)
ValueCap Systems - Proprietary

314

Switch Stage Properties


Selector Mode
User-defined Mapping (default) user provides explicit mapping for values to outputs Auto can be used when there is as many distinct selector values as output links. Hash input rows are hashed on the selector column modulo the number of output links and assigned to an output link accordingly. In this case, the selector column must be of a type that is convertible to Unsigned Integer and may not be nullable.

ValueCap Systems - Proprietary

315

Switch Stage Properties (Continued)


If Not Found
Fail (default) an invalid selector value will cause the job to fail Drop drops the offending record containing an invalid selector value Output sends offending record containing an invalid selector value to a reject link.

Discard Value
Specifies an integer value of the selector column, or the value to which it was mapped using Case, that causes a row to be dropped (not rejected). Optional
ValueCap Systems - Proprietary

316

Filter Stage
Filter stage acts like the WHERE clause in a SQL SELECT statement
Supports 1 input, and multiple output links, similar to Switch stage Can attach a reject link Valid WHERE clause operations:
o six comparison operators: =, <>, <, >, <=, >= o true / false o is null / is not null o like 'abc' (the second operand must be a regular expression) o between (for example, A between B and C is equivalent to B <= A and A => C) o is true / is false / is not true / is not false o and / or / not
ValueCap Systems - Proprietary

317

Other Filter Stage Properties


Output Rejects set to False by default. When set to True, values which do not meet one of the Filter criteria will be sent to a reject link Output Row Only Once set to False by default. When set to True, records are only output to the first successful WHERE clause, whereas when set to False the record will be output to all successful WHERE clauses. Nulls Value Determines whether a null value is treated as 'Greater Than' or 'Less Than' other values.
ValueCap Systems - Proprietary

318

Filter Stage Example


In this example, we filter on the pitchers ERA statistics
ERA values below 3.25 will be sent down the first link ERA values between 3.25 and 5.00 will be sent down the second link ERA values greater than 5.00 will be sent down the third link

ValueCap Systems - Proprietary

319

Transform Stage
Transformer stage provides an extensible interface for defining data transformations
Supports 1 input and multiple outputs, including reject Different user interface from other stages
o Source to target mapping is primary interface
Source: Target:

Source Columns

Target Columns

Contains several pre-built transformations


o Transformations can be combined

Supports Contraints very similar to Filter stage functionality


ValueCap Systems - Proprietary

320

Transform Stage Features


As mentioned on the previous slide, the Transform stage contains many pre-built transformations.
To access these, do the following:

1. Double-click in the column derivation area to bring up the derivation editor 2. Right-click to access the menu. 3. If the cursor is at the beginning of the line when you right-click, you will get the following: 4. If the cursor is at the end of the line when you right-click, you will get the following: select Function to access pre-built transforms

ValueCap Systems - Proprietary

321

Transform Stage Features (Continued)


The Operand Menu provides easy access to other features besides the many pre-built transformations.
DS Macro returns job related information DS Routine calls a function from a UNIX shared library, and may be used anywhere an expression can be defined. Job Parameter insert any pre-defined DataStage Job Parameter Input Column provides a list of columns from the input link to choose from Stage Variable globally defined variable that can be derived using any supported transformation, and then re-used or referenced within any Derivation the Transformer stage.
ValueCap Systems - Proprietary

322

Transform Stage Features (Continued)


System Variable Framework level information that can be referenced as part of the derivation or constraint
o @FALSE The value is replaced with 0. o @TRUE The value is replaced with 1. o @INROWNUM Input row counter. o @OUTROWNUM Output row counter (per link). o @NUMPARTITIONS The total number of partitions for the stage. o @PARTITIONNUM The partition number for the particular instance.

String enter a string value which will become a hardcoded value assigned to the column () Parantheses inserts a pair of parentheses into the derivation field. If Then Else inserts If Then Else into the derivation field.
ValueCap Systems - Proprietary

323

Transform Stage Stage Variables


Stage Variables offer the following advantages:
Similar to global program variables

o Scope is limited to the Transformer


Use to simplify derivations and constraints Use to avoid duplicate coding Retain values across reads Use to accumulate values and compare current values with prior reads

ValueCap Systems - Proprietary

324

Transform Stage Derivations


Derivations can be applied to each output column or Stage Variable.
Specifies the value to be moved to a output column Every output column must have a derivation.

o This can include a 1 to 1 map of the input value/column.


An output column does not require an input column

o Can hard code specific values o Can include derivations based on built-in or user-defined functions
ValueCap Systems - Proprietary

325

User Defined Transformer Routines


The Transformer stage provides an interface for incorporating user created functions written in C++
External Function Type: This calls a function from a UNIX shared library and may be used anywhere an expression can be defined. Any external function defined appear in the expression editor operand menu under Parallel Routines. External Before/After Routine Type: This calls a routine from a UNIX shared library, and can be specified in the Triggers page of a transformer stage Properties dialog box.

Note: Functions must be compiled with a C++ compiler (not a C compiler).


ValueCap Systems - Proprietary

326

Transform Stage Sample Derivations


Job design:

Stage Variable derivation expands AL to American League and NL to National League, stores value to Stage Variable called league league is mapped to newly introduced league_name column on both outputs transform defined only once. Constraint separates AL records from NL records yearID column is mapped to year_in_league column DownCase() makes all characters lower case
ValueCap Systems - Proprietary

327

Transform Stage Constraints


Constraints are used to filter out records
A constraint applies to the entire record. A constraint specifies a condition under which incoming rows of data will be written to an output link If no constraint is specified, all records are passed through the link Constraints are defined for each individual output link

o Checking the Reject Row box will force only those records that did not meet the condition specified in the constraints. o No constraint is required for Reject link.
ValueCap Systems - Proprietary

328

Transformer Stage Execution Order


Within the Transformer stage, there is an order of execution:
1. Stage Variables derivations are executed first 2. Constraints are executed before derivations 3. Column derivations in earlier links are executed before later links 4. Derivations in higher columns are executed before lower columns
3 1 2

ValueCap Systems - Proprietary

329

Transformer Null Handling


We had previously discussed null handling via the Modify stage. Specifically, null handling should occur for the following condition:
Source Field Nullable Destination Field Result Not Nullable If source value is not null, source value propagates. If source value is null, an error occurs, unless the Modify stage is used to handle null representation

The Transformer stage can also be used to perform null handling.


ValueCap Systems - Proprietary

330

Transformer Null Handling (Continued)


The Transformer provides several built-in null handling functions:
IsNotNull() returns true if expression or input value does not evaluate to Null IsNull() returns true if expression or input value does evaluate to Null NullToEmpty() sets input value to an empty string if it was Null on input NullToValue() sets input value to a specific value if it was Null on input NullToZero() sets input value to zero if it was Null on input SetNull() assign a Null to the target column
ValueCap Systems - Proprietary

331

Transformer Type Conversions


Similar to the Modify stage, the Transformer also offers a variety of built-in type conversion functions
Some type conversions are handled automatically by the framework, as indicated by d in the table. An m indicates manual conversion. Manual conversions available in the Transformer are listed here.

ValueCap Systems - Proprietary

332

Other Useful Transformer Functions


Date and Time Transformations Date, Time, and Timestamp manipulations

String Transformations Trim off spaces, characters Compare values Pad characters Etc
ValueCap Systems - Proprietary

333

Transform Stage Usage Sample


The best way to learn the Transformer is to use it extensively.
Use the Row Generator to generate data to test against Test the built-in transformations against the generated data Insert a Peek stage before and after the Transformer to compare before and after results

o BeforePeek will show the records before they are transformed o AfterPeek will show the records as a result of the transformations applied in the Transformer stage.
ValueCap Systems - Proprietary

334

Relevant Stages
Change Capture Compares two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. An extra column is put on the output dataset, containing a change code with values encoding the four actions: insert, delete, copy, and edit. Change Apply Reads a record from the change data set (produced by Change Capture) and from the before data set, compares their key column values, and acts accordingly: If the before keys come before the change keys in the specified sort order, the before record is copied to the output. The change record is retained for the next comparison. If the before keys are equal to the change keys, the behavior depends on the code in the change_code column of the change record:
Insert: The change record is copied to the output; the stage retains the same before record for the next comparison. If key columns are not unique, and there is more than one consecutive insert with the same key, then Change Apply applies all the consecutive inserts before existing records. This record order may be different from the after data set given to Change Capture. Delete: The value columns of the before and change records are compared. If the value columns are the same or if the Check Value Columns on Delete is specified as False, the change and before records are both discarded; no record is transferred to the output. If the value columns are not the same, the before record is copied to the output and the stage retains the same change record for the next comparison. If key columns are not unique, the value columns ensure that the correct record is deleted. If more than one record with the same keys have matching value columns, the firstencountered record is deleted. This may cause different record ordering than in the after data set given to the Change Capture stage. A warning is issued and both change record and before record are discarded, i.e. no output record results. Edit: The change record is copied to the output; the before record is discarded. If key columns are not unique, then the first before record encountered with matching keys will be edited. This may be a different record from the one that was edited in the after data set given to the Change Capture stage. A warning is issued and the change record is copied to the output; but the stage retains the same before record for the next comparison.. Copy: The change record is discarded. The before record is copied to the output

ValueCap Systems - Proprietary

335

Relevant Stages
Difference Takes 2 presorted datasets as inputs and outputs a single data set whose records represent the difference between them. The comparison is performed based on a set of difference key columns. Two records are copies of one another if they have the same value for all difference keys. You can also optionally specify change values. If two records have identical key columns, you can compare the value columns to see if one is an edited copy of the other. The stage generates an extra column, DiffCode, which indicates the result of each record comparison.
The Difference stage is similar, but not identical, to the Change Capture stage. The Change Capture stage is intended to be used in conjunction with the Change Apply stage. The Difference stage outputs the before and after rows to the output data set, plus a code indicating if there are differences. Usually, the before and after data will have the same column names, in which case the after data set effectively overwrites the before data set and so you only see one set of columns in the output. If your before and after data sets have different column names, columns from both data sets are output; note that any key and value columns must have the same name.

Compare Performs a column-by-column comparison of records in two presorted input data sets. You can restrict the comparison to specified key columns. The Compare stage does not change the table definition, partitioning, or content of the records in either input data set. It transfers both data sets intact to a single output data set generated by the stage. The comparison results are also recorded in the output data set. It is recommended that you use runtime column propagation (RCP) in this stage to allow DataStage to define the output column schema. The stage outputs three columns:
Result: Carries the code giving the result of the comparison. First: A subrecord containing the columns of the first input link. Second: A subrecord containing the columns of the second input link. ValueCap Systems - Proprietary

336

Relevant Stages
Encode use any command available from the Unix command line to encode / mask data. The stage converts a data set from a sequence of records into a stream of raw binary data. An encoded data set is similar to an ordinary one, and can be written to a data set stage. You cannot use an encoded data set as an input to stages that performs columnbased processing or re-orders rows, but you can input it to stages such as Copy. You can view information about the data set in the data set viewer, but not the data itself. You cannot repartition an encoded data set. Decode use any command available from the Unix command line to decode / unmask data. It converts a data stream of raw binary data into a data set. As the input is always a single stream, you do not have to define meta data for the input link. Compress use either Unix compress or GZip utility to compress data. It converts a data set from a sequence of records into a stream of raw binary data. A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a DataSet stage. However, a compressed data set cannot be processed by many stages until it is expanded. Stages that do not perform column-based processing or reorder the rows can operate on compressed data sets. For example, you can use the copy stage to create a copy of the compressed data set. Expand use either Unix uncompress or GZip utility to de-compress data. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data.
ValueCap Systems - Proprietary

337

Relevant Stages
Surrogate Key generates a unique key column for an existing data set. Can specify certain characteristics of the key sequence. The stage generates sequentially incrementing unique integers from a given starting point. The existing columns of the data set are passed straight through the stage. Can be executed in parallel. Column Generator generates additional column(s) of data and appends it onto an incoming record structure Head outputs first N records in each partition. Can optionally select records from certain partitions or skip certain number of records. Tail outputs last N records in each partition. Can optionally select records from certain partitions. Sample outputs a sample of the incoming data. Can be configured to perform either a percentage (random) or periodic sampling. Can distribute samples to multiple output links.
ValueCap Systems - Proprietary

338

Lab 5A: Modify vs Transformer

ValueCap Systems - Proprietary

339

Lab 5A Objectives
Learn more about the Modify and Transformer stage Use the Modify stage to perform date field manipulations Use the Transformer stage to perform the same date field manipulations Compare results by using the Compare stage Verify results are the same

ValueCap Systems - Proprietary

340

Lab5a High Level Overview


As a 1st step, lay out the following flow:

Make sure to label the stages and links accordingly. Use the Master.ds dataset in the Dataset stage, which was created in lab3b_master
In the Output Columns tab, click on Load to load the Master table definition (previously saved in lab3).

Head and Copy stages will use their default properties


Head will only keep first 10 records per partition Copy creates an identical copy of the data for each output link
ValueCap Systems - Proprietary

341

Lab5a Job Design


In the Modify stage, enter the following specifications:
Source to target mapping tells Modify to keep the debut column.

This splits the date column into separate Year, Month, and Day In the Output Columns tab for Modify, click on Load to load the Master table definition again. Add the following 3 additional column definitions: debutYear Integer debutMonth Integer debutDay Integer Modify will create these as part of the transformations defined by the specifications.
ValueCap Systems - Proprietary

342

Lab5a SimpleTransform Mappings


In the SimpleTransform stage, map the following:

Create a new Integer column and name it debutID


ValueCap Systems - Proprietary

343

Lab5a SimpleTransform Transformation


Next, double-click in the Derivation area next to debutID:

Double-click here to access Derivations

Enter the following Derivation for the target debutID column:


fromModify.debutYear+fromModify.debutDay*fromModify.debutMonth

This will calculate a debutID field based on the Year, Month, and Date that the baseball player played his first game Note: fromModify in the derivation is the name of the input link. If you used a different link name, then use that one.
ValueCap Systems - Proprietary

344

Lab5a ComplexTransform Mappings


In the ComplexTransform stage, map the following:

Create a new Integer column and name it debutID


ValueCap Systems - Proprietary

345

Lab5a ComplexTransform Stage Variable


Next, right-click under Stage Variables and select Insert New Stage Variable
A new Stage Variable named StageVar should be created

Double-click the Derivation field next to StageVar and enter the following Derivation:
YearFromDate(toTransformer.debut) +MonthDayFromDate(toTransformer.debut) *MonthFromDate(toTransformer.debut) Each of these functions can be accessed by right-clicking and selecting Function from the menu. Look under Date & Time category. Column names can be accessed by right-clicking and selecting Input Column or typing in toTransformer. and selecting from the list. You can keep it all on one line. Hit Enter when finished. Note: toTransformer in the derivation is the name of the input link. If you used a different link name, use that one.
ValueCap Systems - Proprietary

346

Lab5a ComplexTransform Transformation


Click on StageVar and drag it into the Derivation field for debutID.
When finished, your Derivation for debutID should appear similar to the one on the left. Stage Variables allow for the same derivation to be mapped to many different fields, but only calculated once. In this example, we are only mapping it to debutID, but we could have mapped it to any other Integer column.
ValueCap Systems - Proprietary

347

Lab5a Save, Compile, Run


Once finished editing the transformations, save the job as lab5a. Compile the job Run the job and open the Director to view the log
Compare the Peek output from the Modify and SimpleTransformer combination against the Peek output from the ComplexTransformer The debutIDs for a given playerID should be identical. You can limit the number of columns output from the Peek (via Peek stage properties) to make it easier to read in the output log.

Once the job runs correctly, save a copy as lab5a_2.


ValueCap Systems - Proprietary

348

Lab5a Compare the Outputs


Instead of manually comparing the 2 outputs, there is an easier way to do this. Modify lab5a to look like the following:

ValueCap Systems - Proprietary

349

Lab5a Compare Stage


In the Compare stage properties, setup the following options:

We are assuming that all records will be unique in the Master file We are comparing records based on playerID and debutID. Any record with the same playerID and debutID value will be compared. If a different record shows up, then the job will abort.

For both input links, Hash and Sort on playerID and debutID.
ValueCap Systems - Proprietary

350

Lab5a Save, Compile, Run


Once finished editing, save the job, still keeping it as lab5a. Compile the job Run the job and open the Director to view the log
Did the job run to completion or did it fail? Assuming that the job was assembled correctly, then the job should have run to completion successfully with no errors or warnings, implying that the Compare worked and all records are identical In the Peek output, you will notice a new column called result. A value of 0 in result indicates that the records were identical.

Once the job runs correctly, save a copy as lab5a_2.


ValueCap Systems - Proprietary

351

Lab5a Using Full Data Volume


Now lets test against the full data volume by removing the Head stage:

Be sure to either retain the link labelled toCopy or rename the link going into the Copy stage to toCopy
Not doing this will break the source to target mapping on the output of the Copy stage.
ValueCap Systems - Proprietary

352

Lab5a Save, Compile, Run


Once finished editing, save the job, still keeping it as lab5a Compile the job Run the job and open the Director to view the log
Did the job run to completion or did it fail? The job should have run to completion successfully with no errors or warnings, implying that the Compare worked and all records are identical Final output should show 3817 records processed. This can be seen from the Performance Monitor. It can be accessed from the Director Tools New Monitor.
ValueCap Systems - Proprietary

353

Lab 5B: Filter vs Switch vs Transformer !

ValueCap Systems - Proprietary

354

Lab 5B Objective

Separate records in the Master file into 3 groups:


Records where birthYear <= 1965 Records where birthYear between 1966 and 1975 Records where birthYear >= 1976

Use Filter, Switch, and Transformer stages to accomplish the same task and achieve same results!

ValueCap Systems - Proprietary

355

Lab5b The BIG Picture!


Heres how we will achieve this! Use the same Master.ds and table definition as lab5a. When you are done, this is what your job will look like. Note that stages can be resized!
ValueCap Systems - Proprietary

356

Lab5b Design Strategy


Heres the best way to approach building such a complex job: 1. Build little by little 2. Test progress along the way In this example, you would first build #1, test it, then add #2, test it, then add #3, and test it. This way you minimize your debug efforts!
ValueCap Systems - Proprietary

357

Lab5b Filter Properties


In the Filter stage properties, you will need to define 3 outputs one for each birthYear range: Note that output link numbering starts at 0.

How do you figure out which link number corresponds to which output link??? Solution: Click the Output Link Ordering tab.

ValueCap Systems - Proprietary

358

Lab5b Switch Properties


In the Switch stage properties, you will need to define at least 2 outputs optionally a 3rd:
While cumbersome, you will need to explicitly define which values flow down which output link. What about values where birthYear >= 1976 ? You have 2 options:

o Explicitly define the Case mappings o Send all values where birthYear >= 1976 down a reject link as shown on the right Note: Reject links do not allow mapping!
Note that output link numbering here also starts at 0. ValueCap Systems - Proprietary
359

Lab5b Transformer Properties


In the Transformer you will need to have 3 outputs links, each with a unique constraint:
birthYear <= 1965 birthYear >=1966 And birthYear<=1975 birthYear >= 1976

Note that column names need to be preceeded by input link names. These Constraint Derivations could be entered by either manually typing it in or using the GUI interface.
ValueCap Systems - Proprietary

360

Lab5b Save, Compile, Run


As you complete each section, save the job lab5b Compile and run the job. For each date range, you should consistently see the following record counts (also shown on next page)
1955 to 1965: 862 1966 to 1975: 1903 1976 to 1986: 1052 Total records: 3817

There should be no warnings or errors reported.

ValueCap Systems - Proprietary

361

Lab5b - Results

Verify that your record count results match those shown on the right. Also make sure the results are consistent for the Filter, Switch, and Transformer.

ValueCap Systems - Proprietary

362

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116 Page 252 Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

363

Data Transformation and Manipulation


In this section we will discuss the following: Join
InnerJoin LeftOuterJoin RightOuterJoin FullOuterJoin

Merge Lookup
DataSet RDBMS

Aggregator
ValueCap Systems - Proprietary

364

Performing Joins
There are 4 different types of joins explicitly supported by DataStage:
InnerJoin (default) LeftOuterJoin RightOuterJoin FullOuterJoin

To illustrate the functionality, we will use the following 2 sets of record inputs. Note that all columns in this example are of character type. Left Input Right Input

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken


365

ValueCap Systems - Proprietary

InnerJoin
InnerJoin will result in records containing LeftField and RightField where KeyField is an exact match
Left Input Right Input

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

InnerJoin Output

LastName Clemens Ryan Ryan

KeyField 456 789 789

FirstName Roger Nolan Ken


366

ValueCap Systems - Proprietary

LeftOuterJoin
LeftOuterJoin will result in all records from the left input and only the records from the right input where KeyField is an exact match
Left Input Right Input

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

LeftOuterJoin Output

LastName Ryan Ryan Maddux Clemens

KeyField 789 789 012 456

FirstName Nolan Ken


What happened here?
Because there was no match, a blank is populated instead. If the field was numeric, a zero would have been inserted. If the field was nullable, then a null would have been inserted. 367

Roger

ValueCap Systems - Proprietary

RightOuterJoin
RightOuterJoin will result in all records from the right input and only the records from the left input where KeyField is an exact match
Left Input Right Input

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

RightOuterJoin Output

LastName
What happened here?
Because there was no match, a blank is populated instead. If the field was numeric, a zero would have been inserted. If the field was nullable, then a null would have been inserted.

KeyField 123

FirstName Randy Roger Nolan Ken


368

Clemens Ryan Ryan

456 789 789

ValueCap Systems - Proprietary

FullOuterJoin
Left Input Right Input

FullOuterJoin will result in all records from both inputs and the records where KeyField is an exact match

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

LastName Ryan Ryan Maddux

FullOuterJoin Output leftRec_KeyField rightRec_KeyField 789 789 012 123 789 789

FirstName Nolan Ken

Randy Roger

Clemens

456

456

Note: A blank is populated where theres no match. If the field was numeric type, a zero would have been inserted. If the field was nullable, then a null would have been inserted. ValueCap Systems - Proprietary

369

Join Stage Properties


Join stage
Must have at least 2 inputs, but supports more Must specify at least 1 join key
o Join key(s) must have same name on both inputs

All inputs must be pre hashed and sorted by the join key(s). No reject capability
o Need to perform post-processing to detect failed matches (check for nulls, blanks, or 0s) applicable for LeftOuterJoin, RightOuterJoin, and FullOuterJoin.

Always use Link Ordering to differentiate between Left and Right input Links!
o Label your links accordingly
ValueCap Systems - Proprietary

370

Merge Stage Properties


Merge combines a Master and an Update record based on matching key field(s)
Must have at least 2 inputs 1 Master input and 1+ Update input(s) Can have multiple Updates, but only 1 Master input
o Update records are consumed once matched with Master record o Master records must be duplicate free o Update records can contain duplicates

Key needs to have same field name All inputs must be pre hashed and sorted by the merge key(s) Supports optional reject record processing simply attach reject link Always use the Link Ordering tab to verify correct Master and Update order
o Label your links accordingly
ValueCap Systems - Proprietary

371

Merge Stage
To illustrate the Merge stage functionality, we will use the following 2 sets of record inputs. Note that all columns in this example are of character type. Master Input Update Input

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

ValueCap Systems - Proprietary

372

Merge Stage Action Keep Unmatched


Merge with Unmatched Masters Mode = Keep
Master Input Update Input

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

Merge Properties:

Merge Output LastName KeyField FirstName Ryan Ryan Maddux Clemens 789 789 012 456 Roger
ValueCap Systems - Proprietary

What happened here?


Because there was no Update, a blank is populated instead. If the field was numeric, a zero would have been inserted. If the field was nullable, then a null would have been inserted.

Nolan Ken

Optional Reject Output KeyField FirstName 123 Randy


373

Merge Stage Action Drop Unmatched


Merge with Unmatched Masters Mode = Drop
Master Input Update Input

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

Merge Properties:

Merge Output LastName KeyField FirstName Ryan Ryan Clemens 789 789 456 Nolan Ken Roger

Dropped Record LastName KeyField Maddux 012

Optional Reject Output KeyField FirstName 123 Randy


374

ValueCap Systems - Proprietary

Lookup Stage Properties


Lookup maps record(s) from the lookup table to an input record with matching lookup key field(s)
Must have at least 2 inputs 1 Primary input and 1+ Lookup Table input(s) Can have multiple Lookup Tables, but only 1 Primary input
o Lookup tables are expected to be duplicate free, but duplicates are allowed o Update records can contain duplicates

Inputs do not need to be partitioned or sorted Lookup Tables are pre-loaded into shared memory
o Always make sure that your lookup table fits in available shared memory

Uses interface very similar to that of the Transformer stage


ValueCap Systems - Proprietary

375

Lookup Stage Properties (Continued)


Lookup stage supports conditional lookups
Derivations for conditional lookup entered similar to Transformer derivations:
Allow Duplicates in Lookup Table

Supports various error handling modes:


o Continue pass input records that fail the lookup and/or condition through to the output. o Drop permanently drop records that fail the lookup and/or condition o Fail default option, causes entire job to fail if lookup and/or condition are not met o Reject output records that fail the lookup and/or condition to a reject link

ValueCap Systems - Proprietary

376

How to Perform a Lookup


Step #1: Identify the lookup key(s). Lookup key(s) can be designated by checking the key column.

Step #2: Map the input key to the corresponding lookup key. Field names do not need to match

Step #3: Map the input columns from both the input and the lookup table to the output.

ValueCap Systems - Proprietary

377

Lookup Table Sources


Lookup tables can be from virtually any source
The reference link going into the Lookup stage can be from a larger flow, not just a data source such as flat files or parallel Datasets Lookup Filesets
o Allows lookup tables to be persistent o Must pre-define lookup key(s) o Creates a persistent indexed lookup table o Uses a .fs extension

Sparse Lookups RDBMS


o Database lookup instead of loading the lookup table into shared memory, the lookup table remains inside the database o Good for situations where lookup table already resides inside the database and is much larger than the primary input data
ValueCap Systems - Proprietary

378

Performing Sparse Lookups


Sparse Lookups
Supported for the following Enterprise stages:
o Oracle, DB2, Sybase, and ODBC

Lookup table source must be one of the supported RDBMS stages above
o Specify Lookup Type = Sparse in the RDBMS stage o Optionally specify your own lookup SQL by using the User Defined SQL option instead of Table Read Method

Lookup stage still works the same way as before


ValueCap Systems - Proprietary

379

Lookup Example Continue Mode (No Dups)


Lookup with Lookup Failure Mode set to Continue and Duplicates to false will result in all records from the Primary input and only the records from the lookup table where KeyField is an exact match
Primary Input Lookup Table

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

Lookup with Continue Mode Results

LastName Ryan Maddux Clemens

KeyField 789 012 456

FirstName Nolan
What happened here?
Because the lookup failed, a blank is populated instead. If the field was numeric, a zero would have been inserted. If the field was nullable, then a null would have been inserted. 380

Roger

ValueCap Systems - Proprietary

Lookup Example Continue Mode (With Dups)


Lookup with Lookup Failure Mode set to Continue and Duplicates to true will result in all records from the Primary input and only the records from the lookup table where KeyField is an exact match
Primary Input Lookup Table

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

Lookup with Continue Mode Results

LastName Ryan Ryan Maddux Clemens

KeyField 789 789 012 456

FirstName Nolan Ken


What happened here?
Because the lookup failed, a blank is populated instead. If the field was numeric, a zero would have been inserted. If the field was nullable, then a null would have been inserted. 381

Roger

ValueCap Systems - Proprietary

Lookup Example Drop Mode (No Dups)


Lookup with Lookup Failure Mode set to Drop and Duplicates to false will result in only records from the Primary input and the corresponding records from the lookup table where KeyField is an exact match
Primary Input Lookup Table

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

Lookup with Drop Mode Results

LastName Ryan Clemens

KeyField 789 456

FirstName Nolan Roger

ValueCap Systems - Proprietary

382

Lookup Example Drop Mode (With Dups)


Lookup with Lookup Failure Mode set to Drop and Duplicates to true will result in only records from the Primary input and the corresponding records from the lookup table where KeyField is an exact match
Primary Input Lookup Table

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

Lookup with Drop Mode Results

LastName Ryan Ryan Clemens

KeyField 789 789 456

FirstName Nolan Ken Roger


383

ValueCap Systems - Proprietary

Lookup Example Reject Mode (No Dups)


Lookup with Lookup Failure Mode set to Reject and Duplicates to false will result in only records from the Primary input and the corresponding records from the lookup table where KeyField is an exact match
Primary Input Lookup Table

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

Lookup with Drop Mode Results

LastName Ryan Clemens

KeyField 789 456

FirstName Nolan Roger

Reject Record LastName KeyField Maddux 012

ValueCap Systems - Proprietary

384

Lookup Example Reject Mode (With Dups)


Lookup with Lookup Failure Mode set to Reject and Duplicates to true will result in only records from the Primary input and the corresponding records from the lookup table where KeyField is an exact match
Primary Input Lookup Table

LastName Ryan Maddux Clemens

KeyField 789 012 456

KeyField 123 456 789 789

FirstName Randy Roger Nolan Ken

Lookup with Drop Mode Results

LastName Ryan Ryan Clemens

KeyField 789 789 456

FirstName Nolan Ken Roger

Reject Record LastName KeyField Maddux 012

ValueCap Systems - Proprietary

385

Aggregator Stage Properties


Aggregator stage performs aggregations based on user-defined grouping criteria
Must specify at least 1 key for grouping criteria Input data must be minimally hash partitioned by the grouping key. Optionally define a column or columns for calculation
o Over 15 aggregation functions available o If aggregation function is not selected, all functions will be performed.
ValueCap Systems - Proprietary

386

Aggregator Stage Properties (Continued)


Aggregation types
Calculation performs an aggregation function against one or more selected columns

Re-Calculation Similar to Calculate but performs the specified aggregation function(s) on a set of data that had already been previously aggregated, using the Summary Output Column property to produce a subrecord containing the summary data that is then included with the data set. Select the column to be aggregated, then specify the aggregation functions to perform against it, and the output column to carry the result. Row Count count the total number of unique records within each group as defined by the grouping key criteria.
ValueCap Systems - Proprietary

387

Aggregator Stage Properties (Continued)


Aggregation Methods
Hash Default option. Use this mode when the number of unique groups, as defined by grouping key(s), are relatively small; generally, fewer than about 1000 groups per megabyte of memory.
o Input data should be previously hash partitioned by the grouping key(s) o Memory intensive

Sort Use this mode when there are a large number of unique groups as defined by grouping key(s), or if unsure about the number of groups
o Input data should be previously hash partitioned and sorted by the grouping key(s) o Uses less memory, but more disk I/O
ValueCap Systems - Proprietary

388

Aggregator Example
Suppose we would like to find out, based on the data in our baseball Salaries.ds dataset, the following:
How many players are on each team each year What the average salary is per team each year.

The flow would look like the following

ValueCap Systems - Proprietary

389

Aggregator Example (Continued)

Default output data type is double


Output Mapping:

Output Mapping:

Calculate the player count and average salary separately and join the results together afterwards. Note: Data is being hash and sorted prior to the copy. Why?
ValueCap Systems - Proprietary

390

Lab 6A: Join & Lookup

ValueCap Systems - Proprietary

391

Lab 6A Objectives
Use Join stage to map a baseball players first and last name to his corresponding Batting record(s) Repeat the above functionality using the Lookup stage Repeat the above functionality using the Merge stage

ValueCap Systems - Proprietary

392

Lab6a Data Sources


Review the table definitions for the data we will be working with:
Batting Data
Column Name playerID yearID teamID lgID G AB R H DB TP HR RBI SB IBB Description Player ID code Year Team League Games At Bats Runs Hits Doubles Triples Homeruns Runs Batted In Stolen Bases Intentional walks

We will leverage the playerID key that exists in both datasets to identify and map the correct nameFirst and nameLast columns. Note that a given playerID value will likely appear in many records, based on how many years he played in the league. While the playerID will be the same, yearID should always be different.
Master Data
Column Name playerID birthYear birthMonth birthDay nameFirst nameLast debut finalGame Description A unique code asssigned to each player. Year player was born Month player was born Day player was born Player's first name Player's last name Date player made first major league appearance Date player made last major league appearance

These are the 2 columns you are interested in

ValueCap Systems - Proprietary

393

Lab6a Job Design


The objective is to leverage the Join stage to map the players correct first and last names to his batting record Build a job similar to the following

Make sure this is the left or primary input.

Hash & Sort on Join key

Which Join type will you use to ensure that all records from your Batting.ds file make it through to the output? We only care about picking up nameFirst and nameLast columns from the Master data
Only map these two columns on the output of the Join stage, and remember to disable RCP for this stage so that other columns are not propagated along.
ValueCap Systems - Proprietary

394

lab6a_join Job Results


Save the job as lab6a_join Compile and Run How many records did the Job Monitor report on the output of the Join?
25076 player batting records (4720 unique player batting records) 3817 master player records

In the Director Job Log, what did the Peek stage report?
Heres an example of the output:

Based on above record counts and Peek output, its obvious that we dont have master data for all players in the batting data.

ValueCap Systems - Proprietary

395

lab6a_join Design Update


Next, lets separate out Batting records for which no Master is available Append onto the original flow a Filter stage to separate out records where Master data is not available:

What kind of Filter criteria can you use to accomplish this?


If theres no Master data match, what happens to the nameFirst and nameLast columns? If theres no match, wouldnt nameFirst = nameLast?

ValueCap Systems - Proprietary

396

lab6a_join Updated Job Results


Save the job again as lab6a_join Compile and Run How many records did the Job Monitor report on each output of the Filter?
19877 player batting records where there was a Master record match 5199 player batting records where there was no Master record match

ValueCap Systems - Proprietary

397

lab6a_lookup Overview
Save lab6a_join as lab6a_lookup Next, replace the Join stage with the Lookup stage
Make sure to have your links setup correctly Use the same lookup key as join key Make sure that the Fail condition is set to Continue so that the job does not fail when a lookup failure is encountered

ValueCap Systems - Proprietary

398

lab6a_lookup Job Results


Save the job as lab6a_lookup Compile and Run How many records did the Job Monitor report on each output of the Filter?
Should be the same as lab6a_join 19877 player batting records where there was a Master record match 5199 player batting records where there was no Master record match

ValueCap Systems - Proprietary

399

lab6a_lookup Design Update


Is there another way to achieve the same results without using the Filter stage? Try creating the following flow to see if you can replicate the same results:
Save As lab6a_lookup2 19877 player batting records where there was a Master record match 5199 player batting records where there was no Master record match

ValueCap Systems - Proprietary

400

Lab6a What about the Merge?


Could you have used the Merge stage to replicate the same functionality in lab6a_join and lab6a_lookup?
The Batting data contains multiple records with the same playerID value. That is because a given player typically stays in the league multiple years. The Master data contains unique records Merge expects its Master input to contain no duplicates, where duplicates records are defined by the merge keys only Merge also consumes the Update record once a match occurs against an incoming Master record. Answer: Yes! What if you used the Master data as the Master input and the Batting data as the Update?
ValueCap Systems - Proprietary

401

lab6a_merge Design
Save job lab6a_lookup2 as lab6a_merge Edit the job to reflect the following Save, Compile, and Run Your results should match:
19877 player batting records where there was a Master record match 5199 player batting records where there was no Master record match
ValueCap Systems - Proprietary

402

Lab 6B: Aggregator

ValueCap Systems - Proprietary

403

Lab 6B Objectives

Use the Aggregator stage to perform the following:


Find the pitcher with the best ERA per team per year Find the pitcher(s) with the highest salary per team per year

o Note: Some pitchers may have the same salary


Determine if its the same person!

ValueCap Systems - Proprietary

404

Lab6b Data Sources


Review the table definitions for the data we will be working with:
Pitching Data
Column Name playerID yearID teamID lgID W L SHO SV SO ERA Description Player ID code Year Team League Wins Losses Shutouts Saves Strikeouts Earned Run Average

We will leverage the playerID, yearID, lgID, and teamID keys that exists in both datasets to identify and map the correct salary column.

Salaries Data
Column Name yearID teamID lgID playerID salary Description Year Team League Player ID code Salary

ValueCap Systems - Proprietary

405

Lab6b Job Design


Heres what the job will look like once you are finished building it!
To ease development, you will build this job one part at a time and test the results along the way.

ValueCap Systems - Proprietary

406

lab6b_aggregator Step 1
Use the Aggregator stage to find the pitcher with the lowest ERA on each team, each year.

Should be <=, >=

Use the Filter stage to eliminate records where ERA < 1 AND W < 5
o Its not likely for a pitcher to have a legitimate season ERA less than 1.00 and have won fewer than 5 games

In the Aggregator stage, isolate the record with the lowest ERA per team per year
o Group by teamID and yearID keys o Calculate minimum value for ERA
ValueCap Systems - Proprietary

407

lab6b_aggregator Step 1 (Continued)


Step 1 continued
Aggregator will produce 1 record per team per year, containing the lowest ERA value for that year.
o Will not contain playerID

playerID Lookup Need to use Lookup, Join, or Merge to map playerID and other relevant columns back to the output of the Aggregator
o For Example:

Note: Be sure to disable RCP if not mapping all columns across

ValueCap Systems - Proprietary

408

lab6b_aggregator Step 1 (Continued)


Save the job as lab6b_aggregator Compile Run the job Verify, that your record counts match the following:

ValueCap Systems - Proprietary

409

lab6b_aggregator Step 2
Use another Aggregator stage to find the pitcher with the highest salary on each team, each year. Extend the flow as shown below:

ValueCap Systems - Proprietary

410

lab6b_aggregator Step 2 (Continued)


Salary Lookup first, you will need to remove all records from the Salaries data which does not belong to a pitcher (contains both batter and pitcher data)
Use the Pitching data to identify the pitchers Perform a Lookup against the Salaries data and only the pitcher salaries will be returned!

ValueCap Systems - Proprietary

411

lab6b_aggregator Step 2 (Continued)


Calculate Top Salary Use the Aggregator stage to find the highest paid pitcher on each team for each year
Group by teamID and yearID keys Calculate maximum value for salary

playerID Lookup Need to use Lookup, Join, or Merge to map playerID and other relevant columns back to the output of the Aggregator.
For Example:

Note: Be sure to disable RCP if not mapping all columns across

ValueCap Systems - Proprietary

412

lab6b_aggregator Step 2 (Continued)


Save the job, keeping it as lab6b_aggregator Compile and Run the job Verify, that your record counts match the following:

NOTE: It is likely that salary data was not available for all pitchers in the Pitching data set Also, some pitchers may have the same salary.

ValueCap Systems - Proprietary

413

lab6b_aggregator Step 3
Finally, determine whether or not the pitchers with the best ERA records are also the ones who are being paid the most
Extend the flow as shown below:

ValueCap Systems - Proprietary

414

lab6b_aggregator Step 3 (Continued)


The Answer use the Lookup stage to find all records where best ERA = highest salary
Lookup will send all matching records where the pitcher with the best ERA also had the highest salary Enable rejects and all records which do not match the above criteria will flow down the reject link instead

ValueCap Systems - Proprietary

415

lab6b_aggregator Step 3 (Continued)


Calculate Top Salary Use the Aggregator stage to find the highest paid pitcher on each team for each year
Group by teamID and yearID keys Calculate maximum value for salary

playerID Lookup Need to use Lookup, Join, or Merge to map playerID and other relevant columns back to the output of the Aggregator.
For Example:

Note: Be sure to disable RCP if not mapping all columns across

ValueCap Systems - Proprietary

416

lab6b_aggregator Final Results


Save the job, keeping it as lab6b_aggregator Compile and Run the job Verify, that your record counts match the following:

Answer: Having the best ERA does not correlate to being the best paid pitcher on the team!
ValueCap Systems - Proprietary

417

lab6b_aggregator Optimization
A simple optimization that can be performed in this job is to hash and sort the data only once, before Copy1, instead of doing it twice as before. Remember, the data needs to be hash and sorted for the Aggregator stage to function properly when using the Sort mode.

ValueCap Systems - Proprietary

When processing large volumes of data, eliminating unecessary hash and sorts will improve your performance! 418

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116 Page 252 Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

419

Wrappers
In this section we will discuss the following: Wrappers
What is a wrapper Use case How to create

ValueCap Systems - Proprietary

420

What is a Wrapper?
DataStage allows users to leverage existing applications within a job by providing the means to call the executable from within a Wrapper. Wrappers can be:
Any executable application (C, C++, COBOL, Java, etc) which supports standard input and standard output, or named pipes Executable scripts (Korn shell, awk, PERL, etc) Unix commands

DataStage treats wrappers as a black box


Does not manage or know about what goes on within a wrapper.

Wrappers support from 0 to many input and output links


Should match the action of the executable being wrappered

ValueCap Systems - Proprietary

421

What is a Wrapper? (Continued)


Wrappers are considered to be external applications
Data must be 1st exported from DataStage before its passed onto the wrapped executable. Once processed by the wrapped executable, data must be imported back into DataStage before further processing can occur.
Data is exported from DataStage to Unix Data is processed through Unixs grep command Data is imported from Unix to DataStage

Example Wrapper ValueCap Systems - Proprietary

422

Why Wrapper?
Wrappers allow existing executable functionality to be redeployed as a stage within DataStage
Re-use the logic in 1 or many different jobs Achieve higher scalability and better performance than running it sequentially
o Some applications cannot or should not be executed in parallel. o Some applications require the entire dataset in a single partition, thus inhibiting its ability to process in parallel

Avoid re-hosting of complex logic by creating a Wrapper instead of a complete DataStage job.

ValueCap Systems - Proprietary

423

Wrapper Use Case


An existing legacy COBOL application performs the following:
Reads data in from a flat file Performs lookups against a database table Scores the data Writes the results out to a flat file

Because this COBOL application does not need to be processed sequentially and can support named pipes for the input, it becomes an ideal candidate for becoming a Wrapper.
The Wrapper will appear as a stage and can be used in any applicable job
COBOL Wrapper

Input file A w B
RDBMS

ValueCap Systems - Proprietary

424

Creating a Wrapper Step 1


To create a Wrapper, 1st do the following:

Stage Type Name will be the name that shows up on the palette Command is where the name of the executable is entered.

grep is a Unix command for searching in this example, search for any text containing the string NL

Execution Mode is parallel by default, but can be sequential.


ValueCap Systems - Proprietary

425

Creating a Wrapper Step 2


Next, you should define the Input and Output Interfaces

The Input Interface tells DataStage how to export the data in a format that is digestible by the wrappered application remember, the data is being sent to the wrappered executable to be processed The Output Interface tells DataStage how to re-import the data that has been processed by the wrappered executable. This action is very similar to what happens when DataStage is reading in a flat file. For multiple Inputs and/or Outputs, define an interface for each
o Note: Link numbering starts at 0
ValueCap Systems - Proprietary

426

Creating a Wrapper Step 3


For both Input and Output Interfaces, be sure to specify the Stream properties
Defines characteristic of both input and output data Also supports the use of named pipes for input/output

ValueCap Systems - Proprietary

427

Creating a Wrapper Step 4


Next, define any environment variables and/or exit codes which may occur from the wrappered executable.
For example, an executable may return a 1 when finished successfully
ValueCap Systems - Proprietary

428

Creating a Wrapper Step 5


The Wrapper also supports command line arguments and options to be defined

This step is optional Use the Properties tab to enter this information Will see an example of this in the lab.

ValueCap Systems - Proprietary

429

Creating a Wrapper Step 6


Finally, to create the Wrapper:

1st click on the Generate button 2nd click on OK This will create the Wrapper and store it under the Category you specify.

ValueCap Systems - Proprietary

430

Locating the Wrapper


Once created, the Wrapper will be accessible from the List View or the Palette
Category name can be user-defined or changed later Wrappers can be exported via the Manager and re-imported into any other DataStage project. Use Wrappers just like any other stage in a job Double-Click the Wrapper in the List View to change its properties or definition

ValueCap Systems - Proprietary

431

Lab 7A: Simple Wrapper

ValueCap Systems - Proprietary

432

Lab 7A Objectives
Create a simple Wrapper using the Unix sort command Apply the Wrapper in a DataStage job

ValueCap Systems - Proprietary

433

Unix Sort
To learn how to use the Unix sort utility, simply type in man sort at the Unix command line to bring up the online help
It should look similar to the screenshot to the right: Sort utility can take data from standard input and write the sorted data to standard output
ValueCap Systems - Proprietary

434

Creating a New Wrapper


To create a new Wrapper, right-click on Stage Types, select New Parallel Stage, and click on Wrapped Enter the following information in the Wrapper stage editor

ValueCap Systems - Proprietary

435

Defining the Input Interface


Click on the Wrapped
Use the Batting table definition we created in Lab 3

Interfaces

Input tabs

Select the appropriate table definition

Specify Standard Input as the stream property

ValueCap Systems - Proprietary

436

Defining the Output Interface


Click on the Wrapped
Use the Batting table definition we created in Lab 3

Interfaces

Output tabs

Select the appropriate table definition

Specify Standard Output as the stream property

ValueCap Systems - Proprietary

437

Generating the Wrapper


Click on Generate button and then the OK button at the bottom of the Wrapper stage editor

Look in the Repository View under the Stage Types Wrapper category
Verify that the newly created UnixSort Wrapper is there

ValueCap Systems - Proprietary

438

Testing the Wrapper


To test the newly created UnixSort Wrapper stage, assemble the following job:

Use Batting.ds dataset Use the Batting table definition created in Lab 3 Use the Input Partitioning tab on the UnixSort stage to specify Hash on playerID do not click on the Sort box!
o Remember, you must hash on the sort key!

ValueCap Systems - Proprietary

439

UnixSort Test Output


Save the job as lab7a Compile the job Run the job and view the results in the Director log
The Peek output should reveal that all records are sorted based on playerID By default, the Unix Sort utility will use the first column as the column to perform the sort on.

ValueCap Systems - Proprietary

440

Lab 7B: Advanced Wrapper

ValueCap Systems - Proprietary

441

Lab 7B Objectives
Create a Wrapper using the Unix sort command which supports user-defined options Apply Wrapper in a DataStage job

ValueCap Systems - Proprietary

442

Advanced UnixSort Wrapper


The Unix Sort utility supports user-defined options
One option is to specify the column delimiter it looks for whitespace delimiter by default Another option is to specify the column to perform the sort on Example: sort -t , +1 -2 will look for , as the column delimiter and perform a sort using column #2 as the sort key

Create a new Wrapper that allows you to specify a column delimiter and a key to perform the sort on
ValueCap Systems - Proprietary

443

Defining the AdvancedUnixSort Wrapper


Right-click on the UnixSort Wrapper stage in the Repository View and select Create Copy Edit the copied Wrapper and change the name to AdvancedUnixSort Keep everything else the same on the General tab
ValueCap Systems - Proprietary

444

AdvancedUnixSort Wrapper Options


Click on the Properties tab and define the following:

t used by sort to define column delimiter. Use , as the default value, as this is what DataStage uses when exporting data start defines the start position for the sort key, based on column number reference (+1 = end of 1st column) stop defines the stop position for the sort key, based on column number reference (-2 = end of 2nd column) Specify the Conversion values as shown above
ValueCap Systems - Proprietary

445

Generating the Wrapper


Click on Generate button and then the OK button at the bottom of the Wrapper stage editor

Look in the Repository View under the Stage Types Wrapper category
Verify that the newly created AdvancedUnixSort Wrapper is there

ValueCap Systems - Proprietary

446

Testing the Advanced Wrapper


To test the newly created AdvancedUnixSort Wrapper stage, assemble the following job:

Use Batting.ds dataset Use the Batting table definition created in Lab 3 Edit the properties for the AdvancedUnixSort stage
o Specify the Column Delimiter to be , o Set the End Position to -3 (i.e. teamID) o Set the Start Position to +2 o Specify the Hash key to be teamID
ValueCap Systems - Proprietary

447

AdvancedUnixSort Test Output


Save the job as lab7b Compile the job Run the job and view the results in the Director log
The Peek output should reveal that all records are now sorted based on teamID By default, the Unix Sort utility would have used the first column as the column to perform the sort on.

ValueCap Systems - Proprietary

448

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116 Page 252 Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

449

Buildops
In this section we will discuss the following: Buildops
What is a Buildop Use cases How to create Example

Note: Buildop is considered advanced functionality within DataStage. In this section you will learn the basics for how to create a simple Buildop.
ValueCap Systems - Proprietary

450

What is a Buildop?
DataStage allows users create a new stage using standard C/C++ - this stage is called a Buildop Buildops can be any ANSI compliant C/C++ code
Code must be syntactically correct Code must be able to be compiled by C/C++ compiler If code does not work outside of DataStage, it will not work within DataStage!

DataStage Framework treats Buildops as a native stage


Stage created using Buildop is native to the Framework, whereas a Wrapper is calling an external executable which is non-native. Does not need to export / import data like the Wrapper did Does not manage or know about what goes on within the custom written code itself, but does manage the parallel execution of the code
ValueCap Systems - Proprietary

451

Why Buildop?
Buildops allow users to extend the functionality provided by DataStage out of the box Buildops offer a high performance means of integrating custom logic into DataStage Once created, a Buildop can be reused in any job and shared across projects Buildops only require the core business logic to be written in C/C++
DataStage will take care of creating the necessary infrastructure to execute the business logic
ValueCap Systems - Proprietary

452

Buildop Use Case


A custom scoring algorithm does the following:
Identifies customers who live within a certain area AND have household income above some fixed amount Each record which meets the above criteria is identified by a special value populated in the status column

Because this logic does not need to be processed sequentially, it becomes an ideal candidate for becoming a Buildop.
Buildop

Input file A b B
RDBMS

ValueCap Systems - Proprietary

453

Buildop vs Transformer
The use case scenario described on the previous slide could also have been easily implemented in the transformer. Buildop Advantages:
Use standard C/C++ code, allows existing logic to be re-used High performance buildops are considered native to the Framework, whereas the Transformer must generate code Supports multiple inputs and outputs

Transformer Advantages:
Simple graphical interface Pre-defined functions and derivations are easy to access No need to pre-define input and output interface
ValueCap Systems - Proprietary

454

Creating a Buildop Step 1


To create a Buildop, 1st do the following:

Stage Type Name will be the name that shows up on the palette Operator is the reference name that the Framework will use this is often the kept same as Stage Name Execution Mode is parallel by default, but can be sequential.
ValueCap Systems - Proprietary

455

Creating a Buildop Step 2


Next, you must define the Input and Output Interfaces

The Input Interface describes to the Buildop the column(s) being operated on. Note: Only specify the columns that will be used within the Buildop. Any column defined must be referenced in the code! The Output Interface describes to the Buildop the column(s) being written out. Note: Only specify the columns that will be used within the Buildop. Any column defined must be referenced in the code! For multiple Inputs and/or Outputs, define an interface for each
o Define Port Names in order to track inputs / outputs o When theres only 1 input/output, theres no need to define Port Name
ValueCap Systems - Proprietary

456

Creating a Buildop Step 2 (Continued)


Auto Read/Write
This defaults to True which means the stage will automatically read and/or write records from/to the port. If set to False, you must explicitly control the read and write operations in the code. Once a record is written out, it can no longer be accessed from within the Buildop

RCP Runtime Column Propagation


False by default. Set this to True to force all columns not defined in the Input Interface to propagate through and be available downstream. If set to False, no columns other than those defined in the Output Interface will show up downstream.
ValueCap Systems - Proprietary

457

Creating a Buildop Step 3 (Optional)


The Transfer tab allows users to define record transfer behavior between input and output links.

Useful when theres more than 1 input or output link Defaults to False, which means that you have to include code which manages the transfer. Set to True to have the transfer carried out automatically. Defaults to False, which means the transfer will be combined with other transfers to the same port. Set to True to specify that the transfer should be separate from other transfers.
ValueCap Systems - Proprietary

Auto Transfer

Separate

458

Creating a Buildop Step 4 (Optional)


In the Build Logic Definitions tab, you can define variables which will be used as part of the business logic. Variables can be standard C types or Framework data types
Some examples Framework data types include: APT_String, APT_Int32, APT_Date, APT_Decimal

ValueCap Systems - Proprietary

459

Creating a Buildop Step 5 (Optional)


In the Build Logic Pre-Loop or Post-Loop tab, you can define logic using C/C++ to be executed before and after Pre-Loop
Logic that is processed before any records have been processed

Post-Loop
Logic that is processed after all records have been processed

ValueCap Systems - Proprietary

460

Creating a Buildop Step 6


The core C/C++ business logic is entered in the Per-Record tab
Use any standard ANSI C/C++ code Leverage built-in Framework function and macro calls

Directly reference columns!

Per-Record processing
Logic is executed against each record. Once record has been written out, it cannot be recalled Does allow buffering of records and management of record input and output flow advanced topics.
ValueCap Systems - Proprietary

461

Creating a Buildop Step 7


Finally, to create the Buildop, click on the Generate button to compile the logic into a stage

If there are no syntax errors or other violations in the Buildop definition, you should obtain an Operator Generation Succeeded status window similar to the one below:

ValueCap Systems - Proprietary

462

Locating the Buildop


Once created, the Buildop will be accessible from the List View or the Palette
Category name can be user-defined or changed later Buildops can be exported via the Designer and re-imported into any other DataStage project. Use Buidops just like any other stage in a job Double-Click the Buildop in the List View to change its properties or definition

ValueCap Systems - Proprietary

463

Buildop Usage Example


Heres an example of the Buildop in action:
Input Interface Output Interface

Input Columns

Output Columns

Note the new column

Sample Output:

ValueCap Systems - Proprietary

464

Lab 8A: Buildop

ValueCap Systems - Proprietary

465

Lab 8A Objectives
Create a Buildop to perform the following:
Derive the pitchers Win-Loss percentage based on his Win Loss record for the season and populate result into new column Expand lgID value to either National League or American League and populate result into new column

ValueCap Systems - Proprietary

466

Lab 8A Overview
Heres the simple job we will be creating to test out the Buildop: Overview:
Use the Batting.ds dataset Use the Batting table definition created in Lab 3 Use the following formula to calculate Win-Loss Percentage:

o WLPercent = (Wins / (Wins+Losses) ) x 100


If lgID = AL, then set leagueName = American League If lgID = NL, then set leagueName = National League
ValueCap Systems - Proprietary

467

Creating a New Buildop


To create a new Buildop, right-click on Stage Types, select New Parallel Stage, and click on Build Enter the following information in the Buildop stage editor

ValueCap Systems - Proprietary

468

Input / Output Buildop Table Definitions


Create an input and an output table definition for Buildop
Remember, only specify the columns that will be referenced within the Buildop code itself. For the input, create the following and save it as /Labs/Lab8/Buildop_Input

For the output, create the following and save it as /Labs/Lab8/Buildop_Output

ValueCap Systems - Proprietary

469

Defining the Input Interface


Click on the Build Interfaces Input tabs Select the Buildop_Input table definition you just created
Do not use the Pitching table definition! Set Auto Read to True Set RCP to True

ValueCap Systems - Proprietary

470

Defining the Output Interface


Click on the Buildop Interfaces Output tabs Select the Buildop_Output table definition you just created
Do not use the Pitching table definition! Set Auto Write to True Set RCP to True

ValueCap Systems - Proprietary

471

Defining the Per-Record Logic


Enter the following C code into the Per-Record section

ValueCap Systems - Proprietary

472

Generating the Buildop


When finished editing the Per-Record section, click on the Generate button If everything was entered correctly, you should get a similar success dialogue:

Click Close and then OK


ValueCap Systems - Proprietary

473

Locating the Buildop


Once successfully created, the buildop will be accessible from the Repository View and the Palette (under the Buildop category).

ValueCap Systems - Proprietary

474

Testing the Buildop


Test the newly created Buildop using the flow discussed at the beginning of this lab:

Save as lab8a Compile and run lab8a - a random sample output is shown here:

ValueCap Systems - Proprietary

475

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116

4. Data Partitioning, Sorting, and Collection Page 252


Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

476

Additional Topics
In this section we will provide a brief overview of: Job Report Generator Containers
Local Shared

Job Sequencer DataStage Designer


Export Import

ValueCap Systems - Proprietary

477

Job Report Generator


Developers often neglect to document the applications they develop. Fortunately, DataStage jobs are, for the most part, self documenting and fairly easy to decipher. Another built-in feature the Report Generator offers an automated way to document DataStage jobs

ValueCap Systems - Proprietary

478

Generating a Job Report


Access the Generate Report option from the File menu in the Designer
Make sure that the job is open in the Designer
Specify name for the report.

Then click on OK to create the report.


ValueCap Systems - Proprietary

479

Viewing the Job Report


Once the report is generated successfully, a dialog will appear that will let you view the report, or open the Report Console so that you can view all reports. You can also open the Report Console by opening the Web Console from the Start menu, and selecting the Reporting tab.

ValueCap Systems - Proprietary

480

Viewing the Job Report


The report is a hyperlinked document which allows you to access information about details of the job.

ValueCap Systems - Proprietary

481

Job Report Details


For example, clicking on the SimpleTransform stage will show the following documentation:

All derivations will be listed

ValueCap Systems - Proprietary

482

Containers
Containers are used in DataStage to visually simplify jobs and create re-usable logic flows
Containers can contain 1 or more stages and have input/output links Local Containers are only accessible from within the job where it was created

o Local Container can be converted into a Shared Container o Local Containers can be deconstructed back into the original stages within the flow
Shared Containers are accessible to any job within a project

o Shared Container can be converted into a ValueCap Local Container Systems - Proprietary o Shared Containers cannot be deconstructed

483

Creating a Container
First, draw a line around the specific stages that you would like to place into a container. Make sure that only the stages you want are selected!
In this example, we are only selecting the Transformer and the Funnel

ValueCap Systems - Proprietary

484

Creating a Container (Continued)


Next, click on the Edit menu, select Construct Container, and then either Local or Shared.
You can also use the icons on the toolbar

ValueCap Systems - Proprietary

485

Creating a Container
Once created, the job with a shared container will look like the following:

The contents of the Container can be viewed in a separate window, by right clicking on the Container and selecting Properties the option

ValueCap Systems - Proprietary

486

Shared vs Local Containers


The primary difference between Shared and Local Containers is that Shared containers can be re-used in other jobs.

ValueCap Systems - Proprietary

487

Job Sequencer
The Job Sequencer provides an interface for managing the execution of multiple jobs
To create a Job Sequence, select it from the New menu. Next, drag and drop the jobs onto the canvas and link them as you would with any 2 stages.

In this example, lab5a_1 will execute 1st, and then lab5a_2, and then lab5a_3.

ValueCap Systems - Proprietary

488

Sequencer Stages
The Job Sequencer has a lot of built-in function to assist with job flow management
Handle exceptions such as errors and warnings Send message via email or pager Execute external applications or scripts Wait for file activity prior to executing job

o Useful for batch applications which are dependent on arrival of input data
Control execution based on completion and condition of executed jobs

ValueCap Systems - Proprietary

489

Job Sequencer Example


Once a Job Sequence has been created, it behaves just like any other DataStage job

If the DataStage jobs use Job Parameters, you must pass in the value for those parameters from within the Sequencer

o Can define Job Parameters for a Job Sequence and pass those parameters into the interface for each job being called.
Need to Save the job, Compile, and Run it. Sequencer Job can be scheduled just like any other DataStage job. ValueCap Systems - Proprietary

490

DataStage Manager
Use the DataStage Designer client to import or export:
Entire Project 1 or Many Jobs Shared Containers Buildops Wrappers Routines Table Definitions Executables

Supports internal DSX format or XML for imports and exports


ValueCap Systems - Proprietary

491

Export Interface
Specify name and location for export Specify whole project (backup) or individual objects Append or Overwrite existing DSX or XML export files Note: Items should not be open in the Designer when performing exports
ValueCap Systems - Proprietary

492

When to Export
Use the Designer to perform job / project exports
When upgrading DataStage, its considered a good practice to

1. Export the projects 2. Delete the projects 3. Perform the upgrade 4. Re-import the projects.
Upgrades will proceed much faster Export jobs, containers, stages, etc and check the DSX or XML file into source control Export to a DSX or XML in order to migrate items between DataStage servers Export the entire project as a means of creating a backup
ValueCap Systems - Proprietary

493

Import Interface
The import interface is simpler than that of the export Specify location of the DSX or XML Use the Perform Usage Analysis feature to ensure nothing gets accidentally overwritten during import

You can also select only specific items to import by using the Import Selected option
ValueCap Systems - Proprietary

494

Lab 9A: Job Report

ValueCap Systems - Proprietary

495

Lab 9A Objectives
Generate a Job Report:
Open job lab5a_1 Use the Job Report utility to generate a report Examine the results

ValueCap Systems - Proprietary

496

Lab 9A Overview
Open the job lab5a_1

ValueCap Systems - Proprietary

497

Generating a Job Report


Access the Generate Report option from the File menu in the Designer
Make sure that lab5a_1 is open in the Designer
1. Specify location for report to be generated and saved to.

2. Click on OK to create the report.

ValueCap Systems - Proprietary

498

Viewing Reports

3. After the report is generated, you should see the dialog box shown above. Click on the Reporting Console link.

ValueCap Systems - Proprietary

499

Viewing Reports

4. This should take you to the Reporting tab of the Information Server Web Console, shown above. Starting with the Reports option in the Navigation pane on the left, navigate to the folder containing the job report you just created.
ValueCap Systems - Proprietary

500

Viewing Reports
Your Web Console should now look something like this:

ValueCap Systems - Proprietary

501

Viewing Reports
5. Select the report you just created, and click View Report Result in the pane on the right. You should now see a job report similar to the one shown on the left. Try clicking on the stage icons and see what happens.

ValueCap Systems - Proprietary

502

Lab 9B: Shared Containers

ValueCap Systems - Proprietary

503

Lab 9B Objectives
Create a shared container using a subset of logic from previously created job Edit the Shared Container to make it more generic Reuse Shared Container in a separate job

ValueCap Systems - Proprietary

504

Lab 9B Overview
Open the job lab6a_lookup Left-click and drag your cursor around the stages as shown below by the red box:

ValueCap Systems - Proprietary

505

Creating the Shared Container


Next, click on the Shared Container icon toolbar on the

You can also click on the Edit menu, select Construct Container, and then select Shared. Save the Shared Container as MasterLookup

Your flow should now look similar to this:

ValueCap Systems - Proprietary

506

Modify the Job


Modify your job by adding 2 copy stages as shown below:

This is to work around an issue with performance statistics.


The Peek stages will only report the number of records output to the Director log Adding the Copy stage will display an accurate record count

ValueCap Systems - Proprietary

507

Testing the Shared Container


Save the job as lab9b_1 Compile the job and run There should be the following output
19877 player batting records going to Copy1, where there was a Master record match 5199 player batting records going to Copy2, where there was no Master record match

ValueCap Systems - Proprietary

508

Editing the Shared Container


In order to make the Shared Container useful for other data sources, we will need to edit the Input and Output Table Definitions and leverage RCP Open the MasterLookup Shared Container:

Edit the Input and Output Table Definitions and remove all columns except for playerID, nameFirst and nameLast Make sure RCP is enabled everywhere

Save the Shared Container and close the window


ValueCap Systems - Proprietary

509

Shared Container Re-Use


Create the following job flow using the Pitching.ds dataset and Table Definition

Be sure to have RCP enabled throughout your job Table Definitions on the output of the Shared Container is optional because of RCP

ValueCap Systems - Proprietary

510

Testing the Shared Container


Save the job as lab9b_2 Compile the job and run There should be the following output
9691 pitching records going to Copy1, where there was a Master record match 2226 pitching records going to Copy2, where there was no Master record match

You can also try processing the Salaries dataset using the Shared Container created in this lab.
ValueCap Systems - Proprietary

511

Lab 9C: Job Sequencer

ValueCap Systems - Proprietary

512

Lab 9C Objectives
Use the Job Sequencer to run jobs lab9b_1 and lab9b_2 back to back

ValueCap Systems - Proprietary

513

Lab 9C Overview
Create a Job Sequence by selecting File Job Sequence New

To create a Job Sequence, click on and select job lab9a_1 and drag it onto the canvas. Next, click on and drag lab9a_2 Right-click on Job_Activity_0 stage and drag the link to the Job_Activity_1 stage.
This will run lab9a_1 first and then lab9a_2 next

ValueCap Systems - Proprietary

514

Job Parameters
Before the jobs can be run, you must specify the values to be passed to the Job Parameters
Both lab9b_1 and lab9b_2 use $APT_CONFIG_FILE and $FILEPATH

ValueCap Systems - Proprietary

515

Defining Job Sequencer Parameters


Go to the Job Properties Parameters tab and click on Add Environment Variable
Select $APT_CONFIG_FILE and $FILEPATH from the list

Click on OK when finished


ValueCap Systems - Proprietary

516

Inserting Parameter Values


Next, go back into the Job_Activity stage properties and for each Parameter, click on Insert Parameter Value to insert the Parameters you just defined
Do this for both stages

ValueCap Systems - Proprietary

517

When finished, save the job as lab9c Compile the job but do not run it yet.
First, make sure that both lab9b_1 and lab9b_2 are compiled and ready to run

Run lab9c and view the results in the Director log


There should have been no errors The results from each individual job can be viewed from the Director by selecting the log for that job

ValueCap Systems - Proprietary

518

Lab 9D: Project Export

ValueCap Systems - Proprietary

519

Lab 9D Objectives
Use the DataStage Manager to export your entire project
This will provide you with a backup of the work you have done this week

ValueCap Systems - Proprietary

520

Export Your Entire Project

Save all of your work. Close all open jobs. In the Designer, under Export menu, select DataStage Components. In the Repository Export dialog, click on Add.
ValueCap Systems - Proprietary

521

Export Your Entire Project

In the Select Items dialog, click on the project, which is the top level of the hierarchy. Click on OK. Now, you will probably have to wait a couple of minutes.
ValueCap Systems - Proprietary

522

Export Your Entire Project


When control returns to the dialog, fill in the Export to file path. Click Export A progress bar will appear. Eventually, you will see a dialog that will tell you that some read-only objects were not exported.

ValueCap Systems - Proprietary

523

Congratulations!
You have successfully completed all of your labs! You have created a backup of your labs which you can take with you and later import into your own project elsewhere.

ValueCap Systems - Proprietary

524

Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary

Page 10 Page 73

Page 116 Page 252 Page 309 Page 364

Page 420 Page 450 Page 477 Page 526

525

Glossary
Administrator DataStage client used to control project global settings and permissions. Collector Gather records from all partitions and place them into a single partition. Forces sequential processing to occur. Compiler Used by DataStage Designer to validate contents of a job and prepare it for execution. Configuration File file used to describe to the Framework how many ways parallel a job should be run.
Node virtual name for the processing node Fastname hostname or ip address of the processing box Pool virtual label used to group processing nodes and resources in the config file Resource Disk designates where Parallel Datasets are to be written to Resource Scratchdisk designates where DataStage should create temporary files

Dataset DataStage data storage mechanism which allows for data to be stored across multiple files on multiple disks. This is often used to spread out I/O and expedite file reads and writes. Designer DataStage client used primarily to design, create, execute, and maintain jobs.
ValueCap Systems - Proprietary

526

Glossary
Director DataStage client used to manage the execution of DataStage jobs and to import/export objects from the metadata repository. These include table definitions, jobs, and custom built stages. Export Process by which data is written out of DataStage to any supported target. Funnel Stage used to gather many links, where each link contains many partitions, into a single parallel link. All input links must have the same layout. Generator Stage used to create rows of data based on table definition and parameters provided. Often useful for testing applications where real data is not available. Grid Large collection of computing resources which allow for MPP-style processing of data. Grid computing often allows for dynamic configuration of available computing resources. Import Process by which data is read into DataStage and translated to DataStage internal format. Job A collection of stages arranged in a logical manner to represent a particular business logic. Jobs must be first compiled before they can be executed.
ValueCap Systems - Proprietary

527

Glossary
Link - A conduit between 2 stages which enables data transfer from the upstream stage to the downstream stage. Manager DataStage client used to import/export objects from the DataStage server repository. These include table definitions, jobs, and custom built stages. MPP Massively Parallel Processing. Computing architecture where memory and disk is not shared across hardware processing nodes. Operator Same as a stage. Operators are represented by stages in the Designer, but referenced directly by the Framework. Partition Division of data into parts for the purpose of parallel processing. Parallelism Concurrent processing of data
Partitioned Parallelism divide and conquer approach to processing data. Data is divided into partitions and processed concurrently. Data remains in the same partition throughout the entire life of the job. Pipelined Parallelism parallel data processing similar to partitioned parallelism, except data does not have to remain within the same partition throughout the life of the job. This allows records to be processed across various partitions, helping eliminate potential bottlenecks.

ValueCap Systems - Proprietary

528

Glossary
Peek Stage which allows users to view a subset of records (default 10 per partition) as they pass through. Pipelining The ability to process data and pass data between processes in memory instead of having to first land data to disk. RCP Runtime Column Propagation. Feature which allows columns to be automatically propagated at runtime without user having to manually perform source to target mapping at design time. RDBMS Relational Database Management System. A database that is organized and accessed according to the relationships between data values Reject Record that is rejected by a stage because it does not meet a specific condition. Scalability From a DataStage perspective, its the ability for an application, to process the same amount of data in less time as additional hardware resources are added to the computing platform. SMP Symmetric Multi-Processing. Computing architecture where memory and disk is shared by all processors.

ValueCap Systems - Proprietary

529

Glossary
Stage A component in DataStage that performs a predetermined action against the data. For example, the Sort stage will sort all records based on a chosen column or set of columns. Table Definition A schema containing field names and their associated data types and properties. Can also contain descriptions about the content of the field(s). Wrapper An external application, command, or other independently executable object that can be called from within DataStage as a stage. Wrappers can accept many inputs and many outputs, but the inputs and outputs must be pre-defined.

ValueCap Systems - Proprietary

530

Potrebbero piacerti anche