Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introduction
What is Information Server(IIS) 8.x Suite of applications that share a common repository Common set of Application Services (hosted by WebSphere app server) Data Integration toolset (ETL, Profiling and data quality) Employs scalable parallel processing Engine Supports N-tier layered architecture Newer version of data integration/ETL tool set offered by IBM Web Browser Interface to manage security and authentication
Product Suite
IIS Organized into 4 layers
Client: Administration, Analysis,Development, and User Interface Metadata Repository: Single repository for each install. Can reside in DB2, Oracle or SQL Server database. Stores configuration, design and runtime metadata. DB2 is supplied database. Domain: Common Services. Requires WebSphere Application Server. Single domain for each install. Engine: Core engine that run all ETL jobs. Engine install includes connectors, packs, job monitors, performance monitors, log service etc.,
Note : Metadata Repository , Domain and Engine can reside in either same server or separate server. Multiple engines can exist in a single Information Server install.
Detailed IS Architecture
Client layer
DataStage & QualityStage Client Admin Console Client Reporting Console Client Information Analyzer WebSphere Business Glossary Fast Track Metadata Workbench
Domain layer
IADB (Profiling)
Import/Export Manager
Engine layer
Metadata DB
Course Objectives
Upon completion of this course, you will be able to:
Understand principles of parallel processing and scalability Understand how to create and manage a scalable job using DataStage Implement your business logic as a DataStage Job Build, Compile, and Execute DataStage Jobs Execute your DataStage Jobs in parallel Enhance DataStage functionality by creating your own Stages Import and Export DataStage Jobs
ValueCap Systems - Proprietary
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
IS DataStage Overview
In this section we will discuss: Product History Product Architecture Project setup and configuration Job Design Job Execution Managing Jobs and Job Metadata
10
Product History
Prior to IBMs acquisition of Ascential Software, Ascential had performed a series of its own acquisitions: Ascential started off as VMark before it became Ardent Software and introduced DataStage as an ETL solution Ardent was then acquired by Informix and through a reversal in fortune, Ardent management took over Informix. Informix was then sold to IBM and Ascential Software was spun out with approximately $1 Billion in the bank as a result. Ascential Software kept DataStage as its cash cow product, but started focusing on a bigger picture: Data Integration for the Enterprise
11
12
QualityStage
DataStage
13
14
Designer
Director
Administrator
Manager
DataStage Repository
15
No more Manager client Common Repository can be on a separate server Default J2EE-compliant Application Server is WebSphere Application Server
16
17
When 1st connecting to the Administrator, you will need to provide the following: Server address where the DataStage repository was installed Your userid Your password Assigned project
18
19
C:\IBM\InformationServer\Projects\Sample
20
21
22
23
24
For majority of lab exercises, you will be selecting Parallel Job or using the Existing and Recent tabs.
ValueCap Systems - Proprietary
25
These boxes can be docked in various locations within this interface. Just click and drag around
The DataStage Designer user interface can be customized to your preferences. Here are just a few of the options
26
New Job
Job Compile
Grid Lines
Snap to Grid
Job Properties
Run Job
Link Markers
Zoom In / Out
Some of the useful icons you will become very familiar with as you get to know DataStage. Note that if you let the mouse pointer hover over any icon, a tooltip will appear.
ValueCap Systems - Proprietary
27
Left-click and drag the stage(s) onto the canvas. You can also left-click on the stage once and then position your mouse cursor on the canvas and left-click again to place the chosen stage there.
28
To create the link, you can right-click on the upstream stage and drag the mouse pointer to the downstream stage. This will create a link as shown here. Alternatively, you can select the link icon from the General category in your Palette by left-clicking on it.
29
When Show stage validation errors under the Diagram menu is selected (the default) DataStage Designer uses visual cues to alert users that theres something wrong. Placing the mouse cursor over an exclamation mark on a stage will display a message indicating what the problem is. A red link indicates that the link cannot be left dangling and must have a source and/or target attached to it.
30
You may notice that the default labels that are created on the stages and links are not very intuitive. You can easily change them by left-clicking once on the stage or link and then start typing a more appropriate label. This is considered to be a best practice. You will understand why shortly. Labels can also be changed by right-clicking on the stage or link and selecting the Rename option.
31
32
Heres an example of a fairly common stage properties dialogue box. The Properties tab will always contain the stage specific options. Mandatory entries will be highlighted red. The Input tab allows you to view the incoming data layout as well as define data partitioning (we will cover this in detail later). The Output tab allows you to view and map the outgoing data layout.
ValueCap Systems - Proprietary
33
Another useful feature on the Input properties tab is the fact that you can see what the incoming data layout looks like.
34
The answer lies in the Mapping tab. This is the Source to Target mapping paradigm you will find throughout DataStage. It is a means of propagating design-time metadata from source to target
ValueCap Systems - Proprietary
35
Source to Target mapping is achieved by 2 methods in DataStage: Left-clicking and dragging a field or collection of fields from the Source side (left) to the Target side (right). Left-clicking on the Columns bar on the Source side and dragging it into the Target side. This is illustrated above. When performed correctly, you will see the Target side populated with some or all of the fields from the Source side, depending on your selection.
ValueCap Systems - Proprietary
36
Once the mapping is complete, you can go back into the Output Columns tab and you will notice that all of the fields youve mapped from Source to Target now appear under the Columns tab. You may have also noticed the Runtime column propagation option below the columns. This is here because we enabled it in the Administrator. If you do not see this option, it is likely because it did not get enabled.
ValueCap Systems - Proprietary
37
38
39
40
41
Once selected, it will show up in the Job Properties window. The default value can be altered to a different value. Parameters can be used to control job behavior as well as referenced within stages to allow for simple adjustment of properties without having to modify the job itself.
ValueCap Systems - Proprietary
42
Before a job can be executed, it must first be saved and compiled. Compilation will validate that all necessary options are set and defined within each of the stages in the job.
Compile
Run
To run the job, just click on the run button on the Designer. Alternatively, you can also click on the run button from within the Director. The Director will contain the job run log, which provides much more detail than the Designer will.
ValueCap Systems - Proprietary
43
44
The Designer is also used for exporting and importing DataStage jobs, table Definitions, routines, containers, etc Items can be exported in 1 of 2 formats: DSX or XML. DSX format is DataStages internal format. Both formats can be opened and viewed in a standard text editor. We do not recommend altering the contents unless you really know what you are doing!.
ValueCap Systems - Proprietary
45
You can export the contents of the entire project, or individual components. You can also export items into an existing file by selecting the Append to existing file option. Exported projects, depending on the total number of jobs, can grow to be several megabytes. However, these files can be easily compressed.
ValueCap Systems - Proprietary
46
Previously exported items can be imported via the Designer. You can choose to import everything or only selected content. DSX files from previous versions of DataStage can also be imported. The upgrade to the current version will occur on the fly as the content is being imported into the repository.
47
48
49
50
Job Log
Some of the useful icons you will become very familiar with as you get to know DataStage. Note that if you let the mouse pointer hover over any icon, a tooltip will appear..
ValueCap Systems - Proprietary
51
52
53
54
55
Lab 1A Objective
Learn to setup and configure a simple project for IBM Information Server DataStage / QualityStage
56
Click on Add button to create a new project. Your instructor may advise you on a project name do not change the default project directory. Click OK when finished.
ValueCap Systems - Proprietary
57
Project Setup
Click on the new project you have just created and select the Properties button. Under the General tab, check the boxes next to:
Enable job administration in the Director Enable Runtime Column Propagation for Parallel Jobs
Next, click on the Environment button to bring up the Environment Variables editor.
58
59
Setting APT_CONFIG_FILE defines the default configuration file used by jobs in the project. Setting APT_DUMP_SCORE will enable additional diagnostic information to appear in the Director log. Click OK button when finished editing Environment Variables. Click OK and then Close to exit the Administrator. You have now finished configuring your project.
60
61
Lab 1B Objective
62
Once connected, select the Parallel Job option and click on OK. You should see a blank canvas with the Parallel label in the upper left hand corner.
ValueCap Systems - Proprietary
63
Use the techniques covered in the lecture material to build the job. Job consists of a Row Generator stage and a Peek stage.
For the Row Generator, you will need to enter the following table definition:
Alter the stage and link labels to match the diagram above.
ValueCap Systems - Proprietary
64
Once the job has compiled successfully, right-click on the canvas and select Show performance statistics Click on the Job Run button. Once your job finishes executing, you should see the following output:
65
66
Lab 1C Objective
67
Once connected, you should see the status of lab1b, which was just executed from within the Designer:
68
69
First one shows the configuration file being used. The next few entries show the output of the Peek stage.
ValueCap Systems - Proprietary
70
Stage Output
The Peek stage output in the Director log should be similar to the following:
Peek stage is similar to inserting a Print statement into the middle of a program. Where did this data come from? The data was generated by the Row Generator stage! You will learn more about this powerful stage in later sections & labs.
ValueCap Systems - Proprietary
71
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
72
73
Scalability
Scalability is a term often used in product marketing but seldom well defined: Hardware vendors claim their products are highly scalable
Computers Storage Network
74
Scalability Defined
How should scalability be defined? Well, that depends on the product. For Parallel DataStage : The ability to process a fixed amount of data in decreasing amounts of time as hardware resources (cpu, memory, storage) are increased Could also be defined as the ability to process growing amounts of data by increasing hardware resources accordingly.
75
Scalability Illustrated
Linear Scalability: runtime decreases as amount of hardware resources are increased. For example: a job that takes 8 hours to run on 1 cpu, will take 4 hours on 2 cpus, 2 hours on 4 cpus, and 1 hour on 8 cpus. Poor Scalability: results when running time no longer improves as additional hardware resources are added. Super-linear Scalability: occurs when the job performs better than linear as amount of hardware resources are increased.
run time
poor scalability
76
Hardware Scalability
Hardware vendors achieve scalability by: Using multiple processors Having large amounts of memory Installing fast storage mechanisms Leveraging a fast back plane Using very high bandwidth, high speed networking solutions
77
78
Software Scalability
Software scalability can occur via: Executing on scalable hardware Effective memory utilization Minimizing disk I/O Data partitioning Multi-threading Multi-processing
79
Software Scalability DS EE
Parallel DataStage achieves scalability in a variety of ways: Data Pipelining Data Partitioning Minimizing disk I/O In memory processing We will explore these concepts in detail!
80
81
Parallel Framework
Configuration File Configuration File contains virtual map of available system resources.
Parallel Framework Framework will reference the Configuration File to determine the degree of parallelism for the job at runtime.
82
Traditional Processing
Suppose we are interested in implementing the following business logic where A, B, and C represent specific data transformation processes:
file A B C
RDBMS
staging area:
disk
disk
disk
While the above solution works and eventually delivers the correct results, problems will occur when data volumes increase and/or batch windows decrease! Disk I/O is the slowest link in the chain. Sequential processing prohibits scalability
83
Data Pipelining
What if, instead of persisting data to disk between processes, we could move the data between processes in memory?
file A B C
RDBMS Invoke loader
staging area:
disk
disk
disk
The application will certainly run faster simply because we are now avoiding the disk I/O that was previously present.
file A B C
RDBMS
This concept is called data pipelining. Data continuously flows from Source to Target, through the individual transformation processes. The downstream process no longer has to wait for all of the data to be written to disk it can now begin processing as soon as the upstream process is finished with the first record!
ValueCap Systems - Proprietary
84
Data Partitioning
Parallel processing would not be possible without data partitioning. We will devote an entire lecture to this subject matter later in this course. For now: Think of partitioning as the act of distributing records into separate partitions for the purpose of dividing the processing burden from one processor to many.
Data File Partitioner
Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000
85
Parallel Processing
By combining data pipelining and partitioning, you can achieve what people typically envision as being parallel processing:
Input file A B C
RDBMS
In this model, data flows from source to target, upstream stage to downstream stage, while remaining in the same partition throughout the entire job. This is often referred to as partitioned parallelism.
86
Input file A B C
RDBMS
What makes pipeline parallelism powerful is the following: Records are not bound to any given partition Records can flow down any partition Prevents backup and hotspots from occurring in any given partition The parallel framework does this by default!
ValueCap Systems - Proprietary
87
88
Configuration Files
Configuration files are used by the Parallel Framework to determine the degree of parallelism for a given job. Configuration files are plain text files which reside on the server side Several configuration files can co-exist, however, only one can be referenced at a time by a job Configuration files have a minimum of one processing node defined and no maximum Can be edited through the Designer or vi or other text editors Syntax is pretty simple and highly repetitive.
89
Label for each node, can be anything needs to be different for each node
{}
{}
Location for parallel dataset storage used to spread I/O can have multiple entries per node
{}
Location for temporary scratch file storage used to spread I/O can have multiple entries per node 90
VS
Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000
-- OR -Collector
Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000 Records 1 - 1000
Data File
VS
91
Parallel Datasets perform better because: data I/O is distributed instead of sequential, thus removing a bottleneck data is stored using a format native to the Parallel Framework, thus eliminating need for the Framework to re-interpret data contents data can be stored and read back in a pre-partitioned and sorted manner
ValueCap Systems - Proprietary
92
93
94
Browsing Datasets
deleting datasets
Dataset viewer can be accessed from the Tools menu in the Designer. Use the Dataset viewer to see all metadata as well as records stored within the dataset. Alternatively, if all you want to do is browse the records in the dataset, you can use the View Data button in the properties window for the dataset stage.
95
96
Lab 2A Objectives
Learn to create a simple configuration file and validate its contents. Note: You will need to leverage skills learned during previous labs to complete subsequent labs.
97
98
99
What is in your configuration file will depend on the hardware environment you are using (i.e. number of cpus). For Example, on a 4 cpu system, you will likely see a configuration file with 4 node entries defined.
ValueCap Systems - Proprietary
100
Regardless of how many cpus your system has, edit the configuration file and create as many node entries as you have cpus.
The default may already have the nodes defined. Copy and paste is the fastest way to do this if you need to add nodes. Keep in mind that node names need to be unique, while everything else can stay the same! Pay attention to the { }s!!! Your instructor may choose to provide you with alternate resource disk and resource scratchdisk locations to use.
ValueCap Systems - Proprietary
101
Once you have saved your configuration file, click on the Check button again at the bottom. This action will validate the contents of your configuration file.
Again, always do this after you have created a configuration file. If it fails this simple test, then there is no way any job will run using this configuration file! If the validation fails, use the error message to determine what the problem is. Correct the problem and repeat the above step.
ValueCap Systems - Proprietary
102
103
Validates information
Fastname entry should match hostname or IP rsh permissions, if necessary, are in place Read and Write permissions exist for all of your resource disk and scratchdisk entries
ValueCap Systems - Proprietary
104
Click OK button when finished editing Environment Variables. Click OK and then Close to exit the Administrator. You have now finished configuring your project.
ValueCap Systems - Proprietary
105
106
Lab 2B Objective
Use your newly created configuration files to test a simple DataStage application.
107
108
109
Defining APT_CONFIG_FILE
Once selected, you will return to the Job Properties window. Verify that the value for APT_CONFIG_FILE is the same as the 1node configuration file you defined previously in Lab 2A.
110
111
112
113
Using APT_DUMP_SCORE
Another way to verify degree of parallelism is to look at the following output in your job log:
The entries Peek,0 and Peek,1 show up as a result of you having set APT_DUMP_SCORE to TRUE. The numbers 0 and 1 signify partition numbers. So if you have a job running 4 way parallel, you should see numbers 0 through 3.
ValueCap Systems - Proprietary
114
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
115
Related Stages
116
To use either stage, you will need to have a table or column definition. You can generate as little as 1 record with 1 column. Columns can be of any supported data type
Integer, Float, Double, Decimal, Character, Varchar, Date, and Timestamp
ValueCap Systems - Proprietary
117
Row Generator
The Row Generator is an excellent stage to use when building jobs in Datastage. It allows you to test the behavior of various stages within the product. To configure the Row Generator to work, you must define at least 1 column. Looking at what we did for the job in Lab 1b, we see that 3 columns were defined:
We could have also loaded an existing table definition instead of entering our own.
ValueCap Systems - Proprietary
118
Row Generator
Suppose we want to stick with the 3 column table definition we created. As you saw in Lab 2B, the Row Generator will produce records with miscellaneous 10-byte character, integer, and date values. There is, however, a way to specify values to be generated. To do so, double click on the number next to the column name.
119
120
121
Note: In addition, with Decimal types, you also have the option of defining percent zero and percent invalid
122
123
Column Generator
The Column Generator is an excellent stage to use when you need to insert a new column or set of columns into a record layout. Column Generator requires you to specify the name of the column first, and then in the output-mapping tab, you will need to map source to target. In the output-columns tab, you will need to customize the column(s) added the same way as it is done in the Row Generator.
For example, if you are generating a dummy key, you would want to make it an Integer type with an initial value of 0 and increment of 1. When running this in parallel, you can start with an initial value of part and increment of partcount. part is defined in the Framework as the partition number and partcount is the number of partitions.
124
125
Can also be used to terminate any job flow. Similar in behavior to inserting a print statement into your source code.
ValueCap Systems - Proprietary
126
127
Importing Data
There are 2 primary means of importing data from external sources:
Automatically DataStage automatically reads the table definition and applies it to the incoming data. Examples include RDBMSs, SAS datasets, and parallel datasets. Manually user must define the table definition that corresponds to the data to be imported. These table definitions can be entered manually or imported from an existing copy book or schema file. Examples include flat files and complex files.
128
Columnization
DataStage parses through the record it just carved out and separates out the columns, again based on the table definition provided. Column delimiters are also defined within the table definition
Can become very troublesome if you dont know the correct layout of your data!
ValueCap Systems - Proprietary
129
130
131
132
Null Handling
All DataStage data types are nullable
Tags and subrecs are not nullable, but their fields are
Null fields do not have a value DataStage null is represented by an out-of-band indicator
Nulls can be detected by a stage Nulls can be converted to/from a value Null fields can be ignored by a stage, can trigger error, or other action
Exporting a nullable field to a flat file without 1st defining how to handle the null will cause an error.
133
What would your table definition look like for this data?
You need column names, which are provided for you You need data types for each column You need to specify , as the column delimiter You need to specify newline as the record delimiter
ValueCap Systems - Proprietary
134
Data types must also match the data itself, otherwise it will cause the columnization step to fail.
135
If your table definition is not correct, then the View Data operation will fail.
ValueCap Systems - Proprietary
136
The table definition we used above worked for the data we were given. Was this the only table definition that would have worked? No, but this was the best one
VarChar is perhaps the most flexible data type, so we could have defined all columns as VarChars. All numeric and date/time types can be imported as Char or VarChar as well, but the reverse is rarely true. Decimal types can typically be imported as Float or Double and vice versa, but be careful with precision you may lose data! Integer types can also be imported as Decimal, Float, or Double.
ValueCap Systems - Proprietary
137
138
139
140
141
142
143
Parallel datasets can sometimes be faster than loading/extracting a RDBMS. Some conditions that can make this happen:
Non-partitioned RDBMS tables Remote location of RDBMS Sequential RDBMS access mechanism
ValueCap Systems - Proprietary
144
145
Once setup, each option guides you through a simple process to import the table definition and save it for future re-use.
ValueCap Systems - Proprietary
146
The presence of the icon on the link signifies that a table definition is present, or that metadata is present on the link. Why do this when DataStage can do this automatically at runtime? Sometimes it is easier or more straight forward to have the metadata available at design time.
ValueCap Systems - Proprietary
147
Another way to access saved table definitions is to use the Load button on the Output tab of any given stage. Note that you can also do this on the Input tab, but that is the same as loading it on the Output tab of the upstream (preceding) stage.
ValueCap Systems - Proprietary
148
149
RDBMS Connectivity
DataStage offers an array of options for RDBMS connectivity, ranging from ODBC to highly-scalable native interfaces. For handling large data volumes, DataStages highly-scalable native database interfaces are the best way to go. While the icons may appear similar, always look for the _enterprise label.
DB2 parallel extract, load, upsert, and lookup. Oracle parallel extract, load, upsert, and lookup. Teradata parallel extract and load Sybase sequential extract, parallel load, upsert, and lookup Informix parallel extract and load
150
Usually a query is submitted to a database sequentially, and the database then distributes the query to execute it in parallel. The output, however, is returned sequentially. Similarly, when loading data, data is loaded sequentially 1st, before being distributed by the database.
DataStage will avoid this bottleneck by establishing parallel connections into the database and execute queries, extract data, and load data in parallel. The degree of parallelism changes depending on the database configuration (i.e. number of partitions that are set up).
151
While the database itself may be highly scalable, the overall solution which includes the application accessing the database is not. Any sequential bottlenecks in an end to end solution will limit its ability to scale!
DataStage s native parallel connectivity into the database is the key enabler for a truly scalable end to end solution.
152
153
154
155
156
Connectivity
DataStage Oracle Enterprise Stage
157
Stage Options
User Oracle user-id Password Oracle user password DB Options can also accept SQL*Loader parameters such as:
o DIRECT = TRUE, PARALLEL = TRUE,
ValueCap Systems - Proprietary
158
User-Defined Query option allows you to enter your own query or copy and paste an existing query.
Note: the custom query will run sequentially by default.
Running SQL queries in parallel requires the use of the following option:
Partition Table option enter name of the table containing the partitioning strategy you are looking to match
ValueCap Systems - Proprietary
159
Both set of options above will yield identical results. Leaving out the Partition Table option would cause the extract to execute sequentially.
160
Can also use DELETE option to remove data from target Oracle table
ValueCap Systems - Proprietary
161
162
Relevant Stages
Column Import Only import a subset of the columns in a record, leaving the rest as raw or string. This is useful when you have a very wide record and only plan on referencing a few columns. Column Export Combine 2 or more columns into a single column. Combine Records Combines records in which particular keycolumn values are identical into vectors of subrecords. As input, the stage takes a data set in which one or more columns are chosen as keys. All adjacent records whose key columns contain the same value are gathered into the same record as subrecords. Make Subrecord Combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors. Specify the vector columns to be made into a vector of subrecords and the name of the new subrecord.
ValueCap Systems - Proprietary
163
All these columns are combined into a vector of the same length as the number of columns (n+1). The vector is called column_name. Any input columns that do not have a name of that form will not be included in the vector but will be output as top level columns.
ValueCap Systems - Proprietary
164
165
The Stored Procedure stage supports input and output parameters or arguments. It can process the returned value after the stored procedure is run. Also provides status codes indicating whether the stored procedure completed successfully and, if not, allowing for error handling. Currently supports DB2, Oracle, and Sybase. Complex Flat File As a source stage it imports data from one or more complex flat files, including MVS datasets with QSAM and VSAM files. A complex flat file may contain one or more GROUPs, REDEFINES, OCCURS or OCCURS DEPENDING ON clauses. When used as a target, the stage exports data to one or more complex flat files. It does not write to MVS datasets.
166
167
Lab 3A Objectives
Learn to create a table definition to match the contents of the flat file Read in the flat file using the Sequential File stage and the table definition just created.
168
169
Batters File
The layout of the Batting.csv file is:
Column Name playerID yearID teamID lgID G AB R H DB TP HR RBI SB IBB Description Player ID code Year Team League Games At Bats Runs Hits Doubles Triples Homeruns Runs Batted In Stolen Bases Intentional walks
Tips: 1. Use a data type that most closely matches the data. For example, for the Games column, use Integer instead of Char or VarChar! 2. When using a VarChar type, always fill in a maximum length by filling in a number in the length column 3. When defining numerical types such as Integer or Float, theres no need to fill in length or scale values. You only do this for Decimal types.
Open the file using vi or any other text editor to view contents note the contents and data types Create a table definition for this data, save it as batting.
ValueCap Systems - Proprietary
170
Pitchers File
The layout of the Pitching.csv file is:
Column Name playerID yearID teamID lgID W L SHO SV SO ERA Description Player ID code Year Team League Wins Losses Shutouts Saves Strikeouts Earned Run Average
Tips: 1. Be careful to choose the right data type for the ERA column. Your choices should boil down to Float vs Decimal
Open the file using vi or any other text editor to view contents note the contents and data types Create a table definition for this data, save it as pitching.
ValueCap Systems - Proprietary
171
Salary File
The layout of the Salaries.csv file is:
Column Name yearID teamID lgID playerID salary Description Year Team League Player ID code Salary
Tips: 1. Salary value is in whole dollars. Again be sure to select the best data type. While it may be tempting to use Decimal, the Framework is more efficient at processing Integer and Float types. Those are considered native to the Framework.
Open the file using vi or any other text editor to view contents note the contents and data types Create a table definition for this data, save it as salaries.
172
Master File
The layout of the Master.csv file is:
Column Name playerID birthYear birthMonth birthDay nameFirst nameLast debut finalGame Description A unique code asssigned to each player. Year player was born Tips: Month player was born 1. Treat birthYear, birthMonth, & Day player was born birthDay as Integer types for now. Player's first name 2. Be sure to specify the correct Date format string: %mm/%dd/%yyyy Player's last name Date player made first major league appearance Date player made last major league appearance
Open the file using vi or any other text editor to view contents note the contents and data types Create a table definition for this data, save it as master.
173
Next, find the batting table definition you created, click and drag the table onto the link On the link:
Look for the icon that signifies the presence of a table definition
ValueCap Systems - Proprietary
174
175
Viewing Data
If everything went well, you should see the View Data window pop up:
If you get an error instead, take a look at the error message to determine the location and nature of the error. Make the necessary corrections and try again.
ValueCap Systems - Proprietary
176
Testing lab3a
Save the job as lab3a_batting Compile the job and then click on the run button. Go into the Director and take a look at the job log.
Look out for Warnings and Errors !!! Errors are fatal and must be resolved. Warnings can be an issue. In this case, it could be warning you that certain records failed to import. This is a bad thing!
177
lab3a_batting Results
For your lab3a_batting job:
You should see Import complete. 25076 records imported successfully, 0 rejected. There should be no rejected records! Find the Peek output line in the Directors Log. Double-click on it. It should look like the following:
178
When finished, your job should resemble one of the diagrams on the right.
Be sure to rename the stages accordingly.
Make sure that View Data works for each and every input file.
ValueCap Systems - Proprietary
179
Validating Results
For your lab3a_pitching job:
You should see Import complete. 11917 records imported successfully, 0 rejected. There should be no rejected records!
180
181
Lab 3B Objective
Write out the imported data files to ASCII flat files and parallel datasets Use different formatting properties
182
183
Edit lab3a_batting_out
Go to lab3b_batting and edit the job to look like the following:
184
Edit lab3b_batting
In the Copy stages Output Mapping tab, map the source columns to the target columns for both output links:
185
right-click
186
For the Sequential File stage, fill in the appropriate path and filename for where the data file should reside.
For example: Use Batting.txt as the filename
ValueCap Systems - Proprietary
187
Save and compile lab3b_batting Run the job and view the results in the Director
ValueCap Systems - Proprietary
188
189
190
191
192
PLACEHOLDER SLIDE
Insert appropriate set of database connectivity slides here depending on customer environment: DB2 Oracle Teradata Sybase
ValueCap Systems - Proprietary
193
Connectivity
DataStage DB2 Enterprise Stage
194
195
Lab 3C Objective
Insert the data stored within the Datasets created in Lab 3B into the database
196
197
Setting Up Parameters
To make things easier, we will use job parameters. Go to the Administrator and open your project properties. Access the Environment Variable settings and create the following 3 parameters:
Use your own directory path, userid, and password (NOTE: userid and password may not be necessary, depending on DB2 setup)
198
Lab 3C DB2
Create a new job and pull together the following stages (Dataset, Peek, and DB2 Enterprise):
199
200
201
202
203
Lab 3D Objective
Extract Batting, Pitching, Salaries, and Master tables from the Database
204
Use the same USERID and PASSWORD job parameters (if needed) from Lab 3C
ValueCap Systems - Proprietary
205
Lab 3D DB2
Create a new job and pull together the following stages (DB2 Enterprise and Peek):
206
207
208
209
210
Connectivity
DataStage Oracle Enterprise Stage
211
212
Lab 3C Objective
Insert the data stored within the Datasets created in Lab 3B into the database
213
214
Setting Up Parameters
To make things easier, we will use job parameters. Go to the Administrator and open your project properties. Access the Environment Variable settings and create the following 3 parameters:
215
Lab 3C Oracle
Create a new job and pull together the following stages (Dataset, Peek, and Oracle Enterprise):
216
217
218
219
220
Lab 3D Objective
Extract Batting, Pitching, Salaries, and Master tables from the Database
221
Use the same USERID and PASSWORD job parameters from Lab 3C
ValueCap Systems - Proprietary
222
Lab 3D Oracle
Create a new job and pull together the following stages (Oracle Enterprise and Peek):
223
224
225
226
227
Connectivity
DataStage Teradata Enterprise Stage
228
229
Lab 3C Objective
Insert the data stored within the Datasets created in Lab 3B into the database
230
231
Setting Up Parameters
To make things easier, we will use job parameters. Go to the Administrator and open your project properties. Access the Environment Variable settings and create the following 3 parameters:
232
Lab 3C Teradata
Create a new job and pull together the following stages (Dataset, Peek, and Teradata Enterprise): Rename the links and stages accordingly In the Dataset stage:
Use the FILEPATH parameter along with the Dataset filename created earlier in Lab 3B Load the Batting table definition in the Columns tab. While this step is optional, it does provide design time metadata.
233
NOTE: You may also need to specify a Database option and provide the name of the database to connect to.
234
235
236
237
Lab 3D Objective
Extract Batting, Pitching, Salaries, and Master tables from the Database
238
Use the same USERID and PASSWORD job parameters from Lab 3C
ValueCap Systems - Proprietary
239
Lab 3D Teradata
Create a new job and pull together the following stages (Teradata Enterprise and Peek): For the Teradata Enterprise stage:
Configure settings as shown below, using the USERID and PASSWORD job parameters as shown below Use the Table Read Method May need to specify the Database option
ValueCap Systems - Proprietary
240
241
242
243
244
245
Lab 3E Objective
Import a COBOL copybook and save it as a table definition Compare the DataStage table definition to the copybook
246
Lab 3E Copybook
We will import the following COBOL copybook:
01 CLIENT-RECORD. 05 FIRST-NAME 05 LAST-NAME 05 GENDER 05 BIRTH-DATE 05 INCOME 05 STATE 05 RECORD-ID PIC X(2). PIC 999999999 COMP. PIC X(16). PIC X(20). PIC X(1). PIC X(10). PIC 9999999V99 COMP-3.
The copybook is located in a file called customer.cfd You will need to have a copy of this file locally on your computer in order to import it into DataStage
ValueCap Systems - Proprietary
247
248
Copybook Imported
If no errors occur, then your copybook has successfully imported and has been translated into a DataStage table definition. Double-click on the newly created table definition to view it
249
250
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
251
252
Data Partitioning
Partitioner
In Chapter 2 we very briefly touched upon the topic of Records 1 - 1000 partitioning: Data Records 1001 - 2000
File
Records 2001 - 3000 Records 3001 - 4000
We also discussed the fact that parallelism would not be possible without partitioning.
For example, how would the following be accomplished without partitioning the data as it comes in from the sequential input file?
Input file A B C
RDBMS
253
When you sit down at a card game, how are the cards dealt out?
The dealer typically distributes the cards evenly to all players. Each player winds up with an equivalent amount of cards.
When partitioning data, it is often desirable to achieve a balance of records in each partition
Too many records in any given partition is referred to as a data skew. Data skews cause overall processing times to take longer to finish.
ValueCap Systems - Proprietary
254
255
Auto Partitioning
By default, partitioning is always set to Auto
Auto means the Framework will decide the most optimal partitioning algorithm based on what the job is doing.
Partitioning is accessed from the same location for any given stage with an input link attached:
256
Random Partitioning
The records are partitioned randomly, based on the output of a random number generator. No further information is required. Suppose we have the following record:
playerID - varchar yearID - integer teamID - char[3] ERA - float
Partition #2
aasedo01 armstmi01 beckwjo01 boddimi01 brownma02 brownmi01 camacer01 caudibi01 1985 1985 1985 1985 1985 1985 1985 1985 BAL NYA KCA BAL MIN BOS CLE TOR 3.78 3.07 4.07 4.07 6.89 21.6 8.10 2.99
Partition #3
ackerji01 alexado01 atherke01 barklje01 birtsti01 blylebe01 1985 1985 1985 1985 1985 1985 TOR TOR OAK CLE OAK CLE 3.23 3.45 4.30 5.27 4.01 3.26
Partition #4
agostju01 allenne01 bairdo01 bannifl01 barojsa01 beattji01 beller01 berenju01 bestka01 1985 1985 1985 1985 1985 1985 1985 1985 1985 CHA NYA DET CHA SEA SEA BAL DET SEA 3.58 2.76 6.24 4.87 5.98 7.29 4.76 5.59 1.95
257
Roundrobin Partitioning
Records are distributed very evenly amongst all partitions. Use this method (or Auto) when in doubt. The roundrobin partitioned records may look like:
Partition #1
aasedo01 allenne01 bannifl01 beckwjo01 bestka01 blylebe01 boydoi01 burrira01 camacer01 cerutjo01 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 BAL NYA CHA KCA SEA MIN BOS ML4 CLE TOR 3.78 2.76 4.87 4.07 1.95 3.00 3.79 4.81 8.10 5.40
Partition #2
ackerji01 armstmi01 barklje01 behenri01 birtsti01 boddimi01 brownma02 burttde01 candejo01 clancji01 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 TOR NYA CLE CLE OAK BAL MIN MIN CAL TOR 3.23 3.07 5.27 7.78 4.01 4.07 6.89 3.81 3.80 3.78
Partition #3
agostju01 atherke01 barojsa01 Beller01 blackbu02 boggsto01 brownmi01 butchjo01 Carych01 clarkbr01 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 CHA OAK SEA BAL KCA TEX BOS MIN DET CLE 3.58 4.30 5.98 4.76 4.33 11.57 21.6 4.98 3.42 6.32
Partition #4
alexado01 bairdo01 beattji01 berenju01 blylebe01 bordiri01 burnsbr01 bystrma01 caudibi01 clarkst02 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 TOR DET SEA DET CLE NYA CHA NYA TOR TOR 3.45 6.24 7.29 5.59 3.26 3.21 3.96 5.71 2.99 4.50
258
Same Partitioning
Preserves whatever partitioning that is already in place.
Data remains in the same partition throughout the flow (aka partitioned parallelism), or until data becomes repartitioned on purpose Does not care about how data was previously partitioned Sets the Preserve Partitioning flag to prevent automatic repartitioning later
259
Entire Partitioning
Places a complete copy of the data into each partition:
Records 1 25,000 Records 1 100,000
Auto
Entire
Data File
Entire partitioning is useful for making a copy of the data available on all processing nodes of a shared nothing environment.
No shared memory between processing nodes Entire forces a copy to be pushed out to each node Lookup stage does this
ValueCap Systems - Proprietary
260
Modulus Partitioning
Distributes records using a modulus function on the key column selected from the available list.
(Field value) mod (no. of partitions, n) = 0,1,,n, n-1
we will perform a modulus partition on yearID. Results would look like this (Modulus+1=part#):
Partition #1
aasedo01 alexado01 allenne01 anderal02 anderri02 aquinlu01 atherke01 augusdo01 bailesc01 bairdo01 . . . 1988 1988 1988 1988 1988 1988 1988 1988 1988 1988 BAL DET NYA MIN KCA KCA MIN ML4 CLE TOR 4.05 4.32 3.84 2.45 4.24 2.79 3.41 3.09 4.9 4.05
Partition #2
aasedo01 ackerji01 agostju01 alexado01 allenne01 armstmi01 atherke01 bairdo01 bannifl01 barklje01 . . . 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 BAL TOR CHA TOR NYA NYA OAK DET CHA CLE 3.78 3.23 3.58 3.45 2.76 3.07 4.3 6.24 4.87 5.27
Partition #3
aasedo01 ackerji01 agostju01 agostju01 akerfda01 alexado01 allenne01 anderal02 andujjo01 aquinlu01 . . . 1986 1986 1986 1986 1986 1986 1986 1986 1986 1986 BAL TOR CHA MIN OAK TOR CHA MIN OAK TOR 2.98 4.35 7.71 8.85 6.75 4.46 3.82 5.55 3.82 6.35
Partition #4
aasedo01 akerfda01 aldrija01 alexado01 allenne01 allenne01 anderal02 anderri02 andersc01 andujjo01 . . . 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 BAL CLE ML4 DET CHA NYA MIN KCA TEX OAK 2.25 6.75 4.94 1.53 7.07 3.65 10.95 13.85 9.53 6.08
261
Range Partitioning
Partitions data into approximately equal size partitions based on one or more partitioning keys.
Range partitioning is often a preprocessing step to performing a total sort on a dataset Requires extra pass through the data to create range map
Suppose we want to range partition based on a baseball pitchers Earned Run Average (ERA). We 1st have to create a range map file as shown here.
262
Sorting typically occurs whenever range partitioning is performed, in order best group records belonging in the same range.
Actual range values are determined by the Framework using an algorithm that attempts to achieve an optimal distribution of records.
ValueCap Systems - Proprietary
263
Partition #2
clarkma01 moyerja01 downske01 ballaje01 bonesri01 wehrmda01 tomkobr01 loiseri01 coneda01 remlimi01 . . . 1996 2001 1990 1989 1994 1985 1997 1998 1999 2004 NYN SEA SFN BAL ML4 CHA CIN PIT NYA CHN 3.43 3.43 3.43 3.43 3.43 3.43 3.43 3.44 3.44 3.44
Partition #3
desseel01 cruzne01 marotmi01 towerjo01 zitoba01 tomkobr01 roberna01 clancji01 hudsoch02 johnto01 . . . 2001 2002 2002 2003 2004 2005 2005 1988 1988 1988 CIN HOU DET TOR OAK SFN DET TOR NYA NYA 4.48 4.48 4.48 4.48 4.48 4.48 4.48 4.49 4.49 4.49
Partition #4
foulkke01 searara01 willica01 smallma01 welleto01 bairdo01 staplda02 wengedo01 broxtjo01 smithmi03 . . . 2005 1985 1994 1996 2004 1987 1988 1998 2005 1987 BOS ML4 MIN HOU CHN PHI ML4 SDN LAN MIN 5.91 5.92 5.92 5.92 5.92 5.93 5.93 5.93 5.93 5.94
Range partitioning is very effective for producing balanced partitions and can be efficient if data characteristics do not change over time.
ValueCap Systems - Proprietary
264
Hash Partitioning
Partitions records based on the value of a key column or columns
All records with the same key column value will wind up in the same partition Hash partitioning is often a preprocessing step to performing a total sort on a dataset Poorly chosen partition key(s) can result in a data skew that is, majority of the records wind up in one or two partitions while the rest of the partitions receive no data.
o For example, hash partitioning on gender would result in a data skew where the majority of records will be spread between 2 partitions. Skews are bad!
ValueCap Systems - Proprietary
265
266
Partition #2
aasedo01 abbotky01 abbotky01 abbotky01 abbotky01 aceveju01 ackerji01 ackerji01 ackerji01 ackerji01 . . . 1989 1991 1996 1992 1995 2003 1990 1985 1991 1986 NYN CAL CAL PHI PHI TOR TOR TOR TOR TOR 3.94 4.58 20.25 5.13 3.81 4.26 3.83 3.23 5.20 4.35
Partition #3
aardsda01 abbotji01 abbotji01 abbotji01 abbotji01 abbotji01 abbotji01 abbotji01 abbotpa01 abbotpa01 . . . 2004 1989 1991 1995 1992 1996 1990 1999 1993 2003 SFN CAL CAL CAL CAL CAL CAL MIL CLE KCA 6.75 3.92 2.89 4.15 2.77 7.48 4.51 6.91 6.38 5.29
Partition #4
abbotji01 abbotji01 abbotji01 abbotji01 abbotpa01 abregjo01 aceveju01 aceveju01 aceveju01 aceveju01 . . . 1998 1995 1993 1994 2004 1985 2001 2003 1998 1999 CHA CHA NYA NYA TBA CHN FLO NYA SLN SLN 4.55 3.36 4.37 4.55 6.70 6.38 2.54 7.71 2.56 5.89
Note that all records with the same playerID and teamID value are now in the same partition.
ValueCap Systems - Proprietary
267
DB2 Partitioning
Distributes the data using the same partitioning algorithm as DB2.
Must supply specific DB2 table via the Partition Properties
NOTE: DB2 Enterprise stage automatically invokes the DB2 partitioner prior to loading data in parallel.
ValueCap Systems - Proprietary
268
269
Sorting Techniques
There are 2 ways to sort data
Easiest way is to specify the sort key(s) on the input link properties for any given stage that supports an input link. The actual Sort stage is also an option for sorting data within a flow. Functionality wise, it is identical.
Sorting requires data to be pre-partitioned using either Range or Hash Sorting sets the Preserve Partitioning flag which forces Same partitioning to occur downstream.
This avoids messing up the sorted order of the records
ValueCap Systems - Proprietary
270
Sort Properties
Common properties for sorting include
Unique removes duplicates where duplicates are determined by the specified key fields being sorted on. For example, if sorting on playerID and teamID, then all records with the same playerID and teamID will be considered identical and only 1 will be kept. Stable indicates that incoming data is already pre-sorted and not to re-sort the data. For example, if the data is already sorted on playerID and teamID, and the new sort key is ERA, then the data will now be sorted on playerID, teamID, and ERA. If the option is not set, then the data will only be sorted on ERA.
ValueCap Systems - Proprietary
271
272
Removing Duplicates
The Remove Duplicates stage records based on specified keys.
Must specify at least 1 key Selected key columns define a duplicate record Can choose to keep first or last record Similar to using the Unique option under Sort options
removes duplicate
273
Data Collection
Data collection is the opposite of data partitioning:
Collector
Records 1 - 1000 Records 1001 - 2000 Records 2001 - 3000 Records 3001 - 4000
Data File
All records from all partitions are gathered into a single partition. Collectors are used when:
Writing out to a sequential file Processing data through a stage that runs sequentially
274
275
Auto Collector
By default, collecting is always set to Auto
Auto means the Framework will decide the most optimal collecting algorithm based on what the job is doing.
Collector type is accessed from the same location for any given sequential stage with an input link attached:
276
Roundrobin Collector
Collects records from multiple partitions in a roundrobin manner.
Collects a record from the first partition, then the second, then the third, etc until it reaches the last partition and then starts over again. Extremely fast! Typically same as Auto
277
Ordered Collector
Reads all records from the first partition, then all records from the second partition, and so on until all partitions have been read.
Useful for maintaining sort order if data was previously partition-sorted, then the outcome will be a sorted single partition. Could be slow if some partitions get backed up.
278
Sort Merge not only acts as a collector, but also manages data flow from many partitions to fewer partitions
For example, a job can run 8-way parallel and then slow down to 4way parallel. To accomplish this, the Framework leverages the Sort Merge to maintain the sort order and partitioning strategy.
279
Link Indicators
The icons found on links are an indicator of what is happening in terms of partitioning or collecting.
Auto partitioning Sequential to Parallel, data is being partitioned Data re-partitioning Same partitioning Partition and Sort Sort and Collect data Collect data
280
Funnel Stage
Collects many links and outputs only 1 link
All input links must possess the exact same data layout
281
282
Lab 4A Objectives
Learn more about the Peek stage Learn to invoke different partitioners Observe outcome from different partitioners Roundrobin, Entire, and Hash
283
In the Administrator, select Project Properties and enter the Environment editor
Find and set APT_CONFIG_FILE to the 4node configuration file you just created. This makes it the default for your project. Make sure APT_DUMP_SCORE is set to True.
Click OK button when finished editing Environment Variables. Click OK and then Close to exit the Administrator.
ValueCap Systems - Proprietary
284
285
286
G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 G:46 AB:76 R:10 H:19 DB:7 TP:0 HR:2 RBI:15 SB:0 IBB:1 G:111 AB:343 R:48 H:92 DB:15 TP:1 HR:14 RBI:52 SB:2 IBB:0 G:34 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 G:156 AB:534 R:59 H:142 DB:26 TP:0 HR:5 RBI:56 SB:7 IBB:3 G:4 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0 G:132 AB:411 R:54 H:125 DB:13 TP:5 HR:8 RBI:42 SB:4 IBB:3 G:153 AB:500 R:73 H:137 DB:26 TP:3 HR:11 RBI:59 SB:17 IBB:2 G:29 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0
playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0 playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4 playerID:baineha01 yearID:1985 teamID:CHA lgID:AL G:160 AB:640 R:86 H:198 DB:29 TP:3 HR:22 RBI:113 SB:1 IBB:8 playerID:balbost01 yearID:1985 teamID:KCA lgID:AL G:160 AB:600 R:74 H:146 DB:28 TP:2 HR:36 RBI:88 SB:1 IBB:4 playerID:barfije01 yearID:1985 teamID:TOR lgID:AL G:155 AB:539 R:94 H:156 DB:34 TP:9 HR:27 RBI:84 SB:22 IBB:5 playerID:baylodo01 yearID:1985 teamID:NYA lgID:AL G:142 AB:477 R:70 H:110 DB:24 TP:1 HR:23 RBI:91 SB:0 IBB:6 playerID:bellbu01 yearID:1985 teamID:TEX lgID:AL G:84 AB:313 R:33 H:74 DB:13 TP:3 HR:4 RBI:32 SB:3 IBB:1 playerID:bentobu01 yearID:1985 teamID:CLE lgID:AL G:31 AB:67 R:5 H:12 DB:4 TP:0 HR:0 RBI:7 SB:0 IBB:2 playerID:berrada01 yearID:1985 teamID:NYA lgID:AL G:48 AB:109 R:8 H:25 DB:5 TP:1 HR:1 RBI:8 SB:1 IBB:0 playerID:blackbu02 yearID:1985 teamID:KCA lgID:AL G:33 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0
287
Entire Partitioning
Go back into the Peek stage Input properties, and change the Partition type to Entire
Save and compile the job Run the job Go back to the Director and view the job log Compare your output to the one on the next slide. Note that each output is identical, since Entire places a copy of the data into each partition.
288
289
290
playerID:weaveje01 yearID:2002 teamID:DET lgID:AL G:2 AB:7 R:0 H:2 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:tocajo01 yearID:2001 teamID:NYN lgID:NL G:13 AB:17 R:3 H:3 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:valdeis01 yearID:1997 teamID:LAN lgID:NL G:28 AB:57 R:0 H:5 DB:1 TP:0 HR:0 RBI:1 SB:1 IBB:0 playerID:valdema01 yearID:1997 teamID:MON lgID:NL G:47 AB:19 R:0 H:2 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:valenfe01 yearID:1997 teamID:SDN lgID:NL G:13 AB:17 R:0 H:3 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:valenfe01 yearID:1997 teamID:SLN lgID:NL G:5 AB:5 R:1 H:1 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:wadete01 yearID:1997 teamID:ATL lgID:NL G:12 AB:12 R:0 H:3 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:germaes01 yearID:2003 teamID:OAK lgID:AL G:5 AB:4 R:0 H:1 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:whitega01 yearID:1997 teamID:CIN lgID:NL G:11 AB:9 R:0 H:1 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0 playerID:greenga01 yearID:1991 teamID:TEX lgID:AL G:8 AB:20 R:0 H:3 DB:1 TP:0 HR:0 RBI:1 SB:0 IBB:0
291
292
playerID:sosasa01 yearID:2001 teamID:CHN lgID:NL G:160 AB:577 R:146 H:189 DB:34 TP:5 HR:64 RBI:160 SB:0 IBB:37 playerID:belleal01 yearID:1998 teamID:CHA lgID:AL G:163 AB:609 R:113 H:200 DB:48 TP:2 HR:49 RBI:152 SB:6 IBB:10 playerID:galaran01 yearID:1997 teamID:COL lgID:NL G:154 AB:600 R:120 H:191 DB:31 TP:3 HR:41 RBI:140 SB:15 IBB:2 playerID:griffke02 yearID:1996 teamID:SEA lgID:AL G:140 AB:545 R:125 H:165 DB:26 TP:2 HR:49 RBI:140 SB:16 IBB:13 playerID:gonzaju03 yearID:2001 teamID:CLE lgID:AL G:140 AB:532 R:97 H:173 DB:34 TP:1 HR:35 RBI:140 SB:1 IBB:5 playerID:ortizda01 yearID:2004 teamID:BOS lgID:AL G:150 AB:582 R:94 H:175 DB:47 TP:3 HR:41 RBI:139 SB:0 IBB:8 playerID:dawsoan01 yearID:1987 teamID:CHN lgID:NL G:153 AB:621 R:90 H:178 DB:24 TP:2 HR:49 RBI:137 SB:11 IBB:7 playerID:bondsba01 yearID:2001 teamID:SFN lgID:NL G:153 AB:476 R:129 H:156 DB:32 TP:2 HR:73 RBI:137 SB:13 IBB:35 playerID:delgaca01 yearID:2000 teamID:TOR lgID:AL G:162 AB:569 R:115 H:196 DB:57 TP:1 HR:41 RBI:137 SB:0 IBB:18 playerID:giambja01 yearID:2000 teamID:OAK lgID:AL G:152 AB:510 R:108 H:170 DB:29 TP:1 HR:43 RBI:137 SB:2 IBB:6
293
294
Lab 4B Objective
Use collectors to process data sequentially View difference between SortMerge, Ordered, and Roundrobin collectors
295
Lab4B Collectors
Open lab4a and Save-As lab4b Edit the job and add a second Peek stage: Go to the Advanced Stage properties for the 2nd Peek
Change the Execution Mode to Sequential Click OK, save and compile lab4b. Run lab4b and view the results in the Director log.
ValueCap Systems - Proprietary
296
Note that in Auto mode, the Collector maintained the sort order on RBI
This suggests that the Framework decided to use the SortMerge Collector
ValueCap Systems - Proprietary
297
SortMerge Collector
Go back into the Sequential_Peek stage Input properties, and change the Collector type to SortMerge
Be sure to set the Sort direction to Descending No need to click on Sort Save and compile the job Run the job Go to the Director and view the job log Compare the output of the Sequential_Peek stage to the output on the previous slide. The output should be the same.
ValueCap Systems - Proprietary
298
Ordered Collector
Go back into the Sequential_Peek stage Input properties, and change the Collector type to Ordered
Save and compile the job Run the job Go back to the Director and view the job log Compare your output to the one on the next slide.
299
Ordered Collector takes all records from the 1st partition, then the 2nd, then the 3rd, and finally the 4th.
Compare this output with the output from partition 0 for the Hash and Sort exercise in lab4a If the records were originally range partitioned, then the resulting output would show up sorted.
ValueCap Systems - Proprietary
300
Roundrobin Collector
Go back into the Sequential_Peek stage Input properties, and change the Collector type to Roundrobin
Save and compile the job Run the job Go back to the Director and view the job log Compare your output to the one below:
Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: Sequential_Peek,0: playerID:sosasa01 yearID:1998 teamID:CHN lgID:NL G:159 AB:643 R:134 H:198 DB:20 TP:0 HR:66 RBI:158 SB:18 IBB:14 playerID:ramirma02 yearID:1999 teamID:CLE lgID:AL G:147 AB:522 R:131 H:174 DB:34 TP:3 HR:44 RBI:165 SB:2 IBB:9 playerID:tejadmi01 yearID:2004 teamID:BAL lgID:AL G:162 AB:653 R:107 H:203 DB:40 TP:2 HR:34 RBI:150 SB:4 IBB:6 playerID:sosasa01 yearID:2001 teamID:CHN lgID:NL G:160 AB:577 R:146 H:189 DB:34 TP:5 HR:64 RBI:160 SB:0 IBB:37 playerID:gonzaju03 yearID:1998 teamID:TEX lgID:AL G:154 AB:606 R:110 H:193 DB:50 TP:2 HR:45 RBI:157 SB:2 IBB:9 playerID:thomafr04 yearID:2000 teamID:CHA lgID:AL G:159 AB:582 R:115 H:191 DB:44 TP:0 HR:43 RBI:143 SB:1 IBB:18 playerID:galaran01 yearID:1996 teamID:COL lgID:NL G:159 AB:626 R:119 H:190 DB:39 TP:3 HR:47 RBI:150 SB:18 IBB:3 playerID:belleal01 yearID:1998 teamID:CHA lgID:AL G:163 AB:609 R:113 H:200 DB:48 TP:2 HR:49 RBI:152 SB:6 IBB:10 playerID:belleal01 yearID:1996 teamID:CLE lgID:AL G:158 AB:602 R:124 H:187 DB:38 TP:3 HR:48 RBI:148 SB:11 IBB:15 playerID:vaughmo01 yearID:1996 teamID:BOS lgID:AL G:161 AB:635 R:118 H:207 DB:29 TP:1 HR:44 RBI:143 SB:2 IBB:19
301
302
Lab 4C Objective
303
Enter the following table definition under the Row Generator stage properties Output Columns tab.
304
This allows us to re-use the table definition in other Row Generator stages.
ValueCap Systems - Proprietary
305
306
Lab4C Results
Verify that the record count going to the Peek stage is 600 rows (100+200+300):
Remember, links Input1, Input2, and Input3 get combined in the Funnel stage, which outputs only 1 link while maintaining same number of partitions.
ValueCap Systems - Proprietary
307
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
308
309
Modify Stage
Modify stage is useful and effective for light transformations:
Drop columns permanently remove columns that are not needed from the record structure Keep columns specify which columns to keep (opposite of drop columns) Null handling specify alternative null representation Substring obtain only a subset of bytes from a Char column. Change data types alter column data types. Data must be compatible between data types.
o For example, a column of type Char[3] with a value of ABC cannot be changed to become ValueCap Systems - Proprietary 310 an Integer type.
311
Nullable
Nullable
312
313
Switch Stage
Switch stage is useful for splitting up records and sending them down different links based on a key value.
Similar in behavior to switch/case statement in C Must provide a Selector field to perform the switch operation Must specify case value and corresponding link number (starts at 0)
ValueCap Systems - Proprietary
314
315
Discard Value
Specifies an integer value of the selector column, or the value to which it was mapped using Case, that causes a row to be dropped (not rejected). Optional
ValueCap Systems - Proprietary
316
Filter Stage
Filter stage acts like the WHERE clause in a SQL SELECT statement
Supports 1 input, and multiple output links, similar to Switch stage Can attach a reject link Valid WHERE clause operations:
o six comparison operators: =, <>, <, >, <=, >= o true / false o is null / is not null o like 'abc' (the second operand must be a regular expression) o between (for example, A between B and C is equivalent to B <= A and A => C) o is true / is false / is not true / is not false o and / or / not
ValueCap Systems - Proprietary
317
318
319
Transform Stage
Transformer stage provides an extensible interface for defining data transformations
Supports 1 input and multiple outputs, including reject Different user interface from other stages
o Source to target mapping is primary interface
Source: Target:
Source Columns
Target Columns
320
1. Double-click in the column derivation area to bring up the derivation editor 2. Right-click to access the menu. 3. If the cursor is at the beginning of the line when you right-click, you will get the following: 4. If the cursor is at the end of the line when you right-click, you will get the following: select Function to access pre-built transforms
321
322
String enter a string value which will become a hardcoded value assigned to the column () Parantheses inserts a pair of parentheses into the derivation field. If Then Else inserts If Then Else into the derivation field.
ValueCap Systems - Proprietary
323
324
o Can hard code specific values o Can include derivations based on built-in or user-defined functions
ValueCap Systems - Proprietary
325
326
Stage Variable derivation expands AL to American League and NL to National League, stores value to Stage Variable called league league is mapped to newly introduced league_name column on both outputs transform defined only once. Constraint separates AL records from NL records yearID column is mapped to year_in_league column DownCase() makes all characters lower case
ValueCap Systems - Proprietary
327
o Checking the Reject Row box will force only those records that did not meet the condition specified in the constraints. o No constraint is required for Reject link.
ValueCap Systems - Proprietary
328
329
330
331
332
String Transformations Trim off spaces, characters Compare values Pad characters Etc
ValueCap Systems - Proprietary
333
o BeforePeek will show the records before they are transformed o AfterPeek will show the records as a result of the transformations applied in the Transformer stage.
ValueCap Systems - Proprietary
334
Relevant Stages
Change Capture Compares two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. An extra column is put on the output dataset, containing a change code with values encoding the four actions: insert, delete, copy, and edit. Change Apply Reads a record from the change data set (produced by Change Capture) and from the before data set, compares their key column values, and acts accordingly: If the before keys come before the change keys in the specified sort order, the before record is copied to the output. The change record is retained for the next comparison. If the before keys are equal to the change keys, the behavior depends on the code in the change_code column of the change record:
Insert: The change record is copied to the output; the stage retains the same before record for the next comparison. If key columns are not unique, and there is more than one consecutive insert with the same key, then Change Apply applies all the consecutive inserts before existing records. This record order may be different from the after data set given to Change Capture. Delete: The value columns of the before and change records are compared. If the value columns are the same or if the Check Value Columns on Delete is specified as False, the change and before records are both discarded; no record is transferred to the output. If the value columns are not the same, the before record is copied to the output and the stage retains the same change record for the next comparison. If key columns are not unique, the value columns ensure that the correct record is deleted. If more than one record with the same keys have matching value columns, the firstencountered record is deleted. This may cause different record ordering than in the after data set given to the Change Capture stage. A warning is issued and both change record and before record are discarded, i.e. no output record results. Edit: The change record is copied to the output; the before record is discarded. If key columns are not unique, then the first before record encountered with matching keys will be edited. This may be a different record from the one that was edited in the after data set given to the Change Capture stage. A warning is issued and the change record is copied to the output; but the stage retains the same before record for the next comparison.. Copy: The change record is discarded. The before record is copied to the output
335
Relevant Stages
Difference Takes 2 presorted datasets as inputs and outputs a single data set whose records represent the difference between them. The comparison is performed based on a set of difference key columns. Two records are copies of one another if they have the same value for all difference keys. You can also optionally specify change values. If two records have identical key columns, you can compare the value columns to see if one is an edited copy of the other. The stage generates an extra column, DiffCode, which indicates the result of each record comparison.
The Difference stage is similar, but not identical, to the Change Capture stage. The Change Capture stage is intended to be used in conjunction with the Change Apply stage. The Difference stage outputs the before and after rows to the output data set, plus a code indicating if there are differences. Usually, the before and after data will have the same column names, in which case the after data set effectively overwrites the before data set and so you only see one set of columns in the output. If your before and after data sets have different column names, columns from both data sets are output; note that any key and value columns must have the same name.
Compare Performs a column-by-column comparison of records in two presorted input data sets. You can restrict the comparison to specified key columns. The Compare stage does not change the table definition, partitioning, or content of the records in either input data set. It transfers both data sets intact to a single output data set generated by the stage. The comparison results are also recorded in the output data set. It is recommended that you use runtime column propagation (RCP) in this stage to allow DataStage to define the output column schema. The stage outputs three columns:
Result: Carries the code giving the result of the comparison. First: A subrecord containing the columns of the first input link. Second: A subrecord containing the columns of the second input link. ValueCap Systems - Proprietary
336
Relevant Stages
Encode use any command available from the Unix command line to encode / mask data. The stage converts a data set from a sequence of records into a stream of raw binary data. An encoded data set is similar to an ordinary one, and can be written to a data set stage. You cannot use an encoded data set as an input to stages that performs columnbased processing or re-orders rows, but you can input it to stages such as Copy. You can view information about the data set in the data set viewer, but not the data itself. You cannot repartition an encoded data set. Decode use any command available from the Unix command line to decode / unmask data. It converts a data stream of raw binary data into a data set. As the input is always a single stream, you do not have to define meta data for the input link. Compress use either Unix compress or GZip utility to compress data. It converts a data set from a sequence of records into a stream of raw binary data. A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a DataSet stage. However, a compressed data set cannot be processed by many stages until it is expanded. Stages that do not perform column-based processing or reorder the rows can operate on compressed data sets. For example, you can use the copy stage to create a copy of the compressed data set. Expand use either Unix uncompress or GZip utility to de-compress data. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data.
ValueCap Systems - Proprietary
337
Relevant Stages
Surrogate Key generates a unique key column for an existing data set. Can specify certain characteristics of the key sequence. The stage generates sequentially incrementing unique integers from a given starting point. The existing columns of the data set are passed straight through the stage. Can be executed in parallel. Column Generator generates additional column(s) of data and appends it onto an incoming record structure Head outputs first N records in each partition. Can optionally select records from certain partitions or skip certain number of records. Tail outputs last N records in each partition. Can optionally select records from certain partitions. Sample outputs a sample of the incoming data. Can be configured to perform either a percentage (random) or periodic sampling. Can distribute samples to multiple output links.
ValueCap Systems - Proprietary
338
339
Lab 5A Objectives
Learn more about the Modify and Transformer stage Use the Modify stage to perform date field manipulations Use the Transformer stage to perform the same date field manipulations Compare results by using the Compare stage Verify results are the same
340
Make sure to label the stages and links accordingly. Use the Master.ds dataset in the Dataset stage, which was created in lab3b_master
In the Output Columns tab, click on Load to load the Master table definition (previously saved in lab3).
341
This splits the date column into separate Year, Month, and Day In the Output Columns tab for Modify, click on Load to load the Master table definition again. Add the following 3 additional column definitions: debutYear Integer debutMonth Integer debutDay Integer Modify will create these as part of the transformations defined by the specifications.
ValueCap Systems - Proprietary
342
343
This will calculate a debutID field based on the Year, Month, and Date that the baseball player played his first game Note: fromModify in the derivation is the name of the input link. If you used a different link name, then use that one.
ValueCap Systems - Proprietary
344
345
Double-click the Derivation field next to StageVar and enter the following Derivation:
YearFromDate(toTransformer.debut) +MonthDayFromDate(toTransformer.debut) *MonthFromDate(toTransformer.debut) Each of these functions can be accessed by right-clicking and selecting Function from the menu. Look under Date & Time category. Column names can be accessed by right-clicking and selecting Input Column or typing in toTransformer. and selecting from the list. You can keep it all on one line. Hit Enter when finished. Note: toTransformer in the derivation is the name of the input link. If you used a different link name, use that one.
ValueCap Systems - Proprietary
346
347
348
349
We are assuming that all records will be unique in the Master file We are comparing records based on playerID and debutID. Any record with the same playerID and debutID value will be compared. If a different record shows up, then the job will abort.
For both input links, Hash and Sort on playerID and debutID.
ValueCap Systems - Proprietary
350
351
Be sure to either retain the link labelled toCopy or rename the link going into the Copy stage to toCopy
Not doing this will break the source to target mapping on the output of the Copy stage.
ValueCap Systems - Proprietary
352
353
354
Lab 5B Objective
Use Filter, Switch, and Transformer stages to accomplish the same task and achieve same results!
355
356
357
How do you figure out which link number corresponds to which output link??? Solution: Click the Output Link Ordering tab.
358
o Explicitly define the Case mappings o Send all values where birthYear >= 1976 down a reject link as shown on the right Note: Reject links do not allow mapping!
Note that output link numbering here also starts at 0. ValueCap Systems - Proprietary
359
Note that column names need to be preceeded by input link names. These Constraint Derivations could be entered by either manually typing it in or using the GUI interface.
ValueCap Systems - Proprietary
360
361
Lab5b - Results
Verify that your record count results match those shown on the right. Also make sure the results are consistent for the Filter, Switch, and Transformer.
362
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
363
Merge Lookup
DataSet RDBMS
Aggregator
ValueCap Systems - Proprietary
364
Performing Joins
There are 4 different types of joins explicitly supported by DataStage:
InnerJoin (default) LeftOuterJoin RightOuterJoin FullOuterJoin
To illustrate the functionality, we will use the following 2 sets of record inputs. Note that all columns in this example are of character type. Left Input Right Input
InnerJoin
InnerJoin will result in records containing LeftField and RightField where KeyField is an exact match
Left Input Right Input
InnerJoin Output
LeftOuterJoin
LeftOuterJoin will result in all records from the left input and only the records from the right input where KeyField is an exact match
Left Input Right Input
LeftOuterJoin Output
Roger
RightOuterJoin
RightOuterJoin will result in all records from the right input and only the records from the left input where KeyField is an exact match
Left Input Right Input
RightOuterJoin Output
LastName
What happened here?
Because there was no match, a blank is populated instead. If the field was numeric, a zero would have been inserted. If the field was nullable, then a null would have been inserted.
KeyField 123
FullOuterJoin
Left Input Right Input
FullOuterJoin will result in all records from both inputs and the records where KeyField is an exact match
FullOuterJoin Output leftRec_KeyField rightRec_KeyField 789 789 012 123 789 789
Randy Roger
Clemens
456
456
Note: A blank is populated where theres no match. If the field was numeric type, a zero would have been inserted. If the field was nullable, then a null would have been inserted. ValueCap Systems - Proprietary
369
All inputs must be pre hashed and sorted by the join key(s). No reject capability
o Need to perform post-processing to detect failed matches (check for nulls, blanks, or 0s) applicable for LeftOuterJoin, RightOuterJoin, and FullOuterJoin.
Always use Link Ordering to differentiate between Left and Right input Links!
o Label your links accordingly
ValueCap Systems - Proprietary
370
Key needs to have same field name All inputs must be pre hashed and sorted by the merge key(s) Supports optional reject record processing simply attach reject link Always use the Link Ordering tab to verify correct Master and Update order
o Label your links accordingly
ValueCap Systems - Proprietary
371
Merge Stage
To illustrate the Merge stage functionality, we will use the following 2 sets of record inputs. Note that all columns in this example are of character type. Master Input Update Input
372
Merge Properties:
Merge Output LastName KeyField FirstName Ryan Ryan Maddux Clemens 789 789 012 456 Roger
ValueCap Systems - Proprietary
Nolan Ken
Merge Properties:
Merge Output LastName KeyField FirstName Ryan Ryan Clemens 789 789 456 Nolan Ken Roger
Inputs do not need to be partitioned or sorted Lookup Tables are pre-loaded into shared memory
o Always make sure that your lookup table fits in available shared memory
375
376
Step #2: Map the input key to the corresponding lookup key. Field names do not need to match
Step #3: Map the input columns from both the input and the lookup table to the output.
377
378
Lookup table source must be one of the supported RDBMS stages above
o Specify Lookup Type = Sparse in the RDBMS stage o Optionally specify your own lookup SQL by using the User Defined SQL option instead of Table Read Method
379
FirstName Nolan
What happened here?
Because the lookup failed, a blank is populated instead. If the field was numeric, a zero would have been inserted. If the field was nullable, then a null would have been inserted. 380
Roger
Roger
382
384
385
386
Re-Calculation Similar to Calculate but performs the specified aggregation function(s) on a set of data that had already been previously aggregated, using the Summary Output Column property to produce a subrecord containing the summary data that is then included with the data set. Select the column to be aggregated, then specify the aggregation functions to perform against it, and the output column to carry the result. Row Count count the total number of unique records within each group as defined by the grouping key criteria.
ValueCap Systems - Proprietary
387
Sort Use this mode when there are a large number of unique groups as defined by grouping key(s), or if unsure about the number of groups
o Input data should be previously hash partitioned and sorted by the grouping key(s) o Uses less memory, but more disk I/O
ValueCap Systems - Proprietary
388
Aggregator Example
Suppose we would like to find out, based on the data in our baseball Salaries.ds dataset, the following:
How many players are on each team each year What the average salary is per team each year.
389
Output Mapping:
Calculate the player count and average salary separately and join the results together afterwards. Note: Data is being hash and sorted prior to the copy. Why?
ValueCap Systems - Proprietary
390
391
Lab 6A Objectives
Use Join stage to map a baseball players first and last name to his corresponding Batting record(s) Repeat the above functionality using the Lookup stage Repeat the above functionality using the Merge stage
392
We will leverage the playerID key that exists in both datasets to identify and map the correct nameFirst and nameLast columns. Note that a given playerID value will likely appear in many records, based on how many years he played in the league. While the playerID will be the same, yearID should always be different.
Master Data
Column Name playerID birthYear birthMonth birthDay nameFirst nameLast debut finalGame Description A unique code asssigned to each player. Year player was born Month player was born Day player was born Player's first name Player's last name Date player made first major league appearance Date player made last major league appearance
393
Which Join type will you use to ensure that all records from your Batting.ds file make it through to the output? We only care about picking up nameFirst and nameLast columns from the Master data
Only map these two columns on the output of the Join stage, and remember to disable RCP for this stage so that other columns are not propagated along.
ValueCap Systems - Proprietary
394
In the Director Job Log, what did the Peek stage report?
Heres an example of the output:
Based on above record counts and Peek output, its obvious that we dont have master data for all players in the batting data.
395
396
397
lab6a_lookup Overview
Save lab6a_join as lab6a_lookup Next, replace the Join stage with the Lookup stage
Make sure to have your links setup correctly Use the same lookup key as join key Make sure that the Fail condition is set to Continue so that the job does not fail when a lookup failure is encountered
398
399
400
401
lab6a_merge Design
Save job lab6a_lookup2 as lab6a_merge Edit the job to reflect the following Save, Compile, and Run Your results should match:
19877 player batting records where there was a Master record match 5199 player batting records where there was no Master record match
ValueCap Systems - Proprietary
402
403
Lab 6B Objectives
404
We will leverage the playerID, yearID, lgID, and teamID keys that exists in both datasets to identify and map the correct salary column.
Salaries Data
Column Name yearID teamID lgID playerID salary Description Year Team League Player ID code Salary
405
406
lab6b_aggregator Step 1
Use the Aggregator stage to find the pitcher with the lowest ERA on each team, each year.
Use the Filter stage to eliminate records where ERA < 1 AND W < 5
o Its not likely for a pitcher to have a legitimate season ERA less than 1.00 and have won fewer than 5 games
In the Aggregator stage, isolate the record with the lowest ERA per team per year
o Group by teamID and yearID keys o Calculate minimum value for ERA
ValueCap Systems - Proprietary
407
playerID Lookup Need to use Lookup, Join, or Merge to map playerID and other relevant columns back to the output of the Aggregator
o For Example:
408
409
lab6b_aggregator Step 2
Use another Aggregator stage to find the pitcher with the highest salary on each team, each year. Extend the flow as shown below:
410
411
playerID Lookup Need to use Lookup, Join, or Merge to map playerID and other relevant columns back to the output of the Aggregator.
For Example:
412
NOTE: It is likely that salary data was not available for all pitchers in the Pitching data set Also, some pitchers may have the same salary.
413
lab6b_aggregator Step 3
Finally, determine whether or not the pitchers with the best ERA records are also the ones who are being paid the most
Extend the flow as shown below:
414
415
playerID Lookup Need to use Lookup, Join, or Merge to map playerID and other relevant columns back to the output of the Aggregator.
For Example:
416
Answer: Having the best ERA does not correlate to being the best paid pitcher on the team!
ValueCap Systems - Proprietary
417
lab6b_aggregator Optimization
A simple optimization that can be performed in this job is to hash and sort the data only once, before Copy1, instead of doing it twice as before. Remember, the data needs to be hash and sorted for the Aggregator stage to function properly when using the Sort mode.
When processing large volumes of data, eliminating unecessary hash and sorts will improve your performance! 418
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
419
Wrappers
In this section we will discuss the following: Wrappers
What is a wrapper Use case How to create
420
What is a Wrapper?
DataStage allows users to leverage existing applications within a job by providing the means to call the executable from within a Wrapper. Wrappers can be:
Any executable application (C, C++, COBOL, Java, etc) which supports standard input and standard output, or named pipes Executable scripts (Korn shell, awk, PERL, etc) Unix commands
421
422
Why Wrapper?
Wrappers allow existing executable functionality to be redeployed as a stage within DataStage
Re-use the logic in 1 or many different jobs Achieve higher scalability and better performance than running it sequentially
o Some applications cannot or should not be executed in parallel. o Some applications require the entire dataset in a single partition, thus inhibiting its ability to process in parallel
Avoid re-hosting of complex logic by creating a Wrapper instead of a complete DataStage job.
423
Because this COBOL application does not need to be processed sequentially and can support named pipes for the input, it becomes an ideal candidate for becoming a Wrapper.
The Wrapper will appear as a stage and can be used in any applicable job
COBOL Wrapper
Input file A w B
RDBMS
424
Stage Type Name will be the name that shows up on the palette Command is where the name of the executable is entered.
grep is a Unix command for searching in this example, search for any text containing the string NL
425
The Input Interface tells DataStage how to export the data in a format that is digestible by the wrappered application remember, the data is being sent to the wrappered executable to be processed The Output Interface tells DataStage how to re-import the data that has been processed by the wrappered executable. This action is very similar to what happens when DataStage is reading in a flat file. For multiple Inputs and/or Outputs, define an interface for each
o Note: Link numbering starts at 0
ValueCap Systems - Proprietary
426
427
428
This step is optional Use the Properties tab to enter this information Will see an example of this in the lab.
429
1st click on the Generate button 2nd click on OK This will create the Wrapper and store it under the Category you specify.
430
431
432
Lab 7A Objectives
Create a simple Wrapper using the Unix sort command Apply the Wrapper in a DataStage job
433
Unix Sort
To learn how to use the Unix sort utility, simply type in man sort at the Unix command line to bring up the online help
It should look similar to the screenshot to the right: Sort utility can take data from standard input and write the sorted data to standard output
ValueCap Systems - Proprietary
434
435
Interfaces
Input tabs
436
Interfaces
Output tabs
437
Look in the Repository View under the Stage Types Wrapper category
Verify that the newly created UnixSort Wrapper is there
438
Use Batting.ds dataset Use the Batting table definition created in Lab 3 Use the Input Partitioning tab on the UnixSort stage to specify Hash on playerID do not click on the Sort box!
o Remember, you must hash on the sort key!
439
440
441
Lab 7B Objectives
Create a Wrapper using the Unix sort command which supports user-defined options Apply Wrapper in a DataStage job
442
Create a new Wrapper that allows you to specify a column delimiter and a key to perform the sort on
ValueCap Systems - Proprietary
443
444
t used by sort to define column delimiter. Use , as the default value, as this is what DataStage uses when exporting data start defines the start position for the sort key, based on column number reference (+1 = end of 1st column) stop defines the stop position for the sort key, based on column number reference (-2 = end of 2nd column) Specify the Conversion values as shown above
ValueCap Systems - Proprietary
445
Look in the Repository View under the Stage Types Wrapper category
Verify that the newly created AdvancedUnixSort Wrapper is there
446
Use Batting.ds dataset Use the Batting table definition created in Lab 3 Edit the properties for the AdvancedUnixSort stage
o Specify the Column Delimiter to be , o Set the End Position to -3 (i.e. teamID) o Set the Start Position to +2 o Specify the Hash key to be teamID
ValueCap Systems - Proprietary
447
448
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
449
Buildops
In this section we will discuss the following: Buildops
What is a Buildop Use cases How to create Example
Note: Buildop is considered advanced functionality within DataStage. In this section you will learn the basics for how to create a simple Buildop.
ValueCap Systems - Proprietary
450
What is a Buildop?
DataStage allows users create a new stage using standard C/C++ - this stage is called a Buildop Buildops can be any ANSI compliant C/C++ code
Code must be syntactically correct Code must be able to be compiled by C/C++ compiler If code does not work outside of DataStage, it will not work within DataStage!
451
Why Buildop?
Buildops allow users to extend the functionality provided by DataStage out of the box Buildops offer a high performance means of integrating custom logic into DataStage Once created, a Buildop can be reused in any job and shared across projects Buildops only require the core business logic to be written in C/C++
DataStage will take care of creating the necessary infrastructure to execute the business logic
ValueCap Systems - Proprietary
452
Because this logic does not need to be processed sequentially, it becomes an ideal candidate for becoming a Buildop.
Buildop
Input file A b B
RDBMS
453
Buildop vs Transformer
The use case scenario described on the previous slide could also have been easily implemented in the transformer. Buildop Advantages:
Use standard C/C++ code, allows existing logic to be re-used High performance buildops are considered native to the Framework, whereas the Transformer must generate code Supports multiple inputs and outputs
Transformer Advantages:
Simple graphical interface Pre-defined functions and derivations are easy to access No need to pre-define input and output interface
ValueCap Systems - Proprietary
454
Stage Type Name will be the name that shows up on the palette Operator is the reference name that the Framework will use this is often the kept same as Stage Name Execution Mode is parallel by default, but can be sequential.
ValueCap Systems - Proprietary
455
The Input Interface describes to the Buildop the column(s) being operated on. Note: Only specify the columns that will be used within the Buildop. Any column defined must be referenced in the code! The Output Interface describes to the Buildop the column(s) being written out. Note: Only specify the columns that will be used within the Buildop. Any column defined must be referenced in the code! For multiple Inputs and/or Outputs, define an interface for each
o Define Port Names in order to track inputs / outputs o When theres only 1 input/output, theres no need to define Port Name
ValueCap Systems - Proprietary
456
457
Useful when theres more than 1 input or output link Defaults to False, which means that you have to include code which manages the transfer. Set to True to have the transfer carried out automatically. Defaults to False, which means the transfer will be combined with other transfers to the same port. Set to True to specify that the transfer should be separate from other transfers.
ValueCap Systems - Proprietary
Auto Transfer
Separate
458
459
Post-Loop
Logic that is processed after all records have been processed
460
Per-Record processing
Logic is executed against each record. Once record has been written out, it cannot be recalled Does allow buffering of records and management of record input and output flow advanced topics.
ValueCap Systems - Proprietary
461
If there are no syntax errors or other violations in the Buildop definition, you should obtain an Operator Generation Succeeded status window similar to the one below:
462
463
Input Columns
Output Columns
Sample Output:
464
465
Lab 8A Objectives
Create a Buildop to perform the following:
Derive the pitchers Win-Loss percentage based on his Win Loss record for the season and populate result into new column Expand lgID value to either National League or American League and populate result into new column
466
Lab 8A Overview
Heres the simple job we will be creating to test out the Buildop: Overview:
Use the Batting.ds dataset Use the Batting table definition created in Lab 3 Use the following formula to calculate Win-Loss Percentage:
467
468
469
470
471
472
473
474
Save as lab8a Compile and run lab8a - a random sample output is shown here:
475
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
Page 116
476
Additional Topics
In this section we will provide a brief overview of: Job Report Generator Containers
Local Shared
477
478
479
480
481
482
Containers
Containers are used in DataStage to visually simplify jobs and create re-usable logic flows
Containers can contain 1 or more stages and have input/output links Local Containers are only accessible from within the job where it was created
o Local Container can be converted into a Shared Container o Local Containers can be deconstructed back into the original stages within the flow
Shared Containers are accessible to any job within a project
o Shared Container can be converted into a ValueCap Local Container Systems - Proprietary o Shared Containers cannot be deconstructed
483
Creating a Container
First, draw a line around the specific stages that you would like to place into a container. Make sure that only the stages you want are selected!
In this example, we are only selecting the Transformer and the Funnel
484
485
Creating a Container
Once created, the job with a shared container will look like the following:
The contents of the Container can be viewed in a separate window, by right clicking on the Container and selecting Properties the option
486
487
Job Sequencer
The Job Sequencer provides an interface for managing the execution of multiple jobs
To create a Job Sequence, select it from the New menu. Next, drag and drop the jobs onto the canvas and link them as you would with any 2 stages.
In this example, lab5a_1 will execute 1st, and then lab5a_2, and then lab5a_3.
488
Sequencer Stages
The Job Sequencer has a lot of built-in function to assist with job flow management
Handle exceptions such as errors and warnings Send message via email or pager Execute external applications or scripts Wait for file activity prior to executing job
o Useful for batch applications which are dependent on arrival of input data
Control execution based on completion and condition of executed jobs
489
If the DataStage jobs use Job Parameters, you must pass in the value for those parameters from within the Sequencer
o Can define Job Parameters for a Job Sequence and pass those parameters into the interface for each job being called.
Need to Save the job, Compile, and Run it. Sequencer Job can be scheduled just like any other DataStage job. ValueCap Systems - Proprietary
490
DataStage Manager
Use the DataStage Designer client to import or export:
Entire Project 1 or Many Jobs Shared Containers Buildops Wrappers Routines Table Definitions Executables
491
Export Interface
Specify name and location for export Specify whole project (backup) or individual objects Append or Overwrite existing DSX or XML export files Note: Items should not be open in the Designer when performing exports
ValueCap Systems - Proprietary
492
When to Export
Use the Designer to perform job / project exports
When upgrading DataStage, its considered a good practice to
1. Export the projects 2. Delete the projects 3. Perform the upgrade 4. Re-import the projects.
Upgrades will proceed much faster Export jobs, containers, stages, etc and check the DSX or XML file into source control Export to a DSX or XML in order to migrate items between DataStage servers Export the entire project as a means of creating a backup
ValueCap Systems - Proprietary
493
Import Interface
The import interface is simpler than that of the export Specify location of the DSX or XML Use the Perform Usage Analysis feature to ensure nothing gets accidentally overwritten during import
You can also select only specific items to import by using the Import Selected option
ValueCap Systems - Proprietary
494
495
Lab 9A Objectives
Generate a Job Report:
Open job lab5a_1 Use the Job Report utility to generate a report Examine the results
496
Lab 9A Overview
Open the job lab5a_1
497
498
Viewing Reports
3. After the report is generated, you should see the dialog box shown above. Click on the Reporting Console link.
499
Viewing Reports
4. This should take you to the Reporting tab of the Information Server Web Console, shown above. Starting with the Reports option in the Navigation pane on the left, navigate to the folder containing the job report you just created.
ValueCap Systems - Proprietary
500
Viewing Reports
Your Web Console should now look something like this:
501
Viewing Reports
5. Select the report you just created, and click View Report Result in the pane on the right. You should now see a job report similar to the one shown on the left. Try clicking on the stage icons and see what happens.
502
503
Lab 9B Objectives
Create a shared container using a subset of logic from previously created job Edit the Shared Container to make it more generic Reuse Shared Container in a separate job
504
Lab 9B Overview
Open the job lab6a_lookup Left-click and drag your cursor around the stages as shown below by the red box:
505
You can also click on the Edit menu, select Construct Container, and then select Shared. Save the Shared Container as MasterLookup
506
507
508
Edit the Input and Output Table Definitions and remove all columns except for playerID, nameFirst and nameLast Make sure RCP is enabled everywhere
509
Be sure to have RCP enabled throughout your job Table Definitions on the output of the Shared Container is optional because of RCP
510
You can also try processing the Salaries dataset using the Shared Container created in this lab.
ValueCap Systems - Proprietary
511
512
Lab 9C Objectives
Use the Job Sequencer to run jobs lab9b_1 and lab9b_2 back to back
513
Lab 9C Overview
Create a Job Sequence by selecting File Job Sequence New
To create a Job Sequence, click on and select job lab9a_1 and drag it onto the canvas. Next, click on and drag lab9a_2 Right-click on Job_Activity_0 stage and drag the link to the Job_Activity_1 stage.
This will run lab9a_1 first and then lab9a_2 next
514
Job Parameters
Before the jobs can be run, you must specify the values to be passed to the Job Parameters
Both lab9b_1 and lab9b_2 use $APT_CONFIG_FILE and $FILEPATH
515
516
517
When finished, save the job as lab9c Compile the job but do not run it yet.
First, make sure that both lab9b_1 and lab9b_2 are compiled and ready to run
518
519
Lab 9D Objectives
Use the DataStage Manager to export your entire project
This will provide you with a backup of the work you have done this week
520
Save all of your work. Close all open jobs. In the Designer, under Export menu, select DataStage Components. In the Repository Export dialog, click on Add.
ValueCap Systems - Proprietary
521
In the Select Items dialog, click on the project, which is the top level of the hierarchy. Click on OK. Now, you will probably have to wait a couple of minutes.
ValueCap Systems - Proprietary
522
523
Congratulations!
You have successfully completed all of your labs! You have created a backup of your labs which you can take with you and later import into your own project elsewhere.
524
Agenda
1. DataStage Overview 2. Parallel Framework Overview 3. Data Import and Export 4. Data Partitioning, Sorting, and Collection 5. Data Transformation and Manipulation 6. Data Combination 7. Custom Components: Wrappers 8. Custom Components: Buildops 9. Additional Topics 10. Glossary
ValueCap Systems - Proprietary
Page 10 Page 73
525
Glossary
Administrator DataStage client used to control project global settings and permissions. Collector Gather records from all partitions and place them into a single partition. Forces sequential processing to occur. Compiler Used by DataStage Designer to validate contents of a job and prepare it for execution. Configuration File file used to describe to the Framework how many ways parallel a job should be run.
Node virtual name for the processing node Fastname hostname or ip address of the processing box Pool virtual label used to group processing nodes and resources in the config file Resource Disk designates where Parallel Datasets are to be written to Resource Scratchdisk designates where DataStage should create temporary files
Dataset DataStage data storage mechanism which allows for data to be stored across multiple files on multiple disks. This is often used to spread out I/O and expedite file reads and writes. Designer DataStage client used primarily to design, create, execute, and maintain jobs.
ValueCap Systems - Proprietary
526
Glossary
Director DataStage client used to manage the execution of DataStage jobs and to import/export objects from the metadata repository. These include table definitions, jobs, and custom built stages. Export Process by which data is written out of DataStage to any supported target. Funnel Stage used to gather many links, where each link contains many partitions, into a single parallel link. All input links must have the same layout. Generator Stage used to create rows of data based on table definition and parameters provided. Often useful for testing applications where real data is not available. Grid Large collection of computing resources which allow for MPP-style processing of data. Grid computing often allows for dynamic configuration of available computing resources. Import Process by which data is read into DataStage and translated to DataStage internal format. Job A collection of stages arranged in a logical manner to represent a particular business logic. Jobs must be first compiled before they can be executed.
ValueCap Systems - Proprietary
527
Glossary
Link - A conduit between 2 stages which enables data transfer from the upstream stage to the downstream stage. Manager DataStage client used to import/export objects from the DataStage server repository. These include table definitions, jobs, and custom built stages. MPP Massively Parallel Processing. Computing architecture where memory and disk is not shared across hardware processing nodes. Operator Same as a stage. Operators are represented by stages in the Designer, but referenced directly by the Framework. Partition Division of data into parts for the purpose of parallel processing. Parallelism Concurrent processing of data
Partitioned Parallelism divide and conquer approach to processing data. Data is divided into partitions and processed concurrently. Data remains in the same partition throughout the entire life of the job. Pipelined Parallelism parallel data processing similar to partitioned parallelism, except data does not have to remain within the same partition throughout the life of the job. This allows records to be processed across various partitions, helping eliminate potential bottlenecks.
528
Glossary
Peek Stage which allows users to view a subset of records (default 10 per partition) as they pass through. Pipelining The ability to process data and pass data between processes in memory instead of having to first land data to disk. RCP Runtime Column Propagation. Feature which allows columns to be automatically propagated at runtime without user having to manually perform source to target mapping at design time. RDBMS Relational Database Management System. A database that is organized and accessed according to the relationships between data values Reject Record that is rejected by a stage because it does not meet a specific condition. Scalability From a DataStage perspective, its the ability for an application, to process the same amount of data in less time as additional hardware resources are added to the computing platform. SMP Symmetric Multi-Processing. Computing architecture where memory and disk is shared by all processors.
529
Glossary
Stage A component in DataStage that performs a predetermined action against the data. For example, the Sort stage will sort all records based on a chosen column or set of columns. Table Definition A schema containing field names and their associated data types and properties. Can also contain descriptions about the content of the field(s). Wrapper An external application, command, or other independently executable object that can be called from within DataStage as a stage. Wrappers can accept many inputs and many outputs, but the inputs and outputs must be pre-defined.
530