Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Version 8
SC18-9889-00
WebSphere DataStage
®
Version 8
SC18-9889-00
Note
Before using this information and the product that it supports, be sure to read the general information under “Notices and
trademarks” on page 63.
Contents
Chapter 1. Introduction . . . . . . . . 1 Lesson 3.1: Designing the transformation job . . . 25
The transformer job . . . . . . . . . . 25
Chapter 2. Tutorial project goals . . . . 3 Creating the transformation job and adding
stages and links . . . . . . . . . . . . 25
Configuring the Data Set stages . . . . . . 26
Chapter 3. Module 1: Opening and Configuring the Transformer stage. . . . . . 26
running the sample job . . . . . . . . 5 Running the transformation job . . . . . . . 28
Lesson 1.1: Opening the sample job . . . . . . . 5 Lesson checkpoint . . . . . . . . . . . 29
The Designer client . . . . . . . . . . . 5 Lesson 3.2: Combining data in a job . . . . . . 29
The sample job for the tutorial . . . . . . . 6 Using a Lookup stage . . . . . . . . . . 29
Starting the Designer client and opening the Creating a lookup job . . . . . . . . . . 29
sample job . . . . . . . . . . . . . . 6 Configuring the Lookup File Set stage . . . . 30
Lesson checkpoint . . . . . . . . . . . 7 Configuring the Lookup stage . . . . . . . 31
Lesson 1.2: Viewing and compiling the sample job . . 7 Lesson checkpoint . . . . . . . . . . . 32
Exploring the Sequential File stage . . . . . . 8 Lesson 3.3: Capturing rejected data . . . . . . 32
Exploring the Data Set stage . . . . . . . . 8 Lesson checkpoint . . . . . . . . . . . 33
Compiling the sample job . . . . . . . . . 9 Lesson 3.4: Performing multiple transformations in a
Lesson checkpoint . . . . . . . . . . . 9 single job . . . . . . . . . . . . . . . 33
Lesson 1.3: Running the sample job . . . . . . . 9 Adding new stages and links . . . . . . . 34
Running the job . . . . . . . . . . . . 9 Configuring the Business_Rules Transformer
Viewing the data set . . . . . . . . . . 11 stage . . . . . . . . . . . . . . . 35
Lesson checkpoint . . . . . . . . . . . 11 Configuring the Lookup operation . . . . . . 37
Module 1: Summary . . . . . . . . . . . 11 Lesson checkpoint . . . . . . . . . . . 38
Module 3 Summary . . . . . . . . . . . 38
Chapter 4. Module 2: Designing your
first job . . . . . . . . . . . . . . 13 Chapter 6. Module 4: Loading a data
Lesson 2.1: Creating a job . . . . . . . . . . 13 target . . . . . . . . . . . . . . . 39
Lesson checkpoint . . . . . . . . . . . 13 Lesson 4.1: Creating a data connection object . . . 39
Lesson 2.2: Adding stages and links to the job . . . 14 Data connection objects . . . . . . . . . 39
The job design . . . . . . . . . . . . 14 Creating a data connection object . . . . . . 39
Adding stages and linking them . . . . . . 14 Lesson checkpoint . . . . . . . . . . . 40
Specifying properties and column metadata for Lesson 4.2: Importing column metadata from a
the Sequential File stage . . . . . . . . . 15 database table . . . . . . . . . . . . . 40
Specifying properties for the Lookup File Set Lesson checkpoint . . . . . . . . . . . 41
stage and running the job . . . . . . . . 17 Lesson 4.3: Writing to a database . . . . . . . 41
Lesson checkpoint . . . . . . . . . . . 17 Connectors . . . . . . . . . . . . . 41
Lesson 2.3: Importing metadata . . . . . . . . 18 Creating the job . . . . . . . . . . . . 41
Importing metadata into your repository . . . 18 Configuring the Data Set stage . . . . . . . 42
Loading column metadata from the repository. . 18 Configuring the ODBC connector . . . . . . 42
Lesson checkpoint . . . . . . . . . . . 20 Lesson checkpoint . . . . . . . . . . . 44
Lesson 2.4: Adding job parameters . . . . . . 20 Module 4 summary . . . . . . . . . . . . 45
Job parameters . . . . . . . . . . . . 20
Defining job parameters . . . . . . . . . 20 Chapter 7. Module 5: Processing in
Adding job parameters to your job design . . . 21
parallel. . . . . . . . . . . . . . . 47
Supplying values for the job parameters . . . . 21
Lesson 5.1: Exploring the configuration file . . . . 47
Lesson checkpoint . . . . . . . . . . . 22
Opening the default configuration file . . . . 47
Lesson 2.5: Creating parameter sets . . . . . . 22
Example configuration file . . . . . . . . 47
Parameter sets . . . . . . . . . . . . 22
Lesson checkpoint . . . . . . . . . . . 48
Creating a parameter set from existing job
Lesson 5.2: Partitioning data . . . . . . . . . 48
parameters . . . . . . . . . . . . . 22
Viewing partitions in a data set . . . . . . . 49
Lesson checkpoint . . . . . . . . . . . 23
Creating multiple data partitions . . . . . . 50
Module 2 Summary . . . . . . . . . . . 23
Lesson checkpoint . . . . . . . . . . . 51
Lesson 5.3: Changing the configuration file . . . . 51
Chapter 5. Module 3: Designing a Creating a configuration file . . . . . . . . 52
transformation job . . . . . . . . . . 25 Deploying the new configuration file . . . . . 52
Learning objectives
By completing this tutorial, you will achieve the following learning objectives:
v Learn how to design parallel jobs that extract, transform, and load data.
v Learn how to run the jobs that you have designed, and how to view the results.
v Learn how to create reusable objects that can be included in other job designs.
The company GlobalCo is merging with WorldCo. Their customer base is worldwide and because their
businesses are similar, the two companies have some customers in common. The new merged company
wants to build a data warehouse for the delivery and billing information. The exercises in this tutorial
focus on a small portion of the work that needs to be done to accomplish this goal.
Your part of the project is to work on the GlobalCo data that records billing details for customers. You
must read this data from a comma-separated file, and then cleanse and transform the data in preparation
for it to be merged with the equivalent data from WorldCo. This data ultimately forms the bill_to
dimension table in the finished data warehouse.
Learning objectives
As you work through the job scenario, you will learn how to do the following tasks:
v Design parallel jobs that extract, transform, and load data
v Run the jobs that you design and view the results
v Create reusable objects that can be included in other job designs
This tutorial should take approximately four hours to finish. If you explore other concepts related to this
tutorial, it can take longer to complete.
Skill level
You can do this tutorial with only a beginning level of understanding of WebSphere DataStage concepts.
Audience
This tutorial is intended for WebSphere DataStage designers who want to learn how to create parallel
jobs.
System requirements
Prerequisites
You need to complete the following tasks before starting the tutorial:
v Get DataStage developer privileges from the WebSphere DataStage administrator
v Check that the WebSphere DataStage administrator has installed and set up the tutorial by following
the procedures described in Appendix A
The sample job extracts data from a comma-separated file and writes the data to a staging area. The data
that the job writes is used by later modules in the tutorial.
Learning objectives
After you complete the lessons in this module, you will understand how to do the following tasks:
v Start the WebSphere DataStage and QualityStage Designer (Designer client) and attach a project.
v Open an existing job.
v Compile a job so that it ready to run.
v Open the Director client and run a job.
v View the results of the job.
Prerequisites
This lesson shows you how to start the Designer client and open the sample job that is supplied with the
tutorial.
The Designer client is like a workbench or a blank canvas that you use to build jobs. The Designer client
has a palette that contains the tools that form the basic building blocks of a job:
v Stages connect to data sources to read or write files and to process data.
v Links connect the stages along which your data flows.
v Annotations provide information about the jobs that you create.
The Designer client uses a repository where you can store the objects that you are creating as part of the
design process. These objects can be reused by other job designers. The sample job is an object in the
repository that is included with the tutorial. The sample job uses a table definition, which is also an
object in the repository.
In the design area of the Designer client, you work with the tools and objects to create your job designs.
The sample job opens in a design window.
The data that you use in this job is the bill-to information from GlobalCo. This data becomes the bill_to
dimension for the star schema.
The job opens in the Designer client display area. The following figure shows the Designer client with the
samplejob job open. The Tutorial folder is shown in the repository tree.
palette
design area
Lesson checkpoint
In this lesson, you opened your first job.
The sample job has a Sequential File stage to read data from the flat file and a Data Set stage to write
data to the staging area. The two stages are joined by a link. The data that will flow between the two
stages on the link was defined when the job was designed. When the job is run, the data will flow down
this link.
Lesson checkpoint
In this lesson, you explored a simple data extraction job that reads data from a file and writes it to a
staging area.
You run the job from the Director client. The Director client is the operating console. You use the Director
client to run and troubleshoot jobs that you are developing in the Designer client. You also use the
Director client to run fully developed jobs in the production environment.
You use the job log to debug any errors you receive when you run the job.
Lesson checkpoint
In this lesson you ran the sample job and looked at the results.
Module 1: Summary
You have now opened, compiled, and run your first data extraction job.
Now that you have run a data extraction job, you can start creating your own jobs. The next module
guides you through the process of creating a simple job that does more data extraction.
Lessons learned
By completing this module, you learned about the following concepts and tasks:
v Starting the Designer client.
v Opening an existing job.
v Compiling the job.
v Starting the Director client from the Designer client.
v Running the sample job.
v Viewing the results of the sample job and seeing how the job extracts data from a comma-separated file
and writes it to a staging area.
Additional resources
For more information about the features that you have learned about, see the following guides:
v IBM WebSphere DataStage Designer Client Guide
v IBM WebSphere DataStage Director Client Guide
The job that you design will read two flat files and populate two lookup tables. The two lookup tables
will be used by a more complex job that you will create in the next module.
Learning objectives
After completing the lessons in this module, you will understand how to do the following tasks:
v Add stages and links to a job.
v Specify the properties of the stages and links to determine what they will do when the job is run.
v Learn how to specify column metadata.
v Consolidate your knowledge of compiling and running jobs.
If you closed the Designer client after completing module 1, you will need to start the Designer client
again.
You create a parallel job and save it to a new folder in the Tutorial folder in the repository tree.
To create a job:
1. In the Designer client, select File → New.
2. In the New window, select the Jobs folder in the left pane and then select the parallel job icon in the
right pane.
3. Click OK. A new empty job design window opens in the design area.
4. Select File → Save.
5. In the Save Parallel Job As window, right-click on the Tutorial folder and select New → Folder from
the shortcut menu.
6. Type in a name for the folder, for example, My Jobs then move to the Item name field.
7. Type in the name of the job in the Item name field. Call the job populate_cc_spechand_lookupfiles.
8. Check that the Folder path field contains the path \Tutorial\My Jobs, then click Save.
You have created a new parallel job named populate_cc_spechand_lookupfiles and saved it in the folder
Tutorial\My Jobs in the repository.
Lesson checkpoint
In this lesson you created a job and saved it to a specified place in the repository.
The first part of the job reads a comma-separated file that contains a series of customer numbers, a
corresponding code that identifies the country in which the customers are located, and another code that
specifies the customer’s language. You are designing a job that reads the comma-separated file and writes
the contents to a lookup table in a lookup file set. This table will be used by a subsequent job when it
populates a dimension table.
A job consists of stages linked together which describe the flow of data from a data source to a data
target. A stage is a graphical representation of the data itself, or of a transformation that will be
performed on that data.
Your job design should now look something like the one shown in this figure:
You will use the default values for the remaining fields.
11. Add two more rows to the table to specify the remaining two columns and fill them in as follows:
Your Columns tab should look like the one in the following figure (if you have National Language
Support installed, there is an additional field named Extended):
12. Click the Save button to save the column definitions that you specified as a table definition object in
the repository. The definitions can then be reused in other jobs.
13. In the Save Table Definition window, enter the following information:
Option Description
Data source type Saved
Data source name CustomerCountry.csv
Table/file name country_codes_data
14. Click OK to specify the locator for the table definition. The locator identifies the table definition.
15. In the Save Table Definition As window, save the table definition in the Tutorial folder and name it
country_codes_data.
16. Click the View Data button and click OK in the Data Browser window to use the default settings.
The data browser shows you the data that the CustomerCountry.csv file contains. Since you specified
the column definitions, the Designer client can read the file and show you the results.
17. Close the Data Browser window.
18. Click OK to close the stage editor.
19. Save the job.
Notice that a small table icon has appeared on the Country_codes_data link. This icon shows that the link
now has metadata. You have designed the first part of your job.
Specifying properties for the Lookup File Set stage and running the
job
In this part of the lesson, you configure the next stage in your job. You already specified the column
metadata for data that will flow down the link between the two stages, so there are fewer properties to
specify in this task.
You have now written a lookup table that can be used by another job later on in the tutorial.
Lesson checkpoint
You have now designed and run your very first job.
In this lesson, you will add more stages to the job that you designed in Lesson 2.2. The stages that you
add are similar to the ones that you added in lesson 2.2. The stages read a comma-separated file that
contains code numbers and corresponding special delivery instructions. The contents are again written to
a lookup table that is ready to use in a later job. The finished job contains two separate data flows, and it
will write data to two separate lookup file sets. Rather than type the column metadata, you import the
column metadata from the source file, and use that metadata in the job design.
The column definitions that you viewed are stored as a table definition in the repository.
In this part of the lesson, you are consolidating the job design skills that you learned and loading the
column metadata from the table definition that you imported earlier.
1. Add a Sequential file stage and a Lookup File Set stage to your job and link them together. Position
them under the stages and link that you added earlier in this lesson.
2. Rename the stages and link as follows:
Your job design should now look like the one shown in this figure:
3. Open the stage editor for the special_handling Sequential File stage and specify that it will read the
file SpecialHandling.csv and that the first line of this file contains column names.
4. Click the Format tab.
5. In the record-level category, select the Record delimiter string property from the Available
properties to add.
6. Select DOS format from the Record delimiter string list. This setting ensures that the file can be read
by UNIX or Linux WebSphere DataStage servers.
7. Click the Columns tab.
8. Click Load. You load the column metadata from the table definition that you previously saved as an
object in the repository.
9. In the Table Definitions window, browse the repository tree to the folder where you stored the
SpecialHandling.csv column definitions.
10. Select the SpecialHandling.csv table definition and click OK.
11. In the Selected Columns window, ensure that all of the columns appear in the Selected columns list
and click OK. The column definitions appear in the Columns tab of the stage editor.
12. Close the Sequential File stage editor.
13. Open the stage editor for the special_handling_lookup stage.
14. Specify a path name for the destination file set and specify that the lookup key is the
SPECIAL_HANDLING_CODE column then close the stage editor.
15. Save, compile, and run the job.
Job parameters
Sometimes, you want to specify information when you run the job rather than when you design it. In
your job design, you can specify a job parameter to represent this information. When you run the job,
you are then prompted to supply a value for the job parameter.
You specified the location of four files in the job that you designed in Lesson 2.3. In each part of the job,
you specified a file that contains the source data and a file to write the lookup data set to. In this lesson,
you will replace all four file names with job parameters. You will then supply the actual path names of
the files when you run the job.
You will save the definitions of these job parameters in a parameter set in the repository. When you want
use the same job parameters in a job later on in this tutorial, you can load them into the job design from
the parameter set. Parameter sets enable the same job parameters to be used by different jobs.
Lesson checkpoint
You defined job parameters to represent the file names in your job and specified values for these
parameters when you ran the job.
In this lesson, you will create a parameter set from the job parameters that you created in Lesson 2.4. You
will also supply a set of default values for the parameters in the parameter set that are also available
when the parameter set is used.
Parameter sets
You use parameter sets to define job parameters that you are likely to reuse in other jobs. Whenever you
need this set of parameters in a job design, you can insert them into the job properties from the
parameter set. You can also define different sets of values for each parameter set. These parameter sets
are stored as files in the WebSphere DataStage server installation directory and are available to use in
your job designs or when you run jobs that use these parameter sets. If you make any changes to a
parameter set object, these changes are reflected in job designs that use this object until the job is
compiled. The parameters that a job is compiled with are available when the job is run. However, if you
change the design after the job is compiled, the job will link to the current version of the parameter set.
You can create parameter sets from existing job parameters, or you can specify the job parameters as part
of the task of creating a new parameter set.
You created a parameter set that is available for another job that you will create later in this tutorial. The
current job continues to use the individual parameters rather than the parameter set.
Lesson checkpoint
You have now created a parameter set.
Module 2 Summary
In this module, you designed and ran a data extraction job.
You also learned how to create reusable objects such as table definitions and parameters sets that you can
include in other jobs that you design.
Lessons learned
By completing this module, you learned about the following concepts and tasks:
v Creating new jobs and saving them in the repository
v Adding stages and links and specifying their properties
v Specifying column metadata and saving it as a table definition to reuse later
v Specifying job parameters to make your job design more flexible, and saving the parameters in the
repository to reuse later
The job that you design will read the GlobalCo bill_to data that was written to a data set when you ran
the sample job in Module 1 of this tutorial. Your job will perform some simple cleansing of the data. Your
job will transform the data by dropping the columns that you do not need and by trimming some of the
data in the columns that you do need.
Learning objectives
After completing the lessons in this module, you will understand how to do the following tasks:
v How to use a Transformer stage to transform data
v How to handle rejected data
v How to combine data by using a Lookup stage
The job will also specify some stricter data typing for the remaining columns. Stricter data typing helps
to impose quality controls on the data that you are processing.
Finally, the job applies a function to one of the data columns to delete space characters that the column
contains. This transformation job prepares the data in that column for a later operation.
The transformation job that you are designing uses a Transformer stage, but there are also several other
types of processing stages available in the Designer client that can transform data. For example, you can
use the Modify stage in your job, if you want to change only the data types in a data set. Several of the
processing stages can drop data columns as part of their processing. In the current job, you use the
Transformer stage because you require a transformation function that you can customize. Several
functions are available to use in the Transformer stage.
By specifying stricter data typing for your data, you will be able to better diagnose inconsistencies in
your source data when you run the job.
5. Double-click the Derivation field for the CUSTOMER_NUMBER column in the stripped_bill_to link.
The expression editor opens.
6. In the expression editor, type the following text: trim(full_bill_to.CUSTOMER_NUMBER,’ ’,’A’). The
text specifies a function that deletes all the space characters from the CUSTOMER_NUMBER column
on the full_bill_to link before writing it to the CUSTOMER_NUMBER column on the stripped_bill_to
link. Your Transformer stage editor should look like the one in the following figure:
You will base your new job on the transformation job that you created in Lesson 3.1. You will add a
Lookup stage that looks up the data that you created in Lesson 2.2.
You can also combine data in a parallel job by using a Join stage. Where you use a large reference table, a
job can run faster if it combines data by using a Join stage. For the job that you are designing, the
reference table is small, and so a Lookup stage is preferred. The Lookup stage is most efficient where the
data being looked up fits into the available physical memory.
You can configure Lookup stages to search for data in a Lookup file set, or they can search for data in a
relational database. The job will look up the data in a reference table in a Lookup File Set stage that was
created in Lesson 2.2 of this tutorial. When you use lookup file sets, you must specify the lookup key
column when you define the file set. You defined the key columns for the lookup tables that you used in
this lesson when you created the file sets in Module 2.
Ensure that the TrimAndStrip job that you created in Lesson 3.1 is open, and that you have a
multi-window view in the design area of the Designer client. In multi-window view, you can see all the
open jobs in the display area. To switch from single-window view to multi-window view, click the
minimize button in the Designer Client menu bar.
1. Create a job, name it CleansePrepare, and save it in the tutorial folder in the repository.
2. In the TrimAndStrip job, drag the mouse cursor around the stages in the job to select them and
select Edit → Copy.
3. In the CleansePrepare job, select Edit → Paste. The stages appear in the CleansePrepare job. You can
now close the TrimAndStrip job.
4. Select the Processing area in the palette and drag a Lookup stage to the CleansPrepare job. Position
the Lookup stage just below the int_GlobalCoBillTo stage and name it Lookup_Country.
5. Select the stripped_bill_to link, position the mouse cursor in the link’s arrowhead, and drag to the
Lookup stage. You moved the link with its associated column metadata to allow data to flow from
the Transformer stage to the Lookup stage.
The job that you designed should look like the one in the following figure:
Lesson checkpoint
With this lesson, you started to design more complex and sophisticated jobs.
Ensure that the CleansePrepare job that you created in Lesson 3.2 is open and active.
In the Lookup stage for the job that you created in Lesson 3.2, you specified that processing should
continue on a row if the lookup operation fails. Any rows that contain CUSTOMER_NUMBER fields that
were not matched in the lookup table were bypassed, and the COUNTRY column for that row was set to
NULL. In this lesson, you will specify that non-matching rows are written to a reject link. The reject link
captures any customer numbers that do not have an entry in the country codes table. You can examine
the rejected rows and decide what action to take.
1. From the File section of the palette, drag a Sequential File stage to the CleansePrepare job and
position it under the Lookup_Country Lookup stage. Name the Sequential File stage Rejected_Rows.
2. Draw a link from the Lookup stage to the Sequential File stage. Name the link rejects. Because the
Lookup stage already has a stream output link, the new link is designated as a reject link and is
shown as a dashed line. Your job should resemble the one in the following figure:
Lesson checkpoint
You learned the following tasks:
v How to add a reject link to your job
v How to configure the Lookups stage so that it rejects data where a lookup fails
In this lesson, you will further transform your data to apply some business rules and perform another
lookup of a reference table.
procedure.
2. Select the following columns in the country_code input link and drag them to the
with_business_rules output link:
v CUSTOMER_NUMBER
v CUST_NAME
v ADDR_1
v ADDR_2
v CITY
v REGION_CODE
v ZIP
v TEL_NUM
3. In the metadata area for the with_business_rules output link, add the following new columns:
The new columns appear in the graphical representation of the link, but are highlighted in red
because they do not yet have valid derivations.
4. In the graphical area, double-click the Derivation field of the SOURCE column.
5. In the expression editor, type ’GlobalCo’:. Position your mouse pointer immediately to the right of
this text, right-click and select Input Column from the menu. Then select the COUNTRY column
from the list. When you run the job, the SOURCE column for each row will contain the two-letter
country code prefixed with the text GlobalCo, for example, GlobalCoUS.
6. In the Transformer stage editor toolbar, click the Stage Properties tool on the far left. The
Transformer Stage Properties window opens.
7. Click the Variables tab and, by using the techniques that you learned for defining table definitions,
add the following stage variables to the grid:
When you close the Properties window, these stage variables appear in the Stage Variables area
above the with_business_rules link.
8. Double-click the Derivation fields of each of the stage variables in turn and type the following
expressions in the expression editor:
9. Select the xtractSpecialHandling stage variable and drag it to the Derivation field of the
SPECIAL_HANDLING_CODE column and drop it on the with_business_rules link. A line is drawn
between the stage variable and the column, and the name xtractSpecialHandling appears in the
Derivation field. For each row that is processed, the SPECIAL_HANDLING_CODE column writes
the current value of the xtractSpecialHandling variable.
10. Select the TrimDate stage variable and drag it to the Derivation field of the SETUP_DATE column
and drop it on the with_business_rules link. A line is drawn between the stage variable and the
column, and the name TrimDate appears in the Derivation field. For each row processed, the
SETUP_DATE column writes the current value of the TrimDate variable.
11. Double-click the Derivation field of the RECNUM column and type ’GC’: in the expression editor.
Right-click and select System Variable from the menu. Then select @OUTROWNUM. You added row
numbers to your output.
Your transformer editor should look like the one in the following picture:
Lesson checkpoint
In this lesson, you consolidated your existing skills in defining transformation jobs and added some new
skills.
Module 3 Summary
In this module you refined and added to your job design skills.
You learned how to design more complex jobs that transform the data that your previous jobs extracted.
Lessons learned
By completing this module, you learned the following concepts and tasks:
v How to drop data columns from your data flow
v How to use the transform functions that are provided with the Designer client
v How to combine data from two different sources
v How to capture rejected data
In the tutorial modules that you completed so far, you were working with comma-separated files and
staging files in internal formats (data sets and lookup file sets). In this module, you start working with a
relational database. The database is the ultimate target for the data that you are working with.
For these lessons, you will use the database that is hosting the repository. The tutorial supplies scripts
that your database administrator runs to create the tables that you need for these lessons.
Because different types of relational database can be used to host the repository, the lessons in this
module use an ODBC connection that makes the lessons database-independent. Your database
administrator needs to set up a DSN that you can use to connect to the database by using ODBC.
Learning objectives
After completing the lessons in this module, you will understand how to do the following tasks:
v How to define a data connection object that you use and reuse to connect to a database.
v How to import column metadata from a database.
v How to write data to a relational database target.
Prerequisites
Ensure that your database administrator runs the relevant database scripts that are supplied with the
tutorial and set up a DSN for you to use when connecting to an ODBC connector.
If you change the details of a data connection while you are designing a job, these changes are reflected
in the job design. However, after you compile your job, the data connection details are fixed in the
executable version of the job. Subsequent changes to the job design will once again link to the data
connection object and pick up any changes that were made to that object.
You can create data connection objects directly in the repository. Also, you can create data connection
objects when you are using a connector stage to import metadata by saving the connection details. This
lesson shows you how to create a data connection object directly in the repository.
7. Click OK.
8. In the Save Data Connection As window, select the tutorial folder and click Save.
Lesson checkpoint
You learned how to create a data connection object and store the object in the repository.
In Lesson 2.3, you learned how to import column metadata from a comma-delimited file. In this lesson,
you will import column metadata from a database table by using the ODBC connector. When you import
data by using a connector, the column definitions are saved as a table definition in the project repository
and in the dynamic repository. The table definition is then available to be used by other projects and by
other components in the information integration suite.
The table definition is imported and appears in the tutorial folder. The table definition has a different
icon from the table definitions that you used previously. This icon identifies that the table definition was
imported by using a connector and is available to other projects and to other suite components.
Lesson checkpoint
You learned how to import column metadata from a database using a connector.
Double-check that your database administrator ran the scripts to set up the database and database table
that you need to access in this lesson. Also ensure that the database administrator set up a DSN for you
to use for the ODBC connection.
Connectors
Connectors are stages that you use to connect to data sources and data targets to read or write data.
In the Database section of the palette in the Designer are many types of stages that connect to the same
types of data sources or targets. For example, if you click the down arrow next to the ODBC icon in the
palette, you can choose to add either an ODBC connector stage or an ODBC Enterprise stage to your job.
If your database type supports connector stages, use them because they provide the following advantages
over other types of stages:
v Creates job parameters from the connector stage (without first defining the job parameters in the job
properties).
v Saves any connection information that you specify in the stage as a data connection object.
v Reconciles data types between source and target to avoid runtime errors.
v Generates detailed error information if a connector encounters problems when the job runs.
In this exercise, you will use the table definition that you imported in Lesson 4.2. Notice that the column
definitions are the same as the table definition that you created by editing the Transformer stage and
Lookup stage in the job in Lesson 3.4.
1. Double-click the BillToSource Data Set stage to open the stage editor.
2. Select the File property on the Properties tab of the Output page and set it to the data set that you
created in Lesson 3.4. Use a job parameter to represent the data set file.
3. In the Columns page, click Load.
4. In the Table Definitions window, open the tutorial folder, select the table definition that you created in
Lesson 4.2, and click OK. The columns grid is populated with the column metadata.
5. Click OK to close the stage editor.
You wrote the BillTo data to the tutorial database table. This table forms the bill_to dimension of the star
schema that is being implemented for the GlobalCo delivery data in the business scenario that the
tutorial is based on.
Lesson checkpoint
You learned how to use a connector stage to connect to and write to a relational database table.
In lesson 4.1, you learned how to define a data connection object; in lesson 4.2, you imported column
metadata from a data base; and in lesson 4.3, you learned how to write data to a relational data base
target.
Lessons learned
By completing this module, you learned about the following concepts and tasks:
v Load data into data targets
v Use the Designer clients reusable components
When you design parallel jobs in the Designer client, you concentrate on designing your jobs to run
sequentially, unconcerned about how parallel processing is implemented. You specify the logic of the job,
and WebSphere DataStage specifies the best implementation on the available hardware. However, you
can exert more exact control on job implementation.
In this module, you learn about how you can control whether jobs run sequentially or in parallel, and
you look at the partitioning of data.
Learning objectives
After completing the lessons in this module, you will know how to do the following:
v How to use the configuration file to optimize parallel processing.
v How to control parallel processing at the stage level in your job design.
v How to control the partitioning of data so that it can be handled by multiple processors.
Prerequisites
You must be working on a computer with multiple processors.
You must have DataStage administrator privileges to create and use a new configuration file.
Unless you specify otherwise, the parallel engine uses a default configuration file that is set up when
WebSphere DataStage is installed.
The default configuration file is created when WebSphere DataStage is installed. Although the system has
four processors, the configuration file specifies two processing nodes. Specify fewer processing nodes
than there are physical processors to ensure that your computer has processing resources available for
other tasks while it runs WebSphere DataStage jobs.
Configuration files can be more complex and sophisticated than the example file and can be used to tune
your system to get the best possible performance from the parallel jobs that you design.
Lesson checkpoint
In this lesson, you learned how the configuration file is used to control parallel processing.
Most partitioning operations result in a set of partitions that are as near to equal size as possible,
ensuring an even load across your processors.
As you perform other operations, you need to control partitioning to ensure that you get consistent
results. For example, you are using an aggregator stage to summarize your data to get the answers that
you need. You must ensure that related data is grouped together in the same partition before the
summary operation is performed on that partition.
In this lesson, you will run the sample job that you ran in Lesson 1. By default, the data that is read from
the file is not partitioned when it is written to the data set. You change the job so that it has the same
number of partitions as there are nodes defined in your system’s default configuration file.
This exercise teaches you how to use the data set management tool to look at data sets and how they are
structured.
The sample job reads a comma-separated file. By default, comma-separated files are read sequentially and
all their data is stored in a single partition. In this exercise, you will override the default behavior and
specify that the data that is read from the file will be partitioned by using the round-robin method. The
round-robin method sends the first data row to the first processing node, the second data row to the
second processing node, and so on.
Lesson checkpoint
In this lesson, you learned some basics about data partitioning.
This lesson demonstrates that you can quickly change configuration files to affect how parallel jobs are
run. When you develop parallel jobs, first run your jobs and test the basic functionality before you start
implementing parallel processing.
You deployed your new configuration file. Keep the Administrator client open, because you will use it to
restore the default configuration file at the end of this lesson.
You will see how the configuration file overrides other settings in your job design. Although you
previously partitioned the data that is read from the GlobalCo_BillTo comma-separated file, the
configuration file specifies that the system has only a single processing node available, and so no data
partitioning is performed.
Lesson checkpoint
You learned how to create a configuration file and use it to alter the operation of parallel jobs.
You also learned how to control the partitioning of data at the level of individual stages.
Lessons learned
By completing this module, you learned about the following concepts and tasks:
v The configuration file
v How to use the configuration editor to edit the configuration file
v How to control data
Now that you have successfully completed the WebSphere DataStage tutorial, you can complete the IBM
WebSphere QualityStage tutorial. The WebSphere QualityStage tutorial implements the standardization of
customer information and the removal of duplicate entries from the data.
Lessons learned
By completing this tutorial, you learned about the following concepts and tasks:
v How to extract, transform, and load data by using WebSphere DataStage
v Using the parallel processing power of WebSphere DataStage.
v How to reuse the job design elements
You need DataStage administrator privileges and Windows administrator privileges to perform some of
the installation and setup tasks. You need a higher level of system knowledge and database knowledge to
complete the installation and setup tasks than you need to complete the tutorial.
Module 4 of the tutorial imports metadata from a table in a relational database and then writes data to
the table. WebSphere DataStage uses a repository that is hosted by a relational database (DB2 by default)
and you can create the table in that database. There are data definition language (DDL) scripts for DB2,
Oracle, and SQL Server in the tutorial folder. To create the table:
1. Open the administrator client for your database (for example, the DB2 Control Center).
2. Create a new database named Tutorial.
3. Connect to the new database.
4. Run the appropriate DDL script to create the tutorial table in the new database. The scripts are in the
tutorial folder and are named as follows:
v DB2_table.ddl
v Oracle_table.ddl
v SQLserver_table.ddl
You define the DSN on the computer where the WebSphere DataStage server is installed. The procedure
on a Windows computer is different from the procedure on a UNIX or Linux computer. You require
administrator privileges on the Windows computer.
You define the DSN on the computer where the WebSphere DataStage server is installed. The procedure
on a UNIX or Linux computer is different from the procedure on a Windows computer. To set up a DSN
on a UNIX or Linux computer, you edit three files:
v dsenv
v odbc.ini
v uvodbc.config
The entries that you make in each file depend on the type of database. Full details are in the IBM
Information Server Planning, Configuration, and Installation Guide.
publib.boulder.ibm.com/infocenter/iisinfsv/v8r0/index.jsp
You can order IBM publications online or through your local IBM representative.
v To order publications online, go to the IBM Publications Center at www.ibm.com/shop/publications/
order.
v To order publications by telephone in the United States, call 1-800-879-2755.
To find your local IBM representative, go to the IBM Directory of Worldwide Contacts at
www.ibm.com/planetwide.
Contacting IBM
You can contact IBM by telephone for customer support, software services, and general information.
Customer support
To contact IBM customer service in the United States or Canada, call 1-800-IBM-SERV (1-800-426-7378).
Software services
To learn about available service options, call one of the following numbers:
v In the United States: 1-888-426-4343
v In Canada: 1-800-465-9600
General information
Accessible documentation
Documentation is provided in XHTML format, which is viewable in most Web browsers.
Syntax diagrams are provided in dotted decimal format. This format is available only if you are accessing
the online documentation using a screen reader.
Your feedback helps IBM to provide quality information. You can use any of the following methods to
provide comments:
v Send your comments using the online readers’ comment form at www.ibm.com/software/awdtools/
rcf/.
v Send your comments by e-mail to comments@us.ibm.com. Include the name of the product, the version
number of the product, and the name and part number of the information (if applicable). If you are
commenting on specific text, please include the location of the text (for example, a title, a table number,
or a page number).
IBM may not offer the products, services, or features discussed in this document in other countries.
Consult your local IBM representative for information on the products and services currently available in
your area. Any reference to an IBM product, program, or service is not intended to state or imply that
only that IBM product, program, or service may be used. Any functionally equivalent product, program,
or service that does not infringe any IBM intellectual property right may be used instead. However, it is
the user’s responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.
IBM may have patents or pending patent applications covering subject matter described in this
document. The furnishing of this document does not grant you any license to these patents. You can send
license inquiries, in writing, to:
For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property
Department in your country or send inquiries, in writing, to:
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION ″AS IS″ WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some
states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this
statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically
made to the information herein; these changes will be incorporated in new editions of the publication.
IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in
any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of
the materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003 U.S.A.
Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment of a fee.
The licensed program described in this document and all licensed material available for it are provided
by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or
any equivalent agreement between us.
Any performance data contained herein was determined in a controlled environment. Therefore, the
results obtained in other operating environments may vary significantly. Some measurements may have
been made on development-level systems and there is no guarantee that these measurements will be the
same on generally available systems. Furthermore, some measurements may have been estimated through
extrapolation. Actual results may vary. Users of this document should verify the applicable data for their
specific environment.
Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products and
cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of
those products.
All statements regarding IBM’s future direction or intent are subject to change or withdrawal without
notice, and represent goals and objectives only.
This information is for planning purposes only. The information herein is subject to change before the
products described become available.
This information contains examples of data and reports used in daily business operations. To illustrate
them as completely as possible, the examples include the names of individuals, companies, brands, and
products. All of these names are fictitious and any similarity to the names and addresses used by an
actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs
in any form without payment to IBM, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the operating platform for
which the sample programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs.
Each copy or any portion of these sample programs or any derivative work, must include a copyright
notice as follows:
© (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. ©
Copyright IBM Corp. _enter the year or years_. All rights reserved.
Trademarks
IBM trademarks and certain non-IBM trademarks are marked at their first occurrence in this document.
Java™ and all Java-based trademarks and logos are trademarks or registered trademarks of Sun
Microsystems, Inc. in the United States, other countries, or both.
Microsoft®, Windows, Windows NT®, and the Windows logo are trademarks of Microsoft Corporation in
the United States, other countries, or both.
Intel®, Intel Inside® (logos), MMX and Pentium® are trademarks of Intel Corporation in the United States,
other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product or service names might be trademarks or service marks of others.
Printed in USA
SC18-9889-00