Sei sulla pagina 1di 56

WebSphere QualityStage


Version 8

WebSphere QualityStage Tutorial

SC18-9925-00
WebSphere QualityStage
®


Version 8

WebSphere QualityStage Tutorial

SC18-9925-00
Note
Before using this information and the product that it supports, be sure to read the general information under “Notices and
trademarks” on page 43.

© Copyright International Business Machines Corporation 2004, 2006. All rights reserved.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Chapter 1. The Designer Client structure 1 Lesson 2.3: Configuring the target data sets . . . . 23
About DataStage Projects . . . . . . . . . . 1 Lesson checkpoint . . . . . . . . . . . 23
About QualityStage jobs . . . . . . . . . . 1 Summary for standardized data job . . . . . . 23
WebSphere DataStage and QualityStage stages . . . 2
Server and client components . . . . . . . . . 2 Chapter 5. Module 3: Grouping records
with common attributes . . . . . . . 25
Chapter 2. Tutorial project goals . . . . 3 Lesson 3.1: Setting up an Unduplicate Match job . . 25
Copying tutorial data . . . . . . . . . . . 4 Lesson checkpoint for the Unduplicate Match job 27
Starting a project . . . . . . . . . . . . . 4 Lesson 3.2: Configuring the Unduplicate Match job
Creating a job . . . . . . . . . . . . . 4 stage properties . . . . . . . . . . . . . 27
Importing tutorial sample data . . . . . . . 5 Configuring the Unduplicate Match stage . . . 28
Configuring the Funnel stage . . . . . . . 29
Chapter 3. Module 1: Investigating Lesson 3.2 checkpoint . . . . . . . . . . 29
source data . . . . . . . . . . . . . 7 Lesson 3.3: Configuring the Unduplicate job target
files . . . . . . . . . . . . . . . . . 30
Lesson 1.1: Setting up and linking an investigate job 7
Lesson checkpoint . . . . . . . . . . . 30
Lesson checkpoint . . . . . . . . . . . 8
Summary for Unduplicate stage job . . . . . . 30
Lesson 1.2: Renaming links and stages in an
Investigate job . . . . . . . . . . . . . . 8
Lesson checkpoint . . . . . . . . . . . 9 Chapter 6. Module 4: creating a single
Lesson 1.3: Configuring the source file . . . . . . 9 record . . . . . . . . . . . . . . . 33
Lesson checkpoint . . . . . . . . . . . 10 Lesson 4.1: Setting up a survive job . . . . . . 33
Lesson 1.4: Configuring the Copy stage . . . . . 11 Lesson checkpoint . . . . . . . . . . . 34
Lesson checkpoint . . . . . . . . . . . 11 Lesson 4.2: Configuring the survive job stage
Lesson 1.5: Configuring the Investigate stage . . . 11 properties . . . . . . . . . . . . . . . 34
Lesson summary . . . . . . . . . . . 12 Configuring the Survive stage . . . . . . . 35
Lesson 1.6: Configuring the InvestigateStage2 icon 13 Configuring the target file . . . . . . . . 36
Lesson summary . . . . . . . . . . . 14 Lesson checkpoint . . . . . . . . . . . 36
Lesson 1.7: Configuring the target reports . . . . 14 Module 4: survive job summary . . . . . . . 36
Lesson checkpoint . . . . . . . . . . . 15
Lesson 1.8: Compiling and running jobs . . . . . 15 Chapter 7. WebSphere QualityStage
Lesson checkpoint . . . . . . . . . . . 15 tutorial summary . . . . . . . . . . 39
Module 1 Summary . . . . . . . . . . . 15

Accessing information about IBM . . . 41


Chapter 4. Module 2: Standardizing
Contacting IBM . . . . . . . . . . . . . 41
data . . . . . . . . . . . . . . . . 17 Accessible documentation . . . . . . . . . 41
Lesson 2.1: Setting up a standardize job . . . . . 17 Providing comments on the documentation . . . . 42
Lesson checkpoint . . . . . . . . . . . 19
Lesson 2.2: Configuring the Standardize job stage
Notices and trademarks . . . . . . . 43
properties . . . . . . . . . . . . . . . 19
Notices . . . . . . . . . . . . . . . . 43
Configuring Standardize properties . . . . . 19
Trademarks . . . . . . . . . . . . . . 45
Configuring the Transformer properties . . . . 21
Configuring the Standardize Copy stage . . . . 22
Configuring the Match Frequency stage . . . . 22 Index . . . . . . . . . . . . . . . 47
Lesson checkpoint . . . . . . . . . . . 22

© Copyright IBM Corp. 2004, 2006 iii


iv WebSphere QualityStage Tutorial
Chapter 1. The Designer Client structure
IBM® WebSphere® QualityStage is a data cleansing component that is part of the WebSphere DataStage®
and QualityStage Designer (Designer client).

The Designer client provides a common user interface in which you design your data quality jobs. In
addition, you have the power of the parallel processing engine to process large stores of source data.

The integrated stages available in the Repository provide the basis for accomplishing the following data
cleansing tasks:
v Resolving data conflicts and ambiguities
v Uncovering new or hidden attributes from free-form or loosely controlled source columns
v Conforming data by transforming data types into a standard format
v Creating one unique result

Learning objectives

The key points that you should keep in mind as you complete this tutorial include the following
concepts:
v How the processes of standardization and matching improve the quality of the data
v The ease of combining both QualityStage and DataStage stages in the same job
v How the data flows in an iterative process from one job to another
v The surviving data results in the best available record

About DataStage Projects


The Designer client provides projects as a method for organizing your re-engineered data. You define
data files and stages and you build jobs in a specific project. WebSphere QualityStage uses these projects
to create and store files on the client and server.

Each QualityStage project contains the following components:


v QualityStage jobs
v Stages that are used to build each job
v Match specification
v Standardization rules
v Table definitions

In this tutorial, you will create a project by using the data that is provided.

About QualityStage jobs


WebSphere QualityStage uses jobs to process data.

To start a QualityStage job, you open the Designer client and create a new Parallel job. You build the
QualityStage job by adding stages, source and target files, and links from the Repository, and placing
them onto the Designer canvas. The Designer client compiles the Parallel job and creates an executable
file. When the job runs, the stages process the data using the data properties that you defined. The result
is a data set that you can use as input for the next job.

© Copyright IBM Corp. 2004, 2006 1


In this tutorial, you build four QualityStage jobs. Each job is built around one of the Data Quality stages
and additional DataStage stages.

WebSphere DataStage and QualityStage stages


A stage in QualityStage performs an action on data. The type of action depends on the stage that you
use.

The Designer client stages are stored in the Designer tool palette. You can access all the QualityStage
stages in the Data Quality group in the palette. You configure each stage to perform the type of actions
on the data that obtains the required results. Those results are used as input data to the next stage. The
following stages are included in QualityStage:
v Investigate stage
v Standardize stage
v Match Frequency stage
v Unduplicate Match stage
v Reference Match stage
v Survive stage

In this tutorial, you use most of the QualityStage stages.

You can also add any of the DataStage stages to your job. In some of the lessons, you add DataStage
stages to enhance the types of tools for processing the data.

Server and client components


You load the client and server components to process the data.

The following server components are installed on the server:


Repository
A central store that contains all the information required to build a QualityStage job.
WebSphere DataStage server
Runs the QualityStage jobs.

The following DataStage client components are installed on any personal computer:
v WebSphere DataStage and QualityStage Designer
v DataStage Director
v DataStage Administrator

In this tutorial, you use all of these components when you build and run your QualityStage project.

2 WebSphere QualityStage Tutorial


Chapter 2. Tutorial project goals
The goal of this tutorial is to use Designer client stages to cleanse customer data to remove all the
duplicates of customer addresses and provide a best case for the correct address.

In this tutorial, you have the role of a database analyst for a bank that provides many financial services.
The bank has a large database of customers; however, there are problems with the customer list because
it contains multiple names and address records for a single household. Because the marketing department
wants to market additional services to existing customers, you need to find and remove duplicate
addresses.

For example, a married couple have four accounts, each in their own names. The accounts include two
checking accounts, an IRA, and a mutual fund.

In the bank’s existing system, customer information is tracked by account number rather than customer
name, number, or address. For this one customer, the bank has four address entries.

To save money on the mailing, the bank wants to consolidate the household information so that each
household receives only one mailing. In this tutorial, you are going to use WebSphere QualityStage to
standardize all customer addresses. In addition, you need to locate and consolidate all records of
customers who are living at the same address.

Learning objectives

The purpose of this tutorial is to provide a working knowledge of the QualityStage process flow through
the jobs. In addition, you learn how to do the following tasks:
v Set up each job in the project
v Configure each stage in the job
v Assess results of each job
v Apply those results to your business practices

With the completion of these tasks, you will understand how QualityStage stages restructure and cleanse
the data by using applied business rules.

This tutorial will take approximately 2.5 hours to complete.

Skill level

To use this tutorial, you need an intermediate to advanced level of understanding of data analysis.

Audience

This tutorial is intended for business analysts and systems analysts who are interested in understanding
QualityStage.

System requirements
v IBM Information Server Suite
v Microsoft® Windows® XP or Linux® operating systems

© Copyright IBM Corp. 2004, 2006 3


Prerequisites

To complete this tutorial, you need to know how to use


v IBM WebSphere DataStage and QualityStage Designer
v Personal computers

Expected results

Upon the completion of this tutorial, you should be able to use the Designer client to create your own
QualityStage projects to meet your company’s business requirements and data quality standards.

Copying tutorial data


The tutorial requires data that is located on the IBM Information Suite CD.

You will locate the tutorial data and copy it into your server.
1. Locate the tutorial folder Tutorialdataon the IBM Information Suite CD.
2. Copy the folder Tutorialdata to your system.

Starting a project
You use Designer client project as a container for your QualityStage jobs.

You open the DataStage Designer client to begin the tutorial. The DataStage Designer Parallel job
provides the executable file that runs your QualityStage jobs.
1. Click Start → All Programs → IBM Information Server → Start the agent.
2. Click Start → All Programs → IBM Information Server → IBM WebSphere DataStage and
QualityStage Designer. The Attach to Project window opens.
3. In the Domain field, type the name of the server that you are connected to.
4. In the User name field, type your user name.
5. In the Password field, type your password and click OK. The New Parallel job opens in the Designer
client.
6. In the Project field, attach a project or accept the default.

You are going to import the tutorial sample jobs and data.

Creating a job
The Designer client provides the interface to the parallel engine that processes the QualityStage jobs. You
are going to save a job to a folder in the DataStage repository.

If it is not already open, open the DataStage Designer client.

To create a new DataStage job:


1. From the New window, select the Jobs folder in the left pane and then select the Parallel Job icon in
the right pane.
2. Click OK. A new empty job design window opens in the job design area.
3. Click File → Save.
4. In the Save Parallel Job As window, right-click the Jobs folder and select New → Folder from the
shortcut menu.
5. Type in a name for the folder, for example, MyTutorial.
6. In the Item name field, type the title Investigate and click Save.

4 WebSphere QualityStage Tutorial


You have created a new parallel job named Investigate and saved it in the folder Jobs\MyTutorial in the
repository.

You are going to import the tutorial data into your project.

Importing tutorial sample data


The data for this tutorial contains sample jobs, table definitions, a Match specification, and source data.

You can import the tutorial sample data to begin the tutorial lessons.
1. Click the Import menu and select DataStage Components. The DataStage Repository Import window
opens.
2. In the Import from file field, browse to locate the QSTutorial data.
3. Click Import all.

The sample tutorial is displayed in the repository under Jobs/QSTutorial folder. You can open each job
and look at how they are designed on the canvas. Use these jobs as a reference when you create your
own jobs.

You can begin Module 1.

Chapter 2. Tutorial project goals 5


6 WebSphere QualityStage Tutorial
Chapter 3. Module 1: Investigating source data
This module explains how to set up and process an Investigate job.

The Investigate job provides data from which you can create reports in the Information Server Web
console. You can use the information in the reports to make basic assumptions about the data and what
further steps you need to take to attain the goal of providing a legitimate address for each customer in
the database.

Learning objectives

After completing the lessons in this module, you will know how to do the following tasks:
1. Add QualityStage or DataStage stages and links to a job
2. Configure stage properties to specify what action they take when the job is run.
3. Load and process customer data and metadata.
4. Compile and run a job
5. Produce data for reports.

This module should take approximately 30 minutes to complete.

Lesson 1.1: Setting up and linking an investigate job


You create each QualityStage job by adding Data Quality stages and DataStage sequential files and stages
to the Designer canvas. Each icon on the canvas is linked together to allow the data to flow from the
source file to each stage.

The goals for this lesson are to add stages to the Designer canvas and link them together.

To set up an investigate job:


1. Click Palette → Data Quality to select the Investigate stage.
If you do not see the palette, click View → Palette.
2. Drag the Investigate stage icon onto the Designer canvas and drop it near the middle of the canvas.
3. Drag a second Investigate stage icon and drop it beneath the first Investigate stage. You use two
investigate stages to create the data for the reports.
4. Click Palette → File and select Sequential File.
5. Drag the sequential file onto the Designer canvas and drop it to the left of the first Investigate stage.
This sequential file will become the source file.
6. Click Palette → Processing and select the Copy stage. This stage duplicates the data from the source
file and copies it to the two Investigate stages.
7. Drag the Copy stage onto the Designer canvas and drop it between the sequential file and the first
Investigate stage.
8. Click Palette → File, and drag a second sequential file onto the Designer canvas and drop it to the
right of the first Investigate stage.
The data from the Investigate stage is sent to the second sequential file which becomes the target
file.
9. Drag a third sequential file onto the Designer canvas and drop it also to the right of the Investigate
stage and beneath the second Sequential File. You now have a source file, a Copy stage, two
Investigate stages, and two target files.

© Copyright IBM Corp. 2004, 2006 7


10. Drag a fourth sequential file onto the Designer canvas and drop it beneath the third sequential file as
the final target file. Next, you are going to link all the stages together.
11. Click Palette → General → Link.
a. Drag a link from the source sequential file to the Copy stage.
b. Continue linking the other stages. The following figure shows the completed Investigate job.

Lesson checkpoint
When you set up the Investigate job, you are connecting the source file and its source data and metadata
to all the stages and linking the stages to the target files.

In completing this lesson, you learned the following about the Designer:
v How to add stages to the Designer canvas
v How to combine Data Quality and Processing stages on the Designer canvas
v How to link all the stages together

Lesson 1.2: Renaming links and stages in an Investigate job


When creating a large job in the Designer client, it is important to rename each stage, file, and link with
meaningful names to avoid confusion when selecting paths during stage configuration.

When you rename the links and stages, do not use spaces. The Designer client resets the name back to
the generic if you put in spaces. The goals for this lesson are to replace the generic names for the icons
on the canvas with more appropriate names.

To rename icons on the canvas:


1. To rename a stage, complete the following steps:
a. Click the name of the source SequentialFile until a highlighted box appears around the name.
b. Type SourceFile in the box.
c. Click outside the box to deselect it.
2. To rename a link, complete the following steps:
a. Right-click the generic link name DSLinkXX that connects SourceFile to the Copy icon and select
Rename from the shortcut menu. A highlighted box appears around the default name.

8 WebSphere QualityStage Tutorial


b. Type Customerdata and click outside the box. The default name changes to Customerdata.
3. Right-click the generic link name that connects the Copy icon to the Investigate icon.
4. Repeat step 2, except type Name in the box.
5. Right-click the generic link name that connects the Copy icon to the second Investigate icon.
6. Repeat step 2, except type AddrCityState in the box.
7. Rename the following stages as described in step 1:

Stage Change to
Copy CopyStage
Investigate InvestigateStage
Investigate InvestigateStage2

8. Rename the three target files from the top in the following order:
a. NameTokenReport
b. AreaTokenReport
c. AreaPatternReport
9. Rename the links as described in step 2:

Link Change to
From InvestigateStage to NameTokenReport TokenData
From InvestigateStage2 to AreaTokenReport AreaTokenData
From InvestigateStage2 to AreaPatternReport AreaPatternData

Renaming the elements on the Designer canvas provides better organization to the Investigate job.
Related information
“Lesson 2.1: Setting up a standardize job” on page 17
Standardizing data is the first step you take when you do data cleansing. In Lesson 2.1, you add a
variety of stages to the Designer canvas. These stages include the Transformer stage which applies
derivatives to handle nulls and the Match Frequency stage which adds frequency data.
“Lesson 3.1: Setting up an Unduplicate Match job” on page 25
Sorting records into related attributes is the second step in data cleansing. In this lesson, you are
going to add the Data Quality Unduplicate Match stage and a Funnel stage to match records and
remove duplicates.

Lesson checkpoint
In this lesson, you changed the generic stages and links to names appropriate for the job.

You learned the following tasks:


v How to select the default name field in order to edit it
v The correct method to use in changing the name

Lesson 1.3: Configuring the source file


The source data and metadata are attached to the sequential file as the source data for the job.

The goal of this lesson is to attach the input data of customer names and addresses and load the
metadata.

Chapter 3. Module 1: Investigating source data 9


To add data and metadata to the Investigate job, you configure the source file to locate the input data file
input.csv stored on your computer and load the metadata columns.

To configure the source file:


1. Double-click the SourceFile icon to open the Properties window.
2. To select the tutorial data file, complete the following steps:
a. Click Source > File to activate the File field.

b. Click in the File field and select Browse for File.


c. Locate the directory where you copied the tutorial input.csv file.
d. Click input.csv to select the file. You can click View Data to see the quality of the input data. You
see bank customer names and addresses. The addresses are shown in a disorganized way which
makes it difficult for the bank to analyze the data.
3. Click Columns.
4. Click Load.
5. From the Table Definitions window, click the Table Definitions folder.
6. Click the Tutorial folder.
7. Click Input to load the source metadata. The metadata columns are shown in the Columns page.

8. Click Save to save the data at a location that you choose.


9. Click OK to close the Sequential File window.

Lesson checkpoint
In this lesson, you attached the input data (customer names and addresses) and loaded the metadata.

You learned how to do the following tasks:


v Attaching source data to the source file
v Adding column metadata to the source file

10 WebSphere QualityStage Tutorial


Lesson 1.4: Configuring the Copy stage
The Copy stage duplicates the source data and sends it to the two Investigate stages.

This lesson explains how you can add a DataStage Processing stage, the Copy stage, to a QualityStage
job. You use the Copy stage to duplicate the metadata and send the output metadata to the two
Investigate stages.

To configure a Copy stage:


1. Double-click the Copy stage icon to open the Properties window.
2. Click Input → Columns. The metadata you loaded in the SourceFile has propagated to the Copy stage.
3. Click Output → Mapping. You will map the columns that display in the left Columns box to the right
box.
4. In the Output name field, select Name if it is not already selected. Selecting the correct output link
assures that the data goes to the correct Investigate stage, either InvestigateStage or InvestigateStage2.
5. To copy the data from the Columns pane to the Name pane, complete the following steps:
a. Place your cursor in the Columns pane, right click and choose Select All.
b. Then choose Copy.
c. Place your cursor in the Name pane, right click and choose Paste Column. The column metadata
copies into the Name pane and lines show the columns linking from Columns to Name.
6. Change the Output name field to AddrCityState.
7. Repeat step 4 to map the Columns to the AddrCityState output link.

This procedure shows you how to map columns to two different outputs.

Lesson checkpoint
In this lesson, you mapped the input metadata to the two output links to continue the propagation of the
metadata to the next two stages.

You learned how to do the following tasks:


v Adding a DataStage stage to a QualityStage job
v Propagating metadata to the next stage
v Mapping metadata to two output links

Lesson 1.5: Configuring the Investigate stage


The Word Investigate option of the Investigate stage parses name and address data into recognizable
patterns using a rule set that classifies personal names and addresses.

The Investigate stage analyzes®each record from the source file. In this lesson, you are going to select the
NAME rule set to apply USPS standards.

To configure the Investigate stage icon:


1. Double-click the InvestigateStage icon.
2. Click Word Investigate to open the Word Investigate window. The Name column that was propagated
to the Investigate stage from the Copy stage is shown in the Available Columns pane.

3. Select Name from the Available Column pane and click to move the Name column into the
Standard Columns pane. The Investigate stage analyzes the Name column using the rule set that you
select in step 4.

Chapter 3. Module 1: Investigating source data 11


4. In the Rule Set: field, click and select a rule set for the Investigate stage.
a. In the Open window, double-click the Standardization Rules folder to open the Standardization
Rules tree.
b. Double-click the USA folder and select NAME from the list. The NAME rule set parses the Name
column according to United States Post Office standards for names.
c. Right-click NAME and select Provision All.
d. Click OK to exit the Open window.
5. Click Token Report in the Output Dataset area.
6. Click Stage Properties → Output → Mapping.
7. Map the output columns:
a. Click the Columns pane.
b. Right-click and select Select All in the context menu.
c. Select Copy.
d. Click in the NameData pane.
e. Right-click and select Paste Column. The columns on the left side map to the columns on the right
side.
Your map should look like the following figure:

8. Click Columns. Notice that the Output columns are populated when you map the columns in the
Mapping tab.

Lesson summary
This lesson explained how to configure the Investigate stage using the USNAME rule set.

You learned how to configure the Investigate stage in the Investigate job by doing the following tasks:
v Selecting the columns to investigate

12 WebSphere QualityStage Tutorial


v Selecting a rule from the rules set
v Mapping the output columns

Lesson 1.6: Configuring the InvestigateStage2 icon


The Word Investigate option of the Investigate stage parses name and address data into recognizable
patterns using a rule set that classifies personal names and addresses.

The Investigate stage analyzes each record from the source file. In this lesson, you are going to select the
®
USAREA rule set to apply USPS standards.

To configure the InvestigateStage2 icon:


1. Double-click the InvestigateStage2 icon.
2. Click Word Investigate to open the Word Investigate window. The address columns that were
propagated to the second Investigate stage from the Copy stage are shown in the Available Columns
pane.
3. Select the following columns to move to the Standard Columns pane. The second Investigate stage
analyzes the address columns using the rule set that you select in step 5.
v City
v State
v Zip5
v Zip4

4. Click to move each selected column to the Standard Columns pane.

5. In the Rule Set: field, click to locate a rule set for the InvestigateStage2.
a. In the Open window, double-click the Standardization Rules folder to open the Standardization
Rules tree.
b. Double-click the USA folder and select AREA from the list. The rule set parses the City, State,
Zip5 and Zip4 columns according to the United States Post Office standards.
c. Right-click and select Provision All.
d. Click OK to exit the Open window. AREA.SET is shown in the Rule Set field.
6. Click Token Report and Pattern Report in the Output Dataset area. When you assign data to two
outputs, you need to verify that the link ordering is correct. Link ordering assures that the data is
sent to the correct reports through the assigned links that you named in Lesson 1.2. The Link
Ordering tab is not shown if there is only one link.
7. Click Stage Properties → Link Ordering and select the link to change, if you need to change the
order.
8. Move the links up or down as described next:

v Click to move the link name up a level.

v Click to move the link name down a level.


The following figure shows the correct order for the links.

Chapter 3. Module 1: Investigating source data 13


9. Click Output → Mapping. Since there are two output links from the second Investigate stage, you
need to map the columns to each link by following these steps:
a. In the Output name field, select AreaPatternData.
b. Select the Columns pane.
c. Right-click and select Select All and then Copy.
d. Select the AreaPatternData pane, right-click and select Paste. The columns are mapped to the
output link AreaPatternData.
e. In the Output name field, select AreaTokenData.
f. Repeat substeps b to d, except select the AreaTokenData pane.
10. Click OK to close the InvestigateStage2 window.

Lesson summary
This lesson explained how to configure the second Investigate stage to the AREA rule set.

You learned how to configure the second Investigate stage in the Investigate job by doing the following
tasks:
v Selecting the columns to investigate
v Selecting a rule from the rules set
v Verifying the link ordering for the output reports
v Mapping the output columns to two output links

Lesson 1.7: Configuring the target reports


The source data information and column metadata are propagated to the target data files for later use in
creating Investigation reports.

The Investigate job modifies the unformed source data into readable data that is later configured into
Investigation reports.

To configure the data files:


1. Double-click the NameTokenReport icon on Designer client canvas.
2. Click Target > File.

3. Click and select tokrpt.csv from the tutorial data folder.


4. Click OK to close the stage.
5. Double-click the AreaPatternReport icon.
6. Repeat steps 2 to 4 except select areapatrpt.csv from the tutorial data folder.
7. Double-click the AreaTokenReport icon.
8. Repeat steps 2 to 4 except select areatokrpt.csv from the tutorial data folder.

14 WebSphere QualityStage Tutorial


Lesson checkpoint
This lesson explained how to configure the target files for use as reports.

You configured the three target data files by linking the data to each report file.

Lesson 1.8: Compiling and running jobs


You test the completeness of the Investigate job by running the compiler and then you run the job to
process the data for the reports.

You compile the Investigate job in the Designer client. After the job compiles, you open the Director and
run the job.

To compile and run the job:


1. Click File → Save to save the Investigate job on the Designer canvas.

2. Click to compile the job. The Compile Job window opens and the job begins to compile. When
the compiler finishes, the following message is shown Job successfully compiled with no errors.
3. Click Tools → Run Director. The Director application opens with the job shown in the Director Job
Status View window.

4. Click to open the Job Run Options window.


5. Click Run.

After the job runs, Finished is shown in the Status column.


Related information
“Lesson 3.3: Configuring the Unduplicate job target files” on page 30
You are going to attach files to the four output records. The records in the MatchedOutput file become
the source records for the next job.
“Lesson 4.2: Configuring the survive job stage properties” on page 34
You will load matched and duplicates data from the Unduplicate Match job, configure the Survive
stage with rules that test columns to a set of conditions, and configure a target file.

Lesson checkpoint
In this lesson, you learned how to compile and process an Investigate job.

You processed the data into three output files by doing the following tasks:
v Compiling the Investigate job
v Running the Investigate job in the Director

Module 1 Summary
In Module 1, you set up, configured, and processed a QualityStage Investigate job.

An investigate job looks at each record column-by-column and analyzes the data content of the columns
that you select. The Investigate job loads the name and address source data stored in the bank’s database,
parses the columns into a form that can be analyzed, and then organizes the data into three data files.

The Investigate job modifies the unformed source data into readable data that you can configure into
Investigation reports using the Information Server for Web console. You select the QualityStage Reports to
access the reports interface in the Web console.

Chapter 3. Module 1: Investigating source data 15


The next module organizes the unformed data into standardized data that provides usable data for
matching and survivorship.

Lessons learned

By completing this module, you learned about the following concepts and tasks:
v How to correctly set up and link stages in a job so that the data propagates from one stage to the next
v How to configure the stage properties to apply the correct rule set to analyze the data
v How to compile and run a job
v How to create data for analysis

16 WebSphere QualityStage Tutorial


Chapter 4. Module 2: Standardizing data
In Module 2, you configure the Data Quality Standardize stage to standardize name and address
information. This information is derived from the bank’s customer data base.

When you looked at the data in Module 1, you may have noticed that some addresses were free form
and nonstandard. The goal of removing duplicates of customer addresses and guaranteeing that a single
address is the correct address for that customer would be impossible without standardizing.

Standardizing or conditioning ensures that the source data is internally consistent, that is, each type of
data has the same type of content and format. When you use consistent data, the system can match
address data with greater accuracy during the match stage.

Learning objectives

After completing the lessons in this module, you will know how to do the following tasks:
1. Add stages and links to a Standardize job.
2. Configure the various stage properties to correctly process the data when the job is run.
3. Work with using derivations to handle nulls.
4. Generate frequency and standardized data.

This module should take approximately 60 minutes to complete.

Lesson 2.1: Setting up a standardize job


Standardizing data is the first step you take when you do data cleansing. In Lesson 2.1, you add a variety
of stages to the Designer canvas. These stages include the Transformer stage which applies derivatives to
handle nulls and the Match Frequency stage which adds frequency data.

If you have not already done so, open the Designer client.

As you learned in Lesson 1.1, you are going to add stages and links to the Designer canvas to create a
standardize job. The Investigate job that you completed helped you determine how to formulate a
business strategy using Investigation reports. The Standardize job applies rule sets to the source data to
condition it for matching.

To set up a Standardize job:


1. Add the following stages to the Designer canvas from the palette.
v Data Quality Standardize stage to the middle of the canvas
v Sequential file to the left of the Standardize stage
v Data Set file to the right of the Standardize stage
v Transformer stage between the Standardize stage and the Data Set file
v Copy stage between the Transformer stage and the Data Set file
v Data Quality Match Frequency stage below the Copy stage
v Second Data Set file to the right of the Match Frequency stage
Do not worry about the positioning of the stages and files, after linking them you can adjust their
location on the canvas.
2. Right-click the Sequential file and drag to create a link from the Sequential file to the Standardize
stage.

© Copyright IBM Corp. 2004, 2006 17


3. Drag links to all the stages as explained in step 2.
If the link is red, click to activate the link and drag it until it meets the stage. It should turn black.
When all the icons on the canvas are linked, you can click on a stage and drag it to change its
position.
4. Rename the stages to the following names by typing the name in the highlighted box:

Stage Change to
SequentialFile Customer
Standardize stage Standardize
Transformer stage CreateAdditionalMatchColumns
Copy stage Copy
Data_Set file Stan
Match Frequency stage MatchFrequency
Data_Set file Frequencies

5. Rename the following links by highlighting the default name:

Link Change to
From Customer to Standardize Input
From Standardize to CreateAdditionalColumns Standardized
From CreateAdditionalColumns to Copy ToCopy
From Copy to Stan StandardizedData
From Copy to MatchFrequency ToMatchFrequency
From MatchFrequency to Frequencies ToFrequencies

The following figure shows the completed Standardized job.

Related information

18 WebSphere QualityStage Tutorial


“Lesson 1.2: Renaming links and stages in an Investigate job” on page 8
When creating a large job in the Designer client, it is important to rename each stage, file, and link
with meaningful names to avoid confusion when selecting paths during stage configuration.

Lesson checkpoint
In this lesson you learned how to set up a Standardize job. The importance of the Standardize stage is to
generate the type of data that can then be used in a match job.

You set up and linked a Standardize job by doing the following tasks:
v Adding Data Quality and Processing stages to the Designer canvas
v Linking all the stages
v Renaming the links and stages

Lesson 2.2: Configuring the Standardize job stage properties


In this lesson, you will configure the stage properties for the Standardize job stages on the Designer
canvas.

When you configure the Standardize job, you will be completing the following tasks:
v Loading source data and metadata
v Adding compliant rule sets for United States names and addresses
v Applying derivatives to null sets
v Copying data to two output links
v Creating frequency data

To configure the source file stage properties:


1. Double-click the Customer file to open the Output Properties window.
2. Select the File property under the Source category, and in the File field enter the path name for the
tutorial data folder C:\Tutorial\Data\Input.csv. The file you have attached is the source file that the
stage reads when the job runs.
3. Click the Columns tab and click Load. The Table Definitions window opens.
4. Click the Table Definitions → Tutorial → Input folder. The table definitions load into the Columns tab
of the source file.
5. Click OK to close the source file.

You attached the source data to the Source file and loaded table definitions to organize the data into
standard address columns.

Configuring Standardize properties


When you configure the Standardize stage, you are applying rules to name and address data to parse it
into standard column format.

First, configure the source file.

To configure the Standardize stage:


1. Double-click Standardize to open the Standardize Stage window.
2. Click New Process to open the Standardize Rule Process window.
3. From the Standardize Rule Process window, click Standardization Rules → USA. The rule set you are
selecting for the standardize job is Domain Specific. The reason for selecting this rule set is to create
consistent, industry-standard data structures and matching structures.
4. To select a standardization rule, complete the following steps:

Chapter 4. Module 2 Standardizing data 19


a. Select NAME for the rule set. The NAME rule appears in the Rule Set field. You are selecting
this country code because the name and address data is from the United States.
b. From the Available Columns pane, select Name.

c. Click to move the Name column into the Selected Column pane. The Optional NAMES
Handling field becomes active.
d. Click OK.

5. Click New Process and select ADDR from the USA country name.
6. Move the following column names from the Available Columns pane to the Selected Columns pane:
v AddressLine1
v AddressLine2
7. Click New Process and select AREA from the USA country name.
8. Move the following column names from the Available Columns pane to the Selected Columns pane:
v City
v State
v Zip5
v Zip4
You are going to map the Standardize stage output columns and save the table definitions to the
Table Definitions folder.
9. Click Output → Mapping and select the columns in the right pane and copy them to the left pane.
10. Click Columns.
You are going to create the Identifier for the Table Definitions.
a. Click Save. The Save Table Definitions window opens. The file name is shown in the Table/File
name field.
b. In the Data source type field, type Table Definitions.
c. In the Data source name field, type QualityStage.
d. Click OK.
e. Type Standardized in the Item name field.
f. Click Save and close the Standardize Stage window.

20 WebSphere QualityStage Tutorial


You have configured the Standardize stage to apply NAME, ADDR, and AREA rule sets to the customer
data and saved Table Definitions.

Configuring the Transformer properties


The DataStage Transformer stage increases the number of columns that the matching stage uses to select
matches. Also, the Transformer stage will apply derivations to handle null sets.

To configure the transformer properties:


1. Open the Transformer stage.
2. From the Transformer Stage window, right-click the Input columns and choose Select All to highlight
all the columns from the Standardize stage.
3. Drag the selected columns to the Output link.
You have now populated the Output pane and the Output metadata pane.
4. Select the top row of the ToCopy pane and complete the following steps to add three derivatives and
columns to the Transformer stage:
a. Right-click the row and select Insert row.
b. Add two more rows as explained in step a.
c. Right-click the top inserted row and select Edit row and the Edit Column Meta Data window
opens.
d. In the Column name field, type MatchFirst1.
e. In the SQL type field, select VarChar.
f. In the Length field, select 1.
g. In the Nullable field, select Yes.
h. Click Apply and Close to remove the window.
i. Right-click the next row and select Edit row.
j. In the Column name field, type HouseNumberFirstChar.
k. Repeat substeps e to h.
l. Edit the last new row.
m. In the Column name field, type ZipCode3.
n. Repeat substeps e to h, except in substep f, select 3.
5. To add derivations to the columns, complete the following substeps:
a. Double click the Derivation area for the MatchFirst1 column to open the Expression Editor and
type the following derivative:If isNull(Standardize.MatchFirstName_USNAME) then Setnull()
Else Standardize.MatchFirstName_NAME[1,1] The expression that you entered detects whether the
MatchFirstName column contains a null. If it does, it handles it. If it contains a string, it extracts
the first character and writes it to the MatchFirst1 column.
b. Repeat substep a for the HouseNumberFirstChar column and type the following derivative:If
isNull(Standardize.HouseNumber_ADDR) then Setnull() Else
Standardize.HouseNumber_ADDR[1,1].
c. Repeat substep a for the ZipCode3 column and type the following derivative:If
isNull(Standardize.ZipCode_AREA) then Setnull() Else Standardize.ZipCode_AREA[1,3].
6. Finally, map the three derivatives and columns to the input columns by completing the following
steps:
a. Scroll the Standardized pane until you locate MatchFirstName_NAME.
b. Click and drag the column and drop it on the same column name in the ToCopy pane.
c. Repeat substeps a and b with HouseNumber_ADDR and ZipCode_AREA.
d. Click OK to close the Transformer Stage window.

Chapter 4. Module 2 Standardizing data 21


Configuring the Standardize Copy stage
As you learned in Lesson 1.4, a Copy stage duplicates data and writes it to more than one output link. In
this lesson, the Copy stage duplicates the metadata from the Transformer stage and writes it to the Match
Frequency stage and the target file.

The metadata from the Standardize and Transformer stages is duplicated and written to two output links.

To configure a Copy stage:


1. Double-click the Copy stage and click Output → Mapping.
2. To copy the data to the StandardizedData output link, following these steps:
a. From the Output name field, select StandardizedData.
b. Copy the columns from the right pane to the left pane.
3. To copy the data to the ToMatchFrequency output link, repeat step 2 except select ToMatchFrequency
in the Output name field.
4. Close the Copy stage.

Configuring the Match Frequency stage


The Match Frequency stage generates frequency information using any data that provides the columns
needed by a match.

The Match Frequency stage processes frequency data independently from executing a match. The output
link of the stage carries four columns:
v qsFreqVal
v qsFreqCounts
v qsFreqColumnID
v qsFreqHeaderFlag

To configure the Match Frequency stage:


1. Double-click the Match Frequency stage icon to open the Match Frequency Stage window.
2. Click Do not use a Match Specification. At this point you do not know the columns that would be
used in the match specification.
3. From the Output → Mapping tab, copy the columns in the left pane to the right pane.
4. To create Table Definitions, continue with the following substeps:
a. Click Columns.
b. Click Save. The Save Table Definitions window opens.
c. Click OK and the Save Table Definitions As window opens.
d. Select Table Definitions → Tutorial → Save.
5. Click OK to close the Output → Column tab and the Match Frequency stage.

Lesson checkpoint
This lesson explained how to configure the source file and all the stages for the Standardize job.

You have now applied settings to each stage and mapped the output files to the next stage for the
Standardize job. You learned how to do the following tasks:
v Configure the source file to load the customer data and metadata
v Apply United States postal service compliant rule set to the customer name and address data
v Add additional columns for matching and create derivatives to handle nulls
v Write data to two output links and associate the data to the correct links

22 WebSphere QualityStage Tutorial


v Create frequency data

Lesson 2.3: Configuring the target data sets


The two target data sets in the Standardize job store the standardized and frequency data that you can
use as source data in the Unduplicate Match job.

First, configure the stages as explained in Lesson 2.2.

You are going to apply the following tasks:


v Attach the file to the Stan target data set
v Attach the file to the Frequencies data set

To configure the target data sets:


1. Double-click the Stan target data set.
2. From the Data Set window, click Input → Properties and select Target → File.

3. Click and select Browse for file.


4. Locate the Tutorial Data folder.
5. Select stan in the Tutorial Data folder and click OK.
6. Double-click the Frequencies target data set.
7. Repeat steps 2 to 4.
8. Select frequencies in the Tutorial Data folder and click OK. These two files will be used as the
source data sets for the Unduplicate Match job.

9. Click to compile the job in the Designer client.

10. Click to run the job.

The job standardizes the data according to applied rules and adds additional matching columns to the
metadata. The data is written to two target data sets as the source files for a later job.

Lesson checkpoint
This lesson explained how to attach files to the target data sets to store the processed standardized
customer name and address data and frequency data.

You have configured the Stan and Frequencies target data set files to accept the data when it is processed.

Summary for standardized data job


In Module 2, you set up and configured a standardize job.

Running a standardize job conforms the data to ensure that all the customer name and address data has
the same content and format. The Standardize job loaded the name and address source data stored in the
bank’s database and added table definitions to organize the data into a format that could be analyzed by
the rule sets. Further processing by the Transformer stage increased the number of columns. In addition,
frequency data was generated for input into the match job.

Chapter 4. Module 2 Standardizing data 23


Lessons learned

By completing this module, you learned about the following concepts and tasks:
v Creating standardized data to match records effectively
v Running DataStage and Data Quality stages together in one job
v Applying country-specific rule sets to analyze the address data
v Using derivatives to handle nulls
v Creating the data that can be used as source data in a later job

24 WebSphere QualityStage Tutorial


Chapter 5. Module 3: Grouping records with common
attributes
The Unduplicate Match stage uses standardized data and frequency data to match records and remove
duplicate records.

The Unduplicate Match stage is one of two stages that matches records while removing duplicates and
residuals. The other matching stage is Reference Match stage.

With the Unduplicate Match stage, you are grouping records that share common attributes. The Match
specification that you apply was configured to separate all records with weights above a certain match
cutoff as duplicates. The master record is then identified by selecting the record within the set that
matches to itself with the highest weight.

Any records that are not part of a set of duplicates are residuals. These records along with the master
records are used for the next pass. You do not include duplicates because you want them to belong to
only one set.

The reason you use a matching stage is to ensure data integrity. Data integrity is assured because you are
applying probabilistic matching technology. This technology is applied to any relevant attribute for
evaluating the columns, parts of columns, or individual characters that you define. In addition, you can
apply agreement or disagreement weights to key data elements.

Learning objectives

When you build the Unduplicate Match stage, you are going to perform the following tasks:
1. Add Data Quality and DataStage links and stages to a job.
2. Add standardize and frequencies data as the source files.
3. Configure stage properties to specify what action they take when the job is run.
4. Remove duplicate addresses after the first pass.
5. Apply a Match specification to determine how matches are selected.
6. Funnel the common attribute data to a separate target file.

This module should take approximately 30 minutes to complete.

Lesson 3.1: Setting up an Unduplicate Match job


Sorting records into related attributes is the second step in data cleansing. In this lesson, you are going to
add the Data Quality Unduplicate Match stage and a Funnel stage to match records and remove
duplicates.

If you have not already done so, open the Designer client.

As you learned in the previous module, you are going to add stages and links to the Designer canvas to
create an Unduplicate Match job. The Standardize job you just completed created a stan data set and a
frequencies data set. The information from these data sets is going to be used as the input data when
you design the Unduplicate Match job.

To set up an Unduplicate Match job:


1. Add the following stages to the Designer canvas from the palette.
v Data Quality Unduplicate Match stage to the middle of the canvas

© Copyright IBM Corp. 2004, 2006 25


Data set file to the top left of the Unduplicate Match stage
v
A second Data set file to the lower left of the Unduplicate Match stage
v
Processing Funnel stage to the upper right of the Unduplicate Match stage
v
Three Sequential files, one to the right of the Funnel stage and the other two to the right of the
v
Unduplicate Match stage
2. Right-click the top data set and drag to create a link from the data set to the Unduplicate Match stage.
3. Drag links to all the stages as explained in step 2
4. Rename the stages by typing the following names in the highlighted edit box:

Stage Change to
top left data set MatchFrequencies
lower left data set StandardizedData
Unduplicate Match Unduplicate
Funnel CollectMatched
top right Sequential file MatchOutput_csv
middle right Sequential file ClericalOutput_csv
lower right Sequential file NonMatchedOutput_csv

5. Rename the following links by typing the name in the highlighted box:

Links Change to
From MatchFrequencies to Unduplicate MatchFrequencies
From StandardizedData to Unduplicate StandardizedData
Unduplicate to CollectMatched MatchedData
Unduplicate to CollectMatched Duplicates
CollectMatched to MatchOutput_csv MatchedOutput
Unduplicate to ClericalOutput_csv Clerical
Unduplicate to NonMatchedOutput_csv NonMatched

26 WebSphere QualityStage Tutorial


Related information
“Lesson 1.2: Renaming links and stages in an Investigate job” on page 8
When creating a large job in the Designer client, it is important to rename each stage, file, and link
with meaningful names to avoid confusion when selecting paths during stage configuration.

Lesson checkpoint for the Unduplicate Match job


In this lesson, you learned how to set up an Unduplicate Match job. During the processing of this job, the
records are matched using the Match specification created for this tutorial. The records are then sorted
according to their attributes and written to a variety of output links.

You set up and linked an Unduplicate Match job by doing the following tasks:
v Adding Data Quality and Processing stages to the Designer canvas
v Linking all the stages
v Renaming the links and stages with appropriate names

Lesson 3.2: Configuring the Unduplicate Match job stage properties


In this lesson, you configure the stage properties for the Unduplicate Match job stages on the Designer
canvas.

When you configure the Unduplicate Match job, you are completing the following tasks:
v Loading data and metadata for two source files
v Applying a Match specification to the Unduplicate Match job and selecting output links
v Combining unsorted records

To configure the MatchFrequencies and StandardizedData data sets:


1. Double-click the MatchFrequencies data set to open the Output Properties window.
2. Select the File property under the Source category, and in the File field enter the path name for the
tutorial data folder C:\Tutorial\Data\Frequencies.csv.

Chapter 5. Module 3: Grouping records with common attributes 27


3. Click the Columns tab and click Load. The Table Definitions window opens.
4. Click the Table Definitions → Tutorial → Frequencies folder. The table definitions load into the
Columns tab of the source file.
5. Click OK to close the MatchFrequencies window.
6. Double click the StandardizedData file.
7. Repeat steps 2 to 5 except in steps 2 and 4, select stan.

With this lesson, you have loaded the data that resulted from the Standardize job into the source files for
the Unduplicate Match job.

Configuring the Unduplicate Match stage


With the Unduplicate Match stage, you are grouping records that share common attributes.

To configure the Unduplicate Match stage:

1. Double-click Unduplicate and click the Match Specification button.


2. From the Repository window, double-click the Match Specifications → Matches folder.
3. Select NameandAddress.MAT.
4. Right-click NameandAddress.MAT and select Provision All from the menu.
5. Click OK. You are attaching an Unduplicate Match specification that was created for the tutorial.
6. Click the following Match Output options:
v Match. Sends matched records as output data.
v Clerical. Separates those records that require clerical review.
v Duplicates. Includes duplicate records that are above the match cutoff.
v Residuals. Separates records that are not duplicates as residuals.
7. Keep the default Dependent Match Type. After the first pass, duplicates are removed with every
additional pass.
8. Click Stage Properties → Link Ordering. The links should be in the following order:

Link label Link name


Match MatchedData
Clerical Clerical
Duplicate Duplicates
Residual NonMatched

9. Click Output → Mapping and map the following columns to the correct links:
a. If not selected, select MatchedData from the Output name field.
b. Copy all the columns in the left pane to the MatchedData pane.
c. Select Duplicates from the Output name field.
d. Copy all the columns in the left pane to the Duplicates pane.
e. Select Clerical from the Output name field.
f. Copy all the columns in the left pane to the Clerical pane.
g. Select NonMatched from the Output name field.
h. Copy all the columns in the left pane to the Clerical pane.
10. Click OK to close the Stage Properties window.

28 WebSphere QualityStage Tutorial


Configuring the Funnel stage
With the Funnel stage, you are combining records as they arrive in an unordered format.

To configure a continuous funnel:


1. Double-click the CollectMatched stage and click Stage → Advanced.
2. Select Sequential from the Execution mode menu. This setting allows you to use the sort function.
3. Click Input → Partitioning. From this page, you are going to set the sort function.
a. Select Sort Merge from the Collector type menu.
b. From the Sorting area, click Perform sort.
c. Then, click Stable to preserve any previously sorted data sets.
d. From the Available list, select the sort key qsMatchSetID.

4. click Output → Mapping.


5. Copy the columns from the Columns pane to the MatchedOutput pane.
6. Click OK to close the stage window.

Lesson 3.2 checkpoint


In Lesson 3.2, you configured the source files and stages of the Unduplicate Match job.

You learned how to do the following tasks:


v Load data and metadata generated in a previous job
v Apply a Match specification to process the data according to matches and duplicates
v Combine records into a single file

Chapter 5. Module 3: Grouping records with common attributes 29


Lesson 3.3: Configuring the Unduplicate job target files
You are going to attach files to the four output records. The records in the MatchedOutput file become
the source records for the next job.

To configure the target files:


1. Double-click the MatchedOutput_csv file. You will attach a file name to the match records.
2. Click Target → File.
3. From the File field, enter Tutorial/Data/MatchOutput.csv.
4. Repeat steps 2 to 3 for each additional target file.
v For the ClericalOutput_csv file, type ClericalOutput.csv.
v For the NonMatchedOutput_csv file, type NonMatchedOutput.csv.
5. Click OK to close the window.

6. Click to compile the job in the Designer client.


7. Click Tools → Run Director to open the DataStage Director. The Director opens with the Standardize
job visible in the Director window with the Compiled status.
8. Click Run.

You have configured the target files.


Related information
“Lesson 1.8: Compiling and running jobs” on page 15
You test the completeness of the Investigate job by running the compiler and then you run the job to
process the data for the reports.

Lesson checkpoint
In this lesson, you have combined the matched and duplicate address records into one file. The
nonmatched and clerical output records were separated into individual files. The clerical output records
can be reviewed manually for matching records. The nonmatched records will be used for the next pass.
The matched and duplicate address records will be used in the Survive job.

You have learned how to separate the output records from the Unduplicate Match stage to the various
target files.

Summary for Unduplicate stage job


In Module 3, you set up and configured an Unduplicate stage job in order to isolate matched and
duplicate name and address data into one file.

In creating an Unduplicate stage job, you added a Match specification to apply the blocking and
matching criteria to the standardized and frequency data created in the Standardize job. After applying
the Match specification, the resulting records were sent out through four output links, one for each type
of record. The match and duplicates were sent to a Funnel stage that combined the records into one
output that was written to a file. The nonmatched or residuals records were sent to a file, as were the
clerical output records.

Lessons learned

By completing Module 3, you learned about the following concepts and tasks:
v Applying a Match specification to the Unduplicate stage
v How the Unduplicate stage groups records with similar attributes

30 WebSphere QualityStage Tutorial


v Ensuring data integrity by applying probability matching technology

Chapter 5. Module 3: Grouping records with common attributes 31


32 WebSphere QualityStage Tutorial
Chapter 6. Module 4: creating a single record
In Module 4, you are designing a Survive job to isolate the best record for each customer’s name and
address.

With the Unduplicate job, a group of records were identified with similar attributes. In the Survive job,
you will specify which columns and column values from each group creates the output record for the
group. The output record can include the following information:
v An entire input record
v Selected columns from the record
v Selected columns from different records in the group

You select column values based on rules for testing the columns. A rule contains a set of conditions and a
list of targets. If a column tests true against the conditions, the column value for that record becomes the
best candidate for the target. After testing each record in the group, the columns declared best candidates
combine to become the output record for the group. Whether a column survives is determined by the
target. Whether a column value survives is determined by the rules.

Learning objectives

After completing the lessons in this module, you will be able to do the following tasks:
1. Add stages and links to a Survive job.
2. Choose the selected column.
3. Add the rules.
4. Map the output columns.

This module should take approximately 20 minutes to complete.

Lesson 4.1: Setting up a survive job


Creating the best results record in the Survive stage is the last job in the data cleansing process. The best
results record is the name and address with the highest probability of being correct for every bank
customer.

In this lesson, you will add the Data Quality Survive stage, the source file of combined data from the
Unduplicate Match job, and the target file for the best records.

To set up a survive job


1. Add the following stages to the Designer canvas from the palette:
v Data Quality Survive stage to the middle of the canvas
v Sequential file to the left of the Survive stage
v Second Sequential file to the right of the Survive stage
2. Right-click the left Sequential file and drag to create a link from the file to the Survive stage.
3. Drag a second link from the Survive stage to the output Sequential file.
4. Rename the following stages:

Stage Change to
left Sequential file MatchedOutput
Survive stage Survive

© Copyright IBM Corp. 2004, 2006 33


Stage Change to
right Sequential file Survived_csv

5. Rename the following links:

Links Change to
From MatchedOutput to Survive Matchesandduplicates
From Survive to Survived_csv Survived

Lesson checkpoint
In this lesson, you learned how to set up a Survive job by adding as source data the results of the
Unduplicate Match job, the Survive stage, and the target file as the output record for the group.

You have learned that the Survive stage takes one input link and one output link.

Lesson 4.2: Configuring the survive job stage properties


You will load matched and duplicates data from the Unduplicate Match job, configure the Survive stage
with rules that test columns to a set of conditions, and configure a target file.

With the survive job, you are testing column values to ascertain which columns are the best candidates
for that record. These columns are combined to become the output record for the group. In selecting a
best candidate, you can specify that these column values be tested:
v Record creation data
v Data source
v Length of data in a column
v Frequency of data in a group

To configure the source file:


1. Double-click the MatchedOutput file to access the Properties page.
2. Click Source → File and select Browse for file.
3. Find and load the MatchedOutput.csv file.
4. Click Columns.
5. Click Load and select MatchedOutput in the Table Definitions folder.
6. Click OK to close the MatchedOutput window.

You have loaded the Table Definitions into the MatchedOutput file and attached file MatchedOutput.csv.
Related information
“Lesson 1.8: Compiling and running jobs” on page 15
You test the completeness of the Investigate job by running the compiler and then you run the job to
process the data for the reports.

34 WebSphere QualityStage Tutorial


Configuring the Survive stage
You will configure the Survive stage with rules to compare the columns against a best case.

To configure the survive stage:


1. Double click the Survive stage.
2. Click New Rule to open the Survive Rules Definition window. The Survive stage requires a rule that
contains one or more targets and a TRUE condition expression.
You define the rule by specifying each of the following elements:
v Target column or columns
v Column to analyze
v Technique to apply to the column being analyzed

3. Select AllColumns from Available Columns and click to move AllColumns to the Target
column. When you select AllColumns, you are assigning the first record in the group as the best
record.
4. From the Survive Rule (Pick one) area, click Analyze Column and select qsMatchType from the Use
Target drop-down menu. You are selecting qsMatchType as the target to which to compare other
columns.
5. From the Technique field drop-down menu, click Equals. The rules syntax for the Equals technique
is c."column" = "DATA".
6. In the Data field, type MP.
7. Click OK to close the New Rule window.
8. Follow steps 2 to 5 to add the following columns and rules:

Targets Analyze Column Technique


GenderCode_USNAME GenderCode_USNAME Most Frequent (Non-blank)
FirstName_USNAME FirstName_USNAME Most Frequent (Non-blank)
MiddleName_USNAME MiddleName_USNAME Longest
PrimaryName_USNAME PrimaryName_USNAME Most Frequent (Non-blank)

You can view the rules you added in the Survive Stage grid.
9. From the Select the group identification data column, choose the Selected Column qsMatchSetID
from the list
10. Click Stage Properties → Output → Mapping.
11. Select the columns from the Columns pane and copy them to the Survived pane.
12. Click OK to close the Mapping tab.
13. Click OK to close the Survive Stage window.

Chapter 6. Module 4: creating a single record 35


Configuring the target file
You will be configuring the target file.
1. Double click the target file and click Target → File to activate the File field.
2. Type the path name of the record.csv file.
3. Click OK to close the window.

Lesson checkpoint
You have set up the survive job, renamed the links and stages, and configured the source and target files,
and the Survive stage.

With Lesson 4.2, you learned how to select simple rule sets which is then applied to a selected column.
This combination is then compared against all columns to find the best record.

Module 4: survive job summary


With Module 4, you have completed the last job in the QualityStage work flow. In this module, you set
up and configured the survive job to select the best record from the matched and duplicates name and
address data created in the Unduplicate Match stage.

In configuring the Survive stage, you selected a rule, included columns from the source file, added a rule
to each column and applied the data. After the Survive stage processed the records to select the best
record, the information was sent to the output file.

36 WebSphere QualityStage Tutorial


Lessons learned

In completing Module 4, you learned about the following tasks and concepts:
v Using the Survive stage to create the best candidate in a record
v How to apply simple rules to the column values

Chapter 6. Module 4: creating a single record 37


38 WebSphere QualityStage Tutorial
Chapter 7. WebSphere QualityStage tutorial summary
From the lessons in this tutorial, you learned how QualityStage can be used to help an organization
manage and maintain its data quality. It is imperative for companies that their customer data be high
quality; thus it needs to be up-to-date, complete, accurate, and easy to use.

The tutorial presented a common business problem which was to verify customer names and addresses,
and showed the steps to take using QualityStage jobs to reconcile the various names that belonged to one
household. The tutorial presented four modules that covered the four jobs in the QualityStage work flow.
These jobs provide customers with the following assurances:
v Investigating data to identify errors and validate the contents of fields in a data file
v Conditioning data to ensure that the source data is internally consistent
v Matching data to identify all records in one file that correspond to similar records in another file
v Identifying which records from the match data survive to create a best candidate record

Lessons learned

By completing this tutorial, you learned about the following concepts and tasks:
v About the QualityStage work flow
v How to set up a QualityStage job
v How data created in one job is the source for the next job
v How to create quality data using QualityStage

© Copyright IBM Corp. 2004, 2006 39


40 WebSphere QualityStage Tutorial
Accessing information about IBM
IBM has several methods for you to learn about products and services.

You can find the latest information on the Web at www-306.ibm.com/software/data/integration/


info_server/:
v Product documentation in PDF and online information centers
v Product downloads and fix packs
v Release notes and other support documentation
v Web resources, such as white papers and IBM Redbooks™
v Newsgroups and user groups
v Book orders

To access product documentation, go to this site:

publib.boulder.ibm.com/infocenter/iisinfsv/v8r0/index.jsp

You can order IBM publications online or through your local IBM representative.
v To order publications online, go to the IBM Publications Center at www.ibm.com/shop/publications/
order.
v To order publications by telephone in the United States, call 1-800-879-2755.

To find your local IBM representative, go to the IBM Directory of Worldwide Contacts at
www.ibm.com/planetwide.

Contacting IBM
You can contact IBM by telephone for customer support, software services, and general information.

Customer support

To contact IBM customer service in the United States or Canada, call 1-800-IBM-SERV (1-800-426-7378).

Software services

To learn about available service options, call one of the following numbers:
v In the United States: 1-888-426-4343
v In Canada: 1-800-465-9600

General information

To find general information in the United States, call 1-800-IBM-CALL (1-800-426-2255).

Go to www.ibm.com for a list of numbers outside of the United States.

Accessible documentation
Documentation is provided in XHTML format, which is viewable in most Web browsers.

© Copyright IBM Corp. 2004, 2006 41


XHTML allows you to view documentation according to the display preferences that you set in your
browser. It also allows you to use screen readers and other assistive technologies.

Syntax diagrams are provided in dotted decimal format. This format is available only if you are accessing
the online documentation using a screen reader.

Providing comments on the documentation


Please send any comments that you have about this information or other documentation.

Your feedback helps IBM to provide quality information. You can use any of the following methods to
provide comments:
v Send your comments using the online readers’ comment form at www.ibm.com/software/awdtools/
rcf/.
v Send your comments by e-mail to comments@us.ibm.com. Include the name of the product, the version
number of the product, and the name and part number of the information (if applicable). If you are
commenting on specific text, please include the location of the text (for example, a title, a table number,
or a page number).

42 WebSphere QualityStage Tutorial


Notices and trademarks
Notices
This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries.
Consult your local IBM representative for information on the products and services currently available in
your area. Any reference to an IBM product, program, or service is not intended to state or imply that
only that IBM product, program, or service may be used. Any functionally equivalent product, program,
or service that does not infringe any IBM intellectual property right may be used instead. However, it is
the user’s responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.

IBM may have patents or pending patent applications covering subject matter described in this
document. The furnishing of this document does not grant you any license to these patents. You can send
license inquiries, in writing, to:

IBM Director of Licensing


IBM Corporation
North Castle Drive
Armonk, NY 10504-1785 U.S.A.

For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property
Department in your country or send inquiries, in writing, to:

IBM World Trade Asia Corporation


Licensing 2-31 Roppongi 3-chome, Minato-ku
Tokyo 106-0032, Japan

The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION ″AS IS″ WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some
states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this
statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically
made to the information herein; these changes will be incorporated in new editions of the publication.
IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in
any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of
the materials for this IBM product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.

© Copyright IBM Corp. 2004, 2006 43


Licensees of this program who wish to have information about it for the purpose of enabling: (i) the
exchange of information between independently created programs and other programs (including this
one) and (ii) the mutual use of the information which has been exchanged, should contact:

IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003 U.S.A.

Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment of a fee.

The licensed program described in this document and all licensed material available for it are provided
by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or
any equivalent agreement between us.

Any performance data contained herein was determined in a controlled environment. Therefore, the
results obtained in other operating environments may vary significantly. Some measurements may have
been made on development-level systems and there is no guarantee that these measurements will be the
same on generally available systems. Furthermore, some measurements may have been estimated through
extrapolation. Actual results may vary. Users of this document should verify the applicable data for their
specific environment.

Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products and
cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of
those products.

All statements regarding IBM’s future direction or intent are subject to change or withdrawal without
notice, and represent goals and objectives only.

This information is for planning purposes only. The information herein is subject to change before the
products described become available.

This information contains examples of data and reports used in daily business operations. To illustrate
them as completely as possible, the examples include the names of individuals, companies, brands, and
products. All of these names are fictitious and any similarity to the names and addresses used by an
actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs
in any form without payment to IBM, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the operating platform for
which the sample programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs.

Each copy or any portion of these sample programs or any derivative work, must include a copyright
notice as follows:

© (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. ©
Copyright IBM Corp. _enter the year or years_. All rights reserved.

44 WebSphere QualityStage Tutorial


If you are viewing this information softcopy, the photographs and color illustrations may not appear.

Trademarks
IBM trademarks and certain non-IBM trademarks are marked at their first occurrence in this document.

See http://www.ibm.com/legal/copytrade.shtml for information about IBM trademarks.

The following terms are trademarks or registered trademarks of other companies:

Java™ and all Java-based trademarks and logos are trademarks or registered trademarks of Sun
Microsystems, Inc. in the United States, other countries, or both.

Microsoft, Windows, Windows NT®, and the Windows logo are trademarks of Microsoft Corporation in
the United States, other countries, or both.

Intel®, Intel Inside® (logos), MMX and Pentium® are trademarks of Intel Corporation in the United States,
other countries, or both.

UNIX® is a registered trademark of The Open Group in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product or service names might be trademarks or service marks of others.

Notices and trademarks 45


46 WebSphere QualityStage Tutorial
Index
A L S
accessibility 42 legal notices 43 scenario for tutorial project 3
address analysis 13 Lesson 1.1 7 screen readers 42
analyze addresses 13 Lesson 3.1, setting up an Unduplicate select Rule Set 11
Match job 25 Sequential file 14, 17
Lesson 3.2 checkpoint 29 server components 2
C Lesson 3.2, configuring Unduplicate
source files 27
setting up
Investigate stage job 7
cleanse data 1
Lesson 3.3, configuring the Unduplicate Standardize job 17
client components 2
job target files 30 single-domain column investigation 7
columns, mapping 11
Lesson 3.4, configuring the Funnel source file
comments on documentation 42
stage 29 configure 9
common attributes, grouping records 25
Lesson 4.2, configuring the survive rename 9
configuring
job 34 stages
Match Frequency stage 22
links, renaming 8 Copy 11, 17, 22
configuring the Copy stage 22
Investigate 7, 11
contacting IBM 41
Match Frequency 17, 22
Copy stage
configuring 11, 22 M Standardize 17, 19
Transformer 17
copying metadata 22 mapping columns 22
stages, renaming 8
Mapping columns 11
Standardize rule sets 19
Match Frequency stage
Standardize stage
D columns 22
configuring 22
conditioning data 17
data configuring 19
metadata 11
parse free form 7 Standardize rule sets 19
load 9
standardize 17 Standardize stage job
Module 2, about 17
data cleansing 1 setting up 17
Module 3
Designer client 5 Survive job
summary for Unduplicate stage
Designer Tool Palette configuring 34
job 30
Data Quality group 2 Survive job, setting up 33
Unduplicate Match stage 25
documentation Survive stage
Module 4
accessible 42 renaming links and stages 33
summary 36
ordering 41 setting up 33
Web site 41 Survive stage job
Module 4
O creating a single record 33
F output reports, configure 14 Module 4: creating a single record 33
summary 36
file
Sequential 9
source 9 P
Funnel stage, configuring 29 Parallel job T
saving 4 token report 7, 14
parse free-form data 7 trademarks 45
I pattern report 7, 14
project elements 1
tutorial data
install 4
install tutorial data 4
projects 1 tutorial project goals 3
Investigate stage 7
opening 4 tutorial sample data
configure 11, 13
importing 5
Investigate stage job
renaming links and stages 8
setting up 7 R
readers’ comment form 42 U
records with common attributes 25 Unduplicate job target files
J reports
configure output 14
configuring 30
Unduplicate Match job
jobs
pattern 7, 14 configuring source files 27
overview 1
token 7, 14 Lesson checkpoint 27
Word pattern 7 setting up 25
Word token 7 Unduplicate Match stage job
Rule Set grouping records with common
select 11 attributes 25

© Copyright IBM Corp. 2004, 2006 47


Unduplicate Match stageconfiguring 28
Unduplicate stage
configuring target files 30
Unduplicate stage job
summary 30

W
WebSphere DataStage
Copy stage 11, 22
creating a job 4
Designer client 1
WebSphere DataStage Designer 4
WebSphere QualityStage
jobs 1
projects 1
stages 2
Survive stage 33, 34
Survive stage job 33
summary 36
Unduplicate Match stage 25, 27
Unduplicate stage job 29
summary 30
value 1
Word 7
Word pattern report 7

48 WebSphere QualityStage Tutorial




Printed in USA

SC18-9925-00

Potrebbero piacerti anche