Sei sulla pagina 1di 327

QualityStage 8 Essentials DX741

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Copyright, Disclaimer of Warranties and Limitation of Liability


Copyright IBM Corporation February 2007 IBM Software Group One Rogers Street Cambridge, MA 02142 All rights reserved. Printed in the United States. IBM and the IBM logo are registered trademarks of International Business Machines Corporation. The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: AnswersOnLine AIX APPN AS/400 BookMaster C-ISAM Client SDK Cloudscape Connection Services Database Architecture DataBlade DataJoiner DataPropagator DB2 DB2 Connect DB2 Extenders DB2 Universal Database Distributed Database Distributed Relational DPI DRDA DynamicScalableArchitecture DynamicServer DynamicServer.2000 DynamicServer with Advanced DecisionSupportOption DynamicServer with Extended ParallelOption DynamicServer with UniversalDataOption DynamicServer with WebIntegrationOption DynamicServer, WorkgroupEdition Enterprise Storage Server FFST/2 Foundation.2000 Illustra Informix Informix4GL InformixExtendedParallelServer InformixInternet Foundation.2000 Informix RedBrick Decision Server J/Foundation MaxConnect MVS MVS/ESA Net.Data NUMA-Q ON-Bar OnLineDynamicServer OS/2 OS/2 WARP OS/390 OS/400 PTX QBIC QMF RAMAC RedBrickDesign RedBrickDataMine RedBrick Decision Server RedBrickMineBuilder RedBrickDecisionscape RedBrickReady RedBrickSystems RelyonRedBrick S/390 Sequent SP System View Tivoli TME UniData UniData&Design UniversalDataWarehouseBlueprint UniversalDatabaseComponents UniversalWebConnect UniVerse VirtualTableInterface Visionary VisualAge WebIntegrationSuite WebSphere WebSphere DataStage

Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. All other product or brand names may be trademarks of their respective companies. All information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has been certified as being Year 2000 compliant. This document may not be reproduced in whole or in part without the priori written permission of IBM. Note to U.S. Government Users Documentation related to restricted rights Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.

Copyright IBM Corporation 2009

Course Contents
Topic
Data Quality Issues QualityStage 8 Architecture Developing with QualityStage Investigation Standardize Match Survivorship Special Topics Globalization (NLS) Address Verification Stage

Page
5 39 55 79 117 185 258 278 295 312

Copyright IBM Corporation 2009

Course contents
Data quality issues Information Server purpose and architecture Introduction to DataStage and QualityStage Investigation Standardization Match Survivorship Special Topics
Data quality methodology QualityStage Migration Tool

Copyright IBM Corporation 2009

Data Quality Issues

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Unit objectives
After completing this unit, you should be able to:
List the five common data quality contaminants Describe each of the following processes:
Investigation Standardization Match Survivorship

Copyright IBM Corporation 2009

Data quality challenges


Different or inconsistent standards in structure, format or values Missing data, default values Spelling errors, data in wrong fields Buried information Data anomalies

Copyright IBM Corporation 2009

Data quality why do we care?


Accurate reports Accurate information for support operations Support development of applications that go beyond original scope for which data was designed
Master Data Management Data Warehouse Analytical applications

Copyright IBM Corporation 2009

Example - Master Data Management

Source 1
Align

Source 2

Harmonize Consolidate

Consolidated customer view

Source 3

Copyright IBM Corporation 2009

Different or inconsistent standards Name Field


Source 1 MARK DI LORENZO DENIS E. MARIO TOM & MARY ROBERTS DILORENZO, MARK MARIO, DENISE ROBERTS, TOM & MARY MARC DILORENZO ESQ MRS DENNIS MARIO MR & MRS T. ROBERTS
Copyright IBM Corporation 2009

Location
MA93 CT15 IL21 6793 0215 8721 BOSTON HARTFORD CHICAGO

Source 2

Source 3

Missing data & default values


Do the field values match the meta data labels?

NAME Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Astorial Fedrl Savings Kevin Cooke, Receiver

SOC. SEC. # 228-02-1975 999999999 025-37-1888 34-2671434 101010101 LN#12-756 18-7534216

TELEPHONE 6173380300 3380321 415-392-2000 508-466-1200 212-235-1000 FAX 528-9825 5436

John Doe Trustee for K 111111111

Copyright IBM Corporation 2009

Buried information

Legacy Meta Desc.


NAME 1 ADDRESS 1 ADDRESS 2 ADDRESS 3 ADDRESS 4 ADDRESS 5

Legacy Record Values Robert A. Jones TTE Robert Jones Jr. First Natl Provident FBO Elaine & Michael Lincoln UTA DTD 3-30-89 59 Via Hermosa c/o Colleen Mailer Esq Seattle, WA 98101-2345

Copyright IBM Corporation 2009

The anomalies nightmare

CUSNUM
90328574 90328575 90238495 90233479 90233489 90234889 90345672 IBM

NAME

ADDRESS
187 N.Pk. Str. Salem NH 01456 187 N.Pk. St. Sarem NH 01456

SALES $
8,494.00 3,432.00

I.B.M. Inc. International Bus. M. Int. Bus. Machines Inter-Nation Consults Int. Bus. Consultants I.B. Manufacturing

187 No. Park StSalem NH 04156 2,243.00 187 Park Ave Salem NH 04156 15 Main St. Andover MA 02341 PO Box 9 Boston MA 02210 Park Blvd. Boston MA 04106 5,900.00 6,800.00 10,243.00 15,999.00

No common key

Anomalies

Lack of Standards Spelling Errors

Copyright IBM Corporation 2009

What data challenges do you face?


No consistent naming convention Business terms and spillover text Missing values or data in the wrong fields

Buried information Misspelling No unique key linking records together

Acct # 5154155 5152335 5146261 87121 87458

Name Peter J. Lalonde LaLonde, Peter Lalonde, Sofie Pete & Soph Lalond P. Lalonde FBO

Address 40 Beacon St.

City Melrose,

State Mass YES

Zip 02176 MA MA

Note ODP 02111 CHK ID FR Alert

76 George 617-210-0824 Boston 40 Bacon Street 76 George Road Melrose Boston

MASS MA 02

S.Lalonde40 Becon Rd. Melrose

176

Copyright IBM Corporation 2009

Why investigate?
Discover potential anomalies in the data Examine single domain and free-form fields Identify invalid and default values Reveal undocumented business rules Verify the reliability of the data in the fields to be used as matching criteria Gain complete understanding of data

Copyright IBM Corporation 2009

Investigate single domain report


Single domain
Field

Sample source data

Frequency

% of Total

Freq. Count

Copyright IBM Corporation 2009

Investigate word pattern report


Freeform text (Word)
Field Pattern

Sample source data

Frequency
Copyright IBM Corporation 2009

% of Total

What is standardize?
Applying business logic to data chaos.
Pattern manipulation

Enforcing business standards on data elements.


Standards definition

Transforming the input to an output which meets the business requirement.


Field structuring

Copyright IBM Corporation 2009

How to standardize
Parse specific data fields into smaller, atomic data elements
Atomic data elements are called tokens Categorize identified elements
Separate Name, Address, and Area from freeform Name & Address lines Identification of Distinct Material Categories (e.g. Sutures vs. Orthopedic Equipment)

Refine data elements


Example 1
Name = DR PAUL E JONES becomes:
> > > > Title = DR First Name = PAUL Middle Name = E Last Name = JONES

Example 2
Part Description = BLK LATEX GLOVE becomes:
> Color = BLACK > Type = LATEX > Part = GLOVE

Copyright IBM Corporation 2009

Why standardize?
Normalize values in data fields to standard values
Transform First Name = MIKE MICHAEL Transform Title = Doctor Dr Transform Address = ST. Michael Street Saint Michael St. Transform Color = BLK BLACK

Apply phonetic coding to key words - facilitates record linkage


NYSIIS Soundex Typically applied to Name fields (first, last, street, city)

Copyright IBM Corporation 2009

QualityStage standardize
Uses a highly flexible pattern recognition language Can employ field or domain specific standardization (i.e. unique rules for names vs. addresses vs. dates, etc.) Contains customizable classification and standardization tables Utilizes results from data investigation

Copyright IBM Corporation 2009

QualityStage standardize report example


Original data Ind./Org. flag

Copyright IBM Corporation 2009

Match
Conditioned data and QualityStages matching engine link the previously unlinkable. Match Construction:
Reliability of input data defines a match result.

Statistical Analysis & Match Scoring:


Linkage probability determined on a sliding scale by field level comparison.

Report Generation:
All business rules applied have easy to understand report structure.

Copyright IBM Corporation 2009

What is match?
Identifying all records on one file that correspond to similar records on another file Identifying duplicate records in one file Building relationships between records in multiple files Performing statistical and probabilistic matching Calculating a score based on the probability of a match

Copyright IBM Corporation 2009

Why match?
Identify duplicate entities within one or more files Perform householding Create consolidated view of customer Establish cross-reference linkage

Copyright IBM Corporation 2009

How to match
Single file (Unduplication) or two file (Reference) Different match comparisons for different types of data (e.g. exact character, uncertainty/fuzzy match, keystroke errors, multiple word comparison) Generation of composite weights from multiple fields Use of probabilistic or statistical algorithms Application of match cutoffs or thresholds to identify automatic and clerical match levels Incorporation of override weights to assess particular data conditions (e.g. default values, discriminatory elements)

Copyright IBM Corporation 2009

QualityStage match
A wide variety of match comparison algorithms providing a full spectrum of fuzzy matching functions Statistically-based method for determining matches (Probabilistic Record Linkage Theory) Field-by-field comparisons for agreement or disagreement Assignment of weights or penalties Overrides for unique data conditions Score results to determine the probability of matched records Thresholds for final match determination Ability to measure informational content of data

Copyright IBM Corporation 2009

QualityStage match examples

Copyright IBM Corporation 2009

What is survive?
Creation of best-of-breed surviving data based on record or field level information Development of cross-reference file of related keys Creating output formats:
Relational table with primary and foreign keys Transactions to update databases Cross-reference files

Copyright IBM Corporation 2009

Why survive?
Provide consolidated view of data Provide consolidated view containing the best-of-breed data Resolve conflicting values and fill missing values Cross-populate best available data Implement business rules Create cross-reference keys

Copyright IBM Corporation 2009

How to survive
Highly flexible rules Record or field level survivorship decisions Rules can be based upon data frequency, data recency (i.e. date), data source, value presence or length Rules can incorporate multiple tests QualityStage features
Point-and-click (GUI-based) creation of business rules to determine best-of-breed surviving data Performed at record or field level

Copyright IBM Corporation 2009

QualityStage survive examples

Example 1: The longest populated Middle and Last Name Matched Survived First Middle Last Name First Middle Last Name Name Name Name Name MARI MARI S LEMELSONLEMELSONLAPPNER LAPPNER

MARI S

LEMELSON

Example 2: The longest populated Middle Name, Date of Birth, and SSN

Matched First Name Middle NLast Name DENISE TRIANO DENISE F TRIANO SSN DOB 19580211 98524173 First Name DENISE

Survived SSN Middle NaLast Nam DOB F TRIANO 19580211 98524173

Copyright IBM Corporation 2009

Course lab project design

Policy Select US Data for further processing Identify Duplicate Customer Records Investigate Assess Data Quality Condition Name, Address and Area

Standardize Country

Investigate Conditioned Results

Survive the Best Customer Record

Apply User Overrides

Copyright IBM Corporation 2009

Checkpoint
1. (T/F) Data quality investigation cleans the source data. 2. (T/F) Standardization modifies the source data so that it can be loaded into the target system. 3. (T/F) Survivorship data can be either record based or field based.

Copyright IBM Corporation 2009

Checkpoint solutions
1. (T/F) (T/F) Data quality investigation cleans the source data.
Answer: False

2. (T/F) Standardization modifies the source data so that it can be loaded into the target system.
Answer: False

3. Survivorship data can be either record based or field based.


Answer: True

Copyright IBM Corporation 2009

Unit summary
Having completed this unit, you should be able to: List the five common data quality contaminants
Different standards Missing and default values Spillover and buried information Anomalies No consolidated view Investigation Standardization Match Survivorship

Describe each of the following processes:

Copyright IBM Corporation 2009

Lab 1: Review course project


Course business case: WINN Insurance CRM project See QualityStage Essentials Exercises

Copyright IBM Corporation 2009

Lab 2: Copy student files


Copy student files to disk
Use C: drive as root for folder

Copyright IBM Corporation 2009

QualityStage 8 Architecture

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Unit objectives
After completing this unit, you should be able to:
Describe the Data Quality architecture Identify server and client components

Copyright IBM Corporation 2009

Information Server conceptual architecture


Information Analyzer DataStage
QualityStage

Information Director

Information Server

Other Services Metadata Access Services Client logon access Logging Security

Metadata Server & Repository

Copyright IBM Corporation 2009

QualityStage technical overview


Uses DataStage (parallel version)
DataStage design environment Parallel execution engine Stages are native enterprise operators Match designer is embedded in DataStage Designer Client Get DataStage data connectivity by default
No need for meta brokers, plug-ins Common meta data

Legacy (pre-version 8) QS job execution


Migration utility available to aid conversion from QS 7.x to QS 8 Converted jobs can be compiled and executed in the QS 8 environment

Copyright IBM Corporation 2009

DataStage/QualityStage physical architecture


Clients DataStage/QualityStage Designer Director Administrator Information Server

UNIX

Connect to projects Via TCP/IP

Projects

Windows Windows

Copyright IBM Corporation 2009

DataStage clients
Administrator
Add and delete projects Set project defaults Set project environment parameters

Designer
Maintain data definitions Add, modify, and delete jobs Add, modify, and delete match specifications Manage rule sets Compile jobs Run jobs Provision rule sets and match specifications

Director
Run jobs Review job log Schedule jobs
Copyright IBM Corporation 2009

DataStage Administrator
Administrator
Create or delete projects Set project defaults Apply security

Project list

Copyright IBM Corporation 2009

Project property defaults

Copyright IBM Corporation 2009

DataStage Designer
Designer
Client GUI for designing jobs Windows 2000+, XP Build meta data Build Jobs Modify Standardization Rules Build match specifications Designer Repository Database

Sample QualityStage job as viewed in Designer

Copyright IBM Corporation 2009

Designer canvas, repository, and palette

Copyright IBM Corporation 2009

DataStage Director
Director
Client GUI for managing job execution Windows 2000+, XP Run jobs set job options and parameters View job log Schedule job execution

Copyright IBM Corporation 2009

Job log viewed in Director

Copyright IBM Corporation 2009

Checkpoint
1. (T/F) DataStage Administrator executes jobs. 2. (T/F) DataStage Designer configures projects. 3. Which DataStage component displays objects in the designer database?

Copyright IBM Corporation 2009

Checkpoint solutions
1. (T/F) DataStage Administrator executes jobs.
Answer: False

2. (T/F) DataStage Designer configures projects.


Answer: False

3. Which DataStage component displays objects in the designer database.


Answer: the repository view

Copyright IBM Corporation 2009

Unit summary
Having completed this unit, you should be able to:
Describe the Data Quality architecture Identify server and client components

Copyright IBM Corporation 2009

Lab 3: configure QualityStage project


Create a project using Administrator (if necessary) Set project properties
General defaults Environment variables Security groups and roles

Copyright IBM Corporation 2009

Developing with QualityStage

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Unit objectives
After completing this unit, you should be able to:
Import meta data Build DataStage/QualityStage Jobs Run jobs Review results

Copyright IBM Corporation 2009

DataStage/QualityStage project
Components
Jobs Stages within jobs Table Definitions The Designer repository view shows project components

Copyright IBM Corporation 2009

Job definition
A job is an executable DataStage/QualityStage program Created by job compilation Jobs can be run in batch or in real time

Copyright IBM Corporation 2009

Job development overview


Designer client
Import or enter file meta data defining your sources and targets Add stages and links defining the process Compile the job Run the job (Designer or Director) Review data results

Server
Runs the job View job log

Copyright IBM Corporation 2009

Log onto project in Designer or Director

User name and Password controlled by Information Server

List of valid projects


Copyright IBM Corporation 2009

Designer repository components


Database which stores
Data file definitions Job designs Standardization rules Data connection objects

Copyright IBM Corporation 2009

Project structure Repository view In Designer

Copyright IBM Corporation 2009

DataStage/QualityStage design environment


Data definitions

Stages

Copyright IBM Corporation 2009

Data definitions
Entered or loaded via DataStage import mechanisms
Sequential file ODBC Native database connection

New and redefined columns can be added on the data flow via Transformer stage

Copyright IBM Corporation 2009

Data Quality folder


Stages are the building blocks Focused in function All phases of data quality:
Investigate Standardize Match Frequency Match Unduplicate Match Reference Match Survive International postal MNS Optional Address Verification

Copyright IBM Corporation 2009

Standardization rule sets


Pre-defined rules for parsing and standardizing:
Name Address Area (City, State and Zip)

Multi-national address processing Validate structure:


Tax ID US Phone Date Email

Append ISO country codes Rule sets are stored in the repository and provisioned to the job execution area

Rule set for USNAME


Copyright IBM Corporation 2009

Rule set components


Can modify some rule set components Test rule sets Copy rule sets

Copyright IBM Corporation 2009

Match Specifications in the DataStage Repository


Created using the Match Designer Allows online testing of match criteria

Copyright IBM Corporation 2009

Executing a job via Director

Server Director Executes the job


Click run button Set run options Execute job View job log View job monitor

Copyright IBM Corporation 2009

Running a job in Director


Director
Client GUI for running jobs Windows 2000+, XP View job logs and monitor Job scheduling

Job status view

Copyright IBM Corporation 2009

Execution environment

Data Quality Job Log

Copyright IBM Corporation 2009

Job Monitor statistics

Copyright IBM Corporation 2009

Job development process



Import meta data Define job


Draw stages and links Set stage properties Compile

Run the job Review results

Copyright IBM Corporation 2009

Checkpoint
1. (T/F) The job monitor displays link statistics. 2. (T/F) The job log is viewed in DataStage Designer. 3. What protocol is used for communication between the DataStage clients and server?

Copyright IBM Corporation 2009

Checkpoint solutions
1. (T/F) The job monitor displays link statistics.
Answer: True

2. (T/F) The job log is viewed in DataStage Designer.


Answer: False

3. What protocol is used for communication between the DataStage clients and server?
Answer: TCPIP

Copyright IBM Corporation 2009

Unit summary
Having completed this unit, you should be able to:
Import meta data Build DataStage/QualityStage Jobs Run jobs Review results

Copyright IBM Corporation 2009

Lab 4: Import meta data


DataStage import mechanisms
DataStage components
Any object built in DataStage, such as jobs, table definitions, match specifications

Copyright IBM Corporation 2009

Lab 5: Build and run DataStage job


Read sequential file
Must use format tab to handle nulls

Copyright IBM Corporation 2009

Investigation

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Unit objectives
After completing this unit, you should be able to:
Build Investigate jobs Use character discrete, concatenate, and word investigations to analyze data fields Review results

Copyright IBM Corporation 2009

Investigation
Verify the domain
Review each field of interest and verify the data matches the meta data

Identify data formats, missing and default values Identify data anomalies
Format Structure Content

Discover unwritten business rules Identify data preparation requirements

Copyright IBM Corporation 2009

Investigate stage
Features
Analyze free-form and single domain fields Provide frequency distributions of distinct values and patterns

Investigate methods
Character Discrete Character Concatenate Word

Copyright IBM Corporation 2009

Investigate methods

Method

Why Analyzing field values, formats, and domains Cross-field correlation, checking logic relationships between fields Identifying free-form fields that may require parsing and discovery of key words for classification

Character Discrete Character Concatenate Word Investigation

Copyright IBM Corporation 2009

Investigate terminology
Field Masks

Options that represent the data. Options: Character (C), Type (T), Skipped (X)

Tokens

Individual units of data

Character Mask C T X

Usage For viewing the actual character values of the data For viewing the pattern of the data For ignoring characters

Copyright IBM Corporation 2009

Field mask examples

Token
02116 02116 01832-4480 XJ2 6EM (617) 338-0300 617-338-0300 6173380300 (617)3380300

Mask
CCCCC CCCXX TTTTTTTTTT TTTTTTT CCCCCCCCCCCCCC TTTTTTTTTTTT CCCXXXXXXXXX CCCXXXXXXXXX

Result
02116 021 nnnnn-nnnn aanbnaa (617) 338-0300 nnn-nnn-nnnn 617 (61

Copyright IBM Corporation 2009

Character discrete: field mask (C)haracter


Usage: Domain quality
View the contents of each field to verify that the data values match the field labels

Mechanism: Investigate stage


Generates reports for frequency

Copyright IBM Corporation 2009

Character discrete - character results

Copyright IBM Corporation 2009

Character discrete: field mask (T)ype


Usage: Data formats (patterns):
View the format of field which contain that you suspect may follow or conform to a specific format, e.g., dates, PIN, Tax ID, account numbers.

Generates reports for frequency

Copyright IBM Corporation 2009

Investigation Implementation

Copyright IBM Corporation 2009

QualityStage Investigation job character

Double Click

Copyright IBM Corporation 2009

Investigation - Character

Select Column

Add

Copyright IBM Corporation 2009

Investigation - Character

Select mask

Copyright IBM Corporation 2009

Investigation - Character

Copyright IBM Corporation 2009

Investigation - Character

Copyright IBM Corporation 2009

Investigation - Character

Copyright IBM Corporation 2009

View investigation report

Copyright IBM Corporation 2009

Character concatenate
Identify Field Relationships
Investigate one or more fields to uncover any relationship between the field values. Uses combinations of character masks Generates reports for frequency

Copyright IBM Corporation 2009

Character concatenate results DOB and DOD Fields

Copyright IBM Corporation 2009

Word investigate
Usage: Free-form field pattern analysis
To view the pattern of the data within a freeform text field and parse it into individual tokens

QualityStage process
Apply rule sets to free-form fields Discover parsing requirements Discover patterns in data Generate reports for pattern frequency distributions and token report

Copyright IBM Corporation 2009

Word investigation results


Pattern Report Token Report

How to use Look at most frequently occurring patterns. Use to estimate how much work to modify a rule set for a customer. How to use Review tokens with SME to verify tokens are properly classified. Identify most frequently occurring unclassified tokens and add them to rule set.

Copyright IBM Corporation 2009

Rule sets
Rules for parsing, classifying, and organizing data Rule Set Domains
Country processing Pre-processing Domain Processing
Name: Business and Personal Street Address Area: Locality, City, State and Zip/Postal codes

Multinational Address Processing

Copyright IBM Corporation 2009

Parsing
Parse free-form data with the SEPLIST and a STRIPLIST
SEPLIST - Any character in the SEPLIST will separate tokens, and become a token itself STRIPLIST - Any character in the STRIPLIST will be ignored in the resulting pattern

The SEPLIST is always applied first

Copyright IBM Corporation 2009

Parsing example
Example:
Token1

120 Main St. N.W.

SEPLIST . STRIPLIST

Token2

Token3

Token4

Token5

Token6

Token7

Token8

120

Main

St

SEPLIST STRIPLIST .
Token1 Token2 Token3 Token4

120

Main

St

NW

SEPLIST . STRIPLIST .
Token1

120

Main

Token2

St

Token3

Token4

Token5

Copyright IBM Corporation 2009

Data typing: classifying tokens


Identify and type the token in terms of its business meaning and value

PATTERN KEY(USADDR rule set):


^ Numeric token ? Unclassified alpha token @, <, > Mixed Token T Street Type U Unit Type

120 ^

Main ?

Street T

Apt U

6C >

Copyright IBM Corporation 2009

Example: word investigate

Token report

Pattern report

Produce Reports based on Patterns & Tokens

Classify known words and assign default tags

^
10

?
MAPLE

Parse

STREET APARTMENT 222

Copyright IBM Corporation 2009

Investigation - Word

Copyright IBM Corporation 2009

Investigation - Word

Copyright IBM Corporation 2009

Link ordering

Copyright IBM Corporation 2009

Investigation define output files

Copyright IBM Corporation 2009

Sort output (optional)

Copyright IBM Corporation 2009

Review word reports patterns and tokens

Copyright IBM Corporation 2009

Data quality assessment process


Review and analyze each field for the following information:
How often is the field populated? What are the anomalies and out-of-range values? How often does each one occur? How many unique values were found? What is the distribution of the data or patterns?

Use Investigate results to:


Update project business requirements Define development plan and application design

Copyright IBM Corporation 2009

Checkpoint
1. (T/F) Character discrete investigation examines a single domain. 2. (T/F) Word investigation examines a single domain. 3. Name the three character masks.

Copyright IBM Corporation 2009

Checkpoint solutions
1. (T/F) Character discrete investigation examines a single domain.
Answer: True

2. (T/F) Word investigation examines a single domain.


Answer: False

3. Name the three character masks.


Answer: C, T, and X

Copyright IBM Corporation 2009

Unit summary
Having completed this unit, you should be able to:
Build Investigate jobs Use character discrete, concatenate, and word investigations to analyze data fields Review results

Copyright IBM Corporation 2009

Lab 6: Build investigate jobs


Character with C mask Character with T mask Character concatenate Word

Copyright IBM Corporation 2009

Standardize

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Unit objectives
After completing this unit, you should be able to:
Describe the Standardize stage Identify rule sets Build jobs using the Standardize stage Interpret standardization results Investigate unhandled data and patterns

Copyright IBM Corporation 2009

Standardize
Transformation
Parsing free form fields Comparison threshold for classifying like words Bucketing data tokens

Standardization
Applying standard values and standard formats

Phonetic Coding for use in Matching


NYSIIS Soundex

Copyright IBM Corporation 2009

Standardize example
Input File:
Address Line 1 Address Line 2

1721 W ELFINDALE ST 1721 W ELFINDALE ST # 20 16200 VENTURA BOULEVARD C/O JOSEPH C REIFF 1705 W St 1655 PONCE DE LEON AVENUE

UNIT 20 SUITE 201 12 WESTERN AVE PHILADELPHIA 15TH FLOOR

Result File:
House # Dir Str. Name Type Type Value Unit Type Unit. Value Floor Floor

1721 1721 16200 12 1705 1655

W W

ELFINDALE ELFINDALE VENTURA WESTERN W PONCE DE LEON

ST ST BLVD AVE ST AVE

UNIT STE

20 20 201 FLOOR 15

Copyright IBM Corporation 2009

Standardize process
Output File
House Number 10 Street Unit Type Type ST APT

Street Name MAPLE

Unit 222

Key: ^ = Single numeric ? = One or more unknown alphas T = Street type U = Unit type

Process Patterns and Bucket Data

Classify & assign default tags

^
10

?
MAPLE

Parse

STREET APARTMENT 222

Copyright IBM Corporation 2009

Standardize stage
Standardize Stage
Uses Rule sets for:
Country processing Pre-domain processing
USPREP

Domain processing
USADDR USAREA USNAME

Multi-national Address WAVES Address Verification Interface (optional)

Copyright IBM Corporation 2009

Types of rule sets


Country Identifier COUNTRY
Preparatory steps Not always required

Domain Pre-processor USPREP

Domain Specific: USNAME

Domain Specific: USADDR

Domain Specific: USAREA

Copyright IBM Corporation 2009

Example: country identifier


Input Record Input Record 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 28 GROSVENOR STREET LONDON W1X 9FE 28 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET 123 MAIN STREET

Output Record Output Record US US CA CA GB GB US US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 Y 28 GROSVENOR STREET LONDON W1X 9FE Y 28 GROSVENOR STREET LONDON W1X 9FE N 123 MAIN STREET N 123 MAIN STREET

Copyright IBM Corporation 2009

Example: domain preprocessor

Input Record Input Record Field 1 Field 1 Field 2 Field 2 JIM HARRIS (781) 322-2426 JIM HARRIS (781) 322-2426 92 DEVIR STREET MALDEN MA 02148 92 DEVIR STREET MALDEN MA 02148
Mixed domain

Output Record Output Record Name Domain Name Domain Address Domain Address Domain Area Domain Area Domain Other Domain Other Domain JIM HARRIS JIM HARRIS 92 DEVIR STREET 92 DEVIR STREET MALDEN MA 02148 MALDEN MA 02148 (781) 322-2426 (781) 322-2426

Copyright IBM Corporation 2009

Example: domain specific


Input Record Input Record 100 SUMMER STREET 15TH FLOOR 100 SUMMER STREET 15TH FLOOR

Output Record Output Record House Number House Number Street Name Street Name Street Suffix Type Street Suffix Type Floor Type Floor Type Floor Value Floor Value Address Type Address Type NYSIIS of Street Name NYSIIS of Street Name Reverse Soundex of Street Name Reverse Soundex of Street Name Input Pattern Input Pattern
Copyright IBM Corporation 2009

100 100 SUMMER SUMMER ST ST FL FL 15 15 S S SANAR SANAR R520 R520 ^+T>U ^+T>U

Rule sets
Rule sets contain logic for:
Parsing Classifying Processing data by pattern and bucketing data

Three required files


Classification Table Dictionary File Pattern Action File

Optional files
Lookup tables Override tables

Copyright IBM Corporation 2009

Rule set files


Classification Table Dictionary File Pattern Action File Reference Tables Override Tables
Contains standard abbreviations that identify and classify key words. Define the output file fields to store the parsed and conditioned data Contains a series of patterns and programming commands to condition the data Optional conversion and lookup tables for converting and returning standardized values Tables for storing overrides entered into the Designer GUI

Copyright IBM Corporation 2009

Classification table
Contains the words for classification, standardized versions of words, and data class Data class (data tag) is assigned to each data token Default classes are the same across all rule sets User-defined classes are assigned in the classification table
Users may modify, add or delete these classes User-defined classes are a single letter

Copyright IBM Corporation 2009

Default classes
Class ^ + ? @ > < Zero (0) Description A single numeric A single unclassified alpha (word) One or more consecutive unclassified alphas Complex mixed token, e.g., C3PO Leading numeric, e.g., 6A Trailing numeric, e.g. A6 Null class

Copyright IBM Corporation 2009

User-defined classes
Class USNAME G P USADDR T D B USAREA S State Abbreviation Street Type Directional Box Type Generational, e.g., Senior, I, II Prefix, e.g. Dr., Mr., Miss Description

Copyright IBM Corporation 2009

Classification table example


;-------------------------------------------------------------------------------

; USADDR Classification Table ;-----------------------------------------------------------------------------; Classification Legend ;-----------------------------------------------------------------------------; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types Standard Classification ; U - Unit Types form ;-----------------------------------------------------------------------------Token ; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-----------------------------------------------------------------------------DRAW "PO BOX" B DRAWER "PO BOX" B PO "PO BOX" B POB "PO BOX" B POBOX "PO BOX" B POBX "PO BOX" B PODRAWER "PO BOX" B

Copyright IBM Corporation 2009

Comparison threshold
May be used in the Classification table Used to efficiently make entries into the classification table Helps overcome spelling and data entry errors Not required Threshold uses a logical string comparator

Threshold level
900 850 800 750 700
Exact match Almost certainly the same Most likely equivalent Most likely not the same Almost certainly not the same

Copyright IBM Corporation 2009

Classification table example with comparison threshold


; USADDR Classification Table ;-----------------------------------------------------------------------------; Classification Legend ;-----------------------------------------------------------------------------; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-----------------------------------------------------------------------------; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-----------------------------------------------------------------------------DRAW "PO BOX" B DRAWER "PO BOX" B NORTHEAST NE D 850 NORTHWEST NW D 850 NW NW D S S D SO S D SOUTH S D
Copyright IBM Corporation 2009

Dictionary file
Defines the field definitions for the output file When data is moved to these output fields it is called bucketing the data The order that the fields are listed in the dictionary file defines the order the fields appear in the output file Dictionary file entries are similar to field definitions

Copyright IBM Corporation 2009

Dictionary file example


;;QualityStage v8.0 \FORMAT\ SORT=N ;-----------------------------------------------------------------------------; USADDR Dictionary File ;-----------------------------------------------------------------------------; Total Dictionary Length = 411 ;-----------------------------------------------------------------------------; Business Intelligence Fields ;-----------------------------------------------------------------------------HouseNumber C 10 S HouseNumber ;0001-0010 HouseNumberSuffix C 10 S HouseNumberSuffix ;0011-0020 StreetPrefixDirectional C 3 S StreetPrefixDirectional ;0021-0023 StreetPrefixType C 20 S StreetPrefixType ;0024-0043 StreetName C 25 S StreetName ;0044-0068 StreetSuffixType C 5 S StreetSuffixType ;0069-0073 StreetSuffixQualifier C 5 S StreetSuffixQualifier ;0074-0078

Copyright IBM Corporation 2009

Pattern-Action file
Contains the rules for standardization; that is, the actions to execute with a given pattern of tokens Records are processed from the top down Written in Pattern Action Language (PAL) Complex parsing can be coded in this file

Copyright IBM Corporation 2009

Pattern Action file process

Street Address Pattern

10

Hollow Oak Road

Pattern Action Language

COPY [1] {HN} COPY_S [2] {SN} COPY_A [3] {ST}


10 Hollow Oak Rd

{HN}
Copyright IBM Corporation 2009

{SN}

{ST}

Optional lookup tables


Called from the Pattern Action File Rule sets may contain lookup tables such as:
Common First Names and Enhanced First Names
Barb & Barbara Ted & Edward

Gender based on name State abbreviations Common city abbreviations


NYC = New York City LA = Los Angeles

Copyright IBM Corporation 2009

Standardize process
Dictionary File
3

House Number 10

Street Name MAPLE

Street Unit Type Type ST APT

Unit 222

Pattern Action File Classification Table

Process Patterns and Bucket Data

Classify & assign default tags

^
10

?
MAPLE

Parse

STREET APARTMENT 222

Copyright IBM Corporation 2009

Standardizing international data


Two methods
Method 1: Use country pre-processor, domain pre-processor, and domain-specific rules
Uses out-of-the-box, included functionality/rules

Method 2: Use Multinational Standardize, WAVES, or AVI


WAVES requires purchase of WAVES database AVI requires purchase of database for address validation

Copyright IBM Corporation 2009

Course lab project design


Policy Select US Data for further processing Identify Duplicate Customer Records Investigate Assess Data Quality Condition Name, Address and Area

Standardize Country

Investigate Conditioned Results

Survive the Best Customer Record

Apply User Overrides

Copyright IBM Corporation 2009

Country rule set


Country Rule set appends the two byte ISO country code Input to the country rule set includes:
Default country code (designated by ZQdefault valueZQ) Street Address City or locality State Zip or Postal code Country field (if it exists)

Output:
Two byte ISO country code Flag identifying explicit or default decision

Copyright IBM Corporation 2009

Standardization implementation

Copyright IBM Corporation 2009

Standardization jobs

Country Rule Set

USPREP Rule Set

Domain-specific Rule Sets

Copyright IBM Corporation 2009

Standardization US Name, Address, Area

Copyright IBM Corporation 2009

Standardization

Copyright IBM Corporation 2009

Standardization mapping columns

Copyright IBM Corporation 2009

Course lab project design


Policy Select US Data for further processing Identify Duplicate Customer Records Investigate Assess Data Quality Condition Name, Address and Area

Standardize Country

Investigate Conditioned Results

Survive the Best Customer Record

Apply User Overrides

Copyright IBM Corporation 2009

Selecting US data
The DataStage Filter Stage provides the capability of selecting and/or rejecting records based on a set of values for a field Selecting or splitting data requiring compound or complex logic may require Transformer stage

Copyright IBM Corporation 2009

Course exercise project design


Policy Select US Data for further processing Identify Duplicate Customer Records Investigate Assess Data Quality Condition Name, Address and Area

Standardize Country

Investigate Conditioned Results

Survive the Best Customer Record

Apply User Overrides

Copyright IBM Corporation 2009

Domain pre-processor rule sets


Pre-processor rule sets are designed to filter name, street address and area (city, state, zip) data
For example, if the city, state and zip is found in ADDRESS LINE 2, the pre-processor rule set will attempt to recognize this data and move it into the area domain

The pre-processor rule set prepares the data for processing by domain specific rule sets

Copyright IBM Corporation 2009

Domain rule sets


Domain rule sets expect only data for that domain as the input Domain rule sets that come with QualityStage are:
Name Street address Area (city, state and zip)

Copyright IBM Corporation 2009

USNAME rule set


The USNAME rule set works on both personal names and organization names for US data Data is parsed into name components Phonetic coding of the First Name and Primary Name are created for matching

Copyright IBM Corporation 2009

USADDR rule set


This rule set is applied to street address fields The Address Type flag identifies different types of addresses
S Street address B Box address R Rural route address

Phonetic coding of the Street Name is created for matching

Copyright IBM Corporation 2009

USAREA rule set


This rule set is applied to city, state and postal code fields Data is parsed into city name, state abbreviation, zip code and zip plus four Phonetic coding of the city name is created for matching

Copyright IBM Corporation 2009

Standardize results
Business Intelligence fields
Parsed from the original data, they may be used in matching and generally they are moved to the target system

Matching Fields
Generally these fields are created to help during the match process and are dropped after successful matching

Reporting fields
Specifically created to help review results of Standardize and recognized handled and unhandled data

Copyright IBM Corporation 2009

Business intelligence fields


Intelligent data parsed and bucketed from the input free-form field
USNAME Examples

USADDR Examples HouseNumber Directional Street Name Unit Types Box Types Unit Values Building Names

USAREA Examples

Title First Name Middle Name Primary Name Generational

City State Zip5 Zip4

Copyright IBM Corporation 2009

Standardize matching fields


Phonetic coding
NYSIIS Reverse NYSIIS Soundex Reverse Soundex

Hash keys
First 2 characters of the first five words

Packed Keys
Data concatenated, or packed

Copyright IBM Corporation 2009

Standardize reporting fields


Unhandled Pattern
The pattern generated for tokens not processed by the selected rule set.

Unhandled Data

The remaining tokens not processed by the selected rule set. The pattern generated for the stream of input tokens based on the parsing rules and token classifications. The tokens not processed by the rule set because they represent a data exception. Flag indicating what kind of user overrides were applied to this record
Copyright IBM Corporation 2009

Input Pattern

Exception Data

User override flag

Investigate NAME unhandled patterns and data


Identify the unhandled patterns for the NAME field. In the report include the unhandled data, input pattern, original data and the record key.
1. 2. Build a Character Concatenate Investigation using the following fields Increase the number of samples to 5
Field Description Unhandled Pattern Unhandled Data Input Pattern Name domain data Type C X X X

Copyright IBM Corporation 2009

Standard practice: investigate handled and unhandled data Review the business intelligence fields to ensure accurate bucketing of the data Build a Character Discrete Investigation for each field and review the contents and the format Build Investigation to review:
Unhandled Patterns Unhandled Data Input Pattern Input Fields

Copyright IBM Corporation 2009

Customizing rule sets


A rule set may require modification if some input data is:
Not processed Incorrectly processed

QualityStage provides functions to:


Test strings for classifications using the Rules Analyzer Apply user Overrides Modify classification table

Copyright IBM Corporation 2009

User overrides
Provides the user with the ability to modify rule sets The following types of rule sets can be modified using User Overrides
Domain Pre-processor rule sets Domain rule sets

There are five types of user overrides relating to: classifications, patterns, and text strings User overrides are
GUI Driven

Rule set should be provisioned after modifications applied

Copyright IBM Corporation 2009

User classification override


Recognized as a keyword and classified
Additional words
New abbreviation, variation Misspelling of a word

User Classifications may override or add:


Original values (Token values) Standard value Class

Copyright IBM Corporation 2009

Apply classification override


Unhandled Data Input Pattern
+,+

Original Data
HOCHREITER , CAROLYNNE
Add CAROLYNNE as a valid first name to the classification table

Override

Token Value Carolynne

Standard Value Carolynne

Class F

Input Pattern
+,F

Original Data
HOCHREITER , CAROLYNNE

Corrected Pattern

Copyright IBM Corporation 2009

Text overrides
Allow the user to specify overrides based on an entire text string Use this override for special cases and specific handling of a string of text Input Text Overrides
Applied to the original text string

Unhandled Text Overrides


Applied to the Unhandled Data field

Copyright IBM Corporation 2009

Input text overrides

Input Text

Unhandled Pattern ++ ++ ++

Input Text ZACHARIA GELLMAN TOMMOTHY CABBOTT REIFF FUNERAL

Override

Input Text
REIFF FUNERAL

Override
Move text string to the Primary name field

Input Pattern ++

Primary Name REIFF FUNERAL

Results

Copyright IBM Corporation 2009

Pattern overrides
Allow the user to specify overrides based on an entire pattern Use this override when most or all records should be processed with identical logic Input Pattern Overrides
Applied to the original text string

Unhandled Pattern Overrides


Applied to the Unhandled Data field

Copyright IBM Corporation 2009

Unhandled pattern overrides


Unhandled Pattern +, +
+, + +, + Unhandled Pattern +, +

Input Text
HAYWARD, WINSLOW ESHAGHIAN , JOUBI BOULDER, CORONA Override Move + to Primary Name Comma provides context Move + to First Name

Unhandled Pattern

Override

Unhandled Pattern +, + +, + +, +

First Name WINSLOW JOUBI CORONA

Primary Name HAYWARD ESHAGHIAN BOULDER

Results

Copyright IBM Corporation 2009

User override precedence


User Classification Input Text Input Pattern Unhandled Text Unhandled Pattern
Recognize words to classify

Modify logic based on the input string

Modify logic based on the input pattern Modify logic based on the Unhandled data string Modify logic based on the unhandled pattern

Copyright IBM Corporation 2009

Investigate address and area unhandled patterns

Identify the unhandled patterns for the Address and AREA fields. In the report include the unhandled data, input pattern, original data and the record key.
1. 2. Build a Character Concatenate Investigation using the following fields Increase the number of samples to 5

Field Description Unhandled Pattern Unhandled Data Input Pattern Address Domain

Type C X X X

Copyright IBM Corporation 2009

Overrides
Purpose
Correct problems found during standardization

Rule set may require overrides because you have data


Not processed Incorrectly processed

Override types
Classification Input pattern Input text Unhandled pattern Unhandled text

Can be tested with rules analyzer

Copyright IBM Corporation 2009

Course exercise project design


Policy Select US Data for further processing Identify Duplicate Customer Records Investigate Assess Data Quality Condition Name, Address and Area

Standardize Country

Investigate Conditioned Results

Survive the Best Customer Record

Apply User Overrides

Copyright IBM Corporation 2009

Overrides screen

Copyright IBM Corporation 2009

Checkpoint
1. (T/F) WAVES can standardize name fields. 2. (T/F) Rule sets are used in standardization processing. 3. Name the components of rule sets.

Copyright IBM Corporation 2009

Checkpoint solutions
1. (T/F) (T/F) WAVES can standardize name fields.
Answer: False

2. (T/F) Rule sets are used in standardization processing.


Answer: True

3. Name the components of rule sets.


Answer: Classification table, dictionary, pattern action file, lookup tables

Copyright IBM Corporation 2009

Unit summary
Having completed this unit, you should be able to:
Describe the Standardize stage Identify rule sets Build jobs using the Standardize stage Interpret standardization results Investigate unhandled data and patterns

Copyright IBM Corporation 2009

Lab 7: Standardize country


Word investigation
Uses COUNTRY rule set

Rule set found in Other folder Adds ISO country code to records

Copyright IBM Corporation 2009

Lab 8: Select US records


Uses Select stage to separate records with US ISO code Could also use Transformer stage

Copyright IBM Corporation 2009

Lab 9: Standardize USPREP


Word investigation
Uses rule set

Rule set found in US folder

Copyright IBM Corporation 2009

Lab 10: Standardize USNAME, USADDR, USAREA Word investigation


Uses rule sets

Rule sets found in US folder

Copyright IBM Corporation 2009

Lab 11: Investigate unhandled patterns


Character concatenate investigation C mask used to produce histogram X mask used to display other fields of interest

Copyright IBM Corporation 2009

Lab 12: Apply user overrides


Classification Input pattern Unhandled pattern

Copyright IBM Corporation 2009

Match

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Unit objectives
After completing this unit, you should be able to:
Build a QualityStage job to identify matching records Apply multiple match passes to increase efficiency/efficacy Interpret and improve match results

Copyright IBM Corporation 2009

Match stage
Statistically-based method for determining matches 25 match comparison algorithms providing a full spectrum of fuzzy matching functions Ability to measure informational content of data Identify duplicate entities within one or more files Match specification built with Match Designer Critical field settings

Copyright IBM Corporation 2009

What constitutes a good match?


Which of the following record pairs is a match? And how do you know?
W W HOLDEN HOLDEN W HOLDEN W HOLDEN 12 MAIN ST 12 MAINE ST 128 MAIN PL 128 MAINE PL 02111 12/8/62 02110 12/8/62 02111 12/8/62 02110 12/8/62 338-0824 338-0824

WM HOLDEN WILL HOLDEN

128A MAIN SQ 128A MAINE SQ

Do you compare all the shared or common fields? Do you give partial credit? Are some fields (or some values) more important to you than others? Why? Do more fields increase your confidence? By how much? What is enough?
Copyright IBM Corporation 2009

The value of information content


Information content measures the significance of one field over another (Discriminating Value)
A Gender Code contributes less information than a Tax-Id Number

Information content also measures the significance of one value in a field over another (Frequency)
In a First-Name Field, JOHN contributes less information than DWEZEL

Significance is determined by a values reliability and its ability to discriminate, both can be calculated from your data

Copyright IBM Corporation 2009

Distribution of weights
WILLIAM J WILLAIM JOHN +1 +1 HOLDEN HOLDEN +17 128 MAIN ST 02111 12/8/62 128 MAINE AVE 02110 12/8/62 +2 +4 -1 +7 +9 = 40

The weighted score is a relative measure of the probability of a match


4000 3500

# of Pairs

3000 2500 2000 1500 1000 500 0 -20 -10 0

Grey area

Non-Matches

Thresholds defined can be used for automated processing


Matches 20 30 40
Weight of Comparisons

10

Less Confidence

More Confidence
Copyright IBM Corporation 2009

Weights
Measures the information content of a data value Each field contributes to the confidence (probability) of a match

Copyright IBM Corporation 2009

Types of weights
If a field matches, the agreement weight is used
Agreement weight is a positive value

If a field doesnt match, the disagreement weight is used


Disagreement is a negative value

Partial weight is assigned for non-exact or fuzzy matches Missing values have a default weight of zero Weights for all field comparisons are summed to form a composite weight

Copyright IBM Corporation 2009

Matching terminology
Informational Content Weight Composite Weight Match Cutoffs False Positives False Negatives
Measures the significance of one field value over another Measures the informational content of a data value Measures the confidence of a match

Distinguish matches from non-matches Records with a score above the High cutoff that really arent a match Records below the low cutoff that really are a match
Copyright IBM Corporation 2009

Measuring the conditions of uncertainty


Reliability of the data in a given field
Estimated as the probability that the field agrees given the record pair is a match

Probability of a random agreement of values


Estimated as the probability the field agrees given the record pair is not a match

Copyright IBM Corporation 2009

Reliability (m-probability)
Approximated as, 1 - error rate for the given field The higher the m-probability, the higher the disagreement weight will be for the field not matching since the data is considered reliable

Copyright IBM Corporation 2009

Chance agreement (u-probability)


The u-probability can be approximated as the probability that a field agrees at random (by chance) QualityStage uses a frequency analysis to determine the probability of chance agreement for all values
Created by a Match Frequency stage

Rare values bring more weight to a match

Copyright IBM Corporation 2009

Blocking
Grouping together like records that have a high-probability of producing matches Only like records are compared to each other making the match more efficient and computationally feasible Records in a block match exactly on one to several blocking fields

Copyright IBM Corporation 2009

Blocking example: sample data Block on NYSIIS of Last Name


NYSIIS LNAME YANG GARAS YANG GARAS GARAS MATAC GARAS JANCAN YANG NAME YUNG , WAYNE D GEROSA, FRAN X YOUNG , JONATHAN A GERISA, FRANCIS MARCUS MATIC GEROSA, MARY RENEE JENKINS YOUNG THERESA C ADDRESS 9000 SHEPARD DRIVE 29 AARONS CT 1767 TOBEY ROAD 29 AARONS CT 100 SUMMER STREET 29 AARONS CT 100 SUMMER STREET 1767 TOBEY ROAD ZIP 78753 06877 30341 06877 06877 02111 06877 02111 30341

GEROSA, FRANCIS XAVIER 29 AARONS COURT

Copyright IBM Corporation 2009

Blocking example NYSIIS of Last Name


NYSIIS
YANG YANG YANG

NAME
YUNG , WAYNE D YOUNG , JONATHAN A YOUNG THERESA C

ADDRESS
9000 SHEPARD DRIVE 4220 BELLE PARK DR 1767 TOBEY ROAD

ZIP
78753 77072 30341

GARAS GARAS GARAS GARAS MATAC

GEROSA, FRAN X GEROSA, FRANCIS XAVIER GEROSA, MARY GARISA, FRANCIS MARCUS MATIC

29 AARONS CT 29 AARONS COURT 29 AARONS CT 29 AARONS CT 100 SUMMER STREET

06877 06877 06877 06877 02111

JANCAN

RENEE JENKINS

100 SUMMER STREET

02111

Blocks with only one record are considered residuals


Copyright IBM Corporation 2009

Blocking strategy
Choose fields with reliable data Choose fields with a good distribution of values Combinations of fields may be used

Copyright IBM Corporation 2009

Examples of blocking strategies


Zip code for matching addresses NYSIIS of last name for matching individuals Brand name for matching products Combination of zip code and NYSIIS of street name for matching addresses Combination of NYSIIS of last name and first letter of first name for matching individuals

Copyright IBM Corporation 2009

Blocking summary
Blocking groups together like records Matching is more efficient for small block sizes
Blocks should have less than 1000 records (guideline, not a hard and fast rule)

Blocking fields must match exactly for a candidate set to be created/evaluated Beware of block overflow
Computationally run out of resources Comparisons are not completed Every record in the block becomes an automatic residual

Copyright IBM Corporation 2009

Match types
Unduplication
Identifies duplicates candidates in one file

Reference Match (Two File)


One-to-one correspondence
For every record on stream link we expect to find a match to one record on reference link

Many-to-one correspondence
More than one record on stream link can match to the same record on reference link

Copyright IBM Corporation 2009

Comparing data values


Different comparisons for different data 17 comparison methods Most common
CHAR - (character comparison) character by character, left to right. UNCERT - (character uncertainty) tolerates phonetic errors, transpositions, random insertion, deletion, and replacement of characters CNT_DIFF Counts keying errors in numeric fields. You set a tolerance threshold NAME_UNCERT Can be used to compare character values, if the strings are different lengths then the shorter of the two lengths is used

Copyright IBM Corporation 2009

Match Implementation

Copyright IBM Corporation 2009

Tasks required in match process


Standardize the data Add data columns needed for blocking Generate match frequency report using Match Frequency stage Build match specification in Match Designer Add pass
Blocking columns Match commands

Configure match test results environment

Run pass Review results Tune the match Add cutoffs Set overrides Add more passes Repeat steps until match results are acceptable

Copyright IBM Corporation 2009

Standardize columns and generate match frequency

Copyright IBM Corporation 2009

Match frequency stage

Map fields

Copyright IBM Corporation 2009

Match frequency generation

Copyright IBM Corporation 2009

Lab 13: Match frequency


Use Match Frequency stage in a match job

Copyright IBM Corporation 2009

Match Designer
Used to build a match specification that will be addressed in a match job Features
Design control center Data-centric Graphical representation of statistics Independent of job design Iterative development

Copyright IBM Corporation 2009

Match Design - Unduplicate

Copyright IBM Corporation 2009

Match design creating specification


How to create a new match specification

Right-click in non-root area of repository

Copyright IBM Corporation 2009

Match design - unduplicate


The Major Components
Pass Composer Test results Histogram Holding Area

Data Viewer

Decision Rules

Cutoff Tuning

Copyright IBM Corporation 2009

Match design - unduplicate

Select match type example unduplicate

Will initially get one pass called MyPass

Copyright IBM Corporation 2009

Match design - unduplicate

Click table definition icon

Use load button to access table definition of standardized data set


Copyright IBM Corporation 2009

Match design - unduplicate

Select pass icon

Blocking

Match Commands

Copyright IBM Corporation 2009

Match design - unduplicate

Save passes and specification

Copyright IBM Corporation 2009

Match design - unduplicate

Name and place passes and specification

Copyright IBM Corporation 2009

Match design - unduplicate

Questions: Where is the standardized data? Where is the frequency report? What ODBC-accessed database will store test results?
Set up test results area

Copyright IBM Corporation 2009

Match design - unduplicate

Standardized sample data

Note: these must be data sets


Frequencies data set

Data Source Name User Name Password

Copyright IBM Corporation 2009

Match design - unduplicate

Add Blocking Columns

Copyright IBM Corporation 2009

Match design - unduplicate

Business Name

Select Column

Click Apply or OK

Copyright IBM Corporation 2009

Match design - unduplicate

Add MATCH Column

Copyright IBM Corporation 2009

Match design - unduplicate

Business Name

Copyright IBM Corporation 2009

Match design - unduplicate

Compare Type

Copyright IBM Corporation 2009

Match design - unduplicate

Data Column

Right-Click to view data frequencies

Copyright IBM Corporation 2009

Match design - unduplicate

Frequencies

Copyright IBM Corporation 2009

Match design - unduplicate

Select Parameter

Copyright IBM Corporation 2009

Fully configured pass

Click test pass to run the pass against the data

Expanded view will show details of blocking and match commands

Copyright IBM Corporation 2009

Match design after test pass run

Copyright IBM Corporation 2009

Match design - unduplicate

Grouping option: Match Sets: See all matches and duplicates together Match Pairs+Sort: See the master record repeated

Copyright IBM Corporation 2009

Match design - unduplicate


Default Display (Grouped by Match Sets)

Grouped by Match Pairs and then sorted Ascending by Weight

Copyright IBM Corporation 2009

Match design - unduplicate


Compare Weights: See how any two records score

Copyright IBM Corporation 2009

Match design - unduplicate

Statistics Tab

ge an Ch

ho tS ha W

ws

Copyright IBM Corporation 2009

Match design - unduplicate


w Ho ws ho S

ge an Ch

Copyright IBM Corporation 2009

Match design - unduplicate


TOTAL Statistics Tab
Ch an ge

Ho w

Sh ow s

ge an Ch

s ow Sh at h W

Copyright IBM Corporation 2009

Lab14: Configure test results database


Build a DB2 database to contain match test results Build an ODBC source to connect the database to QualityStage

Copyright IBM Corporation 2009

Lab 15: Match specification


Use Match Designer to build specification for unduplicate job Configure test results area

Copyright IBM Corporation 2009

Match improvement strategy


1. Set critical values for important fields 2. Review calculated weights
Adjust weights using weight overrides

3. Set cutoffs 4. Add additional passes

Copyright IBM Corporation 2009

Critical fields
Used to identify fields that must agree in order for records to be linked
Critical Fields values must agree exactly or the records cannot be linked (considered a match) Critical Missing OK Field values must agree exactly on values not considered missing values

QualityStage feature: Variable Special Handling

Copyright IBM Corporation 2009

Variable special handling

Copyright IBM Corporation 2009

Weight overrides
Allows you to adjust both the agreement and/or disagreement weights for specific situations
Add to calculated weight Replace weight
On Match Commands screen

Copyright IBM Corporation 2009

Weight override screen

Copyright IBM Corporation 2009

Cutoffs
There are two cutoffs
Match cutoff (high cutoff) Clerical cutoff (low cutoff)

Records with a weight equal to or above the Match cutoff are considered matches Records with a weight below the low cutoff are not matches Records with a weight greater than or equal to the low cutoff and less than the high cutoff are considered clerical records for manual review Cutoffs can be set at the same value eliminating clerical records

Copyright IBM Corporation 2009

Setting the match cut-off


Weights Data fields Definite Match Definite Match Questionable Match
27.82 27.82 27.82 38.65 38.65 25.81 14.45 PO BOX PO BOX PO BOX 35 35 928 S S 930202 930202 930202 COLLIER COLLIER 1ST 1ST RD RD ST ST NW NW STE STE 610 610

Copyright IBM Corporation 2009

Multiple match passes


Additional passes are helpful in overcoming data errors and missing values in block fields You should always create at least two match passes Change blocking strategies for each pass

Copyright IBM Corporation 2009

Example: multiple match passes


Pass Weights Data fields 1 26.31 JASON BIRCH 1 26.31 JASON BIRSH 1 1 1 2 2 20.42 10.83 RES A 20.42 10.19 JOHN SMITH MARY SMITH JOHN SMITH JOHN SMITH JOHN SMITH

1350 WALTON 1350 WALTON 2047 PRINCE 2047 PRINCE P.O. BOX 123 2047 PRINCE P.O. BOX 123

WAY WAY AVE AVE

30901 30901 30604 30604 30604

AVE

30604 30604

Pass 1 blocked on street name Pass 2 found additional matched records in which the street name was different but the names were the same
Copyright IBM Corporation 2009

Match Implementation Unduplicate job

Copyright IBM Corporation 2009

Unduplication implementation

Double Click

Copyright IBM Corporation 2009

Unduplication Implementation

Copyright IBM Corporation 2009

Verify link order for both input and output

Copyright IBM Corporation 2009

Map all output links

Copyright IBM Corporation 2009

Checkpoint
1. (T/F) Match specifications are created using Designer. 2. (T/F) An unduplicate match can be used against two files. 3. Which match specification component determines the extent of the clerical review records?

Copyright IBM Corporation 2009

Checkpoint solutions
1. (T/F) Match specifications are created using Designer.
Answer: True

2. (T/F) An unduplicate match can be used against two files.


Answer: False

3. Which match specification component determines the extent of the clerical review records?
Answer: cutoff values

Copyright IBM Corporation 2009

Unit summary
Having completed this unit, you should be able to: Build a QualityStage job to identify matching records Apply multiple match passes to increase efficiency/efficacy Interpret and improve match results

Copyright IBM Corporation 2009

Lab 16: Unduplicate job


Build unduplicate job using the match specification

Copyright IBM Corporation 2009

Survive

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Unit objectives
After completing this unit, you should be able to:
Identify Survive techniques Describe implementation options Define Survive rules Build Survive job

Copyright IBM Corporation 2009

Survive stage
Point-and-click creation of business rules to determine surviving data user decides how to survive data Performed at record or field level very flexible Creates a single, consolidated record containing the best-ofbreed data Provides consolidated view of the data

Copyright IBM Corporation 2009

Survive example
Survive Input (Match Output)
Group Legacy 1 D150 1 A1367 23 D689 23 A436 23 D352 First Bob Robert William Billy William Middle A Alex Last Dixon Dickson Obrian OBrian Obrian No. 1500 1500 5901 5901 5901 Dir. SE SW SW Str. Name Type UnitNo. ROSS CLARK CIR ROSS CLARK CIR 74TH ST STE 202 74TH ST 74 ST #202

Survived Consolidated Output


Group Legacy First 1 D150 Robert 23 D689 Middle Last Dickson OBrian No. 1500 5901 Dir. SE SW Str. Name Type Unit No. ROSS CLARK CIR 74TH ST STE 202

William Alex

Cross-Reference File
Group Legacy 1 D150 1 A1367 23 23 23 D689 A436 D352
Copyright IBM Corporation 2009

Survive rules
A rule contains a condition and a set of target fields
When the condition is met the field becomes a candidate for the best All records in a group are tested against the condition The best populates the target fields

Multiple targets are permitted for the same rule

Copyright IBM Corporation 2009

Survive rules
Custom Rule
Build your own logical expression Comparison (=, !=, <, > ,<=, >=) Logical (and, or, not) Indicate the current and best records with the following notation
c.field indicates the current b.field indicates the best

Parentheses ( ) can be used for grouping complex conditions String literals are enclosed in double quotation marks, such as MARS. A semicolon (;) terminates a rule.

Copyright IBM Corporation 2009

Building survive rules


Survive Rules Definition screen lets you easily build, delete and manage survivor rules

Copyright IBM Corporation 2009

Survive techniques
Pre-defined Techniques
Source Recency Frequency Most complete (longest string)

User-specified logic

Copyright IBM Corporation 2009

Target fields
Fields you want to write to the output file Populated based on meeting the conditions of the survivor rule(s) Fields not listed as targets are excluded from the output file May have multiple targets for each rule

Copyright IBM Corporation 2009

Example: complex survive rule


The following rule states that FIELD3 of the current record should be retained if the field contains five or more characters and FIELD1 has any contents. The prefix of b. indicates the current best record The prefix c. indicates the current record testing against the survivor rule
FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;

TARGET

CONDITION

Copyright IBM Corporation 2009

Survive Implementation

Copyright IBM Corporation 2009

Survive QualityStage job

Double Click

Copyright IBM Corporation 2009

Survive stage properties

Copyright IBM Corporation 2009

Survive stage properties

Output Column

Technique

Copyright IBM Corporation 2009

Survive stage properties

Complex available

Copyright IBM Corporation 2009

Checkpoint
1. (T/F) Survivorship can allow more than one record to survive. 2. (T/F) Survivorship rules deal with the complete record only. 3. Name three survive rules.

Copyright IBM Corporation 2009

Checkpoint solutions
1. (T/F) Survivorship can allow more than one record to survive.
Answer: False

2. (T/F) Survivorship rules deal with the complete record only.


Answer: False

3. Name three survive rules.


Answer: most recent record, longest non-blank, most frequent non-blank

Copyright IBM Corporation 2009

Unit summary
Having completed this unit, you should be able to: Identify Survive techniques Describe implementation options Define Survive rules Build Survive job

Copyright IBM Corporation 2009

Lab 17: Survivorship job


Build survivorship job

Copyright IBM Corporation 2009

Survive job with XREF file

Copyright IBM Corporation 2009

Special Topics

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Full Run

Copyright IBM Corporation 2009

Full run single job

1. Double Click

Copyright IBM Corporation 2009

Full run using DataStage job sequencer

Copyright IBM Corporation 2009

QualityStage Migration Tool

Copyright IBM Corporation 2009

QualityStage Migration Tool Overview


The QualityStage Migration Tool (QSMT) provides the ability to migrate QualityStage 7.5 jobs and Standardization Rule Sets to the QS8 environment. QSMT analyzes the QS 7.5 server project directory to construct dsx files which can be imported into the QS8 common repository using the DS & QS8 Designers import facility

Copyright IBM Corporation 2009

QualityStage migration tool overview


QSMT functionality offers three types of QS 7.5 object migrations:
QS 7.5 Standardization Rule Set QS 7.5 job in combined mode QS 7.5 job in expanded mode

Two modes for migrating jobs to QS8:


Combined Mode
Use when you need to take a legacy process and just run it in QS8 Allows control before and after the legacy process Will always run after importing without any manual tuning

Expanded Mode
Use when you need to add QS8 operators within a migrated process May require some manual tuning to run

Copyright IBM Corporation 2009

Rule set migration


The QSMT has the ability to migrate Standardization Rule Sets in one of two ways: Explicitly - - you may specify the rule set you want to migrate By job dependency - - you may migrate all Rules associated with a particular job

Note: Regardless of the migration mode, all migrated rules will have the new naming convention of : QS-7.5-Ruleset-Name_QS-7.5-Project-Name

Copyright IBM Corporation 2009

Combined mode migration


Use this mode to get a legacy QS job up and running in QS8 with as little effort as possible. Jobs will import and run without modifications After importing, a migrated job will appear in the Jobs folder of the repository view in the QS/DS 8 Designer client Jobs are renamed by QSMT within the QS8 package to minimize name collision The new job name has the following naming convention: QS-7.5-Job-Name_QS-7.5-Project-Name

Copyright IBM Corporation 2009

QSMT combined mode migration


The job consists of a single instance of the QS 8 Legacy Job stage, together with some number of DS Sequential File stages, which are linked to the Legacy Job stage as inputs or outputs

Copyright IBM Corporation 2009

QSMT combined mode migration


All the QS stages run under the control of the single Legacy Job stage in Combined Mode The list of operations can be seen by opening the Legacy stage

File IO to external files is performed by the Information Server Sequential File stages

Copyright IBM Corporation 2009

QSMT combined mode & running a QS8 job


Once imported, Legacy jobs are run the same as any other QS8 job Prior to compiling, be sure any required rule sets are provisioned to the server Run as you would any other QS8 job

Copyright IBM Corporation 2009

QSMT expanded mode


Use to re-implement the job in the QS8 environment After importing, a migrated job will appear in the Jobs folder in the same way as in Combined Mode The job consists of one or more stages for each 7.5 stage, plus DS PX Sequential File stages, linked to represent the 7.5 job flow. For complex jobs, stages may need to be reorganized to improve readability

Copyright IBM Corporation 2009

QS stage migration reference table


QS 7.5 Stage Type
Abbreviate Build Collapse FFC FFC Investigate Match* Multinational Standardize Parse Select Select

QS8 Stage Type


Legacy Job Legacy Job Legacy Job Copy ODBC Enterprise Legacy Job Legacy Job MNS Legacy Job Legacy Job Filter

Conditions
Always Always Always Delimited text used in 7.5 stage ODBC used in 7.5 stage Always Always Always Always Merge used in 7.5 stage Split, Accept, or Reject used in 7.5

* Currently working on converting Match specifications for GA

Copyright IBM Corporation 2009

QS stage migration reference table


QS 7.5 Stage Type
Sort Standardize Survive Survive Transfer Unijoin WAVES

QS8 Stage Type


Sort Standardize Legacy Job Survive Legacy Job Legacy Job WAVES

Conditions
Always Always If target columns overlap If target columns do not overlap Always Always Always

Copyright IBM Corporation 2009

QSMT expanded mode & running a QS8 job


Prior to compiling, be sure to complete the following:
Provision any required rules to the server Add ODBC connection information to any ODBC read or write stages appearing in the job To complete the migration, perform the following for every Standardize, Survive, MNS and Waves stage that appears on the canvas:
Open the stage editor for the stage (e.g. by double-clicking it) Click ok

Once the above tasks are completed, compile and run as you would any other job

Copyright IBM Corporation 2009

Lab 18: QualityStage Migration Tool


Migrate 7.5 QualityStage jobs to version 8

Copyright IBM Corporation 2009

Globalization

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Objectives
After completing this module you will be able to: Build jobs that read and write Japanese data Modify client settings to display Japanese data with correct characters

Copyright IBM Corporation 2009

Terminology
Character Set
An ordered list of characters used for text Example: Latin, Cyrillic, Unicode

Character Encoding
How each character in a character set is represented as bits Examples: UTF-8, UTF-16BE, GB18030 are encodings of Unicode

Codepage
Microsoft Windows term for Encoding, often used in other contexts too Examples:
1252 is Windows Latin1 superset of ISO8859/1 932 another name for Shift-JIS

Copyright IBM Corporation 2009

Character Sets
Latin
Italian, Spanish, French, English alphabets

Cyrillic alphabet
Subsets are used by five Slavic languages (Bulgarian, Russian, Belarusian, Serbian, Macedonian, Ukrainian) and some non-Slavic (Kazakh, Uzbek, Kyrgyz, Tajik, and Mongolian)

ASCII
Represents 256 characters

Unicode
Represents 65,536 unique characters Standard for representing the characters of all languages Includes Chinese, Japanese, and Korean

Copyright IBM Corporation 2009

Character encoding
Definition A system that pairs a character from a character set to something else, such as a number Two common computer encodings for Unicode
UTF8
Variable length encoding for Unicode Encodes each character to one to four bytes

UTF16
Variable length encoding for Unicode Allows either endian representation but mandates that the byte order be explicitly indicated by a byte order Mark

Copyright IBM Corporation 2009

NLS
NLS National Language Support
Globalization + Localization/Translation

NLS map
What DataStage uses to convert between external and internal encodings Internal encoding is UTF8 for Server engine, UTF16 for Parallel engine

Copyright IBM Corporation 2009

Where DataStage NLS Mapping Happens


External character set

Information Server Common Design Repository

External character set

Map Server Client


Scripts, etc.

Logs

Map

Parallel Engine running job


Unicode (UTF-16)

Messages XML (UTF-8)

Job Monitor
Unicode (UTF-16)

Map
Windows code page

DataStage & QualityStage Runtime Objects Unicode (UTF-8)

Copyright IBM Corporation 2009

Examples of DataStage NLS Maps


Parallel IBM367 Big5 IBM1026 GB2312 ISO_8859-1:1987 ISO_8859-5:1988 KS_C_5601-1987 windows-1253 windows-1255 IBM865 Shift_JIS TIS-620 Description Standard (US) ASCII 7-bit set TAIWAN: "Big 5" standard IBM EBCDIC variant 1026 (Turkish) CHINESE: EUC as per GB 2312 ISO Standard 8859 part 1: Latin-1 ISO Standard 8859 part 5: Latin-Cyrillic KOREAN: EUC as per KSC 5861 MS Windows codepage 1253 (Greek) MS Windows codepage 1255 (Hebrew) PC DOS code page 865 (Nordic) JAPANESE: Shift-JIS main map THAILAND: Industrial Standard 620

Copyright IBM Corporation 2009

Setting a Client/Server Map


Admin Client (whole server)

Client

Map

DataStage & QualityStage Runtime Objects Unicode (UTF-8)

Associates a server map with the current Windows code page

Copyright IBM Corporation 2009

Setting Job-Level Maps


Sets the default map name to use with all Parallel jobs in this project

Admin Client (per project)

unless you override it in the job properties dialog

Copyright IBM Corporation 2009

Setting a Stage-Level Map


External character set

Map Server Parallel Engine running job


Unicode (UTF-16)

Various stages have an NLS Map tab: e.g. Sequential File, External Source, External Target, File Set
Define character set mappings (ustring Applied at stage or individual field level external file)

Copyright IBM Corporation 2009

Setting a Column-Level Map


Char may be "extended" for Unicode" Non-default NLS map (for relevant types)

For Sequential File-type stages


NChar, and Char with extended type, offer a drop-down list of map names in the NLS Map property

Copyright IBM Corporation 2009

Converting string to ustring manual control


Transformer, modify, etc.
string ustring conversion will happen automatically, taking current map from context (job level or stage=operator level) Fine control via explicit conversion functions

Conversions may use specific map name

Copyright IBM Corporation 2009

NLS Implementation using Investigate stage


Job Design from Lab

Job-level NLS map

Copyright IBM Corporation 2009

Investigation results for Japanese city column

Input data Client machine with codepage set to JPN

Output report data Client machine with codepage set to JPN

Copyright IBM Corporation 2009

Unit summary
Having completed this unit, you should be able to: Build a QualityStage investigation job for non-English data View correctly-formatted results in DataStage/QualityStage data viewer

Copyright IBM Corporation 2009

Lab 19: NLS


Build investigation job for city using Japanese data

Copyright IBM Corporation 2009

Address Verification Interface Stage

Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4.0.3

Objectives
After completing this module you will be able to: Build jobs using the AV stage to parse and verify address data

Copyright IBM Corporation 2009

AVI Stage
Provides
Transliteration (e.g. Japanese to Latin) Parsing Address validation

WAVES equivalent Does not provide postal certification discounts


Use CASS (US), DPID (Australia), or SERP (Canada) if certification is desired

Supports real-time

Copyright IBM Corporation 2009

Components
AV stage Reference data
16 Geographies Purchased via Passport system

API libraries
Address Doctor

Copyright IBM Corporation 2009

Reference Data
Required for validation function only Requires annual license agreement Location pointed to by AV stage Some databases are memory intensive Load options
Partial preload
Indexes loaded to memory

Full preload
Data loaded to memory Fast access but must have adequate memory

No preload
Data accessed from disk, slowest method

Copyright IBM Corporation 2009

Job components

Input address data

AVI stage

Optional error file

Copyright IBM Corporation 2009

Stage properties

Function

Reference data location

Navigation

Copyright IBM Corporation 2009

Mapping input columns to address elements

Copyright IBM Corporation 2009

Transliterate mode
Map input columns to address elements
Multiple input columns can be mapped to one address element

Options offer the choice to increase the number of address lines

Copyright IBM Corporation 2009

Map columns to output link

Copyright IBM Corporation 2009

Parsing mode
Input sample

Output sample

Copyright IBM Corporation 2009

Validation mode
Uses reference data from a database Map input columns to address elements Can activate error link Creates validation summary report Sample output (only two of the validation columns shown)

Copyright IBM Corporation 2009

Validation mode statuses


Part of output record Document actions taken by AV stage Short code Verbose code Example

Copyright IBM Corporation 2009

Validation summary report sample (USPREP)


Validation Summary Report Company Name: List Identifier: Processing Date (yyyy/mm/dd): 2009/02/25 Total Number Of Records Processed: 2843 Passed: 2843 100.00% Failed: 0 0.00% Validated: 2233 78.54% Corrected: 415 14.60% Has Suggestion: 195 6.86% PostCode Failed: 70 2.46% City Failed: 37 1.30% Street Failed: 274 9.64% Country Failed: 0 0.00%
Copyright IBM Corporation 2009

Unit summary
Having completed this unit, you should be able to: Build jobs using the AV stage to parse and verify address data

Copyright IBM Corporation 2009

Lab 20: AV Stage


Build AV job to parse Japanese address data

Review prebuilt job that validated USPREP data from earlier lab

Copyright IBM Corporation 2009

Potrebbero piacerti anche