QS Essentials

QualityStage 8 Essentials DX741
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Copyright, Disclaimer of Warranties and Limitation of Liability

Copyright IBM Corporation February 2007 IBM Software Group One Rogers Street Cambridge, MA 02142 All rights reserved. Printed in the United States. IBM and the IBM logo are registered trademarks of International Business Machines Corporation. The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: AnswersOnLine AIX APPN AS/400 BookMaster C-ISAM Client SDK Cloudscape Connection Services Database Architecture DataBlade DataJoiner DataPropagator DB2 DB2 Connect DB2 Extenders DB2 Universal Database Distributed Database Distributed Relational DPI DRDA DynamicScalableArchitecture DynamicServer DynamicServer.2000 DynamicServer with Advanced DecisionSupportOption DynamicServer with Extended ParallelOption DynamicServer with UniversalDataOption DynamicServer with WebIntegrationOption DynamicServer, WorkgroupEdition Enterprise Storage Server FFST/2 Foundation.2000 Illustra Informix Informix4GL InformixExtendedParallelServer InformixInternet Foundation.2000 Informix RedBrick Decision Server J/Foundation MaxConnect MVS MVS/ESA Net.Data NUMA-Q ON-Bar OnLineDynamicServer OS/2 OS/2 WARP OS/390 OS/400 PTX QBIC QMF RAMAC RedBrickDesign RedBrickDataMine RedBrick Decision Server RedBrickMineBuilder RedBrickDecisionscape RedBrickReady RedBrickSystems RelyonRedBrick S/390 Sequent SP System View Tivoli TME UniData UniData&Design UniversalDataWarehouseBlueprint UniversalDatabaseComponents UniversalWebConnect UniVerse VirtualTableInterface Visionary VisualAge WebIntegrationSuite WebSphere WebSphere DataStage
Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. All other product or brand names may be trademarks of their respective companies. All information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has been certified as being Year 2000 compliant. This document may not be reproduced in whole or in part without the priori written permission of IBM. Note to U.S. Government Users Documentation related to restricted rights Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
Copyright IBM Corporation 2009
Course Contents
Topic
Data Quality Issues QualityStage 8 Architecture Developing with QualityStage Investigation Standardize Match Survivorship Special Topics Globalization (NLS) Address Verification Stage
Page
5 39 55 79 117 185 258 278 295 312
Course contents
Data quality issues Information Server purpose and architecture Introduction to DataStage and QualityStage Investigation Standardization Match Survivorship Special Topics
Data quality methodology QualityStage Migration Tool
Data Quality Issues
4.0.3
Unit objectives
After completing this unit, you should be able to:
List the five common data quality contaminants Describe each of the following processes:
Investigation Standardization Match Survivorship
Data quality challenges

Different or inconsistent standards in structure, format or values Missing data, default values Spelling errors, data in wrong fields Buried information Data anomalies
Data quality why do we care?

Accurate reports Accurate information for support operations Support development of applications that go beyond original scope for which data was designed
Master Data Management Data Warehouse Analytical applications
Example - Master Data Management
Source 1
Align
Source 2
Harmonize Consolidate
Consolidated customer view
Source 3
Different or inconsistent standards Name Field

Source 1 MARK DI LORENZO DENIS E. MARIO TOM & MARY ROBERTS DILORENZO, MARK MARIO, DENISE ROBERTS, TOM & MARY MARC DILORENZO ESQ MRS DENNIS MARIO MR & MRS T. ROBERTS
Location
MA93 CT15 IL21 6793 0215 8721 BOSTON HARTFORD CHICAGO
Source 2
Source 3
Missing data & default values

Do the field values match the meta data labels?
NAME Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Astorial Fedrl Savings Kevin Cooke, Receiver
SOC. SEC. # 228-02-1975 999999999 025-37-1888 34-2671434 101010101 LN#12-756 18-7534216
TELEPHONE 6173380300 3380321 415-392-2000 508-466-1200 212-235-1000 FAX 528-9825 5436
John Doe Trustee for K 111111111
Buried information
Legacy Meta Desc.

NAME 1 ADDRESS 1 ADDRESS 2 ADDRESS 3 ADDRESS 4 ADDRESS 5
Legacy Record Values Robert A. Jones TTE Robert Jones Jr. First Natl Provident FBO Elaine & Michael Lincoln UTA DTD 3-30-89 59 Via Hermosa c/o Colleen Mailer Esq Seattle, WA 98101-2345
The anomalies nightmare
CUSNUM
90328574 90328575 90238495 90233479 90233489 90234889 90345672 IBM
NAME
ADDRESS
187 N.Pk. Str. Salem NH 01456 187 N.Pk. St. Sarem NH 01456
SALES $
8,494.00 3,432.00
I.B.M. Inc. International Bus. M. Int. Bus. Machines Inter-Nation Consults Int. Bus. Consultants I.B. Manufacturing
187 No. Park StSalem NH 04156 2,243.00 187 Park Ave Salem NH 04156 15 Main St. Andover MA 02341 PO Box 9 Boston MA 02210 Park Blvd. Boston MA 04106 5,900.00 6,800.00 10,243.00 15,999.00
No common key
Anomalies
Lack of Standards Spelling Errors
What data challenges do you face?

No consistent naming convention Business terms and spillover text Missing values or data in the wrong fields
Buried information Misspelling No unique key linking records together
Acct # 5154155 5152335 5146261 87121 87458
Name Peter J. Lalonde LaLonde, Peter Lalonde, Sofie Pete & Soph Lalond P. Lalonde FBO
Address 40 Beacon St.
City Melrose,
State Mass YES
Zip 02176 MA MA
Note ODP 02111 CHK ID FR Alert
76 George 617-210-0824 Boston 40 Bacon Street 76 George Road Melrose Boston
MASS MA 02
S.Lalonde40 Becon Rd. Melrose
176
Why investigate?
Discover potential anomalies in the data Examine single domain and free-form fields Identify invalid and default values Reveal undocumented business rules Verify the reliability of the data in the fields to be used as matching criteria Gain complete understanding of data
Investigate single domain report

Single domain
Field
Sample source data
Frequency
% of Total
Freq. Count
Investigate word pattern report

Freeform text (Word)
Field Pattern
Sample source data
Frequency
% of Total
What is standardize?
Applying business logic to data chaos.
Pattern manipulation
Enforcing business standards on data elements.

Standards definition
Transforming the input to an output which meets the business requirement.

Field structuring
How to standardize
Parse specific data fields into smaller, atomic data elements
Atomic data elements are called tokens Categorize identified elements
Separate Name, Address, and Area from freeform Name & Address lines Identification of Distinct Material Categories (e.g. Sutures vs. Orthopedic Equipment)
Refine data elements

Example 1
Name = DR PAUL E JONES becomes:
> > > > Title = DR First Name = PAUL Middle Name = E Last Name = JONES
Example 2
Part Description = BLK LATEX GLOVE becomes:
> Color = BLACK > Type = LATEX > Part = GLOVE
Why standardize?
Normalize values in data fields to standard values
Transform First Name = MIKE MICHAEL Transform Title = Doctor Dr Transform Address = ST. Michael Street Saint Michael St. Transform Color = BLK BLACK
Apply phonetic coding to key words - facilitates record linkage

NYSIIS Soundex Typically applied to Name fields (first, last, street, city)
QualityStage standardize
Uses a highly flexible pattern recognition language Can employ field or domain specific standardization (i.e. unique rules for names vs. addresses vs. dates, etc.) Contains customizable classification and standardization tables Utilizes results from data investigation
QualityStage standardize report example

Original data Ind./Org. flag
Match
Conditioned data and QualityStages matching engine link the previously unlinkable. Match Construction:
Reliability of input data defines a match result.
Statistical Analysis & Match Scoring:

Linkage probability determined on a sliding scale by field level comparison.
Report Generation:
All business rules applied have easy to understand report structure.
What is match?
Identifying all records on one file that correspond to similar records on another file Identifying duplicate records in one file Building relationships between records in multiple files Performing statistical and probabilistic matching Calculating a score based on the probability of a match
Why match?
Identify duplicate entities within one or more files Perform householding Create consolidated view of customer Establish cross-reference linkage
How to match
Single file (Unduplication) or two file (Reference) Different match comparisons for different types of data (e.g. exact character, uncertainty/fuzzy match, keystroke errors, multiple word comparison) Generation of composite weights from multiple fields Use of probabilistic or statistical algorithms Application of match cutoffs or thresholds to identify automatic and clerical match levels Incorporation of override weights to assess particular data conditions (e.g. default values, discriminatory elements)
QualityStage match
A wide variety of match comparison algorithms providing a full spectrum of fuzzy matching functions Statistically-based method for determining matches (Probabilistic Record Linkage Theory) Field-by-field comparisons for agreement or disagreement Assignment of weights or penalties Overrides for unique data conditions Score results to determine the probability of matched records Thresholds for final match determination Ability to measure informational content of data
QualityStage match examples
What is survive?
Creation of best-of-breed surviving data based on record or field level information Development of cross-reference file of related keys Creating output formats:
Relational table with primary and foreign keys Transactions to update databases Cross-reference files
Why survive?
Provide consolidated view of data Provide consolidated view containing the best-of-breed data Resolve conflicting values and fill missing values Cross-populate best available data Implement business rules Create cross-reference keys
How to survive
Highly flexible rules Record or field level survivorship decisions Rules can be based upon data frequency, data recency (i.e. date), data source, value presence or length Rules can incorporate multiple tests QualityStage features
Point-and-click (GUI-based) creation of business rules to determine best-of-breed surviving data Performed at record or field level
QualityStage survive examples
Example 1: The longest populated Middle and Last Name Matched Survived First Middle Last Name First Middle Last Name Name Name Name Name MARI MARI S LEMELSONLEMELSONLAPPNER LAPPNER
MARI S
LEMELSON
Example 2: The longest populated Middle Name, Date of Birth, and SSN
Matched First Name Middle NLast Name DENISE TRIANO DENISE F TRIANO SSN DOB 19580211 98524173 First Name DENISE
Survived SSN Middle NaLast Nam DOB F TRIANO 19580211 98524173
Course lab project design
Policy Select US Data for further processing Identify Duplicate Customer Records Investigate Assess Data Quality Condition Name, Address and Area
Standardize Country
Investigate Conditioned Results
Survive the Best Customer Record
Apply User Overrides
Checkpoint
1. (T/F) Data quality investigation cleans the source data. 2. (T/F) Standardization modifies the source data so that it can be loaded into the target system. 3. (T/F) Survivorship data can be either record based or field based.
Checkpoint solutions
1. (T/F) (T/F) Data quality investigation cleans the source data.
Answer: False
2. (T/F) Standardization modifies the source data so that it can be loaded into the target system.
Answer: False
3. Survivorship data can be either record based or field based.

Answer: True
Unit summary
Having completed this unit, you should be able to: List the five common data quality contaminants
Different standards Missing and default values Spillover and buried information Anomalies No consolidated view Investigation Standardization Match Survivorship
Describe each of the following processes:
Lab 1: Review course project

Course business case: WINN Insurance CRM project See QualityStage Essentials Exercises
Lab 2: Copy student files

Copy student files to disk
Use C: drive as root for folder
QualityStage 8 Architecture
4.0.3
Unit objectives
Describe the Data Quality architecture Identify server and client components
Information Server conceptual architecture

Information Analyzer DataStage
QualityStage
Information Director
Information Server
Other Services Metadata Access Services Client logon access Logging Security
Metadata Server & Repository
QualityStage technical overview

Uses DataStage (parallel version)
DataStage design environment Parallel execution engine Stages are native enterprise operators Match designer is embedded in DataStage Designer Client Get DataStage data connectivity by default
No need for meta brokers, plug-ins Common meta data
Legacy (pre-version 8) QS job execution

Migration utility available to aid conversion from QS 7.x to QS 8 Converted jobs can be compiled and executed in the QS 8 environment
DataStage/QualityStage physical architecture

Clients DataStage/QualityStage Designer Director Administrator Information Server
UNIX
Connect to projects Via TCP/IP
Projects
Windows Windows
DataStage clients
Administrator
Add and delete projects Set project defaults Set project environment parameters
Designer
Maintain data definitions Add, modify, and delete jobs Add, modify, and delete match specifications Manage rule sets Compile jobs Run jobs Provision rule sets and match specifications
Director
Run jobs Review job log Schedule jobs
DataStage Administrator
Administrator
Create or delete projects Set project defaults Apply security
Project list
Project property defaults
DataStage Designer
Designer
Client GUI for designing jobs Windows 2000+, XP Build meta data Build Jobs Modify Standardization Rules Build match specifications Designer Repository Database
Sample QualityStage job as viewed in Designer
Designer canvas, repository, and palette
DataStage Director
Director
Client GUI for managing job execution Windows 2000+, XP Run jobs set job options and parameters View job log Schedule job execution
Job log viewed in Director
Checkpoint
1. (T/F) DataStage Administrator executes jobs. 2. (T/F) DataStage Designer configures projects. 3. Which DataStage component displays objects in the designer database?
1. (T/F) DataStage Administrator executes jobs.
Answer: False
2. (T/F) DataStage Designer configures projects.

Answer: False
3. Which DataStage component displays objects in the designer database.

Answer: the repository view
Unit summary
Having completed this unit, you should be able to:
Describe the Data Quality architecture Identify server and client components
Lab 3: configure QualityStage project

Create a project using Administrator (if necessary) Set project properties
General defaults Environment variables Security groups and roles
Developing with QualityStage
4.0.3
Unit objectives
Import meta data Build DataStage/QualityStage Jobs Run jobs Review results
DataStage/QualityStage project
Components
Jobs Stages within jobs Table Definitions The Designer repository view shows project components
Job definition
A job is an executable DataStage/QualityStage program Created by job compilation Jobs can be run in batch or in real time
Job development overview

Designer client
Import or enter file meta data defining your sources and targets Add stages and links defining the process Compile the job Run the job (Designer or Director) Review data results
Server
Runs the job View job log
Log onto project in Designer or Director
User name and Password controlled by Information Server
List of valid projects

Designer repository components

Database which stores
Data file definitions Job designs Standardization rules Data connection objects
Project structure Repository view In Designer
DataStage/QualityStage design environment

Data definitions
Stages
Data definitions
Entered or loaded via DataStage import mechanisms
Sequential file ODBC Native database connection
New and redefined columns can be added on the data flow via Transformer stage
Data Quality folder

Stages are the building blocks Focused in function All phases of data quality:
Investigate Standardize Match Frequency Match Unduplicate Match Reference Match Survive International postal MNS Optional Address Verification
Standardization rule sets

Pre-defined rules for parsing and standardizing:
Name Address Area (City, State and Zip)
Multi-national address processing Validate structure:

Tax ID US Phone Date Email
Append ISO country codes Rule sets are stored in the repository and provisioned to the job execution area
Rule set for USNAME

Rule set components

Can modify some rule set components Test rule sets Copy rule sets
Match Specifications in the DataStage Repository

Created using the Match Designer Allows online testing of match criteria
Executing a job via Director
Server Director Executes the job

Click run button Set run options Execute job View job log View job monitor
Running a job in Director

Director
Client GUI for running jobs Windows 2000+, XP View job logs and monitor Job scheduling
Job status view
Execution environment
Data Quality Job Log
Job Monitor statistics
Job development process

Import meta data Define job

Draw stages and links Set stage properties Compile
Run the job Review results
Checkpoint
1. (T/F) The job monitor displays link statistics. 2. (T/F) The job log is viewed in DataStage Designer. 3. What protocol is used for communication between the DataStage clients and server?
1. (T/F) The job monitor displays link statistics.
Answer: True
2. (T/F) The job log is viewed in DataStage Designer.

Answer: False
3. What protocol is used for communication between the DataStage clients and server?
Answer: TCPIP
Unit summary
Import meta data Build DataStage/QualityStage Jobs Run jobs Review results
Lab 4: Import meta data

DataStage import mechanisms
DataStage components
Any object built in DataStage, such as jobs, table definitions, match specifications
Lab 5: Build and run DataStage job

Read sequential file
Must use format tab to handle nulls
Investigation
4.0.3
Unit objectives
Build Investigate jobs Use character discrete, concatenate, and word investigations to analyze data fields Review results
Investigation
Verify the domain
Review each field of interest and verify the data matches the meta data
Identify data formats, missing and default values Identify data anomalies
Format Structure Content
Discover unwritten business rules Identify data preparation requirements
Investigate stage
Features
Analyze free-form and single domain fields Provide frequency distributions of distinct values and patterns
Investigate methods
Character Discrete Character Concatenate Word
Investigate methods
Method
Why Analyzing field values, formats, and domains Cross-field correlation, checking logic relationships between fields Identifying free-form fields that may require parsing and discovery of key words for classification
Character Discrete Character Concatenate Word Investigation
Investigate terminology
Field Masks
Options that represent the data. Options: Character (C), Type (T), Skipped (X)
Tokens
Individual units of data
Character Mask C T X
Usage For viewing the actual character values of the data For viewing the pattern of the data For ignoring characters
Field mask examples
Token
02116 02116 01832-4480 XJ2 6EM (617) 338-0300 617-338-0300 6173380300 (617)3380300
Mask
CCCCC CCCXX TTTTTTTTTT TTTTTTT CCCCCCCCCCCCCC TTTTTTTTTTTT CCCXXXXXXXXX CCCXXXXXXXXX
Result
02116 021 nnnnn-nnnn aanbnaa (617) 338-0300 nnn-nnn-nnnn 617 (61
Character discrete: field mask (C)haracter

Usage: Domain quality
View the contents of each field to verify that the data values match the field labels
Mechanism: Investigate stage

Generates reports for frequency
Character discrete - character results
Character discrete: field mask (T)ype

Usage: Data formats (patterns):
View the format of field which contain that you suspect may follow or conform to a specific format, e.g., dates, PIN, Tax ID, account numbers.
Generates reports for frequency
Investigation Implementation
QualityStage Investigation job character
Double Click
Investigation - Character
Select Column
Add
Select mask
View investigation report
Character concatenate
Identify Field Relationships
Investigate one or more fields to uncover any relationship between the field values. Uses combinations of character masks Generates reports for frequency
Character concatenate results DOB and DOD Fields
Word investigate
Usage: Free-form field pattern analysis
To view the pattern of the data within a freeform text field and parse it into individual tokens
QualityStage process
Apply rule sets to free-form fields Discover parsing requirements Discover patterns in data Generate reports for pattern frequency distributions and token report
Word investigation results

Pattern Report Token Report
How to use Look at most frequently occurring patterns. Use to estimate how much work to modify a rule set for a customer. How to use Review tokens with SME to verify tokens are properly classified. Identify most frequently occurring unclassified tokens and add them to rule set.
Rule sets
Rules for parsing, classifying, and organizing data Rule Set Domains
Country processing Pre-processing Domain Processing
Name: Business and Personal Street Address Area: Locality, City, State and Zip/Postal codes
Multinational Address Processing
Parsing
Parse free-form data with the SEPLIST and a STRIPLIST
SEPLIST - Any character in the SEPLIST will separate tokens, and become a token itself STRIPLIST - Any character in the STRIPLIST will be ignored in the resulting pattern
The SEPLIST is always applied first
Parsing example
Example:
Token1
120 Main St. N.W.
SEPLIST . STRIPLIST
Token2
Token3
Token4
Token5
Token6
Token7
Token8
120
Main
St
SEPLIST STRIPLIST .
Token1 Token2 Token3 Token4
120
Main
St
NW
SEPLIST . STRIPLIST .
Token1
120
Main
Token2
St
Token3
Token4
Token5
Data typing: classifying tokens

Identify and type the token in terms of its business meaning and value
PATTERN KEY(USADDR rule set):

^ Numeric token ? Unclassified alpha token @, <, > Mixed Token T Street Type U Unit Type
120 ^
Main ?
Street T
Apt U
6C >
Example: word investigate
Token report
Pattern report
Produce Reports based on Patterns & Tokens
Classify known words and assign default tags
^
10
?
MAPLE
Parse
STREET APARTMENT 222
Investigation - Word
Investigation - Word
Link ordering
Investigation define output files
Sort output (optional)
Review word reports patterns and tokens
Data quality assessment process

Review and analyze each field for the following information:
How often is the field populated? What are the anomalies and out-of-range values? How often does each one occur? How many unique values were found? What is the distribution of the data or patterns?
Use Investigate results to:

Update project business requirements Define development plan and application design
Checkpoint
1. (T/F) Character discrete investigation examines a single domain. 2. (T/F) Word investigation examines a single domain. 3. Name the three character masks.
1. (T/F) Character discrete investigation examines a single domain.
Answer: True
2. (T/F) Word investigation examines a single domain.

Answer: False
3. Name the three character masks.

Answer: C, T, and X
Unit summary
Build Investigate jobs Use character discrete, concatenate, and word investigations to analyze data fields Review results
Lab 6: Build investigate jobs

Character with C mask Character with T mask Character concatenate Word
Standardize
4.0.3
Unit objectives
Describe the Standardize stage Identify rule sets Build jobs using the Standardize stage Interpret standardization results Investigate unhandled data and patterns
Standardize
Transformation
Parsing free form fields Comparison threshold for classifying like words Bucketing data tokens
Standardization
Applying standard values and standard formats
Phonetic Coding for use in Matching

NYSIIS Soundex
Standardize example
Input File:
Address Line 1 Address Line 2
1721 W ELFINDALE ST 1721 W ELFINDALE ST # 20 16200 VENTURA BOULEVARD C/O JOSEPH C REIFF 1705 W St 1655 PONCE DE LEON AVENUE
UNIT 20 SUITE 201 12 WESTERN AVE PHILADELPHIA 15TH FLOOR
Result File:
House # Dir Str. Name Type Type Value Unit Type Unit. Value Floor Floor
1721 1721 16200 12 1705 1655
W W
ELFINDALE ELFINDALE VENTURA WESTERN W PONCE DE LEON
ST ST BLVD AVE ST AVE
UNIT STE
20 20 201 FLOOR 15
Standardize process
Output File
House Number 10 Street Unit Type Type ST APT
Street Name MAPLE
Unit 222
Key: ^ = Single numeric ? = One or more unknown alphas T = Street type U = Unit type
Process Patterns and Bucket Data
Classify & assign default tags
^
10
?
MAPLE
Parse
Standardize stage
Standardize Stage
Uses Rule sets for:
Country processing Pre-domain processing
USPREP
Domain processing
USADDR USAREA USNAME
Multi-national Address WAVES Address Verification Interface (optional)
Types of rule sets

Country Identifier COUNTRY
Preparatory steps Not always required
Domain Pre-processor USPREP
Domain Specific: USNAME
Domain Specific: USADDR
Domain Specific: USAREA
Example: country identifier

Input Record Input Record 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 28 GROSVENOR STREET LONDON W1X 9FE 28 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET 123 MAIN STREET
Output Record Output Record US US CA CA GB GB US US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 Y 28 GROSVENOR STREET LONDON W1X 9FE Y 28 GROSVENOR STREET LONDON W1X 9FE N 123 MAIN STREET N 123 MAIN STREET
Example: domain preprocessor
Input Record Input Record Field 1 Field 1 Field 2 Field 2 JIM HARRIS (781) 322-2426 JIM HARRIS (781) 322-2426 92 DEVIR STREET MALDEN MA 02148 92 DEVIR STREET MALDEN MA 02148
Mixed domain
Output Record Output Record Name Domain Name Domain Address Domain Address Domain Area Domain Area Domain Other Domain Other Domain JIM HARRIS JIM HARRIS 92 DEVIR STREET 92 DEVIR STREET MALDEN MA 02148 MALDEN MA 02148 (781) 322-2426 (781) 322-2426
Example: domain specific

Input Record Input Record 100 SUMMER STREET 15TH FLOOR 100 SUMMER STREET 15TH FLOOR
Output Record Output Record House Number House Number Street Name Street Name Street Suffix Type Street Suffix Type Floor Type Floor Type Floor Value Floor Value Address Type Address Type NYSIIS of Street Name NYSIIS of Street Name Reverse Soundex of Street Name Reverse Soundex of Street Name Input Pattern Input Pattern
100 100 SUMMER SUMMER ST ST FL FL 15 15 S S SANAR SANAR R520 R520 ^+T>U ^+T>U
Rule sets
Rule sets contain logic for:
Parsing Classifying Processing data by pattern and bucketing data
Three required files

Classification Table Dictionary File Pattern Action File
Optional files
Lookup tables Override tables
Rule set files

Classification Table Dictionary File Pattern Action File Reference Tables Override Tables
Contains standard abbreviations that identify and classify key words. Define the output file fields to store the parsed and conditioned data Contains a series of patterns and programming commands to condition the data Optional conversion and lookup tables for converting and returning standardized values Tables for storing overrides entered into the Designer GUI
Classification table
Contains the words for classification, standardized versions of words, and data class Data class (data tag) is assigned to each data token Default classes are the same across all rule sets User-defined classes are assigned in the classification table
Users may modify, add or delete these classes User-defined classes are a single letter
Default classes
Class ^ + ? @ > < Zero (0) Description A single numeric A single unclassified alpha (word) One or more consecutive unclassified alphas Complex mixed token, e.g., C3PO Leading numeric, e.g., 6A Trailing numeric, e.g. A6 Null class
User-defined classes
Class USNAME G P USADDR T D B USAREA S State Abbreviation Street Type Directional Box Type Generational, e.g., Senior, I, II Prefix, e.g. Dr., Mr., Miss Description
Classification table example

;-------------------------------------------------------------------------------
; USADDR Classification Table ;-----------------------------------------------------------------------------; Classification Legend ;-----------------------------------------------------------------------------; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types Standard Classification ; U - Unit Types form ;-----------------------------------------------------------------------------Token ; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-----------------------------------------------------------------------------DRAW "PO BOX" B DRAWER "PO BOX" B PO "PO BOX" B POB "PO BOX" B POBOX "PO BOX" B POBX "PO BOX" B PODRAWER "PO BOX" B
Comparison threshold
May be used in the Classification table Used to efficiently make entries into the classification table Helps overcome spelling and data entry errors Not required Threshold uses a logical string comparator
Threshold level
900 850 800 750 700
Exact match Almost certainly the same Most likely equivalent Most likely not the same Almost certainly not the same
Classification table example with comparison threshold

; USADDR Classification Table ;-----------------------------------------------------------------------------; Classification Legend ;-----------------------------------------------------------------------------; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-----------------------------------------------------------------------------; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-----------------------------------------------------------------------------DRAW "PO BOX" B DRAWER "PO BOX" B NORTHEAST NE D 850 NORTHWEST NW D 850 NW NW D S S D SO S D SOUTH S D
Dictionary file
Defines the field definitions for the output file When data is moved to these output fields it is called bucketing the data The order that the fields are listed in the dictionary file defines the order the fields appear in the output file Dictionary file entries are similar to field definitions
Dictionary file example

;;QualityStage v8.0 \FORMAT\ SORT=N ;-----------------------------------------------------------------------------; USADDR Dictionary File ;-----------------------------------------------------------------------------; Total Dictionary Length = 411 ;-----------------------------------------------------------------------------; Business Intelligence Fields ;-----------------------------------------------------------------------------HouseNumber C 10 S HouseNumber ;0001-0010 HouseNumberSuffix C 10 S HouseNumberSuffix ;0011-0020 StreetPrefixDirectional C 3 S StreetPrefixDirectional ;0021-0023 StreetPrefixType C 20 S StreetPrefixType ;0024-0043 StreetName C 25 S StreetName ;0044-0068 StreetSuffixType C 5 S StreetSuffixType ;0069-0073 StreetSuffixQualifier C 5 S StreetSuffixQualifier ;0074-0078
Pattern-Action file
Contains the rules for standardization; that is, the actions to execute with a given pattern of tokens Records are processed from the top down Written in Pattern Action Language (PAL) Complex parsing can be coded in this file
Pattern Action file process
Street Address Pattern
10
Hollow Oak Road
Pattern Action Language
COPY [1] {HN} COPY_S [2] {SN} COPY_A [3] {ST}

10 Hollow Oak Rd
{HN}
{SN}
{ST}
Optional lookup tables

Called from the Pattern Action File Rule sets may contain lookup tables such as:
Common First Names and Enhanced First Names
Barb & Barbara Ted & Edward
Gender based on name State abbreviations Common city abbreviations

NYC = New York City LA = Los Angeles
Standardize process
Dictionary File
3
House Number 10
Street Name MAPLE
Street Unit Type Type ST APT
Unit 222
Pattern Action File Classification Table
Process Patterns and Bucket Data
Classify & assign default tags
^
10
?
MAPLE
Parse
Standardizing international data

Two methods
Method 1: Use country pre-processor, domain pre-processor, and domain-specific rules
Uses out-of-the-box, included functionality/rules
Method 2: Use Multinational Standardize, WAVES, or AVI

WAVES requires purchase of WAVES database AVI requires purchase of database for address validation

Standardize Country
Country rule set

Country Rule set appends the two byte ISO country code Input to the country rule set includes:
Default country code (designated by ZQdefault valueZQ) Street Address City or locality State Zip or Postal code Country field (if it exists)
Output:
Two byte ISO country code Flag identifying explicit or default decision
Standardization implementation
Standardization jobs
Country Rule Set
USPREP Rule Set
Domain-specific Rule Sets
Standardization US Name, Address, Area
Standardization
Standardization mapping columns

Standardize Country
Selecting US data
The DataStage Filter Stage provides the capability of selecting and/or rejecting records based on a set of values for a field Selecting or splitting data requiring compound or complex logic may require Transformer stage
Course exercise project design

Standardize Country
Domain pre-processor rule sets

Pre-processor rule sets are designed to filter name, street address and area (city, state, zip) data
For example, if the city, state and zip is found in ADDRESS LINE 2, the pre-processor rule set will attempt to recognize this data and move it into the area domain
The pre-processor rule set prepares the data for processing by domain specific rule sets
Domain rule sets

Domain rule sets expect only data for that domain as the input Domain rule sets that come with QualityStage are:
Name Street address Area (city, state and zip)
USNAME rule set

The USNAME rule set works on both personal names and organization names for US data Data is parsed into name components Phonetic coding of the First Name and Primary Name are created for matching
USADDR rule set

This rule set is applied to street address fields The Address Type flag identifies different types of addresses
S Street address B Box address R Rural route address
Phonetic coding of the Street Name is created for matching
USAREA rule set

This rule set is applied to city, state and postal code fields Data is parsed into city name, state abbreviation, zip code and zip plus four Phonetic coding of the city name is created for matching
Standardize results
Business Intelligence fields
Parsed from the original data, they may be used in matching and generally they are moved to the target system
Matching Fields
Generally these fields are created to help during the match process and are dropped after successful matching
Reporting fields
Specifically created to help review results of Standardize and recognized handled and unhandled data
Business intelligence fields

Intelligent data parsed and bucketed from the input free-form field
USNAME Examples

USADDR Examples HouseNumber Directional Street Name Unit Types Box Types Unit Values Building Names
USAREA Examples

Title First Name Middle Name Primary Name Generational
City State Zip5 Zip4
Standardize matching fields

Phonetic coding
NYSIIS Reverse NYSIIS Soundex Reverse Soundex
Hash keys
First 2 characters of the first five words
Packed Keys
Data concatenated, or packed
Standardize reporting fields

Unhandled Pattern
The pattern generated for tokens not processed by the selected rule set.
Unhandled Data
The remaining tokens not processed by the selected rule set. The pattern generated for the stream of input tokens based on the parsing rules and token classifications. The tokens not processed by the rule set because they represent a data exception. Flag indicating what kind of user overrides were applied to this record
Input Pattern
Exception Data
User override flag
Investigate NAME unhandled patterns and data

Identify the unhandled patterns for the NAME field. In the report include the unhandled data, input pattern, original data and the record key.
1. 2. Build a Character Concatenate Investigation using the following fields Increase the number of samples to 5
Field Description Unhandled Pattern Unhandled Data Input Pattern Name domain data Type C X X X
Standard practice: investigate handled and unhandled data Review the business intelligence fields to ensure accurate bucketing of the data Build a Character Discrete Investigation for each field and review the contents and the format Build Investigation to review:
Unhandled Patterns Unhandled Data Input Pattern Input Fields
Customizing rule sets

A rule set may require modification if some input data is:
Not processed Incorrectly processed
QualityStage provides functions to:

Test strings for classifications using the Rules Analyzer Apply user Overrides Modify classification table
User overrides
Provides the user with the ability to modify rule sets The following types of rule sets can be modified using User Overrides
Domain Pre-processor rule sets Domain rule sets
There are five types of user overrides relating to: classifications, patterns, and text strings User overrides are
GUI Driven
Rule set should be provisioned after modifications applied
User classification override

Recognized as a keyword and classified
Additional words
New abbreviation, variation Misspelling of a word
User Classifications may override or add:

Original values (Token values) Standard value Class
Apply classification override

Unhandled Data Input Pattern
+,+
Original Data
HOCHREITER , CAROLYNNE
Add CAROLYNNE as a valid first name to the classification table
Override
Token Value Carolynne
Standard Value Carolynne
Class F
Input Pattern
+,F
Original Data
HOCHREITER , CAROLYNNE
Corrected Pattern
Text overrides
Allow the user to specify overrides based on an entire text string Use this override for special cases and specific handling of a string of text Input Text Overrides
Applied to the original text string
Unhandled Text Overrides

Applied to the Unhandled Data field
Input text overrides
Input Text
Unhandled Pattern ++ ++ ++
Input Text ZACHARIA GELLMAN TOMMOTHY CABBOTT REIFF FUNERAL
Override
Input Text
REIFF FUNERAL
Override
Move text string to the Primary name field
Input Pattern ++
Primary Name REIFF FUNERAL
Results
Pattern overrides
Allow the user to specify overrides based on an entire pattern Use this override when most or all records should be processed with identical logic Input Pattern Overrides
Applied to the original text string
Unhandled Pattern Overrides

Applied to the Unhandled Data field
Unhandled pattern overrides

Unhandled Pattern +, +
+, + +, + Unhandled Pattern +, +
Input Text
HAYWARD, WINSLOW ESHAGHIAN , JOUBI BOULDER, CORONA Override Move + to Primary Name Comma provides context Move + to First Name
Unhandled Pattern
Override
Unhandled Pattern +, + +, + +, +
First Name WINSLOW JOUBI CORONA
Primary Name HAYWARD ESHAGHIAN BOULDER
Results
User override precedence

User Classification Input Text Input Pattern Unhandled Text Unhandled Pattern
Recognize words to classify
Modify logic based on the input string
Modify logic based on the input pattern Modify logic based on the Unhandled data string Modify logic based on the unhandled pattern
Investigate address and area unhandled patterns
Identify the unhandled patterns for the Address and AREA fields. In the report include the unhandled data, input pattern, original data and the record key.
1. 2. Build a Character Concatenate Investigation using the following fields Increase the number of samples to 5
Field Description Unhandled Pattern Unhandled Data Input Pattern Address Domain
Type C X X X
Overrides
Purpose
Correct problems found during standardization
Rule set may require overrides because you have data

Not processed Incorrectly processed
Override types
Classification Input pattern Input text Unhandled pattern Unhandled text
Can be tested with rules analyzer
Course exercise project design

Standardize Country
Overrides screen
Checkpoint
1. (T/F) WAVES can standardize name fields. 2. (T/F) Rule sets are used in standardization processing. 3. Name the components of rule sets.
1. (T/F) (T/F) WAVES can standardize name fields.
Answer: False
2. (T/F) Rule sets are used in standardization processing.

Answer: True
3. Name the components of rule sets.

Answer: Classification table, dictionary, pattern action file, lookup tables
Unit summary
Describe the Standardize stage Identify rule sets Build jobs using the Standardize stage Interpret standardization results Investigate unhandled data and patterns
Lab 7: Standardize country

Word investigation
Uses COUNTRY rule set
Rule set found in Other folder Adds ISO country code to records
Lab 8: Select US records

Uses Select stage to separate records with US ISO code Could also use Transformer stage
Lab 9: Standardize USPREP

Word investigation
Uses rule set
Rule set found in US folder
Lab 10: Standardize USNAME, USADDR, USAREA Word investigation

Uses rule sets
Rule sets found in US folder
Lab 11: Investigate unhandled patterns

Character concatenate investigation C mask used to produce histogram X mask used to display other fields of interest
Lab 12: Apply user overrides

Classification Input pattern Unhandled pattern
Match
4.0.3
Unit objectives
Build a QualityStage job to identify matching records Apply multiple match passes to increase efficiency/efficacy Interpret and improve match results
Match stage
Statistically-based method for determining matches 25 match comparison algorithms providing a full spectrum of fuzzy matching functions Ability to measure informational content of data Identify duplicate entities within one or more files Match specification built with Match Designer Critical field settings
What constitutes a good match?

Which of the following record pairs is a match? And how do you know?
W W HOLDEN HOLDEN W HOLDEN W HOLDEN 12 MAIN ST 12 MAINE ST 128 MAIN PL 128 MAINE PL 02111 12/8/62 02110 12/8/62 02111 12/8/62 02110 12/8/62 338-0824 338-0824
WM HOLDEN WILL HOLDEN
128A MAIN SQ 128A MAINE SQ
Do you compare all the shared or common fields? Do you give partial credit? Are some fields (or some values) more important to you than others? Why? Do more fields increase your confidence? By how much? What is enough?
The value of information content

Information content measures the significance of one field over another (Discriminating Value)
A Gender Code contributes less information than a Tax-Id Number
Information content also measures the significance of one value in a field over another (Frequency)
In a First-Name Field, JOHN contributes less information than DWEZEL
Significance is determined by a values reliability and its ability to discriminate, both can be calculated from your data
Distribution of weights
WILLIAM J WILLAIM JOHN +1 +1 HOLDEN HOLDEN +17 128 MAIN ST 02111 12/8/62 128 MAINE AVE 02110 12/8/62 +2 +4 -1 +7 +9 = 40
The weighted score is a relative measure of the probability of a match

4000 3500
# of Pairs
3000 2500 2000 1500 1000 500 0 -20 -10 0
Grey area
Non-Matches
Thresholds defined can be used for automated processing

Matches 20 30 40
Weight of Comparisons
10
Less Confidence
More Confidence
Weights
Measures the information content of a data value Each field contributes to the confidence (probability) of a match
Types of weights
If a field matches, the agreement weight is used
Agreement weight is a positive value
If a field doesnt match, the disagreement weight is used

Disagreement is a negative value
Partial weight is assigned for non-exact or fuzzy matches Missing values have a default weight of zero Weights for all field comparisons are summed to form a composite weight
Matching terminology
Informational Content Weight Composite Weight Match Cutoffs False Positives False Negatives
Measures the significance of one field value over another Measures the informational content of a data value Measures the confidence of a match
Distinguish matches from non-matches Records with a score above the High cutoff that really arent a match Records below the low cutoff that really are a match
Measuring the conditions of uncertainty

Reliability of the data in a given field
Estimated as the probability that the field agrees given the record pair is a match
Probability of a random agreement of values

Estimated as the probability the field agrees given the record pair is not a match
Reliability (m-probability)
Approximated as, 1 - error rate for the given field The higher the m-probability, the higher the disagreement weight will be for the field not matching since the data is considered reliable
Chance agreement (u-probability)

The u-probability can be approximated as the probability that a field agrees at random (by chance) QualityStage uses a frequency analysis to determine the probability of chance agreement for all values
Created by a Match Frequency stage
Rare values bring more weight to a match
Blocking
Grouping together like records that have a high-probability of producing matches Only like records are compared to each other making the match more efficient and computationally feasible Records in a block match exactly on one to several blocking fields
Blocking example: sample data Block on NYSIIS of Last Name

NYSIIS LNAME YANG GARAS YANG GARAS GARAS MATAC GARAS JANCAN YANG NAME YUNG , WAYNE D GEROSA, FRAN X YOUNG , JONATHAN A GERISA, FRANCIS MARCUS MATIC GEROSA, MARY RENEE JENKINS YOUNG THERESA C ADDRESS 9000 SHEPARD DRIVE 29 AARONS CT 1767 TOBEY ROAD 29 AARONS CT 100 SUMMER STREET 29 AARONS CT 100 SUMMER STREET 1767 TOBEY ROAD ZIP 78753 06877 30341 06877 06877 02111 06877 02111 30341
GEROSA, FRANCIS XAVIER 29 AARONS COURT
Blocking example NYSIIS of Last Name

NYSIIS
YANG YANG YANG
NAME
YUNG , WAYNE D YOUNG , JONATHAN A YOUNG THERESA C
ADDRESS
9000 SHEPARD DRIVE 4220 BELLE PARK DR 1767 TOBEY ROAD
ZIP
78753 77072 30341
GARAS GARAS GARAS GARAS MATAC
GEROSA, FRAN X GEROSA, FRANCIS XAVIER GEROSA, MARY GARISA, FRANCIS MARCUS MATIC
29 AARONS CT 29 AARONS COURT 29 AARONS CT 29 AARONS CT 100 SUMMER STREET
06877 06877 06877 06877 02111
JANCAN
RENEE JENKINS
100 SUMMER STREET
02111
Blocks with only one record are considered residuals

Blocking strategy
Choose fields with reliable data Choose fields with a good distribution of values Combinations of fields may be used
Examples of blocking strategies

Zip code for matching addresses NYSIIS of last name for matching individuals Brand name for matching products Combination of zip code and NYSIIS of street name for matching addresses Combination of NYSIIS of last name and first letter of first name for matching individuals
Blocking summary
Blocking groups together like records Matching is more efficient for small block sizes
Blocks should have less than 1000 records (guideline, not a hard and fast rule)
Blocking fields must match exactly for a candidate set to be created/evaluated Beware of block overflow
Computationally run out of resources Comparisons are not completed Every record in the block becomes an automatic residual
Match types
Unduplication
Identifies duplicates candidates in one file
Reference Match (Two File)

One-to-one correspondence
For every record on stream link we expect to find a match to one record on reference link
Many-to-one correspondence
More than one record on stream link can match to the same record on reference link
Comparing data values

Different comparisons for different data 17 comparison methods Most common
CHAR - (character comparison) character by character, left to right. UNCERT - (character uncertainty) tolerates phonetic errors, transpositions, random insertion, deletion, and replacement of characters CNT_DIFF Counts keying errors in numeric fields. You set a tolerance threshold NAME_UNCERT Can be used to compare character values, if the strings are different lengths then the shorter of the two lengths is used
Match Implementation
Tasks required in match process

Standardize the data Add data columns needed for blocking Generate match frequency report using Match Frequency stage Build match specification in Match Designer Add pass
Blocking columns Match commands
Configure match test results environment
Run pass Review results Tune the match Add cutoffs Set overrides Add more passes Repeat steps until match results are acceptable
Standardize columns and generate match frequency
Match frequency stage
Map fields
Match frequency generation
Lab 13: Match frequency

Use Match Frequency stage in a match job
Match Designer
Used to build a match specification that will be addressed in a match job Features
Design control center Data-centric Graphical representation of statistics Independent of job design Iterative development
Match Design - Unduplicate
Match design creating specification

How to create a new match specification
Right-click in non-root area of repository
Match design - unduplicate

The Major Components
Pass Composer Test results Histogram Holding Area
Data Viewer
Decision Rules
Cutoff Tuning
Select match type example unduplicate
Will initially get one pass called MyPass
Click table definition icon
Use load button to access table definition of standardized data set

Select pass icon
Blocking
Match Commands
Save passes and specification
Name and place passes and specification
Questions: Where is the standardized data? Where is the frequency report? What ODBC-accessed database will store test results?
Set up test results area
Standardized sample data
Note: these must be data sets

Frequencies data set
Data Source Name User Name Password
Add Blocking Columns
Business Name
Select Column
Click Apply or OK
Add MATCH Column
Business Name
Compare Type
Data Column
Right-Click to view data frequencies
Frequencies
Select Parameter
Fully configured pass
Click test pass to run the pass against the data
Expanded view will show details of blocking and match commands
Match design after test pass run
Grouping option: Match Sets: See all matches and duplicates together Match Pairs+Sort: See the master record repeated

Default Display (Grouped by Match Sets)
Grouped by Match Pairs and then sorted Ascending by Weight

Compare Weights: See how any two records score
Statistics Tab
ge an Ch
ho tS ha W
ws

w Ho ws ho S
ge an Ch

TOTAL Statistics Tab
Ch an ge
Ho w
Sh ow s
ge an Ch
s ow Sh at h W
Lab14: Configure test results database

Build a DB2 database to contain match test results Build an ODBC source to connect the database to QualityStage
Lab 15: Match specification

Use Match Designer to build specification for unduplicate job Configure test results area
Match improvement strategy

1. Set critical values for important fields 2. Review calculated weights
Adjust weights using weight overrides
3. Set cutoffs 4. Add additional passes
Critical fields
Used to identify fields that must agree in order for records to be linked
Critical Fields values must agree exactly or the records cannot be linked (considered a match) Critical Missing OK Field values must agree exactly on values not considered missing values
QualityStage feature: Variable Special Handling
Variable special handling
Weight overrides
Allows you to adjust both the agreement and/or disagreement weights for specific situations
Add to calculated weight Replace weight
On Match Commands screen
Weight override screen
Cutoffs
There are two cutoffs
Match cutoff (high cutoff) Clerical cutoff (low cutoff)
Records with a weight equal to or above the Match cutoff are considered matches Records with a weight below the low cutoff are not matches Records with a weight greater than or equal to the low cutoff and less than the high cutoff are considered clerical records for manual review Cutoffs can be set at the same value eliminating clerical records
Setting the match cut-off

Weights Data fields Definite Match Definite Match Questionable Match
27.82 27.82 27.82 38.65 38.65 25.81 14.45 PO BOX PO BOX PO BOX 35 35 928 S S 930202 930202 930202 COLLIER COLLIER 1ST 1ST RD RD ST ST NW NW STE STE 610 610
Multiple match passes

Additional passes are helpful in overcoming data errors and missing values in block fields You should always create at least two match passes Change blocking strategies for each pass
Example: multiple match passes

Pass Weights Data fields 1 26.31 JASON BIRCH 1 26.31 JASON BIRSH 1 1 1 2 2 20.42 10.83 RES A 20.42 10.19 JOHN SMITH MARY SMITH JOHN SMITH JOHN SMITH JOHN SMITH
1350 WALTON 1350 WALTON 2047 PRINCE 2047 PRINCE P.O. BOX 123 2047 PRINCE P.O. BOX 123
WAY WAY AVE AVE
30901 30901 30604 30604 30604
AVE
30604 30604
Pass 1 blocked on street name Pass 2 found additional matched records in which the street name was different but the names were the same
Match Implementation Unduplicate job
Unduplication implementation
Double Click
Unduplication Implementation
Verify link order for both input and output
Map all output links
Checkpoint
1. (T/F) Match specifications are created using Designer. 2. (T/F) An unduplicate match can be used against two files. 3. Which match specification component determines the extent of the clerical review records?
1. (T/F) Match specifications are created using Designer.
Answer: True
2. (T/F) An unduplicate match can be used against two files.

Answer: False
3. Which match specification component determines the extent of the clerical review records?
Answer: cutoff values
Unit summary
Having completed this unit, you should be able to: Build a QualityStage job to identify matching records Apply multiple match passes to increase efficiency/efficacy Interpret and improve match results
Lab 16: Unduplicate job

Build unduplicate job using the match specification
Survive
4.0.3
Unit objectives
Identify Survive techniques Describe implementation options Define Survive rules Build Survive job
Survive stage
Point-and-click creation of business rules to determine surviving data user decides how to survive data Performed at record or field level very flexible Creates a single, consolidated record containing the best-ofbreed data Provides consolidated view of the data
Survive example
Survive Input (Match Output)
Group Legacy 1 D150 1 A1367 23 D689 23 A436 23 D352 First Bob Robert William Billy William Middle A Alex Last Dixon Dickson Obrian OBrian Obrian No. 1500 1500 5901 5901 5901 Dir. SE SW SW Str. Name Type UnitNo. ROSS CLARK CIR ROSS CLARK CIR 74TH ST STE 202 74TH ST 74 ST #202
Survived Consolidated Output

Group Legacy First 1 D150 Robert 23 D689 Middle Last Dickson OBrian No. 1500 5901 Dir. SE SW Str. Name Type Unit No. ROSS CLARK CIR 74TH ST STE 202
William Alex
Cross-Reference File
Group Legacy 1 D150 1 A1367 23 23 23 D689 A436 D352
Survive rules
A rule contains a condition and a set of target fields
When the condition is met the field becomes a candidate for the best All records in a group are tested against the condition The best populates the target fields
Multiple targets are permitted for the same rule
Survive rules
Custom Rule
Build your own logical expression Comparison (=, !=, <, > ,<=, >=) Logical (and, or, not) Indicate the current and best records with the following notation
c.field indicates the current b.field indicates the best
Parentheses ( ) can be used for grouping complex conditions String literals are enclosed in double quotation marks, such as MARS. A semicolon (;) terminates a rule.
Building survive rules

Survive Rules Definition screen lets you easily build, delete and manage survivor rules
Survive techniques
Pre-defined Techniques
Source Recency Frequency Most complete (longest string)
User-specified logic
Target fields
Fields you want to write to the output file Populated based on meeting the conditions of the survivor rule(s) Fields not listed as targets are excluded from the output file May have multiple targets for each rule
Example: complex survive rule

The following rule states that FIELD3 of the current record should be retained if the field contains five or more characters and FIELD1 has any contents. The prefix of b. indicates the current best record The prefix c. indicates the current record testing against the survivor rule
FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;
TARGET
CONDITION
Survive Implementation
Survive QualityStage job
Double Click
Survive stage properties
Output Column
Technique
Complex available
Checkpoint
1. (T/F) Survivorship can allow more than one record to survive. 2. (T/F) Survivorship rules deal with the complete record only. 3. Name three survive rules.
1. (T/F) Survivorship can allow more than one record to survive.
Answer: False
2. (T/F) Survivorship rules deal with the complete record only.

Answer: False
3. Name three survive rules.

Answer: most recent record, longest non-blank, most frequent non-blank
Unit summary
Having completed this unit, you should be able to: Identify Survive techniques Describe implementation options Define Survive rules Build Survive job
Lab 17: Survivorship job

Build survivorship job
Survive job with XREF file
Special Topics
4.0.3
Full Run
Full run single job
1. Double Click
Full run using DataStage job sequencer
QualityStage Migration Tool
QualityStage Migration Tool Overview

The QualityStage Migration Tool (QSMT) provides the ability to migrate QualityStage 7.5 jobs and Standardization Rule Sets to the QS8 environment. QSMT analyzes the QS 7.5 server project directory to construct dsx files which can be imported into the QS8 common repository using the DS & QS8 Designers import facility
QualityStage migration tool overview

QSMT functionality offers three types of QS 7.5 object migrations:
QS 7.5 Standardization Rule Set QS 7.5 job in combined mode QS 7.5 job in expanded mode
Two modes for migrating jobs to QS8:

Combined Mode
Use when you need to take a legacy process and just run it in QS8 Allows control before and after the legacy process Will always run after importing without any manual tuning
Expanded Mode
Use when you need to add QS8 operators within a migrated process May require some manual tuning to run
Rule set migration

The QSMT has the ability to migrate Standardization Rule Sets in one of two ways: Explicitly - - you may specify the rule set you want to migrate By job dependency - - you may migrate all Rules associated with a particular job
Note: Regardless of the migration mode, all migrated rules will have the new naming convention of : QS-7.5-Ruleset-Name_QS-7.5-Project-Name
Combined mode migration

Use this mode to get a legacy QS job up and running in QS8 with as little effort as possible. Jobs will import and run without modifications After importing, a migrated job will appear in the Jobs folder of the repository view in the QS/DS 8 Designer client Jobs are renamed by QSMT within the QS8 package to minimize name collision The new job name has the following naming convention: QS-7.5-Job-Name_QS-7.5-Project-Name
QSMT combined mode migration

The job consists of a single instance of the QS 8 Legacy Job stage, together with some number of DS Sequential File stages, which are linked to the Legacy Job stage as inputs or outputs
QSMT combined mode migration

All the QS stages run under the control of the single Legacy Job stage in Combined Mode The list of operations can be seen by opening the Legacy stage
File IO to external files is performed by the Information Server Sequential File stages
QSMT combined mode & running a QS8 job

Once imported, Legacy jobs are run the same as any other QS8 job Prior to compiling, be sure any required rule sets are provisioned to the server Run as you would any other QS8 job
QSMT expanded mode

Use to re-implement the job in the QS8 environment After importing, a migrated job will appear in the Jobs folder in the same way as in Combined Mode The job consists of one or more stages for each 7.5 stage, plus DS PX Sequential File stages, linked to represent the 7.5 job flow. For complex jobs, stages may need to be reorganized to improve readability
QS stage migration reference table

QS 7.5 Stage Type
Abbreviate Build Collapse FFC FFC Investigate Match* Multinational Standardize Parse Select Select
QS8 Stage Type

Legacy Job Legacy Job Legacy Job Copy ODBC Enterprise Legacy Job Legacy Job MNS Legacy Job Legacy Job Filter
Conditions
Always Always Always Delimited text used in 7.5 stage ODBC used in 7.5 stage Always Always Always Always Merge used in 7.5 stage Split, Accept, or Reject used in 7.5
* Currently working on converting Match specifications for GA
QS stage migration reference table

QS 7.5 Stage Type
Sort Standardize Survive Survive Transfer Unijoin WAVES
QS8 Stage Type

Sort Standardize Legacy Job Survive Legacy Job Legacy Job WAVES
Conditions
Always Always If target columns overlap If target columns do not overlap Always Always Always
QSMT expanded mode & running a QS8 job

Prior to compiling, be sure to complete the following:
Provision any required rules to the server Add ODBC connection information to any ODBC read or write stages appearing in the job To complete the migration, perform the following for every Standardize, Survive, MNS and Waves stage that appears on the canvas:
Open the stage editor for the stage (e.g. by double-clicking it) Click ok
Once the above tasks are completed, compile and run as you would any other job
Lab 18: QualityStage Migration Tool

Migrate 7.5 QualityStage jobs to version 8
Globalization
4.0.3
Objectives
After completing this module you will be able to: Build jobs that read and write Japanese data Modify client settings to display Japanese data with correct characters
Terminology
Character Set
An ordered list of characters used for text Example: Latin, Cyrillic, Unicode
Character Encoding
How each character in a character set is represented as bits Examples: UTF-8, UTF-16BE, GB18030 are encodings of Unicode
Codepage
Microsoft Windows term for Encoding, often used in other contexts too Examples:
1252 is Windows Latin1 superset of ISO8859/1 932 another name for Shift-JIS
Character Sets
Latin
Italian, Spanish, French, English alphabets
Cyrillic alphabet
Subsets are used by five Slavic languages (Bulgarian, Russian, Belarusian, Serbian, Macedonian, Ukrainian) and some non-Slavic (Kazakh, Uzbek, Kyrgyz, Tajik, and Mongolian)
ASCII
Represents 256 characters
Unicode
Represents 65,536 unique characters Standard for representing the characters of all languages Includes Chinese, Japanese, and Korean
Character encoding
Definition A system that pairs a character from a character set to something else, such as a number Two common computer encodings for Unicode
UTF8
Variable length encoding for Unicode Encodes each character to one to four bytes
UTF16
Variable length encoding for Unicode Allows either endian representation but mandates that the byte order be explicitly indicated by a byte order Mark
NLS
NLS National Language Support
Globalization + Localization/Translation
NLS map
What DataStage uses to convert between external and internal encodings Internal encoding is UTF8 for Server engine, UTF16 for Parallel engine
Where DataStage NLS Mapping Happens

External character set
Information Server Common Design Repository
Map Server Client

Scripts, etc.
Logs
Map
Parallel Engine running job

Unicode (UTF-16)
Messages XML (UTF-8)
Job Monitor
Unicode (UTF-16)
Map
Windows code page
DataStage & QualityStage Runtime Objects Unicode (UTF-8)
Examples of DataStage NLS Maps

Parallel IBM367 Big5 IBM1026 GB2312 ISO_8859-1:1987 ISO_8859-5:1988 KS_C_5601-1987 windows-1253 windows-1255 IBM865 Shift_JIS TIS-620 Description Standard (US) ASCII 7-bit set TAIWAN: "Big 5" standard IBM EBCDIC variant 1026 (Turkish) CHINESE: EUC as per GB 2312 ISO Standard 8859 part 1: Latin-1 ISO Standard 8859 part 5: Latin-Cyrillic KOREAN: EUC as per KSC 5861 MS Windows codepage 1253 (Greek) MS Windows codepage 1255 (Hebrew) PC DOS code page 865 (Nordic) JAPANESE: Shift-JIS main map THAILAND: Industrial Standard 620
Setting a Client/Server Map

Admin Client (whole server)
Client
Map
DataStage & QualityStage Runtime Objects Unicode (UTF-8)
Associates a server map with the current Windows code page
Setting Job-Level Maps

Sets the default map name to use with all Parallel jobs in this project
Admin Client (per project)
unless you override it in the job properties dialog
Setting a Stage-Level Map

Map Server Parallel Engine running job

Unicode (UTF-16)
Various stages have an NLS Map tab: e.g. Sequential File, External Source, External Target, File Set
Define character set mappings (ustring Applied at stage or individual field level external file)
Setting a Column-Level Map

Char may be "extended" for Unicode" Non-default NLS map (for relevant types)
For Sequential File-type stages

NChar, and Char with extended type, offer a drop-down list of map names in the NLS Map property
Converting string to ustring manual control

Transformer, modify, etc.
string ustring conversion will happen automatically, taking current map from context (job level or stage=operator level) Fine control via explicit conversion functions
Conversions may use specific map name
NLS Implementation using Investigate stage

Job Design from Lab
Job-level NLS map
Investigation results for Japanese city column
Input data Client machine with codepage set to JPN
Output report data Client machine with codepage set to JPN
Unit summary
Having completed this unit, you should be able to: Build a QualityStage investigation job for non-English data View correctly-formatted results in DataStage/QualityStage data viewer
Lab 19: NLS

Build investigation job for city using Japanese data
Address Verification Interface Stage
4.0.3
Objectives
After completing this module you will be able to: Build jobs using the AV stage to parse and verify address data
AVI Stage
Provides
Transliteration (e.g. Japanese to Latin) Parsing Address validation
WAVES equivalent Does not provide postal certification discounts

Use CASS (US), DPID (Australia), or SERP (Canada) if certification is desired
Supports real-time
Components
AV stage Reference data
16 Geographies Purchased via Passport system
API libraries
Address Doctor
Reference Data
Required for validation function only Requires annual license agreement Location pointed to by AV stage Some databases are memory intensive Load options
Partial preload
Indexes loaded to memory
Full preload
Data loaded to memory Fast access but must have adequate memory
No preload
Data accessed from disk, slowest method
Job components
Input address data
AVI stage
Optional error file
Stage properties
Function
Reference data location
Navigation
Mapping input columns to address elements
Transliterate mode
Map input columns to address elements
Multiple input columns can be mapped to one address element
Options offer the choice to increase the number of address lines
Map columns to output link
Parsing mode
Input sample
Output sample
Validation mode
Uses reference data from a database Map input columns to address elements Can activate error link Creates validation summary report Sample output (only two of the validation columns shown)
Validation mode statuses

Part of output record Document actions taken by AV stage Short code Verbose code Example
Validation summary report sample (USPREP)

Validation Summary Report Company Name: List Identifier: Processing Date (yyyy/mm/dd): 2009/02/25 Total Number Of Records Processed: 2843 Passed: 2843 100.00% Failed: 0 0.00% Validated: 2233 78.54% Corrected: 415 14.60% Has Suggestion: 195 6.86% PostCode Failed: 70 2.46% City Failed: 37 1.30% Street Failed: 274 9.64% Country Failed: 0 0.00%
Unit summary
Having completed this unit, you should be able to: Build jobs using the AV stage to parse and verify address data
Lab 20: AV Stage

Build AV job to parse Japanese address data
Review prebuilt job that validated USPREP data from earlier lab

QS Essentials

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

QS Essentials

Caricato da

Copyright:

Formati disponibili

QualityStage 8 Essentials DX741

Copyright, Disclaimer of Warranties and Limitation of Liability

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

Data Quality Issues

Copyright IBM Corporation 2009

Data quality challenges

Copyright IBM Corporation 2009

Data quality why do we care?

Copyright IBM Corporation 2009

Example - Master Data Management

Consolidated customer view

Copyright IBM Corporation 2009

Different or inconsistent standards Name Field

Missing data & default values

SOC. SEC. # 228-02-1975 999999999 025-37-1888 34-2671434 101010101 LN#12-756 18-7534216

TELEPHONE 6173380300 3380321 415-392-2000 508-466-1200 212-235-1000 FAX 528-9825 5436

John Doe Trustee for K 111111111

Copyright IBM Corporation 2009

Legacy Meta Desc.

Copyright IBM Corporation 2009

The anomalies nightmare

Lack of Standards Spelling Errors

Copyright IBM Corporation 2009

What data challenges do you face?

Buried information Misspelling No unique key linking records together

Acct # 5154155 5152335 5146261 87121 87458

Address 40 Beacon St.

State Mass YES

Note ODP 02111 CHK ID FR Alert

76 George 617-210-0824 Boston 40 Bacon Street 76 George Road Melrose Boston

S.Lalonde40 Becon Rd. Melrose

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

Investigate single domain report

Sample source data

Copyright IBM Corporation 2009

Investigate word pattern report

Sample source data

Enforcing business standards on data elements.

Transforming the input to an output which meets the business requirement.

Copyright IBM Corporation 2009

Refine data elements

Copyright IBM Corporation 2009

Apply phonetic coding to key words - facilitates record linkage

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

QualityStage standardize report example

Copyright IBM Corporation 2009

Statistical Analysis & Match Scoring:

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

QualityStage match examples

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

Copyright IBM Corporation 2009

QualityStage survive examples

Survived SSN Middle NaLast Nam DOB F TRIANO 19580211 98524173