Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. All other product or brand names may be trademarks of their respective companies. All information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has been certified as being Year 2000 compliant. This document may not be reproduced in whole or in part without the priori written permission of IBM. Note to U.S. Government Users Documentation related to restricted rights Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
Course Contents
Topic
Data Quality Issues QualityStage 8 Architecture Developing with QualityStage Investigation Standardize Match Survivorship Special Topics Globalization (NLS) Address Verification Stage
Page
5 39 55 79 117 185 258 278 295 312
Course contents
Data quality issues Information Server purpose and architecture Introduction to DataStage and QualityStage Investigation Standardization Match Survivorship Special Topics
Data quality methodology QualityStage Migration Tool
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Unit objectives
After completing this unit, you should be able to:
List the five common data quality contaminants Describe each of the following processes:
Investigation Standardization Match Survivorship
Source 1
Align
Source 2
Harmonize Consolidate
Source 3
Location
MA93 CT15 IL21 6793 0215 8721 BOSTON HARTFORD CHICAGO
Source 2
Source 3
NAME Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Astorial Fedrl Savings Kevin Cooke, Receiver
Buried information
Legacy Record Values Robert A. Jones TTE Robert Jones Jr. First Natl Provident FBO Elaine & Michael Lincoln UTA DTD 3-30-89 59 Via Hermosa c/o Colleen Mailer Esq Seattle, WA 98101-2345
CUSNUM
90328574 90328575 90238495 90233479 90233489 90234889 90345672 IBM
NAME
ADDRESS
187 N.Pk. Str. Salem NH 01456 187 N.Pk. St. Sarem NH 01456
SALES $
8,494.00 3,432.00
I.B.M. Inc. International Bus. M. Int. Bus. Machines Inter-Nation Consults Int. Bus. Consultants I.B. Manufacturing
187 No. Park StSalem NH 04156 2,243.00 187 Park Ave Salem NH 04156 15 Main St. Andover MA 02341 PO Box 9 Boston MA 02210 Park Blvd. Boston MA 04106 5,900.00 6,800.00 10,243.00 15,999.00
No common key
Anomalies
Name Peter J. Lalonde LaLonde, Peter Lalonde, Sofie Pete & Soph Lalond P. Lalonde FBO
City Melrose,
Zip 02176 MA MA
MASS MA 02
176
Why investigate?
Discover potential anomalies in the data Examine single domain and free-form fields Identify invalid and default values Reveal undocumented business rules Verify the reliability of the data in the fields to be used as matching criteria Gain complete understanding of data
Frequency
% of Total
Freq. Count
Frequency
Copyright IBM Corporation 2009
% of Total
What is standardize?
Applying business logic to data chaos.
Pattern manipulation
How to standardize
Parse specific data fields into smaller, atomic data elements
Atomic data elements are called tokens Categorize identified elements
Separate Name, Address, and Area from freeform Name & Address lines Identification of Distinct Material Categories (e.g. Sutures vs. Orthopedic Equipment)
Example 2
Part Description = BLK LATEX GLOVE becomes:
> Color = BLACK > Type = LATEX > Part = GLOVE
Why standardize?
Normalize values in data fields to standard values
Transform First Name = MIKE MICHAEL Transform Title = Doctor Dr Transform Address = ST. Michael Street Saint Michael St. Transform Color = BLK BLACK
QualityStage standardize
Uses a highly flexible pattern recognition language Can employ field or domain specific standardization (i.e. unique rules for names vs. addresses vs. dates, etc.) Contains customizable classification and standardization tables Utilizes results from data investigation
Match
Conditioned data and QualityStages matching engine link the previously unlinkable. Match Construction:
Reliability of input data defines a match result.
Report Generation:
All business rules applied have easy to understand report structure.
What is match?
Identifying all records on one file that correspond to similar records on another file Identifying duplicate records in one file Building relationships between records in multiple files Performing statistical and probabilistic matching Calculating a score based on the probability of a match
Why match?
Identify duplicate entities within one or more files Perform householding Create consolidated view of customer Establish cross-reference linkage
How to match
Single file (Unduplication) or two file (Reference) Different match comparisons for different types of data (e.g. exact character, uncertainty/fuzzy match, keystroke errors, multiple word comparison) Generation of composite weights from multiple fields Use of probabilistic or statistical algorithms Application of match cutoffs or thresholds to identify automatic and clerical match levels Incorporation of override weights to assess particular data conditions (e.g. default values, discriminatory elements)
QualityStage match
A wide variety of match comparison algorithms providing a full spectrum of fuzzy matching functions Statistically-based method for determining matches (Probabilistic Record Linkage Theory) Field-by-field comparisons for agreement or disagreement Assignment of weights or penalties Overrides for unique data conditions Score results to determine the probability of matched records Thresholds for final match determination Ability to measure informational content of data
What is survive?
Creation of best-of-breed surviving data based on record or field level information Development of cross-reference file of related keys Creating output formats:
Relational table with primary and foreign keys Transactions to update databases Cross-reference files
Why survive?
Provide consolidated view of data Provide consolidated view containing the best-of-breed data Resolve conflicting values and fill missing values Cross-populate best available data Implement business rules Create cross-reference keys
How to survive
Highly flexible rules Record or field level survivorship decisions Rules can be based upon data frequency, data recency (i.e. date), data source, value presence or length Rules can incorporate multiple tests QualityStage features
Point-and-click (GUI-based) creation of business rules to determine best-of-breed surviving data Performed at record or field level
Example 1: The longest populated Middle and Last Name Matched Survived First Middle Last Name First Middle Last Name Name Name Name Name MARI MARI S LEMELSONLEMELSONLAPPNER LAPPNER
MARI S
LEMELSON
Example 2: The longest populated Middle Name, Date of Birth, and SSN
Matched First Name Middle NLast Name DENISE TRIANO DENISE F TRIANO SSN DOB 19580211 98524173 First Name DENISE
Policy Select US Data for further processing Identify Duplicate Customer Records Investigate Assess Data Quality Condition Name, Address and Area
Standardize Country
Checkpoint
1. (T/F) Data quality investigation cleans the source data. 2. (T/F) Standardization modifies the source data so that it can be loaded into the target system. 3. (T/F) Survivorship data can be either record based or field based.
Checkpoint solutions
1. (T/F) (T/F) Data quality investigation cleans the source data.
Answer: False
2. (T/F) Standardization modifies the source data so that it can be loaded into the target system.
Answer: False
Unit summary
Having completed this unit, you should be able to: List the five common data quality contaminants
Different standards Missing and default values Spillover and buried information Anomalies No consolidated view Investigation Standardization Match Survivorship
QualityStage 8 Architecture
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Unit objectives
After completing this unit, you should be able to:
Describe the Data Quality architecture Identify server and client components
Information Director
Information Server
Other Services Metadata Access Services Client logon access Logging Security
UNIX
Projects
Windows Windows
DataStage clients
Administrator
Add and delete projects Set project defaults Set project environment parameters
Designer
Maintain data definitions Add, modify, and delete jobs Add, modify, and delete match specifications Manage rule sets Compile jobs Run jobs Provision rule sets and match specifications
Director
Run jobs Review job log Schedule jobs
Copyright IBM Corporation 2009
DataStage Administrator
Administrator
Create or delete projects Set project defaults Apply security
Project list
DataStage Designer
Designer
Client GUI for designing jobs Windows 2000+, XP Build meta data Build Jobs Modify Standardization Rules Build match specifications Designer Repository Database
DataStage Director
Director
Client GUI for managing job execution Windows 2000+, XP Run jobs set job options and parameters View job log Schedule job execution
Checkpoint
1. (T/F) DataStage Administrator executes jobs. 2. (T/F) DataStage Designer configures projects. 3. Which DataStage component displays objects in the designer database?
Checkpoint solutions
1. (T/F) DataStage Administrator executes jobs.
Answer: False
Unit summary
Having completed this unit, you should be able to:
Describe the Data Quality architecture Identify server and client components
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Unit objectives
After completing this unit, you should be able to:
Import meta data Build DataStage/QualityStage Jobs Run jobs Review results
DataStage/QualityStage project
Components
Jobs Stages within jobs Table Definitions The Designer repository view shows project components
Job definition
A job is an executable DataStage/QualityStage program Created by job compilation Jobs can be run in batch or in real time
Server
Runs the job View job log
Stages
Data definitions
Entered or loaded via DataStage import mechanisms
Sequential file ODBC Native database connection
New and redefined columns can be added on the data flow via Transformer stage
Append ISO country codes Rule sets are stored in the repository and provisioned to the job execution area
Execution environment
Checkpoint
1. (T/F) The job monitor displays link statistics. 2. (T/F) The job log is viewed in DataStage Designer. 3. What protocol is used for communication between the DataStage clients and server?
Checkpoint solutions
1. (T/F) The job monitor displays link statistics.
Answer: True
3. What protocol is used for communication between the DataStage clients and server?
Answer: TCPIP
Unit summary
Having completed this unit, you should be able to:
Import meta data Build DataStage/QualityStage Jobs Run jobs Review results
Investigation
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Unit objectives
After completing this unit, you should be able to:
Build Investigate jobs Use character discrete, concatenate, and word investigations to analyze data fields Review results
Investigation
Verify the domain
Review each field of interest and verify the data matches the meta data
Identify data formats, missing and default values Identify data anomalies
Format Structure Content
Investigate stage
Features
Analyze free-form and single domain fields Provide frequency distributions of distinct values and patterns
Investigate methods
Character Discrete Character Concatenate Word
Investigate methods
Method
Why Analyzing field values, formats, and domains Cross-field correlation, checking logic relationships between fields Identifying free-form fields that may require parsing and discovery of key words for classification
Investigate terminology
Field Masks
Options that represent the data. Options: Character (C), Type (T), Skipped (X)
Tokens
Character Mask C T X
Usage For viewing the actual character values of the data For viewing the pattern of the data For ignoring characters
Token
02116 02116 01832-4480 XJ2 6EM (617) 338-0300 617-338-0300 6173380300 (617)3380300
Mask
CCCCC CCCXX TTTTTTTTTT TTTTTTT CCCCCCCCCCCCCC TTTTTTTTTTTT CCCXXXXXXXXX CCCXXXXXXXXX
Result
02116 021 nnnnn-nnnn aanbnaa (617) 338-0300 nnn-nnn-nnnn 617 (61
Investigation Implementation
Double Click
Investigation - Character
Select Column
Add
Investigation - Character
Select mask
Investigation - Character
Investigation - Character
Investigation - Character
Character concatenate
Identify Field Relationships
Investigate one or more fields to uncover any relationship between the field values. Uses combinations of character masks Generates reports for frequency
Word investigate
Usage: Free-form field pattern analysis
To view the pattern of the data within a freeform text field and parse it into individual tokens
QualityStage process
Apply rule sets to free-form fields Discover parsing requirements Discover patterns in data Generate reports for pattern frequency distributions and token report
How to use Look at most frequently occurring patterns. Use to estimate how much work to modify a rule set for a customer. How to use Review tokens with SME to verify tokens are properly classified. Identify most frequently occurring unclassified tokens and add them to rule set.
Rule sets
Rules for parsing, classifying, and organizing data Rule Set Domains
Country processing Pre-processing Domain Processing
Name: Business and Personal Street Address Area: Locality, City, State and Zip/Postal codes
Parsing
Parse free-form data with the SEPLIST and a STRIPLIST
SEPLIST - Any character in the SEPLIST will separate tokens, and become a token itself STRIPLIST - Any character in the STRIPLIST will be ignored in the resulting pattern
Parsing example
Example:
Token1
SEPLIST . STRIPLIST
Token2
Token3
Token4
Token5
Token6
Token7
Token8
120
Main
St
SEPLIST STRIPLIST .
Token1 Token2 Token3 Token4
120
Main
St
NW
SEPLIST . STRIPLIST .
Token1
120
Main
Token2
St
Token3
Token4
Token5
120 ^
Main ?
Street T
Apt U
6C >
Token report
Pattern report
^
10
?
MAPLE
Parse
Investigation - Word
Investigation - Word
Link ordering
Checkpoint
1. (T/F) Character discrete investigation examines a single domain. 2. (T/F) Word investigation examines a single domain. 3. Name the three character masks.
Checkpoint solutions
1. (T/F) Character discrete investigation examines a single domain.
Answer: True
Unit summary
Having completed this unit, you should be able to:
Build Investigate jobs Use character discrete, concatenate, and word investigations to analyze data fields Review results
Standardize
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Unit objectives
After completing this unit, you should be able to:
Describe the Standardize stage Identify rule sets Build jobs using the Standardize stage Interpret standardization results Investigate unhandled data and patterns
Standardize
Transformation
Parsing free form fields Comparison threshold for classifying like words Bucketing data tokens
Standardization
Applying standard values and standard formats
Standardize example
Input File:
Address Line 1 Address Line 2
1721 W ELFINDALE ST 1721 W ELFINDALE ST # 20 16200 VENTURA BOULEVARD C/O JOSEPH C REIFF 1705 W St 1655 PONCE DE LEON AVENUE
Result File:
House # Dir Str. Name Type Type Value Unit Type Unit. Value Floor Floor
W W
UNIT STE
20 20 201 FLOOR 15
Standardize process
Output File
House Number 10 Street Unit Type Type ST APT
Unit 222
Key: ^ = Single numeric ? = One or more unknown alphas T = Street type U = Unit type
^
10
?
MAPLE
Parse
Standardize stage
Standardize Stage
Uses Rule sets for:
Country processing Pre-domain processing
USPREP
Domain processing
USADDR USAREA USNAME
Output Record Output Record US US CA CA GB GB US US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111 Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0 Y 28 GROSVENOR STREET LONDON W1X 9FE Y 28 GROSVENOR STREET LONDON W1X 9FE N 123 MAIN STREET N 123 MAIN STREET
Input Record Input Record Field 1 Field 1 Field 2 Field 2 JIM HARRIS (781) 322-2426 JIM HARRIS (781) 322-2426 92 DEVIR STREET MALDEN MA 02148 92 DEVIR STREET MALDEN MA 02148
Mixed domain
Output Record Output Record Name Domain Name Domain Address Domain Address Domain Area Domain Area Domain Other Domain Other Domain JIM HARRIS JIM HARRIS 92 DEVIR STREET 92 DEVIR STREET MALDEN MA 02148 MALDEN MA 02148 (781) 322-2426 (781) 322-2426
Output Record Output Record House Number House Number Street Name Street Name Street Suffix Type Street Suffix Type Floor Type Floor Type Floor Value Floor Value Address Type Address Type NYSIIS of Street Name NYSIIS of Street Name Reverse Soundex of Street Name Reverse Soundex of Street Name Input Pattern Input Pattern
Copyright IBM Corporation 2009
100 100 SUMMER SUMMER ST ST FL FL 15 15 S S SANAR SANAR R520 R520 ^+T>U ^+T>U
Rule sets
Rule sets contain logic for:
Parsing Classifying Processing data by pattern and bucketing data
Optional files
Lookup tables Override tables
Classification table
Contains the words for classification, standardized versions of words, and data class Data class (data tag) is assigned to each data token Default classes are the same across all rule sets User-defined classes are assigned in the classification table
Users may modify, add or delete these classes User-defined classes are a single letter
Default classes
Class ^ + ? @ > < Zero (0) Description A single numeric A single unclassified alpha (word) One or more consecutive unclassified alphas Complex mixed token, e.g., C3PO Leading numeric, e.g., 6A Trailing numeric, e.g. A6 Null class
User-defined classes
Class USNAME G P USADDR T D B USAREA S State Abbreviation Street Type Directional Box Type Generational, e.g., Senior, I, II Prefix, e.g. Dr., Mr., Miss Description
; USADDR Classification Table ;-----------------------------------------------------------------------------; Classification Legend ;-----------------------------------------------------------------------------; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types Standard Classification ; U - Unit Types form ;-----------------------------------------------------------------------------Token ; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-----------------------------------------------------------------------------DRAW "PO BOX" B DRAWER "PO BOX" B PO "PO BOX" B POB "PO BOX" B POBOX "PO BOX" B POBX "PO BOX" B PODRAWER "PO BOX" B
Comparison threshold
May be used in the Classification table Used to efficiently make entries into the classification table Helps overcome spelling and data entry errors Not required Threshold uses a logical string comparator
Threshold level
900 850 800 750 700
Exact match Almost certainly the same Most likely equivalent Most likely not the same Almost certainly not the same
Dictionary file
Defines the field definitions for the output file When data is moved to these output fields it is called bucketing the data The order that the fields are listed in the dictionary file defines the order the fields appear in the output file Dictionary file entries are similar to field definitions
Pattern-Action file
Contains the rules for standardization; that is, the actions to execute with a given pattern of tokens Records are processed from the top down Written in Pattern Action Language (PAL) Complex parsing can be coded in this file
10
{HN}
Copyright IBM Corporation 2009
{SN}
{ST}
Standardize process
Dictionary File
3
House Number 10
Unit 222
^
10
?
MAPLE
Parse
Standardize Country
Output:
Two byte ISO country code Flag identifying explicit or default decision
Standardization implementation
Standardization jobs
Standardization
Standardize Country
Selecting US data
The DataStage Filter Stage provides the capability of selecting and/or rejecting records based on a set of values for a field Selecting or splitting data requiring compound or complex logic may require Transformer stage
Standardize Country
The pre-processor rule set prepares the data for processing by domain specific rule sets
Standardize results
Business Intelligence fields
Parsed from the original data, they may be used in matching and generally they are moved to the target system
Matching Fields
Generally these fields are created to help during the match process and are dropped after successful matching
Reporting fields
Specifically created to help review results of Standardize and recognized handled and unhandled data
USADDR Examples HouseNumber Directional Street Name Unit Types Box Types Unit Values Building Names
USAREA Examples
Hash keys
First 2 characters of the first five words
Packed Keys
Data concatenated, or packed
Unhandled Data
The remaining tokens not processed by the selected rule set. The pattern generated for the stream of input tokens based on the parsing rules and token classifications. The tokens not processed by the rule set because they represent a data exception. Flag indicating what kind of user overrides were applied to this record
Copyright IBM Corporation 2009
Input Pattern
Exception Data
Standard practice: investigate handled and unhandled data Review the business intelligence fields to ensure accurate bucketing of the data Build a Character Discrete Investigation for each field and review the contents and the format Build Investigation to review:
Unhandled Patterns Unhandled Data Input Pattern Input Fields
User overrides
Provides the user with the ability to modify rule sets The following types of rule sets can be modified using User Overrides
Domain Pre-processor rule sets Domain rule sets
There are five types of user overrides relating to: classifications, patterns, and text strings User overrides are
GUI Driven
Original Data
HOCHREITER , CAROLYNNE
Add CAROLYNNE as a valid first name to the classification table
Override
Class F
Input Pattern
+,F
Original Data
HOCHREITER , CAROLYNNE
Corrected Pattern
Text overrides
Allow the user to specify overrides based on an entire text string Use this override for special cases and specific handling of a string of text Input Text Overrides
Applied to the original text string
Input Text
Unhandled Pattern ++ ++ ++
Override
Input Text
REIFF FUNERAL
Override
Move text string to the Primary name field
Input Pattern ++
Results
Pattern overrides
Allow the user to specify overrides based on an entire pattern Use this override when most or all records should be processed with identical logic Input Pattern Overrides
Applied to the original text string
Input Text
HAYWARD, WINSLOW ESHAGHIAN , JOUBI BOULDER, CORONA Override Move + to Primary Name Comma provides context Move + to First Name
Unhandled Pattern
Override
Unhandled Pattern +, + +, + +, +
Results
Modify logic based on the input pattern Modify logic based on the Unhandled data string Modify logic based on the unhandled pattern
Identify the unhandled patterns for the Address and AREA fields. In the report include the unhandled data, input pattern, original data and the record key.
1. 2. Build a Character Concatenate Investigation using the following fields Increase the number of samples to 5
Field Description Unhandled Pattern Unhandled Data Input Pattern Address Domain
Type C X X X
Overrides
Purpose
Correct problems found during standardization
Override types
Classification Input pattern Input text Unhandled pattern Unhandled text
Standardize Country
Overrides screen
Checkpoint
1. (T/F) WAVES can standardize name fields. 2. (T/F) Rule sets are used in standardization processing. 3. Name the components of rule sets.
Checkpoint solutions
1. (T/F) (T/F) WAVES can standardize name fields.
Answer: False
Unit summary
Having completed this unit, you should be able to:
Describe the Standardize stage Identify rule sets Build jobs using the Standardize stage Interpret standardization results Investigate unhandled data and patterns
Rule set found in Other folder Adds ISO country code to records
Match
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Unit objectives
After completing this unit, you should be able to:
Build a QualityStage job to identify matching records Apply multiple match passes to increase efficiency/efficacy Interpret and improve match results
Match stage
Statistically-based method for determining matches 25 match comparison algorithms providing a full spectrum of fuzzy matching functions Ability to measure informational content of data Identify duplicate entities within one or more files Match specification built with Match Designer Critical field settings
Do you compare all the shared or common fields? Do you give partial credit? Are some fields (or some values) more important to you than others? Why? Do more fields increase your confidence? By how much? What is enough?
Copyright IBM Corporation 2009
Information content also measures the significance of one value in a field over another (Frequency)
In a First-Name Field, JOHN contributes less information than DWEZEL
Significance is determined by a values reliability and its ability to discriminate, both can be calculated from your data
Distribution of weights
WILLIAM J WILLAIM JOHN +1 +1 HOLDEN HOLDEN +17 128 MAIN ST 02111 12/8/62 128 MAINE AVE 02110 12/8/62 +2 +4 -1 +7 +9 = 40
# of Pairs
Grey area
Non-Matches
10
Less Confidence
More Confidence
Copyright IBM Corporation 2009
Weights
Measures the information content of a data value Each field contributes to the confidence (probability) of a match
Types of weights
If a field matches, the agreement weight is used
Agreement weight is a positive value
Partial weight is assigned for non-exact or fuzzy matches Missing values have a default weight of zero Weights for all field comparisons are summed to form a composite weight
Matching terminology
Informational Content Weight Composite Weight Match Cutoffs False Positives False Negatives
Measures the significance of one field value over another Measures the informational content of a data value Measures the confidence of a match
Distinguish matches from non-matches Records with a score above the High cutoff that really arent a match Records below the low cutoff that really are a match
Copyright IBM Corporation 2009
Reliability (m-probability)
Approximated as, 1 - error rate for the given field The higher the m-probability, the higher the disagreement weight will be for the field not matching since the data is considered reliable
Blocking
Grouping together like records that have a high-probability of producing matches Only like records are compared to each other making the match more efficient and computationally feasible Records in a block match exactly on one to several blocking fields
NAME
YUNG , WAYNE D YOUNG , JONATHAN A YOUNG THERESA C
ADDRESS
9000 SHEPARD DRIVE 4220 BELLE PARK DR 1767 TOBEY ROAD
ZIP
78753 77072 30341
GEROSA, FRAN X GEROSA, FRANCIS XAVIER GEROSA, MARY GARISA, FRANCIS MARCUS MATIC
JANCAN
RENEE JENKINS
02111
Blocking strategy
Choose fields with reliable data Choose fields with a good distribution of values Combinations of fields may be used
Blocking summary
Blocking groups together like records Matching is more efficient for small block sizes
Blocks should have less than 1000 records (guideline, not a hard and fast rule)
Blocking fields must match exactly for a candidate set to be created/evaluated Beware of block overflow
Computationally run out of resources Comparisons are not completed Every record in the block becomes an automatic residual
Match types
Unduplication
Identifies duplicates candidates in one file
Many-to-one correspondence
More than one record on stream link can match to the same record on reference link
Match Implementation
Run pass Review results Tune the match Add cutoffs Set overrides Add more passes Repeat steps until match results are acceptable
Map fields
Match Designer
Used to build a match specification that will be addressed in a match job Features
Design control center Data-centric Graphical representation of statistics Independent of job design Iterative development
Data Viewer
Decision Rules
Cutoff Tuning
Blocking
Match Commands
Questions: Where is the standardized data? Where is the frequency report? What ODBC-accessed database will store test results?
Set up test results area
Business Name
Select Column
Click Apply or OK
Business Name
Compare Type
Data Column
Frequencies
Select Parameter
Grouping option: Match Sets: See all matches and duplicates together Match Pairs+Sort: See the master record repeated
Statistics Tab
ge an Ch
ho tS ha W
ws
ge an Ch
Ho w
Sh ow s
ge an Ch
s ow Sh at h W
Critical fields
Used to identify fields that must agree in order for records to be linked
Critical Fields values must agree exactly or the records cannot be linked (considered a match) Critical Missing OK Field values must agree exactly on values not considered missing values
Weight overrides
Allows you to adjust both the agreement and/or disagreement weights for specific situations
Add to calculated weight Replace weight
On Match Commands screen
Cutoffs
There are two cutoffs
Match cutoff (high cutoff) Clerical cutoff (low cutoff)
Records with a weight equal to or above the Match cutoff are considered matches Records with a weight below the low cutoff are not matches Records with a weight greater than or equal to the low cutoff and less than the high cutoff are considered clerical records for manual review Cutoffs can be set at the same value eliminating clerical records
1350 WALTON 1350 WALTON 2047 PRINCE 2047 PRINCE P.O. BOX 123 2047 PRINCE P.O. BOX 123
AVE
30604 30604
Pass 1 blocked on street name Pass 2 found additional matched records in which the street name was different but the names were the same
Copyright IBM Corporation 2009
Unduplication implementation
Double Click
Unduplication Implementation
Checkpoint
1. (T/F) Match specifications are created using Designer. 2. (T/F) An unduplicate match can be used against two files. 3. Which match specification component determines the extent of the clerical review records?
Checkpoint solutions
1. (T/F) Match specifications are created using Designer.
Answer: True
3. Which match specification component determines the extent of the clerical review records?
Answer: cutoff values
Unit summary
Having completed this unit, you should be able to: Build a QualityStage job to identify matching records Apply multiple match passes to increase efficiency/efficacy Interpret and improve match results
Survive
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Unit objectives
After completing this unit, you should be able to:
Identify Survive techniques Describe implementation options Define Survive rules Build Survive job
Survive stage
Point-and-click creation of business rules to determine surviving data user decides how to survive data Performed at record or field level very flexible Creates a single, consolidated record containing the best-ofbreed data Provides consolidated view of the data
Survive example
Survive Input (Match Output)
Group Legacy 1 D150 1 A1367 23 D689 23 A436 23 D352 First Bob Robert William Billy William Middle A Alex Last Dixon Dickson Obrian OBrian Obrian No. 1500 1500 5901 5901 5901 Dir. SE SW SW Str. Name Type UnitNo. ROSS CLARK CIR ROSS CLARK CIR 74TH ST STE 202 74TH ST 74 ST #202
William Alex
Cross-Reference File
Group Legacy 1 D150 1 A1367 23 23 23 D689 A436 D352
Copyright IBM Corporation 2009
Survive rules
A rule contains a condition and a set of target fields
When the condition is met the field becomes a candidate for the best All records in a group are tested against the condition The best populates the target fields
Survive rules
Custom Rule
Build your own logical expression Comparison (=, !=, <, > ,<=, >=) Logical (and, or, not) Indicate the current and best records with the following notation
c.field indicates the current b.field indicates the best
Parentheses ( ) can be used for grouping complex conditions String literals are enclosed in double quotation marks, such as MARS. A semicolon (;) terminates a rule.
Survive techniques
Pre-defined Techniques
Source Recency Frequency Most complete (longest string)
User-specified logic
Target fields
Fields you want to write to the output file Populated based on meeting the conditions of the survivor rule(s) Fields not listed as targets are excluded from the output file May have multiple targets for each rule
TARGET
CONDITION
Survive Implementation
Double Click
Output Column
Technique
Complex available
Checkpoint
1. (T/F) Survivorship can allow more than one record to survive. 2. (T/F) Survivorship rules deal with the complete record only. 3. Name three survive rules.
Checkpoint solutions
1. (T/F) Survivorship can allow more than one record to survive.
Answer: False
Unit summary
Having completed this unit, you should be able to: Identify Survive techniques Describe implementation options Define Survive rules Build Survive job
Special Topics
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Full Run
1. Double Click
Expanded Mode
Use when you need to add QS8 operators within a migrated process May require some manual tuning to run
Note: Regardless of the migration mode, all migrated rules will have the new naming convention of : QS-7.5-Ruleset-Name_QS-7.5-Project-Name
File IO to external files is performed by the Information Server Sequential File stages
Conditions
Always Always Always Delimited text used in 7.5 stage ODBC used in 7.5 stage Always Always Always Always Merge used in 7.5 stage Split, Accept, or Reject used in 7.5
Conditions
Always Always If target columns overlap If target columns do not overlap Always Always Always
Once the above tasks are completed, compile and run as you would any other job
Globalization
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Objectives
After completing this module you will be able to: Build jobs that read and write Japanese data Modify client settings to display Japanese data with correct characters
Terminology
Character Set
An ordered list of characters used for text Example: Latin, Cyrillic, Unicode
Character Encoding
How each character in a character set is represented as bits Examples: UTF-8, UTF-16BE, GB18030 are encodings of Unicode
Codepage
Microsoft Windows term for Encoding, often used in other contexts too Examples:
1252 is Windows Latin1 superset of ISO8859/1 932 another name for Shift-JIS
Character Sets
Latin
Italian, Spanish, French, English alphabets
Cyrillic alphabet
Subsets are used by five Slavic languages (Bulgarian, Russian, Belarusian, Serbian, Macedonian, Ukrainian) and some non-Slavic (Kazakh, Uzbek, Kyrgyz, Tajik, and Mongolian)
ASCII
Represents 256 characters
Unicode
Represents 65,536 unique characters Standard for representing the characters of all languages Includes Chinese, Japanese, and Korean
Character encoding
Definition A system that pairs a character from a character set to something else, such as a number Two common computer encodings for Unicode
UTF8
Variable length encoding for Unicode Encodes each character to one to four bytes
UTF16
Variable length encoding for Unicode Allows either endian representation but mandates that the byte order be explicitly indicated by a byte order Mark
NLS
NLS National Language Support
Globalization + Localization/Translation
NLS map
What DataStage uses to convert between external and internal encodings Internal encoding is UTF8 for Server engine, UTF16 for Parallel engine
Logs
Map
Job Monitor
Unicode (UTF-16)
Map
Windows code page
Client
Map
Various stages have an NLS Map tab: e.g. Sequential File, External Source, External Target, File Set
Define character set mappings (ustring Applied at stage or individual field level external file)
Unit summary
Having completed this unit, you should be able to: Build a QualityStage investigation job for non-English data View correctly-formatted results in DataStage/QualityStage data viewer
Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4.0.3
Objectives
After completing this module you will be able to: Build jobs using the AV stage to parse and verify address data
AVI Stage
Provides
Transliteration (e.g. Japanese to Latin) Parsing Address validation
Supports real-time
Components
AV stage Reference data
16 Geographies Purchased via Passport system
API libraries
Address Doctor
Reference Data
Required for validation function only Requires annual license agreement Location pointed to by AV stage Some databases are memory intensive Load options
Partial preload
Indexes loaded to memory
Full preload
Data loaded to memory Fast access but must have adequate memory
No preload
Data accessed from disk, slowest method
Job components
AVI stage
Stage properties
Function
Navigation
Transliterate mode
Map input columns to address elements
Multiple input columns can be mapped to one address element
Parsing mode
Input sample
Output sample
Validation mode
Uses reference data from a database Map input columns to address elements Can activate error link Creates validation summary report Sample output (only two of the validation columns shown)
Unit summary
Having completed this unit, you should be able to: Build jobs using the AV stage to parse and verify address data
Review prebuilt job that validated USPREP data from earlier lab