Sei sulla pagina 1di 18

Data Cleansing

> >>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Why is Legacy Data “Dirty” ?


• Dummy Values,
• Absence of Data,
• Multipurpose Fields,
• Cryptic Data,
• Contradicting Data,
• Inappropriate Use of Address Lines,
• Violation of Business Rules,
• Reused Primary Keys,
• Non-Unique Identifiers, and
• Data Integration Problems
<< < > >>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Steps in Data Cleansing


• Parsing
• Correcting
• Standardizing
• Matching
• Consolidating

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Parsing

Parsing locates and identifies individual


data elements in the source files and then
isolates these data elements in the target
files.

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Parsing

Parsed Data in Target File


First Name: Beth
Middle Name: Christine
Input Data from Source File Last Name: Parker
Beth Christine Parker, SLS MGR Title: SLS MGR
Regional Port Authority Firm: Regional Port Authority
Federal Building Location: Federal Building
12800 Lake Calumet Number: 12800
Hedgewisch, IL Street: Lake Calumet
City: Hedgewisch
State: IL

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Correcting

Corrects parsed individual data


components using sophisticated data
algorithms and secondary data sources.

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Correcting
Corrected Data
Parsed Data First Name: Beth
First Name: Beth Middle Name: Christine
Middle Name: Christine Last Name: Parker
Last Name: Parker Title: SLS MGR
Title: SLS MGR Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: South Butler Drive
Street: Lake Calumet City: Chicago
City: Hedgewisch State: IL
State: IL Zip: 60633
Zip+Four: 2398

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Standardizing

Standardizing applies conversion routines


to transform data into its preferred (and
consistent) format using both standard
and custom business rules.

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Standardizing
Corrected Data
Corrected Data Pre-name: Ms.
First Name: Beth First Name: Beth
Middle Name: Christine 1st Name Match
Last Name: Parker Standards: Elizabeth, Bethany, Bethel
Title: SLS MGR Middle Name: Christine
Firm: Regional Port Authority Last Name: Parker
Location: Federal Building Title: Sales Mgr.
Number: 12800 Firm: Regional Port Authority
Street: South Butler Drive Location: Federal Building
City: Chicago Number: 12800
State: IL Street: S. Butler Dr.
Zip: 60633 City: Chicago
Zip+Four: 2398 State: IL
Zip: 60633
Zip+Four: 2398

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Parsing, Correcting, Standardizing


TITLE FIRST CONC. LAST GENER.
NAME
LINE
Mr. William
Bill St. John III
HSNO ST-DIR ST-NM ST-TYPE

STREET
LINE 101 S. St.
Main Strete
CITY STATE POST

GEOG.
LINE St. Louis, MO 63181
Sant. 63118
<< < > >>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Matching

Searching and matching records within and


across the parsed, corrected and
standardized data based on predefined
business rules to eliminate duplications.

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Match Patterns

Business Street Branch Customer City Vendor Pattern Pattern


Name Type #/Tax ID Code I.D.

Exact Exact Exact Exact Exact Exact AAAAAA P110

Exact VClose Exact VClose Exact Blanks ABAAA- P115

Exact VClose Exact Blanks Exact Exact ABA-AA P120

Exact VClose Close Close Exact Exact ABCCAA S300

VClose VClose Exact Close Exact Exact BBACAA S310

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Matching
Corrected Data (Data Source #2)
Corrected Data (Data Source #1) Pre-name: Ms.
Pre-name: Ms. First Name: Elizabeth
First Name: Beth 1st Name Match
1st Name Match Standards: Beth, Bethany, Bethel
Standards: Elizabeth, Bethany, Bethel Middle Name: Christine
Middle Name: Christine Last Name: Parker-Lewis
Last Name: Parker Title:
Title: Sales Mgr. Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: S. Butler Dr., Suite 2
Street: S. Butler Dr. City: Chicago
City: Chicago State: IL
State: IL Zip: 60633
Zip: 60633 Zip+Four: 2398
Zip+Four: 2398 Phone: 708-555-1234
Fax: 708-555-5678

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Consolidating

Analyzing and identifying relationships


between matched records and
consolidating/merging them into ONE
representation.

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Consolidating

Consolidated Data
Name: Ms. Beth (Elizabeth)
Corrected Data (Data Source #1) Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Corrected Data (Data Source #2)
Phone: 708-555-1234
Fax: 708-555-5678

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Consolidating

William Janet Karen William


Jones Jones Jones Jones Jr.

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

Legacy Systems View (3 Clients)

Account No.
83451234 Policy No.
ME309451-2

Transaction
B498/97

<< < > >>


INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION

The Reality – ONE Client

Account No.
83451234 Policy No.
ME309451-2

Transaction
B498/97

<< < > >>

Potrebbero piacerti anche