20463D 09

Module 9
Enforcing Data Quality

Module Overview
• Introduction to Data Quality

• Using Data Quality Services to Cleanse Data
• Using Data Quality Services to Match Data
Lesson 1: Introduction to Data Quality
• What Is Data Quality, and Why Do You Need It?

• Data Quality Services Overview
• What Is a Knowledge Base?
• What Is a Domain?
• What Is a Reference Data Service?
• Creating a Knowledge Base
• Demonstration: Creating a Knowledge Base
What Is Data Quality, and Why Do You Need It?
• Business decisions should be made on trusted

data
• Data quality issues in sources can be propagated
to the data warehouse:
• Invalid data values
• Inconsistencies
• Duplicate business entities
Data Quality Services Overview
• DQS is a knowledge-based solution for:

• Data Cleansing
• Data Matching
• DQS Components:
• Server
• Client
• Data Cleansing SSIS Transformation
What Is a Knowledge Base?
• Repository of knowledge about data:

• Domains define values and rules for each field
• Matching policies define rules for identifying duplicate
records
What Is a Domain?
• Domains are specific to a data field

• Domains contain the rules for the data
• Domains can be individual or composite
What Is a Reference Data Service?
• The Azure Marketplace hosts specialist data

cleansing providers
• Set up an account
• Subscribe to a reference service
• Map your domain to the reference service
Creating a Knowledge Base
Creating a knowledge base is an iterative process:

1. Knowledge discovery
2. Domain management
Demonstration: Creating a Knowledge Base
In this demonstration, you will see how to:

• Create a Knowledge Base
• Perform Knowledge Discovery
• Perform Domain Management
Lesson 2: Using Data Quality Services to Cleanse
Data
• Creating a Data Cleansing Project

• Viewing Cleansed Data
• Demonstration: Cleansing Data
• Using the Data Cleansing Data Flow
Transformation
Creating a Data Cleansing Project
1. Select a knowledge base

2. Map columns to domains
3. Review suggestions and corrections
4. Export results
Viewing Cleansed Data
• Output – The values for all fields after data

cleansing
• Source – The original value for fields that were
mapped to domains and cleansed
• Reason – The reason the output value was
selected by the cleansing operation
• Confidence – An indication of the confidence
Data Quality Services estimates for corrected
values
• Status – The status of the output column (correct
or corrected)
Demonstration: Cleansing Data

• Create a Data Cleansing Project
• View Cleansed Data
Using the Data Cleansing Data Flow
Transformation
• Input data to be cleansed

• Select knowledge base and map columns to
domains
• Output cleansed columns
Lab A: Cleansing Data
• Exercise 1: Creating a DQS Knowledge Base

• Exercise 2: Using a DQS Project to Cleanse Data
• Exercise 3: Using DQS in an SSIS Package
Logon Information
Virtual machine: 20463C-MIA-SQL
User name: ADVENTUREWORKS\Student
Password: Pa$$w0rd
Estimated Time: 30 Minutes

Lab Scenario
You have created an ETL solution for the

Adventure Works data warehouse, and invited
some data stewards to validate the process before
putting it into production. The data stewards have
noticed some data quality issues in the staged
customer data, and requested that you provide a
way for them to cleanse data so that the data
warehouse is based on consistent and reliable
data. The data stewards have provided you with
an Excel workbook containing some examples of
the issues found in the data.
Lesson 3: Using Data Quality Services to Match
Data
• Creating a Matching Policy

• Creating a Data Matching Project
• Viewing Data Matching Results
• Demonstration: Matching Data
Creating a Matching Policy
• Define matching rules for business entities

• Rules match entities based on domains:
• Similarity: Similar or exact match
• Weight: Percentage to apply if match succeeds
• Prerequisite: Mandatory domain match for rule to
succeed
• If the combined weight of all matches meets or
exceeds the rule’s minimum matching score,
the entities are duplicates
Creating a Data Matching Project
1. Select a knowledge base

2. Map columns to domains
3. Review match clusters
4. Export matches and survivors
• Select survivorship rule:
• Pivot record
• Most complete and longest record
• Most complete record
• Longest record
Viewing Data Matching Results
• Cluster ID – Identifier for a cluster of matched

records
• Record ID – Identifier for a matched record
• Matching Rule – The rule that produced the

match
• Score – Combined weighting of match criteria
• Pivot Mark – A matched record chosen arbitrarily

by Data Quality Services as the pivot record for a
cluster
Demonstration: Matching Data

• Create a Matching Policy
• Create a Data Matching Project
• View Data Matching Results
Lab B: Deduplicating Data
• Exercise 1: Creating a Matching Policy

• Exercise 2: Using a DQS Project to Match Data
Logon Information
Virtual machine: 20463C-MIA-SQL
User name: ADVENTUREWORKS\Student
Password: Pa$$w0rd
Estimated Time: 30 Minutes

Lab Scenario
You have created a DQS knowledge base and used it to cleanse

customer data. However, data stewards are concerned that the
staged customer data may include duplicate entries. For records
to be considered a match, the following criteria must be true:
• The Country/Region column must be an exact match.
• A total matching score of 80 or higher based on the following
weightings must be achieved:
• An exact match of the Gender column has a weighting of 10.
• An exact match of the City column has a weighting of 20.
• An exact match of the EmailAddress column has a weighting
of 30.
• A similar FirstName column value has a weighting of 10.
• A similar LastName column value has a weighting of 10.
• A similar AddressLine1 column value has a weighting of 20.
Module Review and Takeaways
• Review Question(s)

20463D 09

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

20463D 09

Caricato da

Copyright:

Formati disponibili

Module 9

Enforcing Data Quality

• Introduction to Data Quality

• What Is Data Quality, and Why Do You Need It?

• Business decisions should be made on trusted

• DQS is a knowledge-based solution for:

• Repository of knowledge about data:

• Domains are specific to a data field

• The Azure Marketplace hosts specialist data

Creating a knowledge base is an iterative process:

In this demonstration, you will see how to:

• Creating a Data Cleansing Project

1. Select a knowledge base

• Output – The values for all fields after data

In this demonstration, you will see how to:

• Input data to be cleansed

• Exercise 1: Creating a DQS Knowledge Base

Estimated Time: 30 Minutes

You have created an ETL solution for the

• Creating a Matching Policy

• Define matching rules for business entities

1. Select a knowledge base

• Cluster ID – Identifier for a cluster of matched

• Matching Rule – The rule that produced the

• Pivot Mark – A matched record chosen arbitrarily

In this demonstration, you will see how to:

• Exercise 1: Creating a Matching Policy

Estimated Time: 30 Minutes

You have created a DQS knowledge base and used it to cleanse

Potrebbero piacerti anche