Sei sulla pagina 1di 25

Module 9

Enforcing Data Quality


Module Overview

• Introduction to Data Quality


• Using Data Quality Services to Cleanse Data
• Using Data Quality Services to Match Data
Lesson 1: Introduction to Data Quality

• What Is Data Quality, and Why Do You Need It?


• Data Quality Services Overview
• What Is a Knowledge Base?
• What Is a Domain?
• What Is a Reference Data Service?
• Creating a Knowledge Base
• Demonstration: Creating a Knowledge Base
What Is Data Quality, and Why Do You Need It?

• Business decisions should be made on trusted


data
• Data quality issues in sources can be propagated
to the data warehouse:
• Invalid data values
• Inconsistencies
• Duplicate business entities
Data Quality Services Overview

• DQS is a knowledge-based solution for:


• Data Cleansing
• Data Matching

• DQS Components:
• Server
• Client
• Data Cleansing SSIS Transformation
What Is a Knowledge Base?

• Repository of knowledge about data:


• Domains define values and rules for each field
• Matching policies define rules for identifying duplicate
records
What Is a Domain?

• Domains are specific to a data field


• Domains contain the rules for the data
• Domains can be individual or composite
What Is a Reference Data Service?

• The Azure Marketplace hosts specialist data


cleansing providers
• Set up an account
• Subscribe to a reference service
• Map your domain to the reference service
Creating a Knowledge Base

Creating a knowledge base is an iterative process:


1. Knowledge discovery
2. Domain management
Demonstration: Creating a Knowledge Base

In this demonstration, you will see how to:


• Create a Knowledge Base
• Perform Knowledge Discovery
• Perform Domain Management
Lesson 2: Using Data Quality Services to Cleanse
Data

• Creating a Data Cleansing Project


• Viewing Cleansed Data
• Demonstration: Cleansing Data
• Using the Data Cleansing Data Flow
Transformation
Creating a Data Cleansing Project

1. Select a knowledge base


2. Map columns to domains
3. Review suggestions and corrections
4. Export results
Viewing Cleansed Data

• Output – The values for all fields after data


cleansing
• Source – The original value for fields that were
mapped to domains and cleansed
• Reason – The reason the output value was
selected by the cleansing operation
• Confidence – An indication of the confidence
Data Quality Services estimates for corrected
values
• Status – The status of the output column (correct
or corrected)
Demonstration: Cleansing Data

In this demonstration, you will see how to:


• Create a Data Cleansing Project
• View Cleansed Data
Using the Data Cleansing Data Flow
Transformation

• Input data to be cleansed


• Select knowledge base and map columns to
domains
• Output cleansed columns
Lab A: Cleansing Data

• Exercise 1: Creating a DQS Knowledge Base


• Exercise 2: Using a DQS Project to Cleanse Data
• Exercise 3: Using DQS in an SSIS Package

Logon Information
Virtual machine: 20463C-MIA-SQL
User name: ADVENTUREWORKS\Student
Password: Pa$$w0rd

Estimated Time: 30 Minutes


Lab Scenario

You have created an ETL solution for the


Adventure Works data warehouse, and invited
some data stewards to validate the process before
putting it into production. The data stewards have
noticed some data quality issues in the staged
customer data, and requested that you provide a
way for them to cleanse data so that the data
warehouse is based on consistent and reliable
data. The data stewards have provided you with
an Excel workbook containing some examples of
the issues found in the data.
Lesson 3: Using Data Quality Services to Match
Data

• Creating a Matching Policy


• Creating a Data Matching Project
• Viewing Data Matching Results
• Demonstration: Matching Data
Creating a Matching Policy

• Define matching rules for business entities


• Rules match entities based on domains:
• Similarity: Similar or exact match
• Weight: Percentage to apply if match succeeds
• Prerequisite: Mandatory domain match for rule to
succeed
• If the combined weight of all matches meets or
exceeds the rule’s minimum matching score,
the entities are duplicates
Creating a Data Matching Project

1. Select a knowledge base


2. Map columns to domains
3. Review match clusters
4. Export matches and survivors
• Select survivorship rule:
• Pivot record
• Most complete and longest record
• Most complete record
• Longest record
Viewing Data Matching Results

• Cluster ID – Identifier for a cluster of matched


records
• Record ID – Identifier for a matched record

• Matching Rule – The rule that produced the


match
• Score – Combined weighting of match criteria

• Pivot Mark – A matched record chosen arbitrarily


by Data Quality Services as the pivot record for a
cluster
Demonstration: Matching Data

In this demonstration, you will see how to:


• Create a Matching Policy
• Create a Data Matching Project
• View Data Matching Results
Lab B: Deduplicating Data

• Exercise 1: Creating a Matching Policy


• Exercise 2: Using a DQS Project to Match Data

Logon Information
Virtual machine: 20463C-MIA-SQL
User name: ADVENTUREWORKS\Student
Password: Pa$$w0rd

Estimated Time: 30 Minutes


Lab Scenario

You have created a DQS knowledge base and used it to cleanse


customer data. However, data stewards are concerned that the
staged customer data may include duplicate entries. For records
to be considered a match, the following criteria must be true:
• The Country/Region column must be an exact match.
• A total matching score of 80 or higher based on the following
weightings must be achieved:
• An exact match of the Gender column has a weighting of 10.
• An exact match of the City column has a weighting of 20.
• An exact match of the EmailAddress column has a weighting
of 30.
• A similar FirstName column value has a weighting of 10.
• A similar LastName column value has a weighting of 10.
• A similar AddressLine1 column value has a weighting of 20.
Module Review and Takeaways

• Review Question(s)

Potrebbero piacerti anche