Sei sulla pagina 1di 226

Data Quality and

Master Data
Management with
Microsoft SQL Server
2008 R2
Dejan Sarka, Davide Mauri

Advanced Techniques & Tools for Storing,


Maintaining, Cleansing, and Merging Master Data

Table of Contents

Table of Contents
Table of Contents ............................................................................................................................. 1
Foreword .......................................................................................................................................... 6
Acknowledgements .......................................................................................................................... 8
About the Authors .......................................................................................................................... 11
Chapter 1: Master Data Management ........................................................................................... 12
Types of Data .............................................................................................................................. 13
What Is Master Data?............................................................................................................. 14
Master Data Management ..................................................................................................... 16
MDM Challenges ................................................................................................................ 21
Data Models ............................................................................................................................... 23
Relational Model .................................................................................................................... 23
Dimensional Model ................................................................................................................ 31
Other Data Formats and Storages .......................................................................................... 33
Data Quality ................................................................................................................................ 36
Data Quality Dimensions ........................................................................................................ 36
Completeness ..................................................................................................................... 37
Accuracy ............................................................................................................................. 38
Information......................................................................................................................... 38
Consistency......................................................................................................................... 39
Data Quality Activities ............................................................................................................ 42
Master Data Services and Other SQL Server Tools .................................................................... 47

Page 1

Table of Contents
Master Data Services .............................................................................................................. 47
Entities ................................................................................................................................ 49
Attributes............................................................................................................................ 49
Members ............................................................................................................................ 49
Hierarchies.......................................................................................................................... 49
Collections .......................................................................................................................... 50
Versions .............................................................................................................................. 50
Other SQL Server Tools for MDM........................................................................................... 51
Summary .................................................................................................................................... 54
References .................................................................................................................................. 55
Chapter 2: Master Data Services Concepts and Architecture ........................................................ 56
Master Data Services Setup ....................................................................................................... 57
Installation of MDS Components and Tools ........................................................................... 57
Setup of MDS Database.......................................................................................................... 58
Setup of MDS Web Application .............................................................................................. 59
Master Data Manager Web Application .................................................................................... 62
Explorer .................................................................................................................................. 63
Version Management ............................................................................................................. 63
Integration Management ....................................................................................................... 63
System Administration ........................................................................................................... 64
User and Groups Permissions ................................................................................................ 64
Models ........................................................................................................................................ 65
Models .................................................................................................................................... 65

Page 2

Table of Contents
Entities and Attributes ........................................................................................................... 66
Hierarchies.............................................................................................................................. 77
Derived Hierarchy............................................................................................................... 78
Explicit Hierarchy ................................................................................................................ 79
Collections .............................................................................................................................. 84
Business Rules ........................................................................................................................ 85
Importing, Exporting and Managing Data .................................................................................. 89
Import Data ............................................................................................................................ 89
Managing Data ....................................................................................................................... 91
Export Data ............................................................................................................................. 93
Multiple Versions of Data ........................................................................................................... 95
MDS Database Schema .............................................................................................................. 98
Staging Tables ....................................................................................................................... 101
Summary ..................................................................................................................................103
References ................................................................................................................................ 104
Chapter 3: Data Quality and SQL Server 2008 R2 Tools ............................................................... 105
Measuring the Completeness ..................................................................................................106
Attribute Completeness .......................................................................................................107
XML Data Type Attribute Completeness .............................................................................. 109
Simple Associations among NULLs ....................................................................................... 112
Tuple and Relation Completeness........................................................................................ 118
Multivariate Associations among NULLs .............................................................................. 120
Profiling the Accuracy............................................................................................................... 126

Page 3

Table of Contents
Numeric, Date and Discrete Attributes Profiling .................................................................128
Strings Profiling .................................................................................................................... 131
Other Simple Profiling ..........................................................................................................134
Multivariate Accuracy Analysis ............................................................................................ 136
Measuring Information ............................................................................................................142
Using Other SQL Server Tools for Data Profiling ......................................................................145
SSAS Cubes ........................................................................................................................... 145
PowerPivot for Excel 2010 ...................................................................................................149
SSIS Data Profiling Task ........................................................................................................153
Excel Data Mining Add-Ins....................................................................................................156
Clean-Up ............................................................................................................................... 160
Summary ..................................................................................................................................161
References ................................................................................................................................ 162
Chapter 4: Identity Mapping and De-Duplicating ........................................................................163
Identity Mapping ...................................................................................................................... 165
Problems............................................................................................................................... 166
T-SQL and MDS String Similarity Functions ..........................................................................166
Preparing the Data ...........................................................................................................169
Testing the String Similarity Functions .............................................................................174
Optimizing Mapping with Partitioning .............................................................................176
Optimizing Mapping with nGrams Filtering .....................................................................180
Comparing nGrams Filtering with Partitioning ................................................................ 188
Microsoft Fuzzy Components ...................................................................................................190

Page 4

Table of Contents
Fuzzy Algorithm Description ................................................................................................ 190
Configuring SSIS Fuzzy Lookup ............................................................................................. 192
Testing SSIS Fuzzy Lookup ....................................................................................................200
Fuzzy Lookup Add-In for Excel.............................................................................................. 201
De-Duplicating .......................................................................................................................... 203
Preparing for Fuzzy Grouping ............................................................................................... 204
SSIS Fuzzy Grouping Transformation ................................................................................... 206
Testing SSIS Fuzzy Grouping .................................................................................................211
Clean-Up ............................................................................................................................... 213
Summary ..................................................................................................................................215
References ................................................................................................................................ 216
Index .............................................................................................................................................217

Page 5

Foreword

Foreword
Dejan Sarka
If all men were just, there would be no need of valor, said Agesilaus, Spartan King, 444 BC-360
BC. Just from this quote we can realize that Agesilaus was not too keen on fighting. Actually,
Agesilaus never hurt his enemies without just cause, and he never took any unjust advantages.
Nevertheless, the ancient world was just as imperfect as the contemporary world is, and
Agesilaus had to fight his share of battles.
If everyone would always insert correct data into a system, there would be no need for
proactive constraints or for reactive data cleansing. We could store our data in text files, and
maybe the only application we would need would be Notepad. Unfortunately, in real life, things
go wrong. People are prone to make errors. Sometimes our customers do not provide us with
accurate and timely data. Sometimes an application has a bug and makes errors in the data.
Sometimes end users unintentionally make a transposition of letters or numbers. Sometimes we
have more than one application in an enterprise, and in each application we have slightly
different definitions of the data. (We could continue listing data problems forever.)
A good and suitable data model, like the Relational Model, enforces data integrity through the
schema and through constraints. Unfortunately, many developers still do not understand the
importance of a good data model. Nevertheless, even with an ideal model, we cannot enforce
data quality. Data integrity means that the data is in accordance with our business rules; it does
not mean that our data is correct.
Not all data is equally important. In an enterprise, we can always find the key data, such as
customer data. This key data is the most important asset of a company. We call this kind of data
master data.

Page 6

Foreword
This book deals with master data. It explains how we can recognize our master data. It stresses
the importance of a good data model for data integrity. It shows how we can find areas of bad
or suspicious data. It shows how we can proactively enforce better data quality and make an
authoritative master data source through a specialized Master Data Management application. It
also shows how we can tackle the problems with duplicate master data and the problems with
identity mapping from different databases in order to create a unique representation of the
master data.
For all the tasks mentioned in this book, we use the tools that are available in the Microsoft SQL
Server 2008 R2 suite. In order to achieve our goalgood quality of our datanearly any part of
the suite turns to be useful. This is not a beginners book. We, the authors, suppose that you,
the readers, have quite good knowledge of SQL Server Database Engine, .NET, and other tools
from the SQL Server suite.
Achieving good quality of your master data is not an easy task. We hope this book will help you
with this task and serve you as a guide for practical work and as a reference manual whenever
you have problems with master data.

Page 7

Acknowledgements

Acknowledgements
Dejan Sarka
This book would never have been finished without the help and support of several people. I
need to thank them for their direct and indirect contributions, their advice, their
encouragement, and other kinds of help.
In the first place, I have to mention my coauthor and colleague from SolidQ, Davide Mauri. As an
older guy, I have followed his career over several years. I am amazed by the amount of
knowledge he gained in the past few years. He has become a top speaker and recognized author
of several books and articles. Nevertheless, he retained all the vigor of youth and is still full of
ideas. Davide, I am so proud of you. I always enjoy working with you, and I am looking forward
to our further cooperation.
With three other colleagues from SolidQ, Itzik Ben-Gan, Herbert Albert, and Gianluca Hotz, we
are forming a gang of four inside the company, called Quartet. It is not just unofficial group; our
official duty in the company is to certify and confirm places for company parties. Our endless
discussions during conferences, hikes, or time spent in pubs are an invaluable source of insight
and enlightenment. Besides general help through our socializing, all three of them have made a
concrete contribution to this book.
Herbert helped with technical review. Gianluca didnt review the book officially; nevertheless,
he read it. His old mans grumbling was always a sign to me that I wrote something inaccurate
or even wrong. Itzik was not directly involved in this book. However, this book would never have
been written without him. This is the ninth book I am contributing to so far. I would never even
have started writing if Itzik hadnt pushed me and involved me in coauthoring his book 7 years
ago. Itzik has invited me to contribute to four of his books so far, and we are already starting the
fifth one together. To my friends from the Quartet, thank you for all of the precious time we are
spending together!
Page 8

Acknowledgements
SolidQ is not just a regular company. First of all, we are all friends. Even more, we have the best
CEO in the world, Fernando Guerrero. Fernando, thank you for inviting me to become a part of
this wonderful group of people, and thank you for all of your patience with me, and for all of
your not only technical advice but also life advice! And thanks to all other members of this
company; because we all together total more than 150 SQL Server and related technologies
worldwide experts joining our efforts, I simply cannot list every single colleague.
Besides concrete work with SQL Server, I am also involved in theoretical research in The Data
Quality Institute. Dr. Uro Godnov helped me with first steps in data quality, and he is
continuing to educate me. Although he has some problems with his health, he is always
available to me. Uro, forget what Agesilaus said! We need courage not just because of our
enemies; sometimes we need courage because of ourselves. Stay as brave as you are forever!
I cannot express enough how much I appreciate being a member of the Microsoft Most
Valuable Professional (MVP) program. Through this program, MVPs have direct contact with the
SQL Server team. Help of the team in general and help from the Master Data Services part of
the team is extraordinary. No matter how busy they are with developing a new version of SQL
Server, they always take time to respond to our questions.
Finally, I have to thank to my family and friends. Thank you for understanding the reduced time I
could afford to spend with you! However, to be really honest, I did not miss too many parties
and beers.

Page 9

Acknowledgements

Davide Mauri
Dejan already wrote a lot about our friendship, our company, and everything that allows us to
continue to enjoy our work every day. But even at the cost of seeming repetitious, Id also like
to thank all my SolidQ colleagues, who are not only colleagues but friends above all. I learn a lot
from you, and each discussion is mind-opening for me. I cannot thank you enough for this!
I would like to say a big thank to Dejan, Itzik, Gianluca, and Fernando for being examplestruly
mentorsnot only through words but also with facts: You really give me inspiration and you
show me each day, with your excellent work, your determination, your honesty, and your
integrity, the path one has to follow to be someone who can make a difference, from a
professional and ethical point of view. I couldnt have found better colleagues, friends, and
partners. Thanks!
Id like also to thank specifically my Italian team. Guys, were really creating something new,
setting new standards, finding clever ways to solve business problems, making customers
happy, and giving them the maximum quality the market can offer, while being happy ourselves
at the same time. This is not easy, but were able to work together as a well-trained team,
enjoying what we do each day. Thanks!
Last but not least, of course, a big thanks also to my wife Olga and my newborn son Riccardo:
You really are the power behind me. Olga, thanks for your patience and for your support!
Riccardo, thanks for your smiles that allow me to see the world with different eyes! Thank you!

Page 10

About the Authors

About the Authors


Dejan Sarka
Dejan Sarka focuses on development of database and business intelligence
applications. Besides projects, he spends about half of the time on training and mentoring. He
is the founder of the Slovenian SQL Server and .NET Users Group. Dejan is the main author or
coauthor of nine books about databases and SQL Server so far. Dejan also developed three
courses for SolidQ: Data Modeling Essentials, Data Mining, and Master Data Management. As an
MCT, Dejan speaks at many international events, including conferences such as PASS, TechEd,
and DevWeek. He is also indispensable on regional Microsoft events. In addition, he is a coorganizer of a top-level technical conference named Bleeding Edge.

Davide Mauri
Davide Mauri is a SolidQ Mentor and a member of the Board of Directors of SolidQ Italia. A wellknown Microsoft SQL Server MVP, MCP, MCAD, MCDBA, and MCT, as well as an acclaimed
speaker at international SQL Server conferences, Davide enjoys working with T-SQL and
Relational Modeling and studying the theory behind these technologies. In addition, he is wellgrounded in Reporting Services, .NET, and object-oriented principles, and he has a deep
knowledge of Integration Services and Analysis Services, giving him a well-rounded area of
expertise around the Microsoft Data Platform, allowing him to have the correct vision and
experience to handle development of complex business intelligence solutions. He is a course
author for SolidQ, including seminars about Upgrading to SQL Server 2008, co-author of the
book Smart Business Intelligence Solutions with SQL Server 2008, and author of the well-known
DTExec replacement tool DTLoggedExec (http://dtloggedexec.davidemauri.it).

Page 11

Chapter 1: Master Data Management

Chapter 1: Master Data Management


Dejan Sarka
Master Data Management (MDM), the process of creating and maintaining master data, is one
of the most challenging tasks in IT for an enterprise. In this chapter we will define what exactly
the term master data means and what the most important challenges are, including data quality
problems and ways to improve them.
When talking about data quality, we cannot skip data integrity, and data models. The Relational
Model is still the best model currently in use by line of business applications for enforcing data
integrity. Nevertheless, it is not very suitable for analyzing data. Additional models for analytical
systems have evolved; the most widely used is the Dimensional Model. We are going to briefly
introduce both, Relational and Dimensional models, and show where the master data is in these
models. In addition to data integrity, we are going to deal also with other data quality issues.
After all introductions, we are going to put MDS in the picture. We are going to see that MDS is
an integral part of MDM and that there are also other SQL Server tools that can help us in MDM.
SQL Server Integration Services (SSIS) and SQL Server Analysis Services can play an important
role in MDM activity. We are also going to define how MDS can help improving data in different
kinds of applications, including On Line Transactional Processing (OLTP), On Line Analytical
Processing (OLAP) and Customer Relationship Management (CRM) applications.
In this chapter, we are going to introduce the following:

Master data and Master Data Management;

Data models;

Data quality;

Master Data Services and other SQL Server tools.

Page 12

Chapter 1: Master Data Management

Types of Data
In an average company, many different types of data appear. These types include:

Metadata this is data about data. Metadata includes database schemas for
transactional and analytical applications, XML document schemas, report
definitions, additional database table and column descriptions stored by using
SQL Server provided extended properties or custom tables, application
configuration data and similar.

Transactional data - maintained by line of business OLTP applications. In this


context, we are using term transactional in means of business transactions, not
database management system transactions. This data includes, for example,
customer orders, invoices, insurance claims, data about manufacturing stages,
deliveries, monetary transactions and similar. In short, this is OLTP data about
events.

Hierarchical data typically appears in analytical applications. Relationships


between data are represented in hierarchies. Some hierarchies represent
intrinsic structure of data they are natural for the data. An example is product
taxonomy. Products have subcategories, which are in categories. We can have
multiple levels of hierarchies. Such hierarchies are especially useful for drilling
down from general to detailed level in order to find reasons, patterns and
problems. This is a very common way of analyzing data in OLAP applications.

Semi-structured data is typically in XML form. XML data can appear in standalone
files, or as part (a column in a table) of a database. Semi-structured data is useful
where metadata, i.e. schema, changes frequently, or when you do not have need
for detailed relational schema. In addition, XML is widely used for data exchange.

Unstructured data involves all kind of documents with instructions, company


portals, magazine articles, e-mails and similar. This data can appear in a
database, in a file, or even in printed material.

Page 13

Chapter 1: Master Data Management

Finally, we have master data.

What Is Master Data?


Intuitively, everybody has an idea of what kind of data master data would be. Let us give you
some more precise definitions and examples.
In the Relational Model, a table represents a predicate, and a row represents a proposition
about a subject, object or event from the real world. We start building a relational database by
describing the business problems in sentences. Nouns in the sentences describe real world
entities that are of interest for the business, like customers, products, and employees and
similar. Verbs in the sentences describe relationships between nouns, precisely roles the nouns
play in the relationships. Nouns define our master data. Verbs lead to transactional data. From a
description of a relational database, you can easily find nouns. These critical nouns typically
define one of the following:

People, including customers, suppliers, employees, and sales representatives

Things, including products, equipment, assets, and stores

Concepts, including contracts, licenses, and bills of material

Places, including company locations, and customer geographic divisions

Any of these entity sets can be further divided into specialized subsets. For example, a company
can segment their customers based on previous sales into premier and other customers, or
based on customer types, or persons and companies.
For analytical applications, many times data is organized in a Dimensional Model. A popular
name for the Dimensional Model is Star Schema (although, to be precise, a dimensionally
modeled database can include multiple star and snowflake schemas). This is because it has a
central, Fact table, and surrounding, dimensional tables or Dimensions. Fact tables hold data we
are measuring, namely Measures. Dimension attributes are used for pivoting fact data; they give
measures some meaning. Dimensions give context to measures. Fact tables are populated from

Page 14

Chapter 1: Master Data Management


transactional data. Dimensions are populated from entities that represent nouns in a relational
database description. Therefore, in a Dimensional Model, dimensions are the master data.
As we already mentioned, OLAP analyses typically involve drilling down over hierarchies.
Dimensions can include multiple hierarchies.
Master data appears in probably every single application in an enterprise. ERP applications
include products, bills of material, customers, suppliers, contracts and similar. Customer
Relationship Management (CRM) applications deal, of course, with customers. Human
Resources Management (HRM) applications are about employees. Analytical applications many
times include all master data that appears in an enterprise. We can easily imagine that master
data is a very important part of data. It is crucial that this master data is known and correct.
Data quality issues are mostly about master data. It is easy to imagine that having the same
master data in multiple sources can immediately lead to problems with same definitions, same
identifications, and duplication. Master data typically changes with a much slower rate than
transactional data. Customers, for example, do not change addresses frequently; however, they
interact with your company through orders, services or even complaints probably on daily basis.
Nevertheless, although less volatile than transactional data, the master data life cycle is still a
classical CRUD cycle: Create, Read, Update and Destroy. The question that arises is whether any
data about people, things, places or concepts is really master data for any company.
If a company sells only five different products, then products data is not master data for this
company. Although technically it is master data, the company does not need any specific
management of this data. Every single attribute of these five products has a correct value in the
system, and it is unlikely that an inaccurate value will appear. Therefore, for this company,
products data is not considered master data. Cardinality has influence on definition whether we
consider specific data as master data or not.
If a company does not collect many attributes for product categories, if the categories entity is
quite simple in terms of the number of attributes (for example, it could have only two

Page 15

Chapter 1: Master Data Management


attributes, category id and category name), then again that company probably does not
consider this as master data. Complexity of data is another criterion that helps us decide which
data needs special treatment and which data is master data.
Another factor to consider when defining master data is volatility. We already mentioned that
master data tends to change less frequently that transactional data. Now imagine that some
data does not change at all. For example, let us use geographic locations. Data about geographic
locations changes quite infrequently; in some systems, we can even consider this data as static.
As soon as we have cleansed this data, it does not need any further treatment, and therefore
we do not consider this as master data again.
For some data, we need to maintain history. In transactional applications, we need to know how
the data came to the current state. Government or other authorities prescribes a formal
auditing for some business areas. In analytical applications, we frequently analyze over time, for
example, we compare sales of this year with sales of the previous year in a geographic region. In
order to make proper analyses, we need to take into account possible movements of customers
from region to region; therefore, we need to maintain history again. Data that needs versioning,
auditing, or any other kind of maintaining of history, is typically master data.
Finally, we give more attention to data that we reuse repeatedly. Re-usage increases value of
data for us. The value of the data can increase because of other factors as well. For example, in
pharmacy, an error in a bill of materials can lead to huge damage. Therefore, the more valuable
the data is for us, the more likely we are going to define it as master data.

Master Data Management


We discussed that data that does not need special treatment is not master data. Clearly, we can
conclude that master data needs some special treatment, which is called Master Data
Management. In a more formal definition Master Data Management (MDM) is a set of
coordinated processes, policies, tools and technologies used to create and maintain accurate
master data.

Page 16

Chapter 1: Master Data Management


Even the formal definition of MDM is still very broad. We do not have a single tool for MDM.
We can consider anything we use to improve the quality of master data as a MDM tool. Any
formal, scheduled or repeated, and ad-hoc activity that improves master data quality is a MDM
process. Any technology, like Relational Database Management System (RDBMS) that enforces
data integrity, is part of MDM technology. However, depending on the approach we use to
maintain master data, we might consider using a specialized tool with clearly defines the
process for managing master data.
Some of the most important goals of MDM include:

Unifying or at least harmonizing master data between different transactional, or


operational systems

Maintaining multiple versions of master data for different needs in different


operational systems

Integrating master data for analytical and CRM systems

Maintaining history for analytical systems

Capturing information about hierarchies in master data, especially useful for


analytical applications

Supporting compliance with government prescriptions (e.g., Sarbanes-Oxley)


through auditing and versioning

Having a clear CRUD process through prescribed workflow

Maximizing Return Of Investment (ROI) through re-usage of master data

Please note the last bullet. Master Data Management can be quite costly, and very intensive in
terms of resources used, including man hours. For a small company with a single operational
system, probably no specific MDM tool is needed. Such a company can maintain master data in
the operational system. The more we reuse master data (in multiple operational, CRM and
analytical applications), the bigger ROI we get.

Page 17

Chapter 1: Master Data Management


A RDBMS and an application can and should enforce data integrity. Data integrity means that
the data conforms to business rules. Business rules can be quite simple, as order numbers must
be unique, there should be no order without known customer, and quantity ordered must be
greater than zero. Business rules can also be more complicated. For example, a company can
define that no customer should order in a single Web order a product in quantity that is more
than half of the quantity of the product on the stock. If the database and the application do not
enforce data integrity, we can expect dirty data.
Nevertheless, even if the database and the application do enforce data integrity, we still should
not take data accuracy for granted. How can we prevent typo errors? For example, an operator
could write 42 Hudson Avenue instead of 24 Hudson Avenue; both addresses are valid from
data integrity perspective. Another issue arises if we have multiple systems. Do all operators
enter data in consistent way? Some operators could write the correct address, but in slightly
different form, like 24 Hudson Ave..
We could resolve data quality issues with occasional data cleansing. With cleansing, data quality
rises; however, over time, the quality falls down again. This is reactive approach. A proactive
approach, which prevents entering low quality data, is even better. What we need is explicit
data governance. We must know who is responsible for the data, and we must have clearly
defined the process of maintaining the data. Data governance sets the policies, the rules for
master data. Data governance rules can prescribe data requirements, such as which information
is required, how we derive values, data retention periods and similar. Data governance can also
prescribe security policies, like which master data needs encryption, and which part has to be
audited. It can prescribe versioning and workflows. It defines how to address data
inconsistencies between different source systems; for example, it can define the authoritative
source. Data governance policies should also define how to bring new systems online without
breaking existing master data processes and quality. Finally, data governance can define the
need for explicit roles in an enterprise roles, responsible for maintaining master data and
implementing data governance.

Page 18

Chapter 1: Master Data Management


In MDM terminology, we have to define the Data stewards, the people responsible for their
part of master data. Data stewards are the governors. They should work independently of any
specific source or destination system for master data, in an objective way. Data stewards must
have deep knowledge about the data they govern. Commonly, one data steward covers one
business area of master data; we have one steward for customers, one for products etc. We
should define data stewardship roles and designate data stewards early in the process of
implementing a MDM solution.
From what we have seen so far, we can conclude that there are different approaches to master
data management. Here is a list of possible approaches:

No central master data management we have systems that do not


communicate at all. When we need any kind of cross-system interaction, like
doing analysis over data from multiple systems, we do ad-hoc merging and
cleansing. This approach is very cheap from the beginning; however, it turns out
as the most expensive over time. From this books perspective, we really should
not treat this approach as a real MDM approach.

Central metadata storage with this approach, we have at least unified,


centrally maintained definitions for master data. Different systems should follow
and implement these central definitions. Ad-hoc merging and cleansing becomes
somehow simpler. In this scenario, for the central metadata storage, we typically
do not use a specialized solution. This central storage of metadata is probably in
an unstructured form in documents, worksheets, or even on paper only.

Central metadata storage with identity mapping besides unified, centrally


maintained definitions for master data we also store keys mapping tables in our
MDM solution. Data integration applications can be developed much faster and
easier. Although this solution seems quite appealing, it has many problems with
maintaining master data over time. We have only keys from different systems in
our MDM database; we do not have any other attributes. All attributes in source
systems change over time, and we have no versioning or auditing in place to
Page 19

Chapter 1: Master Data Management


follow the changes. This approach is viable for a limited time only. It is useful, for
example, during upgrading, testing and the initial usage of a new ERP system to
provide mapping back to the old ERP system.

Central metadata storage and central data that is continuously merged we


have metadata as well as master data stored in a dedicated MDM system.
However, we do not insert or update master data here; we merge (and cleanse)
master data from source systems continuously, on daily basis. There are multiple
issues with this approach and continuous merge can become expensive. The only
viable possibility for this approach is that we can find out what has changed in
source systems from the last merge, enabling us to merge only the delta, only the
new and updated data. This approach is frequently used for analytical systems.
We prepare a Data Warehouse (DW) as a central storage for analytical data
(which includes transactional and master data). We populate DW overnight, and
during population, we merge data and resolve cleansing issues. Although we
typically do not create the DW with MDM as the main goal, we can treat DW as
an authoritative source of master data.

Central MDM, single copy with this approach, we have a specialized MDM
application, where we maintain master data, together with its metadata, in a
central location. All existing applications are consumers of this master data. This
approach seems preferable at first glimpse. However, it has its own drawbacks.
We have to upgrade all existing applications to consume master data from
central storage instead of maintain their own copies. This can be quite costly, and
maybe even impossible with some legacy systems. In addition, our central master
metadata should union all metadata from all source systems. Finally, the process
of creating and updating master data could simply be too slow. It could happen,
for example, that a new customer would have to wait for couple of days before
submitting the first order, because the process of inserting customer data with all
possible attributes involves contacting all source systems and takes simply too
long.

Page 20

Chapter 1: Master Data Management

Central MDM, multiple copies in this approach, we have a central storage of


master data and its metadata. However, the metadata here includes only an
intersection of common metadata from source systems. Each source system
maintains its own copy of master data, with additional attributes that pertain to
this system only. After we insert master data in the central MDM system, we
replicate it (preferably automatically) to source systems, where we update the
source-specific attributes. This approach gives us a good compromise between
cost, data quality, and effectiveness of the CRUD process. Still, there is no free
lunch. As different systems can also update the common data, we can have
update conflicts. Therefore, this approach involves continuous merge as well.
However, as at least part of the data is updated centrally, this approach means
less work with continuous merging than in central metadata storage and central
data that is continuously merged approach.

For the last two approaches we need a special MDM application. A specialized MDM solution
could be useful for central metadata storage with identity mapping and central metadata
storage and central data that is continuously merged approaches as well. SQL Server 2008 R2
Master Data Services (MDS) is a specialized MDM application. We could also write our own
application. Other SQL Server tools, like SSIS and SSAS, are helpful in the MDM process as well.
However, for the last two approaches to MDM, MDS is the most efficient solution.
MDM Challenges
For a successful MDM project, we have to tackle all challenges we meet. These challenges
include:

Different definitions of master metadata in source systems


We can have different coding schemes, data types, collations and similar. We
have to unify the metadata definitions.

Data quality issues

Page 21

Chapter 1: Master Data Management


This is something we always have to expect. In short, if we do not have data
quality issues, then we probably do not need a specialized MDM solution
anyway. We have to improve the data quality; otherwise, the MDM project fails
to accomplish its most important goal.

Authority
Who is responsible for master data? Different departments want to be
authoritative for their part of master data, and the authority for master data can
overlap in an enterprise. We have to define policies for master data, with explicit
data stewardship process prescribed. We also define data ownership as part of
authority issue resolution.

Data conflicts
When we prepare the central master data database, we have to merge data from
our sources. We have to resolve data conflicts during the project, and, depending
on the MDM approach we take, replicate the resolved data back to the source
systems.

Domain knowledge
We should include domain experts in a MDM project.

Documentation
We have to take care that we properly document our master data and metadata.

No matter which approach we take, MDM projects are always challenging. However, tools like
MDS can efficiently help us resolve possible issues.

Page 22

Chapter 1: Master Data Management

Data Models
It is crucial that we have a basic understanding of data models used in an enterprise before we
start a MDM project. Details of data modeling are out of the scope for this book; only the
minimum needed is here. We are going to introduce the Relational Model, the Dimensional
Model, and briefly other models and storages.

Relational Model
The relational model was conceived in the 1960s by Edgar F. Codd, who worked for IBM. It is a
simple, yet rigorously defined conceptualization of how users perceive and work with data. The
most important definition is the Information Principle, which states that all information in a
relational database is expressed in one (and only one) way, as explicit values in columns within
rows of a table. In the relational model, a table is called a relation, and a row is called a tuple,
which consists of attributes.
Each relation represents some real-world entity, such as a person, place, thing, or event. An
entity is a thing that can be distinctly identified and is of business interest. Relationships are
associations between entities.
A row in a relation is a proposition, like an employee with identification equal to 17, full name
Davide Mauri, lives in Milan.
Relation header, or schema of the relation, is the predicate for its propositions.
A predicate is a generalized form of proposition, like employee with identification
EmployeeId(int), full name EmloyeeName(string), lives in City(CitiesCollection).
Note the name / domain pair of placeholders for concrete values. The domain, or the data type,
is the first point where a RDBMS can start enforcing data integrity. In the previous example, we
cannot insert an EmployeeId that is not an integral number. We cannot insert a city that is not in

Page 23

Chapter 1: Master Data Management


our collection of allowed cities. Please also note that without a good naming convention, it is
hard to reconstruct predicates and propositions from a database schema.
With different source systems, we typically do not have influence on the source schema and
naming conventions; however, we should be aware that worse schema and naming mean more
problems for our MDM project, and worse data quality. One of the most important tools for
enforcing data integrity in a relational model is normalization. As mentioned, we are going to
introduce only basics here; for more details on relational model, do please refer to data
modeling books, for example An Introduction to Database Systems by C. J. Date.
Tables are normalized when they represent propositions about entities of one type - in other
words, when they represent a single set. This means that entities do not overlap in tables, and
that tables are orthogonal or normal in mathematical terms. When a table meets a certain
prescribed set of conditions, it is in a particularly normal form. A database is normalized when
all tables are normalized.
Normalization is a redesign process to unbundle the entities. The process involves
decomposition. The normalization is achieved by applying a sequence of rules to create what
are called normal forms. The goal is to eliminate redundancy and incompleteness.
Many normal forms are defined. The most important ones are first, second, and third. If a
database is in third normal form, it is usually already fully normalized. We are going to introduce
the first three normal forms only. If a database is not normalized, you can experience data
manipulation anomalies. Let us start with the following example of a model for customers
orders, shown in figure 1.

Page 24

Chapter 1: Master Data Management

Orders
PK

OrderId
OrderDate
CustomerId
CustomerName
Address
City
Country
OrderDetails{ProductId, ProductName, Quantity}

Figure 1: ORDERS MODEL BEFORE 1ST NF


Note that the last column (OrderDetails) is actually a collection of items ordered. This model
leads to many CRUD anomalies. How do you insert a potential customer, a customer without an
order? How do you insert a product that is not ordered yet? If you delete the last order for a
customer, you lose all information about that customer. If you update a name of a product, you
have to take care to update it in all rows where this product is on order. Even reading this data
is hard; how do you find the total quantity ordered for a single product, for example.
The first normal form says that a table is in first normal form if all columns are atomic, or
indivisible. No multivalued columns are allowed. Decomposition has to start with the
OrderDetails column. We need a single row per item in an order, and every atomic piece of data
of a single item (ProductId, ProductName, Quantity) must have its own column. However, after
the decomposition, we have multiple rows for a single order. OrderId by itself cannot be the key
anymore. The new key is composed of the OrderId and ItemId columns. Figure 2 shows our
Orders table in first normal form.

Page 25

Chapter 1: Master Data Management

Orders
PK
PK

OrderId
ItemId
OrderDate
CustomerId
CustomerName
Address
City
Country
ProductId
ProductName
Quantity

Figure 2: ORDERS MODEL IN 1ST NF


Queries are now simplified; it is easy to find totals for products. However, update anomalies are
still possible. You still cannot insert a customer without an order, a product that is not ordered
yet, and you have to change a single piece of information, like order date, which now repeats
for the same order in multiple rows, in multiple places.
To achieve the second normal form, a table must be in first normal form, and every non-key
column must be functionally dependent on the entire key. This means that no non-key column
can depend on a part of the key only. In Orders model that is in first normal form, we need
OrderId only to get customer data and order date; we dont need ItemId, which is also part of
the key. To achieve the second normal form, we need to decompose the table into two tables,
shown in figure 3.

Page 26

Chapter 1: Master Data Management

Orders
PK

OrderId
OrderDate
CustomerId
CustomerName
Address
City
Country

OrderDetails
PK,FK1
PK

OrderId
ItemId
ProductId
ProductName
Quantity

Figure 3: ORDERS MODEL IN 2ND NF


With the model in the second normal form, we resolved the problem of updating order date.
For a single order, we maintain it in a single place now. However, there are still some update
anomalies possible. There is still some redundancy in the model. Customer address, city and
country repeat over orders; product name repeats over order details.
To achieve the third normal form, a table must be in the second normal form, and every nonkey column must be non-transitively dependent on the key. In other words, non-key columns
must be mutually independent. In our previous model in second normal form, we can find
CustomerId from OrderId, and from CustomerId we can transitively find the customer name and
address. From address, we can find city, and from city, we can find country. In addition, in the
OrderDetails table, ProductId and ProductName columns are not mutually independent. To
achieve the third normal form, we must create new tables for dependencies between non-key
columns, as shown in figure 4.

Page 27

Chapter 1: Master Data Management

Customers

Orders

PK

CustomerId

PK

OrderId

FK1

CustomerId
OrderDate

FK1

CustomerName
Address
CityId

OrderDetails

Cities
PK

CityId

PK,FK1
PK

OrderId
ItemId

FK1

City
CountryId

FK2

ProductId
Quantity

Countries
PK

CountryId
Country

Products
PK

ProductId
ProductName

Figure 4: ORDERS MODEL IN 3RD NF


After a database is in third normal, there are usually no update anomalies. However, queries for
retrieving data are more complicated, as they involve multiple joins. Properly implemented
Relational Model helps maintaining data integrity and data quality in general.
Besides data types and normalization, we can use many additional tools provided by modern
RDBMS like SQL Server. We have declarative constraints, for example Check constraint, which
further narrows down possible values of an attribute. We can define whether all values have to
be known; we can forbid NULLs (NULL is a standard placeholder for unknown or not applicable
value). We can implement integrity rules programmatically, with triggers and stored procedures.
Of course, we can implement them programmatically in the middle tier or client tier of an
application as well. The important thing for a MDM project is that we understand our source
data. Realizing whether the design of operational databases, which are the sources, where the
data is introduced in an enterprise for the first time, is proper or not, helps us a lot when
evaluating the cost of possible approaches to a MDM solution.

Page 28

Chapter 1: Master Data Management


Before explaining the Dimensional Model, we have to understand another relational design
technique specialization and generalization. We can have NULLs in our data because the value
is unknown, or because an attribute is not applicable for a subset of rows. In our example, we
could have the persons and companies in the Customers table. For persons, the birth date
makes sense; for companies, the number of employees might be valuable information. In order
to prevent NULLs, we can introduce subtype tables, or subtypes in short.
Two entities are of distinct, or primitive, types if they have no attributes in common. Some
relations can have both common and distinct attributes. If they have a common identifier, we
can talk about a special supertype-subtype relationship. Supertypes and subtypes are helpful for
representing different levels of generalization or specialization. In a business problem
description, the verb is (or explicitly is a kind of) leads to a supertype-subtype relationship.
Specialization leads to additional decomposition of tables, as we can see in the example in
figure 5.

Page 29

Chapter 1: Master Data Management

Cities
PK

CityId

FK1

City
CountryId

Countries
PK

CountryId
Country

Customers

Orders

PK

CustomerId

PK

OrderId

FK1

CustomerId
OrderDate

FK1

CustomerName
Address
CityId

OrderDetails
Persons
PK,FK1

CustomerId

PK,FK1
PK

OrderId
ItemId

BirthDate

FK2

ProductId
Quantity

Companies
PK,FK1

CustomerId
NumberOfEmployees

Products
PK

ProductId
ProductName

Figure 5: ORDERS MODEL IN 3RD NF WITH SPECIALIZATION


As you can see, even in this small example, our relational design became quite complicated.
Queries for analyses involve multiple joins; they are hard to write, and many times do not
perform good enough to analyze on-line, in real time. Analyzing on-line means that we can
change the structure of the aggregations we are looking for in real time. For example, instead of
aggregating companies sales over countries and years, we could decide to aggregate over
products and age of individual people. Because of problems with analyses, Dimensional Model
evolved. The Dimensional Model is the base building block of OLAP systems.

Page 30

Chapter 1: Master Data Management

Dimensional Model
Dimensional Model of a database has more deterministic schema than Relational Model. We
use it for Data Warehouses (DW). In a DW, we store merged and cleansed data from different
source systems, with historical data included, in one or more star schemas. A single star schema
covers one business area, like sales, inventory, production, or human resources. As we already
know, we have one central (Fact) table and multiple surrounding (Dimensions) tables in a star
schema. Multiple star schemas are connected through shared dimensions. An explicit Time (or
Date, depends on the level of granularity we need) dimension is always present, as we always
include historical data in a DW. Star schema was introduced by Ralph Kimball in his famous book
The Data Warehouse Toolkit.
Star schema is deliberately denormalized. Lookup tables, like Cities and Countries in our
example, are flattened back to the original table, and attributes from that lookup tables form
natural hierarchies. In addition, we can also flatten specialization tables. Finally, we can add
multiple derived attributes and custom-defined hierarchies. An example of a star schema
created from our normalized and specialized model is illustrated in figure 6.

Page 31

Chapter 1: Master Data Management

DimCustomers
PK

DimProducts

CustomerId

PK

CustomerName
Address
City
Country
PersonCompanyFlag
PersonAge
CompanySize

ProductId
ProductName

FactOrders
PK,FK1
PK,FK2
PK,FK3

CustomerId
ProductId
DateId
DimDates
Quantity

PK

DateId
Date
Month
Quarter
Year

Figure 6: ORDERS IN A STAR SCHEMA MODEL


With more dimension tables, the schema would represent a star even more. We can see derived
columns (Age and CompanySize) in the DimCustomers table. This table is also flattened from the
original Customers table, Cities and Countries lookup tables, and Persons and Companies
subtype tables. In the DimCustomers, we can see natural a hierarchy, Country City
Customer; in the DimDates table, we can see a natural hierarchy, Year Quarter Month
Date. For analytical purposes, we could also define ad-hoc hierarchies, for example Country
CompanySize.
While relational databases are typically sources of master data, are dimensional databases
usually destinations of master data. As with the Relational Model, there is much more to say for
the Dimensional Model. However, this brief introduction should be enough for understanding
the two mostly used models, and how the impact on our master data management solution.

Page 32

Chapter 1: Master Data Management

Other Data Formats and Storages


Besides relational and dimensional databases, data can appear in an enterprise in many
additional various formats ant storages.
Some companies implement Operational Data Store (ODS). ODS is by definition of Bill Inmon,
who introduced this concept:
Subject-oriented, integrated, volatile, current-valued, detailed-only collection of data in
support of an organization's need for up-to-the-second, operational, integrated, collective
information.
Bill Inmon: Building the Operational Data Store, 2nd Edition (John Wiley & Sons, 1999)

The data in ODS has limited history, and is updated more frequently than the data in a DW.
However, the data is already merged from multiple sources. ODS is many times part of a CRM
application; typically, it is data about customers. A MDM solution can actually replace, or
integrate, existing ODS.
Some data is semi-structured. Either the structure is not prescribed in so many details, or the
structure itself is volatile. Nowadays semi-structured data usually appears in XML format. We
can have XML data in files in a file system or in a database. Modern relational systems support
XML data types.
XML instances can have a schema. For defining XML schemas, we use XML Schema Definition
(XSD) documents. XSD is an XML instance with defined namespaces, elements and attributes,
that expresses a set of rules to which an XML instance must conform. If XML instance conforms
to XSD, then we say it is schema validated. Here is an example of XML schema:
<xsd:schema targetNamespace="ResumeSchema" xmlns:schema="ResumeSchema"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:sqltypes="http://schemas.microsoft.com/sqlserver/2004/sqltypes"
elementFormDefault="qualified">
<xsd:import
namespace="http://schemas.microsoft.com/sqlserver/2004/sqltypes"
schemaLocation="http://schemas.microsoft.com/sqlserver/2004/sqltypes/sqltypes

Page 33

Chapter 1: Master Data Management


.xsd" />
<xsd:element name="Resume">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="skills" minOccurs="0" maxOccurs="1">
<xsd:simpleType>
<xsd:restriction base="sqltypes:nvarchar"
sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase
IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52">
<xsd:maxLength value="1000" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="previousemployment" minOccurs="0">
<xsd:simpleType>
<xsd:restriction base="sqltypes:nvarchar"
sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase
IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52">
<xsd:maxLength value="100" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>

Reading XML is not very funny. Nevertheless, we can extract some useful information. From the
highlighted parts of the code, we can conclude that this is a schema for resumes, probably for
job candidates. It allows two elements, according to their names for describing skills and
previous employment. Skills can appear only once, while previous employment multiple times.
In order to maintain data quality for XML data, we should force validation of XML instances
against XML schemas. SQL Server supports XML schema validation for columns of XML data
type.
Every company has also to deal with unstructured data. This data is in documents,
spreadsheets, other computer formats, or even on paper only. If this data is important, if it is

Page 34

Chapter 1: Master Data Management


part of master data, we should include it in our MDM solution. Of course, the problem is how to
do it.
One possibility is simply to store all kind of files in a database. SQL Server supports a couple of
large objects (LOB) data types. In addition, SQL Server supports FileStream data for binary large
objects, data type varbinary(max). FileStream integrates the SQL Server Database Engine with
an NTFS file system by storing varbinary(max) object data as files on the file system. This way,
we get unlimited storage for unstructured data inside our relational database.
Usually we can find interesting properties of the unstructured data. If we store unstructured
data in our database, it makes a lot of sense to store these interesting properties in additional,
regular non-LOB attributes. We can classify documents before storing them in a database. We
can do the classification manually, or with help of a tool. Text mining tools, for example, can
extract terms from text. We can use the terms extracted to classify texts.

Page 35

Chapter 1: Master Data Management

Data Quality
Data quality is indivisibly interleaved with master data management. The most important goal
of an MDM solution is to raise the quality of master data. We should tackle data quality issues in
any MDM project. Nevertheless, data quality activities, such as data profiling, finding root cause
for poor data, and improvements of quality, can be independent of an MDM project as well. An
enterprise can define data quality policies and processes through existing applications only.
However, a specialized MDM solution can mitigate implementation of those policies a great
deal.
Before we describe data quality activities, we have to decide for which aspects we are going to
measure and improve the quality of our data. Data quality dimensions capture a specific aspect
included in general data quality term. Measuring data quality, also known as data profiling,
should be an integral part of the implementation of an MDM solution. We should always get a
thorough comprehension of source data before we start merging it. We should also include
measuring improvements of data quality over time to understand and explain the impact of the
MDM solution. Let us start with data quality dimensions, to show what and how we can
measure data quality.

Data Quality Dimensions


Data quality dimensions can refer to data values or to their schema. We introduced the most
popular schemas earlier in this chapter. We are going to start with pure data quality dimensions,
and return to schema quality dimensions later in this section.
There is no exact definition which data quality dimensions should we inspect. Different tools
and different books list different sets of dimensions. Nevertheless, some dimensions are
analyzed more frequently than others are. We are going to focus on the most important, i.e. the
most frequent ones.

Page 36

Chapter 1: Master Data Management


We can measure some data quality dimensions with tools, like Transact-SQL queries. The
measurable dimensions are also called hard dimensions. For some dimensions, we depend on
the perception of the users of data. These are soft dimensions. We cannot measure soft
dimensions directly; we can measure them indirectly through interviews with users of data, or
through any other kind of communication with users. Note that this communication can
unfortunately include unpleasant events, like customer complaints, we want to prevent. Let us
start with hard dimensions.
Completeness
Completeness is a dimension that can be measured in the easiest way. We can start measuring it
on population level. In a closed world assumption, we can state that no other values except the
values actually present in a relational table represent facts in the real world. If the relation does
not have unknown values (NULLs), the relation is complete from the population perspective. In
the open world assumption, we cannot state the population completeness, even if our relation
does not contain null values. In order to evaluate the completeness of a relation in the open
world assumption, we need to get a reference relation that contains the entire population. With
a reference relation, we can define the completeness as the proportion of the number of tuples
presented in the relation and the number of tuples in the reference relation. Because of privacy
and law constraints, it is commonly not possible to acquire the reference relation. However,
usually we can get at least the number of tuples in a reference relation. For example, we can
easily get the number of citizens of a country. From a technical point of view, it is very easy to
measure the completeness of our relation once you have the number of tuples in a reference
relation.
In a closed world assumption, in a relational database, the presence of the NULLs is what
defines the completeness. We could measure attribute completeness, i.e. the number of NULLs
in a specific attribute, tuple completeness, i.e. the number of unknown values of the attributes
in a tuple, and relation completeness, i.e. the number of tuples with unknown attribute values
in the relation. Finally, we could also measure value completeness, which makes sense for

Page 37

Chapter 1: Master Data Management


complex, semi-structured columns, namely for XML data type columns. In an XML instance, a
complete element or an attribute can miss. In addition, XML standards also define a special
xsi:nil attribute as a placeholder for missing values; this is similar to relational NULLs.
Accuracy
Accuracy is a complicated dimension. Firstly, you have to determine what is inaccurate.
Accuracy is stricter than just conforming to business rules; the latter should be enforced with
data integrity. For data that should be unique, duplicate values are inaccurate. Finding duplicate
values might be quite easy, with simple queries, or very hard, if we have to find duplicates
across different systems. Finding other inaccurate data might involve some manual work. With
different algorithms, you can extract data that is potentially inaccurate only.
Here is some advice on how to isolate inaccurate data. For discrete data values, we can use
frequency distribution of values. A value with very low frequency is probably incorrect. For
strings, we can search for string length distribution. A string with very untypical length is
potentially incorrect. For strings, we can also try to find patterns, and then create pattern
distribution. Patterns with low frequency are probably denoting wrong values. For continuous
attributes, we can use descriptive statistics. Just by looking at minimal and maximal values, we
can easily spot potentially problematic data. No matter how we find inaccurate data, we can
flag it, and then measure the level of accuracy. We can measure this for columns and tables.
Information
Another measurable dimension is information. Information Theory, an applied mathematics
branch, defines entropy as the quantification of information in a system. We can measure
entropy on column and table level. The more disperse values we have, the more frequencies
distribution of a discrete column is equally spread among the values, the more information we
have in the column. Information is not a direct data quality dimension; however, it can tell us
whether our data is suitable for analyses or not.

Page 38

Chapter 1: Master Data Management


Consistency
Consistency measures the equivalence of information stored in various databases. We can find a
lot of inconsistent data by comparing values with a predefined set of possible values; we can
find some inconsistencies by comparing data among systems. We can find some inconsistencies
manually. No matter how we find inconsistencies, we can flag them, and then again measure
the level of inconsistency on column or table level.
We can measure soft dimension indirectly, through interaction with users. Questionnaires,
quick polls, user complaints, or any other communication with data users, is our tool for
measuring quality of soft data dimensions. The following list includes some typical soft
dimensions:

Timeliness tells us the degree to which data is current and available when
needed. There is always some delay between change in the real world and the
moment when this change is entered into a system. Although stale data can
appear in any system, this dimension is especially important for Web applications
and sites. A common problem on the Web is that owners do not update sites in a
timely manner; we can find a lot of obsolete information on the Web.

Ease of use is a very typical dimension that relies on user perception. This
dimension depends on application, on user interface. In addition, users of data
can perceive usage as complex also because they are undereducated.

Intention is the data the right data for intended usage? Sometimes we do not
have the exact data we need; however, we can substitute the data needed with
data with similar information. For example, we can use phone area codes instead
of ZIP codes in order to locate customers approximately. Although phone
numbers were not intended for analyses, they can give us reasonable results.
Another, worse example of unintended usage is usage of a column in a table for
storing unrelated information, like using product name to store product

Page 39

Chapter 1: Master Data Management


taxonomy in it. This is unintended usage of schema, which leads to many
problems with data cleansing and integration.

Trust we have to ask users whether they trust the data. This is a very important
dimension. If users do not trust data in operational systems, they will create their
own little, probably unstructured, databases. Integration of master data from
unstructured sources is very challenging. If users do not trust data from analytical
applications, they will simply stop using them.

Presentation quality is another dimension that depends on user perception.


When presenting data, format and appearance should support appropriate use of
information. In operational systems, this dimension is closely related to ease of
use dimension, and depends a lot on user interface. For example, an application
can force users to enter dates manually, or guide them through calendar control.
In analytical systems, presentation is probably even more important. Do we show
data in graphs or in tables? How much interactivity should we put in our reports?
Questions like this and answers to these questions can have a big influence on
success or failure of analytical systems.

Finally, we can describe some schema quality dimensions. A common perception is that schema
quality cannot be measured automatically. Well, this is true for some dimensions; we cannot
measure them without digging into a business problem. Nevertheless, it is possible to find
algorithms and create procedures that help us in measuring some part of schema quality. The
following list shows the most important schema quality dimensions with a brief description of
how to measure them when applicable.

Completeness tells us to which extent schema covers the business problem. We


cannot measure this dimension without in-depth analysis of business problem
and its needs.

Correctness of the model concerns the correct representation of real-world


objects in the schema and the correct representation of requirements. For
example, the first name could be modeled as an entity with one-to-one

Page 40

Chapter 1: Master Data Management


relationship to Customers entity, or as an attribute of the Customers entity. Of
course, the latter is correct. The first name cannot be uniquely identified, and is
therefore not an entity. This is the problem of correctly representing real world
objects. An example of incorrectness considering requirements is a model of
departments and managers. If a requirement says that a manager can manage a
single department, and that each department has a single manager, then the
relationship between managers and departments should be one-to-one.
Modeling this relationship as many-to-many means an incorrect schema. We can
measure this dimension manually, by investigating business requirements and
object definitions and their representation in the schema.

Documentation tells us whether the schema is properly documented. Schema


diagrams, like Entity-Relationship diagrams, should always be part of
documentation. In addition, all business rules that cannot be represented in a
diagram should be documented in textual format. In short, we should have
complete documentation of conceptual schema. We can check the quality of this
dimension manually, with schema documentation overview.

Compliance with theoretical models is the database schema for transactional


applications in a properly normalized and specialized relational schema, and does
the schema for data warehouse consist of star schemas. This dimension has to be
partially measured manually, with an overview of the data model. However, we
can find some problematical areas in the schema procedurally. For example, we
can check correlations between non-key attributes; attributes with very high
correlation in a table lead to a conclusion that the schema is not normalized, at
least it is not in third normal form. In addition, many NULLs lead to conclusion
that there is not enough specialization, i.e. subtypes in the schema. This is
especially true if some values of an attribute always leads to NULLs in another
attribute. we can measure this with database queries.

Minimalization abstraction is a very important modeling technique. It means


that we should have only objects in a model that are important for business

Page 41

Chapter 1: Master Data Management


problems that we are solving. Schema should be minimal, without objects that
are not pertinent to the problem it is solving. Again, we can measure this
dimension with manual overview and comparison to business requirements.
There are many more data and schema quality dimensions defined in literature that deal
specifically with data quality problems. Nevertheless, we mentioned the most important ones;
now it is time to describe the activities needed to understand and improve data quality.

Data Quality Activities


Data quality projects are very intensive in terms of resources needed. In order to execute a
successful project, we have to show the possible benefits to key stakeholders. Firstly, we need
to understand business needs. We can use interviews, overviews of organizational charts,
analyses of existing practices in an enterprise etc. We have to prioritize the business issues, and
make a clear project plan. For a successful project, it is important to start with a business area
that is either very painful for the stakeholders, or is quite simple to solve. We should always
implement data quality projects step by step.
Before we start any MDM project, we have to understand sources and destinations of master
data. Therefore, data quality activities must include overviews. We have to make an extensive
overview of all schemas of databases that pertain to master data. We should interview domain
experts and users of data. This is especially important for getting comprehension of quality of
schema dimensions. In addition, after this step, we should also have a clear understanding of
technology that the enterprise is using. If needed, we have to plan to include appropriate
technology experts in the project. During the overview of the data, we should also focus on data
life cycle, to understand retention periods and similar.
The next important step is the data quality assessment. We can assess hard dimensions with
procedural analysis of the data, i.e. data profiling. There are many different tools for data
profiling. We should exploit all knowledge we have and all tools available for this task. We
should measure soft data quality dimensions in this step as well. We can get a lot of insight on

Page 42

Chapter 1: Master Data Management


the state of the company by comparing hard and soft dimensions. If we evaluate hard
dimensions as bad, and get good evaluations for soft dimensions, then the company does not
realize they have problems with data. In such a case, additional assessment of potential damage
caused with bad data can help key stakeholders understand data quality problems. If both, soft
and hard dimensions are evaluated low; the company is ready for data quality and / or MDM
project. If hard dimensions get good evaluations, while soft get bad, then this means that for
some reason domain experts and users do not trust the data. Usually the reason is in previous
systems, or in previous versions of a system, or in under education. If both, hard and soft
dimensions get good evaluations, then the company does not need a special data quality
project; however, the company could still decide for a MDM project, in order to minimize
expenses with master data maintenance.
After finishing with data assessment, we can re-assess the business impact of low quality data.
We should meet again with key stakeholders in order to review the priorities and elaborate the
improvements part of the project plan in detail.
Before we start improving data quality, we have to find root causes for bad data. Finding root
causes can narrow down the size of work with data quality improvements substantially. For
finding root causes, we can use the five whys method introduced by Sakichi Toyoda and used
first in Toyota Motor Company. With this method, we simply ask why five times. Imagine the
problem is with duplicate records for customers.

We can ask why there are duplicate records. An answer might be because
operators frequently insert new customer record instead of using existing ones.

We should ask the second why why are operators creating new records for
existing customers? The answer might be because operators do not search for
existing records of a customer.

We then ask the third why why dont they search. The answer might be
because the search would take too long.

Page 43

Chapter 1: Master Data Management

The next why is, of course, why it takes so long? The answer might be because it
is very clumsy to search for existing customers.

Now we ask the final, the fifth why why is searching so clumsy? The answer
might be because one can search only for exact values, not for approximate
strings, and an operator does not have exact name, address, or phone number of
a customer in memory. We found the root cause for duplication for this
example, this is application, specifically user interface. Now we know where to
put effort in order to lower the number of duplicates.

Of course, five whys is not the only technique for finding root causes. We can also just track a
piece of information through its life cycle. By tracking it, we can easily spot the moment when it
becomes inaccurate. We can find some root causes for some problems procedurally as well. For
example, we can find that NULLs are in the system because lack of subtypes with quite simple
queries. No matter how we find root causes, we have to use this information to prepare a
detailed improvements plan.
An improvement plan should include two parts: correcting existing data, and even more
important, preventing future errors. If we focus on correcting only, we will have to repeat the
correcting part of the data quality activities regularly. Of course, we have to spend some time in
correcting existing data; however, we should not forget the preventing part. When we have
both parts of improvements plan, we start implementing it.
Implementation of corrective measures involves automatic and manual cleansing methods.
Automatic cleansing methods can include our own procedures and queries. If we have a known
logic how to correct the data, we should use it. We can solve consistency problems by defining a
single way of representing the data in all systems, and then replace inconsistent representations
with the newly defined ones. For example, if gender is represented in some system with
numbers 1 and 2, while we define that is should be represented with letters F and M, we can
replace numbers with letters in a single update statement. For de-duplication and merging from
different sources, we can use string-matching algorithms. For correcting addresses, we can use

Page 44

Chapter 1: Master Data Management


validating and cleansing tools that already exist in the market and use some registries of valid
addresses. However, we should always prepare on the fact that part of data cleansing has to be
done manually.
Preventing new inserts of inaccurate data involves different techniques as well. The most
important one is to implement a MDM solution. For a MDM solution, we need an MDM
application. SQL Server Master Data Services is an example of the tool we could use.
Besides enabling central storage for master data and its metadata, the MDM application has to
support explicit data stewardship, versioning, auditing and data governance workflows. In
addition to MDM solution, we should also focus on source systems. It is very unlikely that out
MDM solution will ever cover all possible master entities with all of their attributes. As we
already mentioned, an approach with a central MDM and a single copy of data is most of the
times impractical. Therefore, part of master data is still going to be maintained in source,
operational applications. We have to enforce proper data models, constraints, and good user
interfaces wherever possible.
After we have implemented data quality corrective and preventive solutions, we should
measure how they perform. Even better, we should prepare improvements infrastructure in
advance, before starting implementing solutions. By measuring improvements, we can easily
show the value of the solutions to key stakeholders. In addition, we can control our work as well
since our improvements can fail and lead even to worse data. We can measure soft dimensions
with ongoing interviews with end users and domain experts. We can store results of the
interviews in a special data quality data warehouse, to track the data quality over time. For hard
dimensions, we can measure them automatically on a predefined schedule, and again store
results in the data quality data warehouse. Figure 7 shows potential schema for data quality
data warehouse, for measuring completeness and accuracy.

Page 45

Chapter 1: Master Data Management

FactTables
PK,FK3
PK,FK1
PK,FK2

TableId
DateId
EmployeeId

DimStewards
PK,FK1

EmployeeName
ManagerId
Department
Title
Age

NumRows
NumUnknownRows
NumErroneousRows

FactColumns
PK,FK3
PK,FK1
PK,FK2

ColumnId
DateId
EmployeeId

DimDates
PK

DateId
Date
Month
Quarter
Year

NumValues
NumUnknownValues
NumErroneousValues

DimColumns

DimTables
PK

TableId
TableName
SchemaName
DatabaseName
ServerName
ApplicationName

EmployeeId

PK

ColumnId

FK1

ColumnName
TableId

Figure 7: DATA QUALITY DATA WAREHOUSE


Although not mentioned as an explicit data quality activity step, it is very important to
communicate actions and results throughout the project. The more communication we have,
the better. Domain experts can always help us with their knowledge. IT professionals and end
users put much more effort in success if they are involved, if they feel the project is their
project. Key stakeholders should always know how the project progresses and have to be
involved in all decision milestones.

Page 46

Chapter 1: Master Data Management

Master Data Services and Other SQL


Server Tools
With SQL Server 2008 R2, Microsoft released a MDM solution called Master Data Services as
part of the suite. Besides this specialized solution, other SQL Server parts are extremely useful in
all parts of data quality and master data management activity. In this part, we are going to
briefly introduce the Master Data Services and its architecture, and other useful SQL Server
tools as well. Of course, a detailed presentation of Master Data Services and instructions on
how to work with it are in the rest of this book.

Master Data Services


SQL Server Master Data Services is a platform for master data management. It provides a
location for master data and metadata, enables processes for data stewardship and workflow,
and has connection points we can use to exchange data with sources and destinations directly
or through application interfaces, and even to extend the functionality of MDS. It consists of a
database, Web application, configuration application, Web service as an interface to
applications, and .NET class libraries that can be used to develop and maintain MDS applications
programmatically.
The MDS database, called MDS Hub, is central to any MDS deployment. The MDS Hub stores:

Schema and database objects for master data and metadata

Versioning settings and business rules

Information for starting workflow and e-mail notifications

Database objects, settings and data for MDS application

Staging tables for importing and processing data from source systems

Subscription views for systems that can retrieve master data directly from the
Hub

Page 47

Chapter 1: Master Data Management


Figure 8 represents the MDS Hub, processes in the Hub, and connections to outer systems, to
data sources and destinations, graphically.

ERP

CRM

SharePoint

DW

Other
Web
Service

Synchroniz
ation

Metadata

Business
Rules

Workflow

Data
Stewardship

Versioning

Entities

Hierarchies

Figure 8: MDS HUB


In an MDS database models are the topmost level of master data organization. Models consist
of

Entities- which map to real world entities, or DW dimensions

Attributes- part of entities, containers for entity properties

Page 48

Chapter 1: Master Data Management

Members- part of entities, which we can consider as rows in entity table

Hierarchies, also defined as entities, that group members in a tree structure

Collections- which give possibility to users to create ad-hoc groups of hierarchies


and members.

Entities
Entities are central objects in the model structure. Typical entities include customers, products,
employees and similar. An entity is a container for its members, and members are defined by
attributes. We can think of entities as tables. In a model, we can have multiple entities.
Attributes
Attributes are objects within entities; they are containers for values. Values describe properties
of entity members. Attributes can be combined in attribute groups. We can have domain-based
attribute values; this means that we get the pool of possible values of attributes in a lookup
table, related to the corresponding entity of the attribute.
Members
Members are the master data. We can think of members as rows of master data entities. A
member is a product, a customer, an employee and similar.
Hierarchies
Another key concept in MDS is Hierarchy. Hierarchies are tree structures that either group
similar members for organizational purposes or consolidate and summarize members for
analyses. Hierarchies are extremely useful for data warehouse dimensions, because typical
analytical processing involves drilling down through hierarchies.
Derived hierarchies based on domain-based attributes, i.e. on relationships that exist in the
model. In addition, we can create explicit hierarchies, which we can use for consolidating
members any way we need.

Page 49

Chapter 1: Master Data Management


Collections
We can also consolidate members in collections. Collections are not as fixed as hierarchies; we
can create them ad-hoc for grouping members and hierarchies for reporting or other purposes.
Versions
MDS supports versioning. We can have multiple versions of master data within a model.
Versions contain members, attributes and attribute values, hierarchies and hierarchy
relationships, and collections of a model.
Master Data Manager is an ASP.NET application which is the data stewardship portal. In this
application, we can create models, manage users and permissions, create MDS packages for
deployment on different MDS servers, manage versions, and integrate master data with source
and destination systems. Figure 9 shows the Master Data Manager home page.

Page 50

Chapter 1: Master Data Management

Figure 9: MASTER DATA MANAGER HOME PAGE

Other SQL Server Tools for MDM


SQL Server Database Engine, the RDBMS itself, is the system that hosts MDS database, the MDS
Hub. Of course, as a RDBMS, it can host many transactional and analytical databases and we can
use it for many other master data management and data quality bound activities. Transact-SQL
is a very powerful language for manipulating and querying data that can be used to discover and
correct incorrect data. From version 2005 and later, we can enhance Transact-SQL capabilities
with CLR functions and procedures. For example, Transact-SQL does not support validating
strings against regular expressions out of the box. However, we can write Visual Basic or Visual
C# functions for this task, and use it inside SQL Server.

Page 51

Chapter 1: Master Data Management


SQL Server Integration Services (SSIS) is a tool for developing Extract Transform Load (ETL)
applications. An ETL process is always part of an analytical solution. We have to populate data
warehouses somehow. Of course, we can use the ETL process and ETL tools for preparing
master data for our MDS application as well. SSIS has included many procedures called tasks
and transformations that are useful for analyzing, cleansing and merging data. For example, the
Data Profiling task quickly gives us a good overview of the quality of our data. The Data Profiling
task is shown in figure 10.

Figure 10: THE DATA PROFILING TASK


Besides Data profiling, there are quite a few additional tasks and transformations that help us
with the ETL process. For example, Fuzzy Lookup transformation can join data from different
sources based on string similarities. Fuzzy Grouping transformation helps find duplicates, again
based on string similarities.

Page 52

Chapter 1: Master Data Management


SQL Server Analysis Services (SSAS) is an analytical tool. It enables OLAP and Data Mining
analyses. As an analytical tool, it does not seem useful for master data management at first
glimpse. However, we can use OLAP cubes efficiently for data overview. Data Mining comprises
of a set of advanced algorithms that try to find patterns and rules in a data set. Some of these
algorithms are very useful for finding bad data. For example, the Clustering algorithm tries to
group rows in groups, or clusters, based on similarity of the attribute values. After we find
clusters, we can analyze how a row fits to some cluster. A row that does not fit well in any
cluster is a row with suspicious attribute values.
Finally, we have SQL Server Reporting Services (SSRS) in SQL Server suite. SSRS is useful for
master data management indirectly. We can present our activities and data quality
improvements through reports. We can publish reports on default SSRS portal site, or on any
other intranet portal site, including SharePoint sites, we have in our enterprise.

Page 53

Chapter 1: Master Data Management

Summary
For a beginning of the book about SQL Server Master Data Services, we started with some
theoretical introductions. We defined what master data is. Then we described master data
management in general. We mentioned explicit actions about data governance, and operators
who manage the master data. The name of those operators is data stewards. We also
introduced different approaches to master data management.
Then we switched to master data sources and destinations. For sources, it is crucial that they
take care for data integrity. We discussed briefly the Relational Model, the most important
model for transactional databases. We also mentioned normalization as the process of
unbundling relations, a formal process that leads to desired state of a database, when tables
represent exactly one entity. In this second part of the first chapter, we also introduced the
Dimensional Model, a model used for analytical systems.
Master data management always has to deal with data quality. Investments into a MDM
solution make no sense, if the quality of the data in our company does not improve. In order to
measure the data quality, we need to understand which data quality dimensions we have. In
addition, we introduced the most important activities dedicated to data quality improvements.
In the last part of the chapter, we introduced SQL Server Master Data Services. We mentioned
MDS key concepts. We also explained how we can use other elements of SQL Server suite for
master data management and data quality activities. It is time now to start working with MDS.

Page 54

Chapter 1: Master Data Management

References

Bill Inmon: Building the Operational Data Store, 2nd Edition (John Wiley & Sons, 1999)

C. J. Date: An Introduction to Database Systems, Eighth Edition (Pearson Education, Inc.,


2004)

Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling, Second Edition (John Wiley & Sons, 2002)

Danette McGilvray: Executing Data Quality Projects (Morgan Kaufmann, 2008)

5 Whys on Wikipedia

Page 55

Chapter 2: Master Data Services Concepts and Architecture

Chapter 2: Master Data Services Concepts


and Architecture
Davide Mauri
We previously introduced master data, master data management and Master Data Services in
the first chapter of the book, now it is time to show how we can work with this new product in
SQL Server suite. This chapter shows how we can create a MDS model with entities, attributes,
hierarchies and collections. In addition, it shows how we can import and export data to and
from MDS Hub. We will also learn how to implement business rules for our attributes, and the
concept of versioning master data.
The chapter starts also cover a quick introduction to Master Data Services setup so that it can
serve you if you want to put your hands on MDS and want to use this chapter as a walk-through
guide, and you do not have MDS installed yet. Topics covered in this chapter include:

Installation of Master Data Services (MDS);

Using MDS Web application;

Defining models, entities, attributes and hierarchies;

Constraining inputs with business rules;

Importing, exporting and managing data;

Managing versions.

Page 56

Chapter 2: Master Data Services Concepts and Architecture

Master Data Services Setup


This section provides an overview to installing Master Data Services. The aim is to give you just a
short description of the steps necessary to have a machine with MDS up and running without
going too much into detail. The complete coverage of an MDS installation is provided in SQL
Server 2002 R2 Books Online.
The installation and setup of Master Data Services is divided into three main steps:

Installation of MDS components and tools

Setup of MDS database

Setup of MDS Web Application

Installation of MDS Components and Tools


Installing MDS components and tools, is a straightforward operation. Its done using a typical
Windows installer package and you can simply accept all default options.
MDS Setup is included only in the 64-bit versions of SQL Server 2008 R2 Developer, Enterprise or
Datacenter editions. Operating system support is determined mainly by the edition of SQL
Server you choose. For example, the Developer edition can be installed starting with Windows
Vista (Business, Enterprise and Ultimate) and Windows 7 (Professional, Enterprise and
Ultimate).
MDS components and tools require .NET Framework 3.5 Service Pack 1 to be installed. If its not
already installed, MDS Setup will install it before proceeding with the main installation.
Furthermore, the Windows user account of the logged user that is executing MDS Setup needs
to be a member of local Administrators Windows group.

Page 57

Chapter 2: Master Data Services Concepts and Architecture


Although not strictly necessary to setup the components and tools, the PowerShell feature
needs to be installed before proceeding with the remaining configuration steps.
The other two installation steps require using a specialized configuration program called Master
Data Services Configuration Manager that is installed along with the other components and
tools in the first step.

Figure 11: MASTER DATA SERVICES CONFIGURATION MANAGER

Setup of MDS Database


Setting up a database for an MDS solution requires an already installed SQL Server instance and
the provisioning of two Windows users that will be granted the appropriate permissions to login
to the instance and access the database.

Page 58

Chapter 2: Master Data Services Concepts and Architecture


The first Windows user will be used to provide the identity of the web application pool and the
second one will be used as the MDS System Administrator. The MDS System Administrator can
access and update all models and all data in all functional areas for all web applications that use
an MDS database. In this book, we assume that the MDS database created is named MDSBook.
MDS Configuration Manager can then be used to connect to the instance and create the
database as shown in the Figure 2.

Figure 2: DATABASE SETUP WITH MDS CONFIGURATION MANAGER

Setup of MDS Web Application


Setting up an MDS web application requires the installation of the Web Server (ISS) role along
with some specific role services. As a quick guideline, the following roles services are required:

Common HTTP Features

Page 59

Chapter 2: Master Data Services Concepts and Architecture


o Static Content
o Default Document
o Directory Browsing
o HTTP Errors

Application Development
o ASP.NET
o .NET Extensibility
o ISAPI Extensions
o ISAPI Filters

Health and Diagnostics


o HTTP Logging
o Request Monitor

Security
o Windows Authentication
o Request Filtering

Performance
o Static Content Compression

Management Tools
o IIS Management Console

Furthermore, the following features are required:

.NET Framework Features


o WCF Activation

HTTP Activation

Non-HTTP Activation

Windows PowerShell

Windows Process Activation Service


o Process Model
o .NET Environment

Page 60

Chapter 2: Master Data Services Concepts and Architecture


o Configuration APIs
MDS Configuration Manager can then be used to create and configure the web application as
shown in Figure 3.

Figure 3: CONFIGURING WEB APPLICATION WITH MDS CONFIGURATION MANAGER

Page 61

Chapter 2: Master Data Services Concepts and Architecture

Master Data Manager Web Application


The entry point of any kind of human interaction with Master Data Services is the Master Data
Manager Web Application.
The web application is installed during the configuration of Master Data Services as part of the
initial system deployment, and allows managing everything related to a Master Data
Management solution. From a data model definition to system and security management,
passing through data verification and manipulation, this is the central point from where Master
Data can be managed.
Figure 4 shows the five main functional areas, which are the entry point to access all available
features.

Page 62

Chapter 2: Master Data Services Concepts and Architecture


Figure 4: MASTER DATA MANAGER WEB APPLICATION

Explorer
Explorer is where all data stored inside Master Data Services can be managed. Models, entities,
hierarchies and collections can be navigated and their data can be validated with the help of
defined business rules.
Direct updates or additions to the Members, the actual data stored in entities, can be done
here. Its also possible to annotate such data to explain why a change has been made, so that
everything can be tracked and kept safe for future reference. Its also possible to reverse
transactions if some changes made to data have to be undone.

Version Management
Once the process of validating Master Data is finished, external applications can start to use the
verified data stored in Master Data Services. Since applications will relay on that reference data,
its vital that no modification at all can be done on it, but still data will need to change in order
to satisfy business requirements. By creating different versions of data, its possible to manage
all these different situations, keeping track of changes and creating stable and immutable
versions of Master Data.

Integration Management
As the name implies, Integration Management allows managing the integration of the Master
Data with the already existing data ecosystem. With Integration Management, its possible to
batch import data that has been put into staging tables and monitor the import results. It is also
possible to define how data can be exposed to external applications via defined Subscription
Views.

Page 63

Chapter 2: Master Data Services Concepts and Architecture

System Administration
System Administration is the central place where the work of defining a Master Data Model
takes place. With System Administration it is possible to create models, entities, attributes,
hierarchies, business rules and everything offered by Master Data Services in terms of data
modeling.

User and Groups Permissions


Security plays a key role when a system manages all the data that represents a core asset of any
business. With User and Groups Permissions, all topics regarding security can be administered
and managed, configuring all the security clearances needs to access and modify Master Data.

Page 64

Chapter 2: Master Data Services Concepts and Architecture

Models
As explained in Chapter 1, Master Data Services organizes its data with the aid of different
concepts to deal with different aspects of data. All these concepts allow the definition of a Data
Model. To start to get confident with Master Data Services concepts and technology, in the next
paragraph well create a model to hold customers data. By following a walk-through of all the
steps needed we will get an example of a functional Master Data Services model.

Models
The highest level of data organization in Master Data Services is the Model. A model is the
container of all other Master Data Services objects and can be thought of as a database in
respect with the Relational Model.
As a result, the first step to create any solution based on Master Data Services is to create a
Model. From the System Administration section Master Data Manager portal, select the
element Model from the Manage menu.

Figure 5: OPENING THE MODEL MAINTENANCE PAGE

Page 65

Chapter 2: Master Data Services Concepts and Architecture


From the Model Maintenance page, its possible to create a new Model simply clicking on the
Add Model Button (The green plus icon). This will open the Add Model page, where it is
possible to specify the name and other options that will automatically create some default
objects for us. These will be useful after having a solid background on how Master Data Services
works, on order to save some work, so for now well uncheck them.

Figure 6: ADDING A MODEL


Since a Model is just a simple container, all thats needed to create a Model is its name. For our
sample were going to create customers Master Data, so we call our Model Customer.

Entities and Attributes


After having created the Customer Model, Entities of that model need to be defined. Entities
are the objects of which a customer is made of, and they are in concept very similar to the
tables of a standard Relational Database.
In our sample Customer Model, each customer, obviously, will have a specific address. An
address exists in a City, which has a specific Province (or State), within a Country.

Page 66

Chapter 2: Master Data Services Concepts and Architecture


Cities, Provinces and Countries will then be entities on their own since they represent reference
data too. In this way, Master Data will contain one and only one definition of their member
values, so that it will be possible to have a single version of the truth, avoiding human errors
such as misspelled or incorrect data that hinders the reach of high data quality standards.
The process of defining which object has to be put into a separate Entity and which does not is
similar to the Normalization process for a relational database. The aim is to avoid redundancy
and data duplication, which drives to data anomalies and inconsistencies.
Back to our sample, to hold customers reference data, well create the following entities:

Customer

City

StateProvince

CountryRegion

This sample is based on the usage of the AdventureWorksLT2008R2 sample database,


which can be downloaded from CodePlex.
Entities are created through the Model -> Entities menu item from the System Administration
Page. After having defined in which model an Entity belongs to, only the name as to be specified
in order to create it.

Page 67

Chapter 2: Master Data Services Concepts and Architecture

Figure 7: ADDING AN ENTITY


We havent defined what are hierarchies and collections, so we dont want to have them in our
model for now. It will be possible to change this option later on, when well have a deeper
understating of these objects.
After creating all the aforementioned Entities, youll see a page like the following:

Page 68

Chapter 2: Master Data Services Concepts and Architecture

Figure 8: ENTITY MAINTENANCE PAGE AFTER ADDING ALL ENTITIES


From here its possible to enrich and complete entities definition with the aid of Attributes:
entities are made of Attributes, which are the analogous of columns in a relational database.
Any Entity is made of at least two attributes, which are mandatory, and cannot be removed nor
renamed:

Name

Code

Attributes can be of three different types. The types define which kind of values an Attribute
can handle:

Free-Form: Allow the input of free-form text, numbers, dates and links.

File: Allow you to store any generic file. For example: documents, images or any
kind of binary object.

Domain-Based: The values of an attribute that as a Domain-Based attribute type,


are the values stored into another Entity. It allows the creation of relationships

Page 69

Chapter 2: Master Data Services Concepts and Architecture


between Entities, by imposing the rule that each value used here must be a value
that exists in another specific Entity. As a result, this force the user not to enter
free-form data but to choose from a list of defined values, which helps a lot in
avoiding errors and redundancy. The entity that stores the domain of possible
values for the attribute is similar to a lookup table in classical Relational Model.
To manage attributes for an entity, we first need to select that entity from the Entity
Maintenance Page, so that a set of icons will become visible:

Figure 9: SELECTING AN ENTITY IN THE ENTITY MAINTENANCE PAGE


Clicking on the pencil icon will open the Attribute Management page for the selected Entity.

Page 70

Chapter 2: Master Data Services Concepts and Architecture

Figure 10: OPENING THE ATTRIBUTE MAINTENANCE PAGE


In the Leaf Attributes section its possible to add as many attribute as we need. Attribute are
called Leaf to distinguish them from Consolidated Attributes used in Explicit Hierarchies. In
brief, a Leaf Attribute represents the lowest level of details used to define the Master Data
model.
Moving forward in our Customer sample, we need to create an attribute for the StateProvice
entity: a State or a Province in contained within a Country and so we need to model this
situation in our solution.
This kind of relationship between the StateProvince and Country entities can be represented
using a Domain-Based Attribute, like shown in Figure 11.

Page 71

Chapter 2: Master Data Services Concepts and Architecture

Figure 11: CREATING A DOMAIN-BASED ATTRIBUTE


Now it will probably be obvious that a Domain-Based Attribute has to be defined also for City
Entity, so that it will be related to the state or province in which the city is located.
With the Customer Entity, we will need to define several attributes. Beside the mandatory Code
and Name well create the following:

Firstname

Lastname

EmailAddress

Phone

Address

City

Page 72

Chapter 2: Master Data Services Concepts and Architecture


For simplicity well create them all with an exception for City attribute as Free-Text
attributes with a maximum length of 100 characters. City has to be a Domain-Based attribute
whose values will be taken from City entity.
When a Free-Text attribute is created, its possible to define the length of the data and the
length in pixels of the textbox that will allow a user to see and modify attributes data.

Figure 12: CREATING A FREE-FORM ATTRIBUTE


As the previous image shows, its also possible to flag the option Enable Change Tracking. All
kinds of attributes support this option. As the name suggest, if Change Tracking is enabled,
Master Data Services will monitor attributes data for changes. If a change happens, a Business
Rule can be configured to be fired so that the necessary validation of data can take place.
Page 73

Chapter 2: Master Data Services Concepts and Architecture


Attributes data type determines which constraints can be placed on the attribute itself to
assure data consistency. For Number Data Type, its possible to define how many decimal places
have to be used and the input mask.

Figure 13: NUMBER DATA TYPE OPTIONS


DateTime data type also has needs to have an input mask defined. A File Attribute can constrain
the type of handled files by specifying a file extension. Just be aware that this is not meant to be
a security mechanism. For example, nothing will check that the effective file content is what its
extension says it should be.
After having created all entities and attributes we have defined the model in which well store
customer data. It should be clear at this point that creating Entities and Attributes is the key
point of modeling Master Data. As it happens, for relational database modeling, its vital for the
success of a project to have a clear idea of all the entities that are needed along with their
attributes so that a correct relationship can be created. Since changing them once data has
been imported, though still possible, its a time and resource consuming task.
Some similarities with a relational database emerge here. This is not an unexpected
thing. Master Data Services uses the power of SQL Server Relational Database Engine to
enforce constraints that assures data quality. Each time an Entity is created, a
corresponding table gets created too, and each time a Domain-Based attribute put two

Page 74

Chapter 2: Master Data Services Concepts and Architecture


Entities in a relationship, a corresponding Foreign Key constraint gets created between
the two underneath tables. All entities and attributes are stored into mdm.tblEntity and
mdm.tblAttribute tables of the configure Master Data Services database. Members of an
entity are stored in a specific table created ad-hoc. Two tables are created per entity.
These tables following a specific naming convention:
mdm_tbl_<model_id>_<entity_id>_<table_type>. EN table type holds entity members,
while MS table type holds security information. The mapping between these tables and
the Entity that uses them can be found in the mdm.tblEntity table.
As can be easily supposed, in real world solution, an entity can have dozens of attributes, each
one or each group describing a specific aspect of the entity. In this case it may make sense to
group logically connected attributes together so that it will be easier for the final user to focus
only on a specific group of data.
The possibility to group attributes is supplied by the Attribute Group feature. In our example it
makes sense to group the Address and City attribute together since they hold address
information. The management page of Attribute Groups can be opened from Model -> Attribute
Group menu. For each entity, Attributes Groups can be created and populated with the desired
attributes. The Attribute Groups Maintenance page is shown in Figure 14.

Page 75

Chapter 2: Master Data Services Concepts and Architecture

Figure 14: ATTRIBUTE GROUPS MAINTENANCE PAGE


Name and Code attribute are not visible since they will always be available in any attribute
group, and so they dont need to be inserted manually..
Another Attribute Group we may want to create here is the one the groups contact information
together (FirstName, LastName, Phone, EmailAddress) to isolate them from address data.
Besides helping users to manage Master Data more easily, Attribute Groups are also a powerful
security feature. Do please refer to Books Online for more details how effective permissions are
determined through users, groups and MDS Model elements assigned permissions.

Page 76

Chapter 2: Master Data Services Concepts and Architecture

Hierarchies
Hierarchies are tree based structures that allow you to organize master data in a coherent way,
which also facilitates the navigation and aggregation of the data itself.

Figure 15: HIERARCHY EXAMPLE


A hierarchy tells the user or the consumer system that receives Master Data from Master Data
Services, that make sense to aggregate data at all level except the leaf level, which represent
the actual un-aggregated data.
For example, Customers may be the leaf level. Since customers live and work in cities, it makes
sense to aggregate all the customers on a per city basis, so that we can analyze how strong the
presence of our company in different cities is. Cities are located in countries so aggregating data
by country is interesting analyze worldwide business.
Having a common place where hierarchies are defined, like in Master Data Services models, is a
key point to have a standard way to aggregate data across all the applications used in a
company.

Page 77

Chapter 2: Master Data Services Concepts and Architecture


For all those who already have developed a Business Intelligence solution, the concept of
hierarchy in a SQL Server Analysis Services Dimension is very close to the one used by Master
Data Services. In fact, it is possible to create and populate an Analysis Services Dimension just
using the data stored in a Master Data Services hierarchy.
Derived Hierarchy
A Derived Hierarchy is a hierarchy created using the relationships that exist between entities
that use Domain-Based attributes. In our Customer sample, we have a relationship between
City, StateProvince and CountryRegion as follows:

Figure 16: DERIVED HIERARCHY EXAMPLE


A Derived Hierarchy can be created using the Derived Hierarchy Maintenance page, reachable
via Manage menu. After having specified the name, Geography in our case, composing the
hierarchy is all a matter of dragging the available entities from the left to the Current Level
section in the middle of the page. After dropping the entity, Master Data Services will
automatically detect at which level of the hierarchy the dropped entity has to be placed, by
analyzing the relationship between that and already placed entities.

Page 78

Chapter 2: Master Data Services Concepts and Architecture

Figure 17: DERIVED HIERARCHY MAINTENANCE PAGE


The Preview pane will show how Members are organized based on the created hierarchy. This
feature will only be useful when data has already been imported in Master Data Services to
check that the hierarchy organizes values in the expected way.
In a Derived Hierarchy, all members of all used entities take part in the hierarchy. When a
member of an entity is added, deleted or updated, Master Data Services automatically updates
the related hierarchies so that data is always consistent. An entity member can be used one
time only in a hierarchy. An entity cannot be placed in more than one level. Of course, an entity
can be used in more than one hierarchy.
Special cases for Derived Hierarchies are Recursive Hierarchies. This is the case when you have
an Entity that has a Domain-Based attribute that uses values coming from the entity itself. A
typical scenario here is the organizational chart of a company where an employee can also be
the manager of other employees. There is no particular difference between this kind of
hierarchy and the standard Derived Hierarchies, but there are a few limitations:

Only one Recursive Hierarchy can exist per Entity

Member Permission cannot be assigned

Circular References are not allowed

Explicit Hierarchy
It is also possible to define a hierarchy without using the relationship between entities: maybe
we need to create a hierarchy using some additional data not available in any system, by adding
new data manually directly into the Master Data Services database.

Page 79

Chapter 2: Master Data Services Concepts and Architecture


These special hierarchies are called Explicit Hierarchies. They can organize data coming from
one entity only (in opposition to what happen with Derived Hierarchies) and allow the creation
of Consolidated Members, which defines the level of the hierarchy itself.
In our Customer example, we have CountryRegion data, which contains information about the
nation where a customer resides. For business purposes we may decide that it makes sense to
have the data grouped into Continents and Geographic Zones.

Figure 18: EXPLICIT HIERARCHY EXAMPLE


Explicit Hierarchies can be ragged, meaning that not all branches have to be equally deep. For
example, we may decide to have a higher level of details under North America, adding the levels
Country and State, while omitting them for other elements.
To create an Explicit Hierarchy, we firs have to enable this feature at Entity level. This can be
done from the Entity Maintenance page, together with the definition of the hierarchy name.

Page 80

Chapter 2: Master Data Services Concepts and Architecture

Figure 19: Enabling EXPLICIT HIERARCHIES


If the Include all leaf members in a mandatory hierarchy flag is checked, the newly created
Explicit Hierarchy will contain all the entitys members. By default they will be places under the
Root node so that they can be moved into other levels later on. If the flags not checked, not
all entitys members must be used in the hierarchy. The unused members will be placed in a
special node named Unused.
After having enabled the Explicit Hierarchy, the Entity Maintenance page will look like the one in
Figure 20.

Page 81

Chapter 2: Master Data Services Concepts and Architecture


Figure 20: ENTITY MAINTENANCE PAGE AFTER ENABLING EXPLICIT HIERARCHIES
Consolidated Attributes are the attributes used by Consolidated Members. As visible, its
possible to have more than one Explicit Hierarchy per Entity.
Collection Attributes are attributes used by Collection which will be discussed later in this
chapter.
After having defined an Explicit Hierarchy, to add Consolidated Members to that hierarchy, in
order to build its levels, the Explorer functional area of Master Data Services has to be used.
From here, through the Hierarchies menu item its possible to access the Maintenance Page to
choose the Explicit Hierarchy. The page is divided in two main sections. On the left, the
hierarchy is visible and its possible to navigate its levels or move items from one level to
another (along with their sublevel) by dragging and dropping them or using the cut-n-paste
feature.

Figure 21: MANAGING EXPLICIT HIERARCHIES LEFT SIDE


On the right section of the page, there is the area where its possible to create or delete items
from levels. In the following sample, under the Europe level we need to add the Western,

Page 82

Chapter 2: Master Data Services Concepts and Architecture


Central and Eastern zones. After having selected the Europe item on the left side of the page, it
is possible to add Consolidated members to that hierarchy level.

Figure 22: MANAGING EXPLICIT HIERARCHIES RIGHT SIDE


It is also possible to add Leaf Members from this page. Remember that since a Leaf Member is
considered the last item of a hierarchy, a leaf member cannot have any child element.
After having clicked on the Add button (again the green plus icon) a new page will pop-up.
Here, all attributes for the chosen Item will be inserted and its also possible to select under
which level the new member will be placed by clicking on the Find Parent Node button
(indicated by a magnifying glass icon).

Page 83

Chapter 2: Master Data Services Concepts and Architecture

Figure 23: ADDING A CONSOLIDATED MEMBER


The manual addition of Consolidated Members is just one option. As well see next, its possible
to batch import such members as it happens for Leaf Members.

Collections
Collections are nothing more than a group of Explicit Hierarchies and Collection Members. They
are useful to create groups of members that not necessarily grouped from a business
perspective; maybe they are useful to be grouped to make life easier for the user. For example,
if someone is in charge of managing Customers from Spain, Portugal and South America, (s)he
can do a more efficient job if (s)he can find all their customers in a single group. Heres where
Collections come into play.
Collections can be created on a per-Entity basis and, just like Explicit Hierarchies; they can also
have their own Consolidated Members. Again, Members management is done in the Explorer
functional area. After having created a Consolidated Member to represent the collection, its
possible to add members by clicking on the Edit Members menu item visible after clicking on
the drop-down arrow near the collection we want to manage.

Page 84

Chapter 2: Master Data Services Concepts and Architecture

Figure 24: EDITING A COLLECTION


A new page will be displayed where using the same drag-n-drop or cut-n-paste technique used
to manage Hierarchies, members of the collection can be added, removed or changed.

Business Rules
Business Rules are one of the most powerful features of Master Data Services. They allow the
definition of rules that assure the quality of data. A business rule is basically an If...Then
sentence that allows specifying what has to be done if certain conditions are met.
For example, lets say that an email address, to be considered valid, need to contain at least the
at (@) character.
Business Rules are created in the System Administration functional area, by accessing the
Business Rule Maintenance Page through the Manage menu item. New rules can be created
by clicking on the usual Add button, and rule name can be set by double-clicking on the
existing name that has to be changed. After having selected for which entity the Business Rule
has to be created, the rule can be defined by clicking on the third icon from the left:

Page 85

Chapter 2: Master Data Services Concepts and Architecture

Figure 25: BUSINESS RULE MAINTENANCE PAGE


The Edit Business Rule page will be displayed. This page has four main areas.
In the top left section there are the Components of the business rule. Each component
represents pieces of a sentence that can be put together to form a business rule. They are of
three types: conditions used for the If part of the rule sentence, actions used for the Then part,
and logical operators that connect two or more conditions. As said before, in our case we just
want to be sure that the EmailAddress attribute values have the @ sign. Under the Validation
Actions there is the must contain the pattern action. We dont need any conditions, since we
always want to verify that an email address is formally correct.
To put desired elements under the proper section, we just have to drag-n-drop the element we
want to use in the right place: Conditions or Actions.

Page 86

Chapter 2: Master Data Services Concepts and Architecture

Figure 25: EDITING A BUSINESS RULE


As soon as we drop the item, the Edit section of the page, on the bottom right will require us to
configure the used Condition or Action. Basing on the selected item we may be asked to specify
which attribute we want to use and some other configuration options. In our example, well
need to specify that the rule will be applied on the EmailAddress attribute and that pattern
were looking for is %@%.

Figure 26: SPECIFYING DETAILS OF THE MUST CONTAIN THE PATTERN ACTION

Page 87

Chapter 2: Master Data Services Concepts and Architecture


Again, to select the Attribute we want to use, we just have to drag-n-drop it under the Edit
Action item. Once the rule is completed, it needs to be published before users can start to use
it to validate data. From the Business Rules Maintenance page, select the rule that needs to be
published and click on the Publish Business Rules button (second from left).

Figure 27: PUBLISHING A BUSINESS RULE


After the Business Rule has been published, it can be used to validate Master Data.

Page 88

Chapter 2: Master Data Services Concepts and Architecture

Importing, Exporting and Managing Data


Master Data Services needs to collect data from other systems in order to consolidate and
validate it and make it Master Data. Though data can be managed directly using the Master
Data Management portal, it will be a common scenario to import and export data from and to
an external system that produces and consumes reference data.

Import Data
Importing data into Master Data Services is a batch process based on three main tables:

mdm.tblStagingMember

mdm.tblStgMemberAttribute

mdm.tblStgRelationship

The first table mdm.tblStagingMember is used to import members and their Code &
Name system attributes. All other user defined attributes have to be imported using the
mdm.tblStgMemberAttribute table. If we need to move members into an Explicit
Hierarchy or add members to a Collection, well use the mdm.tblStgRelationship table too.
Without going into deep technical details, to import member data, we firstly need to create
members, and then populate all the user-defined attributes, in the following sequence.
For the first attempt, well load data for the CountryRegion Entity, taking the values from
AdventureWorksLT2008R2 sample database.
To extract and load data into the mdm.tblStagingMember well use the following T-SQL
code:
USE AdventureWorksLT2008R2;
GO
SELECT DISTINCT
CountryRegion
FROM
SalesLT.[Address]

Page 89

Chapter 2: Master Data Services Concepts and Architecture


)
INSERT INTO MDSBook.mdm.tblStgMember
(ModelName, HierarchyName, EntityName,
MemberType_ID, MemberName, MemberCode)
SELECT
ModelName = 'Customer',
HierachyName = NULL,
EntityName = 'CountryRegion',
MemberType_ID = 1,
MemberName = CountryRegion,
MemberCode = ROW_NUMBER() OVER (order by CountryRegion)
FROM
cte_source;
GO

After executing the T-SQL batch from SQL Server Management Studio, well have three rows
imported in the member staging table of our MDSBook sample database. To notify Master Data
Services that we want to start the import process, we have to use the Import feature available
in the Integration Management functional area.

Figure 28: IMPORTING CUSTOMER DATA


By clicking on the Process Button (the only one above the Model label) well start the batch
process which will queue the rows to be imported. Batches are processed every 60 seconds by
default.

Page 90

Chapter 2: Master Data Services Concepts and Architecture


After that time the process will be completed and well be able to see the result, using the same
page:

Figure 29: AFTER CUSTOMER DATA IS IMPORTED


If errors occurred, we could have detailed information on the errors shown in a separate page,
simply by clicking on the Batch Details icon, the first on the left.

Managing Data
Now that we have our data inside Master Data Services, we can use the Explorer functional area
to explore and manage the data. Here we can select the Entity we want to manage simply
choosing the desired one from the Entities menu.
Since weve just imported CountryRegion data we may want to check it. The CountryRegion
Entity page will be as shown in Figure 30.

Page 91

Chapter 2: Master Data Services Concepts and Architecture

Figure 30: THE COUNTRYREGION ENTITY PAGE


Here it is possible to add or remove members or change existing ones simply by double clicking
on the attribute that we want to change.
The yellow question mark visible on the left by the members name indicates that data has not
yet been validated, which means that we have to execute business rules in order to validate
date. To do this, we simply have to select the members we want to validate, just by checking the
relative checkbox and then clicking on the Apply Business Rules button, the third from the
right, just below the CoutryRegion title.
We havent defined any business rule for CountyRegion, so data will surely pass validation,
changing the yellow question mark to a green tick mark.
Since CountryRegion has an Explicit Hierarchy defined, we may also want to check Consolidated
Members. As the image shows, we can do this by selecting the Consolidated option, or we can
use the Hierarchies menu to access the Explicit Hierarchy and put the imported Leaf Members
under the correct levels:

Page 92

Chapter 2: Master Data Services Concepts and Architecture

Figure 31: THE COUNTRYREGION EXPLICIT HIERARCHY PAGE


By using drag-n-drop or by using the cut-n-paste features we can move Canada and United
States leaf members under the North America Consolidated Member, and United Kingdom
leaf member under the Western Europe consolidated member.

Export Data
Now that we have imported, validated and consolidated our data, we may need to make it
available to external applications. This can be done by creating a Subscription View from the
Integration Management functional area.

Figure 32: CREATING A SUBSCRIPTION VIEW


A Subscription View needs to be defined with a name, a version of the data that has to be
exported (well introduce versioning in the following section of this chapter) and of course the

Page 93

Chapter 2: Master Data Services Concepts and Architecture


Entity or the Derived Hierarchy we want to export. The format allows you to define which kind
of data we want to export: Leaf Members, Consolidated Members, levels of an Explicit Hierarchy
and so on.
We want to export the CountryRegion Master Data, along with the information of its Explicit
Hierarchy: thats way we have to choose Explicit Levels in the Format combo-box and specify
how many levels we want to export. Three levels are enough, since our Explicit Hierarchy is
made up of the Continent Zone Country.
After having saved the Subscription View, Master Data Services creates a SQL Server View for
us. In the MDSBook database, we created the Customer_CountryRegion view which is ready to
be used by external applications to get that Master Data. Here is an example of Transact-SQL
code used to query the view:
SELECT *
FROM [MDSBook].[mdm].[Customer_CountryRegion];

The view, along with other columns, has flattened the hierarchy using three levels, as
requested:

Figure 33: DATA FROM A SUBSCRIPTION VIEW

Page 94

Chapter 2: Master Data Services Concepts and Architecture

Multiple Versions of Data


Now that Master Data has been made available to other system via Subscription Views, we must
guarantee those systems that Master Data wont change suddenly and without any advice. To
freeze Master Data so that its values wont change, we have to introduce the concept of
versions.
Versions provide a way to create logical snapshots of the Master Data; this way, we can still
make changes to Master Data while keep tracking the changes over time, and providing to the
subscribing system the version they expect to get.
Versioning is completely integrated into the system and it is always active. At the very
beginning, only one version of data exists and the status is set to Open. This first version is
automatically created by Master Data Services as soon as the Model is created.
Open means that everyone (who has the required permission) can make changes to the
model and to its data.

Figure 34: THE FIRST VERSION OF DATA IS OPEN


Once the model is completed and has been filled with data, the administrator needs to start the
validation process. Of course, to validate data it has to be stable. To be sure that no-one can
change data while someone is validating it, the version must be set to Locked. This operation is
done in the Version Management functional area, shown in Figure 35.

Page 95

Chapter 2: Master Data Services Concepts and Architecture

Figure 35: LOCKING A VERSION


Once the version has been locked, validation can take place. To validate data all at once, the
Validation Version page, reachable via the menu with the same name, runs all the defined
Business Rules on all data:

Figure 36: VALIDATING A VERSION


After having the data validated, its finally possible to set the version to the Committed status,
using the initially greyed button (the second in the top left side of the image above).

Page 96

Chapter 2: Master Data Services Concepts and Architecture


Once the version has reached this status, its data cannot be changed anymore. If new need to
change to Master Data, well also need to create a new version of it. New versions are created
from the Manage Version page, by selecting which committed version of data we want to start
to work from.

Figure 37: THE MANAGE VERSIONS PAGE


Now that we have a new version, we can start to work on it freely, being sure that all reference
data stored in the previous version will remain untouched, allowing the system that rely upon it
to work correctly and without problems.

Page 97

Chapter 2: Master Data Services Concepts and Architecture

MDS Database Schema


In the last part of this chapter, we are going to take a look at the MDS database schema. This
should help us understanding how its made and how it works, since an intimate knowledge of
the final store of our Master Data can help us to create better solution to handle it.
Each time a new MDS solution is created, using the MDS Configuration Manager tool, a new
database is also created too.
This database will contain all the models well define in this specific MDS solution, beside all the
system objects needed by MDS for its own functioning. All these objects are located in the mdm
schema. Among all the tables here contained several are particularly interesting and deserve
more attention.
The first of these tables is mdm.tblModel. Each time a new model is created within the selected
MDS Application, a new row is also inserted here. This action also generates an ID value that
uniquely identify the model and that is used throughout the database. From now on this ID will
be referred to as model_id.
Models Entities also reside in tables; the main table where a list of all Entities is stored is the
mdm.tblEntity table. As obvious, also in this case each entity has its own ID value, which will be
called entity_id. The table contains all the Entity defined in the MDS Application and each Entity
is related to the Model in which it has been defined using the model_id value.
Each time a new Entity is added to the model, at least two dedicated tables are also created.
The name of these tables is generated using this rule:
mdm.tbl_<model_id>_<entity_id>_<member_type>

Page 98

Chapter 2: Master Data Services Concepts and Architecture


The first two placeholders, model_id and entity_id has been already defined, the member_type
can have six different values:

EN: table that holds entity data

MS: table that stores security data

HR: table that stores Explicit Hierarchies Data

HP: stores relationships between entities in the Explicit Hierarchy

CN: table that holds Collections defined on the Entity

CM: table that keep tracks of which Entity Member is in which Collection

Allowed member types are stored in the mdm.tblEntityMemberType table.


As soon as an entity is created, only the EN and MS gets created. All the other tables will be
created only if the functionality they support is used. The following query shows an example
how to find all entities in a MDS database by querying the mdm.tblEntity table, with the results
shown in the figure right after the query.
SELECT ID, Model_ID, Name, EntityTable, SecurityTable
FROM mdm.tblEntity;

Figure 38: ENTITIES IN THE MDM.TBLENTITY TABLE


As we know, each Entity can have its own set of Attributes. These attributes are stored as
columns in EN entity table. Each time a new attribute is added to an entity, a new column is
also added to the related table. The name of the created column follows a naming convention
similar to the one already seen in table names:

Page 99

Chapter 2: Master Data Services Concepts and Architecture


uda_<model_id>_<attribute_id>
The only exception to this rule is for the Code and Name attributes. Since they are mandatory
for any entity, they are mapped directly to columns with the same name. In the following figure,
we can see sample results of a query over an example of the EN entity table.

Figure 39: ATTRIBUTES OF AN EN ENTITY TABLE


The attribute_id, like the others ids mentioned until now, is generated automatically by the
system each time a new row is inserted. The table that keep tracks of all attributes in the
database is the mdm.tblAttribute table. This table stores all the metadata needed by attributes,
like the DisplayName and the mapping to the column name used in entity table:

Figure 40: PARTIAL CONTENT OF THE MDM.TBLATTRIBUTE TABLE


When an entity has a Domain-Based attribute, a Foreign Key constraint is automatically created
between the referenced and the referencing entities in order to support the relationship.

Page 100

Chapter 2: Master Data Services Concepts and Architecture

Staging Tables
As we have learned in previous section, each time a new entity is created, an associated
database table gets created as well. In such dynamic and always changing environment, creating
a standard solution to import data into the Master Data Services database so that it will fit into
the defined model can be a nightmare.
Luckily, Master Data Services comes to help, giving us three standard tables, all existing in the
mdm schema, that have to be used to import data into entities, to populate theirs attributes
and to define relationships of hierarchies.
These three tables are called staging tables and are the following:

mdm.tblStagingMember

mdm.tblStgMemberAttribute

mdm.tblStgRelationship

By using them its possible to import data into any model we have in our Master Data Services
database. A fourth staging table exists, but it isnt used actively in the import process, since it
just reports that status and the result of the batch process that is moving data from the
aforementioned staging tables into entities, attributes and hierarchies. This table is the
mdm.StgBatch table.
You can populate the staging tables with regular T-SQL inserts and bulk inserts. After the staging
tables are populated, you can invoke the batch staging process from the Master Data Manager
Web application. After the batch process is finished, you should check for the possible errors.
Please refer to Books Online for more details about populating your MDS database through the
staging tables.
Following this process and with the aid of the three staging tables, is possible not only to
populate entities with new fresh data (which means creating new members), but also:

Page 101

Chapter 2: Master Data Services Concepts and Architecture

Update members values

Delete and reactivate members

Create collections

Add members to collections

Delete and reactivate collections

Update attribute values

Designate relationships in explicit hierarchies

Page 102

Chapter 2: Master Data Services Concepts and Architecture

Summary
In this chapter, the quick walk-through Master Data Manager Web application gave us an
overview of MDS capabilities. We also learned how to work with Master Data Manager. Of
course, everything is not that simple as we have shown in this chapter. Nevertheless, this
chapter helps us match elements of a master data management solution with practical
implementation by using Master Data Services.
The first part of this chapter is just a quick guide to installing MDS. This should help readers to
start testing it and use this chapter as a quick walk-through MDS. Then we defined MDS Models
and elements of a Model. We have also shown how to import and export data. Finally, we
discussed versioning of our master data.
We have to mention that for a real-life solution, we should expect much more work with
importing data. Before importing data into a MDS model, we have to profile it, to check it for
quality, and cleanse it. Typical actions before importing data include also merging the data from
multiple sources and de-duplicating it.
In the next chapter, we are going to explain how we can check data quality in our existing
systems by exploiting tools and programming languages included in and supported by SQL
Server suite. The last chapter of this book is dedicated to merging and de-duplicating data.
We are not going to spend more time and place on the Master Data Services application in this
book. This is a general MDM book, and MDS is just a part of a general MDM solution. In
addition, MDS that comes with SQL Server 2008 R2 is just a first version and is, according to our
opinion, not suitable for production usage in an enterprise yet. We suggest waiting for the next
version of MDS for real-life scenarios. Nevertheless, the understanding how MDS works in
current version we got in this chapter should definitely help us with successful deployment and
usage of the next version.

Page 103

Chapter 2: Master Data Services Concepts and Architecture

References

Installing and Configuring Master Data Services on MSDN

MSFT database samples on CodePlex

How Permissions Are Determined (Master Data Services) on MSDN

Master Data Services Team Blog

Master Data Services Database (Master Data Services) on MSDN

Page 104

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Chapter 3: Data Quality and SQL Server


2008 R2 Tools
Dejan Sarka
Before we start implementing any centralized MDM solution, we should understand the quality
of our existing data. If we transfer and merge bad data into our MDM solution, we will have bad
data from the start, and over time the situation can get worse. Therefore, intensive data
profiling should always start a MDM project.
Besides getting an impression of our data quality, we should also try to find the root cause for
bad data. In chapter one of this book, we have already seen that the five whys is a known
technique used to find the root cause. However, the five whys technique works well for soft
data quality dimensions.
In this chapter, we are going to focus on hard data quality dimensions, i.e. on measurable
dimensions. Namely, we are going to deal with completeness, accuracy and information. We are
going to show how we can use SQL Server 2008 R2 tools for measuring data quality. In addition,
we are going to show how we can use these tools to find the root cause for bad data. We will
also see that through data profiling we can make some conclusions about schema quality.
In this chapter, we are going to introduce the following:

Measuring the completeness;

Profiling the accuracy;

Measuring information;

Using other SQL Server tools for data quality.

Page 105

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Measuring the Completeness


In order to get some value out of the data, our data has to be complete. This means that the
data should be deep enough to cover all of the business needs, and scoped for the business
tasks.
Completeness starts with schema completeness. Of course, if our schema does not have a place
for a specific piece of data provided, we cannot insert that piece of data. Normalization of a
relational schema is a formal mathematical process that guarantees the completeness of the
schema. In addition, normalization eliminates redundancy. Nevertheless, we are not going to
talk about data modeling here. We are going to show how we can measure the completeness
data quality dimension, and find the root cause for incomplete data.
Population completeness is the first completeness that can be easily measured. We can use two
different assumptions here:

Closed world assumption: all tuples that satisfy relation predicates are in the
relation;

Open world assumption: population completeness defined on a reference


relation.

If we have the reference relation, we can measure population completeness by just comparing
the number of rows in our and in reference relation. Of course, just having the number of rows
in the reference relation is sufficient information to measure our population completeness.
In a relational database, the presence of the NULLs is what defines the completeness. NULLs are
standard placeholders for unknown. We can measure attribute completeness, i.e. the number
of null values in a specific attribute, tuple completeness, i.e. the number of unknown values of
the attributes in a tuple, and relation completeness, i.e. the number of tuples with unknown
attribute values in the relation.

Page 106

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


In a relational model, we have only a single placeholder for unknown attribute values, the NULL.
However, we can have NULLs because attributes are not applicable for subset of rows, or
because the values are really unknown. Not applicable values can also be a signal that the
schema is not of the best possible quality; namely, they can tell us that some subtypes should
be introduced in the schema. Inadequate schema quality is one possible root cause for NULLs in
our database.
In XML data, we do not have NULLs. XML has a special xsi:nil placeholder for unknown values. In
addition, a complete element or an attribute can miss in an XML instance, and this is missing
data as well. We have to use XQuery language inside Transact-SQL (T-SQL), inside XML data type
methods, in order to find incomplete data inside XML.
When we start data profiling, we meet a lot of suspicious or even bad data at the start. In order
to find the root cause, we have to narrow down our data profiling activities and searches. The
techniques shown in this chapter help us not only with data profiling, but with narrowing down
the problem in order to find the root cause as well.

Attribute Completeness
Let us start with a simple example finding the number of NULLs in an attribute. We are going
to analyze attributes from the Production.Product table from the AdventureWorks2008R2 demo
database.
The samples are based on the usage of the SQL 2008 R2 sample databases, which can be
downloaded from CodePlex.
You can execute queries in SQL Server Management Studio (SSMS), the tool shipped with SQL
Server. If you are not familiar with this tool yet, do please refer to Books Online.
First, lets find which columns are nullable, i.e. allow null values, in the Production.Product table
with the following query that uses the INFORMATION_SCHEMA.COLUMNS view:

Page 107

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


SELECT COLUMN_NAME, IS_NULLABLE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = N'Production'
AND TABLE_NAME = N'Product';

The nullable columns are Color, Size, SizeUnitMeasureCode, WeightUnitMeasureCode, Weight,


ProductLine, Class, Style, ProductSubcategoryID, ProductModelID, SellEndDate and
DiscontinuedDate. The next query shows an overview of first 100 rows, with only nullable
columns, and columns that help you identifying rows, namely ProductId and Name:
SELECT TOP 100
ProductId
,Name
,Color
,Size
,SizeUnitMeasureCode
,WeightUnitMeasureCode
,Weight
,ProductLine
,Class
,Style
,ProductSubcategoryID
,ProductModelID
,SellEndDate
,DiscontinuedDate
FROM Production.Product;

Partial results, showing only a couple of columns and rows, are here:
ProductId

Name

Color

Size

SizeUnitMea

sureCode
1

Adjustable Race

NULL

NULL

NULL

Bearing Ball

NULL

NULL

NULL

BB Ball Bearing

NULL

NULL

NULL

Headset Ball Bearings

NULL

NULL

NULL

316

Blade

NULL

NULL

NULL

You can easily see that there are many unknown values in this table. With a simple GROUP BY
query, we can find the number of nulls, for example in the Size column. By dividing the number
of NULLs by the total number of rows in the Production.Product table, we can also get the
proportion of NULLs.

Page 108

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


The following queries show the absolute number of NULLs in the Size column, and the
proportion of NULLs:
SELECT

Size
,COUNT(*) AS cnt
FROM Production.Product
WHERE Size IS NULL
GROUP BY Size;
SELECT
100.0 *
(SELECT COUNT(*) FROM Production.Product
WHERE Size IS NULL
GROUP BY Size)
/
(SELECT COUNT(*) FROM Production.Product)
AS PctNullsOfSize;

By running these two queries, we are able to find out that there are 293 NULLs in the Size
column, which is 58% of all values. This is a huge proportion; just from this percentage, we can
conclude that size is not applicable for all products. For products, size is a common attribute, so
we could expect mostly known values. We could continue checking other nullable attributes
with similar queries; however, we will see later that for finding NULLs in a column, the SQL
Server Integration Services (SSIS) Data Profiling task is suitable for this.

XML Data Type Attribute Completeness


So far we have shown how to check whether the complete value is missing. However, if we have
a column of XML data type, there could be just a single element inside the XML value missing.
Of course, we can then define that the xml value is not complete. The question is how you can
find rows with a specific element missing. In SQL Server, XML data type supports some useful
methods. These methods are:

query() method, which returns part of the xml data in xml format;

value() method, which returns a scalar value of an element or an attribute of an


element;

exist() method, which returns 1, if an element or an attribute exists in an xml


instance, 0 if an element or an attribute does not exist, and null if the xml data
type instance contains null;
Page 109

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

modify() method, which allows you to insert an element or an attribute, delete


an element or an attribute, or update value of an element or an attribute;

nodes() method, which allows you to shred an xml data type instance to
relational data.

All of the XML data type methods accept XQuery as an argument. XQuery expressions allow us
to traverse through nodes of XML instance to find a specific element or attribute. The value
method accepts additional parameter, the target SQL Server scalar data type. For the modify()
method, XQuery is extended to allow modifications, and is called XML Data Modification
Language or XML DML. You can learn more about XQuery expressions and XML DML in Books
Online.
For checking whether an element is present in the xml instance, exist() method is the right one.
First, let us create XML instance in a query from the Production.Product table by using the FOR
XML clause:
SELECT

p1.ProductID
,p1.Name
,p1.Color
,(SELECT p2.Color
FROM Production.Product AS p2
WHERE p2.ProductID = p1.ProductID
FOR XML AUTO, ELEMENTS, TYPE)
AS ColorXml
FROM Production.Product AS p1
WHERE p1.ProductId < 319
ORDER BY p1.ProductID;

The subquery in the SELECT clause generates XML data type column from the Color attribute.
For the sake of brevity, the query is limited to return seven rows only. The first five have NULLs
in the Color attribute, and the XML column returned does not include the Color element. The
last two rows include the Color element.
We are going to use the previous query inside a CTE in order to simulate a table with XML
column, where the Color element is missing for some rows. The outer query is using the .exist()
method to check for the presence of the Color attribute.

Page 110

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


WITH TempProducts AS
(SELECT p1.ProductID
,p1.Name
,p1.Color
,(SELECT p2.Color
FROM Production.Product AS p2
WHERE p2.ProductID = p1.ProductID
FOR XML AUTO, ELEMENTS, TYPE)
AS ColorXml
FROM Production.Product AS p1
WHERE p1.ProductId < 319)
SELECT ProductID
,Name
,Color
,ColorXml.value('(//Color)[1]','nvarchar(15)')
AS ColorXmlValue
FROM TempProducts
WHERE ColorXml.exist('//Color') = 0;

The outer query correctly finds the first five rows from the CTE. XML specification also allows
that an element is present but has no value. In such a case, a special attribute xsi:nil should
appear inside the nillable (XML term for nullable) element. Therefore, in order to find all
incomplete xml instances, we have to check also for the xsi:nil attribute. In order to check for
the xsi:nil attribute, we have to create it in some rows first. We are going to slightly change the
last query. In the CTE, we are going to include XSINIL keyword in the FOR XML clause of the
subquery. This will generate the Color element for every row; however, when the color is
missing, this element will have an additional xsi:nil attribute. Then, with the outer query, we
have to check whether this attribute appears in the Color element:
WITH TempProducts AS
(SELECT p1.ProductID
,p1.Name
,p1.Color
,(SELECT p2.Color
FROM Production.Product AS p2
WHERE p2.ProductID = p1.ProductID
FOR XML AUTO, ELEMENTS XSINIL, TYPE)
AS ColorXml
FROM Production.Product AS p1
WHERE p1.ProductId < 319)
SELECT ProductID
,Name
,Color
,ColorXml.value('(//Color)[1]','nvarchar(15)')
AS ColorXmlValue
FROM TempProducts
WHERE ColorXml.exist('//Color[@xsi:nil]') = 1;

Of course, the CTE query returns the same small sample (seven) rows of products with
ProductID lower than 319. The outer query correctly finds the first five rows from the CTE.

Page 111

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Simple Associations among NULLs


It would be too easy if we could finish our research by just calculating the proportion of
unknown values. In order to improve the quality of our data, we have to discover the cause of
these Nulls. The first thing we can check is whether Nulls in one column lead to Nulls in another
one. The relation between columns could indicate two things for the schema: the schema could
not be in third normal form, i.e. we have functional dependency between two non-key columns,
or the values for a column are not applicable for all rows, and thus we should probably
introduce subtypes. Lets first check whether NULLs in one column lead to NULLs in another.
We can need to decide which columns need checking. The answer depends on the information
we can get from the names of the columns and from the business experts. If the naming
convention in the database is narrative, we can easily find the potential candidate columns for
checking. If the naming convention is bad, we have to rely on business experts to narrow down
our checking. If we do not have a business expert at hand and we cannot get a clue from the
column names, then we have a problem; we have to check all pairs of columns.
Fortunately, in the AdventureWorks2008R2 demo database, the naming convention is quite
good. If we take a look at the Size and SizeUnitMeasureCode columns, their names tell us they
are somehow related. Therefore, lets check whether NULLs are related in these two columns
with the following query:
SELECT

1 AS ord
,Size
,SizeUnitMeasureCode
,COUNT(*) AS cnt
FROM Production.Product
GROUP BY Size, SizeUnitMeasureCode
UNION ALL
SELECT 2 AS ord
,Size
,SizeUnitMeasureCode
,COUNT(*) AS cnt
FROM Production.Product
GROUP BY SizeUnitMeasureCode, Size
ORDER BY ord, Size, SizeUnitMeasureCode;

Before showing the results of this query, let us add a comment on the query. The result set
unions two result sets; first SELECT aggregates rows on Size and then on SizeUnitMeasureCode,

Page 112

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


the second SELECT does the opposite and aggregates rows on SizeUnitMeasureCode first and
then on Size.
The first column of the result is just a constant used to get guaranteed order in the combined
result set.
In SQL 2008 and 2008 R2, we can rewrite this query in a shorter way using the new GROUPING
SETS subclause of the GROUP BY clause. It is not just the query that is shorter; SQL Server is
usually able to find a more effective execution plan as well, and thus execute the query faster.
However, in order to guarantee the order of the result, the query with GROUPING SETS
becomes more complicated, and also less efficient. As this is not a T-SQL programming book, we
are going to use the query above, written without GROUPING SETS. Although the query is not as
efficient as it could be, it is more readable, and thus more suitable for explaining the concepts of
narrowing down the search for the reasons for NULLs. The abbreviated results are as follows:
ord

Size

SizeUnitMeasureCode

cnt

NULL

NULL

293

38

CM

12

62

CM

11

70

NULL

NULL

11

NULL

11

NULL

XL

NULL

NULL

NULL

293

38

CM

12

...

Page 113

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


2

62

CM

11

70

NULL

NULL

11

NULL

11

NULL

XL

NULL

Note the bolded rows. Whenever the Size is null, SizeUnitMeasureCode is also null; however,
the opposite is not true. In addition, SizeUnitMeasureCode is null for all sizes expressed in
character codes, like L, M and not null for numeric sizes, except for size 70 (the bolded italics
row). We can conclude that there is a strong relation between nulls in these two columns. Of
course, size unit measure code tells us in which unit the size measured is if the size is numeric;
for example in the second row of the result set, we can see that size 38 is measured in
centimeters. When the size is expressed in character codes, the measure unit makes no sense.
However, we can see that something is wrong with size 70; this one should have measure unit.
The measure unit is missing or size 70 should not be in the relation, as it is potentially
erroneous. By researching unknown values, we can find potential errors. In addition, for the
root cause of the unknown values, we can omit the SizeUnitMeasureCode column; we already
know where nulls in this column come from. Therefore, we can limit our research to the Size
column only from this pair.
If we do the same analysis for the Weight and WeightUnitMeasureCode columns, we will find
that we can omit the WeightUnitMeasureCode column from further researches as well. Finally,
we can do the same thing for the ProductSubcategoryID and ProductModelID columns, and will
find out that whenever ProductSubcategoryID is null, ProductModelID is null as well. Therefore,
we can also omit the ProductModelID from further completeness checking.
How can we prevent missing size measure units? The answer lies in the schema. We can
introduce a check constraint on the SizeUnitMeasureCode column that would not accept null
values for numeric sizes, or we can create a trigger on the Production.Product table that can

Page 114

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


check whether the measure unit is known for numeric sizes. Finally, we can further normalize
the schema. It is obvious that the SizeUnitMeasureCode and Size columns are functionally
dependent. Figure 1shows the normalized version of the design for size and size unit measure.

Products

Sizes

PK

ProductId

PK

SizeId

FK1

Name
SizeId
OtherCols

FK1

Size
SizeTypeId

SizeTypes
PK

SizeTypeId
SizeTypeName
MeasureUnit

Figure 12: SIZE AFTER NORMALIZATION


In a design like the one in the image above, once we insert the correct measure of units for the
appropriate size types, we cannot have erroneous or missing size measure units anymore. Of
course, if the values are incorrect in the lookup tables, Sizes and SizeTypes, measure units would
be wrong for all products of a specific size. However, it is easier to maintain small lookup tables
and have only correct values in them. The question here is whether it is worth changing the
database design because of the problem with a missing size measure unit. Changing the
database design could lead to upgrading one or more applications, which could be very
expensive. We could prevent new incorrect values with a check constraint or a trigger instead.
Even with a check constraint or a trigger we could end up with problems with our applications.
If our applications do not implement exception handling, we might finish with a crashed
application when an erroneous size measure unit would be inserted. Preventing errors should
be our ultimate goal; however, in many cases, especially when we have to deal with legacy
applications, it might be the cheapest way to simply find the errors and correct them on a
regular schedule.
Here, we just checked whether NULL in one column leads to NULL in another column. What if a
known value in a column leads to NULL in another column? This would mean that the second
attribute is not applicable for all values of the first attribute. Subtype tables would probably be
needed. In the Production.Product table, from the name of the ProductSubcategoryId column

Page 115

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


we can suspect that this column might be a potential candidate for introducing subtypes for
values of this column. Lets check if the Class attribute is applicable for all product subcategories
using the following query:
WITH
Classes AS
(SELECT ProductSubcategoryID
,Class
,COUNT(*) AS NRowsInClass
FROM Production.Product
GROUP BY ProductSubcategoryId, Class)
SELECT c.ProductSubcategoryID
,c.Class
,c.NRowsInClass
,CAST(ROUND(100.0 * c.NRowsInClass /
SUM(c.NRowsInClass)
OVER(PARTITION BY c.ProductSubcategoryId),0)
AS decimal(3,0))
AS PctOfSubCategory
FROM Classes c
ORDER BY PctOfSubCategory DESC;

This query uses Common Table Expressions (CTE) to calculate the number of rows in
subcategories and classes. The outer query uses the CTE to calculate the percentage of rows of a
class in a specific subcategory from the total number of rows of the same subcategory by using
the OVER clause. The result set is ordered by a descending percentage. If the percentage is
close to 100, it means that one class is prevalent in one subcategory. If the value of the class is
NULL, it means that the class is probably not applicable for the whole subcategory. Here are
partial results of this query.
ProductSubcategoryID

Class

NRowsInClass

PctOfSubCategory

NULL

100

NULL

100

NULL

100

18

NULL

100

We can easily see that the Class attribute is not applicable for all subcategories. In further
research of reasons for NULLs of the Class attribute, we can exclude the rows where the values
are not applicable. We can also check which other attributes are not applicable for some

Page 116

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


subcategories. Note that there are not necessarily the same subcategories leading to NULLs in
Class as the subcategories leading to NULLs in other attributes. For example, subcategory 4
leads to NULLs in Color, but not to NULLs in Class attribute. Therefore, in order to continue the
research for the reason for NULLs with narrowed number of rows by excluding the rows where
the values are not applicable, we should make separate researches for each nullable attribute.
We will focus on the Class attribute and continue researching with rows where Class is
applicable.
The following query limits rows, to rows where the Class attribute is applicable, and limits the
column list to interesting columns only.
SELECT

ProductId
,Name
,Color
,Size
,Weight
,ProductLine
,Class
,Style
FROM Production.Product
WHERE (Color IS NULL OR
Size IS NULL OR
Weight IS NULL OR
ProductLine IS NULL OR
Class IS NULL OR
Style IS NULL)
AND
(ProductSubcategoryId NOT IN
(SELECT ProductSubcategoryId
FROM Production.Product
WHERE ProductSubcategoryId IS NOT NULL
GROUP BY ProductSubcategoryId
HAVING COUNT(DISTINCT Class) = 0));

Excluded columns are ProductModelId, because it is NULL whenever ProductSubcategoryId is


NULL; WeightUnitMeasureCode and SizeUnitMeasureCode are not needed, as they are NULL
whenever Weight or Size are NULL.
In addition, SellEndDate and DiscontinuedDate are not interesting for completeness quality
checking, as from the names we can expect that most of the rows should have NULLs in these
two columns, and we can easily imagine there is business reason behind those NULLs. Of

Page 117

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


course, these two columns could be interesting from the accuracy perspective. Six interesting
columns are left from the completeness perspective.
Rows are limited by the WHERE clause to rows which have an unknown value in any of the six
interesting columns without rows where the Class column is not applicable. The final subquery
in the WHERE clause is used to find all subcategories for which the Class is not applicable. There
is not a single distinct class in a subcategory. Note that the COUNT(DISTINCT Class) aggregate
function, differently from COUNT(*), eliminates NULLs from the aggregation. Therefore, if the
count of distinct classes in a subcategory is zero, this means that there are only classes with
NULLs in a subcategory.

Tuple and Relation Completeness


For the tuple completeness, we can count how many columns have unknown values. For
relation completeness, we can use the proportion of rows with NULLs and total number of rows.
Before starting measuring, lets create a scalar user-defined function that accepts a parameter
of sql_variant data type, and returns one if the parameter is null and zero if it is not null.
CREATE FUNCTION dbo.ValueIsNULL
(@checkval sql_variant)
RETURNS tinyint
AS
BEGIN
DECLARE @retval int
IF (@checkval IS NULL) SET @retval = 1
ELSE SET @retval = 0;
RETURN(@retval)
END;

With this function, it is easy to write a query that calculates the number of nulls for each
interesting row. Note that the following query refers to interesting columns only, and limits the
result set to rows with applicable Class attribute only.
SELECT

ProductId
,Name
,dbo.ValueIsNULL(Color) +
dbo.ValueIsNULL(Size) +
dbo.ValueIsNULL(Weight) +
dbo.ValueIsNULL(ProductLine) +
dbo.ValueIsNULL(Class) +
dbo.ValueIsNULL(Style)

Page 118

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


AS NumberOfNULLsInRow
FROM Production.Product
WHERE (ProductSubcategoryId NOT IN
(SELECT ProductSubcategoryId
FROM Production.Product
WHERE ProductSubcategoryId IS NOT NULL
GROUP BY ProductSubcategoryId
HAVING COUNT(DISTINCT Class) = 0))
ORDER BY NumberOfNULLsInRow DESC;

The abbreviated result from the query is as follows:


ProductId

Name

NumberOfNULLsInRow

802

LL Fork

803

ML Fork

804

HL Fork

We can say that tuples with more NULLs are less complete than tuples with fewer NULLs. We
can even export this data to a staging table, repeat the tuple completeness measure on a
schedule, and compare the measures to notice the tuple improvement. Of course, this makes
sense only if we can join measures on some common identification; we need something that
uniquely identifies each row.
In the Production.Product table, there is a primary key on the ProductId column. In a relational
database, every table should have a primary key, and the key should not change if you want to
make comparisons over time.
For relation completeness, we can use two measures: the total number of NULLs in the relation
and the number of rows with NULL in any of columns. The following query does both
calculations for the Production.Product table, limited on rows with applicable Class attribute
only.
SELECT

'Production' AS SchemaName
,'Product' AS TableName
,COUNT(*) AS NumberOfRowsMeasured
,SUM(
dbo.ValueIsNULL(Color) +
dbo.ValueIsNULL(Size) +

Page 119

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


dbo.ValueIsNULL(Weight) +
dbo.ValueIsNULL(ProductLine) +
dbo.ValueIsNULL(Class) +
dbo.ValueIsNULL(Style)
) AS TotalNumberOfNULLs
,SUM(
CASE WHEN (
dbo.ValueIsNULL(Color) +
dbo.ValueIsNULL(Size) +
dbo.ValueIsNULL(Weight) +
dbo.ValueIsNULL(ProductLine) +
dbo.ValueIsNULL(Class) +
dbo.ValueIsNULL(Style)
) > 0 THEN 1 ELSE 0
END
) AS NumberOfRowsWithNULLs
FROM Production.Product
WHERE (ProductSubcategoryId NOT IN
(SELECT ProductSubcategoryId
FROM Production.Product
WHERE ProductSubcategoryId IS NOT NULL
GROUP BY ProductSubcategoryId
HAVING COUNT(DISTINCT Class) = 0));

The result of the relation completeness query is as follows:


SchemaNa

TableNa

NumberOfRows

TotalNumber

NumberOfRows

me

me

Measured

OfNULLs

WithNULLs

Production

Product

237

222

61

We can continue with such measurements for each table in the database. We can also store the
results in a data quality data warehouse, as proposed in chapter 1 of this book. This way, we can
measure improvements over time. After all, one of the most important goals when
implementing a MDM solution is data quality improvement.

Multivariate Associations among NULLs


We presented a way how you can find whether there is some association between missing
values in two columns. A little bit more advanced question is whether there is an association
among missing values in multiple columns; for example, if a value is missing in all or any of the
Color, Class and Style columns of the Production.Product table, does this fact lead to a missing
value in the Weight column?

Page 120

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


There are many multivariate analyses known from statistics and data mining. SQL Server
Analysis Services (SSAS) bring very strong data mining support. All of the most popular
algorithms are implemented. For the problem mentioned, Nave Bayes is an appropriate
algorithm. It checks associations of pairs of variables, and then calculates joined associations,
i.e. associations between multiple input variables and a target variable. The target variable is
called the predictable variable, as we can use the model to predict the values of the target
variable in new cases when we have values for the input variables.
In SSAS, we can declare all variables as input and predictable, thus getting one Nave Bayes
analysis per variable. Microsoft also ships data mining viewers, the controls that show us the
knowledge learned by the algorithm graphically. The Nave Bayes Dependency Network viewer
shows us all links between all variables in a single screen, it shows all models simultaneously.
This way we can easily find the most important links and thus understand which target variable
has the strongest associations with which input variables.
Lets start working. First, we need a view in the AdventureWorks2008R2 database that selects
only the interesting columns for applicable values of the Class column of the Production.Product
table, as in the following code:
CREATE VIEW dbo.ProductsMining
AS
SELECT ProductId
,Name
,dbo.ValueIsNULL(Color) AS Color
,dbo.ValueIsNULL(Size) AS Size
,dbo.ValueIsNULL(Weight) AS Weight
,dbo.ValueIsNULL(ProductLine) AS ProductLine
,dbo.ValueIsNULL(Class) AS Class
,dbo.ValueIsNULL(Style) AS Style
FROM Production.Product
WHERE (ProductSubcategoryId NOT IN
(SELECT ProductSubcategoryId
FROM Production.Product
WHERE ProductSubcategoryId IS NOT NULL
GROUP BY ProductSubcategoryId
HAVING COUNT(DISTINCT Class) = 0));

Next, we need to create an Analysis Services project in Business Intelligence Development


Studio (BIDS). We can do this by completing the following steps:

Page 121

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


1. Open the BIDS.
2. In BIDS, create a new SSAS project. Name the solution MDSBook_Ch03, the
project MDSBook_Ch03_SSAS, and save it the solution to any folder you want
we suggest the C:\MDSBook\Chapter03 folder.
3. In Solution Explorer, right-click on the Data Sources folder. Create a new Data
Source for the AdventureWorks2008R2 database. Connect to the instance where
you have deployed the demo databases. Select the AdventureWorks2008R2
database. Use the service account impersonation. Use the default name for the
data source (Adventure Works2008R2).
4. In Solution Explorer, right-click on the Data Source Views folder. Create a new
Data Source View based on the data source from the previous step. From the
available objects, select only the ProductsMining (dbo) view you just created. Use
the default name for the data source view (Adventure Works2008R2).
5. In Solution Explorer, right-click on the Mining Structures folder and select New
Mining Structure.
6. Use the existing relational database or data warehouse.
7. Select the Microsoft Nave Bayes technique.
8. Use Adventure Works2008R2 data source view.
9. Specify ProductsMining as a case table.
10. Use ProductId as a key column (selected by default), and Class, Color,
ProductLine, Size, Style and Weight as input and predictable columns, like in
figure 2.

Page 122

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 2: TRAINING DATA COLUMNS


11. In the Specify Columns Content and Data Type screen, make sure the content
type is discrete for all input and target variables. If needed, change the content to
Discrete (in SQL 2008 R2 RTM, the wizard usually detects the content of these
columns inappropriately as Discretized).
12. Use the default percentage (30%) of data for the test set (this is not important for
the usage you are performing, as you have a single model only; this is important
if you want to compare the accuracy of predictions of multiple models).
13. Name the structure Products Mining Completeness, and the model Products
Mining Completeness Nave Bayes.
14. Click Finish.
15. Save, deploy and process the project by right-clicking on the project in the
Solution Explorer window and selecting Deploy.

Page 123

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


16. When the processing is finished, click on the Mining Model Viewer tab. The
default viewer is the Dependency Network viewer.
17. Lower the slider on the left to get more important links. For example, put the
slider on the left side of the viewer on the seventh position from the bottom, and
then click on the Size attribute. You should see the something like figure 3 shows.

Figure 3: DEPENDENCY NETWORK VIEWER


18. You can easily see that ProductLine, Style and Weight predict Size, while Size also
predicts Class. Color has no strong associations with other attributes.
19. Check the other viewer, for example the Attribute Discrimination viewer. Select
the Size attribute, and compare value 0 with value 1, like in figure 4.

Page 124

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 4: ATTRIBUTE DISCRIMINATION VIEWER


20. You can see that the Style attribute is strongly linked with Size; i.e., if Style is
NULL, it is highly probable that Size is NULL as well (value 1 in the picture, as the
dbo.ValueIsNULL function returns 1 for NULLs).
As we can see from this example, we can easily find associations between missing values
and data mining. Data mining can help you a lot in searching for the root cause. Before
finishing with this example, we have to add a quick note. We can use the Data Source
View (DSV) for a quick overview of your data as well by completing the following steps:
21. In the DSV, right-click on the ProductsMining view, and select Explore Data.
22. Use the Chart tab and select Class, Color, ProductLine, Size, Style and Weight
columns.
23. Check the other views as well. When finished, close the Explore Data view. Do
not close BIDS yet, if you want to follow further examples in this chapter.

Page 125

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Profiling the Accuracy


Similarly like completeness, accuracy can also be defined on the attribute, tuple or relation
level. However, measuring the accuracy is more complicated than measuring the completeness.
Sometimes we can easily define which data is inaccurate. The data can be duplicated, out of
predefined range, or be inaccurate in some other way that can be easily spotted and discovered
with simple queries. For example, it is clear that something is wrong if the birth date is higher
than the employment data for an employee. However, there is something wrong if a twenty
year old employee has a Ph.D.in medicine? This may be an error. Nevertheless, we cannot say
this without checking this further.
With code, we cannot always find inaccurate data. Many times we can find suspicious data only,
and then check this data manually. Therefore, it makes a lot of sense to find suspicious data
programmatically. If we can narrow down manual checking from ten million rows to ten
thousand suspicious rows only, we have made fantastic progress.
Finding inaccurate date differs if we are dealing with numbers, strings, or other data types. For
the completeness, we showed examples dealing with products.
Another aspect of master data is customers. For profiling accuracy we are going to deal with
customers, where we typically have many different types of data gathered. We are going to
base our examples on the dbo.vTargetMail view from the AdventureWorksDW2008R2 demo
database. This database, also part of the product samples, represents an analytical data
warehouse. It is already denormalized, using star schema. The dbo.vTargetMail view joins
customers demographic data from multiple tables from the AdventureWorks2008R2 database,
and is thus more suitable for demonstrating techniques for finding inaccurate data than source
tables, where we would have to deal with multiple joins.
Because the sample data is accurate, or at least intended to be accurate, we are going to create
a view based on mentioned vTargetMail view. Our view will select subset of rows from the

Page 126

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


original view, and add some erroneous rows. We have worked with the AdventureWorks2008R2
database so far; we can continue working in this context, in order to have less work with cleanup code after we finish with this chapter. The following code creates the view we are going to
use in the AdventureWorks2008R2 database.
CREATE VIEW dbo.vTargetMailDirty
AS
SELECT TOP 500 CustomerKey, FirstName, MiddleName, LastName,
BirthDate, MaritalStatus, Gender, EmailAddress,
YearlyIncome, TotalChildren, NumberChildrenAtHome,
EnglishEducation AS Education,
EnglishOccupation AS Occupation,
HouseOwnerFlag, NumberCarsOwned,
AddressLine1 AS Address, Phone, DateFirstPurchase,
CommuteDistance, Region, Age, BikeBuyer
FROM AdventureWorksDW2008.dbo.vTargetMail
WHERE CustomerKey < 12000
UNION
-- misspelled MiddleName, repeated CustomerKey 11000
SELECT 99000, N'Jon', N'VeryLongMiddleName', N'Yang',
'19660408', N'M', N'M',
N'jon24@adventure-works.com', 90000, 2, 0,
N'Bachelors',
N'Professional', 1, 0, N'3761 N. 14th St',
N'1 (11) 500 555-0162', '20010722', N'1-2 Miles',
N'Pacific', 42, 1
UNION
-- duplicate PK, repeated CustomerKey 11000
SELECT 99000, N'Jon', N'V', N'Yang',
'19660408', N'M', N'M',
N'jon24@adventure-works.com', 90000, 2, 0,
N'Bachelors',
N'Professional', 1, 0, N'3761 N. 14th St',
N'1 (11) 500 555-0162', '20010722', N'1-2 Miles',
N'Pacific', 42, 1
UNION
-- wrong EmailAddress, repeated CustomerKey 11001
SELECT 99001, N'Eugene', N'L', N'Huang',
'19650514', N'S', N'M',
N'eugene10#adventure-works.com', 60000, 3, 3,
N'Bachelors',
N'Professional', 0, 1, N'2243 W St.',
N'1 (11) 500 555-0110', '20010718', N'0-1 Miles',
N'Pacific', 42, 1
UNION
-- BirthDate out of range, repeated CustomerKey 11001
SELECT 99002, N'Eugene', N'L', N'Huang',
'18650514', N'S', N'M',
N'eugene10@adventure-works.com', 60000, 3, 3,
N'Bachelors',
N'Professional', 0, 1, N'2243 W St.',
N'1 (11) 500 555-0110', '20010718', N'0-1 Miles',
N'Pacific', DATEDIFF(YY, '18650514', GETDATE()), 1
UNION
-- misspelled Occupation, repeated CustomerKey 11002
SELECT 99003, N'Ruben', NULL, N'Torres',
'19650812', N'M', N'M',
N'ruben35@adventure-works.com', 60000, 3, 3,
N'Bachelors',
N'Profesional', 1, 1, N'5844 Linden Land',
N'1 (11) 500 555-0184', '20010710', N'2-5 Miles',
N'Pacific', 42, 1
UNION
-- Phone written as 'Phone: ' + number

Page 127

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


SELECT CustomerKey, FirstName, MiddleName, LastName,
BirthDate, MaritalStatus, Gender, EmailAddress,
YearlyIncome, TotalChildren, NumberChildrenAtHome,
EnglishEducation AS Education,
EnglishOccupation AS Occupation,
HouseOwnerFlag, NumberCarsOwned,
AddressLine1 AS Address,
N'Phone: ' + Phone, DateFirstPurchase,
CommuteDistance, Region, Age, BikeBuyer
FROM AdventureWorksDW2008.dbo.vTargetMail
WHERE CustomerKey > 12000
AND CustomerKey % 500 = 3;

Numeric, Date and Discrete Attributes Profiling


We are going to start with finding inaccuracies in a single attribute. As we already mentioned,
the techniques differ for different data types. We are going to start with dates. For finding
potentially inaccurate dates, the MIN and MAX T-SQL aggregate functions are very useful. The
following query finds the oldest and the youngest person among our customers.
SELECT

CustomerKey
,FirstName
,LastName
,BirthDate
FROM dbo.vTargetMailDirty
WHERE BirthDate =
(SELECT MIN(BirthDate) FROM dbo.vTargetMailDirty)
OR
BirthDate =
(SELECT MAX(BirthDate) FROM dbo.vTargetMailDirty);

The results have found suspicious data. The oldest person is born in the year 1865.
CustomerKey

FirstName

LastName

BirthDate

99002

Eugene

Huang

1865-05-14

11132

Melissa

Richardson

1984-10-26

Finding suspicious data is mostly translated to finding outliers, i.e. rare and far out of bound
values. We can use a similar technique for continuous numeric values.
With a couple of standard T-SQL aggregate functions, we can easily get an idea of distribution of
values, and then compare minimal and maximal values with the average. In addition, the
standard deviation tells us how spread the distribution is in general. The less spread it is, the

Page 128

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


more likely the outliers are inaccurate. The following query calculates this basic descriptive
statistics for the Age column, which is calculated from the BirthDate column.
SELECT

MIN(Age) AS AgeMin
,MAX(Age) AS AgeMax
,AVG(CAST(Age AS float)) AS AgeAvg
,STDEV(Age) AS AgeStDev
FROM dbo.vtargetMailDirty;

The result shows us something we already knew: the oldest person in the data is probably the
wrong Age, and since Age is calculated, the BirthDate has to be wrong also.
AgeMin

AgeMax

AgeAvg

AgeStDev

26

146

50.1018518518519

13.1247332792511

Before we move to discrete attributes, let us mention how to interpret this basic descriptive
statistics. We should expect a normal, Gaussian distribution of ages around the average age. In a
normal distribution, around 68% of the data should lie within one standard deviation of either
side of the mean, about 95% of the data should lie within two standard deviations of either side
of the mean, and about 99% of the data should lie within three standard deviations of either
side of the mean. The minimal age is less than two standard deviations from the mean (50 2 *
13 = 24 years), while the maximal age is more than seven standard deviations of the mean.
There is a very, very low probability for data to lay more than seven standard deviations from
the average value.
For example, we already have less than 0.5% of probability to have data outside interval that
finishes three standard deviations from the mean. Thus, we can conclude there is something
wrong with the maximal value in the Age column.
How do we find outliers in discrete columns? No matter whether they are numeric, dates or
strings, they can take a value from discrete pools of possible values only. Lets say we do not
know the pool in advance. We can still try to find suspicious values by measuring frequency
distribution of all values of an attribute. A value with very low frequency is potentially an

Page 129

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


erroneous value. The following query finds the frequency distribution for the Occupation
column.
WITH freqCTE AS
(
SELECT v.Occupation
,AbsFreq=COUNT(v.Occupation)
,AbsPerc=CAST(ROUND(100.*
(COUNT(v.Occupation)) /
(SELECT COUNT(*) FROM vTargetMailDirty)
,0) AS int)
FROM dbo.vTargetMailDirty v
GROUP BY v.Occupation
)
SELECT c1.Occupation
,c1.AbsFreq
,CumFreq=(SELECT SUM(c2.AbsFreq)
FROM freqCTE c2
WHERE c2.Occupation
<= c1.Occupation)
,c1.AbsPerc
,CumPerc=(SELECT SUM(c2.AbsPerc)
FROM freqCTE c2
WHERE c2.Occupation
<= c1.Occupation)
,Histogram=CAST(REPLICATE('*',c1.AbsPerc) AS varchar(50))
FROM freqCTE c1
ORDER BY c1.Occupation;

The query uses a CTE to calculate absolute frequency and absolute percentage, and calculates
cumulative values in outer query with correlated subqueries. Note again that this is not the
most efficient query. However, we want to show the techniques we can use, and the
performance is not our main goal here. In addition, we typically do not execute these queries
very frequently.
In the result of the previous query, we can see the suspicious occupation. The Profesional
value is present in a single row only. As it is very similar to the Professional value, we can
conclude this is an error.
Occupation

AbsFreq

CumFreq

AbsPerc

CumPerc

Histogram

Clerical

68

68

13

13

*************

Management

134

202

25

38

*************************

Manual

64

266

12

50

************

Page 130

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


Profesional

267

50

Professional

157

424

29

79

*****************************

Skilled

116

540

21

100

*********************

Manual

Strings Profiling
Catching errors in unconstrained strings, like names and addresses, is one of the most
challenging data profiling tasks. Because there are no constraints, it is not possible to say in
advance what is correct and what is incorrect. Still, the situation is not hopeless. We are going
to show couple of queries we can use to find string inconsistencies.
In the character column, strings have different lengths in different rows. However, lengths are
distributed with either nearly uniform or normal distribution. In both cases, strings that are
extremely long or short might be errors. Therefore, we are going to start our profiling by
calculating the distribution of string lengths. The following example checks lengths of middle
names.
SELECT

LEN(MiddleName) AS MNLength
,COUNT(*) AS Number
FROM dbo.vTargetMailDirty
GROUP BY LEN(MiddleName)
ORDER BY Number;

The vast majority of middle names are either unknown or one character long.
MNLength

Number

18

NULL

235

303

Of course, it is easy to find middle names that are more than one character long with the next
query.

Page 131

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


SELECT CustomerKey
,FirstName
,LastName
,MiddleName
FROM dbo.vTargetMailDirty
WHERE LEN(MiddleName) > 1;

We see that one middle name is definitely wrong (of course, this is the one we added
intentionally). In addition, the middle name that is two characters long might be written
inconsistently as well. It looks like middle names should be written with a single letter, without a
dot after it.
CustomerKey

FirstName

LastName

MiddleName

11377

David

Robinett

R.

99000

Jon

Yang

VeryLongMiddleName

Sometimes we know what strings should look like. This means we have patterns for strings. We
can check for basic patterns with LIKE T-SQL operator. For example, we would not expect any
letters in phone numbers. Lets check them with the following query.
SELECT

CustomerKey
,FirstName
,LastName
,Phone
FROM dbo.vTargetMailDirty
WHERE Phone LIKE '%[A-Z]%';

From the abbreviated results, we can see that there are some phone numbers that include
characters. It seems like some operator constantly uses prefix Phone when entering phone
numbers.
CustomerKey

FirstName

LastName

Phone

12003

Audrey

Munoz

Phone: 1 (11) 500 555-0124

12503

Casey

Shen

Phone: 1 (11) 500 555-0148

13003

Jill

Hernandez

Phone: 1 (11) 500 555-0197

13503

Theodore

Gomez

Phone: 1 (11) 500 555-0167

14003

Angel

Ramirez

Phone: 488-555-0166

Page 132

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

More advanced pattern matching can be done with regular expressions. Regular expressions
can be treated as LIKE operator on steroids. T-SQL does not support regular expressions out of
the box. However, we are not powerless. From version 2005, SQL Server supports CLR objects,
including functions, stored procedures, triggers, user-defined types and user-defined
aggregates. A simple .NET, either Visual C# or Visual Basic function could do the work. Going
into details of using CLR code inside SQL Server is out of the scope of this book. Nevertheless,
the CLR project with this function is added to the accompanying code, and we can use it. Here is
Visual C# code for the function.
using
using
using
using
using
using

System;
System.Data;
System.Data.SqlClient;
System.Data.SqlTypes;
Microsoft.SqlServer.Server;
System.Text.RegularExpressions;

public partial class CLRUtilities


{
[SqlFunction(DataAccess = DataAccessKind.None)]
public static SqlBoolean IsRegExMatch(
SqlString inpStr, SqlString regExStr)
{
return (SqlBoolean)Regex.IsMatch(inpStr.Value,
regExStr.Value,
RegexOptions.CultureInvariant);
}
};

The Boolean function accepts two parameters: string to check and regular expression. It returns
true if string matches the pattern and false otherwise. Before we can use the function, we have
to import the assembly into SQL Server database, and create and register the function. The
CREATE ASSEMBLY command imports an assembly. The CREATE FUNCTION registers CLR
function. After the function is registered, we can use it like any built-in T-SQL function. The TSQL code for importing the assembly and registering function is as follows:
CREATE ASSEMBLY MDSBook_Ch03_CLR
FROM
'C:\MDSBook\Chapter03\MDSBook_Ch03\MDSBook_Ch03_CLR\bin\Debug\MDSBook_Ch03_CLR
.dll'
WITH PERMISSION_SET = SAFE;
GO

Page 133

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


CREATE FUNCTION dbo.IsRegExMatch
(@inpstr AS nvarchar(max), @regexstr AS nvarchar(max))
RETURNS BIT
WITH RETURNS NULL ON NULL INPUT
EXTERNAL NAME MDSBook_Ch03_CLR.CLRUtilities.IsRegExMatch;

Note that this code expects that the assembly is in the


C:\MDSBook\Chapter03\MDSBook_Ch03\ MDSBook_Ch03_CLR\bin\Debug\ folder. If you
have it in some other folder, do please change the path accordingly.
After the function is registered in SQL Server, we can use it like any other T-SQL functions. We
are going to use it for checking e-mails, as in the following code.
SELECT

CustomerKey
,FirstName
,LastName
,EmailAddress
,dbo.IsRegExMatch(EmailAddress,
N'(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})')
AS IsEmailValid
FROM dbo.vTargetMailDirty
WHERE dbo.IsRegExMatch(EmailAddress,
N'(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})')
= CAST(0 AS bit);

In the results, we can see that we found an incorrect e-mail address.


CustomerKey

FirstName

LastName

EmailAddress

IsEmailValid

99001

Eugene

Huang

eugene10#adventure-

works.com

You can learn more about regular expressions on the MSDN. In addition, there are many sites
on the Web where developers freely exchange regular expressions.

Other Simple Profiling


It is very simple to check for the uniqueness of rows in a table as well. If we do not use Primary
Key and Unique constraints in our database, we can get duplicate rows. We have to check which
columns can potentially form a key, i.e. we have to check candidate keys. If they are unique,
they are useful as keys. Do please note that keys should not only be unique, but also known;
therefore, checking for completeness should be a part of candidate keys checks as well.

Page 134

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


In the following example we are checking the uniqueness of rows over a single column the
CustomerKey column - only. Nevertheless, the query would not be much more complicated if
we were to check the uniqueness of a composite candidate key. We would only have to use all
columns that form the candidate key in every place we use CustomerKey.
SELECT

CustomerKey
,COUNT(*) AS Number
FROM dbo.vTargetMailDirty
GROUP BY CustomerKey
HAVING COUNT(*) > 1
ORDER BY Number DESC;

Of course, we get the duplicated CustomerKey in the result:


CustomerKey

Number

99000

Let us only mention how we can profile XML data. In the completeness part of this chapter, we
have mentioned XML data type methods, and have shown how we can use them in T-SQL
queries. For the accuracy, we used the .value() method of the XML data type to extract element
and attribute values and represent them as scalar values of the built-in SQL Server data types.
After that, we used the same methods as we used for finding inaccuracies in attributes of builtin scalar data types. Therefore, dealing with XML data type attributes does not differ much from
what we have seen so far in the accuracy part of this chapter.
Finally, we have to mention that SQL Server supports validating the complete XML instance
against an XML schema collection. In an XML schema collection, we can have multiple XML
schemas in XSD format. An XML instance has to validate against one of the schemas for the
collection, otherwise SQL Server rejects the instance. Similarly like we mentioned for check
constraints, we should also try to use XML schema collection validation in our databases, in
order to enforce the data integrity rules.

Page 135

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Multivariate Accuracy Analysis


As with completeness, we also want to find the root cause for inaccurate data. We should ask
ourselves whether specific values in some attributes lead to errors in other attributes. If we can
find the attributes that influence on errors in other attributes, we can focus just on them, and
make huge improvements in data quality with minimal efforts. Again, we are going to use data
mining for this task.
We can use Nave Bayes algorithm for finding the root cause of inaccurate data, like we used it
for finding causes of incompleteness. However, this time we are going to introduce another
useful algorithm, the Decision Trees algorithm.
Decision Trees work on discrete data. Note that SQL Server Decision Trees accept continuous
attributes; however, the algorithm changes to Regression Trees. In the following example, we
are going to use discrete algorithms only, in order to make sure we get Decision Trees. Before
we start exploring the data, lets explain briefly how Decision Trees work.
Decision Trees is a directed technique. A target variable is the one that holds information about
a particular decision, divided into a few discrete and broad categories (yes / no; liked / partially
liked / disliked; etc.).
We are trying to explain this decision using other gleaned information saved in other variables
(demographic data, purchasing habits, etc.). The process of building the tree is a recursive
partitioning. The data is split into partitions using a certain value of one of the explaining
variables; the partitions are then split again and again. Initially the data is in one big box. The
algorithm tries all possible breaks of all input (explaining) variables for the initial split. The goal
is to get purer partitions in terms of the target variable. The purity is related to the distribution
of the target variable. We want to get uniform distributions of target variable in branches of the
tree as possible. The tree continues to grow using the new partitions as separate starting points
and splitting them more.

Page 136

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


We then have to stop the process somewhere; otherwise we could get a completely fitting tree
having only one case in each class. The class would be, of course, absolutely pure. This would
not make any sense; the results could not be used for any meaningful prediction. This
phenomenon is called over-fitting. We are not going into details here; lets just mention that
we can control the growth of the tree with algorithm parameters.
We are now going to analyze what leads to a incorrect phone number. The following view filters
out rows with duplicate primary key (data mining models require key in source data) and adds a
flag showing whether the phone number is valid or not.
CREATE VIEW dbo.TargetMailMining
AS
SELECT CustomerKey
,FirstName
,LastName
,MaritalStatus
,Gender
,YearlyIncome
,TotalChildren
,NumberChildrenAtHome
,Education
,Occupation
,HouseOwnerFlag
,NumberCarsOwned
,CommuteDistance
,Region
,Age
,BikeBuyer
,CASE
WHEN Phone LIKE '%[A-Z]%' THEN N'False'
ELSE N'True'
END
AS PhoneValid
FROM dbo.vTargetMailDirty
WHERE CustomerKey <> 99000;

To create the Decision Trees model, complete the following steps:


1. If you have already closed the project with Nave Bayes model, reopen it in BIDS.
2. In Solution Explorer, double-click on the Adventure Works200R2 data source
view in the Data Source Views folder to open the Data Source View designer.
3. Right-click in the empty space in designer window and select Add/Remove Tables
option. Find the TargetMailMining view and add it to the Data Source View.

Page 137

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


4. In Solution Explorer, right-click on the Mining Structures folder, and select New
Mining Structure.
5. Use existing relational database or data warehouse.
6. Select the Microsoft Decision Trees technique.
7. Use Adventure Works2008R2 data source view.
8. Specify TargetMailMining as a case table.
9. Use CustomerKey as a key column and PhoneValid as predictable column. Use
Age, BikeBuyer, CommuteDistance, Education, Gender, HouseOwnerFlag,
MaritalStatus, NumberCarsOwned, NumberChildrenAtHome, Occupation,
Region, TotalChildren and YearlyIncome as input columns. Check whether your
selection matches figure 5.

Page 138

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


Figure 5: COLUMNS USAGE FOR THE DECISION TREES ANALYSIS
10. In the Specify Columns Content and Data Type screen, click on the Detect
button. Be sure that the content of the Bike Buyer attribute is discrete.
11. In the Specify Columns Content and Data Type screen, click on the Detect
button. Be sure that the content of all attributes is discrete, except Age and
YearlyIncome, which should be continuous, and CustomerKey, which is the key.
12. In the Create Testing Set window, accept the default split option to use 30% of
data for testing.
13. Name the structure Target Mail Mining Accuracy, and model Target Mail Mining
Accuracy Decision Trees. Click Finish.
14. In a real-life project, you should discretize continuous columns in advance, and
define discretization buckets from business perspective. Nevertheless, you can
discretize in SSAS as well. Discretize Age and Yearly Income mining structure
columns in five groups using EqualAreas method. Use the Properties window of a
column in Mining Structure tab of the Data Mining Designer, change Content
property to Discretized, DiscretizationBucketCount property to 5 and
DiscretizationMethod property to Automatic.
15. In the demo data, there is no strong rule for invalid phones based on other
attributes. Although no pattern is a piece of information as well, for the following
example, lets force deeper tree through algorithm parameters. In the Mining
model Designer window, click on the Mining Models tab.
16. Right-click on the Decision Trees model and select Set Algorithm Parameters.
Change the COMPLEXITY_PENALTY parameter to 0.1, and the
MINIMUM_SUPPORT parameter to 5, as in figure 6.

Page 139

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 6: SETTING THE ALGORITHM PARAMETERS FOR THE DECISION TREES ANALYSIS
17. Save, deploy and process the model (or the whole project).
18. Click on the Mining Model Viewer tab. Change the Background drop-down list to
False. Now you can easily spot what leads to invalid phone numbers. Darker blue
nodes mean nodes with more invalid phone numbers. In the example in figure 7
it seems that having House Owner Flag set to 0, Yearly Income between 75,650
and 98,550, and Commute Distance not between zero and one mile, is a good
way to have an invalid phone number.

Page 140

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 7: DECISION TREE MINING MODEL VIEWER


19. Check the Dependency Network viewer as well.
20. Save the project. Do not close BIDS, if you want to try the next exercise.
Although for this example we needed to tweak the Decision Trees parameters in order to force
growth of the tree to get at least some results, the example is valid to show the process of
finding root causes for inaccurate data.

Page 141

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Measuring Information
The next hard data quality dimension is information. The amount of information in data is not
important from a correctness point of view; however, it makes sense to know it if we want to
use our data for business intelligence, for analyses.
From mathematics, from Information Theory, the measure for the amount of information is
Entropy. Before quoting the formula, lets explain the idea behind the Entropy.
In real life, information is the same thing as surprise. If a friend tells you something, and you are
not surprised, this means you already knew that. Now, lets say we have a discrete attribute.
How many times can we get surprised with its value, i.e. its state? For a start, lets say we have
two possible states, with equal distribution. We would expect one state; in 50% of cases we
would be surprised, because we would get another state. Now imagine we have skewed
distribution, and one state has 90% probability. We would expect this state, and be surprised in
10% of cases only. With totally uniform distribution, we would not have any chance to be
surprised ever.
Now lets say the attribute has three possible states, with equal distribution. We still expect one
specific state. However, in this case, we would be surprised in 67% examples. With four possible
states and equal distribution, we would be surprised in 75% examples. We can conclude the
following: the more possible states a discrete variable has, the higher the maximal possible
information in this attribute. The more equal across possible values (the less uniform) the
distribution is, the more actual information is in the attribute. The Entropy formula reflects this:

Page 142

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


We do not have support for Entropy calculation out of the box in T-SQL. However, we can
calculate probabilities, i.e. frequencies of states, and then the Entropy. The following query
calculates the Entropy for Education, Region and Gender columns:
WITH
DistinctValues AS
(
SELECT COUNT(DISTINCT Education) AS DistinctEducation
,COUNT(DISTINCT Region) AS DistinctRegion
,COUNT(DISTINCT Gender) AS DistinctGender
FROM dbo.vTargetMailDirty
)
SELECT N'H(Education)' AS Measure
,(SELECT DistinctEducation FROM DistinctValues) AS DistinctVals
,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) AS Actual
,(1.0*LOG((SELECT DistinctEducation
FROM DistinctValues))/LOG(2)) AS Maximal
,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) /
(1.0*LOG((SELECT DistinctEducation
FROM DistinctValues))/LOG(2)) AS Normalized
FROM (SELECT Education
,1.0*COUNT(Education) /
(SELECT COUNT(*) FROM dbo.vTargetMailDirty) AS Px
FROM dbo.vTargetMailDirty
GROUP BY Education) AS Prob
UNION
SELECT N'H(Region)' AS Measure
,(SELECT DistinctRegion FROM DistinctValues) AS DistinctVals
,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) AS Actual
,(1.0*LOG((SELECT DistinctRegion
FROM DistinctValues))/LOG(2)) AS Maximal
,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) /
(1.0*LOG((SELECT DistinctRegion
FROM DistinctValues))/LOG(2)) AS Normalized
FROM (SELECT Region
,1.0*COUNT(Region) /
(SELECT COUNT(*) FROM dbo.vTargetMailDirty) AS Px
FROM dbo.vTargetMailDirty
GROUP BY Region) AS Prob
UNION
SELECT N'H(Gender)' AS Measure
,(SELECT DistinctGender FROM DistinctValues) AS DistinctVals
,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) AS Actual
,(1.0*LOG((SELECT DistinctGender
FROM DistinctValues))/LOG(2)) AS Maximal
,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) /
(1.0*LOG((SELECT DistinctGender
FROM DistinctValues))/LOG(2)) AS Normalized
FROM (SELECT Gender
,1.0*COUNT(Gender) /
(SELECT COUNT(*) FROM dbo.vTargetMailDirty) AS Px
FROM dbo.vTargetMailDirty
GROUP BY Gender) AS Prob
ORDER BY Normalized;

The query calculates the number of distinct states in a CTE. This information is used to calculate
the maximal possible Entropy. Then the query calculates in derived tables, in the FROM part of

Page 143

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


outer query, the actual distribution for each of the analyzed columns separately. In the SELECT
part of the outer queries for each analyzed columns, the actual Entropy is calculated. In
addition, normalized Entropy is calculated as a division of actual and maximal Entropy, to show
which variable has relatively more information, or better said, which variable has more equal
distribution. Finally, all outer queries are unioned in a single query with UNION operators. The
result is as follows:
Measure

DistinctVals

Actual

Maximal

Normalized

H(Education)

2.14581759656851

2.32192809488736

0.924153336743447

H(Region)

1.54525421415102

1.58496250072116

0.974946860539558

H(Gender)

0.999841673765157

0.999841673765157

From the results, we can see that we have the most information in the Education column.
However, Gender column has the most equal distribution.
For continuous attributes, we can calculate the Entropy in the same way if we discretize them.
We have to discretize them in classes (or bins) of equal width, in order to preserve the shape of
the original distribution function as much as possible. We can try with different number of bins.

Page 144

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Using Other SQL Server Tools for Data


Profiling
We used T-SQL queries, enhanced with CLR function, and data mining models in our data
profiling activities so far. There are more useful tools in the SQL Server suite for this task. SSAS
OLAP cubes are very useful for an overview of discrete columns. SSIS has a specialized Data
Profiling task. Finally, we can also use free Office Data Mining Add-Ins. We are now going to
show you how we can use all of the tools mentioned for data profiling.

SSAS Cubes
SSAS cubes are typically built on Star Schema, using multiple dimensions and fact tables, with
business measures like quantities and amounts. Nevertheless, we can create a SSAS cube based
on a single table, and define par of columns for a dimension, and part for a fact table. In fact,
this way we are creating star schema inside Analysis Services cube, although we have flattened
(single table) model in our data warehouse.
For data profiling, the only measure we need is the count of rows. Then we can use all other
attributes for a dimension, and use them for analyzing counts, i.e. frequencies, of different
states of discrete attributes.
We are going to create a cube based on TargetMailMining view we used for Decision Trees
analysis. In order to do this, you will need to complete the following steps:
1. If you closed the BI project from this chapter, reopen it in BIDS.
2. In Solution Explorer, right-click on the Cubes folder, and select New Cube.
3. Use the existing tables in the Select Creation Method window.
4. Use Adventure Works data source view. Select TargetMailMining for the measure
group table.

Page 145

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


5. In the Select Measures window, clear all measures selected except Target Mail
Mining Count measure. The wizard uses all numeric columns for measures, and
automatically creates additional measure with COUNT aggregate function. This is
exactly the measure we need.
6. Use the dimension selected by default, i.e. Target Mail Mining (it is, of course,
the only dimension you can use).
7. Specify CustomerKey as the dimension key.
8. Name the cube Target Mail Cube and click Finish. The Cube Designer window
should open.
9. In Solution Explorer, expand the Dimensions folder. Double-click on the Target
Mail Mining dimension to open the Dimension Designer. The Cube wizard did not
add any attributes to the dimension.
10. In Dimension Designer, drag and drop columns from the rightmost of the three
panes (the Data Source View pane) to the leftmost (the Attributes) pane. Drag all
columns we used as input columns for the Decision Trees model: Age, BikeBuyer,
CommuteDistance, Education, Gender, HouseOwnerFlag, MaritalStatus,
NumberCarsOwned, NumberChildrenAtHome, Occupation, Region, TotalChildren
and YearlyIncome. Your dimension should be like the one in figure 8.

Page 146

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 8: TARGET MAIL MINING DIMENSION


11. You have to discretize the Age and Yearly Income attributes again. Use the
Properties window to change the DiscretizationBucketCount property to 5 and
DiscreatizationMethod property to Automatic for both attributes.
12. Save the project and deploy it.
13. You can browse the cube directly from BIDS using the Browser tab in Cube
Designer. However, it is much nicer to use Microsoft Office Excel 2007 or 2010 as
OLAP client.
14. If you have access to Excel 2007 or 2010, open it.
15. Click on Data tab to open the Data ribbon.
16. In the From Other Sources drop-down list, select From Analysis Services.
17. In Data Connection Wizard, write your Analysis Services in the Name textbox. Use
Windows authentication. Select the MDSBook_Ch03_SSAS database, Target Mail
Cube cube. Finish the Data Connection Wizard. Click OK to import the data in
PivotTable report.

Page 147

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


18. In the PivotTable Field List window, select the Target Mail Mining Count measure,
and the Occupation attribute. You should already see the distribution for the
Occupation in the pivot table. Click somewhere in the pivot table to make sure
you have accessible the PivotTable Tools ribbon.
19. Click on the Options tab of the PivotTable Tools ribbon. On the right side of the
ribbon, click on PivotChart. Click OK to select the default chart options.
20. Your Excel worksheet should look similar to the one in figure 9.

Figure 9: DISTRIBUTION OF THE OCCUPATION ATTRIBUTE IN EXCEL, DATA FROM SSAS CUBE
21. Play with other distributions. Try also to group over columns.
22. When finished, close Excel. Do not close BIDS yet.
As you can see, SSAS together with Excel is a powerful data profiling tool. This was an example
of using a tool for other than its intended purpose. In the next section, we are going to
introduce a tool that is intended for data profiling the SSIS Data Profiling task.

Page 148

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

PowerPivot for Excel 2010


Microsoft SQL Server PowerPivot for Excel is a free add-in that became available at the same
time as SQL Server 2008 R2. Besides Excel version, SharePoint version of PowerPivot exists as
well. This tool is providing so-called self-service BI. From Excel, advanced end users can create
their own analytical solution without intervention of database specialists.
You can download PowerPivot for Excel 2010 from the PowerPivot site.
PowerPivot is a completely new technology for data warehousing. In this version, as an add-in
for Excel 2010, this is an in-memory relational database with column-oriented storage. Because
it is relational, we work with tables that represent entity sets. Because it is in-memory, it is fast.
And because the data, although we work with rows, is stored in sorted, compressed column by
column storage, it is even faster. Because the interface for creating the database, importing the
data, and creating PivotTable and PivotChart reports is available in Excel, it is simple to use.
We can use PowerPivot instead of SSAS cubes for our data profiling tasks. Assuming you already
downloaded and installed PowerPivot for Excel 2010, complete the following steps to prepare
similar analysis like with SSAS on the TargetMailMining view:
1. Open Excel 2010. Click on the PowerPivot tab (this tab appears after the
PowerPivot is installed).
2. From the Launch group on the left side of the ribbon, click on the PowerPivot
Window button. You just launched your database development and
administration interface for PowerPivot databases.
3. Using the Home tab, from the Get External data group, click on the From
Database drop-down list, and select From SQL Server.
4. In the Table Import Wizard, Connect to a Microsoft SQL Server Database page,
connect to your SQL Server instance with the AdventureWorks2008R2 sample
database where the TargetMailMining view is created. Click Next.

Page 149

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


5. In the Choose How to Import the Data page, use the default option Select from
a list of tables and views to choose the data to import. Click Next.
6. In the Select Tables and Views page, choose the TargetMailMining view. Do not
click Finish yet click on the Preview & Filter button first.
7. The Preview Selected Table window opens. Now you can select only the columns
we need for analysis and the key: CustomerKey, Age, BikeBuyer,
CommuteDistance, Education, Gender, HouseOwnerFlag, MaritalStatus,
NumberCarsOwned, NumberChildrenAtHome, Occupation, Region, TotalChildren
and YearlyIncome. Actually, you have to de-select other columns (FirstName,
LastName and PhoneValid), because by default all columns are selected. Click OK.
After you are returned to the Select tables and Views page, click Finish.

Figure 10: THE PREVIEW SELECTED TABLE WINDOW


8. After the data is imported, on the Importing page, click Close. Your data is now in
the PowerPivot database, in memory.
9. Discretizing continuous columns is not that simple in PowerPivot as it is in
Analysis Services. You can create derived columns using expressions in the Data

Page 150

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


Analysis Expressions (DAX) language, the language that PowerPivot supports.
Please refer to Books Online or PowerPivot book mentioned in the References
section of this chapter for details about PowerPivot and DAX. For this brief
overview, we are going to simply skip the Age and YearlyIncome continuous
columns.
10. Lets start analyzing the data. In the Reports group of the PowerPivot Home
window, click on the PivotTable drop-down list, and select PivotTable.
11. In the Create PivotTable pop-up window, select the Existing Worksheet option,
and click OK.
12. In the PowerPivot Field List, select the CustomerKey and Occupation columns.
CustomerKey should appear in the Values area, and Occupation in the Axis Fields
(Categories) area. Note that the default aggregation function is Sum. Therefore,
your chart shows sum of CustomerKey over Occupation. Of course, this makes no
sense.
13. In the Values area, click on the down arrow near the Sum of CustomerKey field.
Select Edit Measure option.
14. In the Measure Settings window, select the Count aggregate function. Click OK.

Page 151

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


Figure 11: SETTING THE COUNT AGGREGATE FUNCTION
15. We can add a chart to the worksheet, like we did when we analyzed SSAS data.
However, this chart is not automatically connected to the PivotTable we already
have in the sheet. We will have to select fields for analysis again.
16. In the Reports group, click on the PivotTable drop-down list, and select
PivotChart. In the Create PivotChart pop-up window, select Existing Worksheet,
and change the Location to 'Sheet1'!$A$10.
17. In the PowerPivot Field List, select the CustomerKey and Occupation columns.
18. In the Values area, click on the down arrow near the Sum of CustomerKey field.
Select Edit Measure option.
19. In the Measure Settings window, select the Count aggregate function. Click OK.
20. Now you should have the same analysis of distribution of the Occupation column
like you had when analyzing SSAS cube.

Figure 12: DISTRIBUTION OF THE OCCUPATION ATTRIBUTE IN EXCEL, DATA FROM POWERPIVOT
21. Play with other distributions. Try also to group over columns.

Page 152

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


22. When finished, close Excel.

SSIS Data Profiling Task


With SSIS Data Profiling task available, a question why we did develop all of the queries, mining
models and cubes in this chapter so far. Well, everything we learned so far is still valid and
useful. The Data Profiling task can do limited profiling only. We already developed many more
advanced profiling techniques. In addition, the Data Profiling task saves the result in XML form,
which is not useful directly. We would have to write custom code in the SSIS Script task in order
to consume the results in the same package. For overview of the result, an application called
Data Profile Viewer is included within SQL Server suite. Lets explore what the Data Profiling task
can do by completing the following steps.
1. If you closed BIDS, reopen the solution we created so far.
2. Use File Add New Project menu option to add a project to the solution. Use the
Integration Services project template. Name the project MDSBook_Ch03_SSIS.
3. In Solution Explorer, in SSIS Packages folder, right-click on the default package called
Package.dtsx and rename it to SSISProfiling.dtsx. Confirm renaming of the package
object.
4. From the Toolbox, drag and drop the Data Profiling Task to the control from area.
Right-click on it and select Edit.
5. In the General tab, use the Destination property drop-down list, and select New File
Connection.
6. In the File Connection Manager Editor window, change usage type to Create file. In
the File text box, type file name you want to create (you can select folder graphically
with Browse button). In order to follow placing and naming we had so far, type
C:\MDSBook\Chapter03\SSISProfiling.xml.
7. When you are back in the Data Profiling Task Editor, General tab, change the
OverwriteDestination property to True, in order to make possible to re-execute the

Page 153

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


package multiple times (otherwise, you would get an error saying that the
destination file already exists with next execution of the package).
8. In the bottom-right corner of the Data Profiling Task Editor, General tab, click on the
Quick Profile button.
9. In the Simple Table Quick Profiling Form, click on the New button to create a new
ADO.NET connection. The Data Profiling Task accepts only ADO.NET connections.
10. Connect to your SQL Server and select the AdventureWorks2008R2 database. Click
OK to return to the Simple Table Quick Profiling Form.
11. Select the dbo.TargetMailMining view in the Table or View drop-down list. Leave
selected all check boxes that are selected by default. Select also the Column Pattern
Profile and Functional Dependency Profile check boxes. Click OK.
12. In the Data Profiling Task Editor window, in the Profile Type list on the right, select
the Functional Dependency Profile Request. In the Request Properties window,
select BikeBuyer as dependent column, and CommuteDistance, Education and
NumberCarsOwned as determinant columns. Deselect star (all columns) as
determinant columns. Your Data Profiling Task Editor should look like the one in
figure 13. When finished, click OK to close the Data Profiling Task Editor window.

Page 154

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 13: THE DATA PROFILING TASK CONFIGURED


12. Save the project. Right-click on the Data Profiling task and select Execute Task
option. Wait until the execution finishes. The task color should turn to green after
a successful execution.
13. From the Debug menu, select the Stop Debugging option. Check whether the
XML file appeared in the folder.
14. From All Programs, SQL Server 2008 R2, Integration Services menu, start the Data
Profile Viewer application.
15. Click on the Open button. Navigate to the SSISProfiling.xml file and open it. Now
you can start harvesting the results.
16. In the left, the Profiles pane, select, for example, the Column Value Distribution
Profiles. In the top-right pane, select the Occupation column. In the middle-right
window, you should see the distribution for Occupation attribute. Click on the
value that has very low frequency (the Proffesional value). At the top-right corner
of the middle-right window, you will find the drill-down button. Click on it. In the

Page 155

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


bottom-right window, you get row with this value for the Occupation column,
like in figure 14.

Figure 14: THE OCCUPATION COLUMN VALUE DISTRIBUTION PROFILE


17. Check other profiles generated.
SSIS Data Profiling Task can be a nice, automated basic tool for quick profiling of our data.
Besides some simple profiling, it has some more sophisticated options. For example, the
Functional Dependency profile uses Nave Bayes algorithm internally to find dependency
between single target variable and multiple determinant variables. Data mining algorithms are
really useful for data profiling.

Excel Data Mining Add-Ins


Microsoft Office 2007 Data Mining Add-Ins is free to download from Microsoft SQL Server 2008
R2 Feature Pack site:
(http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=ceb4346f-657f4d28-83f5-aae0c5c83d52). The same add-ins work for Office 2010 as well. However, do please

Page 156

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


note that this add-ins currently exist in 32-bit version only. After you install them, you should
get a new Data Mining tab in Excel.
With data mining add-ins, Excel does not become a data mining engine. It can utilize, or
leverage some existing Analysis Services instance. Excel sends data from its worksheets or tables
to SSAS instance, and then gets the result and displays it. If data is formatted as Excel tables,
then data mining is simpler, through more business-oriented options that appear in Table Tools
ribbon. If data is organized in worksheets only, then we can still utilize the Data Mining ribbon,
and mine it from more technical perspective.
The Clustering data mining algorithm is especially useful for finding complex outliers. By
complex outliers we mean outliers not because of a value of a single column; we are searching
for outliers based on combination of values from multiple columns. Therefore, with Clustering,
we can make multivariate analysis for outliers.
The Clustering algorithm tries to group rows into clusters. In the Expectation-Maximization
variant of the algorithm, which is default in SSAS, each case belongs to each cluster with some
probability. A case can belong to one cluster with 0.3, and to another with 0.05 probabilities.
However, the highest probability could be very low as well. For example, a case can belong to
one cluster with probability 0.001, to another with 0.00003, and to other clusters with even
lower probabilities. This means that the case does not fit well to any cluster, or, said differently,
that this case is an exception. This is how we can find outliers with Clustering, find cases for
which the highest probability they belong to a cluster is still very low.
With data mining add-ins, Excel becomes a very powerful data mining tool. We can use it for
nearly any data mining activity Analysis Services can perform. Excel can serve as data mining
client, because with data mining add-ins, we also get data mining viewers that are built in BIDS
and SSMS in Excel. In addition, the add-ins also brings a couple of new templates for Visio.
Nevertheless, this introduction to Office Data Mining Add-Ins should be enough for a Master

Page 157

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


Data Services book. Rather than continuing with theory and descriptions lets use the add-ins!
You get started; you will need to complete the following steps:
1. In order to use Excel as a data mining tool, your SSAS instance has to allow session
mining models. In SSMS, in Object Explorer, connect to your SSAS instance. Rightclick on it and select Properties. Change the Data Mining \ Allow Session Mining
Models property to True, as shown in figure 12. Click OK.

Figure 15: ALLOW SSAS SESSION MINING MODELS


2. Open Excel. In the Data ribbon, click on From Other Sources drop-down list. Select
the From SQL Server option.
3. Connect to your SQL Server and select the AdventureWorks2008R2 database. In the
Select Database and Table window of the Data Connection Wizard, make sure that
the Connect to a specific table check box is checked. Select the
dbo.TargetMailMining view and finish the wizard.

Page 158

Chapter 3: Data Quality and SQL Server 2008 R2 Tools


4. In the Import Data window, click OK. You should get all of the data in tabular format
in your worksheet.
5. If part of your worksheet is formatted as table n Excel, you get additional Table Tools
Analyze ribbon with data mining add-ins.
6. As Excel needs Analysis Services to perform data mining tasks, you need to click on
the Connection button (by default it should show No connection). Connect to your
Analysis Services, to the MDSBook_Ch02_SSAS database. Your screen should look
like the one in figure 16.

Figure 16: TABLE ANALYSIS TOOLS AFTER CONNECTION TO SSAS


7. Click on the Highlight Exceptions button. In the pop-up window, click Run.
8. After the wizard finishes, you should get additional sheet with summary information
about exceptions. If you select your original worksheet, the worksheet with data, you
will notice that some rows are highlighted with dark yellow color, like shown in figure
17. In addition, some columns of highlighted rows are highlighted with lighter yellow
color.

Page 159

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 17: EXCEPTIONS HIGHLIGHTED IN THE ORIGINAL SHEET


9. These highlights show you suspicious rows. In addition, you can see which column
makes a row an exception. Now it is easy to find errors.
10. After you finish, close Excel and BIDS.

Clean-Up
As this was the last profiling option introduced in this chapter, we can clean up the
AdventureWorks2008R2 database with the following code:
USE AdventureWorks2008R2;
IF OBJECT_ID(N'dbo.ValueIsNULL', N'FN') IS NOT NULL
DROP FUNCTION dbo.ValueIsNULL;
IF OBJECT_ID(N'dbo.ProductsMining',N'V') IS NOT NULL
DROP VIEW dbo.ProductsMining;
IF OBJECT_ID(N'dbo.vTargetMailDirty',N'V') IS NOT NULL
DROP VIEW dbo.vTargetMailDirty;
IF OBJECT_ID(N'dbo.IsRegExMatch', N'FS') IS NOT NULL
DROP FUNCTION dbo.IsRegExMatch;
DROP ASSEMBLY MDSBook_Ch03_CLR;
IF OBJECT_ID(N'dbo.TargetMailMining',N'V') IS NOT NULL
DROP VIEW dbo.TargetMailMining;
GO

Page 160

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Summary
The old rule garbage in garbage out is absolutely valid for MDM projects as well. Before
starting implementing a centralized MDM solution, like SQL Server Master Data Services
solution, we should have in-depth comprehension of the quality of our data. In addition, we
should find root causes for bad data.
We have shown in this chapter how we can use tools from SQL Server 2008 R2 suite for data
profiling and for finding the root cause. We have used Transact-SQL queries. We have used
XQuery and CLR code for controlling data quality of XML data and strings. We used SQL Server
Analysis Services intensively. The Unified Dimensional Model, or, if we prefer this expression,
OLAP cube, is a nice way for a quick, graphical overview of the data. In addition, PowerPivot for
Excel 2010 gives us opportunity to achieve the same graphical overview even without Analysis
Services. Data Mining helps us find interesting patterns, and thus this is a useful tool for finding
root causes for bad data. With Office Data Mining Add-Ins, Excel 2007 and 2010 became a
powerful data mining tool as well. SQL Server Integration Services Data Profiling task is another
quick tool for finding bad or suspicious data.
One of the most challenging tasks in preparing and maintaining master data is merging it from
multiple sources when we do not have the same identifier in all sources. This means we have to
do the merging based on similarities of column values, typically on similarity of string columns
like names and addresses. Even if we have a single source of master data, we can have duplicate
rows for the same entity, like duplicate rows for the same customer. In the next chapter, we are
going to tackle these two problems, merging and de-duplicating.

Page 161

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

References

Erik Veerman, Teo Lachev, Dejan Sarka: MCTS Self-Paced Training Kit (Exam 70-448):
Microsoft SQL Server 2008-Business Intelligence Development and Maintenance
(Microsoft Press, 2009)

Thomas C. Redman: Data Quality - The Field Guide (Digital Press, 2001)

Tamraparni Dasu, Theodore Johnson: Exploratory Data Mining and Data Cleaning (John
Wiley & Sons, 2003)

Arkady Maydanchik: Data Quality Assessment (Technics Publications, 2007)

Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass: Inside Microsoft SQL Server 2008: TSQL Querying (Microsoft Press, 2009)

Itzik Ben-Gan, Dejan Sarka, Roger Wolter, Greg Low, Ed Katibah, Isaac Kunen: Inside
Microsoft SQL Server 2008: T-SQL Programming (Microsoft Press, 2010)

Marco Russo, Alberto Ferrari: PowerPivot for Excel 2010 (Microsoft Press, 2011)

Page 162

Chapter 4: Identity Mapping and De-Duplicating

Chapter 4: Identity Mapping and DeDuplicating


Dejan Sarka
Two of the most challenging problems with maintaining master data are identity mapping and
de-duplication.
In an enterprise, we frequently have more than one source of master data. Sometimes we have
to import master data from outer sources. Different sources can include legacy applications;
relational databases used by OLTP applications; analytical databases; semi-structured data, such
as XML data from some Web service; and even non-structured data in Microsoft Excel
worksheets and Microsoft Word documents. Typically, we do not have unique identification of
entities in all these different sources. However, we would like to get a complete picture of an
entity (e.g., of a customer) in order to properly support our applications, such as a Customer
Relationship Management (CRM) application. Therefore, we have to identify which objects (e.g.,
rows in a table, XML instances) represent the same entity. We need to match the identities
based on similarities of entity attributes.
Even if we have a single source of data, we are not without problems. Master data is often
duplicated. For example, multiple operators could insert the same customer multiple times,
each time with slightly differently written name or address. Again, we would like to identify such
duplicates and get rid of them.
In this chapter, we introduce details of both problems and show possible solutions. We focus on
identity mapping, and we show that de-duplicating can be actually translated to the same
problem. We discuss the following topics:

Problems with identity mapping

Page 163

Chapter 4: Identity Mapping and De-Duplicating

String similarity functions

Reducing the search space

Performing identity mapping with T-SQL

Using SSIS Fuzzy Lookup transformation for identity mapping

De-duplicating

Using SSIS Fuzzy Grouping transformation for de-duplicating

Page 164

Chapter 4: Identity Mapping and De-Duplicating

Identity Mapping
The problem with identity mapping arises when data is merged from multiple sources that can
update data independently. Each source has its own way of entity identification, or its own keys.
There is no common key to make simple joins. Data merging has to be done based on
similarities of strings, using names, addresses, e-mail addresses, and similar attributes. Figure 1
shows the problem: we have three rows in the first table in the top left corner, and two rows in
the second table in the top right corner. Keys of the rows from the left table (Id1 column) are
different from keys of the rows in the right table (Id2 column). The big table in the bottom
shows the result of approximate string matching. Note that each row from the left table is
matched to each row from the right table; similarities are different for different pairs of rows.

Figure 13: IDENTITY MAPPING PROBLEM

Page 165

Chapter 4: Identity Mapping and De-Duplicating

Problems
Many problems arise with identity mapping. First, there is no way to get a 100 percent accurate
match programmatically; if you need a 100 percent accurate match, you must match entities
manually. But even with manual matching, you cannot guarantee 100 percent accurate matches
at all times, such as when you are matching people. For example, in a database table, you might
have two rows for people named John Smith, living at the same address; we cannot know
whether this is a single person or two people, maybe a father and son. Nevertheless, when you
perform the merging programmatically, you would like to get the best matching possible. You
must learn which method to use and how to use it in order to get the best possible results for
your data. In addition, you might even decide to use manual matching on the remaining
unmatched rows after programmatic matching is done. Later in this chapter, we compare a
couple of public algorithms that are shipped in SQL Server 2008 R2, in the Master Data Services
(MDS) database; we also add SSIS Fuzzy Lookup transformation to the analysis.
The next problem is performance. For approximate merging, any row from one side, from one
source table, can be matched to any row from the other side. This creates a cross join of two
tables. Even small data sets can produce huge performance problems, because cross join is an
algorithm with quadratic complexity. For example, cross join of 18,484 rows with 18,484 rows of
the AdventureWorksDW200R2 vTargetMail view to itself (we will use this view for testing),
means dealing with 341,658,256 rows after the cross join! We discuss techniques for optimizing
this matching (i.e., search space reduction techniques) later in this chapter.
Another problem related to identity mapping is de-duplicating. We deal briefly with this
problem at the end of the chapter.

T-SQL and MDS String Similarity Functions


The T-SQL SOUNDEX() and DIFFERENCE() functions allow searches on similar sounding character
strings. SOUNDEX() takes a word, such as a persons name, as input and produces a character
string that identifies a set of words that are (roughly) phonetically alike. Phonetically alike in this

Page 166

Chapter 4: Identity Mapping and De-Duplicating


case means US English phonetically alike. DIFFERENCE() returns an integer indicating the
difference between the SOUNDEX() values. DIFFERENCE() returns a value between 0 and 4. The
integer returned is the number of characters in the SOUNDEX() values that are the same.
Therefore, value 0 indicates weak or no similarity, and value 4 indicates strong similarity or the
same values. SOUNDEX() is a very basic function, not suitable for solving real-life problems.
There are many better algorithms that compare strings phonetically. In addition, all phonetic
algorithms are language dependent. We focus on language-independent algorithms in this
chapter, because we cannot say in which language we will have to do the matching in advance.
If we install MDS in SQL Server 2008 R2, we get the following string matching algorithms:

Levenshtein distance (also called edit distance)

Jaccard index

Jaro-Winkler distance

Simil (longest common substring; also called Ratcliff/Obershelp)

All of these algorithms are well-known and publically documented (e.g., on Wikipedia). They are
implemented through a CLR function. Note that MDS comes only with SQL Server 2008 R2
Enterprise and Datacenter 64-bit editions. If you are not running either of these editions, you
can use any edition of SQL Server 2005 or later and implement these algorithms in the CLR
functions. Anastasios Yalanopouloss Beyond SoundEx - Functions for Fuzzy Searching in MS
SQL Server (http://anastasiosyal.com/archive/2009/01/11/18.aspx#soundex) provides a link to
a publically available library of CLR string matching functions for SQL Server.
Levenshtein (edit) distance measures the minimum number of edits needed to transform one
string into the other. It is the total number of character insertions, deletions, or substitutions
that it takes to convert one string to another. For example, the distance between kitten and
sitting is 3:

kitten sitten (substitution of 's' for 'k')

sitten sittin (substitution of 'i' for 'e')

Page 167

Chapter 4: Identity Mapping and De-Duplicating

sittin sitting (insert 'g' at the end)

Similarity is then normalized between 0 and 1.


The Jaccard index (similarity coefficient) measures similarity between sample sets. When used
for strings, it defines strings as sets of characters. The size of the intersection is divided by the
size of the union of the sample sets, as the following formula shows. Note that the coefficient is
normalized between 0 and 1 by its definition.

The Jaro-Winkler distance is a variant of Jaro string similarity metrics. Jaro distance combines
matches and transpositions for two strings s1 and s2:

| |

| |

The symbol m means the number of matching characters and the symbol t is the number of
transpositions, while |s| denotes the length of a string. In order to define characters as
matching, their position in the strings must be close together, which is defined by character
position distance CPD; they should be no farther than the following formula calculates:

| || |

Jaro-Winkler distance uses a prefix scale p, which gives more favorable ratings to strings that
match from the beginning.

In this formula, the meaning of the signs is:

Page 168

Chapter 4: Identity Mapping and De-Duplicating

dj is the Jaro distance


is the common-prefix length at the start of the string, up to a maximum of 4
characters

p is a scaling factor for common prefixes; p should not exceed 0.25, otherwise the
distance can become larger than 1 (usually p is equal to 0.1)

Finally, the Simil algorithm looks for the longest common substring in two strings. Then it
removes this substring from the original strings. After that, it searches for the next longest
common substring in remainders of the two original strings from the left and the right. It
continues this process recursively until no common substrings of a defined length (e.g., two
characters) are found. Finally, the algorithm calculates a coefficient between 0 and 1 by dividing
the sum of the lengths of the substrings by the lengths of the strings themselves.
Preparing the Data
In order to compare functions efficiency, we need some data. The code in Listing 1 prepares
sample data. First, it prepares two tables. The CustomersMaster table will be the master table,
which is a table with keys that we want to keep and transfer to the target table. The
CustomersTarget table is the target of identity mapping; it will receive the keys from the
CustomersMaster table. We fill both tables from the
AdventureWorksDW2008R2.dbo.vTargetMail view. In the target table, we keep the original
keys, multiplied by -1, in order to control the efficiency of merging. The target table also has an
empty column to store the key from the master table (MasterCustomerId) and one additional
column (Updated) that will be used only in the code that produces errors in the data. Besides
common columns (Fullname, StreetAddress, and CityRegion), the two tables also have some
different columns, to show that identical structures are not necessary. All we need is at least
one character column in common, a column that is used to compare values.
Listing 1: Code to Prepare Sample Data
-- Assuming that MDSBook database exists
USE MDSBook;
GO

Page 169

Chapter 4: Identity Mapping and De-Duplicating

-- Creating sample data


-- Customers primary (master) table
IF OBJECT_ID(N'dbo.CustomersMaster', N'U') IS NOT NULL
DROP TABLE dbo.CustomersMaster;
GO
CREATE TABLE dbo.CustomersMaster
(
CustomerId int NOT NULL PRIMARY KEY,
FullName nvarchar(200) NULL,
StreetAddress nvarchar(200) NULL,
CityRegion nvarchar(200) NULL,
NumberCarsOwned tinyint NULL
);
GO
INSERT INTO dbo.CustomersMaster
SELECT c.CustomerKey AS CustomerId
,c.FirstName + ' ' + c.LastName AS FullName
,c.AddressLine1 AS StreetAddress
,g.City + ', ' + g.StateProvinceName
+ ', ' + g.EnglishCountryRegionName AS CityRegion
,c.NumberCarsOwned
FROM AdventureWorksDW2008R2.dbo.DimCustomer AS c
INNER JOIN AdventureWorksDW2008R2.dbo.DimGeography AS g
ON c.GeographyKey = g.GeographyKey;
GO
-- Customers secondary (merge target) table
IF OBJECT_ID(N'dbo.CustomersTarget', N'U') IS NOT NULL
DROP TABLE dbo.CustomersTarget;
GO
CREATE TABLE dbo.CustomersTarget
(
CustomerId int NOT NULL PRIMARY KEY,
FullName nvarchar(200) NULL,
StreetAddress nvarchar(200) NULL,
CityRegion nvarchar(200) NULL,
MaritalStatus nchar(1) NULL,
Updated int NULL,
MasterCustomerId int NULL
);
GO
INSERT INTO dbo.CustomersTarget
SELECT c.CustomerKey * (-1) AS CustomerId
,c.FirstName + ' ' + c.LastName AS FullName
,c.AddressLine1 AS StreetAddress
,g.City + ', ' + g.StateProvinceName
+ ', ' + g.EnglishCountryRegionName AS CityRegion
,c.MaritalStatus
,0 AS Updated
,NULL AS MasterCustomerId
FROM AdventureWorksDW2008R2.dbo.DimCustomer AS c
INNER JOIN AdventureWorksDW2008R2.dbo.DimGeography AS g
ON c.GeographyKey = g.GeographyKey;
GO

We can perform matching based on FullName, StreetAddress, and CityRegion and get 100
percent accurate results. Of course, we have the same data in both tables. In order to test the
functions, we have to perform some updates in the target table and produce some errors in the

Page 170

Chapter 4: Identity Mapping and De-Duplicating


common columns. The code in Listing 2 performs these changes. This code makes changes with
a slightly controlled randomness, in a loop that repeats three times. In every loop, the code
makes three updates on 17 percent, 17 percent, and 16 percent (combined 50 percent) rows.
Rows to be updated are selected with Bernoullis formula for sampling, which provides
statistically good randomness. In the first two passes, the 50 percent of rows are selected from
all rows; in the third pass, we select 50 percent of rows from the rows that were updated in the
previous two passes only. This way, we get more errors on fewer rows; if we selected rows
randomly from all rows in each pass, we would get fewer errors on more rows. From
experience, we know that errors tend to pile in the same rows and are not spread completely
randomly in a table. For example, we nearly always get an error with the last name Sarka in
Slovenian (it is a Hungarian last name, uncommon in Slovenia), written as Sraka (which is a
Slovenian word), whereas we nearly never get an error with the last name Novak (the most
common last name in Slovenian).
Listing 2: Code to Update the Target Table
DECLARE @i AS int = 0, @j AS int = 0;
WHILE (@i < 3)
-- loop more times for more changes
BEGIN
SET @i += 1;
SET @j = @i - 2;
-- control here in which step you want to update
-- only already updated rows
WITH RandomNumbersCTE AS
(
SELECT CustomerId
,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber1
,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber2
,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber3
,FullName
,StreetAddress
,CityRegion
,MaritalStatus
,Updated
FROM dbo.CustomersTarget
)
UPDATE RandomNumbersCTE SET
FullName =
STUFF(FullName,
CAST(CEILING(RandomNumber1 * LEN(FullName)) AS int),
1,
CHAR(CEILING(RandomNumber2 * 26) + 96))
,StreetAddress =
STUFF(StreetAddress,
CAST(CEILING(RandomNumber1 * LEN(StreetAddress)) AS int),
2, '')
,CityRegion =

Page 171

Chapter 4: Identity Mapping and De-Duplicating


STUFF(CityRegion,
CAST(CEILING(RandomNumber1 * LEN(CityRegion)) AS int),
0,
CHAR(CEILING(RandomNumber2 * 26) + 96) +
CHAR(CEILING(RandomNumber3 * 26) + 96))
,Updated = Updated + 1
WHERE RAND(CHECKSUM(NEWID()) % 1000000000 - CustomerId) < 0.17
AND Updated > @j;
WITH RandomNumbersCTE AS
(
SELECT CustomerId
,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber1
,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber2
,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber3
,FullName
,StreetAddress
,CityRegion
,MaritalStatus
,Updated
FROM dbo.CustomersTarget
)
UPDATE RandomNumbersCTE SET
FullName =
STUFF(FullName, CAST(CEILING(RandomNumber1 * LEN(FullName)) AS int),
0,
CHAR(CEILING(RandomNumber2 * 26) + 96))
,StreetAddress =
STUFF(StreetAddress,
CAST(CEILING(RandomNumber1 * LEN(StreetAddress)) AS int),
2,
CHAR(CEILING(RandomNumber2 * 26) + 96) +
CHAR(CEILING(RandomNumber3 * 26) + 96))
,CityRegion =
STUFF(CityRegion,
CAST(CEILING(RandomNumber1 * LEN(CityRegion)) AS int),
2, '')
,Updated = Updated + 1
WHERE RAND(CHECKSUM(NEWID()) % 1000000000 - CustomerId) < 0.17
AND Updated > @j;
WITH RandomNumbersCTE AS
(
SELECT CustomerId
,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber1
,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber2
,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber3
,FullName
,StreetAddress
,CityRegion
,MaritalStatus
,Updated
FROM dbo.CustomersTarget
)
UPDATE RandomNumbersCTE SET
FullName =
STUFF(FullName,
CAST(CEILING(RandomNumber1 * LEN(FullName)) AS int),
1, '')
,StreetAddress =
STUFF(StreetAddress,
CAST(CEILING(RandomNumber1 * LEN(StreetAddress)) AS int),
0,
CHAR(CEILING(RandomNumber2 * 26) + 96) +
CHAR(CEILING(RandomNumber3 * 26) + 96))
,CityRegion =

Page 172

Chapter 4: Identity Mapping and De-Duplicating


STUFF(CityRegion,
CAST(CEILING(RandomNumber1 *
2,
CHAR(CEILING(RandomNumber2 *
CHAR(CEILING(RandomNumber3 *
,Updated = Updated + 1
WHERE RAND(CHECKSUM(NEWID()) % 1000000000
AND Updated > @j;
END;

LEN(CityRegion)) AS int),
26) + 96) +
26) + 96))
- CustomerId) < 0.16

After the update, we can check how many rows are different in the target table for the original
rows, which are still available in the master table. The maximum number of updates per row is
9; the more times a row was updated, the more common attribute values differ from the
original (and correct) ones. The probability that a row is updated many times drops quite quickly
with higher numbers of updates. The query in Listing 3 compares full names and addresses after
the update, selecting only rows with some changes and sorting them by the number of updates
in descending order, so we get the rows with maximal number of updates on the top.
Listing 3: Query to Compare Full Names and Addresses after Update
SELECT

m.FullName
,t.FullName
,m.StreetAddress
,t.StreetAddress
,m.CityRegion
,t.CityRegion
,t.Updated
FROM dbo.CustomersMaster AS m
INNER JOIN dbo.CustomersTarget AS t
ON m.CustomerId = t.CustomerId * (-1)
WHERE m.FullName <> t.FullName
OR m.StreetAddress <> t.StreetAddress
OR m.CityRegion <> t.CityRegion
ORDER BY t.Updated DESC;

The partial result in Figure 2 shows that three rows were updated 7 times (note that in order to
fit the number of updates of a row into the figure, CityRegion columns before and after the
update are omitted from the figure). Altogether, 7,790 rows were updated in this test. You
should get different results every time you run this test, because the updates are done
randomly (or better, with a controlled randomness, no matter how paradoxical this sounds).
You can also see that the values in the rows that were updated many times differ quite a lot
from the original values; therefore, our string matching algorithms are going to have a hard time
finding similarities.

Page 173

Chapter 4: Identity Mapping and De-Duplicating

Figure 2: RESULTS OF NAME AND ADDRESS COMPARISON


Testing the String Similarity Functions
Although we mentioned that we are not going to use T-SQL SOUNDEX() and DIFFERENCE()
functions because they are language dependent, it makes sense to start with a quick test of
these functions. This test shows another reason for not using the functions: They are very
inefficient. Lets look at how they perform on full names. Because we retained the original key
(although multiplied by -1) in the target table, we can make an exact join and compare the
original (master) and changed (target) names, as Listing 4 shows.
Listing 4: SOUNDEX() and DIFFERENCE() Test on Full Names
SELECT

m.CustomerId
,m.FullName
,t.FullName
,DIFFERENCE(m.FullName, t.Fullname) AS SoundexDifference
,SOUNDEX(m.FullName) AS SoundexMaster
,SOUNDEX(t.FullName) AS SoundexTarget
FROM dbo.CustomersMaster AS m
INNER JOIN dbo.CustomersTarget AS t
ON m.CustomerId = t.CustomerId * (-1)
ORDER BY SoundexDifference;

The results in Figure 3 show that the DIFFERENCE() function based on the SOUNDEX() code did
not find any similarity for quite a few full names. However, as you can see from the highlighted
row, the name Zoe Rogers was not changed much; there should be some similarity in the strings
Zoe Rogers and rZoeRogers. This proves that the two functions included in T-SQL are not
efficient enough for a successful identity mapping.

Page 174

Chapter 4: Identity Mapping and De-Duplicating

Figure 3: RESULTS OF TESTING THE SOUNDEX AND DIFFERENCE FUNCTIONS ON FULL NAMES
You could continue with checking the two T-SQL functions using the street address and city and
region strings.
All of the four algorithms implemented in the mdq.Similarity function in the MDS database
return similarity as a number between zero and one. A higher number means better similarity.
The query in Listing 5 checks how algorithms perform on full names. Again, because we retained
the original key (although multiplied by -1) in the target table, we can make an exact join,
compare the original (master) and changed (target) names, and visually evaluate which
algorithm gives the highest score.
Listing 5: Code to Check Algorithm Performance on Full Names
SELECT

m.CustomerId
,m.FullName
,t.FullName
,mdq.Similarity(m.FullName, t.Fullname,
,mdq.Similarity(m.FullName, t.Fullname,
,mdq.Similarity(m.FullName, t.Fullname,
,mdq.Similarity(m.FullName, t.Fullname,
FROM dbo.CustomersMaster AS m
INNER JOIN dbo.CustomersTarget AS t
ON m.CustomerId = t.CustomerId * (-1)
ORDER BY Levenshtein;

0,
1,
2,
3,

0.85,
0.85,
0.85,
0.85,

0.00)
0.00)
0.00)
0.00)

AS
AS
AS
AS

Levenshtein
Jaccard
JaroWinkler
Simil

For more information about the mdq.Similarity parameters, see SQL Server Books Online.

Page 175

Chapter 4: Identity Mapping and De-Duplicating


The results (in Figure 4) are sorted by Levenshtein coefficient, in ascending order (we could
choose any of the four), in order to get the rows with the maximal dissimilarity on the top.

Figure 4: RESULTS OF CHECKING ALGORITHM PERFORMANCE ON FULL NAMES FOR THE MDQ.SIMILARITY FUNCTION
From the results, we can see that the Jaro-Winkler algorithm gives the highest scores in our
example. You should check the algorithms on street address and city and region strings as well.
Although we cannot say that the Jaro-Winkler algorithm would always perform the best, and
although you should always check how algorithms perform on your data, we can say that this is
not a surprise. Jaro-Winkler is one of the most advanced public algorithms for string matching.
In addition, it forces higher scores for strings with the same characters in the beginning of the
string. From experience, we noticed that errors in data, produced by humans, typically do not
appear in the first few characters. Therefore, it seems that the Jaro-Winkler algorithm is the
winner in this case, and we will use it to do the real matching. However, before we do the
match, we need to discuss optimizing the matching by trying to avoid a full cross join.
Optimizing Mapping with Partitioning
As mentioned, the identity matching is a quadratic problem. The search space dimension is
equal to the cardinality of A x B of the Cartesian product of the sets included in the match.
There are multiple search space reduction techniques.
A partitioning or blocking technique partitions at least one set involved in matching in blocks.
For example, take the target rows in batches and match each batch with a full master table.

Page 176

Chapter 4: Identity Mapping and De-Duplicating


With 10,000 rows in each table, we could have 10 iterations of joins with 1,000 x 10,000 =
10,000,000 rows instead of one big row with 10,000 x 10,000 = 100,000,000 rows. The
important thing here is that we do not bloat the memory and go over our hardware resources.
In order to prevent this, we can change batch sizes appropriately.
A sorting neighborhood technique sorts both sets and then moves a window of a fixed size on
both sets. We do the matching on that window only, one window after another. The problem
with such a technique lies in the fact that we can have objects from one set in a window that is
not compared to the window from another set where the objects that should actually be
matched reside. The same problem could arise with blocking techniques implemented on both
sets. We do not want to lose the matches we might find by optimization of the mapping
methods.
Lets test the partitioning method. The first partition is, of course, the partition with exact
matches. The query in Listing 6 updates the MasterCustomerId column in the target table with
CustomerId from the master table based on exact matches on all three common string columns.
Listing 6: Query to Update MasterCustomerId Column
WITH Matches AS
(
SELECT t.MasterCustomerId
,m.CustomerId AS MCid
,t.CustomerId AS TCid
FROM dbo.CustomersTarget AS t
INNER JOIN dbo.CustomersMaster AS m
ON t.FullName = m.FullName
AND t.StreetAddress = m.StreetAddress
AND t.CityRegion = m.CityRegion
)
UPDATE Matches
SET MasterCustomerId = MCid;

In the next step, we select (nearly) randomly 1,000 rows from the target table and perform a
match with all rows from the master table. We measure the efficiency of the partitioning
technique and of the Jaro-Winkler string similarity algorithm. We select rows from the target
(nearly) randomly in order to prevent potential bias that we might get by selecting, for example,
rows based on CustomerId. We use the NEWID() function to simulate randomness. The values

Page 177

Chapter 4: Identity Mapping and De-Duplicating


generated by NEWID() actually do not satisfy statistical randomness, which is why we used term
(nearly) randomly. In a real-life example, we would iterate through all rows block by block
anyway, so it is not that important to use real statistical randomness. NEWID() should give us
enough randomness to simulate one iteration (i.e., one block from the target table matched
with the complete master table).
We measure the efficiency of the matching on the fly, while we are doing the match. The first
part of the code simply declares two variables to store the number of rows updated in this pass
and the start time of the UPDATE query. The actual UPDATE query starts with a CTE that selects
1,000 rows from the target table, as we said, nearly randomly, using the TOP operator on
NEWID() values, as Listing 7 shows.
Listing 7: UPDATE Query
-- Variables to store the number of rows updated in this pass
-- and start time
DECLARE @RowsUpdated AS int, @starttime AS datetime;
SET @starttime = GETDATE();
-- Select (nearly) randomly 1000 rows from the target table
WITH CustomersTarget1000 AS
(
SELECT TOP 1000
t.CustomerId
,t.MasterCustomerId
,t.FullName
,t.StreetAddress
,t.CityRegion
FROM CustomersTarget AS t
WHERE t.MasterCustomerId IS NULL
ORDER BY NEWID()
),

The next CTE, which Listing 8 shows, performs a cross join between the 1,000-row block from
the target table with the full master table. It also compares the strings and adds the JaroWinkler similarity coefficient to the output. In addition, it adds the row number sorted by the
Jaro-Winkler similarity coefficient, in descending order and partitioned by the target table key,
to the output. A row number equal to 1 will mean the highest Jaro-Winkler similarity coefficient
for a specific target table key. This way, we marked the master table row that, according to JaroWinkler algorithms, is the most similar to the target row. The last part of the UPDATE

Page 178

Chapter 4: Identity Mapping and De-Duplicating


statement, also shown in Listing 8, is the actual update. It updates target table rows with the
key from the master table row with the highest similarity.
Listing 8: Code to Perform Cross Join and Actual Update
-- Full cross join
-- Adding Jaro-Winkler coefficient and row number
MasterTargetCross AS
(
SELECT t.CustomerId AS TCid
,m.CustomerId AS MCid
,t.MasterCustomerId
,mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion,
t.FullName + t.StreetAddress + t.CityRegion,
2, 0.85, 0.00) AS JaroWinkler
,ROW_NUMBER() OVER (PARTITION BY t.CustomerId ORDER BY
mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion,
t.FullName + t.StreetAddress + t.CityRegion,
2, 0.85, 0.00) DESC) AS RowNo
FROM CustomersMaster AS m
CROSS JOIN CustomersTarget1000 AS t
)
-- Actual update
UPDATE MasterTargetCross
SET MasterCustomerId = MCid
WHERE RowNo = 1;

Finally, in the last statements of the batch, we use the fact that we know what key from the
master table we should receive in the target table. We are just counting how many rows got a
wrong key and comparing this number to the number of rows we updated in this pass. In
addition, we are also measuring the time needed to execute this update, as Listing 9 shows.
Listing 9: Code to Measure Efficiency
-- Measuring the efficiency
SET @RowsUpdated = @@ROWCOUNT;
SELECT @RowsUpdated AS RowsUpdated
,100.0 * @RowsUpdated / 7790 AS PctUpdated
,COUNT(*) AS NumErrors
,100.0 * COUNT(*) / @RowsUpdated AS PctErrors
,DATEDIFF(S, @starttime, GETDATE()) AS TimeElapsed
FROM dbo.CustomersTarget
WHERE MasterCustomerId <> CustomerId * (-1);

After we execute the update, we can harvest the results of the efficiency measurement, as
Figure 5 shows.

Page 179

Chapter 4: Identity Mapping and De-Duplicating

Figure 5: RESULTS OF EFFICIENCY MEASUREMENT


The results are not too promising in this case. We updated only 1,000 out of 7,790 rows we have
to update, which is approximately 12.84 percent. With more than 6 percent of errors on that
small subset, it seems that pure a Jaro-Winkler algorithm does not satisfy our needs, and
doesnt make sense to apply it to the full set. In addition, 133 seconds for only a small subset of
rows is not too encouraging. This is the price of the cross join. Apparently, we need to improve
both the accuracy of the matching and the performance. Before developing a better procedure,
lets reset the MasterCustomerId column in the target table:
UPDATE dbo.CustomersTarget
SET MasterCustomerId = NULL;

Optimizing Mapping with nGrams Filtering


In this section, we focus on a pruning (or filtering) method. The question is whether we can
intelligently pre-select a subset of rows from the master table for a match with a single preselected batch from the target table. With intelligent selection, we would like to improve the
accuracy and the performance of the matching procedure.
The problem here is how to pre-select these rows. We need to be sure that we do not exclude a
row that should be matched with one or more target rows. Besides that, we need reasonably
sized subsets; if we pre-select a subset of 9,000 rows from a set of 10,000 rows, we didnt gain
nearly anything, because we would get nearly full cross join again. In addition, we would like to
build some intelligence in selecting batches from the target table; lets select the rows that have
more chances to match well with master rows first, and then rows with a little fewer chances,
and then rows with even fewer chances, and so on. What we need is, of course, a procedure
that accepts parameters. Finally, we could leave some rows for manual matching. Of course, we
would start with exact matching.

Page 180

Chapter 4: Identity Mapping and De-Duplicating


Without any further hesitation, lets introduce the pre-selecting algorithm we will use. We will
tokenize strings from the master table to substrings of length n. These tokens are called
nGrams. In the MDS database, there is already a function mdq.NGrams that does this for us. If
we do not use MDS, we can write our own CLR or T-SQL function. A T-SQL solution is already
written and can be found in SQL Server MVP Deep Dives.
After we tokenize the strings from the master table and store these tokens together with keys in
a new table, we calculate the overall (absolute) frequency of tokens and store this in a token
statistics table. With this statistics, we can select for comparison rows with strings that have at
least m common nGrams with less than p frequency in each pass. The m, n and p are the
parameters we use to select rows to be matched in each pass. We can start by comparing the
rows that have at least 3 (m) common 4(n)Grams with less than 20 (p) absolute frequency. The
p parameter helps us to start comparing strings with rare tokens first, with tokens that have low
absolute frequency. Then we can use lower values for m and n and higher values for p in the
next pass, and so on.
Of course, this solution comes at a price. We have to create nGrams for both the source and
target tables. In our example, we will store nGrams for the master (source) table and create
nGrams for a batch of target rows on the fly. If we need to do the identity mapping continuously
(e.g., from day to day), we could store nGrams for both tables permanently and update them
with the help of triggers on master and target tables. We need to keep updated also the token
statistics table. This could be done either with another trigger on the master table or with help
of an indexed view on the tokens table (i.e., the token statistics table could be just an indexed
view). Triggers and indexes would slow down updates on the master and target tables.
Nevertheless, because both tables hold master data, we do not expect too frequent changes
there. In addition, the tokens and token statistics tables occupy some space. However, we do
not consider this as a huge problem. We need only a couple of tables like this, because we
usually merge only a couple of entity sets using string matching (typically customers).

Page 181

Chapter 4: Identity Mapping and De-Duplicating


In summary, pre-selecting comes with a price. The price is quite low if we do the merging
continuously. For a single merge, it might be simpler to use a blocking technique to reduce the
search space. But we dont need to guess, because we have developed a whole infrastructure
for measuring the efficiency.
As mentioned, in our example we will show a solution that stores nGrams for the master table
only and calculates nGrams for the target table on the fly. In addition, we are will store the
nGrams statistics in a table as well.
The code in Listing 10 creates a table to store nGrams from the master table and then populates
the new table. Note that the n parameter in this case is 4we are using 4Grams. In addition,
before actually extracting nGrams, we are standardizing all strings to upper case.
Listing 10: Code to Create Table to Store nGrams from the Master Table
CREATE TABLE dbo.CustomersMasterNGrams
(
CustomerId int NOT NULL,
Token char(4) NOT NULL
PRIMARY KEY (Token, CustomerId)
);
GO
INSERT INTO dbo.CustomersMasterNGrams
SELECT DISTINCT
-- Store one NGram for one customer only once
m.CustomerId
,g.Token
FROM dbo.CustomersMaster AS m
CROSS APPLY (SELECT Token, Sequence
FROM mdq.NGrams(UPPER(m.FullName +
m.StreetAddress +
m.CityRegion), 4, 0)) AS g
WHERE CHARINDEX('', g.Token) = 0
AND CHARINDEX('
', g.Token) = 0;

We do not show the code and results of checking the content of this table here. Lets directly
create the nGrams frequency table, and store the frequency there, with the code in Listing 11.
Listing 11: Code to Create the nGrams Frequency Table
CREATE TABLE dbo.CustomersMasterNGramsFrequency
(
Token char(4) NOT NULL,
Cnt int NOT NULL,

Page 182

Chapter 4: Identity Mapping and De-Duplicating


RowNo int NOT NULL
PRIMARY KEY (Token)
);
GO
INSERT INTO dbo.CustomersMasterNGramsFrequency
SELECT Token
,COUNT(*) AS Cnt
,ROW_NUMBER() OVER(ORDER BY COUNT(*)) AS RowNo
FROM dbo.CustomersMasterNGrams
GROUP BY Token;

The following querys results display which nGrams values are the most frequent, as Figure 6
shows.
SELECT *
FROM dbo.CustomersMasterNGramsFrequency
ORDER BY RowNo DESC;

Figure 6: NGRAMS FREQUENCY


In order to completely comprehend the impact of storing the tokens and their frequency, lets
check the space used by the original and tokens tables. The following code checks the space
used; Figure 7 shows the results.
EXEC
EXEC
EXEC
EXEC

sp_spaceused
sp_spaceused
sp_spaceused
sp_spaceused

'dbo.CustomersMaster';
'dbo.CustomersTarget';
'dbo.CustomersMasterNGrams';
'dbo.CustomersMasterNGramsFrequency';

Page 183

Chapter 4: Identity Mapping and De-Duplicating

Figure 7: SPACE USED


From the results, we can see that the nGrams table actually occupies much more space,
approximately ten times more space, than the original table. Of course, our original
CustomersMaster table does not have many columns. In a real-world situation, the tokens table
would probably use approximately the same or even less space than the source table.
Now we have the entire infrastructure we need to start doing the actual matching.
Again, we are starting with exact matches. The UPDATE code in Listing 12 performs the exact
matching on all three common columns; this is the same query we used in the partitioning
procedure already.
Listing 12: Code to Perform Matching on Common Columns
WITH Matches AS
(
SELECT t.MasterCustomerId
,m.CustomerId AS MCid
,t.CustomerId AS TCid
FROM dbo.CustomersTarget AS t
INNER JOIN dbo.CustomersMaster AS m
ON t.FullName = m.FullName
AND t.StreetAddress = m.StreetAddress
AND t.CityRegion = m.CityRegion
)
UPDATE Matches

Page 184

Chapter 4: Identity Mapping and De-Duplicating


SET MasterCustomerId = MCid;

Now we can filter out exact matches in all of the following steps. If we have a large number of
rows, we can also use only a batch of still unmatched rows of the target table combined with
nGrams pre-selecting. This means we could actually combine the nGrams filtering with the
partitioning technique. However, because the number of rows to match is already small enough
in this proof of concept project, we do not split the unmatched rows in the target table in
batches here. We will measure the efficiency of the matching on the fly, while we are doing the
match. The first part of the code, as in the partitioning technique, declares two variables to
store the number of rows updated in this pass and the start time.
The actual merging query uses four CTEs. In the first one, we are tokenizing the target table
rows into 4Grams on the fly. Because we are using 4Grams, the n parameter is equal to 4 in this
pass. This part of code is shown in Listing 13.
Listing 13: Variable Declaration and the First CTE
-- Variables to store the number of rows updated in this pass
-- and start time
DECLARE @RowsUpdated AS int, @starttime AS datetime;
SET @starttime = GETDATE();
-- Tokenize target table rows
WITH CustomersTargetNGrams AS
(
SELECT t.CustomerId AS TCid
,g.Token AS TToken
,t.MasterCustomerId
FROM dbo.CustomersTarget AS t
CROSS APPLY (SELECT Token, Sequence
FROM mdq.NGrams(UPPER(t.FullName +
t.StreetAddress +
t.CityRegion), 4, 0)) AS g
WHERE t.MasterCustomerId IS NULL
AND CHARINDEX(' , g.Token) = 0
AND CHARINDEX('0, g.Token) = 0
),

The next CTE selects only target rows with 4Grams, with absolute frequency less than or equal
to 20. Thus, the p parameter is 20 in this case. The code is shown in Listing 14.
Listing 14: The Second CTE

Page 185

Chapter 4: Identity Mapping and De-Duplicating


-- Target rows with 4Grams, with absolute frequency less than or equal to 20
NGramsMatch1 AS
(
SELECT tg.TCid
,tg.MasterCustomerId
,mg.CustomerId AS MCid
,tg.TToken
,f.Cnt
FROM CustomersTargetNGrams AS tg
INNER JOIN dbo.CustomersMasterNGramsFrequency AS f
ON tg.TToken = f.Token
AND f.Cnt <= 20
INNER JOIN dbo.CustomersMasterNGrams AS mg
ON f.Token = mg.Token
),

In the third CTE, we are selecting only matches that have in common at least three less frequent
4Grams. Thus, the m parameter is 3 in this example, as shown in Listing 15.
Listing 15: The Third CTE
-- Matches that have in common at least three less frequent 4Grams
NGramsMatch2 AS
(
SELECT TCid
,MCid
,COUNT(*) AS NMatches
FROM NGramsMatch1
GROUP BY TCid, MCid
HAVING COUNT(*) >= 3
),

The last CTE then compares the strings and adds the Jaro-Winkler similarity coefficient to the
output. In addition, it adds the row number sorted by Jaro-Winkler similarity coefficient, in
descending order and partitioned by the target table key, to the output. Row number equal to 1
means the row with the highest Jaro-Winkler similarity coefficient for a specific target table key.
This way, we marked the master table row that, according to Jaro-Winkler algorithms, is the
most similar to the target row. The code for the fourth CTE that calculates the Jaro-Winkler
coefficient and the row number is shown in Listing 16. Fortunately, this is the last CTE in this
long query; after this CTE, the only thing left to do is the actual UPDATE. It updates the target
table rows with the key from the master table row with the highest similarity. The UPDATE
statement is also shown in Listing 16.
Listing 16: The Fourth CTE and the UPDATE Statement

Page 186

Chapter 4: Identity Mapping and De-Duplicating


-- Adding Jaro-Winkler coefficient and row number
NGramsMatch3 AS
(
SELECT t.CustomerId AS TCid
,m.CustomerId AS MCid
,t.MasterCustomerId
,mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion,
t.FullName + t.StreetAddress + t.CityRegion,
2, 0.85, 0.00) AS JaroWinkler
,ROW_NUMBER() OVER (PARTITION BY t.CustomerId ORDER BY
mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion,
t.FullName + t.StreetAddress + t.CityRegion,
2, 0.85, 0.00) DESC) AS RowNo
,ngm2.NMatches
FROM NGramsMatch2 AS ngm2
INNER JOIN dbo.CustomersTarget AS t
ON ngm2.TCid = t.CustomerId
INNER JOIN dbo.CustomersMaster AS m
ON ngm2.MCid = m.CustomerId
)
-- Actual update
UPDATE NGramsMatch3
SET MasterCustomerId = MCid
WHERE RowNo = 1;

We use the code in Listing 17 to measure the efficiency, and show the results in Figure 8.
Listing 17: Measuring the Efficiency
-- Measuring the efficiency
SET @RowsUpdated = @@ROWCOUNT;
SELECT @RowsUpdated AS RowsUpdated
,100.0 * @RowsUpdated / 7790 AS PctUpdated
,COUNT(*) AS NumErrors
,100.0 * COUNT(*) / @RowsUpdated AS PctErrors
,DATEDIFF(S, @starttime, GETDATE()) AS TimeElapsed
FROM dbo.CustomersTarget
WHERE MasterCustomerId <> CustomerId * (-1);

Figure 8: RESULTS OF MEASURING THE EFFICIENCY


The results are much better in this case. We updated only 6,847 out of 7,790 rows we have to
update, which is approximately 87.89 percent. With only 0.48 percent of errors, we can see now
that the complication with intelligent pre-selecting makes sense. In addition, the performance is
magnitudes better than we achieved with the partitioning procedure. We needed only 7

Page 187

Chapter 4: Identity Mapping and De-Duplicating


seconds to match 6,847 rows, whereas we needed 133 seconds to match only 1,000 rows with
the partitioning procedure. But is the n Grams filtering method always that efficient?
Comparing nGrams Filtering with Partitioning
In order to make a more frank comparison, we need to play with different parameters of the
nGrams filtering method. We admit that in the section where we described the method, we
used parameters that give us, according to many tests, very good results. Some of the most
important results of multiple tests are aggregated in the Table 1.
Table 1: Results of nGrams Filtering

Method

Rows to

Rows

Percent

Number

Percent

Elapsed

match

matched

matched

of errors

of errors

time (s)

Partitions

NA

NA

NA

7790

1000

12.83

63

6.30

133

Partitions

NA

NA

NA

7790

2000

25.67

93

4.65

262

n Grams

20

7790

7391

94.88

232

3.14

223

20

7790

7717

99.06

322

4.17

264

50

7790

7705

96.39

48

0.64

11

50

7790

7716

99.05

106

1.37

14

20

7790

6847

87.89

33

0.48

filtering

nGrams
filtering

nGrams
filtering

nGrams
filtering

nGrams
filtering

Page 188

Chapter 4: Identity Mapping and De-Duplicating

nGrams

20

7790

7442

95.53

145

1.95

10

7790

6575

84.40

188

2.86

10

7790

5284

67.83

28

0.53

filtering

nGrams
filtering

nGrams
filtering

As you can see from the table, it is important to choose the correct values for the parameters of
the nGrams filtering method. For example, when using two common 3Grams with absolute
frequency less than or equal to 20, the performance (elapsed time 264s) and the accuracy (4.17
percent of errors) were not really shining. Nevertheless, the table proves that the nGrams
filtering method works. In all cases in the table, it is more efficient than the partitioning method.
With proper values for the parameters, it gives quite astonishing results.
In addition, note that in a real example, not as many errors would be present in the target table
rows, and the results could be even better. We should also try matching with other algorithms,
to check which one is the most suitable for our data. In addition, we could start with more strict
values for the three parameters and have even fewer errors in the first pass. So, did we find the
best possible method for identity mapping? Before making such a strong conclusion, lets try the
last option for matching we have in SQL Server out of the box, the SSIS Fuzzy Lookup data flow
transformation.

Page 189

Chapter 4: Identity Mapping and De-Duplicating

Microsoft Fuzzy Components


The Fuzzy Lookup and Fuzzy Grouping transformations, which are included in Enterprise,
Datacenter and Developer editions of SQL Server only, use a proprietary Microsoft matching
algorithm. It is partially documented in SQL Server Books Online; more information can be found
in MSDN and TechNet articles. In addition, Microsoft recently published a Fuzzy Lookup add-in
for Excel 2010. This add-ins documentation includes a good explanation of the Fuzzy algorithm.
Even more details can be found in the documentation for the Microsoft Research Data Cleaning
project.

Fuzzy Algorithm Description


Fuzzy transformations use a quite advanced algorithm for approximate string matching. It
actually comprises some other algorithms that we already know. It starts by using the Jaccard
similarity coefficient. However, the Fuzzy transformations version is much more advanced: It is
actually weighted Jaccard similarity for tokens.
For example, the sets {a, b, c} and {a, c, d} have a Jaccard similarity of 2/4 = 0.5 because the
intersection is {a, c} and the union is {a, b, c, d}. You can assign weights to each item in a set and
define the weighted Jaccard similarity as the total weight of the intersection divided by the total
weight of the union. For example, we added arbitrary weights to the elements of the sets from
the previous example to get the weighted sets {(a, 2), (b, 5), (c, 3)}, {(a, 2), (c, 3), (d, 7)}. For
these two sets, the weighted Jaccard similarity is (2 + 3) / (2 + 3 + 5 +7) = 5/17 = 0.294. Tokens
are substrings of original strings. Fuzzy transformations convert strings to sets before they
calculate the weighted Jaccard similarity. The transformation used for converting an internal
component is called a tokenizer. For example, the row {Ruben Torres, 5844 Linden Land}
might be tokenized into the set {Ruben, Torres, 5844, Linden, Land}. The default
tokenizer is for English text. You can change the LocaleId property in component properties.
Note that this is an advanced property, and you need to open the Advanced Editor for the

Page 190

Chapter 4: Identity Mapping and De-Duplicating


transformation in order to get to this property (right-click on either Fuzzy Lookup or Fuzzy
Grouping transformation in SSIS Designer and select the Show Advanced Editor option).
Fuzzy transformations assign weights to tokens. Tokens get higher weights if they occur
infrequently and lower weights if they occur frequently. In database texts, for example,
frequent words such as database might be given a lower weight, whereas less frequent words
such as broker might be given a higher weight. In the Excel version of Fuzzy Lookup, you can
even override the default token weights by supplying your own table of token weights.
Fuzzy components are additionally enhanced with token transformations. Tokens are converted
from one string to another. There are many classes of such transformations that Fuzzy
components handle automatically, such as spelling mistakes, string prefixes, and string
merge/split operations. In the Excel version, you can also define a custom transformation table
to specify conversions from a token to a token. For example, you can specify that Inc token
has to be converted to Incorporated token.
The Jaccard coefficient is further enhanced to Jaccard similarity under transformations. This is
the maximum Jaccard similarity between any two transformations of each set. With a given set
of transformation rules (either from your table in Excel, or only built-in rules in SSIS), all possible
transformations of the set are considered. For example, for the sets {a, b, c} and {a, c, d} and the
transformation rules {b=>d, d=>e}, the Jaccard similarity is computed:

Variations of {a, b, c}: {a, b, c}, {a, d, c}

Variations of {a, c, d}: {a, c, d}, {a, c, e}

Maximum Jaccard similarity between all pairs:

J({a, b, c}, {a, c, d}) = 2/4 = 0.5

J({a, b, c}, {a, c, e}) = 2/4 = 0.5

J({a, d, c}, {a, c, d}) = 3/3 = 1.0

J({a, d, c}, {a, c, e}) = 2/4 = 0.5

The maximum is 1.0.

Page 191

Chapter 4: Identity Mapping and De-Duplicating


Fuzzy components use also Edit (Levenshtein) distance. The algorithm for Edit distance was
described earlier in this chapter. Fuzzy components include an additional internal
transformation provider called EditTransformationProvider, which generates specific
transformations for each string and creates a transformation from the token to all words in its
dictionary that are within a given edit distance. The normalized edit distance is the edit distance
divided by the length of the input string.
As you can see, Fuzzy transformations use quite an advanced algorithm, which combines many
public algorithms and some internal components. But can we expect better results than we
obtained with public algorithms provided in MDS database? Lets run some tests, starting with
SSIS Fuzzy Lookup transformation.

Configuring SSIS Fuzzy Lookup


To test the Fuzzy lookup transformation, we need to create a new empty target table, as shown
in Listing 18.
Listing 18: A New Table for Fuzzy Lookup Matching
CREATE TABLE dbo.FuzzyLookupMatches
(
CustomerId int,
FullName nvarchar(200),
StreetAddress nvarchar(200),
CityRegion nvarchar(200),
MaritalStatus nvarchar(1),
Updated int,
MasterCustomerId int,
MCid int
);

Of course, even Fuzzy Lookup cannot perform magic. If we try to match every row from the left
table with every row from the right table, we still get a full cross join. Therefore, it is wise to do
some optimization here as well.
In the SSIS package, we start with exact matches again, using the Lookup transformation. The
target table is the source, and the master table is the lookup table for the Lookup
transformation. The rows that did not match (i.e., the Lookup no Match Output) are then
Page 192

Chapter 4: Identity Mapping and De-Duplicating


driven to the Fuzzy Lookup transformation. After exact and approximate matches are done, we
make a union of all rows and send them to the output. If you want to test Fuzzy Lookup by
yourself, follow these steps.
1. Open Business Intelligence Development Studio (BIDS).
2. In BIDS, create a new SSIS project. Name the solution MDSBook_Ch04, the
project MDSBook_Ch04_SSIS, and save the solution to any folder you want (we
suggest C:\MDSBook\Chapter04).
3. In Solution Explorer, right-click on the default package with the name
Package.dtsx. Rename it to FuzzyLookup.dtsx. Click OK in the message box
window to rename the package object.
4. From the Toolbox, drag the Data Flow Task to the Control Flow working area. In
SSIS Designer, click on the Data Flow tab.
5. From the Toolbox, drag the OLE DB Source to the Data Flow working area. Rightclick on it, then select the Rename option, and rename it to CustomersTarget.
6. Double-click on the source to open the OLE DB Source Editor. In the Connection
Manager tab, create a new connection to your SQL Server system with the
MDSBook database. Select the dbo.CustomersTarget in the Name of the table or
the view drop-down list. Your data source should be similar to the one shown in
Figure 9.

Page 193

Chapter 4: Identity Mapping and De-Duplicating

Figure 9: CONFIGURING THE DATA SOURCE


7. Click on the Columns tab to get columns mapping. Click OK to close the data
Source Editor.
8. Drag the Lookup transformation to the canvas. Rename it to Exact Matches.
Connect it with the source using the green arrow from the CustomersTarget data
source. Double-click on it to open the editor.
9. In the Lookup Transformation Editor window, in the General tab, use the Specify
how to handle rows with no matching entries drop-down list to select the option
Redirect rows to no match output.
10. Use the Connection tab to specify the lookup table. Use the same connection
manager as you created for the data source. Using the Use a table or view option,
select the dbo.CustomersMaster table.

Page 194

Chapter 4: Identity Mapping and De-Duplicating


11. In the Columns tab, drag and drop the FullName column from the left table to the
FullName column in the right table. Repeat this for StreetAddress and CityRegion
columns. These three columns will be used for matching rows. Click on the check
box near the CustomerId column in the right table. This is the column we are
adding to our CustomersTarget table. In the Output Alias cell, rename the
CustomerId column to MCid. Your Columns tab of the Lookup Transformation
Editor should be similar to the one in Figure 10. Click OK to close the editor.

Figure 10: CONFIGURING THE LOOKUP TRANSFORMATION


12. Drag and drop the Fuzzy Lookup transformation to the canvas. Rename it to
Approximate Matches. Connect it with the Lookup transformation using Lookup
no Match Output. Double-click on it to open the Fuzzy Lookup Transformation
Editor.

Page 195

Chapter 4: Identity Mapping and De-Duplicating


13. In the Reference Table tab, use the same connection manager as for the Lookup
transformation reference table, and again select the dbo.CustomersMaster table
for the reference table name.
14. Check the Store new index check box. By storing the index, you can optimize
subsequent executions (see Books Online and the sidebar Storing an Index).
Your Reference Table should resemble the one in Figure 11.

Figure 11: CONFIGURING THE REFERENCE TABLE TAB


Storing an Index
When the package first runs the transformation, the transformation copies the reference
table, adds a key with an integer data type to the new table, and builds an index on the
key column. Next, the transformation builds an index, called a match index, on the copy
of the reference table. The match index stores the results of tokenizing the values in the

Page 196

Chapter 4: Identity Mapping and De-Duplicating


transformation input columns, and the transformation then uses the tokens in the
lookup operation. The match index is a table in a SQL Server 2000 or later database.
When the package runs again, the transformation can either use an existing match index
or create a new index. If the reference table is static, the package can avoid the
potentially expensive process of rebuilding the index for repeat sessions of data
cleaning.
15. Click on the Columns tab. Delete the connection between left and right
CustomerId columns. Check the box near the right CustomerId column. Rename
the outp column name to MCid. If connections between FullName,
StreetAddress, and CityRegion columns from left and right do not exist, create
them. These settings are the same as columns settings for the Lookup
transformation.
16. Click on the Advanced tab. Set the similarity threshold to 0.50 (see the sidebar
Similarity Threshold). Check that Maximum number of matches to output per
lookup is set to 1. We are selecting the best match according to the Fuzzy Lookup
only, like we did when we matched rows using the Jaro-Winkler algorithm. Your
Advanced tab should be similar to the one in Figure 12. Click OK.

Page 197

Chapter 4: Identity Mapping and De-Duplicating

Figure 12: THE ADVANCED TAB OF THE FUZZY LOOKUP TRANSFORMATION EDITOR
Similarity Threshold
The closer the value of the similarity threshold is to 1, the closer the resemblance of the
lookup value to the source value must be to qualify as a match. Increasing the threshold
can improve the speed of matching because fewer candidate records need to be
considered. You can optimize the matching by having multiple Fuzzy Lookups. First you
match rows with higher similarity, then you lower the similarity threshold, then lower
again, and so on. Because every time you need to match fewer rows, you can control the
performance.
17. Drag and drop the Union All transformation to the working area. Connect it with
the Lookup Match Output from the Lookup transformation and with the default
output from the Fuzzy Lookup transformation.

Page 198

Chapter 4: Identity Mapping and De-Duplicating


18. Drag and drop OLE DB Destination to the working area. Rename it to
FuzzyLookupMatches. Connect it with the green arrow from the Union All
transformation. Double-click on it to open the editor.
19. In the Connection Manager tab, use the same connection manager to the
MDSBook database as in all sources and transformations in this exercise. Select
the dbo.FuzzyLookupMatches table in the Name of the table or the view dropdown list.
20. Click on the Mappings tab to check whether the column mappings are correct.
Because we used the same column names throughout the package and we used
the same names in the destination table, the automatic mapping should be
correct. Click OK to close the OLE DB Destination Editor. Your data flow of the
package should be similar to the one in Figure 13 shows.

Page 199

Chapter 4: Identity Mapping and De-Duplicating


Figure 13: THE COMPLETE DATA FLOW FOR FUZZY LOOKUP
21.

Save the project. Do not exit BIDS.

Testing SSIS Fuzzy Lookup


Now it is time to test how the Fuzzy Lookup transformation works. At this point, we need to
execute the package. After execution, we need to measure the results. Lets start by calculating
the number of matched rows, or, simpler, the number of not matched rows:
SELECT COUNT(*)
FROM dbo.FuzzyLookupMatches
WHERE MCid IS NULL;

From this number, it is easy to calculate the number of matched rows. In my case, the number
of unmatched row was 608. Because I had 7,790 rows to match, this means that Fuzzy Lookup
actually matched 92.20 percent of rows to match. With the following query we can measure the
number of errors:
SELECT *
FROM dbo.FuzzyLookupMatches
WHERE CustomerId <> MCid * (-1);

In my case, the result was fantastic: no errors at all. At first glimpse, it looked too good to be
true to me. Therefore, I ran many additional tests, and only here and there got an error. Out of
more than 20 tests, I got two errors only once!
What about the execution time? After you execute the package, you can read execution time
using the Execution Results tab of the SSIS Designer. In my case, execution (or elapsed) time was
15 seconds. However, this time included the time needed to read the data, to perform exact
matches with the Lookup transformation, to run the Fuzzy Lookup transformation, union the
data, and write the data to the destination.

Page 200

Chapter 4: Identity Mapping and De-Duplicating


Lets compare the Fuzzy Lookup results with the best results we got using public algorithms.
With public algorithms, we matched 87.89 percent of rows to match with 0.48 percent of errors
in 7 seconds (approximate match only). With Fuzzy Lookup, we matched 92.20 percent of rows
to match with 0.00 percent of errors in 15 seconds (exact and approximate match).
As we could expect, Microsofts Fuzzy algorithm really works. However, you should still test
other algorithms on your data. It is impossible to predict that Fuzzy Lookup will outperform all
other algorithms on any kind of data. Thats why we developed different methods using
different algorithms for identity mapping.

Fuzzy Lookup Add-In for Excel


After you download and install the Fuzzy Lookup add-in for Excel, you get this power of identity
mapping for Excel tables as well. There is nothing much to add from the algorithm perspective
of course, this is the same Fuzzy Lookup algorithm as we described earlier. Exhaustive
instructions on how to use the add-in are provided in a PDF document and in a demo Excel file.
Figure 14 shows a quick overview of the Excel demo file with Fuzzy Lookup prepared for
configuration.

Page 201

Chapter 4: Identity Mapping and De-Duplicating

Figure 14: EXCEL FUZZY LOOKUP

Page 202

Chapter 4: Identity Mapping and De-Duplicating

De-Duplicating
De-duplicating is actually a very similar, or more accurately, it is actually the same problem as
identity mapping. SSIS has a separate Fuzzy Grouping transformation for this task. It groups
rows based on string similarities. Lets see how we can use either of the two Fuzzy tasks for both
problems.
Lets start with using Fuzzy Grouping for identity mapping. This is a very simple task. We can just
make a not distinct union (in T-SQL, use UNION ALL rather than the UNION operator) of all rows
from both the master and target tables. Then we can perform the grouping on this union. The
result is the same as we would get with identity mapping.
To turn the problem around: How could we perform de-duplication with Fuzzy Lookup or
another string similarity merging? We should use the same table twice, once as the master and
once as the target table. We would immediately exclude exact matches for all the character
columns and with the same keys (i.e., matches of a row with itself). In the next step, we would
perform the identity mapping of exact matches of character columns of rows with different
keys. Then we would perform approximate matching on the rest of the rows. Finally, we could
delete all the rows that got a match (i.e., the same identification) except one. The Excel Fuzzy
Lookup add-in can actually be used for de-duplicating as well. For the left and the right table for
matching, you can define the same Excel table.
Nevertheless, you might find it easier to perform the de-duplicating by using the SSIS Fuzzy
Grouping transformation. Therefore, lets test it. Before testing, lets prepare some demo data.
We will add all the rows from the CustomersTarget table to the CustomersMaster table and
then try to de-duplicate them.
We can add the rows with the INSERT query, like shown in Listing 19.
Listing 19: Adding Duplicate Rows to the CustomersMaster Table

Page 203

Chapter 4: Identity Mapping and De-Duplicating


INSERT INTO dbo.CustomersMaster
(CustomerId
,FullName
,StreetAddress
,CityRegion
,NumberCarsOwned)
SELECT t.CustomerId
,t.FullName
,t.StreetAddress
,t.CityRegion
,NULL
FROM dbo.CustomersTarget AS t

Preparing for Fuzzy Grouping


Before we start using the Fuzzy Lookup transformation, we have to think about the algorithm.
We need to optimize it similarly to how we optimized the identity mapping process. We have to
perform exact matches first. We will use the Aggregate transformation for this task. We will
group by FullName, StreetAddress, and CityRegion columns. However, we need to decide which
CustomerId to keep. For this example, lets say we keep the higher Id. This is a quite plausible
scenario, because a higher Id could mean that the row came to our system later than the one
with the lower Id and might be cleaner than the one with the lower Id. Because we have
negative CustomerId values in the CustomersTarget table, this means we will retain the Id from
the CustomersMaster table.
After we have the Id, we need to find the correct value for the NumberCarsOwned column. We
will use the Lookup transformation and perform a lookup on the same table we are deduplicating (i.e., the CustomersMaster table). Because we will have higher Ids only in the
aggregated table, we will get the correct match and read the value of the NumberCarsOwned
attribute from the correct row.
First, we have to prepare the destination table. Besides original columns, we are adding
columns that the Fuzzy Grouping transformation provides. We are using the default Fuzzy
Grouping names, as shown in Listing 20.
Listing 20: Table for Fuzzy Grouping Matches
CREATE TABLE dbo.FuzzyGroupingMatches

Page 204

Chapter 4: Identity Mapping and De-Duplicating


(
_key_in int NULL,
_key_out int NULL,
_score real NULL,
CustomerId int NULL,
FullName nvarchar(200) NULL,
StreetAddress nvarchar(200) NULL,
CityRegion nvarchar(200) NULL,
NumberCarsOwned tinyint NULL,
FullName_clean nvarchar(200) NULL,
StreetAddress_clean nvarchar(200) NULL,
CityRegion_clean nvarchar(200) NULL,
_Similarity_FullName real NULL,
_Similarity_StreetAddress real NULL,
_Similarity_CityRegion real NULL
);

There will be some additional work after Fuzzy Grouping finishes. The transformation adds the
following columns to the output:

_key_in, a column that uniquely identifies each row for the transformation;

_key_out, a column that identifies a group of duplicate rows;

The _key_out column has the value of the _key_in column in the canonical data row. The
canonical row is the row that the Fuzzy Grouping identified as the most plausible correct
row and was used for comparison (i.e., the row used for standardizing data). Rows with
the same value in _key_out are part of the same group. The _key_out value for a group
corresponds to the value of _key_in in the canonical data row. We could suppose that
we could keep the canonical row from the group only;

_score, a value between 0 and 1 that indicates the similarity of the input row to the
canonical row. For the canonical row, the _score has a value of 1.

In addition, Fuzzy Grouping adds columns used for approximate string comparison with clean
values. Clean values are the values from the canonical row. In our example, these columns are
the FullName_clean, StreetAddress_clean, and CityRegion_clean columns. Finally, the
transformation adds columns with similarity scores for each character column used for
approximate string comparison. In our example, these are the _Similarity_FullName,
_Similarity_StreetAddress, and _Similarity_CityRegion columns.

Page 205

Chapter 4: Identity Mapping and De-Duplicating

SSIS Fuzzy Grouping Transformation


We will create a new SSIS package for the de-duplicating problem. Use the following steps:
1. If you closed it, open BIDS and re-open the MDSBook_Ch04_SSIS project.
2. In Solution Explorer, right-click on the Packages folder and add a new package.
Rename it to FuzzyGrouping.dtsx. Click OK in the message box window to rename
the package object as well.
3. From the Toolbox, drag the Data Flow Task to the Control Flow working area. In
SSIS Designer, click on the Data Flow tab.
4. From the Toolbox, drag the OLE DB Source to the Data Flow working area. Rightclick on it, then select the Rename option and rename it to CustomersMaster.
5. Double-click on the source to open the OLE DB Source Editor. In the Connection
Manager tab, create a new connection to your SQL Server system with the
MDSBook database. Select dbo.CustomersMaster in the Name of the table or the
view drop-down list.
6. Click on the Columns tab to get the columns mapping. Click OK.
7. Drag the Aggregate transformation to the working area. Rename it to Exact
Matches. Connect it with the source using the green arrow from the
CustomersMaster data source. Double-click on it to open the editor.
8. Configure the FullName, StreetAddress, and CityRegion for the Group by
operation. Use Maximum Operation for the CustomerId column. Click OK. Figure
15 shows the Aggregate transformation configuration.

Page 206

Chapter 4: Identity Mapping and De-Duplicating

Figure 15: CONFIGURING THE AGGREGATE TRANSFORMATION


9. Drag the Lookup transformation to the canvas. Rename it to Adding
NumberCarsOwned. Connect it with the Aggregate transformation using the
green arrow from the CustomersTarget data source. Double-click on it to open
the editor.
10. In the Lookup Transformation Editor window, in the General tab, use the default
settings.
11. Use the Connection tab to specify the lookup table. Use the same connection
manager as you created for the data source. Using the Use a table or view option,
select the dbo.CustomersMaster table.
12. Click on the Columns tab. Make sure that the only connection is between the
CustomerId columns from the left and right. Use NumberCarsOwned as the

Page 207

Chapter 4: Identity Mapping and De-Duplicating


lookup column, and make sure you use the same name for the output alias.
Figure 16 shows the correct configuration.

Figure 16: COLUMN USAGE FOR THE LOOKUP TRANSFORMATION


13. Drag the Fuzzy Grouping transformation to the working area. Rename it to
Approximate Matches. Connect it with the Lookup Match Output to the Lookup
transformation. Double-click on it to open the Fuzzy Grouping Transformation
editor.
14. In the Connection Manager tab, use the same connection manager as in previous
steps.
15. In the Columns tab, use FullName, StreetAddress, and CityRegion as the grouping
columns, and CustomerId and NumberCarsOwned as pass-through columns, as
Figure 17 shows.

Page 208

Chapter 4: Identity Mapping and De-Duplicating

Figure 17: COLUMN USAGE FOR THE FUZZY GROUPING TRANSFORMATION


16. Click on the Advanced tab and check the advanced settings, especially the Input,
Output, and Similarity Score column names. Make sure you are using the same
names as in the dbo.FuzzyGroupingMatches table. Set the similarity threshold to
0.50 (as we did for the Fuzzy Lookup transformation when we did identity
mapping). Click OK to close the Fuzzy Grouping Editor.
17. Drag and drop OLE DB Destination to the working area. Rename it to
FuzzyGroupingMatches. Connect it with the green arrow from the Fuzzy
Grouping transformation. Double-click on it to open the editor.
18. In the Connection Manager tab, use the same connection manager to the
MDSBook database as in all sources and transformations in this exercise. Select

Page 209

Chapter 4: Identity Mapping and De-Duplicating


the dbo.FuzzyGroupingMatches table in the Name of the table or the view dropdown list.
19. Click on the Mappings tab to check whether the column mappings are correct.
Because we used the same column names throughout the package and the same
names in the destination table, the automatic mapping should be correct. Click
OK to close the OLE DB Destination Editor. Your data flow of the package should
be similar to that in Figure 18.

Figure 18: THE COMPLETE DATA FLOW FOR FUZZY GROUPING


20.

Save the project. Do not exit BIDS.

Page 210

Chapter 4: Identity Mapping and De-Duplicating

Testing SSIS Fuzzy Grouping


Now its time to execute the package! After the execution, we have to measure the results. Lets
start with execution time. In my test, it was about 57 seconds. Therefore, it seems like it was
less efficient than Fuzzy Lookup. In addition, we did not finish yet. We have to check the content
of the destination table. A very simple query gives us an overview of the results, which Figure 19
shows.
SELECT *
FROM dbo.FuzzyGroupingMatches

Figure 19: RESULTS OF THE FUZZYGROUPINGMATCHES QUERY


At a first glimpse, the results are quite satisfying. Rows with _key_out (the second column)
equal to 1 (the first two rows) were correctly identified as duplicates, and the row with the
positive CustomerId (the first row in the output, with CustomerId 14348) was identified as the
canonical row (_key_in equal to 1). However, lets run some additional tests.
In our next query, lets count the number of duplicate rows in each group (i.e., the number of
rows with the same _key_out value).
SELECT

_key_out
,COUNT(_key_out) AS NumberOfDuplicates
FROM dbo.FuzzyGroupingMatches
GROUP BY _key_out
ORDER BY NumberOfDuplicates DESC;

Page 211

Chapter 4: Identity Mapping and De-Duplicating


From our data, we would expect that the maximal number of rows in a group would be 2.
However, the results in Figure 20 show a different picture.

Figure 20: RESULTS OF QUERYING THE NUMBER OF _KEY_OUT DUPLICATES


In this test, we got maximally 19 rows identified as duplicates! Of course, we could set a higher
similarity threshold and get more accurate matches for duplicates. However, we set it to 0.50 to
have a direct comparison of efficiency of the Fuzzy Grouping with the Fuzzy Lookup
transformation. Lets check the rows with the higher number of duplicates identified visually,
with the query shown in Listing 21.
Listing 21: Checking Rows with High Number of Duplicates
WITH NumberOfDuplicatesCTE AS
(
SELECT _key_out
,COUNT(_key_out) AS NumberOfDuplicates
FROM dbo.FuzzyGroupingMatches
GROUP BY _key_out
)
SELECT *
FROM dbo.FuzzyGroupingMatches
WHERE _key_out =
(SELECT TOP 1 _key_out
FROM NumberOfDuplicatesCTE
ORDER BY NumberOfDuplicates DESC);

Page 212

Chapter 4: Identity Mapping and De-Duplicating


Figure 21 shows the results.

Figure 21: RESULTS OF CHECKING THE ROWS WITH HIGH NUMBER OF DUPLICATES
From the results, we can see that the canonical row was not identified properly for all the rows.
For example, the canonical row for the sixth row (i.e., the row with CustomerId -25321) should
be the second row (i.e., the row with CustomerId 25321). Many correct rows with positive
CustomerId values were identified incorrectly as duplicates of the first row (the row with
CustomerId equal to 25320), which was identified as the canonical row for this group.
Apparently, we would have to perform more manual work in order to finish de-duplicating by
using the Fuzzy Grouping than by using the Fuzzy Lookup transformation. Of course, we should
play more with Fuzzy Grouping with different similarity threshold settings. We could perform a
consecutive procedure using de-duplicating with a high similarity threshold first, then lower it a
bit, and then lower it more, and so on. Nevertheless, it seems that the Fuzzy Lookup
transformation could be more suitable for de-duplicating than Fuzzy Grouping. Not only did it
give us better results, but it also easily managed to outperform Fuzzy Grouping.

Clean-Up
To clean up the MDSBook database, use the code from Listing 22.
Listing 22: Clean-Up Code
USE MDSBook;

Page 213

Chapter 4: Identity Mapping and De-Duplicating


IF OBJECT_ID(N'dbo.CustomersMaster', N'U') IS NOT NULL
DROP TABLE dbo.CustomersMaster;
IF OBJECT_ID(N'dbo.CustomersTarget', N'U') IS NOT NULL
DROP TABLE dbo.CustomersTarget;
IF OBJECT_ID(N'dbo.CustomersMasterNGrams', N'U') IS NOT NULL
DROP TABLE dbo.CustomersMasterNGrams;
IF OBJECT_ID(N'dbo.CustomersMasterNGramsFrequency', N'U') IS NOT NULL
DROP TABLE dbo.CustomersMasterNGramsFrequency;
IF OBJECT_ID(N'dbo.FuzzyLookupMatches', N'U') IS NOT NULL
DROP TABLE dbo.FuzzyLookupMatches;
IF OBJECT_ID(N'dbo.FuzzyGroupingMatches', N'U') IS NOT NULL
DROP TABLE dbo.FuzzyGroupingMatches;
GO

Page 214

Chapter 4: Identity Mapping and De-Duplicating

Summary
As you can see in this chapter, identity mapping and de-duplicating are not simple. We
developed a custom algorithm for identity mapping using the functions from the MDS database.
We tested it on data from the AdventureWorksDW2008R2 demo database. We made errors in
the data in a controllable way, so we could measure the results of the tests throughout the
chapter. Through the tests, we realized that the quality of the results and the performance of
our de-duplicating algorithm are highly dependent on proper selection of parameters.
After testing the manual procedure, we used SSIS Fuzzy Lookup transformation for deduplicating. We also introduced the Fuzzy Lookup add-in for Excel, which brings the power of
this transformation to advanced users on their desktops. We showed that de-duplicating is
actually the same problem as identity mapping. Nevertheless, we also tested the SSIS Fuzzy
Grouping transformation.
According to the tests in this chapter, the Fuzzy Lookup transformation is a clear winner. It gave
us better results than any other option, including the Fuzzy Grouping performance, with quite
astonishing performance. Nevertheless, this does not mean you should always use Fuzzy Lookup
for identity mapping and de-duplicating. You should test other possibilities on your data as well.
The tests presented in this chapter are quite exhaustive, so you should have the heavy work
with identity mapping and de-duplicating mitigated.
This is the last chapter in this version of the book. However, stay tuned; there are many exciting
new features coming with the next release of SQL Server, code-name Denali. Microsoft already
announced rewritten Master Data Services, improved Integration Services, and a complete new
application called Data Quality Services. We will update this book to incorporate these new
technologies when they become available.

Page 215

Chapter 4: Identity Mapping and De-Duplicating

References

53 Microsoft SQL Server MVPs: SQL Server MVP Deep Dives (Manning, 2010)

Carlo Batini, Monica Scannapieco: Data Quality Concepts, Methodologies and


Techniques (Springer-Verlag, 2006)

Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server

Levenshtein distance on Wikipedia

Jaccard index on Wikipedia

Jaro-Winkler distance on Wikipedia

Ratcliff/Obershelp pattern recognition on National Institute of Standards and Technology

Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005 article on
MSDN describing more details about Fuzzy transformations than Books Online

Fuzzy Lookup Add-In for Excel

Microsoft Research Data Cleaning Project

Page 216

Index

Index
.NET, 7, 11, 47, 57, 60, 133

attributes, 48, 49, 66, 69, 71, 74, 75, 82, 99, 100, 128,

_key_in, 205, 211

146

_key_out, 205, 211, 212

auditing, 16, 17, 19, 45

a composite candidate key, 135

authoritative source, 18, 20

accuracy, 18, 38, 45, 105, 118, 123, 126, 135, 136,

authority, 22

139, 180, 189

Axis Fields (Categories), 151

accurate, 6, 16, 126, 166, 170, 212

bias, 177

administrator, 95

BIDS, 121, 122, 125, 137, 141, 145, 147, 148, 153,

Adventure Works2008R2, 122, 138

157, 160, 193, 200, 206, 210

AdventureWorks2008R2, 107, 112, 121, 122, 126,

Bill Inmon, 33, 55

127, 149, 154, 158, 160

bill of material, 14, 15

aggregate functions, 128

bill of materials, 16

Aggregate transformation, 204, 206, 207

blocking, 176, 177, 182

aggregation, 30

Boolean, 133

Alberto Ferrari, 162

business intelligence, 11, 142

algorithm, 38, 40, 44, 53, 121, 136, 156, 166, 167,

business interest, 23

173, 175, 176, 178, 186, 189, 190, 192, 201

business needs, 42, 106

Analysis Services, 11, 12, 53, 78, 121, 145, 147, 150,

business problem, 10, 14, 29, 40, 42

157, 159, 161

business requirements, 41, 42, 63

analytical application, 13, 14, 16, 17, 40

Business Rule Maintenance, 85, 86

Anastasios Yalanopoulos, 167

business rules, 6, 18, 38, 41, 47, 56, 63, 64, 92

application, 6, 7, 13, 15, 18, 20, 21, 28, 33, 39, 40, 44,

Business Rules, 85, 88, 92, 96

45, 47, 50, 52, 56, 59, 61, 62, 63, 101, 103, 115,

C. J. Date, 24, 55

153, 155, 163, 215

canonical row, 205, 211, 213

Arkady Maydanchik, 162

cardinality, 176

assessment, 42, 43

Carlo Batini, 216

association, 23, 120, 121, 124, 125

Cartesian product, 176

Attribute Discrimination, 124, 125

central storage, 19, 20, 21, 45

Attribute Groups, 75, 76

Check constraint, 28

Attribute Groups Maintenance, 75, 76

Circular References, 79

Attribute Management, 70

class libraries, 47
classify, 35

Page 217

Index
cleansing, 6, 18, 19, 20, 40, 44, 52

contract, 14, 15

client tier, 28

coordinated, 16

closed world assumption, 37

correcting, 44

Closed world assumption, 106

corrective measures, 44

CLR, 51, 133, 134, 145, 160, 161, 167, 181

correctness, 40

Clustering, 53, 157

CRM, 12, 15, 17, 33, 163

Code attribute, 76

cross-system, 19

collation, 21

CTE, 110, 111, 116, 130, 143, 178, 185, 186

Collection Attributes, 82

cumulative values, 130

collections, 50, 56, 63, 68, 102

current, 16, 33, 39, 103

Collections, 49, 50, 84, 99

customers, 6, 10, 14, 15, 16, 19, 24, 33, 39, 43, 44,

committed version, 97

49, 65, 66, 67, 77, 84, 126, 128, 181

common columns, 169, 171, 184

Danette McGilvray, 55

Common Table Expressions, 116

Data conflict, 22

common-prefix length, 169

Data Connection Wizard, 147, 158

communicate, 19, 46

Data Flow Task, 193, 206

completeness, 37, 40, 45, 105, 106, 107, 109, 114,

data governance, 18, 45, 54

117, 118, 119, 120, 123, 126, 134, 135, 136


complexity, 16

data integrity, 6, 7, 12, 17, 18, 23, 24, 28, 38, 54, 135
Data Mining, 11, 53, 139, 145, 156, 157, 158, 161,

COMPLEXITY_PENALTY, 139

162

compliance, 17, 41

data model, 6, 7, 12, 41, 62

components, 57, 58, 191, 192

data profiling, 36, 42, 105, 107, 131, 145, 148, 149,

compromise, 21

156, 161

concept, 14

Data Profiling task, 52, 109, 145, 148, 153, 155, 161

Concepts, 56, 216

Data Profiling Task Editor, 153, 154

Configuration Manager, 58, 59, 61, 98

data quality, 6, 7, 9, 12, 15, 17, 18, 21, 22, 24, 28, 34,

Connection Manager, 153, 193, 199, 206, 208, 209

36, 37, 38, 42, 43, 44, 45, 46, 47, 51, 53, 54, 67,

consistency, 39

74, 103, 105, 106, 120, 136, 142, 161

consistent, 18, 79

Data Source Views, 122, 137

Consolidated Attributes, 82

data steward, 19

constant, 113

data stewardship, 19, 22, 45, 47, 50

constraint, 6, 28, 37, 45, 74, 131, 134, 135

data type, 21, 23, 33, 34, 35, 38, 74, 107, 109, 110,

container, 49, 65, 66


continuous, 20, 21, 38, 128, 136, 139, 144, 150

118, 126, 128, 135, 196


data warehouse, 41, 45, 46, 49, 120, 122, 126, 138,

continuous merge, 20, 21

145

Page 218

Index
database, 11, 13, 14, 18, 19, 22, 24, 28, 31, 33, 35,

duplicate primary key, 137

41, 47, 48, 51, 54, 57, 58, 59, 65, 67, 75, 79, 89,

duplicate records, 43

90, 94, 98, 99, 100, 101, 104, 107, 112, 115, 120,

duplication, 15, 44, 67, 163, 203

121, 122, 126, 127, 133, 134, 147, 149, 150, 154,

DW, 20, 31, 33, 48

158, 159, 160, 166, 169, 175, 181, 191, 192, 193,

ease of use, 39

197, 199, 206, 209, 213, 215

Ed Katibah, 162

DateTime, 74

education, 43

Decision Trees, 136, 137, 138, 139, 140, 141, 145,

efficiency, 169, 177, 178, 179, 180, 182, 185, 187,

146

212

decomposition, 24, 25, 29

employees, 14, 15, 29, 49, 79

dedicated, 20, 54, 98, 103

Enable Change Tracking, 73

de-duplicating, 164, 203

enforce, 6, 7, 18, 45, 74, 135

de-duplication, 163

enforcing, 12, 23, 24

definition, 15, 16, 17, 23, 33, 36, 62, 65, 67, 69, 80,

enterprise, 6, 12, 15, 18, 22, 23, 28, 33, 36, 42, 53,

85, 168

103, 163

Dejan Sarka, 6, 8, 11, 12, 56, 105, 162, 163

entities, 14, 15, 23, 24, 29, 45, 48, 49, 56, 63, 64, 67,

denormalized, 31, 126

69, 71, 74, 75, 78, 79, 99, 100, 101, 163, 166

Dependency Network, 121, 124, 141

entity, 14, 15, 23, 40, 48, 49, 54, 70, 71, 73, 75, 78,

deployment, 47, 50, 62, 103

79, 80, 81, 85, 98, 99, 100, 101, 149, 161, 163,

derived columns, 32, 150

165, 181

derived hierarchies, 49

Entity Maintenance Page, 70

Derived Hierarchy, 78, 79, 94

entropy, 38

Derived Hierarchy Maintenance, 78

Entropy formula, 142

descriptive statistics, 38, 129

entry point, 62

destinations, 32, 42, 47, 48, 54

Erik Veerman, 162

DIFFERENCE(), 166, 174

erroneous, 114, 115, 127, 130

Dimension, 14, 15, 31, 36, 146

ETL, 52

Dimensional Model, 12, 14, 15, 23, 29, 30, 31, 32, 54,

evaluating the cost, 28

161

event, 14, 23

discretize, 139, 144, 147

exact matching, 180

distinct states, 143

Excel, 147, 148, 149, 152, 153, 156, 157, 158, 159,

documentation, 22, 41

160, 161, 162, 163, 190, 191, 201, 202, 203, 215,

domain, 22, 23, 42, 43, 45, 49, 70

216

domain knowledge, 22

execution time, 200, 211

Domain-Based, 69, 71, 72, 73, 74, 78, 79, 100

exist(), 109, 110

Page 219

Index
explicit hierarchies, 49, 102

historical data, 31

Explicit Hierarchies, 80

history, 16, 17, 33

Explicit Hierarchy, 79, 80, 81, 82, 89, 92, 93, 94, 99

HRM, 15

Explorer, 63, 82, 84, 91, 122, 123, 137, 138, 145, 146,

human resources, 31

153, 158, 193, 206

identification, 15, 23, 119, 163, 165, 203

exporting, 56

identity mapping, 7, 19, 21, 163, 164, 165, 166, 169,

Fact table, 14

174, 181, 189, 201, 203, 204, 209, 215

Feature Pack, 156

importing, 47, 103, 133, 149

File, 69, 74, 153

Importing, 56, 89, 90, 150

filtering, 180, 185, 188, 189

improvement, 36, 43, 44, 45, 53, 54, 120, 136

first pass, 189

improvement plan, 44

five whys, 43, 44, 105

inaccuracies, 128, 135

flag, 38, 39, 73, 81, 137

inaccurate, 8, 15, 38, 44, 45, 126, 128, 129, 136, 141

Free-Form, 69

incompleteness, 24, 136

Free-Text, 73

inconsistent, 39, 44

functional areas, 59, 62

independently, 19, 165

functional dependency, 112

influence, 15, 24, 40, 136

functionally dependent, 26, 115

information, 17, 18, 23, 25, 26, 29, 33, 34, 38, 39, 40,

Fuzzy Grouping, 52, 164, 190, 191, 203, 204, 205,

44, 75, 76, 80, 91, 94, 105, 106, 112, 136, 139,

206, 208, 209, 210, 211, 212, 213, 215, 216

142, 143, 144, 159, 175, 190

Fuzzy Lookup, 52, 164, 166, 189, 190, 191, 192, 193,

Information Principle, 23

195, 197, 198, 200, 201, 202, 203, 204, 209, 211,

Information Theory, 38, 142

212, 213, 215, 216

input, 69, 74, 121, 122, 123, 136, 138, 146, 166, 192,

Fuzzy transformations, 190, 191, 192, 216

197, 205

generalization, 29

installation, 57, 58, 59

government, 17

installer package, 57

governor, 19

installing, 57, 103

Greg Low, 162

instance, 33, 38, 58, 59, 107, 109, 110, 122, 135, 149,

GROUPING SETS, 113

157, 158

hard dimensions, 37, 42, 45

integrating, 17

harmonizing, 17

integration, 19, 40, 63

header, 23

Integration Management, 63, 90, 93

hierarchical data, 13

Integration Services, 11, 12, 52, 109, 153, 155, 161,

Hierarchies, 49, 71, 77, 79, 80, 81, 82, 83, 84, 85, 92,
99

215, 216
integrity rules, 28

Page 220

Index
interaction, 19, 39, 62

Master Data Management, 7, 11, 12, 16, 17, 62, 89

interactivity, 40

Master Data Manager, 50, 62, 63, 65, 101, 103

intersection, 21, 168, 190

Master Data Services, 9, 12, 21, 45, 47, 54, 56, 57, 58,

interview, 37, 42, 45

62, 63, 64, 65, 66, 73, 74, 77, 78, 79, 82, 85, 89,

Isaac Kunen, 162

90, 91, 94, 95, 101, 103, 104, 158, 161, 166, 215

Itzik Ben-Gan, 8, 162

matching, 133, 165, 166, 167, 168, 170, 173, 176,

Jaccard, 167, 168, 175, 190, 191, 216

177, 178, 180, 181, 184, 185, 189, 190, 194, 195,

Jaro-Winkler, 167, 168, 176, 177, 178, 179, 180, 186,

198, 203

187, 197, 216

matching characters, 168

join, 28, 30, 126, 165, 177

MDM solution, 19, 21, 22, 28, 33, 35, 36, 45, 47, 54,

key data, 6

103, 105, 120, 161

keys mapping tables, 19

mdm.StgBatch, 101

Leaf Attributes, 71

mdm.tblAttribute, 75, 100

leaf level, 77

mdm.tblEntity, 75, 98, 99

legacy applications, 115, 163

mdm.tblEntityMemberType, 99

Levenshtein, 167, 175, 176, 192, 216

mdm.tblModel, 98

license, 14

mdm.tblStagingMember, 89, 101

life cycle, 15, 42, 44

mdm.tblStgMemberAttribute, 89, 101

LIKE, 132, 133, 137

mdm.tblStgRelationship, 89, 101

list of defined values, 70

mdq.NGrams, 181, 182, 185

location, 14, 16

mdq.Similarity, 175, 176, 179, 187

longest common substring, 167, 169

MDS, 12, 21, 22, 47, 48, 49, 50, 51, 52, 54, 56, 57, 58,

lookup table, 49, 70, 192, 194, 207

59, 61, 76, 98, 99, 101, 103, 166, 167, 175, 181,

Lookup transformation, 192, 194, 195, 196, 197, 198,

192, 215

200, 204, 207, 208, 213

MDS Hub, 47, 51, 56

low frequency, 38, 129, 155

member, 49, 63, 75, 79, 80, 82, 83, 84, 92, 94

Lubor Kollar, 162

member values, 67

magnifying glass, 83

merging, 19, 21, 36, 44, 52, 103, 161, 165, 166, 169,

Management Studio, 90, 107

182, 185, 203

managing data, 56

metadata, 13, 19, 20, 21, 22, 45, 47, 100

Mappings, 199, 210

Microsoft, 7, 9, 11, 47, 121, 122, 133, 138, 147, 149,

Marco Russo, 162


master data, 6, 7, 12, 14, 15, 16, 17, 18, 19, 20, 21,

156, 162, 163, 190, 201, 215, 216


Microsoft Research, 190, 216

22, 32, 35, 36, 40, 42, 43, 45, 47, 48, 49, 50, 51,

middle tier, 28

52, 53, 54, 56, 77, 103, 126, 161, 163, 181

minimalization, 41

Page 221

Index
MINIMUM_SUPPORT, 139

OLTP, 12, 13, 163

Mining Structures, 122, 138

open world assumption, 37

missing, 38, 107, 109, 110, 111, 114, 115, 120, 125

Open world assumption, 106

Model, 14, 30, 60, 64, 65, 66, 67, 75, 76, 90, 95, 98,

operating system, 57

99, 103, 124, 140, 141

operational system, 17

modify(), 110

optimization, 177, 192

Monica Scannapieco, 216

order, 13, 15, 24, 27

mutually independent, 27

OVER, 90, 116, 179, 183, 187

MVP, 9, 11, 181, 216

over-fitting, 137

Nave Bayes, 121, 122, 123, 136, 137, 156

package, 153, 154, 192, 193, 196, 197, 199, 200, 206,

Name attribute, 100

210, 211

naming convention, 24, 75, 99, 112

partitioning, 176, 177, 184, 185, 187, 189

natural hierarchies, 31

pattern, 38

navigation, 77

people, 14

NEWID(), 171, 172, 173, 177, 178

People, 6

nGrams, 180, 181, 182, 183, 184, 185, 188, 189

PivotChart, 148, 149, 152

nodes(), 110

PivotTable, 147, 148, 149, 151, 152

non-transitively dependent, 27

place, 14

normal distribution, 129, 131

policy, 16, 18, 22, 36

normal form, 24, 25, 26, 27, 41, 112

population, 20, 37, 106

normalization, 24, 28, 54, 106, 115

portal, 50, 53, 65, 89

Normalization, 24, 67, 106

PowerPivot, 149, 150, 151, 152, 161, 162

notification, 47

PowerShell, 58, 60

noun, 14, 15

predicate, 14, 23

NULL, 28, 29, 37, 38, 41, 44, 90, 106, 107, 108, 109,

predictable, 121, 122, 138

110, 112, 113, 114, 115, 116, 117, 118, 119, 120,

pre-select, 180

121, 125, 127, 131, 134, 160, 170, 178, 180, 182,

presentation quality, 40

183, 185, 200, 204, 205, 214

preventing, 44

nullable, 107, 108, 109, 111, 117

Preview, 79, 150

object, 11, 14, 35, 41, 67, 69, 153, 193, 206

priority, 43

ODS, 33

proactive, 6, 18

OLAP, 12, 13, 15, 30, 53, 145, 147, 161

probability, 129, 142, 157, 173

OLAP cubes, 53

Process Activation Service, 60

OLE DB Destination, 199, 209, 210

processing, 47, 49, 124

OLE DB Source, 193, 206

product categories, 15

Page 222

Index
products, 14, 15, 19, 26, 30, 49, 109, 111, 115, 126

sample data, 126, 169, 170

proposition, 14, 23

schema, 6, 13, 23, 24, 31, 32, 33, 34, 36, 40, 41, 42,

pruning, 180

45, 98, 101, 105, 106, 107, 112, 114, 126, 135, 145

quality, 7, 10, 17, 18, 22, 36, 39, 40, 41, 42, 43, 45,

schema quality, 36, 40, 107

52, 54, 85, 103, 105, 107, 112, 117, 120, 161, 215

search space, 164, 166, 176, 182

query(), 109

search space reduction, 176

Ralph Kimball, 31, 55

security, 18, 60, 62, 64, 74, 75, 76, 99

randomness, 171, 173, 177

segment, 14

Ratcliff/Obershelp, 167, 216

SELECT, 89, 90, 94, 99, 108, 109, 110, 111, 112, 116,

RDBMS, 17, 18, 23, 28, 51

117, 118, 119, 120, 121, 127, 128, 129, 130, 131,

reactive, 6, 18

132, 134, 135, 137, 143, 144, 170, 171, 172, 173,

real world, 14, 37, 39, 41, 48, 75

174, 175, 177, 178, 179, 182, 183, 184, 185, 186,

Recursive Hierarchies, 79

187, 200, 204, 211, 212

recursive partitioning, 136

semi-structured data, 13

redundancy, 24, 27, 67, 70, 106

service, 15, 59

reference data, 63, 67, 89, 97

set, 14, 18, 36, 112, 149, 166, 168, 176, 177, 181,

reference relation, 37, 106

190, 191

Reference Table, 196

shared dimensions, 31

regular expressions, 51, 133, 134

Simil, 167, 169, 175

relation, 23, 37, 106, 112, 114, 118, 119, 120, 126

Similarity Threshold, 197, 198

relational database, 14, 15, 23, 35, 37, 67, 69, 74,

single copy of data, 45

106, 119, 122, 138, 149

soft dimensions, 37, 39, 43, 45

relational design, 29, 30

SolidQ, 8, 9, 10, 11

Relational Model, 6, 11, 12, 14, 23, 28, 31, 32, 54, 65,

Solution Explorer, 122

70

sorting neighborhood, 177

relationship, 13, 14, 23, 29, 41, 49, 50, 69, 71, 74, 75,
78, 79, 99, 100, 101, 102

SOUNDEX(), 166, 174


sources, 15, 22, 28, 32, 33, 40, 42, 44, 47, 48, 52, 54,

retention, 18, 42

103, 161, 163, 165, 199, 209

re-usage, 17

specialization, 29, 30, 31, 41

reuse, 16, 17

spreadsheet, 34

Roger Wolter, 162

SQL 2008 R2 sample databases, 107

root cause, 36, 43, 44, 105, 106, 107, 114, 125, 136,

SQL Server, 7, 9, 11, 12, 13, 21, 28, 34, 35, 45, 47, 51,

141, 161

52, 53, 54, 56, 57, 58, 74, 78, 90, 94, 103, 105,

Sakichi Toyoda, 43

107, 109, 110, 113, 121, 133, 134, 135, 136, 145,

sales representative, 14

Page 223

Index
149, 153, 154, 155, 156, 158, 161, 162, 166, 167,

tokenizer, 190

175, 181, 189, 190, 193, 197, 206, 215, 216

Toolbox, 153, 193, 206

SSAS, 21, 53, 121, 122, 139, 145, 147, 148, 149, 152,
157, 158, 159
SSIS, 12, 21, 52, 109, 145, 148, 153, 156, 164, 166,
189, 191, 192, 193, 200, 203, 206, 211, 215

transactional, 13, 14, 15, 16, 17, 20, 41, 51, 54


transactional data, 13, 15
Transact-SQL, 37, 51, 94, 107, 161
transposition, 6

SSRS, 53

transpositions, 168

staging tables, 47

treatment, 16

stakeholder, 42, 43, 45, 46

trust, 40

standard deviation, 128, 129

T-SQL, 11, 89, 90, 101, 107, 113, 128, 132, 133, 134,

Star Schema, 14, 145

135, 143, 145, 162, 164, 166, 174, 175, 181, 203

static, 16, 133, 197

tuple, 23, 37, 106, 118, 119, 126

Steve Kass, 162

unified, 19

string lengths, 131

uniform distribution, 142

string similarity, 164

UNION, 112, 127, 143, 144, 203

string-matching, 44

UNION ALL, 112, 203

strings profiling, 131

Union All transformation, 198, 199

subject, 14

unique, 7, 18, 38, 134, 163

Subscription View, 93, 94

uniqueness, 134, 135

subscription views, 47

unstructured data, 13

Subscription Views, 63, 95

UPDATE, 171, 172, 177, 178, 179, 180, 184, 186, 187

subtype, 29, 41, 44, 107, 112, 116

update anomalies, 26, 27, 28

System Administration Page, 67

User and Groups Permissions, 64

Tamraparni Dasu, 162

user-defined attributes, 89

taxonomy, 13, 40

validate, 33, 63, 92, 93, 96

Teo Lachev, 162

validate data, 88, 95, 96

Testing Set, 139

Validation Actions, 86

text mining, 35

value(), 109, 135

Theodore Johnson, 162

verb, 14

thing, 14

version, 9, 50, 51, 67, 93, 95, 96, 97, 103, 115, 133,

third normal form, 27

149, 157, 190, 191, 215

Thomas C. Redman, 162

versioning, 16, 17, 18, 19, 45, 50, 56, 93, 103

timeliness, 39

volatility, 16

timely, 6, 39

vTargetMail, 126, 127, 128, 166, 169

token, 181, 183, 184, 190, 191, 197

Page 224

Index
Web, 18, 39, 47, 56, 57, 59, 61, 62, 63, 101, 103, 134,
163

workflow, 17, 18, 45, 47


worksheet, 148, 152, 159

WHERE, 108, 109, 110, 111, 117, 118, 119, 120, 121,

XML, 13, 33, 34, 38, 107, 109, 110, 111, 135, 153,

127, 128, 130, 132, 134, 137, 172, 173, 178, 179,
182, 185, 187, 200, 212

155, 161, 163


XQuery, 107, 110, 161

Wikipedia, 55, 167, 216

XSD, 33, 135

Windows, 57, 58, 59, 60, 147

xsi:nil, 107, 111

Page 225

Potrebbero piacerti anche