Sei sulla pagina 1di 16

ENSURING DATA QUALITY

Dr. Ing. Alsayed Algeragwy

Roadmap

Introduction Data Quality (DQ) problem DQ dimensions DQ models Prominent DQ approaches Open problems Data quality in BExIS Summary

Introduction

Data are of high quality if they are fit for their intended uses in operations, decision making, and planning
Data are of high quality if they correctly represent the real-world construct to which they refer

DQ problem: Data conflicts


Deviations between data Data with conflicts are called dirty data and can mislead analysis performed on it In order to improve data quality and to avoid wrong analysis, data cleaning is needed

Classification of data conflicts

Encyclopdia of database systems, 2009: Data conflicts. Hong-Hai Do

Dirty data are costly


6

Poor data cost US businesses $611 billion annually


Erroneously priced data in retail databases cost US customers $2.5 billion each year
2000

1/3 of system development projects were forced to delay or cancel due to poor data quality 2001 30%-80% of the development time and budget for data warehousing are for data cleaning
1998

CIA dirty data about WMD in Iraq!

Data quality: The No. 1 problem for data management

Data quality: Theory and practice. Wenfei Fan, Talk at VLDB 2011

Is data quality important?

DQ dimensions

DQ is usually understood as a multidimensional concept The dimensions represent views, criteria, or measurements attributes for DQ problems that can be assessed, interpreted, and possibly improved individually.

Classification of DQ dimensions

Redmann: based on DQ conflicts by considering the different levels where they occur Naumann: content-related, technical-related, intellectual, and instantiation-related Liu: based on the hierarchical views on DQ following the steps of data life cycle (collection, organization, presentation, application) The most important ones in many application scenarios are completeness, accuracy, consistency, and timeliness

DQ dimensions
1.

Completeness:

Missing or incomplete data is one of the most important DQ problem. There are different meanings of completeness. The often used defintion is the absence of null values
The extend to which data are correct, reliable, and certified free of errors Syntactic accuracy, semantic accuracy The degree at which data managed in a system satisfy spesified constrains or business rules The degree to which provided data is up-to-date

2.

Accuracy

3.

Consistency

4.

Timeliness (currency)

DQ models

Exdenting traditional models for databases for the purpose of representing DQ dimensions and the association of such dimensions to data

Approaches & Prototypes


ULDB: Databases with Uncertainty and Lineage Trio: A system for integrated management of data, accuracy, and lineage. ULDB forms the basis for the Trio system CerFix: a data cleaning system that finds certain fixes for tuples at the point of data entry (VLDB 2011) GDR: a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing user involvement (VLDB 2011) Improving data quality: using dynamic forms (ICDE 2010) : using conditional functional dependencies (VLDB 2007) Commercial tools: Data quality (Informatica), DataFlux (SAS), Quality Stage (IBM),

Open challenges in DQ

Invistigating the relationship between data quality and process quality Which DQ dimensions should be considered in specific application domains? Associating quality to data in open environment systems (semi-structured data)

Summary

Data quality: The No.1 problem for data management Real life data are dirty, dirty data are costly

The quest for a principled approach Effective algorithms for certain fixes (minimum user interaction) Efficient algorithms for determining information completeness Efficient algorithms for deciding data currency Data accuracy Putting all together: Interaction between central issues of data quality

Many challenges remain


Further slides: Managing a team


1.

2.

3.

I have the followings: Creative innovation: I have developed and implemented new approaches for managing XMLbased data Communication skills: I have co-operated with several scientistics, such as Prof. Z. Bella (France), Marco Mesiti (Italy), Richi Nayak (Australia) Control: I have supervised undergraduate student projects. Each group consists of at least five.

Potrebbero piacerti anche