Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Roadmap
Introduction Data Quality (DQ) problem DQ dimensions DQ models Prominent DQ approaches Open problems Data quality in BExIS Summary
Introduction
Data are of high quality if they are fit for their intended uses in operations, decision making, and planning
Data are of high quality if they correctly represent the real-world construct to which they refer
Deviations between data Data with conflicts are called dirty data and can mislead analysis performed on it In order to improve data quality and to avoid wrong analysis, data cleaning is needed
1/3 of system development projects were forced to delay or cancel due to poor data quality 2001 30%-80% of the development time and budget for data warehousing are for data cleaning
1998
Data quality: Theory and practice. Wenfei Fan, Talk at VLDB 2011
DQ dimensions
DQ is usually understood as a multidimensional concept The dimensions represent views, criteria, or measurements attributes for DQ problems that can be assessed, interpreted, and possibly improved individually.
Classification of DQ dimensions
Redmann: based on DQ conflicts by considering the different levels where they occur Naumann: content-related, technical-related, intellectual, and instantiation-related Liu: based on the hierarchical views on DQ following the steps of data life cycle (collection, organization, presentation, application) The most important ones in many application scenarios are completeness, accuracy, consistency, and timeliness
DQ dimensions
1.
Completeness:
Missing or incomplete data is one of the most important DQ problem. There are different meanings of completeness. The often used defintion is the absence of null values
The extend to which data are correct, reliable, and certified free of errors Syntactic accuracy, semantic accuracy The degree at which data managed in a system satisfy spesified constrains or business rules The degree to which provided data is up-to-date
2.
Accuracy
3.
Consistency
4.
Timeliness (currency)
DQ models
Exdenting traditional models for databases for the purpose of representing DQ dimensions and the association of such dimensions to data
Open challenges in DQ
Invistigating the relationship between data quality and process quality Which DQ dimensions should be considered in specific application domains? Associating quality to data in open environment systems (semi-structured data)
Summary
Data quality: The No.1 problem for data management Real life data are dirty, dirty data are costly
The quest for a principled approach Effective algorithms for certain fixes (minimum user interaction) Efficient algorithms for determining information completeness Efficient algorithms for deciding data currency Data accuracy Putting all together: Interaction between central issues of data quality
2.
3.
I have the followings: Creative innovation: I have developed and implemented new approaches for managing XMLbased data Communication skills: I have co-operated with several scientistics, such as Prof. Z. Bella (France), Marco Mesiti (Italy), Richi Nayak (Australia) Control: I have supervised undergraduate student projects. Each group consists of at least five.