Big DQ Academia

UNIVERSITI TEKNIKAL MALAYSIA MELAKA
FACULTY OF INFORMATION AND

COMMUNICATION TECHNOLOGY (FTMK)
BIG DATA QUALITY
NOR HIDAYAH BINTI ABDUL MANAF, AIDA HAZIRAH ABDUL HAMID, NUR
SUHADA NUZAIFA ISMAIL, SITI AYUNAZURA JAMALI@JAMANI, NURUL IDA
FARHANA ABDULL HADI
Contents
Topic 2: Big Data Quality ....................................................................................................................... 1
Introduction ......................................................................................................................................... 1
1.1 Introduction to Big Data Quality .............................................................................................. 1
1.2 Importance of Big Data ............................................................................................................. 3
1.3 Tools for big data cleaning........................................................................................................ 4
1.4 Challenges of big data cleaning ................................................................................................ 5
Conclusion .......................................................................................................................................... 7
References ........................................................................................................................................... 8
Topic 2: Big Data Quality
Introduction
1.1 Introduction to Big Data Quality
Data quality refers to the overall utility of a dataset(s) as a function of its ability to be
easily processed and analyzed for other uses, usually by a database, data warehouse, or data
analytics system. Quality data is useful data. To be of high quality, data must be consistent and
unambiguous. Data quality issues are often the result of database merges or systems/cloud
integration processes in which data fields that should be compatible are not due to schema or
format inconsistencies. Data that is not high quality can undergo data cleansing to raise its
quality.
There are four dimensions of big data quality: accuracy, timeliness, consistency and
completeness. Accuracy refers to the degree to which data are equivalent to their corresponding
“real” values (Ballou and Pazer, 1985). This dimension can be accessed via comparing values
with external values that are known to be (or considered to be) correct (Redman, 1996). A
simple example would be a data record in a customer relationship management system, where
the street address for a customer in the system matches the street address where the customer
currently resides. In this case, accuracy of the street address value in the system could be
assessed via validating the shipping address on the most recent customer order. No problem
context or value-judgment of the data is needed: it is either accurate or not. Its accuracy is
entirely self-dependent.
Timeliness refers to the degree to which data are up-to-date. Research suggests that
timeliness can be further decomposed into two dimensions: (1) currency, or length of time since
the records last update, and (2) volatility, which describes the frequency of updates (Blake and
Mangiameli, 2011; Pipino et al., 2002 ; Wand and Wang, 1996). Data that are correct when
assessed, but updated very infrequently, may still hamper efforts at effective managerial
decision making (e.g., errors that occur in the data may be missed more often than not with
1
infrequent record updating, preventing operational issues in the business from being detected
early). A convenient example measure for calculating timeliness using values for currency and
volatility can be found in Ballou et al. (1998), p. 468, where currency is calculated using the
time of data delivery, the time it was entered into the system, and the age of the data at delivery
(which can differ from input time). Together, currency and volatility measures are used to
calculate timeliness.
Consistency refers to the degree to which related data records match in terms of format
and structure. Ballou and Pazer (1985) define consistency as when the “representation of the
data value is the same in all cases” (p. 153). Batini et al. (2009) develop the notion of both
intra-relation and inter-relation constraints on the consistency of data. Intra-relation
consistency assesses the adherence of the data to a range of possible values (Coronel et al.,
2011), whereas inter-relation assesses how well data are presented using the same structure.
An example of this would be that a person, currently alive, would have for “year of birth” a
possible value range of 1900–2013 (intra-relation constraint), while that persons record in two
different datasets would, in both cases, have a field for birth year, and both fields would
intentionally represent the person’s year of birth in the same format (inter-relation constraint).
Completeness refers to the degree to which data are full and complete in content, with
no missing data. This dimension can describe a data record that captures the minimally required
amount of information needed (Wand and Wang, 1996), or data that have had all values
captured (Gomes et al., 2007). Several types of cpompleteness has been reported by (Emran,
2015) and methods to measure completeness also has been proposed by the author (N A Emran
et al. 2013), (Nurul A Emran et al. 2013) (Emran et al. 2014),. Every field in the data record is
needed to paint the complete picture of what the record is attempting to represent in the “real
world.” For example, if a customer record includes a name and street address, but no state, city,
and zip code, then that record is considered incomplete. The minimum amount of data needed
for a correct address record is not present. A simple ratio of complete versus incomplete records
can then form a potential measure of completeness.
A summary of the dimensions of data quality is presented in Table 1. Once data quality
measures are understood, these quality measures can be monitored for improvement or
adherence to standards. For example, data can be tagged as either accurate or not. Once tagged,
there should be a method in place to monitor the long-term accuracy of the data. Combined
with the measuring and monitoring the other three data quality dimensions, this helps to ensure
that the records in the dataset are as accurate, timely, complete, and consistent as is practical.
2
Data quality
dimension Description Supply chain example
Accuracy Are the data Customer shipping address in a customer relationship

free of errors? management system matches the address on the most
recent customer order
Timeliness Are the data Inventory management system reflects real-time

up-to-date? inventory levels at each retail location
Consistency Are the data All requested delivery dates are entered in a
presented in DD/MM/YY format
the same
format?
Completeness Are necessary Customer shipping address includes all data points
data missing? necessary to complete a shipment (i.e. name, street
address, city, state, and zip code)
Table 1: Data Quality Dimension
1.2 Importance of Big Data
When data is of excellent quality, it can be easily processed and analyzed, leading to
insights that help the organization make better decisions. High-quality data is essential to
business intelligence efforts and other types of data analytics, as well as better operational
efficiency.
3
1.3 Tools for big data cleaning
 Drake
Drake is a simple-to-use, extensible, text-based data workflow tool that

organizes command execution around data and its dependencies. Data processing
steps are defined along with their inputs and outputs and Drake automatically
resolves their dependencies and calculates:
 which commands to execute (based on file timestamps)

 in what order to execute the commands (based on dependencies)
Drake is like GNU Make, but designed especially for data workflow
management. It has HDFS support, allows multiple inputs and outputs, and includes
a host of features designed to help you bring sanity to your otherwise chaotic data
processing workflows.
 OpenRefine
OpenRefine (formerly Google Refine) is a powerful tool for working with

messy data: cleaning it; transforming it from one format into another; and extending
it with web services and external data.
 DataWrangler
Wrangler is an interactive tool for data cleaning and transformation. Spend

less time formatting and more time analyzing your data. Wrangler allows interactive
transformation of messy, real-world data into the data tables analysis tools expect.
Export data for use in Excel, R, Tableau and Protovis.
 DataCleaner
The heart of DataCleaner is a strong data profiling engine for discovering

and analyzing the quality of your data. Find the patterns, missing values, character
sets and other characteristics of your data values. Profiling is an essential activity
of any Data Quality, Master Data Management or Data Governance program. If you
don't know what you're up against, you have poor chances of fixing it.
4
 Winpure Data Cleaning Tool
Data quality is an important contributor in the overall success of a project

or campaign. Inaccurate data leads to wrong assumptions and analysis.
Consequently, it leads to failure of the project or campaign. Duplicate data can thus
cause all sorts of hassles such as slow load ups, accidental deletion etc. A superior
data cleaning tool tackles these problems and cleans your database of duplicate data,
bad entries and incorrect information.
1.4 Challenges of big data cleaning
Because big data has the 4V characteristics, when enterprises use and process big
data, extracting high-quality and real data from the massive, variable, and complicated
data sets becomes an urgent issue. At present, big data quality faces the following
challenges:
● The diversity of data sources brings abundant data types and complex data
structures and increases the difficulty of data integration.
In the past, enterprises only used the data generated from their own business
systems, such as sales and inventory data. But now, data collected and analyzed by
enterprises have surpassed this scope. Big data sources are very wide, including:
1) data sets from the internet and mobile internet (Li & Liu, 2013);
2) data from the Internet of Things;
3) data collected by various industries;
4) scientific experimental and observational data
(Demchenko, Grosso & Laat, 2013), such as high-energy physics

experimental data, biological data, and space observation data. These sources
produce rich data types. One data type is unstructured data, for example, documents,
video, audio, etc. The second type is semi-structured data, including: software
packages/modules, spreadsheets, and financial reports. The third is structured data.
5
The quantity of unstructured data occupies more than 80% of the total amount of
data in existence.
As for enterprises, obtaining big data with complex structure from different
sources and effectively integrating them are a daunting task (McGilvray, 2008).
There are conflicts and inconsistent or contradictory phenomena among data from
different sources. In the case of small data volume, the data can be checked by a
manual search or programming, even by ETL (Extract, Transform, Load) or ELT
(Extract, Load, Transform). However, these methods are useless when processing
PB-level even EB-level data volume.
● Data volume is tremendous, and it is difficult to judge data quality within a

reasonable amount of time.
After the industrial revolution, the amount of information dominated by

characters doubled every ten years. After 1970, the amount of information doubled
every three years. Today, the global amount of information can be doubled every
two years. In 2011, the amount of global data created and copied reached 1.8 ZB. It
is difficult to collect, clean, integrate, and finally obtain the necessary high-quality
data within a reasonable time frame. Because the proportion of unstructured data in
big data is very high, it will take a lot of time to transform unstructured types into
structured types and further process the data. This is a great challenge to the existing
techniques of data processing quality.
● Data change very fast and the “timeliness” of data is very short, which
necessitates higher requirements for processing technology.
Due to the rapid changes in big data, the “timeliness” of some data is very short.
If companies can’t collect the required data in real time or deal with the data needs
over a very long time, then they may obtain outdated and invalid information.
Processing and analysis based on these data will produce useless or misleading
conclusions, eventually leading to decision-making mistakes by governments or
6
enterprises. At present, real-time processing and analysis software for big data is
still in development or improvement phases; really effective commercial products
are few.
● No unified and approved data quality standards have been formed in China and
abroad, and research on the data quality of big data has just begun.
To guarantee the product quality and improve benefits to enterprises, in 1987

the International Organization for Standardization (ISO) published ISO 9000
standards. Nowadays, there are more than 100 countries and regions all over the
world actively carrying out these standards. This implementation promotes mutual
understanding among enterprises in domestic and international trade and brings the
benefit of eliminating trade barriers. By contrast, the study of data quality standards
began in the 1990s, but not until 2011 did ISO published ISO 8000 data quality
standards (Wang, Li, & Wang, 2010). At present, more than 20 countries have
participated in this standard, but there are many disputes about it. The standards
need to be mature and perfect. At the same time, research on big data quality in
China and abroad has just begun and there are, as yet, few results
Conclusion
Within an organization, acceptable data quality is crucial to operational and

transactional processes and to the reliability of business analytics (BA) / business
intelligence (BI) reporting. Data quality is affected by the way data is entered, stored and
managed. Data quality assurance (DQA) is the process of verifying the reliability and
effectiveness of data.
Maintaining data quality requires going through the data periodically

and scrubbing it. Typically, this involves updating it, standardizing it, and de-
duplicating records to create a single view of the data, even even if it is stored in multiple
disparate systems. There are many vendor applications on the market to make this job
easier.
7
References
http://searchdatamanagement.techtarget.com/definition/data-quality
https://www.informatica.com/services-and-training/glossary-of-terms/data-quality-
definition.html#fbid=PAAEY1tABpg
http://datascience.codata.org/articles/10.5334/dsj-2015-002/
http://www.sciencedirect.com/science/article/pii/S0925527314001339
Emran, N.A. et al., 2013. Measuring Data Completeness for Microbial Genomics Database.
In ACIIDS 2013 Part 1. Lecture Notes in Computer Science. Springer.
Emran, N.A. et al., 2013. Reference Architectures to Measure Data Completeness across
Integrated Databases. In ACIIDS 2003 Part 1. Springer-Verlag Berlin Heidelberg, pp.
216–225.
Emran, N.A., Embury, S. & Missier, P., 2014. Measuring Population-Based Completeness
for Single Nucleotide Polymorphism (SNP) Databases. In J. Sobecki, V. Boonjing, & S.
Chittayasothorn, eds. Advanced Approaches to Intelligent Information and Database
Systems. Cham: Springer International Publishing, pp. 173–182.

Big DQ Academia

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Big DQ Academia

Caricato da

Copyright:

Formati disponibili

UNIVERSITI TEKNIKAL MALAYSIA MELAKA

FACULTY OF INFORMATION AND

BIG DATA QUALITY

1.1 Introduction to Big Data Quality .............................................................................................. 1

1.2 Importance of Big Data ............................................................................................................. 3

1.3 Tools for big data cleaning........................................................................................................ 4

1.4 Challenges of big data cleaning ................................................................................................ 5

1.1 Introduction to Big Data Quality

Accuracy Are the data Customer shipping address in a customer relationship

Timeliness Are the data Inventory management system reflects real-time

Table 1: Data Quality Dimension

1.2 Importance of Big Data

Drake is a simple-to-use, extensible, text-based data workflow tool that

 which commands to execute (based on file timestamps)

OpenRefine (formerly Google Refine) is a powerful tool for working with

Wrangler is an interactive tool for data cleaning and transformation. Spend

The heart of DataCleaner is a strong data profiling engine for discovering

Data quality is an important contributor in the overall success of a project

1.4 Challenges of big data cleaning

2) data from the Internet of Things;

3) data collected by various industries;

4) scientific experimental and observational data

(Demchenko, Grosso & Laat, 2013), such as high-energy physics

● Data volume is tremendous, and it is difficult to judge data quality within a

After the industrial revolution, the amount of information dominated by

To guarantee the product quality and improve benefits to enterprises, in 1987

Within an organization, acceptable data quality is crucial to operational and

Maintaining data quality requires going through the data periodically

Potrebbero piacerti anche