UCS551 Chapter 3 - Data Management and Data Quality

Data Management and Data
Quality
UCS551
DR. AZLIN BINTI AHMAD
Data Cleaning
Data Transformation
Table of
contents
Data Sampling
Data Sub-setting and manipulating

Data Preprocessing: Introduction
 Data Preprocessing is a technique that used to improve the quality of the data before applied mining, so that data will
lead to high quality mining results.
 Data processing technique can substantially improve the overall quality of the patterns mined and/or the time required
for the actual mining. Data preprocessing include data cleaning, data integration, data transformation, and data
reduction.
https://www.electronicsmedia.info/2017/12/20/what-is-data-preprocessing/
Data Preprocessing: Why it is needed?
 Incomplete, noisy, and inconsistent data are common place properties of large real world database and data
warehouse.
 Incomplete data can occur for a number of reasons:-
 Attributes of interest may not always be available, such as customer information for sales transaction important at the time of
entry.
 Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions.
 Data what where inconsistent with other recorded data may have been deleted.
 Furthermore recording of the history or modifications to the data may have been overlooked.
 Missing data, particularly for tuples with missing value for some mining results.
 Therefore to improve the quality of data and, consequently, of the mining results, data preprocessing needed.
Data preprocessing: Where it exists?
Data preparation exists during data preparation in order to generate mining results.
Data preprocessing: Where it exists?
http://socialco
mputing.kaist.a
c.kr/placeness/
Steps of Data Preprocessing
 Data Cleaning: Data cleaning can be applied to remove noise and correct inconsistencies in the data.
 Data integration: Data integration merges data from multiple sources in to a coherent data store, such as a data
warehouse.
 Data transformations: Data transformations such as normalization, may be applied for example, normalization
may improve the accuracy and efficiency of mining algorithms involving distance measurements.
 Data reduction: Data Reduction can reduce the data size by aggregating, eliminating redundant features, or
clustering, for instance.
These techniques are not mutually exclusive. They may work together.
https://www.slideshare.net/TonyNguyen197/d https://www.electronicsmedia.info/2017/12/20/what-is-data-
ata-preprocessing-61425971 preprocessing/
Data Mining Processes
Introduction
 The whole process of data mining cannot be completed in a single step. In other words, you cannot get the required
information from the large volumes of data as simple as that. It is a very complex process than we think involving a
number of processes. The processes including data cleaning, data integration, data selection, data transformation, data
mining, pattern evaluation and knowledge representation are to be completed in the given order.
Types of Data Mining Processes

 Different data mining processes can be classified into two types: data preparation or data preprocessing and data
mining. In fact, the first four processes, that are data cleaning, data integration, data selection and data transformation,
are considered as data preparation processes.
 The last three processes including data mining, pattern evaluation and knowledge representation are integrated into
one process called data mining.
https://www.wideskills.com/data-mining-tutorial/data-mining-processes
a) Data Cleaning
 Data cleaning is the process where the data gets cleaned.

 Data in the real world is normally incomplete, noisy and inconsistent. The data available in data sources might be
lacking attribute values, data of interest etc.
 For example, you want the demographic data of customers and what if the available data does not include attributes
for the gender or age of the customers? Then the data is of course incomplete.
 Sometimes the data might contain errors or outliers. An example is an age attribute with value 200. It is obvious
that the age value is wrong in this case. The data could also be inconsistent.
 For example, the name of an employee might be stored differently in different data tables or documents. Here, the
data is inconsistent. If the data is not clean, the data mining results would be neither reliable nor accurate.
 Data cleaning involves a number of techniques including filling in the missing values manually, combined
computer and human inspection, etc.
 The output of data cleaning process is adequately cleaned data.
Data Cleaning
b) Data Integration
 Data integration is the process where data from different data sources are integrated into one. Data lies in
different formats in different locations.
 Data could be stored in databases, text files, spreadsheets, documents, data cubes, Internet and so on.
 Data integration is a really complex and tricky task because data from different sources does not match
normally.
 Suppose a table A contains an entity named customer_id where as another table B contains an entity named
number. It is really difficult to ensure that whether both these entities refer to the same value or not. Metadata
can be used effectively to reduce errors in the data integration process.
 Another issue faced is data redundancy. The same data might be available in different tables in the same
database or even in different data sources.
 Data integration tries to reduce redundancy to the maximum possible level without affecting the reliability of
data.
c) Data Selection
 Data mining process requires large volumes of historical data for analysis.
 So, usually the data repository with integrated data contains much more data than actually required. From the
available data, data of interest needs to be selected and stored.
 Data selection is the process where the data relevant to the analysis is retrieved from the database.
d) Data Transformation
 Data transformation is the process of transforming and consolidating the data into different forms
that are suitable for mining.
 Data transformation normally involves normalization, aggregation, generalization etc.
 For example, a data set available as "-5, 37, 100, 89, 78" can be transformed as "-0.05, 0.37, 1.00, 0.89,
0.78". Here data becomes more suitable for data mining.
 After data integration, the available data is ready for data mining.
e) Data Mining
 Data mining is the core process where a number of complex and intelligent methods are applied to
extract patterns from data.
 Data mining process includes a number of tasks such as association, classification, prediction,
clustering, time series analysis and so on.
f) Pattern Evaluation
 The pattern evaluation identifies the truly interesting patterns representing knowledge based on different
types of interestingness measures. A pattern is considered to be interesting if it is potentially useful,
easily understandable by humans, validates some hypothesis that someone wants to confirm or valid on
new data with some degree of certainty.
g) Knowledge Representation
 The information mined from the data needs to be presented to the user in an appealing way.
 Different knowledge representation and visualization techniques are applied to provide the output of data mining
to the users.
Summary of Data Preprocessing
 The data preparation methods along with data mining tasks complete the data mining process as such.
 The data mining process is not as simple as we explain.
 Each data mining process faces a number of challenges and issues in real life scenario and extracts potentially useful
information.
Data Sampling
 Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of
data points to identify patterns and trends in the larger data set being examined. It enables data scientists,
predictive modelers and other data analysts to work with a small, manageable amount of data about a statistical
population to build and run analytical models more quickly, while still producing accurate findings.
https://searchbusinessanalytics.techtarget.com/definition/data-sampling
Data Sampling: Advantages and challenges
of data sampling
 Sampling can be particularly useful with data sets that are too large to efficiently analyze in full -- for example, in
big data analytics applications or surveys. I
 dentifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of
the data or population.
 An important consideration, though, is the size of the required data sample and the possibility of introducing a
sampling error.
 In some cases, a small sample can reveal the most important information about a data set.
 In others, using a larger sample can increase the likelihood of accurately representing the data as a whole, even
though the increased size of the sample may impede ease of manipulation and interpretation.
Types of data sampling methods:
Probability sampling
 There are many different methods for drawing samples from data; the ideal one depends on the data set
and situation. Sampling can be based on probability, an approach that uses random numbers that
correspond to points in the data set to ensure that there is no correlation between points chosen for the
sample.
 Further variations in probability sampling include:
 Simple random sampling: Software is used to randomly select subjects from the whole population.
 Stratified sampling: Subsets of the data sets or population are created based on a common factor, and
samples are randomly collected from each subgroup.
Types of data sampling methods:
Probability sampling
 Cluster sampling: The larger data set is divided into subsets (clusters) based on a defined factor, then a
random sampling of clusters is analyzed.
 Multistage sampling: A more complicated form of cluster sampling, this method also involves dividing the
larger population into a number of clusters. Second-stage clusters are then broken out based on a secondary
factor, and those clusters are then sampled and analyzed. This staging could continue as multiple subsets are
identified, clustered and analyzed.
 Systematic sampling: A sample is created by setting an interval at which to extract data from the larger
population -- for example, selecting every 10th row in a spreadsheet of 200 items to create a sample size of
20 rows to analyze.
Data Sampling
 Sampling can also be based on nonprobability, an approach in which a data sample is determined and extracted
based on the judgment of the analyst.
 As inclusion is determined by the analyst, it can be more difficult to extrapolate whether the sample accurately
represents the larger population than when probability sampling is used.
Types of data sampling methods: Non-
probability sampling
Nonprobability data sampling methods include:

 Convenience sampling: Data is collected from an easily accessible and available group.
 Consecutive sampling: Data is collected from every subject that meets the criteria until the predetermined sample
size is met.
 Purposive or judgmental sampling: The researcher selects the data to sample based on predefined criteria.
 Quota sampling: The researcher ensures equal representation within the sample for all subgroups in the data set or
population.
Once generated, a sample can be used for predictive analytics. For example, a retail business might use data sampling
to uncover patterns about customer behavior and predictive modeling to create more effective sales strategies.
Data Sub-setting and manipulating
What is sub-setting?
 Subsetting is the process of retrieving just the parts of large files which are of interest for a specific purpose. This
occurs usually in a client—server setting, where the extraction of the parts of interest occurs on the server before
the data is sent to the client over a network.
 The main purpose of subsetting is to save bandwidth on the network and storage space on the client computer.
 Subsetting may be favorable for the following reasons:
 restrict or divide the time range
 select cross sections of data
 select particular kinds of time series
 exclude particular observations
Wikipedia

UCS551 Chapter 3 - Data Management and Data Quality

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

UCS551 Chapter 3 - Data Management and Data Quality

Caricato da

Copyright:

Formati disponibili

Data Management and Data

Data Sub-setting and manipulating

Types of Data Mining Processes

 Data cleaning is the process where the data gets cleaned.

Nonprobability data sampling methods include:

Potrebbero piacerti anche