Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Quality
UCS551
DR. AZLIN BINTI AHMAD
Data Cleaning
Data Transformation
Table of
contents
Data Sampling
Data Preprocessing is a technique that used to improve the quality of the data before applied mining, so that data will
lead to high quality mining results.
Data processing technique can substantially improve the overall quality of the patterns mined and/or the time required
for the actual mining. Data preprocessing include data cleaning, data integration, data transformation, and data
reduction.
https://www.electronicsmedia.info/2017/12/20/what-is-data-preprocessing/
Data Preprocessing: Why it is needed?
Incomplete, noisy, and inconsistent data are common place properties of large real world database and data
warehouse.
Incomplete data can occur for a number of reasons:-
Attributes of interest may not always be available, such as customer information for sales transaction important at the time of
entry.
Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions.
Data what where inconsistent with other recorded data may have been deleted.
Furthermore recording of the history or modifications to the data may have been overlooked.
Missing data, particularly for tuples with missing value for some mining results.
Therefore to improve the quality of data and, consequently, of the mining results, data preprocessing needed.
Data preprocessing: Where it exists?
Data preparation exists during data preparation in order to generate mining results.
Data preprocessing: Where it exists?
http://socialco
mputing.kaist.a
c.kr/placeness/
Steps of Data Preprocessing
Data Cleaning: Data cleaning can be applied to remove noise and correct inconsistencies in the data.
Data integration: Data integration merges data from multiple sources in to a coherent data store, such as a data
warehouse.
Data transformations: Data transformations such as normalization, may be applied for example, normalization
may improve the accuracy and efficiency of mining algorithms involving distance measurements.
Data reduction: Data Reduction can reduce the data size by aggregating, eliminating redundant features, or
clustering, for instance.
These techniques are not mutually exclusive. They may work together.
https://www.slideshare.net/TonyNguyen197/d https://www.electronicsmedia.info/2017/12/20/what-is-data-
ata-preprocessing-61425971 preprocessing/
Data Mining Processes
Introduction
The whole process of data mining cannot be completed in a single step. In other words, you cannot get the required
information from the large volumes of data as simple as that. It is a very complex process than we think involving a
number of processes. The processes including data cleaning, data integration, data selection, data transformation, data
mining, pattern evaluation and knowledge representation are to be completed in the given order.
Data mining process requires large volumes of historical data for analysis.
So, usually the data repository with integrated data contains much more data than actually required. From the
available data, data of interest needs to be selected and stored.
Data selection is the process where the data relevant to the analysis is retrieved from the database.
d) Data Transformation
Data transformation is the process of transforming and consolidating the data into different forms
that are suitable for mining.
Data transformation normally involves normalization, aggregation, generalization etc.
For example, a data set available as "-5, 37, 100, 89, 78" can be transformed as "-0.05, 0.37, 1.00, 0.89,
0.78". Here data becomes more suitable for data mining.
After data integration, the available data is ready for data mining.
e) Data Mining
Data mining is the core process where a number of complex and intelligent methods are applied to
extract patterns from data.
Data mining process includes a number of tasks such as association, classification, prediction,
clustering, time series analysis and so on.
f) Pattern Evaluation
The pattern evaluation identifies the truly interesting patterns representing knowledge based on different
types of interestingness measures. A pattern is considered to be interesting if it is potentially useful,
easily understandable by humans, validates some hypothesis that someone wants to confirm or valid on
new data with some degree of certainty.
g) Knowledge Representation
The information mined from the data needs to be presented to the user in an appealing way.
Different knowledge representation and visualization techniques are applied to provide the output of data mining
to the users.
Summary of Data Preprocessing
The data preparation methods along with data mining tasks complete the data mining process as such.
The data mining process is not as simple as we explain.
Each data mining process faces a number of challenges and issues in real life scenario and extracts potentially useful
information.
Data Sampling
Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of
data points to identify patterns and trends in the larger data set being examined. It enables data scientists,
predictive modelers and other data analysts to work with a small, manageable amount of data about a statistical
population to build and run analytical models more quickly, while still producing accurate findings.
https://searchbusinessanalytics.techtarget.com/definition/data-sampling
Data Sampling: Advantages and challenges
of data sampling
Sampling can be particularly useful with data sets that are too large to efficiently analyze in full -- for example, in
big data analytics applications or surveys. I
dentifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of
the data or population.
An important consideration, though, is the size of the required data sample and the possibility of introducing a
sampling error.
In some cases, a small sample can reveal the most important information about a data set.
In others, using a larger sample can increase the likelihood of accurately representing the data as a whole, even
though the increased size of the sample may impede ease of manipulation and interpretation.
Types of data sampling methods:
Probability sampling
There are many different methods for drawing samples from data; the ideal one depends on the data set
and situation. Sampling can be based on probability, an approach that uses random numbers that
correspond to points in the data set to ensure that there is no correlation between points chosen for the
sample.
Further variations in probability sampling include:
Simple random sampling: Software is used to randomly select subjects from the whole population.
Stratified sampling: Subsets of the data sets or population are created based on a common factor, and
samples are randomly collected from each subgroup.
Types of data sampling methods:
Probability sampling
Cluster sampling: The larger data set is divided into subsets (clusters) based on a defined factor, then a
random sampling of clusters is analyzed.
Multistage sampling: A more complicated form of cluster sampling, this method also involves dividing the
larger population into a number of clusters. Second-stage clusters are then broken out based on a secondary
factor, and those clusters are then sampled and analyzed. This staging could continue as multiple subsets are
identified, clustered and analyzed.
Systematic sampling: A sample is created by setting an interval at which to extract data from the larger
population -- for example, selecting every 10th row in a spreadsheet of 200 items to create a sample size of
20 rows to analyze.
Data Sampling
Sampling can also be based on nonprobability, an approach in which a data sample is determined and extracted
based on the judgment of the analyst.
As inclusion is determined by the analyst, it can be more difficult to extrapolate whether the sample accurately
represents the larger population than when probability sampling is used.
Types of data sampling methods: Non-
probability sampling
Once generated, a sample can be used for predictive analytics. For example, a retail business might use data sampling
to uncover patterns about customer behavior and predictive modeling to create more effective sales strategies.
Data Sub-setting and manipulating
What is sub-setting?
Subsetting is the process of retrieving just the parts of large files which are of interest for a specific purpose. This
occurs usually in a client—server setting, where the extraction of the parts of interest occurs on the server before
the data is sent to the client over a network.
The main purpose of subsetting is to save bandwidth on the network and storage space on the client computer.
Subsetting may be favorable for the following reasons:
restrict or divide the time range
select cross sections of data
select particular kinds of time series
exclude particular observations
Wikipedia