3 Chapter

Why Is Data Preprocessing Important?
Data Preprocessing
No quality data, no quality mining results!

n
Quality decisions must be based on quality data

p
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
n p
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
July 28, 2009
Data Mining: Concepts and Techniques
Data Preprocessing
p p p
Multi-Dimensional Measure of Data Quality
An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes
n n n n
data cleaning data integration data transformation data reduction
July 28, 2009
Why Data Preprocessing?

p
Major Tasks in Data Preprocessing
Data in the real world is dirty

n
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
p
e.g., occupation= e.g., Salary=-10 e.g., Age=42 Birthday=03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records
n n
noisy: containing errors or outliers

p
inconsistent: containing discrepancies in codes or names

p p p
July 28, 2009
July 28, 2009
Why Is Data Dirty?
Forms of Data Preprocessing
July 28, 2009
July 28, 2009
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
Chapter 2: Data Preprocessing

p
Measuring the Dispersion of Data

Quartiles, outliers and boxplots
n n n n
p p p p p p p
Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation
p
Quartiles: Q1 (25 th percentile), Q3 (75 th percentile) Inter-quartile range: IQR = Q3 Q1 Five number summary: min, Q 1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: )

n
Data reduction Discretization and concept hierarchy generation Summary

Data Mining: Concepts and Techniques 9
Variance: (algebraic, scalable computation) 2= 1 N
1 n 1 n 1 n s = ( xi x)2 = n 1[ xi2 n ( xi )2 ] n 1 i =1 i =1 i =1
2
(x
i =1
)2 =
1 N
x
i =1
2 i
n
July 28, 2009
Standard deviation s (or ) is the square root of variance s2 (or 2)

July 28, 2009
Mining Data Descriptive Characteristics

p
Properties of Normal Distribution Curve

p
Motivation
n
To better understand the data: central tendency, variation and spread median, max, min, quantiles, outliers, variance, etc. Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube
Data dispersion characteristics

n
The normal (distribution) curve n From to +: contains about 68% of the measurements (: mean, : standard deviation) n From 2 to +2: contains about 95% of it n From 3 to +3: contains about 99.7% of it
Numerical dimensions correspond to sorted intervals

n n
Dispersion analysis on computed measures

n n
July 28, 2009
July 28, 2009
14
Measuring the Central Tendency

p
Boxplot Analysis
x = 1 n
xi
i
Mean (algebraic measure) (sample vs. population):

n n
xi
x
N
i =1
Weighted arithmetic mean: Trimmed mean: chopping extreme values

x =
w
i=1 n i =1
Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot

n n
Median: A holistic measure

n
Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:
Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum
n p
Mode
n n n
median = L1 + (
n / 2 ( f )l f median
)c
n n
mean mode = 3 ( mean median)

Data Mining: Concepts and Techniques 11 July 28, 2009
July 28, 2009
Symmetric vs. Skewed Data

p
Visualization of Data Dispersion: Boxplot Analysis
Median, mean and mode of symmetric, positively and negatively skewed data
July 28, 2009 July 28, 2009 Data Mining: Concepts and Techniques 12
16
Histogram Analysis
p
Loess Curve
p p
Graph displays of basic statistical class descriptions n Frequency histograms

p p
A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data
Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression
July 28, 2009
17
July 28, 2009
21
Quantile Plot
p p
Positively and Negatively Correlated Data
Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information n For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
July 28, 2009
18 July 28, 2009 Data Mining: Concepts and Techniques 22
Quantile-Quantile (Q-Q) Plot

p p
Not Correlated Data
Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether there is a shift in going from one distribution to another
July 28, 2009
19 July 28, 2009 Data Mining: Concepts and Techniques 23
Scatter plot
p p
Data PREPROCESSING
Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane
July 28, 2009
20
Data Preprocessing
p
Data PREPROCESSING
Data reduction
n n
data cube aggregation, dimension reduction, data compression, numerosity reduction and discretization. Used to obtain a reduced representation of the data while minimizing the loss of information content.
Data reduction
T1 T2 T2000 A1 A2 A3 ... A126 T1 T4 T1456 A1 A2 A3 ... A115
DATA CLEANING
Data Integration
p
Data Cleaning :Missing Values
Data integration n combines data from multiple sources into a coherent data store e.g. data warehouse n sources may include multiple database, data cubes or flat files n Issues in data integration
p schema
integration and resolution of data value
p redundancy p detection
conflicts
DATA CLEANING
Data Integration
p
Data Cleaning:Noisy Data

p
Schema integration
n n
integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#
p
Noise - random error or variance in a measured variable smooth out the data to remove the noise
Detecting and resolving data value conflicts

n n
for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units
DATA CLEANING
Data Integration
Data Cleaning:Noisy Data
Redundant data occur often when integration of multiple databases
The same attribute may have different names in different databases One attribute may be a derived attribute in another table, e.g., annual revenue
Redundant data may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
DATA CLEANING
DATA CLEANING
Simple Discretization Methods: Binning
Regression
y
Y1
Y1
y=x+1
X1
DATA CLEANING
DATA CLEANING
Binning Methods for Data Smoothing
Data Cleaning : Inconsistent Data

p p
Can be corrected manually using external references Source of inconsistency

n
error made at data entry, can be corrected using paper trace
DATA CLEANING
Cluster Analysis
n
Data PREPROCESSING
Clustering
p Outliers
may be detected by clustering, where similar values are organized into groups or clusters.
n n
Combined computer and human inspection Regression
DATA CLEANING
Cluster Analysis
DATA REDUCTION
Data Reduction
Data Cube Aggregation
DATA REDUCTION
DATA REDUCTION
Dimensionality Reduction
n
Data sets for analysis may contain hundreds of attributes that may be irrelevant to the mining task or redundant Dimensionality reduction reduces the dataset size by removing such attributes among them
DATA REDUCTION
Data Cube Aggregation

Sales data for company AllElectronics for 1997 1999 (pp73) Year = 1999 Year = 1998 Year = 1997 Quarter Sales Q1 $224,000 Q2 $408,000 Q3 $350,000 Q4 $586,000
DATA REDUCTION
How can we find a good subset of the original attributes?? attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.
Year 1997 1998 1999
Sales $1,568,000 $2,356,000 $3,594,000
Data Reduction
Data cube Aggregation
DATA REDUCTION
Dimensionality Reduction Attribute subset selection techniques
DATA REDUCTION
Dimensionality reduction
Data Reduction
Data compression Numerosity reduction Discretization and Concept Hierarchy generation
DATA REDUCTION
Standard form Data preparation Dimension reduction
DATA REDUCTION
Evaluation
Prediction Methods
Data Subset
The role of dimension reduction in Data Mining
Example of Decision Tree Induction

Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6?
DATA REDUCTION
DATA COMPRESSION
Data Compression Methods
Class 1
Class 2
Class 1
Class 2
Reduced attribute set: {A1, A4, A6}
DATA REDUCTION

p Reducts
Data Reduction
Data cube Aggregation Dimensionality reduction
computation by rough set
theory
selection of attributes are identified by the concept of discernibility relations of classes in the dataset n Will be discussed in next class.
n
Data Reduction
NUMEROSITY REDUCTION
Data Reduction
Numerosity Reduction
Data Reduction
DATA COMPRESSION
NUMEROSITY REDUCTION
Data Compression
p
Numerosity Reduction
Apply data encoding or transformation to obtain a reduced or compressed representation of the original data lossless
n
although typically lossless, they allow only limited manipulation of data.
lossy
Regress Analysis and Log-Linear Models
Sampling
WOR SRS le random t simp e withou ( mpl ent) sa cem repla

SR SW R
Raw Data
Data Reduction Method (2): Histograms

p
Sampling
Raw Data Cluster/Stratified Sample
Divide data into buckets and store average (sum) for each bucket Partitioning rules:
n n n
40 35 30 25
Equal-width: equal bucket range Equal-frequency (or equal-depth)
V-optimal: with the least histogram 20 variance (weighted sum of the 15 original values that each bucket represents) 10 MaxDiff: set bucket boundary between each pair for pairs have the 1 largest differences
5 0
10000 30000 50000 70000 90000
58
July 28, 2009
Clustering
p
Hierarchical Reduction
p p p p
Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multidimensional index tree structures There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8
Use multi-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than clusters Parametric methods are usually not amenable to hierarchical representation Hierarchical aggregation
n n n
An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram
Sampling
p p
Data Reduction
Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data
n
Simple random sampling may have very poor performance in the presence of skew Stratified sampling: p Approximate the percentage of each class (or subpopulation of interest) in the overall database p Used in conjunction with skewed data
Develop adaptive sampling methods

n
Data Reduction
Sampling may not reduce database I/Os (page at a time).
Discretization
Entropy-Based Discretization
p
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
E (S ,T ) = | S 1| |S| Ent ( S 1) + |S2| |S| Ent ( S 2 )
The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.
Discretization and Concept hierarchy
Entropy-Based Discretization
p
The process is recursively applied to partitions obtained until some stopping criterion is met, Experiments show that it may reduce data size and improve classification accuracy
Ent ( S ) E (T , S ) >
Data Preprocessing
p
Segmentation by Natural Partitioning

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, natural intervals.
n
If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals
Discretization and Concept Hierarchy Generation for Numeric Data

p p p p p
Example of 3-4-5 Rule

count Step 1: -$351 Min Step 2: Step 3: (-$1,000 - 0) msd =1,000 -$159 Low (i.e, 5%-tile) Low =-$1,000 Hig h=$2,000 (-$1,000 - $2,00 0) (0 -$ 1,000) ($1,000 - $2,00 0) profit $1,8 38 Hig h(i. e, 95%-0 tile) $4,700 Max
Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning
(-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0)
Step 4:
(-$4000 -$5,000)
(-$400 - 0) (0 $20 0) ($200 $40 0) ($400 $60 0)
(0 - $1,000) ($1,000 $1,200) ($1,200 $1,400)
($1,000 - $2, 000)
($2,000 - $5, 000)
($2,000 $3,000) ($3,000 $4,000) ($4,000 $5,000)
($1,400 $1,600) ($800 $1,000) ($1,600 ($1,800 $1,800) $2,000)
($600 $80 0)
Concept Hierarchy Generation

p
Data Transformation
Many techniques can be applied recursively in order to provide a hierarchical partitioning of the attribute - concept hierarchy Concept hierarchy useful for mining at multiple levels of abstraction
Concept Hierarchy Generation for Categorical Data
Data Transformation

Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range

min-max normalization z-score normalization normalization by decimal scaling New attributes constructed from the given ones
Attribute/feature construction
Automatic Concept Hierarchy Generation

p
Data Transformation: Normalization

p
Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set (but.. Must consider the semantic of those attributes and relation among them) n The attribute with the most distinct values is placed at the lowest level of the hierarchy n Note: Exceptionweekday, month, quarter, year
min-max normalization
v' = v min ( new _ max new _ min ) + new _ min max min
A A A A A A
z-score normalization
v' = v mean A stand _ dev
A
country province_or_ state city street
15 distinct values 365 distinct values
normalization by decimal scaling
v' =
3567 distinct values 674,339 distinct values
v 10 j
Where j is the smallest integer such that Max(|v ' |)<1
Data PREPROCESSING
p
Discretization and Concept Hierarchy Generation

Manual Discretization
n
The information to convert the continuous values into discrete values are obtain from the expert of the domain area Assignment 1
Assignment 1
p
Data Discretization
Table 6: Discretization of the mathematical symbols
Orientation Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 h02 1 0 0 0 2 1 0 0 2 1 2 2 0 0 h03 2 1 1 0 2 1 1 2 0 1 2 2 0 0 h11 1 0 1 1 0 0 1 1 0 0 1 1 0 0 h12 2 1 0 1 2 1 1 0 2 1 2 2 0 0 h13 2 1 2 1 1 1 2 2 0 1 0 2 0 0 h21 2 1 1 0 2 0 2 2 1 0 1 2 0 1 h22 2 1 2 0 2 1 1 0 1 1 2 2 0 0 h30 1 0 1 1 1 1 0 0 0 0 1 1 1 1 h31 2 0 2 1 0 1 2 1 1 0 2 2 0 1 Result s
Title Preprocessing of XXX dataset : experiment on manual and automated techniques

n
Manual preprocessing of given dataset

p p
p p
Identify incomplete, noisy and inconsistent data Use statistical techniques such as frequency, boxplot to detect such data. Record number of missing values, noisy data Also record which tuple involve (if not too many)
Manual discretization
p
Use binning, histogram analysis, regression, cluster
Compare with automated technique (later in class)

p
Binning, entropy,
Summary
p p p
..\example_data.xls For automated preprocessing use http://rosetta.lcb.uu.se/
Data preparation is a big issue for both warehousing and mining Data preparation includes
n n n
Data cleaning and data integration Data reduction and feature selection Discretization
A lot methods have been developed but still an active area of research
Data Discretization
Data Discretization
Table 5: The invariance features for mathematical symbols
Symbol h 02 0. 86711 0. 54536 0. 58806 0. 61814 0. 88477 0. 80491 0. 73293 0. 66253 0. 91948 0. 82281 2. 213 2. 15402 0. 15565 0. 16081 h 03 0. 18849 0. 02198 0. 05518 0. 00880 0. 14812 0. 05006 0. 05052 0. 08034 0. 02059 0. 06182 0. 71402 0. 18761 0. 00002 0. 01299 h11 0. 08184 0. 02583 0. 08122 0. 05408 0. 01660 0. 03593 0. 16291 0. 03918 0. 01081 0. 02135 0. 059 0. 08548 0. 00662 0. 01091 h 12 0. 16839 0. 0241 0. 00895 0. 01927 0. 13137 0. 01596 0. 05135 0. 01415 0. 06653 0. 03221 0. 22918 0. 33771 0. 00547 0. 00812 h13 0. 12728 0. 01231 0. 07504 0. 05894 0. 06236 0. 04019 0. 11263 0. 10883 0. 00924 0. 03237 0. 00903 0. 81689 0. 00182 0. 00205 h 21 0. 01923 0. 01844 0. 01626 0. 00178 0. 02861 0. 00195 0. 02107 0. 01978 0. 01543 0. 01006 0. 01181 0. 11741 0. 00775 0. 01267 h22 0. 24873 0. 1193 0. 18318 0. 07934 0. 21195 0. 12116 0. 1385 0. 11662 0. 15602 0. 12365 0. 63556 0. 70659 0. 03896 0. 04902 h30 0. 12638 0. 00087 0. 03664 0. 01363 0. 04551 0. 01324 0. 00799 0. 0049 0. 00388 0. 00398 0. 05279 0. 03468 0. 02263 0. 04908 h 31 0. 04125 0. 00535 0. 05776 0. 02165 0. 00528 0. 01841 0. 07375 0. 01161 0. 00697 0. 00606 0. 08960 0. 13071 0. 00017 0. 01069

3 Chapter

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

3 Chapter

Caricato da

Copyright:

Formati disponibili

Why Is Data Preprocessing Important?

No quality data, no quality mining results!

Quality decisions must be based on quality data

Data warehouse needs consistent integration of quality data

July 28, 2009

Data Mining: Concepts and Techniques

Multi-Dimensional Measure of Data Quality

data cleaning data integration data transformation data reduction

July 28, 2009

Data Mining: Concepts and Techniques

Why Data Preprocessing?

Major Tasks in Data Preprocessing

Data in the real world is dirty

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

July 28, 2009

Data Mining: Concepts and Techniques

July 28, 2009

Data Mining: Concepts and Techniques

Why Is Data Dirty?

Forms of Data Preprocessing

July 28, 2009

Data Mining: Concepts and Techniques

July 28, 2009

Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing

Measuring the Dispersion of Data

Variance and standard deviation (sample: s, population: )

Data reduction Discretization and concept hierarchy generation Summary

Variance: (algebraic, scalable computation) 2= 1 N

Standard deviation s (or ) is the square root of variance s2 (or 2)

July 28, 2009

Mining Data Descriptive Characteristics

Properties of Normal Distribution Curve

Data dispersion characteristics

Numerical dimensions correspond to sorted intervals

Dispersion analysis on computed measures

July 28, 2009

July 28, 2009

Data Mining: Concepts and Techniques

Measuring the Central Tendency

Mean (algebraic measure) (sample vs. population):

Weighted arithmetic mean: Trimmed mean: chopping extreme values

Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot

Median: A holistic measure

mean mode = 3 ( mean median)

July 28, 2009

Symmetric vs. Skewed Data

Visualization of Data Dispersion: Boxplot Analysis

Data Mining: Concepts and Techniques

Graph displays of basic statistical class descriptions n Frequency histograms

July 28, 2009

Data Mining: Concepts and Techniques

July 28, 2009

Data Mining: Concepts and Techniques

Positively and Negatively Correlated Data

July 28, 2009

Data Mining: Concepts and Techniques

18 July 28, 2009 Data Mining: Concepts and Techniques 22

Quantile-Quantile (Q-Q) Plot

Not Correlated Data

July 28, 2009

Data Mining: Concepts and Techniques