Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Data Preprocessing
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
n p
Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
Data Preprocessing
p p p
An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes
n n n n
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
p
e.g., occupation= e.g., Salary=-10 e.g., Age=42 Birthday=03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records
n n
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
p p p p p p p
Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation
p
Quartiles: Q1 (25 th percentile), Q3 (75 th percentile) Inter-quartile range: IQR = Q3 Q1 Five number summary: min, Q 1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR
1 n 1 n 1 n s = ( xi x)2 = n 1[ xi2 n ( xi )2 ] n 1 i =1 i =1 i =1
2
(x
i =1
)2 =
1 N
x
i =1
2 i
n
July 28, 2009
Motivation
n
To better understand the data: central tendency, variation and spread median, max, min, quantiles, outliers, variance, etc. Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube
Data Mining: Concepts and Techniques 10
The normal (distribution) curve n From to +: contains about 68% of the measurements (: mean, : standard deviation) n From 2 to +2: contains about 95% of it n From 3 to +3: contains about 99.7% of it
14
Boxplot Analysis
x = 1 n
xi
i
xi
x
N
i =1
w
i=1 n i =1
Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:
Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum
Data Mining: Concepts and Techniques 15
n p
Mode
n n n
median = L1 + (
n / 2 ( f )l f median
)c
n n
Median, mean and mode of symmetric, positively and negatively skewed data
July 28, 2009 July 28, 2009 Data Mining: Concepts and Techniques 12
16
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
Histogram Analysis
p
Loess Curve
p p
A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data
Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression
17
21
Quantile Plot
p p
Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information n For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether there is a shift in going from one distribution to another
Scatter plot
p p
Data PREPROCESSING
Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane
20
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
Data Preprocessing
p
Data PREPROCESSING
Data reduction
n n
data cube aggregation, dimension reduction, data compression, numerosity reduction and discretization. Used to obtain a reduced representation of the data while minimizing the loss of information content.
Data reduction
T1 T2 T2000 A1 A2 A3 ... A126 T1 T4 T1456 A1 A2 A3 ... A115
DATA CLEANING
Data Integration
p
Data integration n combines data from multiple sources into a coherent data store e.g. data warehouse n sources may include multiple database, data cubes or flat files n Issues in data integration
p schema
p redundancy p detection
conflicts
DATA CLEANING
Data Integration
p
Schema integration
n n
integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#
p
Noise - random error or variance in a measured variable smooth out the data to remove the noise
for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units
DATA CLEANING
Data Integration
The same attribute may have different names in different databases One attribute may be a derived attribute in another table, e.g., annual revenue
Redundant data may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
DATA CLEANING
DATA CLEANING
Regression
y
Y1
Y1
y=x+1
X1
DATA CLEANING
DATA CLEANING
DATA CLEANING
Cluster Analysis
n
Data PREPROCESSING
Clustering
p Outliers
may be detected by clustering, where similar values are organized into groups or clusters.
n n
DATA CLEANING
Cluster Analysis
DATA REDUCTION
Data Reduction
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
DATA REDUCTION
DATA REDUCTION
Dimensionality Reduction
n
Data sets for analysis may contain hundreds of attributes that may be irrelevant to the mining task or redundant Dimensionality reduction reduces the dataset size by removing such attributes among them
DATA REDUCTION
Dimensionality Reduction
DATA REDUCTION
How can we find a good subset of the original attributes?? attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.
Data Reduction
Data cube Aggregation
DATA REDUCTION
DATA REDUCTION
Dimensionality reduction
Data Reduction
DATA REDUCTION
Dimensionality Reduction
Standard form Data preparation Dimension reduction
DATA REDUCTION
Evaluation
Prediction Methods
Data Subset
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
DATA REDUCTION
DATA COMPRESSION
Class 1
Class 2
Class 1
Class 2
DATA REDUCTION
Data Reduction
Data cube Aggregation Dimensionality reduction
theory
selection of attributes are identified by the concept of discernibility relations of classes in the dataset n Will be discussed in next class.
n
Data Reduction
NUMEROSITY REDUCTION
Data Reduction
Data cube Aggregation
Numerosity Reduction
Dimensionality reduction
Data Reduction
DATA COMPRESSION
NUMEROSITY REDUCTION
Data Compression
p
Numerosity Reduction
Apply data encoding or transformation to obtain a reduced or compressed representation of the original data lossless
n
lossy
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
Sampling
Raw Data
Sampling
Raw Data Cluster/Stratified Sample
Divide data into buckets and store average (sum) for each bucket Partitioning rules:
n n n
40 35 30 25
V-optimal: with the least histogram 20 variance (weighted sum of the 15 original values that each bucket represents) 10 MaxDiff: set bucket boundary between each pair for pairs have the 1 largest differences
5 0
10000 30000 50000 70000 90000
58
Clustering
p
Hierarchical Reduction
p p p p
Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multidimensional index tree structures There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8
Use multi-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than clusters Parametric methods are usually not amenable to hierarchical representation Hierarchical aggregation
n n n
An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram
Sampling
p p
Data Reduction
Data cube Aggregation
Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data
n
Simple random sampling may have very poor performance in the presence of skew Stratified sampling: p Approximate the percentage of each class (or subpopulation of interest) in the overall database p Used in conjunction with skewed data
Dimensionality reduction
Data Reduction
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
Discretization
Entropy-Based Discretization
p
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
E (S ,T ) = | S 1| |S| Ent ( S 1) + |S2| |S| Ent ( S 2 )
The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.
Entropy-Based Discretization
p
The process is recursively applied to partitions obtained until some stopping criterion is met, Experiments show that it may reduce data size and improve classification accuracy
Ent ( S ) E (T , S ) >
Data Preprocessing
p
If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals
Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning
(-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0)
Step 4:
(-$4000 -$5,000)
($600 $80 0)
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
Data Transformation
Many techniques can be applied recursively in order to provide a hierarchical partitioning of the attribute - concept hierarchy Concept hierarchy useful for mining at multiple levels of abstraction
Data Transformation
Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range
min-max normalization z-score normalization normalization by decimal scaling New attributes constructed from the given ones
Attribute/feature construction
Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set (but.. Must consider the semantic of those attributes and relation among them) n The attribute with the most distinct values is placed at the lowest level of the hierarchy n Note: Exceptionweekday, month, quarter, year
min-max normalization
v' = v min ( new _ max new _ min ) + new _ min max min
A A A A A A
z-score normalization
v' = v mean A stand _ dev
A
v' =
3567 distinct values 674,339 distinct values
v 10 j
Data PREPROCESSING
p
The information to convert the continuous values into discrete values are obtain from the expert of the domain area Assignment 1
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com
Assignment 1
p
Data Discretization
Table 6: Discretization of the mathematical symbols
Orientation Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 h02 1 0 0 0 2 1 0 0 2 1 2 2 0 0 h03 2 1 1 0 2 1 1 2 0 1 2 2 0 0 h11 1 0 1 1 0 0 1 1 0 0 1 1 0 0 h12 2 1 0 1 2 1 1 0 2 1 2 2 0 0 h13 2 1 2 1 1 1 2 2 0 1 0 2 0 0 h21 2 1 1 0 2 0 2 2 1 0 1 2 0 1 h22 2 1 2 0 2 1 1 0 1 1 2 2 0 0 h30 1 0 1 1 1 1 0 0 0 0 1 1 1 1 h31 2 0 2 1 0 1 2 1 1 0 2 2 0 1 Result s
p p
Identify incomplete, noisy and inconsistent data Use statistical techniques such as frequency, boxplot to detect such data. Record number of missing values, noisy data Also record which tuple involve (if not too many)
Manual discretization
p
Binning, entropy,
Summary
p p p
Data preparation is a big issue for both warehousing and mining Data preparation includes
n n n
Data cleaning and data integration Data reduction and feature selection Discretization
A lot methods have been developed but still an active area of research
Data Discretization
Data Discretization
Table 5: The invariance features for mathematical symbols
Symbol h 02 0. 86711 0. 54536 0. 58806 0. 61814 0. 88477 0. 80491 0. 73293 0. 66253 0. 91948 0. 82281 2. 213 2. 15402 0. 15565 0. 16081 h 03 0. 18849 0. 02198 0. 05518 0. 00880 0. 14812 0. 05006 0. 05052 0. 08034 0. 02059 0. 06182 0. 71402 0. 18761 0. 00002 0. 01299 h11 0. 08184 0. 02583 0. 08122 0. 05408 0. 01660 0. 03593 0. 16291 0. 03918 0. 01081 0. 02135 0. 059 0. 08548 0. 00662 0. 01091 h 12 0. 16839 0. 0241 0. 00895 0. 01927 0. 13137 0. 01596 0. 05135 0. 01415 0. 06653 0. 03221 0. 22918 0. 33771 0. 00547 0. 00812 h13 0. 12728 0. 01231 0. 07504 0. 05894 0. 06236 0. 04019 0. 11263 0. 10883 0. 00924 0. 03237 0. 00903 0. 81689 0. 00182 0. 00205 h 21 0. 01923 0. 01844 0. 01626 0. 00178 0. 02861 0. 00195 0. 02107 0. 01978 0. 01543 0. 01006 0. 01181 0. 11741 0. 00775 0. 01267 h22 0. 24873 0. 1193 0. 18318 0. 07934 0. 21195 0. 12116 0. 1385 0. 11662 0. 15602 0. 12365 0. 63556 0. 70659 0. 03896 0. 04902 h30 0. 12638 0. 00087 0. 03664 0. 01363 0. 04551 0. 01324 0. 00799 0. 0049 0. 00388 0. 00398 0. 05279 0. 03468 0. 02263 0. 04908 h 31 0. 04125 0. 00535 0. 05776 0. 02165 0. 00528 0. 01841 0. 07375 0. 01161 0. 00697 0. 00606 0. 08960 0. 13071 0. 00017 0. 01069
Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com