Lesson2 Data Pre Processing by RK

INST 766:
DATA MINING
LESSON 2: DATA PREPROCESSING
Lesson 2: Data Preprocessing INST 766: Data Mining

Outline
What is Data?
Why Preprocess the Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Discretization and concept hierarchy generation
Lesson 2: Data Preprocessing 2

INST 766: Data Mining
What is Data?
Collection of data objects and their
Attributes
attributes
An attribute is a property or Tid Refund Marital Taxable

Status Income Cheat
characteristic of an object
Examples: eye color of a person, 1 Yes Single 125K No
temperature, etc. 2 No Married 100K No
Attribute is also known as variable,
field, characteristic, or feature 3 No Single 70K No
A collection of attributes describe an 4 Yes Married 120K No
Objects
object 5 No Divorced 95K Yes
Object is also known as record, point, 6 No Married 60K No
case, sample, entity, or instance
7 Yes Divorced 220K No
Concept : thing to be learned 8 No Single 85K Yes
Instance: individual example of concept 9 No Married 75K No

Attributes: Measuring aspects of an 10 No Single 90K Yes
instance.
10

What is Data(2)?
Attribute Values:
Attribute values are numbers or symbols assigned to an
attribute
Distinction between attributes and attribute values

Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values

Example: Attribute values for ID and age are integers
But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum
value
What is Data(3)?
Types of Attributes:
There are different types of attributes:
Categorical : having discrete classes (e.g., red, blue, green)
(Categorical can be either nominal (unordered) or ordinal (ordered).)
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a scale from 1-
10), grades, height in {tall, medium, short}
Continuous: having any numerical value (e.g., quantity sold)
Interval
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio
Examples: temperature in Kelvin, length, time, counts

What is Data(4)?
Properties of Attribute Values:
The type of an attribute depends on which of the following
properties it possesses:
Distinctness: = ≠
Order: < >
Addition: + -
Multiplication: */
Nominal attribute: distinctness

Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Attribute Type Description Examples Operations
Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, χ2 test
information to distinguish one object
from another. (=, ≠)
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

provide enough information to order {good, better, best}, rank correlation, run
objects. (<, >) grades, street numbers tests, sign tests
Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a temperature in Celsius deviation, Pearson's
unit of measurement exists. or Fahrenheit correlation, t and F
(+, - ) tests
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical current

Attribute Transformation Comments
Level
Nominal Any permutation of values If all employee ID numbers

were reassigned, would it
make any difference?
Ordinal An order preserving change of An attribute encompassing

values, i.e., the notion of good, better
new_value = f(old_value) best can be represented
where f is a monotonic function. equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval new_value =a * old_value + b Thus, the Fahrenheit and
where a and b are constants Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in

meters or feet.
What is Data(5)?
Discrete and Continuous Attributes:

Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a
finite number of digits.
Continuous attributes are typically represented as floating-point
variables.
What is Data(6)?
Types of data sets:
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
What is Data(7)?
Important Characteristics of Structured Data:

Dimensionality
Curse of Dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale

What is Data(8)?
Record Data:
Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

What is Data(9)?
Data Matrix:
If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute
Such data set can be represented by an m by n matrix, where there are

m rows, one for each object, and n columns, one for each attribute
Projection Projection Distance Load Thickness

of x Load of y load
10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

What is Data(10)?
Document Data:
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term
occurs in the document.
timeout
season
coach
score
game
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0

What is Data(11)?
Transaction Data:
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products purchased by a
customer during one shopping trip constitute a transaction, while the individual
products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

What is Data(12)?
Graph Data:
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
2 <li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
5 1 <li>
<a href="papers/papers.html#aaaa">
2 Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
5 N-Body Computation and Dense Linear System Solvers

What is Data(13)?
Chemical Data:
Benzene Molecule: C6H6

What is Data(14)?
Ordered Data:
Sequences of transactions
Items/Events
An element of
the sequence

What is Data(15)?
Ordered Data:
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

What is Data(16)?
Ordered Data:
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean

What is Data(17)?
Data Quality:
What kinds of data quality problems?

How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:

Noise and outliers
missing values
duplicate data

What is Data(18)?
Noise:
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen
Two Sine Waves Two Sine Waves + Noise

What is Data(19)?
Outliers:
Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set

Outline
What is Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction

Knowledge Discovery Process
flow, according to CRISP-DM
see
Monitoring www.crisp-dm.org
for more
information

Knowledge Discovery Process,
in practice
Data
Monitoring
Preparation
Data Preparation
estimated to take
70-80% of the
time and effort

Why Data Preprocessing?
Data in the real world is dirty (due to their huge size ie:
several giga bytes or more)
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!

Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data

Major Tasks in Data Preprocessing
HOW CAN THE DATA BE PRE PROCESSED SO AS TO IMPROVE
THE EFFICIENCY AND EASY OF MINING PROCESS?
Data cleaning (This technique can be applied to remove and

correct the inconsistence in the data)
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration (Techniques helps to merge data from

multiple source into coherent data store)
Integration of multiple databases, data cubes, or files

Major Tasks in Data Preprocessing
Data transformation (normalization techniques can be

applied. Since normalization may improve the accuracy and efficiency of
data mining algorithms)
Normalization and aggregation
Data reduction (helps to reduce the data size by aggregating,
eliminating redundant data)
Obtains reduced representation in volume but
produces the same or similar analytical results
Data discretization
Part of data reduction but with particular
importance, especially for numerical data

Forms of data preprocessing

Outline
What is Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction

Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data

Reasons for Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill
in the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree 34
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data

How to Handle Noisy Data?
Binning method
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection

detect suspicious values and check by human
Regression
smooth by fitting the data into regression functions

Binning: Simple Discretization Methods
Equal-width (distance) partitioning:
It divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
The most straightforward

But outliers may dominate presentation
Skewed data is not handled well.
Equal-depth (frequency) partitioning:

It divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.

Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Cluster Analysis

Regression
Y1
Y1’ y=x+1
X1 x

Outline
What is Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction

Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id ≡ B.cust-#
Eg. How can computer can sure that customer_id on one database is the
same as customer_No from another database.
Ans. Data warehouse and database has metadata that can be used to avoid
error in the schema integration.
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different sources are
different
possible reasons: different representations, different scales, e.g., metric vs.
British units
Eg. Customer weight might be stored in pounds, on some database and
K.G.s. on some database similarly, Price can be stored in Dollars,
Rupees, Birr even from different services such break fast and taxes.
Handling Redundant Data in Data Integration
Redundant data occur often when integration of multiple databases

The same attribute may have different names in different databases
One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
Redundant data may be able to be detected by correlation analysis
Eg. For given 2 attributes, such as analysis can measure how strongly one attribute
implies to other, based on available data set.
The correlation between attribute A and B can be measured by
− −
rA, B =
∑( A − A)( B − B )
(n −1)σ Aσ B
- -
where, n is is the number of tuples, A and B are respective mean
values of A and B, and σ A and σ B are the respective standard deviations
of A and B.
Handling Redundant Data in Data Integration
If the resulting correlation equation is greater than 0 then A and B are

positively correlated
Ie. if measure the values of A increases as the value B increase
The Higher the value the more each attribute implies the other
The high value indicated that A/B may redundancy may be removed.
If the result value is 0 (zero) then A and B are independent and there is no
correlation between them.
If the resulting value is less than 0 (zero) ie. Negative it indicates that A and B
are negatively correlated.
This leads, the value one attribute increase and other attribute decreases.
Each attribute discourage each other
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

Outline
What is Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction

Data Transformation
Smoothing: remove noise from data eg: Using techniques such as
Binning, Clustering and Regrassion.
Aggregation: summarization, data cube construction : Eg. Daily
sales data may be aggregated for monthly, weekly and annual
sales
This technique is basically used for constructing a data cube for
analysis of the data for multiple granularity.
Generalization: concept hierarchy climbing-
low level or primitive data are replaced by higher level
concepts through the use of hierarchy concept.
Eg. The attribute like street can be generalized as higher level
concepts like city, country. Similarly, Age can be mapped to
higher level concepts called middle age, young age, old age. (It
is like superficial data)
Data Transformation
Normalization: scaled to fall within a small, specified range I
» Eg:. -1.0 to 1.0 or 0.0 to 1.0
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction (Derived Attributes)

New attributes constructed from the given ones

Data Transformation: Normalization
min-max normalization
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
See Employee Salary Attribute Example
z-score normalization v −meanA
v' =
stand _ devA
See Example How it transfer for Employee Salary
normalization by decimal scaling
v
v' = j
10 Where j is the smallest integer such that Max(| v ' |)<1
See How it uses for transformation

Outline
What is Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction

Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining
may take a very long time to run on the complete data set
Data reduction
Obtains a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
Data reduction strategies

Data cube aggregation
Dimensionality reduction
Numerosity reduction

Data Reduction Strategies :
Data Cube Aggregation(1)
The lowest level of a data cube
the aggregated data for an individual entity of interest
e.g., a customer in a phone calling data warehouse.
Assuming we have collected from electronic shop that consists of sales

per
quarter fro the year 2000,2001,2002
If we are interested to find out annual sales (ie. total per year) rather than
total per quarter
This can be done through aggregated so that resulting data can be

summarized the total sales per year instead of per quarter.
The resulting data in smaller in size without loss of any information
necessary for analysis task.
Figure: Sales data for a given branch of AllElectronics for the year 1997
to 1999. On the left, the sales are shown per quarter. On the right, the
data are aggregated to provide the annual sales.
Figure: A data cube for sales at AllElectronics.

Data Reduction Strategies:
Data Cube Aggregation(2)
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Here each cell holds an aggregate data to the data point in multidimensional
space.
Concept of hierarchy is existed for each attribute allowing data at multiple
level of abstraction
Eg. Hierarchy of Branch may allow to group into region based on the address.
Data cube provides fast access to precomputed summarized data, so it
helps
Reference appropriate levels

Use the smallest representation which is enough to solve the task
Queries regarding aggregated information should be

answered using data cube, when possible

Data Reduction Strategies: Dimensionality
Reduction(1)
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
Eg. Data set may consists of hundreds of attributes many of which may be
irrelevant to the mining task and redundant
Eg. While deciding to customers as to whether customer is likely to
purchase popular CD from video library
If we notify the sale database (attribute from sales database) that customer’s
telephone_no is likely to be irrelevant as compare to attributes such as age and
music_taste.
This can be difficult and time consuming when the behavior of the data is
unknown.
Adding irrelevant attributes in data set make slowdown the mining process.
reduce # of patterns in the patterns, easier to understand

Data Reduction Strategies: Dimensionality
Reduction(2)
It reduce the data size by removing the irrelevant attributes.
Typically, methods of attribute subset selection are applied.
The goal of attribute subset selection is to find minimum set of attributes such that resulting
probability distribution data classes are closer to original data set which is obtained through
.
all attributes
Benefits of reduced set of attributes

Reduce the number of attributes appearing in discovered pattern
Helps pattern easier to understand.
HOW CAN WE FIND A GOOD SUBSET OF THE ORIGINAL
ATTRIBUTES ?.
Heuristic methods (due to exponential # of choices):
heuristic method that a reduce search space are commonly used for attribute subset
selection
It always finds best choice to lead optimal solution and these are greedy
It (stastical method) assumes that attributes are independent of another
In addition, many attribute evaluation measure can be used such as information gain
measure is used to build decision tree for classification
Heuristic Feature Selection Methods
There are 2d possible sub-features of d features
Several heuristic feature selection methods:
Best single features under the feature independence
assumption: choose by significance tests.
Best step-wise feature selection:
The best single-feature is picked first
Then next best feature condition to the first, ...
Step-wise feature elimination:
Repeatedly eliminate the worst feature
Best combined feature selection and elimination:
Optimal branch and bound:
Use feature elimination and backtracking
Techniques of basic heuristic methods of attribute sub set
Selection
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward

elimination
Decision-tree induction

STEPWISE FORWARD SET OF ATTRIBUTES
• The best of the original attributes is determined and

added to the set
At each step/iteration, the best of the remaining original

attributes is added to the set .
Eg. For forward selection
Initial attribute set {A1,A2,A3,A4,A5,A6}
2. Start with empty set of attribute

{ }
3. The Best original attribute are determined and added to set

{A1}
{A1,A4}
Reduced attribute set is
{A1,A4,A6}
59
STEPWISE BACKWARD ELIMINATION
1. This procedure starts with Full set of attributes
2. At each step it removes the worst attributes from the

remaining in the set.
Eg. Initial attributes {A1,A2,A3,A4,A5,A6}

{A1,A3,A4,A5,A6}
{A1,A4,A5,A6}
Reduced attribute set
{A1,A4,A6}

Figure: Greedy (heuristic) methods for attribute subset selection

Data Reduction Strategies :
Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed
without reconstructing the whole

Data Compression
Original Data Compressed

Data
lossless
s s y
Original Data lo
Approximated

Data Reduction Strategies: Wavelets
Discrete wavelet transform (DWT): linear signal
processing, multiresolutional analysis
Compressed approximation: store only a small fraction of

the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better

lossy compression, localized in space
DWT uses a hierarchical pyramid algorithm that halves the

data at each iteration, resulting in fast computational speed.

Data Reduction Strategies: Wavelets
Method:
1) The length, L, of the input data vector must be an integer power of 2. This
condition can be met by padding the data vector and zeros as necessary.
3) Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average. The second performs a
weighted difference, which acts to bring out the detailed features of the data.
5) The two functions are applied to pairs of the input data, resulting in two sets of
data of length L/2. In general, these represents a smoothed or low frequency
version of the input data, and the high frequency content of it, respectively.
7) The two functions are recursively applied to the sets of data obtained in the
previous loop, until the resulting data sets obtained are of length 2.
9) A selection of values from the data sets obtained in the above iterations are
designated the wavelet coefficient of the transformed data.
Note: The matrix must be orthonormal, so that the matrix inverse is just its
transpose.

Figure: Examples of wavelet families. The number next to a wavelet name
is the number of vanishing moments of the wavelet. This is a set of mathematical
relationships that the coefficients must satisfy and is related to the number of
coefficients.

Data Reduction Strategies: PCA
Principal Component Analysis (PCA) : Also known by names
Karhunen-Love, or K-L method.
Given N data vectors from k-dimensions, find c <= k orthogonal vectors

that can be best used to represent data
The original data set is reduced to one consisting of N data vectors on c
principal components (reduced dimensions)
Each data vector is a linear combination of the c principal component
vectors
Works for numeric data only
Used when the number of dimensions is large

Data Reduction Strategies: PCA
PCA Procedure:
The input data are normalized, so that each attribute falls within the
same range. This step helps ensure that attributes with large domains
will not dominate attributes with smaller domains.
PCA computes c orthonormal vectors that provide a basis for the
normalized input data. These are unit vectors that each point in a
direction perpendicular to the others. These vectors are referred to as
the principal components. The input data are linear combination of the
principal components.
The principal components are sorted in order of decreasing
“significance” or strength. The principal component essentially serve
as a new set of axes for the data, providing important information
about the variance among the data, the second axis shows the next
highest variance, and so on (See figure on the next page).
Since the components are sorted according to decreasing order of
“significance” the size of the data can be reduced by eliminating the
weaker components, that is , those with low variance. Using the
strongest principal components, it should be possible to reconstruct a
good approximation of the original data.

Figure: Principal component analysis. Y1 and Y2 are the first two principle
component of the given data that was mapped to X1 and X2.

Numerosity Reduction
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
Log-linear models: obtain value at a point in m-D space as
the product on appropriate marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling

Numerosity Reduction
Linear regression: Y = α + β X
Two parameters , α and β specify the line and are to be
estimated by using the data at hand.
using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the above.
Log-linear models:
The multi-way table of joint probabilities is approximated
by a product of lower-order tables.
Probability: p(a, b, c, d) = αab βacχad δbcd
Histograms
A popular data reduction 40

technique
35
Divide data into buckets and
store average (sum) for each 30
bucket
Can be constructed optimally 25
in one dimension using 20
dynamic programming
Related to quantization 15
problems.
10
5
0
10000 30000 50000 70000 9000072
Discretization: Equal-Width
Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count
4
2 2 2 0 2 2
[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]

Equal Width, bins Low <= value < High

Discretization: Equal-Width
may produce clumping
Count
1
[0 – 200,000) … …. [1,800,000 –
2,000,000]
Salary in a corporation

Discretization: Equal-Height
Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count
4 4 4
2
[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]

Equal Height = 4, except for the last bin

Discretization: Equal-height advantages
Generally preferred because avoids clumping

In practice, “almost-equal” height binning is used which
avoids clumping and gives more intuitive breakpoints
Additional considerations:
don’t split frequent values across bins
create separate bins for special values (e.g. 0)
readable breakpoints (e.g. round breakpoints)

Histogram
V-Optimal Histogram: is the one with the least variance.
It is the most accurate and practical.
Histogram variance is weighted sum of the original values that each
bucket represents, where bucket weight is equal to the number of values in
the bucket.
MaxDiff Histogram: the difference between each pair of adjacent

values. A bucket boundary is established between each pair for pairs
having the β -1 largest differences, where β is user-specified.
MaxDiff is the most accurate and practical.

Clustering
Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms.
Sampling
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor performance
in the presence of skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a time).

Sampling
W O R
SRS le random
i m p h o u t
( s e wi t
l
samp ment)
p l a c e
re
SRSW
R
Raw Data
Sampling
Raw Data Cluster/Stratified Sample

Figure: Sampling can be used for data reduction

Hierarchical Reduction
Use multi-resolution structure with different degrees of
reduction
Hierarchical clustering is often performed but tends to
define partitions of data sets rather than “clusters”
Parametric methods are usually not amenable to
hierarchical representation
Hierarchical aggregation
An index tree hierarchically divides a data set into partitions
by value range of some attributes
Each partition can be considered as a bucket
Thus an index tree with aggregates stored at each node is a
hierarchical histogram

Discretization
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
☛ divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Range Freq Range Freq
10 56 68 10 – 20 3 40 – 50 3
12 45 41 20 – 30 1 50 – 60 1
23 43 13 30- 40 0 60 – 70 1

Outline
What is Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction

Discretization and Concept
hierarchy
Discretization
reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level concepts (such as young,
middle-aged, or senior).

Figure: A concept hierarchy for the attribute price.

Discretization and concept hierarchy
generation for numeric data
Binning (see sections before)
Histogram analysis (see sections before)
Clustering analysis (see sections before)
Entropy-based discretization
Segmentation by natural partitioning

Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2

using boundary T, the entropy after partitioning is
| S 1| |S 2|
E (S , T ) = Ent ( S 1) + Ent ( S 2)
|S| | S|
The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization.
The process is recursively applied to partitions obtained until some

stopping criterion is met, e.g.,
Ent ( S ) − E (T , S ) > δ
Experiments show that it may reduce data size and improve
classification accuracy

Figure : Automatic Generation of concept hierarchy for profit based on3-4-5 rule.
Concept hierarchy generation for
categorical data
Specification of a partial ordering of attributes explicitly at

the schema level by users or experts
Specification of a portion of a hierarchy by explicit data
grouping
Specification of a set of attributes, but not of their partial
ordering
Specification of only a partial set of attributes

Specification of a set of attributes
Concept hierarchy can be automatically generated based on the
number of distinct values per attribute in the given attribute set.
The attribute with the most distinct values is placed at the lowest
level of the hierarchy.
country 15 distinct values
province_or_ state 65 distinct

values
city 3567 distinct values
street 674,339 distinct values 92


Lesson2 Data Pre Processing by RK

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lesson2 Data Pre Processing by RK

Caricato da

Copyright:

Formati disponibili

INST 766:

LESSON 2: DATA PREPROCESSING

Lesson 2: Data Preprocessing INST 766: Data Mining

Lesson 2: Data Preprocessing 2

An attribute is a property or Tid Refund Marital Taxable

A collection of attributes describe an 4 Yes Married 120K No

Concept : thing to be learned 8 No Single 85K Yes

Instance: individual example of concept 9 No Married 75K No

Lesson 2: Data Preprocessing 3

Distinction between attributes and attribute values

Different attributes can be mapped to the same set of values

Lesson 2: Data Preprocessing 5

Nominal attribute: distinctness

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

Lesson 2: Data Preprocessing 7

Nominal Any permutation of values If all employee ID numbers

Ordinal An order preserving change of An attribute encompassing

Ratio new_value = a * old_value Length can be measured in

Discrete and Continuous Attributes:

Important Characteristics of Structured Data:

Lesson 2: Data Preprocessing 11

1 Yes Single 125K No

Lesson 2: Data Preprocessing 12

Such data set can be represented by an m by n matrix, where there are

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

Lesson 2: Data Preprocessing 13

Lesson 2: Data Preprocessing 14

Lesson 2: Data Preprocessing 15

Lesson 2: Data Preprocessing 16

Lesson 2: Data Preprocessing 17

Lesson 2: Data Preprocessing 18

Lesson 2: Data Preprocessing 19

Lesson 2: Data Preprocessing 20

What kinds of data quality problems?

Examples of data quality problems:

Lesson 2: Data Preprocessing 21

Two Sine Waves Two Sine Waves + Noise

Lesson 2: Data Preprocessing 22

Lesson 2: Data Preprocessing 23

Lesson 2: Data Preprocessing 24

Lesson 2: Data Preprocessing 25

Lesson 2: Data Preprocessing 26

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

No quality data, no quality mining results!

Lesson 2: Data Preprocessing 27

Data cleaning (This technique can be applied to remove and

Data integration (Techniques helps to merge data from

Lesson 2: Data Preprocessing 28

Data transformation (normalization techniques can be

Lesson 2: Data Preprocessing 29

Lesson 2: Data Preprocessing 30

Lesson 2: Data Preprocessing 31

Lesson 2: Data Preprocessing 32

Lesson 2: Data Preprocessing 35

Combined computer and human inspection

Lesson 2: Data Preprocessing 36

The most straightforward

Equal-depth (frequency) partitioning:

Lesson 2: Data Preprocessing 37

Lesson 2: Data Preprocessing 38