Sei sulla pagina 1di 92

INST 766:

DATA MINING

LESSON 2: DATA PREPROCESSING

Lesson 2: Data Preprocessing INST 766: Data Mining


Outline
What is Data?
Why Preprocess the Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Discretization and concept hierarchy generation

Lesson 2: Data Preprocessing 2


INST 766: Data Mining
What is Data?
Collection of data objects and their
Attributes
attributes

An attribute is a property or Tid Refund Marital Taxable


Status Income Cheat
characteristic of an object
Examples: eye color of a person, 1 Yes Single 125K No
temperature, etc. 2 No Married 100K No
Attribute is also known as variable,
field, characteristic, or feature 3 No Single 70K No

A collection of attributes describe an 4 Yes Married 120K No

Objects
object 5 No Divorced 95K Yes
Object is also known as record, point, 6 No Married 60K No
case, sample, entity, or instance
7 Yes Divorced 220K No

Concept : thing to be learned 8 No Single 85K Yes

Instance: individual example of concept 9 No Married 75K No


Attributes: Measuring aspects of an 10 No Single 90K Yes
instance.
10

Lesson 2: Data Preprocessing 3


INST 766: Data Mining
What is Data(2)?
Attribute Values:
Attribute values are numbers or symbols assigned to an
attribute

Distinction between attributes and attribute values


Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters

Different attributes can be mapped to the same set of values


Example: Attribute values for ID and age are integers
But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum
value
Lesson 2: Data Preprocessing 4
INST 766: Data Mining
What is Data(3)?
Types of Attributes:
There are different types of attributes:
Categorical : having discrete classes (e.g., red, blue, green)
(Categorical can be either nominal (unordered) or ordinal (ordered).)
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a scale from 1-
10), grades, height in {tall, medium, short}
Continuous: having any numerical value (e.g., quantity sold)
Interval
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio
Examples: temperature in Kelvin, length, time, counts

Lesson 2: Data Preprocessing 5


INST 766: Data Mining
What is Data(4)?
Properties of Attribute Values:
The type of an attribute depends on which of the following
properties it possesses:
Distinctness: = ≠
Order: < >
Addition: + -
Multiplication: */

Nominal attribute: distinctness


Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Lesson 2: Data Preprocessing 6
INST 766: Data Mining
Attribute Type Description Examples Operations

Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, χ2 test
information to distinguish one object
from another. (=, ≠)

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,


provide enough information to order {good, better, best}, rank correlation, run
objects. (<, >) grades, street numbers tests, sign tests

Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a temperature in Celsius deviation, Pearson's
unit of measurement exists. or Fahrenheit correlation, t and F
(+, - ) tests

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical current

Lesson 2: Data Preprocessing 7


INST 766: Data Mining
Attribute Transformation Comments
Level

Nominal Any permutation of values If all employee ID numbers


were reassigned, would it
make any difference?

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better
new_value = f(old_value) best can be represented
where f is a monotonic function. equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval new_value =a * old_value + b Thus, the Fahrenheit and
where a and b are constants Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).

Ratio new_value = a * old_value Length can be measured in


meters or feet.
Lesson 2: Data Preprocessing 8
INST 766: Data Mining
What is Data(5)?

Discrete and Continuous Attributes:


Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes

Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a
finite number of digits.
Continuous attributes are typically represented as floating-point
variables.
Lesson 2: Data Preprocessing 9
INST 766: Data Mining
What is Data(6)?
Types of data sets:
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
Lesson 2: Data Preprocessing 10
INST 766: Data Mining
What is Data(7)?

Important Characteristics of Structured Data:


Dimensionality
Curse of Dimensionality

Sparsity
Only presence counts

Resolution
Patterns depend on the scale

Lesson 2: Data Preprocessing 11


INST 766: Data Mining
What is Data(8)?
Record Data:
Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Lesson 2: Data Preprocessing 12


INST 766: Data Mining
What is Data(9)?
Data Matrix:
If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute

Such data set can be represented by an m by n matrix, where there are


m rows, one for each object, and n columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

Lesson 2: Data Preprocessing 13


INST 766: Data Mining
What is Data(10)?
Document Data:
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term
occurs in the document.

timeout

season
coach

score

game
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

Lesson 2: Data Preprocessing 14


INST 766: Data Mining
What is Data(11)?
Transaction Data:
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products purchased by a
customer during one shopping trip constitute a transaction, while the individual
products that were purchased are the items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Lesson 2: Data Preprocessing 15


INST 766: Data Mining
What is Data(12)?
Graph Data:
Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb">
Data Mining </a>
2 <li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
5 1 <li>
<a href="papers/papers.html#aaaa">
2 Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
5 N-Body Computation and Dense Linear System Solvers

Lesson 2: Data Preprocessing 16


INST 766: Data Mining
What is Data(13)?
Chemical Data:
Benzene Molecule: C6H6

Lesson 2: Data Preprocessing 17


INST 766: Data Mining
What is Data(14)?
Ordered Data:
Sequences of transactions
Items/Events

An element of
the sequence

Lesson 2: Data Preprocessing 18


INST 766: Data Mining
What is Data(15)?
Ordered Data:
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

Lesson 2: Data Preprocessing 19


INST 766: Data Mining
What is Data(16)?
Ordered Data:
Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

Lesson 2: Data Preprocessing 20


INST 766: Data Mining
What is Data(17)?
Data Quality:

What kinds of data quality problems?


How can we detect problems with the data?
What can we do about these problems?

Examples of data quality problems:


Noise and outliers
missing values
duplicate data

Lesson 2: Data Preprocessing 21


INST 766: Data Mining
What is Data(18)?
Noise:
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Lesson 2: Data Preprocessing 22


INST 766: Data Mining
What is Data(19)?
Outliers:
Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set

Lesson 2: Data Preprocessing 23


INST 766: Data Mining
Outline
What is Data?
Why Preprocess the Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Discretization and concept hierarchy generation

Lesson 2: Data Preprocessing 24


INST 766: Data Mining
Knowledge Discovery Process
flow, according to CRISP-DM

see
Monitoring www.crisp-dm.org
for more
information

Lesson 2: Data Preprocessing 25


INST 766: Data Mining
Knowledge Discovery Process,
in practice

Data
Monitoring
Preparation

Data Preparation
estimated to take
70-80% of the
time and effort

Lesson 2: Data Preprocessing 26


INST 766: Data Mining
Why Data Preprocessing?
Data in the real world is dirty (due to their huge size ie:
several giga bytes or more)
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

No quality data, no quality mining results!


Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data

Lesson 2: Data Preprocessing 27


INST 766: Data Mining
Major Tasks in Data Preprocessing
HOW CAN THE DATA BE PRE PROCESSED SO AS TO IMPROVE
THE EFFICIENCY AND EASY OF MINING PROCESS?

Data cleaning (This technique can be applied to remove and


correct the inconsistence in the data)
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

Data integration (Techniques helps to merge data from


multiple source into coherent data store)
Integration of multiple databases, data cubes, or files

Lesson 2: Data Preprocessing 28


INST 766: Data Mining
Major Tasks in Data Preprocessing

Data transformation (normalization techniques can be


applied. Since normalization may improve the accuracy and efficiency of
data mining algorithms)
Normalization and aggregation
Data reduction (helps to reduce the data size by aggregating,
eliminating redundant data)
Obtains reduced representation in volume but
produces the same or similar analytical results
Data discretization
Part of data reduction but with particular
importance, especially for numerical data

Lesson 2: Data Preprocessing 29


INST 766: Data Mining
Forms of data preprocessing

Lesson 2: Data Preprocessing 30


INST 766: Data Mining
Outline
What is Data?
Why Preprocess the Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Discretization and concept hierarchy generation

Lesson 2: Data Preprocessing 31


INST 766: Data Mining
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data

Lesson 2: Data Preprocessing 32


INST 766: Data Mining
Reasons for Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred.
Lesson 2: Data Preprocessing 33
INST 766: Data Mining
How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill
in the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree 34
Lesson 2: Data Preprocessing INST 766: Data Mining
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data

Lesson 2: Data Preprocessing 35


INST 766: Data Mining
How to Handle Noisy Data?
Binning method
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

Clustering
detect and remove outliers

Combined computer and human inspection


detect suspicious values and check by human

Regression
smooth by fitting the data into regression functions

Lesson 2: Data Preprocessing 36


INST 766: Data Mining
Binning: Simple Discretization Methods
Equal-width (distance) partitioning:
It divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.

The most straightforward


But outliers may dominate presentation
Skewed data is not handled well.

Equal-depth (frequency) partitioning:


It divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.

Lesson 2: Data Preprocessing 37


INST 766: Data Mining
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Lesson 2: Data Preprocessing 38


INST 766: Data Mining
Cluster Analysis

Lesson 2: Data Preprocessing 39


INST 766: Data Mining
Regression

Y1

Y1’ y=x+1

X1 x

Lesson 2: Data Preprocessing 40


INST 766: Data Mining
Outline
What is Data?
Why Preprocess the Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Discretization and concept hierarchy generation

Lesson 2: Data Preprocessing 41


INST 766: Data Mining
Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id ≡ B.cust-#
Eg. How can computer can sure that customer_id on one database is the
same as customer_No from another database.
Ans. Data warehouse and database has metadata that can be used to avoid
error in the schema integration.
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different sources are
different
possible reasons: different representations, different scales, e.g., metric vs.
British units
Eg. Customer weight might be stored in pounds, on some database and
K.G.s. on some database similarly, Price can be stored in Dollars,
Rupees, Birr even from different services such break fast and taxes.
Lesson 2: Data Preprocessing 42
INST 766: Data Mining
Handling Redundant Data in Data Integration

Redundant data occur often when integration of multiple databases


The same attribute may have different names in different databases
One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
Redundant data may be able to be detected by correlation analysis
Eg. For given 2 attributes, such as analysis can measure how strongly one attribute
implies to other, based on available data set.
The correlation between attribute A and B can be measured by

− −

rA, B =
∑( A − A)( B − B )
(n −1)σ Aσ B
- -
where, n is is the number of tuples, A and B are respective mean
values of A and B, and σ A and σ B are the respective standard deviations
of A and B.
Lesson 2: Data Preprocessing 43
INST 766: Data Mining
Handling Redundant Data in Data Integration

If the resulting correlation equation is greater than 0 then A and B are


positively correlated
Ie. if measure the values of A increases as the value B increase
The Higher the value the more each attribute implies the other
The high value indicated that A/B may redundancy may be removed.

If the result value is 0 (zero) then A and B are independent and there is no
correlation between them.

If the resulting value is less than 0 (zero) ie. Negative it indicates that A and B
are negatively correlated.
This leads, the value one attribute increase and other attribute decreases.
Each attribute discourage each other

Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

Lesson 2: Data Preprocessing 44


INST 766: Data Mining
Outline
What is Data?
Why Preprocess the Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Discretization and concept hierarchy generation

Lesson 2: Data Preprocessing 45


INST 766: Data Mining
Data Transformation
Smoothing: remove noise from data eg: Using techniques such as
Binning, Clustering and Regrassion.
Aggregation: summarization, data cube construction : Eg. Daily
sales data may be aggregated for monthly, weekly and annual
sales
This technique is basically used for constructing a data cube for
analysis of the data for multiple granularity.
Generalization: concept hierarchy climbing-
low level or primitive data are replaced by higher level
concepts through the use of hierarchy concept.
Eg. The attribute like street can be generalized as higher level
concepts like city, country. Similarly, Age can be mapped to
higher level concepts called middle age, young age, old age. (It
is like superficial data)
Lesson 2: Data Preprocessing 46
INST 766: Data Mining
Data Transformation
Normalization: scaled to fall within a small, specified range I
» Eg:. -1.0 to 1.0 or 0.0 to 1.0
min-max normalization
z-score normalization
normalization by decimal scaling

Attribute/feature construction (Derived Attributes)


New attributes constructed from the given ones

Lesson 2: Data Preprocessing 47


INST 766: Data Mining
Data Transformation: Normalization
min-max normalization
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
See Employee Salary Attribute Example
z-score normalization v −meanA
v' =
stand _ devA
See Example How it transfer for Employee Salary
normalization by decimal scaling
v
v' = j
10 Where j is the smallest integer such that Max(| v ' |)<1

See How it uses for transformation

Lesson 2: Data Preprocessing 48


INST 766: Data Mining
Outline
What is Data?
Why Preprocess the Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Discretization and concept hierarchy generation

Lesson 2: Data Preprocessing 49


INST 766: Data Mining
Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining
may take a very long time to run on the complete data set

Data reduction
Obtains a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results

Data reduction strategies


Data cube aggregation
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation

Lesson 2: Data Preprocessing 50


INST 766: Data Mining
Data Reduction Strategies :
Data Cube Aggregation(1)
The lowest level of a data cube
the aggregated data for an individual entity of interest
e.g., a customer in a phone calling data warehouse.

Assuming we have collected from electronic shop that consists of sales


per
quarter fro the year 2000,2001,2002

If we are interested to find out annual sales (ie. total per year) rather than
total per quarter

This can be done through aggregated so that resulting data can be


summarized the total sales per year instead of per quarter.
The resulting data in smaller in size without loss of any information
necessary for analysis task.
Lesson 2: Data Preprocessing 51
INST 766: Data Mining
Figure: Sales data for a given branch of AllElectronics for the year 1997
to 1999. On the left, the sales are shown per quarter. On the right, the
data are aggregated to provide the annual sales.
Lesson 2: Data Preprocessing 52
INST 766: Data Mining
Figure: A data cube for sales at AllElectronics.

Lesson 2: Data Preprocessing 53


INST 766: Data Mining
Data Reduction Strategies:
Data Cube Aggregation(2)
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Here each cell holds an aggregate data to the data point in multidimensional
space.
Concept of hierarchy is existed for each attribute allowing data at multiple
level of abstraction
Eg. Hierarchy of Branch may allow to group into region based on the address.
Data cube provides fast access to precomputed summarized data, so it
helps

Reference appropriate levels


Use the smallest representation which is enough to solve the task

Queries regarding aggregated information should be


answered using data cube, when possible

Lesson 2: Data Preprocessing 54


INST 766: Data Mining
Data Reduction Strategies: Dimensionality
Reduction(1)
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
Eg. Data set may consists of hundreds of attributes many of which may be
irrelevant to the mining task and redundant
Eg. While deciding to customers as to whether customer is likely to
purchase popular CD from video library
If we notify the sale database (attribute from sales database) that customer’s
telephone_no is likely to be irrelevant as compare to attributes such as age and
music_taste.

This can be difficult and time consuming when the behavior of the data is
unknown.
Adding irrelevant attributes in data set make slowdown the mining process.
reduce # of patterns in the patterns, easier to understand

Lesson 2: Data Preprocessing 55


INST 766: Data Mining
Data Reduction Strategies: Dimensionality
Reduction(2)
It reduce the data size by removing the irrelevant attributes.
Typically, methods of attribute subset selection are applied.
The goal of attribute subset selection is to find minimum set of attributes such that resulting
probability distribution data classes are closer to original data set which is obtained through
.
all attributes

Benefits of reduced set of attributes


Reduce the number of attributes appearing in discovered pattern
Helps pattern easier to understand.
HOW CAN WE FIND A GOOD SUBSET OF THE ORIGINAL
ATTRIBUTES ?.
Heuristic methods (due to exponential # of choices):
heuristic method that a reduce search space are commonly used for attribute subset
selection
It always finds best choice to lead optimal solution and these are greedy
It (stastical method) assumes that attributes are independent of another
In addition, many attribute evaluation measure can be used such as information gain
measure is used to build decision tree for classification
Lesson 2: Data Preprocessing 56
INST 766: Data Mining
Heuristic Feature Selection Methods
There are 2d possible sub-features of d features
Several heuristic feature selection methods:
Best single features under the feature independence
assumption: choose by significance tests.
Best step-wise feature selection:
The best single-feature is picked first
Then next best feature condition to the first, ...
Step-wise feature elimination:
Repeatedly eliminate the worst feature
Best combined feature selection and elimination:
Optimal branch and bound:
Use feature elimination and backtracking
Lesson 2: Data Preprocessing 57
INST 766: Data Mining
Techniques of basic heuristic methods of attribute sub set

Selection
Step-wise forward selection

Step-wise backward elimination

Combining forward selection and backward


elimination

Decision-tree induction

Lesson 2: Data Preprocessing 58


INST 766: Data Mining
STEPWISE FORWARD SET OF ATTRIBUTES

• The best of the original attributes is determined and


added to the set

At each step/iteration, the best of the remaining original


attributes is added to the set .
Eg. For forward selection

Initial attribute set {A1,A2,A3,A4,A5,A6}

2. Start with empty set of attribute


{ }

3. The Best original attribute are determined and added to set


{A1}
{A1,A4}
Reduced attribute set is
{A1,A4,A6}
Lesson 2: Data Preprocessing INST 766: Data Mining
59
STEPWISE BACKWARD ELIMINATION
1. This procedure starts with Full set of attributes

2. At each step it removes the worst attributes from the


remaining in the set.

Eg. Initial attributes {A1,A2,A3,A4,A5,A6}


{A1,A3,A4,A5,A6}
{A1,A4,A5,A6}
Reduced attribute set
{A1,A4,A6}

Lesson 2: Data Preprocessing 60


INST 766: Data Mining
Figure: Greedy (heuristic) methods for attribute subset selection

Lesson 2: Data Preprocessing 61


INST 766: Data Mining
Data Reduction Strategies :
Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without expansion

Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed
without reconstructing the whole

Lesson 2: Data Preprocessing 62


INST 766: Data Mining
Data Reduction Strategies:
Data Compression

Original Data Compressed


Data
lossless

s s y
Original Data lo
Approximated

Lesson 2: Data Preprocessing 63


INST 766: Data Mining
Data Reduction Strategies: Wavelets
Discrete wavelet transform (DWT): linear signal
processing, multiresolutional analysis

Compressed approximation: store only a small fraction of


the strongest of the wavelet coefficients

Similar to discrete Fourier transform (DFT), but better


lossy compression, localized in space

DWT uses a hierarchical pyramid algorithm that halves the


data at each iteration, resulting in fast computational speed.

Lesson 2: Data Preprocessing 64


INST 766: Data Mining
Data Reduction Strategies: Wavelets

Method:
1) The length, L, of the input data vector must be an integer power of 2. This
condition can be met by padding the data vector and zeros as necessary.

3) Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average. The second performs a
weighted difference, which acts to bring out the detailed features of the data.

5) The two functions are applied to pairs of the input data, resulting in two sets of
data of length L/2. In general, these represents a smoothed or low frequency
version of the input data, and the high frequency content of it, respectively.

7) The two functions are recursively applied to the sets of data obtained in the
previous loop, until the resulting data sets obtained are of length 2.

9) A selection of values from the data sets obtained in the above iterations are
designated the wavelet coefficient of the transformed data.

Note: The matrix must be orthonormal, so that the matrix inverse is just its
transpose.

Lesson 2: Data Preprocessing 65


INST 766: Data Mining
Figure: Examples of wavelet families. The number next to a wavelet name
is the number of vanishing moments of the wavelet. This is a set of mathematical
relationships that the coefficients must satisfy and is related to the number of
coefficients.

Lesson 2: Data Preprocessing 66


INST 766: Data Mining
Data Reduction Strategies: PCA
Principal Component Analysis (PCA) : Also known by names
Karhunen-Love, or K-L method.

Given N data vectors from k-dimensions, find c <= k orthogonal vectors


that can be best used to represent data
The original data set is reduced to one consisting of N data vectors on c
principal components (reduced dimensions)
Each data vector is a linear combination of the c principal component
vectors
Works for numeric data only
Used when the number of dimensions is large

Lesson 2: Data Preprocessing 67


INST 766: Data Mining
Data Reduction Strategies: PCA
PCA Procedure:
The input data are normalized, so that each attribute falls within the
same range. This step helps ensure that attributes with large domains
will not dominate attributes with smaller domains.
PCA computes c orthonormal vectors that provide a basis for the
normalized input data. These are unit vectors that each point in a
direction perpendicular to the others. These vectors are referred to as
the principal components. The input data are linear combination of the
principal components.
The principal components are sorted in order of decreasing
“significance” or strength. The principal component essentially serve
as a new set of axes for the data, providing important information
about the variance among the data, the second axis shows the next
highest variance, and so on (See figure on the next page).
Since the components are sorted according to decreasing order of
“significance” the size of the data can be reduced by eliminating the
weaker components, that is , those with low variance. Using the
strongest principal components, it should be possible to reconstruct a
good approximation of the original data.

Lesson 2: Data Preprocessing 68


INST 766: Data Mining
Figure: Principal component analysis. Y1 and Y2 are the first two principle
component of the given data that was mapped to X1 and X2.

Lesson 2: Data Preprocessing 69


INST 766: Data Mining
Data Reduction Strategies:
Numerosity Reduction
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
Log-linear models: obtain value at a point in m-D space as
the product on appropriate marginal subspaces

Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling

Lesson 2: Data Preprocessing 70


INST 766: Data Mining
Data Reduction Strategies:
Numerosity Reduction
Linear regression: Y = α + β X
Two parameters , α and β specify the line and are to be
estimated by using the data at hand.
using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the above.
Log-linear models:
The multi-way table of joint probabilities is approximated
by a product of lower-order tables.
Probability: p(a, b, c, d) = αab βacχad δbcd
Lesson 2: Data Preprocessing 71
INST 766: Data Mining
Histograms

A popular data reduction 40


technique
35
Divide data into buckets and
store average (sum) for each 30
bucket
Can be constructed optimally 25
in one dimension using 20
dynamic programming
Related to quantization 15
problems.
10
5
0
10000 30000 50000 70000 9000072
Lesson 2: Data Preprocessing INST 766: Data Mining
Discretization: Equal-Width

Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85

Count

4
2 2 2 0 2 2

[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]


Equal Width, bins Low <= value < High

Lesson 2: Data Preprocessing 73


INST 766: Data Mining
Discretization: Equal-Width
may produce clumping

Count

1
[0 – 200,000) … …. [1,800,000 –
2,000,000]
Salary in a corporation

Lesson 2: Data Preprocessing 74


INST 766: Data Mining
Discretization: Equal-Height

Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85

Count

4 4 4
2

[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]


Equal Height = 4, except for the last bin

Lesson 2: Data Preprocessing 75


INST 766: Data Mining
Discretization: Equal-height advantages

Generally preferred because avoids clumping


In practice, “almost-equal” height binning is used which
avoids clumping and gives more intuitive breakpoints
Additional considerations:
don’t split frequent values across bins
create separate bins for special values (e.g. 0)
readable breakpoints (e.g. round breakpoints)

Lesson 2: Data Preprocessing 76


INST 766: Data Mining
Histogram
V-Optimal Histogram: is the one with the least variance.
It is the most accurate and practical.
Histogram variance is weighted sum of the original values that each
bucket represents, where bucket weight is equal to the number of values in
the bucket.

MaxDiff Histogram: the difference between each pair of adjacent


values. A bucket boundary is established between each pair for pairs
having the β -1 largest differences, where β is user-specified.
MaxDiff is the most accurate and practical.

Lesson 2: Data Preprocessing 77


INST 766: Data Mining
Clustering

Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms.
Lesson 2: Data Preprocessing 78
INST 766: Data Mining
Sampling
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor performance
in the presence of skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a time).

Lesson 2: Data Preprocessing 79


INST 766: Data Mining
Sampling

W O R
SRS le random
i m p h o u t
( s e wi t
l
samp ment)
p l a c e
re

SRSW
R

Raw Data
Lesson 2: Data Preprocessing 80
INST 766: Data Mining
Sampling

Raw Data Cluster/Stratified Sample

Lesson 2: Data Preprocessing 81


INST 766: Data Mining
Figure: Sampling can be used for data reduction

Lesson 2: Data Preprocessing 82


INST 766: Data Mining
Hierarchical Reduction
Use multi-resolution structure with different degrees of
reduction
Hierarchical clustering is often performed but tends to
define partitions of data sets rather than “clusters”
Parametric methods are usually not amenable to
hierarchical representation
Hierarchical aggregation
An index tree hierarchically divides a data set into partitions
by value range of some attributes
Each partition can be considered as a bucket
Thus an index tree with aggregates stored at each node is a
hierarchical histogram

Lesson 2: Data Preprocessing 83


INST 766: Data Mining
Discretization
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
☛ divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Range Freq Range Freq
10 56 68 10 – 20 3 40 – 50 3
12 45 41 20 – 30 1 50 – 60 1
23 43 13 30- 40 0 60 – 70 1

Lesson 2: Data Preprocessing 84


INST 766: Data Mining
Outline
What is Data?
Why Preprocess the Data?
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Discretization and concept hierarchy generation

Lesson 2: Data Preprocessing 85


INST 766: Data Mining
Discretization and Concept
hierarchy

Discretization
reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals. Interval labels can then be used to replace
actual data values.

Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level concepts (such as young,
middle-aged, or senior).

Lesson 2: Data Preprocessing 86


INST 766: Data Mining
Figure: A concept hierarchy for the attribute price.

Lesson 2: Data Preprocessing 87


INST 766: Data Mining
Discretization and concept hierarchy
generation for numeric data
Binning (see sections before)

Histogram analysis (see sections before)

Clustering analysis (see sections before)

Entropy-based discretization

Segmentation by natural partitioning

Lesson 2: Data Preprocessing 88


INST 766: Data Mining
Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two intervals S1 and S2


using boundary T, the entropy after partitioning is
| S 1| |S 2|
E (S , T ) = Ent ( S 1) + Ent ( S 2)
|S| | S|
The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization.

The process is recursively applied to partitions obtained until some


stopping criterion is met, e.g.,
Ent ( S ) − E (T , S ) > δ
Experiments show that it may reduce data size and improve
classification accuracy

Lesson 2: Data Preprocessing 89


INST 766: Data Mining
Figure : Automatic Generation of concept hierarchy for profit based on3-4-5 rule.
Lesson 2: Data Preprocessing 90
INST 766: Data Mining
Concept hierarchy generation for
categorical data

Specification of a partial ordering of attributes explicitly at


the schema level by users or experts
Specification of a portion of a hierarchy by explicit data
grouping
Specification of a set of attributes, but not of their partial
ordering
Specification of only a partial set of attributes

Lesson 2: Data Preprocessing 91


INST 766: Data Mining
Specification of a set of attributes
Concept hierarchy can be automatically generated based on the
number of distinct values per attribute in the given attribute set.
The attribute with the most distinct values is placed at the lowest
level of the hierarchy.

country 15 distinct values

province_or_ state 65 distinct


values
city 3567 distinct values

street 674,339 distinct values 92


Lesson 2: Data Preprocessing INST 766: Data Mining

Potrebbero piacerti anche