Sei sulla pagina 1di 27

Data Pre-Processing and Feature

Selection
Week 6
Why Pre-Process Data
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality
data
Where does the Noise Come From???
• Data is not always available
– E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of
entry
– not register history or changes of the data
• Missing data may need to be inferred
Multi-Dimensional Measure of Data Quality

• A well-accepted multidimensional view:


– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Value added
– Interpretability
– Accessibility
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
Data Cleaning

• Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Cleaning – Missing Values
• Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
• Use the attribute mean to fill in the missing value
• Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
Cleaning - Noise
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
Data Integration
• Data integration:
– combines data from multiple sources into a coherent store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from
different sources are different
– possible reasons: different representations, different
scales, e.g., metric vs. British units
Data Integration – Redundant Data
• Redundant data occurs often when integration of multiple databases
– The same attribute may have different names in different databases
– An attribute might have been derived from another in the other
source e.g. annual revenue.
• Redundant attributes may be able to be detected by
correlation analysis
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
Data Normalization
min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
z-score normalization
v  mean A
v'
stand _ dev A
normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(| |)<1
10
Data Reduction
• Warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time
to run on the complete data set
• Data reduction
– Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
• Data reduction strategies
– Data Compression
– Dimensionality reduction – Feature Selection
– Numerosity reduction - Sampling
– Discretization and concept hierarchy generation
Data Reduction – Feature Selection
• Dimensionality Reduction
• Redundant features
– duplicate much or all of the information contained in
one or more other attributes one or more other
attributes
• Example: purchase price of a product and the amount of
sales tax paid
– Irrelevant
• Some features contain no information that is useful for the
data-mining task at hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA
Data Reduction – Feature Selection
• Select a minimum set of features such that the
probability distribution of different classes
given the values for those features is as close
as possible to the original distribution given
the values of all features
• reduce # of patterns in the patterns, easier to
understand
Data Reduction – Feature Selection
• Why ?
– removing irrelevant data.
– increasing predictive accuracy of learned models.
– reducing the cost of the data.
– improving learning efficiency, such as reducing
storage requirements and computational cost.
– reducing the complexity of the resulting model
description, improving the understanding of the
data and the model.
Data Reduction – Feature Selection
• Techniques:
– Filter approaches
– Features are selected before data mining algorithm is run
• Univariate Feature Selection
• Multi-variate Feature Selection
– Embedded approaches:
• Feature selection occurs naturally as part of the data mining
algorithm
– Wrapper approaches
• Use the data mining algorithm as a black box to find best
subset of attribute
• Principal Component Analysis (PCA)
Feature Selection - Filter
• Statistical tests between features/response to
select best features
• Filter Methods
– information gain
– chi-square test
– fisher score
– correlation coefficient
– variance threshold
Feature Selection – Wrapper Method

• Use a subset of features and train a model using them


• Based on the inferences drawn from the previous
model, decide to add or remove features to the subset.
• The problem is essentially reduced to a search
problem.
• These methods are usually computationally very
expensive.
Feature Selection – Wrapper Method
• Brute Force
• 2d possible groups of features exist (d is the
number of features)
• Try each one out to see the performance
• Select the one that gives the best output
• Exponential Time Complexity
Feature Selection – Wrapper Method
• Other Wrapper Methods
• Forward Selection: Forward selection is an iterative method in
which we start with having no feature in the model. In each
iteration, we keep adding the feature which best improves our
model till an addition of a new variable does not improve the
performance of the model.
• Backward Elimination: In backward elimination, we start with all
the features and removes the least significant feature at each
iteration which improves the performance of the model. We repeat
this until no improvement is observed on removal of features.
• Recursive Feature elimination: It is a greedy optimization algorithm
which aims to find the best performing feature subset. It repeatedly
creates models and keeps aside the best or the worst performing
feature at each iteration. It constructs the next model with the left
features until all the features are exhausted. It then ranks the
features based on the order of their elimination.
Feature Selection – Embedded
Method
• Embedded methods combine the qualities’ of
filter and wrapper methods.
• It’s implemented by algorithms that have their
own built-in feature selection methods.
Feature Selection – Embedded
Methods
• Embedded Methods
– Lasso regression performs L1 regularization which
adds penalty equivalent to absolute value of the
magnitude of coefficients.
– Ridge regression performs L2 regularization which
adds penalty equivalent to square of the
magnitude of coefficients.
Feature Selection – Embedded
Methods
• Estimator
Characteristics
– Bias
– Variance
Feature Selection – Embedded
Methods
Feature Selection
• Filter Vs. Wrapper
– Filter methods measure the relevance of features by their
correlation with dependent variable while wrapper methods
measure the usefulness of a subset of feature by actually
training a model on it.
– Filter methods are much faster compared to wrapper methods
as they do not involve training the models. On the other hand,
wrapper methods are computationally very expensive as well.
– Filter methods use statistical methods for evaluation of a subset
of features while wrapper methods use cross validation.
– Filter methods might fail to find the best subset of features in
many occasions but wrapper methods can always provide the
best subset of features.
– Using the subset of features from the wrapper methods make
the model more prone to overfitting as compared to using
subset of features from the filter methods.
Data Discretization
• Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace actual
data values.
• Concept hierarchies
– reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).

Potrebbero piacerti anche