Sei sulla pagina 1di 4

# Paper Title: An Introduction to Variable and Feature Selection

## Venue: Journal of Machine Learning Research (JMLR)

Review

In this paper, authors provides various the various aspects regarding variable selection and feature
selection methods. This includes providing a better definition of the objective function, feature
construction, feature ranking, improving the performance of predictors. Authors considered two
examples for illustration purpose throughout the paper i.e. Gene selection from microarray data and text
categorization.

1. Variable Ranking

In order to do variable selection, researchers used variable ranking as an auxiliary selection mechanism
because of its simplicity and scalability. It is used to discover a set of drug in microarray analysis. It finds
genes that discriminates between healthy and disease patients. Authors illustrates two methods for
variable ranking i.e. correlation criteria, single variable classifiers, and information theoretic ranking
criteria which is discussed below.

Consider a set of m examples {xk, yk} consisting of n input variables xk,I(I = 1,…n) and one output vector yk.
In order to do the ranking of each variable in vector x, score function S(i) will be applied and the scores
gets sorted. The higher value of score indicates the higher value of that variable. The score of i th variable
xi is the correlation between the variable and the output vector y. Mathematically, it is written as:
∑𝑚
𝑘=1(𝑥𝑘,𝑖 − 𝑥̅ )(𝑦𝑘 − 𝑦
̅)
𝑅(𝑖) =
√∑𝑚 2 𝑚 ̅)2
𝑘=1(𝑥𝑘,𝑖 − 𝑥̅ ) ∑𝑘=1(𝑦𝑘 − 𝑦

To rank the variable, 𝑅(𝑖)2 will be used as it enforces a ranking according to goodness of linear fit of
individual variables.

## 1.1. Single Variable Classifiers

This method to compute the predictive power of each variable. This method is used for only binary
classification problem. In order to compute the predictive power, a threshold 𝛳 can be used to make a
single variable classifier. This method can be used when there is a large number of variable present where
ranking criteria can be failed.

## 1.2. Information Theoretic Ranking Criteria

This method computes the mutual information between each variable and target. Mathematically it is
computed as:
𝑝(𝑥𝑖 , 𝑦)
𝐼(𝑖) = ∫ ∫ 𝑙𝑜𝑔 𝑑𝑥𝑑𝑦
𝑥𝑖 𝑦 𝑝(𝑥𝑖 )𝑝(𝑦)
2. Small but Revealing Examples

In this section, authors answers several questions regarding redundant variables and outlines the
usefulness and limitations of ranking techniques. The questions and their corresponding answers are
below:

## Q1: Can Presumably Redundant Variables Help Each Other?

Noise reduction and consequently better class separation may be obtained by adding variables
that are presumably redundant.

## Q2: How Does Correlation Impact Variable Redundancy?

Perfectly correlated variables are truly redundant in the sense that no additional information is
gained by adding them. Very high variable correlation does not mean absence of variable
complementarity.

## Q3: Can a variable that is Useless by itself be Useful with Others?

Yes it is. Two variables that are useless by themselves can be useful together.

## 3. Variable Subset Selection

In order to select the subset of variable, several method has been proposed by the researchers i.e. 1)
Wrappers and Embedded methods 2) Nested Subset 3) Direct Objective optimization which are discussed
below

## 3.1. Wrappers and Embedded Methods:

It addresses the problem of variable selection regardless of choosing learning algorithm. It uses the
prediction performance of a learning algorithm to assess the relative usefulness of subsets of variables. If
the number of variables are too large then the search space is also large and the variable selection
problem becomes NP hard. Efficient search techniques has been applied to address this pronlem such as
greedy search, which is robust against overfitting.

## 3.2. Nested Subset Methods

In order to find the optimal variable subset, researchers proposed various methods which selects the
subsets with respect to change in objective function. These methods are discussed below:

1) Finite difference calculation: In this method, the difference between J(s) and J(s+1) or J(s-1) is
computed for addition and removal of variables.
2) Quadratic approximation of the cost function: This method is used the prune the weights of the
variables in backward elimination method. For this purpose, a second order Taylor expansion of J
is computed and the first order terms are neglected. It yields the variation
𝜕2 𝐽
𝐷𝐽𝑖 = (1/2) 𝜕𝑤 2 (𝐷𝑤𝑖 )2 for variable i. The change in weights reflects the removal of variable.
𝑖
3) Sensitivity of the objective function calculation: In this method, the square of the derivative of J
w.r.t. xi is used.

## 3.3. Direct Objective Optimization

This section addresses the problem of formulating the objective function of variable selection and find
algorithm to optimize it. The objective function consist of two terms i.e. goodness of fit and regularization.
Researchers showed that the l0 norm formulation of SVMs can be solved approximately with a simple
modification of the vanilla SVM algorithm. Few researchers showed that using l1 norm minimization in
SVMs is sufficient to drive enough weights to zero. To the best of author’s knowledge, there is no
algorithm which directly minimizes the number of variables for non linear predictors.

## 4. Feature Construction and Space Dimensionality Reduction

Dimensionality reduction of input data is always advantageous for storing and processing the data. On the
other hand, it is said that the better performance is achieved using the features extracted from the original
input data. There are several techniques proposed to reduce the dimensionality of the feature such as
PCA, LDA etc. For feature construction, achieving best reconstruction of data is efficient for making
prediction. Also developing or applying unsupervised algorithm is always advantageous even the data is
supervised as most of the data is unlabeled. For example in text categorization, most of the data is
unlabeled.

4.1. Clustering

In this method, group of similar data points is replaced by the centroid of the cluster and that becomes a
feature. K-means and hierarchical are the most popular algorithm for clustering. In distributional
clustering, if 𝑋̂ is the random variable representing the constructed features, the information bottleneck
(IB) tries to minimize the mutual information between 𝐼(𝑋, 𝑋̂) and preserves the mutual information
𝐼(𝑋̂, 𝑌). This method searches for largest possible compression while retaining information about target.

## 4.2. Matrix Factorization

This method uses singular value decomposition (SVD) for feature construction. It form a set of features
that are linear combination of the original input data and it outputs the reconstruction of the original data
in least square manner.

## 4.3. Supervised Feature Selection

Authors reviewed three approaches for selecting features which are discussed below:

 Nested subset methods: Neural networks uses nodes to extract features from input data.
Therefore, node selection is considered as feature selection process.
 Filters: In this method, mutual information is maximized between the features and the output. To
optimize the weights of the parameter, gradient descent method is used.
 Direct objective optimization: This technique uses kernel methods to possess an implicit feature
space revealed by kernel expansion. In this method, it has been shown that the selecting these
implicit features only improves the generalization of the model.

5. Validation Methods

In this section, authors addresses two problem i.e. out of sample performance prediction and model
selection. For model selection purpose, only the training and validation data is used and various methods
are applied. Before doing model selection, the number of samples required for training is one of the major
problem. Several researchers followed leave one out technique but many times it leads to overfitting of
data. In metric based methods, unlabeled data is used and discrepancy between the model trained using
different subset of the data is used.

## 6. Advances Topics and Open Problems

In this section talks about seven open problems which is discussed below

 Variance of Variable Subset Selection: Small perturbation in experimental data can lead to poor
performance. To stabilize this problem, several bootstraps is used in which the subsets of training
data is chosen several times for variable selection.
 Variable Ranking in the Context of Others: Various methods has been discussed above for
variable ranking. Another algorithm i.e. relief algorithm which is based on nearest neighbor
selection is used for feature selection. For each example, the closest example from same class and
from different class are selected. The score of ith variable is computed as difference between the
nearest miss and nearest hit.
 Unsupervised Variable Selection: Many times, it is required to select the optimal features in the
absence of target label y. For this purpose, number of variable ranking criteria is used i.e. saliency,
entropy, smoothness, density and reliability.
 Forward vs Backward selection: It is said that the forward selection is computationally more
efficient as compared to backward selection. But it is argued that the weaker subsets are found
by the forward selection method. Authors illustrated the above by an example in this paper.
 Multi-class Problem: Variable selection method treat multiclass problem directly rather than
decomposing it into two class problem. The based on mutual information can be extended directly
to multi-class problem. For variable selection, multi-class setting is considered as advantageous
for variable selection since it is less likely that the random features gives good accuracy.
 Selection of Examples: In this problem, it is said that the mislabeled data leads to wrong choices
of features whereas reliable labelled data leads to better performance and it avoids the selection
of wrong variables.
 Inverse Problems: Authors considered this problem as one of the most problem. It states that
many times it is necessary to find out the underlying distribution. It means, the source of
generation of data. It is used in identifying the source causing the disease which helps in diagnosis.

At the end, authors recommend using a linear predictor and select variables in two alternative ways i.e.
correlation coefficient or mutual information, nested subset method.