0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

3 visualizzazioni4 paginePaper Review

Apr 17, 2019

Test

© © All Rights Reserved

DOCX, PDF, TXT o leggi online da Scribd

Paper Review

© All Rights Reserved

0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

3 visualizzazioni4 pagineTest

Paper Review

© All Rights Reserved

Sei sulla pagina 1di 4

Review

In this paper, authors provides various the various aspects regarding variable selection and feature

selection methods. This includes providing a better definition of the objective function, feature

construction, feature ranking, improving the performance of predictors. Authors considered two

examples for illustration purpose throughout the paper i.e. Gene selection from microarray data and text

categorization.

1. Variable Ranking

In order to do variable selection, researchers used variable ranking as an auxiliary selection mechanism

because of its simplicity and scalability. It is used to discover a set of drug in microarray analysis. It finds

genes that discriminates between healthy and disease patients. Authors illustrates two methods for

variable ranking i.e. correlation criteria, single variable classifiers, and information theoretic ranking

criteria which is discussed below.

Consider a set of m examples {xk, yk} consisting of n input variables xk,I(I = 1,…n) and one output vector yk.

In order to do the ranking of each variable in vector x, score function S(i) will be applied and the scores

gets sorted. The higher value of score indicates the higher value of that variable. The score of i th variable

xi is the correlation between the variable and the output vector y. Mathematically, it is written as:

∑𝑚

𝑘=1(𝑥𝑘,𝑖 − 𝑥̅ )(𝑦𝑘 − 𝑦

̅)

𝑅(𝑖) =

√∑𝑚 2 𝑚 ̅)2

𝑘=1(𝑥𝑘,𝑖 − 𝑥̅ ) ∑𝑘=1(𝑦𝑘 − 𝑦

To rank the variable, 𝑅(𝑖)2 will be used as it enforces a ranking according to goodness of linear fit of

individual variables.

This method to compute the predictive power of each variable. This method is used for only binary

classification problem. In order to compute the predictive power, a threshold 𝛳 can be used to make a

single variable classifier. This method can be used when there is a large number of variable present where

ranking criteria can be failed.

This method computes the mutual information between each variable and target. Mathematically it is

computed as:

𝑝(𝑥𝑖 , 𝑦)

𝐼(𝑖) = ∫ ∫ 𝑙𝑜𝑔 𝑑𝑥𝑑𝑦

𝑥𝑖 𝑦 𝑝(𝑥𝑖 )𝑝(𝑦)

2. Small but Revealing Examples

In this section, authors answers several questions regarding redundant variables and outlines the

usefulness and limitations of ranking techniques. The questions and their corresponding answers are

below:

Noise reduction and consequently better class separation may be obtained by adding variables

that are presumably redundant.

Perfectly correlated variables are truly redundant in the sense that no additional information is

gained by adding them. Very high variable correlation does not mean absence of variable

complementarity.

Yes it is. Two variables that are useless by themselves can be useful together.

In order to select the subset of variable, several method has been proposed by the researchers i.e. 1)

Wrappers and Embedded methods 2) Nested Subset 3) Direct Objective optimization which are discussed

below

It addresses the problem of variable selection regardless of choosing learning algorithm. It uses the

prediction performance of a learning algorithm to assess the relative usefulness of subsets of variables. If

the number of variables are too large then the search space is also large and the variable selection

problem becomes NP hard. Efficient search techniques has been applied to address this pronlem such as

greedy search, which is robust against overfitting.

In order to find the optimal variable subset, researchers proposed various methods which selects the

subsets with respect to change in objective function. These methods are discussed below:

1) Finite difference calculation: In this method, the difference between J(s) and J(s+1) or J(s-1) is

computed for addition and removal of variables.

2) Quadratic approximation of the cost function: This method is used the prune the weights of the

variables in backward elimination method. For this purpose, a second order Taylor expansion of J

is computed and the first order terms are neglected. It yields the variation

𝜕2 𝐽

𝐷𝐽𝑖 = (1/2) 𝜕𝑤 2 (𝐷𝑤𝑖 )2 for variable i. The change in weights reflects the removal of variable.

𝑖

3) Sensitivity of the objective function calculation: In this method, the square of the derivative of J

w.r.t. xi is used.

This section addresses the problem of formulating the objective function of variable selection and find

algorithm to optimize it. The objective function consist of two terms i.e. goodness of fit and regularization.

Researchers showed that the l0 norm formulation of SVMs can be solved approximately with a simple

modification of the vanilla SVM algorithm. Few researchers showed that using l1 norm minimization in

SVMs is sufficient to drive enough weights to zero. To the best of author’s knowledge, there is no

algorithm which directly minimizes the number of variables for non linear predictors.

Dimensionality reduction of input data is always advantageous for storing and processing the data. On the

other hand, it is said that the better performance is achieved using the features extracted from the original

input data. There are several techniques proposed to reduce the dimensionality of the feature such as

PCA, LDA etc. For feature construction, achieving best reconstruction of data is efficient for making

prediction. Also developing or applying unsupervised algorithm is always advantageous even the data is

supervised as most of the data is unlabeled. For example in text categorization, most of the data is

unlabeled.

4.1. Clustering

In this method, group of similar data points is replaced by the centroid of the cluster and that becomes a

feature. K-means and hierarchical are the most popular algorithm for clustering. In distributional

clustering, if 𝑋̂ is the random variable representing the constructed features, the information bottleneck

(IB) tries to minimize the mutual information between 𝐼(𝑋, 𝑋̂) and preserves the mutual information

𝐼(𝑋̂, 𝑌). This method searches for largest possible compression while retaining information about target.

This method uses singular value decomposition (SVD) for feature construction. It form a set of features

that are linear combination of the original input data and it outputs the reconstruction of the original data

in least square manner.

Authors reviewed three approaches for selecting features which are discussed below:

Nested subset methods: Neural networks uses nodes to extract features from input data.

Therefore, node selection is considered as feature selection process.

Filters: In this method, mutual information is maximized between the features and the output. To

optimize the weights of the parameter, gradient descent method is used.

Direct objective optimization: This technique uses kernel methods to possess an implicit feature

space revealed by kernel expansion. In this method, it has been shown that the selecting these

implicit features only improves the generalization of the model.

5. Validation Methods

In this section, authors addresses two problem i.e. out of sample performance prediction and model

selection. For model selection purpose, only the training and validation data is used and various methods

are applied. Before doing model selection, the number of samples required for training is one of the major

problem. Several researchers followed leave one out technique but many times it leads to overfitting of

data. In metric based methods, unlabeled data is used and discrepancy between the model trained using

different subset of the data is used.

In this section talks about seven open problems which is discussed below

Variance of Variable Subset Selection: Small perturbation in experimental data can lead to poor

performance. To stabilize this problem, several bootstraps is used in which the subsets of training

data is chosen several times for variable selection.

Variable Ranking in the Context of Others: Various methods has been discussed above for

variable ranking. Another algorithm i.e. relief algorithm which is based on nearest neighbor

selection is used for feature selection. For each example, the closest example from same class and

from different class are selected. The score of ith variable is computed as difference between the

nearest miss and nearest hit.

Unsupervised Variable Selection: Many times, it is required to select the optimal features in the

absence of target label y. For this purpose, number of variable ranking criteria is used i.e. saliency,

entropy, smoothness, density and reliability.

Forward vs Backward selection: It is said that the forward selection is computationally more

efficient as compared to backward selection. But it is argued that the weaker subsets are found

by the forward selection method. Authors illustrated the above by an example in this paper.

Multi-class Problem: Variable selection method treat multiclass problem directly rather than

decomposing it into two class problem. The based on mutual information can be extended directly

to multi-class problem. For variable selection, multi-class setting is considered as advantageous

for variable selection since it is less likely that the random features gives good accuracy.

Selection of Examples: In this problem, it is said that the mislabeled data leads to wrong choices

of features whereas reliable labelled data leads to better performance and it avoids the selection

of wrong variables.

Inverse Problems: Authors considered this problem as one of the most problem. It states that

many times it is necessary to find out the underlying distribution. It means, the source of

generation of data. It is used in identifying the source causing the disease which helps in diagnosis.

At the end, authors recommend using a linear predictor and select variables in two alternative ways i.e.

correlation coefficient or mutual information, nested subset method.

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.