Sei sulla pagina 1di 6

HYBRID SEQUENTIAL FEATURE SELECTION

Assignment No. 3

Submitted By

Name Riaz Ahmad


Registration No. L1R08MSCS0006

University of Central Punjab


Department of Information & Technology
Lahore.
HYBRID SEQUENTIAL FEATURE SELECTION
Riaz Ahmad
Faculty of Information and Technology
University of Central Punjab, Lahore, Pakistan.
{riaz.ahmad@ucp.edu.pk}

ABSTRACT: The data collected for data mining might have many irrelevant as well as redundant features.
There is a need to remove these irrelevant and redundant features, because these features does not add or
affect the target concept [1]. But this data must be removed in many applications, so that learning can
work. Removing these features improve the efficiency of the learning algorithm as well make the model
simpler and general. Many algorithms have been introduced for feature selection .e.g. [5,8,9,16,17,19,20].
The focus of the paper is the Forward Sequential Selection (FSS) and Backward Sequential Selection (BSS)
features selection techniques. In these techniques a feature is removed or added by comparing the accuracy
rates. But there is still lack of selection criteria if two features having same accuracy rates. This work adds
a selection criteria if there is a tie between the two features candidates for selection. As well propose a
hybrid technique BF Sequential Feature Selection incorporating this selection criterion in both techniques,
showing better results comparing to both techniques. Let F be the set of features, S1 = {F1, F3, F4} result
after applying FSS and S2= {F1, F3, F4, F5} after applying BSS algorithm with incorporated selection
criterion. Then the final result of the proposed hybrid technique is the union of both resultant features
S = {F1, F2, F3, F4, F5}.
1. INTRODUCTION for efficient reporting but still organizations could not use
Data Mining is new emerging technology these days used to their data in an efficient way. Data warehousing helps them
find hidden, interesting and previously unknown patterns in to view data in an efficient way as well up to some extent for
data [1]. The data to be mined can be in any form e.g. decision making. Then the concept of data mining came into
Relations, text, images etc. Question arises here that why we existence. Data Mining helps to make decisions in a
need to mine the data and use different algorithms or tools or validated way. With the help of data mining we can find the
in simple words can we use the powerful features provided relationships or associations between two products. For
by SQL to achieve this goal? The answer is very simple example if want to know that when a customer buy an egg
‘NO’, because SQL queries are simply used to present the what other product he will buy as well and vice versa [2]. In
data in different forms either detailed form or summarized this way the stakeholder can keep the most commonly
form, mostly provide the patterns that we already know. bought products together or can offer the promotions for
Data Mining algorithms give us the patterns that we already other products. In the same way data mining can help
don’t know (unusual but of our interest). stakeholder for forecasting mean to say prediction. Data
If we further elaborate Data Mining, we can say as Coal mining is used both to reduce cost as well to increase
Miners dig the soil and come up with coal or different mines revenue.
that are their main objective. In data mining our objective is 1.3 Data Mining Process Model
to dig into the data and come up with interesting results to Following steps are involved in data mining [5]
our stakeholders that were previously unknown. If we come • Data cleaning, (removes or transforms noise and
with the result that all females got pregnant then it’s not an inconsistent data)
unknown result. No need to present such results because • Data integration, (combine data from multiple data
these results are known or of no interest to our stakeholder. sources)
Data mining is also known as Knowledge Discovery. Now a • Data selection, (data relevant to the analysis task
days where we are with full of new technology are retrieved from the database)
developments, companies have massive data in the form of • Data transformation, (data are transformed or
databases, data marts and data warehouses to run their day to consolidated into forms appropriate for mining)
day business tasks as well for decision makings. Data • Data mining, (model construction, Algorithms are
mining as discussed in the start helps to uncover hidden applied to uncover patterns)
patterns lie under the data. In this era data mining is possible • Patterns evaluation, (checking results)
as the computing power is affordable as well as data mining • Knowledge presentation, (using visualization and
tools are easily available in the market [2]. knowledge representation techniques)
1.1 Prerequisites for Data Mining 1.4 Types of Data Mining
Data Mining comes from the different fields of science We can divide the types of data mining into two categories,
including machine learning, neural networks, statistics, first category concerns with the type of data to be mined; I
database systems and data warehousing. To learn about data mean to say Text Mining, Web Mining, and Graph Mining
mining we must have knowledge about these areas up to etc. Second category comprises of different ways of mining
some extent [1]. the data [3,4].
1.2 Importance/Need of Data Mining These consist of the following types.
Organizations are storing their day to day data since years; • Association Rule Mining
they have terabyte of data. Organizations are bearing cost of • Classification
keeping this data for years. They were forced to think what • Clustering
to do with this data. Then the concept of warehousing came • Prediction
• Regression that he uses card mostly in the very start of month and never
Lets us discuss about these types of Data Mining Algorithms make a transaction of a huge amount. If someone else tries
1.4.1 Association Rule Mining to use his/her card illegally to make unusual transaction it
This technique is used to find the interesting relationships can be detected easily.
among the data items. A most famous application of Insurance Companies use data mining for claims and fraud
association rule mining is market basket analysis. In market- analysis. For example it might help insurance company to
basket analysis, we try to observe the different buying trends make decision either to give insurance to the particular
of customers. It helps the stakeholder to offer promotions as person or not similarly by analyzing different attributes from
well the placement of items. For example items egg, butter the profile of the customer. Data about customer from
and butter purchased together can be kept together. external sources mean to say from different companies can
Association rule mining is unsupervised. also be used besides the data provided by the customer.
1.4.2 Classification In telecommunication, company might want to offer
Classification is used to divide the data items on the basis of different packages; data mining helps them to make
class attributes. Most famous example of classification is to decisions to offer promotions for what age of people during
divide the given sale data into the class of data consisting of which hours for different regions by analyzing the traffic.
those who buy and who don’t buy. Classification is The major use of data mining in telecommunication is also
supervised as the class attribute is already known on the the fraud detection by observing the behavior of their
basis of which data is divided into different classes. The best customers (voucher recharging, call durations etc)
example of classification is decision trees. In Transport, bus or airline Company can take help by
1.4.3 Clustering applying data mining to make a decision about new rout. By
Clustering is used to group the items having same analyzing the traveling trend of passengers, theirs likes or
characteristics. Example of clustering can be thought as dislikes during journey. Either they must increase or
grouping the students of a university into different groups decrease the no. of busses/airplane on a specific rout. This
considering the similarities found in them. As usually we all is only possible by the application of data mining.
divide the students into groups as intelligent students and Marketing & Sales point of view as I have discussed earlier
dull minded students. Clustering is also unsupervised unlike Market-Basket analysis. A shopkeeper might offer
the classification in which class attribute is given. promotions on different items based on analyzing the buying
1.4.4 Prediction trend of consumers at different places. Similarly the
Prediction means telling about future. Mostly it is used in placement of items is made with the help of data mining
predicting the product sales. This technique observes the results.
previous history of the data items. For example one might Power Supply Forecasting is also a main area of application
want to predict will it rain in the next week. Then climate of data mining. Power supply companies use the data mining
conditions in the previous days, last year during these days for power usage analysis of domestic and commercial areas
will be observed to predict. Similarly one might want to during different hours, months to forecast the power
observe the passing percentage of the students next year. requirements for the next month, next quarter or next year.
1.5 Applications of Data Mining As in Europe electricity is charged differently in different
The applications of data mining are almost in every field of hours of the day. Such types of decisions are only possible
life. Main areas of application are with the usage of data mining.
• Finance (Loan Application Processing, Credit Card There are also many other areas of application of data
Analysis) mining, in simple data mining has become the need of
• Insurance (Claims, Fraud Analysis) industries. In Europe data mining is being used since years,
• Telecommunications (Call History Analysis, Fraud in Pakistan industries have just started the data mining after
detections, Promotions) knowing its importance.
• Transport Section 2 discusses the FSS and BSS feature selection
• Marketing & Sales methods from the material studied for supervised learning.
• Electricity Supply Forecasting Section 3 introduces and explains criterion addition to BSS
• Medical and BSS while selecting feature among candidate features.
Let us discuss the examples of each area of application. In Section 4 discusses the newly hybrid nature proposed
finance/Banking data mining application is used in loan technique. Section 5 concludes this work with key findings.
application processing. While processing loan application, 2. MATERIAL AND METHOD
data mining helps to make decisions either to accept or reject Data for mining is collected from the different sources.
the loan application by analyzing different attributes of the Cleansing and pre-processing is performed on the data.
applicant. Algorithm can use the attribute of region, race, During the pre-processing of data some features/attributes
salary, years of service with the current employer, years of are added and removed. Before adding an attribute it is made
living at current place, family background as well previous sure that is this attribute have worth or not. Similarly while
loan history. It helps you to make decision accurately. removing an attribute or feature the accuracy rate of
Further elaborating might be the people of that regions are classifier must not be compromised. The need to remove an
mostly defaulters similarly might be the black people are attribute is to simplify the data as much as possible so that
most vulnerable to defaulter. Same is the case with credit the efficiency of the algorithm could be increased and the
card fraud detection. History of customer is analyzed to model could be simpler [5]. Attributes can be divided into
detect the fraud. For example credit card holder history tells two types, relevant and irrelevant [6].Feature selection has
been an active field of research and development for decades step is to remove the irrelevant features and in second step
in statistical pattern recognition [10], machine learning redundant features are removed. The advantage of over
[11,12], data mining [13,14] and statistics [15]. It has proven traditional framework is that it gives the optimal subset
in both theory and practice effective in enhancing learning solution as compared to traditional framework.
efficiency, increasing accuracy, and simplicity of learned Feature selection algorithms are typically composed of the
results. following three components [17]:
2.1 Feature Selection • Search algorithm, searches the space of feature
‘Let G be some subset of F and fG be the value vector of G. subsets, which has size 2d, where d is the number
In general, the goal of feature selection can be formalized as of features.
selecting a minimum subset G such that P(C | G = fG) is • Evaluation function inputs a feature subset and
equal or as close as possible to P(C | F = f ), where P(C | G = outputs a numeric evaluation. The search
fG) is the probability distribution of different classes given algorithm's goal is to maximize this function.
the feature values in G and P(C | F = f ) is the original • Performance i.e. classification.
distribution given the feature values in F’ [5]. The most common sequential search algorithms for feature
The method of selecting only the relevant attributes is called selection are forward sequential selection (FSS) and
Feature Selection. Feature selection attempts to select the backward sequential selection (BSS).
smallest sized subset of features keeping in view the Let us discuss these techniques one by one in depth.
following criteria. 2.2 Forward Sequential Selection:
• The classification accuracy does not significantly Forward Sequential Selection shortly named as FSS
decrease; evaluates all feature subsets with exactly one feature starting
• and results must be as close as possible to the with an empty set, and selects the one with the best
original class distribution, given all features.[16]. performance. It then adds to this subset the feature yielding
The following figure shows the general Feature Selection the best performance. This process repeats until there is no
Process[5]. improvement by adding additional attributes. It means
Forward Sequential Selection selects best feature locally
optimal from the features. This constitutes on of its
drawback. The algorithm can not correct the previously
Subset Subset
Generation Evaluation added feature [16,17].
2.3 Backward Sequential Selection
Backward Sequential Selection shortly called BSS is
reciprocal of FSS. It begins with all features and removes a
feature if removal increases the performance of classifier.
This cycle continues until the performance of the classifier
remains the same or begins to decrease.
BSS frequently outperformed FSS, perhaps because BSS
evaluates the contribution of a given feature in the context of
all other features. In contrast, FSS can evaluate the utility of
a single feature only in the limited context of previously
selected features. [16,17] note this problem with FSS, but
Fig.1. Framework of feature selection. their results favor neither algorithm. Based on these
The diagram in Figure 1 exhibits a traditional feature observations, it is not clear whether FSS will outperform
Original Candidate
selection framework. Subset generation produces the BSS on a given dataset with an unknown no. of feature
Set Subset
possible subsets, and then each candidate subset is evaluated interactions.
and compared with the previous one subset at the second
stage with respect to some measuring criteria. If Best
Current the newly
2.4 Plus l take away r Method
The plus l take away r method shortly named as (1,r) makes
evaluated subset is better than the previousSubset one, it will
use of both Forward Sequential Selection as well Back
replace the last one. This process is repeated until the given
Sequential Selection. For l iterations of Forward Sequential
condition is reached. In this traditionalStopapproach on the
Conditi Search r iterations of Backward Sequential Search are
relevancy of the attributes
No is considered. on
performed. This cycle is repeated until we reached the
[9] proposed a new framework of feature selection, a claim
required no. of features [10]. This method overcomes the
that only the relevance among the attributes is not enough
Selected Subset Yes problem stated above in FSS and BSS but in partial fashion.
but as well feature redundancy is also another metric.
Here the more problem arises that are what will be the
optimal values of l and r for moderate computational cost.
2.5 Sequential Floating Forward Selection
Sequential Floating Forward Selection shortly called SFFS
also makes use of both Forward and Backward Sequential
Selection techniques. Each forward Sequential Selection step
is followed by the number of Backward Sequential Selection
Fig. 2. New Framework of feature selection. steps until the optimal subset of features is selected. Thus
The diagram in Figure 2 is an enhancement in traditional backtracking becomes possible in this algorithm and it
Relevant Selected
feature
Original selection framework
Relevance
composedRedundancy
of two steps. First
Set Analysis Subset Analysis Subset
requires no parameter is required as in Plus l take away r F6 78
Method [16]. As feature F5 in Table 3 causes raise in accuracy rate, so we
add F5 and set becomes S = {F3, F2, F5}
3. PROPOSED CRITERION
In both FSS and BSS, one drawback that is common in both Iteration 4.
Algorithms is that no one handles if a tie occurs between the S = {F3, F2, F5}
candidate features, either for addition in case of FSS or Table 4
deletion in case of BSS. Let me discuss this in detail to Feature Accuracy Rate Acc. Avg. Rate
elaborate the problem. Suppose in FSS, we start with empty F1 89.5
set S={}. Then we check with other features one by one F4 85
either and calculates the accuracy rates of adding each F6 75
feature. Suppose feature F1 results the highest accuracy. We As feature F1 in Table 4 causes raise in accuracy rate, so we
add this feature to set S, thus S becomes S= {F1}. Next we add F1 and set becomes S = {F3, F2, F5, F1}.
repeat the process for the remaining features, and there occur Iteration 5 causes no increase in accuracy rate, so we stop
that features F3 & F4 cause the same accuracy rates. Which this process. The final optimal subset is S = {F3, F2, F5,
of them should be added? Here I propose that a history F1}.
should be maintained about each feature during each cycle. Note that we have calculated the accumulated averages of
When such tie occurs the decision must be taken with the only those attributes causing a tie thus saving our
help of the history maintained. Mean to say the attribute computational cost.
having the greater accumulated average amount the
candidate attributes can be taken. Same criterion can be 4. HYBRID NATURE
added to BSS. Here we are going to propose another technique that makes
3.1 Illustration use of most common and popular sequential feature
Let us take an example to elaborate proposed selection selection algorithms (FSS, BSS). As it is not clear whether
criterion for selecting the best feature among the candidate FSS will outperform BSS on a given dataset with an
features. unknown amount of feature interactions. On this
Suppose we have F set of features and we want to get the experimented statement we propose that if we take the union
optimal subset of features among these by using Forward of resulted features of both algorithms, we will not have any
Sequential Selection (FSS). doubt about the nature of data. What kind of data sets are,
Iteration 1. our selection of features will be better than performing only
one of these algorithms.
S = {}
Suppose we have the data set S having ten features
Table 1
S = {F1,F2,F3,……….F10}. when we perform both
Feature Accuracy Rate Acc. Avg. Rate algorithms on this data set the resulted subsets of both are
F1 65 S1={F2,F4,F5,F7} and S2={F2,F3,F4,F5,F7}. If we take the
F2 70 union of both resulted subsets S1 and S2, then the final
F3 87 result is S3 = {S1 U S2}, S1= {F2,F3,F4,F5,F7}. In such a
F4 85 way we have all relevant features required by reducing the
F5 80 chances of error as well as mentioned above that the results
F6 75 as well are dependant on the algorithm. By taking union of
As feature F3 in Table 1 has the highest Accuracy rate it resultant subsets of features, we come with the all features
should be selected. So S becomes S={F3}. best for classifier.
Iteration 2.
5. CONCLUSIONS
S = {F3} In this paper, it was identified that there is a need of adding a
Table 2 selection criterion for candidate attributes selections in a
Feature Accuracy Rate Acc. Avg. Rate case if there exists during Sequential Selection Algorithms,
F1 88 77 which will surely select a feature having greater
F2 88 79 accumulated average thus making sure that this attribute is
F4 85 best among the candidate attributes. Similarly we proposed
F5 80 another way of selection of optimal subset of features by
F6 75 taking union of subsets of both the Forward Sequential
As feature F1 and F2 in Table 2 have a tie, so we compute Selection and Backward Sequential Selection , which
their accumulated averages and selected the feature having ensures that whatever the nature of dataset, the resulted
largest value. So our set becomes S = {F3, F2} subset will be the optimal subset without missing any feature
Iteration 3. that can play role during the construction of the final
S = {F3, F2} classifier.
Table 3 REFERENCES
Feature Accuracy Rate Acc. Avg. Rate [1]. Daniel T. Larose, Discovering Knowledge in
F1 88 Data: An Introduction to Data Mining, 2005.
F4 86 [2]. P. Tan, M. Steinbach & V. Kumar, Introduction to
F5 89 Data Mining, 2nd edition, 2006.
[3]. J. Han and M. Kamber , Data Mining: Concepts [16].A. Miller. Subset Selection in Regression.
and Techniques, 2006. Chapman & Hall/CRC, 2 edition, 2002.
[4]. M. Hegland, Data Mining – Challenges, Models, [17].D. Koller and M. Sahami. Toward optimal
Methods and Algorithms, 2003 feature selection.In Proceedings of the Thirteenth
[5]. M. Dash , H. Liu, Feature Selection for International Conference on Machine Learning,
Classification, Intelligent Data Analysis 1 (1997) pages 284–292, 1996.
131–156 [18]. David.W. Aha, Richard L. Banker, A Comparative
[6]. Almuallim, H., and Dietterich, T.G., Learning Evalution of Sequential Selection Algorithms,
with many irrelevant features. In: Proceedings of Springer-Verlag 1996
Ninth National Conference on Artifical [19].G. John, R. Khavi, and K. Pfleger, “Irrelevant
Intelligence, MIT Press, Cambridge, Features and the Subset Selection Problem,” to
Massachusetts, 547–552, 1992. appear in The Proceedings of the Eleventh
[7]. Almuallim, H., and Dietterich, T.G., Learning International Conference on Machine Learning,,
Boolean Concepts in the Presence of Many Rutgers, NJ., July 1994.
Irrelevant Features. Artificial Intelligence, 69(1– [20].S. Salzberg, “Improving Classification Methods
2):279–305, November, 1994. via Feature Selection,” John Hopkins TR, 1992.
[8]. Siedlecki, W. and Sklansky, J., On automatic [21].Weston J, Mukherjee S, Chapelle O, Pontil M,
feature selection. International Journal of Pattern Poggio T, Vapnik V: Feature selection for SVMs.
Recognition and Artificial Intelligence, 2:197–220, In Advances in Neural Information Processing
1988. Systems 13. 11th edition. Edited by Solla SA,
[9]. Lei Yu, Huan Liu11, Efficient Feature Selection Leen TK, Muller K-R. Cambridge, MA: MIT
via Analysis of Relevance and Redundancy, Press, 2001.
Journal of Machine Learning Research 5 (2004)
1205–1224
[10].P. Mitra, C. A. Murthy, and S. K. Pal.
Unsupervised feature selection using feature
similarity. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 24(3):301–312, 2002.
[11].H. Liu, H. Motoda, and L. Yu. Feature selection
with selective sampling. In Proceedings of the
Nineteenth International Conference on Machine
Learning, pages 395–402, 2002b.
[12].M. Robnik-Sikonja and I. Kononenko.
Theoretical and empirical analysis of Relief and
ReliefF. Machine Learning, 53:23–69, 2003.
[13].Y. Kim, W. Street, and F. Menczer. Feature
selection for unsupervised learning via
evolutionary search. In Proceedings of the Sixth
ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages
365–369, 2000.
[14].M. Dash, K. Choi, P. Scheuermann, and H. Liu.
Feature selection for clustering – a filter solution.
In Proceedings of the Second International
Conference on Data Mining, pages 115–122, 2002.
[15].