Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
I.
RESEARCH QUESTIONS
1. Identify the set of metrics that can predict change required to improve structural quality of a
software most significantly.
2. Identify the best Feature Subset Selection technique under various circumstances.
3. Identify the best Feature Subset Selection technique and Machine learning algorithm combination
under various circumstances.
II. INTRODUCTION
Software maintenance, which starts after delivery of the project, is needed either due to functional
requirement or to improve the structure of the software. In this paper we will focus on change needed to
improve the structural quality of the software. Software metrics are the quantitative measures that
describe the functional and structural properties of a software system or a process. Software metrics are
used to predict various characteristics of the software, such as change requirement. Thus we need
various machine learning algorithms which use general inductive processes to predict the change needed
on the basis of pre identified changes automatically. A large set of software metrics that label structural
properties can be used as input to these machine learning algorithms that overruns the processing time
and budget constraints. To shorten this large set of metrics, feature subset selection technique (FSS) can
be used that finds those metrics which can be ignored in our forecasting process without disturbing the
13 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
precision of our prediction. FSS is a data preprocessing technique in which redundant, irrelevant,
erroneous and missing data are removed. In this study we are using five FSS algorithms and four machine
learning algorithms to identify smallest but most efficient set of metrics.
To perform this study two consecutive versions of an open source medium sized software ORDrumbox,
downloaded from http://sourceforge.net/ are used.
III. RELATED WORK
Machine learning is a field of computer science that is based on the study of computational learning
theory and pattern recognition and explores the creation and study of algorithm. Various empirical
studies have been conducted in this field to propose the importance and application of machine learning
algorithms in various fields. Witten, Frank and Hall [10] have described various machine learning tools
and techniques in their book. Andrieu et. al. [11] have introduced an algorithm known as Markov chain
Monte Carlo (MCMC), which is an instance of a large class of sampling algorithms for machine learning.
Kubat et. al. [12] have used machine learning algorithms to identify oil spills in satellite radar images.
Similarly Freitag [13] has studied how learning can be used to extract information from domains where
linguistic processing can be a problem. To analyze the systems where large set of attributes affect their
performance, machine learning algorithms alone might not give much significant results in the time
constraints faced by analysts. To overcome this problem FSS techniques can be merged with machine
learning algorithms to improve performance.
FSS supports human readers to comprehend a learnt model and can severely reduce the search space for
a learner. Various studies have shown that a learner can overlook many attributes with little or no
damage to accuracy precision. The strengths and weaknesses of the wrapper methodology are discussed
by Kohavi and John[3] and a series of improved designs are shown. On the other hand, feature selection
problem using a greedy least squares regression algorithm is studied by Zhang [4]. He has shown that
under a definite irrepresentable state of the design matrix, the greedy algorithm can select features
consistently when the sample size tends to infinity. Dash and Liu [5] have executed feature subset
selection on the basis of consistency. They compared inconsistency measure with other measures and
studied various search techniques such as exhaustive, complete, heuristic and random search.
Classification and Regression Trees approach (CART) is used by Bittencourt and Clarke [6] for feature
selection. A new method for feature subset selection using the TAR2 treatment learner is presented by
Gunnalan et. al. [7]. Yang and Honavar [8] have presented an approach to multi-criteria optimization
problem of feature subset selection using genetic algorithm. They demonstrated the possibility of this
approach for FSS in automated design of neural network for pattern classification and knowledge
discovery. Another method for feature subset selection which is FSS-EBNA (Feature Subset Selection by
Estimation of Bayesian Network Algorithm) is proposed by Inza et. al. [9]. They have used a wrapper
approach over Naive-Bayes and ID3 learning algorithms to estimate the goodness of each obtained
solution.
In this study we have classified software metrics using various feature subset selection techniques and
then the results are reviewed using various machine learning algorithms.
IV. RESEARCH METHODOLOGY
This section describes various steps that have been followed in this empirical study as shown in figure 1.
To carry out this research two consecutive versions of an open source medium sized software
ORDrumbox, version 0.9.082 and version 0.9.07, downloaded from http://sourceforge.net/ are used and
various software metrics are calculated. Thereafter five FSS algorithms on the obtained metrics are
applied and their results are recorded. Afterwards four machine learning algorithms are applied to
14 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
evaluate the error while predicting change using the set of metrics obtained by FSS algorithms. After
collecting all the results, analysis is done to identify metrics that constitute the smallest possible set to
determine and control the need for new versions to improve the structural quality of the software. All
metrics, FSS algorithms and machine learning algorithms used in this study are described in following
subsections.
Software metrics are the measures that provide information about physical and functional properties of a
system, component or process. Software project managers ordinarily use several software metrics to
support in the design and implementation of huge softwares [1, 2, 14, 15]. Various metrics used in this
project are mentioned in table 1.
Table 1. Software Metrics
S. No.
1.
2.
3.
4.
5.
6.
7.
8.
Software Metrics
Definition
WMC - Weighted
methods per class
Weighted methods per class (WMC) metric is defined as the sum of the
complexities of all the methods of a class.
DIT - Depth of
Inheritance Tree
NOC - Number of
Children
CBO - Coupling
between object
classes
RFC - Response for
a Class
LCOM - Lack of
cohesion in
methods
Ca - Afferent
couplings
Ce - Efferent
couplings
This metric is a measure of the inheritance levels of each class in the object
hierarchy.
This metric is defined as the number of immediate descendants of the class.
This metric counts the number of classes coupled by various means such as
inheritance, method calls, field accesses, return types, arguments, etc.
This metric counts the number of different methods that can be invoked by a
class.
A class's LCOM metric counts the sets of methods that are not linked with
each other by sharing class's fields.
Afferent coupling counts the number of classes that use a specific class.
Efferent coupling counts the number of classes that are used by a specific
class.
www.ijafrc.org
9.
NPM - Number of
Public Methods
10.
LCOM3 -Lack of
cohesion in
methods
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
The NPM metric simply measure number of methods in a class that are
declared public.
It gives measure of cohesion in the range of 0-2 and is calculated as:
www.ijafrc.org
To carry out this we have calculated various metrics, as mentioned earlier for ORDrumbox software
using CKJM1 tool and CCCC2 tool. After calculating all these metrics for the software, the numeric change
metric and binary change metric were calculated for each class by analyzing two consecutive versions of
ORDrumbox. Various properties of all the metrics for thus obtained are shown in table 2.
Table 2. Properties of Metrics calculated for ORDrumbox version 0.9.082
Software
Metric
WMC
DIT
NOC
CBO
RFC
LCOM
Ca
Ce
NPM
LCOM3
LOC
DAM
MOA
MFA
CAM
IC
CBM
AMC
MVG
COM
NOM
L_C
M_C
IF4
IF4V
IF4C
Mean
Median
9.7051
1.6959
0.3825
10.2627
28.1705
111.1705
6.4931
4.6129
7.6866
1.1103
231.6452
0.6132
0.7143
0.2395
0.5663
0.1705
0.3226
19.3956
9.2212
12.8387
9.5069
12.8772
3.3813
9.7512
9.7512
0
4
1
0
5
15
3
2
2
3
0.9656
85
0.6047
0
0
0.5556
0
0
14.625
2
4
4
7.0275
0.3
0
0
0
Standard
Deviation
13.60643
1.8408
2.36805
13.97154
31.59676
410.6652
12.18876
5.94943
12.58613
0.62388
349.4495
0.05714
2.68939
0.41434
0.27528
0.44457
1.23869
22.16497
16.35874
22.76097
21.55916
22.88966
26.50979
143.6434
143.6434
0
Variance
Minimum
Maximum
Range
185.135
3.389
5.608
195.204
998.355
168645.9
148.566
35.396
158.411
0.389
122114.9
0.003
7.233
0.172
0.076
0.198
1.534
491.286
267.608
518.062
464.797
523.937
702.769
20633.44
20633.44
0
1
0
0
0
1
0
0
0
0
0
1
0.5
0
0
0
0
0
0
0
0
-17
0
0
0
0
0
87
6
23
75
226
3705
68
31
85
2
2558
1
33
1
1
3
15
175.17
132
148
196
196
333
2116
2116
0
86
6
23
75
225
3705
68
31
85
2
2557
0.5
33
1
1
3
15
175.17
132
148
213
196
333
2116
2116
0
http://www.spinellis.gr/sw/ckjm/
http://sourceforge.net/projects/cccc/
17 | 2015, IJAFRC All Rights Reserved
2
www.ijafrc.org
REJ
ALOC
AMVG
ACOM
CHANGE
BCHANGE
C.
10.4101
17.6959
1.6348
2.3241
14.2829
0.5073
8
12.571
0.375
0.75
1
1
8.48214
22.03114
3.59524
5.27281
34.87173
0.50117
71.947
485.371
12.926
27.802
1216.037
0.251
1
1.33
0
0
0
0
52
257
39.5
49.33
319
1
51
255.67
39.5
49.33
319
1
Feature Subset Selection (FSS) technique is the process of choosing a subset of relevant features. There
are various algorithms to apply FSS on a set of features. In this project weka3 (Waikato Environment for
Knowledge Analysis) [16, 17] has been used to apply FSS to the set of metrics discussed in previous
subsection. The algorithms used in this study are described in table 3.
Table 3. FSS Algorithms and their Description
S. No.
1.
2.
3.
4.
5.
FSS Algorithm
Description
CFsSubsetEval function in Keel provides this function. It calculates the worth
of a subset of attributes on the basis of individual predictive capability of
Correlation based
every feature and the degree of redundancy among them. While using this
Algorithm
algorithm, subsets of features that are highly correlated with the class but
have low intercorrelation are chosen. [18]
GainRatioAttributeEval function in Keel performs on the basis of this
technique. It calculates the worth of an attribute by determining the gain
Gain Ratio
ratio with respect to the class.
GainR(Class, Attribute) = (H(Class) - H(Class | Attribute)) / H(Attribute). [10]
InfoGainAttributeEval function in Keel performs on the basis of this
technique. It evaluates the worth of an attribute by determining the
Information Gain
information gain with respect to the class. [10]
InfoGain(Class,Attribute) = H(Class) - H(Class | Attribute).
OneRAttributeEval function in keel evaluates the worth of an attribute by
OneR
using the OneR classifier. [10]
SymmetricalUncertAttributeEval function in Keel performs on the basis of
this technique. It evaluates the worth of an attribute by determining the
Symmetrical
symmetrical uncertainty with respect to the class. [10]
Uncertainty
SymmU(Class, Attribute) = 2 * (H(Class) - H(Class | Attribute)) / H(Class) +
H(Attribute).
Out of all these FSS algorithms, Correlation based Algorithm was used with Change metric as
argument and the rest four were used with BChange metric as discussed earlier.
D. Machine Learning Algorithm
Machine learning helps in the creation and study of algorithms that can learn from data. Such algorithms
build a model from input data provided and use that model to make decision or predictions. We have
used weka for applying four machine learning algorithms [16.17] each one on every subset selected by all
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
18 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
five FSS techniques thereby giving us (5*4) twenty different combinations and thus twenty outputs. The
learning algorithm used in this study are as described in table 4.
Table 4. Machine Learning Algorithms and their Description
MultiLayer Perceptron
2.
K*
3.
4.
LWL
Description
This is a classifier that uses backpropagation to classify instances.
This network can be constructed by hand or produced by an
algorithm or both. The network can also be observed and changed
during training time. [10]
K* or KStar [20] is an instance-based classifier, i.e. the class of a
test instance is based upon the class of another training instances
similar to it and similarity is identified by some similarity
function. It differs from other instance-based learning algorithms
in that it uses an entropy-based distance function.
In this algorithm we have used IBK function which is a K-nearest
neighbor classifier and select appropriate value of K based during
cross-validation. It can also do distance weighting. [19]
LWL (Locally Weighted Learning) technique uses an instancebased algorithm to assign instance weights which are then used
by a specified WeightedInstancesHandler [21].
S. No.
Measure to
calculate Error
Description
It measures the average of absolute error and is calculated as:
1.
2.
3.
4.
Mean Absolute
Error (MAE)
Root Mean
Square Error
(RMS)
The relative absolute error takes the total absolute error and normalizes it by
Relative Absolute dividing by the total absolute error of the simple predictor. It is calculated as:
Error (RAE)
Root Relative
Square Error
(RRS)
It takes the total squared error and normalizes it by dividing by the total
squared error of the simple predictor. It is calculated as:
www.ijafrc.org
In this study all the reported results are based on 10-fold cross-validation for each algorithms.
RQ1: Identify the set of metrics that can predict change required to improve structural quality of
software most significantly.
As shown in table 6 after applying Correlation based algorithm RFC, Ce, COM, NOM, REJ and AMVG are
found to be most significant metrics whereas after applying gain ratio based FSS REJ, COM, RFC, L_C,
ACOM, M_C, MVG and AMVG are found to be more significant metrics than others. Similarly, REJ, NOM, Ce,
NOM, RFC, L_C, ACOM, M_C and MVG are found to be more significant metrics for information gain
method of FSS, REJ, ACOM, RFC, M_C, LCOM3, LOC, COM, NOM, AMVG, MVG and L_C metrics for OneR
method of FSS and REJ, COM, RFC, L_C, ACOM, M_C, Ce, MVG and NOM metrics for symmetrical
uncertainty method of FSS.
Table 6. Metrics subset generated by all the Feature Subset Selection Algorithms
FSS
Original Dataset(CHANGE)
Original Dataset(BChange)
Correlation based Algorithm(CHANGE)
Gain Ratio(BCHANGE)
Information Gain(BCHANGE)
OneR(BCHANGE)
Symmetrical Uncertainty(BCHANGE)
Nearest Neighbor
Algorithm
17.8681
0.2467
0.044
0.0453
0.027
0.0282
0.027
K*
LWL
16.7613
0.2312
0.397
0.0598
0.0442
0.0319
0.0442
18.964
0.2434
15.1934
0.2365
0.2362
0.2379
0.2362
Multilayer
Perceptron
41.2443
0.2393
13.7241
0.1639
0.1522
0.1436
0.1647
www.ijafrc.org
FSS
Original Dataset(CHANGE)
Original Dataset(BChange)
Correlation based Algorithm(CHANGE)
Gain Ratio(BCHANGE)
Information Gain(BCHANGE)
OneR(BCHANGE)
Symmetrical Uncertainty(BCHANGE)
Nearest Neighbor
Algorithm
47.2241
0.4893
0.1984
0.141
0.1031
0.1058
0.1031
K*
LWL
42.8919
0.449
1.1772
0.1638
0.1366
0.1191
0.1366
38.2315
0.3512
29.3743
0.3405
0.3424
0.3409
0.3424
Multilayer
Perceptron
87.502
0.4484
27.3932
0.2996
0.2918
0.2654
0.2865
FSS
Original Dataset(CHANGE)
Original Dataset(BChange)
Correlation based Algorithm(CHANGE)
Gain Ratio(BCHANGE)
Information Gain(BCHANGE)
OneR(BCHANGE)
Symmetrical Uncertainty(BCHANGE)
Nearest Neighbor
Algorithm
83.3185
49.4126
0.2059
9.0656
5.4185
5.6517
5.4185
K*
LWL
78.1576
46.299
1.8587
11.9788
8.8508
6.3845
8.8508
88.4288
48.743
71.1378
47.386
47.3143
47.6559
47.3143
Multilayer
Perceptron
192.3218
47.9335
64.2586
32.8358
30.4991
28.76
32.9908
FSS
Original Dataset(CHANGE)
Original Dataset(BChange)
Correlation based Algorithm(CHANGE)
Gain Ratio(BCHANGE)
Information Gain(BCHANGE)
OneR(BCHANGE)
Symmetrical Uncertainty(BCHANGE)
Nearest Neighbor
Algorithm
121.5123
75.6944
0.5135
28.2226
20.6298
21.1845
20.6298
K*
LWL
110.3649
89.8598
3.0473
32.7936
27.346
23.8333
27.346
98.3735
70.2922
76.0392
68.1604
68.5395
68.2302
68.5395
Multilayer
Perceptron
225.151
89.7419
70.9109
59.976
58.4075
53.133
57.3414
RQ3: Identify the best Feature Subset Selection technique and Machine learning algorithm combination
under various circumstances.
As we can see in table 7, when MAE is used as measure for Error Estimation, OneR FSS method used with
K* learning gives minimum error. While using RMS as measure for error estimation, as in table 8, nearest
neighbor based learning used either with symmetric uncertainty based FSS or information gain based
FSS method gives minimum error in prediction. If RAE is used as measure to calculate error then
according to table 9, correlation based FSS along with nearest neighbor based learning gives best results.
If RRS is used as a measure to calculate error then according to table 10, correlation based FSS along with
nearest neighbor based learning gives more accurate results.
VI. CONCLUSION AND FUTURE SCOPE
As we have already seen, the error in change prediction is reduced after applying various FSS techniques.
Apart from the chances of wrong prediction, the effort required to analyze smaller set of metrics will be
www.ijafrc.org
much lesser than that required to analyze all the metrics to estimate the change for a large scale project.
As per the results collected in this study we can conclude that:
1. RFC, REJ, COM, NOM, L_C, ACOM and MVG are the most significant metrics that can predict the
structural quality of a software.
2. After applying FSS using OneR we observed that prediction accuracy is improved significantly
with every machine learning algorithm when MAE, RAE and RMS prediction accuracy measures
are used.
3. Correlation based technique for FSS gives more accurate results if RRS is used as measure for
detection of error.
4. If MAE is used as a measure to calculate error then OneR based FSS used along with K* learning
gives more accurate results.
5. If RMS is used as a measure to calculate error then nearest neighbor based learning used either
with symmetric uncertainty based FSS or information gain based FSS method gives more accurate
results.
6. We observed that correlation based FSS used along with nearest neighbor based learning gives
more accurate results when RAE and RRS are used as measure to calculate error.
This study finds out that correlation based technique i.e. CFS is best technique for FSS when machine
learning is applied for prediction. However more study is needed to carried out to verify the results. In
this paper we have taken 30 metrics for analysis and their impact is analyzed by an automated tool. So
one of the future scope can be to include a wider range of metrics for the study. In order to obtain more
generalized results we can include more open source softwares for analysis. Moreover, more FSS and
machine learning algorithms can be used for better analysis. Genetic algorithms can also be used to get
wider range of results.
VII.
REFERENCES
[1]
Kan S.H., Metrics and Models in Software Quality Engineering, Addison-Wesley Publishing
Company, Reading, Massachusetts, USA, 1995.
[2]
Fenton N.E., Pfleeger S.L., Software Metrics: A Rigorous and Practical Approach, 2nd Edition. PWS
Publishing Company, Boston, USA, 1997.
[3]
Kohavi R., John G. H., Wrappers for feature subset selection, Artificial Intelligence 97, pp.273-324,
1997.
[4]
Tong Zang, On the Consistency of Feature Selection using Greedy Least Square Regression, Journal
of Machine Learning Research, 2008.
[5]
Manoranjan Dash, Huan Liu, Consistency-based search in feature selection, Artificial Intelligence
151, pp.155176, 2003.
[6]
www.ijafrc.org
[7]
Gunnalan R., Menzies T. Appukutty K., Srinivasan A., Hu Y., Feature Subset Selection with
TAR2less, 2003.
[8]
Yang J., Honavar V., Feature Subset selection using a Genetic Algorithm, Feature extraction,
construction and selection. Springer US, pp.117-136, 1998.
[9]
Inza I., Larranaga P., Etxeberria R., Sierra B., Feature Subset Selection by Bayesian network-based
optimization, Artificial Intelligence 123, pp.157-184, 2000.
[10]
Witten I. H., Frank E., Hall M. A., Data Mining: Practical Machine Learning Tools and Techniques,
Third Edition.
[11]
Andrieu C., Freitas N. D., Doucet A., Jordan M. I., An Introduction to MCMC for Machine Learning,
Machine Learning, 50, pp.543, 2003.
[12]
Kubat M., Holte R., Matwin S., Machine Learning for the Detection of Oil Spills in Satellite Radar
Images Machine Learning, 30, pp.195215, 1998.
[13]
Freitag D., Machine Learning for Information Extraction in Informal Domains, Machine Learning,
39, pp.169202, 2000.
[14]
Gupta V., Chhabra J. K., Measurement of Dynamic Metrics Using Dynamic Analysis of Programs,
APPLIED COMPUTING CONFERENCE (ACC '08), Istanbul, Turkey, 2008.
[15]
Aggarwal K. K., Singh Y., Kaur A., Malhotra R., Empirical Study of Object-Oriented Metrics,
Journal of Object Technology, 2006.
[16]
Singhal S., jena M. A Study on WEKA Tool for Data Preprocessing, Classification and Clustering,
International Journal of Innovative Technology and Exploring Engineering (IJITEE), vol. 2, Issue-6,
2013.
[17]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten, The
WEKA Data Mining Software: An Update, SIGKDD Explorations, vol. 11, Issue 1, 2009.
[18]
Hall M. A., Correlation-based Feature Subset Selection for Machine Learning, Hamilton, New
Zealand, 1998.
[19]
Aha D., Kibler D., Instance-based learning algorithms. Machine Learning, vol. 6, pp.37-66, 1991.
[20]
John G. Cleary, Leonard E. Trigg, K*: An Instance-based Learner Using an Entropic Distance
Measure, 12th International Conference on Machine Learning, pp.108-114, 1995.
[21]
Frank E., Hall M. A., Pfahringer B., Locally Weighted Naive Bayes. 19th Conference in Uncertainty
in Artificial Intelligence, pp.249-256, 2003.
www.ijafrc.org