Sei sulla pagina 1di 8

Neurocomputing 138 (2014) 106113

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Letters

A new online data imputation method based on general regression


auto associative neural network
Vadlamani Ravi n, Mannepalli Krishna
Centre of Excellence in CRM and Analytics, Institute for Development and Research in Banking Technology (IDRBT), Castle Hills Road #1,
Masab Tank, Hyderabad, 500 057 AP, India

art ic l e i nf o a b s t r a c t

Article history: In this paper we proposed online, ofine and semi-online data imputation models based on the four auto
Received 4 August 2013 associative neural networks. The online model employs mean imputation followed by general regression
Received in revised form auto associative neural network (GRAANN). The ofine methods include mean imputation followed by
25 January 2014
particle swarm optimization based auto associative neural network (PSOAANN); mean imputation
Accepted 7 February 2014
followed by particle swarm optimization based auto associative wavelet neural network (PSOAAWNN)
Communicated by Dr. Swagatam Das
Available online 8 April 2014 and the semi-online method involving mean imputation followed by radial basis function auto
associative neural network (RBFAANN). We compared the performance of these hybrid models with
Keywords: that of mean imputation and a hybrid imputation method viz, K-means and multi-layer perceptron
Data imputation
(MLP) of Ankaiah and Ravi (2011) [65]. We tested the effectiveness of these models on four benchmark
Auto associative neural network
classication and four benchmark regression datasets; three bankruptcy prediction datasets and one
General regression auto associative neural
network credit scoring datasets under 10-fold cross-validation testing. From the experiments, we observed that
Radial basis function auto associative neural the GRAANN yielded better imputation for the missing values than the rest of the models. We conrmed
network this by performing the Wilcoxon signed rank test to test the statistical signicance between the methods
Particle swarm optimization proposed. It turned out that GRAANN outperformed other models in most of the datasets.
& 2014 Elsevier B.V. All rights reserved.

1. Introduction monitoring, industrial process, telecommunications and computer


networks, and medical diagnosis among others [2].
The missing data or incomplete data is a very common problem Little and Rubin [3] categorized missing data into (i) missing
in many real world datasets. Missing data occurs due to several completely at random (MCAR), (ii) missing at random (MAR), and
reasons like non-response to some elds in data collection process (iii) not missing at random (NMAR). MCAR situation occurs if the
by respondents because of negligence, privacy reasons, data entry probability of missing value for variable X does not depend on the
errors, system failure, ambiguity of the survey questions, cultural value X itself or on any other variable in the dataset. MAR occurs if
issues in updating the databases and other various reasons. Imputa- the probability of missing data on particular variable X depends on
tion is dened as the substitution of a missing data point or a other variables, but not on the variable X itself. NMAR occurs if the
missing component of a data point by some suitable value. Missing probability of missing value of particular variable X depends on
values or incomplete values result in less efcient estimates in the variable X itself. The missing data under the category of MCAR
classication or regression problems because of sample bias and and MAR are recoverable, whereas missing data under NMAR
reduced sample size. Missing value imputation became mandatory category are irrecoverable. Several techniques for imputing miss-
because most data mining algorithms cannot work with incomplete ing data based on statistical analysis [3] are mean substitution
datasets. The completeness and quality of the data plays a major methods [4], hot deck imputation [5,6], regression methods [3],
role in analyzing the available data, because the inferences made expectation maximization [7], and multiple imputation methods
from complete data are more accurate than those made from [8]. The methods based on machine learning techniques are multi-
incomplete data [1]. Data imputation found applications in auto- layer perceptron [9], K-nearest neighbor [10], fuzzy-neural net-
matic speech recognition, nancial and business applications, trafc work [11], AANN imputation with genetic algorithms (GA) [1], and
self-organizing maps [12] among others.
Marseguerra and Zoia [13] were the rst to propose AANN (also
known as auto encoder)for reconstruction of a missing signal in
n
Corresponding author. Tel.: 91 402 329 4042; fax: 91 402 353 5157.
nuclear reactor simulated data. They used Robust AANN (RAANN)
E-mail addresses: padmarav@gmail.com (V. Ravi), for reconstruction of a missing time series if it is linearly or
krishnamannepalli@gmail.com (M. Krishna). nonlinearly correlated with the other measured data. Abdella and

http://dx.doi.org/10.1016/j.neucom.2014.02.037
0925-2312/& 2014 Elsevier B.V. All rights reserved.
V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106113 107

Marwala [1] proposed AANN trained by the back propagation methods is that it ignores the correlations between various
algorithm for imputation. They employed AANN to train the components. If the variables are correlated then we can perform
samples with complete values and then used the trained network the data imputation using regression imputation, where the
and a genetic algorithm (GA) to impute the missing values in regression equations are computed each time by considering the
incomplete samples. The GA was used to approximate the missing attribute containing incomplete value as target variable. The
input values by optimizing an objective function driven by the regression method preserves the variance and covariance of
trained AANN. Some other works involving AANN for imputation missing data with other variables. Hot and cold deck imputation
are reviewed in the literature review section. is another type of data imputation, where the missing values are
In this paper, we exploited the strength of GRNN in solving replaced by the closest components that are present in both
multi-inputmulti-output (MIMO) problems. Accordingly, we pro- vectors for each case with a missing value. Mean imputation is
posed a variant of GRNN called GRAANN, where the output layer the earliest method of imputation, where the missing values of a
contains the input variables themselves during training, thereby variable are replaced by the average value of all the remaining
achieving auto association. We propose GRAANN for imputation as cases of that variable. In multiple imputation method, each
follows: we rst employ GRAANN to train the samples with missing value is replaced with valid and reasonable values. So,
complete values and then used mean imputation for getting the we get M complete data sets by replacing each value M times and
initial estimates of the missing values in incomplete samples by analyzing all datasets after which we can make combined
followed by the trained GRAANN to get the nal imputed values. inferences [2].
Therefore, our paper is different from Abdella and Marwala [1] in
that we could achieve imputation in one go because GRAANN 2.3. Model-based procedures
requires single iteration only, unlike AANN reported in Abdella and
Marwala [1]. Consequently, our method can be used for imputing The methods under model-based procedures are maximum
missing values in streaming data or data coming in online. We also likelihood method and expectation maximization. According to
propose other variants of AANN viz, PSOAANN, PSOAAWNN and DeSarbo and Rao [17], the maximum likelihood approach to
RBFAANN for data imputation. In PSOAANN and PSOAAWNN, PSO analyzing missing data assumes that the observed data are a
is used to update the weights values in the training process. In sample drawn from a multivariate normal distribution. Based on
PSOAANN, PSOAAWNN and RBFAANN, we imputed the missing available data the parameters are estimated and based on these
values in test set rst with mean values and then input these test parameters the missing values are determined. According to Laird
records to the trained PSOAANN, PSOAAWNN and RBFAANN for [18] the expectation maximization algorithm is an iterative pro-
nal imputation. However, since PSOAANN, PSOAAWNN require cess, where, in the rst iteration it estimates missing data and
many iterations to get trained, they cannot be used for online parameters using maximum likelihood, while in the second
applications. iteration it re-estimates the missing data based on new para-
The remainder of this paper is organized as follows: a brief meters and then recalculates the new parameter estimates based
review of literature on imputation of missing data is presented in on actual and re-estimated missing data [2].
Section 2. The proposed method is explained in Section 3. The
description of the dataset is presented in Section 4. Experimental 2.4. Machine learning methods
design is described in Section 5. Results and discussions are
presented in Section 6 followed by the conclusions in Section 7. Samad and Harp [19] proposed SOM approach for handling the
missing data. In the feed forward neural network approach, MLP
should be trained as a nonlinear regression model using the
2. Literature review complete cases and choosing one variable as target each time.
Several researchers such as Sharpe and Solly [20], Nordbotten [21],
They are several methods to handle missing data of numerical Gupta and Lam [9], Yoon and Lee [22] used MLP for missing
attributes. According to Kline [14], the methods to handling data imputation. In K-nearest neighbor (K-nn) [23] approach, the
missing data can be classied into four categories. Those are missing values are replaced by their nearest neighbors those are
(1) deletion, (2) imputation, (3) modeling the distribution of selected from the complete cases which minimize the distance
missing data and then estimate them based on certain parameters function. K-nn approach found applications in breast cancer
and (4) machine learning methods. prognosis [10,22,24].
We now present the work done in using AANN for data
2.1. Deletion procedures imputation. Mohamed and Marwala [25] proposed three techni-
ques based on neural networks to impute the missing data in a
The deletion techniques simply delete the cases that contain medical database. The rst technique consists of AANN combined
missing data. It is a brute force technique which is too simple to be with GA and is extended to an agent-based system, which is the
effective. There are two forms in this approach: (i) List wise second system, and the third technique views the missing data
deletion that ignores the cases or records containing missing problem in a pattern classication perspective. The agent-based
values. The drawback of this method is that the dataset may lose system provides the best accuracy on average. Marwala and
large number of observations, which may result in large error [15]. Chakraverty [26] used AANN and GA for fault classication in
(ii) Pair wise deletion that considers each feature separately. All mechanical systems with missing data.
recorded values are considered and missing data ignored for each In Nelwamondo and Marwala [27], fuzzy-ARTMAPs is used an
feature. If the overall sample size is small or missing data cases are ensemble of neural networks to perform both classication
large then this approach is good [15]. and regression with missing data. Marivate et al. [28] used AANN,
principal component analysis neural network (PCANN) and
2.2. Imputation procedures Support Vector Regression for prediction and combined each of
them with a GA to impute missing variables. The use of PCA
The imputation techniques include regression imputation, hot improves the overall performance of the ANN. Nelwamondo et al.
and cold deck imputation, multiple imputation and mean impu- [29] separately employed expectation maximization (EM) and the
tation. Schafer [16] points out that the disadvantage of these AANN combined with GA. Results show that EM performs better
108 V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106113

when there is no interdependence between the input variables, nodes which store the input records and the number of pattern
whereas the AANN GA is better when there is inherent nonlinear nodes is equal to the number of input records. The outputs of the
relationship between some of the variables. Mohamed et al. [30] pattern unit are passed onto summation units. The summation
proposed a combination of rst MLP, AANN and second MLP in unit includes a numerator summation unit and a denominator
that order to achieve imputation. Ssali and Marwala [31] intro- summation unit. The denominator summation unit adds up the
duced a hybrid decision tree AANN combined with GA and a weight values coming from each of the hidden neurons. The
decision treePCA-neural network (PCANN) combined with GA numerator summation unit adds up the weight values multiplied
based model. The results indicate that the addition of a decision by the actual target value for each hidden neuron. The output node
tree improves results for both models. Mistry et al. [32] proposed generates the estimated value of output by dividing the values of
the combination of AANN with PCA and imputation was per- numerator summation unit by the denominator summation unit,
formed by the trained network and GA. Chen [33] proposed the and uses the result as the nal estimated value.
hybrid AANNGA to impute missing variables for predicting the Because GRNN is capable of solving multi input and multi
business failure. output (MIMO) problems, we extended GRNN to GRAANN by
taking the input variables in the output nodes. Fig. 1 depicts the
architecture of the GRAANN. In the output layer, after the training
3. Proposed methodology is completed, we obtain the modied or predicted input variables
(X11, X12, , X1n) for the input variables (X1, X2, , Xn). The process of
The common thread in all the methods presented in [2933] is GRAANN based imputation is as follows:
that they impute missing values in a dataset by considering one
variable at a time. Further, some of them invoke GA in the second 1. Divide the dataset into two sets: a set of complete records and
stage to complete the job of imputation. Therefore, in this paper, another set of records which contains missing values known as
we thought of pursuing a different line of research by devising a missing records.
few new auto associative neural networks architectures, which are 2. Train the GRAANN with complete records, using the same
extended, auto associative versions of MLP, RBF, WNN and GRNN training algorithm of GRNN [34].
and which can nish imputation without having to invoke GA. The 3. Impute the missing values rst with the mean values of
reason for choosing these architectures is that they proved to be corresponding variable in missing records. This completes the
powerful not only in solving nonlinear regression problems arising rst stage of imputation. Then, input the modied records
in the multi-inputsingle-output (MISO) framework but also can obtained in the rst stage to the trained GRAANN.
solve MIMO problems. Further, we intended to reduce the com- 4. The quality of the imputation is measured using mean absolute
putational burden by simply imputing all variables (where missing percentage error [35] (MAPE) value
values are present) in one go, instead of imputing the missing
values in data by considering one variable at a time. Of course, this 100 n xi  x^ i
MAPE
necessitated the availability of initial imputed estimates, which are n i 1 xi
provided by mean imputation, which is extremely simple. In the
where n is the number of missing values in a given dataset, x^ i is
case of MLP, we trained its auto associative counterpart viz, AANN
predicted by the GRAANN for the missing value and xi is the actual
by PSO to come out with PSOAANN because of the inherent defects
value.
of the back propagation algorithm. Similarly in the case of WNN,
The same methodology is followed in the case of other three
we did not want to be entangled by the defects of gradient based
AANN also.
weight updation schemes and hence employed PSO to update the
weights, dilation and translation parameters.
Accordingly, in this paper, we proposed four new AANN 3.2. Particle swarm optimization (PSO) based auto associative neural
architectures for data imputation. These are (i) general regression network (PSOAANN)
auto associative neural network (GRAANN); (ii) particle swarm
optimization (PSO) trained AANN; (iii) PSO trained auto associative The PSO algorithm was introduced by the Kennedy and
wavelet neural network (PSOAAWNN) and (iv) semi-online radial Eberhart [36], which is population-based optimization technique.
basis function auto associative neural network (RBFAANN). All the The PSO algorithm imitates the behavior of bird ocking and
four auto associative neural nets are trained to predict their own sh schooling. PSOAANN containing one input layer, one hidden
input variables, so the output variables in output layer are appro- layer and one output layer was proposed by Paramjeet et al. [37]
ximately equal to the input variables. We now briey describe for privacy preservation by employing PSO for training AANN.
them as follows:
Input layer Pattern layer Summation layer
Output layer
3.1. General regression auto associative neural network (GRAANN)
X X
Since GRAANN is a variant of GRNN, we briey describe GRNN
rst. The GRNN was proposed by Specht [34]. GRNN has the
unique features of quick learning, simple training algorithm and X X

being discriminative against infrequent outliers and erroneous
observations. GRNN is capable of approximating any arbitrary
function from historical data. Each training sample in GRNN is
operated as a kernel during the training process. Parzen window
estimator is used to establish the regression surface. In GRNN,
estimation is based on non-parametric regression analysis in order
to get the best t for the observed data. GRNN consists of four X X
layers input layer, pattern layer, summation layer and the output
layer. The input layer contains input variables connected to all the
neurons in the pattern layer. The pattern layer contains the pattern Fig. 1. Architecture of GRAANN.
V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106113 109

However, in this paper, we propose PSOAANN for data imputation, instead of iterative gradient based supervised training to justify
recognizing the versatility of the network. For preserving privacy the online feature of the training algorithm in both phases. Ravi
the output layer contains the input variables. The number of nodes et al. [51] applied a semi-online training algorithm for radial basis
in the hidden layer is a user dened parameter. The sigmoid function neural networks to predict bankruptcy in Banks. The auto
function is used as the activation function in hidden and output associative version of the same architecture is now proposed for
layers. Fig. 2 depicts the architecture of PSOAANN. The three data imputation. The architecture of the resulting semi-online
layered AANN is used for data imputation instead of ve layered RBFN is depicted in Fig. 3.
AANN [3840] because it reduces the computational complexity The online ECM is a fast and one pass algorithm for a dynamic
and it is simpler to understand & implement. We imputed the estimation of the number of clusters in a dataset, and it does not
missing values in test set rst with mean values and then input involve any optimization. The online ECM is a distance based
these test records to trained network for nal imputation. clustering method. The threshold value (Dthr) is set as a clustering
parameter and in any cluster, the maximum distance, Max Dist,
3.3. PSO based auto associative wavelet neural network between a sample point and the cluster center, is less than the
(PSOAAWNN) threshold value, Dthr. This parameter affects the number of
clusters to be estimated.
The auto associative wavelet neural network (AAWNN) used
particle swarm optimization to optimize the weights of the net-
4. Dataset description
work in the training phase. In test phase, the mean imputed test
records are supplied to the trained AAWNN for nal imputation.
In this paper, we analyzed several regression, classication and
Wavelets [41] are functions used to localize a given function in
banking datasets. While, the regression datasets are Boston hous-
both space and scale [42]. In analyzing physical situations, where
ing, forest res, auto mpg, and body fat, the classication datasets
the signal contains discontinuities and sharp spikes, the wavelets
include wine, Pima Indians, iris and spectf. The banking bank-
have advantages over traditional Fourier methods. A class of neural
ruptcy datasets are Spanish, Turkish and UK bankruptcy apart
networks called WNN, which originate from wavelet decomposition
from a UK credit dataset. All the benchmark datasets those are
in signal processing have become more popular, based on the locally
regression (Boston housing, forest res, auto mpg, and body fat)
supported basis functions such as Radial Basis Function Networks
and classication datasets (wine, Pima Indians, iris and spectf) are
(RBFN) [43,44]. A family of wavelets can be constructed from
taken from UCI machine learning [53] repository and KEEL
a mother wavelet (w(x)), which is conned in nite interval.
datasets [54]. Turkish banks dataset is obtained from Canbas
Daughter wavelets a;b x are then formed using translation (b)
et al. [55] and is available at [56]. The Spanish banks dataset
and dilation (a) parameters. An individual wavelet is dened as
is obtained from Olmeda and Fernandez [57]. The UK dataset is
a;b x jj  1=2 x  a=b obtained from Beynon and Peel [58]. The UK credit dataset is
obtained from Thomas et al. [59]. The number of records and the
WNN was proposed as a universal tool for functional approximation.
number of attributes in these datasets are presented in Table 1.
The WNN shows surprising effectiveness in solving the conventional
problem of poor convergence or even divergence encountered in
other kinds of neural networks. When compared to other networks
5. Experimental design
WNN converges faster [45]. The popularity of WNN can be seen by
its applications [4650].
We divided the total records in a dataset into a set of complete
records and another set with missing records, i.e. those contain
3.4. Semi-online radial basis function auto associative neural missing values. Complete records contain the records without
network (RBFAANN) missing values and these complete records are used in training
process. However, since none of the datasets analyzed here
The RBFAANN is an extension of semi-online radial basis contain missing values, we simulated the scenario as follows:
function neural network (SORBFNN) [51], wherein, auto associa- missing values in some records are created randomly by deleting
tion is introduced by taking the input variables in the output layer some feature values and these missing records are used in testing
also during training. The training algorithm for RBFAANN works in process. In all, 10% of the total records are subjected to this process
two steps. The unsupervised learning takes place in the rst step and this set of records with missing values is taken as test set. We
on the input data, where the clusters are determined in just one trained the network with the set of complete records, and imputed
pass using evolving clustering method (ECM) algorithm of Kasabov the missing values in test set rst by mean values and then fed
and Song [52]. The supervised learning is involved in the second these test records to the trained auto associative network for nal
step, where the ordinary least squares technique (LSE) was
employed. Here the ordinary least squares technique was used Unsupervised Learning by ECM
Supervised Learning by LSE
Input layer Output layer
Hidden layer X1 X1
X1 X1
X2 X2
X2 X2

X3 X3

Xn
Xn
Xn Xn

Fig. 2. Architecture of PSOAANN. Fig. 3. Architecture of RBFAANN.


110 V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106113

imputation. For all datasets, and for all proposed methods, we 6. Results and discussion
performed 10 folds cross-validation (10-FCV) by changing the
composition of training and test sets randomly. We calculated First we present the tools employed and the parameters xed
the average MAPE value over 10 folds. This average MAPE serves as for the techniques in the study. We used Neuroshell [60] to
the measure of accuracy of the imputation process; the less the implement the GRAANN. Since GRAANN is not readily available
average MAPE value, the better the method is said to be. in the tool we implemented MIMO model in GRNN and fed the
We state that while we minimize the MAPE value of the original input variables as the output variables. This trick, in effect,
training set during every single run of the algorithms, in each transforms the GRNN into GRAANN. In Neuroshell, in training the
fold, one must recognize that the MAPE value of the test set GRAANN, the parameter setting chosen are presented in Table 1.
(where imputations take place) is inuenced by the composition Further, we executed the code of PSOAANN developed in [37],
of the training set. In order to alleviate this inuence, one must while we implemented PSOAAWNN in C. Finally, we extended the
carry out 10-fold cross-validation. Since MAPE is a dimensionless SORBFNN [51] written in MATLAB to its auto associative version
quantity and expressed as a percentage value, its average over 10 viz, RBFAANN just like the GRAANN. In the case of PSOAANN
folds indicates the strength of the proposed imputation algorithms and PSOAAWNN, for all the datasets, the parameter settings
across 10 different test settings. Further, 10-fold cross-validation is are presented in Table 1.The parameter settings of the hybrid
also used to ne tune the parameters of the algorithms. Standard K-means MLP in the case of various datasets for different folds in
deviations of MAPE, for all datasets, presented in Table 3, are the 10-fold cross-validation are presented in Table 2. Thus, we
computed as an additional summary measure. conducted exhaustive experiments by changing various parameter
settings in all models in order to get best possible imputation in all
datasets.
The average MAPE values and standard deviation of MAPE
values obtained from 10-fold cross-validation (10-FCV) by
Table 1 GRAANN and other methods in the case of different datasets are
Parameter settings for GRAAN, PSOAAWNN, PSOAANN and SORBFAANN.
presented in Table 3. In the case of regression, classication and
GRAANN Smoothing parameter 0.3 banking datasets, GRAANN produced best results except for Pima
Genetic breeding pool size 20/50/200/ 300 Indian. The best average MAPE values are highlighted with a bold
Calibration method Genetic, adaptive font. Table 4 presents the computed Wilcoxon signed rank test
learning values for different datasets. According to the Wilcoxon signed
Distance metric Vanilla (Euclidean)
or city block
rank test, if the computed value is less than or equal to the critical
value, then methods under comparison are statistically signicant.
PSOAAWNN/ Particles 30
We conducted non-directional Wilcoxon signed rank test [6164]
PSOAANN C1 And C2 1 and 3
Tolerance 0.0001 at 1% level of signicance, in pairs, to test the whether GRAANN is
Lower and upper bounds on  1 and 1 statistically signicantly better compared to other methods pro-
weights and dilation/translation posed here and elsewhere.
parameters The critical value for the two tailed Wilcoxon signed rank test
Max iterations 1000
Hidden nodes 3 to no. of input
at 1% level of signicance is 2.576. From Table 4, we observe that
nodes statistically there is no signicant difference between GRAANN
and PSOAANN for the datasets Forest res, Body fat, Pima Indian
SORBFAANN Distance threshold of ECM Between 0.1 and 0.3
Weight factor 1 and UK bankruptcy datasets out of 12 datasets. Also, statistically,
there is no signicantly difference between GRAANN and

Table 2
Parameter settings for K-means MLP.

Dataset Clusters Iterations Learning rate Momentum rate Hidden layers Hidden nodes Training epochs

Iris 3 1000 0.8, 0.81, 0.85 and 1.0 0.12, 0.15, 0.16, 2 15 (6 folds), 5, 10 and 20 1000
0.2 and 0.3 (rest folds)
Body fat 2 300 0.9 and 0.45 (8 folds), 0.7 0.1 2 10 1000
(rest folds)
Pima 3 300 0.1 0.1 2 10 200
Indian
Spanish 2 300 0.9 (3 folds), 0.5 (3 folds), 0.4, 0.6 1 (5 folds), 2 (4 folds), 3 10 (8 folds), 1000 (9 folds), 200
banks 0.24, (rest folds) 5 (2 folds) (1 fold)
0.38 and 0.38 (rest folds)
Turkish 2 500 0.9 (6 folds), 0.95 0.2 (4 folds), 0.1 (2 2 (7 folds), 1 (3 folds) 10 500
banks (3 folds), 0.6 (1 fold) folds),
0.39 (2 folds), 0.45 (2
folds)
UK banks 3 1500 0.5 and 0.95 0.1 and 0.52 2 10 1000
UK credit 3 1500 0.5 and 0.95 0.1 and 0.52 2 10 1000
Wine 3 500 0.01, 0.2 and 0.8 0.001 1 Either 7 or 8 1000
forest res 3 500 0.2 (9 folds), 0.8 (1 fold) 0.0001 1 Either 7 or 8 1000
Boston 3 500 0.2, 07 and 08 0.0001 1 Either 7 or 8 1000
housing
Auto MPG 2 100 0.1 0.01 1 26 1000
Spectf 2 100 0.1 0.01 2 30 1000
V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106113 111

Table 3
Mean and Standard deviation of MAPE values over 10 folds.

Dataset No. of records No. of attributes MAPE values in % (standard deviation)

GRAANN PSOAANN PSOAAWNN RBFAANN Mean imputation K-means MLP

Regression datasets
Boston housing 506 13 15.38 (2.45) 24.61 (5.95) 30.94 (7.54) 98.87 (32.76) 37.77 (10.37) 21.01 (4.16)
Forest res 516 10 18.47 (2.08) 22.69 (5.98) 26.62 (5.41) 59.24 (19.17) 24.72 (6.84) 26.61 (5.23)
Auto mpg 392 7 15.54 (3.70) 37.59 (10.13) 38.16 (13.59) 62.53 (15.28) 59.70 (14.36) 23.75 (4.52)
Body fat 252 14 4.61 (2.03) 7.61 (4.53) 9.21 (4.01) 25.40 (14.78) 11.61 (7.18) 7.83 (1.64)

Classication datasets
Wine 178 13 12.87 (2.47) 22.16 (4.52) 23.64 (3.94) 39.11 (8.38) 29.99 (4.51) 21.58 (3.87)
Pima Indian 768 8 23.89 (2.86) 21.72 (3.21) 23.68 (3.01) 32.28 (4.66) 24.02 (3.82) 29.70 (3.39)
Iris 150 4 5.75 (2.41) 15.84 (9.03) 12.83 (6.41) 26.93 (13.88) 23.57 (14.46) 9.41 (1.97)
Spectf 267 44 8.41 (1.59) 16.69 (4.35) 43.30 (4.48) 21.12 (5.48) 14.85 (4.74) 12.14 (2.68)

Banking datasets
UK credit 1225 12 20.47 (5.32) 33.94 (10.35) 38.64 (9.26) 45.53 (17.47) 28.43 (1.83) 32.17 (11.56)
Spanish 66 9 23.28 (11.40) 60.95 (22.20) 48.81 (15.25) 847.02 (1043.3) 55.53 (45.23) 39.91 (13.06)
Turkish 40 12 17.25 (10.14) 53.56 (23.05) 33.45 (7.86) 188.85 (122.6) 66.00 (26.01) 33.01 (21.34)
UK bankruptcy 60 10 26.85 (12.36) 33.47 (9.21) 31.48 (5.86) 141.61 (42.68) 37.07 (11.64) 30.96 (10.58)

Table 4 7. Conclusions
Wilcoxon signed rank test values of GRAANN versus other methods.

Dataset PSOAANN PSOAAWNN RBFAANN Mean K-


In this paper, based on the concept of AANN, we proposed four
imputation means MLP new algorithms for data imputation, viz, GRAANN, PSOAANN,
PSOAAWNN and RBFAANN. The effectiveness of these proposed
Regression methods is tested on four benchmark classication, four bench-
Boston 2.67 2.77 2.77 2.77 2.67
mark regression datasets and three bankruptcy prediction datasets
housing
Forest res 1.86 2.67 2.77 2.36 2.77 and one credit scoring datasets under 10-fold cross-validation
Auto mpg 2.77 2.77 2.77 2.77 2.77 testing. Based on the results, we conclude that GRAANN should be
Body fat 1.96 2.77 2.77 2.67 2.67 preferred for imputation in the class of AANN architectures owing
to its consistently better imputation for most of the datasets as
Classication evidenced by Wilcoxon signed rank test. Further, we conclude that
Wine 2.77 2.77 2.77 2.77 2.77 GRAANN can be used in online data imputation applications, even
Pima 2.31 0.73 2.77 0.02 2.77
though we did not test any online data here, because of its simple
Indian
Iris 2.77 2.77 2.77 2.77 2.26 architecture, fast, one-pass training algorithm, and its immunity to
Spectf 2.77 2.77 2.77 2.77 2.77 outliers etc. Moreover, this study demonstrates that we do not
need to use computationally complex evolutionary algorithms to
Banking ne tune the imputations yielded by AANN. We could achieve
UK credit 2.77 2.77 2.77 2.77 2.57 highly accurate imputations with just mean imputation followed
Spanish 2.77 2.77 2.77 2.77 2.57 by GRAANN. This feature obviates the necessity of invoking
Turkish 2.67 2.26 2.77 2.77 2.36
evolutionary algorithms in the whole process. This is a signicant
UK 1.65 1.14 2.77 1.75 0.63
bankruptcy outcome of the study.

References

[1] M. Abdella, D. Marwala, The use of genetic algorithms and neural networks to
PSOAAWNN for Pima Indian, Turkish and UK bankruptcy datasets approximate missing data in database, in: Proceedings of the IEEE 3rd International
Conference on Computational Cybernetics (ICCC), 2005, pp. 207212.
out of 12 datasets. However, there is a statistically signicant [2] P.J. Garca-Laencina, J.L. Sancho-Gomez, A.R. Figueiras-Vidal, Pattern classica-
difference between GRAANN and RBFAANN for all datasets. tion with missing data: a review, Neural Comput. Appl. 19 (2010) 263282.
Further, there is no statistically signicant difference between [3] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, 2nd ed., Wiley-
Interscience, Hoboken, NJ, USA, 2002.
GRAANN and Mean imputation for Forest res, Pima Indian and [4] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, Wiley, New York,
UK bankruptcy datasets out of 12 datasets. Moreover, there is no 1987.
statistically signicant difference between GRAANN and K- [5] I.G. Sande, Hot-deck imputation procedures, Incomplete Data in Sample
Surveys, 3, Academic Press, New York (1983) 339349.
means MLP [65] for Iris, UK credit, Spanish, Turkish and UK [6] B.M. Ford, An overview of hot-deck procedures, Incomplete Data in Sample
bankruptcy. By observing the overall results, we conclude that Surveys, 2, Academic Press, New York (1983) 185207.
GRAANN performed better imputation for most of the datasets. [7] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via
the EM algorithm, J. R. Stat. Soc. Ser. B 39 (1) (1977) 138.
Most importantly, the fact that GRAANN requires just single
[8] D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, Wiley, New York,
iteration to train makes it very attractive compared to all other 1987.
models. This feature makes GRAANN suitable in applications, [9] A. Gupta, M.S. Lam, Estimating missing values using neural networks, J. Oper.
where data imputation is required to be performed in real-time Res. Soc. 47 (2) (1996) 229238.
[10] G. Batista, M.C. Monard, A study of K-nearest neighbor as an imputation
or on-line. Therefore, we conclude that GRAANN should be method, in: A. Abraham, et al., (Eds.), Hybrid Intelligent Systems, Ser Front
preferred for imputation in the class of AANN architectures. Articial Intelligence Applications, IOS Press, 2002, pp. 251260.
112 V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106113

[11] B. Gabrys, Neuro-fuzzy approach to processing inputs with missing values in [45] X. Zhang, J. Qi, R. Zhang, M. Liu, Z. Hu, H. Xue, Prediction of programmed-
pattern recognition problems, Int. J. Approx. Reasoning 30 (2002) 149179. temperature retention values of naphtha's by wavelet neural networks,
[12] P. Merlin, A. Sorjamaa, B. Maillet, A. Lendasse, X-SOM and L-SOM: a double Comput. Chem. 25 (2) (2001) 25133.
classication approach for missing value imputation, Neuro Comput. 73 (2010) [46] E. Avci, An expert system based on wavelet neural network-adaptive norm
11031108. entropy for scale invariant texture classication, Expert Syst. Appl. 32 (3)
[13] M. Marseguerra, A. Zoia, The auto-associative neural network in signal analysis (2007) 919926.
II, application to on-line monitoring of a simulated BWR component, Ann. [47] C. Dimoulas, G. Kalliris, G. Papanikolaou, V. Petridis, A. Kalampakas, Bowel-
Nucl. Energy 32 (11) (2002) 12071223. sound pattern analysis using wavelets and neural networks with application
[14] R.B. Kline, Principles and Practice of Structural Equation Modeling, Guliford to long-term unsupervised, gastrointestinal motility monitoring, Expert Syst.
Press, New York, 1988. Appl. 34 (1) (2008) 2641.
[15] Q. Song, M. Shepperd, A new imputation method for small software project [48] L. Dong, D. Xiao, Y. Liang, Y. Liu, Rough set and fuzzy wavelet neural network
data sets, J. Syst. Software 80 (1) (2007) 5162. integrated with least square weighted fusion algorithm based fault diagnosis
[16] J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, Florida, research for power transformers, Electric Power Syst. Res. 78 (1) (2008)
USA, 1997. 129136.
[17] W.S. DeSarbo, V.R. Rao, A constrained unfolding methodology for product [49] K. Vinaykumar, V. Ravi, M. Carr, N. Rajkiran, Software cost estimation using
positioning, Market. Sci. 5 (1) (1986) 119. wavelet neural networks, Syst. Software 81 (11) (2008) 18531867.
[18] N.M. Laird, Missing data in longitudinal studies, Stat. Med. 7 (1988) 305315. [50] N. Rajkiran, V. Ravi, Software reliability prediction using wavelet neural
[19] T. Samad, S.A. Harp, Self-organization with partial data network, Comput. networks, in: International conference on computational intelligence and
Neural Syst. 3 (1992) 205212. multimedia applications, Sivakasi, Tamilnadu, India, 2007.
[20] P.K. Sharpe, R.J. Solly, Dealing with missing values in neural network based [51] V. Ravi, P. Ravikumar, E. Ravisrinivas, N.K. Kasabov, A. Semi-Online Training,
diagnostic systems, Neural Comput. Appl. 3 (2) (1995) 7377. Algorithm for the radial basis function neural networks: applications to
[21] S. Nordbotten, Neural network imputation applied to the Norwegian 1990 bankruptcy prediction in banks, Advances in Banking Technology and Man-
population census data, J. Off. Stat. 12 (1996) 385401. agement: Impacts of ICT and CRM, 243260.
[22] S.Y. Yoon, S.Y. Lee, Training algorithm with incomplete data for feed-forward [52] N.K. Kasabov, Q. Song, DENFIS: dynamic, evolving neural-fuzzy inference
neural networks, Neural Process. Lett. 10 (1999) 171179. systems and its application for time-series prediction, IEEE Trans. Fuzzy Syst.
[23] G. Batista, M.C. Monard, Experimental Comparison of K-nearest Neighbor and 10 (2) (2002) 144154.
Mean or Mode Imputation Methods with the Internal Strategies Used by C4.5 [53] http://archive.ics.uci.edu/ml/datasets.html, last retrieved in 2013.
and CN2 to Treat Missing Data. Technical Report, University of Sao Paulo, 2003. [54] http://sci2s.ugr.es/keel/datasets.php, last retrieved in 2013.
[24] J. Jerez, I. Molina, J. Subirates, L. Franco, Missing data imputation in breast [55] S. Canbas, A. Caubak, S.B. Kilic, Prediction of commercial bank failure via
cancer prognosis, in: Proceedings of the 24th IASTED International Conference multivariate statistical analysis of nancial structures: the Turkish case, Eur. J.
on Biomedical Engineering (BioMed06), Anaheim, CA, USA, 2006. Oper. Res. 166 (2005) 528546.
[25] S. Mohamed, T. Marwala, Neural network based techniques for estimating [56] http://www.tbb.org.tr/english/bulten/yillik/2000/ratios.xls, last retrieved in
missing data in databases, in: The 16th Annual Symposium of the Pattern 2013.
Recognition Association of South Africa, Langebaan, South Africa, 2005, [57] I. Olmeda, E. Fernandez, Hybrid classiers for nancial multi criteria decision
pp. 2732. making: the case of bankruptcy prediction, Comput. Econ. 10 (1997) 317335.
[26] T. Marwala, S. Chakraverty, Fault classication in structures within complete [58] M.J. Beynon, M.J. Peel, Variable precision rough set theory and data discretiza-
measured data using auto associative neural networks and genetic algorithm, tion: an application to corporate failure prediction, Omega 29 (2001) 561576.
Curr. Sci. India 90 (4) (2006) 542548. [59] L.C. Thomas, D.B. Edelman, J.N. Crook, Credit Scoring and Its Applications,
[27] F.V. Nelwamondo, T. Marwala, Fuzzy ARTMAP and neural network approach to SIAM, Philadelphia, USA, 2002.
online processing of inputs with missing values, in: Proceedings of the 17th [60] http://www.neuroshell.com, last retrieved in 2013.
Symposium of the Pattern Recognition Association of South Africa, 2006, [61] https://www.msu.edu/user/sw/statrev/strv50.htm?Q, last retrieved in 2013.
pp. 177182. [62] F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull. 1 (1945)
[28] V.N. Marivate, F.V. Nelwamondo, T. Marwala, Autoencoder Principal Compo- 8083.
nent Analysis and Support Vector Regression for Data Imputation, CoRR, 2007. [63] S. Siegel, Non-parametric Statistics for the Behavioral Sciences, McGraw Hill,
[29] F.V. Nelwamondo, S. Mohamed, T. Marwala, Missing data: a comparison of New York (1956) 7583.
neural network and expectation maximization techniques, Curr. Sci. 93 (12) [64] R. Lowry, Concepts and Applications of Inferential Statistics, 2013. Retrieved
(2007). from http://vassarstats.net/textbook/ch12a.html.
[30] A.K. Mohamed, F.V. Nelwamondo, T. Marwala, Estimating missing data using [65] N. Ankaiah, V. Ravi, A novel soft computing hybrid for data imputation, in:
neural network techniques, principal component analysis and genetic algo- Proceedings of the 7th International Conference on Data Mining (DMIN), Las
rithms, in: Proceedings of the 18th Symposium of the Pattern Recognition Vegas, USA, 2011.
Association of South Africa, 2007.
[31] G. Ssali, T. Marwala, Computational intelligence and decision trees for missing
data estimation, in: IJCNN, 2008, pp. 201207.
[32] J. Mistry, F.V. Nelwamondo, T. Marwala, Using principle component analysis Vadlamani Ravi is an associate professor in the Insti-
and auto associative neural networks to estimate missing data in a database, tute for Development and Research in Banking Tech-
in: Proceedings of the12th World Multi-Conference on Systemics, Cybernetics nology, Hyderabad, since February 2010. He obtained
and Informatics: WMSCI, 2008. his Ph.D. in the area of soft computing from Osmania
[33] M.H. Chen, Pattern recognition of business failure by auto associative neural University, Hyderabad and RWTH Aachen, Germany
networks in considering the missing values, in: Computer Symposium, 2010, (2001); M.S. (science and technology) from BITS, Pilani
pp. 711715. (1991) and M.Sc. (statistics and operations research)
[34] D.F. Specht, A general regression neural network, IEEE Trans. Neural Networks from IIT, Bombay (1987). At IDRBT, he spearheads the
2 (6) (1991) 568576. CRM Lab, rst-of-its-kind in India and evangelizes CRM
[35] B.E. Flores, A pragmatic view of accuracy measurement in forecasting, Omega in a big way by conducting customized training pro-
14 (2) (1986) 9398. grammes for bankers on CRM subsuming OCRM &
[36] J. Kennedy, R.C. Eberhart, Particle swarm optimization, in: Proceedings of the ACRM; data warehousing and data mining and con-
IEEE International Conference on Neural Networks, Piscataway, NJ, USA, 1995, ducting POC for banks, etc.
pp. 19421948. He has 130 papers to his credit with the break-up of 70 papers in refereed
[37] V. Ravi Paramjeet, N. Nekuri, C. RaghavendraRao, Privacy preserving data international journals. His papers appeared in Applied Soft Computing, Soft Comput-
mining using particle swarm optimization trained auto-associative neural ing, Asia-Pacic Journal of Operational Research, Decision Support Systems, European
network: an application to bankruptcy prediction in banks, Int. J. Data Min. Journal of Operational Research, Expert Systems with Applications, Fuzzy Sets and
Model. Manage. 4 (1) (2012) 3956. Systems, IEEE Transactions on Fuzzy Systems, IEEE Transactions on Reliability,
[38] M.A. Kramer, Nonlinear principal component analysis using auto associative Information Sciences, Journal of Systems and Software, Knowledge Based Systems,
neural networks, AIChE J. 37 (2) (1991) 233243. IJUFKS, IJCIA, IJAEC, IJDMMM, IJIDS, IJDATS, IJISSS, IJCIR, IJCISIM, IJBIC, Computers and
[39] C. Pramodh, V. Ravi, Modied great deluge algorithm based auto associative Chemical Engineering, Canadian Geotechnical Journal, Biochemical Engineering Jour-
neural network for bankruptcy prediction in banks, Int. J. Comput. Intell. Res. 3 nal, Bioinformation, Journal of Services Research, etc. He also edited a book entitled
(4) (2007) 363370. Advances in Banking Technology and Management: Impacts of ICT and CRM (http://
[40] V. Ravi, C. Pramodh, Non-linear principal component analysis-based hybrid www.igi-global.com/reference/details.asp?id=6995), published by IGI Global, USA,
classiers: an application to bankruptcy prediction in banks, Int. J. Inf. Decis. 2007. Some of his research papers are listed in Top 25 Hottest Articles by Elsevier
Sci. 2 (1) (2010) 5067. and World Scientic. He has an H-index of 24 with 1935 citations for his papers
[41] A. Grossmann, J. Morlet, Decomposition of Hardi functions into square (http://scholar.google.co.in/). He is recognized as a Ph.D. supervisor at Department
integrable wavelets of constant shape, SIAM J. Math. Anal. 15 (1984) 725736. of Computer and Information Sciences, University of Hyderabad and Department of
[42] http://mathworld.wolfram.com/Wavelet.html retrieved in 2013. Computer Sciences, Berhampur University, Orissa. He is an invited member in
[43] Q. Zhang, A. Benvniste, Wavelet networks, IEEE Trans. Neural Networks 3 (6) Marquis Who's Who in the World, USA in 2009, 2010. He is also an invited member
(1992) 889898. in 2000 Outstanding Intellectuals of the 21st Century 2009/2010 published by
[44] Q. Zhang, Using wavelet network in nonparametric estimation, IEEE Trans. International Biographical Center, Cambridge, England. He is an invited member of
Neural Networks 8 (2) (1997) 227236. Top 100 Educators in 2009 published by International Biographical Centre,
V. Ravi, M. Krishna / Neurocomputing 138 (2014) 106113 113

Cambridge, England. Three Ph.D. students graduated under his supervision. So far, ordinated the rst international EDP in IDRBT on ACRM to banking executives
he advised 50 M.Tech./M.C.A./M.Sc. projects and at least a dozen Summer Interns jointly with Prof Dr. Dirk Van den Poel, University of Ghent at Ghent, Belgium in
from various IITs. He currently supervises three Ph.D. students and 5 M.Tech. 2011. This programme is successfully repeated in 2012. As part of academic
students. He is on the Steering Committee of Canara Bank for their DWH and CRM outreach, he is an invited resource person in various national workshops and
project; IT advisor for Indian Bank for their DWH and CRM projects and principal faculty development programmes on soft computing, data mining funded by AICTE
consultant for Bank of India for their CRM project; expert committee member for and organized by some engineering colleges in India.
IRDA for their business analytics and fraud analytics projects. He is a referee for 36 Prior to joining IDRBT as an assistant professor in April 2005, he worked as a
international journals of repute. Moreover, he is a member of the Editorial Review faculty at the Institute of Systems Science (ISS), National University of Singapore
Board for the International Journal of Information Systems in Service Sector published (April 2002March 2005). At ISS, he was involved in teaching M.Tech. (knowledge
by IGI Global, USA; International Journal of Data Analysis Techniques and Strategies engineering) and research in the areas of Fuzzy systems, neural networks, soft
published by Interscience, Switzerland; International Journal of Information and computing systems and data mining and machine learning. Furthermore, he
Decision Sciences (IJIDS), Interscience, Switzerland; International Journal of Strategic consulted for Seagate Technologies, Singapore and Knowledge Dynamics Pvt. Ltd.,
Decision Sciences (IJSDS), IGI Global, USA and International Journal of Information Singapore, on data mining projects. Before leaving for Singapore, he worked as an
Technology Project Management (IJITPM), IGI Global, USA. He is on the PC for some assistant director (Scientist E1) from 1996 to 2002 and Scientist C from 1993 to
international conferences and chaired many sessions in international conferences 1996, respectively, at the Indian Institute of Chemical Technology (IICT), Hyderabad.
in India and abroad. His research interests include fuzzy computing, neuro He was deputed to RWTH Aachen (Aachen University of Technology) Germany
computing, soft computing, data mining, web mining, privacy preserving data under the DAAD Long Term Fellowship to carry out advanced research during
mining, global/multi-criteria/combinatorial optimization, bankruptcy prediction, 19971999. He earlier worked as Scientist B and Scientist C at the Central Building
risk measurement, text mining, customer relationship management (CRM), churn Research Institute, Roorkee (19881993) and was listed as an expert in soft
prediction in banks and rms and asset liability management through optimiza- computing by TIFAC, Government of India.
tion. In a career spanning 25 years, he has worked in several cross-disciplinary
areas such as nancial engineering, software engineering, reliability engineering,
chemical engineering, environmental engineering, chemistry, medical entomology,
bioinformatics and geotechnical engineering. At IDRBT, he held various adminis-
Mannepalli Krishna holds an M.Tech. (IT) from University
trative positions such as coordinator, IDRBT-industry relations (20052006), M.
of Hyderabad and IDRBT, Hyderabad. His research inter-
Tech. (IT) coordinator (20062009), convener, IDRBT working group on CRM
ests include data imputation and machine learning. He
(20102011). As the convener, IDRBT working group on CRM, he co-authored a
now works as an ofcer in Andhra Bank, Mumbai, India.
Handbook on Holistic CRM and Analytics (http://www.idrbt.ac.in/PDFs/Holistic%
20CRM%20Booklet_Nov2011.pdf ), where a new framework for CRM, best practices
and new organization structures apart from HR issues for Indian banking industry
are all suggested. He has 25 years of research experience and 12 years of teaching
experience. He designed and developed a number of courses in Singapore and India
at the M.Tech. level in soft computing, data warehousing and data mining, fuzzy
computing, neuro computing, quantitative methods in nance, soft computing in
nance, etc. Furthermore, he designed and developed a number of short courses for
Executive Development Programmes (EDPs) in the form of 2-week long CRM for
senior and junior executives, data mining, big data and its relevance to banking,
fraud analytics, etc. He conducted ACRM proof of the concept (POC) for six banks on
their real data. He established excellent research collaborations with University of
Hong Kong, University of Ghent, Belgium, IISc, Bangalore and IIT Kanpur. He co-

Potrebbero piacerti anche