2017 - An Improved Kernel-Based Incremental Extreme Learning Machine

Neural Comput & Applic
DOI 10.1007/s00521-017-3096-3
ORIGINAL ARTICLE
An improved kernel-based incremental extreme learning machine

with fixed budget for nonstationary time series prediction
Wei Zhang1 • Aiqiang Xu1 • Dianfa Ping2 • Mingzhe Gao1
Received: 29 August 2016 / Accepted: 15 June 2017

Ó The Natural Computing Applications Forum 2017
Abstract In order to curb the model expansion of the using artificial and real-word nonstationary time series
kernel learning methods and adapt the nonlinear dynamics data. The results indicate that the proposed method can
in the process of the nonstationary time series online pre- achieve higher prediction accuracy, better generalization
diction, a new online sequential learning algorithm with performance and stability.
sparse update and adaptive regularization scheme is pro-
posed based on kernel-based incremental extreme learning Keywords Time series prediction Extreme learning
machine (KB-IELM). For online sparsification, a new machine Online modeling Fixed budget Sparsity
method is presented to select sparse dictionary based on the measures Adaptive regularization
instantaneous information measure. This method utilizes a
pruning strategy, which can prune the least ‘‘significant’’
centers, and preserves the important ones by online mini- 1 Introduction
mizing the redundancy of dictionary. For adaptive regu-
larization scheme, a new objective function is constructed Nonstationary time series prediction (TSP) plays an
based on basic ELM model. New model has different important role in the scientific and engineered fields, such as
structural risks in different nonlinear regions. At each fault-tolerant analysis, state prediction, condition monitor-
training step, new added sample could be assigned optimal ing and fault diagnosis [1]. The main task of TSP is to find a
regularization factor by optimization procedure. Perfor- proper model with appropriate structure and parameters to
mance comparisons of the proposed method with other characterize the dynamic behavior of real systems.
existing online sequential learning methods are presented As a famous nonlinear modeling method, neural net-
works (NNs) have been extensively used to address TSP
issues over the years [2]. NNs are proven to be universal
& Aiqiang Xu approximators under suitable conditions, thus providing the
hjhy1989@njfu.edu.cn means to capture information in data that are difficult to
Wei Zhang identify using other approaches. It is, however, well known
linguo@njfu.edu.cn that the traditional NNs algorithms suffer from some
Dianfa Ping problems such as being easily trapped into local minima,
mingyunjiang@njfu.edu.cn slow convergence and huge computational costs [3]. In
Mingzhe Gao order to overcome these issues, Huang et al. [4] proposed
1601111088@pku.edu.cn extreme learning machine (ELM), which is a novel learn-
1
ing algorithm for single-layer feedforward neural net-
Office of Research and Development, Naval Aeronautical
works(SLFNs). Its salient advantage is that the input
and Astronautical University, Yantai 264001,
People’s Republic of China weights and hidden biases are randomly chosen instead of
2 being exhaustively tuned. It has been reported to provide
Department of Electronic and Information Engineering,
Naval Aeronautical and Astronautical University, better generalization performance with much faster learn-
Yantai 264001, People’s Republic of China ing speed [5–9].
123
During the last few years, there has been increasing samples. Generally speaking, the inner structure hidden in
attention on kernel learning methods [10]. The fundamental time series data will determine the samples’ significance,
idea of kernel methods is that a Mercer kernel is applied to especially in nonstationary setting [22]. So they could not
transform the low-dimensional feature vector into a high- guarantee the new added sample is the most valuable for
dimensional reproducing kernel Hilbert space (RKHS), in prediction [23]. Ref. [24] achieved the online sparsification
which many nonlinear problems will become linearly by deleting those old samples which had the highest sim-
solvable. Recently, Huang et al. [11] extended ELM by ilarity with new samples. This approach cannot effectively
using kernel functions and we denote the derived results as track the system dynamics. Ref. [25] proposed ALD-KOS-
ELM with kernel (KELM). KELM is a special batch ELM by use of approximate linear dependence (ALD)
variant of ELM [12, 13]. The experimental and theoretical criterion. A major criticism that can be made of ALD
analysis has shown that KELM tends to have better scal- criterion is that it leads to costly operations with quadratic
ability and can achieve similar (for regression and binary complexity in the cardinality m of the dictionary. Ref. [26]
class cases) or much better (for multiclass cases) general- achieved the model sparsification by using fast leave-one-
ization performance at much faster learning speed than out cross-validation (FLOO-CV).
basic SVM and LS-SVM [11]. The model expansion has been solved effectively by
However, in many actual applications, online, adaptive aforementioned methods, but the second problem is still
and real-time operations are required. Such requirements not worked out fundamentally. In the process of nonlinear
pose a serious challenge, since aforementioned NNs, ELM dynamic system modeling, empirical risk and structural
and KELM operate in batch model. In other words, risk should be considered simultaneously. Generally,
whenever a new sample arrives, these algorithms need to KELM controls structure risk by Tikhonov regularization
gather old and new data together to perform a retraining in [16, 27]. There is no doubt that time-varying system should
order to incorporate the new information [14]. This will have different structural risks in different nonlinear regions
result in extra storing space consumption and make the [28]. But the current KELM-based methods employ a
learning time become longer and longer. Fortunately, some constant regularization factor in all time, greatly limiting
online sequential learning algorithms have been proposed their effectiveness when modeling unknown nonlinear
to meet the actual demands, such as online sequential ELM time-varying system. So, in order to improve the capability
(OSELM) [15], regularized online sequential ELM (ReOS- of tracking the time-varying dynamics, regularization fac-
ELM) [16]. Moreover, KB-IELM was presented in Ref. tor should vary over time.
[17], which is a beneficial attempt to extend KELM into The aim of this paper is to seek a new online sequential
online application. learning strategy of KB-IELM for nonstationary time series
There is no doubt that it will be a very significant work prediction, which is computationally simple, and able to
to research the online application of KELM. For KB- simultaneously solve aforementioned two key problems.
IELM, although it can handle the online tasks, some issues So, the fix-budget method is regarded as the basic modeling
still need to be solved. For example, KB-IELM’s model strategy, KB-IELM is regarded as basic modeling algo-
order is equal to the number of training samples, which will rithm and then a unified framework is further presented by
lead to kernel matrix expansion with learning ongoing associating the proposed sparsification rule and adaptive
[10, 18]. On the one hand, the algorithm will be in danger regularization scheme.
of over-fitting; on the other hand, the computational com- For the online sparsification, a new method is presented
plexity and storage requirement will grow superlinearly. In to select the sparse dictionary based on the instantaneous
a word, two key problems must be solved when KB-IELM information measure. The proposed sparsification method
is applied to online predict nonstationary time series, i.e., utilizes a pruning strategy. By online minimizing the
(1) how to curb the model expansion; (2) how to track or redundancy of dictionary, it decides whether to replace one
adapt the nonlinear dynamics in the time-varying and of the old dictionary members with the new kernel func-
nonstationary environment. tion. In the end, a compact dictionary with predefined size
Commonly, online sparsification strategies are can be selected adaptively. The proposed method does not
employed to solve above issues. It helps to curb the need any a priori knowledge about the data, and its com-
growing kernel function while the training sample putational complexity is linear with the kernel center
sequentially arrives [19]. Ref. [20] proposed KELM with number.
forgetting mechanism (FOKELM) based on traditional For adaptive regularization scheme, a new objective
sliding time window. On the basis of FOKELM, Ref. [21] function is constructed based on basic ELM model. New
proposed CF-FOKELM by use of Cholesky factorization. model has different regularization factors in different
These methods can obtain a compact dictionary, but the nonlinear regions. At each training iteration, in order to
dictionary largely depends on the latest k observed assign optimal regularization factor for new added sample,
123
LOO-CV generalization error is adopted to construct loss Mercer’s conditions can be applied on ELM. The kernel
function, then the optimal regularization factor can be matrix is defined as G ¼ HHT , where Gði; jÞ ¼ hðxi Þ
derived by minimizing loss function based on gradient hðxj ÞT ¼ kðxi ; xj Þ, and kð; Þ is a predefined kernel func-
descent (GD) method. Moreover, a dynamic learning rate is tion. Then, the output of SLFNs by kernel-based ELM can
adopted to ensure the algorithmic convergence. be given as Eq. (3).
Finally, we associate the proposed sparsification rule
with the adaptive regularization scheme based on KB- f ðÞ ¼ hðÞHT ðc1 I þ HHT Þ1 Y
ð3Þ
IELM algorithm and derive a new online sequential ¼ ½kðx1 ; Þ; . . .; kðxN ; Þðc1 I þ GÞ1 Y
learning algorithm for KELM (denoted as NOS-KELM for
simplicity) in this paper. Performance comparisons of Let k ¼ ½kðx1 ; Þ; . . .; kðxN ; Þ, h ¼ ðc1 I þ GÞ1 Y,
proposed method with other existing algorithms are pre- where h denotes kernel weight coefficient. Let
P
sented by using artificial and real-life time series data. The h ¼ ½h1 ; . . .; hm T , then f ðÞ ¼ kh ¼ Ni¼1 hi kðxi ; Þ.
simulation results testify that NOS-KELM is an effective In Ref. [17], KB-IELM is proposed to realize adaptive
way to predict nonstationary time series. update of kernel weight coefficient when new samples
The rest of this paper is organized as follows. In Sect. 2, arrive sequentially. Suppose that @t ¼ fðxi ; yi Þji ¼ 1; . . .; tg
a brief description is presented about KB-IELM and sev- at time t. Let At ¼ c1 It þ Gt , the sample ðxtþ1 ; ytþ1 Þ
eral sparsity measure criterions. The proposed algorithm is arrives at time t ? 1, then we have
given in Sect. 3, including the online selection of key
At Vtþ1
nodes, the real-time update of kernel weight coefficients Atþ1 ¼ ð4Þ
VTtþ1 vtþ1
and online optimization of regularization factor. In Sect. 4,
the computational complexity is discussed. In Sect. 5, the where Vtþ1 ¼ ½k1;tþ1 ; . . .; kt;tþ1 T , vtþ1 ¼ c1 þ ktþ1;tþ1 .
proposed algorithm is evaluated by both simulation and Using the block matrix inverse lemma,A1 tþ1 can be
real-world data. The conclusion is drawn in Sect. 6. 1
calculated from At .
1
1 At þ A1t Vtþ1 q1 tþ1 VTtþ1 A1t A1t Vtþ1 q1
tþ1
Atþ1 ¼ 1
q1 T
tþ1 Vtþ1 At q1
tþ1
2 Preliminaries ð5Þ
2.1 KB-IELM where qtþ1 ¼ vtþ1 VTtþ1 A1

t Vtþ1 .
So kernel weight coefficient can be updated by
T
For a data set containing N different training samples, htþ1 ¼ A1
tþ1 Ytþ1 , where Ytþ1 ¼ ½y1 ; . . .; yt ; ytþ1 .
@ ¼ fðxi ; yi Þji ¼ 1; . . .; Ng, where xi 2 Rp , yi 2 R. ELM
can be expressed as a simple single-input single-output 2.2 Online sparsification and sparsity measures
(SISO) model, i.e., f ðxi Þ ¼ hðxi Þb, where hðxi Þ ¼
½h1 ðxi Þ; . . .; hL ðxi Þ is a feature mapping from the p-di- The main drawback in model (3) is that the model order
mensional input space to the L-dimensional hidden layer equals to the size of the training samples set, so when
feature space; L is the number of hidden layer neurons; t ! 1, it is unsuitable for online applications. It is sig-
b ¼ ½b1 ; . . .; bL T is the output weights vector; and b can be nificant to develop an active set strategy to control the
derived by solving following objective function [3, 16], increase in the model order as new samples become
available. Suppose that there is a sparse dictionary Dt ¼
1 1X N
fkðcti ; Þgm t
Min: L1 ¼ kbk2 þc n2 i¼1 with mt key nodes at t-th training iteration,
2 2 i¼1 i ð1Þ thus we have
s:t: yi ¼ hðxi Þb þ ni ; i ¼ 1; 2; . . .; N X
mt
f^t ðÞ ¼ hti kðcti ; Þ ð6Þ
where ni is the training error, c is the regularization factor i¼1
to relax the over-fitting problem and c 2 Rþ . In the end, b where cti is the kernel center of i-th kernel function,
can be expressed as Eq. (2). fct1 ; ct2 ; . . .; ctmt g fx1 ; x2 ; . . .; xt g, hti is the weight coeffi-
1 cient of i-th kernel function at t-th training iteration and
b ¼ HT c1 I þ HHT Y ð2Þ
mt t. So model (6) learns a kernel machine in the sub-
where Y ¼ ½y1 ; . . .; yN T denotes the sample target values; space spanned by the selected key nodes [29]. Indeed, the
H ¼ ½hðx1 ÞT ; . . .; hðxN ÞT T is the mapping matrix for all dictionary can be augmented whenever the novel kernel
input samples; and I is an identity matrix. function kðxi ; Þ increases the diversity of the dictionary
123
[30]. Currently, there exists several sparsity measures to j6¼i

X
t t t
quantify this diversity, such as novelty criterion (NC) [31], max kðci ; cj Þ þ kðci ; xtþ1 Þ d
i¼1mt
ALD [32], coherence measure [33], cumulative coherence 1 j mt
measure [34], significance measure [10, 19] and surprise where d is a given positive threshold.
measure (SC) [35]. These methods’ main idea is to con-
struct a sparse dictionary by only accepting those important 2.2.5 Significance measure
input samples.
Ref. [10] measures the significance of a center based on a
2.2.1 Novelty criterion weighted average contribution over all quantized input
data. This method utilizes a pruning strategy, the center
NC [31] first computes the distance of xtþ1 to the current
with the smallest influence on the whole system will be
dictionary, if dis ¼ mincti 2Dt xtþ1 cti \d1 , then xtþ1 will discarded when a new sample is included in the dictionary.
not be added into the dictionary. Otherwise, it further Ref. [19] continuously examines the significance of the
computes the prediction error, i.e., etþ1 ¼ ytþ1 f^t ðxtþ1 Þ, new training sample based on the Hessian matrix of the
only if jetþ1 j [ d2 , xtþ1 will be accepted as a new center. system loss function. This method utilizes a constructive
Where d1 and d2 are two user-specified parameters. strategy, samples with small significance are discarded and
those with relative large significance are selected as the
2.2.2 Approximate linear dependence dictionary member.
ALD criterion is used to select the most linearly independent 2.2.6 Surprise measure
atoms in kernel algorithm, where only those atoms that cannot
be absorbed well by the existing dictionary are selected [32]. To determine useful data to be learned and remove
The kernel function kðxtþ1 ; Þ is added to the dictionary if redundant ones, a subjective information measure called
2 surprise is introduced in Ref. [35]. Surprise is defined as
Xmt
the negative log likelihood of the samples given the
min kðxtþ1 ; Þ ni kðcti ; Þ d
n1 nm learning system’s hypothesis on the data distribution, i.e.,
i¼1 H
where d is a positive threshold parameter to control the STt ðxtþ1 ; ytþ1 Þ ¼ ln pðxtþ1 ; ytþ1 jTt Þ
level of sparseness. where pðxtþ1 ; ytþ1 jTt Þ is the posterior probability of
ðxtþ1 ; ytþ1 Þ hypothesized by Tt .
2.2.3 Coherence measure
The coherence corresponds to the largest correlation 3 KB-IELM with sparse updates and adaptive
between atoms of a given dictionary [33]. The kðxtþ1 ; Þ is regularization scheme
added to the dictionary if

kðxtþ1 ; ct Þ 3.1 Problem statement and formulation
i
max pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d ð7Þ
i¼1mt kðxtþ1 ; xtþ1 Þkðcti ; cti Þ
In order to avoid those issues described in Sect. 1,
where d is a given threshold, and d 2 ð0; 1. When the unit- according to Eq. (1), a new objective function is defined as
norm kernel is applied, Eq. (7) becomes Eq. (8).

max kðxtþ1 ; cti Þ d 1 1X m
i¼1mt Min : L2 ¼ kbk2 þ c n2
2 m i¼1 i i ð8Þ
s: t: yi ¼ hðci Þb þ ni ; i ¼ 1; 2; . . .; m
2.2.4 Cumulative coherence measure
where ci is the obtained key nodes by online selection; m is
Cumulative coherence measure can be viewed as an the number of key nodes; all key nodes compose sparse
extension of coherence criterion and provide a deeper dictionary D ¼ fkðci ; Þgm
i¼1 ; and 1/m is used to avoid the
description of a dictionary [34, 36]. The cumulative influence of the cumulate error. Compared with Eq. (1),
coherence of a dictionary with a Gram
matrix
Gt is defined there are three main improvements in Eq. (8):
Pj6¼i t t
as lðGt Þ ¼ maxi¼1mt 1 j mt kðci ; cj Þ. A candidate 1. The fixed-budget method is employed, which ensures
kernel function kðxtþ1 ; Þ is included in the dictionary if the computational complexity is bounded.
123
KB-IELM model
tackle the model track the time-varying

expansion dynamics
sparse dictionary adaptive regularization

selection scheme
kernel weight
coefficient update
Fig. 2 Interpretation of sparsification procedure
NOS-KELM model
Fig. 1 Framework of this paper Suppose that the obtained learning system at the training
step t is ft ¼ f ðDt ; ht ; ct Þ. At the training step t ? 1, when a
2. The online sparsification strategy is used, only those new training sample ðxtþ1 ; ytþ1 Þ arrives, we can obtain a
key-nodes will be accepted to update current model. new kernel function kðxtþ1 ; Þ. The potential dictionary is
3. The adaptive regularization scheme is constructed, t ¼ fDt ; kðxtþ1 ; Þg. In order to determine
defined as D
which makes the model has different regularization
whether kðxtþ1 ; Þ can be inserted into dictionary, we firstly
factors in different nonlinear regions.
give two definitions based on information theory.
Karush–Kuhn–Tucker(KKT) optimality conditions are Definition 1: Hypothesize that the current learning
employed to solve above objective function, we have system is ft , the instantaneous posterior probability of
bt ¼ HT ðKt þ HHT Þ1 Yt . If the regularization factor vec- observation sample xtþ1 is pt ðxtþ1 jft Þ, then the information
tor at time t is defined as ct ¼ ½ct1 ; ct2 ; ; ctm , then, contained by xtþ1 , which can transfer to the current
learning system, is defined as the instantaneous conditional
Kt ¼ m=2 ½diag(ct Þ1 where diag represents a diagonal
self-information of xtþ1 at time t, namely
matrix.
Iðxtþ1 jft Þ ¼ log pt ðxtþ1 jft Þ.
By using kernel function, we can obtain
Definition 2: Hypothesize that the current learning
X
m
system is ft , the number of atoms in dictionary Dt is m, the
f^t ðÞ ¼ kt ht ¼ hti kðcti ; Þ ð9Þ
i¼1 instantaneous posterior probability of kernel center
cti ð1 i mÞ is pt ðcti jft Þ, then the average self-information
where kt ¼ ½kðct1 ; Þ; . . .; kðctm ; Þ denotes current kernel of Dt at time t is defined as the instantaneous conditional
vector; ht ¼ ½ht1 ; . . .; htm T is the current kernel weight entropy of Dt , namely
coefficient vector, and ht ¼ ðKt þ Gt Þ1 Yt . Xm
According to Eq. (9), we can see that the improved KB- HðDt jft Þ ¼ pt ðcti jft Þ log pt ðcti jft Þ
IELM has to deal with some important problems in the i¼1
process of online application, i.e., the selection of Dt , the In the actual applications, the probability distribution
update of ct and ht . In order to solve these problems, we function (PDF) of data is hard to obtain without any priori
present a new online learning method with sparse update knowledge or hypothesis. Kernel density estimator (KDE)
and adaptive regularization scheme based on KB-IELM. is a reasonable method to estimate PDF. Given
The framework of this paper is shown in Fig. 1. Dt ¼ fkðct1 ; Þ; . . .; kðctm ; Þg, then by use of the KDE, the
instantaneous conditional PDF of kernel center can be
3.2 Sparse dictionary selection based represented as Eq. (10).
on instantaneous information measure 1X m
pt ðcjr; ft Þ ¼ kr ðc; cti Þ ð10Þ
m i¼1
In this section, a novel sparsification rule is proposed based
on the instantaneous information measure. This method is where r is the kernel width. According to Eq. (10), the
based on a kind of pruning strategy, which can prune the instantaneous conditional self-information of xtþ1 and the
least ‘‘significant’’ centers and preserves the important instantaneous conditional entropy of Dt can be, respec-
ones, as described in Fig. 2. tively, denoted as the following.
123
1X m
t
Let, Ft ¼ Ft diag½diagðFt Þ þ mIt .By substituting G
Iðxtþ1 jr; ft Þ ¼ log kr ðxtþ1 ; cti Þ
m i¼1
and St into Ft , we can get
(" # " #)
X Ft ¼
m
1X m
t t 1X m
t t 2 3
HðDt jr; ft Þ ¼ kr ðci ; cj Þ log kr ðci ; cj Þ ¼m
j6P j6¼P
mþ1
i¼1
m j¼1 m j¼1 6 m kr ðct1 ; ctj Þ kr ðct1 ; ctj Þ 7
6 1 j mþ1 1 j mþ1 7
6 7
6 .. .. .. .. 7
6 . . . . 7
Without loss of generality, all kernel function mentioned 6 7
6 ¼1
j6P j6¼Pmþ1 7
in this paper are unit-norm kernel, i.e., 8x 2 X, kðx; xÞ ¼ 1, 6 7
6 kr ðctm ; ctj Þ m kr ðctm ; ctj Þ 7
6 1 j mþ1 1 j mþ1 7
if kðx; Þ is not unit-norm, replace kðx; Þ with 6
6
7
7
pffiffiffiffiffiffiffiffiffiffiffiffiffi 4 ¼1
j6P ¼m
j6P 5
kðx; Þ kðx; xÞ. kr ðctmþ1 ; ctj Þ kr ðctmþ1 ; ctj Þ m
1 j mþ1 1 j mþ1
T m 1
Let et ¼ ½1; . . .; 1 2 R , then Gram matrix of dic-
tionary Dt is denoted as Gt , multiply the matrix Gt with the After the i-th ð1 i m þ 1Þ, kernel function in the
matrix et , we have St ¼ Gt et , i.e., potential dictionary D t is deleted, the new dictionary and
2 Pm 3 the new learning system are, respectively, denoted as D i
kr ðct1 ; ctj Þ t
6 Pm k ðct ; ct Þ 7
j¼1
and fti . Then, the instantaneous conditional probability of
6 j¼1 r 2 j 7
St ¼ 6
6 .. 7
7 the l-th ðl 6¼ iÞ kernel center can be denoted as the
4P . 5 following.
m t t
j¼1 kr ðcm ; cj Þ j6¼i
1 X 1
pt ðctl jr; fti Þ ¼ kr ðctl ; ctj Þ ¼ Ft ðl; iÞ
The instantaneous conditional probability of the i-th m 1 j mþ1 m
kernel center in the dictionary Dt under system ft is
pt ðcti jr; ft Þ ¼ St ðiÞ=m. So the instantaneous conditional According to Eq. (11), the instantaneous conditional
entropy can be obtained by Eq. (11). entropy of D i is written as Eq. (14).
t

T

T

S St
HðDt jr; ft Þ ¼ t log ð11Þ HðDi jr; fi Þ ¼ Ft ð: iÞ log Ft ð: iÞ ð14Þ
m m t t
m m
At the training step t ? 1, let xtþ1 ¼ cmþ1 , then the i is defined as Eq. (15).
The redundancy of D t
Gram matrix of the dictionary with all potential kernel
t. HðD i jr; fi Þ i jr; fi Þ
HðD
functions is denoted as G Ri t t ¼ 1 t t
t ¼ 1 i ð15Þ
logD log j mj
t
G t ¼ GTt Kt ð12Þ
Kt 1 We aim to online minimize the redundancy of dic-
tionary. That is because the less redundancy dictionary has,
where Kt ¼ ½kr ðct1 ; cmþ1 Þ; . . .; kr ðctm ; cmþ1 ÞT 2 Rm 1 . Let the more information dictionary contains. So the index of
et ¼ ½1; . . .; 1T 2 Rðmþ1Þ 1 , compute St ¼ G t et , thus we kernel function removed from old dictionary can be
can obtain Eq. (13). determined by Eq. (16).

Gt Kt et Gt et þ Kt St þ Kt i ¼ arg min ðRi
t Þ ð16Þ
St ¼ ¼ ¼ P
KTt 1 1 KTt et þ 1 1 þ Kt 1 i mþ1
ð13Þ If i ¼ m þ 1, the dictionary remains unchanged. This is

P because new kernel function kðxtþ1 ; Þ is removed from the
where Kt denotes the sum of all elements in Kt .
potential dictionary. If i 6¼ m þ 1, the i-th kernel function
Let E t ¼ ð et eTt It Þ 2 Rðmþ1Þ ðmþ1Þ , where It is a kðcti ; Þ of old dictionary is replaced by kðxtþ1 ; Þ. And
(m ? 1)-order identity matrix. Obviously, E t is a matrix
Dtþ1 ¼ D i , Stþ1 ¼ Si , where Si can be obtained by use
t t t
with its diagonal entries being 0 and all its off-diagonal
t with the matrix E t , of Eq. (17). Gtþ1 can be computed based on Eq. (21), i.e.,
entries being 1. Multiply the matrix G
Gtþ1 ¼ Atþ1 diag½diagðAtþ1 Þ þ Im , where Im is a m-
we can obtain
order identity matrix. The details of dictionary selection

Gt kt et et t can be summarized as Algorithm 1.

Ft ¼ Gt ð T
et et Þ Gt ¼ G
kTt 1 1 1 (
i
¼ St St G t |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} St ð1 : i 1Þ ¼ Ft ð1 : i 1; iÞ
|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} mþ1 i ð17Þ
mþ1 St ði : mÞ ¼ Ft ði þ 1 : m þ 1; iÞ
123
Algorithm 1 However, in the process of updating htþ1 , the regular-

Initialization: t=1, set m,r and Dt ¼ fkðcti ; Þgm ization factor ctþ1
new is still unknown and needs to be opti-
i¼1
mized in order to adapt to the time-varying dynamics.
1 Compute gram matrix Gt , and compute St ¼ Gt et ;
2 While new training sample ðxtþ1 ; ytþ1 Þ is obtained, do 3.4 Adaptive regularization scheme
3 Compute Kt ,using (12), (13) ,respectively, compute G t , St ;
4 Compute Ft ; The objective of this section is to obtain the optimal reg-
5 i jr; fi Þ; i ¼ 1; ; m þ 1;
Using (14) compute HðD ularization factor at each time step. Firstly, loss function is
t t
6 Using (15) compute redundancy Ri constructed based on leave-one-out cross-validation (LOO-
t ; i ¼ 1; ; m þ 1;
7 Using (16) determine removable vector index i; CV) generalization error [28]. The LOO-CV method has
8 If i=m?1, then been successfully used to a variety of applications because
9 Dtþ1 ¼ Dt ;Gtþ1 ¼ Gt , Stþ1 ¼ St ;
it has an unbiased estimation process and could avoid
10 else
random disturbance.
11 i , using (17) update Stþ1 , using (18) update Gtþ1 ; According to Sect. 3.2, the dictionary is Dtþ1 ¼
Dtþ1 ¼ Dt m
fkðctþ1 tþ1 tþ1
i ; Þgi¼1 at time t ? 1, then each ðci ; yi Þ in Dtþ1
12 End if
is chosen as the test sample and the remaining samples are
13 End while
used as training samples.
14 Output Dtþ1 and i; t=t?1, return to step 2.
Both sides of Eq. (20) are multiplied by Atþ1 simulta-
neously, we have Atþ1 htþ1 ¼ Ytþ1 . According to Eq. (18),
we can further obtain
3.3 Kernel weight coefficient update i
At Vtþ1 htþ1 tþ1
Y
¼ ð21Þ
VTtþ1 vtþ1 htþ1
m ytþ1
m
When a new sample is regarded as key node, in order to
keep the dictionary size unchanged, an old sample needs to where htþ1 denotes a vector composed of remaining ele-
be deleted before the new sample was inserted into dic- ments after the m-th element htþ1 m is removed from htþ1 ;
tionary. According to Eq. (9), let At ¼ Kt þ Gt , suppose Y tþ1 represents a vector composed of remaining elements
that Ai
t is a matrix composed of remaining elements after after the m-th element ytþ1 m is removed from Ytþ1 .
the i-th row and the i-th column elements are removed from When LOO-CV [26, 27, 37] is applied to carry out
At , where i can be determined by use of Eq. (16). cross-validation for ctþ1 m , the kernel weight coefficient of
At time t ? 1, when new sample ðxtþ1 ; ytþ1 Þ is added i 1
KELM is htþ1 ¼ ðAt Þ Ytþ1 , then predictive value of ctþ1 m
into dictionary,Atþ1 can be updated by Eq. (18). T i 1
i is y^tþ1
m ¼ V tþ1 ðA t Þ Y tþ1 . According to Eq. (21), the
At Vtþ1 following equation can be derived.
Atþ1 ¼ ð18Þ
VTtþ1 vtþ1
tþ1 ¼ Ai htþ1 þ Vtþ1 htþ1
Y
t
tþ1 T
m ð22Þ
where vtþ1 ¼ m ð2ctþ1 tþ1
new Þ þ 1 and cnew is a new regulariza- ytþ1
m ¼ vtþ1 hm þ Vtþ1 htþ1
tion factor corresponding to new sample xtþ1 ;
According to Eq. (22), then the predictive value of ctþ1
m
Vtþ1 ¼ ½k1;tþ1 ; . . .; ki1;tþ1 ; kiþ1;tþ1 ; . . .; km;tþ1 T . According can be rewritten as Eq. (23).
to Eq. (5), then A1 tþ1 can be written as Eq. (19). T i 1 tþ1
" # y^tþ1 T
m ¼ Vtþ1 htþ1 þ Vtþ1 ðAt Þ Vtþ1 hm ð23Þ
1 i 1 1
ðAi 1 T
t Þ Vtþ1 qtþ1 Vtþ1 ðAt Þ ðAi 1
t Þ Vtþ1 qtþ1
A1
tþ1 ¼ i 1 Combining Eqs. (22) and (23), the LOO-CV general-
q1 T
tþ1 Vtþ1 ðAt Þ q1
tþ1
" # ization error on ðctþ1 tþ1
m ; ym Þ can be expressed as follows.
1
ðAi
t Þ O
þ ðmÞ
nloo ðt þ 1Þ ¼ ytþ1 ^tþ1
O 0 m y m
1
¼ ðvtþ1 VTtþ1 ðAi tþ1
t Þ Vtþ1 Þhm
ð19Þ
where qtþ1 ¼ vtþ1 VTtþ1 ðAi 1 According to Eq. (19), we have qtþ1 ¼ vtþ1
t Þ Vtþ1 . 1
According to Eq. (19), kernel weight coefficient can be VTtþ1 ðAi
t Þ Vtþ1 .
ðmÞ
update by Eq. (20). Thus, nloo ðt þ 1Þ ¼ qtþ1 htþ1
m , and qtþ1 is the reciprocal
htþ1 ¼ A1 ð20Þ of the m-th diagonal element of A1 tþ1 . According to the
tþ1 Ytþ1
conclusion in Ref. [26], exchanging the order of elements
where Ytþ1 ¼ ½yt1 ; . . .; yti1 ; ytiþ1 ; . . .; ytm ; ytþ1 T . will not change the solution of Eq. (21). So the LOO-CV
123
1

generalization error for each sample in Dtþ1 can be ðAi
t Þ O
Ntþ1 ¼
expressed as follows. O 0
ðkÞ ðA1
tþ1 Ytþ1 Þk It is easy to see that Stþ1 ,Mtþ1 and Ntþ1 have no rela-
nloo ðt þ 1Þ ¼ ; k ¼ 1; . . .; m ð24Þ i 1
diagðA1tþ1 Þk tionship with ctþ1new . However, ðAt Þ is unknown in the
process of calculating Stþ1 , Mtþ1 and Ntþ1 . In order to
As a result, the generalization error vector between the
avoid the computational burden caused by recalculating
estimated values and the real values can be denoted as 1
ð1Þ ðmÞ matrix ðAit Þ and improve the operation efficiency when
Etþ1 ¼ ½nloo ðt þ 1Þ; . . .; nloo ðt þ 1Þ.
deleting the old sample from dictionary, we present an
So the loss function can be defined as Eq. (25). effective method.
m h i2
1 1X ðkÞ For convenience of description, At can be rewritten as
Jðctþ1
new Þ ¼ hEðt þ 1Þ; Eðt þ 1Þi ¼ nloo ðt þ 1Þ
2 2 k¼1 following form, as shown in Fig. 3, where i is the remov-
able index searched by Eq. (16).
ð25Þ
The optimal regularization factor c
tþ1
can be obtained
m k1,m
tþ1
by minimizing loss function Jðcnew Þ, i.e., 2 t 1 k12 k1i 1 k1m
1
c
tþ1 ¼ arg min Jðctþ1 k21 m k2i k2,m
new Þ ð26Þ
2 t 1 1 k2m
ctþ1
new 2 2
At
where is the range of ctþ1 new . GD algorithm is an opti- ki1 ki 2 m t 1 ki , m kim
2 i
1
mization algorithm, it is also known as the steepest descent
method. GD algorithm is one of the most common methods km1 km 2 kmi km ,m m t 1
for solving unconstrained optimization problems and uses
1 2 m
the negative gradient direction as the search direction.

Fig. 3 Detailed composition of matrix At
According to the GD method, the iteration equation can be
obtained by Eq. (27).
o
ctþ1 tþ1
new ðj þ 1Þ ¼ cnew ðjÞ gðjÞ Jðctþ1
new Þjctþ1 tþ1
new ¼cnew ðjÞ
ð27Þ
octþ1
new 0 0 0 1 0
where gðjÞ is learning rate; the initial value of ctþ1
is set as
new 1 0 0 0 0
ctþ1
new ð1Þ ¼ c t
m .Without loss of generality, using r c to denote
the function gradient with respect to c. According to Pt
0 0 1 0 0 i th row
Eq. (25), we can obtain Eq. (28).
m h
X i
ðkÞ ðkÞ 0 0 0 0 1
rc Jðctþ1
new Þ ¼ nloo ðt þ 1Þ rc nloo ðt þ 1Þ ð28Þ
k¼1
Fig. 4 Elementary matrix Pt
According to Eq. (24), we can further get Eq. (29).
ðkÞ ðrc A1 1
tþ1 Ytþ1 Þk diagðAtþ1 Þk
rc nloo ðt þ 1Þ ¼ 2
diagðA1tþ1 Þk
i th column
ð29Þ
ðA1 diagðrc A1
tþ1 Ytþ1 Þk tþ1 Þk

diagðA1
2 0 1 0 0
tþ1 Þk
0 0 0 0
Apparently, rc A1
tþ1 must be calculated before the
optimization problem (26) can be solved. We have
1
Qt 0 0 1 0
qtþ1 ¼ m 2ctþ1 þ 1 VTtþ1 ðAi
t Þ Vtþ1 ð30Þ
new 1 0 0 0
1
Let Stþ1 ¼ 1 VTtþ1 ðAi
t Þ Vtþ1 ,
0 0 0 1
" #
1 i 1 1
ðAi T
t Þ Vtþ1 Vtþ1 ðAt Þ ðAi
t Þ Vtþ1 Fig. 5 Elementary matrix Qt
Mtþ1 ¼ ðiÞ
VTtþ1 ðAt Þ1 1
123
Moving the i-th row and i-th column of At to the first According to Eq. (30), we can know that q1
tþ1 and
row and first column, respectively, this transformation 1 tþ1
rc qtþ1 are two functions of cnew , and
process can be mathematically formulated as: ( 1
qtþ1 ¼ 2ctþ1 tþ1
new.ðm þ 2cnew Stþ1 Þ
~ t ¼ Pt A t Q t
A ð31Þ 2 ð37Þ
rc q1 tþ1
tþ1 ¼ 2m ðm þ 2cnew Stþ1 Þ
where Pt and Qt are two m-order elementary matrix, and
their structures are shown in Figs. 4 and 5. Combining Eqs. (34), (35) and (37), we can further
It is easy to see that Pt PTt ¼ Im and Qt QTt ¼ Im , where obtain Eq. (38).
Im is a m-order identity matrix. So Pt and Qt are orthogonal 8
> diag(A1 1
tþ1 Þ¼qtþ1 diagðMtþ1 Þ þ diagðNtþ1 Þ
matrixes. Based on the properties of orthogonal matrix, we >
< 1
1 diag(rc Atþ1 Þ¼rc q1
tþ1 diagðMtþ1 Þ
have P1 T T T
t ¼ Pt , Qt ¼ Qt . Furthermore, Pt ¼ Qt , so the 1 1 ð38Þ
>
> A Y
tþ1 tþ1 ¼ q tþ1 tþ1 tþ1 þ Ntþ1 Ytþ1
M Y
following conclusion can be obtained: Q1 :
t ¼ Pt , 1
rc Atþ1 Ytþ1 ¼ rc q1
1 tþ1 Mtþ1 Ytþ1
Pt ¼ Q t .
According to Eq. (31), we can get By substituting (38) into (36), rc Jðctþ1 new Þ can be
obtained. At each iterative step in Eq. (27), the change of
~ 1 ¼ ðPt At Qt Þ1 ¼ Pt A1 Qt
A ð32Þ
t t ctþ1
new will not impact on diagðMtþ1 Þ, diagðNtþ1 Þ,
~ 1 can be rewritten as the following form. Mtþ1 Ytþ1 ,Ntþ1 Ytþ1 and Stþ1 , and only has influence on
A t
" # q1 1
tþ1 and rc qtþ1 . So, at each training iteration, we only
1 ð ~1 Þð1;1Þ
A ð ~1 Þð1;2:endÞ
A need to update q1 1
tþ1 and rc qtþ1 .
~ ¼
A t t
t
~1 Þð2:end;1Þ ðA
ðA ~1 Þð2:end;2:endÞ For the iteration equation shown in Eq. (27), a dynamic
t t
learning rate is adopted to ensure the convergence of the
~t .
We can also obtain the block matrix form of A algorithm.

~ t ¼ vtT Vit
A gð1 d=etþ1 ðjÞÞ if etþ1 ðjÞ [ d

V A gðjÞ ¼ ð39Þ
t t
0 if etþ1 ðjÞ d
where Vt ¼ ½ki;1 ; . . .; ki;i1 ; ki;iþ1 ; ; ki;m , vt ¼ m ð2ct Þ
i
where g is a constant, and 0\g 1;etþ1 ðjÞ is the mean
þ1. According to the conclusion in Ref [20], we can obtain
value of generalization errors in the j-th iteration at time
P .
~1 ð2:end;1Þ ðA
~1 Þð1;2:endÞ t ? 1, i.e., etþ1 ðjÞ ¼ m
ðkÞ
ðAi Þ 1
¼ ð ~1 Þð2:end;2:endÞ ðAt Þ
A t k¼1 jnloo ðt þ 1Þj m ; d represents
t t
ðA~1 Þð1;1Þ algorithm termination threshold.
t
ð33Þ By substituting the results of Eqs. (36) and (39) into
(27), the optimization problem can be solved. In the end,
According to Eq. (33), we can obtain Stþ1 , Mtþ1 and the regularization factor vector can be updated by
Ntþ1 . By substituting them into (19), (19) can be rewritten Eq. (40).
as Eq. (34).
8
2ctþ1 < ctþ1
k ¼ ctk ; k\i
A1
tþ1 ¼
new
Mtþ1 þ Ntþ1 ð34Þ tþ1
c ¼ ct ; i k\m ð40Þ
m þ 2ctþ1
new Stþ1 : k tþ1 kþ1
ck ¼ ctþ1
new ; k ¼ m
According to Eq. (34), rc A1 tþ1 can be obtained.

tþ1

1 o 2cnew
rc Atþ1 ¼ tþ1 Mtþ1 þ Ntþ1
ocnew m þ 2ctþ1new Stþ1 4 Complexity analysis
ð35Þ
2m
¼ 2
Mtþ1
ðm þ 2ctþ1
new Stþ1 Þ A brief computational framework of NOS-KELM is
described in Fig. 6.
By substituting Eq. (29) into (28), (28) can be rewritten
The details can be summarized as Algorithm 2. Where
as Eq. (36).
the initial dictionary D0 is composed of the first m samples
( )
Xm
D1k D2k fðxi ; yi Þgm
i¼1 , kernel width r and initial regularization factor
tþ1
rc Jðcnew Þ ¼ 3 ð36Þ c0 are determined by grid search method. Initial regular-
k¼1 diagðA1
tþ1 Þk
ization factor vector is defined as c0 ¼ ½c01 ; . . .; c0m , then
where D1k ¼ ðA1 1 1
tþ1 Ytþ1 Þk ðrc Atþ1 Ytþ1 Þk diagðAtþ1 Þk K0 ¼ m=2 ½diag(c0 Þ1 . Moreover, G0 ¼ ½kðxi ; xj Þm m ,
1 2 1
and D2k ¼ ðAtþ1 Ytþ1 Þk diagðrc Atþ1 Þk . Y0 ¼ ½y1 ; y2 ; . . .; ym T .
123
We will make a brief analysis to algorithmic complexity

at each training step from following three aspects. (1)
Sparse dictionary selection: the time complexity of
updating G t and St is O(m);the time complexity of com-
puting Ft and HðD i jr; fi Þ is O(m ? 1); the time com-
t t
plexity of computing Ri
t is O(1); the time complexity of
finding out removable index i is O(m ? 1), so the time
complexity of this stage is O(m ? 1). (2) Adaptive regu-
larization scheme: the time complexity of computing Pt
and Qt are O(i); the time complexity of computing A ~ 1 ,
t
1
ðAi
t Þ is, respectively, O(m2), O((m-1)2);the time com-
plexity of computing Stþ1 , Mtþ1 and Ntþ1 is, respectively,
O((m-1)2), O(m2) and O(m); the time complexity of
computing Mtþ1 Ytþ1 and Ntþ1 Ytþ1 are O(m2); the time
complexity of computing q1 1
tþ1 and rc qtþ1 are O (1); the
Fig. 6 Brief computational framework for NOS-KELM time complexity of computing rc Jðctþ1 new Þ is O(m).
Assuming that the optimization process needs l iterations,
then the time complexity of the whole optimization pro-
Algorithm 2 cedure is O(lm). So the time complexity of this stage are
Initialization: Set r, c0, m, g and d; O(m2) or O(lm) (when l [ m). (3) Kernel weight coefficient
update: the time complexity of updating A1 tþ1 and ctþ1 is
1
Let t = 0, compute A1 1
t ¼ ðKt þ Gt Þ ,ht ¼ At Yt O(m), and the time complexity of updating htþ1 is O(m2).
1 t = t?1, new sample ðxtþ1 ; ytþ1 Þ is obtained; So the time complexity of this stage is O(m2).
2 Using Algorithm 1 to determine removable index i;
3 If i = m?1
4 ctþ1 ¼ ct ;A1 1
tþ1 ¼ At ; htþ1 ¼ ht ; 5 Experimental analysis
5 else
6 Compute Pt and Qt according to Fig. 4 and Fig. 5; In this section, we will give three examples to demonstrate
7 ~t and
Using (31), (32) to compute A ~1 ;
A effectiveness of the proposed method. In order to further
t
1 show its performance, the proposed method is compared
8 Using (33) to compute ðAi
t Þ ;
with three existing KELM methods:(1) KB-IELM [17] (no
9 Compute Vtþ1 and Ytþ1 ;and compute Mtþ1 ,Ntþ1 and Stþ1 ;
online sparsification procedure);(2) FOKELM [20] (based
10 Compute diagðMtþ1 Þ,diagðNtþ1 Þ,Mtþ1 Ytþ1 and Ntþ1 Ytþ1 ;
on traditional sliding time window);(3) ALD-KOS-ELM
11 Let j = 1, ctþ1 t
new ðjÞ ¼ cm [25] (based on ALD criterion).
12 While (j \ 100)do
All kernel methods employ Gaussian kernel as kernel
13 Using (37) to compute q1 1
tþ1 ,rc qtþ1 ; using (38) compute to
function, i.e., kðxi ; xj Þ ¼ expðjjxi xj jj2 =rÞ, where r is
diag(Atþ1 Þ,diag(rc Atþ1 Þ,Atþ1 Ytþ1 ,rc A1
1 1 1
tþ1 Ytþ1 ;
kernel width. All simulation studies are conducted in
14 Using (36) to compute rc Jðctþ1
new Þ, and compute etþ1 ðjÞ ; MATLAB R2010b environment running on a Windows XP
15 If etþ1 ðjÞ [ d
PC with Intel Core i3-2120 2.2 GHz CPU and 2 GB RAM.
16 gðjÞ ¼ g½1 d=etþ1 ðjÞ; Moreover, the root-mean-square error (RMSE) is
Using(27) to compute ctþ1
new ðj þ 1Þ; j = j?1; regarded as performance measure index of model accuracy,
17 else the maximal absolute prediction error (MAPE) and mean
18 ctþ1 tþ1
new ðj þ 1Þ ¼ cnew ðjÞ; break; relative prediction error (MRPE) are regarded as perfor-
19 End if mance measure indices of model stability. They are,
20 End while respectively, defined as
21 c
tþ1 ¼ ctþ1 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
new ðj þ 1Þ;
1X n
22 Using (40) to update ctþ1 ;using(34) to update A1
tþ1 ; using (20) RMSE ¼ jy^ðiÞ yðiÞj2
update htþ1 ;
n i¼1
23 End if MAPE ¼ max jy^ðiÞ yðiÞj
24 Output htþ1 ; return to step 1. i¼1;;n
123
1X n
jy^ðiÞ yðiÞj From Table 2, we can see that compared with KB-IELM,
MRPE ¼ FOKELM and ALD-KOS-ELM, when predictive step is
n i¼1 yðiÞ
equal to 200, the RMSE is, respectively, reduced by 39.2,
5.1 Nonstationary Mackey–Glass chaotic time series 24.4 and 14.7% when predictive step is equal to 400, the
RMSE is, respectively, reduced by 42.2, 70.8 and 27.8%. So
This example is an artificial nonstationary time series, and our proposed method has the higher modeling accuracy.
it is generated by the Mackey–Glass chaotic time series When the predictive step is equal to 400, the prediction
mixed with a sinusoid. The Mackey–Glass chaotic time results of proposed method are shown in Fig. 8. It is clear
series are generated by the following time-delay differen- that the prediction curve can express the trend of the actual
tial equation. curve effectively, and the prediction errors are in a rela-
dxðtÞ axðt sÞ tively low level. Besides, Fig. 9 shows the distribution of
¼ bxðtÞ model regularization factors when the training process is
dt 1 þ xðt sÞ10
finished. Obviously, the obtained model has different reg-
where xðtÞ is the value of time series at time t. Initial ularization factors in different nonlinear regions.
conditions are set as: a ¼ 0:2, b ¼ 0:1, s ¼ 17, xð0Þ ¼ 1:2 Figure 10 shows the learning curves of different meth-
and xðtÞ ¼ 0 for t\0. We apply the fourth-order Runge– ods, where Y-axis denotes mean square error (MSE),
Kutta method with time step size D ¼ 0:1 to get the X-axis denotes training sample. We can see that the
numerical solution of differential equation. Then, a sinu- learning curve of our proposed method is more smooth and
soid 0:3 sinð2pt=3000Þ is added to the series to create the converges to a more accurate stage. So the proposed
nonstationary chaotic time series. Sampling interval is set method has better performance than other ones.
as Ts ¼ 10D. In this example, the first 800 points for
training and the last 400 points for testing, and all points 5.2 Lorenz chaotic time series
are shown in Fig. 7. The time embedding dimension is set
as 10, i.e., the input is uðtÞ ¼ ðxðt Ts Þ; . . .; xðt 10Ts ÞÞ. Lorenz chaotic time series is shown as the following
In this example, the selected parameters are depicted in equation:
Table 1. All methods are applied to learn training samples 8
< dx=dt ¼ rðy xÞ
one-by-one. Let Z denote predictive step, when Z is equal dy=dt ¼ rx y xz
to 200 and 400, respectively, the prediction results of dif- :
dz=dt ¼ bz þ xy
ferent methods are shown in Table 2, where the bold values
are the optimal values corresponding to every evaluation The initial values are set as:r ¼ 10, r ¼ 28, b ¼ 8=3,
index. xð0Þ ¼ 1, yð0Þ ¼ 2 and zð0Þ ¼ 9. The fourth-order Runge–
Kutta method is used to generate a sample set with the first
2.5
800 points for training and last 400 points for testing. At
nonstationary Mackey-Glass data
2
the same time, a Gaussian white noise is added into time
added sinusoid data
series, and SNR = 5 dB. All samples are shown in Fig. 11.
1.5 In this example, xðtÞ, yðtÞ and zðtÞ series are used together
output
to predict xðt þ dÞ. The delayed times and embedding

1 dimensions are set as s1 ¼ s2 ¼ s3 ¼ 1, d1 ¼ d2 ¼ d3 ¼ 6,
respectively. And the resulting reconstructed vector is used
0.5
as the input while xðt þ dÞ is served as the target output.
0 In this example, the selected parameters are depicted in
0 200 400 600 800 1000 1200 Table 3. All methods are applied to learn training samples
sample one-by-one.
Fig. 7 Nonstationary Mackey–Glass chaotic time series When Z is equal to 200 and 400, respectively, the pre-
diction results of different methods are shown in Table 4,
where the bold values are the optimal values corresponding
Table 1 Selected parameters for example 1 to every evaluation index. It is clear that proposed method
Methods c r Else obtains the smallest prediction errors and has the better
modeling performance. Compared with KB-IELM,
KB-IELM [17] 10 10 –
FOKELM and ALD-KOS-ELM, when predictive step is
FOKELM [20] 2e ? 4 10 m = 50
equal to 200, the RMSE is reduced by 59.0.3, 48.4 and
ALD-KOS-ELM [25] 1e ? 3 10 d = 1e-5
21.7%, respectively;when predictive step is equal to 400,
NOS-KELM 2e ? 4 10 m = 50, d = 0.01, g = 0.8
the RMSE is reduced by 47.8, 15.3 and 16.2%,
123
Table 2 Prediction results of

Methods Training Testing
different methods for example 1
Tr-time/s RMSE Te-time/s RMSE MAPE MRPE
Z = 200
[17] 38.438 0.0174 0.0293 0.0153 0.0420 0.0108
[20] 0.2100 0.0268 0.0009 0.0123 0.0317 0.0083
[25] 0.2327 0.0112 0.0009 0.0109 0.0249 0.0081
NOS-KELM 0.9793 0.0104 0.0009 0.0093 0.0263 0.0063
Z = 400
[17] 16.799 0.0190 0.0436 0.0166 0.0438 0.0112
[20] 0.1840 0.0526 0.0012 0.0329 0.0842 0.0241
[25] 0.1637 0.0351 0.0013 0.0133 0.0384 0.0088
NOS-KELM 0.8215 0.0112 0.0031 0.0096 0.0258 0.0066
2
10
2
Output
1 0
10
Testing MSE
0
0 50 100 150 200 250 300 350 400 -2
Sample sequence 10
(a) Original figure
-4
10
1.4
Output
1.35 Predicted -6
Actual 10
1.3 0 200 400 600 800
100 105 110 115 120 Sample sequence
Sample sequence
KB-IELM FOKELM
(b) Enlarged figure
0.04
ALD-KOS-ELM NOS-KELM
0.02
Error
0 Fig. 10 Learning curves of different methods for example 1

-0.02
-0.04
0 50 100 150 200 250 300 350 400 20
Sample sequence
X(t)
(c) Prediction error 0
Fig. 8 Comparison of real values and prediction values obtained by -20

0 200 400 600 800 1000 1200
NOS-KELM in example 1
sample
50
Y(t)
4 0
x 10
2
-50
0 200 400 600 800 1000 1200
Regularization factor
sample
2
50
Z(t)
2
0
0 200 400 600 800 1000 1200
2 sample
Fig. 11 Lorenz chaotic time series with Gaussian white noise

2
0 10 20 30 40 50
Dictionary atoms respectively. Moreover, Table 4 also shows the proposed
Fig. 9 Distribution of NOS-KELM regularization factors in example method has the smallest MAPE and MRPE, which indi-
1 when training is ended cates the proposed method has better stability.
123
Table 3 Selected parameters for example 2 20
Output
Methods c r Else 0
KB-IELM [17] 7e ? 4 1e ? 6 – -20

0 100 200 300 400
FOKELM [20] 4e ? 4 1e ? 6 m = 80 Sample sequence
ALD-KOS-ELM 2e ? 4 1e ? 6 d = 1e-6 (a) Original figure
[25]
6
Output
NOS-KELM 5e ? 4 1e ? 6 m = 80, d = 0.2, g = 0.8 Predicted
4
Actual
2
As shown in Table 4, KB-IELM spends much more 150 160 170 180 190
learning time than other methods because of no online Sample sequence
sparsification procedure. The training time of proposed (b) Enlarged figure
1
method is slightly longer than FOKELM and ALD-KOS-
Error
ELM, but it is also in a relatively low level. 0
When the predictive step is equal to 400, the prediction -1
results of proposed method are shown in Fig. 12. It is clear 0 100 200 300 400
that the prediction curve fits the actual curve well, and the Sample sequence
prediction errors are in a relatively low level. Besides that, (c) Prediction error
Fig. 13 shows the distribution of model regularization Fig. 12 Comparison of real values and prediction values obtained by
factors when the training process is ended. Obviously, NOS-KELM in example 2
compared with other methods, the obtained model has
different regularization factors in different nonlinear 4
regions. x 10
5
Figure 14 shows the learning curves of different meth-
ods. Compared with Fig. 10, the same conclusion can be
obtained.
5
5.3 Yearly sunspot numbers time series
In this section, the proposed method is applied to a real-life 5

example. Yearly sunspot numbers time series is nonlinear,
nonstationary and non-Gaussian, and has long been a
benchmark in chaotic time series research. The yearly 5
0 20 40 60 80
sunspot numbers we used are from 1700 to 2010 about 311
Dictionary atoms
samples (see Fig. 15). In this example, the input is set as
uðtÞ ¼ ðxðt 1Þ; . . .; xðt ns ÞÞ, the time embedding length Fig. 13 Distribution of NOS-KELM regularization factors in exam-
is ns ¼ 5 in phase reconstruction stage. ple 2 when training is ended

Z = 200
[17] 40.789 0.6517 0.0395 0.6035 1.5822 0.4889
[20] 0.3603 0.5168 0.0011 0.4795 1.5484 0.2034
[25] 0.1556 0.3622 0.0007 0.3157 1.1968 0.1704
NOS-KELM 0.7155 0.3499 0.0008 0.2472 0.7831 0.1278
Z = 400
[17] 17.279 0.5652 0.0485 0.5469 1.5114 0.2929
[20] 0.2766 0.4124 0.0018 0.3366 1.1129 0.5131
[25] 0.1153 0.3744 0.0011 0.3403 1.1962 0.7162
NOS-KELM 0.5755 0.3646 0.0013 0.2851 0.9441 0.6454
123
6 200
10
Output
100
4
10 0
0 50 100 150 200
Testing MSE
Sample sequence
2
10 (a) Original figure
35
Output
30
0 25 Predicted
10 20
15 Actual
165 170 175 180 185
-2
10 Sample sequence
0 200 400 600 800
Sample sequence
(b) Enlarged figure
100
KB-IELM FOKELM
Error
ALD-KOS-ELM NOS-KELM 0
Fig. 14 Learning curves of different methods for example 2 -100

0 50 100 150 200
Sample sequence
200 (c) Prediction error
sunspot number
150 Fig. 16 Comparison of real values and prediction values obtained by

NOS-KELM in example 3
100
In this example, the selected parameters are depicted in
50
Table 5. Table 6 shows that the prediction results of dif-
0 ferent methods for yearly sunspot number time series when
1700 1750 1800 1850 1900 1950 2000
year
predictive step is equal to 100 and 200, respectively, where
the bold values are the optimal values corresponding to
Fig. 15 Yearly sunspot numbers time series every evaluation index. It is clear that our proposed method
gets the smallest prediction errors and has the better
Table 5 Selected parameters in example 3 modeling performance. Compared with KB-IELM,
FOKELM and ALD-KOS-ELM, when predictive step is
Methods c r Else
equal to 100, the RMSE is, respectively, reduced by 3.1,
KB-IELM [17] 1e ? 4 1e ? 6 – 15.2 and 8.5%;when predictive step is equal to 200, the
FOKELM [20] 2e ? 4 1e ? 6 m = 50 RMSE is, respectively, reduced by 3.3, 4.2 and 5.6%.
ALD-KOS-ELM [25] 2e ? 4 1e ? 6 d = 3e-7 When the predictive step is equal to 200, the prediction
NOS-KELM 2e ? 4 1e ? 6 m = 50, d = 18, g = 0.8 results of proposed method are shown in Fig. 16. It is clear
that the prediction curve can express the trend of the actual

Z = 100
[17] 0.1190 13.5935 0.0012 18.5265 66.7773 0.5508
[20] 0.0633 14.5808 0.0006 21.1775 88.1223 0.4867
[25] 0.2447 12.9152 0.0012 19.6364 65.4295 0.4827
NOS-KELM 0.3410 12.4182 0.0012 17.9586 59.7291 0.4774
Z = 200
[17] 0.0650 14.0217 0.0026 16.3969 66.0040 0.6895
[20] 0.0280 14.0356 0.0026 16.5558 65.1638 0.7141
[25] 0.0314 13.3749 0.0032 15.9481 59.1508 0.5219
NOS-KELM 0.1395 13.2193 0.0015 15.8586 61.8346 0.5238
123
x 10
4 The proposed method has the following advantages: (1)
2 A novel sparsification rule is proposed, which can prune
the least ‘‘significant’’ samples and preserves the important
ones; (2) An adaptive regularization scheme is constructed,

2 which ensures the new model has different structural risks
in different nonlinear regions by adjusting regularization
factors; (3) An unified learning framework is established,
2 which realizes the simultaneous update of the kernel
weight coefficients and regularization factors when some
samples are added or removed.
2
In the future, we aim to investigate how the kernel types
0 10 20 30 40 50 and kernel parameters would affect the prediction perfor-
Dictionary atoms
mance of NOS-KELM.
Fig. 17 Distribution of NOS-KELM regularization factors in exam-
ple 3 when training is ended Acknowledgements The research was supported by National Sci-
ence Foundation of China under Grant Nos. 61571454.
10
5 Compliance with ethical standards
Conflict of interest The authors declare that there is no conflict of

interests regarding the publication of this paper.
4
10
Testing MSE
References
3 1. Mourad E, Amiya N (2012) Comparison-based system-level fault

10 diagnosis: a neural network approach. IEEE Trans Parallel Dis-
trib Syst 23(6):1047–1059
2. Tian Z, Qian C, Gu B, Yang L, Liu F (2015) Electric vehicle air
2
conditioning system performance prediction based on artificial
10 neural network. Appl Therm Eng 89:101–104
0 50 100 150 200
3. Cambria E, Huang GB (2013) Extreme learning machine [trends
Sample sequence and controversies]. IEEE Intell Syst 28(6):30–59
KB-IELM FOKELM 4. Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine:
theory and application. Neurocomputing 70(1–3):489–501
ALD-KOS-ELM NOS-KELM 5. Yin G, Zhang YT, Li ZN, Ren GQ, Fan HB (2014) Online fault
diagnosis method based on incremental support vector data
Fig. 18 Learning curves of different methods in example 3 description and extreme learning machine with incremental out-
put structure. Neurocomputing 128:224–231
6. Rong HJ, Huang GB, Sundararajan N, Saratchandran P (2009)
curve effectively, and the prediction errors are in a rela- Online sequential fuzzy extreme learning machine for function
tively low level. Besides that, Fig. 17 shows the distribu- approximation and classification problems. IEEE Transactions on
tion of model regularization factors when training is ended. systems, man and cybernetics—part B: cybernetics 39(4):1067–
1072
According to Fig. 18, we can see that the proposed 7. Mirza B, Lin ZP, Liu N (2015) Ensemble of subset online
method has better stability than other ones. Obviously, its sequential extreme learning machine for class imbalance and
learning curve converges to a more accurate stage. concept drift. Neurocomputing 149:316–329
8. Xia SX, Meng FR, Liu B, Zhou Y (2015) A kernel clustering-
based possibilistic fuzzy extreme learning machine for class
imbalance learning. Cogn Comput 7:74–85
6 Conclusion 9. Li XD, Mao WJ, Wei Jiang (2016) Multiple-kernel-learning-
based extreme learning machine for classification design. Neural
In order to curb the model expansion and adapt the non- Comput Appl 27:175–184
10. Zhao SL, Chen BD, Zhu PP, Principe JC (2013) Fixed budget
linear dynamics in nonstationary environment, a new quantized kernel least-mean-square algorithm. Sig Process
online sequential learning algorithm for KELM, namely 93:2759–2770
NOS-KELM, is presented. The experimental results show 11. Huang GB, Zhou H, Ding X, Zhang R (2011) Extreme learning
that NOS-KELM utilizing sparsification rule and adaptive machine for regression and multiclass classification. IEEE Trans
Syst Man Cybern—Part B: Cybern 42(2):513–529
regularization scheme can achieve higher modeling accu- 12. Wang XY, Han M (2014) Online sequential extreme learning
racy, higher convergence rate and better stability than other machine with kernels for nonstationary time series prediction.
KELM-based online sequential learning methods. Neurocomputing 145:90–97
123
13. Deng WY, Ong YS, Tan PS, Zheng QH (2016) Online sequential 25. Scardapance S, Comminiello D, Scarpiniti M, Uncini A (2015)
reduced kernel extreme learning machine. Neurocomputing Online sequential extreme learning machine with kernel. IEEE
174:72–84 Trans Neural Netw Learn. Syst 26(9):2214–2220
14. Wong SY, Yap KS, Yap HJ, Tan SC (2015) A truly online 26. Zhang YT, Ma C, Li ZN, Fan HB (2014) Online modeling of
learning algorithm using hybird fuzzy ARTMAP and online kernel extreme learning machine based on fast leave-one-out
extreme learning machine for pattern classification. Neural Pro- cross-validation. J Shanghai Jiaotong Univ 48(5):641–646
cess Lett 42:585–602 27. Shao ZF, Meng JE (2016) An online sequential learning algo-
15. Liang NY, Huang GB, Saratchandran P, Sundararajan N (2006) A rithm for regularized extreme learning machine. Neurocomputing
fast and accurate online sequential learning algorithm for feed- 173:778–788
forward networks. IEEE Trans Neural Netw 17(6):1411–1423 28. Lu XJ, Zhou C, Huang MH, Lv WB (2016) Regularized online
16. Huynh HT, Won Y (2011) Regularized online sequential learning sequential extreme learning machine with adaptive regulation
algorithm for single-hidden layer feedforward neural networks. factor for time-varying nonlinear system. Neurocomputing
Pattern Recogn Lett 32:1930–1935 174:617–626
17. Guo L, Hao JH, Liu M (2014) An incremental extreme learning 29. Lin M, Zhang LJ, Jin R, Weng SF, Zhang CS (2016) Online
machine for online sequential learning problems. Neurocomput- kernel learning with nearly constant support vectors. Neuro-
ing 128:50–58 computing 179:26–36
18. Fan HJ, Song Q, Yang XL, Xu Z (2015) Kernel online learning 30. Honeine P (2015) Analyzing sparse dictionaries for online
algorithm with state feedbacks. Knowl Based Syst 89:173–180 learning with kernels. IEEE Trans Signal Process 63(23):6343–
19. Fan HJ, Song Q (2013) A sparse kernel algorithm for online time 6353
series data prediction. Expert Syst Appl 40:2174–2181 31. Platt J (1991) A resource-allocating network for function inter-
20. Zhou XR, Liu ZJ, Zhu CX (2014) Online regularized and ker- polation. Neural Comput 3(2):213–225
nelized extreme learning machines with forgetting mechanism. 32. Engel Y, Mannor S, Meir R (2004) The kernel recursive least-
Math Probl Eng. doi:10.1155/2014/938548 squares algorithm. IEEE Trans Signal Process 52(8):2275–2285
21. Zhou XR, Wang CS (2016) Cholesky factorization based online 33. Richard C, Bermudez JCM, Honeine P (2009) Online prediction
regularized and kernelized extreme learning machines with for- of time series data with kernels. IEEE Trans Signal Process
getting mechanism. Neurocomputing 174:1147–1155 57(3):1058–1067
22. Gu Y, Liu JF, Chen YQ, Jiang XL, Yu HC (2014) TOSELM: 34. Fan HJ, Song Q, Xu Z (2014) Online learning with kernel reg-
timeliness online sequential extreme learning machine. Neuro- ularized least mean square algorithms. Expert Syst Appl
computing 128:119–127 41:4349–4359
23. Lim J, Lee S, Pang HS (2013) Low complexity adaptive forget- 35. Liu WF, Park I, Principe JC (2009) An information theoretic
ting factor for online sequential extreme learning machine (OS- approach of designing sparse kernel adaptive filters. IEEE Trans
ELM) for application to nonstationary system estimation. Neural Neural Netw 20(12):1950–1961
Comput Appl 22:569–576 36. Fan HJ, Song Q, Shrestha SB (2016) Kernel online learning with
24. He X, Wang HL, Lu JH, Jiang W (2015) Online fault diagnosis of adaptive kernel width. Neurocomputing 175:233–242
analog circuit based on limited-samples sequence extreme 37. Zhao YP, Wang KK (2014) Fast cross validation for regularized
learning machine. Control Decis 30(3):455–460 extreme learning machine. J Syst Eng Electron 25(5):895–900
123

2017 - An Improved Kernel-Based Incremental Extreme Learning Machine

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

2017 - An Improved Kernel-Based Incremental Extreme Learning Machine

Caricato da

Copyright:

Formati disponibili

Neural Comput & Applic

An improved kernel-based incremental extreme learning machine

Received: 29 August 2016 / Accepted: 15 June 2017

2.1 KB-IELM where qtþ1 ¼ vtþ1 VTtþ1 A1

[30]. Currently, there exists several sparsity measures to j6¼i

tackle the model track the time-varying

sparse dictionary adaptive regularization

ð13Þ If i ¼ m þ 1, the dictionary remains unchanged. This is

Algorithm 1 However, in the process of updating htþ1 , the regular-

the negative gradient direction as the search direction.

We will make a brief analysis to algorithmic complexity

to predict xðt þ dÞ. The delayed times and embedding

Table 2 Prediction results of

0 Fig. 10 Learning curves of different methods for example 1

(c) Prediction error 0

Fig. 8 Comparison of real values and prediction values obtained by -20

Fig. 11 Lorenz chaotic time series with Gaussian white noise

Table 3 Selected parameters for example 2 20

KB-IELM [17] 7e ? 4 1e ? 6 – -20

5.3 Yearly sunspot numbers time series

In this section, the proposed method is applied to a real-life 5

Table 4 Prediction results of

Fig. 14 Learning curves of different methods for example 2 -100

150 Fig. 16 Comparison of real values and prediction values obtained by

Table 6 Prediction results of

ones; (2) An adaptive regularization scheme is constructed,

Conflict of interest The authors declare that there is no conflict of

3 1. Mourad E, Amiya N (2012) Comparison-based system-level fault

Potrebbero piacerti anche