Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
DOI 10.1007/s00521-017-3096-3
ORIGINAL ARTICLE
Abstract In order to curb the model expansion of the using artificial and real-word nonstationary time series
kernel learning methods and adapt the nonlinear dynamics data. The results indicate that the proposed method can
in the process of the nonstationary time series online pre- achieve higher prediction accuracy, better generalization
diction, a new online sequential learning algorithm with performance and stability.
sparse update and adaptive regularization scheme is pro-
posed based on kernel-based incremental extreme learning Keywords Time series prediction Extreme learning
machine (KB-IELM). For online sparsification, a new machine Online modeling Fixed budget Sparsity
method is presented to select sparse dictionary based on the measures Adaptive regularization
instantaneous information measure. This method utilizes a
pruning strategy, which can prune the least ‘‘significant’’
centers, and preserves the important ones by online mini- 1 Introduction
mizing the redundancy of dictionary. For adaptive regu-
larization scheme, a new objective function is constructed Nonstationary time series prediction (TSP) plays an
based on basic ELM model. New model has different important role in the scientific and engineered fields, such as
structural risks in different nonlinear regions. At each fault-tolerant analysis, state prediction, condition monitor-
training step, new added sample could be assigned optimal ing and fault diagnosis [1]. The main task of TSP is to find a
regularization factor by optimization procedure. Perfor- proper model with appropriate structure and parameters to
mance comparisons of the proposed method with other characterize the dynamic behavior of real systems.
existing online sequential learning methods are presented As a famous nonlinear modeling method, neural net-
works (NNs) have been extensively used to address TSP
issues over the years [2]. NNs are proven to be universal
& Aiqiang Xu approximators under suitable conditions, thus providing the
hjhy1989@njfu.edu.cn means to capture information in data that are difficult to
Wei Zhang identify using other approaches. It is, however, well known
linguo@njfu.edu.cn that the traditional NNs algorithms suffer from some
Dianfa Ping problems such as being easily trapped into local minima,
mingyunjiang@njfu.edu.cn slow convergence and huge computational costs [3]. In
Mingzhe Gao order to overcome these issues, Huang et al. [4] proposed
1601111088@pku.edu.cn extreme learning machine (ELM), which is a novel learn-
1
ing algorithm for single-layer feedforward neural net-
Office of Research and Development, Naval Aeronautical
works(SLFNs). Its salient advantage is that the input
and Astronautical University, Yantai 264001,
People’s Republic of China weights and hidden biases are randomly chosen instead of
2 being exhaustively tuned. It has been reported to provide
Department of Electronic and Information Engineering,
Naval Aeronautical and Astronautical University, better generalization performance with much faster learn-
Yantai 264001, People’s Republic of China ing speed [5–9].
123
Neural Comput & Applic
During the last few years, there has been increasing samples. Generally speaking, the inner structure hidden in
attention on kernel learning methods [10]. The fundamental time series data will determine the samples’ significance,
idea of kernel methods is that a Mercer kernel is applied to especially in nonstationary setting [22]. So they could not
transform the low-dimensional feature vector into a high- guarantee the new added sample is the most valuable for
dimensional reproducing kernel Hilbert space (RKHS), in prediction [23]. Ref. [24] achieved the online sparsification
which many nonlinear problems will become linearly by deleting those old samples which had the highest sim-
solvable. Recently, Huang et al. [11] extended ELM by ilarity with new samples. This approach cannot effectively
using kernel functions and we denote the derived results as track the system dynamics. Ref. [25] proposed ALD-KOS-
ELM with kernel (KELM). KELM is a special batch ELM by use of approximate linear dependence (ALD)
variant of ELM [12, 13]. The experimental and theoretical criterion. A major criticism that can be made of ALD
analysis has shown that KELM tends to have better scal- criterion is that it leads to costly operations with quadratic
ability and can achieve similar (for regression and binary complexity in the cardinality m of the dictionary. Ref. [26]
class cases) or much better (for multiclass cases) general- achieved the model sparsification by using fast leave-one-
ization performance at much faster learning speed than out cross-validation (FLOO-CV).
basic SVM and LS-SVM [11]. The model expansion has been solved effectively by
However, in many actual applications, online, adaptive aforementioned methods, but the second problem is still
and real-time operations are required. Such requirements not worked out fundamentally. In the process of nonlinear
pose a serious challenge, since aforementioned NNs, ELM dynamic system modeling, empirical risk and structural
and KELM operate in batch model. In other words, risk should be considered simultaneously. Generally,
whenever a new sample arrives, these algorithms need to KELM controls structure risk by Tikhonov regularization
gather old and new data together to perform a retraining in [16, 27]. There is no doubt that time-varying system should
order to incorporate the new information [14]. This will have different structural risks in different nonlinear regions
result in extra storing space consumption and make the [28]. But the current KELM-based methods employ a
learning time become longer and longer. Fortunately, some constant regularization factor in all time, greatly limiting
online sequential learning algorithms have been proposed their effectiveness when modeling unknown nonlinear
to meet the actual demands, such as online sequential ELM time-varying system. So, in order to improve the capability
(OSELM) [15], regularized online sequential ELM (ReOS- of tracking the time-varying dynamics, regularization fac-
ELM) [16]. Moreover, KB-IELM was presented in Ref. tor should vary over time.
[17], which is a beneficial attempt to extend KELM into The aim of this paper is to seek a new online sequential
online application. learning strategy of KB-IELM for nonstationary time series
There is no doubt that it will be a very significant work prediction, which is computationally simple, and able to
to research the online application of KELM. For KB- simultaneously solve aforementioned two key problems.
IELM, although it can handle the online tasks, some issues So, the fix-budget method is regarded as the basic modeling
still need to be solved. For example, KB-IELM’s model strategy, KB-IELM is regarded as basic modeling algo-
order is equal to the number of training samples, which will rithm and then a unified framework is further presented by
lead to kernel matrix expansion with learning ongoing associating the proposed sparsification rule and adaptive
[10, 18]. On the one hand, the algorithm will be in danger regularization scheme.
of over-fitting; on the other hand, the computational com- For the online sparsification, a new method is presented
plexity and storage requirement will grow superlinearly. In to select the sparse dictionary based on the instantaneous
a word, two key problems must be solved when KB-IELM information measure. The proposed sparsification method
is applied to online predict nonstationary time series, i.e., utilizes a pruning strategy. By online minimizing the
(1) how to curb the model expansion; (2) how to track or redundancy of dictionary, it decides whether to replace one
adapt the nonlinear dynamics in the time-varying and of the old dictionary members with the new kernel func-
nonstationary environment. tion. In the end, a compact dictionary with predefined size
Commonly, online sparsification strategies are can be selected adaptively. The proposed method does not
employed to solve above issues. It helps to curb the need any a priori knowledge about the data, and its com-
growing kernel function while the training sample putational complexity is linear with the kernel center
sequentially arrives [19]. Ref. [20] proposed KELM with number.
forgetting mechanism (FOKELM) based on traditional For adaptive regularization scheme, a new objective
sliding time window. On the basis of FOKELM, Ref. [21] function is constructed based on basic ELM model. New
proposed CF-FOKELM by use of Cholesky factorization. model has different regularization factors in different
These methods can obtain a compact dictionary, but the nonlinear regions. At each training iteration, in order to
dictionary largely depends on the latest k observed assign optimal regularization factor for new added sample,
123
Neural Comput & Applic
LOO-CV generalization error is adopted to construct loss Mercer’s conditions can be applied on ELM. The kernel
function, then the optimal regularization factor can be matrix is defined as G ¼ HHT , where Gði; jÞ ¼ hðxi Þ
derived by minimizing loss function based on gradient hðxj ÞT ¼ kðxi ; xj Þ, and kð; Þ is a predefined kernel func-
descent (GD) method. Moreover, a dynamic learning rate is tion. Then, the output of SLFNs by kernel-based ELM can
adopted to ensure the algorithmic convergence. be given as Eq. (3).
Finally, we associate the proposed sparsification rule
with the adaptive regularization scheme based on KB- f ðÞ ¼ hðÞHT ðc1 I þ HHT Þ1 Y
ð3Þ
IELM algorithm and derive a new online sequential ¼ ½kðx1 ; Þ; . . .; kðxN ; Þðc1 I þ GÞ1 Y
learning algorithm for KELM (denoted as NOS-KELM for
simplicity) in this paper. Performance comparisons of Let k ¼ ½kðx1 ; Þ; . . .; kðxN ; Þ, h ¼ ðc1 I þ GÞ1 Y,
proposed method with other existing algorithms are pre- where h denotes kernel weight coefficient. Let
P
sented by using artificial and real-life time series data. The h ¼ ½h1 ; . . .; hm T , then f ðÞ ¼ kh ¼ Ni¼1 hi kðxi ; Þ.
simulation results testify that NOS-KELM is an effective In Ref. [17], KB-IELM is proposed to realize adaptive
way to predict nonstationary time series. update of kernel weight coefficient when new samples
The rest of this paper is organized as follows. In Sect. 2, arrive sequentially. Suppose that @t ¼ fðxi ; yi Þji ¼ 1; . . .; tg
a brief description is presented about KB-IELM and sev- at time t. Let At ¼ c1 It þ Gt , the sample ðxtþ1 ; ytþ1 Þ
eral sparsity measure criterions. The proposed algorithm is arrives at time t ? 1, then we have
given in Sect. 3, including the online selection of key
At Vtþ1
nodes, the real-time update of kernel weight coefficients Atþ1 ¼ ð4Þ
VTtþ1 vtþ1
and online optimization of regularization factor. In Sect. 4,
the computational complexity is discussed. In Sect. 5, the where Vtþ1 ¼ ½k1;tþ1 ; . . .; kt;tþ1 T , vtþ1 ¼ c1 þ ktþ1;tþ1 .
proposed algorithm is evaluated by both simulation and Using the block matrix inverse lemma,A1 tþ1 can be
real-world data. The conclusion is drawn in Sect. 6. 1
calculated from At .
1
1 At þ A1t Vtþ1 q1 tþ1 VTtþ1 A1t A1t Vtþ1 q1
tþ1
Atþ1 ¼ 1
q1 T
tþ1 Vtþ1 At q1
tþ1
2 Preliminaries ð5Þ
123
Neural Comput & Applic
measure [34], significance measure [10, 19] and surprise where d is a given positive threshold.
measure (SC) [35]. These methods’ main idea is to con-
struct a sparse dictionary by only accepting those important 2.2.5 Significance measure
input samples.
Ref. [10] measures the significance of a center based on a
2.2.1 Novelty criterion weighted average contribution over all quantized input
data. This method utilizes a pruning strategy, the center
NC [31] first computes the distance of xtþ1 to the current
with the smallest influence on the whole system will be
dictionary, if dis ¼ mincti 2Dt xtþ1 cti \d1 , then xtþ1 will discarded when a new sample is included in the dictionary.
not be added into the dictionary. Otherwise, it further Ref. [19] continuously examines the significance of the
computes the prediction error, i.e., etþ1 ¼ ytþ1 f^t ðxtþ1 Þ, new training sample based on the Hessian matrix of the
only if jetþ1 j [ d2 , xtþ1 will be accepted as a new center. system loss function. This method utilizes a constructive
Where d1 and d2 are two user-specified parameters. strategy, samples with small significance are discarded and
those with relative large significance are selected as the
2.2.2 Approximate linear dependence dictionary member.
ALD criterion is used to select the most linearly independent 2.2.6 Surprise measure
atoms in kernel algorithm, where only those atoms that cannot
be absorbed well by the existing dictionary are selected [32]. To determine useful data to be learned and remove
The kernel function kðxtþ1 ; Þ is added to the dictionary if redundant ones, a subjective information measure called
2 surprise is introduced in Ref. [35]. Surprise is defined as
Xmt
the negative log likelihood of the samples given the
min kðxtþ1 ; Þ ni kðcti ; Þ d
n1 nm learning system’s hypothesis on the data distribution, i.e.,
i¼1 H
where d is a positive threshold parameter to control the STt ðxtþ1 ; ytþ1 Þ ¼ ln pðxtþ1 ; ytþ1 jTt Þ
level of sparseness. where pðxtþ1 ; ytþ1 jTt Þ is the posterior probability of
ðxtþ1 ; ytþ1 Þ hypothesized by Tt .
2.2.3 Coherence measure
The coherence corresponds to the largest correlation 3 KB-IELM with sparse updates and adaptive
between atoms of a given dictionary [33]. The kðxtþ1 ; Þ is regularization scheme
added to the dictionary if
kðxtþ1 ; ct Þ 3.1 Problem statement and formulation
i
max pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d ð7Þ
i¼1mt kðxtþ1 ; xtþ1 Þkðcti ; cti Þ
In order to avoid those issues described in Sect. 1,
where d is a given threshold, and d 2 ð0; 1. When the unit- according to Eq. (1), a new objective function is defined as
norm kernel is applied, Eq. (7) becomes Eq. (8).
max kðxtþ1 ; cti Þ d 1 1X m
i¼1mt Min : L2 ¼ kbk2 þ c n2
2 m i¼1 i i ð8Þ
s: t: yi ¼ hðci Þb þ ni ; i ¼ 1; 2; . . .; m
2.2.4 Cumulative coherence measure
where ci is the obtained key nodes by online selection; m is
Cumulative coherence measure can be viewed as an the number of key nodes; all key nodes compose sparse
extension of coherence criterion and provide a deeper dictionary D ¼ fkðci ; Þgm
i¼1 ; and 1/m is used to avoid the
description of a dictionary [34, 36]. The cumulative influence of the cumulate error. Compared with Eq. (1),
coherence of a dictionary with a Gram
matrix
Gt is defined there are three main improvements in Eq. (8):
Pj6¼i t t
as lðGt Þ ¼ maxi¼1mt 1 j mt kðci ; cj Þ. A candidate 1. The fixed-budget method is employed, which ensures
kernel function kðxtþ1 ; Þ is included in the dictionary if the computational complexity is bounded.
123
Neural Comput & Applic
KB-IELM model
kernel weight
coefficient update
Fig. 2 Interpretation of sparsification procedure
NOS-KELM model
Fig. 1 Framework of this paper Suppose that the obtained learning system at the training
step t is ft ¼ f ðDt ; ht ; ct Þ. At the training step t ? 1, when a
2. The online sparsification strategy is used, only those new training sample ðxtþ1 ; ytþ1 Þ arrives, we can obtain a
key-nodes will be accepted to update current model. new kernel function kðxtþ1 ; Þ. The potential dictionary is
3. The adaptive regularization scheme is constructed, t ¼ fDt ; kðxtþ1 ; Þg. In order to determine
defined as D
which makes the model has different regularization
whether kðxtþ1 ; Þ can be inserted into dictionary, we firstly
factors in different nonlinear regions.
give two definitions based on information theory.
Karush–Kuhn–Tucker(KKT) optimality conditions are Definition 1: Hypothesize that the current learning
employed to solve above objective function, we have system is ft , the instantaneous posterior probability of
bt ¼ HT ðKt þ HHT Þ1 Yt . If the regularization factor vec- observation sample xtþ1 is pt ðxtþ1 jft Þ, then the information
tor at time t is defined as ct ¼ ½ct1 ; ct2 ; ; ctm , then, contained by xtþ1 , which can transfer to the current
learning system, is defined as the instantaneous conditional
Kt ¼ m=2 ½diag(ct Þ1 where diag represents a diagonal
self-information of xtþ1 at time t, namely
matrix.
Iðxtþ1 jft Þ ¼ log pt ðxtþ1 jft Þ.
By using kernel function, we can obtain
Definition 2: Hypothesize that the current learning
X
m
system is ft , the number of atoms in dictionary Dt is m, the
f^t ðÞ ¼ kt ht ¼ hti kðcti ; Þ ð9Þ
i¼1 instantaneous posterior probability of kernel center
cti ð1 i mÞ is pt ðcti jft Þ, then the average self-information
where kt ¼ ½kðct1 ; Þ; . . .; kðctm ; Þ denotes current kernel of Dt at time t is defined as the instantaneous conditional
vector; ht ¼ ½ht1 ; . . .; htm T is the current kernel weight entropy of Dt , namely
coefficient vector, and ht ¼ ðKt þ Gt Þ1 Yt . Xm
According to Eq. (9), we can see that the improved KB- HðDt jft Þ ¼ pt ðcti jft Þ log pt ðcti jft Þ
IELM has to deal with some important problems in the i¼1
process of online application, i.e., the selection of Dt , the In the actual applications, the probability distribution
update of ct and ht . In order to solve these problems, we function (PDF) of data is hard to obtain without any priori
present a new online learning method with sparse update knowledge or hypothesis. Kernel density estimator (KDE)
and adaptive regularization scheme based on KB-IELM. is a reasonable method to estimate PDF. Given
The framework of this paper is shown in Fig. 1. Dt ¼ fkðct1 ; Þ; . . .; kðctm ; Þg, then by use of the KDE, the
instantaneous conditional PDF of kernel center can be
3.2 Sparse dictionary selection based represented as Eq. (10).
on instantaneous information measure 1X m
pt ðcjr; ft Þ ¼ kr ðc; cti Þ ð10Þ
m i¼1
In this section, a novel sparsification rule is proposed based
on the instantaneous information measure. This method is where r is the kernel width. According to Eq. (10), the
based on a kind of pruning strategy, which can prune the instantaneous conditional self-information of xtþ1 and the
least ‘‘significant’’ centers and preserves the important instantaneous conditional entropy of Dt can be, respec-
ones, as described in Fig. 2. tively, denoted as the following.
123
Neural Comput & Applic
1X m
t
Let, Ft ¼ Ft diag½diagðFt Þ þ mIt .By substituting G
Iðxtþ1 jr; ft Þ ¼ log kr ðxtþ1 ; cti Þ
m i¼1
and St into Ft , we can get
(" # " #)
X Ft ¼
m
1X m
t t 1X m
t t 2 3
HðDt jr; ft Þ ¼ kr ðci ; cj Þ log kr ðci ; cj Þ ¼m
j6P j6¼P
mþ1
i¼1
m j¼1 m j¼1 6 m kr ðct1 ; ctj Þ kr ðct1 ; ctj Þ 7
6 1 j mþ1 1 j mþ1 7
6 7
6 .. .. .. .. 7
6 . . . . 7
Without loss of generality, all kernel function mentioned 6 7
6 ¼1
j6P j6¼Pmþ1 7
in this paper are unit-norm kernel, i.e., 8x 2 X, kðx; xÞ ¼ 1, 6 7
6 kr ðctm ; ctj Þ m kr ðctm ; ctj Þ 7
6 1 j mþ1 1 j mþ1 7
if kðx; Þ is not unit-norm, replace kðx; Þ with 6
6
7
7
pffiffiffiffiffiffiffiffiffiffiffiffiffi 4 ¼1
j6P ¼m
j6P 5
kðx; Þ kðx; xÞ. kr ðctmþ1 ; ctj Þ kr ðctmþ1 ; ctj Þ m
1 j mþ1 1 j mþ1
T m 1
Let et ¼ ½1; . . .; 1 2 R , then Gram matrix of dic-
tionary Dt is denoted as Gt , multiply the matrix Gt with the After the i-th ð1 i m þ 1Þ, kernel function in the
matrix et , we have St ¼ Gt et , i.e., potential dictionary D t is deleted, the new dictionary and
2 Pm 3 the new learning system are, respectively, denoted as D i
kr ðct1 ; ctj Þ t
6 Pm k ðct ; ct Þ 7
j¼1
and fti . Then, the instantaneous conditional probability of
6 j¼1 r 2 j 7
St ¼ 6
6 .. 7
7 the l-th ðl 6¼ iÞ kernel center can be denoted as the
4P . 5 following.
m t t
j¼1 kr ðcm ; cj Þ j6¼i
1 X 1
pt ðctl jr; fti Þ ¼ kr ðctl ; ctj Þ ¼ Ft ðl; iÞ
The instantaneous conditional probability of the i-th m 1 j mþ1 m
kernel center in the dictionary Dt under system ft is
pt ðcti jr; ft Þ ¼ St ðiÞ=m. So the instantaneous conditional According to Eq. (11), the instantaneous conditional
entropy can be obtained by Eq. (11). entropy of D i is written as Eq. (14).
t
T
T
S St
HðDt jr; ft Þ ¼ t log ð11Þ HðDi jr; fi Þ ¼ Ft ð: iÞ log Ft ð: iÞ ð14Þ
m m t t
m m
At the training step t ? 1, let xtþ1 ¼ cmþ1 , then the i is defined as Eq. (15).
The redundancy of D t
Gram matrix of the dictionary with all potential kernel
t. HðD i jr; fi Þ i jr; fi Þ
HðD
functions is denoted as G Ri t t ¼ 1 t t
t ¼ 1 i ð15Þ
logD log j mj
t
G t ¼ GTt Kt ð12Þ
Kt 1 We aim to online minimize the redundancy of dic-
tionary. That is because the less redundancy dictionary has,
where Kt ¼ ½kr ðct1 ; cmþ1 Þ; . . .; kr ðctm ; cmþ1 ÞT 2 Rm 1 . Let the more information dictionary contains. So the index of
et ¼ ½1; . . .; 1T 2 Rðmþ1Þ 1 , compute St ¼ G t et , thus we kernel function removed from old dictionary can be
can obtain Eq. (13). determined by Eq. (16).
Gt Kt et Gt et þ Kt St þ Kt i ¼ arg min ðRi
t Þ ð16Þ
St ¼ ¼ ¼ P
KTt 1 1 KTt et þ 1 1 þ Kt 1 i mþ1
123
Neural Comput & Applic
123
Neural Comput & Applic
1
generalization error for each sample in Dtþ1 can be ðAi
t Þ O
Ntþ1 ¼
expressed as follows. O 0
ðkÞ ðA1
tþ1 Ytþ1 Þk It is easy to see that Stþ1 ,Mtþ1 and Ntþ1 have no rela-
nloo ðt þ 1Þ ¼ ; k ¼ 1; . . .; m ð24Þ i 1
diagðA1tþ1 Þk tionship with ctþ1new . However, ðAt Þ is unknown in the
process of calculating Stþ1 , Mtþ1 and Ntþ1 . In order to
As a result, the generalization error vector between the
avoid the computational burden caused by recalculating
estimated values and the real values can be denoted as 1
ð1Þ ðmÞ matrix ðAit Þ and improve the operation efficiency when
Etþ1 ¼ ½nloo ðt þ 1Þ; . . .; nloo ðt þ 1Þ.
deleting the old sample from dictionary, we present an
So the loss function can be defined as Eq. (25). effective method.
m h i2
1 1X ðkÞ For convenience of description, At can be rewritten as
Jðctþ1
new Þ ¼ hEðt þ 1Þ; Eðt þ 1Þi ¼ nloo ðt þ 1Þ
2 2 k¼1 following form, as shown in Fig. 3, where i is the remov-
able index searched by Eq. (16).
ð25Þ
The optimal regularization factor c
tþ1
can be obtained
m k1,m
tþ1
by minimizing loss function Jðcnew Þ, i.e., 2 t 1 k12 k1i 1 k1m
1
c
tþ1 ¼ arg min Jðctþ1 k21 m k2i k2,m
new Þ ð26Þ
2 t 1 1 k2m
ctþ1
new 2 2
At
where is the range of ctþ1 new . GD algorithm is an opti- ki1 ki 2 m t 1 ki , m kim
2 i
1
mization algorithm, it is also known as the steepest descent
method. GD algorithm is one of the most common methods km1 km 2 kmi km ,m m t 1
for solving unconstrained optimization problems and uses
1 2 m
123
Neural Comput & Applic
Moving the i-th row and i-th column of At to the first According to Eq. (30), we can know that q1
tþ1 and
row and first column, respectively, this transformation 1 tþ1
rc qtþ1 are two functions of cnew , and
process can be mathematically formulated as: ( 1
qtþ1 ¼ 2ctþ1 tþ1
new.ðm þ 2cnew Stþ1 Þ
~ t ¼ Pt A t Q t
A ð31Þ 2 ð37Þ
rc q1 tþ1
tþ1 ¼ 2m ðm þ 2cnew Stþ1 Þ
where Pt and Qt are two m-order elementary matrix, and
their structures are shown in Figs. 4 and 5. Combining Eqs. (34), (35) and (37), we can further
It is easy to see that Pt PTt ¼ Im and Qt QTt ¼ Im , where obtain Eq. (38).
Im is a m-order identity matrix. So Pt and Qt are orthogonal 8
> diag(A1 1
tþ1 Þ¼qtþ1 diagðMtþ1 Þ þ diagðNtþ1 Þ
matrixes. Based on the properties of orthogonal matrix, we >
< 1
1 diag(rc Atþ1 Þ¼rc q1
tþ1 diagðMtþ1 Þ
have P1 T T T
t ¼ Pt , Qt ¼ Qt . Furthermore, Pt ¼ Qt , so the 1 1 ð38Þ
>
> A Y
tþ1 tþ1 ¼ q tþ1 tþ1 tþ1 þ Ntþ1 Ytþ1
M Y
following conclusion can be obtained: Q1 :
t ¼ Pt , 1
rc Atþ1 Ytþ1 ¼ rc q1
1 tþ1 Mtþ1 Ytþ1
Pt ¼ Q t .
According to Eq. (31), we can get By substituting (38) into (36), rc Jðctþ1 new Þ can be
obtained. At each iterative step in Eq. (27), the change of
~ 1 ¼ ðPt At Qt Þ1 ¼ Pt A1 Qt
A ð32Þ
t t ctþ1
new will not impact on diagðMtþ1 Þ, diagðNtþ1 Þ,
~ 1 can be rewritten as the following form. Mtþ1 Ytþ1 ,Ntþ1 Ytþ1 and Stþ1 , and only has influence on
A t
" # q1 1
tþ1 and rc qtþ1 . So, at each training iteration, we only
1 ð ~1 Þð1;1Þ
A ð ~1 Þð1;2:endÞ
A need to update q1 1
tþ1 and rc qtþ1 .
~ ¼
A t t
t
~1 Þð2:end;1Þ ðA
ðA ~1 Þð2:end;2:endÞ For the iteration equation shown in Eq. (27), a dynamic
t t
learning rate is adopted to ensure the convergence of the
~t .
We can also obtain the block matrix form of A algorithm.
~ t ¼ vtT Vit
A gð1 d=etþ1 ðjÞÞ if etþ1 ðjÞ [ d
V A gðjÞ ¼ ð39Þ
t t
0 if etþ1 ðjÞ d
where Vt ¼ ½ki;1 ; . . .; ki;i1 ; ki;iþ1 ; ; ki;m , vt ¼ m ð2ct Þ
i
where g is a constant, and 0\g 1;etþ1 ðjÞ is the mean
þ1. According to the conclusion in Ref [20], we can obtain
value of generalization errors in the j-th iteration at time
P .
~1 ð2:end;1Þ ðA
~1 Þð1;2:endÞ t ? 1, i.e., etþ1 ðjÞ ¼ m
ðkÞ
ðAi Þ 1
¼ ð ~1 Þð2:end;2:endÞ ðAt Þ
A t k¼1 jnloo ðt þ 1Þj m ; d represents
t t
ðA~1 Þð1;1Þ algorithm termination threshold.
t
ð33Þ By substituting the results of Eqs. (36) and (39) into
(27), the optimization problem can be solved. In the end,
According to Eq. (33), we can obtain Stþ1 , Mtþ1 and the regularization factor vector can be updated by
Ntþ1 . By substituting them into (19), (19) can be rewritten Eq. (40).
as Eq. (34).
8
2ctþ1 < ctþ1
k ¼ ctk ; k\i
A1
tþ1 ¼
new
Mtþ1 þ Ntþ1 ð34Þ tþ1
c ¼ ct ; i k\m ð40Þ
m þ 2ctþ1
new Stþ1 : k tþ1 kþ1
ck ¼ ctþ1
new ; k ¼ m
According to Eq. (34), rc A1 tþ1 can be obtained.
tþ1
1 o 2cnew
rc Atþ1 ¼ tþ1 Mtþ1 þ Ntþ1
ocnew m þ 2ctþ1new Stþ1 4 Complexity analysis
ð35Þ
2m
¼ 2
Mtþ1
ðm þ 2ctþ1
new Stþ1 Þ A brief computational framework of NOS-KELM is
described in Fig. 6.
By substituting Eq. (29) into (28), (28) can be rewritten
The details can be summarized as Algorithm 2. Where
as Eq. (36).
the initial dictionary D0 is composed of the first m samples
( )
Xm
D1k D2k fðxi ; yi Þgm
i¼1 , kernel width r and initial regularization factor
tþ1
rc Jðcnew Þ ¼
3 ð36Þ c0 are determined by grid search method. Initial regular-
k¼1 diagðA1
tþ1 Þk
ization factor vector is defined as c0 ¼ ½c01 ; . . .; c0m , then
where D1k ¼ ðA1 1 1
tþ1 Ytþ1 Þk ðrc Atþ1 Ytþ1 Þk diagðAtþ1 Þk K0 ¼ m=2 ½diag(c0 Þ1 . Moreover, G0 ¼ ½kðxi ; xj Þm m ,
1 2 1
and D2k ¼ ðAtþ1 Ytþ1 Þk diagðrc Atþ1 Þk . Y0 ¼ ½y1 ; y2 ; . . .; ym T .
123
Neural Comput & Applic
123
Neural Comput & Applic
1X n
jy^ðiÞ yðiÞj From Table 2, we can see that compared with KB-IELM,
MRPE ¼ FOKELM and ALD-KOS-ELM, when predictive step is
n i¼1 yðiÞ
equal to 200, the RMSE is, respectively, reduced by 39.2,
5.1 Nonstationary Mackey–Glass chaotic time series 24.4 and 14.7% when predictive step is equal to 400, the
RMSE is, respectively, reduced by 42.2, 70.8 and 27.8%. So
This example is an artificial nonstationary time series, and our proposed method has the higher modeling accuracy.
it is generated by the Mackey–Glass chaotic time series When the predictive step is equal to 400, the prediction
mixed with a sinusoid. The Mackey–Glass chaotic time results of proposed method are shown in Fig. 8. It is clear
series are generated by the following time-delay differen- that the prediction curve can express the trend of the actual
tial equation. curve effectively, and the prediction errors are in a rela-
dxðtÞ axðt sÞ tively low level. Besides, Fig. 9 shows the distribution of
¼ bxðtÞ model regularization factors when the training process is
dt 1 þ xðt sÞ10
finished. Obviously, the obtained model has different reg-
where xðtÞ is the value of time series at time t. Initial ularization factors in different nonlinear regions.
conditions are set as: a ¼ 0:2, b ¼ 0:1, s ¼ 17, xð0Þ ¼ 1:2 Figure 10 shows the learning curves of different meth-
and xðtÞ ¼ 0 for t\0. We apply the fourth-order Runge– ods, where Y-axis denotes mean square error (MSE),
Kutta method with time step size D ¼ 0:1 to get the X-axis denotes training sample. We can see that the
numerical solution of differential equation. Then, a sinu- learning curve of our proposed method is more smooth and
soid 0:3 sinð2pt=3000Þ is added to the series to create the converges to a more accurate stage. So the proposed
nonstationary chaotic time series. Sampling interval is set method has better performance than other ones.
as Ts ¼ 10D. In this example, the first 800 points for
training and the last 400 points for testing, and all points 5.2 Lorenz chaotic time series
are shown in Fig. 7. The time embedding dimension is set
as 10, i.e., the input is uðtÞ ¼ ðxðt Ts Þ; . . .; xðt 10Ts ÞÞ. Lorenz chaotic time series is shown as the following
In this example, the selected parameters are depicted in equation:
Table 1. All methods are applied to learn training samples 8
< dx=dt ¼ rðy xÞ
one-by-one. Let Z denote predictive step, when Z is equal dy=dt ¼ rx y xz
to 200 and 400, respectively, the prediction results of dif- :
dz=dt ¼ bz þ xy
ferent methods are shown in Table 2, where the bold values
are the optimal values corresponding to every evaluation The initial values are set as:r ¼ 10, r ¼ 28, b ¼ 8=3,
index. xð0Þ ¼ 1, yð0Þ ¼ 2 and zð0Þ ¼ 9. The fourth-order Runge–
Kutta method is used to generate a sample set with the first
2.5
800 points for training and last 400 points for testing. At
nonstationary Mackey-Glass data
2
the same time, a Gaussian white noise is added into time
added sinusoid data
series, and SNR = 5 dB. All samples are shown in Fig. 11.
1.5 In this example, xðtÞ, yðtÞ and zðtÞ series are used together
output
123
Neural Comput & Applic
Z = 200
[17] 38.438 0.0174 0.0293 0.0153 0.0420 0.0108
[20] 0.2100 0.0268 0.0009 0.0123 0.0317 0.0083
[25] 0.2327 0.0112 0.0009 0.0109 0.0249 0.0081
NOS-KELM 0.9793 0.0104 0.0009 0.0093 0.0263 0.0063
Z = 400
[17] 16.799 0.0190 0.0436 0.0166 0.0438 0.0112
[20] 0.1840 0.0526 0.0012 0.0329 0.0842 0.0241
[25] 0.1637 0.0351 0.0013 0.0133 0.0384 0.0088
NOS-KELM 0.8215 0.0112 0.0031 0.0096 0.0258 0.0066
2
10
2
Output
1 0
10
Testing MSE
0
0 50 100 150 200 250 300 350 400 -2
Sample sequence 10
(a) Original figure
-4
10
1.4
Output
1.35 Predicted -6
Actual 10
1.3 0 200 400 600 800
100 105 110 115 120 Sample sequence
Sample sequence
KB-IELM FOKELM
(b) Enlarged figure
0.04
ALD-KOS-ELM NOS-KELM
0.02
Error
4 0
x 10
2
-50
0 200 400 600 800 1000 1200
Regularization factor
sample
2
50
Z(t)
2
0
0 200 400 600 800 1000 1200
2 sample
123
Neural Comput & Applic
Output
Methods c r Else 0
Output
NOS-KELM 5e ? 4 1e ? 6 m = 80, d = 0.2, g = 0.8 Predicted
4
Actual
2
As shown in Table 4, KB-IELM spends much more 150 160 170 180 190
learning time than other methods because of no online Sample sequence
sparsification procedure. The training time of proposed (b) Enlarged figure
1
method is slightly longer than FOKELM and ALD-KOS-
Error
ELM, but it is also in a relatively low level. 0
When the predictive step is equal to 400, the prediction -1
results of proposed method are shown in Fig. 12. It is clear 0 100 200 300 400
that the prediction curve fits the actual curve well, and the Sample sequence
prediction errors are in a relatively low level. Besides that, (c) Prediction error
Fig. 13 shows the distribution of model regularization Fig. 12 Comparison of real values and prediction values obtained by
factors when the training process is ended. Obviously, NOS-KELM in example 2
compared with other methods, the obtained model has
different regularization factors in different nonlinear 4
regions. x 10
5
Figure 14 shows the learning curves of different meth-
ods. Compared with Fig. 10, the same conclusion can be
Regularization factor
obtained.
5
Z = 200
[17] 40.789 0.6517 0.0395 0.6035 1.5822 0.4889
[20] 0.3603 0.5168 0.0011 0.4795 1.5484 0.2034
[25] 0.1556 0.3622 0.0007 0.3157 1.1968 0.1704
NOS-KELM 0.7155 0.3499 0.0008 0.2472 0.7831 0.1278
Z = 400
[17] 17.279 0.5652 0.0485 0.5469 1.5114 0.2929
[20] 0.2766 0.4124 0.0018 0.3366 1.1129 0.5131
[25] 0.1153 0.3744 0.0011 0.3403 1.1962 0.7162
NOS-KELM 0.5755 0.3646 0.0013 0.2851 0.9441 0.6454
123
Neural Comput & Applic
6 200
10
Output
100
4
10 0
0 50 100 150 200
Testing MSE
Sample sequence
2
10 (a) Original figure
35
Output
30
0 25 Predicted
10 20
15 Actual
165 170 175 180 185
-2
10 Sample sequence
0 200 400 600 800
Sample sequence
(b) Enlarged figure
100
KB-IELM FOKELM
Error
ALD-KOS-ELM NOS-KELM 0
Z = 100
[17] 0.1190 13.5935 0.0012 18.5265 66.7773 0.5508
[20] 0.0633 14.5808 0.0006 21.1775 88.1223 0.4867
[25] 0.2447 12.9152 0.0012 19.6364 65.4295 0.4827
NOS-KELM 0.3410 12.4182 0.0012 17.9586 59.7291 0.4774
Z = 200
[17] 0.0650 14.0217 0.0026 16.3969 66.0040 0.6895
[20] 0.0280 14.0356 0.0026 16.5558 65.1638 0.7141
[25] 0.0314 13.3749 0.0032 15.9481 59.1508 0.5219
NOS-KELM 0.1395 13.2193 0.0015 15.8586 61.8346 0.5238
123
Neural Comput & Applic
x 10
4 The proposed method has the following advantages: (1)
2 A novel sparsification rule is proposed, which can prune
the least ‘‘significant’’ samples and preserves the important
Regularization factor
10
5 Compliance with ethical standards
References
123
Neural Comput & Applic
13. Deng WY, Ong YS, Tan PS, Zheng QH (2016) Online sequential 25. Scardapance S, Comminiello D, Scarpiniti M, Uncini A (2015)
reduced kernel extreme learning machine. Neurocomputing Online sequential extreme learning machine with kernel. IEEE
174:72–84 Trans Neural Netw Learn. Syst 26(9):2214–2220
14. Wong SY, Yap KS, Yap HJ, Tan SC (2015) A truly online 26. Zhang YT, Ma C, Li ZN, Fan HB (2014) Online modeling of
learning algorithm using hybird fuzzy ARTMAP and online kernel extreme learning machine based on fast leave-one-out
extreme learning machine for pattern classification. Neural Pro- cross-validation. J Shanghai Jiaotong Univ 48(5):641–646
cess Lett 42:585–602 27. Shao ZF, Meng JE (2016) An online sequential learning algo-
15. Liang NY, Huang GB, Saratchandran P, Sundararajan N (2006) A rithm for regularized extreme learning machine. Neurocomputing
fast and accurate online sequential learning algorithm for feed- 173:778–788
forward networks. IEEE Trans Neural Netw 17(6):1411–1423 28. Lu XJ, Zhou C, Huang MH, Lv WB (2016) Regularized online
16. Huynh HT, Won Y (2011) Regularized online sequential learning sequential extreme learning machine with adaptive regulation
algorithm for single-hidden layer feedforward neural networks. factor for time-varying nonlinear system. Neurocomputing
Pattern Recogn Lett 32:1930–1935 174:617–626
17. Guo L, Hao JH, Liu M (2014) An incremental extreme learning 29. Lin M, Zhang LJ, Jin R, Weng SF, Zhang CS (2016) Online
machine for online sequential learning problems. Neurocomput- kernel learning with nearly constant support vectors. Neuro-
ing 128:50–58 computing 179:26–36
18. Fan HJ, Song Q, Yang XL, Xu Z (2015) Kernel online learning 30. Honeine P (2015) Analyzing sparse dictionaries for online
algorithm with state feedbacks. Knowl Based Syst 89:173–180 learning with kernels. IEEE Trans Signal Process 63(23):6343–
19. Fan HJ, Song Q (2013) A sparse kernel algorithm for online time 6353
series data prediction. Expert Syst Appl 40:2174–2181 31. Platt J (1991) A resource-allocating network for function inter-
20. Zhou XR, Liu ZJ, Zhu CX (2014) Online regularized and ker- polation. Neural Comput 3(2):213–225
nelized extreme learning machines with forgetting mechanism. 32. Engel Y, Mannor S, Meir R (2004) The kernel recursive least-
Math Probl Eng. doi:10.1155/2014/938548 squares algorithm. IEEE Trans Signal Process 52(8):2275–2285
21. Zhou XR, Wang CS (2016) Cholesky factorization based online 33. Richard C, Bermudez JCM, Honeine P (2009) Online prediction
regularized and kernelized extreme learning machines with for- of time series data with kernels. IEEE Trans Signal Process
getting mechanism. Neurocomputing 174:1147–1155 57(3):1058–1067
22. Gu Y, Liu JF, Chen YQ, Jiang XL, Yu HC (2014) TOSELM: 34. Fan HJ, Song Q, Xu Z (2014) Online learning with kernel reg-
timeliness online sequential extreme learning machine. Neuro- ularized least mean square algorithms. Expert Syst Appl
computing 128:119–127 41:4349–4359
23. Lim J, Lee S, Pang HS (2013) Low complexity adaptive forget- 35. Liu WF, Park I, Principe JC (2009) An information theoretic
ting factor for online sequential extreme learning machine (OS- approach of designing sparse kernel adaptive filters. IEEE Trans
ELM) for application to nonstationary system estimation. Neural Neural Netw 20(12):1950–1961
Comput Appl 22:569–576 36. Fan HJ, Song Q, Shrestha SB (2016) Kernel online learning with
24. He X, Wang HL, Lu JH, Jiang W (2015) Online fault diagnosis of adaptive kernel width. Neurocomputing 175:233–242
analog circuit based on limited-samples sequence extreme 37. Zhao YP, Wang KK (2014) Fast cross validation for regularized
learning machine. Control Decis 30(3):455–460 extreme learning machine. J Syst Eng Electron 25(5):895–900
123