Multiclass Filters by A Weighted Pairwise Criterion

1412 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO.
5, MAY 2011
Multiclass Filters by a Weighted Pairwise Criterion
for EEG Single-Trial Classication
Haixian Wang, Member, IEEE
AbstractThe ltering technique for dimensionality reduction
of multichannel electroencephalogram(EEG) recordings, modeled
using common spatial patterns and its variants, is commonly used
in two-class braincomputer interfaces (BCI). For a multiclass
problem, the optimization of certain separability criteria in the
output space is not directly related to the classication error of
EEG single-trial segments. In this paper, we derive a new discrim-
inant criterion, termed weighted pairwise criterion (WPC), for
optimizing multiclass lters by minimizing the upper bound of the
Bayesian error that is intentionally formulated for classifying EEG
single-trial segments. The WPC approach pays more attention to
close class pairs that are more likely to be misclassied than far
away class pairs that are already well separated. Moreover, we
extend WPC by integrating temporal information of EEG series.
Computationally, we employ the rank-one update and power iter-
ation technique to optimize the proposed discriminant criterion.
The experiments of multiclass classication on the datasets of BCI
competitions demonstrate the efcacy of the proposed method.
Index TermsBayesian classication error, braincomputer in-
terfaces (BCI), common spatial patterns (CSP), multiclass lters,
weighted pairwise criterion (WPC).
I. INTRODUCTION
A
CCURATE classication of electroencephalogram (EEG)
signals is the core problem in the community of brain
computer interfaces (BCI) [22]. A large number of modern
signal processing and machine learning techniques have been
used and developed [1], [8], [15]. One powerful and widely
used method for processing multichannel EEG series is the l-
tering technique, represented by the common spatial patterns
(CSP) [4]. The CSP approach, designed for the two-class prob-
lem [12], [16], [18], [19], seeks few lters such that the ratio
of the ltered variances between the two populations is maxi-
mized (or minimized). By the fact that CSP make use of only
spatial information, the spatio-temporal versions were also de-
veloped, for example, the common spatio-spectral patterns [11]
and the local temporal CSP [20]. The literature [4] reviewed
many variants of CSP.
Manuscript received June 20, 2010; revised October 14, 2010 and
December 30, 2010; accepted January 3, 2011. Date of publication January 13,
2011; date of current version April 20, 2011. This work was supported in part
by the National Natural Science Foundation of China under Grants 61075009
and 60803059, in part by the Qing Lan Project, and in part by the Fund for the
Program of Excellent Young Teachers at Southeast University.
The author is with the Key Laboratory of Child Development and Learn-
ing Science of Ministry of Education, Research Center for Learning Science,
Southeast University, Nanjing 210096, China (e-mail: hxwang@seu.edu.cn).
Digital Object Identier 10.1109/TBME.2011.2105869
The CSP approach was originally suggested for two-class
paradigm. The multiclass extensions have been investigated in
the literature. One trivial extension was to divide the multiclass
problem into many two-class situations followed by applying
CSP repeatedly [4], [7]. Another conventional extension was
the joint approximate diagonalization (JAD) of M covariance
matrices, where M was the number of multiple classes [7].
This is based on the observation that CSP is to simultaneously
diagonalize two covariance matrices. The JAD was further in-
vestigated from the perspective of mutual information and brain
source separation [9], [10]. The JAD approach is actually a
decomposition technique rather than a classication method.
Recently, Zheng and Lin [23] presented a multiclass exten-
sion via Bayesian classication error estimation. The discrim-
inant criterion is derived by minimizing the upper bound of
the Bayesian error of classifying
T
x, where x is an EEG sig-
nal recorded at a specic time point and is a lter. While
this is a reasonable criterion to optimize spatial lters, a more
direct approach uses the same features for optimizing spatial
lters as for classication. Denoting a multivariate time se-
ries of band-pass ltered single-trial EEG by X, these features
are in CSP
T
XX
T
(cf., [4, eq. (2)], [16, eq. (2)], and [23,
eqs. (29)(31)]). The quantity
T
XX
T
is the variance of the
band-pass ltered EEG signals. It is equal to band power. So,
the band power
T
XX
T
with an appropriate spatial lter
in fact corresponds to the effect of event-related desyncroniza-
tion/syncronization (ERD/ERS), which is an effective neuro-
physiological feature for classication of brain activities [4].
Specically, the so-called idle rhythms, reected around 10 Hz,
can be observed over motor and sensorimotor areas in most
persons. These idle rhythms are attenuated when processing
motor activity. This physiological phenomenon is termed ERD
effect because of loss of synchrony in the neural population. By
contrast, the rebound of the rhythmic activity is termed ERS.
In this paper, we directly target the classication of
T
XX
T
, for which the upper bound of the Bayesian clas-
sication error is estimated. Accordingly, by minimizing the
upper bound of the Bayesian classication error, we develop a
new discriminant criterion that directly related to the classica-
tion of EEG single-trial segments. The proposed criterion takes
the form of sum of weighted pairwise classes and is referred to
as weighted pairwise criterion (WPC). The WPC approach puts
heavier weights onto close class pairs, which are more likely to
be misclassied, and de-emphasizes the inuence of far away
class pairs, where the classes are already well separated. The
weighting strategy helps make the criterion suited in produc-
ing separability in the output space, which has been witnessed
in pattern classication problems [13], [14]. Computationally,
0018-9294/$26.00 2011 IEEE
WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1413
the proposed WPC method is conveniently implemented by the
rank-one update and power iteration technique. Moreover, we
compare WPC with [23]. The classication targets of the two
methods are different, and then the discriminant criteria, which
are obtained by minimizing the upper bound of the Bayesian
classication error, are thus different. We also extend WPC by
integrating temporal information of EEG series into the covari-
ance matrix formulation. The efcacy of the proposed approach
is demonstrated on the classication of four motor imagery tasks
on two datasets of BCI competitions.
The remainder of this paper is organized as follows. In
Section II, the CSP and its multiclass situation via Bayesian
classication error estimation based on EEG sampling points
are briey reviewed. In Section III, we derive the upper bound
of the Bayesian error that is intentionally formulated for clas-
sifying EEG single-trial segments, then propose the WPC to
minimize the upper bound, and give the optimization procedure
by using the rank-one update and power iteration technique. The
comparison with [23] and the extension by integrating temporal
information are presented in Section IV. The experimental re-
sults are presented in Section V. Finally, Section VI concludes
the paper.
II. BRIEF REVIEW OF CSP AND ITS MULTICLASS SITUATION
Let x R
K
be an EEG signal at a specic time point with K
electrodes. We viewx recorded during performing certain men-
tal task as a K-dimensional random variable that is generated
from a Gaussian distribution. Suppose that ( = c
1
, . . . , c
M
is the set of mental conditions to be investigated. We consider

the multiclass (M > 2) classication problem that assigns EEG
single-trial segments into the M predened brain states. Given
class c
l
(l 1, 2, . . . , M), the random variable x is assumed
to be Gaussian distributed according to x[c
l
N(0,
l
), where
l
is the covariance matrix. The Gaussian assumption will not
sacrice generality when studying linear lters and statistics
less than second order [10]. For the purpose of classication,
we wish to learn G (G < K) lters (linear transformation vec-
tors)
g
R
K
using the nite training data such that the ltered
features are more discriminative for predicting class labels than
using the rawEEGdata. Hereafter, the termconditions and class
labels are used interchangeably.
A. CSP: Two-Class Paradigm
The CSP method provides a powerful way for extracting EEG
features related with the modulation of ERD/ERS. The CSP
algorithm is applied to two-class situation only. It solves the
lters such that the projected EEG series have the maximum
ratio of variances between the two classes. Maximizing the
variances actually characterize the ERD/ERS effects. Let X =
[x
1
, . . . , x
N
] R
KN
be a segment of EEG series during one
trial, where x
i
is the multichannel EEG signal at a specic time
point i, and N denotes the number of sampled time-points.
The CSP approach solves spatial lters by simultaneously
diagonalizing the estimated covariance matrices under the two
conditions. The covariance matrices of the two classes are esti-
mated as
l
=
1
N[1
l
[
t1
l
X
t
X
T
t
(1)
where 1
l
(l 1, 2) denotes the set of indices of trials belong-
ing to class c
l
, and [1
l
[ is the cardinality of set 1
l
. The spatial
lters of CSP can be alternatively formulated as an optimization
problem [4], [18]
= argmax, min
R
K
(2)
where the notation max, min means that maximizing or min-
imizing the Rayleigh quotient is of equally interest. The spatial
lters thus are obtained by solving the generalized eigenvalue
equation
1
=

2
. (3)
The eigenvalue

measures the ratio of variances between the
two classes. For the purpose of classication, the lters are spec-
ied by choosing several generalized eigenvectors associated
with eigenvalues from both ends of the eigenvalue spectrum.
The variances of the spatially ltered EEG data are discrimina-
tive features, which are input into a classier.
B. Multiclass Situation by Bayesian Error Estimation
The CSP is suitable for two-class classication only. Zheng
and Lin [23] addressed the multiclass paradigm via Bayesian
classication error estimation. By the assumption that the dis-
tribution of the EEG sampling point x conditioned on class c
l
is Gaussian, i.e., p
l
= N(x; 0,
l
), the ltered EEG signal y =
T
x is also Gaussian distributed according to N(y; 0,
T
l
).
Based on the Gaussian distribution, Zheng and Lin [23] obtained
the upper bound of the Bayesian error of classifying
T
x, given
by
()
qM(M 1)
2

q
3
32
_
M
l=1
[
T
(
l

)[
_
2
(4)
where q is the common a priori probability of the M classes, and
M
m=1
q
m
. Minimizing the upper bound of the Bayesian
classication error is equivalent to maximize the discriminant
criterion
J() =
M
l=1
[
T
(
l

)[
. (5)
So, the G lters are dened as [23]
1
= arg max
J() (6)

G
= arg max
T
g = 0
g = 1 , . . . , G 1
J(). (7)
Using some suitable estimations for
l
and

, the dened lters
can be determined one by one via the rank-one update and power
iteration procedure.
1414 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
III. WPC
For the specic classication of multiclass EEG single-trial
segment X, a more direct approach is to consider optimizing
the feature
T
XX
T
, rather than
T
x, as for classication. It
is ideal to obtain the Bayesian classication error-based opti-
mal criterion for the classication of
T
XX
T
. The Bayesian
classication error is, in general, too complex to be calculated
directly. Therefore, the upper bound of the Bayesian classica-
tion error, which is meanwhile required to be easy to optimize
in practice, is usually estimated as a suboptimal criterion. In this
section, we develop a new discriminant criterion based on the
upper bound of the Bayesian classication error of EEG single
trials. It is noted that we take
T
XX
T
, rather than
T
x, as
our target element in deriving the upper bound of the multiclass
Bayesian classication error.
A. Upper Bound of Multiclass Bayesian Error
Recalling X = [x
1
, . . . , x
N
], we have
T
XX
T
=
N
i=1
(
T
x
i
)
2
, where N is the number of sample points in
one trial. Under the assumption of independent Gaussian dis-
tribution N(0,
l
) of x
i
conditioned on class c
l
, we have that
(
T
XX
T
)/(
T
l
) abides by
2
distribution with degree of
freedom N. Usually, the number of sampling points N is very
large, say N > 30. By the central limit theorem, we have that
T
XX
T
l
approaches the Gaussian dis-
tribution with mean N
T
l
and variance 2N(
T
l
)
2
. For
the time being, we assume that 2N(
T
l
)
2
is less than one,
which will be addressed with the general case later on.
Denote
lm
() by the Bayesian error between classes c
l
and
c
m
, i.e.,
lm
() = q
l
P(f
(X) ,= c
l
[c
l
) + q
m
P(f
(X) ,= c
m
[c
m
) (8)
where P(f
(X) ,= c
l
[c
l
) is the probability that samples belong-
ing to class c
l
are misclassied into class c
m
, f
() denotes the
Bayesian classier, and q
l
and q
m
are the a priori probabilities
of classes c
l
and c
m
, respectively. Since the data
T
XX
T
conditioned on each class are (approximately) Gaussian dis-

tributed with variance less than one and mean being, for exam-
ple, N
T
l
if conditioned on class c
l
, it follows that
q
l
P(f
(X) ,= c
l
[c
l
) + q
m
P(f
(X) ,= c
m
[c
m
)
_
D
m
q
l
p
l
(x)dx +
_
D
l
q
m
p
m
(x)dx (9)
where p
l
(x) and p
m
(x) are the probability density functions of
Gaussian distributions N(N
T
l
, 1) and N(N
T
m
, 1),
respectively, and D
m
and D
l
are dened as
D
m
= x : q
m
p
m
(x) q
l
p
l
(x) (10)
D
l
= x : q
l
p
l
(x) > q
m
p
m
(x). (11)
Suppose q
l
= q
m
= q. Then, we have [21]
_
D
m
p
l
(x)dx +
_
D
l
p
m
(x)dx = 1 erf
_
N[
T
(
l

m
)[
2
2
_
(12)
where the error function (erf) is dened as erf(x) =
(2/
)
_
x
0
e
u
2
du. By (8), (9), and (12), the Bayesian error
between classes c
l
and c
m
in the 1-D feature space after being
projected onto is expressed as
lm
()
q
1 erf
_
N[
T
(
l

m
)[
2
2
_
. (13)
It is still complex to optimize via (13), since is embedded
in the error function. We would like to isolate from the error
function. We present the following inequality. For 0 x a,
we have
erf(x)
1
a
erf(a)x. (14)
The equality holds when taking x = 0 or x = a. The proof is
given in Appendix A. Let
lm
() = [
T
l

T
m
[ (15)
be the absolute distance between classes c
l
and c
m
in the reduced
1-D feature space. By (14), we have
erf
_
N
lm
()
2
2
_
lm
erf
_
N
lm
2
2
_
lm
() (16)
where
lm
is the maximum value of
lm
(). Note we have
required, in the beginning of this section, that the magnitude of
is subject to the constraint 2N(
T
l
)
2
1. The left and
right expressions of (16) are not equal for all directions of .
The two expressions are equal when arriving at the maximum
or the minimum value. Combining (13) and (16), we have
lm
()
q
1
1
lm
erf
_
N
lm
2
2
_
lm
(). (17)
For the M classes problem, the upper bound of the Bayesian
error is calculated as [5]
()
M 1
l=1
M
m=l+1
lm
()
M 1
l=1
M
m=l+1
q
_
1
1
lm
erf
_
N
lm
2
2
_
lm
()
_
. (18)
B. Discriminant Criterion Based on Upper Bound of Multiclass
Bayesian Error
To minimize the Bayesian error, we should minimize its upper
bound, which is reduced to maximize the following discriminant
criterion:
J
P
() =
M 1
l=1
M
m=l+1
1
lm
erf
_
N
lm
2
2
_
lm
(). (19)
Let
lm
=
1
lm
erf
_
N
lm
2
2
_
. (20)
Then, J
P
() can be rewritten as
J
P
() =
M 1
l=1
M
m=l+1
lm
lm
(). (21)
Fig. 1. Weighting function (u) = (1/u)erf (u).
The
lm
can be viewed as weight imposed on pairwise classes
c
l
and c
m
. Since N
T
l
and N
T
m
are the distribution
means of classes c
l
and c
m
in the reduced 1-D feature space, re-
spectively, the quantity N
lm
in (20) is the maximum distance
(with respect to ) between the two class means. It reects
the separability of two classes. Note (u) = (1/u)erf(u) is a
monotonically decreasing function of u, as shown in Fig. 1. So,
the pairwise class weighting function
lm
in (20) is monoton-
ically decreasing with respect to N
lm
. That is, in (21), we
impose heavier weights onto close class pairs, which are more
likely to be misclassied. The close class pairs are endowed with
emphasize, which helps make the criterion suited in producing
separability in the output space.
C. Discriminant Criterion: General Case of
In the earlier derivation, we require that the variance
2N(
T
l
)
2
is less than 1. This requirement is satised
by restricting the length of . It sufces to restrict such
that
T
(
2NM

) = 1. In fact, for any class c
l
, we have
2N(
T
l
)
2
< 2N(
T
M

)
2
= 1. So, we can normalize
as (/
_
T
(
2NM

)). Accordingly, the term
T
(
l

m
) is normalized as
T
(
l

m
)

T
(
l

m
)
T
(
2NM

)
.
As a result, in the weighting function (20), the
lm
is trans-
formed as the largest absolute generalized eigenvalue of
l

m
with respect to

2NM

.
For the general case of , based on criterion (19), we present
the multiclass discriminant criterion
J
P
() =
M 1
l=1
M
m=l+1
1
/
lm
erf
_
N
/
lm
4M
_
[
T
(
l

m
)[
(22)
where
/
lm
denotes the largest absolute generalized eigenvalue
of
l

m
with respect to

. The value
/
lm
actually measures
the closeness between classes c
l
and c
m
in the input space.
Based on criterion (22), we dene a set of lters as follows:
1
= arg max
J
P
() (23)

G
= arg max
T
g = 0 ,
g = 1 , . . . , G 1
J
P
(). (24)
This discriminant criterion produces a set of discriminant vec-
tors to minimize the upper bound of the Bayesian error. We
see that the gth discriminant vector
g
is determined such that
J
P
is maximized in the (K g + 1)-dimensional space that is
perpendicular (with respect to

) to the space spanned by previ-
ously obtained discriminant vectors
1
through
g1
. We refer
to

J
P
as WPC, which is to minimize the upper bound of the
Bayesian error for multiclass EEG single-trial classication.
D. Implementation of WPC
The proposed WPC method can be similarly implemented
by using the rank-one update and power iteration technique
without resorting to complex optimization algorithm, as in [23].
Let =

1/2
and
J
P
() =
M 1
l=1
M
m=l+1
1
/
lm
erf
_
N
/
lm
4M
_
[
T
(
l

m
)[
(25)
where
l
=

1/2
1/2
and
/
lm
is the largest absolute
eigenvalue of
l

m
. Then, the solution of is converted
into the optimization of formulated as
1
= arg max
J
P
() (26)

G
= arg max
T
g = 0 ,
g = 1 , . . . , G 1
J
P
(). (27)
Let s
lm
1, 1 be the sign of
T
(
l

m
). Then,
[
T
(
l

m
)[ = s
lm
T
(
l

m
). Let
H(s) =
M 1
l=1
M
m=l+1
1
/
lm
erf
_
N
/
lm
4M
_
s
lm
(
l

m
) (28)
where s has the entries s
lm
l,m=1,...,M
. The optimization of
can be further formulated as
1
= arg max
sS
max
T
H(s)
(29)

G
= arg max
sS
max
T
g = 0 ,
g = 1 , . . . , G 1
T
H(s)
(30)
where S is the sign space of all possible s.
Clearly, the rst vector
1
is the rst principal eigenvector
(which can be obtained by power iteration) of H(s
), where
s
is the sign that results in the largest rst principal eigen-

value of all possible s. Once the rst vector
1
is obtained,
we proceed to nd the second vector
2
in the orthogonally
complementary space of
1
, i.e., in the space spanned by
I
K

1
T
1
, where I
K
denotes the K-dimensional identity ma-
trix. Therefore,
2
is solved as the rst principal eigenvec-
tor of the deated matrix (I
K

1
T
1
)H(s
)(I
K

1
T
1
).
Note that we use the same symbol s
, but it is not neces-

sarily the same with the one producing
1
. Generally, sup-
pose the rst g vectors
1
, . . . ,
g
have been obtained. The
(g + 1)th vector is determined in the orthogonally complemen-
tary space spanned by
1
, . . . ,
g
, i.e., in the space spanned
by I
K
U
g
U
T
g
, where U
g
is the matrix of orthonormal ba-
sis of
1
, . . . ,
g
, which, for example, can be obtained by the
Schmidt orthogonalization procedure. So,
g+1
is solved as
the rst principal eigenvector of (I
K
U
g
U
T
g
)H(s
)(I
K

U
g
U
T
g
). Given the obtained
g+1
, according to the Schmidt or-
thogonalization procedure, the basis matrix U
g+1
is formed
by padding U
g
as U
g+1
= [U
g
, u
g+1
], where u
g+1
=
_
g+1
U
g
(U
T
g

g+1
)
_
/|
g+1
U
g
(U
T
g

g+1
)|. In theory,
g+1
is orthogonal with U
g
, i.e., U
T
g

g+1
= 0, which im-
plies that u
g+1
is simply the normalized
g+1
. We keep the
previous Schmidt orthogonalization procedure for computa-
tional precision in practice. Note I
K
U
g+1
U
T
g+1
= (I
K

u
g+1
u
T
g+1
)(I
K
U
g
U
T
g
), which makes it feasible to compute
(I
K
U
g+1
U
T
g+1
)H(s
)(I
K
U
g+1
U
T
g+1
) for the next step
by updating (I
K
U
g
U
T
g
)H(s
)(I
K
U
g
U
T
g
) through mul-
tiplying I
K
u
g+1
u
T
g+1
from both sides.
In practice, the covariance matrices
l
(l = 1, . . . , M) are
usually unknown, which thus need to be estimated. The ex-
pression

dened in (1) provides a way of estimation. We
summarize the optimization procedure of multiclass lters via
the WPC approach in Table I.
E. Classication
Suppose
g
(g = 1, . . . , G) are the G lters obtained by
WPC. For any EEG data segment X
t
, we extract the features as
2
g
=
T
g
X
t
X
T
t

g
, (g = 1, . . . , G). (31)
The extracted features on training EEG data are used to design
a classier. For a testing EEG segment, its features are extracted
in the same way, which are input into the trained classier to
predict its class label.
IV. COMPARISON AND EXTENSION
In this section, we compare the proposed WPCapproach with
[23]. The starting points and formulations of the two methods
are completely different. We also extend WPC by integrating
temporal information.
A. Comparison With [23]
Both the proposed WPC approach and [23] are based on the
Bayesian error estimation. However, the classication targets
and then the criteria of the two methods are different, as sum-
marized in Table II. The discriminant criterion of [23] is derived
TABLE I
OPTIMIZATION PROCEDURE OF MULTICLASS FILTERS VIA THE WPC APPROACH
by minimizing the upper bound of the Bayesian error of classi-
fying
T
x, while WPC takes feature
T
XX
T
used in EEG
single-trial classication as target directly.
It is noted that
q
M 1
l=1
M
m=l+1
[
T
(
l

m
)[
M
l=1
[
T
(
l

)[
2q
M 1
l=1
M
m=l+1
[
T
(
l

m
)[. (32)
TABLE II
COMPARISON BETWEEN WPC AND ZHENG AND LIN [23]
The proof is given in Appendix B. Let
J() =
M 1
l=1
M
m=l+1
[
T
(
l

m
)[
. (33)
Then, we have q

J() J() 2q

J(). So, maximizing J()
can be roughly performed by maximizing

J(). Using =
1/2
, maximizing

J() is equivalent to maximizing
J() =
M 1
l=1
M
m=l+1
[
T
(
l

m
)[
(34)
which is M(M 1)/2 pairs of absolute distances between
l
and
T
m
subject to || = 1.
The maximization of (34), however, may not very appropriate
for classifying multiclass EEG single-trials in some cases. For
example, consider the situation that one class has large differ-
ence (in terms of ) from the other classes. To maximize the
criterion

J(), the class pairs (say c
l
and c
m
) that have large
differences (between
l
and
m
) heavily control the selection
of the direction of . Note is of unit length. As a result, the
remote class is projected fromthe other classes as far as possible
while close classes are more likely to be merged.
By contrast, with the weight (1/
/
lm
)erf(
N
/
l m
4M
), the crite-
rion

J
P
() in (25) or

J
P
() in (22) de-emphasizes the inuence
of large class differences, where the classes are already well
separated and gives great emphasize to small class differences
where the close classes are more likely to be confused.
An interesting connection between WPC and [23] is that,
when applied in two-class paradigm, criterion

J
P
() and crite-
rion J() produce the same solution of . This, however, does
not necessarily imply that the upper bounds of the Bayesian
errors of these two methods are equal. On the other hand, com-
paring the upper bounds of the Bayesian errors derived by the
two methods is meaningless since they are derived for classify-
ing different objects.
B. Extension to WPC: Integrating Temporal Information
The set of lters of WPC are obtained by considering clas-
sifying the projected variance
T
XX
T
. Note that X is a
segment of EEG single-trial time course. The covariance for-
mulation XX
T
, however, is globally independent of time.
The temporal information is completely ignored. From the
study of neurophysiology, EEG signals are usually nonsta-
tionary. It is useful to integrate the temporal information into
the covariance formulation, reecting the temporal manifold
of the EEG time course [20]. Specically, by the fact that
XX
T
= (1/2N)
N
i,j=1
(x
i
x
j
)(x
i
x
j
)
T
, we use the tem-
porally local covariance matrix
C =
1
2N
N
i,j=1
(x
i
x
j
)(x
i
x
j
)
T
A(i, j) (35)
for covariance modeling instead of XX
T
that is time indepen-
dent. The time-dependent adjacency value A(i, j) is dened
such that only temporally close sample pairs, say x
i
x
j
:
[i j[ < with being a temporal range parameter, are se-
lected to contribute to the summation (35). The value A(i, j) is
monotonously decreasing with respect to temporal distances be-
tween selected sample pairs. In this paper, the adjacency matrix
Ais dened using the Tukeys tricube weighting function [6]
A(i, j) =
_
_
_
1
i j
3
_
3
, [i j[ <
0, else.
(36)
With some algebraic derivations, C is compactly expressed
as C = (1/N)XEX
T
, where E = DA is the Laplacian
matrix, A = (A(i, j))
i,j=1,...,N
, and D is the diagonal ma-
trix whose diagonal entries are row sums of A. Let L =
(1/N)E. Then, under the same probabilistic assumption with
the previous section, the Gaussian quadratic form
T
XLX
T
l
has mean tr(L)
T
l
and vari-
ance 2tr(L
2
)(
T
l
)
2
, where tr() is the trace operator. If
T
XLX
T
is treated as target feature for classication pur-
pose. Then, (28) is accordingly modied by integrating temporal
information as
H
TI
(s) =
M 1
l=1
M
m=l+1
1
/
lm
erf
_
tr(L)
/
lm
4M
_
tr(L
2
)
_
s
lm
(
l

m
).
(37)
Note that, in this case, the difference between means of classes c
l
and c
m
becomes tr(L)(
T
(
l

m
)). And is normalized
as (/
_
T
(
_
2tr(L
2
)M

)).
In the implementation of the temporal extension of WPC, the
covariance matrices
l
(l = 1, . . . , M) are estimated as
l
=
1
N[1
l
[
t1
l
X
t
LX
T
t
. (38)
The optimization procedure can be similarly carried out with
WPC. The features are extracted as

2
g
=
T
g
X
t
LX
T
t

g
, (g = 1, . . . , G) (39)
where
g
are the lters of the temporal extension of WPC.
1) Choice of : The additional parameter is determined
from the data using a three-way cross-validation strategy. This
strategy contains two nested loops. In the outer loop, the sam-
ples are divided into T
1
folds, in which one-fold is treated as
testing set. The testing samples are used for the estimation of
generalization ability and are not concerned with the solutions
of the lters and the parameter. In the inner loop, the remaining
T
1
1 folds are further divided into T
2
folds, in which one-fold
is treated as validation set while the remaining T
2
1 folds are
treated as training set. For each , the lters are solved on the
training set, and then the recognition rate is calculated on the
validation set. This procedure is repeated T
2
times with a dif-
ferent validation set each time. The average recognition rates
is recorded as the recognition accuracy across the T
2
folds. We
select the that results in the maximum recognition accuracy.
We then solve the lters using all the T
2
folds with the optimal
selected earlier. With the lters obtained, we calculate the
recognition rate on the testing set which is specied in the outer
loop. The earlier procedure is repeated T
1
times with a different
fold as testing set each time. The average recognition rates are
computed as the nal recognition accuracy across the T
1
folds.
V. EXPERIMENTS
We evaluate the effectiveness of the proposed multiclass
methods on two publicly available datasets of BCI competi-
tions. These two datasets are of four-class motor imagery EEG
signals. We compare the classication performances of the pro-
posed multiclass methods with the multiclass CSP using one-
versus-rest and using JAD [7], the multiclass information theo-
retic feature extraction [10], and the multiclass CSP presented
in [23].
A. EEG Datasets Used for Evaluation
1) Dataset IIIa of BCI Competition III: This dataset is of
four-class motor imagery paradigm by recording three subjects
(k3b, k6b, and l1b) [3]. The subjects, sitting in a normal chair
with relaxation, were asked to perform four different tasks of
motor imagery (i.e., left hand, right hand, one foot, and tongue)
by cues, which were presented in a randomized order. In each
trial, the cue was displayed from the third second and lasted
for 1.25 s. At the same time, the motor imaginary started and
continued until the xation cross disappeared at the seventh
second. So, the duration of the motor imagery in each trial was
4 s. For subject k3b, there were 90 trials for each mental task.
And for subjects k6b and l1b, each mental task cue appeared 60
times. In our experiment, we discard four trials of subject k6b
because of missing data. The EEGmeasurements were recorded
using 60 sensors by a 64-channel neuroscan system. The left and
right mastoids were used as reference and ground, respectively.
The EEG signals were sampled at 250 Hz and ltered by cutoff
frequencies 1 and 50 Hz with the notchlter ON.
2) Dataset IIa of BCI Competition IV: This dataset contains
EEG signals recorded during a cue-based four-class motor im-
agery task from nine subjects [17]. Each trial started from a
short acoustic warning tone along with a xation cross dis-
played on the black screen. After 2 s, a visual cue was presented
for 1.25 s, instructing the subjects to carry out the desired motor
imagery task (i.e., the imagination of movement of the left hand,
right hand, both feet, or tongue) from the third second until the
xation cross disappeared at the sixth second. Each subject par-
ticipated two sessions recorded on different days. There were
288 trials in each session for each subject, i.e., 72 trials per task.
Twenty-two electrodes were used to record the EEG signals that
were sampled at 250 Hz and ltered by cutoff frequencies 0.5
and 100 Hz with the notchlter ON.
B. Experimental Settings and Results
The data are band-pass ltered between 5 and 35 Hz using
a fth-order butterworth lter, as in [9] and [10]. The EEG
segments recorded during the motor imagery period, i.e., from
the third second to the seventh second in dataset IIIa of BCI
competition III and from the third second to the sixth second in
dataset IIa of BCI competition IV, are used in the experiment. We
exploit the three-fold cross-validation strategy to evaluate the
classication accuracy. That is, we partition all the trials of each
class per subject into three divisions, in which each division is
used as testing data while the remainder two divisions are used
as training data. This procedure is repeated three times until
each division is used once as testing data. In each repetition, for
each lter obtained on the training data, features are obtained
by projection on the 15 frequency bands of 2-Hz width in the
range 535 Hz [9], [10]. Consequently, we obtain a (15G)-
dimensional feature vector for each trial, where Gis the number
of lters selected on the training data. That is, we use the three-
way cross-validation procedure with T
1
= T
2
= 3 to determine
the value of G, where G varies from 2 to 10 in step of 2. The
(15G)-dimensional feature vectors are further reduced to 3-D
1
vectors by using the Fisher discriminant analysis (FDA) [21]. It
should be noted that the spatial lters, the value of G as well as
the FDA weights are calculated on the basis of the training data
and then applied to the testing data. The conventional classier
of the nearest class mean with Euclidean distance [21] is adopted
to predict the class labels of the testing samples.
Table III reports the classication accuracies by using the
multiclass lters solved by the various methods. Note, we also
evaluate the classication accuracy of WPC integrating tempo-
ral information (WPC/TI), where the parameter is determined
by the three-way cross-validation procedure with T
1
= T
2
= 3.
Here is varied logarithmically from 1 to 5 in step of 1. It is
observed that the proposed WPC method achieves much bet-
ter classication accuracy than the existing multiclass methods,
and WPC/TI further improves the results in most cases. The im-
provement of WPC/TI is attributed to the local temporal mod-
eling. The reason that WPC/TI results in lower classication
accuracies than WPC in few cases may be due to overtting.
C. Comparison With BCI Competition IV
For dataset IIa of BCI competition IV, to compare with the re-
sults of the winners, we use the evaluation of session-to-session
transfer fromsession one to session two in terms of kappa score,
simulating competition scenario. The procedure of the session-
to-session transfer is much simpler than the cross validation.
Specically, we use the rst session as training data and the
second session as testing data. All the experimental settings are
same with the description in the previous section except that the
training data and the testing data are now xed. The classica-
tion accuracy is summarized in Table IV. It can be seen that our
proposed methods have fairly well classication performance
compared with the results obtained by the best two competi-
1
Since the number of classes is four, we can obtain at most three dimensions
of features by FDA, which is known as the rank-limit problem.
TABLE III
COMPARISON OF THE CLASSIFICATION ACCURACIES (%) OF THE PROPOSED WPC AND WPC/TI METHODS WITH THE EXISTING MULTICLASS METHODS FOR EACH
SUBJECT ON THE DATASETS OF BCI COMPETITIONS, WHERE M1M6 REFER TO MULTICLASS CSP USING ONE-VERSUS-REST, MULTICLASS CSP USING JAD,
MULTICLASS INFORMATION THEORETIC FEATURE EXTRACTION, MULTICLASS CSP IN [23], WPC, AND WPC/TI, RESPECTIVELY
TABLE IV
KAPPA SCORES OF VARIOUS MULTICLASS METHODS FOR EACH SUBJECT ON
DATASET IIA OF BCI COMPETITION IV USING SESSION-TO-SESSION TRANSFER,
WHERE NO. 1 AND NO. 2 REFER TO THE BEST TWO COMPETITORS, AND
M1M6 ARE SAME WITH THOSE OF TABLE III
tors. Note that the results obtained by the multiclass CSP using
one-versus-rest and using JAD are slightly different from those
reported in [9], since different classiers and time segments are
used.
In our experiment, a simple classication procedure is em-
ployed to reveal the effectiveness of the multiclass lters ob-
tained by WPC and WPC/TI. The classication performance
may be improved if we solve lters in narrower frequency bands,
tuning the optimal time segment for each trial, and/or using other
sophisticated classiers. The goal of this paper is to demonstrate
the effectiveness of the weighted scheme for solving multiclass
lters: while we use the same experimental settings for all the
methods, the weighted pairwise design produces a much higher
classication accuracy.
VI. CONCLUSION
In this paper, we propose a new discriminant criterion, called
WPC, of optimizing multiclass lters. The approach is estab-
lished by minimizing the upper bound of the Bayesian error of
classifying EEG single-trial segments, resulting in the form of
sum of weights imposed on individual pairwise classes accord-
ing to their closeness. We pay special emphasize on the effect
of closer classes that are more likely to cause misclassication.
In other words, the contributions of different class pairs to the
discriminant criterion are biased. Computationally, the WPC
algorithm is conveniently solved by the rank-one update and
power iteration technique.
The proposed WPC approach is intentionally formulated for
classifying EEG single-trial data. It takes into account classi-
cation errors of EEG trials between pairs of classes. While
the criterion derived based on the Bayesian error of classify-
ing EEG sampling points is reasonable, the large pairwise class
differences may play an overwhelming role in the optimization.
By contrast, WPCdirectly uses the same features for optimizing
spatial lters as for classication. Moreover, we extend WPC
by integrating the temporal information of EEG series in the co-
variance matrix formulation. The effectiveness of the proposed
WPCmethod is demonstrated by the classication of four motor
imagery tasks on two datasets of BCI competition.
Finally, we point out that the Bayesian error estimation heav-
ily relies on the assumption of independent Gaussian distribu-
tion. This assumption, however, does not hold stringently in ap-
plications, since EEG data usually has an autocorrelation struc-
ture. One possible way is to consider using the Gauss mixture
model instead of single Gaussian distribution. We are studying
this issue theoretically and practically.
APPENDIX A
PROOF OF (14)
For 0 x a, in the error function
erf(x) =
2
_
x
0
e
u
2
du (40)
we use the variable substitution v =
a
x
u. Then, we have
erf(x) =
2
_
a
0
e
v
2
(
x
a
)
2 x
a
dv. (41)
Since 0
x
a
1, it follows that
erf(x)
x
a
2
_
a
0
e
v
2
dv =
1
a
erf(a)x (42)
which completes the proof.
APPENDIX B
PROOF OF (32)
We have
M
l=1
T
(
l

)[ =
M
l=1
[
T
_
l

M
m=1
q
m
_
=
M
l=1
m=1
q
T
(
l

m
)
2q
M 1
l=1
M
m=l+1
[
T
(
l

m
)[. (43)
On the other hand, we have
q
M 1
l=1
M
m=l+1
[
T
(
l

m
)[
=
q
2
M
l=1
M
m=1
[
T
(
l

)
T
(
m

)[
q
2
M
l=1
M
m=1
[
T
(
l

)[ +
q
2
M
l=1
M
m=1
[
T
(
m

)[
=
1
2
M
l=1
[
T
(
l

)[ +
1
2
M
m=1
[
T
(
m

)[
=
M
l=1
[
T
(
l

)[. (44)
The proof is, thus, established.
ACKNOWLEDGMENT
The author would like to thank the anonymous referees and
the editors for constructive recommendations, which improve
the paper substantially.
REFERENCES
[1] A. Bashashati, M. Fatourechi, R. K. Ward, and G. E. Birch, A survey
of signal processing algorithms in braincomputer interfaces based on
electrical brain signals, J. Neural Eng., vol. 4, no. 2, pp. R32R57, Jun.
2007.
[2] B. Blankertz, K.-R. M uller, G. Curio, T. M. Vaughan, G. Schalk, J. R.
Wolpaw, A. Schl ogl, C. Neuper, G. Pfurtscheller, T. Hinterberger,
M. Schr oder, and N. Birbaumer, The BCI competition 2003: Progress
and perspectives in detection and discrimination of EEG single trials,
IEEE Trans. Biomed. Eng., vol. 51, no. 6, pp. 10441051, Jun. 2004.
[3] B. Blankertz, K.-R. M uller, D. J. Krusienski, G. Schalk, J. R. Wolpaw,
A. Schl ogl, G. Pfurtscheller, J. R. Mill an, M. Schr oder, and N. Birbaumer,
The BCI competition III: Validating alternative approaches to actual BCI
problems, IEEETrans. Neural Syst. Rehabil. Eng., vol. 14, no. 2, pp. 153
159, Jun. 2006.
[4] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K.-R. M uller,
Optimizing spatial lters for robust EEG single-trial analysis, IEEE
Signal Process. Mag., vol. 25, no. 1, pp. 4156, Jan. 2008.
[5] J. T. Chu and J. C. Chuen, Error probability in decision functions for
character recognition, J. Assoc. Comput. Mach., vol. 14, no. 2, pp. 273
280, 1967.
[6] W. S. Cleveland, Robust locally weighted regression and smoothing scat-
terplots, J. Amer. Stat. Assoc., vol. 74, pp. 829836, 1979.
[7] G. Dornhege, B. Blankertz, G. Curio, and K.-R. M uller, Boosting bit rates
in noninvasive EEGsingle-trial classications by feature combination and
multi-class paradigms, IEEETrans. Biomed. Eng., vol. 51, no. 6, pp. 993
1002, Jun. 2004.
[8] G. Dornhege, M. Krauledat, K.-R. M uller, and B. Blankertz, General
Signal Processing and Machine Learning Tools for BCI. Cambridge,
MA: MIT Press, 2007, pp. 207233.
[9] C. Gouy-Pailler, M. Congedo, C. Brunner, C. Jutten, and G. Pfurtscheller,
Nonstationary brain source separation for multiclass motor imagery,
IEEE Trans. Biomed. Eng., vol. 57, no. 2, pp. 469478, Feb. 2010.
[10] M. Grosse-Wentrup and M. Buss, Multiclass common spatial patterns
and information theoretic feature extraction, IEEE Trans. Biomed. Eng.,
vol. 55, no. 8, pp. 19912000, Aug. 2008.
[11] S. Lemm, B. Blankertz, G. Curio, and K.-R. M uller, Spatio-spectral lters
for improved classication of single trial EEG, IEEE Trans. Biomed.
Eng., vol. 52, no. 9, pp. 15411548, Sep. 2005.
[12] Y. Li, X. Gao, and S. Gao, Classication of single-trial electroencephalo-
gram during nger movement, IEEE Trans. Biomed. Eng., vol. 51, no. 6,
pp. 10191025, Jun. 2004.
[13] M. Loog, R. P. W. Duin, and R. Haeb-Umbach, Multiclass linear dimen-
sion reduction by weighted pairwise Fisher criteria, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 23, no. 7, pp. 762766, Jul. 2001.
[14] R. Lotlikar and R. Kothari, Fractional-step dimensionality reduction,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 6, pp. 623627, Jun.
2000.
[15] D. J. McFarland, C. W. Anderson, K.-R. M uller, A. Schl ogl, and D. J.
Krusienski, BCI meeting 2005workshop on BCI signal processing:
Feature extraction and translation, IEEE Trans. Neural Syst. Rehabil.
Eng., vol. 14, no. 2, pp. 135138, Jun. 2006.
[16] J. M uller-Gerking, G. Pfurtscheller, and H. Flyvbjerg, Designing optimal
spatial lters for single-trial EEGclassication in a movement task, Clin.
Neurophys., vol. 110, no. 5, pp. 787798, May 1999.
[17] M. Naeem, C. Brunner, R. Leeb, B. Graimann, and G. Pfurtscheller,
Seperability of four-class motor imagery data using independent compo-
nents analysis, J. Neural Eng., vol. 3, pp. 208216, 2006.
[18] L. C. Parra, C. D. Spence, A. D. Gerson, and P. Sajda, Recipes for linear
analysis of EEG, NeuroImage, vol. 28, no. 2, pp. 326341, Nov. 2005.
[19] H. Ramoser, J. M uller-Gerking, and G. Pfurtscheller, Optimal spatial
ltering of single trial EEG during imagined hand movement, IEEE
Trans. Rehabil. Eng., vol. 8, no. 4, pp. 441446, Dec. 2000.
[20] H. Wang and W. Zheng, Local temporal common spatial patterns for
robust single-trial EEG classication, IEEE Trans. Neural Syst. Rehabil.
Eng., vol. 16, no. 2, pp. 131139, Apr. 2008.
[21] A. R. Webb, Statistical Pattern Recognition. London, U.K.: Oxford
Univ. Press, 1999.
[22] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and
T. M. Vaughan, Braincomputer interfaces for communication and con-
trol, Clin. Neurophysiol., vol. 113, no. 6, pp. 767791, Jun. 2002.
[23] W. Zheng and Z. Lin, Optimizing multi-class spatio-spectral lters via
Bayes error estimation for EEG classication, in Proc. Neural Informat.
Process. Syst. (NIPS), 2009, pp. 19.
Haixian Wang (M09) received the B.S. and M.S.
degrees in statistics and the Ph.D. degree in com-
puter science from Anhui University, Anhui, China,
in 1999, 2002, and 2005, respectively.
During 20022005, he was with the Key Labora-
tory of Intelligent Computing and Signal Processing
of the Ministry of Education of China. He is currently
with the Key Laboratory of Child Development and
Learning Science of the Ministry of Education, Re-
search Center for Learning Science, Southeast Uni-
versity, Nanjing, Jiangsu China. His research inter-
ests include EEG signal processing, statistical pattern recognition, and machine
learning.

Multiclass Filters by A Weighted Pairwise Criterion

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Multiclass Filters by A Weighted Pairwise Criterion

Caricato da

Copyright:

Formati disponibili

1412 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO.

is the set of mental conditions to be investigated. We consider

conditioned on each class are (approximately) Gaussian dis-

is the sign that results in the largest rst principal eigen-

, but it is not neces-

Potrebbero piacerti anche