Sei sulla pagina 1di 24

Machine Learning manuscript No.

(will be inserted by the editor)


Online Variance Minimization
Manfred K. Warmuth Dima Kuzmin
Received: data / Accepted: date
Journal submission: September 25, 2008
Abstract We design algorithms for two online variance minimization problems. Specif-
ically, in every trial t our algorithms get a covariance matrix C
t
and try to se-
lect a parameter vector w
t1
such that the total variance over a sequence of trials
P
t
(w
t1
)

C
t
w
t1
is not much larger than the total variance of the best parameter
vector u chosen in hindsight. Two parameter spaces are considered - the probability
simplex and the unit sphere. The rst space is associated with the problem of mini-
mizing risk in stock portfolios and the second space leads to an online calculation of
the eigenvector with minimum eigenvalue. For the rst parameter space we apply the
Exponentiated Gradient algorithm which is motivated with a relative entropy. In the
second case the algorithm maintains a mixture of unit vectors which is represented as
a density matrix. The motivating divergence for density matrices is the quantum ver-
sion of the relative entropy and the resulting algorithm is a special case of the Matrix
Exponentiated Gradient algorithm. In each case we prove bounds on the additional
total variance incurred by the online algorithm over the best oine parameter.
1 Introduction
In one of the simplest models of online learning in the expert setting (Freund and
Schapire, 1997), the learner allocates a probability vector w over the experts at the
Supported by NSF grant IIS 0325363. Some of this work was done while visiting National ICT
Australia in Canberra. An earlier version of this appeared as a conference paper (Warmuth
and Kuzmin, 2006b).
M. K. Warmuth
UC California - Santa Cruz
E-mail: manfred@cse.ucsc.edu
D. Kuzmin
UC California - Santa Cruz
E-mail: dima@cse.ucsc.edu
2
beginning of each trial. It then receives a loss vector l and incurs loss w l =
P
i
w
i
l
i
.
The goal is to design online algorithms with small regret, which is the total loss of the
online algorithm minus the total loss of the best expert chosen in hindsight:
P
t
w
t1

l
t
inf
i
P
t
l
t
i
, where t is the trial index.
In this paper we investigate online algorithms for minimizing the total variance
over a sequence of trials. Instead of receiving a loss vector l in each trial, we now
receive a covariance matrix C of a random loss vector l, where C(i, j) is the covariance
between l
i
and l
j
at the current trial. Intuitively the loss vector provides rst-order
information (means), whereas covariance matrices give second order information. The
variance/risk of the loss for probability vector w when the covariance matrix is C can
be expressed as w

Cw = Var(w l). We want online algorithms with small regret in


the variance, which is the total variance of the algorithm minus the total variance of the
best probability vector u chosen in hindsight:
P
t
(w
t1
)

C
t
w
t1
inf
u
u

(
P
t
C
t
)u
(where the inf is over the probability simplex).
In a more general setting one wants to optimize tradeos between rst-order and
second order terms, i.e. the loss in each trial has the form w l + w

Cw, where
0 is a tradeo parameter. Such problems arise in Markowitz portfolio optimization
(See e.g. discussion in Boyd and Vandenberghe (2004), Section 4.4).
We develop an algorithm for this online variance minimization problem. The pa-
rameter space is the probability simplex. We use the Exponentiated Gradient algorithm
for solving this problem since it maintains a probability vector. The latter algorithm
is motivated and analyzed using the relative entropy between probability vectors as
a measure of progress (Kivinen and Warmuth, 1997). The bounds we obtain are sim-
ilar to the bounds of the Exponentiated Gradient algorithm when applied to linear
regression.
In the second part of the paper we focus on a similar online variance minimization
problem, but now the parameter space that we compare against is the unit sphere
of direction vectors instead of the probability simplex and the total loss of the algo-
rithm is to be close to inf
u
u

(
P
t
C
t
)u, where the minimization is over unit vectors.
The solution of the oine problem is an eigenvector that corresponds to a minimum
eigenvalue of the total covariance
P
t
C
t
.
Note that the variance u

Cu can be rewritten using the trace operator: u

Cu =
tr(u

Cu) = tr(uu

C). The outer product uu

for unit u is called a dyad and the


oine problem can be reformulated as minimizing trace of a product of a dyad with
the total covariance matrix: inf
u
tr(uu

(
P
t
C
t
)) (where u is unit length).
1
In the original experts setting, the oine problem involved a minimum over n
experts. Now this becomes a minimum over innitely many dyads of dimension n and
the best dyads correspond eigenvectors with minimum eigenvalue. The algorithm for
the original expert setting is the Hedge algorithm of Freund and Schapire (1997). It
maintains its uncertainty over which expert is best as a probability vector w, i.e. w
i
is
the current belief that expert i is best:
w
t1
i
=
e

P
t1
q=1
l
q
i
Z
t
, (1.1)
where Z
t1
=
X
j
e

P
t1
q=1
l
q
j
normalizes the total weight to one.
1
In this paper we upper bound the total variance of our algorithm, whereas the generalized
Bayes rule of Warmuth (2005); Warmuth and Kuzmin (2006a) is an algorithm for which the
sum of the negative logs of the variances is upper bounded.
3
Note that except for the normalization, the i-th weight w
t1
i
is exponentially decaying
in the total loss
P
t1
q=1
l
q
i
of expert i.
In the generalized setting we need to maintain uncertainty over innitely many
dyads. The natural parameter space is therefore mixtures of dyads which are called
density matrices in statistical physics (symmetric positive denite matrices of trace
one). Note that the vector of eigenvalues of such matrices is a probability vector. Using
the methodology of Tsuda et al (2005); Warmuth and Kuzmin (2006a) we develop a
matrix version of the Hedge algorithm for solving our second online variance minimiza-
tion problem. Now the density matrix parameter has the form
W
t1
=
exp(
P
t1
q=1
C
q
)
Z
t1
, (1.2)
where Z
t1
= tr(exp(
t1
X
q=1
C
q
)) normalizes the trace to one.
Here exp denotes the matrix exponential. This update was also independently devel-
oped in (Arora and Kale, 2007) for convex optimization algorithms. When the covari-
ance matrices C
q
are the diagonal matrices diag(l
q
) then the density matrix parameter
is a diagonalized probability vector and the matrix update (1.2) becomes the original
expert update (1.1). In other words the original Hedge update may be seen as a special
case of the new matrix update when the eigenvectors are xed to the standard basis
vectors and are not updated.
The original Hedge update (1.1) for a nite set of n experts may be seen as a
softmin calculation of the total loss vector
P
t1
q=1
l
q
, because as , the parameter
vector w
t1
becomes the uniform distribution over the set arg min
i
P
t1
q=1
l
q
i
. Simi-
larly, the generalized update (1.2) is a soft minimum eigenvector calculation of the
total covariance matrix
P
t1
q=1
C
q
, i.e. as , the density matrix W
t1
becomes
the uniform density matrix over subspace spanned by eigenvectors of
P
t1
q=1
C
q
with
minimum eigenvalue.
What replaces the loss w l of the algorithm in the more general context? The
dot product for matrices is a trace and we use the generalized loss tr(W C). If the
eigendecomposition of the parameter matrix W consists of the eigenvectors w
i
and
the associated eigenvalues
i
, then this loss can be rewritten as
tr(W C) = tr((
X

i
w
i
w

i
) C) =
X
i

i
w

i
Cw
i
.
In other words it may be seen as an expected variance along the eigenvectors w
i
that is weighted by the eigenvalues
i
. Curiously enough, this trace is also a quantum
measurement, where W represents a mixture state of a particle and C the instrument
(See Warmuth (2005); Warmuth and Kuzmin (2006a) for additional discussion). Again
the dot product w l is the special case when the eigenvectors are the standard basis
vectors, i.e.
tr(diag(w) diag(l)) = tr((
X
w
i
e
i
e

i
) diag(l)) =
X
i
w
i
e

i
diag(l)e
i
=
X
i
w
i
l
i
.
In this paper we motivate and analyze the update of our generalization of the Hedge
algorithm (1.2) to density matrices using the quantum relative entropy (see e.g. Nielsen
and Chuang (2000)) as a measure of progress in place of the standard relative entropy
4
that can be used for the original Hedge update. We are able to prove an analogous
bound for the density matrix update by employing the Golden-Thompson inequality
and some lemmas developed in Tsuda et al (2005).
The quantum relative entropy is a Bregman divergence and in a dierent paper
(Warmuth and Kuzmin, 2006c) we showed that by employing the Bregman projection
methods developed in Herbster and Warmuth (2001) our new update leads to an on-line
update for Principal Component Analysis.
The paper is organized as follows. The rst half deals with variance minimiza-
tion over the probability simplex. Section 2.1 gives the necessary denitions. Section
2.2 describes the natural interpretation of variance as risk in stock portfolios. Sec-
tion 2.3 motivates the Exponentiated Gradient algorithm for this setting and describes
the derivation of the algorithm. After this, in Section 2.4, we give a proof of relative
loss bounds for this algorithm. Next, motivated by the stock portfolios interpretation,
we analyze a version of the algorithm that trades o the mean/return and the vari-
ance/risk. The more involved proof of the relative loss bounds for tradeo case is given
in Section 2.5. In the second half of the paper we deal with the problem of variance
minimization over the unit sphere. Section 3.1 introduces the relevant matrix math-
ematics and other denitions. In Section 3.2 we introduce the online algorithm for
this problem which maintains a density matrix as its parameter and is motivated by
a quantum relative entropy as the parameter divergence. After this, in Section 3.3,
we set the stage by re-deriving the relative loss bound for the Hedge algorithm using
relative entropy as a measure of progress. Our new algorithm can be seen as a ma-
trix generalization of the Hedge algorithm. An analogous bound for the matrix case is
given in Section 3.4. For the sake of completeness we sketch in Appendix A a number
of other alternate proofs of the same basic loss bound that are based on potentials
and Bregman projections. The proof methods for the original Hedge algorithm go back
to the Continuous Weighted Majority algorithm and the original bounds were proven
for a range of update factors. Analogously in Appendix B we show that the basic loss
bound still holds if a range of matrix factors are used in the density matrix update.
Concluding thoughts and observations are given in Section 4.
2 Variance Minimization over the Probability Simplex
2.1 Denitions
In this paper we only consider symmetric matrices. Such matrices always have an
eigendecomposition of the form W = WW

, where W is an orthogonal matrix of


eigenvectors and is a diagonal matrix of the corresponding eigenvalues. Alternatively,
the decomposition can be written as W =
P
i

i
w
i
w

i
, with the
i
being the eigen-
values and the w
i
the eigenvectors. Note that the dyads w
i
w

i
are square matrices of
rank one.
Matrix A is called positive denite (denoted as A 0) if for all vectors w we have
w

Aw 0. In terms of eigenvalues, positive deniteness means that all eigenvalues


are non-negative. A matrix is strictly positive denite if all eigenvalues are positive.
Let l be a random vector, then C = E

(l E(l))(l E(l))

is its covariance
matrix. Such matrices are symmetric and positive denite. For any other vector w we
5
Fig. 2.1 An ellipse Ain R
2
: The eigenvectors are the directions of the axes and the eigenval-
ues their lengths from the origin. Ellipses are weighted combinations of the one-dimensional
degenerate ellipses (dyads) corresponding to the axes. (For unit direction w, the dyad ww

is a degenerate one-dimensional ellipse which is a line between w and w on the green dash-
dotted unit circle). The solid red curve of the ellipse is a plot of direction vector Aw and
the blue dashed gure eight is unit w on the green dash-dotted unit circle times the variance
w

Aw. At the eigenvectors, this variance equals the eigenvalues and the dashed curve touches
the ellipse.
can compute the variance of the dot product w l as follows:
Var(w l) =E

(l

w E(l

w))
2

=E

((l

E(l

))w)

((l

E(l

))w)

=E

(l E(l))(l E(l))

)w

=w

Cw.
A covariance matrix can be depicted as an ellipse |Cw : |w|
2
= 1 centered at the
origin. The eigenvectors of C form the axes of the ellipse and eigenvalues are the lengths
of the axes from the origin (See Figure 2.1).
For two probability vectors u and w (e.g. vectors whose entries are nonnegative
and sum to one) their relative entropy (or Kullback-Leibler divergence) is given by:
d(u, w) =
n
X
i=1
u
i
log
u
i
w
i
. (2.1)
We call this a divergence (and not a distance) since its not symmetric and does not
satisfy the triangle inequality. It is however nonnegative and convex in either argument.
In the second part of the paper we deal with density matrix parameters and there we
use the matrix generalization of the relative entropy (See Section 3.1).
6
2.2 Risk Minimization
The problem of minimizing the variance when the direction w lies in the probability
simplex is connected to risk minimization in stock portfolios. In Markowitz portfolio
theory, vector p denotes the relative price change of all assets in a given trading period.
Let w be a probability vector that species the proportion of our capital invested into
each asset (assuming short positions are not allowed). Then the relative capital change
after a trading period is the dot product w p. If p is a random vector with known or
estimated covariance matrix C, then the variance of the relative capital change w p
for our portfolio w is w

Cw. This variance is clearly associated with the risk of our


investment. Our problem is then to track the performance of minimum risk portfolio
over a sequence of trading periods.
2.3 Algorithm and Motivation
Let us reiterate the setup and the goal for our algorithm. On every trial t it must
produce a probability vector w
t1
. It then gets a covariance matrix C
t
and incurs a
loss equal to the variance:
L
t
(w
t1
) := (w
t1
)

C
t
w
t1
.
Thus for a sequence of T trials, the total loss of the algorithm will be
L
alg
:=
T
X
t=1
L
t
(w
t1
).
We want this loss to be comparable to the total loss of the best probability vector
chosen in hindsight, where the loss of a probability vector u is
L
u
:=
T
X
t=1
L
t
(u).
This oine problem is a quadratic optimization problem with non-negativity con-
straints which does not have a closed form solution. However we can still prove bounds
for the online algorithm.
A natural choice for an online algorithm for this problem is the Exponentiated
Gradient algorithm of Kivinen and Warmuth (1997) since it maintains a probability
vector as its parameter. Recall that for a general loss function L
t
(w
t1
), the probability
vector of Exponentiated Gradient algorithm is updated as
w
t
i
=
w
t1
i
e
(L
t
(w
t1
))
i
P
i
w
t1
i
e
(L
t
(w
t1
))
i
. (2.2)
This update is motivated by considering the tradeo between the relative entropy
divergence to the old probability vector and the current loss, where > 0 is the
tradeo parameter (which will become the learning rate of the algorithm):
w
t
= arg min
P
n
i=1
w
i
=1
d(w, w
t1
) +L
t
(w), (2.3)
7
The exact solution to the above equation has the form (2.2) except that the gradient
L
t
(w
t
))
i
at the new parameter w
t
appears in the exponents instead of the gradient
L
t
(w
t1
))
i
at the old parameter w
t1
. Since this exact update has w
t
appearing
on both sides of the update equation, it is called the implicit Exponentiated Gradient
update update (Kivinen et al, 2005). In practice, the approximation (2.2) is often used
which is called the (explicit) Exponentiated Gradient update and this is the default here
as well. Alternatively the explicit update results from approximating the loss L
t
(w)
by its rst-order linear expansion via the Taylor formula (See Kivinen and Warmuth
(1997) for more discussion).
In our application, L
t
(w
t1
) =
1
2
(w
t1
)

C
t
w
t1
and L
t
(w
t1
) = C
t
w
t1
,
leading to the following (explicit) Exponentiated Gradient update:
w
t
i
=
w
t1
i
e
(C
t
w
t1
)
i
P
n
i=1
w
t1
i
e
(C
t
w
t1
)
i
.
2.4 Proof of Relative Loss Bounds
We now use the divergence d(u, w) that motivated the update as a measure of progress
in the analysis. The proof proceeds by bounding the progress to the comparison vector u
on two successive trials, as measured by change of the divergence: d(u, w
t1
)d(u, w
t
).
A somewhat dierent commonly used proof style uses a potential function, and bounds
the increase of the potential on two successive trials. Many choices for the potential
are possible, one common one is the value of optimization problem (2.3). We contrast
the two proof styles in Appendix A when dealing with variance minimization over the
sphere. Also, Helmbold and Warmuth (2007) compare the two proof styles for an online
algorithm for learning permutations.
Lemma 2.1 Let w
t
be the weight vector of the Exponentiated Gradient algorithm com-
puted at trial t and let u be an arbitrary probability vector that serves as a comparator.
Also, let r be the bound on the range of elements in covariance matrix C
t
, specically
let max
i,j
[C
t
(i, j)[
r
2
. For any constants a and b such that 0 < a
b
1+rb
and a
learning rate =
2b
1+rb
we have:
a (w
t1
)

C
t
w
t1
b u

C
t
u d(u, w
t1
) d(u, w
t
).
Proof The proof follows the same outline as Lemma 5.8 of Kivinen and Warmuth (1997)
which gives an inequality for the Exponentiated Gradient algorithm when applied to
linear regression with the square loss. We begin by analyzing the progress towards the
comparison vector u:
d(u, w
t1
) d(u, w
t
) =
X
u
i
log
u
i
w
t1
i

X
u
i
log
u
i
w
t
i
=
X
u
i
log w
t
i

X
u
i
log w
t1
i
=
X
u
i
log
w
t1
i
e
(C
t
w
t1
)
i
P
w
t1
i
e
(C
t
w
t1
)
i

X
u
i
log w
t1
i
=
X
u
i
log w
t1
i

X
u
i
(C
t
w
t1
)
i

log

X
w
t1
i
e
(C
t
w
t1
)
i

X
u
i
log w
t1
i
=
X
u
i
(C
t
w
t1
)
i
log

X
w
t1
i
e
(C
t
w
t1
)
i

.
8
Thus, our bound is equivalent to showing F 0 with F given as:
F = a(w
t1
)

C
t
w
t1
bu

Cu +u

Cw
t1
+ log

X
w
t1
i
e
(C
t
w
t1
)
i

.
We proceed by bounding the log term. The assumption on the range of elements
of C
t
and the fact that w
t1
is a probability vector allows us to conclude that
max
i
(C
t
w
t1
)
i
min
i
(C
t
w
t1
)
i
r, since (C
t
w
t1
)
i
=
P
j
C
t
(i, j)w
t1
j
. Now, as-
sume that l is a lower bound for (C
t
w
t1
)
i
, then we have that l (C
t
w
t1
)
i
l +r,
or 0
(C
t
w
t1
)
i
l
r
1. This allows us to use the inequality a
x
1 x(1 a) for
a 0 and 0 x 1. Let a = e
r
:
e
(C
t
w
t1
)
i
= e
l
(e
r
)
(C
t
w
t1
)
i
l
r
e
b

1
(C
t
w
t1
)
i
l
r
(1 e
r
)

.
Using this inequality we obtain:
log

X
w
t1
i
e
(C
t
w
t1
)
i

l + log

1
(w
t1
)

C
t
w
t1
l
r
(1 e
r
)
!
.
This gives us F G, with G given as:
G = a(w
t1
)

C
t
w
t1
bu

C
t
u +u

C
t
w
t1
l
+log

1
(w
t1
)

C
t
w
t1
l
r
(1 e
r
)
!
.
It is sucient to show that G 0. Let z =

C
t
u. Then G(z) becomes:
G(z) = bz

z +z

p
C
t
w
t1
+ constant.
The function G(z) is concave quadratic and is maximized at:
G
z
= 2bz +
p
C
t
w
t1
= 0, z =

2b
p
C
t
w
t1
.
We substitute this value of z into G and get G H, where H is given by:
H = a(w
t1
)

C
t
w
t1
+

2
4b
(w
t1
)

C
t
w
t1
l
+log

1
(w
t1
)

C
t
w
t1
l
r
(1 e
r
)
!
.
Since l (C
t
w
t1
)
i
l +r, then obviously so is (w
t1
)

C
t
w
t1
, since the weighted
average stays within the bounds. Now we can use the inequality log(1 p(1 e
q
))
pq +
q
2
8
, for 0 p 1 and q R (Helmbold et al, 1997)
log

1
(w
t1
)

C
t
w
t1
l
r
(1 e
r
)
!
(w
t1
)

C
t
w
t1
+l +

2
r
2
8
.
We get H S, where S is given as:
S = a(w
t1
)

C
t
w
t1
+

2
4b
(w
t1
)

C
t
w
t1
(w
t1
)

C
t
w
t1
+

2
r
2
8
=
(w
t1
)

C
t
w
t1
4b
(4ab +
2
4b) +

2
r
2
8
.
9
By our assumptions (w
t1
)

C
t
w
t1

r
2
, and therefore:
S Q =
2
(
r
2
8
+
r
8b
)
r
2
+
ar
2
.
We want to make this expression as small as possible, so that it stays below zero. To
do so we minimize it over :
2(
r
2
8
+
r
8b
)
r
2
= 0, =
2b
rb + 1
.
Finally we substitute this value of into Q and obtain conditions on a that guarantees
Q 0:
a
b
rb + 1
.
This concludes the proof.
The following intermediate lemma gives a bound in terms of a tradeo parameter c.
Note that the learning rate depends on this parameter.
Lemma 2.2 Let max
i,j
[C
t
(i, j)[
r
2
as before. Then for arbitrary positive c and
learning rate =
2c
r(c+1)
, the following bound holds:
L
alg
(1 +c) L
u
+

1 +
1
c

r d(u, w
0
).
Proof Let b =
c
r
, then for a =
b
rb+1
=
c
r(c+1)
and = 2a =
2c
r(c+1)
, we can use the
inequality of Lemma 2.1 and obtain:
c
c + 1
(w
t1
)

C
t
w
t1
cu

C
t
u r(d(u, w
t1
) d(u, w
t
)).
Summing over the trials t results in:
c
c + 1
L
alg
cL
u
r(d(u, w
0
) d(u, w
t
)) r d(u, w
0
).
Now the statement of the lemma immediately follows.
The following theorem describes how to choose the learning rate for the purpose of
minimizing the upper bound:
Theorem 2.1 Let C
1
, . . . , C
T
be an arbitrary sequence of covariance matrices such
that max
i,j
[C
t
(i, j)[
r
2
and assume that u

P
T
t=1
C
t
u L. Then the Exponentiated
Gradient algorithm with uniform start vector w
0
= (
1
n
, . . . ,
1
n
) and learning rate =
2

Llog n
r

log n+

rL
has the following bound:
L
alg
L
u
+ 2
p
rLlog n +r log n.
Proof By Lemma 2.2 and since d(u, w
0
) log n:
L
alg
L
u
+cL +
r log n
c
+r log n.
By dierentiating we see that c =
q
r log n
L
minimizes the r.h.s., and substituting this
choice of c gives the bound of the theorem.
10
2.5 Proof of Bound for Variance-Loss Tradeo
In each trial the algorithm now receives both a vector of losses l
t
and a covariance
matrix C
t
. The loss of probability vector w
t1
becomes a additive tradeo between
rst-order and second-order components, i.e. we want to make our loss small, but we
want to control the variance as well ( 0 is a tradeo parameter - the bigger it is,
the less attention we pay to the variance):
L
t,
(w
t1
) = w
t1
l
t
+ (w
t1
)

C
t
w
t1
. (2.4)
Similarly, we dene L

alg
and L

u
by summing this loss over a sequence of trials. Using
the EG strategy for this loss results in the following update:
w
t
i
=
w
t1
i
e
(l
t
i
+(C
t
w
t1
)
i
)
P
n
i=1
w
t1
i
e
(l
t
i
+(C
t
w
t1
)
i
)
. (2.5)
The proof follows the same outline as before. Here we give the main necessary lemma.
Lemma 2.3 Let w
t
be the weight vector of the Exponentiated Gradient algorithm com-
puted at trail t and let u be an arbitrary probability vector that serves as a compara-
tor. Also, let r be the bound on the range of elements of covariance matrix C
t
, that
is max
i,j
[C
t
(i, j)[
r
2
and let the loss vector l
t
be bounded as: l
t
i
[0 . . . q]. For
0, any constants a and b such that a
b(r2q)
2
(r+b(r+q)
2
)(r+2q)
and learning rate
=
2b(r2q)
r+b(r+q)
2
the following inequality holds:
a(w
t1
l
t
+ (w
t1
)

C
t
w
t1
) b(u l
t
+u

C
t
u) d(u, w
t1
) d(u, w
t
)
Proof We begin by analyzing progress towards the comparison vector u:
d(u, w
t1
) d(u, w
t
)
=
X
u
i
log w
t
i

X
u
i
log w
t1
i
(2.5)
=
X
u
i
log
w
t1
i
e
(l
t
i
+(C
t
w
t1
)
i
)
P
w
t1
i
e
(l
t
i
+(C
t
w
t1
)
i
)

X
u
i
log w
t1
i
=
X
u
i
(l
t
i
+ (C
t
w
t1
)
i
) log(
X
w
t1
i
e
(l
t
i
+(C
t
w
t1
)
i
)
).
Thus, our bound is equivalent to showing F 0 with F given as:
F = a(w
t1
l
t
+ (w
t1
)

C
t
w
t1
) b(u l
t
+u

C
t
u)
+
X
u
i
(l
t
i
+ (C
t
w
t1
)
i
) + log(
X
w
t1
i
e
(l
t
i
+(C
t
w
t1
)
i
)
).
We proceed by bounding the log term. As before, let p be a lower bound on (C
t
w
t1
)
i
.
Now, from the assumption on the range of elements in C
t
we get that p (C
t
w
t1
)
i

p + r. Adding the and the assumptions on the range of losses we get: p l
t
i
+
(C
t
w
t1
)
i
p +r +q, or equivalently:
0
l
t
i
+ (C
t
w
t1
)
i
p
r +q
1.
11
Now we apply the inequality a
x
1 x(1 a) for a 0 and 0 x 1. Let
a = e
(r+q)
:
e
(l
t
i
+(C
t
w
t1
)
i
)
= e
p
(e
(r+q)
)
l
t
i
+(C
t
w
t1
)
i
p
r+q
e
p

1
l
t
i
+ (C
t
w
t1
)
i
p
r +q
(1 e
(r+q)
)

.
Using this inequality we obtain:
log

X
w
t1
i
e
(l
t
i
+(C
t
w
t1
)
i
)

p + log

1
w
t1
l
t
+ (w
t1
)

C
t
w
t1
p
r +q
(1 e
(r+q)
)
!
.
This gives us F G with G given as:
G = a(w
t1
l
t
+ (w
t1
)

C
t
w
t1
)
b(u l
t
+u

C
t
u) +
X
u
i
(l
t
i
+ (C
t
w
t1
)
i
)
p + log

1
w
t1
l
t
+ (w
t1
)

C
t
w
t1
p
r +q
(1 e
(r+q)
)
!
.
We will split G into two parts G
1
+G
2
as follows:
G
1
= a(w
t1
)

C
t
w
t1
bu

C
t
u +u

C
t
w
t1
p
+log

1
w
t1
l
t
+ (w
t1
)

C
t
w
t1
p
r +q
(1 e
(r+q)
)
!
G
2
= aw
t1
l
t
bu l
t
+u l
t
.
For G
1
we use variable substitution z =

C
t
u:
G
1
(z) = bz

z +z

p
C
t
w
t1
+ constant
The function G
1
(z) is concave quadratic and is maximized at:
G
1
z
= 2bz +
p
C
t
w
t1
= 0, z =

2b
p
C
t
w
t1
.
We substitute this value of z into G
1
and get G
1
H, where H is given by:
H = a(w
t1
)

C
t
w
t1
+

2
4b
(w
t1
)

C
t
w
t1
p +
+log

1
w
t1
l
t
+ (w
t1
)

C
t
w
t1
p
r +q
(1 e
(r+q)
)
!
.
Since p l
t
i
+ (C
t
w
t1
)
i
p + r + q, then so is w
t1
l
t
+ (w
t1
)

C
t
w
t1
,
since a convex combination stays within the bounds. We now apply the inequality
log(1 x(1 e
y
)) xy +
y
2
8
for 0 x 1 and y R and obtain:
log

1
w
t1
l
t
+ (w
t1
)

C
t
w
t1
p
r +q
(1 e
(r+q)
)
!
w
t1
l
t
(w
t1
)

C
t
w
t1
+p +

2
(r +q)
2
8
.
12
Substituting and reshuing the terms a little bit we obtain G = G
1
+ G
2
S
1
+ S
2
,
where
S
1
=
(w
t1
)

C
t
w
t1
4b
(
2
+ 4ba 4b) +

2
(r +q)
2
8
S
2
= aw
t1
l
t
w
t1
l
t
bu l
t
+u l
t
.
By our assumptions (w
t1
)

C
t
w
t1

r
2
and we use this to upper bound S
1
. For
S
2
we drop the negative terms w
t1
l
t
bu l
t
and use the loss upper bounds
w
t1
l
t
q and u l
t
q. Overall we get G Q with Q given as:
Q =
2
(
r
8b
+
(r +q)
2
8
) (
r
2
q) +
ar
2
+aq.
We want to make Q as small as possible because we need the condition Q 0, which
assures F 0. Minimizing over gives us:
2(
r
8b
+
(r +q)
2
8
)
r
2
+q = 0, =
2rb 4qb
r +b(r +q)
2
.
We substitute this value of into Q:
Q =
(2b(r 2q))
2
(r +b(r +q)
2
)
2

r +b(r +q)
2
8b

2b(r 2q)
r +b(r +q)
2

r 2q
2
+
ar
2
+aq
=
b(r 2q)
2
2(r +b(r +q)
2
)

2b(r 2q)
2
2(r +b(r +q)
2
)
+a(
r
2
+q).
Now Q 0, whenever
a
b(r 2q)
2
(r +b(r +q)
2
)(r + 2q)
.

The following intermediate lemma helps in tuning the learning rate.


Lemma 2.4 Let max
i,j
[C
t
(i, j)[
r
2
and l
t
i
[0 . . . q] as before. Lets use R and Q to
denote the following quantities:
R =
(r 2q)
2
r(r + 2q)
and Q =
(r +q)
2
r
.
Then for arbitrary positive b and learning rate =
2b(r2q)
r+b(r+q)
2
, the following bound
holds:
L

alg

1
R
+
bQ
R

u
+
1
bR
d(u, w
0
) +
Q
R
d(u, w
0
)
Proof We use the inequality from Lemma 2.3 and let a be equal its upper bound:
a =
b(r 2q)
2
(r +b(r +q)
2
)(r + 2q)
=
b
1 +
b
r
(r +q)
2

(r 2q)
2
r(r + 2q)
=
bR
1 +bQ
Summing the per-trial inequalities and dropping the negative term d(u, w
t
) we get the
bound:
bR
1 +bQ
L

alg
bL

u
+d(u, w
0
)
13
Rearranging the terms we get:
L

alg

1
R
+
bQ
R

u
+
1
bR
d(u, w
0
) +
Q
R
d(u, w
0
)

Theorem 2.2 Let C


1
, . . . , C
T
be an arbitrary sequence of covariance matrices such
that max
i,j
[C
t
(i, j)[
r
2
and l
1
, . . . , l
T
be an arbitrary sequence of loss vectors such
that l
t
i
[0 . . . q]. Additionally assume that L

u
= u
P
T
t=1
l
t
+ u

P
T
t=1
C
t
u L.
Then the Exponentiated Gradient algorithm with uniform start vector w
0
= (
1
n
, . . . ,
1
n
)
and learning rate =
2

log n/QL(r2q)
r+

log n/QL(r+q)
2
has the following bound:
L

alg
L

u
+
2
R
p
QLlog n +
Q
R
log n +PL,
where Q and R are as in Lemma 2.4 and
P =
2q(3r 2q)
(r 2q)
2
.
Proof We start with the bound obtained in Lemma 2.4 and use our assumptions L

u

L and d(u, w
0
) log n:
L

alg

1
R
L

u
+
bQ
R
L +
1
bR
log n +
Q
R
log n.
We can now minimize the right hand side as a function of b. Setting the derivative to
zero we get:
log n
R
1
b
2
=
QL
R
, b =
s
log n
QL
Substituting this value of b back into the bound, we get:
L

alg

1
R
L

u
+
2
R
p
QLlog n +
Q
R
log n.
Finally we split the
1
R
L

u
term:
1
R
1 =
r
2
+ 2qr r
2
+ 4qr 4
2
q
2
r
2
4rq + 4
2
q
2
=
6qr 4
2
q
2
r
2
4rq + 4
2
q
2
=
2q(3r 2q)
(r 2q)
2
= P.

Note that when = 0, then the tradeo loss (2.4) used in this section becomes
the covariance loss used in the previous section, and the bound in the above theorem
coincides with the bound of Theorem 2.1 obtained for the covariance loss. When
then the covariance term in the tradeo loss becomes neglegible and after dividing by
the tradeo loss becomes w
t1
l
t
. Relative loss bound for this commonly analyzed
dot loss will be reviewed in Section 3.3. Unfortunately the known bounds for that
loss are stronger than the bounds obtainable from the above theorem by letting
approach . Thus for large the loss bounds of this section are not competitive.
14
3 Variance Minimization over the Unit Sphere
3.1 Denitions
The trace tr(A) of a square matrix A is the sum of its diagonal elements. It is invariant
under a change of basis transformation and thus it is also equal to the sum of eigenvalues
of the matrix. The trace generalizes the normal dot product between vectors to the
space of matrices, i.e. tr(AB) = tr(BA) =
P
i,j
A(i, j)B(i, j). The trace is also a
linear operator, that is tr(aA+bB) = a tr(A)+b tr(B). Another useful property of the
trace is its invariance wrt cycling, i.e. tr(ABC) = tr(BCA) = tr(CAB). A particular
instance of this is the following manipulation: u

Au = tr(u

Au) = tr(Auu

).
Dyads have trace one because tr(uu

) = u

u = 1. We generalize mixtures or
probability vectors to density matrices. Such matrices are mixtures of any number of
dyads, i.e. W =
P
i

i
u
i
u

i
, where
j
0 and
P
i

i
= 1. Equivalently, density
matrices are arbitrary symmetric positive denite matrices of trace one. Any density
matrix W can be decomposed into a sum of exactly n dyads corresponding to the
orthogonal set of its eigenvectors w
i
, i.e. W =
P
n
i=1

i
w
i
w

i
, where the vector of
the n eigenvalues is a probability vector. In quantum physics density matrices over the
eld of complex numbers (instead of the reals) represent the mixed state of a physical
system.
We also need the matrix generalizations of the exponential and logarithm oper-
ations. Given the eigendecomposition of a symmetric matrix A =
P
i

i
a
i
a

i
, the
matrix exponential and logarithm denoted as exp and log are computed as follows:
exp(A) =
X
i
e

i
a
i
a

i
, log(A) =
X
i
(log
i
) a
i
a

i
.
In other words, the exponential and the logarithm are applied to the eigenvalues and the
eigenvectors remain unchanged. Obviously, the matrix logarithm is only dened when
the matrix is strictly positive denite. In analogy with the exponential for numbers,
one would expect the following equality to hold for symmetric matrices A and B:
exp(A+B) = exp(A) exp(B). However this is only true when the symmetric matrices
A and B commute, i.e. AB = BA, which occurs i both matrices share the same
eigensystem. On the other hand, the following trace inequality, called the Golden-
Thompson inequality, holds for arbitrary symmetric matrices (Bhatia, 1997):
tr(exp(A+B)) tr(exp(A) exp(B)). (3.1)
The following quantum relative entropy is a generalization of the classical relative en-
tropy (2.1) to density matrices due to Umegaki (See e.g. Nielsen and Chuang (2000)):
For two density matrices U and W,
(U, W) := tr(U(log U log W)).
We will also use generalized inequalities for the cone of positive denite matrices: For
two symmetric matrices A and B, A B if B A is positive denite.
15
Fig. 3.1 The gure depicts a sequence of updates for the density matrix algorithm when the
dimension is 2 with = 1. All 2-by-2 matrices are represented as ellipses. The top row shows
the covariance matrices C
t
received in that trial. The middle row shows the density matrices
W
t
chosen by the algorithm. Finally, the bottom row is the average of all covariance matrices
so far, i.e.
C
t
t
, where C
t
=
P
t
q=1
C
t
. By the update (3.3), W
t
=
exp(C
t
)
Z
t
, where Z
t
is a normalization. Therefore,
C
t
t
in the third row has the same eigensystem as the density
matrix W
t
in the rst row. Note the tendency of the algorithm to try to place more weight
on the minimal eigenvalue of the covariance average. Since the algorithm is not sure about the
future, it does not place the full weight onto that eigenvalue but hedges its bets instead and
places some weight onto the other eigenvalues as well. This can be seen as a soft minimum
calculation and as it becomes the normal minimum.
3.2 Algorithm and Motivation
As before we briey review our setup. On each trial t our algorithm chooses a den-
sity matrix W
t1
described as a mixture
P
i

t1
i
w
t1
i
(w
t1
i
)

. It then receives a
covariance matrix C
t
and incurs a loss equal to the expected variance of its mixture:
tr(W
t1
C
t
) = tr((
X
i

t1
i
w
t1
i
(w
t1
i
)

)C
t
) =
X
i

t1
i
(w
t1
i
)

C
t
w
t1
i
.
On a sequence of T trials the total loss of the algorithm will be L
alg
=
P
T
t=1
tr(W
t1
C
t
).
We want this loss to be not too much larger than the total variance of best unit vector
u chosen in hindsight, i.e. L
u
= tr(uu

P
t
C
t
) = u

(
P
t
C
t
)u. The set of dyads
is not a convex set. We therefore close it by using convex combinations of dyads (i.e.
density matrices) as our parameter space. The best oine parameter is still a single
dyad, i.e. for any symmetric matrix C,
min
tr(U)=1
tr(UC) = min
u: u
2
=1
u

Cu.
Curiously enough, our loss tr(WC) has the interpretation in quantum mechanics as the
expected outcome of measuring a physical system in mixture state W with instrument
16
C. Let C be decomposed as
P
i

i
c
i
c

i
. The eigenvalues
i
are the possible numerical
outcomes of measurement. When measuring a pure state specied by unit vector u,
the probability of obtaining outcome
i
is given as (u c
i
)
2
and the expected outcome
is tr(uu

C) =
P
i
(u c
i
)
2

i
. For a mixed state W we have the following double
expectation:
tr(WC) = tr
0
@
(
X
i

i
w
i
w

i
)(
X
j

j
c
j
c

j
)
1
A
=
X
i,j
(w
i
c
j
)
2

j
,
where the matrix of measurement probabilities (w
i
c
j
)
2
is a doubly stochastic matrix.
Note also, that for the measurement interpretation, the matrix C does not have to be
positive denite, but only symmetric. The algorithm and the proof of bounds in fact
work ne for this case, as long as we have some assumption on the range of eigenvalues
of C, but the meaning of the algorithm when C is not a covariance matrix is less
clear, since despite all these connections to quantum physics our algorithm does not
seem to have the obvious quantum-mechanical interpretation. Our update clearly is
not a unitary evolution of the mixture state and a measurement does not cause a
collapse of the state as is the case in quantum physics. The question of whether this
type of algorithm is still doing something quantum-mechanically meaningful remains
intriguing. See also Warmuth (2005); Warmuth and Kuzmin (2006a) for additional
discussion.
To derive our algorithm we use the trace expression for expected variance as our
loss and replace the relative entropy with its matrix generalization. The following
optimization problem produces the update:
W
t
= arg min
tr(W)=1
(W, W
t1
) + tr(WC
t
). (3.2)
Using a Lagrangian that enforces the trace constraint (Tsuda et al, 2005), it is easy to
solve this constrained minimization problem:
W
t
=
exp(log W
t1
C
t
)
tr(exp(log W
t1
C
t
))
=
exp(C
t
)
tr(exp(C
t
))
. (3.3)
Note that the updated matrix W
t
is again symmetric and positive denite. In par-
ticular, the quantum relative entropy functions serves as a barrier for the implicit
non-negativity constraints W 0. For the second form of the update above, we un-
raveled the rst, assumed that W
0
=
1
n
I and used the shorthand C
t
=
P
T
q=1
C
q
.
The above update is a special case of the Matrix Exponentiated Gradient update of
Tsuda et al (2005) with the linear loss tr(WC
t
). Recall that when we minimized the
variance over the probability simplex, there was a dierence between the implicit and
explicit version of the Exponentiated Gradient update (Section 2.3). However since
the loss tr(WC
t
) is now linear in the parameter W, the implicit and explicit updates
coincide. Also the following batch motivation produces the same update:
W
t
= arg min
Ws.t. tr(W)=1
(W, W
0
) +
t
X
q=1
tr(WC
q
). (3.4)
When inequality constraints are added to the arg min, then (3.3) and (3.4) may have
dierent solutions (See Kuzmin and Warmuth (2007) for a discussion).
17
3.3 Bregman Divergence Based Proof Methodology for Expert Setting
For the sake of clarity, we begin by the reproving the worst-case loss bound for the
Hedge algorithm since the later proof for the density matrix generalization will follow
the same outline. In doing so we also clarify the dependence of the algorithm on the
range of the losses. The update of that algorithm is given by:
2
w
t
i
=
w
t1
i
e
l
t
i
P
i
w
t1
i
e
l
t
i
. (3.5)
This type of proof (Kivinen and Warmuth, 1997) always starts by considering the
progress made during the update towards any probability vector u that we compare
against, where the progress is measured in terms of the motivating divergence for the
algorithm (here the relative entropy):
d(u, w
t1
) d(u, w
t
) =
X
i
u
i
log
w
t
i
w
t1
i
= u l
t
log
X
i
w
t1
i
e
l
t
i
.
We now lower bound the log
P
i
w
t1
i
e
l
t
i
term (sometimes called the potential)
as in Littlestone and Warmuth (1994), Lemma 5.2. This is done by assuming that
l
t
i
[0, r], for r > 0, and applying the inequality e
x
1 (1 e

)
x
r
, for x [0, r]
and R:
d(u, w
t1
) d(u, w
t
) u l
t
log(1
w
t1
l
t
r
(1 e
r
)).
We now apply log(1 x) x and obtain:
d(u, w
t1
) d(u, w
t
) u l
t
+
w
t1
l
r
(1 e
r
),
Finally we rewrite the above to
w
t1
l
t

r(d(u, w
t1
) d(u, w
t
)) +r u l
t
1 e
r
.
Here w
t1
l
t
is the loss of the algorithm at trial t and u l
t
is the loss of the probability
vector u which serves as a comparator.
So far we assumed that l
t
i
[0, r]. However, it suces to assume that max
i
l
t
i

min
i
l
t
i
r. In other words, the individual losses can be positive or negative, as long
as their range is bounded by r. See Cesa-Bianchi et al (2005) for further discussion
regarding the issues that arise when losses have dierent signs. As we shall observe
below, the requirement on the range of losses will become a requirement on the range
of eigenvalues of the covariance matrices.
Dene
e
l
t
i
:= l
t
i
min
i
l
t
i
. The update remains unchanged when the shifted losses
e
l
t
i
are used in place of the original losses l
t
i
and we immediately get the inequality
w
t1

e
l
t

r(d(u, w
t1
) d(u, w
t
)) +r u
e
l
t
1 e
r
.
2
In the earlier Continuous Weighted Majority algorithm the loss l
t
i
is the absolute loss
|y
t
x
t
i
| between the label y
t
and the prediction of the ith expert x
t
i
.
18
Summing over t results in a bound that holds for any probability vector u:
T
X
t=1
w
t1

e
l
t

r(d(u, w
0
d(u, w
T
)) +r u
e
l
T
1 e
r
,
where l
T
is shorthand for
P
T
q=1
l
t
. We can now simplify the bound by dropping the
term d(u, w
T
) 0 and tune the learning rate following Freund and Schapire (1997):
if
P
t
u
e
l
t

e
L and d(u, w
0
) D lnn, then with =
log(1+
q
2D/
e
L)
r
we get the
inequality
X
t
w
t1

e
l
t

X
t
u
e
l
t
+
p
2r
e
LD +r d(u, w
0
).
By adding
P
T
t=1
min
i
l
t
i
to both sides we obtain the bound
X
t
w
t1
l
t
| {z }
L
alg

X
t
u l
t
| {z }
L
u
+
p
2r

LD +r d(u, w
0
).
Note that
e
L is dened wrt the tilde versions of the losses and the update as well as the
above bound is invariant under shifting the loss vectors l
t
by arbitrary constants. If
the loss vectors l
t
are replaced by gain vectors, then the minus sign in the exponent of
the update becomes a plus sign. In this case the inequality above is reversed and the
last two terms are subtracted instead of added.
3.4 Relative Loss Bounds for Matrix Setting
In addition to the Golden-Thompson inequality we will need lemmas 2.1 and 2.2 from
Tsuda et al (2005):
Lemma 3.1 For any symmetric A, such that 0 A I and any
1
,
2
R the
following holds:
exp(A
1
+ (I A)
2
) Ae

1
+ (I A)e

2
.
Lemma 3.2 For any positive semidenite A and any symmetric B, C, B C implies
tr(AB) tr(AC).
We are now ready to generalize the Hedge bound to matrices:
Theorem 3.1 For any sequence of covariance matrices C
1
, . . . , C
T
such that 0
C
t
rI and for any learning rate , the following bound holds for arbitrary density
matrix U:
T
X
t=1
tr(W
t1
C
t
)
r((U, W
0
) (U, W
T
)) +r tr(UC
T
)
1 e
r
.
19
Proof We start by analyzing the progress made towards the comparison density matrix
U in terms of the quantum relative entropy:
(U, W
t1
) (U, W
t
) = tr(U(log W
t1
+log W
t
)).
By plugging in r.h.s. form of update (3.3) we get
(U, W
t1
) (U, W
t
) = tr(UC
t
) +P
t
P
t1
, (3.6)
where the P
t
is called the potential and is dened as log tr(exp(C
t
)). We rst
rewrite the drop of the potential and then apply the Golden-Thompson inequality (3.1):
P
t
P
t1
= tr(exp(log W
t1
C
t
)) tr(W
t1
exp(C
t
)). (3.7)
Since 0
C
t
r
I, we can use Lemma 3.1 with
1
= r,
2
= 0:
exp(C
t
) I
C
t
r
(1 e
r
).
Now multiply both sides on the left with W
t1
and take a trace. The inequality is
preserved according to Lemma 3.2:
tr(W
t1
exp(C
t
))

1
tr(W
t1
C
t
)
r
(1 e
r
)

. (3.8)
Taking logs of both sides we have:
log tr(W
t1
exp(C
t
)) log

1
tr(W
t1
C
t
)
r
(1 e
r
)

. (3.9)
To bound the log expression on the right we use inequality log(1 x) x:
log

1
tr(W
t1
C
t
)
r
(1 e
r
)


tr(W
t1
C
t
)
r
(1 e
r
). (3.10)
By combining inequalities (3.7-3.10), we obtain the following bound on the drop of the
potential
P
t
P
t1

tr(W
t1
C
t
)
r
(1 e
r
). (3.11)
Plugging this into equation (3.6) we obtain
r((U, W
t1
) (U, W
t
)) +r tr(UC
t
) tr(W
t1
C
t
)(1 e
r
).
Summing over trials and rearranging gives the inequality of the theorem.
Note the our density matrix update (3.3) is invariant wrt the variable change
e
C
t
= C
t

min
(C
t
)I. Therefore by the above theorem, the following inequality holds whenever

max
(C
t
)
min
(C
t
) r:
tr(W
t1
e
C
t
)
r((U, W
t1
) (U, W
t
)) +r tr(U
e
C
t
)
1 e
r
.
We can now drop the (U, W
T
) 0 term from the bound of the theorem and tune the
learning rate as done at the end of Section 3.3. If
P
t
tr(U
e
C
t
)

L and (U, W
0
)
20
D, then with =
log(1+
q
2D

L
)
r
and by adding
P
T
t=1

min
(C
t
)I to both sides of the
inequality we get the bound:
T
X
t=1
tr(W
t1
C
t
)
| {z }
L
alg
tr(UC
T
)
| {z }
L
U
+
p
2r

LD +r(U, W
0
).
The bound of the above theorem can be proven in a number of dierent ways. For
the sake of completeness we sketch a number of these alternate proof methods based
on potentials and Bregman projections in Appendix A. We also show in Appendix B
that the same bound holds if a range of matrix factors are used in the update.
4 Conclusions
We presented two algorithms for online variance minimization problems. For the rst
problem, the variance was measured along a probability vector. It would be interesting
to combine this work with the online algorithms considered in Helmbold et al (1998);
Cover (1991) that maximize the return of the portfolio. Note that it is easy to extend
the portfolio vector to maintain short positions: Simply keep two weights w
+
i
and w

i
per component as is done in the EG

algorithm of Kivinen and Warmuth (1997).


In our second problem the variance was measured along an arbitrary direction. We
gave a natural generalization of the Hedge algorithm to the case when the parameters
are density matrices. Note that in this paper we upper bounded the sum of the expected
variances over trials, whereas in (Warmuth, 2005; Warmuth and Kuzmin, 2006a) a
Bayes rule for density matrices was given for which a lower bound was provided on the
product of the expected variances over trials.
3
Much work has been done on using exponential update factors in the expert setting.
In particular, algorithms have been developed for shifting experts by combining the
exponential updates with an additive sharing update(Herbster and Warmuth, 1998).
In preliminary work we showed that these techniques easily carry over to the density
matrix updates. This includes the more recent work on the sharing to the past average
update, which introduces a long-term memory (Bousquet and Warmuth, 2002). (See
Warmuth and Kuzmin (2008) for some experimental evidence that the sharing to the
past average update carries over to the matrix setting.)
The rst use of the exponential update factors was in the algorithm Winnow Lit-
tlestone (1989, 1988) for learning disjunctions. Curiously enough this algorithm also
nicely carries over the matrix case Warmuth (2007) using the same methods applied
in the second half of this paper.
A Sketch of other Proof Techniques for Variance Minimization over the
Unit Sphere
In this section we briey discuss some slight variations of the proof of above relative loss bound.
For the sake of brevity we assume r = 1 for the rest of the paper. The rst variation stresses
the potential
P
t
:= log tr(exp(C
t
)),
3
This amounts to an upper bound on the sum of the negative logarithms of the expected
variances.
21
which is the value of the motivating minimization problem (3.4) when plugging in the solution
W
t
(easily seen by using the right form of update (3.3)).
4
As part of the previous proof we
already bounded the drop of the potential (This method is the core of many papers in the
expert advice setting (Littlestone and Warmuth, 1994; Kivinen and Warmuth, 1999; Vovk,
1990)):
P
t
P
t1
(3.73.11)
tr(W
t1
C
t
)(1 e

).
Summing over trials we get (since P
1
= 0):
P
T

T
X
t=1
tr(W
t1
C
t
)(1 e

).
Finally, since P
T
is the value of (3.4) at W
T
:
P
T
= (W
T
, W
0
) + tr(W
T
C
T
) = arg min
Us.t. tr(U)=1
(U, W
0
) (U, W
T
) + tr(UC
T
).
Piecing the above inequality and equality for P
T
together, again gives the bound of the theo-
rem.
The second variation of the proof expresses the normalization in the update as a Breg-
man projection and using the Generalized Pythagorean Theorem for Bregman divergences.
See (Helmbold and Warmuth, 2008) for an analogous discussion in the context of learning
permutations. For this we start with the following unnormalized update:
f
W
t
= exp(log W
t1
C
t
),
which is the solution to the minimization problem
W
t
= arg min
W
e
(W, W
t1
) + tr(WC
t
), (A.1)
where
e
(W, W
t1
) is now the unnormalized quantum relative entropy dened as tr(U(log U
log W)) + tr(W) tr(U). The unnormalized version coincides with the standard relative en-
tropy when both matrices have trace one. It is easy to see that the basic inequality in the proof
of Theorem 3.1 holds for this unnormalized update as well:
e
(U, W
t1
)
e
(U,
f
W
t
) = tr(Ulog
f
W
t
) tr(Ulog W
t1
) + tr(W
t1
) tr(
f
W
t
)
= tr(UC
t
) + tr(W
t1
) tr(exp(log W
t1
C
t
))
(3.1)
tr(UC
t
) + tr(W
t1
(I exp(C
t
)))
tr(UC
t
) + (1 e

) tr(W
t1
C
t
).
The last inequality follows upper bounding exp(C
t
) and applying Lemma 3.2 as done in
(3.8).
Now we introduce the concept of Bregman projections. Bregman projection onto a convex
set is dened as simply the closest point in that set in terms of the Bregman divergence of
interest to us. Take the convex set to be the set of trace one matrices. It is easy to see that
the normalized update (3.3) is the Bregman projection wrt quantum relative entropy of
f
W
t
onto the set of trace one matrices:
W
t
= inf
Ws.t. tr(W)=1
e
(W,
f
W
t
) =
f
W
t
tr(
f
W
t
)
.
4
The value of the on-line minimization problem (3.2) at the optimum W
t
is the drop of
potential P
t
P
t1
.
22
Applying the Generalized Pythagorean Theorem for Bregman Projections (See e.g. (Herbster
and Warmuth, 2001)) then gets us the following inequality:
(U,
f
W
t
) (U, W
t
) + (W
t
,
f
W
t
).
Since divergences are non-negative, we can drop the term (W
t
,
f
W
t
) to get the inequality:
(U,
f
W
t
) (U, W
t
) 0. (A.2)
Now we add (A.2) to our inequality for unnormalized update to obtain the desired inequality
for the normalized update:
(U, W
t1
) (U, W
t
) tr(UC
t
) + (1 e

) tr(W
t1
C
t
).
The proof now completes as in Theorem 3.1.
In the above proof of the bound via the unnormalized update, the Bregman projection
methods can also be replaced with a potential based argument. To do this we dene the
unnormalized potential is dened as
e
P
t
:= tr(
f
W
t
). For example by plugging the optimum
unnormalized solution
f
W
t
into (A.1) we obtain the drop of potentials tr(
f
W
t
)(tr(W
t1
).
We leave the details to the reader.
B Proof for Range of Matrix Factors for Variance Minimization over the
Unit Sphere
For the regular Weighted Majority update (Littlestone and Warmuth, 1994) it is well-known
that if the losses satisfy the range restriction
t
i
[0, r], then the multiplicative factors e
l
t
i
can be replaced by a whole range of factors f satisfying e
l
t
i f 1
l
t
i
r
(1 e
r
). This
means that if any such factors are used in place of the Hedge update (3.5) then the basic loss
bound of Theorem 3.1 still holds. Here we generalize the notion of such factors to the matrix
case. As we shall see, the key is to use the right matrix product. For the sake of brevity, we
assume r = 1 for the rest of the paper.
In Section 3.4 and Appendix A we proved a bound for the following version of matrix
update:
W
t
=
exp(log W
t1
C
t
)
tr(exp(log W
t1
C
t
))
.
For our generalization of the update that uses a range of matrix factors we need the following
binary operation dened for symmetric strictly positive denite matrices:
5
AB= exp(log A+log B)
This operation plays a crucial role in the generalized probability calculus for density matrices
that we develop in Warmuth and Kuzmin (2006a). See that paper for a discussion of many
properties of operation.
Now we can rewrite the update (3.3) of our algorithm as follows:
W
t
=
W
t1
exp(C
t
)
tr(W
t1
exp(C
t
))
.
The generalization have the following form
W
t
=
W
t1
F
t
tr(W
t1
F
t
)
, (B.1)
5
This denition can be extended to all symmetric positive denite matrices via limit for-
mula: AB= lim
n
(A
1/n
B
1/n
)
n
. See Warmuth and Kuzmin (2006a) for a discussion.
23
where the matrix factor F
t
is any symmetric matrix satisfying the following generalized in-
equalities:
exp(C
t
) F
t
I C
t
(1 e

). (B.2)
Note that both the lhs and rhs of the above chain of inequalities are both symmetric positive
denite and lhs rhs by Lemma 3.1.
We now prove that the same bound holds for update in (B.1).
Theorem B.1 For any sequence of symmetric loss matrices 0 C
t
I, for any learning
rate > 0 and for any comparator density matrix U the following bound holds on the loss of
algorithm with update (B.1) when the F
t
satisfy the condition (B.2):
T
X
t=1
tr(W
t1
C
t
)
(U, W
0
) (U, W
T
) + tr(UC
T
)
1 e

.
Proof The proof starts by analyzing progress towards U in terms of quantum relative entropy.
To simplify the analysis we assume that W
t1
and F
t
are full rank matrices. In that case
log(W
t1
F
t
) = log W
t1
+log F
t
. This is a reasonable assumption if W
0
has full rank.
(U, W
t1
) (U, W
t
) = tr

U(log W
t1
log
W
t1
F
t
tr(W
t1
F
t
)
)

= tr(Ulog F
t
) log tr(W
t1
F
t
).
(B.3)
Now we proceed to bound the last two terms. First, note that matrix log is operator monotone,
i.e. if A B then log A log B (Fact 8.8.29 of Bernstein (2005)). Thus the left side of
condition (B.2) gives us:
C
t
log F
t
.
Which, in turn, by application of Lemma 3.2 gives:
tr(UC
t
) tr(Ulog F
t
).
Now we will bound the termlog tr(W
t1
F
t
). First, applying Golden-Thompson inequality
to the operation gives:
tr(W
t1
F
t
) tr(W
t1
F
t
).
Next, we use the condition that F
t
I C
t
(1 e

):
tr(W
t1
F
t
) 1 tr(W
t1
C
t
)(1 e

).
All this together gives:
log tr(W
t1
F
t
) log(1 tr(W
t1
C
t
)(1 e

)).
Now, using log(1 x) x we get:
log tr(W
t1
F
t
) tr(W
t1
C
t
)(1 e

).
Putting all the pieces into the progress equality (B.3), we arrive at:
(U, W
t1
) (U, W
t
) tr(UC
t
) + tr(W
t1
C
t
)(1 e

).
Summing over trials and rearranging terms gives the desired inequality.
References
Arora S, Kale S (2007) A combinatorial, primal-dual approach to semidenite programs. In:
Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (STOC
07), ACM, New York, NY, USA, pp 227236
Bernstein DS (2005) Matrix Mathematics: Theory, Facts, and Formulas with Application to
Linear Systems Theory. Princeton University Press
24
Bhatia R (1997) Matrix Analysis. Springer-Verlag, Berlin
Bousquet O, Warmuth MK (2002) Tracking a small set of experts by mixing past posteriors.
Journal of Machine Learning Research 3:363396
Boyd S, Vandenberghe L (2004) Convex Optimization. Cambridge University Press
Cesa-Bianchi N, Mansour Y, Stoltz G (2005) Improved second-order bounds for prediction
with expert advice. In: Proceedings of the 18th Annual Conference on Learning Theory
(COLT 05), Springer-Verlag, pp 217232
Cover TM (1991) Universal portfolios. Mathematical Finance 1(1):129
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an
application to Boosting. J Comput Syst Sci 55(1):119139
Helmbold D, Warmuth MK (2007) Learning permutations with exponential weights. In: Pro-
ceedings of the 20th Annual Conference on Learning Theory (COLT 07), Springer-Verlag,
pp 469483
Helmbold D, Warmuth MK (2008) Learning permutations with exponential weights, submitted
journal version: http://www.soe.ucsc.edu/ manfred/last/
Helmbold D, Schapire RE, Singer Y, Warmuth MK (1997) A comparison of new and old
algorithms for a mixture estimation problem. Machine Learning 27(1):97119
Helmbold D, Schapire RE, Singer Y, Warmuth MK (1998) On-line portfolio selection using
multiplicative updates. Mathematical Finance 8(4):325347
Herbster M, Warmuth MK (1998) Tracking the best expert. Machine Learning 32(2):151178,
earlier version in 12th ICML, 1995
Herbster M, Warmuth MK (2001) Tracking the best linear predictor. Journal of Machine
Learning Research 1:281309
Kivinen J, Warmuth MK (1997) Additive versus exponentiated gradient updates for linear
prediction. Information and Computation 132(1):164
Kivinen J, Warmuth MK (1999) Averaging expert predictions. In: Computational Learning
Theory, 4th European Conference (EuroCOLT 99), Nordkirchen, Germany, March 29-31,
1999, Proceedings, Springer, Lecture Notes in Articial Intelligence, vol 1572, pp 153167
Kivinen J, Warmuth MK, Hassibi B (2005) The p-norm generalization of the LMS algorithm
for adaptive ltering. Journal of IEEE Transactions on Signal Processing 54(5):17821793
Kuzmin D, Warmuth MK (2007) Online Kernel PCA with entropic matrix updates. In: Pro-
ceedings of the 24rd international conference on Machine learning (ICML 07), ACM Inter-
national Conference Proceedings Series, pp 465471
Littlestone N (1988) Learning quickly when irrelevant attributes abound: A new linear-
threshold algorithm. Machine Learning 2(4):285318
Littlestone N (1989) Mistake bounds and logarithmic linear-threshold learning algorithms.
PhD thesis, Technical Report UCSC-CRL-89-11, University of California Santa Cruz
Littlestone N, Warmuth MK (1994) The weighted majority algorithm. Inform Comput
108(2):212261, preliminary version in FOCS 89.
Nielsen MA, Chuang IL (2000) Quantum Computation and Quantum Information. Cambridge
University Press
Tsuda K, R atsch G, Warmuth MK (2005) Matrix exponentiated gradient updates for on-line
learning and Bregman projections. Journal of Machine Learning Research 6:9951018
Vovk V (1990) Aggregating strategies. In: Proceedings of the Third Annual Workshop on
Computational Learning Theory, Morgan Kaufmann, pp 371383
Warmuth MK (2005) Bayes rule for density matrices. In: Advances in Neural Information
Processing Systems 18 (NIPS 05), MIT Press
Warmuth MK (2007) Winnowing subspaces. In: Proceedings of the 24rd international confer-
ence on Machine learning (ICML 07), ACM Press
Warmuth MK, Kuzmin D (2006a) A Bayesian probability calculus for density matrices. In:
Proc. 22nd Conference on Uncertainty in Articial Intelligence, Morgan Kaufmann, pp 503
511
Warmuth MK, Kuzmin D (2006b) Online variance minimization. In: Proceedings of the 19th
Annual Conference on Learning Theory (COLT 06), Springer-Verlag, pp 514 528
Warmuth MK, Kuzmin D (2006c) Randomized PCA algorithms with regret bounds that are
logarithmic in the dimension. In: Advances in Neural Information Processing Systems 19
(NIPS 06), MIT Press
Warmuth MK, Kuzmin D (2008) Randomized PCA algorithms with regret bounds that are
logarithmic in the dimension. Journal of Machine Learning Research To appear

Potrebbero piacerti anche