Sei sulla pagina 1di 7

Machine Learning Course - CS-433

Matrix Factorizations

Nov 19, 2019

changes by Martin Jaggi 2019, 2018, 2017, Martin


c Jaggi and Mohammad Emtiyaz Khan 2016
Last updated on: November 19, 2019
Motivation
In the Netflix prize, the goal was to
predict ratings of users for movies,
given the existing ratings of those
users for other movies. We are going
to study the method that achieved
the best error (for a single method).

̣̣
̣
̣
̣ ̣

̣̣

̣̣

̣̣
̣
̣ ̣
 ̣̣

̣̣ ̣

̣̣

̣̣

̣̣ ̣
̣

The Movie Ratings Data


Given movies d = 1, 2, . . . , D and
users n = 1, 2, . . . , N , we define X
to be the D × N matrix containing
all rating entries. That is, xdn is the
rating of n-th user for d-th movie.

Note that most ratings xdn are miss-


ing and our task is to predict those
missing ratings accurately.
Prediction Using a Matrix Factorization
We will aim to find W, Z s.t.

X ≈ WZ> .

So we hope to ‘explain’ each rat-


ing xdn by a numerical representa-
tion of the corresponding movie and
user
- in fact by the inner product of a
movie feature vector with the user
feature vector.
1
X 
>
2
min L(W, Z) := 2 xdn − (WZ )dn
W,Z
(d,n)∈Ω

where W ∈ RD×K and Z ∈ RN ×K


are tall matrices, having only
K  D, N columns.
The set Ω ⊆ [D] × [N ] collects the
indices of the observed ratings of
the input matrix X.

Each row of those matrices is the


feature representation of a movie
(rows of W) or a user (rows of Z)
respectively.

Is this cost jointly convex w.r.t. W


and Z? Is the model identifiable?
Choosing K
K is the number of latent features.
C OV ER F E AT U RE

Serious preferences might cause a one-time 1.5


event; however, a recurring event is vector qi ∈  f, and each user u is associ-

oy

ms
more likely Braveheart

y-B

au
ya ove
to reflect user opinion. ated with a vector pu ∈  f. For a given item

ke

nb
he run es

n
Jou ra ich
on

Ro k L

tio
-D a be

ne
Amadeus The matrix factorization model

nD

de in ov
r sla
The Color Purple

l Te
nch ck
1.0 elements of qi measure the extent to

k
i, the

n
ille ol. 1 ulie

lle st al
Pu rt Hu

Be Lo n M
T
fac orn Kill: V J
can readily accept varying confidence

ea

Joh

ll
which the item possesses those factors,

Ha
rs
IH

ing

nie
levels, which let it give less weight to

l B ill B

Be

e
An
an
positive or negative. For a given user u,

ad aso lf B ed

tur K

nK
0.5

er

d
less meaningful observations. If con-

ize
ing

a
e
ak
the elements of pu measure the extent of

Cit
e
tF

Na
Go
fidence in observing rui is denoted as

e
Ro vs. J Ha

oic
car
Lethal Weapon

y
interest the user has in items that are high

s: S o e W tru ’s Ch
dd

S
Tri n
Fre
cui, then the model enhances the cost

e
Factor vector 2

Oz
Sense and

alt Sou ay Moo phi


on 0.0
the corresponding factors, again, posi-

ea f M er ck
of

o
Ocean’s 11 (Equation 5) to account
function for

S
dy

rd
Geared Sensibility Geared

on nd W ns
iza
d
Fre

son us e
tive or negative. The resulting dot product,

eW
toward confidence as follows: toward

1 c
Th

i
females males qiT pu, captures the interaction between user

d
us
r
–0.5

e W he e W
Arm the st Ya
rio
¤

Th T Th
min cui(rui - µ - bu - bi u and item i—the user’s overall interest in

an nge
F
tw on
n
ast Lo
p *,q *,b*

ma
( u ,i )K

Ca edd
the item’s characteristics. This approximates

e F The

ter m
Dave

Ste e
o

t
rid
Sis mo
ag

Ac
ly

wa an
Ug

yB

p
The Lion King - pu qi) + λ (|| pu || + || qi ||
T 2 2 2

na tt
user
–1.0u’s rating of item i, which is denoted by

a
te
Th

Ru anh
yo
Dumb and (8)

Co
+ bu2 + bi2)

M
rui, leading to the estimate

in
Dumber

id
Ma
The Princess Independence
Diaries For information on a real-life ap- –1.5
Day r̂ui = qiT pu. (1)
plication involving such schemes, –1.5 –1.0 –0.5 0.0 0.5 1.0
Gus
refer to “Collaborative Filtering for Factor vector 1
Escapist The major challenge is computing the map-
Implicit Feedback Datasets.” 10

ping
Figure of3.each itemtwo
The first and user to
vectors factor
from vectors
a matrix decomposition of the Netflix Prize
Figure 2. A simplified illustration of the latent factor PRIZE
NETFLIX approach, which qdata. Selected movies
theare placed at the appropriate
system spot based on their factor
, pu ∈  f. After recommender
characterizes both users and movies using two axes—male versus female i
vectors in two dimensions. The plot reveals distinct genres, including clusters of
COMPETITION completes this mapping, it can easily esti-
and serious versus escapist. movies with strong female leads, fraternity humor, and quirky independent films.
In 2006, the online DVD rental mate the rating a user will give to any item
company Netflix announced a con- by using Equation 1.
recommendation. These methods have become test to improve
popularthe in state of itsSuchrecommender
a model issystem.
12
To
closely related Our winning
to singular valueentries
decom- consist of more than 100 differ-
enable this, the company released a training set of more ent predictor sets, the majority of which are factorization
recent years by combining good scalability with predictive position (SVD), a well-established technique for identifying
Recall that for K-means, K was than 100 million ratings spanning about 500,000 anony-
accuracy. In addition, they offer much flexibility for model-
models using some variants of the methods described here.
latent semantic factors in information retrieval. Applying
mous customers and their ratings on more than 17,000 Our discussions with other top teams and postings on the
ing various real-life situations. SVD in the collaborative
movies, each movie being rated on a scale of 1 to 5 stars. filtering domain requires
public contest factoring
forum indicate that these are the most popu-
the number of clusters. (Similarly
Recommender systems rely on different
input data, which are often placed in a of matrix
types
Participating of submit
teams
with one3 milliondue
approximately
the user-item
predicted
to theand
ratings, high
rating
ratings for matrix.
portion
Netflix
a test set This often
lar andraises
of missing values
calculates caused by
Factorizing
difficulties
successful methods for predicting ratings.
thesparse-
Netflix user-movie matrix allows us
dimension representing users and the other dimension
-square errorness in the user-item ratings matrix. Conventional SVD is
for GMMs, K was the number of
representing items of interest. The mosttruth.
a root-mean
The firstdata
convenient team that can
(RMSE) based
improve when
undefined
on the held-out
on the Netflix
knowledge algo- about
to discover
ingthemovie
the most
preferences.
matrix
descriptive
is incom-
dimensions for predict-
We can identify the first few most
is high-quality explicit feedback, which rithm’s
includes RMSE performance
explicit by 10Moreover,
plete. percent orcarelessly
more wins addressing
a important
only dimensions
the relatively from a matrix decomposition and

latent variable dimensions).


input by users regarding their interest $1 in million
products.prize.
ForIf no teamfewreaches
knownthe 10 percent
entries
Netflix gives a $50,000 Progress Prize to the team in first
example, Netflix collects star ratings for movies, and TiVo
goal,proneexplore
is highly
Earlier systems relied on imputation
the movies’ location in this new space. Figure 3
to overfitting.
shows the first two factors from the Netflix data matrix
to fill in missing
place after each year of the competition. factorization. Movies are placed according to their factor
users indicate their preferences for TV shows by pressing ratings and make the rating matrix dense.2 However, im-
The contest created a buzz within the collaborative fil- vectors. Someone familiar with the movies shown can see
thumbs-up and thumbs-down buttons. We refer to explicit putation can be very expensive as itclear
tering field. Until this point, the only publicly available data
significantly increases
meaning in the latent factors. The first factor vector
user feedback as ratings. Usually, explicit forfeedback com-
collaborative filtering the amount
research was of data.ofIn
orders addition, (x-axis)
magni- inaccuratehas onimputation
one side lowbrow comedies and horror
Large K facilitates overfitting.
prises a sparse matrix, since any singletude
have rated only a small percentage of possible
user is likely
smaller. Thetorelease ofmight
items. a burst of energy
allure spurred
this data
worksand
distort
3-6
and the
suggested
thecompetition’s
data considerably.
modeling
activity. According to directly
Hence,
movies,
Freddythe
aimedmore recent
at a male
vs. observed
Jason), while
or adolescent audience (Half Baked,
rat-
the other side contains drama or
One strength of matrix factorizationthe is that
contestit allows ings only, while avoiding
website (www.netflixprize.com), more thanoverfitting through
comedy a regularized
with serious undertones and strong female leads
incorporation of additional information. 48,000
When teams from 182 different
explicit model. To countries
learn the have down-
factor vectors(Sophie’s
(pu and Choice,
qi), the Moonstruck).
system The second factorization
feedback is not available, recommender systemsloaded the data.
can infer minimizes the regularized squared axis error
(y-axis)onhastheindependent,
set of critically acclaimed, quirky
user preferences using implicit feedback, which Our team’s entry, originally
indirectly known called
ratings: BellKor, took over films (Punch-Drunk Love, I Heart Huckabees) on the top,
the top spot in the competition in the summer of 2007, and on the bottom, mainstream formulaic films (Armaged-
reflects opinion by observing user behavior, including pur-

Regularization
mouse movements. Implicit feedback usually
and won the 2007 Progress Prize with the best score
chase history, browsing history, search patterns, or even
time: 8.43 percent better than
denotes the
p*
( u ,i )K
with team Big Chaos to win the 2008 Progress Prize with a
Later, we aligned
at the
min ¤ (rui - qiTpu)2 + λ(|| qi ||2 + || pu ||2)
q *,Netflix.
don, Runaway Bride). There are interesting intersections
(2)
between these boundaries: On the top left corner, where
indie meets lowbrow, are Kill Bill and Natural Born Kill-
presence or absence of an event, so it isscore typically
of 9.46repre-
percent. At the Here,
time of κ isthis the set ofwe
writing, theare(u,i)
still pairsers,
for both
whicharty is known
ruimovies that play off violent themes. On the
sented by a densely filled matrix. in first place, inching toward (thethe training
10 percent set).landmark. bottom right, where the serious female-driven movies meet
We can add a regularizer and mini-
A BASIC MATRIX FACTORIZATION MODEL
The system learns the model by fitting the previously
observed ratings. However, the goal is to generalize those
AUGUST 2009 35
mize the following cost:
Matrix factorization models map both users and items
to a joint latent factor space of dimensionality f, such that
previous ratings in a way that predicts future, unknown
ratings. Thus, the system should avoid overfitting the
user-item interactions are modeled as inner products in observed data by regularizing the learned parameters,
that space. Accordingly, each item i is associated with a whose magnitudes are penalized. The constant λ controls
X 
>
2 λw λz Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on October 1, 2009 at 12:51 from IEEE Xplore. Restrictions apply.

1
32 2 xdn − (WZ )dn + kWkFrob + kZk2Frob
COMPUTER
2
2 2
(d,n)∈Ω
Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on October 1, 2009 at 12:51 from IEEE Xplore. Restrictions apply.

where λw , λz > 0 are scalars.


Stochastic Gradient Descent (SGD)
The training objective is a sum
over |Ω| terms (one per rating):
X 
1 >
2
x − (WZ )dn
|2 dn {z }
(d,n)∈Ω fd,n

Derive the stochastic gradient for


W, Z, given one observed rating
(d, n) ∈ Ω.

For one fixed element (d, n) of


the sum, we derive the gradi-
ent entry (d0, k) for W, that is

∂w 0 fd,n (W, Z), and analogously
d ,k
entry (n0, k) of the Z part:

fd,n(W, Z)
∂wd0,k
( 
− xdn − (WZ )dn zn,k if d0 = d
>

=
0 otherwise


fd,n(W, Z)
∂zn0,k
( 
− xdn − (WZ )dn wd,k if n0 = n
>

=
0 otherwise
Alternating Least-Squares (ALS)
For simplicity, let us first assume
that there are no missing ratings,
that is Ω = [D] × [N ]. Then
N
D X
1
X  >
2
2 xdn − (WZ )dn
d=1 n=1
= 21 kX − WZ>k2Frob .

We can use coordinate descent to


minimize the cost plus regularizer:
We first minimize w.r.t. Z for
fixed W and then minimize W
given Z.

Z> := (W>W + λz IK )−1W>X


W> := (Z>Z + λw IK )−1Z>X>

What is the computational complex-


ity? How can you decrease the cost
when N and D are large?
ALS with Missing Entries
Can you derive the ALS updates for
the more general setting, when only
the ratings (d, n) ∈ Ω contribute to
the cost, i.e.
1
X 
>
2
2 xdn − (WZ )dn
(d,n)∈Ω

Hint: Compute the gradient with


respect to each group of variables,
and set to zero.

Potrebbero piacerti anche