Matrix Decompition

Machine Learning
CS 4780/5780 - Fall 2012 Cornell University Department of Computer Science
Recommendation Systems
!1
Antonio Bahamonde
Centro de Inteligencia Articial. Universidad de Oviedo at Gijn. Spain Spanish Association for Articial Intelligence From 9/2012 till 7/2013 Visiting Professor in Cornell University
!2
Contents
Denitions Netix Prize Similarity based methods Matrix factorization References
!3
RS: Denitions
RS help to match users with items
Ease information overload Sales assistance (guidance, advisory, persuasion,...)
RS are software agents that elicit the interests and preferences of individual consumers [...] and make recommendations accordingly. Different system designs / paradigms
Based on availability of exploitable data Implicit and explicit user feedback Domain characteristics
!4
Purpose and success criteria

When does a RS do its job well?
Retrieval perspective
Reduce search costs Provide "correct"

Recommendation perspective
Recommend items from the long tail
proposals Users know in advance what they want
Serendipity identify items
from the Long Tail Users did not know about existence
Dietmar Jannach, Markus Zanker and Gerhard Friedrich
!5
Purpose and success criteria

Prediction perspective
Predict to what degree users like an item Most popular evaluation scenario in research
Interaction perspective
Give users a "good feeling" domain Educate users about the product
Convince/persuade users explain
Finally, commercial perspective

Increase "hit", "clickthrough", "lookers to bookers" rates Optimize sales margins and prot
!6
Popular RS
Google Genius (Apple) last.fm
Amazon Netix TiVo
!7
Popular RS: Waze
Now in the rst line of actuality since Apple suggested its use instead of the failed new maps in iOS6.
Recommendation Systems !8
Popular RS (in everyday life)
Bestseller lists Top 40 music lists The recent returns shelf at the library Unmarked but well-used paths thru the woods Most popular lists in newspapers Many weblogs Read any good books lately?
!9
Collaborative Filtering
The most prominent approach to RS
used by large, commercial ecommerce sites wellunderstood, various algorithms and variations exist applicable in many domains (book, movies, DVDs, ..)
Approach
use the "wisdom of the crowd" to recommend items (crowdsourcing)
Basic assumption and idea
users give ratings to catalog items (implicitly or explicitly) customers who had similar tastes in the past, will have similar tastes in the future
!10
Collaborative Filtering
Relate two fundamentally different entities: users and items

explicit feedback (ratings) implicit (purchase or browsing history, search patterns, ...) sometimes items descriptions by feature (content based)
Approaches:
neighborhood latent factor
!11
Nave Neighborhood approach

item-item

alice item1 5 3 4 3 1 user1 user2 user3 user4 item2 3 1 3 3 5
4 2 4 1 5
user-user

item3
item4 4 3 3 5 2
item5 ? 3 5 4 1
Compute similarity and then prediction


user-user

item1 item2 3 1 3 3 5 item3 4 2 4 1 5 item4 4 3 3 5 2 item5 ? 3 5 4 1
alice user1 user2 user3 user4
5 3 4 3 1
!13

item-item

item1 item2 3 1 3 3 5 item3 4 2 4 1 5 item4 4 3 3 5 2 item5 ? 3 5 4 1
alice user1 user2 user3 user4
5 3 4 3 1
!14
How to measure similarities? (a, b) correlation
cosine
a, b cos(a, b) = a b
simpu, u1 qprpu1 , iq rpu1 , qq ` u1 u bpu, iq rpu, q 1 1 u simpu, u q u similar formula for item-item

rpu, q palice, uq
4,00 2,25 3,50 3,25
user-user using 2 neighbors

item1 alice user1 user2 user3 user4 5 3 4 3 1 item2 3 1 3 3 5 item3 4 2 4 1 5 item4 4 3 3 5 2 item5 ? 3 5 4 1
0,853 0,707 0
3,00
-0,792
p3 2.25q 0.853 ` p5 3.5q 0.707 bpalice, item5q 4 ` 5.09 0.853 ` 0.707


item-item using 2 neighbors
alice item1 5 3 4 3 1 user1 user2 user3 user4 item2 3 1 3 3 5 -0,478 3,0 item3 4 2 4 1 5 -0,428 3,2 item4 4 3 3 5 2 0,582 3,4 3,25 item5 ? 3 5 4 1
p5 3.2q 0.969 ` p4 3.4q 0.582 bpalice, item5q 3.25 ` 4.6 0.969 ` 0.582
pitem5, iq 0,969 rp, iq 3,2
Scalability
user-user is a memory based method millions of users does not scale for most real-world scenarios
!18
item-item
is model based models learned ofine are stored for predictions at run-time allows explanations no cold start problem
!19
However in all cases
not all neighbors should be taken into account (similarity thresholds) not all item are rated (co-rated) not involved the loss function
!20
Netix Prize
(Sep, 21, 2009): Netflix Awards $1 Million Prize and Starts a New Contest [...]try to predict what movies particular customers would prefer Accurately predicting the movies Netflix members will love is a key component of our service, said Neil Hunt, chief product officer (Netflix)

!21
Netix Prize
The Netix dataset
more than 100 million movie ratings (1-5 stars) Nov 11, 1999 and Dec 31, 2005 about 480, 189 users and n = 17, 770 movies 99% of possible rating are missing movie average 5600 ratings user rates average 208 movies training and quiz (test-prize) data
!22
Netix Prize
The loss function: root mean squared error (RMSE) d 1 RM SE prpu, iq bpu, iqq2 |Quiz|
pu,iqPQuiz
Netix had its own system, Cinematch, which achieved 0.9514. The prize was awarded to a system that reach RMSE below 0.8563 (10% improvement)
!23
Netix Prize
Leaderboard

See
Team 1 2 3 The Emsemble Grand Prize Team Opera Solutions and Vandelay United
RMSE
Date
Hour 18:18:28 18:38:22 21:24:40 01:12:31
BellKor's Pragmatic Chaos 0,8567 26/07/09 0,8567 26/07/09 0,8582 10/07/09 0,8588 10/07/09
!24
Netlx prize winners
Yehuda Koren, Robert M. Bell: Advances in Collaborative Filtering. Recommender Systems Handbook 2011: 145-186 Yehuda Koren, Robert Bell,

Yahoo! Research AT&T Labs Research
Discuss similarity and matrix factorization approaches
!25
Similarity approach revisited
3 major components:
data normalization neighbor selection determination of interpolation weights
!26
Baseline approach
Example: Titanic and Joe
Average in Netix: 3.7 Joe critical: 0.3 less than average Titanic: 0.5 more than average (all users)

b(Joe,Titanic) = 3.7 - 0.3 + 0.5 = 3.9
bpu, iq ` bpu, q ` bp, iq bu,i ` bu ` bi
!27
parameters bu and bi indicate the observed deviations of user u and item i ctively, from the average. For example, suppose that we want a baseline predi he rating of the movie Titanic by userapproach that the average rating Baseline Joe. Now, say movies, , is 3.7 stars. Furthermore, Titanic is better than an average movie nds to be rated 0.5 stars above the average. On the other hand, Joe is a cri , who tends to rate 0.3 stars lower than the average. Thus, the baseline predi The equations sound appealing, but Koren and Bell + 0.5. In o Titanics rating by Joe would be 3.9 stars by calculating 3.7 0.3 propose to can solve the a least square approach: stimate bu and bi onelearn it using least squares problem 4 Y 2 2 2 min (rui bu bi ) + 1 ( bu + bi ) .
(u,i)K Netix data b

i 99% of the possible ratings are missing beca where small portion of the movies. The (u, i) pairs for whi only a in the set K = {(u, i) | rui is known}. Each user u is asso denoted by R(u), a regularization all the avoid and the last term is which contains term toitems for which r Likewise, overtting R(i) denotes the set of users who rated item i. a set denoted by N(u), which contains all items for whic !28 Recommendation Systems preference (items that he rented/purchased/watched, etc.).
Baseline approach
In the Netix data
The average rating () = 3.6 Learned bias user (bu) has average 0.044 their absolute value (|bu|) is 0.32 item (movie) (bi) has average -0.26 (s.d. 0.48), (|bi|) is 0.43

baseline RMSE 0,9799 Cinematch 0,9514 Prize 0,8567
!29
Improvements
Koren & Bell report signicant improvements adding
temporal dynamics (temporal drift of preferences) implicit feedback (movie rental history (not available; rated movies!),
baseline + temporal +tem (spline) Cinematch Prize
RMSE
0,9799
0,9771
0,9603
0,9514
0,8567
!30
Matrix factorization
Tries to capture users and items relationships Based on well-known algebraic decomposition of matrices used before in Information Retrieval (LSI) Intended idea: consider latent variables As implemented by Koren and Bell, this approach won the Netix Prize
!31
Transform both items and users into a feature space of lower dimensionality (k), the latent space Tries to explain ratings by characterizing both items and users on factors automatically learned from data. Factors might measure aspects as comedy, drama, amount of action, ... Efcient implementation ofine Admit improvements in temporal drift and implicit feedback
!32
Matrix factorization (SVD)

SVD (singular value decomposition). Matrix M with rows users and columns items M = U*S*(Items)T where UTU = (Items)T(Items) = I
S = diag(s1 , . . . , sn ),

S = diag( s1 , . . . ,
s1 s2 , . . . sn1 sn 0
sn )
and then
M = (U
S) ( S Items )
!33
Matrix factorization (SVD)

If we use only k dimensions M Mk = Uk*Sk*(Itemsk)T
#Items
This is a matrix of rank k. The most similar to M with this rank.

Sk
(Itemsk)T
Mk =
#Users
Uk
Mk = (Uk
Sk ) (
Sk
T Itemsk )
!34
Algorithm Let M = rui - () - bu - bi; [U S Items] = svd(M); U_rep = Uk *sqrt(Sk);
%fill missing values with 0 Yehuda % fix k <= rank(M); Koren and Robe % call pu row u-th of U_rep
by also adding in the aforementioned %call qi row i-th of Items_rep Items_rep = Itemsk * sqrt(Sk); baseline predictors that depend r or item. Thus, a rating is predicted by the rule rui =
T + bi + bu + qi pu .
r to learn the model parametersk(b ,(bi ,Sk ItemsT ) minimize the pu and qi ) k we Mk = (Uk S ) u uared error
In this case M

alice item1 5 3 4 3 1 0,00 user1 user2 user3 user4 bi item2 3 1 3 3 5 -0,20 item3 4 2 4 1 5 0,00 item4 4 3 3 5 2 0,20 item5 ? 3 5 4 1 0,00 bu 0,80 -0,80 0,60 0,00 -0,40 3,20
!36
M = rui - () - bu - bi;

alice item1 1,0 0,6 0,2 -0,2 -1,8 user1 user2 user3 user4
%fill missing values with 0

item3 0,0 -0,4 0,2 -2,2 2,2 item4 -0,2 0,4 -1,0 1,6 -1,0 item5 0,0 0,6 1,2 0,8 -1,8 -0,8 -1,2 -0,6 0,0 2,4
item2
!37
M [U S Items] = svd(M);

U=
-0,143 -0,290 -0,097 -0,406 0,849
0,348 0,185 0,478 -0,762 -0,188
0,445 0,148 -0,838 -0,263 -0,096
-0,500 0,842 -0,096 -0,118 0,136
0,640 0,389 0,226 0,414 0,465
S=
4,971 0,000 0,000 0,000 0,000
0,000 2,591 0,000 0,000 0,000
0,000 0,000 1,258 0,000 0,000
0,000 0,000 0,000 0,440 0,000
0,000 0,000 0,000 0,000 0,000

!38
[U S Items] = svd(M);

Items =
-0,359 0,515 0,575 -0,300 -0,431
0,403 -0,478 0,496 -0,581 0,159
0,471 -0,208 0,111 0,385 -0,758
-0,536 -0,513 0,459 0,474 0,116
-0,447 -0,447 -0,447 -0,447 -0,447
!39
For k = 2

Uk =
-0,143 -0,290 -0,097 -0,406 0,849 0,348 0,185 0,478 -0,762 -0,188
Sk =
4,971 0,000 0,000 0,000 2,591 0,000
Itemsk =
-0,359 0,515 0,575 -0,300 -0,431 0,403 -0,478 0,496 -0,581 0,159
!40
k=2
1
It1 U2 It3
0.5
Alice It5 U1
0
U4
0.5
It2 It4 U3
1.5 1
0.5
0.5
1.5
2
!41
For k = 2; M2 = U2*S2*(Items2)' prediction2 = () - bu - bi + M2(1,5)= 4.4602 mean_error= mean (mean(abs(M-M2))) = 0.1933

For k = 3; M3 = U3*S3*(Items3)' prediction3 = () - bu - bi + M3(1,5)= 4.0356 mean_error= mean (mean(abs(M-M3))) = 0.0624
!42
However,

svd needs full matrices.

Earlier works relied on imputation: increases enormously the amount of data to be handled data is distorted due to inaccurate imputations

!43
is created by also adding in the aforementioned baseline predictors that depend o on the user or item. Thus, a rating is predicted by the rule
T rui = + bi + b + q pu estimators, Korenuand iBell,.set
Yehuda Koren and Robert
To compute all an optimization problem that admits an efcient solution and In orderavoids the problem parametersvalues , pu and qi ) we minimize and rR to learn the model of missing (bu , bi Yehuda Koren the larized squared error

ted by min adding in the aforementioned+ (b2 + bpredictors that2dep also (rui bi bu qT pu )2 baseline 2 + qi 2 + pu ) . 4 i i u user b ,q ,p (u,i)K a rating is predicted by the rule or item. Thus,

The constant 4 , which controls +extentbu + qT pu . rui = the bi + of regularization, is usually determ i by cross validation. Minimization is typically performed by either stochastic gr ent descentqor alternating least squares. (b , b , inspired in ) we and rows minimize i and pu are vectors of k components order to learn the model parameters u i pu and qi columns of the svd full matrix approach Alternating least squares techniques rotate between xing the pu s to solve fo squared error q s to solve for the p s. Notice that when one of these is take qi s Recommendation Systems i and xing the u !44 a constant, the optimization problem is quadratic and can be optimally solved;
The optimization problem can be solved Alternating least squares technique rotate between xing the pus to solve for the qis, and xing the qis to solve for the pus (each are quadratic problems that can be optimally solved) Stochastic Gradient Descent
!45
The optimization problem can be solved Stochastic Gradient Descent

= 0.005; 4 = 0.02; for rui K do T rui = + bi = bu + qi pu ; eui = rui rui ; bu bu + (eui 4 bu ); bi bi + (eui 4 bi ); qi qi + (eui pu 4 qi ); pu pu + (eui qi 4 pu ); end for
!46
for rui K do T rui = + bi = bu + qi pu ; eui = rui rui ; bu bu + (eui 4 bu ); bi bi + (eui 4 bi ); qi qi + (eui pu 4 qi ); pu pu + (eui qi 4 pu ); end for Considering the set R(u) of items that each user u has rated as an implicit feedback
Matrix factorization with implicit feedback
= 0.007; 5 = 0.005; 6 = 0.015; repeat for rui K do T rui = + bi = bu + qi pu ; eui = rui rui ; bu bu + (eui 5 bu ); bi bi + (eui 5 bi ); 1/2 qi qi + (eui (pu + |R(u)| jR(u) yj ) 6 qi ); pu pu + (eui qi 6 pu ); for j R(u) do 1/2 jj yj + (eui |R(u)| qi 6 yj ); end for end for = 0.9 until convergence %around 30 iterations
!47
Matrix factorization scores

SVD method, with improvements in temporal drift and implicit feedback increases its performance. The value of the rank k is also signicant

k=10 SVD SVD++ 0,9140 0,9131 0,8971
k=20 0,9074 0,9032 0,8891
k=50 0,9046 0,8952 0,8824
k=100 0,9025 0,8924 0,8805
k=200 0,9009 0,8911 0,8799
times SVD++
!48
Matrix factorization scores

Finally, with some extra improvements in the algorithms to solve the optimization problems, the team of Koren and Bell won the Netix Prize with a

RMSE = 0.8567

Remember that the second team (The Ensemble) reached the same score. The victory was awarded to Koren and Bell since their results were submitted 20 minutes before.
!49
Other Recommender Systems

Playlists:
Shuo Chen, Joshua Moore, Douglas Turnbull, Thorsten Joachims, Playlist Prediction via Metric Embedding, ACM Conference on Knowledge Discovery and Data Mining (KDD), 2012.
!50
References
Yehuda Koren, Robert M. Bell: Advances in Collaborative Filtering. Recommender Systems Handbook 2011: 145-186 William W. Cohen: Collaborative Filtering: A Tutorial. Carnegie Mellon University Dietmar Jannach, Gerhard Friedrich: Tutorial: Recommender Systems. International Joint Conference on Articial Intelligence (IJCAI). Barcelona, July 17, 2011. TU Dortmund, AlpenAdria Universitt Klagenfurt)
!51

Matrix Decompition

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Matrix Decompition

Caricato da

Copyright:

Formati disponibili

Machine Learning

CS 4780/5780 - Fall 2012 Cornell University Department of Computer Science

Denitions Netix Prize Similarity based methods Matrix factorization References

Ease information overload Sales assistance (guidance, advisory, persuasion,...)

Purpose and success criteria

When does a RS do its job well?

Reduce search costs Provide "correct"

Recommend items from the long tail

proposals Users know in advance what they want

Serendipity identify items

Dietmar Jannach, Markus Zanker and Gerhard Friedrich

Purpose and success criteria

Convince/persuade users explain

Finally, commercial perspective

Google Genius (Apple) last.fm

Amazon Netix TiVo

Popular RS: Waze

Popular RS (in everyday life)

Basic assumption and idea

Relate two fundamentally different entities: users and items

Nave Neighborhood approach

Compute similarity and then prediction

Nave Neighborhood approach

alice user1 user2 user3 user4

Nave Neighborhood approach

alice user1 user2 user3 user4

Nave Neighborhood approach

How to measure similarities? (a, b) correlation

Nave Neighborhood approach

user-user using 2 neighbors

p3 2.25q 0.853 ` p5 3.5q 0.707 bpalice, item5q 4 ` 5.09 0.853 ` 0.707

Nave Neighborhood approach

pitem5, iq 0,969 rp, iq 3,2

Nave Neighborhood approach

Nave Neighborhood approach

Nave Neighborhood approach

However in all cases

Hour 18:18:28 18:38:22 21:24:40 01:12:31

Netlx prize winners

Yahoo! Research AT&T Labs Research

Discuss similarity and matrix factorization approaches

Similarity approach revisited

data normalization neighbor selection determination of interpolation weights

Example: Titanic and Joe

b(Joe,Titanic) = 3.7 - 0.3 + 0.5 = 3.9

bpu, iq ` bpu, q ` bp, iq bu,i ` bu ` bi

Koren & Bell report signicant improvements adding

Matrix factorization (SVD)

Matrix factorization (SVD)

This is a matrix of rank k. The most similar to M with this rank.

Algorithm Let M = rui - () - bu - bi; [U S Items] = svd(M); U_rep = Uk *sqrt(Sk);

%fill missing values with 0

-0,143 -0,290 -0,097 -0,406 0,849

0,348 0,185 0,478 -0,762 -0,188

0,445 0,148 -0,838 -0,263 -0,096

-0,500 0,842 -0,096 -0,118 0,136

0,640 0,389 0,226 0,414 0,465

4,971 0,000 0,000 0,000 0,000

0,000 2,591 0,000 0,000 0,000

0,000 0,000 1,258 0,000 0,000

0,000 0,000 0,000 0,440 0,000

0,000 0,000 0,000 0,000 0,000

-0,359 0,515 0,575 -0,300 -0,431

0,403 -0,478 0,496 -0,581 0,159

For k = 2; M2 = U2S2(Items2)' prediction2 = () - bu - bi + M2(1,5)= 4.4602 mean_error= mean (mean(abs(M-M2))) = 0.1933

For k = 3; M3 = U3S3(Items3)' prediction3 = () - bu - bi + M3(1,5)= 4.0356 mean_error= mean (mean(abs(M-M3))) = 0.0624