Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
September 2014
Abstract
Memory Based Collaborative Filtering Recommender Systems have been around for
the best part of the last twenty years. It is a mature technology, implemented in nu-
merous commercial applications. However, a departure from Memory Based systems,
in favour of Model Based systems happened during the last years.
The Netflix.com competition of 2006, brought the Model Based paradigm to the
spotlight, with plenty of research that followed. Still, these matrix factorization
based algorithms are hard to compute, and cumbersome to update. Memory Based
approaches, on the other hand, are simple, fast, and self explanatory. We posit that
there are still uncomplicated approaches that can be applied to improve this family
of Recommender Systems further.
Four strategies aimed at improving the Accuracy of Memory Based Collaborative
Filtering Recommender Systems have been proposed and extensively tested. The
strategies put forward include an Average Item Voting approach to infer missing rat-
ings, an Indirect Estimation algorithm which pre-estimates the missing ratings before
computing the overall recommendation, a Class Type Grouping strategy to filter out
items of a class different than the target one, and a Weighted Ensemble consisting
of an average of an estimation computed with all samples, with one obtained via the
Class Type Grouping approach.
This work will show that there is still ample space to improve Memory Based
Systems, and raise their Accuracy to the point where they can compete with state-
of-the-art Model Based approaches such as Matrix Factorization or Singular Value
Decomposition techniques, which require considerable processing power, and generate
models that become obsolete as soon as users add new ratings into the system.
Acknowledgements
Artificial Intelligence is a fascinating topic, which certainly will touch our lives in the
years to come. But out of the many branches of this rich discipline, Recommender
Systems attracted me particularly. This, I owe to the teachings of Mara Salamo
Llorente, who introduced me to the topic, patiently answered all my numerous ques-
tions, and after completing the course, was kind enough to agree to supervise my
Thesis. I admire her patience and her focus. Without her, this work would be half
as interesting and half as useful.
No man is an island. And I would never have made it this far, this advanced
in life, without the support of my wife and my son. They are my pillars. They are
my ground. They are my light. Little makes sense without them. Thank you both.
Thank you for pushing me further and higher.
1 Introduction 7
1.1 Definition of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Objectives of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Readers Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
2.5.2 Threshold Filtering . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Rating Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.1 Recommender Algorithm . . . . . . . . . . . . . . . . . . . . . 30
2.7 Assessment Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 Improvement Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8.1 Significance Weighting . . . . . . . . . . . . . . . . . . . . . . 33
2.8.2 Default Voting . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8.3 Context Aware . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.9 Typical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.9.1 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.9.2 Cold Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Proposals 38
3.1 Description of Proposals . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.1 Default Item Voting . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.2 Indirect Estimation . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.3 Class Type Grouping . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.4 Weighted Ensemble . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Item Based Formulations . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Experiments 47
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2
5 Conclusions and Future Work 70
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Bibliography 73
Appendices 79
B Summary of Results 83
3
List of Figures
4
List of Tables
5
4.25 Item Based results of Friedman Test . . . . . . . . . . . . . . . . . . 59
4.26 ANOVA Test of Similarity Functions . . . . . . . . . . . . . . . . . . 60
4.27 ANOVA Test of Significance Weighting Algorithm . . . . . . . . . . . 61
4.28 ANOVA Test of User vs Item Based Approaches . . . . . . . . . . . . 62
4.29 ANOVA Test of Plain, User, Item and Indirect Algorithms . . . . . . 62
4.30 ANOVA Test of Grouping and Weighting Algorithms . . . . . . . . . 63
4.31 Comparative of Thesis Accuracy results against others published . . . 64
6
Chapter 1
Introduction
7
Things changed significantly with the inception of the Web 2.0 around the year
2004. Everything and everybody moved to the internet, shops and shoppers alike,
and radical changes to the GUI of web pages, particularly those enabled by the AJAX
responsive technology, permitted a much richer and immersive experience of browsing.
From then on, it was a never ending exponential increase in information flow, that
even the physical network layer had problems providing at first. But soon enough
the speed limit imposed by early modems gave way to high speed ADSL or cable,
and from that moment onward, web pages competed with one-another to display
more graphics, more images, and more text. The information age was born, and the
concept of information overload became thence common knowledge.
How should an online shopper filter out the massive amount of options being
offered now? The task truly became akin to finding a needle in a haystack. Using
the old fashion pre-internet method of trusting ones own taste or that of a close
friend did not seem to cut it any longer. The visibility of information became greatly
reduced. How to discover new hidden things? New products? New ideas? How to
select among literally millions of images, for the one sought?
Googleing a term and expecting to find a reasonable handful of products is not a
sensible expectation any longer. What should the query I want a shirt like the one I
bought last week return? Search engines are good at returning matches to keywords,
but the results are not tailored to the user. Arguably, very particular queries do yield
valuable results, however the lists are so vast and devoid of any personal targeting,
that in many cases they are rather meaningless.
Luckily, online shops faced the question of targeted recommendation rather early,
perhaps realizing that information would only scale with time, and that people would
need help to find the items they want. Their quest took them to devise algorithms
that could either learn ones taste, or measure the similarity between our taste and
that of others, and present us with an educated recommendation as to what we might
like to browse next.
The underlying premise behind their premise is rather simple but powerful: quan-
tifying and classifying taste can serve as a platform in which other items that shoppers
might like, can be offered to them. Clearly, assisting shoppers in browsing experiences
where the number of items is just too vast to cover physically, aims at increasing the
probability of converting a casual look into a sale.
The family of algorithms developed were termed Recommender Systems, and
rapidly moved from being research projects, into becoming incorporated in full blown
commercial products. Amazon.com was one of the first websites to integrate a Rec-
ommender System to aid in the discovery of their vast catalogue of books. Today we
all expect to be presented with the familiar customers who bought this item, also
bought such and such. Or these items are generally bought together.
8
Recommender Systems are a product of research carried out in the field of In-
formation Retrieval, particularly into Information Filtering techniques developed to
better cope with the exponential growth of information in the computer age. This
places Recommender Systems within the field of Data Mining and Machine Learning.
In fact, Recommender Systems are said to learn the preferences of a particular user,
with the intention of suggesting relevant not-yet seen items to this target user.
Modern Recommender Systems nurture from Machine Learning and Statistics,
and will employ Natural Language Processing techniques to when analysing the con-
tents of items [4]. They will also leverage inference of user interest, which is a classi-
fication problem, in order to build a rich user profile that would represent the target
user as faithfully as possible.
One of the earliest research projects on Recommender Systems was GroupLens
from 1994 [47], which presented UseNet users with similar news stories to follow and
read. The same research group produced MovieLens in 1997 which strived to recom-
mend movies to users, based on the preferences of a community of film aficionados
rating their personal preferences.
Another noteworthy project known as the Music Genome Project was launched
by Pandora.com in 2000 , consisting on a Recommender System that learnt the simi-
larities between music genres, artists, and songs, and offered tunes matching a users
taste. From the same year is Netflix.com which started providing movie rentals online,
and offered movie recommendations via its in-house Recommender System appropri-
ately called Cinematch.
Fast forward to the present date, and one would hardly find an online shop or an
online community without a Recommender System running in a background process;
learning the users taste, and working hard to match it to the available list of products
on stock or stories on record.
Recommender Systems [48, 26] are today a ubiquitous technology which has
proven invaluable when dealing with the exponential growth of information. News,
products, music, posts, all can be adapted to take advantage of the algorithms match-
ing strength and personalized profile. The success of the technology can easily be
seen by studying one of Googles most important improvements to their search en-
gine: browsing history tracking. Learn a users habits at browsing and leverage it to
recommend adds for potential website visits.
Much research, both academic and in the industry, has been carried out in the
field of Recommender Systems. And while the topic is now twenty years old, the tech-
niques developed remain rather simple. Among the various algorithms developed, two
broad categories stand out, one which matches the users tastes and is known as Col-
laborative Filtering [52, 18, 26], and another which matches the items specifications
and is known as Content Based [43, 48, 26].
9
In a Collaborative Filtering Recommender System the main goal is to learn the
users taste by means of some metric, and then match this particular target user to
a large database of other shoppers for which their preferences have been obtained in
the past. One simple way of quantifying a users like or dislike for an item is to use a
rating. This can take a binary form as in the case of like vs dislike, or have more
shades as found in a 1-to-5 star rating scale.
Regardless of how the metric is obtained, the underlying assumption is that a
rating is a good enough representation of the true feeling of a user towards a partic-
ular item. And once enough ratings have been collected in a community of users, a
Collaborative Filtering algorithm can start working to recommend target users other
items that might be of interest to them.
Most forms of Collaborative Filtering Recommender Systems work by matching a
target user to a large list of candidates, by using a similarity function. This approach
is termed User Based [52, 18, 48, 26] since the primary focus of the algorithm is finding
similar candidate users to the target user. In contrast to the User Based approach,
Recommender Systems can search for similar items to the target item, in which case
the algorithm is termed Item Based [52, 18, 48, 26].
The algorithm takes the list of rated items from the target user, matches them
to the rated items from a candidate user, and computes the squared error difference
or another suitable measure function, where the smaller the result, the closer the
candidates taste to the target user. Any product in the candidates list, not found in
the target users list can readily be offered as a recommendation, under the assumption
that similar tastes will like similar items.
Many different similarity functions have been investigated to match a target user
to their fellow candidates. Examples of these functions are the Euclidean Distance,
the Pearson Correlation, and the Cosine Distance [44]. They all strive to quantify the
separation between two samples of rated items, so that a neighbourhood of similar
candidates can be identified for the target user.
In the context of Collaborative Filtering, the Recommender System is said to be
Memory Based [44] when the algorithm computes the similarities in-memory without
the need to produce a model first. It is also generally termed Neighbourhood Based
[48] because it searches for the best candidates within a set neighbourhood.
Collaborative Filtering algorithms saw a powerful alternative formulation emerge
when Netflix.com proposed in 2006 a one million dollar prize to the individual or
group that could improve Cinematch by 10% or better [7, 31, 32]. It took three
years for the prize to be finally awarded, something that in itself speaks about the
immense difficulty in trying to improve these systems. The feat, in fact, was only
achieved when the top contenders joined forces in a collaborative work, close to the
finish line, to finally grab the prize that kept slipping away year after year.
10
This new line of Recommender Algorithms is termed Model Based [18, 48, 26], and
makes use of matrix factorization techniques to guess at the values of latent factors,
which loosely stated could be thought as underlying descriptors of the domain. For
example, in a movie recommender system, latent factors could describe drama vs.
comedy, age appropriateness, amount of violence, and are computed by factoring the
sparse matrix of users and items ratings.
The idea of using a factoring strategy came up during the Netflix.com competi-
tion when one of the contenders described the idea in a blog post. Other participants
quickly saw the potential, and put the proposal to work. At the end, the winning
team confessed that it relied heavily on matrix factorization to achieve the 10% im-
provement goal set by the judges.
While matrix factorization does currently have an upper hand on Recommender
Systems, its benefits dont come for free. For starters, the algorithm is based on
factoring an inherently sparse matrix, which can easily be proven not to have a
unique solution. The factorization is a time consuming and computationally expensive
operation, needing to be recomputed every time a new rating is entered.
On the other hand, the complex model produced by this technique can be prepared
offline, and made ready for almost instantaneous recommendations to be made. The
computation required to produce the model, however, is several orders of magnitude
higher than the one needed when using a Memory Based Collaborative Filtering, in
which the penalty is paid online.
Among Collaborative Filtering Recommender System, an alternative approach
to the typical User Based strategy, is to look for similar items instead of similar
users. This approach was first taken by Amazon.com which holds a patent to the
technology from 2003 [36]. This algorithm is termed Item Based [52, 18, 48, 26], and
its major advantage is that products change much slower than users interaction with
the system, and therefore similarities computed hold true for longer, allowing some
form of cache to be used.
The second broad category of Recommender Systems that has widespread use, is
termed Content Based [43, 48, 26]. Under this paradigm the recommender is built
to look for items that are similar to those that the user has purchased or liked in the
past. The tool of favour in this domain is the TF-IDF, or Term Frequency Inverse
Document Frequency, which, defined loosely, is a measure of how important a term
is inside a collection of documents.
For example, if we had five different books, one being El Quijote de la Mancha,
the term Quijote would be highly discriminative, because it will likely only appear in
one of the five books. In other words, in order to identify this book, looking for the
word Quijote would suffice. However, if all the books discussed the Quijote, then the
term would stop being discriminative and other terms would need to be used instead.
11
The technique of TF-IDF is used in Content Based Recommender Systems as a
way to identify important trends in the users taste. For example, if a user likes En-
glish drama books, then a TF-IDF analysis would isolate the relevant entries reflecting
these specifications from the database. In general, the users profile is assembled as a
weighted vector of item features, where the weight represents the relevance of a given
feature for a given user.
Pandora.com is one good example of a Content Based Recommender System, as
it catalogues songs by features, and then strives to present the listener with melodies
that are similar to those previously liked. It does so, not by matching a user profile
to neighbour candidates, but by searching features within songs.
One important drawback of a Content Based Recommender System is that it tends
to recommend the same type of item to the user, over and over. In order to be able
to recommend a different type of item, the user would have to have rated or show
interest in the new type of item. This is a problem that a Collaborative Filtering
algorithm does not have, since the match here is done between neighbouring users
with similar tastes, but different items rated.
This particular limitation of the Content Based strategy gave rise to a Hybrid
class of Recommender Systems [26, 11, 12], which were researched as a way to try to
get the best from both worlds. In these algorithms, recommendations are produced
by a weighted function of both strategies, or by feeding the second strategy, the result
from the first.
In this thesis, we concentrate on Memory Based Collaborative Filtering Recom-
mender Systems, and propose ideas and algorithms to improve the similarity search,
and thus the quality of the output recommendation. It is believed that there is still
plenty of space to improve Memory Based approaches, to compete with their heavy
weight Model Based Collaborative Filtering counterparts.
12
While the last decade has seen a concentrated effort in exploring Model Based ap-
proaches via Matrix Factorization techniques, Memory Based approaches still remain
the principal technique employed in commercial implementations, and it is strongly
believed that there is ample space to improve this technology further. For this reason,
we have chosen to concentrate on learning and improving Memory Based Collabora-
tive Filtering Recommender Systems.
Clearly, as stated in the introduction in which the Netflix.com million dollar prize
competition was described, improving the error metric of a robust Recommender
System is not an easy task. In this particular competition, it took the winning team
three years to achieve this milestone, and it made use of over 100 different strategies
combined to defeat Cinematch, Netflix.coms own recommender system [31]. But
the prize did teach a valuable lesson: there are still more than a handful of simple
strategies that can be used to better the recommendation results.
In this work we set two major objectives: (1) to fully understand and correctly
employ User and Item Memory Based Collaborative Filtering Recommender System,
and (2) to search for strategies aimed at improving the performance of the system.
Four approaches of varying complexity resulted from our research. They have been
developed and tested in this Thesis, and will be reported in the following sections.
1.3 Summary
In this section we have presented a brief history of the evolution of Recommender Sys-
tems, stating their raison detre and their commercial relevance. Since their inception
into commercial products, recommender systems have matured to be ubiquitous in
front-end websites where products are offered to shoppers. The whole of the online
shopping experience would be nowhere near what it is today without their contribu-
tion, to match peoples taste to back-end catalogues.
The different types of Recommender Systems currently developed have been pre-
sented, namely the Collaborative Filtering and the Content Based approach, and
within the former one, the two Memory Based types have been introduced, namely
the User Based and the Item Based. The rest of this thesis will focus on the Memory
Based Collaborative Filtering formulation.
Lastly, the objectives sought on this work have been stated: learn to use Memory
Based Collaborative Filtering Recommender System, test it, and offer novel ways
to improve its performance as measure by the MAE and the RMSE. The following
sections of this thesis will present four different approaches developed and tested in
an attempt to achieve this objective.
13
1.4 Readers Guide
The following sections of this thesis are arranged as follows:
Chapter (2) will present the State of the Art in Memory Based Collaborative
Filtering algorithms. Relevant formulations will be enumerated, and their advantages
and disadvantages discussed. Chapter (3) will present four different algorithms that
have been developed and tested as a means to improve the performance of the type of
Recommender System studied. Chapter (4) will present our results when testing basic
formulations of User Based and Item Based algorithms and our four variations, when
applied to the well known MovieLens dataset. Description of the experiments, results
and statistical analysis will be included therein. Lastly, Chapter (5) will present a
discussion of the results, our conclusions, and some ideas for future work.
14
Chapter 2
15
Recommender Systems received a big push forward with the adoption of the tech-
nology by Amazon.com at the end of the 1990s. Their own implementation is based
on the similarities of items rather than users, and in described in [36]. This allows
the company to make claims such as: users who bought this item also bought these
other items.
Another significant move forward occurred during the Netflix.com million dollar
prize from 2006. The challenge there was to improve Cinematch, the proprietary
Recommender System of Netflix.com by 10% or more when measured by the Root
Mean Square Error (RMSE). The prize was only awarded three years later to a
team resulting from the collaboration of some of the top contenders. An important
contribution of the competition was the introduction of Model Based approaches to
the problem of Collaborative Filtering.
16
One major drawback of the Content Based Filtering algorithm is that it tends
to over-specialize the search. Once a user shows interest in a particular item, the
Recommender System will work hard to find other items with similar content it
can offer. It will not find unrelated items to the one singled out, a feature called
Serendipity [44], and deemed one of the strongest assets of a Collaborative Filtering.
In a Collaborative Filtering algorithm, the match is done at the user or the item
level, however not with respect to content but with respect to ratings. That is, if two
users show interest in similar disconnected items, say, a particular book, a particular
CD, and a particular video, if one of these users also showed interest in a given news
article, this article could be recommended to the other user.
Clearly, with disjoint item classes the assumption is stretched well beyond what
was intended, but the point is made: the second user will be presented with a com-
pletely unexpected item, which, if liked, will give rise to Serendipity or lucky coin-
cidence, effectively broadening the scope of the recommendation.
Schafer et. al. [52] summarize clearly some of the key questions in which a
Collaborative Filtering algorithm tries to assist. They are: Help me find new items
I might like, Advise me on a particular item, Help me find a user (or some users)
I might like, Help our group find something new that we might like, Help me find a
mixture of new and old items, Help me with tasks that are specific to this domain.
To be successful at this task, a Collaborative Filtering algorithm has to tackle two
issues: identify similar users or items to the target one, and average their (potentially
weighted) ratings, in order to predict unrated items, so that the ones with the highest
score can be recommended to the user.
17
Both methods need to identify the N-Nearest Neighbours to the target sample,
be this one a user or an item. However, databases tend to have many more users
than items, and so the search is much harder in the dimension of the users. In other
words, in real life applications, users tend to interact (and therefore, change) at a
much higher rate than items do; the latter ones tending to be rather fixed for a given
online shop or community [18].
The scalability drawback of a User Based approach is not always the central issue
at the time of choosing one algorithm over the other one. In fact, the decision has
much more to do with the nature of the domain. For example, [44] argues that in the
context of news, the item dimension changes much faster than the user base, and so
the Recommender System should favour a User Based approach.
The author also makes an important point when stating and Item Based algo-
rithm can be helpful in offering an explanation as to why an item was recommended.
Arguing that similar users have purchased an item is less compelling than arguing
that a given item is recommended because it is similar to other items purchased in
the past.
18
The obvious drawback of the Model Based method is that the model, not only
takes substantially longer to compute, but needs to be computed anew if the matrix
of data changes, which happens every time a new user enters a rating. Generally,
small changes are left unprocessed, but when they become substantial, the model
needs to be re-trained. Memory Based approaches do not suffer from this problem
because the similarity is computed every single time a recommendations is sought.
19
As a result, a rating of excellent for a particular film watched a month ago could
very well not stand up to the competition of a new film watched now, but be equally
rated excellent. And since users tend to not change an old rating, both items would
now hold the same level of appreciation.
But even if a user does remember an old item well, taste changes through time
[44]. And if ratings are not adjusted to reflect the new preferences, the old ones will
continue to affect any recommendation of the system, rendering it less effective.
Implicit ratings try to minimize the impact of human psychology by not involving
the user in the derivation of the appraisal. Preferences inferred are less biased since
they are not affected by what the user thinks he should rate [44]. However, an inferred
system is not devoid of problems. One common strategy to guess at the users taste
is to infer it by studying how much time he stayed browsing a particular item; but,
how is an algorithm to know if the user is truly studying an item with agreement or
disagreement? Or worse yet, what if he left the page open while going to prepare a
cup of coffee?
View pages and items purchased have also been used as a means to infer taste
[18]. The rationale is that an item purchased is an item liked, and a page visited is
a page of interest. Naturally, one could buy an item for a friend, and end up in a
page by mere clicking a link by mistake. Also, in some cases, more than one person
browses from the same account, creating effectively a potpourri of tastes, with little
targeting value.
The main advantage of implicit ratings over explicit ones is that the user is not
troubled by having to answer questions in order to train the algorithm. However,
both strategies can easily be combined to enhance the knowledge of a particular
users taste.
Binary: Like/Dislike
20
The Unary scale is used to signal that a user has shown interest in a particular
item. It can be implicit as in the case of a view page or a purchase, or explicit by
adding an appropriate button, like the Google +1 or the Facebook Like. It tells
nothing about the items the user does not like, but it is non-intrusive, and in many
cases, enough to learn about a user.
The Unary scale is typically inferred by the system, by looking at the users
actions, instead of asking for feedback. While the users is not expressively stating a
preference, his preference is still learned by way of the choices he makes.
The Binary scale is explicit, and takes the form of two buttons, one conveying a
positive feeling, and one conveying a negative one. An example of this scale is found
in YouTube.com where users can click on a thumbs up or thumbs down icon.
They confer general feelings about an item without much gamut, however they are
good polarizers with minimal intrusiveness.
Lastly, the Integer scale is akin to an hotel stars or a restaurant forks rating,
and is also explicit. It gives the largest spectrum to express taste, as it lets the user
express himself with more than a general direction. This last rating is customarily
found when rating films or books.
2.3.3 Normalizations
One of the problems faced with using ratings as a means of representing taste, is
that each user has personal interpretation of the scale. While one rater might tend
to give high marks to films he likes, another rater might keep the highest grades for
exceptional movies. The question then is how similar both users are, and how to user
their ratings appropriately.
There are two widely used systems to compensate for variations in the rating
approach by different users, one is called mean-centering, and the other is called
Z-score [48].
The mean-centering algorithm re-maps a users rating by subtracting the mean
value of all his ratings, effectively signalling if the particular rating is positive or
negative when compared to the mean. Positive values represent above-average rat-
ings, negative results represent below-average ratings, and zero represents an average
rating. The formulation is given by [1]:
where h(r)ui represents the mean-centered rating of user u for item i, rui represents
the actual rating of user u for item i, and ru represents the mean rating of user u
across all items rated.
21
When adopting the Item Based approach, the equation simply becomes:
where h(r)i represents the mean rating of item i across all users that rated the
item.
The Z-score approach is very similar to the mean-centering, but it is normalized
by the standard deviation u or i depending on whether we apply the User or the
Item based approach [1]:
rui ru
h(rui ) = (2.3)
u
Normalization of ratings does come with a word of caution. If a user only gave
high ratings to all items reviewed, the mean-centered approach would yield them
average, and any rating below the highest would be a negative one, even if it is fairly
high in the scale. Likewise, if a user has rated all items with the same exact grade,
the standard deviation will be zero, and the Z-score wont be computable [1].
22
2.4.1 Pearson Correlation
The Pearson Correlation similarity function is perhaps the best known algorithm for
User Based Recommender Systems. It was first introduced by the GroupLens project
in 1994, and used ever since as baseline for comparisons.
When using the Pearson Correlation, the similarity is represented by a scale of -1
to +1 where a positive high value suggests a high correlation, a negative high value
suggest inversely high correlation (when one say True the other says False), and lastly
a zero correlation indicates uncorrelated samples.
The User Based Pearson Correlation similarity equation is defined as:
P
(rui ru )(rvi rv )
iIuv
s(u, v) = r P P (2.4)
(rui ru )2 (rvi rv )2
iIuv iIuv
where s(u, v) represents the similarity of users u and v, Iuv represents the set of
items rated by both users u and v, rui and rvi represent the rating of user u or v for
item i, and ru and rv represent the mean rating of user u or v, across all items rated.
In the case of Item Based Recommender Systems, the formulation becomes:
P
(rui ri )(rui rj )
uUij
s(i, j) = r P P (2.5)
(rui ri )2 (rui rj )2
uUij uUij
where s(i, j) represents the similarity of items i and j, Uij represents the set of
common users who rated items i and j, and ri and rj represent the mean rating of
item i or j, across all users that rated the item.
~a.~b
cos(~a, ~b) = (2.6)
k~ak2 k~bk2
23
The User Based Cosine similarity equation, defined as a summation, becomes:
P
rui rvi
iIuv
s(u, v) = r P (2.7)
2 2
P
rui rvi
iIu iIv
where s(u, v) represents the similarity of users u and v, across all items commonly
rated.
In the case of Item Based Recommender Systems, the formulation becomes:
P
riu rju
uUij
s(i, j) = r P (2.8)
2 2
P
riu rju
uUi uUj
where s(i, j) represents the similarity of items i and j, across all users that rated
such items.
where s(u, v) represents the similarity of users u and v, across all items commonly
rated.
In the case of Item Based Recommender Systems, the formulation becomes:
P
(riu ru )(rju ru )
uUij
s(i, j) = r P P (2.10)
(riu ru )2 (rju ru )2
uUi uUj
where s(i, j) represents the similarity of items i and j, across all users that rated
such items.
24
2.4.4 Mean Squared Distance
The Mean Squared Distance similarity is defined as:
(rui rvi )2
P
iIuv
s(u, v) = (2.11)
|Iuv |
where s(u, v) represents the similarity of users u and v, across all items commonly
rated.
In the case of Item Based Recommender Systems, the formulation becomes:
(riu rju )2
P
uUij
s(i, j) = (2.12)
|Uij |
where s(i, j) represents the similarity of items i and j, across all users that rated
such items.
Some texts [48] suggest using the inverse of the Mean Squared Distance, as in:
1
s(a, b) = (2.13)
s(a, b)
where s(a, b) represents the Mean Squared Similarity, and s(a, b) represents the
Mean Squared Distance.
where s(u, v) represents the similarity of users u and v, across all items commonly
rated.
In the case of Item Based Recommender Systems, the formulation becomes:
v P
u
u (riu rju )2
t uUij
s(i, j) = (2.15)
|Uij |
25
where s(i, j) represents the similarity of items i and j, across all users that rated
such items.
This equation can readily be seen to be the Mean Squared Distance with an added
square root.
In some Recommender System implementations [6], the similarity is defined as:
1
s(a, b) = (2.16)
1 + s(a, b)
where s(a, b) represents the Euclidean Similarity, and s(a, b) represents the Eu-
clidean Distance.
This has the effect of limiting the rating to the range of (0..1].
where s(u, v) represents the similarity of users u and v, across all items commonly
rated.
In the case of Item Based Recommender Systems, the formulation becomes:
P
(kui ki )(kui kj )
uUij
s(i, j) = r P P (2.18)
(kui ki )2 (kui kj )2
uUij uUij
where s(i, j) represents the similarity of items i and j, across all users that rated
such items.
The benefit of the Spearman Correlation over the Pearson Correlation is that it
avoids the need to normalize the ratings [48].
26
2.5 Neighbourhood
In theory, the similarity values of all the candidate users can be used at the time
of recommending an item to a target user, however, this is not only an impractical
approach with large user databases, but including users with low correlation to the
target actually hinders the recommendation as it will increase its error [23].
To avoid the inclusion of users uncorrelated to the target, a neighbourhood selec-
tion step is generally included in the overall algorithm, in order to filter out those
unwanted candidates. Two main methods are commented in the literature, namely,
selecting the Top N-Neighbours, and using a Threshold Filtering, which are described
in Sections (2.5.1) and (2.5.2).
27
2.5.2 Threshold Filtering
The Threshold Filtering approach looks to meet a minimum satisfactory level of
similarity for a neighbour to be included in the output set. This strategy fixes the
problem of using a single neighbourhood size for all candidates, but is not devoid of
its own problems.
Setting too high a threshold will result in very good correlated neighbours, but for
some users that are not easily correlated to those in the database, the result might be
too few neighbours, with very low quality recommendations. Setting a lower threshold
will increment the number of neighbours accepted, which defeats the purpose of this
strategy [22].
In general, as the threshold is set tighter, the overall recommendation error drops,
but less users can reliably be recommended items. As the threshold is loosen, the
recommendation error increases, but more users can be recommended items. As
always, there is a trade off that needs to be carefully considered for each Recommender
Systems in question.
where Ni (u) represents the number of neighbours that have item i in common with
user u, of which v is a particular neighbour user, and wuv is the similarity between
the user u and one of its neighbours v.
28
For an Item Based system, the corresponding equation would be:
P
wij rju
jNu (i)
rui = P (2.20)
|wij |
vNu (i)
where wij is the similarity between the item i and one of its neighbours j.
The Rating Prediction can take into consideration the Rating Normalizations dis-
cussed earlier in section (2.3.3). When using the Mean Centering approach, equation
(2.19) becomes:
P
wuv (rvi rv )
vNi (u)
rui = ru + P (2.21)
|wuv |
vNi (u)
If instead of using the Mean Centering approach we were to use the Z-score ap-
proach, equation (2.19) would instead be stated as:
P
wuv (rvi rv )/u
vNi (u)
rui = ru + u P (2.23)
|wuv |
vNi (u)
In [1], the author states that Z-score normalization, despite having the added
benefit of taking into consideration the variance in the rating, it is more sensitive
than Mean Centering and oftentimes it predicts values outside the rating scale.
29
2.6.1 Recommender Algorithm
A pseudo-code description of the overall Recommender Algorithm can be included
for clarity. It is comprised of the following steps [1]:
The algorithm takes as input (Line 1) the User-Item sparse matrix of Ratings R
and outputs (Line 2) a recommendation list of size l (Line 3), which is set according
to the needs of the system. The parameter k (Line 4), which is generally set by
cross-validation, dictates the size of the neighbourhood of users to use, to find similar
candidates to the target.
The bulk of the algorithm loops through the list of users (Line 5), and for each
one of them it finds the most similar k users to this target (Line 6). The similar
users are obtained by using a similarity function as described in Section (2.4), being
the Euclidean Distance, the Pearson Correlation, and the Cosine Distance the most
used ones. The selection of the most similar users is done by using a neighbourhood
selection algorithm as described in Section (2.5), being the Top N-Neighbours or the
Threshold Filtering algorithms the usual choices.
Then, for each item on stock that the target user has not rated (Line 7) we combine
the ratings given by the candidate neighbours (Line 8) to produce a single estimate
of each missing rating. Finally (Line 10), we recommend the target user the top-l
items that received the highest rating (Lines 7-9), assuming that the items not rated
by the user are items that the user is not aware of.
30
2.7 Assessment Metrics
In this section we present the usual metrics used to assess the performance of Rec-
ommender Systems. The first metric is Coverage, which represents the percentage of
unrated items by given users that the algorithm can estimate. Because of a lack of
similar users to a particular target user, or too few ratings collected for the target
user, not all unrated items will be estimated.
The second metric is Accuracy, and represents the error encountered when es-
timating an unrated item of a given target user. When assessing a Recommender
System it is customary to perform a leave-one-out cross validation, where the known
rating is left out of the User-Item Ratings matrix, and estimated by the algorithm.
Doing this for each and every known rating would yield the overall Accuracy of the
system.
2.7.1 Coverage
Coverage is a measure of the percentage of the estimates that the Recommender Sys-
tem algorithm was able to complete. Generally, because of choice of Neighbourhood,
Coverage is not 100%, but a lower value.
As stated in section (2.5) the selection of the type of Neighbourhood strategy
to use has an important impact in the number of ratings that the algorithm can
produce. For example, when using the Threshold Filtering technique, the higher the
threshold, the more similar the samples, but the smaller the neighbourhood, which
in turn results in some number of samples not finding similar candidates, and thus
their unrated items not being estimated.
A high coverage means that the Recommender System is able to recommend items
to many users, regardless of their peculiarities (some users are hard to match). This,
naturally, comes at a cost, which is measured by the performance of the algorithm.
The larger the neighbourhood, the more outliers are accepted, and the worse the
recommendations become.
2.7.2 Accuracy
Accuracy is a measure of the performance of the algorithm. It quantifies how the
estimations produced by the strategy deviate from their known values. The smaller
the error, the better the Recommender System works. Accuracy is closely related
to Coverage in the sense that once one is high, the other one tends to be low. A
compromise must be found between these two parameters for each Recommender
System to work at it best.
31
Two types of Accuracy measures are in wide spread use in the literature; they are
the Mean Absolute Error (MAE), and the Root Mean Squared Error (RMSE). Their
equations follow:
P
|ra ra |
aNui
M AE(r, r) = (2.25)
krk
v P
u
u (ra ra )2
t aNui
RM SE(r, r) = (2.26)
krk
MAE and RMSE do present some drawbacks when used to measure the Accuracy
of a Recommender System [39]. For example, if a rating is not available, then that
ratting is not reflected in the error. Also, in the case of MAE, the error is linear,
meaning that the size of the error is the same at the high and the low ends of the
similarity. The RMSE does penalize more the larger errors, which might make sense
in the context of Recommender Systems.
32
2.8.1 Significance Weighting
Both Neighbouring strategies presented in section (2.5), when fine tuned properly,
filter out lowly correlated samples from entering the recommendation, and driving
the performance of the algorithm lower. However, what the Neighbouring strategies
do not address is the fact that not all accepted samples are equally correlated to the
target. Some samples share more common items than others, rendering them better
at the time of computing the recommendation.
Such a feature proposes to weight the significance of a candidate by the inverse of
the number of common items it shares with the target in question [22]. The idea is
that despite a high similarity computation, if a candidate and a target only share very
few items, its contribution should be weighted down, because they are not broadly
similar, only apparently similar. Neighbours with few common samples generally
prove to be bad predictors for a target [23].
At the heart of this strategy lays the use of a measure of confidence in the similarity
obtained. The more items two samples share in common, the more reliable they are
for producing a recommendation, and the more they should weight as part of the
overall recommendation. Likewise, poorly shared items should be heavily weighted
down so that they dont influence the recommendation that much.
The formulation of this strategy on User Based approaches is expressed as follows:
|Iuv |
w0 (uv) = wuv (2.27)
|Iuv | +
where w0 (uv) is the new weighting after taking into consideration the number of
commonly weighted items, |Iuv | represents the number of co-rated items, and > 0
is a parameter obtained by cross-validation.
On Item Based approaches, the equation would become:
|Uij |
w0 (ij) = wij (2.28)
|Uij | +
min(|Iuv |, )
w0 (uv) = wuv (2.29)
where |Iuv | represents the number of items that users u and v have co-rated, and
> 0 represents a threshold value of co-rated items.
33
And for Item Based approaches becomes:
min(|Uij |, )
w0 (ij) = wij (2.30)
where |Uij | represents the number of users who rated both items i and j.
where s (u, v) represents the similarity of users u and v, across all items commonly
rated, with Default Voting enabled.
In the case of Item Based Recommender Systems, equation (2.5) becomes:
P
(rui ri )(rui rj )
uUij
s (i, j) = r P P
(2.32)
(rui ri )2 (rui rj )2
uUij uUij
where s(i, j) represents the similarity of items i and j, across all users that rated
such items, with Default Voting enabled.
In both expressions, the term rab is defined as follows:
(
rab if rab is defined
rab = (2.33)
otherwise
Alternative expressions for other similarity functions can readily be derived to use
Default Voting.
34
2.8.3 Context Aware
Collaborative Filtering Recommender Systems tend to limit themselves to process the
ratings matrix formed by users and items, which is inherently 2D. However, in many
instances, more information is available, that could be used to improve the results of
the recommendation. Context Aware Recommender Systems [48] take advantage of
the added dimensions using 3D or higher order matrices as input to the system.
For example, information about movies might include more than just the title, like
the genre, the year of release, the director, the actors, and other. When searching for
similar movies, a Context Aware system could filter out particular unwanted directors,
while including genres of interest, and thus focus the recommendation better.
Another example of the application of Context Aware Recommender Systems is
when recommending items for a particular time, like a weekday, a weekend, or a
holiday. In a trip recommender systems, the time dimension could play a prominent
role, since a user might rate a tourist attraction as very good on a weekend, but
very poor on a weekday. In a flight recommender system, taking into consideration
the price changes occurring at various time frames, such as holidays or peak season,
would account for a performance increase when recommending flights.
Clustering has also been proposed [42] as a means to compartmentalize the in-
formation available. Under this strategy, smaller dimensionality spaces are created,
reducing the large search space the algorithm is usually presented with. With the
smaller sub-spaces created, a conventional Collaborative Filtering algorithm can be
applied on each partition, reducing substantially the computational cost, and some-
times improving the results due to specialization of the different clusters.
35
But clearly, without recommendations the system cannot start functioning, and
this problem, termed Cold Start, affects every new deployed Recommender System.
How to incite new users to share their preferences with the rest is an important facet
of the success of an installation.
Like all things open to the general public, Recommender Systems are not immune
to manipulation. If a group of users organise and start rating particular items high,
this will affect the whole of the system. Giving positive recommendations to ones
own items, while bad ones to competitors is termed Shilling Attacks [58].
2.9.1 Sparsity
One of the hardest problems to deal with at the time of making recommendations is
that of sparsity of the information. A user will tend to only rate a few items, and
generally the system will hold on file a large number of users with little item ratings
recorded for each. In other words, the input matrix of users and items is inherently
very sparse.
This has profound effects for the Recommender System. There might be lots of
truly similar candidate users to a target user, but all similarity algorithms work by
matching the ratings that were shared by those users, and without ratings, similar
users will not be identified.
A common strategy to deal with the sparsity problem was presented in section
(2.8.2) where unknown rating values are replace by averages of the known ratings
entered by a user. But this strategy assumes that the missing ratings are efficiently
modelled by an average function, when in fact they might well lay low below or high
above the known ratings mean.
Sparsity complicates the consolidation of neighbourhoods of similar users, and thus
affects the coverage of the algorithm, since some targets will likely not be matched
to a sufficient number of candidates from which to derive a recommendation. In fact,
the denser the dataset, the better the performance [61].
36
One way of breaking this pattern is to force new users to rate a number of items
before allowing them to use the system. This strategy ensures there are always new
ratings coming into the system, while solving also a second problem that affects
every newcomer: a new user that has yet to enter his own preferences cannot be
recommended anything at all by the system.
While it is possible to offer generalized untargeted recommendations to a new
user, it is not a satisfactory solution since the power of the Recommender System
is completely waisted. Offering a rating that works for the majority is the exact
opposite of seeking novelty and targeting the recommendation to a particular user
profile, which is what Recommender Systems excel at.
Lack of new user ratings is not the only problem in a Collaborative Filtering
Recommender System. New items added will also lack any ratings, and will thus not
be recommended to anybody. This situation is termed Early Rater [44].
Here, the strategy of forcing users to rate the new item could help, however, the
problem is generally viewed as being less critical than the new user scenario because
novel items tend to be found fast, and ratings follow suit [52]. A good strategy to
use with new items is to immediately put the on display so users can readily find and
rate them.
2.10 Summary
In this section we have covered much of the technology behind Collaborative Filtering
Recommender Systems. We have reviewed the use of Ratings, and presented some
strategies employed to compensate for the differences in rating style used by each
user. We have then presented the various Similarity Functions in common use on
Recommender Systems, which are based on comparing candidates to a target and
measuring quantitatively the difference between them. This step was said to be
critical to establish a weight representing the similitude between two users. As argued,
Collaborative Filtering algorithms are build on the assumption that similar users like
similar items.
The choice of Neighbourhood was discussed, pointing out that the different strate-
gies proposed in the literature attempt to separate the useful candidates from those
that would lower the quality of the final recommendation. The process of making a
Recommendation was presented next, with equations to incorporate corrections for
how users rate items. Coverage, and metrics to assess performance were commented
on, and several improvement strategies used to deal with sparsity and unmatched
number of co-rated items. Lastly, typical problems encountered in deployed Recom-
mender Systems were listed.
37
Chapter 3
Proposals
38
A better idea to fill in the sparse matrix is required, and two such strategies
will be presented in the following sections. Neighbourhood selection is arguably the
most important step of the Recommender Algorithm, since choosing similar users to
a target is what enacts our assumption that similar users like similar items, and
therefore supports our subsequent recommendation.
Context Aware is another approach that is believed to be under-utilized. Most
commercial Recommender Systems hold much more than user and item preferences.
They have item descriptions, user profiles, demographics, statistics; and all this in-
formation can be appended to the system as an extra dimension that can be taken
advantage of at the time of finding similar candidates.
For example, if one was interested in finding appropriate music to recommend to a
user, knowing that he likes rock but hates rap, is a good start, since all CDs involving
rock would be filtered in, and all those involving rap would be filtered out, rising the
number of potential similar candidates, since the pool now would not only include
users liking rock but hating rap, but also users liking rock and liking rap.
Such a strategy has been researched in this Thesis, and will be presented in the
coming Section, with an alternative weighted formulation.
39
How to select the neighbourhood of users for which we should compute the item
average voting is an open question with no definite answer. In this Thesis we chose to
run the unflavoured formulation of the Collaborative Filtering algorithm to produce
a neighbourhood of similar candidates.
By unflavoured formulation we refer to the basic Collaborative Filtering algorithm
to which no enhancement has been added. Under this algorithm, only items rated
by target and candidate count towards the similarity computation. All other rated
elements are simply discarded.
The item average algorithm was subsequently carried out as a second step, over the
neighbourhood of similar elements. The underlying assumption here is that users or
items belonging to the similar neighbourhood would rate the unrated items similarly.
While not strictly true, it should hold better than averaging regardless of similarity.
The algorithm resulting from this approach looks as follows:
40
3.1.2 Indirect Estimation
Attempting to set all unrated items to a default value, while fixing the difficult sparsity
problem, does it by flattening out the variance in a users ratings. After filling all
unrated preferences with the same value, the distribution becomes quite uniform. A
better strategy would attempt to predict each one of the missing ratings individually.
The idea proposed here is to approach the recommendation as a two step process.
We commence as usual, matching a target and a candidate in order to compute their
similarity. However, instead of filtering out the ratings where one of the two items is
missing, we estimate them before proceeding. With the estimates computed, and the
sparsity problem fixed, we continue the calculation of the similarity, but now with a
full set of ratings.
The underlying paradigm used by the Indirect Estimation is truly borrowed from
computer science where many problems are solved by applying one level of indirection;
for an item missing a rating, we propose to halt the process, estimate the missing rat-
ing using a standard formulation (akin to an indirection, since it kicks off a separate,
out of the main workflow estimation), and only then proceed as normal.
In algorithm form, the idea proposed takes the following form:
41
When contrasted to the standard Collaborative Filtering Algorithm from Section
(2.6.1) it is seen that the Indirect Estimation Algorithm adds an extra computational
step (Lines 6-8). In this portion of the algorithm, any unrated item is first estimated
before being used to find similar users. This ensures that the sparse matrix of User-
Item Ratings is dense, but the values used to not fill it are not mere averages but the
best estimates available.
Observing that all of the candidates unrated entries need to be estimated for
every given prediction, suggests that in fact this strategy could be split into a two
step approach, where one of them could in fact be computed offline, as a preprocessing
step, to be used during the online estimation.
This first step would involve estimating all unrated entries of the Sparse matrix,
generating a dense matrix as a starting point for the Recommender Algorithm. One
major advantage of this split is that the estimation itself is an expensive process, but
that now takes place much before the online recommendation, so it does not impact
the critical moment when the algorithm is used. However, since the User-Item Ratings
matrix in now densely populated with a substantial increase in samples, the algorithm
requires a longer Processing Time to complete.
The full estimation of the User-Item Ratings matrix has though one drawback,
and it involves changes in its data by the addition of new ratings. In principle,
these new ratings could simply replace any estimated ones and leave the others as-is.
However, after several new ratings have been added it would be prudent to recompute
all estimated values using the newly added ratings.
42
The Class Type Grouping strategy attempts to compare action to action, drama
to drama, and comedy to comedy, and ensure that only the target class sought is
matched for similarity, disregarding the other items which might be agreed on or not.
One notable drawback of this strategy is that it excludes preferences from the User-
Item Ratings matrix, effectively reducing the data available for processing. However,
since the algorithm is focusing the similarity search to samples belonging to the same
class, it is expected to compensate for this, as it filters out the noise introduced by
the other controvertible classes.
In pseudo-code, this approach takes the following form:
43
3.1.4 Weighted Ensemble
An ensemble of algorithms for Recommender Systems is nothing new. In fact, the
Netflix.com competition of 2006 was won by a team employing an ensemble of over
100 different strategies to produce recommendations [31].
The proposal put forward here is to consider a weighted ensemble of a standard
Collaborative Filtering algorithm (with or without improvements) and a Class Type
Grouping algorithm. The immediate drawback is that every estimation needs to be
processed by two separate algorithms, however, they could be computed in parallel
since they are completely independent.
In pseudo-code, this approach takes the following form:
44
However, if some proportion of users do agree on more than one class, this ap-
proach would not find them, and clearly these are truly similar users, since they
agree on more than one front. Arguably, having an algorithm that would weight both
approaches could prove very sensible.
Mathematically, the Weighted Ensemble strategy is formulated as follows:
45
All Item Based formulations of the algorithms proposed in this Thesis would follow
the pseudo-code above, with their relevant additions and modifications in place.
3.3 Summary
In this section we have described four different proposals for improving Collaborative
Filtering Recommender System algorithms. The proposals put forward concentrated
on two things: the strong weakness of the general strategy, namely the sparsity of
User-Item Rating matrix brought about the fact that users tend to rate a small
proportion of the items, and on the under-utilized item class information, which was
argued that is generally available in most datasets, but very seldom used.
The first strategy, termed Default Item Voting modified the Default User Voting
approach which sets all unrated items of a given user to the average of all the users
items. Our proposal, deviates from this approach in that the unrated item is set to
the average of the given item as rated by all similar user candidates. The assumption
underlying this strategy is that similar user candidates would tend to rate a particular
item similarly, and thus the average rating could be a good default value to use, to
fix the sparsity problem.
The second strategy, termed Indirect Estimation proposed to actually compute
every single unrated item encountered. Instead of setting all missing preferences to
a default value, the argument put forward was that estimating them should yield
better results. It was noted that estimating all unrated preferences is a computa-
tionally intensive task, but in our formulation it can be done offline, resulting in an
online recommendation of comparable complexity to any other Collaborative Filtering
approach.
The third strategy, termed Class Type Grouping sought to make use of extra
information generally available with modern datasets, namely the class type of the
item. In cases where the system is after recommending a particular type of item, this
approach might work well since it would filter out all outlier items and concentrate
on finding similar elements to the class in question.
The fourth and last strategy, termed Weighted Ensemble formulated an equation
that incorporated the standard Collaborative Filtering algorithm, and the Class Type
Grouping. The rational behind this approach was that in some cases the Class Type
Grouping might prove to be too restrictive, and assembling its result with a proportion
of a standard Collaborative Filtering might be beneficial.
The results of carrying out experiments with each of the four strategies, together
with more standard formulations will be reported in the next section of this Thesis.
46
Chapter 4
Experiments
4.1 Data
In order to test the proposal put forward in this Thesis, we have chosen to use
the MovieLens 100k [40] dataset, which is a very popular dataset in the context of
Recommender Systems. It has been widely used in research, and baseline figures are
readily available for comparative purposes [41].
The dataset comprises a total of 100,000 ratings from a pool of 943 users, voicing
preferences on 1682 movies. The rating scale is a typical 1-5 integer number, 1 being
a low rating, and 5 being a high one. An advantage of this dataset over others is that
it also includes an extra file with tag information about the movies, which allows us
to incorporate them into our Class Type Grouping proposal of Section (3.1.3).
GroupLens, the Social Computing research group at the University of Minnesota,
offers two other complementary datasets to the MovieLens 100k, namely the Movie-
Lens 1M, and the MovieLens 10M. They have not been used in this thesis due to their
size, but would be an interesting exercise for future work, particularly leveraging the
Hadoop [5] algorithms for parallel computations.
The dataset comes as a single list of 100k entries of triple values, separated by
commas (CSV file format), representing a user Id, an item Id, and the rating. It also
comes split into 5 sets of base and test files, with 4/5 and 1/5 partitions on each pair.
A separate file provided with the pack, includes the categories of each movie, which
can be any one, or a combination from the following list:
0. unknown
1. Action
2. Adventure
47
3. Animation
4. Childrens
5. Comedy
6. Crime
7. Documentary
8. Drama
9. Fantasy
10. Film-Noir
11. Horror
12. Musical
13. Mystery
14. Romance
15. Sci-Fi
16. Thriller
17. War
18. Western
Which implies that Toy Story corresponds to the genres of Animation, Childrens
and Comedy.
An efficient way to code the multi-category nature of a film is to use bit-coding.
Under this schema, unknown would be 1, Action would be 2, Adventure would be 4,
Animation would be 8, and so forth in powers of 2. For example, Toy Story would
have a category of 56, since Animation + Childrens + Comedy = 8 + 16 + 32 = 56.
This format is very convenient since finding two movies with similar categories reduces
to the simple boolean algebra operation of AND.
48
4.2 Methodology
The state of the art Apache Mahout Collaborative Filtering Recommender System
[6] was chosen as the basis for the implementation of our proposals. Mahout is not
only a robust Recommender System, but it is part of the Apache foundation, which
makes it available as open source, meaning that the code is freely distributable.
Having Mahout as a basis for our algorithms meant that we could leverage the
power of this a Recommender Systemlike reading a file, selecting suitable neigh-
bourhoods, computing similarities, producing an estimateand concentrate entirely
on the additions we intended to implement and test, namely the four proposals laid
out in Chapter (3).
The Mahout implementation comes already with Euclidean Distance, Pearson
Correlation, and Cosine Distance similarity functions among others, Top N-Neighbours
or Threshold Filtering functions for neighbourhood selection, User Default Voting and
Significance Weighting for missing rating inference, and User and Item Based strate-
gies. With the exception of Threshold Filtering, we have used all other options listed,
to provide a clear baseline metric for comparison.
The Top N-Neighbours requires a suitable maximum neighbourhood size to be
specified. This parameter was obtained by cross-validating the dataset for a range of
neighbourhood sizes, on User and Item Based, and on each of the similarity functions
separately, under a basic scenario (no missing rating inference or other improvement).
Suitable maximum neighbourhoods were also derived for our Class Type Grouping
proposal. This resulted in the following maximum neighbourhood sizes:
Maximum neighbourhood sizes for each basic setup were chosen to minimize the
global Accuracy, but not to maximize the global Coverage. An illustration of the
effect of neighbourhood choice on Accuracy and Coverage is shown in Figure (4.1),
where we see two important areas. In the first, both Accuracy and Coverage improve,
until Accuracy reaches its minimum point. From there onwards, Coverage continues
to improve, but Accuracy actually worsens. The neighbourhood size at the point of
inflection was chosen as the maximum neighbourhood value.
49
User Based Coverage vs Accuracy
0.95
0.90
Accurace
0.85
Pearson
0.80 Cosine
0.75 Euclidean
0.70
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Coverage
4.3 Description
The basic formulations included in the Mahout Recommender System, and used
herein, are the Euclidean Distance, the Pearson Correlation, and the Cosine Dis-
tance similarity functions. All these three similarities can be used with a User Based
or an Item Based approach. Neighbourhood selection was confined to the N-Top
Neighbours, where the maximum N was established by cross-validation.
Missing ratings can be inferred by means of Default User Voting, and all these
algorithms can be weighted by applying a Significance Weighting, which renders can-
didates sharing more items in common, more important than others sharing fewer
items in common. All these Mahout out-of-the-box strategies were run as baseline.
In Chapter (3) we put forward four different proposals for improving the Accuracy
of Memory Based Collaborative Filtering algorithms. The first proposal involved
replacing missing item ratings with an average of the selected neighbourhoods item
rating, and we called it Default Item Voting.
50
The second proposal consisted on estimated missing values while estimating a
given preference, effectively introducing an extra aside computation step, which we
called Indirect Estimation. During the implementation of this algorithm it was dis-
covered that it could be coded as an offline processing step, and it was termed Prepare
Indirect. The Indirect Estimation then makes use of the prepared values whenever a
missing rating is found.
The third proposal involved using the extra information available in the dataset,
in the form of item categories, augmenting the typical user-item matrix to a third
dimension of labels. We termed this strategy Class Type Grouping.
Lastly, we proposed using a weighted equation of a Default Estimation (non-
class discriminative) and a Class Type Grouping formulation, we called this strategy
Weighted Ensemble. In this case, there is the proportionality parameter that need to
be decided. A preliminary feel for the parameter suggested that a simple weighting
with no cut-off would fair well, and this was the strategy chosen. There is ample
place to explore different settings in future work.
Among all the proposals, the first two strategies run sideways to the basic for-
mulations. Strategy three, however, is really a different way to run all the previous
strategies, and thence doubles the number of algorithms. Lastly, strategy four is a
combination of the first two strategies, and the third. In numbers, we have evaluated
a total of 144 different strategies, which we will report in the next section of this
Thesis.
As stated earlier, Apache Mahout was used as the basis for our own implemen-
tations. Mahout is written in Java, and since the source code is openly available, it
can be changed at will to add any new algorithms. All four strategies required new
functionality that was not present in the application, and was added as necessary.
While Mahout does provide an Item Average Inferrer, it is not a match to our own
formulation, and thus it was added to the software. Likewise, the Indirect Estimation
algorithm, and the Class Type Grouping are novel, and were added to Mahout. The
Weighted Ensemble, being a weighted average of previous results, was computed
entirely in Microsoft Excel.
4.4 Results
The results of the various algorithms tested in the thesis are tabulated next. They
include those that were run using out-of-the-box Mahout functionality, and those that
were added, following our proposals. Baseline values are the ones obtained by using
the Plain strategy with the different similarity functions, under User and Item Based
approaches. This Plain strategy makes no inference of missing preferences.
51
The first set of tables show four basic strategies: Plain, Plain with Significance
Weighting (Plain SW), Default User Voting (User), Default User Voting with Sig-
nificance Weighting (User SW); and the first two of our proposed strategies, with
and without Significance Weighting: Default Item Voting (Item), Default Item Vot-
ing with Significance Weighting (Item SW), Indirect Estimation (Ind), and Indirect
Estimation with Significance Weighting (Ind SW). Time is in minutes.
Table 4.2: User Based with Euclidean Distance similarity, Default Estimation
Table 4.3: Item Based with Euclidean Distance similarity, Default Estimation
Table 4.4: User Based with Pearson Correlation similarity, Default Estimation
Table 4.5: Item Based with Pearson Correlation similarity, Default Estimation
52
Plain Plain SW User User SW Item Item SW Ind Ind SW
MAE 0.7891 0.7880 0.8162 0.8066 0.7893 0.7881 0.7994 0.7963
RMSE 1.0043 1.0029 1.0399 1.0255 1.0045 1.0030 1.0183 1.0120
Cov % 0.9712 0.9748 0.9710 0.9850 0.9712 0.9748 0.9692 0.9818
Time 3.8 3.7 15.4 15.9 5.7 5.8 65.7 66.0
Table 4.6: User Based with Cosine Distance similarity, Default Estimation
Table 4.7: Item Based with Cosine Distance similarity, Default Estimation
Tables (4.2) to (4.7) illustrate the results obtained by the different algorithms
under a Default Estimation, which refers to the estimation carried out with all the
samples, regardless of which class they belong to. This is the typical estimation
employed in most Recommender Systems. Bold measures represent the best MAE
error readings obtained in each case. Table (4.8) shows the Time required to process
the offline step of the Indirect Estimation algorithm.
We tabulate next the results obtained by using our proposed third algorithm, the
Class Type Grouping.
Table 4.9: User Based with Euclidean Distance similarity, Class Type Grouping
53
Plain Plain SW User User SW Item Item SW Ind Ind SW
MAE 0.7756 0.7615 0.7351 0.8199 0.7627 0.7738 0.6592 0.7512
RMS 0.9990 0.9799 0.9450 1.0487 0.9811 0.9968 0.8677 0.9749
Cov % 0.6874 0.8000 0.8837 0.9883 0.8000 0.6874 0.6919 0.8663
Time 27.1 28.9 204.1 214.4 28.0 29.3 71.4 62.2
Table 4.10: Item Based with Euclidean Distance similarity, Class Type Grouping
Table 4.11: User Based with Pearson Correlation similarity, Class Type Grouping
Table 4.12: Item Based with Pearson Correlation similarity, Class Type Grouping
Table 4.13: User Based with Cosine Distance similarity, Class Type Grouping
Table 4.14: Item Based with Cosine Distance similarity, Class Type Grouping
54
User Based Item Based
Euclidean Pearson Cosine Euclidean Pearson Cosine
Time 59.4 63.6 54.0 38.4 42.2 40.9
Tables (4.9) to (4.14) illustrate the results obtained by the Class Type Grouping
algorithm under the different scenarios tested, with Table (4.15) reflecting the Time
required to process the offline step of the Indirect Estimation algorithm.
Lastly, we tabulate the results obtained by our fourth algorithm, the Weighted
Ensemble. No Processing Time is given because the values were computed with Excel,
and are readily available.
Table 4.16: User Based with Euclidean Distance similarity, Weighted Ensemble
Table 4.17: Item Based with Euclidean Distance similarity, Weighted Ensemble
Table 4.18: User Based with Pearson Correlation similarity, Weighted Ensemble
Table 4.19: Item Based with Pearson Correlation similarity, Weighted Ensemble
55
Plain Plain SW User User SW Item Item SW Ind Ind SW
MAE 0.7855 0.7859 0.8037 0.7969 0.7875 0.7860 0.7943 0.7931
RMS 0.9969 0.9979 1.0226 1.0125 1.0001 0.9980 1.0099 1.0070
Cov % 0.9610 0.9711 0.9575 0.9670 0.9712 0.9711 0.9594 0.9696
Table 4.20: User Based with Cosine Distance similarity, Weighted Ensemble
Table 4.21: Item Based with Cosine Distance similarity, Weighted Ensemble
Lastly, tables (4.16) to (4.21) illustrate the results obtained by the Weighted
Ensemble algorithm. The parameter in Equation (3.1) was set to 0.5 in all cases.
4.5 Analysis
A plot of the results follow, where we show User Based results as triangles, Item
Based results as circles, and the Baseline result as a square, for all similarity functions
studied. Good results would tend to be located in the lower right-hand side corner of
the graph.
0.8500
User Based
0.8000
Item Based
0.7500
Baseline
0.7000
0.6500
0.6000 0.7000 0.8000 0.9000 1.0000
Coverage
Figure 4.2: User and Item with Euclidean Distance similarity Coverage vs Accuracy
56
User and Item Based Coverage vs Accuracy
1.0000
0.9500
0.9000
Accuracy
0.8500
User Based
0.8000
Item Based
0.7500
Baseline
0.7000
0.6500
0.6000 0.7000 0.8000 0.9000 1.0000
Coverage
Figure 4.3: User and Item with Pearson Correlation similarity Coverage vs Accuracy
0.8500
User Based
0.8000
Item Based
0.7500
Baseline
0.7000
0.6500
0.6000 0.7000 0.8000 0.9000 1.0000
Coverage
Figure 4.4: User and Item with Cosine Distance similarity Coverage vs Accuracy
In order to carry out a Friedman Test [17] of the results, we re-tabulated the data,
capturing only the Accuracies, while disregarding the Coverage obtained by each
algorithm. We also defined the null hypothesis to be there is a difference between
the algorithms, and the alternative hypothesis to be there is no difference between
the algorithms, with an alpha cutoff value of 0.05.
57
For the User Based approach, our Accuracy values and Friedman Test results are:
Table 4.22: User Based MAE Accuracy results for all Algorithms
For the User Based case, we accept the null hypothesis since the result is lower
than 0.05, meaning that we do find a difference in the algorithms. However, a plot
of the intervals reveals that no algorithm has a mean rank significantly different than
our baseline, as shown in the User Based Figure (4.5) of the Friedman Test.
58
For the Item Based approach, our Accuracy values and Friedman Test results are:
Table 4.24: Item Based MAE Accuracy results for all Algorithms
For the Item Based case, we reject the null hypothesis since the result is higher
than 0.05, meaning that we do not find a difference in the algorithms. Also, no
algorithm has a mean rank significantly different from our baseline, as shown in the
Item Based Figure (4.5) of the Friedman Test.
59
1
1
2
2
3
3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
10 5 0 5 10 15 20 25 30 35 5 0 5 10 15 20 25 30 35
1..24: Algorithms from Table (4.22) 1..24: Algorithms from Table (4.24)
Next, we compare each one of our proposals to all the others, by means of ANOVA
tests.
We start by looking at the three similarity functions used. In this case, we obtained
the following results:
Source SS df MS F Prob>F
Groups 0.2511 2 0.1255 58.5402 0.0000
Error 0.3024 141 0.0021
Total 0.5535 143
60
1
0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 (MAE)
Source SS df MS F Prob>F
Groups 0.0049 1 0.0049 4.4158 0.0411
Error 0.0511 46 0.0011
Total 0.0560 47
0.725 0.73 0.735 0.74 0.745 0.75 0.755 0.76 0.765 0.77 0.775 (MAE)
61
Clearly, Significance Weighting performed worse that the default Equal Weighting
strategy, and we readily discard it.
We look at User vs Item Based strategies:
Source SS df MS F Prob>F
Groups 0.0004 1 0.0004 0.3155 0.5800
Error 0.0293 22 0.0013
Total 0.0297 23
0.715 0.72 0.725 0.73 0.735 0.74 0.745 0.75 0.755 0.76 0.765 (MAE)
In the case of User vs Item Based, the results are not significantly different.
We look at the broad algorithms tested, namely, Plain, Default User Voting,
Default Item Voting, and Indirect Estimation:
Source SS df MS F Prob>F
Groups 0.0180 3 0.0060 10.2013 0.0003
Error 0.0117 20 0.0006
Total 0.0297 23
Table 4.29: ANOVA Test of Plain, User, Item and Indirect Algorithms
62
1
Figure 4.9: ANOVA Test of Plain, User, Item and Indirect Algorithms
Source SS df MS F Prob>F
Groups 0.0015 2 0.0007 0.5510 0.5845
Error 0.0282 21 0.0013
Total 0.0297 23
63
We do not see significant difference among the three strategies tested.
Lastly, we include a table illustrating how a selection of the algorithms proposed
in this Thesis fared when compared to others published [41]:
64
Table (4.31) shows the results of some of the algorithms developed in this Thesis,
together with others published in the literature. Results for MyMediaLite were carried
out using a 5-fold cross validation, as opposed to the leave-one-out cross validation
strategy used in our case. They also did not make available the Coverage obtained,
reducing merit in the comparison. They are included here as reference only, as it is
clear that experiments are not entirely compatible to be fairly compared.
4.6 Discussion
This Thesis proposed four different strategies to improve the performance of Memory
Based Collaborative Filtering Recommender Systems. The algorithms proposed and
studied here were the Default Item Voting, the Indirect Estimation, the Class Type
Grouping, and the Weighted Ensemble. Our premise in this work was that Memory
Based approaches still had ample place for improvement, and could compete with the
Model Based algorithms, popularized by the 2006 Netflix.com competition.
Tables (4.2) to (4.7) show the results of running the MovieLens 100k dataset on
a default Mahout algorithm (Plain), on the Default User Voting inferrer, and on
our proposed strategies, the Default Item Voting, and the Indirect Estimation. The
tables also show the results of applying a Significance Weighting to each of these four
algorithms, which gives a higher importance to samples sharing more points with the
target we are trying to estimate. In these set of tables, a User and an Item Based
approach has been used, under all three similarity functions, the Euclidean Distance,
the Pearson Correlation, and the Cosine Distance.
The baseline value shown in the tables, is the result of the column labeled Plain,
where we readily see that the Euclidean Distance User Based approach gave a MAE
reading of 0.7433 with a Coverage of 0.9019%. Good results would be those that
produce a smaller MAE, while keeping the same Coverage, or improving it. Processing
Time would ideally not increase by much, but naturally, more complex algorithms
will invariable require more computer power.
In table (4.2) we see that the Default Item Voting proposal yielded a marginal
improvement over the Plain algorithm, while keeping the same Coverage, and increas-
ing Processing Time by close to 50%. However, the Default Item Voting algorithm
(Item) was substantially better than the Default User Voting (User).
The best result of the table is the one obtained by our Indirect Estimation strategy,
labeled Ind, which gave a MAE of 0.7251 with a Coverage of 0.9389%; that is an
Accuracy improvement of 2.5% over the unaltered Mahout Default algorithm, with a
further 4% improvement in Coverage. The noticeable drawback was Processing Time
which was several orders of magnitude higher.
65
Significance Weighting (SW) can be seen not to improve the Accuracy results of
any algorithm, but improve notoriously the Coverage. In this case, the Default Item
Voting performed equally well as the Plain algorithm, with an increase in Processing
Time. The Indirect Estimation, combined to the Significance Weighting did not yield
a good result, however, it produced a very good Coverage, comparable only to the
one obtained by the Default User Voting, albeit with a better MAE.
The Euclidean Distance similarity results obtained by the Item Based approach,
tabulated in table (4.3), are also significant. Here we see that the Plain algorithm
gave a higher MAE than the User Based counterpart of table (4.2) with a worse Cov-
erage. However, resulting in a similar Coverage, the Item Based Indirect Estimation
produced a significantly better MAE of 0.6760, which is in fact the best result in this
cross section of results. Comparing this MAE to the one obtained by the User Based
Indirect Estimation yields an improvement of 7%. On the other hand, the Coverage
was nearly 10% lower, and Processing Time was considerably higher.
In general, we would expect that lower Coverages will give better Accuracies,
since difficult samples, likely carrying higher estimated errors, would not be consid-
ered. This is why good results, arguably, would have a low Accuracy, with a high
Coverage, like the one encountered in table (4.2) by the Indirect Estimation, which
was noticeably better than the Plain algorithm tested.
Pearson Correlation and Cosine Distance are two other alternatives to the Eu-
clidean Distance, regularly quoted in the literature. For our dataset, these two simi-
larity functions resulted in worse MAE Accuracies, but significantly better Coverages
when compared to the Euclidean Distance. Under these two similarity functions, nor
our Default Item Voting, nor our Indirect Estimation proposals yielded better results
than the Plain or the Default User Voting strategies.
The next set of tables, (4.9) to (4.14), tabulate the results obtained by our third
proposal, the Class Type Grouping. This algorithm made use of class information
available in the dataset, to filter out samples when computing the estimations. Our
initial expectation of this algorithm was that it would improve the Accuracy of the
Recommender System.
The rational behind it was that two users could share a taste for a particular class
(for example, action genre, in movies), but not on other classes (for example, comedy).
If we are trying to estimate an action film, looking only at action films while filtering
all other genres would ensure that we are only looking for similar tastes across this
particular genre, unaffected by discrepancies in other genres.
However, the results, overall, gave a worse MAE than their counterparts without
grouping (Default Estimation). It is important to note thought, that in most cases
the Coverage improved. And as argued before, a higher Coverage is expected to have
a worse Accuracy.
66
A result worth noting is the one obtained by the Item Based Indirect Estimation
of table (4.10). The MAE in this case was 0.6592, which is about 13% better than
our baseline of 0.7433. However, this was obtained by the rather low Coverage of
0.6919%.
The last set of tables, (4.16) to (4.21), show the results obtained by our fourth and
last strategy, the Weighted Ensemble. This algorithm averaged the results of our first
set of tables, and the ones obtained by our third proposal. The idea behind it was
that a good compromise perhaps could be achieved by using some weighted measure
of a result obtained by estimating with all samples, and estimating with samples of
the same class.
Table (4.16) in fact shows that using this strategy produced a improvement in
almost all MAE Accuracies. For example, the Plain strategy went down from 0.7433 to
0.7342, and the Indirect Estimation went down from to 0.7251 to 0.7169. Noteworthy
is the fact that the Indirect Estimation resulted in a higher Coverage than our baseline,
with an Accuracy improvement of about 3.6%.
Table (4.17), which tabulates the results of the Euclidean Distance similarity under
an Item Based approach shows also a significant result, namely the one obtained by
our Indirect Estimation proposal, with a MAE of 0.6395. That is an improvement in
Accuracy of about 16% compared to our baseline. However, the results came at the
expense of a significant reduction in Coverage, which only yielded a 0.6543%.
In general, we see that the Euclidean Distance similarity was better than the Pear-
son Correlation and the Cosine Distance, and that the User Based Indirect Estimation
proposal improved the Accuracy results of the Plain strategy without a reduction in
Coverage, but a significant increase in Processing Time.
A plot of the results of the various algorithms studied are seen in Figures (4.2)
to (4.4). Concentrating our attention on the Euclidean Distance which gave the best
overall results, we see that most User Based values are in the vicinity of our baseline.
The further they are to the lower-right quadrant of the graph, the better they are,
since that means that the Accuracy is lower and the Coverage is higher.
Item Based values tend to be less clustered, and while a few examples display a
rather high Accuracy, they do so with a low Coverage. One exception is the Item
Based Default User Voting algorithm which improved the Accuracy of our baseline
while also improving the Coverage. We see this sample in Figure (4.2) as the rightmost
blue circle.
Figure (4.3) shows the results of the algorithms when using the Pearson Corre-
lation similarity function. Interestingly, all User and Item Based results with this
similarity function tend to be clustered about the baseline, with User Based estima-
tions much closer to the baseline, and Item Based estimation further away.
67
In the case of the Cosine Distance depicted in (4.4), only User Based results are
in the vicinity of the baseline, with all Item Based results being of too poor a quality
to consider further.
In order to assess whether the proposed algorithms were significantly better than
the basic ones, we subjected the results to a Friedman Test, and to various ANOVA
tests. The results of the Friedman Test can be seen in Tables (4.23) and (4.25), for
User and for Item Based approaches, respectively. For the case of User Based, our
null hypothesis stating that there is a difference between the algorithms can be seen
to hold, however, this was not the case for the Item Based approach.
Figure (4.5) shows the ranges of each algorithm. While the null hypothesis in the
User Based case was confirmed by calculation, the graph does not appear to show
any algorithm being statistically different from any other.
Since the Friedman Test was not conclusive enough, we performed several ANOVA
tests to compare different parameters of the strategies studied. First, we looked at the
similarity functions used, namely, the Euclidean Distance, the Pearson Correlation,
and the Cosine Distance. In this case, Figure (4.6) clearly shows that the three
strategies are statistically different, being the Euclidean Distance the best one of the
three, and the Cosine Distance, the worst.
An ANOVA test comparing the Significance Weighting algorithm to the default
equal sample weighting, shows them in Figure (4.7) to be statistically different, with
the default approach being superior in terms of MAE, to the Significance Weighting
formulation.
ANOVA tests on User vs Item Based approaches did not show a statistical dif-
ference among the two, as depicted in Figure (4.8). This, despite the Item Based
approach with Euclidean Distance similarity giving the lowest MAE of all algorithms
tested.
Among the four broad algorithm classes used, namely the Plain, the Average
User Voting, the Average Item Voting, and the Indirect Estimation, the Indirect
Estimation was found to be significantly different, as shown in Figure (4.9).
Lastly, we tested the Default Estimation against the Class Type Grouping and
the Weighted Ensemble proposals. In this case, Figure (4.10) shows that there does
not seem to be statistical difference among the three algorithms. This, despite the
fact that the Weighted Ensemble under Euclidean Distance similarity yielded the best
overall result.
Table (4.31) attempts to put the algorithms proposed in this Thesis in context.
The table shows published results of other algorithms, performing on the MovieLens
dataset used herein. MAE Accuracies obtained by approaches proposed in this Thesis
are marked appropriately in the Source column of the table.
68
While our baseline Plain User Based Euclidean occupies position four from the
start of the table, our best overall strategy Indirect Weighted User Based Euclidean
can be found two-thirds down the table. This strategy made use of three of our pro-
posals, packed together, namely, the Indirect Estimation, the Class Type Grouping,
and the Weighted Ensemble. The approach lowered Accuracy from the baseline of
0.7433 to 0.7169, a 3.6% improvement, placing it in direct competition with Model
Based approaches based on Matrix Factorization techniques. We also point out that
if we were to disregard Coverage, then the Indirect Weighted Item Based Euclidean
triumphed all strategies tabulated.
4.7 Summary
In this section of the Thesis we have presented the methodology used to test the
MovieLens dataset chosen, and tabulated the results obtained by the various algo-
rithms studied. In total, we employed the dataset to test 144 different strategies,
tabulating the Coverage, Accuracy and Processing Time obtained by each one of
them (where applicable).
The results of the algorithms were then analyzed using a Friedman Test, which
looks for statistically significant differences in the data. Finer granularity was also
sought by means of ANOVA tests of the main parameters of the algorithms tested.
Results of the statistical examinations were tabulated with plotted results showing
the differences among the strategies looked at.
69
Chapter 5
5.1 Conclusion
This Thesis proposed four strategies aimed at improving the Accuracy of Memory
Based Collaborative Filtering Recommender Systems. The four approaches were
tested extensively, together with various basic approaches regularly employed in com-
mercial implementations. A total of 144 algorithms were studied, composed of various
combinations of our four proposals, and other known algorithms.
The well researched MovieLens 100k dataset was used in all tests. The dataset
comprises 100,000 ratings from 943 users, about 1682 movies, and was first com-
piled by the GroupLens Group of Social Computing Research at the University of
Minnesota [40].
In order to test existing approaches, and implement our proposals, the Apache
Mahout Collaborative Filtering Recommender System was used [6]. This is an Apache
maintained open source project that has become the standard in Recommender Sys-
tems for research.
The four different proposals set forward in this Thesis were tested under three
different similarity functions, namely the Euclidean Distance, the Pearson Correla-
tion, and the Cosine Distance. Neighbourhood selection was fixed to N-Top Nearest
Neighbours, deriving the number of neighbours by cross validation. Estimations of
preferences were carried out by leave-one-out cross validation, which removes the
testing sample from the set, and uses the remaining samples to estimate it.
User and Item Based approaches were both studied. Other algorithms employed
included the Default User Voting and the Significance Weighting; this last one has
been applied to all algorithms, to test the effect of the proportionality weighting on
the proposals put forward.
70
Our first algorithm, the Default Item Voting, performed better than its counter-
part, the Default User Voting from which it was derived. However, this proposal was
fount to be only marginally better than the baseline.
Our second proposal, the Indirect Estimation was shown to be statistically dif-
ferent from other approaches by ANOVA testing, rendering the best results in this
Thesis. Used alone, this algorithm improved the baseline by 2.5%. Based on the MAE
Accuracy readings obtained, we found that this algorithm learned significantly better
the users preferences than the other neighbourhood selections algorithms tested.
Our third strategy, the Class Type Grouping was expected to perform better than
it did. This approach was based on clustering the samples, which is a standard prac-
tice in Machine Learning. However, in our case, the clustering was done with complete
knowledge of the classes. When applied to the Indirect Estimation it produced results
better than our baseline, but worse than Indirect Estimation alone.
Lastly, our fourth strategy, the Weighted Ensemble improved almost all previous
Accuracies obtained by the Default Estimation. When applied to the Plain algorithm,
it improved the baseline by 1%, and when applied to the Indirect Estimation, it
improved the baseline by 3.6% without a loss in Coverage. In the context of this
Thesis, the User Based Indirect Estimation Weighted Ensemble algorithm was the
best formulation to learn a target user profile, and recommend relevant items.
Overall, our premise that Memory Based Collaborative Filtering Recommender
Systems can still be improved, and made to compete with Model Based approaches
deriving from Matrix Factorization seems to be supported by this work. We believe
that this class of Machine Learning algorithms, which have been in commercial use
for the past twenty years, still merit further research.
71
The Class Type Grouping algorithm did not perform as expected. While it did
not perform worse than our baseline, it was only marginally better. Filtering out non-
relevant classes was expected to clean up our data and produce higher Accuracies. It
is believed that a better look at this algorithm is in place.
Lastly, the Weighted Ensemble, while producing the best result in this work, it
was not tested with a wide range of parameters. The formulation shown in Equation
(3.1) has a weighting parameter that needs to be chosen. We have set it to 0.5 in
this Thesis, after running several tests and noticing that results were satisfactory.
Further studies of this parameter should be carried out. Also, one could decide
to not take one of the two terms of the equation if the number of common items is
less than a threshold. Preliminary testing of this idea yielded nothing of value, but
further testing is needed.
One aspect that was not investigated deeply in this Thesis is the effect of Coverage
on Accuracy. Figure (4.1) shows that there is a minimum to the graph. A question
that remains open is what would happen if one could recognize difficult candidates,
and eliminate them from the global estimation.
This would reduce Coverage, but it is theorized that it might increase Accuracy
because of the removal of error prone estimations. We believe this idea could be
explored further as it would lead to a Recommender Systems that produces better
recommendations, at the expense of breadth.
72
Bibliography
[4] D. Agarwal, B. C. Chen. Machine Learning for Large Scale Recommender Sys-
tems. http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
73
[11] R. Burke. Hybrid Recommender Systems: Survey and Experiments. User mod-
eling and user-adapted interaction, Volume 12, Issue 4, (2002).
[12] R. Burke. Hybrid Web Recommender Systems. The adaptive web, pp. 377-408,
Springer Berlin Heidelberg, (2007).
[17] J. Demsar. Statistical Comparisons of Classifiers over Multiple Data Sets. The
Journal of Machine Learning Research, Volume 7, (2006).
[25] Y. Huang. An Item Based Collaborative Filtering Using Item Clustering Predic-
tion. ISECS International Colloquium on Computing, Communication, Control,
and Management, Volume 4, (2009).
[30] Y. Koren. Factor in the Neighbors: Scalable and Accurate Collaborative Filter-
ing. ACM Transactions on Knowledge Discovery from Data, Volume 4, Issue 1,
(2010).
[32] Y. Koren. The BellKor Solution to the Netflix Grand Prize. Netflix prize
documentation, http://www.netflixprize.com/assets/GrandPrize2009_
BPC_BellKor.pdf, (2009).
[61] G. Xue, C. Lin, Q. Yang, W. Xi, H. Zeng, Y. Yu, Z. Chen. Scalable Collaborative
Filtering Using Cluster-based Smoothing. Proceedings of the 28th annual inter-
national ACM SIGIR conference on Research and development in information
retrieval, pp. 114-121, (2005).
An inspection of Tables (4.2) to (4.7) for the Default Estimation, and Tables (4.9)
to (4.14) for the Class Type Grouping reveals that the Default User Voting and the
Indirect Estimation take considerable more time to process than the Plain and the
Default Item Voting algorithms.
The reason for this excess Processing Time lays in the way the Top N-Neighbourhood
samples are computed. The Default User Voting strategy of the Apache Mahout [6]
implementation of the Collaborative Filtering algorithm applies the Default User
Voting equations from Section (2.8.2) to the neighbourhood search, and again to the
recommendation estimate.
When selecting the neighbourhood, a target needs to be compared for similarity
with every candidate in the system. In practice, when using the Plain algorithm the
computation is light because the User-Item Ratings matrix is inherently very sparse.
However, when applying the Default User Voting or the Indirect Estimation strategies
to the neighbourhood selection step, the User-Item Ratings matrix becomes dense,
and the number of computations increases significantly.
One possibility to reduce the Processing Time is to avoid applying these strategies
during the neighbourhood selection, and only apply them to the recommendation
estimate. What this would achieve is a fast neighbourhood selection, with a penalized
recommendation estimate, however the estimate computation will generally be done
fast because the target there is not compared to every instance, but only to the
selected neighbourhood set, which at most is of size N .
The main drawback of using this strategy is that neighbourhood selection step will
now be done by using the Plain algorithm, which is deemed inferior to the strategies
we put forward. The reason for this claim is that the similarity computation among a
target and a candidate is only done for the samples they both have in common, while
the other strategies attempt to add the other unrated samples in to the calculation.
79
We argue that neighbourhood selection is the most important step of a Collabora-
tive Filtering Recommender System, as is the step in charge of finding those samples
similar to the target, from which to derive the recommendation. If this step is of a
lower quality, the recommendation will degrade.
Tables (A.1) to (A.9) show the results of applying the Default User Voting and
the Indirect Estimation only to the Recommendation step, but not during the neigh-
bourhood selection. When comparing these results to the ones tabulated in Chapter
(4) it can be seen that Accuracy is not as good as the one achieved by using the
algorithms also during neighbourhood selection.
Two facts are worth pointing out, Processing Time is now much closer to the
Plain algorithm, and the strategies do yield a better MAE, which suggests that both
formulation, while to being used for neighbourhood selection, are still superior to the
Plain algorithm.
Table A.4: Fast Algorithms, Euclidean Distance similarity, Class Type Grouping
Table A.5: Fast Algorithms, Pearson Correlation similarity, Class Type Grouping
Table A.6: Fast Algorithms, Cosine Distance similarity, Class Type Grouping
Summary of Results
A summary of the results obtained by the different algorithms is tabulated here for
convenience, in Tables (B.1) to (B.3). They were used to compute the statistical sig-
nificance of the algorithms proposed, found in Section (4.5). We include the baseline
as computed with the Mahout Collaborative Filtering Recommender System (Plain),
and the various algorithms proposed, under User Based and Item Bases approaches,
for the three similarity functions studied.
The upper part of the table corresponds to results obtained by the User Based
approach, while the lower part of the table corresponds to results obtained by the Item
Based approach. Algorithms marked as FA refer to the fast algorithm variant
of the same strategy, in which the neighbourhood selection samples were chosen by
using a Plain strategy, and the algorithm in question was only applied to compute
the estimation.
For example, the Indirect FA algorithm represents the strategy of using the Plain
algorithm during neighbourhood selection, and the Indirect Estimation algorithm
to compute the final estimation value. In contrast to the FA formulation, the
algorithm reported in Chapter (4) employed the Indirect Estimation on both steps,
during neighbourhood selection, and during the computation of the estimation value.
83
Uniform Weighting Significant Weighting
Algorithm Cov % MAE Cov % MAE
Plain 0.9019 0.7433 0.9536 0.7487
User 0.9422 0.7709 0.9907 0.7867
User FA 0.9019 0.7418 0.9538 0.7478
Item 0.9019 0.7429 0.9536 0.7487
Indirect 0.9389 0.7251 0.9850 0.7662
Indirect FA 0.9019 0.7390 0.9538 0.7457
Plain Grouped 0.9325 0.7511 0.9510 0.7510
User Grouped 0.9497 0.7551 0.9712 0.7559
User Grouped FA 0.9325 0.7482 0.9510 0.7485
Item Grouped 0.9325 0.7501 0.9510 0.7502
Indirect Grouped 0.9295 0.7278 0.9572 0.7347
Indirect Grouped FA 0.9325 0.7476 0.9510 0.7479
Plain Weighted 0.8858 0.7342 0.9014 0.7325
User Weighted 0.8813 0.7440 0.9014 0.7483
User Weighted FA 0.8858 0.7322 0.9315 0.7374
Item Weighted 0.9019 0.7358 0.9014 0.7322
Indirect Weighted 0.9102 0.7169 0.9545 0.7421
Indirect Weighted FA 0.8858 0.7305 0.9315 0.7361
Plain 0.8551 0.7870 0.9416 0.7596
User 0.9584 0.7373 0.9985 0.8185
User FA 0.8551 0.7753 0.9416 0.7543
Item 0.8551 0.7814 0.9416 0.7574
Indirect 0.8493 0.6760 0.8493 0.6760
Indirect FA 0.8553 0.7745 0.9416 0.7527
Plain Grouped 0.6874 0.7756 0.8000 0.7615
User Grouped 0.8837 0.7351 0.9883 0.8199
User Grouped FA 0.6874 0.7689 0.8000 0.7601
Item Grouped 0.8000 0.7627 0.6874 0.7738
Indirect Grouped 0.6919 0.6592 0.8663 0.7512
Indirect Grouped FA 0.6874 0.7686 0.8000 0.7587
Plain Weighted 0.6535 0.7612 0.7864 0.7439
User Weighted 0.8786 0.7239 0.9883 0.8134
User Weighted FA 0.6535 0.7520 0.7864 0.7404
Item Weighted 0.6535 0.7579 0.7864 0.7432
Indirect Weighted 0.6543 0.6395 0.8659 0.7638
Indirect Weighted FA 0.6535 0.7515 0.7864 0.7390