Sei sulla pagina 1di 54

Addressing Scalability Issues in

Semantics-Driven Recommender Systems


Master Thesis Econometrics and Management Science
July 2017

Supervisor: Flavius Frasincar


Co-reader: Michel van de Velden

Mounir M. Bendouch
Erasmus Universiteit Rotterdam
Abstract

Content-based semantics-driven recommender systems are applied often to the


small-scale news recommendation domain. These recommender systems improve
over TF-IDF by taking into account (domain) semantics through semantic lexi-
cons or domain ontologies. This work explores their application to other recom-
mendation domains, using the case of large-scale movie recommendations. We
propose methods to extract semantic features from various item descriptions, in-
cluding images. We scale up the semantics-driven approach with pre-computation
of the cosine similarities and gradient learning of the model. Results on a large-
scale dataset of user ratings demonstrate that semantics-driven recommenders
can be extended to more complex domains and outperform TF-IDF on ROC,
P R, F1 , and metrics.
Contents
1 Introduction 1

2 Literature Review and Research Question 3


2.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 CF-IDF+ . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 SF-IDF+ . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.4 Bing-SF-IDF+ . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Data 6

4 Methodology 10
4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.1 Natural Language Processing . . . . . . . . . . . . . . . . 10
4.1.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Domain Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Similarity Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5.2 Related Features . . . . . . . . . . . . . . . . . . . . . . . 22
4.5.3 Logistic Transformation . . . . . . . . . . . . . . . . . . . 26
4.5.4 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Results 37
5.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Conclusion 46
Acknowledgements

I thank Flavius for his excellent supervision and effort to always review my work
carefully. Thanks to my friend Sarunas for his help and advice. I especially
thank my father for sparking my interest in mathematics, and my mother for her
amazing support.
M.M. Bendouch Master Thesis

1 Introduction
Vast amounts of information have become available through the emergence of the
Web [35]. The total amount of data has experienced an accelerating increase ever
since. For example, in just the two years leading up to 2013, the combined size of
all data on the Web grew by a factor of 10 [25]. The vastness of the Web enables
users to explore an immense variety of content, and virtually every niche and
taste for articles, movies, music, and so on has become just mouse-clicks away.
However, this abundance of choice comes at the price of information overload,
and finding the right content has become exceedingly time consuming.
Recommender systems [21] have emerged as a solution to this problem, filling
the need to filter and deliver relevant content to the user by sorting through large
amounts of information and presenting the most interesting selection in the form
of recommendations. This goes beyond plain information retrieval systems such
as search engines because recommender systems incorporate the users prefer-
ences, interests, and needs into the filtering process. Based on this information,
they attempt to predict the rating or preference the user would give to each of
the unseen items under consideration, and recommends those for which this pre-
diction is highest. As these systems are now widely used in areas such as movies,
news, articles, and e-commerce, they have become increasingly relevant.
High performance recommender systems can be invaluable to an on-line con-
tent provider by increasing user satisfaction, because content that better matches
individual user preferences can be recommended. For advertisement driven or
pay-per-view businesses, this can boost revenues substantially by increasing view-
ing time and clicks. For subscription based businesses, the increased satisfaction
can lead to higher popularity and loyalty.
Two different approaches to recommender systems [21] can be distinguished:
collaborative filtering and content-based filtering. These two types of systems
differ in the data and underlying assumptions they use for their predictions. Col-
laborative and content-based filters can also be integrated into a single system
these are called hybrid systems. Collaborative filtering approaches use informa-
tion on which items were selected or rated by the user in the past, and compare
this to the history of other users. The main assumption of collaborative filtering
is that if two users have similar opinions on a particular issue, they will likely
have a similar opinion on another issue. In other words, collaborative filtering
matches users with similar tastes and uses this information to make predictions.
Content-based filtering, on the other hand, uses similarities between the content
of items as opposed to similarities between different users. In other words, it uses
information about the items themselves, and assumes that users like items that
have similar contents, independent of the opinions of others. The unseen items
that are most similar to the items in the user profile are therefore recommended
to the user. The information about the item contents is typically represented in

Page 1 of 50
M.M. Bendouch Master Thesis

the form of a vector of item features.


Content-based recommender systems [20] vary in the features they consider
and how they use these to calculate similarities among items. The features that
are available to the recommender depend on the item type and dataset. For
example, the director, cast, runtime, release date, genre, and plot are typically
available to movie recommender systems. On the other hand, items such as news
articles often include the author, publication date, subject, and the full article
content. Text is a common form of information about the contents of many types
of on-line items, from which features can be extracted. Texts such as descriptions,
lyrics, summaries, reviews, etc. provide rich information about the items and can
be valuable to content-based recommenders.
Term Frequency - Inverse Document Frequency (TF-IDF) [12] is a widely
used technique to estimate similarity between texts or documents. This technique
constructs for each document a feature vector out of the frequency counts of terms
in the document, and multiplies these frequencies by the inverse frequency that
these terms occur in all documents. The resulting vectors can then be directly
compared using measures such as cosine similarity [11].
Building upon TF-IDF, several other recommenders have emerged, originally
designed for the recommendation of news articles. One such recommender is
called Concept Frequency - Inverse Document Frequency (CF-IDF) [11] and uses
concepts from domain ontologies for features instead of terms. This is extended to
include related concepts by [7]. Another recommender called Synset Frequency -
Inverse Document Frequency (SF-IDF) [2] exploits semantics by counting synsets
from semantic lexicons, and is therefore said to be semantics-driven. SF-IDF has
been extended to SF-IDF+ [18] and later to Bing-SF-IDF+ [3] by adding seman-
tically related synsets to the vector and incorporating named-entity similarities
using Bing page counts. A vector of weights is used to optimize the relative
importance of the different features in the calculation of the similarities. These
developments in semantics-driven recommender systems potentially lead to sub-
stantial improvements in performance.
The promising results of the above semantics-driven recommenders lead us
to explore their application to another domain, the large-scale recommendation
of movies. Although the data that is available for movies can contain substantial
semantic information, this information is encoded differently and the items are
more complex than just collections of texts. Combined with the scale of this
problem, it can pose substantial challenges to traditional recommender systems.
We propose methods to scale up the semantics-driven approach and its optimiza-
tion procedure, extract semantic features from publicly available movie-level in-
formation, and find related concepts without the need for an external domain
ontology. This research demonstrates that these semantics-driven recommenders
can be extended to more complex domains and are able to make high-quality
recommendations on an extremely large scale.

Page 2 of 50
M.M. Bendouch Master Thesis

2 Literature Review and Research Question


We provide a review of the semantics-driven recommenders mentioned in the
previous sections, and how they have evolved to provide accurate news article
recommendations. Subsequently, we explain in what areas the literature regard-
ing these recommenders can be extended, leading to our main research question.

2.1 Literature review


The recommenders that we choose to focus on were originally designed for news
recommendations and extract their features from the texts of the articles, but
they can be used to predict similarity between any two texts. As they were
applied to news articles, also called documents or items, the performance evalua-
tions in the literature are reported on a dataset of news items. The dataset used
to compare the recommenders consists of 8 user profiles that indicate whether
each of 100 news items are interesting or non-interesting. The F1 -measure is
chosen as the main measure of performance because it is a balanced metric that
is widely used. Unseen items for which the normalized similarity prediction is
higher than a predetermined threshold value are recommended, after which the
F1 can be calculated. The parameters (if existent) of the recommenders are opti-
mized separately for each threshold value, and the full range of threshold values
is tested.

2.1.1 TF-IDF
The most frequently used feature vector is called Term Frequency - Inverse Doc-
ument Frequency (TF-IDF) and such vectors are generally compared through
cosine distance to obtain similarities. To construct the feature vector for a doc-
ument, for each term in the the text it calculates the term frequency (TF) and
multiplies this by its inverse document frequency (IDF) in all documents. TF is
a count of each term in the document, and this count is then normalized for each
document by considering the total number of terms in that document. Therefore,
the TF-value for a specific term is an indicator of the relative importance of this
term in the document. The other value, IDF, is the inverse of the count of each
term in the total collection of documents, indicating the overall uniqueness of the
term. This value is constant over all documents and can be seen as a weight that
gives relative importance to rare terms. A pre-processing step before counting
terms is performed to remove noise and increase performance. Stop words are
removed and all other words are lemmatized, so that all words with the same
root are considered to be the same term [11].

Page 3 of 50
M.M. Bendouch Master Thesis

2.1.2 CF-IDF+
The Concept Frequency - Inverse Document Frequency (CF-IDF) is comparable
to TF-IDF, but it uses concepts instead of terms. The text is processed by a
natural language processing (NLP) engine that performs word sense disambigua-
tion, part-of-speech (POS) tagging, and tokenization to transform the text into
a collection of concept candidates. A domain ontology containing concepts and
their relationships is checked for each candidate, and if a match is found, a count
is added to that concept. Using concepts instead of terms represents the domain
semantics better because it only considers words that are relevant to the specific
domain, and it subsequently results in an observed performance improvement
over TF-IDF [11]. CF-IDF+ extends this method further by, for each concept
found in the text, adding the concepts to the vector that are directly related
in the domain ontology [7, 31]. Each type of relationship (superclass, subclass,
or instance) is given a weight to vary the overall importance of the found con-
cepts and their related concepts. These three weights are then optimized by grid
search. Including the related concepts can add more domain semantics to the
feature vector.

2.1.3 SF-IDF+
The same principle of CF-IDF is applied with Synset Frequency - Inverse Docu-
ment Frequency (SF-IDF) [2], by matching terms to synsets in a semantic lexicon,
in this case WordNet. This generally results in a longer vector than CF-IDF be-
cause more matches are found as WordNet is much larger than a typical domain
ontology. As CF-IDF+ adds related concepts, so does SF-IDF+ add synsets that
are directly related in WordNet [18]. There are 27 types of semantic relationships
in WordNet, and as in CF-IDF+, every type has a weight that is optimized, but
by a genetic algorithm due to the much larger search space. SF-IDF+ outper-
forms SF-IDF in [18].

2.1.4 Bing-SF-IDF+
While SF-IDF+ retrieves many synsets from the texts, semantic lexicons do not
contain named entities, leading it to consistently discard this information. Bing-
SF-IDF+ recognizes the named entities from the text and uses them in a separate
similarity measure between each pair of entities, called Bing distance [3]. This
measure is a function of three search result page counts: two counts for each
entity separately and one for a combination of the two. An optimized weight is
used to combine the Bing distance and the SF-IDF+ cosine similarity. This leads
to improved performance over SF-IDF+ [3].

Page 4 of 50
M.M. Bendouch Master Thesis

2.2 Research Question


Considering that the line of recommenders mentioned in the previous section
have previously been used on small-scale news article recommendations, some
questions remain when applying these to other domains. Articles have a special
property in that the item is exactly the text itself. For many other types of items,
texts might only describe (some aspect of) the item. Music songs, for example,
might include a description of the artist, the album, or the lyrics. Movies can have
their plots, story-lines, posters, or other descriptions available. The extraction of
semantic features from information of a different nature, and their contribution
to semantics-driven recommender systems is therefore worth exploring.
Another aspect is the size of datasets of user ratings and items. Along with
the rapid expansion of the Web and the centralization of information into vast
datasets, recommendation problems have also increased in scale. Datasets con-
taining millions of ratings are no longer rare, especially for popular items such
as music or movies. To keep up with these developments, semantics-driven rec-
ommender systems have to be applicable on a considerably larger scale.
For this research, we choose to specifically focus on movie recommendations;
one of the domains that is substantially different from news articles. Information
on thousands of movies and millions of user ratings is available, so this field is one
that can especially benefit from scalable recommender systems. We therefore aim
to extend semantics-driven recommenders to the problem of large-scale movie
recommendations, and contribute to the literature by answering the following
main research question:

How can semantics-driven recommenders be applied to a large-scale


movie recommendation problem?

Different information is available for movies compared to news articles, so we


also want to answer:

1. How can synsets and concepts be extracted from item descriptions?

For semantics-driven recommenders that are based on concepts and their rela-
tionships, a domain ontology is needed. We will therefore also answer:

2. How to devise a domain ontology for the selected dataset?

Furthermore, it remains unclear which named entities can be used:

3. To which named entities can the Bing distance measure be applied?

And since these recommenders have previously been applied only to small datasets:

4. How can the approach be scaled to large datasets?

Page 5 of 50
M.M. Bendouch Master Thesis

3 Data
The problem that we focus on in this research is the large-scale recommendation
of movies, and the GroupLens Research Project at the University of Minnesota1
makes datasets freely available for this purpose. As scaling up the approach is
part of our research question, we gather user ratings from the largest dataset
available, the MovieLens 20M2 dataset, describing 5-star user rating activity
from MovieLens3 , an on-line movie recommendation service. It contains a total
of 20,000,000 user ratings on a scale of 1 to 5 across 27,278 movies. It was
collected in the ten-year period of 9 January 1995 until 31 March 2015 from the
rating activity of 138,493 users, who had all rated at least 20 movies.
As semantics-driven recommender systems are content-based, they require, in
addition to the user ratings, item-level information as input for feature extraction.
The MovieLens data contains the title, year of release, and genre labels for each
movie. Identification numbers for The Internet Movie Database (IMDB)4 are also
attached, which allows us to augment this dataset with movie-level information
from other sources.
We use two other sources to collect movie information, and the first is the
Open Movie Database (OMDb)5 , which freely provides an API in the form of a
RESTful Web service. As IMDB ids can be used as query keys, we can match the
information to the MovieLens movies. OMBd makes movie posters available only
to patrons, so we choose to query the Movie Database (TMDb)6 API, our second
source, for this purpose instead. The posters are freely available to anyone with
a free user account, and can be requested by their IMDB ids.
The combined data contains many movie-level variables, but for this research
we choose to retain only those containing substantial semantic information, as
those could be valuable for semantics-driven recommendations. We believe the
bulk of semantic information to be represented by the names of persons involved
in the movie, the genre(s), the plot, and the poster. The involved persons are the
actor(s), director(s), and writer(s). Table 1 shows the variables we use in this
research together with their descriptive statistics.
OMDb provides us with a plot for 96.51% of the movies in the MovieLens
data, containing 63 words on average. These are the only texts in our data from
which semantic features can be extracted. We therefore discard the 3.49% of
movies for which no plot is available, which leaves us with 26,327 movies. We
notice that the plots are substantially shorter than typical news articles, and this
might reduce the amount of semantic information that can be extracted from
1
https://grouplens.org
2
http://grouplens.org/datasets/movielens/20m/
3
https://movielens.org/
4
http://www.imdb.com/
5
http://www.omdbapi.com/
6
https://www.themoviedb.org/

Page 6 of 50
M.M. Bendouch Master Thesis

Table 1: Movie information, descriptive statistics.

N Missing % Mean Min Max


Title (MovieLens) 27,278
Genres (MovieLens)* 27,278 1.99 1 10
Genres (OMDb)* 27,207 0.26 2.21 1 5
Directors (OMDb)* 27,003 1.01 1.11 1 41
Plot (OMDb)** 26,327 3.49 63.49 3 1471
Writers (OMDb)* 25,831 5.30 2.41 1 35
Actors (OMDb)* 26,925 1.29 3.93 1 4
Poster (TMDb) 26,827 1.65
* Multi-class variable, statistics reported for number of classes.
** Full text, statistics reported for number of words.

the texts. The plots describe the storyline without revealing any spoilers and
are similar in their intent to a movie trailer or the description one might find on
the back of a novel. An example of a plot for the first movie in our dataset, the
animation film Toy Story (1995), illustrates this:
A little boy named Andy loves to be in his room, playing with his toys,
especially his doll named Woody. But, what do the toys do when Andy
is not with them, they come to life. Woody believes that he has life (as
a toy) good. However, he must worry about Andys family moving,
and what Woody does not know is about Andys birthday party. Woody
does not realize that Andys mother gave him an action figure known
as Buzz Lightyear, who does not believe that he is a toy, and quickly
becomes Andys new favorite toy. Woody, who is now consumed with
jealousy, tries to get rid of Buzz. Then, both Woody and Buzz are
now lost. They must find a way to get back to Andy before he moves
without them, but they will have to pass through a ruthless toy killer,
Sid Phillips.
In this specific 146-word plot, terms such as boy, toys, doll, birthday, and play
clearly contain semantic information pointing towards childrens movies and
could be predictive of user preferences. Synsets can be extracted by matching
these terms to a semantic lexicon, and their related synsets would likely provide
additional value. Whether these or other terms can be matched to concepts in
a domain ontology for use in CF-IDF and related recommenders is doubtful,
as any ontology containing these concepts would likely have to be too large to
feasibly construct manually. We further notice that named-entities present in
the plots are generally names of fictional characters and would only rarely pro-
vide substantial information in Bing distance evaluations. For example, the Bing

Page 7 of 50
M.M. Bendouch Master Thesis

distance between named-entities Microsoft and Google would be predictive of


similarity between two news articles, but a comparison between character names
John and Mary is unlikely to have any value. When two movies are parts of
a sequel, there could be similarity between character names, and under those
conditions Bing distance could be helpful in the rare case that character names
are somehow related but not an exact match.
TMBd provides us with a movie
poster for 98.35% of the movies in the
dataset, with sufficient resolution to
distinguish many semantic elements.
We discard the 1.65% of movies for
which no poster is available. The
posters are generally created for the
purpose of advertisement and usually
show the characters and setting of the
movie. Depending on the extraction
method, we can extract higher or lower
level elements from the images. Figure
1 shows an example of a poster, again
from the movie Toy Story. The poster
provides a comprehensive view of the
contents of the movie and it shows
toys, a cowboy, and an astronaut. It
clearly gives the impression of a movie
targeted to young boys, and it con-
tains fewer irrelevant elements com- Figure 1: A picture is worth
pared to the plot. The smiling char- a thousand words
acters, cartoon-like font, and colour-
fulness are further indicative of a family movie. Whether informative features
can indeed be extracted depends on the extraction method. The features could
either be represented as some hidden vector or as interpretable terms or synsets.
Variables such as genres, directors, writers, and actors provide additional
(domain-specific) information, as we intuitively expect user ratings to depend
on these preferences. Their labels do not have to be extracted from text, so we
do not have to depend on word sense disambiguation or other natural language
processing techniques to use them in semantics-driven recommenders. As rela-
tionships between actors, directors, and other persons could be extracted from
the available information, it could potentially be used to construct a domain on-
tology. Bing distance could also be calculated between them as they are named
entities. Each movie is additionally labelled with one or more genres, which
are obtained from both MovieLens and OMDb. Genre labels vary between the
two sources, so both are retained to ensure no valuable information is lost. We

Page 8 of 50
M.M. Bendouch Master Thesis

discard any movie without at least one director, actor, writer, and genre.
We conclude that the combined dataset contains substantial semantic infor-
mation, and after discarding all movies that have one or more missing value(s) in
any of the variables, we retain 24,477 movies to be used for this research. We can
reasonably expect these variables to be available to a recommender system and
removing said movies therefore does not impact the general applicability of our
approach. Furthermore, this affects only 0.83% of the user ratings, so it seems
that the discarded movies had few ratings to begin with. We believe that Bing
distance metrics can be applied to the persons involved with the movie, such as
the directors, actors, or writers as they are named entities and similarities among
them are relevant to user preferences, in contrast to character names from the
plots. The titles of the movies contain a negligible number of terms or synsets,
but can however also be used as a named entity to calculate Bing distances.
In this research, we already represent all these entities as concepts in a domain
ontology, and we also note that the large number of named entities makes the
pair-wise search queries infeasible. We therefore decide to exclude Bing distances
from our recommender system and focus on semantics from terms, concepts, and
synsets, in line with TF-IDF [11], CF-IDF(+) [7, 11, 31], and SF-IDF(+) [3].

Page 9 of 50
M.M. Bendouch Master Thesis

4 Methodology
This section covers our research methods, starting with the extraction of semantic
features from both plots and posters. We then describe how we can find related
concepts without the need for an external domain ontology. Subsequently we
define the similarity model, and show how a small modification allows massive
scalability. We derive the gradient and explain its use in stochastic gradient
descent (SGD). The chapter is then completed with a description of our procedure
for sampling observations, training, validation, and evaluation.

4.1 Feature Extraction


We extract several types of features from our dataset, in addition to the concepts
which are readily obtainable from the directors, actors, writers, and genre labels
for each movie. We explain how natural language processing (NLP) is used to
extract both terms and synsets from the plots. We further describe how we
use computer vision techniques to extract a vector of synset probabilities and a
visual-semantic embedding vector from each movie poster.

4.1.1 Natural Language Processing


To extract terms and synsets from the plots, we use natural language processing
techniques (NLP) techniques that can filter out noise from the plots and ex-
ploit known regularities in natural language. These techniques are implemented
through the NLTK7 package in Python 2.78 .
As the used techniques only need sentence information, we can split up each
plot in a set of sentences and process them separately. This is done by NLTK
through known properties of sentences, such as that they generally end with a
period. Each sentence is further split up in a list of words, also called tokens, in
a process called tokenization. This is done using known properties of words, such
that they usually occur in the English dictionary and are separated by spaces or
commas. Using part-of-speech (POS) tagging we add to each word the POS, such
as noun, verb, adjective, or adverb as a property. We remove stop words, which
are extremely common words that contain negligible semantic information: the,
is, at, etc.
For extracting the terms, we apply the Porter [29] stemming algorithm to
each word to reduce the words to their roots. For example f isher, f ishing, and
f ished are reduced to the same root f ish. This way words with a similar basic
meaning are considered to be the same term.
For the extraction of synsets, we apply the Adapted Lesk [1] word sense
disambiguation (WSD) algorithm, an improved [19] version of the classic Lesk [16]
7
http://www.nltk.org/
8
https://www.python.org/

Page 10 of 50
M.M. Bendouch Master Thesis

algorithm, to each word. WSD addresses the problem of identifying the sense of
a word - the meaning in its context. For example the noun bank has multiple
meanings that are very different and the meaning in a text has to be judged from
the context in which it is used. Adjusted Lesk does this by calculating a similarity
between the context (sentence) of the word in the text and the definition of each
sense of the word from the dictionary (in our case WordNet). The sense with the
highest similarity is then identified. We consider only senses that have the same
POS tag as the word from the text. Only if no sense can be found, we consider
all senses with any POS. The synset containing the identified sense of the word
is extracted.

4.1.2 Computer Vision


We also extract semantic features from the movie posters, for which we use
techniques from computer vision, a field in which algorithms are used to gain
high-level understanding from visual information, such as digital images.
Convolutional neural networks (CNNs) [9] are state-of-the-art models in com-
puter vision and dominate the field in terms of performance on a variety of com-
puter vision tasks, such as optical character recognition (OCR) [4,5], facial recog-
nition [15, 17], and face detection [10]. On some object classification tasks [27] it
can even rival human performance [23]. CNNs are types of feed-forward artifi-
cial neural networks (ANNs) [32], which are models inspired by the connectivity
patterns of neurons (called nodes in ANNs) in the animal and human brains.

Artificial Neural Networks. We briefly explain feed-forward ANNs. Let us


consider one node in such a network. It is connected to one or more nodes from
which it receives signals. It combines these signals into one value, called an
activation. This activation is then sent as a signal to one or more other nodes.
The specific connection pattern of the model determines for each node to which
other nodes it sends a signal, and the weights (optimizable parameters) of these
connections determine how large these signals are. Usually the activation of a
node is the sum of incoming signals, each weighed by the connection weight.
Subsequently an activation function is applied to calculate the output activation
that is propagated to the next nodes. The input nodes are activated by input
information, their input activations are thus input values from an observation in
the data. The weights of the connections can be trained such that the output
activation of the output nodes achieve the target values of the observations. The
network is called feed-forward because the connections between the nodes are not
recurrent. In other words, the signals always move from in one direction, from
input to output. The nodes in between the input and output nodes are called
hidden nodes. [32]
The most common feed-forward ANN is the multilayer perceptron (MLP)
model [32], which feeds the input data through multiple layers of nodes, where

Page 11 of 50
M.M. Bendouch Master Thesis

each node is fully connected to each node of the next layer. In other words, the
signals are fed to an input layer, propagated through multiple hidden layers, to an
output layer. The number of hidden layers is often referred to as the depth of the
model and ANNs with many hidden layers are also called deep neural networks
(DNNs). Because of this regular connection pattern, the weights connecting one
layer to the next can be represented in a matrix, which allows for efficient parallel
computations. We explain this by considering one layer - say layer1 - of d1 nodes
and assuming we know its vector a~1 of d~1 output activations. We further assume
that layer2 has d2 nodes. We denote F2 () as the activation function of the nodes
in layer2 . We introduce a d2 d1 weights matrix W2 and a bias vector b~2 of length
d2 , both consisting of learnable parameters (weights). The output activations a~2
of layer2 are then calculated as follows:

a~2 = F2 (W2 a~1 + b~2 ) (1)

The full network can then be described as a composition of multiple functions


like Eq. 1. MLPs are often trained through the backpropagation [22] method,
which is a two step process of propagation and weight optimization. The first
step propagates the input values through the network by repeatedly applying
Eq. 1 to obtain the output values, which are then compared with the target
values to obtain an error value. In other words, the model is evaluated on the
training data. In the second step, the error is propagated back up the network
(backpropagation) and used to calculate a gradient with respect to each weight.
This gradient is then used in a gradient descent algorithm to minimize the loss
by optimizing the weights. These two steps are repeated to iteratively optimize
the weights and reduce the error on the training data.

Convolutional Neural Networks. Through explaining how we would apply


a neural network model to computer vision by using images as inputs, we show
the advantages of convolutional neural networks (CNNs) for computer vision.
Each pixel in a digital image is represented by 3 colour values for red, green,
and blue (RGB). If the input image is of size w wide and h pixels high, it can be
represented as a matrix of 3 h w values. Portable Network Graphics (PNG),
the most common lossless digital image compression format on the Web [26],
encodes an images pixels in a 24-bit RGB palette (8 bits per colour). Computer
vision libraries such as OpenCV9 convert this to a 3 h w matrix of unsigned
8-bit integer values ranging from 0 to 28 1 = 255. As most neural network
libraries such as Theano10 take floating-point numbers as inputs, usually single
1
precision floats (32 bits), the matrix is normalized by multiplying with 255 to
obtain a matrix of values in the range [0, 1].
9
http://opencv.org/
10
http://deeplearning.net/software/theano/

Page 12 of 50
M.M. Bendouch Master Thesis

In order to use the image matrix as input for an MLP, it needs to be flat-
tened to a vector of length 3hw, because the layers arrange the nodes in only one
dimension. This also means that the model does not scale well to high-resolution
images as the number of connections between the input layer and the fully con-
nected hidden layer explodes. Furthermore, the spatial information of the pixels
is lost and has to be implicitly learned through training.
Convolutional neural networks (CNNs) are types of feed-forward ANNs that
can make use of spatial ordering through arranging their layers of nodes in three
dimensions. Each layer can be seen as a collection of different learnable filter
kernels that are convolved across the height and width of the entire image and
extend through the full depth - which is 3 for RGB images. Each filter is applied
to the full image, with the same weights, so the number of weights are indepen-
dent of the image size. A convolutional layer contains multiple of these filters,
and each filter outputs a 2-dimensional map of the results of its convolutions over
the image. If the filters are of shape k k, we say that the receptive field is k
pixels, and usually k {3, 5, 7}. The resulting feature maps from the first layer
will be slightly reduced to spatial size (h k + 1) (w k + 1) as the filters can
only be applied to k k pixels. If we have n filters in a layer, we can represent
the features maps in an n(hk + 1)(w k + 1) matrix to be used for the next
layer, where again different filters can be applied, this time with depth n. This
gradually reduces the size of the spatial activation matrix as it is propagated
through the layers. Using stride of s in the convolutions means the filters are
applied only to each s-th pixel so the resulting maps are reduced more rapidly to
((h k)/s + 1) ((w k)/s + 1). Another operation called zero-padding adds a
border of p pixels to each of the four sides of the spatial input. If zero-padding
is applied with p = (k 1)/2 pixels on each side, the output map will remain the
same size as the input map. In between convolutional layers, max-pooling layers
can be used, which slice an n h w spatial input into tiles of size m m (com-
monly m = 2), over which only the maximum value is propagated. The output
h w
of a max-pooling layer is therefore of size n m m . These introduce the benefit
of reducing the size of the maps and providing some form of spatial invariance.
After some number of hidden convolutional and max-pooling layers, the output is
flattened to an nhw-length visual feature vector, where h and w are substantially
smaller than the original input image size. The flattened layer connects, usually
through additional fully connected layers, to the vector-valued output layer. This
output layer can for example have a softmax activation function in the case of
image classification.

Synset Vectors. The ImageNet Large Scale Visual Recognition Challenge


(ILSVRC)11 is a competition where algorithms compete for object detection and
image classification. The classification competition challenges algorithms to clas-
11
http://www.image-net.org/challenges/LSVRC/

Page 13 of 50
M.M. Bendouch Master Thesis

Figure 2: Crop three windows of 224 224, predict class (synset) probabilities
for each, feature values are the maximum probabilities of each synset.

Comic book 0.8770


Book jacket 0.0336
Window 1
Toyshop 0.0294
Shoe shop 0.0147
Jigsaw puzzle 0.0098
Window 2
Bookshop 0.0042
Puck 0.0031
Tray 0.0030
Confectionery 0.0016 Window 3
Tobacco shop 0.0013

sify an image in 1, 000 categories that are each represented by a synset. In 2014,
the highest-performing submission was a 19-layer deep convolutional neural net-
work from the Visual Geometry Group of the University of Oxford [24] called
VGG19. On the test set, for 81.1% of the images the top-5 predictions included
the correct class. Human performance on this metric is estimated to be around
88-95% [23]. The trained parameters for this model are publicly available on the
web-page12 . VGG19s convolutional layers each have a filter size of 3 3 and the
input to each of those layers is zero-padded with p = 1 such that the ouputs are
of equal spatial dimensions. Down-sampling occurs only through max-pooling
layers. Two fully connected layers are added and connected to a 1,000 dimen-
sional softmax output layer. As substantial semantic content of the posters can
be described by the objects that can be recognized from them, we can use VGG19
to extract meaningful synset vectors. The model takes a 224 224 colour image
as input, represented as a 224 224 3 matrix of RGB pixel values. It outputs
a vector of 1, 000 probabilities, one for each synset. The posters are larger than
224 224 pixels and are higher than they are wide, so we first downscale them
while keeping the aspect ratio such that the width becomes 224. The height is
then still larger than 224 but never larger than 3 224 so we can take 3 vertically
overlapping 224 224 windows of the poster as inputs to ensure every part of the
image is covered (Fig. 2). We evaluate the model on each window, after which
we take the maximum of the 3 output values for each class (synset). We apply
this procedure to the posters to obtain feature vectors of 1, 000 synset values.
12
http://www.robots.ox.ac.uk/vgg/research/

Page 14 of 50
M.M. Bendouch Master Thesis

Visual-Semantic Embeddings. The 1,000 class predictions for the ImageNet


challenge map to a vocabulary of 1,000 synsets, and as such they can be consid-
ered a semantic feature vector. These values are however intended to classify an
image and do not necessarily describe a poster fully. We therefore also consider
another approach called visual-semantic embedding (VSE) [14] that has been
used for the challenge of image captioning [30], where the aim is to generate
a natural language caption that best describes the content of an input image.
This is done by mapping the image and the sequence of words of a caption to
a common feature space, called a visual-semantic space, in which semantic dis-
tances between an image and a caption can be calculated. From this distance
metric the semantic similarity between an image and a caption can be estimated
and the nearest-neighbour caption can be returned. Our goal to represent the
posters in a semantic space can be considered equivalent to mapping them to a
visual-semantic embedding.
The embeddings can be learned with knowledge of pairs of images and their
captions. In visual-semantic space, an image and its caption should be close. Let
us define this closeness as the cosine similarity between the images embedding
~ Rn and the embedding of the caption ~c Rn . In a properly constructed
m
visual-semantic space, for the image and its caption, cos(m,~ ~c) should be relatively
high. Reversely, a non-descriptive caption cr should lead to a relatively low
cos(m,
~ c~r ). As the image and caption are mapped to the same visual-semantic
space, we can also expect that the more semantically similar poster1 and poster2
are, the higher their cos(m ~1, m~ 2 ) - which is exactly the aim of our semantics-
driven recommender.
Mapping an image to a visual-semantic space is done in [14] by a form of
transfer learning [33], where the 4,096 visual features from the second-to-last
layer of the pre-trained VGG19 model are transferred to a new model in which
they are multiplied by a matrix of trainable weights m , resulting in an embedding
vector m~ Rn . Transfer learning simplifies the problem from learning the visual-
semantic embedding from raw pixels to learning it from high-level visual features
trained on the ImageNet Challenge.
Another trainable neural network with weights c transforms the text of the
caption in an embedding vector ~c Rn . We denote a non-matching caption
for image embedding m ~ as c~r and a non-matching image for caption embedding
~c as m~ r . All weights = {m , c } are trained simultaneously to minimize the
following pairwise ranking loss:
XX
max{0, s(m, ~ ~c) + s(m,
~ c~r )}
m k
XX (2)
+ max{0, s(~c, m)
~ + s(~c, m
~ r )}
c k

where s(m,
~ ~c) = m~
~ c is the scoring function. As [14] first scale the embedding

Page 15 of 50
M.M. Bendouch Master Thesis

vectors m~ and ~c to unit norm, s(m,


~ ~c) = cos(m,
~ ~c). For the purpose of extracting
semantic features from the movie posters, we are interested in the visual-semantic
embeddings m ~ of the images. The authors of [14] have made an embedding matrix
to generate 1,024-dimensional visual-semantic embeddings publicly available13 .
This matrix was trained to optimize Eq. 2 on public image captioning datasets.
Our procedure consists of using this pre-trained embedding matrix on the 4,096-
dimensional VGG19 visual feature vectors of the movie posters to obtain their
visual-semantic embeddings.
The VSE vectors have a more solid theoretical foundation compared to the
synset vectors, being derived from a state-of-the-art method whose purpose is
to translate images to text. This is a more direct way of achieving our goal of
extracting semantic features, and we expect this to improve recommender per-
formance compared to VGG19 synset vectors. The VSE method however also
introduce a disadvantage for this application. The features are hidden, meaning
that they have no natural interpretation, so linking them to an ontology or seman-
tic lexicon would be complicated. Only significant and substantial performance
gains could make us prefer VSE over the VGG19 synset vectors.

4.2 Domain Ontology


Domain ontologies are considered resources that are external to the dataset from
which the concepts are derived, and subsequently have to either be obtained
through external sources or manually constructed specifically for the purpose of
the recommender system. We now propose a general method as an alternative to
external domain ontologies, solely based on the dataset itself. This allows us to
find concepts related through a common item by a series of matrix multiplications
of binary matrices.
We have 25,138 movies in our dataset, covering 12,231 directors, 45,393 actors,
27,415 writers, all 19 MovieLens genres (mlg), and all 27 OMDb genres (omg).
Based on the average number of these concepts per movie (table 1) we have an
estimated 292,857 bidirectional movie-concept relations that implicitly form a
virtual domain ontology, which can be used to find related concepts.
We summarize the notation for this section in Table 2. Let classes the set
of existing concept classes, so in our case classes = {director, actor, writer,
mlg, omg}. We denote the number of classes (or equivalently the size of the
set classes) as k and the number of concepts (instances) in class c classes
as nc . Let Mc be an z nc binary matrix that encodes the occurrences of
concepts of class c in each of the z items, and its (i, j)-th element is denoted
as Mc(i,j) . The occurrences are encoded such that is Mc(i,j) = 1 if concept
j is present in item i, and Mc(i,j) = 0 otherwise. In other words, Mc(i,j) =
I( item i contains concept j ), with I(statement) as the indicator function.
13
https://github.com/ryankiros/visual-semantic-embedding

Page 16 of 50
M.M. Bendouch Master Thesis

Table 2: Notation.

k Number of concept classes.


z Number of items (movies).
nc Number of unique concepts (instances) in class c.
conceptc(j) j-th concept of class c, with j [1, nc ].
itemi i-th item, with i [1, z].
Mc Rznc Binary, encodes occurrences of class c.
Mc(i,j) {0, 1} (i, j)-th element: whether item i has concept j.
relation c Connects items that have shared concepts of class c.
Rc Rzz Encodes all occurrences of relation c.
Rc(u,v) {0, 1} (u, v)-th element: whether item u is related to item v
through relation c.
znb
R
Mb Binary, encodes related occurrences of class b found
a

through relation a.
Mb
a
{0, 1} (i, j)-th element: whether any item related to item i
(i,j)
by relation a has concept j.

This also means the sum of row i is the number of concepts found in item i, and
the sum of column j is the number of items in which concept i is found. All these
sums are at least 1 because all concepts occur at least once and every movie has
at least one concept of each class. An example of a matrix Mc in our case is
Mdirector , a binary 25, 138 12, 231 matrix that encodes the directors found in
each of the movies.
We further denote with conceptc(j) the j-th concept of class c and the i-th
item with itemi . We show that the k matrices Mc can be used to find related
concepts for an item through relations of the form:
contains occurs in contains
itemu concepta(i) itemv conceptb(j)
(3)
a, b classes u, v [1, z] i [1, na ] j [1, nb ]
Suppose itemu contains a concept of class c. If we find another itemv that
contains that same concept, we say that itemv is related to itemk through relation
c. The number of possible relations is therefore equal to the number of classes
k. For example, if we find that an actor from movie u also plays in movie v, we
call these movies related through the relation actor. Note that the relations are
bidirectional and that a movie is always related to itself.
This existence of a relation c between itemu and itemv is equivalent to the
existence
Pnc of a j [0, nc ] for which Mc(u,j) M
Pc(v,j) = 1. This is the case if and only if
nc
j=1 Mc(u,j) Mc(v,j) > 1. The expression j=1 Mc(u,j) Mc(v,j) is also the definition
of the dot-product between the u-th and v-th rows of Mc . If we calculate the

Page 17 of 50
M.M. Bendouch Master Thesis

z z symmetric matrix Rc = Mc Mc> , by definition of the matrix multiplication


the element Rc(u,v) is the dot-product of the u-th row of Mc and the v-th column
of Mc> . The v-th column of Mc> is also the v-th row of Mc . Therefore Rc(u,v) > 1
if and only if itemu and itemv are related through relation c.
Consider that Mb(v,j) = 1 if and only if itemv contains conceptb(j) . We further
know that Ra(u,v) > 1 if and only if the itemv is related to itemu through relation
a. Therefore itemu has related conceptb(j) through relation a with the itemv if
and only if Mb(v,j) Ra(u,v) > 1. Hence P itemu has related conceptb(j) through
relation a with any item if and only if zv=1 Ra(u,v) Mb(v,j) > 1. By the definition
of matrix multiplication, this is the (u, j)-th element of the z nc matrix (Ra Mb ).
We denote this matrix Mb and call it the matrix of related concepts of class b
a

through relation a. It encodes the related concepts of all items since Mb(u,j) > 1
a

whenever itemu contains related concept j of class b (conceptb(j) ). As we are only


interested in occurrences of related concepts and not in counts, we make Mb a

binary by clipping its values to 1. We can now use the following equation to
directly obtain related concepts:
>
Mb
= Ra Mb = Ma Ma Mb
a a, b classes (4)

We can also express Eq. 4 as a function of two arbitrary matrices with same
number of rows:
getRelated(A, B) = AA> B (5)
Using the function from Eq. 5 in a nested way, we can obtain related concepts
through longer paths. For example, we can obtain concepts through the path
(notation simplified):

item
concepta
item
conceptb
item
conceptc (6)

Which can be calculated as getRelated(getRelated(Ma , Mb ), Mc ) and denoted


as M ba . The function can be nested any number of times to obtain con-
c
cepts through any path length. This shows that we can find related concepts
through relations of the form expressed in Eq. 3 using only matrix multipli-
cations of the k matrices Mc . With this we propose an alternative to external
ontologies that can be used to find related concepts using only item-concept oc-
currences from the dataset itself. We can for example calculate M actor =
director
>
Mactor Mactor Mdirector to find the related directors of the movies through the
movies actors. In other words, each row of M actor contains all directors
director
of movies that share at least one actor with the movie corresponding to that row.

4.3 Scaling
When all features are extracted from the descriptions, we have to consider how to
scale them. The traditional method of scaling of TF-IDF is the Inverse Document

Page 18 of 50
M.M. Bendouch Master Thesis

Frequency (IDF) as explained in Section 2.1.1, and we apply the same scaling to
terms and synsets from the plots, in line with SF-IDF(+).
The scaling of concepts is slightly different because, in contrast to CF-IDF+,
these are not extracted from texts. Frequencies are always in {0, 1} as we only
extract occurrences of concepts. We do not apply IDF scaling, as we believe that
the interpretation of the feature values deviates too much from their original
meaning in TF/SF/CF-IDF+, where they are relative frequencies from texts.
The 1,000 synset values (VGG19) and the 1,024 visual-semantic feature values
(VSE) extracted from the posters could benefit from scaling as we expect that
some features are more relevant to the content of the movies. These features
should play a larger role in the cosine distance and therefore be scaled higher. We
have little information about the relevance of each of the 1,000 synsets, and even
less so about the 1,024 visual-semantic features, which are hidden and do not have
a natural interpretation. To investigate the upper limit of benefit from scaling, we
learn 1,000 scales for the synsets and 1,024 scales for the visual-semantic features
simultaneously with optimizing the model through stochastic gradient descent.
Following the notation of the similarity model in Table 4, we denote the scale
as c~i if it applies to the i-th feature type ti . So c~i R1,000 ti = V GG19 and
c~i R1,024 ti = V SE. The user-profile vector ui and the unseen item vector
v~i are then scaled through c~i u~i and c~i v~i respectively, with the element-wise
product.
P These resulting scaled vectors are used in the cosine. We restrict c~i > 0
and c~i = 1 to avoid the over-parametrization caused by cos(~u, ~v ) = cos(~u, ~v )
6= 0. Next to the scaled vectors, we also use the unscaled original vectors in
the model for comparison.
The results of the learned visual scaling could also indicate the benefit of
better scaling of the other feature types, for which this learning procedure is too
computationally expensive. Another advantage of applying this experiment only
to the visual features is that the number of features is fixed, so the resulting
learned scaling can be reused for other movie recommendation problems.

4.4 Preparation
After the features are extracted from the descriptions and appropriately scaled,
we take some steps to prepare them for use in the similarity model. For each
movie, we have a feature vector for each type of feature. We increase the para-
metric freedom compared to CF-IDF+ by placing concepts of each class into
separate vectors, which allows their relative importance to be learned. So the
concepts (features) of each class are considered a different type of feature. All
feature types used in this research are summarized in Table 3.
We use the notation (Table 4) used for the similarity model in the next section.
Let us denote the number of feature types as k = 9 and the set of feature types as
t, where ti is the i-th type, for example t1 = director. We represent the feature
values of type ti for item g [1, z] in a matrix Vig Rmi ni with z the total

Page 19 of 50
M.M. Bendouch Master Thesis

Table 3: Feature types used in this research.

i Feature type ti Extracted from Dataset ni * mi **


1 Directors Variable OMDb 12,231 4
2 Actors Variable OMDb 45,393 4
3 Writers Variable OMDb 27,415 4
4 MovieLens genres Variable MovieLens 19 1
5 OMDb genres Variable OMDb 27 1
6 Terms Plot OMDb 48,083 1
7 Synsets Plot OMDb 69,977 19
8 VGG19 Poster TMDb 1,000 1
9 VSE Poster TMDb 1,024 1
* #Features i.e., length of feature vectors. ** #Relations.

number of items, mi the number of relations, and ni the number of features. To


simplify the notation, our definition of the set of relations ri includes directly
found features as a relation, so the first relation is ri1 = direct for every feature
type. Since we retrieve 18 relations from WordNet and only for the plot synsets,
it follows that m7 = 1 + 18 = 19 and m8 = 1. For finding related concepts we
limit ourselves to single-step paths of directors, actors, and writers. This means
m1 = m2 = m3 = 1 + 3 = 4 and for the genres m4 = m5 = 1. As there are no
relations among terms or visual-semantic features, we have m6 = m9 = 1.
The feature matrices Vig are constructed such that the rows contain feature
values from the same relation, and the columns contain feature values of the
same feature. So for the g-th item the scalar element Vig(a,b) is the value of the
b-th feature of type ti found through relation ria . We define the block matrices
Fi = [(Vi1 )> (Vi2 )> .. (Viz )> ]> i [1, k] and calculate the matrix multiplications
Xi = Fi Fi> . It follows that the block Xi(a,b) is equal to the matrix Via (Vib )> .
This concludes the preparation of feature data.

4.5 Similarity Model


As we aim to scale the semantics-driven approach to large datasets, we have to
be able to optimize the recommenders parameters on a large number of user
profiles. We notice that for each of the 138,493 users in our dataset, one or
multiple combinations of liked items can be used to construct a user profile. For
example, if we construct user profiles out of a fixed number of p liked items,
the information
 from a user who has n items in his training set can be used to
construct np user profiles. This means that the number of observations available
to optimize the parameters can be much larger than the number of users in the
dataset. We can scale up semantics-driven recommendations to millions of user

Page 20 of 50
M.M. Bendouch Master Thesis

profiles and thousands of features by rewriting the similarity model as a function


of the dot-products between the feature vectors of individual items, which can
be pre-computed before optimization. To express the proposed similarity model,
we update the previous notation in Table 4.

Table 4: Notation, updated.

sim Similarity between user-profile and unseen item, out-


put of the similarity model.
k Number of feature types, part-similarities.
i [1, k] Index of feature type, part-similarity.
ti i-th feature type.
si i-th part-similarity, here a cosine similarity between
user-profile vector u~i and unseen item vector v~i of fea-
ture type ti
wi Weight for si .
~s Vector containing all part-similarities, si .
w
~ Vector containing all weights wi .
p Number of (liked) items in user-profile.
ni Number of features of feature type ti .
mi Number of relations between features of type ti , with
first relation always defined as direct.
Vi Rmi ni Feature matrix of type ti for unseen item, l-th row
contains feature values found through l-th relation.
Vig Rmi ni Same for the g-th item in user-profile, with g [1, p].
Ui Rmi ni User feature matrix, function of {Vi1 ..Vip }
q~i Rmi Vector of relation weights for i-th feature type ti
v~i Rni Unseen item feature vector, function of Vi , q~i .
u~i Rni User-profile feature vector, function of Ui , q~i .

4.5.1 Definition
Semantics-driven recommenders such as CF-IDF+, SF-IDF+, and Bing-SF-IDF+
have in common that all three methods construct one vector of features from the
user profile and one from the unseen item under consideration. To then calculate
similarity between the user profile and the item, the cosine similarity between
these two vectors is used. For CF-IDF+, these features are computed as a func-
tion of the frequency of concepts in the text and concepts related to them in
the domain ontology. SF-IDF+ does the same with synsets and retrieves the
relations from a lexical ontology. Bing-SF-IDF+ calculates a weighted average

Page 21 of 50
M.M. Bendouch Master Thesis

of Bing-similarity and synset cosine similarity.


Let us now look at a more general model of similarity that can describe any of
these semantics-driven recommenders. The target variable is the similarity, which
we call sim here, between the user profile and a certain unseen item. sim is a
weighted average of a set of k similarity measures, which we call part-similarities.
We denote the weights as a parameter vector w ~ and the part-similarities as el-
ements of vector ~s, both of length k (the number of part-similarities). In line
with previous recommenders, we restrict all wi > 0 so sim cannot be negatively
related to any si . We assume 0 6 si 6 1, which has been the case in previous
recommenders,
Pk and can therefore achieve the desired 0 6 sim 6 1 by enforcing
i=1 wi = 1. Now we can write the similarity as:

k
X
sim = wi si = w
~ ~s (7)
i=1

For CF-IDF+ and SF-IDF+, we can simply define k = 1, w ~ = 1, and ~s = s1


with s1 a cosine similarity. Bing-SF-IDF+ can also be expressed through Eq. 7
with k = 2, w ~ = [, 1 ], s1 a Bing-similarity and s2 a cosine similarity. Since
we do not use Bing-similarity for our model, any part-similarity si in our model
is a cosine similarity, which we write as cos(u~i , v~i ), between some feature vector
u~i of length ni representing the user profile and another vector of v~i of length ni
representing the unseen item. We construct these vectors such that u~i , v~i > 1 to
achieve 0 6 si = cos(u~i , v~i ) 6 1, which holds for the cosine similarity between
any two non-negative vectors ~a and ~b:

~a~b
cos(~a, ~b) = (8)
||~a||2 ||~b||2

As in our model si = cos(u~i , v~i ) we substitute in Eq. 7 for the definition of cosine
similarity from Eq. 8 to get:
k
X k
X
sim = wi si = wi cos(u~i , v~i ) (9)
i=1 i=1

4.5.2 Related Features


Consider how a feature vector v~i Rni for an unseen item is constructed. It
can be seen as a function fi () of the feature matrix Vi Rmi ni of the items
related feature values and a trainable relation weights vector q~i Rmi that
is shared across all items. We take SF-IDF+ as an example and the method
for CF-IDF+ is analogous. In SF-IDF+ there exists only k = 1 part-similarity
s1 = cos(u~1 , v~1 ) in Eq. 7 and therefore one type t1 of feature, in this case synsets.
The feature values in v~1 are the SF-IDF+ values of the unseen item, so v1j is

Page 22 of 50
M.M. Bendouch Master Thesis

the SF-IDF+ value of the j-th synset and n1 is the number of unique synsets.
So here t1 = synset and if we assume for example that toy is the 5-th synset,
v15 is the unseen items SF-IDF+ value of the synset toy. This value is not only
calculated from the occurrences of toy in the unseen items text but also from
occurrences of synsets that are related to toy, for example from the occurrence of
doll. As we consider directly found features as a specific case of relation, we can
define the first relation as direct and in the SF-IDF+ model we restrict q11 = 1.
This makes the number of relations m1 = 28 because the remaining relations are
then the 27 semantic relations found in WordNet. The weights q1l for l [2, 28]
are then the 27 optimizable weights restricted to [0, 1]. We setP our restrictions
less strict than those of the original SF-IDF+, to qil > 0 and m i
l=1 qil = 1, as
we want to allow any weight to be the highest. The weights of SF-IDF+ can
be expressed in our model if we divide the weights vector by its sum, and the
similarities are unaffected by this translation because cos(~u, ~v ) = cos(~u, ~v ) for
any positive scalar .
We can define V1 and f1 by noting that the SF-IDF+ value v1j is the maximum
of the direct SF-IDF value and the SF-IDF values of synsets related to synset
j multiplied by their corresponding relation weight from q1 . In other words,
V1(l,j) is the maximum of SF-IDF values of all synsets related to synset j by
the l-th relation. Therefore the j-th column of V1 , denoted V1(,j) consists of
these 28 SF-IDF values from related synsets, one for each type of relation. Note
that the first value in the j-th column V1(1,j) is the SF-IDF value of synset j
itself. Now to replicate SF-IDF+ within our framework we also need to take a
maximum over the 28 related SF-IDF values after they have been multiplied by
their corresponding relation weights from q~1 . We therefore define f1 as follows:

SF -IDF +j = v1j = f1 (q~1 , V1(,j) ) = max q1l V1(l,j) j [1, n1 ] (10)


1l28

Note that this is equivalent to taking the largest element of the element-wise
product of q~1 and V1(,j) denoted (q~1 V1(,j) ) R28 . This obtains the SF-IDF+
value for feature j, which is the j-th element v1j of the feature vector v~1 .
The item vector v~1 in the cosine similarity of CF-IDF+ can be expressed
similarly but uses concepts instead of synsets and ontology relations instead of
WordNet relations, and therefore V1 would consist of CF-IDF values and r~1 of
the m1 defined relations in the domain ontology. If we would follow this CF-
IDF+/SF-IDF+ method for all features in our research we could define fi more
generally for any feature type ti as:

vij = fi (~
qi , Vi(,j) ) = max qil Vi(l,j) j [1, ni ] (11)
1lmi

However, Eq. 11 reveals the computational bottleneck in the SF-IDF+ and CF-
IDF+ models because the item vector v~i that is used in the cosine similarity
consists of maxima that can be known only by using the full matrix Vi Rmi ni

Page 23 of 50
M.M. Bendouch Master Thesis

and the parameter vector of weights q~i Rmi . As we want to separate the dot-
products from the parameters, we change fi in Eq. 11 to calculate vij by taking
the sum instead of the maximum:
mi
X
vij = fi (~
qi , Vi(,j) ) = qil Vi(l,j) = q~i Vi>
(,j)
j [1, ni ] (12)
l=1

We can now see from Eq. 12 that the j-th element of vi has become the matrix
multiplication of the row-vector q~i and the transpose of column j of Vi . This
means that if we represent the vector q~i as a 1 mi matrix, we no longer need
fi , as we can perform the j multiplications that form the elements of v~i with one
matrix multiplication, using q~i and the full mi ni matrix Vi :

vi = q~i Vi (13)

Notice that the matrix Vi consists of (related) feature values for an unseen item,
but can be calculated for any item, including those in the user profiles. We denote
Vig as the item feature matrix Vi corresponding to the g-th item in a user profile,
which is a set of p liked items. As the user profiles features are defined as sums
of the p items features and the weights q~i are the same for each item, the user
profile feature vector u~i can be represented as:
p p p
v~ig = q~i Vig Vig = q~i Ui
X X X
u~i = = q~i (14)
g=1 g=1 g=1

Here we have defined Ui = pg=1 Vig , which we call the user feature matrix. Using
P
Eq. 13 and Eq. 14 for the feature vectors u~i and v~i we can rewrite the cosine
similarity for feature type i as:

(~
qi Ui )(~ qi Vi )
si = cos(~
qi Ui , q~i Vi ) = (15)
||~
qi Ui ||2 ||~
qi Vi ||2

As the dot-product ~a~b is equivalent to ~>


the matrix multiplication ~ab and the
Euclidean distance ||~a||2 equivalent to ~a~a:

qi Vi )>
q~i Ui (~ q>
q~i (Ui Vi> )~
si = p p =q q i (16)
qi Ui )> q~i Vi (~
q~i Ui (~ qi Vi )> q~i (Ui Ui> )~qi > q~i (Vi Vi> )~
qi >

We can now rewrite Eq. 9 for our full models similarity as:
k k
X X qi >
q~i (Ui Vi> )~
sim = wi s i = wi q (17)
i=1 i=1 qi > q~i (Vi Vi> )~
q~i (Ui Ui> )~ qi >

Page 24 of 50
M.M. Bendouch Master Thesis

Since we have defined Ui = pg=1 Vig , we can show that (Ui Vi> ) and (Ui Ui> ) can
P
both be written as sums of matrix multiplications of the p feature matrices Vig
of liked items and the feature matrix Vi of the unseen item:
p p
g
(Vig Vi> )
X X
Ui Vi> =( >
Vi )Vi = (18)
g=1 g=1

p p p p
Vig )( (Vig ))> = (
X X X X
Ui Ui> = ( Via )( (Vib )> ) =
g=1 g=1 a=1 b=1
p p p p
(19)
X X XX
(Via (Vib )> ) = Via (Vib )>
a=1 b=1 a=1 b=1

Equation 17 shows how sim is function of (Ui Vi> ), (Ui Ui> ), (Vi Vi> ) Rmi mi
and the parameter vectors q~i Rmi , w ~ Rk . We know from Eq. 19 that Ui Ui>
and Ui Vi are simply sums of Vi (Vi )> for some a, b [1, z] and as described
> a b

in Section 4.4, we can retrieve them by accessing Xi(a,b) . This means we no


longer need the large-dimensional Vi as input and none of the ni -dimensional dot-
products (u~i v~i ) from the original model have to be computed while sampling
observations or while training the weights. The data matrix Xi(a,b) has to be
calculated only once for each feature type ti i [1, k], in the preparation stage.

Learned scaling. Note that when we learn scaling for the visual feature types
VGG19/VSE as described in Section 4.3 we insert u~i (~ ci u~i ) and v~i (~
ci v~i )
in the similarity model, where c~i Rni is the learnable scaling. Note also that
u~i = Ui , and v~i = Vi , because the number of relations mi = 1 for these feature
types. Therefore the part-similarity model of Eq. 16 changes to:

ci u~i )(~
q~i ((~ qi >
ci v~i )> )~
si = q q (20)
ci u~i )> )~
ci u~i )(~
q~i ((~ qi > q~i ((~ qi >
ci v~i )> )~
ci v~i )(~
Pmi
We restrict l=1 q
~i l = 1 and here mi = 1, making q~i = 1 and redundant:

ci v~i )>
ci u~i )(~
(~
si = p p (21)
ci u~i )> (~
ci u~i )(~
(~ ci v~i )>
ci v~i )(~

The scaling ci has ni optimizable parameters and therefore by definition the


model is at least ni -dimensional - this is irreducible. However, when we want
to re-use the learned scaling, we can pre-compute c~i u~i and c~i v~i because the
scaling is known and fixed in that case. Then we can redefine u~i = c~i u~i and
v~i = c~i v~i and use the efficient model with pre-computed Ui Ui> , Ui Vi> , Vi Vi> .

Page 25 of 50
M.M. Bendouch Master Thesis

Example. To see how the proposed efficiency improvement works out in prac-
tice, we revisit the SF-IDF+ model. In our case, we retrieve 18 relations from
WordNet and the total number of unique synsets (including the related) found
in the plots is 69,977. Every movie therefore has a feature matrix V1 of size
19 69, 977 containing the (related) SF-IDF values for that movie. In the tradi-
tional model, we would have to multiply each row of the feature matrix with its
corresponding relation weight and take the maximum over the columns to obtain
the SF-IDF+ vector v~1 of length 69, 977. We use p = 5 items per user-profile,
so we would have to repeat the above computation 5 times and calculate the
sum of the 5 vectors to obtain the SF-IDF+ user profile vector u~1 . To then
calculate the part-similarity s1 we would calculate the 69, 977-dimensional cosine
similarity cos(u~1 , v~1 ). This entire operation would have to be repeated for each
observation while optimizing, and we have 1,406,976 observations (Section 4.6.1)
in our training set.
With the approach proposed in this section, we prepare V1a (V1b )> for all combi-
nations of movies a, b [1, z] through a single matrix multiplication as described
in Section 4.4. Due to the efficiency of performing this in a single operation, we
obtain X1 within minutes. Given X1 , dimensionality is reduced from 69, 977 to
18 because we now only require V1a (V1b )> , which are blocks of size 19 19 in X1 .
For each observation in the training set, we can obtain V1 V1> for the unseen item
directly as an 19 19 block on the diagonal of X1 . For U1 V1> and U1 U1> we need
to sum p = 5 and p2 = 25 different blocks respectively. These operations can
be performed before optimization as well, as explained in Section 4.6.1. Using
(U1 V1> ), (U1 U1> ), (V1 V1> ) as observation data, we can optimize the model of Eq.
16, which consists of several 19-dimensional matrix multiplications.

4.5.3 Logistic Transformation


We show in Section 4.5.1 how the semantics-driven recommenders can be gener-
alized to a single similarity model. In the previous section we increase scalability
of this model by reducing dimensionality from ni to mi , P leading to the model of
Eq. 17. Previous recommenders restricted wi > 0 and ki=1 wi = 1 to achieve
0 6 sim 6 1. This however leads to undesirable properties that we avoid by
adjusting the specification.
The model is meant to predict the similarity sim between a user-profile and
an unseen item, which is given as a score in [0, 1]. We assume that a user likes an
item whenever the true similarity is higher than some unknown threshold. Since
we cannot directly observe the true similarity, we have to optimize the model
based on observed likes/dislikes y {0, 1}. The optimal parameters are learned
with some loss function over the deviation of sim from the observed y. Since the
loss increases with |y E[sim]|, it depends on the k different E[siP ] with k the
number of part-similarities. This happens because the restriction ki=1 wi = 1
makes independent mean-shifts impossible.

Page 26 of 50
M.M. Bendouch Master Thesis

The different si cannot be scaled freely either, which can lead to problems,
again even with high-quality predictors. As an obvious example consider that
we have only one part-similarity, but it is perfectly predictive by correlation:
y
s1 = 100 . With the restriction on scaling, it can only change the similarity by
1
100 units, even if mean-shifts are allowed.
To solve the above problems, we have to remove the restriction ki=1 wi = 1
P
and add a learnable bias to the model. Now sim is still linear and increasing
but less restricted. To then bound the model to [0, 1] we apply a logistic function
to the output of the model in Eq. 17:
1
sim = Pk (22)
1 + e i=1 wi si
We can now relax the previously (Section 4.5.1) imposed si > 0 restriction be-
cause 0 6 sim 6 1 si R. Since we no longer need si > 0 we can also relax
u~i , v~i > 0 for the feature vectors, allowing us to use the visual-semantic embed-
dings, which are not non-negative. We nevertheless keep wi > 0 because we still
desire sim si > 0. Note that sim remains an increasing function of wi and si after
the transformation of Eq. 22.
If the si were data inputs the model would be linear logistic regression, but si
itself a non-linear model, following Eq.16 or Eq.21. We can however, in line with
classic logistic regression, use cross-entropy, also called logloss, as a loss function
over an items observed like/dislike y {0, 1} and the predicted similarity sim
[0, 1] between the item and the user-profile:
L = y log(sim) + (1 y) log(1 sim) (23)
The similarity can therefore also be interpreted as the probability of a like given
the input data (Ui Vi> ), (Ui Ui> ), (Vi Vi> ) i [1, k].

4.5.4 Gradient
Given the model definition of Eq. 22 and the cross-entropy loss of Eq. 23
we can define the objective function J() to be minimized as y log(sim()) +
(1 y) log(1 sim()), where is the vector of parameters. To simplify the
notation in this section, we add the bias parameter to the k weights in w, ~ so
wk+1 = and sk+1 = 1. Then the parameters contain the relation weights
vectors q~i i [1, k], the weights vector w,
~ and if applicable the scaling c~i . We
now derive the gradient of the objective function:
ysim() (y 1)sim() (y sim())sim()
J() = = (24)
sim() sim() 1 sim()(sim() 1)
With S() = ki=1 wi si we can derive the gradient of sim using the quotient and
P
chain rules:
1 eS() S()
sim() = = (25)
1 + eS() (eS() + 1)2

Page 27 of 50
M.M. Bendouch Master Thesis

Now we need the gradient of S:


k
X k
X
S() = wi s i = wi si (26)
i=1 i=1

The part of the gradient with respect to the weights vector w ~ is here called w ,
the parts with respect to each of the weights vectors q~i are called qi i [1, k],
and the part with respect to the scaling c~i is called ci . We first derive w by
expressing it for each element wi of w:
~
k
X k
X
wi S() = wj sj = 0 + wi si = si (27)
j=1 j=1,j6=i

Before we derive qi , we simplify our notation for the data matrices A = Ui Vi> ,
B = Ui Ui> , C = Vi Vi> . Note that q~i is only learnable for feature types that use
the model for si of Eq. 16 without a learned scaling c~i . We can now write:

qi >
q~i A~
qi S() = wi si = wi q (28)
q~i B q~i > q~i C q~i >
q
qi > and g = q~i B q~i > q~i C q~i > , we can use the
If we define the functions f = q~i A~
quotient rule:
gf (g)f
qi S() = wi (29)
g2
We derive the intermediate result for f :

f = ~ qi > = A~
qi A~ qi > (30)

And for g using the chain rule:

qi B q~i > q~i C q~i >


~
q
g = q~i Bqi> q~i C q~i > = q =
> >
2 q~i B q~i q~i C q~i
(31)
q~i C q~i > ~qi Bqi> + q~i B q~i > ~
qi C q~i > q~i C q~i > B q~i > + q~i B q~i > C q~i >
q = q
> >
2 q~i B q~i q~i C q~i 2 q~i B q~i > q~i C q~i >

Substituting the results from Eq. 30 and 31 for f and g in Eq. 28 we obtain:
q
> >
qi B q~i > C q~i >
q~i B q~i > q~i C q~i > A~
qi > ( q~i C q~i B q~i +~ )~ qi >
qi A~
2 q~i B q~i q~i C q~i >
>
qi S() = wi (32)
q~i B q~i > q~i C q~i >

Page 28 of 50
M.M. Bendouch Master Thesis

This concludes the derivation for the part-similarities si that use the model of
Eq. 16. For the si model of Eq. 21 with learned scaling c~i , which do not have
relation weights q~i , we need ci instead of qi :

ci v~i )>
ci u~i )(~
(~
ci S() = wi si = wi p p (33)
ci u~i )> (~
ci u~i )(~
(~ ci v~i )>
ci v~i )(~

For clarity we temporarily denote the j-th elements of the vectors c~i , u~i , v~i as cj ,
uj , vj respectively, with j [1, ni ] and ni the number of features. For b [1, ni ]
we derive the b-th element cb si of ci si :
Pni Pni 2
j=1 cj uj cj vj j=1 cj uj vj
cb si = qP qP = qP qP
ni ni ni 2 u2 ni 2 2
c u c u
j=1 j j j j c v c v
j=1 j j j j c
j=1 j j j=1 cj vj

P (34)
A sum over all indices [1, ni ] except the b-th is denoted j6=b . We define the
functions f and g:
ni
X X
f= c2j uj vj = c2b ub vb + c2j uj vj
j=1 j6=b
ni
X ni
 X     (35)
X X
g= c2j u2j c2j vj2 2 2
= cb ub + 2 2
cj uj 2 2
cb vb + 2 2
cj vj
j=1 j=1 j6=b j6=b

And their partial derivatives:

cb f = 2cb ub vb + 0 = 2cb ub vb
X X
cb g = 4c3b u2b vb2 + 2cb u2b c2j vj2 + 2cb vb2 c2j u2j
j6=b j6=b
  X  X 
2 2 2 2 2 2 2 2
= 2cb ub 2cb vb + cj vj + vb cj uj
j6=b j6=b
  ni
X  X 
2 2 2
= 2cb ub cb vb + 2 2 2
cj vj + vb 2 2
cj uj (36)
j=1 j6=b
 X ni  X 
2 2 2 2 2 2 2 2
= 2cb ub cj vj + vb cb ub + cj uj
j=1 j6=b
 ni
X Xni 
= 2cb u2b c2j vj2 + vb2 c2j u2j
j=1 j=1

Page 29 of 50
M.M. Bendouch Master Thesis

Pni
Rewriting cb si of Eq 34 with f and g we obtain (notation
P j=1 simplified to
):

f 2gf f g
cb si = = p
g 2 g3
P 2 2 P 2 2 P 2
cj uj vj 2cb u2b c2j vj2 + vb2 c2j u2j
 P P 
2 cj uj cj vj 2cb ub vb
= q P
2 c2j u2j
 P 2 2 3
cj vj (37)
P 2 2 P 2 2 P 2
cj uj vj u2b c2j vj2 + vb2 c2j u2j
 P P 
2cb cj uj cj vj ub vb
= q P  P 2 2 3
c2j u2j cj vj

Page 30 of 50
M.M. Bendouch Master Thesis

4.6 Experimental Set-up


This section describes the procedure for sampling user-profiles and corresponding
unseen items, for training and for subsequent evaluation. As the F1 -measure and
other classification metrics are not differentiable and we desire the efficiency of
gradient descent, we directly train the similarity model on pairs of a user-profile
and a corresponding unseen item instead. The trained similarity model is then
used to recommend items for which the predicted similarity is above a certain
threshold value. Evaluation consists of calculating various classification metrics
between the test users actual likes/dislikes versus the recommendations.

4.6.1 Sampling
To optimize (train) the part-similarity weights w ~ and the relationship weights
q~i we apply stochastic gradient descent (SGD) on the gradient of the similarity
model. The target similarity y {0, 1} is defined as y = I{ user likes item }.
An item is considered liked by a user if that user rates it with a score of 4.5 or
higher and disliked otherwise. This results in an average proportion of 19.12%
liked items, and 20.9 liked items per user (nliked ). We shuffle the order of users in
our dataset and subsequently take the first 1,000 as the test (hold-out) set, the
following 1,000 as the validation set, and the remaining 136,493 as the training
set. The test set is used for evaluation, the train set to optimize the similarity
model, and the validation set to validate the similarity model for early stopping
while training.
An observation is a pair of user-profile and unseen item. User-profiles are
constructed by sampling p = 5 liked items from a user. For each observation
the feature matrices Ui Vi> , Ui Ui> , and Vi Vi> are constructed from the Xi pre-
computed data. The Vi Vi> are retrieved as blocks of Xi , while Ui Vi> and Ui Ui>
are constructed from sums of p blocks, as explained in Section 4.5.2.
For the train and validation sets, the unseen items are defined as all items not
in the user-profile, which is sampled from a random user. For each user-profile,
we sample a liked item or a disliked item with equal probability such that we
obtain balanced train and validation sets with E(y) = 0.5. Each observation
is therefore a random user-profile and item, sampled from a random user. We
sample 100 batches of 1,024 validation observations and 1,374 training batches
of 1,024 observations, for totals of 102,400 and 1,406,976 respectively.
The test set should reflect a realistic recommendation setting, so we sample
the p = 5 user-profile items by shuffling all rated items and then iteratively
discarding the first item, adding it to the user-profile if it is liked. We stop as
soon as we have obtained p = 5 liked items. All discarded liked and disliked
items are then considered to be seen. In other words, we simulate the situation
in time when a recommender detects that a user has liked p = 5 items. We
require the unseen items to contain at least one liked and one disliked item to

Page 31 of 50
M.M. Bendouch Master Thesis

able to measure performance, which leaves us with 809 eligible user-profiles from
the 1,000 test users. We then construct observations for the user-profile with
each unseen item. For each user, we save these in a separate batch. The test
data is therefore composed of 809 batches of varying sizes, namely the number
of unseen items.

4.6.2 Optimization
Stochastic gradient descent (SGD) is a stochastic approximation of gradient de-
scent, a first-order iterative optimization algorithm. Gradient descent finds a
local minimum for an objective function J() by following the steepest descent
in the gradient J(). In each update iteration, the gradient is first estimated
from all observations in the training data, after which a step proportional to the
gradient at the current point is taken to update the parameter vector . The size
of these steps is determined by the learning rate , which has to be determined
by the researcher. The update rule for gradient descent is:

J() (38)

Instead of calculating the gradient over all observations before each update, SGD
approximates the gradient by calculating it over a single observation. It passes
over the observations in the training set one by one and performs a parameter
update after each visited observation. One full pass over the training set is called
an epoch. To avoid cyclical patterns that can hinder convergence, the observa-
tions are shuffled before each epoch. Mini-batch stochastic gradient descent is
a version of this method that divides the training set in batches of fixed size,
and applies the update step to each mini-batch. We hereafter refer to these
mini-batches simply as batches. With a batch size of 1 the method is therefore
equivalent to SGD and with a batch size equal to the training sets size it is
equivalent to standard gradient descent.
Using batches provides the benefit of allowing parallel computation of the
gradients from each observation, which can lead to substantial speed improve-
ments when implemented on parallel computing hardware such as a Graphical
Processing Unit (GPU). This requires a single batch to fit in memory and it is
further common to use a batch size that is a multiple of 16 as this can lead to
efficiency improvements on some hardware. We choose to use batches of 1,024
observations each. As our training set is expected to be too large to load in
memory, we store the batches separately on disk and load them in memory while
training, inspired by [34].
The learning rate has to be chosen by the researcher and there are many
factors involved, so it requires substantial tuning to obtain desirable convergence
behaviour. If is too high, the algorithm keeps oscillating and overshooting the
minima. On the other hand if is too low, it will take too long to descend

Page 32 of 50
M.M. Bendouch Master Thesis

and could insufficiently explore the error surface. To alleviate this, learning rate
annealing can be applied, where the learning rate is initialized at a relatively
high value and is decreased at every epoch. This encourages fast exploration of
the solution space in the earlier epochs while the decreasing learning rate allows
for finer exploitation of the found minimum later on.
Momentum can be also added to prevent oscillations and accelerate conver-
gence [22]. This method remembers the parameter update from the previous
iteration and adds this, multiplied by some factor 0 < < 1 to the update of
the next iteration. The updates therefore have a property similar to physical
momentum as the method creates a tendency to keep the updates moving in the
same direction:
J() + (39)
Other improvements are methods for adaptive learning rates, such as RmsProp
[28] and its updated version Adam [13] that leads to faster convergence in [13].
Adaptive means that the learning rates are scaled differently (adapted) for each
parameter by taking into account the previously observed magnitudes of the
gradient for that parameter. This helps the update steps to adjust for the different
magnitudes of the partial gradients.
The similarity model is trained with SGD on the 1,374 training batches, and
after each epoch the validation error is measured on the 100 validation batches.
If this validation error has improved compared to the last epoch, the current
parameters are saved. Early stopping is activated as soon as the validation error
has not improved over the last 5 epochs. The order of the training batches is
shuffled at every epoch. We anneal the learning rate at every epoch by dividing
the initial learning rate by the number of elapsed epochs. The stored parameters,
those that led to the lowest validation error while training, are subsequently used
for evaluation. The pseudo-code for this procedure is given in Algorithm 1.

4.6.3 Evaluation
We evaluate our method on the 809 test user-profiles sampled according to the
method of Section 4.6.1. We use the trained model to predict the similarity score
for each unseen item in a batch. The comparison between the predicted scores
and the actual likes forms the basis of performance measurement.
i
For each threshold { 500 i (0, 500)} the unseen items for which sim >
are recommended. We can define y = I[sim > ] {0, 1} to indicate this.PFrom
these recommendations,P we can obtain the number of true P positives T P = y y,
false positives F PP= y(1 y), true negatives T N = (1 y)(1 y), and false
negatives F N = y(1 y) for each user. We can then calculate

TP
P recision = (40)
TP + FP

Page 33 of 50
M.M. Bendouch Master Thesis

Algorithm 1 Optimization
1: procedure Optimize(, trainBatches, valBatches)
2: Generate Normal . Random unrestricted
3: GenerateP wi , q~i Exp i [1, k] . Random non-negative
4: q~i q~i / q~i i [1, k] . Sum of q~i restriction
5: {, w, ~ q~1 ..q~k } . Parameters
6: epoch 1 . Epoch counter
7: stop 1 . Stopping counter
8: while stop 6 5 do . Early stopping criterion
9: Shuffle trainBatches
10: t = /epoch . Learning rate annealing
11: for B trainBatches do . Loop over batches
t P1,024
12: 1,024 obs=1 L(, Bobs ) . Gradient descent update
13: end for
14: ` L(, valBatches) . Get average validation loss
15: if epoch = 1 or ` < ` then
16: ` ` . Update best validation loss
17: . Update best parameters
18: stop 1 . Reset stop counter
19: else
20: stop stop + 1 . Increment stop counter
21: end if
22: epoch epoch + 1 . Increment epoch counter
23: end while
24: return . Return best parameters

as a measure of what fraction of recommended items were actually liked, whereas


TP
Recall = (41)
TP + FN
measures how many liked items were recommended. We can further calculate:
TN
Specif icity = (42)
TN + FP
to measure how many disliked items were not recommend. Another widely used
measure for classification performance is accuracy:
TP + TN
Accuracy = (43)
TP + FP + TN + FN
which measures the fraction of correct judgements i.e., an estimate of E[y = y].

Page 34 of 50
M.M. Bendouch Master Thesis

Note that precision can be increased by recommending as few items as possi-


ble, in this case by using a high threshold (until it is undefined for T P +F P = 0).
On the other hand recall can be increased by lowering the threshold. In fact, a
perfect score of Recall = 1 can be obtained by simply recommending all items.
Therefore, given some similarity predictions there is a trade-off between precision
and recall. This trade-off can be explored by varying the threshold, and plotted
in a precision-recall (PR) curve. The area under this curve, AU C(P R) [0, 1], is
a measure of recommender quality that takes this trade-off into account. Another
metric that uses both precision and recall is the F1 -measure:
P recision Recall
F1 = 2 (44)
P recision + Recall
and there is some trade-off, in other words some threshold, for which this measure
is maximized. If the goal is to maximize F1 [0, 1], a recommender system has
to estimate through some model = r(...) the optimal threshold to determine
how many items to recommend. These estimations are beyond the scope of our
research so we report F1 under two assumed models for r. If r can only find
an optimal threshold that is static across users, we consider this the worst case,
denoted by minr (F1 ). As an upper limit of F1 , we additionally measure maxr (F1 ),
which assumes r can estimate the optimal threshold for each user perfectly.
As with precision and recall, there also exists a trade-off between recall and
specificity, which is illustrated by the receiver operating characteristic (ROC)
curve - a plot of Recall versus (1 Specif icity). Similar to AU C(P R), we
calculate AU C(ROC) as a measure of recommender quality.
The accuracy measure Accuracy [0, 1] does not take into account that
even uninformative recommenders can achieve a high accuracy, depending on
the distribution of y, so we correct for this by calculating Cohens kappa [6]
coefficient . This takes the expected accuracy into account and provides a more
meaningful measure of classification power. It can be calculated as:
Accuracy ExpectedAccuracy
= (45)
1 ExpectedAccuracy
with:
(T N + F P )(T N + F N ) + (F N + T P )(F P + T P )
ExceptedAccuracy = (46)
(T P + F P + T N + F N )2

Like the F1 -measure, also varies by threshold, so we again calculate minr ()


and maxr () under assumptions about the ability of the model = r(...) to
predict the optimal threshold that maximizes .
For each of the performance metrics, we calculate the average over the 809
test users. The experiments are repeated with various feature types included

Page 35 of 50
M.M. Bendouch Master Thesis

or excluded to measure their contribution to the models performance. A com-


parison with the traditional TF-IDF recommender can demonstrate the value of
semantics-driven recommendations in this domain.
We perform one-sided pairwise permutation hypothesis tests with n = 105
permutations for the mean AUC(ROC) with an = 1% significance level, and
specifically test the proposed model against the TF-IDF benchmark and against
all alternative models. The permutation test procedure obviates assumptions
about normality and needs only independence, which holds in our case as each
of the 809 pairs is from a different user. If we can reject all the hypotheses we
can generalize our conclusion because it provides a basis for asserting that the
proposed model outperforms all models with a certain confidence. We therefore
have to minimize the probability of any hypothesis rejection being a false positive.
This probability is increasing in the number of tests, but the Bonferroni correction
[8] guarantees that testing each hypothesis at an /n level leads to a worst-case
overall for n comparisons. We apply this to our 10 hypotheses by lowering the
significance level to 0.1% to achieve = 1%. If all hypotheses are rejected with
this correction, we obtain stronger evidence about the models performance.
Next to the performance metrics and the statistical hypothesis tests, we also
investigate the estimates of relation weights q~i as they can indicate the relative
importance of the different relations. We additionally take into account train-
ing time required for the optimization procedure as an important factor in the
practical scalability of our approach.

Page 36 of 50
M.M. Bendouch Master Thesis

5 Results
We compare the results for models that use various combinations of feature types
in Eq. 17. For the sake of brevity we refer to the combination of the 5 concept
feature types as C. TF-IDF is the baseline model with terms from plots, and
denoted as T. Our version of SF-IDF+ based on synsets from plots is called S.
VGG19 and VSE are referred to as VG and VS respectively. These abbreviations
referring to one or more feature types are the model components. Each model
can consist of one or more of these components. An overview of the feature types
belonging to each component can be found in Table 5.

Table 5: Model components.

i Feature type ti Acronym


1 Directors
2 Actors
3 Writers C
4 MovieLens genres
5 OMDb genres
6 Terms T
7 Synsets S
8 VGG19 VG
9 VSE VS

When the visual feature scaling of VG or VS is learned (optimized) together


with the rest of the parameters, the component is denoted VGL or VSL respec-
tively. When the VG scaling is pre-trained in another model and transferred to
this model, we denote the component VGR or VGM . VGR means that each of
the 10 restarts used a pre-trained scaling from a different restart of VGL . VGA
means that each of the 10 restarts uses the same pre-trained scaling, namely the
average scaling over all 10 restarts of VGL . All models under consideration are
presented in Table 6. Our proposed semantics-driven model is called C+S+VGA ,
which combines the concepts (C) with synsets from plots (S) and posters (VG),
where the scaling for the VGG19 synsets is transferred from the average of the
10 optimized VGL models.
We start by describing the results and computational load of the optimization
procedure, and continue by calculating the performance on the test users. We
calculate the p-values for comparisons between the average AUC(ROC) for each
model and test the hypotheses. The parameter estimates, performance numbers,
and value of the various feature types are also investigated. Finally, the proposed
model is tested against the others and the TF-IDF benchmark.

Page 37 of 50
M.M. Bendouch Master Thesis

Table 6: All models under consideration.

Model k* ** Description
T 1 2 TF-IDF, benchmark.
C 5 18 Modified CF-IDF+.
S 1 21 SF-IDF+, synsets from plots.
C+S 6 38 C and S combined.
VG 1 2 VGG19 synsets, unscaled.
VS 1 2 Visual-Semantic Embeddings, unscaled.
VGL 1 1,002 VGG19 synsets, learned scaling.
VSL 1 1,026 Visual-Semantic Embeddings, learned scaling.
C+S+VG 7 39 C, S, and VG, unscaled.
C+S+VGL 7 1,039 C, S, and VGL , learned scaling.
C+S+VGR 7 39 C, S, and VGR , scaling transferred from VGL .
C+S+VGA 7 39 C, S, and VGA , scaling transferred from VGL .
* Number of feature types (part-similarities) ** Number of parameters.

5.1 Optimization
We implement the optimization procedure in the Python 2.714 library Keras15 ,
which uses a back-end supported by Theano16 . The calculations are performed
on a regular desktop PC equipped with a NVIDIA GTX1060 GPU, which en-
ables efficient parallel computations of the gradient updates in batches of 1,024
observations. The gradients are calculated automatically by Theano using back-
propagation. We minimize the overhead of on-line loading by using a solid-state
drive (SSD) and simultaneously loading the next batch while SGD is applied to
the current batch. Batches are reshuffled before every epoch. For 10 random
restarts, each model is initialized with random weights and is trained with the
optimization and early stopping procedure of 4.6.2 with Adam updates, using
an initial learning rate of = 102 and the other parameters as recommended
by the authors [13]. Together with 1/epoch annealing (Algorithm 1) this shows
stable convergence in our experiments.
To optimize C+S+VGA and C+S+VGR , we have to first optimize the VGL
model, extract the visual scaling from the 10 restarts, and pre-compute the
VGG19 dot-products with this scaling. The optimization results are presented in
Table 7 and we find that training time is within reasonable limits, taking fewer
than 70 minutes for even the heaviest model, C+S+VGL . The impact of our
14
https://www.python.org/download/releases/2.7/
15
https://keras.io/
16
http://deeplearning.net/software/theano/

Page 38 of 50
M.M. Bendouch Master Thesis

Table 7: Optimization results, averages over 10 random restarts.


n=102,400 validation and n=1,406,976 train observations.

Logloss* Training time**


Valid. Train Epochs Secs/Epoch Minutes
C+S+VGA 0.6671 0.6680 10.4 23.0 4.0
C+S+VGL 0.6681 0.6694 35.7 117.0 69.7
C+S+VGR 0.6708 0.6716 9.4 23.8 3.8
C+S+VG 0.6810 0.6820 11.7 23.3 4.5
VSL 0.6779 0.6777 39.3 64.4 42.2
VGL 0.6797 0.6797 26.4 87.3 38.4
VG 0.6924 0.6925 9.4 6.4 1.0
VS 0.6930 0.6931 8.1 6.3 0.9
C+S 0.6812 0.6822 11.0 22.7 4.2
C 0.6815 0.6826 11.9 10.3 2.0
T (TF-IDF) 0.6896 0.6900 10.0 6.4 1.1
S (SF-IDF+) 0.6912 0.6914 11.0 14.7 2.7
* Minimum over all epochs. ** Until early stopping.

scalability improvement as described in Section 4.3 is reflected in a 15x reduction


in seconds per epoch of the VG model, which uses pre-computed dot-products,
compared to its VGL counter-part using the traditional approach. We do not test
the other models using the traditional approach, but given their longer feature
vectors we expect that the efficiency gain would be at least as high.
The models with learned visual scalings tend to overfit to the training data
as witnessed by the lower training loss compared to the validation loss. This
is consistent with expectations as the number of parameters is several orders
of magnitude larger than for the other models. This does not prevent them to
outperform their unscaled counter-parts on the validation set however. Although
the VSL model with visual-semantic embeddings has 1,024 features compared to
1,000 synset features for the VGL model, it takes about 1.5x as many epochs
to converge and results in a slightly better logloss. The sparsity of the VGG19
vectors compared to the VSE vectors could have been a factor in this. For the
unscaled visual vectors we see the opposite, as VG needs slightly more epochs
and results in a lower loss (Table 7).

5.2 Evaluation
Using the optimized models, we can calculate the performance metrics on our
set of 809 test users. We present the results in Table 8 and it is clear that

Page 39 of 50
M.M. Bendouch Master Thesis

even though we do not directly optimize for these metrics, a lower logloss results
in higher test performance. The p-values of pairwise permutation tests on the
average AUC(ROC) are presented in Table 9.

Table 8: Performance on test set, n = 809 users,


averages over 10 random restarts.

AU C F1
ROC PR minr maxr minr maxr
C+S+VGA 0.634 0.391 0.435 0.537 0.137 0.298
C+S+VGL 0.624 0.385 0.431 0.531 0.131 0.289
C+S+VGR 0.624 0.386 0.432 0.532 0.128 0.286
C+S+VG 0.574 0.362 0.419 0.510 0.087 0.253
VGL 0.605 0.347 0.429 0.519 0.110 0.262
VSL 0.605 0.370 0.422 0.517 0.115 0.268
VG 0.525 0.308 0.415 0.476 0.036 0.189
VS 0.508 0.299 0.415 0.472 0.018 0.176
C+S 0.570 0.361 0.419 0.509 0.083 0.251
C 0.567 0.358 0.419 0.507 0.081 0.249
T (TF-IDF) 0.535 0.324 0.413 0.479 0.041 0.200
S (SF-IDF+) 0.531 0.319 0.411 0.477 0.038 0.198

We obtain an unexpected result as SF-IDF+ (S) does not manage to outper-


form the benchmark TF-IDF on any metric. The performance difference is not
statistically significant (Table 9), so we cannot say whether SF-IDF+ would per-
form worse in the limit, but given its theoretical foundations these weak results
are unexpected. The word sense disambiguation (WSD), additional information
in the form of synsets, combined with more parametric freedom apparently do
not translate into higher performance. The optimization procedure could be a
factor, but as mentioned the loss function works to increase test performance.
Regardless, SF-IDF+ also shows a higher train and validation logloss. We re-
trieve the optimized relation weights (Table 10) and calculate the correlations
between the weights across the random restarts. We find that all correlations are
above 0.96 (n=45) which suggests that parameter instability is a minor factor.
The quality of the synset features could play a role, and to check this we revisit
the plot for the animation film Toy Story (1995) mentioned in Section 3:
A little boy named Andy loves to be in his room, playing with his toys,
especially his doll named Woody. (...)
Let us consider this first sentence above and investigate the extracted synsets by
looking up their definitions:

Page 40 of 50
M.M. Bendouch Master Thesis

Table 9: Paired permutation tests on AUC(ROC), 105 random permutations


of 809 paired observations for H0 : row = column , H1 : row > column

C C C
+ + +
C S S S
+ + + +
VS VG S T C S VG VSL VGL VGL VGT
C+S+VGA * .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00
C+S+VGT * .00 .00 .00 .00 .00 .00 .00 .00 .00 .50
C+S+VGL .00 .00 .00 .00 .00 .00 .00 .00 .00 -
VGL .00 .00 .00 .00 .00 .00 .00 .49 -
VSL .00 .00 .00 .00 .00 .00 .00 -
C+S+VG .00 .00 .00 .00 .00 .00 -
C+S .00 .00 .00 .00 .10 -
C .00 .00 .00 .00 -
T (TF-IDF) .00 .07 .12 -
S (SF-IDF+) .00 .21 -
VG .00 -
VS -

*Bonferroni: all comparisons with p < 0.01 also significant at 10 = 0.1%

Small: limited or below average in number or quantity or magnitude or extent;


Son: a male human offspring;
Name: mention and identify by name;
Sexual love: sexual activities (often including sexual intercourse);
Room: the people who are present in a room;
Playing: the action of taking part in a game or sport or other recreation;
Plaything: an artifact designed to be played with;
Doll: a small replica of a person; used as a toy.
Name: mention and identify by name;

The synsets small, son, and name are correctly identified from the words lit-
tle, boy, and named respectively. The verb loves is misidentified as a noun even
though it is clearly a verb, but far more damaging is its mapping to the synset
sexual love. This is not what the author had in mind and completely misses the
point. We further see that the word room is mapped to the wrong synset, but
the rest of the words are processed correctly. This example points to low-quality
WSD as the most likely cause of the unexpected under-performance. Although
theoretically disambiguation leads to more accurate semantics, mistakes are also

Page 41 of 50
M.M. Bendouch Master Thesis

amplified. In the above case the ambiguous meanings of the raw terms love and
room would have been preferable, which might explain why TF-IDF performed
slightly better. Nonetheless, the consistently high weights of WordNet relations
shown in Table 10 do demonstrate that related synsets add value.

Table 10: Optimized synset relation weights,


averages over 10 random restarts.

S +C +VG +VGA +VGR +VGL


Direct .191 .113 .109 .115 .134 .151
Instance hypernym .174 .156 .172 .162 .187 .178
Topic domain .157 .190 .198 .209 .234 .238
Member holonym .146 .125 .131 .102 .147 .132
Region domain .109 .058 .048 .016 .015 .006
Part holonym .094 .119 .119 .150 .147 .155
Also see .069 .088 .087 .061 .063 .059
Usage domain .013 .070 .083 .047 .008 .002
Cause .012 .004 .004 .002 .001 .002
Instance hyponym .010 .022 .012 .017 .015 .023
Substance holonym .008 .024 .005 .083 .021 .008
Substance meronym .006 .011 .014 .001 .004 .000
Part meronym .005 .002 .004 .003 .003 .005
Verb group .003 .010 .009 .021 .016 .037
Member meronym .002 .001 .003 .001 .002 .001
Similar to .000 .005 .000 .000 .000 .000
Hyponym .000 .001 .000 .000 .000 .000
Hypernym .000 .001 .000 .001 .000 .000
Entailment .000 .001 .000 .008 .002 .002

Concepts alone (C) are more informative than both synsets and terms, as
model C substantially (Table 8) improves over the baseline TF-IDF on all metrics.
When we look at the relation weights for concepts in Table 11, we see that for
all models the directly found concepts have the highest average optimal weight,
as we would expect. But we can infer from the maximum weights in Table
11 that there are a few optimal solutions in which one of other relations holds
most of the weight. The different minima obtain a similar loss but can vary
substantially in parameter estimates. For indirect concepts of the actors class,
the most dominant relation are writers. This suggests that users tend to value
movies with actors who have played a role in movies that involve writers from
their user-profile. Related directors on the other hand receive the highest weight

Page 42 of 50
M.M. Bendouch Master Thesis

when they are found through other directors. Related writers receive little weight
overall, regardless of the relation through which they are found.

Table 11: Optimized concept relation weights,


means and maxima over 10 random restarts.
relation
denotes found through relation.

C+ C+ C+ C+
C+ S+ S+ S+ S+
C S VG VGL VGA VGR
Actors .268 .774 .772 .271 .932 .360
Actors
.000 .000 .000 .272 .000 .174
Directors
.000 .000 .000 .272 .000 .174
W riters
.700 .011 .152 .417 .003 .398
Directors .697 .643 .666 .741 .602 .620
Actors
Mean .002 .000 .002 .005 .001 .005
Directors
.272 .318 .287 .239 .338 .330
W riters
.030 .039 .046 .016 .059 .045
Writers .880 .886 .764 .888 .898 .888
Actors
.048 .037 .030 .045 .023 .045
Directors
.065 .054 .060 .054 .045 .056
W riters
.008 .023 .147 .013 .034 .012
Actors .911 .909 .907 .923 .942 .936
Actors
.000 .000 .001 .965 .001 .654
Directors
.096 .969 .094 .079 .095 .225
W riters
.985 .019 .971 .977 .005 .980
Directors .710 .685 .721 .762 .704 .834
Actors
Max .004 .001 .014 .008 .006 .016
Directors
.298 .394 .390 .335 .578 .712
W riters
.107 .101 .164 .102 .384 .276
Writers .912 .897 .920 .923 .908 .903
Actors
.057 .056 .055 .054 .026 .081
Directors
.079 .060 .077 .067 .073 .081
W riters
.031 .038 .907 .029 .044 .043

Comparing the visual feature models we see that the unscaled VG outper-
forms VS, which indicates that the 1,000 synset feature values that we ex-

Page 43 of 50
M.M. Bendouch Master Thesis

tract from the posters are more suitable for recommendation than the 1,024-
dimensional visual-semantic embeddings. Optimized scaling results in a large
performance increase: from an AUC(ROC) of 0.508 to 0.605 for VSL and from
0.525 to 0.605 for VGL . Under learned scaling VSL rivals VGL on some metrics,
and closes the gap on AUC(ROC). These results indicate that the visual-semantic
embeddings do not improve recommender performance over the synset vectors.
In constrast to the visual-semantic features, the synsets are interpretable and
we can compare the 1,000 learned scales. We optimize the same scaling in a
full C+S+VGL model as a benchmark to compare C+S+VGA and C+S+VGR
against and estimate the impact of re-using or transferring learned scales across
models. The 10 synsets with the highest scale, on average over 10 restarts in Ta-
ble 12, exhibit some consistency. We find an average correlation of 0.268 (n=45)
for VGL of the visual scaling over the 10 restarts, and a higher 0.486 (n=45)
for the C+S+VGL . Due to the much higher dimensional space compared to the
relation weights, less stability can be expected. Nevertheless, the correlations
indicate that there is some stability across solutions, and this increases when
concepts and synsets are added. Although the number of parameters does not
increase substantially from this addition, there could be interactions or informa-
tion overlap between the feature types that lead to less variation between local
minima. We find an average correlation of the 10 restarts between the two mod-
els of 0.360 (n=55), which increases to 0.768 (n=1) when we compare the two
mean scalings.

Table 12: Learned visual feature scalings, n = 809 users.


Top 10 highest scales, mean over restarts.

VGL C+S+VGL
Fur coat .01198 Book jacket .01587
Stage .00991 Toyshop .01144
Web site .00975 Bow tie .01142
Balloon .00935 Web site .01142
Jigsaw puzzle .00930 Jigsaw puzzle .01000
Volcano .00923 Volcano .00973
Pick .00907 Cinema .00911
Toyshop .00869 Sweatshirt .00887
Cinema .00861 Fountain .00886
Bow tie .00855 Military uniform .00847

When the mean optimized scales of VGL are transferred to the C+S+VGR
model, it strongly outperforms its unscaled version C+S+VG and all other rec-
ommenders without learned scaling. The performance is indistinguishable from

Page 44 of 50
M.M. Bendouch Master Thesis

that of C+S+VGL , which can be considered a good benchmark for the learned
scaling because this model optimized the scaling together with the rest of the
model. In our test environment therefore, re-using the visual feature scales in
a different model does not seem to decrease performance. When we collect the
average VG scale over 10 random restarts of VGL and transfer this to C+S+VGA
we see that it strongly outperforms all other models (Table 8) and we can reject
the null hypothesis that one or more of the other models have an equal average
AUC(ROC) at the 1% level. C+S+VGA outperforms the traditional benchmark
TF-IDF by a large margin on all metrics. Average AUC(ROC) improves from
0.531 to 0.634, and AUC(PR) from 0.324 to 0.391. We improve minr (F1 ) from
0.413 to 0.435, and maxr (F1 ) from 0.479 to 0.537. Kappa metrics are improved
from 0.038 to 0.137 and from 0.198 to 0.298 for minr () and maxr () respectively.
Given the separately pre-trained visual scaling, we can optimize the model with
the scalable approach using pre-computed dot-products in 4-5 minutes. It is nei-
ther necessary to train the scaling together with the model as a whole, nor to
directly optimize on the final performance metrics.

Page 45 of 50
M.M. Bendouch Master Thesis

6 Conclusion
In this thesis we proposed an extension to previous works on semantics-driven
recommenders. We demonstrated that these systems are broadly applicable be-
yond news recommendations, and the complex domain of movies is one example.
We found that rich semantic information can be extracted not just from articles,
but item descriptions in a much broader sense, even raw images. When a suit-
able domain ontology is unavailable or incompatible with the recommendation
system, our virtual ontology method for finding related concepts can be applied
directly and only requires the dataset itself. Compared to state-of-the-art visual-
semantic methods, synset based visual feature extraction turned out to be more
interpretable and achieved similar performance. In situations where the proper
scaling method for feature vectors is unknown, we showed that effective scales
can be found through direct optimization of the logloss. We further provided evi-
dence that these learned scales can be transferred to other models without having
to re-optimize them. Through a reformulation of how related features are com-
bined, we were able to pre-compute the computationally expensive operations of
the cosine similarities and reduced the dimensionality of the similarity model by
several orders of magnitude. The semantics-driven recommender we presented
strongly outperformed the benchmark TF-IDF on ROC, P R, F1 , and , even
though it was not directly optimized on these metrics but on a cross-entropy
loss function that allowed for efficient gradient-based methods. The proposed
scaling-up of the semantics-driven approach has allowed us to optimize these
models within minutes on consumer-grade commodity hardware.
This research highlights that semantics-driven recommenders have many un-
explored applications and can be utilized effectively with the proposed approach,
opening the door to further extensions to other domains. There is potential for
synsets to contribute more information with an improved word sense disambigua-
tion method. The visual synsets extracted from the posters do not have to be
disambiguated but can perhaps be augmented with related synsets from Word-
Net. The convincing success of learned feature scaling introduces the possibility
of models with greater degrees of freedom, especially since the short training
time on commodity hardware means that still larger datasets can be utilized.
From our dataset, many more user-profiles and training samples could have been
sampled. More semantic information in the form of biographies, a wide variety of
images/posters, reviews, or from external ontologies such as DBpedia17 could be
integrated within the large capacity of the proposed framework. And although we
have used the most important semantic variables, we do not rule out the potential
(domain) semantics left in some of the discarded data, such as the country of ori-
gin. The capacity to train more parameters could be spent on learning separate
weights or scaling for lexical categories such as nouns/verbs/adverbs/adjectives
17
http://wiki.dbpedia.org/

Page 46 of 50
M.M. Bendouch Master Thesis

for the synsets. Related synsets and concepts are found through direct connec-
tions only now, but multi-step paths in either WordNet or the domain ontology
fits in the proposed framework and can add additional (domain) semantics. When
concepts are extracted as variables instead of from text, IDF scaling is not an
obvious choice. Relevant questions about scaling remain, as we demonstrated the
large impact of optimal scaling, which has not been tested for concepts or synsets
from texts. The benefits of allowing a concept to be in multiple classes, such as
a person who is a director of one movie and an actor in another, is also left to
explore. When the number of parameters is increased to the point of overfit,
tools such as regularization, noise injection, or random sub-sampling of features
can be employed to further increase model capacity.

Page 47 of 50
M.M. Bendouch Master Thesis

References
[1] Satanjeev Banerjee and Ted Pedersen. An Adapted Lesk Algorithm for Word
Sense Disambiguation Using WordNet, pages 136145. Springer Berlin Hei-
delberg, Berlin, Heidelberg, 2002.
[2] Michel Capelle, Flavius Frasincar, Marnix Moerland, and Frederik Hogen-
boom. Semantics-based News Recommendation. In Proceedings of the
2nd International Conference on Web Intelligence, Mining and Semantics,
WIMS 2012, New York, NY, USA, 2012. ACM.
[3] Michel Capelle, Marnix Moerland, Frederik Hogenboom, Flavius Frasincar,
and Damir Vandic. Bing-SF-IDF+: A Hybrid Semantics-Driven News Rec-
ommender. In Proceedings of the 2015 ACM Symposium on Applied Com-
puting, SAC 2015, New York, NY, USA, 2015. ACM.
[4] Dan C. Ciresan, Ueli Meier, Jonathan Masci, Luca M. Gambardella, and
Jurgen Schmidhuber. Flexible, high performance convolutional neural net-
works for image classification. In Proceedings of the Twenty-Second Interna-
tional Joint Conference on Artificial Intelligence - Volume Two, IJCAI11,
pages 12371242. AAAI Press, 2011.
[5] D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural net-
works for image classification. In 2012 IEEE Conference on Computer Vision
and Pattern Recognition, pages 36423649, June 2012.
[6] Jacob Cohen. A Coefficient of Agreement for Nominal Scales. Educational
and Psychological Measurement, 20(1):3746, 1960.
[7] Emma de Koning. News Recommendation with CF-IDF+. B.S. Thesis,
Erasmus University Rotterdam, July 2015.
[8] Jean Dunn and Olive Jean Dunn. Multiple Comparisons Among Means.
American Statistical Association, pages 5264, 1961.
[9] M. Egmont-Petersen, D. de Ridder, and H. Handels. Image Processing with
Neural Networksa review. Pattern Recognition, 35(10):22792301, 2002.
[10] Sachin Sudhakar Farfade, Mohammad J. Saberian, and Li-Jia Li. Multi-
view Face Detection Using Deep Convolutional Neural Networks. CoRR,
abs/1502.02766, 2015.
[11] Frank Goossen, Wouter IJntema, Flavius Frasincar, Frederik Hogenboom,
and Uzay Kaymak. News Personalization Using the CF-IDF Semantic Rec-
ommender. In Proceedings of the 1st International Conference on Web In-
telligence, Mining and Semantics, WIMS 2011, New York, NY, USA, 2011.
ACM.

Page 48 of 50
M.M. Bendouch Master Thesis

[12] Karen Sparck Jones. A Statistical Interpretation of Term Specificity and its
Application in Retrieval. Journal of Documentation, 28(1):1121, 1972.

[13] Diederik P. Kingma and Jimmy Ba. Adam: a Method for Stochastic Opti-
mization. CoRR, abs/1412.6980, 2014.

[14] Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying Visual-
Semantic Embeddings with Multimodal Neural Language models. CoRR,
abs/1411.2539, 2014.

[15] S. Lawrence, C. L. Giles, Ah Chung Tsoi, and A. D. Back. Face Recognition:


a Convolutional Neural-Network Approach. IEEE Transactions on Neural
Networks, 8(1):98113, Jan 1997.

[16] Michael Lesk. Automatic Sense Disambiguation Using Machine Readable


Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In Proceed-
ings of the 5th Annual International Conference on Systems Documentation,
SIGDOC 86, pages 2426, New York, NY, USA, 1986. ACM.

[17] Masakazu Matsugu, Katsuhiko Mori, Yusuke Mitari, and Yuji Kaneda. Sub-
ject Independent Facial Expression Recognition with Robust Face Detection
using a Convolutional Neural Network. Neural Networks, 16(56):555 559,
2003. Advances in Neural Networks Research: {IJCNN} 03.

[18] Marnix Moerland, Frederik Hogenboom, Michel Capelle, and Flavius Fras-
incar. Semantics-based News Recommendation with SF-IDF+. In Proceed-
ings of the 3rd International Conference on Web Intelligence, Mining and
Semantics, WIMS 2013, New York, NY, USA, 2013. ACM.

[19] Roberto Navigli. Word Sense Disambiguation: A Survey. ACM Comput.


Surv., 41(2):10:110:69, February 2009.

[20] Michael Pazzani and Daniel Billsus. Content-based Recommendation Sys-


tems. The Adaptive Web, pages 325341, 2007.

[21] Amir Hossein Nabizadeh Rafsanjani, Naomie Salim, Atae Rezaei Aghdam,
and Karamollah Bagheri Fard. Recommendation Systems: A Review. Inter-
national Journal of Computational Engineering Research, 3(5):4752, 2013.

[22] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Neuro-


computing: Foundations of Research. chapter Learning Representations by
Back-propagating Errors, pages 696699. MIT Press, Cambridge, MA, USA,
1988.

[23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S.

Page 49 of 50
M.M. Bendouch Master Thesis

Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet Large Scale Vi-
sual Recognition Challenge. CoRR, abs/1409.0575, 2014.
[24] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks
for Large-scale Image Recognition. CoRR, abs/1409.1556, 2014.
[25] SINTEF. Big Data, for Better or Worse: 90% of worlds data gen-
erated over last two years. www.sciencedaily.com/releases/2013/05/
130522085217.htm.
[26] SINTEF. The PNG image file format is now more popular than
GIF, 2013-01. https://w3techs.com/blog/entry/the_png_image_file_
format_is_now_more_popular_than_gif.
[27] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra-
binovich. Going Deeper with Convolutions. CoRR, abs/1409.4842, 2014.
[28] T. Tieleman and G. Hinton. Lecture 6.5RmsProp: Divide the gradient by
a running average of its recent magnitude. COURSERA: Neural Networks
for Machine Learning, 2012.
[29] C.J. Van Rijsbergen, S.E. Robertson, and M.F. Porter. New Models in
Probabilistic Information Retrieval. British Library research & development
report. Computer Laboratory, University of Cambridge, 1980.
[30] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show
and Tell: A Neural Image Caption Generator. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2015.
[31] Tim Vos. On the Recommendation of News Using the CF-IDF+ Recom-
mender. B.S. Thesis, Erasmus University Rotterdam, July 2015.
[32] P. D. Wasserman and T. Schwartz. Neural networks. II. What are they and
why is everybody so interested in them now? IEEE Expert, 3(1):1015,
Spring 1988.
[33] Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang. A Survey of
Transfer Learning. Journal of Big Data, 3(1):9, 2016.
[34] Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin. Large
linear classification when data cannot fit in memory. In Proceedings of the
16th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD 10, pages 833842, New York, NY, USA, 2010. ACM.
[35] Guo-Qing Zhang, Guo-Qiang Zhang, Qing-Feng Yang, Su-Qi Cheng, and
Tao Zhou. Evolution of the Internet and its Cores. New Journal of Physics,
10(12):123027, 2008.

Page 50 of 50

Potrebbero piacerti anche