Sei sulla pagina 1di 49

A Data-Driven Computational Semiotics:

The Semantic Vector Space of Magritte’s


Artworks
Semiotica, special issue Meaningful Data, 2018

Jean-François Chartier1, Davide Pulizzotto1, Louis Chartrand1, Jean-Guy Meunier1


1
Laboratory of Cognitive Analysis of Information, UQAM, Montreal, Canada

Abstract: The rise of big digital data is changing the framework within which linguists,
sociologists, anthropologists and other researchers are working. Semiotics is not spared by
this paradigm shift. A data-driven computational semiotics is the study with an intensive use
of computational methods of patterns in human-created contents related to semiotic
phenomena. One of the most promising frameworks in this research program is the Semantic
Vector Space (SVS) models and their methods. The objective of this article is to contribute
to the exploration of the SVS for a computational semiotics by showing what types of
semiotic analysis can be accomplished within this framework. The study is applied to a
unique body of digitized artworks. We conducted three short experiments in which we
explore three types of semiotic analysis: paradigmatic analysis, componential analysis and
topic modelling analysis. The results reported show that the SVS constitutes a powerful
framework within which various types of semiotic analysis can be carried out.
Keywords: Computational Semiotics; Semantic Vector Space; Data-Driven; Paradigmatic
Analysis; Component Analysis; Topic Analysis; Artwork Mining; René Magritte.

1. Introduction

The rise of big digital data brings new opportunities in various research areas of humanities
and social sciences. Today, thanks to corpora such as the Google N-Gram, a multi-million
book corpus covering a period of more than 500 years, anthropologists can study the
evolution of the human culture while using an observational scale that was previously
inaccessible (Michel et al. 2011). Linguists can track linguistic shifts, as they evolve in a
language over centuries (Hamilton, Leskovec, and Jurafsky 2016). With so many classical
scholarship digitization projects, historiography is also becoming a large-scale empirical
analysis enterprise (Mimno 2012). The social world is submerged with digital footprints from
the Web that represent invaluable big data sets for sociological analysis (Evans and Aceves
2016). The rise of big digital data brings also new challenges, in particular, the ones that
come with the computational frameworks needed to study these data.

1
We hear more and more about the "data revolution" and what is now called the new "data-
intensive" or "data-driven" approach (Kell and Oliver 2004; Kelling et al. 2009; Kitchin
2014; Pankratius et al. 2016). A data-driven approach is a computer-based exploratory
framework that uses data mining algorithms to discover in large dataset regularities or
patterns of interest. In this framework, algorithms are computational enhancements for the
discoveries of these patterns.
As noted by many, in just a decade, digital (big) data and algorithms have changed the
framework within which linguists, sociologists, anthropologists, semioticians and other
researchers do their work. This entails a new relationship between researchers and the
empirical evidence they uncover (Rastier 2011:11). A computational framework based on
an intensive use of data mining algorithms is more than ever a critical issue in the humanities
and social sciences. Semiotics is not spared by this paradigm shift. A data-driven
computational semiotics has been emerging for some years.

1.1. A Data-Driven Computational Semiotics

Semiotics is defined as the study of signs and of all systems and processes concerning
semiosis. Computational semiotics may be considered as a new branch of semiotics.
However, its definition is not easy to grasp. One can find at least three research programs
that claim to belong to computational semiotics (Meunier 2017; Tanaka-Ishii 2015).
In one of these programs, computational semiotics means that computation is a kind of
semiotic theory itself. In this view, one would say for instance that a computer is a semiotic
system:
“[…] in order to be meaningful, computers ought to be semiotic machines. […] To say
that the computer is a semiotic machine means to realize that what counts in the
functioning of such machines are not electrons (and in the future, light or quanta or organic
matter), but information and meaning expressed in semiotic forms, in programs, in
particular, or, more recently, in apps.” (Nadin 2011:172)
This definition can be found in several research programs of artificial intelligence (Floridi
1999; Meunier 1989).
In a second research program, computational semiotics is about human-machine interaction
and design (De Souza 2005). This research program includes many subfields like the
semiotics of programming language and software engineering (Tanaka-Ishii 2010).
A third research program refers to computer-based methodologies. Here, computational
semiotics means studying semiotic phenomena in a human-created content — all kinds of
content, from text to artwork — with computational models and methods.
In this research program, we develop computational models and methods that can be
manipulated by a computer and applied to datasets in order to mine and extract important
mathematical patterns correlated with phenomena of semiosis. This kind of work remains
however very marginal in the semiotic community. The relation between those mathematical
patterns in data and semiosis still remains a controversial open issue. Several semioticians
will reject the hypothesis that semiotic processes could be modelled in a computational way
(Meunier 2014).
2
Nevertheless, some of the most promising works in this program are those about the Semantic
Vector Space (SVS) models and their methods (Leopold 2005; Mehler 2003; Neuman,
Cohen, and Assaf 2015; Rieger 1999; Sahlgren 2006). SVS models are a mathematical
modelling of combinatorial behavior of signs correlated with different semiotic phenomena,
and a collection of algorithm-based methods with which one seeks to discover these patterns
in large corpora.
Currently, the main leaders in this research program of computational semiotics are not
semioticians but computer scientists specialized in natural language processing, data mining
and information retrieval (Turney and Pantel 2010; Van Rijsbergen 2004; Widdows 2004).
The benefits of this work are, however, important for computational semiotics. They have
shown that the SVS allows one to discover combinatorial patterns from which one can infer
complex meaning structures like paradigmatic relation, componential relation, metaphor,
analogy, topics and so on.

1.2. Aim of This Research

The aim of this article is to explore the contribution of SVS for a computational semiotics
research program. How can SVS constitute a theoretical and methodological framework from
which to build a data-driven computational semiotics? What is the productivity of this
framework for semiotics, in other words, what types of semiotic analysis can we conduct
based on that framework?
In order to provide answers to these questions, we first present the theoretical framework of
the SVS, its main parameters and the main operators of its semantic inference engine. The
SVS is usually presented from a linguistic or information retrieval perspective and applied
on text analysis. In this study, we try to separate this framework from this particular
perspective. To achieve this, we apply the SVS and its algorithms to an unprecedented corpus
composed of the surrealist painter René Magritte’s artworks. To our knowledge, this is the
first time that the SVS is applied as we have applied it to the analysis of the type of sign we
find in this corpus, that is to say, iconic visual signs.
We conduct three short experiments on this corpus. Each aims to explore a kind of semiotic
analysis applied to the corpus of Magritte’s artworks with the algorithms of the SVS. The
first experiment is a paradigmatic analysis. It is based on a semantic comparison algorithm
and aims to discover silences in the studied corpus. The second experiment is a componential
analysis. It is based on a semantic decomposition algorithm and aims to discover how some
iconic visual signs are composed in the studied corpus. Finally, the third experiment is a topic
modelling analysis. It is based on a partitioning algorithm and aims to discover the topics
around which Magritte’s artworks are organized.
These three experiments are not intended to show all the semiotic analyzes one can
accomplish with the help of the SVS framework. We rather wish to give a partial overview
of what we think constitutes a very promising new branch of semiotics. As a matter of fact,
those experiments do not only help to answer our questions, they will also raise several issues
for semiotics.

3
2. Theoretical Framework: The Semantic Vector Space Model

The semantic vector space model (SVS) is both a computational model and algorithmic-
based methods for the inductive discovery from combinatorial patterns of signs in a corpus
(usually textual, but not exclusively) of meaning structures. The model is based on a spatial
metaphor, whereas the methods are based on a distributional hypothesis (Gärdenfors 2014;
Landauer and Dumais 1997; Lemaire and Denhière 2006; Sahlgren 2006; Turney and Pantel
2010; Widdows 2004).
Its theoretical sources are numerous and multidisciplinary, including computer sciences,
psychology and linguistics. Originally, it is a synthesis between, firstly, the spatial metaphor
operationalized in the differential semantics of Osgood (Osgood 1952, 1964; Osgood, Suci,
and Tannenbaum 1957) and, secondly, the distributional hypothesis of structural linguistics
developed by Harris (Harris 1951, 1954).

2.1. The Vector Space Metaphor

The spatial metaphor of the SVS model was introduced in Osgood’s psycholinguistic work,
but some have suggested that this metaphor goes back as far as Saussurre’s classic writing
and the concept of value (Sahlgren 2008). Actually, Osgood wanted to resume a Saussurian
principle, but by operationalizing it in a more formal framework. He sought to measure and
represent mathematically the difference of linguistic sign connotations observed among
different subjects.
To do so, these connotations were collected through a conceptual priming task. For example,
subjects were invited to evaluate on a seven-level scale the connotation of the word
"FATHER" according to the three following dimensions "happy vs sad", "hard vs soft" and
"slow vs fast" (see Figure 1) (Osgood et al. 1957:26). The responses to this priming were
then encoded in vectors and projected into a vector space called metaphorically by Osgood a
"semantic space". Osgood’s differential semantic model consisted in analyzing the distance
between word vectors in this semantic space. According to Osgood, this model allowed an
objective measure of sign connotative meaning.

Figure 1: Example used in Osgood’s experiments to measure the connotative judgments of the word
"FATHER".
This spatial metaphor is coherent with a structural approach of meaning, as opposed to a
referential one. The assumption underlying this spatial metaphor was to conjecture
isomorphism between the vector space and the semantic space:
« Since the positions checked on the scales constitute the coordinates of the [sign]’s
location in semantic space, we assume that the coordinates in the measurement space are
functionally equivalent with the components of the representational mediation process
associated with this [sign]. This, then, is one rationale by which the semantic differential,

4
as a technique of measurement, can be considered as an index of meaning. » (Osgood et
al. 1957:30)1
In other words, there is a correlation between some measurable mathematical properties of
the vector space and some non-directly observable properties of meaning structure of signs
encoded in that space. Therefore, the analysis of the first – with algebraic, topological and
geometrical calculi – should enable the inference of the second. This assumption is at the
root of contemporary SVS models. See for instance Gärdenfors for whom :
« [the meaning] can be described as organized in abstract spatial structures that are
expressed in terms of dimensions, distances, regions, and other geometric notions […] »
(Gärdenfors 2014:22)
The fundamental theoretical elements that constitute the conceptual framework of SVS were
almost all already present in Osgood’s work. What will differentiate the contemporary SVS
from Osgood’s classical work is the integration of the distributional hypothesis in that
framework. The priming technique used by Osgood to construct a semantic space is
substituted in the contemporary SVS framework by statistical analysis techniques that enable
the vector space induction from sign combinatorial patterns computed in a corpus.

2.2. The Distributional Method

The core concepts of Harris’s distributional method are those of environment and
distribution. The concept of environment defines the analysis unit of sign context or
neighborhood in a corpus and determines the combinatorial relations that a sign has with
other signs of a corpus. This analysis unit can take several forms depending on the corpus.
On a textual corpus, it may correspond to a grammatical unit such as collocation, sentence
and paragraph, but it can also be a more arbitrary unit like a window of words or be linked
to pragmatic criteria like speaking turns.
The concept of distribution is derived from the concept of environment. For Harris, all sign
combinations observed in a corpus, that is to say the set of all its environments, forms its
distribution. According to the distributional hypothesis, signs characterized by equivalent
distributions would themselves be " functionally" equivalent (Harris 1951:16).
Although Harris has focused on syntax, according to him, this distributional hypothesis could
also be applied to meaning in a corpus. The meaning of a word in a corpus is correlated with
its combinatorial patterns with other words in the corpus. Words with similar co-occurrence
distributions would also have a similar meaning:
« The fact that, for example, not every adjective occurs with every noun can be used as a
measure of meaning difference. For it is not merely that different members of the one
class have different selections of members of the other class with which they are actually
found. More than that: if we consider words or morphemes A and B to be more different

1
Osgood uses the words "word", "concept", "sign" in an undifferentiated way. In order to avoid theoretical
ambiguities, in this quotation the English word "concept" has been replaced by the word "sign". This does not
change the meaning of the sentence.
5
than A and C, then we will often find that the distributions of A and B are more different
than the distributions of A and C. In other words, difference in meaning correlates with
difference in distribution. » (Harris 1954:156)
This quotation echoes another famous one from the linguist Firth who used to say that "You
shall know a word by the company it keeps » (Firth 1957:179), which means that word
meaning comparison can be reduced to word co-occurrence patterns comparison. This
distributional hypothesis represents the cornerstone of the contemporary SVS methodology.
Its principle is surprisingly simple : words, morphemes, syntagms, and potentially other signs
expressing similar meaning are generally used in similar environments, hence one can induce
which signs express similar meaning by comparing their combinatorial patterns in a corpus.

2.3. The Semantic Vector Space Parameters

There are several types of SVS, including explicit semantic spaces (Lund and Burgess 1996;
Rieger 1981), latent semantic spaces (Landauer, Foltz, and Laham 1998), spaces constructed
from random indexing (Sahlgren 2005), word embedding based semantic spaces (Mikolov,
Yih, and Zweig 2013), tensor-based spaces (Baroni and Lenci 2010) and we can also add
several probabilistic models (Blei and Lafferty 2009; Griffiths, Steyvers, and Tenenbaum
2007). All are based on the same fundamental assumptions mentioned above, but
operationalized differently.
The model that we present is a three parameters model < 𝕍, ℂ, 𝕎 >, with a vocabulary 𝕍, a
set of contexts ℂ and a matrix of combinatorial relations 𝕎. The first parameter is the
vocabulary of the studied corpus. It is denoted by 𝕍 = {𝑣1 … , 𝑣𝑚 } and represents a set of
selected signs to represent the content of the studied corpus. It is for instance a set of words
present in a textual corpus. The vocabulary forms the vector base of the SVS and determines
its dimensionality.
The second parameter, denoted by ℂ = {𝑐1 … 𝑐𝑛 }, represents the set of contexts (or to use
Harris’s terms, the set of environments) that make up a corpus. A context 𝑐𝑘 = {𝑣i : 𝑣i ∈ 𝕍}
corresponds to a set of signs co-present in the same segment of a corpus. For instance, in a
textual corpus, a context could be the set of words co-present in the same sentence. The
segmentation of a corpus into its different contexts is an operation that will determine what
type of combinatorial regularity between signs the SVS will model.
m×m
The third parameter, denoted 𝕎 = [𝑤𝑖𝑗 ] ∈ ℝ𝑚 is the matrix of combinatorial relations
between the signs of the vocabulary 𝕍, in which 𝑤𝑖𝑗 corresponds to the value of a statistical
association coefficient between two signs 𝑣i and 𝑣j . This association coefficient expresses
quantitatively the strength of the combinatorial relation between two signs. Figure 2
illustrates schematically this matrix.

6
sign1 sign2 … signm

sign1 w11 w12 … w1m

sign2 w21 w22 … w2m

⋮ ⋮ ⋮ ⋱ ⋮

signm wm1 wm2 … wmm

Figure 2: Matrix of combinatorial relations between signs.


A row vector 𝐯 = (𝑤1 , … 𝑤𝑚 ) of the matrix encodes the combinatorial pattern of a sign in
the studied corpus or, to use Harris’s terms, this vector models the “distribution” of this sign
in the corpus. These vectors are the coordinates of the signs in the SVS in a manner analogous
to what Osgood did, except that rather than representing the results of a semantic priming
task, the vectors of 𝐯 ∈ 𝕎 represent combinatorial patterns in a corpus.
According to Rieger, this matrix of combinatorial patterns can be interpreted as a
“syntagmatic abstraction” of the relations between signs (Rieger 1992). In this matrix, the
value of wij can be seen as the quantification of the strength of the syntagmatic relation
between the signs vi and vj in a corpus. The vector 𝐯𝐢 can be seen as the syntagmatic signature
vector of this sign. The strength of the syntagmatic relation between two signs must be
understood in a very reductive sense: it is a methodological reduction of syntagmatic relations
(collocation, phrasal verbs, etc.) to co-occurrence patterns (Bordag and Heyer 2007; Sahlgren
2008; Schütze and Pedersen 1993). Nevertheless, this methodological reduction represents
one of the most reliable theoretical foundations of distributional linguistics (Dunning 1993;
Manning and Schütze 1999:5).
The rationale for sign combinatorial patterns calculation is co-occurrence frequency, that is
the number of times two signs are co-present within the same contexts. The assumption is
that the more frequent two signs are co-present in the corpus, the stronger the relation is
between them. In our model, combinatorial patterns between signs are calculated using the
information gain (IG) coefficient2. It is calculated as follows:
𝑛𝑖𝑗 𝑁×𝑛𝑖𝑗
𝐼𝐺(𝑣𝑖 , 𝑣𝑗 ) = 𝑤𝑖𝑗 = ∑2𝑖=1 ∑2𝑗=1 ( log 2 (𝑛 ),
𝑁 𝑖1 +𝑛𝑖0 )×(𝑛1𝑖 +𝑛0𝑗 )

where 𝑛11 is the number of contexts where the signs 𝑣𝑖 and 𝑣𝑗 are co-presents, 𝑛10 is the
number of contexts where appears the sign 𝑣𝑖 but not the sign 𝑣𝑗 , 𝑛01 is the number of
contexts where does not appear 𝑣𝑖 but where 𝑣𝑗 does, 𝑛00 is the number of contexts where
the signs 𝑣𝑖 and 𝑣𝑗 do not appear et N=|ℂ| is the total number of contexts in the corpus.

2
It is sometimes called “mutual information”. Note that it is very different than the pointwise mutual
information commonly used in the construction of a semantic vector space, but that also comes with many
issues, see (Bouma 2009)
7
IG is a critical feature of our SVS model. It is a probabilistic measure of the statistical
dependence between two elements. It is also one of the most suited coefficients for co-
occurrence analysis since it has many advantages (like a smoothing effect) over a raw
frequency calculation (Bouma 2009; Manning and Schütze 1999:178). In our framework,
this coefficient can be interpreted as follows. A zero gain information between two signs
means that they have statistically independent combinatorial behaviors. Conversely, an IG of
1.0 between two signs means totally dependent combinatorial behaviors in a corpus: the
presence of the former in a context always implies the presence of the latter and the absence
of one always implies the absence of the other (and vice versa). Empirically, most of
combinatorial patterns are generally somewhere between these two limits.

2.4. The Semantic Inference Engine

The ability of a SVS model to infer certain meaning structures is grounded in this algebraic
formalism presented above. A strong assumption upon which relies this model consists in
conjecturing that several inferences about meaning structures can be based on algebraic
calculi applied to the elements of a SVS.
These vector calculi are currently the subject of several research programs, notably under the
names “vector symbolic architecture” (Widdows and Cohen 2014) and “conceptual space
architecture” (Gärdenfors 2014). Three vector calculi are particularly important. The first one
is the cosine. It allows sign semantic comparison. The second one is the vector addition. It
allows the sign semantic composition. The third one is the subtraction of the orthogonal
complement, which allows sign semantic negation. These operations are defined in Table 1
(these definitions assume unit normed vectors).
Table 1: Basic inferences on a semantic vector space.

Inference Vector calculus Operation Output

Semantic 𝐜𝐨𝐬(𝐮, 𝐯) =
Cosine 𝜃∈ℝ
comparison 𝐮 ∙ 𝐯 = (𝑢1 𝑣1 + … + 𝑢𝑚 𝑣𝑚 )

Semantic 𝐀𝐃𝐃(𝐮, 𝐯) =
Addition 𝐰 ∈ ℝ𝒎
composition 𝐮 + 𝐯 = (𝑢1 + 𝑣1 , … , 𝑢𝑚 + 𝑣𝑚 )

Orthogonal 𝐃𝐈𝐅𝐅(𝐮, 𝐯) =
Semantic
Complement 𝐰 ∈ ℝ𝒎
negation 𝐮 − (𝐮 ∙ 𝐯)𝐯 = (𝑢1 − 𝜆𝑣1 , … , 𝑢𝑚 − 𝜆𝑣𝑚 )
Subtraction

2.4.1. The Cosine Comparator

A first basic operation on an SVS, which we denote by 𝐜𝐨𝐬(𝐮, 𝐯) ∈ ℝ, is the calculation of


the cosine metric between two vectors. It takes as arguments two vectors and produces a
scalar representing the angle between the two vectors. The cosine is the reduction of a
semantic similarity comparative inference between two signs to the computation of a metric

8
in the SVS. A 𝐜𝐨𝐬(𝐮, 𝐯) = 1 means perfectly collinear vectors and 𝐜𝐨𝐬(𝐮, 𝐯) = 0 means that
the vectors are orthogonal.
In an SVS, the semantic similarity is correlated with the spatial proximity. The more two
signs are close to each other in the SVS, the more they are semantically similar. This
inference is easily explained by the distributional hypothesis: the spatial distance between
two signs is small when the signs have similar combinatorial patterns in a corpus and signs
having similar patterns tend to share meaning.
Other metrics can calculate this semantic similarity inference (Kiela and Clark 2014), but
beyond technical differences they are all based on the same methodological reduction, that
is, they reduce the semantic similarity to a very particular kind of mathematical similarity.
These metrics are measurements of the degree of substitutability or permutability of two signs
in a corpus: two signs are semantically similar if they have similar combinatorial patterns,
that is, they are both permutable to each other in a corpus (Burgess 2000; Burgess, Livesay,
and Lund 1998; Lund and Burgess 1996). This mathematical concept of permutation is
interpreted in the SVS framework as a paradigmatic relation between signs in a corpus, in a
sense very similar to the Saussurian structural linguistics (Bordag and Heyer 2007; Sahlgren
2008). Rieger used to call the function of this metric a “paradigmatic abstraction” of relations
between signs (Rieger 1989, 1992). This is a fundamental inference one can make with the
SVS framework.

2.4.2. The Vector Additive Composition

A second basic operation of the SVS framework, denoted by 𝐀𝐃𝐃(𝐮, 𝐯) ∈ ℝ𝒎 , is vector


addition. It takes as arguments two vectors u and v and produces a third one w which is the
element-wise addition 𝐰 = (𝑢1 + 𝑣1 , … , 𝑢𝑚 + 𝑣𝑚 ). The operation 𝐀𝐃𝐃(𝐮, 𝐯) is a
componential inference reduced to a linear addition of vectors. The representation in an SVS
of the componential meaning of two signs u and v corresponds to the addition of their
respective vectors. In other words, the vector representation of a componential structure 𝐰
corresponds to the addition of the sum of the syntagmatic signature vector of its components
𝐮 and 𝐯.
The semantic composition is a complex semiotic phenomenon and several algebraic
calculations have been proposed to infer it, including projection operators, tensor
multiplication and others (Mitchell and Lapata 2010; Widdows 2008). 𝐀𝐃𝐃(𝐮, 𝐯) is its
simplest algebraic operationalization, but it comes with important consequences. For
instance, the operation is commutative, which means that 𝐮 + 𝐯 = 𝐯 + 𝐮. This can be an
issue for the inference of componential relation where element order in the structure is
important.

2.4.3. The Subtraction of the Orthogonal Complement Vector

The semantic negation, denoted 𝐃𝐈𝐅𝐅(𝐮, 𝐯) ∈ ℝ𝒎 , is another fundamental operator of the


SVS framework. It is another form of componential inference, but unlike 𝐀𝐃𝐃(𝐮, 𝐯) which
enables componential inference from components to structure, 𝐃𝐈𝐅𝐅(𝐮, 𝐯) allows

9
componential inference from structure to its components. It is Widdows who introduced this
inference operator. He summarizes its rational in the following way:
« Vector negation is based on the intuition that unrelated meanings should be orthogonal
to one another, which is to say that they should have no features in common at all. Thus
vector negation generates a ‘meaning vector’ which is completely orthogonal to the
negated term. » (Widdows 2003:137)
The semantic negation operator 𝐃𝐈𝐅𝐅(𝐮, 𝐯) is a projection of 𝐮 in the subspace orthogonal
to 𝐯 by subtracting from 𝐮 its orthogonal complement in order to obtain a new vector 𝐰 =
𝐮 − (𝐮 ∙ 𝐯)𝐯 representing exactly the meaning of 𝐮 independent of 𝐯, which means that
𝐜𝐨𝐬(𝐰, 𝐯) = 0, but that 𝐜𝐨𝐬(𝐰, 𝐮) will tend to be high. In other words, to decompose the
syntagmatic signature vector of a sign 𝐮 by another one 𝐯, one must subtract from 𝐮 the co-
occurrence relations it shares with 𝐯.

2.4.4. Complex Inferences

The cosine, the vector addition and the subtraction of the orthogonal complement form the
basic vector operations of a semantic inference engine. These basic operations can also be
combined to form complex algorithms that can produce different semiotic analysis of signs
encoded in a SVS. Many of those complex inferences have been studied recently, especially
metaphorical relation (Shutova 2010), analogy (Mikolov et al. 2013), hypernymy (Santus,
Lu, et al. 2014), hyponymy (Erk 2009) and antonymy (Lu 2015).
This inference engine is particularly interesting for a data-driven approach such as the one
we want to propose for a computational semiotics. As Widdows and Cohen pointed out,
vector calculi on SVS “support fast, approximate but robust inference and hypothesis
generation” (Widdows and Cohen 2014:141). In fact, these algebraic calculi are all based on
very reductive inductive biases such as linearity and commutativity. However, this
methodological reduction enables large-scale empirical analysis of very complex semiotic
phenomena that would otherwise be very difficult to study on that scale.

3. Experiments on René Magritte’s Artworks Semantic Indexing

In order to show how the semantic inference engine of the SVS framework functions, we
have conducted three short exploratory experimentations where we have implemented
various algorithms based on the vector calculi defined earlier. As noted in the introduction,
our aim is to demonstrate the relevance of this framework for a data-driven computational
semiotics.
Our experiments are conducted on a unique corpus constituted of René Magritte’s artworks
semantic indexing. We adopt a genuine data-driven exploratory research design here, where
we let our data mining algorithms recognize and retrieve specific combinatorial patterns from
this corpus. We seek to show what kind of semiotic analysis these algorithms lead to and
perhaps even challenge the scholars of Magritte’s artworks. Before presenting those three
experiments, we first present in details the characteristics of the studied corpus as well as its
pre-processing.

10
3.1. The Semantic Indexing of Magritte’s Artworks

In digital humanities, several kinds of large digital corpora are created: textual corpora, social
network corpora, geolocation corpora, etc. The creation of large digitized corpora of artworks
is in the line with this ongoing effort. In recent years, several unique corpora composed of
thousands of digitized artworks have been created for analytical purposes (Arora 2012;
Carneiro et al. 2012; Liu et al. 2007; Zhang, Islam, and Lu 2012). A reference corpus is, for
instance, the PAINTING-91 (Khan et al. 2014). This corpus consists of 4,266 digitized
images of paintings made by 91 different artists. There are 49 artworks of Francisco de Goya,
39 of Raphael and 50 of Picasso. Another corpus recently created and much more voluminous
is the Wikiart3. It contains 81,449 digitized images of artworks of 1,119 different artists,
classified into 27 different styles (Saleh and Elgammal 2015). There are more and more
initiatives of this kind, see for example (Arora 2012; Carneiro et al. 2012; Crowley and
Zisserman 2014; Graham et al. 2010; Johnson Jr et al. 2008; Khan et al. 2014; Li and Wang
2004; Lombardi 2005; Shamir 2012, 2015; Shamir and Tarakhovsky 2012; Shen 2009; Stork
2009; Zujovic et al. 2009).
However, the algorithm-based analysis of those digitized artworks corpora remains a very
confined research area. It is mainly computer scientists who explore these corpora using data
mining, machine learning and pattern recognition algorithms in order to predict authors and
artistic styles. Khan and his colleagues, for example used different algorithms from the
artificial vision field to automatically recognize authors and styles of artworks, in their
analysis of the PAINTING-91 corpus. Saleh and Elgammal analyzed which metrics optimize
an automatic recommendation system based on the similarity of plastic features of artworks
(Saleh and Elgammal 2015). Li and Wang reconstructed the stroke signature of classical
Chinese painters using machine learning algorithms (a hidden Markov model). They also
showed that these signatures are sufficiently specific to allow author prediction of artworks
(Li and Wang 2004).
In semiotics, there is still very little research conducted on the large-scale algorithmic-based
analysis of these corpora. This lack of interest from semioticians for the computational study
of these corpora can be explained in several ways. A major issue hindering research in this
area is what is called the “semantic gap” of the computational analysis of digitized image
corpora, especially digitized pictorial artwork corpora (Hare et al. 2006; Liu et al. 2007;
Smeulders et al. 2000). Indeed, image mining in contrast to text mining comes with a
particular problem: unlike texts, where one can induce meaning structure from word
combination patterns (co-occurrences), there is no such patterns in images that an algorithm
can recognize and extract in order to induce meaning structures in images. In fact, in most
cases, algorithmic-based image analysis is achieved only with the analysis of low-level
plastic features of artworks such as texture, pigmentation, geometrical forms, paint strokes,
color tones, spaces, lines and others. Mathematical encoding of artworks with these plastic
features allow some similarity analyzes (Smeulders et al. 2000), but it does not allow an

3
https://www.wikiart.org/
11
algorithm to infer from these low-level features a so-called “high-level” analysis of meaning
structure in artworks. However, for semioticians, plastic analysis is complementary but
generally insufficient, they are also interested in the iconic analysis of these artworks.
In fact, in some experimental settings, such as the SVRC scientific competition4, state-of-
the-art machine learning algorithms have already crossed this semantic gap and have already
allowed complex iconic image analysis. This is the case of several artificial deep
convolutional neural networks (Krizhevsky, Sutskever, and Hinton 2012). Some of these
models are even better than humans in task like image semantic categorization (He et al.
2016). However, the generalization of these models to the iconic analysis of digitized
pictorial artworks is still an open research topic (Bar, Levy, and Wolf 2014). And so far, the
algorithm-based iconic analysis of digitized pictorial artworks is carried out thanks to the
introduction of an intermediate (human) task of semantic indexing.
Artwork semantic indexing involves attributing to artworks semantic annotations, that is, a
list of semantic descriptors. The purpose of these descriptors is to encode or categorize with
lexemes and syntagms the iconic visual content of artworks. Therefore, artwork iconic
analysis is achieved through the analysis of these semantic annotations.
A recent, innovative initiative to this approach was the creation by Hébert and Trudel of a
new digitized corpus of the Belgian painter René Magritte’s artworks (Hébert 2013; Hébert
and Trudel 2013). This corpus consists of 1,780 artworks5, almost all the artworks of the
painter, mainly oils, gouaches, drawings, sketches and posters, whose date of creation varies
from 1916 to 1967. The corpus is based on David Sylvester’s catalogue raisonné (Sylvestre
1997).
Table 2 illustrates some examples of Magritte’s artworks and their semantic indexing
produced by Hébert and Trudel. Each piece of art is associated with a title, an identifier, a
date and a semantic annotation. Each annotation is composed of several descriptors delimited
by "< >" tags. The semantic indexing was done in French, but we translated the descriptors
to the best of our knowledge. The original titles of Magritte’s artworks have been left
unchanged even when they were in French.

4
It is the Large Scale Visual Recognition Challenge (www.image-net.org/challenges/LSVRC/). This is a
prestigious competition about a semantic indexing task of images from the ImageNet dataset.
5
In the database built by Hébert and his team, 1,957 artworks are listed, but only 1,780 have been indexed so
far.
12
Table 2: Three examples of Magritte’s artworks and their semantic indexing.

Artwork
Id Date Image Semantic annotation
title

<chimney>, <multiplicity>, <levitation>,


<building>, <facade>, <roof>, <sky>, <man>,
<bowler_hat>, <pants>, <shirt>, <collar>,
383 Golconde 1953 <shoe>, <face>, <window>, <curtain>,
<small_wood>, <doorframe>, <corniche>,
<gutter>, <pigeon>, <twilight?>, <aurora?>,
<duplication>

<water>, <reflection>, <nudity>, <woman>,


<leg>, <waist>, <head>, <head>,
<standing_position>, <pubis>, <breast>,
27 Baigneuses 1921 <shoulder>, <arm>, <shape_geometry>,
<head>, <hair>, <position_profil>, <buttocks>,
<inclinaison>, <shade>, <curve_geometry>,
<cercle>, <face>, <belly>, <line>

<"Ceci_n_'est_pas_une_pipe.">, <""Ceci"">,
<""n"">, <""est"">, <""pas"">, <""une"">,
<""pipe"">, <""."">, <shade>, <pipe>, <word>,
La trahison <light>, <background_white>, <reflection>,
304 1929
des images <shank_(pipe)>, <stem_(pipe)>, <lip_(pipe),
<bowl_(pipe), <head_(pipe)>, <heel_(pipe)>

Hébert and Trudel’s semantic indexing of Magritte’s artworks is based on a set of criteria
derived from the Groupe μ’s typology of visual semiotics. This typology is illustrated in
Figure 3. It distinguishes between iconic dimensions and plastic dimensions of a visual sign
(Groupe Mu 1992).
All the digitized corpora of artworks cited above index only the plastic dimensions of visual
signs, especially the plastic signifier of artworks (e.g. texture, pigmentation, strokes, color
tones, etc.). Hebert and Trudel did something different. They indexed one of the iconic
dimensions of Magritte’s artworks. More specifically, they indexed the iconic signified
(sometimes called the “type”) of Magritte’s artworks, that is to say that they indexed what

13
the iconic signs in an artwork evoke for them.6 For simplicity of language, we will say that a
descriptor indexes or encodes an iconic sign present in an artwork.
Visual sign

Iconic sign Plastic sign

Iconic Referent Iconic signified Plastic Referent Plastic


signifier (Type) signifier signified

Figure 3: Visual sign typology (Groupe µ).


A descriptor usually corresponds to a simple lexeme like <light>, but it can also sometimes
form a syntagm like <shooting_star>. The bracketed descriptors help to disambiguate the
meaning of another descriptor. For instance, in the descriptor <heel_(pipe)>, (pipe) helps to
disambiguate the heel of the pipe from for example the heel of the foot. Descriptors in
quotation marks, such as the descriptor <"Ceci_n_’est_pas_une_pipe.">, refer to text painted
in an artwork. Finally, some descriptors are associated with a question mark like <twilight?>
and <aurora?> in the annotation of the Golconde. This kind of mark encodes the uncertainty
of the indexers concerning the relevance of a particular descriptor for the iconic sign indexing
of an artwork.
On average, about 20 descriptors were used by Hébert and Trudel to index a single artwork.
In all, 43,325 descriptors, including 3,032 different forms, were used to index the 1,780
artworks of the corpus.
To our knowledge, this is the first large digitized corpus of this kind in the field of art. This
corpus makes it possible for semioticians to analyze a specific dimension of semiotic
processes in various artworks. Therefore, the analyses conducted in our experiments are
unique if we compare them to those usually done by computer scientists mentioned above:
they cross the semantic gap. We conjecture that algorithms based on the algebraic operators
of the SVS framework can be applied to Hébert and Trudel’s semantic indexing and therefore
allow to induce from combinatorial iconic sign patterns encoded within those indexing,
different meaning structures in Magritte’s artworks.

6
Hébert and Trudel may not have always been systematic in their indexing process. Other dimensions of the
triadic definition of iconic sign seem sometime to be indexed with special descriptors generating confusion
about the semiotic criteria used. Since they are very rare in the corpus, in order to base our experiments on
homogeneous descriptors, when possible we have decided to omit these other kinds of descriptors.

14
3.2. Corpus Pre-processing

The corpus pre-processing involves preparing and standardizing Hébert and Trudel’s
semantic indexing output.
Firstly, descriptors in quotation marks that refer to painted text in artworks, for example, the
descriptor <"Cette_n_’est_ pas_une_pipe"> in the "La trahison des images", have been
filtered out. Since these descriptors do not index the signified dimension (type) of iconic
signs, they bias the SVS construction. These descriptors are relatively rare, Hébert and
Trudel’s corpus contains only 1,706 descriptors of this kind.
Secondly, descriptors associated with a question mark have also been filtered out. Hébert and
Trudel’s corpus contains 3,902 of these descriptors. The impact of this filtering is also very
small, but it is nevertheless important to remove them in order to standardize the epistemic
value of the semantic indexing. Further researches might allow us to determine the validity
of these uncertain descriptors, but in the context of this study it seemed more prudent to trust
Hebert and Trudel’s work and to retain only univocal descriptors.
Finally, descriptors present in only a single artwork indexation have also been removed.
These hapaxes play an important role in encoding the iconic content of Magritte’s artworks,
but because they are too rare, they are not relevant in a mathematical modelling based on
statistical regularities. In all, 1,467 hapaxes have been filtered out.
Following these pre-processing steps, the filtered corpus is now composed of 37,395
descriptors, 1,023 different forms and 1,780 artwork semantic annotations. Applied to this
post-processing corpus, the parameters < 𝕍, ℂ, 𝕎 > of the SVS are instantiated in the
following way: the vocabulary 𝕍 now corresponds to all the 1,023 different descriptors, the
contexts ℂ now correspond to the set of 1,780 artwork semantic annotations and each row
vector 𝐯𝐢 = (𝑤𝑖1 , … 𝑤𝑖𝑚 ) of the matrix 𝕎 now corresponds to a descriptor combinatorial
pattern, which we also call its syntagmatic signature vector.

4. Experimentation #1: The Paradigmatic Analysis

The first experiment aims to show how one can use the cosine semantic comparison operator
to produce a paradigmatic analysis between descriptors and Magritte’s artworks. We conduct
this analysis in a particular evaluative task context where we seek to evaluate the reliability
of Hébert and Trudel’s semantic indexing of Magritte’s artworks. The results of this
experiment suggest that the cosine operator effectively allows an algorithm to infer
paradigmatic relations and consequently to find what we will call "silences" in Hébert and
Trudel’s annotations. These results also reveal that the cosine seems to allow the analysis of
another complex phenomenon of semiosis.

4.1. Problem Introduction

As mentioned above, Hébert and Trudel’s semantic indexing of Magritte’s artworks resulted
in one of the first digitized corpora of its kind. The constitution of this corpus needed a
meticulous and very demanding work of iconic sign indexing for all of Magritte’s artworks.
For example, every time a tree was represented in a painting, Hébert and Trudel had to add
15
the <tree> descriptor to its indexation and so on. Their indexing needed to be systematic for
the 1,780 artworks of the corpus. It is a considerable amount of work that is prone to error.
In principle, this indexing has been evaluated by Hébert, Trudel and their team, but the vector
modelling of the artwork annotations allows us to validate more systematically the reliability
of Hébert and Trudel’s semantic indexing. We conjecture that by analyzing the SVS of these
annotations, it is likely that we will detect errors that have passed unnoticed. In this first
experiment, we illustrate how a cosine-based paradigmatic analysis of descriptors and
artwork annotations allows us to discover a particular type of error called “silence”. Those
silences are false negatives errors in Hébert and Trudel’s semantic indexing of Magritte’s
artworks. A silence is an “absence” in an artwork annotation, that is, an iconic sign present
in the artwork, but not encoded by Hébert and Trudel with the corresponding descriptor.
Let us take as an illustration the descriptor <man> in the studied corpus. This descriptor has
a specific combinatorial pattern with other descriptors in the corpus, which reflects the
combinatorial pattern of the iconic sign it encodes. Table 3 illustrates the strongest co-
occurrences of this descriptor calculated with the information gain (IG) coefficient defined
above.
Table 3: The most important co-occurrences in the syntagmatic signature of <man>.

Co-occurrences
of <man> IG

<man> 0.893

<jacket> 0.208

<bowler_hat> 0.168

<pants> 0.145

<shirt> 0.127

<ear> 0.120

<collar> 0.116

<shoe> 0.088

<tie> 0.086

<head> 0.083

<face> 0.074

<back_position> 0.061

<hair> 0.060

16
<mustache> 0.053

<nape> 0.051

Reading these results suggests an unambiguous hypothesis concerning the syntagmatic


signature of man in Magritte’s artworks. The IG coefficient suggests that the most significant
co-occurrence of <man> is the descriptor <jacket>. The rest of these results corroborate this
hypothesis. Most descriptors strongly linked to <man> confirm that in the annotations of
Magritte’s artwork, <man> is dressed with <pants>, <bowler_hat>, <shirt>, <shoe> and so
on.
These preliminary results will not surprise many specialists of Magritte’s artworks or even
art lovers. All know quite well that in Magritte’s artworks the man comes in a three-piece
suit. But this truism is useful here in order to introduce our problematic: knowing that <man>
has a specific syntagmatic signature, one can use the cosine operator and compare this
syntagmatic signature to Hébert and Trudel’s indexing of Magritte’s artworks. Comparing
the syntagmatic signatures between descriptors and artwork annotations is what we call a
paradigmatic analysis in our SVS framework. This should enable us to find where in
Magritte’s artworks a specific sign syntagmatic signature is present even when this sign has
not been indexed by Hébert and Trudel with the corresponding descriptor. This is what we
call a “silence” kind of error.

4.2. Method

The method involves two steps: the construction of the vector representation of artwork
annotations and the comparison of these vectors with the syntagmatic signature vector of a
descriptor.
In the SVS of the studied corpus, an artwork corresponds to the vector addition of the
descriptors that compose its indexation. So let denote an artwork by a context 𝑐𝑗 =
{𝑣1 , … , 𝑣n } and its vector representation by the syntagmatic signature vector addition of all
its descriptors 𝐨j = ∑𝑣i∈𝑐𝑗(𝐯i ).
Let then denote by ℂ− = {𝑐𝑗 ∈ ℂ: 𝑐𝑗 ∌ 𝑣i } all Magritte’s artworks that have not been indexed
by the descriptor 𝑣i . Finding the silences associated with a descriptor 𝑣i then consists in
maximizing the following quantity:
𝑘

𝐒𝐈𝐋𝐄𝐍𝐂𝐄(𝑣i ) = {𝑐1 , … , 𝑐𝑘 } = argmax ∑ 𝐜𝐨𝐬(𝐯i , 𝐨j )


{𝑐1 ,…,𝑐𝑘 }∈[ℂ− ]𝑘
𝑗=1

In other words, this function aims to find the subset of k artworks not indexed with the target
descriptor 𝑣i , but where the syntagmatic signature of the descriptor 𝑣i is very likely.

17
4.3. Results

We performed this analysis on the descriptor <man>. Table 4 shows in descending order the
top 15 non-indexed artworks with the descriptor <man> which maximizes the function
𝐒𝐈𝐋𝐄𝐍𝐂𝐄(< man >).
Table 4: The top 15 nearest Magritte’s artworks in the SVS to the syntagmatic signature vector of <man>, that
are not indexed with the descriptor <man>.

#137 #1170 #1648 #1546 #1920

[Publicity design for


sens de la nuit, Le idée, L' terre d'Osiris, La maître d'école, Le Norine: “Lord
Lister”]

cos=0.524 cos=0.442 cos=0.401 cos=0.390 cos=0.367

#1178 #1137 #880 #189 #709

art de vivre, L' pan de nuit, Le philtre, Le barbare, Le ellipse, L'

cos=0.359 cos=0.350 cos=0.346 cos=0.345 cos=0.343

#1158 #114 #78 #1184 #141

bouchon grande nouvelle, [Man seated at a


conquérant, Le fatigue de vivre, La
d'épouvante, Le La table]

cos=0.343 cos=0.339 cos=0.317 cos=0.315 cos=0.307

18
Recall that the descriptor <man> has a very specific syntagmatic signature, its most
(statistically) important co-occurrences are <jacket>, <bowler_hat>, <pants>, <shirt> and so
on. A cosine-based paradigmatic analysis allows us to predict that we should observe a man
in an artwork or something permutable to it whenever its indexation is characterized by that
particular syntagmatic signature or combinatorial pattern.

4.4. Discussion

The results illustrated in Table 4, although limited to the silences of a single descriptor, are
very revealing, to the point that we may wonder whether Hébert and Trudel have not
committed significant indexing errors. For example, why are the artworks "Le sens de la nuit"
(# 137) and "Le maître d’école" (# 1546) not indexed with the descriptor <man>, despite the
fact that we find in these artworks all the features of its syntagmatic signature? These
artworks clearly contain the iconic sign of a man in Magritte’s artworks and with an obvious
figurativeness7. Hence, these artworks should have been indexed with the descriptor <man>.
According to our method, these absences are silences in Hébert and Trudel’s indexing.
Furthermore, other artworks listed in Table 4 are not necessarily characterized by silences.
These artworks rather illustrate another kind of relation with <man>. As we can see in many
artworks of Table 4, for instance # 1170, # 1648, # 1178, # 1137 and # 880, there is a semiotic
phenomenon that we had not expected to find. These artworks do not contain any iconic sign
designating with obvious figurativeness the presence of a man. We are rather confronted with
an elliptic phenomenon evoked by the full or partial presence of the syntagmatic signature of
the man (important features of its combinatorial pattern) despite the absence of the signifier.
What this cosine-based paradigmatic analysis allows us to discover here is not indexing
errors, but rather a complex rhetorical phenomenon that opens up an ontological distinction
of the iconic visual sign: the figurative iconic sign and the abstract iconic sign (with low
figurativeness). This distinction recalls the opposition between iconization and abstraction of
visual signs (Greimas, Collins, and Perron 1989:634). This is a very interesting unexpected
result. It provides new insights for future works about a computational semiotics such as the
one we are trying to develop in this paper.

7
We used the concept of figurativeness in the greimassian way : signs that « permit us to interpret it as
representing an object of the natural world ». (Greimas, Collins, and Perron 1989, 634).
19
5. Experimentation # 2: The Componential Analysis

The second experiment studies another kind of semiotic phenomenon. The objective of this
experiment is to show how the previously defined orthogonal complement subtraction
operator can be used by an algorithm to produce a componential analysis of the descriptors
used to index Magritte’s artwork iconic signs. The results of the experiment suggest that this
subtraction operator enables the syntagmatic signature decomposition into its various
semantic components.

5.1. Problem Introduction

Some descriptors have a complex semantic, they seem to index different iconic signs in
Magritte’s artworks. Let us take a look at the illustration the descriptor <woman> in the
studied corpus. The syntagmatic signature of this descriptor is projected in a very specific
region of the SVS. Table 5 illustrates its nearest neighbors in the SVS as calculated with the
cosine.
Table 5: The top 15 nearest descriptors of <woman> in the semantic vector space of the annotations of
Magritte’s artworks.

Descriptor 𝐜𝐨𝐬(< woman >, 𝐯j )

<women> 1.000

<nudity> 0.824

<breast> 0.795

<thigh> 0.783

<belly> 0.765

<nipple> 0.762

<pubis> 0.759

<navel> 0.759

<arm> 0.732

<neck> 0.707

<hair> 0.652

<shoulder> 0.590

<pubic_hair> 0.588

20
<face> 0.576

<human_body> 0.562

The descriptors that populate the region where <woman> is projected suggest that its
syntagmatic signature is very similar to the combinatorial patterns of other descriptors such
as <nudity>, <breast>, <thigh>, <belly>, <nipple> and <pubis >. These results will not
surprise many experts of Magritte’s artworks either. While man usually comes in a three-
piece suit (as seen in the previous experimentation), the woman is usually naked in Magritte’s
artworks. However, if we look closely at all of the nearest neighbors of <woman> sorted in
Table 5, we see other descriptors such as <hair> and <face> that express something different
than the feminine nudity. The syntagmatic signature of <woman> appears to be projected at
the edge of two regions in the SVS. In one of these regions, we find descriptors that index
feminine nudity, but in the second one, more peripheral, we find descriptors that index the
woman face. This suggests a new research question: could the syntagmatic signature of a
descriptor be componential? In other words, could the syntagmatic signature of a sign hide
several components under a dominant combinatorial pattern?
The aim of the second experiment is to develop a method of componential analysis that
decomposes the syntagmatic signature of a descriptor into its different combinatorial sub-
patterns called its semantic component.

5.2. Method

By definition, any vector can be decomposed into a set of orthogonal vector components.
Formally, this is what our method wants to accomplish, but applied to the decomposition of
the syntagmatic signature vector of a target descriptor 𝐯i into a subset of "vector components"
𝐮1 + 𝐮2 + ⋯ + 𝐮m corresponding to the syntagmatic vectors of other descriptors present in
Hébert and Trudel’s annotations of Magritte’s artworks. For example, could the syntagmatic
signature of <woman> be decomposed into the following vector components: <woman> =
(<nudity> + <face>)?
This is the kind of question our method is designed to answer. To do so, let us denote the
composition of the syntagmatic signature of a descriptor in the following way:
𝐂𝐎𝐌𝐏𝐎𝐒𝐈𝐓𝐈𝐎𝐍(𝐯i ) ≈ (𝐮1 + 𝐮2 + ⋯ + 𝐮m ),
where 𝐯i is the syntagmatic signature vector of the target descriptor i we want to decompose
and 𝐮1 + 𝐮2 + ⋯ + 𝐮m are vector components corresponding to sub-combinatorial patterns
of the target. To find 𝐂𝐎𝐌𝐏𝐎𝐒𝐈𝐓𝐈𝐎𝐍(𝐯i ) we use the negation operator introduced above
which we applied recursively using the Gram-Schmidt algorithm until the decomposition
process does not find any new component.
The method proceeds as follows. In a first iteration, it searches with the cosine operator
𝐜𝐨𝐬(𝐯i , 𝐮j ) the descriptor 𝐮j with the syntagmatic signature vector the most similar to the
target 𝐯i we want to decompose. The vector 𝐮j then becomes the first component of 𝐯i . We
decompose 𝐯i using the negation operator 𝐃𝐈𝐅𝐅(𝐯i , 𝐮j ) = 𝛆𝒕 , where 𝛆𝒕 represents the residual

21
of 𝐯i , that is to say the syntagmatic signature of the target 𝐯i orthogonal to 𝐮j . In a second
iteration, the method searches a new descriptor whose syntagmatic signature vector 𝐮k is the
most similar to the residual 𝛆𝒕 obtained at the previous iteration. The vector 𝐮k becomes the
second component of 𝐯i . The negation operator is again applied, but this time on the residual
𝐃𝐈𝐅𝐅(𝛆𝒕 , 𝐮k ) = 𝛆𝒕+𝟏 , which gives a new residual corresponding to the syntagmatic signature
of 𝐯i orthogonal to 𝐮j and to 𝐮k . As noted by Widdows, in order to respect the commutativity
of the negation operator at each iteration, it is important to apply the gram-Schmidt’s
orthogonalization algorithm on every new vector component retrieved (Widdows 2004:230).
We repeat the iteration until 𝐜𝐨𝐬(𝛆𝒕 , 𝛆𝒕+𝟏 ) ≈ 1.0, that is, until the decomposition process
stabilizes. This occurs when the residual 𝛆𝒕+𝟏 can no longer be decomposed using the
syntagmatic signature vector of another descriptor of the corpus. At the end of the process,
the composition of the syntagmatic signature of the target descriptor corresponds to the
following vector addition (𝐯i ) ≈ (𝐮𝑗 + 𝐮𝑘 + ⋯ + 𝛆𝒕+𝟏 ).

5.3. Results

We performed this analysis on the descriptor <woman>. As mentioned earlier, according to


the cosine operator, the most similar descriptor to <woman> is <nudity> with cos(<woman>,
<nudity>)=0.82. This cosine indicates that the two descriptors are not perfectly equivalent
(i.e. permutable) in the corpus, hence the syntagmatic signature vector of <woman> can not
be reduced to that of <nudity> although the latter is an important component of the former.
By retrieving in a second iteration the negation 𝐃𝐈𝐅𝐅(< woman >, < nudity >), we obtain
a new syntagmatic signature vector for <woman> which is very different from that of
<nudity>. In the region of the SVS where the vector 𝐃𝐈𝐅𝐅(< woman, < nudity >) is, we
find descriptors such as <hair>, <face>, <eyes>, <neck> and <mouth> that have nothing to
do with nudity (see Table 6). These descriptors all express another semantic component of
<woman>, that is, as we suspected, related to the expression of the woman face. Note,
however, that it is not the descriptor <face> that is selected by our method, but <hair> a
highly correlated descriptor to <face>.
The most intuitive way to illustrate the results of our method is by retrieving artworks whose
annotations are located in the region of <woman>, of 𝐃𝐈𝐅𝐅(< woman >, < nudity >) and
of 𝐃𝐈𝐅𝐅(< woman >, < hair >).These artworks are sorted in Table 7:
Table 6: Top 10 nearest descriptors in the semantic vector space of the negation of <woman> by <nudity> and
the double negation of <woman> by <nudity> and <hair>

𝐃𝐈𝐅𝐅(< woman >, < nudity >) = 𝛆𝒕 𝐃𝐈𝐅𝐅(𝐃𝐈𝐅𝐅(< woman >, < nudity >), < hair >) = 𝛆𝒕+𝟏

Descriptors 𝐜𝐨𝐬(𝐯j , 𝛆𝒕 ) Descriptors 𝐜𝐨𝐬(𝐯j , 𝛆𝒕+𝟏 )

<woman> 0.566 <woman> 0.486

<hair> 0.449 <neckline> 0.207

<face> 0.440 <dress> 0.203

22
<eyes> 0.403 <necklace> 0.192

<neck> 0.371 <neck> 0.173

<mouth> 0.343 <skirt> 0.165

<head> 0.335 <sign> 0.165

<parting_(hair)> 0.335 <mutilation> 0.152

<neckline> 0.333 <blouse> 0.147

<blouse> 0.321 <pearl> 0.145

Table 7: Top 10 nearest artworks in the semantic vector space of the two main semantic components of
<woman>.

𝐜𝐨𝐬(𝐃𝐈𝐅𝐅(< woman >,


𝐜𝐨𝐬(< woman >, 𝐨𝐢 ) 𝐜𝐨𝐬(𝐃𝐈𝐅𝐅(< woman >, < hair >), 𝐨𝐢 )
< nudity >), 𝐨𝐢 )

#738 #22 #768

[Portrait of Emma
menottes de cuivre, Les peinture, La
Bouillon]

cos=0.880 cos=0.490 cos=0.707

#743 #635 #497

[Portrait of Jacqueline
menottes de cuivre, Les représentation, La
Nonkels]

23
cos=0.877 cos=0.483 cos=0.695

#742 #634 #1862

[Portrait of Georgette
menottes de cuivre, Les statue volante, La
Magritte]

cos=0.875 cos=0.481 cos=0.659

#303 #760 #1205

femme cachée, La ‹mémoire, La› folie des grandeurs, La

cos=0.875 cos=0.475 cos=0.658

#756 #1198 #91

24
Femme-bouteille [Head] ‹Souvenir de voyage›

cos=0.872 cos=0.467 cos=0.657

#758 #1197 #408

Femme-bouteille [Head] belle de nuit, La

cos=0.872 cos=0.463 cos=0.651

#755 #767 #1057

Femme-bouteille mémoire, La folie des grandeurs, La

cos=0.868 cos=0.461 cos=0.650

25
#418 #637 #197

[Portrait of Eliane
viol, Le éloge de l'espace, L'
Peeters]

cos=0.866 cos=0.461 cos=0.650

#770 #61 #409

Portrait d'Evelyne
Femme-bouteille Quand l'heure sonnera
Bourland

cos=0.865 cos=0.460 cos=0.647

#761 #4 #798

26
Femme-bouteille [Plaster bust and fruit] leçon des choses, La

cos=0.864 cos=0.458 cos=0.647

5.4. Discussion

The artworks in Table 7 constitute a very convincing demonstration of the ability of the
orthogonal complement subtraction operator to infer the semantic composition of a
descriptor. The first column in Table 7 lists the top 10 nearest artworks to <woman> in the
SVS, the middle column, the top 10 nearest artworks to the negation <woman> by <nudity>
and the third column, the top 10 nearest artworks to the negation of <woman> by <hair>.
Artworks in the first column are a mixture of the two semantic components of the syntagmatic
signature of <woman>. In the second column, we find not only artworks expressing faces of
women, but also artworks without any feminine nudity, whereas the opposite is true in the
artworks of the third column where women are naked but faceless.
Does the syntagmatic signature vector of <woman> have a third component? Our method
indicates that it does not, that is 𝐜𝐨𝐬(𝛆𝒕 , 𝛆𝒕+𝟏 ) ≈ 1.0. This means that the region of the SVS
where the residual of the double negation 𝐃𝐈𝐅𝐅(𝐃𝐈𝐅𝐅(< woman >, < nudity >), < hair >) clusters
descriptors whose syntagmatic signatures differ very little from that of <nudity> or <face>
(see Table 6). The residual seems to represent an atypical variant of the preceding
components rather than a new component orthogonal to the preceding ones. There is no
artwork in the studied corpus in which the woman is represented without a syntagmatic
signature correlated to <nudity> or correlated to <hair>.
The results of the semantic decomposition of <woman>, although limited as in the first
experiment to the analysis of a single descriptor, are very convincing. They enable us to reach
our second objective, which was to show how an algorithm based on the orthogonal
complement subtraction operator can produce a componential analysis of the descriptors used
to index the iconic content of Magritte’s artworks.
However, the definition given to the componential relation has its limits. On the one hand, it
is very unlikely that in Hébert and Trudel’s corpus the syntagmatic signature vector of a
descriptor 𝐯i is perfectly orthogonal to a subset of other descriptor vectors 𝐮1 + 𝐮2 + ⋯ +

27
𝐮m , and, on the other hand, it is also highly unlikely that these components are orthogonal to
one another. This is the case of <woman> for instance, which is not perfectly decomposable
by the <nudity> and <hair> components. There is not a perfect equivalence between the two
since 𝐜𝐨𝐬(< femme >, (< nudité > +< cheveux >)) = 0.86 (an equivalence would
assume a cosine of 1.0) and there is no orthogonality between its components since
𝐜𝐨𝐬(< nudité >, < cheveux >) = 0.48 (and the orthogonality would assume a cosine of
0.0). The components retrieved by our method are quasi-orthogonal. Nevertheless,
conjecturing that the syntagmatic signature vector of "woman" has two main semantic
components, the combinatorial pattern of <nudity> and the combinatorial pattern of <hair>,
represents according to our method a very likely hypothesis about the componential structure
of <woman>. It invites then specialists of Magritte’s artworks to consider it as an informed
conjecture.

6. Experimentation # 3: The Topic Analysis

The semiotic phenomenon studied in this last experiment is topic modelling. The aim of this
experiment is to show how a clustering algorithm based on comparison and composition
operators can produce a topic analysis of Magritte’s artworks. The results obtained show that
the rationale explaining why a clustering algorithm can produce this kind of semiotic analysis
is the isotopic process that supports topic modelling and that a clustering algorithm seems to
reconstruct.

6.1. Problem Introduction

Descriptors and artwork annotations from Hébert and Trudel’s semantic indexing are not
uniformly distributed in the SVS. There are groups of descriptors and groups of artwork
annotations that, because they share similar combinatorial patterns in the corpus, they are
projected into the same regions of the SVS. Therefore, the SVS is structured by various high-
density regions.
Figure 4 illustrates this phenomenon partially. Small triangles represent artwork annotations
of Magritte and circles descriptors. The coordinates of these triangles and circles encode their
vector representations. Although it is a two-dimensional reduction of the total SVS, we can
nevertheless see some high-density regions where artwork annotations related to a common
topic are clustered together. For example, we see at the left end a region where artwork
annotations are clustered around a small number of descriptors such as <woman>, <breast>,
<thigh>, <nipple>, <navel>, <nudity>, <pubic_hair> and <pubis>. We also see in the lower
right corner another region where artwork annotations are grouped around descriptors such
as <crop>, <gallop>, <jockey>, <bit>, <rein>, <saddle_ (horse)>, <stirrup> and <flange>.
The first region includes artworks that share a specific topic one could summarize with the
predicate "FEMALE NUDITY", while the second region groups artworks that share another
specific topic one could summarize with the predicate "RIDING HORSE".

28
FEMALE NUDITY

RIDDING HORSE

Figure 4: Multidimensional scaling of the semantic vector space of the annotations of Magritte’s artworks.
What are the other regions of the SVS of Magritte’s artworks annotated by Hébert and
Trudel? Do these regions also cluster together artworks that share a common topic? The
visualization of Figure 4 does not allow us to go much further in this analysis, because the
two-dimensional reduction is too distorted. To discover other high-density regions in the
SVS, we must use a clustering algorithm.

6.2. Method

Let us denote a space partition into k regions by ℙ = {𝑅1 , … 𝑅𝑘 }, the vector representation of
an artwork by the descriptor vectors addition 𝐨j = ∑𝑣i∈𝑐𝑗(𝐯i ) and let us define a region 𝑅𝑖 as
a convex surface that satisfies the following condition:

𝑅𝑖 = {𝐨 ∈ 𝑅𝑖 : ∀ 𝐜𝐨𝐬(𝐨, 𝐜i ) > 𝐜𝐨𝐬(𝐨, 𝐜j )}


𝑅𝑗 ∈ℙ
𝑗≠i

1
𝐜i = ∑𝐨
|𝑅𝑖 |
𝐨∈𝑅𝑖

Where the vector 𝐜j represents the geometric center of the region 𝑅𝑖 (its centroid). A vector
space partition into convex regions produces a Voronoi structure which has very interesting
properties for a topic modelling analysis. Firstly, the convexity properties of this partition
guarantee that the geometric center of a region is its most representative vector.
Consequently, it is a very natural way to represent the topic associated to a region: a topic is
represented by the vector averaging a group of very nearby artworks in the SVS, a group of
artworks that share a very similar syntagmatic signature vector. Secondly, because of the
convexity property of these regions, the cosine operator allows what Gärdenfors called
“prototype inferences” (Gärdenfors 2000): in convex regions, the distance between the
syntagmatic signature vector of a descriptor and the center of a region is correlated to its
semantic representativeness. In other words, the closer a descriptor is to the center of a region,
the more it is representative of the topic common to the artworks clustered in the concerned
region of the SVS.

29
The K-means clustering algorithm generates exactly this kind of structure. Its objective
function is the following:

argmax ∑ ∑ 𝐜𝐨𝐬(𝐨i , 𝐜j )
𝑅𝑗 ∈ℙ 𝐨i ∈𝑅𝑗

The algorithm is a mobile centers iterative procedure which seeks to move the geometrical
centers 𝐜j of each region 𝑅𝑗 in a partition 𝑃 in order to iteratively maximizes 𝐜𝐨𝐬(𝐨i , 𝐜j ). The
objective function of K-means guarantees, for a partition into k regions and a given seed, an
optimal partition.8

6.3. Results

We applied the K-means algorithm to the artwork annotations produced by Hébert and Trudel
and obtained a partition of the SVS into 20 regions.9 According to our method, this indicates
that Magritte’s artworks are organized around 20 main topics. Each topic is represented in
Table 8 as a list of its 10 most representative descriptors, that is the 10 nearest descriptors to
the center of each region of the SVS (as calculated with the cosine). We also added a predicate
in capital letters that summarizes the meaning structure of each topic. These topics are: (0)
MASONRY, (1) BOISERIE, (2) FACE, (3) LEAF, (4) EYE, (5) RIDING HORSE, (6) SEA,
(7) PIPE, (8) FIRE, (9) TREE, (10) SHEET MUSIC, (11) LION, (12) RIPPLE, (13)
FEMALE NUDITY, (14) DRESSED MAN, (15) HOUSE, (16) PIANO, (17) WORD
INSCRIPTIONS, (18) SKY, and (19) GEOMETRICAL SHAPE.
Finally, we also retrieved the five most representative artworks for each topic, that is the five
nearest artworks in the SVS to the center of each region. These results are available in
Appendix A.
Table 8: Topics in the annotations of Magritte’s artworks and the top 10 nearest descriptors of each region
center.

Topic #0 Topic #1 Topic #2 Topic #3


MASONRY Cos BOISERIE cos FACE cos LEAF cos

<cut_stone> 0.51 <plinth> 0.48 <face> 0.71 <blade_ (leaf)> 0.51

<Joint_ (masonry)> 0.50 <parquet> 0.48 <eyes> 0.69 <vein_(leaf)> 0.51

<masonry> 0.49 <slate_ (wood)> 0.47 <hair> 0.67 <leaf> 0.50

<stone_ (material)> 0.46 <moulding> 0.45 <head> 0.65 <tip_(leaf)> 0.49

<low_wall> 0.40 <wood_(material)> 0.43 <nose> 0.63 <edge_(leaf)> 0.48

8
Centroid seeding was estimated using the algorithm of (Arthur and Vassilvitskii 2007).
9
The estimation the k parameter was obtained by maximizing the Silhouette coefficient (Rousseeuw 1987).
30
<embrasure> 0.39 <paneling> 0.40 <mouth> 0.63 <petiole_(leaf)> 0.46

<porosity> 0.36 <boiserie> 0.39 <eyebrow> 0.59 <head_(bird)> 0.41

<boy> 0.32 <paneling> 0.39 <neck> 0.59 <beak_(bird)> 0.41

<sky> 0.31 <grain_(wood)> 0.37 <front> 0.58 <plumage> 0.41

<drink> 0.31 <finished_lumber> 0.34 <women> 0.51 <wing_(bird)> 0.41

Topic #4 Topic #5 Topic #6 Topic #7


EYE cos RIDDING HORSE cos SEA cos PIPE cos

<iris_(eye)> 0.61 <horse> 0.59 <sea> 0.62 <steam_(pipe)> 0.81

<pupil> 0.60 <horse_riding> 0.56 <wave> 0.60 <shank_(pipe)> 0.81

<white_of_the_eye> 0.60 <rein> 0.54 <water> 0.59 <head_(pipe)> 0.81

<eyelash> 0.53 <flange> 0.54 <scum> 0.55 <heel_(pipe)> 0.81

<perl> 0.51 <bit> 0.54 <skyline> 0.47 <bowl_(pipe)> 0.81

<lacrymal_caruncle> 0.51 <tail_(horse)> 0.51 <sky> 0.47 <lop_(pipe)> 0.80

<eye_ball> 0.49 <crop> 0.51 <beach> 0.44 <pipe> 0.77

<necklace> 0.47 <hoof> 0.49 <swell> 0.40 <plate> 0.67

<support> 0.47 < saddle_(horse)> 0.49 <surge> 0.37 <inscription> 0.24

<lips> 0.47 <mane_(horse)> 0.49 <cloud> 0.37 <tobacco> 0.22

Topic #8 Topic #9 Topic #10 Topic #11


FIRE cos TREE cos SHEET MUSIC cos LION cos

<fire> 0.39 <trunk> 0.57 <sheet_(musique)> 0.64 <lion> 0.50

<flame> 0.39 <tree> 0.56 <note_(musique)> 0.64 <barrel> 0.44

<candle> 0.33 <branch> 0.50 <collage> 0.64 <hoop_(barrel)> 0.43

<burn> 0.33 <foliage> 0.48 <staff_(music)> 0.64 <tripod> 0.42

<canal_
<wick_(candle)> 0.28 0.40 <illegible> 0.59 <mane_(lion)> 0.41
(water_channel)>

31
<slide_(candle)> 0.28 <bark> 0.36 <song> 0.52 <billiard_balls> 0.40

<copper> 0.27 <soil> 0.36 <lyrics_(musique)> 0.52 <billiars_cue> 0.40

<candlestick> 0.27 <vein_(leaf)> 0.36 <treble_clef> 0.50 <billard_table> 0.40

<hammer> 0.26 <leaf> 0.35 <paper> 0.48 <tail_(lion)> 0.40

<halo> 0.25 <blade_(leaf)> 0.35 <marble> 0.37 <paw_(lion)> 0.40

Topic #12 Topic #13 Topic #14 Topic #15


RIPPLE cos FEMALE NUDITY cos DRESSED MAN cos HOUSE cos

<curtain> 0.34 <nudity> 0.77 <man> 0.43 <window> 0.59

<ripple> 0.33 <breast> 0.76 <jacket> 0.41 <roof> 0.58

<sky> 0.32 <woman> 0.75 <collar> 0.40 <house> 0.57

<tissue> 0.31 <thigh> 0.74 <shirt> 0.40 <chi,mey> 0.55

<ring_(curtain)> 0.27 <nipple> 0.74 <tie> 0.37 <façade> 0.54

<drawstring> 0.26 <belly> 0.73 <hair> 0.36 <street> 0.49

<piece_(curtain)> 0.26 <pubis> 0.73 <face> 0.35 <street_lamp> 0.48

<cloud> 0.25 <navel> 0.73 <head> 0.35 <tree> 0.44

<serpentine> 0.25 <arm> 0.71 <knot_(tie)> 0.33 <shutter> 0.44

<lectern> 0.25 <neck> 0.65 <pants> 0.32 <sidewalk> 0.44

Topic #17 Topic #19


Topic #16 Topic #18
WORD GEOMETRICAL
PIANO cos INSCRIPTION cos SKY cos SHAPE cos

<body_(piano)> 0.94 <word> 0.33 <sky> 0.57 <curve_(geometry)> 0.72

<grand_piano> 0.94 <illegibility> 0.19 <cloud> 0.49 <line_(geometry)> 0.72

<support_(piano)> 0.94 <shadow> 0.19 <skyline> 0.37 <line> 0.69

<keyboard_tray_(piano)
> 0.94 <illegible> 0.18 <day> 0.36 <shape_(geometry)> 0.68

<pedal_(piano)> 0.94 <white_background> 0.18 <sea> 0.26 <steel_bar> 0.46

32
<white_key_(piano)> 0.93 <black_basckground> 0.18 <water> 0.24 <pagoda sleeve> 0.43

<black_key(piano)> 0.93 <inscription> 0.18 <soil> 0.23 <square_(geometry)> 0.38

<key_(piano)> 0.93 <hammer> 0.17 <wave> 0.23 <beacon> 0.37

< keyboard_(piano)> 0.91 <reel> 0.16 <scum> 0.21 <saucer> 0.26

<bracelet> 0.89 <arrow> 0.16 <knotting> 0.20 <cup> 0.26

6.4. Discussion

The results in Table 8 and the artworks listed in Appendix A suggest that a clustering
algorithm like K-means applied to the SVS makes it possible to infer the main topic structures
that organize Magritte’s artworks. All lists of descriptors in Table 8 and artworks in Appendix
A are characterized by remarkable topic coherence.
Many have noticed this correlation phenomenon between regions of a SVS and topic
structures, but only in the context of textual corpus analysis (Burgess et al. 1998; Griffiths et
al. 2007; Larsen and Monarchi 2004; Rieger 1983; Widdows 2004:4). Some researchers have
conjectured that the explanation of this correlation is the isotopic process that supports topic
modelling (Mayaffre 2008; Pincemin 1999). The rationale why a clustering algorithm can
produce such semiotic analysis is because of its partial reconstruction of the isotopic
process.
Isotopy is the process of repetition in several contexts of similar semantic feature
combinations (known as “semes”) (Rastier 1996). A topic is the meaning structure supported
by this process. The nature of the relations between these semantic features can be very
complex and this complexity is not encoded in an SVS model because all relations are
reduced to co-occurrence. Nevertheless, what the previous results demonstrate is that an
algorithm as K-means can approximate in a very convincing way this isotopic process.
An algorithm like K-means clusters together artwork annotations (the contexts) characterized
by similar patterns of co-occurrence of descriptors. Each centroid of artwork cluster is a
vector structure - a geometrical center - generated by this clustering process. This is the
reason why we consider these centroids as the vector representation of the topics in
Magritte’s artworks.
For example, the region corresponding to the topic #4 clustered together several artworks
where a repeated combinatorial pattern with some variations is present. This pattern consists
in variants of descriptor combinations of <iris_ (eye), <pupil>, <eye_white>, < eyelash>,
<pearl>, <lacrymal_caruncle> and so on. The meaning structure supported by these
combinatorial patterns is not encoded by the vector of any particular descriptor in the corpus,
but we can easily recognize it and index it with a predicate. We have used the predicate EYE
for this purpose.
These results are interesting for specialists of Magritte’s artworks and allow us to formulate
several research questions. For example, which topics in Magritte’s artworks are the most
important? In other words, which regions of the SVS contains the largest number of
33
Magritte’s artworks? The graph below (Figure 5) shows that the most important topic in
Magritte’s artworks is FEMALE NUDITY. This is the main topic of approximately 13% of
Magritte’s artworks. The second most important topic is the FACE topic, which is the main
topic of 11% of Magritte’s artworks. Together they represent the main topics of a quarter of
Magritte’s artworks. The topics LEAF, SEA, TREE, HOUSE, MASONRY, SKY,
DRESSED MAN and BOISERIE represent each between 6% to 8% of Magritte’s artworks.
The rest are relatively rare and represent the main topic of 1% to 4% of Magritte’s artworks.
Unfortunately, in this article, we cannot go much deeper in the analysis of these results. The
primary objective was methodological. It consisted in demonstrating that one can produce a
topic analysis with a clustering algorithm applied on the SVS.

The proportion of each topic in Magritte's artworks


14% 120%

12%

Cumulative Proportion
100%
Topic Proportion

10%
80%
8%
60%
6%
40%
4%

2% 20%

0% 0%

Topics

Figure 5: The proportion of each topic in Magritte’s artworks.

7. Conclusion

The objective of this article was to explore the contribution of the SVS model and its
algorithms to a data-driven computational semiotics. We wanted to explore what kind of
semiotic analysis one can produce with this framework.
In order to provide some answers to this question, we first presented the SVS theoretical
model, its main parameters and the main operators of its semantic inference engine. The SVS
framework has generally been presented in a perspective limited to linguistics or information
retrieval and in the context of textual corpus analysis. We have tried to detach the SVS from
this particular context. To achieve this, we have applied the SVS and its algorithms to an new
corpus composed of the surrealist painter René Magritte’s artworks. To our knowledge, this
is the first time the SVS has been applied to the analysis of iconic signs in artworks. This is
an important contribution because it shows that the SVS constitutes a framework that can be
generalized to the analysis of other types of signs that are not strictly language-based.
34
We conducted three short experiments on this corpus. Each was designed to explore a type
of semiotic analysis on the corpus of Magritte’s artworks with algorithms based on the basic
algebraic operators of the SVS framework. The first experiment was a cosine-based
paradigmatic analysis. It showed how one can discover silences in Hébert and Trudel’s
indexing of Magritte’s artworks. The second experiment was a componential analysis. It was
based on a semantic decomposition algorithm. This experiment showed how the
componential structure of the syntagmatic signature vector of a descriptor could be inferred
with the orthogonal complement subtraction operator. Finally, the third experiment was a
topic modelling analysis. It was based on a clustering algorithm. This experiment showed
that the regions of the SVS induced with a clustering algorithm allows us to discover the
topic structures organizing Magritte’s artworks.
These experiments only represent a fraction of the semiotic analyses one can produce within
this framework. For example, the results of the first experiment suggested that we could also
use the SVS framework to analyze elliptic structures. Other researches have also suggested
that the SVS framework can allow the analysis of metaphor (Shutova 2010), analogy
(Mikolov et al. 2013), hypernymy (Santus, Lenci, et al. 2014), hyponymy (Erk 2009) and
more.
This is very promising for the development of a data-driven computational semiotics. The
results are both theoretical and methodological. Theoretically, the SVS framework
constitutes a very powerful mathematical model for semiotics: the linear algebra.
Methodologically, the algorithms of the SVS framework allow a data-driven approach within
which large-scale semiotic analysis is possible.
Encouraged by the results of our experiments, we believe that future works should pursue
the investigation of other kinds of semiotic analysis one can produce with the SVS
framework.

Appendix
Appendix A: Five Nearest Artworks of Each Topic in the Semantic Vector Space.

0-masonry

id 1291 848 1653 944 1479

confort de l'esprit, leçon des ténèbres,


title témoin, Le sac à malice, Le entrée en scène, L'
Le La

cos 0.630 0.621 0.619 0.619 0.612

35
1-boiserie

id 430 1751 1739 298 1298

palais de rideaux, palais de rideaux, [Interior with


title vengeance, La vengeance, La
Le Le standing object]

cos 0.638 0.624 0.622 0.622 0.621

2-face

id 219 254 1141 1595 637

paysage fantôme, [Portrait of Eliane


title amants, Les porte ouverte, La Georgette
Le Peeters]

cos 0.784 0.782 0.775 0.773 0.773

3-leaf

id 1323 580 1415 736 1418

36
grâces naturelles, prince charmant,
title île au trésor, L' île au trésor, L' ‹île au trésor, L'›
Les Le

cos 0.690 0.689 0.686 0.684 0.682

4-eye

id 1385 1380 1387 1382 750

title Shéhérazade Shéhérazade Shéhérazade Shéhérazade Objet peint : œil

cos 0.732 0.719 0.718 0.710 0.708

5-ridding horse

id 567 1771 1134 647 1039

title jockey perdu, Le blanc-seing, Le blanc-seing, Le tour d'ivoire, La enfance d'Icare, L'

cos 0.699 0.686 0.668 0.654 0.650

37
6-sea

id 1001 1548 938 1632 989

origines du origine du langage, monde familier,


title idées claires, Les rappel à l'ordre, Le
langage, Les L' Le

cos 0.730 0.729 0.723 0.721 0.719

7-pipe

id 304 1492 1639 1555 1954

trahison des trahison des trahison des ‹trahison des [Poster for a shop
title
images, La images, La images, La images, La› window]

cos 0.839 0.839 0.835 0.835 0.834

8-fire

id 1460 235 119 1600 449

aube à Cayenne, condition


title belle captive, La paysage en feu, Le aube à Cayenne, L'
L' humaine, La

cos 0.479 0.464 0.453 0.453 0.451

38
9-tree

id 1732 543 1680 1745 617

chœur des grandes recherche de


title ‹clairvoyance, La› vie heureuse, La
sphinges, Le espérances, Les l'absolu, La

cos 0.754 0.740 0.728 0.724 0.723

10-sheet music

id 1827 1787 1826 1808 1810

premiers amours, reconnaissance


title Sans titre misanthropes, Les trio magique, Le
Les infinie, La

cos 0.711 0.707 0.706 0.699 0.695

11-lion

id 492 488 1274 1296 1275

39
jeunesse illustrée, jeunesse illustrée, jeunesse illustrée, jeunesse illustrée,
title messagère, La
La La La La

cos 0.621 0.612 0.595 0.594 0.589

12-ripple

id 109 92 1211 177 1105

catapulte du désert, À la suite de l'eau, déesse des


title Joconde, La soir qui tombe, Le
La les nuages environs, La

cos 0.572 0.545 0.541 0.540 0.537

13-female nudity

id 556 644 653 652 435

title aimant, L’ viol, Le rêve, Le aimant, L' viol, Le

cos 0.849 0.846 0.846 0.845 0.844

40
14-dressed man

id 137 1251 868 876 1829

[Design for a chant de la


title sens de la nuit, Le Journal intime [Title unknown]
mural] violette, Le

cos 0.618 0.606 0.604 0.597 0.596

15-house

id 838 974 958 1554 1702

empire des empire des empire des ombre


title parabole, La
lumières, L' lumières, L' lumières, L' monumentale, L'

cos 0.778 0.765 0.763 0.757 0.756

16-piano

id 1547 1506 1522 910 1936

41
title main heureuse, La main heureuse, La main heureuse, La main heureuse, La [Pleyel]

cos 0.963 0.962 0.962 0.961 0.868

17-word inscription

id 318 211 436 1789 256

arbre de la science,
title bonne nouvelle, La miroir vivant, Le Sans titre monde perdu, Le
L'

cos 0.400 0.377 0.368 0.367 0.366

18-sky

id 149 1136 156 511 203

Parmi les bosquets palais de rideaux, chant de l'orage,


title Campagne paysage secret, Le
légers Le Le

cos 0.642 0.622 0.622 0.616 0.611

42
19-geometrical shape

Id 1239 83 16 1243 53

[Flowers in front [Abstract


title [Landscape] forgerons, Les Retraite militaire
of a window] composition]

cos 0.758 0.758 0.758 0.758 0.750

References

Arora, Ravneet Singh. 2012. “Towards Automated Classification of Fine-Art Painting Style:
A Comparative Study.” Rutgers University-Graduate School-New Brunswick.
Bar, Yaniv, Noga Levy, and Lior Wolf. 2014. “Classification of Artistic Styles Using
Binarized Features Derived from a Deep Neural Network.” Pp. 71–84 in Workshop
at the European Conference on Computer Vision. Springer.
Baroni, Marco and Alessandro Lenci. 2010. “Distributional Memory: A General Framework
for Corpus-Based Semantics.” Computational Linguistics 36(4):673–721.
Blei, David M. and John D. Lafferty. 2009. “Topic Models.” Text Mining: Classification,
Clustering, and Applications 10(71):34.
Bordag, Stefan and Gerhard Heyer. 2007. “A Structuralist Framework for Quantitative
Linguistics.” Pp. 171–189 in Aspects of Automatic Text Analysis. Springer.
Bouma, Gerlof. 2009. “Normalized (Pointwise) Mutual Information in Collocation
Extraction.” Proceedings of GSCL 31–40.
Burgess, Curt. 2000. “Theory and Operational Definitions in Computational Memory
Models: A Response to Glenberg and Robertson.” Journal of Memory and Language
43(3):402–8.
Burgess, Curt, Kay Livesay, and Kevin Lund. 1998. “Explorations in Context Space: Words,
Sentences, Discourse.” Discourse Processes 25(2–3):211–57.
Carneiro, Gustavo, Nuno Pinho da Silva, Alessio Del Bue, and João Paulo Costeira. 2012.
“Artistic Image Classification: An Analysis on the Printart Database.” Pp. 143–57 in
Computer Vision–ECCV 2012. Springer.

43
Crowley, Elliot J. and Andrew Zisserman. 2014. “The State of the Art: Object Retrieval in
Paintings Using Discriminative Regions.” in British Machine Vision Conference.
De Souza, Clarisse Sieckenius. 2005. The Semiotic Engineering of Human-Computer
Interaction. MIT press.
Dunning, Ted. 1993. “Accurate Methods for the Statistics of Surprise and Coincidence.”
Journal Computational Linguistics 19(1):61–74.
Erk, Katrin. 2009. “Supporting Inferences in Semantic Space: Representing Words as
Regions.” Pp. 104–15 in Proceedings of the Eighth International Conference on
Computational Semantics. Association for Computational Linguistics.
Evans, James A. and Pedro Aceves. 2016. “Machine Translation: Mining Text for Social
Theory.” Annual Review of Sociology 42:21–50.
Firth, J. R. 1957. “A Synopsis of Linguistic Theory, 1930-1955.” Studies in Linguistic
Analysis Special Volume:1–32.
Floridi, Luciano. 1999. Philosophy and Computing: An Introduction. Psychology Press.
Gärdenfors, P. 2000. Conceptual Spaces: The Geometry of Thought. Cambridge: MIT Press.
Gärdenfors, P. 2014. The Geometry of Meaning: Semantics Based on Conceptual Spaces.
MIT Press.
Graham, Daniel J., Jay D. Friedenberg, Daniel N. Rockmore, and David J. Field. 2010.
“Mapping the Similarity Space of Paintings: Image Statistics and Visual Perception.”
Visual Cognition 18(4):559–73.
Greimas, Algirdas Julien, Frank Collins, and Paul Perron. 1989. “Figurative Semiotics and
the Semiotics of the Plastic Arts.” New Literary History 20(3):627–649.
Griffiths, Thomas L., Mark Steyvers, and Joshua B. Tenenbaum. 2007. “Topics in Semantic
Representation.” Psychological Review 114(2):211.
Groupe Mu. 1992. Traité du signe visuel : Pour une rhétorique de l’image. Paris: Seuil.
Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. “Diachronic Word
Embeddings Reveal Statistical Laws of Semantic Change.” in Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics, vol. 1489–1501.
Berlin, Germany.
Hare, Jonathon S., Paul H. Lewis, Peter GB Enser, and Christine J. Sandom. 2006. “Mind
the Gap: Another Look at the Problem of the Semantic Gap in Image Retrieval.” Pp.
607309-607309–12 in Electronic Imaging 2006. International Society for Optics and
Photonics.
Harris, Zellig S. 1951. Methods in Structural Linguistics. Chicago & London: The University
of Chicago Press.
Harris, Zellig S. 1954. “Distributional Structure.” Word 10(23):146–62.

44
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning
for Image Recognition.” Pp. 770–778 in Proceedings of the IEEE conference on
computer vision and pattern recognition.
Hébert, L. 2013. “Magritte. Toutes Les Œuvres, Tous Les Thèmes.” Retrieved March 16,
2016 (www.magrittedb.com).
Hébert, L. and Éric Trudel. 2013. “Analyse Des Images”. Magritte. Toutes Les Œuvres, Tous
Les Thèmes. Rimouski (Québec): Université du Québec à Rimouski. Retrieved March
15, 2016 (http://magrittedb.com).
Johnson Jr, C.Richard et al. 2008. “Image Processing for Artist Identification.” Signal
Processing Magazine, IEEE 25(4):37–48.
Kell, Douglas B. and Stephen G. Oliver. 2004. “Here Is the Evidence, Now What Is the
Hypothesis? The Complementary Roles of Inductive and Hypothesis‐driven Science
in the Post‐genomic Era.” Bioessays 26(1):99–105.
Kelling, Steve et al. 2009. “Data-Intensive Science: A New Paradigm for Biodiversity
Studies.” BioScience 59(7):613–20.
Khan, Fahad Shahbaz, Shida Beigpour, Joost van de Weijer, and Michael Felsberg. 2014.
“Painting-91: A Large Scale Database for Computational Painting Categorization.”
Machine Vision and Applications 25(6):1385–97.
Kiela, Douwe and Stephen Clark. 2014. “A Systematic Study of Semantic Vector Space
Model Parameters.” Pp. 21–30 in Proceedings of the 2nd Workshop on Continuous
Vector Space Models and their Compositionality (CVSC) at EACL.
Kitchin, Rob. 2014. “Big Data, New Epistemologies and Paradigm Shifts.” Big Data &
Society 1(1).
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “Imagenet Classification
with Deep Convolutional Neural Networks.” Pp. 1097–1105 in Advances in neural
information processing systems.
Landauer, T. K. and S. T. Dumais. 1997. “A Solution to Plato’s Problem: The Latent
Semantic Analysis Theory of Acquisition, Induction, and Representation of
Knowledge.” Psychological Review; Psychological Review 104(2):211.
Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. 1998. “An Introduction to Latent
Semantic Analysis.” Discourse Processes 25(2–3):259–84.
Larsen, Kai R. and David E. Monarchi. 2004. “A Mathematical Approach to Categorization
and Labeling of Qualitative Data: The Latent Categorization Method.” Sociological
Methodology 34(1):349–92.
Lemaire, Benoît and Guy Denhière. 2006. “Effects of High-Order Co-Occurrences on Word
Semantic Similarity.” Current Psychology Letters. Behaviour, Brain & Cognition
(18, Vol. 1, 2006).
Leopold, Edda. 2005. “On Semantic Spaces.” Pp. 63–86 in LDV Forum, vol. 20.

45
Li, Jia and James Z. Wang. 2004. “Studying Digital Imagery of Ancient Paintings by
Mixtures of Stochastic Models.” Image Processing, IEEE Transactions on
13(3):340–53.
Liu, Ying, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma. 2007. “A Survey of Content-
Based Image Retrieval with High-Level Semantics.” Pattern Recognition 40(1):262–
82.
Lombardi, Thomas Edward. 2005. “The Classification of Style in Fine-Art Painting.”
Citeseer.
Lu, Qin. 2015. “When Similarity Becomes Opposition: Synonyms and Antonyms
Discrimination in DSMs.” P. 41 in IJCoL vol. 1, n. 1 december 2015: Emerging
Topics at the First Italian Conference on Computational Linguistics. Accademia
University Press.
Lund, Kevin and Curt Burgess. 1996. “Producing High-Dimensional Semantic Spaces from
Lexical Co-Occurrence.” Behavior Research Methods, Instruments, & Computers
28(2):203–8.
Manning, C. and H. Schütze. 1999. Foundations of Statistical Natural Language Processing.
Cambridge: MIT Press.
Mayaffre, Damon. 2008. “De L’occurrence À L’isotopie. Les Co-Occurrences En
Lexicométrie.” Syntaxe & Sémantique (9):53–72.
Mehler, Alexander. 2003. “Methodological Aspects of Computational Semiotics.” SEED
Journal (Semiotics, Evolution, Energy, and Development) 3(3):71–80.
Meunier, J. G. 2014. “Humanités Numériques Ou Computationnelles : Enjeux
Herméneutiques.” Sens Public 2014.
Meunier, J. G. 2017. “Vers Une Sémiotique Computationnelle” edited by S. Badir, I.
Darrault, L. Hébert, P. Michelucci, and É. Trudel. Applied Semiotics / Sémiotique
Appliquée (16).
Meunier, Jean-Guy. 1989. “Artificial Intelligence and Sign Theory.” Semiotica 77(1–3):43–
64.
Michel, J. B. et al. 2011. “Quantitative Analysis of Culture Using Millions of Digitized
Books.” Science 331(6014):176–182.
Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. 2013. “Linguistic Regularities in
Continuous Space Word Representations.” Pp. 746–751 in Hlt-naacl, vol. 13.
Mimno, David. 2012. “Computational Historiography: Data Mining in a Century of Classics
Journals.” Journal on Computing and Cultural Heritage (JOCCH) 5(1):3.
Mitchell, Jeff and Mirella Lapata. 2010. “Composition in Distributional Models of
Semantics.” Cognitive Science 34(8):1388–1429.
Nadin, Mihai. 2011. “Information and Semiotic Processes The Semiotics of Computation.”
Cybernetics & Human Knowing 18(1–2):153–175.

46
Neuman, Yair, Yochai Cohen, and Dan Assaf. 2015. “How Do We Understand the Meaning
of Connotations? A Cognitive Computational Model.” Semiotica 2015(205):1–16.
Osgood, Charles E. 1952. “The Nature and Measurement of Meaning.” Psychological
Bulletin 49(3):197.
Osgood, Charles E. 1964. “Semantic Differential Technique in the Comparative Study of
Cultures1.” American Anthropologist 66(3):171–200.
Osgood, Charles Egerton, George John Suci, and Percy H. Tannenbaum. 1957. The
Measurement of Meaning. University of Illinois Press.
Pankratius, Victor et al. 2016. “Computer-Aided Discovery: Toward Scientific Insight
Generation with Machine Support.” IEEE Intelligent Systems 31(4):3–10.
Pincemin, Bénédicte. 1999. “Sémantique Interprétative et Analyses Automatiques de Textes:
Que Deviennent Les Sèmes?” Sémiotiques (17):71–120.
Rastier, F. 1996. “La Sémantique Des Textes: Concepts et Applications.” Hermes 9(16):15–
37.
Rastier, F. 2011. La Mesure et Le Grain. Sémantique de Corpus. Paris: Honoré Champion.
Rieger, Burghard B. 1981. “Feasible Fuzzy Semantics. On Some Problems of How to Handle
Word Meaning Empirically.” Words, Worlds, and Contexts. New Approaches in
Word Semantics (Research in Text Theory 6) 193–209.
Rieger, Burghard B. 1983. “Clusters in Semantic Space.” Actes Du Congrès International
Informatique et Science Humaines 805–814.
Rieger, Burghard B. 1989. “Distributed Semantic Representation of Word Meanings.” Pp.
243–273 in Workshop on Parallel Processing: Logic, Organization, and Technology.
Springer.
Rieger, Burghard B. 1992. “Fuzzy Computational Semantics.” Pp. 197–217 in Fuzzy
Systems. Proceedings of the Japanese-German-Center Symposium, Series, vol. 3.
Rieger, Burghard B. 1999. “Semiotics and Computational Linguistics.” Pp. 93–118 in
Computing with Words in Information/Intelligent Systems 1. Springer.
Sahlgren, M. 2006. “The Word-Space Model: Using Distributional Analysis to Represent
Syntagmatic and Paradigmatic Relations between Words in High-Dimensional
Vector Spaces.” Stockholm, Stockholm.
Sahlgren, Magnus. 2005. “An Introduction to Random Indexing.” in Methods and
applications of semantic indexing workshop at the 7th international conference on
terminology and knowledge engineering, TKE, vol. 5.
Sahlgren, Magnus. 2008. “The Distributional Hypothesis.” Italian Journal of Linguistics
20(1):33–54.
Saleh, Babak and Ahmed Elgammal. 2015. “Large-Scale Classification of Fine-Art
Paintings: Learning The Right Metric on The Right Feature.” arXiv Preprint
arXiv:1505.00855.
47
Santus, Enrico, Alessandro Lenci, Qin Lu, and Sabine Schulte Im Walde. 2014. “Chasing
Hypernyms in Vector Spaces with Entropy.” Pp. 38–42 in Proceedings of the 14th
Conference of the European Chapter of the Association for Computational
Linguistics, vol. 2.
Santus, Enrico, Qin Lu, Alessandro Lenci, and Churen Huang. 2014. “Unsupervised
Antonym-Synonym Discrimination in Vector Space.”
Schütze, Hinrich and Jan Pedersen. 1993. “A Vector Model for Syntagmatic and
Paradigmatic Relatedness.” Pp. 104–13 in Proceedings of the 9th Annual Conference
of the UW Centre for the New OED and Text Research. Citeseer.
Shamir, Lior. 2012. “Computer Analysis Reveals Similarities between the Artistic Styles of
Van Gogh and Pollock.” Leonardo 45(2):149–54.
Shamir, Lior. 2015. “What Makes a Pollock Pollock: A Machine Vision Approach.”
International Journal of Arts and Technology 8(1):1–10.
Shamir, Lior and Jane A. Tarakhovsky. 2012. “Computer Analysis of Art.” Journal on
Computing and Cultural Heritage (JOCCH) 5(2):7.
Shen, Jialie. 2009. “Stochastic Modeling Western Paintings for Effective Classification.”
Pattern Recognition 42(2):293–301.
Shutova, Ekaterina. 2010. “Models of Metaphor in NLP.” Pp. 688–697 in Proceedings of the
48th annual meeting of the association for computational linguistics. Association for
Computational Linguistics.
Smeulders, Arnold WM, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh
Jain. 2000. “Content-Based Image Retrieval at the End of the Early Years.” Pattern
Analysis and Machine Intelligence, IEEE Transactions on 22(12):1349–80.
Stork, David G. 2009. “Computer Vision and Computer Graphics Analysis of Paintings and
Drawings: An Introduction to the Literature.” Pp. 9–24 in Computer Analysis of
Images and Patterns. Springer.
Sylvestre, David, ed. 1997. René Magritte. Catalogue Raisonné. Anvers: Fonds Mercator.
Tanaka-Ishii, Kumiko. 2010. Semiotics of Programming. Cambridge University Press.
Tanaka-Ishii, Kumiko. 2015. “Semiotics of Computing: Filling the Gap Between Humanity
and Mechanical Inhumanity.” Pp. 981–1002 in International Handbook of Semiotics.
Springer.
Turney, P. D. and P. Pantel. 2010. “From Frequency to Meaning: Vector Space Models of
Semantics.” Journal of Artificial Intelligence Research 37(1):141–188.
Van Rijsbergen, Cornelis Joost. 2004. The Geometry of Information Retrieval. Cambridge
University Press Cambridge.
Widdows, Dominic. 2003. “Orthogonal Negation in Vector Spaces for Modelling Word-
Meanings and Document Retrieval.” Pp. 136–43 in Proceedings of the 41st Annual

48
Meeting on Association for Computational Linguistics-Volume 1. Association for
Computational Linguistics.
Widdows, Dominic. 2004. Geometry and Meaning. Standford: CSLI Publications.
Widdows, Dominic. 2008. “Semantic Vector Products: Some Initial Investigations.”
Retrieved September 8, 2016 (https://research.google.com/pubs/pub33477.html).
Widdows, Dominic and Trevor Cohen. 2014. “Reasoning with Vectors: A Continuous Model
for Fast Robust Inference.” Logic Journal of IGPL jzu028.
Zhang, Dengsheng, Md Monirul Islam, and Guojun Lu. 2012. “A Review on Automatic
Image Annotation Techniques.” Pattern Recognition 45(1):346–62.
Zujovic, Jana, Lisa Gandy, Scott Friedman, Bryan Pardo, and Thrasyvoulos N. Pappas. 2009.
“Classifying Paintings by Artistic Genre: An Analysis of Features & Classifiers.” Pp.
1–5 in Multimedia Signal Processing, 2009. MMSP’09. IEEE International
Workshop on. IEEE.

49

Potrebbero piacerti anche