Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Networks
s Mikolov
Toma
Google Research
1 / 31
Overview
2 / 31
Representations of Text
Representation of text is very important for performance of
many real-world applications. The most common techniques
are:
Local representations
N-grams
Bag-of-words
1-of-N coding
Continuous representations
Latent Semantic Analysis
Latent Dirichlet Allocation
Distributed Representations
3 / 31
Distributed Representations
4 / 31
projection
hidden
output
w(t)
w(t-2)
U
w(t-1)
5 / 31
Efficient Learning
Using this model just for obtaining the word vectors is very
inefficient
6 / 31
Efficient Learning
7 / 31
Skip-gram Architecture
Input
projection
output
w(t-2)
w(t-1)
w(t)
w(t+1)
w(t+2)
projection
output
w(t-2)
SUM
w(t-1)
w(t)
w(t+1)
w(t+2)
10 / 31
11 / 31
12 / 31
13 / 31
Model
Vector
Training
Training
Accuracy
Dimensionality
Words
Time
[%]
Collobert NNLM
50
660M
2 months
11
Turian NNLM
200
37M
few weeks
Mnih NNLM
100
37M
7 days
Mikolov RNNLM
640
320M
weeks
25
Huang NNLM
50
990M
weeks
13
Our NNLM
100
6B
2.5 days
51
Skip-gram (hier.s.)
1000
6B
hours
66
CBOW (negative)
300
1.5B
minutes
72
14 / 31
Expression
Nearest token
Rome
colder
bratwurst
Cu - copper + gold
Au
Android
15 / 31
16 / 31
Model
Collobert NNLM
0.28
0.34
CBOW (100B)
0.50
17 / 31
Collobert NNLM
Turian NNLM
Mnih NNLM
Redmond
Havel
graffiti
capitulate
conyers
plauen
cheesecake
abdicate
lubbock
dzerzhinsky
gossip
accede
keene
osterreich
dioramas
rearm
McCarthy
Jewell
gunfire
Alston
Arzu
emotion
Cousins
Ovitz
impunity
Podhurst
Pontiff
anaesthetics
Mavericks
Harlang
Pinochet
monkeys
planning
Agarwal
Rodionov
Jews
hesitated
capitulation
Redmond Wash.
Vaclav Havel
spray paint
Skip-gram
Redmond Washington
grafitti
capitulated
(phrases)
Microsoft
Velvet Revolution
taggers
capitulating
18 / 31
19 / 31
Example query:
restaurants in mountain view that are not very good
Forming the phrases:
restaurants in (mountain view) that are (not very good)
Adding the vectors:
restaurants + in + (mountain view) + that + are + (not very
good)
Very simple and efficient
Will not work well for long sentences or documents
20 / 31
Expression
Nearest tokens
Czech + currency
Vietnam + capital
German + airlines
Russian + river
French + actress
21 / 31
22 / 31
king
0.5
prince
0.4
cock
queen
0.3
bull
0.2
princess
0.1
hen
hero
cow
landlord
actor
0.1
male
he
0.2
landlady
heroine
0.3
0.4
0.8
female
actress
0.6
0.4
0.2
she
0
0.2
0.4
0.6
23 / 31
fallen
0.05
fall
0.1
drawn
draw
given
give
0.15
fell
0.2
drew
taken
take
gave
0.25
took
0.3
0.35
0.8
0.6
0.4
0.2
0.2
0.4
0.6
24 / 31
Russia
Japan
Moscow
Tokyo
Ankara
Turkey
0.5
Poland
Germany
France
-0.5
Italy
-1
Spain
-1.5
Portugal
Warsaw
Berlin
Paris
Athens
Rome
Greece
Madrid
Lisbon
-2
-2
-1.5
-1
-0.5
0.5
1.5
25 / 31
Machine Translation
26 / 31
0.2
0.5
horse
0.15
caballo (horse)
0.4
0.1
vaca (cow)
0.3
cow
0.05
perro (dog)
0.2
dog
pig
0.1
0.05
0.1
0.1
0.15
0.2
0.2
cerdo (pig)
0.3
0.25
0.4
cat
0.3
0.3
0.25
0.2
0.15
0.1
0.05
0.05
0.1
0.15
0.5
0.5
gato (cat)
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
27 / 31
Machine Translation
28 / 31
60
Accuracy
50
40
30
20
10
0
7
10
Precision@1
Precision@5
8
10
10
Number of training words
10
10
29 / 31
Machine Translation
30 / 31
Available Resources
31 / 31