NIPS DeepLearningWorkshop NNforText

Learning Representations of Text using Neural
Networks
s Mikolov
Toma
Joint work with Ilya Sutskever, Kai Chen, Greg Corrado,

Jeff Dean, Quoc Le, Thomas Strohmann
Google Research
NIPS Deep Learning Workshop 2013
1 / 31
Overview
Distributed Representations of Text

Efficient learning
Linguistic regularities
Examples
Translation of words and phrases
Available resources
2 / 31
Representations of Text
Representation of text is very important for performance of
many real-world applications. The most common techniques
are:
Local representations
N-grams
Bag-of-words
1-of-N coding
Continuous representations
Latent Semantic Analysis
Latent Dirichlet Allocation
Distributed Representations
3 / 31
Distributed Representations
Distributed representations of words can be obtained from

various neural network based language models:
Feedforward neural net language model
Recurrent neural net language model
4 / 31
Feedforward Neural Net Language Model

input
w(t-3)
projection
hidden
output
w(t)
w(t-2)
U
w(t-1)
Four-gram neural net language model architecture (Bengio

2001)
The training is done using stochastic gradient descent and
backpropagation
The word vectors are in matrix U
5 / 31
Efficient Learning
The training complexity of the feedforward NNLM is high:

Propagation from projection layer to the hidden layer
Softmax in the output layer
Using this model just for obtaining the word vectors is very
inefficient
6 / 31
Efficient Learning
The full softmax can be replaced by:

Hierarchical softmax (Morin and Bengio)
Hinge loss (Collobert and Weston)
Noise contrastive estimation (Mnih et al.)
Negative sampling (our work)
We can further remove the hidden layer: for large models,

this can provide additional speedup 1000x
Continuous bag-of-words model
Continuous skip-gram model
7 / 31
Skip-gram Architecture
Input
projection
output
w(t-2)
w(t-1)
w(t)
w(t+1)
w(t+2)
Predicts the surrounding words given the current word

8 / 31
Continuous Bag-of-words Architecture

Input
projection
output
w(t-2)
SUM
w(t-1)
w(t)
w(t+1)
w(t+2)
Predicts the current word given the context

9 / 31
Efficient Learning - Summary
Efficient multi-threaded implementation of the new models

greatly reduces the training complexity
The training speed is in order of 100K - 5M words per
second
Quality of word representations improves significantly with
more training data
10 / 31
Linguistic Regularities in Word Vector Space
The word vector space implicitly encodes many regularities

among words
11 / 31
The resulting distributed representations of words contain

surprisingly a lot of syntactic and semantic information
There are multiple degrees of similarity among words:
KING is similar to QUEEN as MAN is similar to
WOMAN
KING is similar to KINGS as MAN is similar to MEN
Simple vector operations with the word vectors provide

very intuitive results
12 / 31
Linguistic Regularities - Results
Regularity of the learned word vector space is evaluated

using test set with about 20K questions
The test set contains both syntactic and semantic
questions
We measure TOP1 accuracy (input words are removed
during search)
We compare our models to previously published word
vectors
13 / 31
Linguistic Regularities - Results
Model
Vector
Training
Training
Accuracy
Dimensionality
Words
Time
[%]
Collobert NNLM
50
660M
2 months
11
Turian NNLM
200
37M
few weeks
Mnih NNLM
100
37M
7 days
Mikolov RNNLM
640
320M
weeks
25
Huang NNLM
50
990M
weeks
13
Our NNLM
100
6B
2.5 days
51
Skip-gram (hier.s.)
1000
6B
hours
66
CBOW (negative)
300
1.5B
minutes
72
14 / 31
Expression
Nearest token
Paris - France + Italy
Rome
bigger - big + cold
colder
sushi - Japan + Germany
bratwurst
Cu - copper + gold
Au
Windows - Microsoft + Google
Android
Montreal Canadiens - Montreal + Toronto
Toronto Maple Leafs
15 / 31
Performance on Rare Words
Word vectors from neural networks were previously

criticized for their poor performance on rare words
Scaling up training data set size helps to improve
performance on rare words
For evaluation of progress, we have used data set from
Luong et al.: Better word representations with recursive
neural networks for morphology, CoNLL 2013
16 / 31
Performance on Rare Words - Results
Model
Correlation with Human Ratings

(Spearmans rank correlation)
Collobert NNLM
0.28
Collobert NNLM + Morphology features
0.34
CBOW (100B)
0.50
17 / 31
Rare Words - Examples of Nearest Neighbours
Collobert NNLM
Turian NNLM
Mnih NNLM
Redmond
Havel
graffiti
capitulate
conyers
plauen
cheesecake
abdicate
lubbock
dzerzhinsky
gossip
accede
keene
osterreich
dioramas
rearm
McCarthy
Jewell
gunfire
Alston
Arzu
emotion
Cousins
Ovitz
impunity
Podhurst
Pontiff
anaesthetics
Mavericks
Harlang
Pinochet
monkeys
planning
Agarwal
Rodionov
Jews
hesitated
capitulation
Redmond Wash.
Vaclav Havel
spray paint
Skip-gram
Redmond Washington
president Vaclav Havel
grafitti
capitulated
(phrases)
Microsoft
Velvet Revolution
taggers
capitulating
18 / 31
From Words to Phrases and Beyond
Often we want to represent more than just individual

words: phrases, queries, sentences
The vector representation of a query can be obtained by:
Forming the phrases
Adding the vectors together
19 / 31
From Words to Phrases and Beyond
Example query:
restaurants in mountain view that are not very good
Forming the phrases:
restaurants in (mountain view) that are (not very good)
Adding the vectors:
restaurants + in + (mountain view) + that + are + (not very
good)
Very simple and efficient
Will not work well for long sentences or documents
20 / 31
Compositionality by Vector Addition
Expression
Nearest tokens
Czech + currency
koruna, Czech crown, Polish zloty, CTK
Vietnam + capital
Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese
German + airlines
airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa
Russian + river
Moscow, Volga River, upriver, Russia
French + actress
Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg
21 / 31
Visualization of Regularities in Word Vector Space
We can visualize the word vectors by projecting them to 2D

space
PCA can be used for dimensionality reduction
Although a lot of information is lost, the regular structure is
often visible
22 / 31

0.6
king
0.5
prince
0.4
cock
queen
0.3
bull
0.2
princess
0.1
hen
hero
cow
landlord
actor
0.1
male
he
0.2
landlady
heroine
0.3
0.4
0.8
female
actress
0.6
0.4
0.2
she
0
0.2
0.4
0.6
23 / 31
fallen
0.05
fall
0.1
drawn
draw
given
give
0.15
fell
0.2
drew
taken
take
gave
0.25
took
0.3
0.35
0.8
0.6
0.4
0.2
0.2
0.4
0.6
24 / 31

2
China
Beijing
1.5
Russia
Japan
Moscow
Tokyo
Ankara
Turkey
0.5
Poland
Germany
France
-0.5
Italy
-1
Spain
-1.5
Portugal
Warsaw
Berlin
Paris
Athens
Rome
Greece
Madrid
Lisbon
-2
-2
-1.5
-1
-0.5
0.5
1.5
25 / 31
Machine Translation
Word vectors should have similar structure when trained

on comparable corpora
This should hold even for corpora in different languages
26 / 31
Machine Tanslation - English to Spanish
0.2
0.5
horse
0.15
caballo (horse)
0.4
0.1
vaca (cow)
0.3
cow
0.05
perro (dog)
0.2
dog
pig
0.1
0.05
0.1
0.1
0.15
0.2
0.2
cerdo (pig)
0.3
0.25
0.4
cat
0.3
0.3
0.25
0.2
0.15
0.1
0.05
0.05
0.1
0.15
0.5
0.5
gato (cat)
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
The figures were manually rotated and scaled
27 / 31
Machine Translation
For translation from one vector space to another, we need

to learn a linear projection (will perform rotation and
scaling)
Small starting dictionary can be used to train the linear
projection
Then, we can translate any word that was seen in the
monolingual data
28 / 31
MT - Accuracy of English to Spanish translation

70
60
Accuracy
50
40
30
20
10
0
7
10
Precision@1
Precision@5
8
10
10
Number of training words
10
10
29 / 31
Machine Translation
When applied to English to Spanish word translation, the

accuracy is above 90% for the most confident translations
Can work for any language pair (we tried English to
Vietnamese)
More details in paper: Exploiting similarities among
languages for machine translation
30 / 31
Available Resources
The project webpage is code.google.com/p/word2vec

open-source code
pretrained word vectors (model for common words and
phrases will be uploaded soon)
links to the papers
31 / 31

NIPS DeepLearningWorkshop NNforText

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

NIPS DeepLearningWorkshop NNforText

Caricato da

Copyright:

Formati disponibili

Learning Representations of Text using Neural

Joint work with Ilya Sutskever, Kai Chen, Greg Corrado,

NIPS Deep Learning Workshop 2013

Distributed Representations of Text

Distributed representations of words can be obtained from

Feedforward Neural Net Language Model

Four-gram neural net language model architecture (Bengio

The training complexity of the feedforward NNLM is high:

The full softmax can be replaced by:

We can further remove the hidden layer: for large models,

Predicts the surrounding words given the current word

Continuous Bag-of-words Architecture

Predicts the current word given the context

Efficient Learning - Summary

Efficient multi-threaded implementation of the new models

Linguistic Regularities in Word Vector Space

The word vector space implicitly encodes many regularities

Linguistic Regularities in Word Vector Space

The resulting distributed representations of words contain

Simple vector operations with the word vectors provide

Linguistic Regularities - Results

Regularity of the learned word vector space is evaluated

Linguistic Regularities - Results

Linguistic Regularities in Word Vector Space

Paris - France + Italy

bigger - big + cold

sushi - Japan + Germany

Windows - Microsoft + Google

Montreal Canadiens - Montreal + Toronto

Toronto Maple Leafs

Performance on Rare Words

Word vectors from neural networks were previously

Performance on Rare Words - Results

Correlation with Human Ratings

Collobert NNLM + Morphology features

Rare Words - Examples of Nearest Neighbours

president Vaclav Havel

From Words to Phrases and Beyond

Often we want to represent more than just individual

From Words to Phrases and Beyond

Compositionality by Vector Addition

koruna, Czech crown, Polish zloty, CTK

Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese

airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa

Moscow, Volga River, upriver, Russia

Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg

Visualization of Regularities in Word Vector Space

We can visualize the word vectors by projecting them to 2D

Visualization of Regularities in Word Vector Space

Visualization of Regularities in Word Vector Space

Visualization of Regularities in Word Vector Space

Word vectors should have similar structure when trained

Machine Tanslation - English to Spanish

The figures were manually rotated and scaled

For translation from one vector space to another, we need

MT - Accuracy of English to Spanish translation

When applied to English to Spanish word translation, the

The project webpage is code.google.com/p/word2vec

Potrebbero piacerti anche