Sei sulla pagina 1di 31

Learning Representations of Text using Neural

Networks
s Mikolov
Toma

Joint work with Ilya Sutskever, Kai Chen, Greg Corrado,


Jeff Dean, Quoc Le, Thomas Strohmann

Google Research

NIPS Deep Learning Workshop 2013

1 / 31

Overview

Distributed Representations of Text


Efficient learning
Linguistic regularities
Examples
Translation of words and phrases
Available resources

2 / 31

Representations of Text
Representation of text is very important for performance of
many real-world applications. The most common techniques
are:
Local representations
N-grams
Bag-of-words
1-of-N coding

Continuous representations
Latent Semantic Analysis
Latent Dirichlet Allocation
Distributed Representations

3 / 31

Distributed Representations

Distributed representations of words can be obtained from


various neural network based language models:
Feedforward neural net language model
Recurrent neural net language model

4 / 31

Feedforward Neural Net Language Model


input
w(t-3)

projection

hidden

output

w(t)

w(t-2)
U

w(t-1)

Four-gram neural net language model architecture (Bengio


2001)
The training is done using stochastic gradient descent and
backpropagation
The word vectors are in matrix U

5 / 31

Efficient Learning

The training complexity of the feedforward NNLM is high:


Propagation from projection layer to the hidden layer
Softmax in the output layer

Using this model just for obtaining the word vectors is very
inefficient

6 / 31

Efficient Learning

The full softmax can be replaced by:


Hierarchical softmax (Morin and Bengio)
Hinge loss (Collobert and Weston)
Noise contrastive estimation (Mnih et al.)
Negative sampling (our work)

We can further remove the hidden layer: for large models,


this can provide additional speedup 1000x
Continuous bag-of-words model
Continuous skip-gram model

7 / 31

Skip-gram Architecture

Input

projection

output
w(t-2)

w(t-1)

w(t)

w(t+1)

w(t+2)

Predicts the surrounding words given the current word


8 / 31

Continuous Bag-of-words Architecture


Input

projection

output

w(t-2)

SUM
w(t-1)
w(t)

w(t+1)

w(t+2)

Predicts the current word given the context


9 / 31

Efficient Learning - Summary

Efficient multi-threaded implementation of the new models


greatly reduces the training complexity
The training speed is in order of 100K - 5M words per
second
Quality of word representations improves significantly with
more training data

10 / 31

Linguistic Regularities in Word Vector Space

The word vector space implicitly encodes many regularities


among words

11 / 31

Linguistic Regularities in Word Vector Space

The resulting distributed representations of words contain


surprisingly a lot of syntactic and semantic information
There are multiple degrees of similarity among words:
KING is similar to QUEEN as MAN is similar to
WOMAN
KING is similar to KINGS as MAN is similar to MEN

Simple vector operations with the word vectors provide


very intuitive results

12 / 31

Linguistic Regularities - Results

Regularity of the learned word vector space is evaluated


using test set with about 20K questions
The test set contains both syntactic and semantic
questions
We measure TOP1 accuracy (input words are removed
during search)
We compare our models to previously published word
vectors

13 / 31

Linguistic Regularities - Results

Model

Vector

Training

Training

Accuracy

Dimensionality

Words

Time

[%]

Collobert NNLM

50

660M

2 months

11

Turian NNLM

200

37M

few weeks

Mnih NNLM

100

37M

7 days

Mikolov RNNLM

640

320M

weeks

25

Huang NNLM

50

990M

weeks

13

Our NNLM

100

6B

2.5 days

51

Skip-gram (hier.s.)

1000

6B

hours

66

CBOW (negative)

300

1.5B

minutes

72

14 / 31

Linguistic Regularities in Word Vector Space

Expression

Nearest token

Paris - France + Italy

Rome

bigger - big + cold

colder

sushi - Japan + Germany

bratwurst

Cu - copper + gold

Au

Windows - Microsoft + Google

Android

Montreal Canadiens - Montreal + Toronto

Toronto Maple Leafs

15 / 31

Performance on Rare Words

Word vectors from neural networks were previously


criticized for their poor performance on rare words
Scaling up training data set size helps to improve
performance on rare words
For evaluation of progress, we have used data set from
Luong et al.: Better word representations with recursive
neural networks for morphology, CoNLL 2013

16 / 31

Performance on Rare Words - Results

Model

Correlation with Human Ratings


(Spearmans rank correlation)

Collobert NNLM

0.28

Collobert NNLM + Morphology features

0.34

CBOW (100B)

0.50

17 / 31

Rare Words - Examples of Nearest Neighbours

Collobert NNLM

Turian NNLM

Mnih NNLM

Redmond

Havel

graffiti

capitulate

conyers

plauen

cheesecake

abdicate

lubbock

dzerzhinsky

gossip

accede

keene

osterreich

dioramas

rearm

McCarthy

Jewell

gunfire

Alston

Arzu

emotion

Cousins

Ovitz

impunity

Podhurst

Pontiff

anaesthetics

Mavericks

Harlang

Pinochet

monkeys

planning

Agarwal

Rodionov

Jews

hesitated
capitulation

Redmond Wash.

Vaclav Havel

spray paint

Skip-gram

Redmond Washington

president Vaclav Havel

grafitti

capitulated

(phrases)

Microsoft

Velvet Revolution

taggers

capitulating

18 / 31

From Words to Phrases and Beyond

Often we want to represent more than just individual


words: phrases, queries, sentences
The vector representation of a query can be obtained by:
Forming the phrases
Adding the vectors together

19 / 31

From Words to Phrases and Beyond

Example query:
restaurants in mountain view that are not very good
Forming the phrases:
restaurants in (mountain view) that are (not very good)
Adding the vectors:
restaurants + in + (mountain view) + that + are + (not very
good)
Very simple and efficient
Will not work well for long sentences or documents

20 / 31

Compositionality by Vector Addition

Expression

Nearest tokens

Czech + currency

koruna, Czech crown, Polish zloty, CTK

Vietnam + capital

Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese

German + airlines

airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa

Russian + river

Moscow, Volga River, upriver, Russia

French + actress

Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg

21 / 31

Visualization of Regularities in Word Vector Space

We can visualize the word vectors by projecting them to 2D


space
PCA can be used for dimensionality reduction
Although a lot of information is lost, the regular structure is
often visible

22 / 31

Visualization of Regularities in Word Vector Space


0.6

king

0.5

prince

0.4

cock

queen

0.3

bull
0.2

princess
0.1

hen
hero
cow

landlord

actor

0.1

male

he
0.2

landlady

heroine
0.3

0.4
0.8

female
actress
0.6

0.4

0.2

she
0

0.2

0.4

0.6

23 / 31

Visualization of Regularities in Word Vector Space

fallen

0.05

fall

0.1

drawn
draw

given
give

0.15

fell
0.2

drew

taken
take
gave

0.25

took

0.3

0.35
0.8

0.6

0.4

0.2

0.2

0.4

0.6

24 / 31

Visualization of Regularities in Word Vector Space


2
China
Beijing
1.5

Russia
Japan
Moscow
Tokyo

Ankara

Turkey
0.5
Poland
Germany
France

-0.5

Italy

-1

Spain

-1.5

Portugal

Warsaw
Berlin
Paris
Athens
Rome

Greece

Madrid
Lisbon

-2
-2

-1.5

-1

-0.5

0.5

1.5

25 / 31

Machine Translation

Word vectors should have similar structure when trained


on comparable corpora
This should hold even for corpora in different languages

26 / 31

Machine Tanslation - English to Spanish

0.2

0.5

horse

0.15

caballo (horse)

0.4

0.1

vaca (cow)

0.3

cow

0.05

perro (dog)

0.2

dog

pig

0.1

0.05

0.1

0.1

0.15

0.2

0.2

cerdo (pig)

0.3

0.25

0.4

cat

0.3
0.3

0.25

0.2

0.15

0.1

0.05

0.05

0.1

0.15

0.5
0.5

gato (cat)
0.4

0.3

0.2

0.1

0.1

0.2

0.3

0.4

0.5

The figures were manually rotated and scaled

27 / 31

Machine Translation

For translation from one vector space to another, we need


to learn a linear projection (will perform rotation and
scaling)
Small starting dictionary can be used to train the linear
projection
Then, we can translate any word that was seen in the
monolingual data

28 / 31

MT - Accuracy of English to Spanish translation


70

60

Accuracy

50

40

30

20

10

0
7
10

Precision@1
Precision@5
8

10
10
Number of training words

10

10

29 / 31

Machine Translation

When applied to English to Spanish word translation, the


accuracy is above 90% for the most confident translations
Can work for any language pair (we tried English to
Vietnamese)
More details in paper: Exploiting similarities among
languages for machine translation

30 / 31

Available Resources

The project webpage is code.google.com/p/word2vec


open-source code
pretrained word vectors (model for common words and
phrases will be uploaded soon)
links to the papers

31 / 31

Potrebbero piacerti anche