Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Google 小姐
Yann LeCun’s comment
https://www.quora.com/What-are-some-recent-and-
potentially-upcoming-breakthroughs-in-unsupervised-learning
Yann LeCun’s comment
……
https://www.quora.com/What-are-some-recent-and-potentially-upcoming-breakthroughs-
in-deep-learning
Deep Learning in One Slide (Review)
➢ Fully connected feedforward network
Many kinds of
➢ Convolutional neural network (CNN)
network structures:
➢ Recurrent neural network (RNN)
Different networks can take different kinds of input/output.
Vector
Matrix 𝑥 Network 𝑦
Vector Seq
(speech, video, sentence)
Drawing?
何之源的知乎: https://zhuanlan.zhihu.com/p/24767059
DCGAN: https://github.com/carpedm20/DCGAN-tensorflow
Powered by: http://mattya.github.io/chainer-DCGAN/
0.1 3
−3 −3
⋮ Generator ⋮ Generator
2.4 2.4
0.9 Each dimension of input vector 0.9 Longer hair
represents some characteristics.
0.1 0.1
2.1 −3
⋮ Generator ⋮ Generator
2.4 2.4
0.9 blue hair 3.5 Open mouth
Basic Idea of GAN It is a neural network
(NN), or a function.
Discri- scalar
image
minator Larger value means real,
smaller value means fake.
Discri- Discri-
1.0 minator 1.0
minator
Discri- Discri-
minator
0.1 minator
?
Basic Idea of GAN
Generator
Brown veins
Discriminator
This is where the term
“adversarial” comes from.
Basic Idea of GAN You can explain the process
in different ways…….
NN NN NN
Generator Generator Generator
v1 v2 v3
Real images:
Generator Discriminator
(student) (teacher)
Basic Idea of GAN
Generator
v1 Discriminator
v1
No eyes
Generator
v2
Discriminator
v2
No mouth
Generator
v3
Questions
100 updates
Anime Face Generation
1000 updates
Anime Face Generation
2000 updates
Anime Face Generation
5000 updates
Anime Face Generation
10,000 updates
Anime Face Generation
20,000 updates
Anime Face Generation
50,000 updates
G G G G G G G G G G
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
感謝陳柏文同學提供實驗結果
Outline
Generation by GAN
Improving GAN
Structured Learning
Machine learning is to find a function f
f : X Y
Regression: output a scalar
Classification: output a “class” (one-hot vector)
1 0 0 0 1 0 0 0 1
Class 1 Class 2 Class 3
Structured Learning/Prediction: output a
sequence, a matrix, a graph, a tree ……
Output is composed of components with dependency
Structured
Learning
Regression,
Classification
Output Sequence f : X Y
Machine Translation
X : “機器學習及其深層與 Y : “Machine learning and
結構化” having it deep and structured”
(sentence of language 1) (sentence of language 2)
Speech Recognition
X: Y : 感謝大家來上課”
(speech) (transcription)
Chat-bot
X: Y:
Ref: https://arxiv.org/pdf/1611.07004v1.pdf
Text to Image
Sentence 這個婆娘不是人
Generation 九天玄女下凡塵
Structured Learning Approach
Generator
Learn to generate Bottom
the object at the Up
component level
Generative
Adversarial
Network (GAN)
Discriminator
Evaluating the Top
whole object, and Down
find the best one
Lecture I
Generation by GAN
Improving GAN
We will control what to generate
Generation latter. → Conditional Generation
Image Generation
0.3 0.1 −0.3
−0.1 −0.1 0.1 NN
⋮ ⋮ ⋮ Generator
−0.7 0.7 0.9
In a specific range
Sentence Generation
0.3 0.1 −0.3 How are you?
−0.1 −0.1 0.1 NN
⋮ ⋮ ⋮ Good morning.
Generator
−0.7 0.2 0.5 Good afternoon.
So many questions ……
code
Generator
vectors
code: 0.1 0.1 0.2 0.3
(where does them −0.5 0.9 −0.1 0.2
come from?)
Image:
As close as possible
0.1 NN
image
0.9 Generator
As close as possible
𝑦1 1
NN
c.f. 𝑦2 0
Classifier
⋮ ⋮
code
code
NN
Generator
code
Generator
vectors
code: 0.1 0.1 0.2 0.3
(where does them −0.5 0.9 −0.1 0.2
come from?)
Image:
Encoder in auto-encoder
provides the code ☺
NN
𝑐
Encoder
Auto-encoder Low
dimension
Compact
NN code representation of
Encoder the input object
28 X 28 = 784
Learn together
NN Can reconstruct
code
Decoder the original object
Trainable
NN NN
𝑐
Encoder Decoder
Auto-encoder - Example
NN
𝑐
Encoder
PCA 降到
32-dim
Unsupervised Leaning
without annotation
Auto-encoder
As close as possible
code
NN NN
Encoder Decoder
= Generator
Randomly generate
code
NN
a vector as code Decoder
Image ?
= Generator
(real examples)
Auto-encoder
2D
code NN image
Decoder
NN
Decoder
1.5
0
-1.5 1.5
−1.5
0 NN
Decoder
(real examples)
Auto-encoder
-1.5 1.5
code
code
NN
Auto-encoder
code
Generator
vectors
code: 0.1 0.1 0.2 0.3
(where does them −0.5 0.9 −0.1 0.2
come from?)
Image:
NN NN
a b
Generator Generator
NN
0.5x a + 0.5x b ?
Generator
Auto-encoder
NN NN
input output
Encoder Decoder
code = Generator
Variational Auto-encoder
Minimize
(VAE) m 1 reconstruction error
m2
NN m3 𝑐1
input NN
Encoder 𝜎1 exp + 𝑐2 output
Decoder
𝜎2 𝑐3
𝜎3
X 𝑐𝑖 = 𝑒𝑥𝑝 𝜎𝑖 × 𝑒𝑖 + 𝑚𝑖
𝑒
From a normal 1
𝑒2 Minimize
distribution 𝑒3 3
2
𝑒𝑥𝑝 𝜎𝑖 − 1 + 𝜎𝑖 + 𝑚𝑖
𝑖=1
Why VAE?
decode
code
encode
VAE - Writing Poetry
……
"come with me," she said.
"talk to me," she said.
"don’t worry about it," she said.
Ref: http://www.wired.co.uk/article/google-artificial-intelligence-poetry
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy
Bengio, Generating Sentences from a Continuous Space, arXiv prepring, 2015
What do we miss?
Generated Image Target
G
as close as
possible
It will be fine if the generator can truly copy the target image.
What if the generator makes some mistakes …….
Some mistakes are serious, while some are fine.
What do we miss? Target
我覺得不行 我覺得不行
我覺得 我覺得
其實 OK 其實 OK
Each neural in output layer
What do we miss? corresponds to a pixel.
Layer L-1 Layer L
…… empty
誰管你
…… ink
……
……
旁邊的,我們產
我覺得不行 我覺得其實 OK ……生一樣的顏色
(Variational) Auto-encoder
x2
𝑧1 𝑥1
𝑧2 G 𝑥2
x1
So many questions ……
D: X R
• Input x: an object x (e.g. an image)
• Output D(x): scalar which represents how “good” an
object x is
D 1.0 D 0.1
D scalar 1 D scalar 1
(real) (real)
D scalar 1 D scalar 1
(real) (real)
D(x)
real examples
In practice, you cannot decrease all the x
other than real examples.
Discriminator - Training
• Negative examples are critical.
D scalar 1 D scalar 1
(real) (real)
D scalar 0 D scalar 0
(fake) (fake)
D
How to generate realistic
0.9 Pretty real
negative examples?
Discriminator - Training
• General Algorithm
• Given a set of positive examples, randomly
generate a set of negative examples.
• In each iteration
• Learn a discriminator D that can discriminate
positive and negative examples.
v.s. D
In the end ……
𝐷 𝑥 𝐷 𝑥
Structured Learning ➢ Structured Perceptron
➢ Structured SVM
➢ Gibbs sampling
Graphical Model ➢ Hidden information
➢ Application: sequence
labelling, summarization
Bayesian Network Markov Random Field
(Directed Graph) (Undirected Graph)
Energy-based
Model:
http://www.cs.nyu
Conditional Markov Logic Boltzmann .edu/~yann/resear
Random Field Network Machine ch/ebm/
• Pros: • Pros:
• Considering the big
• Easy to generate even picture
with deep model • Cons:
• Cons: • Generation is not
always feasible
• Imitate the appearance • Especially when
• Hard to learn the your model is deep
correlation between • How to do negative
components sampling?
So many questions ……
G ~
x = x arg max D x
~
x X
Generating Negative Examples
G ~
x = x arg max D x
~
x X
hidden layer
NN Discri-
image 0.13
Generator minator
code update fix 0.9
Gradient Ascent
Algorithm
• Initialize generator and discriminator G D
• In each training iteration:
Sample some
Update
real objects: 1 1 1 1
D
Generate some
fake objects:
code 0 0 0 0
code
code
code
G
code
code
code
code
image
G image
image D 1
image
update fix
Algorithm
• Initialize generator and discriminator G D
• In each training iteration:
Sample some
Update
real objects: 1 1 1 1
Generate some D
Learning
fake objects: 0 0 0 0
D code
code
code
code
G
Learning
code
code
code
code
image
G G image
image D 1
image
update fix
Benefit of GAN
• From Discriminator’s point of view
• Using generator to generate negative samples
G ~
x = x arg max D x
~
x X
efficient
• From Generator’s point of view
• Still generate the object component-by-
component
• But it is learned from the discriminator with
global view.
感謝 段逸林 同學提供結果
GAN
VAE
GAN
https://arxiv.org/a
bs/1512.09300
Lecture I
Generation by GAN
Improving GAN
Discriminator
GAN Real
Generated
𝐷 𝑥 𝐷 𝑥
Binary Classifier real
generated
as Discriminator
Binary Binary
Real Fake
Classifier Classifier
Typical binary classifier uses sigmoid function
at the output layer 1 is the largest, 0 is the smallest
They don’t
move.
strong real
weak generated
Binary Classifier
as Discriminator
• Don’t let the discriminator perfectly separate real
and generated data
• Add noise to input or label?
real
generated
real
generated
Least Square GAN (LSGAN)
• Replace sigmoid with linear (replace classification
with regression)
Binary Binary
scalar scalar
Classifier Classifier
1 (Real) 0 (Fake)
They don’t 1
move.
0
real
generated
WGAN
• We want the scores of the real examples as large as
possible, generated examples as small as possible.
Binary Binary
scalar scalar
Classifier Classifier
Any
problem?
−∞
WGAN
Move towards Max
The discriminator should
be a 1-Lipschitz function.
It should be smooth.
How to realize?
Lipschitz Function
𝐷 𝑥1 − 𝐷 𝑥2 ≤ 𝐾 𝑥1 − 𝑥2
Output Input
change change
𝐷 𝑥 𝐷 𝑥
Easy to move
DCGAN LSGAN Original Improved
WGAN WGAN
G: CNN, D: CNN
…
d 0 0 0 1 1
…
1-of-N g 1 0 0 0 0
Encoding
…
o 0 1 1 0 0
…
<space> 0 0 0 0 1
• 升雲白遲丹齋取,此酒新巷市入頭。黃道故海歸中後,不驚入得韻子門。
• 據口容章蕃翎翎,邦貸無遊隔將毬。外蕭曾臺遶出畧,此計推上呂天夢。
• 新來寳伎泉,手雪泓臺蓑。曾子花路魏,不謀散薦船。
• 功持牧度機邈爭,不躚官嬉牧涼散。不迎白旅今掩冬,盡蘸金祇可停。
• 玉十洪沄爭春風,溪子風佛挺橫鞋。盤盤稅焰先花齋,誰過飄鶴一丞幢。
• 海人依野庇,為阻例沉迴。座花不佐樹,弟闌十名儂。
• 入維當興日世瀕,不評皺。頭醉空其杯,駸園凋送頭。
• 鉢笙動春枝,寶叅潔長知。官爲宻爛去,絆粒薛一靜。
• 吾涼腕不楚,縱先待旅知。楚人縱酒待,一蔓飄聖猜。
• 折幕故癘應韻子,徑頭霜瓊老徑徑。尚錯春鏘熊悽梅,去吹依能九將香。
• 通可矯目鷃須浄,丹迤挈花一抵嫖。外子當目中前醒,迎日幽筆鈎弧前。
• 庭愛四樹人庭好,無衣服仍繡秋州。更怯風流欲鴂雲,帛陽舊據畆婷儻。
Loss-sensitive GAN (LSGAN)
WGAN LSGAN
D(x) D(x)
𝑥 Δ 𝑥, 𝑥′
𝑥
Δ 𝑥, 𝑥′
𝑥′′
𝑥′
𝑥′′ 𝑥′
Energy-based GAN (EBGAN)
• Using an autoencoder as discriminator D
An image It can be reconstructed
=
is good. by autoencoder.
Generator is
the same. Discriminator
0.1
EN DE - X -1 -0.1
Do not have to
be very negative
Hard to reconstruct, easy to destroy
Mode Collapse
Missing Mode ?
Mode collapse is
easy to detect.
?
Missing Mode ?
• E.g. BEGAN on CelebA
陳柏文同學提供
實驗結果
圖片來原 柯達方 同學
Ensemble
Generator Generator
1 2
Lecture II:
Variants of GAN
Lecture II
Conditional Generation
Sequence Generation
Conditional Generation
“Girl with red hair
and red eyes” NN
“Girl with yellow Generator
ribbon”
Conditional Generation
• We don’t want to simply generate some random stuff.
• Generate objects based on conditions:
Caption Generation
Chat-bot
“Hello” “Hello. Nice
Given
to see you.”
condition:
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Modifying Input Code
0.1 3
−3 −3
⋮ Generator ⋮ Generator
2.4 2.4
0.9 Each dimension of input vector 0.9 Longer hair
represents some characteristics.
0.1 0.1
2.1 −3
⋮ Generator ⋮ Generator
2.4 2.4
0.9 blue hair 3.5 Open mouth
c Discrimi
z = Generator x scalar
z’ nator
encoder
“Auto- Parameter
decoder Classifier
encoder” sharing
(only the last
Predict the code c layer is different)
that generates x c
What is InfoGAN?
c
z = Generator x c must have clear
z’ influence on x
encoder
“Auto-
decoder Classifier The classifier can
encoder”
recover c from x.
Predict the code c
that generates x c
https://arxiv.org/abs/1606.03657
https://arxiv.org/abs/1606.03657
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Connecting Code and Attribute
CelebA
GAN+Autoencoder
• We have a generator (input z, output x)
• However, given x, how can we find z?
• Learn an encoder (input x, output z)
as close as possible
x z Generator
Encoder x
(Decoder)
different structures? init fixed
Discriminator Autoencoder
as close as possible
x z Generator
Encoder x
(Decoder)
Attribute Representation
𝑧𝑙𝑜𝑛𝑔
1 1
𝑧𝑙𝑜𝑛𝑔 = 𝐸𝑛 𝑥 − 𝐸𝑛 𝑥 ′
𝑁1 𝑁2 ′
𝑥∈𝑙𝑜𝑛𝑔 𝑥 ∉𝑙𝑜𝑛𝑔
Short ′ ′ Long
𝑥 𝐸𝑛 𝑥 + 𝑧𝑙𝑜𝑛𝑔 = 𝑧 𝐺𝑒𝑛 𝑧
Hair Hair
Photo
Editing
https://www.youtube.com/watch?v=kPEIJJsQr7U
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Conditional GAN
• Generating images based on text description
sentence NN code
“a dog is sleeping” Encoder
NN
code image
Generator
?
c1: a dog is running 𝑥ො 1 :
Conditional GAN c2: a bird is flying 𝑥ො 2 :
sentence NN code
“a dog is running” Encoder
“a bird is flying”
NN
code
Generator
c1: a dog is running 𝑥ො 1 :
Conditional GAN c2: a bird is flying 𝑥ො 2 :
Target of
NN output
A blurry image!
Conditional GAN
c: train
G Image x = G(c,z)
Prior distribution 𝑧
It is a distribution.
Approximate the
Text: “train” distribution below
Target of
NN output
A blurry image!
Conditional GAN
c: train
G Image x = G(c,z)
Prior distribution 𝑧
D 𝑐 D
𝑥 scalar scalar
(type 1) (type 2)
𝑥
• 根據文字敘述畫出動漫人物頭像
Red hair, long hair
http://konachan.net/post/show/239400/aikatsu-clouds-flowers-hikami_sumire-hiten_goane_r
blue eyes
Released Training Data red hair
short hair
96 x 96
• Data download link:
https://drive.google.com/open?id=0BwJmB7alR-
AvMHEtczZZN0EtdzQ
• Anime Dataset:
• training data: 33.4k (image, tags) pair
• Training tags file format
• img_id <comma> tag1 <colon> #_post <tab> tag2
<colon> …
tags.csv
Image-to-image
𝑐
G x = G(c,z)
𝑧
https://arxiv.org/pdf/1611.07004
Image-to-image
• Traditional supervised approach
NN Image
as close as
possible
Testing:
input close
Image-to-image
• Experimental results
G Image D scalar
𝑧
Testing:
Minimize
distance
Generator
Using CNN
G Output
training data
Speech Enhancement
noisy clean
• Conditional GAN
noisy output clean
output
D scalar
(fake pair or not)
noisy
Speech Enhancement
Enhanced Speech
Noisy Speech
感謝廖峴峰同學提供實驗結果
(和中研院曹昱博士共同指導)
• Voice Conversion
• Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, Hsin-Min Wang,
Voice Conversion from Unaligned Corpora using Variational Autoencoding
Wasserstein Generative Adversarial Networks, Interspeech 2017
• Speech Enhancement
• Santiago Pascual, Antonio Bonafonte, Joan Serrà, SEGAN: Speech
Enhancement Generative Adversarial Network, Interspeech 2017
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
paired data
Cycle GAN
https://arxiv.org/abs/1703.10593
https://junyanz.github.io/CycleGAN/
Become similar
Domain X to domain Y
ignore input
𝐷𝑌 scalar
Input image
belongs to
domain Y or not
Domain Y
Domain X Domain Y
Cycle GAN
as close as possible
𝐺𝑋→𝑌 𝐺Y→X
Lack of information
for reconstruction
𝐷𝑌 scalar
Input image
belongs to
domain Y or not
Domain Y
c.f. Dual Learning Domain X Domain Y
Cycle GAN
as close as possible
𝐺𝑋→𝑌 𝐺Y→X
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
動畫化的世界
input output domain
• Using the code:
https://github.com/Hi-
king/kawaii_creator
• It is not cycle GAN,
Disco GAN
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
https://www.youtube.com/watch?v=9c4z6YsBGQ0
Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman and Alexei A. Efros. "Generative
Visual Manipulation on the Natural Image Manifold", ECCV, 2016.
Andrew Brock, Theodore Lim, J.M. Ritchie, Nick
Weston, Neural Photo Editing with Introspective
Adversarial Networks, arXiv preprint, 2017
Basic Idea
Fulfill the
space of constraint
code z
x z Generator
Encoder x
• Method 3 (Decoder)
𝑧 ∗ = 𝑎𝑟𝑔 min 𝑈 𝐺 𝑧 + 𝜆1 𝑧 − 𝑧0 2 − 𝜆2 𝐷 𝐺 𝑧
𝑧
Generator feature
Training
data: The same
Testing distribution
data:
Generator feature
Domain Independent Features
Domain Independent Features
feature extractor
Too easy to feature
extractor ……
Domain classifier
Domain-adversarial training
Maximize label classification accuracy + Maximize label
minimize domain classification accuracy classification accuracy
feature extractor Label predictor
Domain classifier
Not only cheat the domain
classifier, but satisfying label
classifier at the same time
Maximize domain
classification accuracy
This is a big network, but different parts have different goals.
Domain-adversarial training
Train:
Test:
x z Generator Discrimi
Encoder x scalar
(Decoder) nator
as close as possible
GAN
Jeff Donahue, Philipp Krähenbühl, Trevor Darrell,
“Adversarial Feature Learning”, ICLR, 2017
(from prior
distribution) from encoder
code z code z or decoder?
Conditional Generation
Sequence Generation
A A A
B B B
<BOS> A
B
Sentence Generation
• Generation (Testing)
B A A
A A A
B B B
<BOS> A
B
Sentence Generation
Generator sentence x
Original GAN
Sentence Generation
• Initialize generator G and discriminator D
• In each iteration:
Tuning generator a A A A
little bit will not B
B B
change the output.
In the paper of
improved WGAN …
(ignoring <BOS> A
sampling process) B
Sentence Generation - SeqGAN
Using Policy Gradient
Discrimi
Generator scalar
nator
update
. (period)
I’m fine
Information of the
whole sentences
Input
sentence/history h En De response sentence x
Chatbot
Input
sentence/history h
Discriminator Real or fake
response sentence x
Conditional Generation
Sequence Generation
High
Probability
Low
Image Probability
Space
Theory behind GAN
• A generator G is a network. The network defines a
probability distribution.
Normal 𝑃𝐺 𝑥 𝑃𝑑𝑎𝑡𝑎 𝑥
Distribution
generator
𝑧 G 𝑥=𝐺 𝑧
As close as
It is difficult to compute 𝑃𝐺 𝑥 possible
We do not know what the distribution
looks like.
https://blog.openai.com/generative-models/
When the discriminator is well trained, its
loss represents a specific divergence
measure between PG and Pdata.
Discri-
minator
Update discriminator is to minimize
the divergence measure
Normal 𝑃𝐺 𝑥 𝑃𝑑𝑎𝑡𝑎 𝑥
Distribution
generator
𝑧 G 𝑥=𝐺 𝑧
As close as
Binary classifier evaluates the
possible
JS divergence
You can design the discriminator to evaluate other divergence.
https://blog.openai.com/generative-models/
http://www.guokr.com/post/773890/
……
? 𝑃𝐺50 𝑥 𝐽𝑆 𝑃𝐺2 ||𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎 𝑥
Not really better ……
……
𝑃𝐺100 𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝐽𝑆 𝑃𝐺2 ||𝑃𝑑𝑎𝑡𝑎 = 0
Evaluating JS divergence https://arxiv.org/a
bs/1701.07875
𝑃 𝑄
𝑊 𝑃, 𝑄 = 𝑑
Why Earth Mover’s Distance?
𝐷𝑓 𝑃𝑑𝑎𝑡𝑎 ||𝑃𝐺
𝑊 𝑃𝑑𝑎𝑡𝑎 , 𝑃𝐺
𝑑0 𝑑50
https://arxiv.org/abs/1701.07875
To Learn more ……
• GAN Zoo
• https://github.com/hindupuravinash/the-gan-zoo
• Tricks: https://github.com/soumith/ganhacks
• Tutorial of Ian Goodfellow:
https://arxiv.org/abs/1701.00160
Lecture III:
Decision Making
and Control
Decision Making and Control
• Machine observe some inputs, takes an action, and
finally achieve the target.
• E.g. Go playing
Object 紅燈
Recognition
observation
Action: Decision
a giant network? Making
Stop!
Decision Making and Control
• E.g. dialogue system
a giant network?
Decision
Making
Natural
請問您要從
Language 問出發地
哪裡出發呢?
Generation
How to solve this problem?
• Network as a function, learn as typical supervised
tasks
Target: Target: 請問您要
Target: 3-3
Stop! 從哪裡出發呢?
我想訂11月5日
到台北的機票
Machine do not know some
Behavior Cloning behavior must copy, but
some can be ignored.
https://www.youtube.com/watch?v=j2FSB3bseek
Properties of
Decision Making and Control
Machine does not know the
What do we miss?
influence of each action.
• Agent’s actions affect the subsequent data it
receives
• Reward delay
• In space invader, only “fire” obtains reward
• Although the moving before “fire” is
important
• In Go playing, sacrificing immediate reward to
gain more long-term reward
Better Way ……
A sequence
of decision
Next move
Network (19 x 19
positions)
Classification
Two Learning Scenarios
• Scenario 1: Reinforcement Learning
• Machine interacts with the environment.
• Machine obtains the reward from the environment, so it
knows its performance is good or bad.
• Scenario 2: Learning by demonstration
• Also known as imitation learning, apprenticeship
learning
• An expert demonstrates how to solve the task, and
machine learns from the demonstration.
Lecture III
Reinforcement Learning
Inverse Reinforcement
Learning
Alpha Go: policy-based + value-based
RL + model-based
Model-free
Approach
Policy-based Value-based
Model-based Approach
Basic Components
You cannot control
Reward
Actor Env
Function
Video 殺一隻怪
Game 得 20 分 …
Go 圍棋規則
Scenario
Actor/Policy
Observation Action
Action =
Function Function
π( Observation ) output
input
Environment
Learning to play Go
Observation Action
Reward
Next Move
Environment
Agent learns to take
Learning to play Go actions maximizing
expected reward.
Observation Action
Reward
reward = 0 in most cases
If win, reward = 1
If loss, reward = -1
Environment
Actor, Environment, Reward
𝑠1 𝑎1 𝑠2 𝑎2
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
Reward Reward
𝑟1 𝑟2
Total reward:
𝑇
Trajectory
𝑅 𝜏 = 𝑟𝑡
𝜏 = 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇
𝑡=1
Example: Playing Video Game
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3
This is an episode.
After many turns Total reward:
Game Over 𝑇
(spaceship destroyed)
𝑅 = 𝑟𝑡
𝑡=1
Obtain reward 𝑟𝑇
We want the total
Action 𝑎 𝑇 reward be maximized.
Neural network as Actor
• Input of neural network: the observation of machine
represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
Take the action
NN as actor based on the
left 0.7
probability.
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
如果發現不能微分就用 𝑇
• 𝜏 = 𝑠1 , 𝑎1 , 𝑟1 , 𝑠2 , 𝑎2 , 𝑟2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇 , 𝑟𝑇
𝑃 𝜏|𝜃 =
𝑝 𝑠1 𝑝 𝑎1 |𝑠1 , 𝜃 𝑝 𝑟1 , 𝑠2 |𝑠1 , 𝑎1 𝑝 𝑎2 |𝑠2 , 𝜃 𝑝 𝑟2 , 𝑠3 |𝑠2 , 𝑎2 ⋯
𝑇
𝑝 𝑎𝑡 = "𝑓𝑖𝑟𝑒"|𝑠𝑡 , 𝜃
= 𝑝 𝑠1 ෑ 𝑝 𝑎𝑡 |𝑠𝑡 , 𝜃 𝑝 𝑟𝑡 , 𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 = 0.7
𝑡=1
left
0.1
Actor right
not related Control by 𝑠𝑡 0.2
𝜋𝜃
to your actor your actor 𝜋𝜃 fire
0.7
Goodness of Actor
• An episode is considered as a trajectory 𝜏
• 𝜏 = 𝑠1 , 𝑎1 , 𝑟1 , 𝑠2 , 𝑎2 , 𝑟2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇 , 𝑟𝑇
• 𝑅 𝜏 = σ𝑇𝑡=1 𝑟𝑡
• If you use an actor to play the game, each 𝜏 has a
probability to be sampled
• The probability depends on actor parameter 𝜃:
𝑃 𝜏|𝜃
𝑁
1 Use 𝜋𝜃 to play the
𝑅ത𝜃 = 𝑅 𝜏 𝑃 𝜏|𝜃 ≈ 𝑅 𝜏 𝑛 game N times,
𝑁
𝜏 𝑛=1 obtain 𝜏 1 , 𝜏 2 , ⋯ , 𝜏 𝑁
Sum over all Sampling 𝜏 from 𝑃 𝜏|𝜃
possible trajectory N times
Three Steps for Deep Learning
• Gradient ascent 𝜃 = 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , ⋯
• Start with 𝜃 0
• 𝜃1 ← 𝜃 0 + 𝜂𝛻𝑅ത𝜃0 𝜕 𝑅ത𝜃 Τ𝜕𝑤1
• 𝜃 2 ← 𝜃1 + 𝜂𝛻𝑅ത𝜃1 𝜕𝑅ത𝜃 Τ𝜕𝑤2
𝛻 𝑅ത𝜃 = ⋮
• ……
𝜕𝑅ത𝜃 Τ𝜕𝑏1
⋮
Policy Gradient 𝑅ത𝜃 = 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻 𝑅ത𝜃 =?
𝜏
𝛻𝑃 𝜏|𝜃
𝛻 𝑅ത𝜃 = 𝑅 𝜏 𝛻𝑃 𝜏|𝜃 = 𝑅 𝜏 𝑃 𝜏|𝜃
𝑃 𝜏|𝜃
𝜏 𝜏
• 𝜏 = 𝑠1 , 𝑎1 , 𝑟1 , 𝑠2 , 𝑎2 , 𝑟2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇 , 𝑟𝑇
𝑇
𝑁 𝑁 𝑇𝑛
1 1
𝛻 𝑅ത𝜃 ≈ 𝑅 𝜏 𝛻𝑙𝑜𝑔𝑃 𝜏 |𝜃 = 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
𝑛 𝑛
𝑁 𝑁
𝑛=1 𝑛=1 𝑡=1
𝑁 𝑇𝑛
1 What if we replace
= 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
𝑁 𝑅 𝜏 𝑛 with 𝑟𝑡𝑛 ……
𝑛=1 𝑡=1
……
𝑁 𝑇𝑛
1
𝜏2: 𝑠12 , 𝑎12 𝑅 𝜏2 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
𝑁
𝑠22 , 𝑎22 𝑅 𝜏2 𝑛=1 𝑡=1
……
……
Data
Collection
Policy Gradient
3
Considered as Minimize: − 𝑦ො𝑖 𝑙𝑜𝑔𝑦𝑖
Classification Problem 𝑖=1
𝑦𝑖 𝑦ො𝑖
left 1
… right 0
s … fire 0
Maximize: 𝑙𝑜𝑔𝑦𝑖 =
𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠
𝜃 ← 𝜃 + 𝜂𝛻𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠
𝜃 ← 𝜃 + 𝜂𝛻𝑅ത𝜃
Policy Gradient
𝛻 𝑅ത𝜃 =
𝑁 𝑇𝑛
1
𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
Given actor parameter 𝜃 𝑁
𝑛=1 𝑡=1
𝜏1 : 𝑠11 , 𝑎11 𝑅 𝜏1
𝑎11 = 𝑙𝑒𝑓𝑡
𝑠21 , 𝑎21 𝑅 𝜏1
left 1
……
……
𝑠11 NN right 0
𝜏2: 𝑠12 , 𝑎12 𝑅 𝜏2 fire 0
𝑠22 , 𝑎22 𝑅 𝜏2 𝑎12 = 𝑓𝑖𝑟𝑒
…
……
……
left 0
𝑠12 NN right 0
fire 1
𝜃 ← 𝜃 + 𝜂𝛻𝑅ത𝜃
Policy Gradient
𝛻 𝑅ത𝜃 =
𝑁 𝑇𝑛
1
𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
Given actor parameter 𝜃 𝑁
𝑛=1 𝑡=1
𝜏1 : 𝑠11 , 𝑎11 𝑅 𝜏1 2
Each training data is
𝑠21 , 𝑎21 𝑅 𝜏1 2 weighted by 𝑅 𝜏 𝑛
……
……
……
…
𝑠12 NN 𝑎12 = 𝑓𝑖𝑟𝑒
End of Warning
Critic
• A critic does not determine the action.
• Given an actor π, it evaluates the how good the
actor is
An actor can be
found from a critic.
e.g. Q-learning
http://combiboilersleeds.com/picaso/critics/critics-4.html
Critic
• State value function 𝑉 𝜋 𝑠
• When using actor 𝜋, the cumulated reward
expects to be obtained after seeing observation
(state) s
𝑉𝜋 𝑠
s 𝑉𝜋
scalar
𝑉 𝜋 𝑠 is large 𝑉 𝜋 𝑠 is smaller
Critic
𝑉 以前的阿光 大馬步飛 = bad
𝑉 變強的阿光 大馬步飛 = good
How to estimate 𝑉 𝜋 𝑠
• Monte-Carlo based approach
• The critic watches 𝜋 playing the game
After seeing 𝑠𝑎 ,
Until the end of the episode, 𝑠𝑎 𝑉𝜋 𝑉 𝜋 𝑠𝑎 𝐺𝑎
the cumulated reward is 𝐺𝑎
After seeing 𝑠𝑏 ,
Until the end of the episode,
𝑠𝑏 𝑉𝜋 𝑉 𝜋 𝑠𝑏 𝐺𝑏
the cumulated reward is 𝐺𝑏
How to estimate 𝑉 𝜋 𝑠
• Temporal-difference approach
⋯ 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ⋯
𝑉 𝜋 𝑠𝑡 = 𝑟𝑡 + 𝑉 𝜋 𝑠𝑡+1
𝑠𝑡 𝑉𝜋 𝑉 𝜋 𝑠𝑡
- 𝑉 𝜋 𝑠𝑡 − 𝑉 𝜋 𝑠𝑡+1 𝑟𝑡
𝑠𝑡+1 𝑉𝜋 𝑉 𝜋 𝑠𝑡+1
𝜋 𝜋 Larger variance
𝑠𝑎 𝑉 𝑉 𝑠𝑎 𝐺𝑎
unbiased
𝑠𝑡 𝑉𝜋 𝑉 𝜋 𝑠𝑡 𝑟 + 𝑉 𝜋 𝑠𝑡+1 𝑉𝜋 𝑠𝑡+1
Smaller variance
May be biased
[Sutton, v2,
MC v.s. TD Example 6.4]
𝑉 𝜋 𝑠𝑏 + 𝑟 = 𝑉 𝜋 𝑠𝑎
3/4 0 3/4
(The actions are ignored here.)
Actor = Generator ?
Actor-Critic Critic = Discriminator ?
https://arxiv.org/abs/1610.01945
𝜋 interacts with
the environment
𝜋 = 𝜋′ TD or MC
劉廷緯、溫明浩
https://www.youtube.com/watch?v=8iRD1w73fDo&feature=youtu.be
Lecture III
Reinforcement Learning
Inverse Reinforcement
Learning
Imitation Learning
𝑠1 𝑎1 𝑠2 𝑎2
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
reward function is not available
因此為了保護人類整體,控制人類的自由是
必須的
Inverse Reinforcement Learning
demonstration
of the expert
Environment
𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁
𝑅 𝜏Ƹ 𝑛 > 𝑅 𝜏
𝑛=1 𝑛=1
Reward
𝜏1 , 𝜏2 , ⋯ , 𝜏𝑁 Function R
Actor
= Generator
Find an actor based
Reward function Actor 𝜋 on reward function R
= Discriminator
By Reinforcement learning
GAN High score for real,
low score for generated
𝜏1 , 𝜏2 , ⋯ , 𝜏𝑁 Reward
Function
https://pdfs.semanticscholar.org/27c3/6fd8e6aeadd50ec1e180399df70c46dede00.pdf
http://martin.zinkevich.org/publications
Path Planning /maximummarginplanning.pdf
Third Person Imitation Learning
• Ref: Bradly C. Stadie, Pieter Abbeel, Ilya Sutskever, “Third-Person Imitation
Learning”, arXiv preprint, 2017
http://lasa.epfl.ch/research_new/ML/index.php https://kknews.cc/sports/q5kbb8.html
http://sc.chinaz.com/Files/pic/icons/1913/%E6%9C%BA%E5%99%A8%E4%BA%BA%E5%9B
%BE%E6%A0%87%E4%B8%8B%E8%BD%BD34.png
Third Person Imitation Learning
Concluding Remarks