Sei sulla pagina 1di 234

Introduction of Generative

Adversarial Network (GAN)


李宏毅
Hung-yi Lee
Generative Adversarial Network
(GAN)
• How to pronounce “GAN”?

Google 小姐
Yann LeCun’s comment

https://www.quora.com/What-are-some-recent-and-
potentially-upcoming-breakthroughs-in-unsupervised-learning
Yann LeCun’s comment

……

https://www.quora.com/What-are-some-recent-and-potentially-upcoming-breakthroughs-
in-deep-learning
Deep Learning in One Slide (Review)
➢ Fully connected feedforward network
Many kinds of
➢ Convolutional neural network (CNN)
network structures:
➢ Recurrent neural network (RNN)
Different networks can take different kinds of input/output.

Vector

Matrix 𝑥 Network 𝑦

Vector Seq
(speech, video, sentence)

How to find Given the example inputs/outputs as


the function? training data: {(x1,y1),(x2,y2), ……, (x1000,y1000)}
Creation
Anime Face
Generation

Drawing?

何之源的知乎: https://zhuanlan.zhihu.com/p/24767059
DCGAN: https://github.com/carpedm20/DCGAN-tensorflow
Powered by: http://mattya.github.io/chainer-DCGAN/

Basic Idea of GAN It is a neural network


(NN), or a function.

vector Generator image matrix

0.1 3
−3 −3
⋮ Generator ⋮ Generator
2.4 2.4
0.9 Each dimension of input vector 0.9 Longer hair
represents some characteristics.
0.1 0.1
2.1 −3
⋮ Generator ⋮ Generator
2.4 2.4
0.9 blue hair 3.5 Open mouth
Basic Idea of GAN It is a neural network
(NN), or a function.

Discri- scalar
image
minator Larger value means real,
smaller value means fake.

Discri- Discri-
1.0 minator 1.0
minator

Discri- Discri-
minator
0.1 minator
?
Basic Idea of GAN
Generator

Brown veins

Butterflies are Butterflies do


……..
not brown not have veins

Discriminator
This is where the term
“adversarial” comes from.
Basic Idea of GAN You can explain the process
in different ways…….
NN NN NN
Generator Generator Generator
v1 v2 v3

Discri- Discri- Discri-


minator minator minator
v1 v2 v3

Real images:
Generator Discriminator
(student) (teacher)
Basic Idea of GAN

Generator
v1 Discriminator
v1
No eyes
Generator
v2
Discriminator
v2
No mouth
Generator
v3
Questions

Q1: Why generator cannot learn by itself?

Q2: Why discriminator don’t generate


object itself?

Q3: How discriminator and generator


interact?
Anime Face Generation

100 updates
Anime Face Generation

1000 updates
Anime Face Generation

2000 updates
Anime Face Generation

5000 updates
Anime Face Generation

10,000 updates
Anime Face Generation

20,000 updates
Anime Face Generation

50,000 updates
G G G G G G G G G G

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
感謝陳柏文同學提供實驗結果
Outline

Lecture 1: Introduction of GAN

Lecture 2: Variants of GAN

Lecture 3: Making Decision and Control


Lecture I:
Introduction of GAN
To learn more theory:
https://www.youtube.com/watch?v=0CKeqXl5IY0&lc=z13zuxbgl
pvsgbgpo04cg1bxuoraejdpapo0k
https://www.youtube.com/watch?v=KSN4QYgAtao&lc=z13kz1nq
vuqsipqfn23phthasre4evrdo
Lecture I

When can I use GAN?

Generation by GAN

Improving GAN
Structured Learning
Machine learning is to find a function f

f : X Y
Regression: output a scalar
Classification: output a “class” (one-hot vector)
1 0 0 0 1 0 0 0 1
Class 1 Class 2 Class 3
Structured Learning/Prediction: output a
sequence, a matrix, a graph, a tree ……
Output is composed of components with dependency
Structured
Learning

Regression,
Classification
Output Sequence f : X Y
Machine Translation
X : “機器學習及其深層與 Y : “Machine learning and
結構化” having it deep and structured”
(sentence of language 1) (sentence of language 2)
Speech Recognition

X: Y : 感謝大家來上課”
(speech) (transcription)
Chat-bot

X: “How are you?” Y: “I’m fine.”


(what a user says) (response of machine)
Output Matrix f : X Y
Image to Image Colorization:

X: Y:

Ref: https://arxiv.org/pdf/1611.07004v1.pdf

Text to Image

X : “this white and yellow flower Y:


have thin white petals and a
round yellow stamen”
ref: https://arxiv.org/pdf/1605.05396.pdf
Decision Making and Control

Action: Action: Action:


“right” “fire” “left”

GO Playing is the same.


Why Structured Learning
Interesting?
• One-shot/Zero-shot Learning:
• In classification, each class has some examples.
• In structured learning,
• If you consider each possible output as a
“class” ……
• Since the output space is huge, most “classes”
do not have any training data.
• Machine has to create new stuff during
testing.
• Need more intelligence
Why Structured Learning
Interesting?
• Machine has to learn to planning
• Machine can generate objects component-by-
component, but it should have a big picture in its mind.
• Because the output components have dependency,
they should be considered globally.
Image
Generation

Sentence 這個婆娘不是人
Generation 九天玄女下凡塵
Structured Learning Approach
Generator
Learn to generate Bottom
the object at the Up
component level
Generative
Adversarial
Network (GAN)
Discriminator
Evaluating the Top
whole object, and Down
find the best one
Lecture I

When can I use GAN?

Generation by GAN

Improving GAN
We will control what to generate
Generation latter. → Conditional Generation

Image Generation
0.3 0.1 −0.3
−0.1 −0.1 0.1 NN
⋮ ⋮ ⋮ Generator
−0.7 0.7 0.9
In a specific range

Sentence Generation
0.3 0.1 −0.3 How are you?
−0.1 −0.1 0.1 NN
⋮ ⋮ ⋮ Good morning.
Generator
−0.7 0.2 0.5 Good afternoon.
So many questions ……

Q1: Why generator cannot learn by itself?

Q2: Why discriminator don’t generate


object itself?

Q3: How discriminator and generator


interact?
code
code
NN
Generator

code
Generator
vectors
code: 0.1 0.1 0.2 0.3
(where does them −0.5 0.9 −0.1 0.2
come from?)
Image:

As close as possible
0.1 NN
image
0.9 Generator
As close as possible
𝑦1 1
NN
c.f. 𝑦2 0
Classifier
⋮ ⋮
code
code
NN
Generator

code
Generator
vectors
code: 0.1 0.1 0.2 0.3
(where does them −0.5 0.9 −0.1 0.2
come from?)

Image:

Encoder in auto-encoder
provides the code ☺

NN
𝑐
Encoder
Auto-encoder Low
dimension
Compact
NN code representation of
Encoder the input object
28 X 28 = 784
Learn together

NN Can reconstruct
code
Decoder the original object

Trainable

NN NN
𝑐
Encoder Decoder
Auto-encoder - Example
NN
𝑐
Encoder

PCA 降到
32-dim

Pixel -> tSNE

Unsupervised Leaning
without annotation
Auto-encoder
As close as possible

code
NN NN
Encoder Decoder
= Generator

Randomly generate
code

NN
a vector as code Decoder
Image ?

= Generator
(real examples)

Auto-encoder
2D
code NN image
Decoder
NN
Decoder
1.5
0
-1.5 1.5
−1.5
0 NN
Decoder
(real examples)

Auto-encoder

-1.5 1.5
code
code
NN
Auto-encoder

code
Generator
vectors
code: 0.1 0.1 0.2 0.3
(where does them −0.5 0.9 −0.1 0.2
come from?)

Image:

NN NN
a b
Generator Generator

NN
0.5x a + 0.5x b ?
Generator
Auto-encoder

NN NN
input output
Encoder Decoder
code = Generator
Variational Auto-encoder
Minimize
(VAE) m 1 reconstruction error
m2
NN m3 𝑐1
input NN
Encoder 𝜎1 exp + 𝑐2 output
Decoder
𝜎2 𝑐3
𝜎3
X 𝑐𝑖 = 𝑒𝑥𝑝 𝜎𝑖 × 𝑒𝑖 + 𝑚𝑖
𝑒
From a normal 1
𝑒2 Minimize
distribution 𝑒3 3
2
෍ 𝑒𝑥𝑝 𝜎𝑖 − 1 + 𝜎𝑖 + 𝑚𝑖
𝑖=1
Why VAE?

decode
code

encode
VAE - Writing Poetry

sentence RNN RNN


sentence
Encoder Decoder
code
Code Space
i went to the store to buy some groceries.
i store to buy some groceries.
i were to buy any groceries.

……
"come with me," she said.
"talk to me," she said.
"don’t worry about it," she said.

Ref: http://www.wired.co.uk/article/google-artificial-intelligence-poetry
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy
Bengio, Generating Sentences from a Continuous Space, arXiv prepring, 2015
What do we miss?
Generated Image Target

G
as close as
possible

It will be fine if the generator can truly copy the target image.
What if the generator makes some mistakes …….
Some mistakes are serious, while some are fine.
What do we miss? Target

1 pixel error 1 pixel error

我覺得不行 我覺得不行

6 pixel errors 6 pixel errors

我覺得 我覺得
其實 OK 其實 OK
Each neural in output layer
What do we miss? corresponds to a pixel.
Layer L-1 Layer L
…… empty
誰管你
…… ink

……
……
旁邊的,我們產
我覺得不行 我覺得其實 OK ……生一樣的顏色

The relation between the components are critical.


The last layer generates each components independently.
Need deep structure to catch the relation between
components.
感謝 黃淞楓 同學提供結果

(Variational) Auto-encoder

x2

𝑧1 𝑥1
𝑧2 G 𝑥2

x1
So many questions ……

Q1: Why generator cannot learn by itself?

Q2: Why discriminator don’t generate


object itself?

Q3: How discriminator and generator


interact?
Evaluation function, Potential
Discriminator Function, Evaluation Function …

• Discriminator is a function D (network, can deep)

D: X  R
• Input x: an object x (e.g. an image)
• Output D(x): scalar which represents how “good” an
object x is

D 1.0 D 0.1

Can we use the discriminator to generate objects?


Yes.
Discriminator
• It is easier to catch the relation between the
components by top-down evaluation.

This CNN filter is


我覺得不行 我覺得其實 OK good enough.
Discriminator
• Suppose we already have a good discriminator
D(x) …

Enumerate all possible x !!!


It is feasible ???
How to learn the discriminator?
Discriminator - Training
• I have some real images

D scalar 1 D scalar 1
(real) (real)

D scalar 1 D scalar 1
(real) (real)

Discriminator only learns to output “1” (real).

Discriminator training needs some negative examples.


Discriminator - Training

object Discrimi scalar


x nator D D(x)

D(x)

real examples
In practice, you cannot decrease all the x
other than real examples.
Discriminator - Training
• Negative examples are critical.

D scalar 1 D scalar 1
(real) (real)

D scalar 0 D scalar 0
(fake) (fake)

D
How to generate realistic
0.9 Pretty real
negative examples?
Discriminator - Training
• General Algorithm
• Given a set of positive examples, randomly
generate a set of negative examples.
• In each iteration
• Learn a discriminator D that can discriminate
positive and negative examples.
v.s. D

• Generate negative examples by discriminator D


x  arg max D x 
~
x X
Discriminator 𝐷 𝑥
- Training
real
generated
𝐷 𝑥

In the end ……

𝐷 𝑥 𝐷 𝑥
Structured Learning ➢ Structured Perceptron
➢ Structured SVM
➢ Gibbs sampling
Graphical Model ➢ Hidden information
➢ Application: sequence
labelling, summarization
Bayesian Network Markov Random Field
(Directed Graph) (Undirected Graph)
Energy-based
Model:
http://www.cs.nyu
Conditional Markov Logic Boltzmann .edu/~yann/resear
Random Field Network Machine ch/ebm/

(Only list some of


the approaches) Restricted
Segmental CRF Boltzmann Machine
Generator v.s. Discriminator
• Generator • Discriminator

• Pros: • Pros:
• Considering the big
• Easy to generate even picture
with deep model • Cons:
• Cons: • Generation is not
always feasible
• Imitate the appearance • Especially when
• Hard to learn the your model is deep
correlation between • How to do negative
components sampling?
So many questions ……

Q1: Why generator cannot learn by itself?

Q2: Why discriminator don’t generate


object itself?

Q3: How discriminator and generator


interact?
Discriminator - Training
• General Algorithm
• Given a set of positive examples, randomly
generate a set of negative examples.
• In each iteration
• Learn a discriminator D that can discriminate
positive and negative examples.
v.s. D

• Generate negative examples by discriminator D

G ~
x = x  arg max D x 
~
x X
Generating Negative Examples
G ~
x = x  arg max D x 
~
x X

hidden layer
NN Discri-
image 0.13
Generator minator
code update fix 0.9
Gradient Ascent
Algorithm
• Initialize generator and discriminator G D
• In each training iteration:
Sample some
Update
real objects: 1 1 1 1
D
Generate some
fake objects:
code 0 0 0 0
code
code
code
G
code
code
code
code

image
G image
image D 1
image
update fix
Algorithm
• Initialize generator and discriminator G D
• In each training iteration:
Sample some
Update
real objects: 1 1 1 1
Generate some D
Learning
fake objects: 0 0 0 0
D code
code
code
code
G

Learning
code
code
code
code

image
G G image
image D 1
image
update fix
Benefit of GAN
• From Discriminator’s point of view
• Using generator to generate negative samples

G ~
x = x  arg max D x 
~
x X

efficient
• From Generator’s point of view
• Still generate the object component-by-
component
• But it is learned from the discriminator with
global view.
感謝 段逸林 同學提供結果

GAN

VAE

GAN

https://arxiv.org/a
bs/1512.09300
Lecture I

When can I use GAN?

Generation by GAN

Improving GAN
Discriminator

GAN Real
Generated

• Discriminator leads the generator

𝐷 𝑥 𝐷 𝑥
Binary Classifier real
generated
as Discriminator
Binary Binary
Real Fake
Classifier Classifier
Typical binary classifier uses sigmoid function
at the output layer 1 is the largest, 0 is the smallest

They don’t
move.

You cannot train your classifier too good……


Binary Classifier
as Discriminator
• Don’t let the discriminator perfectly separate real
and generated data
• Weaken your discriminator?

strong real
weak generated
Binary Classifier
as Discriminator
• Don’t let the discriminator perfectly separate real
and generated data
• Add noise to input or label?

real
generated
real
generated
Least Square GAN (LSGAN)
• Replace sigmoid with linear (replace classification
with regression)
Binary Binary
scalar scalar
Classifier Classifier
1 (Real) 0 (Fake)

They don’t 1
move.
0
real
generated
WGAN
• We want the scores of the real examples as large as
possible, generated examples as small as possible.
Binary Binary
scalar scalar
Classifier Classifier

as large as possible as small as possible


Move towards Max +∞

Any
problem?
−∞
WGAN
Move towards Max
The discriminator should
be a 1-Lipschitz function.
It should be smooth.
How to realize?

Lipschitz Function
𝐷 𝑥1 − 𝐷 𝑥2 ≤ 𝐾 𝑥1 − 𝑥2
Output Input
change change

K=1 for "1 − 𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧"


Do not change fast
It should be
WGAN smooth enough.

• Original WGAN → Weight Clipping


Force the parameters w between c and -c
After parameter update, if w > c, w = c; if w < -c, w = -c
Do not truly maximize (minimize) the real (generated) examples
• Improved WGAN → Gradient Penalty
Keep the gradient close to 1

𝐷 𝑥 𝐷 𝑥

Easy to move
DCGAN LSGAN Original Improved
WGAN WGAN
G: CNN, D: CNN

G: CNN (no normalization), D: CNN (no normalization)

G: CNN (tanh), D: CNN(tanh)


DCGAN LSGAN Original Improved
WGAN WGAN
G: MLP, D: CNN

G: CNN (bad structure), D: CNN

G: 101 layer, D: 101 layer


I will talk about
RNN later.
Sentence Generation
• good bye.
g o o d <space> ……


d 0 0 0 1 1

1-of-N g 1 0 0 0 0
Encoding

o 0 1 1 0 0

<space> 0 0 0 0 1

Consider this matrix as an “image”


I will talk about
RNN later.
Sentence Generation
• Real sentence
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0 A binary classifier
0 0 0 1 0 can immediately
• Generated 0 0 0 0 1 find the difference.
No overlap
0.9 0.1 0.1 0 0
0.1 0.9 0.1 0 0
Can never 0 0 0.7 0.1 0
be 1-of-N 0 0 0.1 0.8 0.1
0 0 0 0.1 0.9 WGAN is helpful
Improved WGAN successfully https://arxiv.org/abs/1704.00028
generating sentences
感謝 李仲翊 同學提供實
驗結果
W-GAN – 唐詩鍊成 輸出 32 個字 (包含標點)

• 升雲白遲丹齋取,此酒新巷市入頭。黃道故海歸中後,不驚入得韻子門。
• 據口容章蕃翎翎,邦貸無遊隔將毬。外蕭曾臺遶出畧,此計推上呂天夢。
• 新來寳伎泉,手雪泓臺蓑。曾子花路魏,不謀散薦船。
• 功持牧度機邈爭,不躚官嬉牧涼散。不迎白旅今掩冬,盡蘸金祇可停。
• 玉十洪沄爭春風,溪子風佛挺橫鞋。盤盤稅焰先花齋,誰過飄鶴一丞幢。
• 海人依野庇,為阻例沉迴。座花不佐樹,弟闌十名儂。
• 入維當興日世瀕,不評皺。頭醉空其杯,駸園凋送頭。
• 鉢笙動春枝,寶叅潔長知。官爲宻爛去,絆粒薛一靜。
• 吾涼腕不楚,縱先待旅知。楚人縱酒待,一蔓飄聖猜。
• 折幕故癘應韻子,徑頭霜瓊老徑徑。尚錯春鏘熊悽梅,去吹依能九將香。
• 通可矯目鷃須浄,丹迤挈花一抵嫖。外子當目中前醒,迎日幽筆鈎弧前。
• 庭愛四樹人庭好,無衣服仍繡秋州。更怯風流欲鴂雲,帛陽舊據畆婷儻。
Loss-sensitive GAN (LSGAN)
WGAN LSGAN
D(x) D(x)

𝑥 Δ 𝑥, 𝑥′
𝑥

Δ 𝑥, 𝑥′

𝑥′′
𝑥′
𝑥′′ 𝑥′
Energy-based GAN (EBGAN)
• Using an autoencoder as discriminator D
An image It can be reconstructed
=
is good. by autoencoder.
Generator is
the same. Discriminator

0.1
EN DE - X -1 -0.1

Autoencoder 0 for best


images
EBGAN
Auto-encoder based discriminator
only give limited region large value.

0 is for gen real gen


the best.

Do not have to
be very negative
Hard to reconstruct, easy to destroy
Mode Collapse
Missing Mode ?

Mode collapse is
easy to detect.

?
Missing Mode ?
• E.g. BEGAN on CelebA

陳柏文同學提供
實驗結果
圖片來原 柯達方 同學

Ensemble

Generator Generator
1 2
Lecture II:
Variants of GAN
Lecture II

Conditional Generation

Sequence Generation

A Little Bit of Theory (option)


Conditional Generation
Generation
0.3 0.1 −0.3
−0.1 −0.1 0.1 NN
⋮ ⋮ ⋮ Generator
−0.7 0.7 0.9
In a specific range

Conditional Generation
“Girl with red hair
and red eyes” NN
“Girl with yellow Generator
ribbon”
Conditional Generation
• We don’t want to simply generate some random stuff.
• Generate objects based on conditions:
Caption Generation

Given “A young girl


condition: is dancing.”

Chat-bot
“Hello” “Hello. Nice
Given
to see you.”
condition:
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Modifying Input Code
0.1 3
−3 −3
⋮ Generator ⋮ Generator
2.4 2.4
0.9 Each dimension of input vector 0.9 Longer hair
represents some characteristics.
0.1 0.1
2.1 −3
⋮ Generator ⋮ Generator
2.4 2.4
0.9 blue hair 3.5 Open mouth

➢The input code determines the generator output.


➢Understand the meaning of each dimension to
control the output.
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
InfoGAN
Regular
(The colors GAN
represents the
characteristics.) Modifying a specific dimension,
no clear meaning
What we expect Actually …
What is InfoGAN?

c Discrimi
z = Generator x scalar
z’ nator
encoder

“Auto- Parameter
decoder Classifier
encoder” sharing
(only the last
Predict the code c layer is different)
that generates x c
What is InfoGAN?

c
z = Generator x c must have clear
z’ influence on x
encoder

“Auto-
decoder Classifier The classifier can
encoder”
recover c from x.
Predict the code c
that generates x c
https://arxiv.org/abs/1606.03657
https://arxiv.org/abs/1606.03657
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Connecting Code and Attribute

CelebA
GAN+Autoencoder
• We have a generator (input z, output x)
• However, given x, how can we find z?
• Learn an encoder (input x, output z)

as close as possible

x z Generator
Encoder x
(Decoder)
different structures? init fixed

Discriminator Autoencoder
as close as possible

x z Generator
Encoder x
(Decoder)
Attribute Representation

𝑧𝑙𝑜𝑛𝑔

1 1
𝑧𝑙𝑜𝑛𝑔 = ෍ 𝐸𝑛 𝑥 − ෍ 𝐸𝑛 𝑥 ′
𝑁1 𝑁2 ′
𝑥∈𝑙𝑜𝑛𝑔 𝑥 ∉𝑙𝑜𝑛𝑔

Short ′ ′ Long
𝑥 𝐸𝑛 𝑥 + 𝑧𝑙𝑜𝑛𝑔 = 𝑧 𝐺𝑒𝑛 𝑧
Hair Hair
Photo
Editing

https://www.youtube.com/watch?v=kPEIJJsQr7U
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Conditional GAN
• Generating images based on text description

sentence NN code
“a dog is sleeping” Encoder

NN
code image
Generator
?
c1: a dog is running 𝑥ො 1 :
Conditional GAN c2: a bird is flying 𝑥ො 2 :

• Generating images based on text description

sentence NN code
“a dog is running” Encoder
“a bird is flying”

NN
code
Generator
c1: a dog is running 𝑥ො 1 :
Conditional GAN c2: a bird is flying 𝑥ො 2 :

• Text to image by traditional supervised learning

c1: a dog is running NN Image 𝑥ො 1 :


as close as
possible
Text: “train”

Target of
NN output

A blurry image!
Conditional GAN
c: train
G Image x = G(c,z)
Prior distribution 𝑧
It is a distribution.
Approximate the
Text: “train” distribution below

Target of
NN output

A blurry image!
Conditional GAN
c: train
G Image x = G(c,z)
Prior distribution 𝑧

x is realistic or not x is realistic or not +


c and x are matched or not

D 𝑐 D
𝑥 scalar scalar
(type 1) (type 2)
𝑥

Positive example: Positive example: (train , )


Negative example:
Negative example: (train , ) (cat , )
Text to Image "red flower with
- Results black center"
Text to Image
- Results
MLDS 作業三
Conditional GAN 負責助教:曾柏翔

• 根據文字敘述畫出動漫人物頭像
Red hair, long hair

Black hair, blue eyes

Blue hair, green eyes


感謝曾柏翔助教、
Data Collection 樊恩宇助教蒐集資料

http://konachan.net/post/show/239400/aikatsu-clouds-flowers-hikami_sumire-hiten_goane_r
blue eyes
Released Training Data red hair
short hair
96 x 96
• Data download link:
https://drive.google.com/open?id=0BwJmB7alR-
AvMHEtczZZN0EtdzQ
• Anime Dataset:
• training data: 33.4k (image, tags) pair
• Training tags file format
• img_id <comma> tag1 <colon> #_post <tab> tag2
<colon> …

tags.csv
Image-to-image
𝑐
G x = G(c,z)
𝑧

https://arxiv.org/pdf/1611.07004
Image-to-image
• Traditional supervised approach

NN Image
as close as
possible
Testing:

It is blurry because it is the


average of several images.

input close
Image-to-image
• Experimental results

G Image D scalar
𝑧

Testing:

input close GAN GAN + close


Image super resolution
• Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes
Totz, Zehan Wang, Wenzhe Shi, “Photo-Realistic Single Image Super-Resolution
Using a Generative Adversarial Network”, CVPR, 2016
Video Generation
target

Minimize
distance
Generator

Discriminator thinks it is real

Discrimi Last frame is real


nator or generated
https://github.com/dyelax/Adversarial_Video_Generation
Speech Enhancement

• Typical deep learning approach


Noisy Clean

Using CNN

G Output
training data

Speech Enhancement
noisy clean
• Conditional GAN
noisy output clean

output
D scalar
(fake pair or not)
noisy
Speech Enhancement
Enhanced Speech

Noisy Speech

感謝廖峴峰同學提供實驗結果
(和中研院曹昱博士共同指導)

Which Enhanced Speech is better?


More about Speech Processing
• Speech synthesis
• Takuhiro Kaneko, Hirokazu Kameoka, Nobukatsu Hojo, Yusuke Ijima, Kaoru
Hiramatsu, Kunio Kashino, “Generative Adversarial Network-based
Postfiltering for Statistical Parametric Speech Synthesis”, ICASSP 2017
• Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari, "Training
algorithm to deceive anti-spoofing verification for DNN-based speech
synthesis, ”, ICASSP 2017

• Voice Conversion
• Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, Hsin-Min Wang,
Voice Conversion from Unaligned Corpora using Variational Autoencoding
Wasserstein Generative Adversarial Networks, Interspeech 2017

• Speech Enhancement
• Santiago Pascual, Antonio Bonafonte, Joan Serrà, SEGAN: Speech
Enhancement Generative Adversarial Network, Interspeech 2017
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
paired data

Cycle GAN, Disco GAN


Transform an object from one domain to another without paired data
Domain X Domain Y

photo van Gogh


Domain X Domain Y

Cycle GAN
https://arxiv.org/abs/1703.10593
https://junyanz.github.io/CycleGAN/
Become similar
Domain X to domain Y

𝐺𝑋→𝑌 Not what we want

ignore input
𝐷𝑌 scalar

Input image
belongs to
domain Y or not
Domain Y
Domain X Domain Y

Cycle GAN
as close as possible

𝐺𝑋→𝑌 𝐺Y→X

Lack of information
for reconstruction
𝐷𝑌 scalar

Input image
belongs to
domain Y or not
Domain Y
c.f. Dual Learning Domain X Domain Y

Cycle GAN
as close as possible

𝐺𝑋→𝑌 𝐺Y→X

scalar: belongs to 𝐷𝑌 scalar: belongs to


domain X or not 𝐷𝑋 domain Y or not

𝐺Y→X 𝐺𝑋→𝑌

as close as possible
動畫化的世界
input output domain
• Using the code:
https://github.com/Hi-
king/kawaii_creator
• It is not cycle GAN,
Disco GAN
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
https://www.youtube.com/watch?v=9c4z6YsBGQ0

Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman and Alexei A. Efros. "Generative
Visual Manipulation on the Natural Image Manifold", ECCV, 2016.
Andrew Brock, Theodore Lim, J.M. Ritchie, Nick
Weston, Neural Photo Editing with Introspective
Adversarial Networks, arXiv preprint, 2017
Basic Idea

Fulfill the
space of constraint
code z

Why move on the code space?


Back to z 𝑧∗
𝑥𝑇
• Method 1
𝑧 ∗ = 𝑎𝑟𝑔 min 𝐿 𝐺 𝑧 , 𝑥 𝑇 Difference between G(z) and xT
𝑧
Gradient Descent ➢ Pixel-wise
➢ By another network
• Method 2 as close as possible

x z Generator
Encoder x
• Method 3 (Decoder)

Using the results from method 2 as the initialization of method 1


Editing Photos
• z0 is the code of the input image Using discriminator
to check the image is
image realistic or not

𝑧 ∗ = 𝑎𝑟𝑔 min 𝑈 𝐺 𝑧 + 𝜆1 𝑧 − 𝑧0 2 − 𝜆2 𝐷 𝐺 𝑧
𝑧

Not too far away from


the original image

Does it fulfill the constraint of editing?


Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Domain Independent Features
• Training and testing data are in different domains

Generator feature
Training
data: The same
Testing distribution
data:
Generator feature
Domain Independent Features
Domain Independent Features
feature extractor
Too easy to feature
extractor ……

Domain classifier
Domain-adversarial training
Maximize label classification accuracy + Maximize label
minimize domain classification accuracy classification accuracy
feature extractor Label predictor

Domain classifier
Not only cheat the domain
classifier, but satisfying label
classifier at the same time
Maximize domain
classification accuracy
This is a big network, but different parts have different goals.
Domain-adversarial training

Domain classifier fails in the end


It should struggle ……
Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by Backpropagation,
ICML, 2015
Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand,
Domain-Adversarial Training of Neural Networks, JMLR, 2016
Domain-adversarial training

Train:
Test:

Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by Backpropagation,


ICML, 2015
Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand,
Domain-Adversarial Training of Neural Networks, JMLR, 2016
Conditional Generation
Modifying input code
• Making code has influence (InfoGAN)
• Connection code space with attribute
Controlling by input objects
• Paired data
• Unpaired data
• Unsupervised
Feature extraction
• Domain Independent Feature
• Improving Auto-encoder (VAE-GAN, BiGAN)
Anders Boesen, Lindbo Larsen, Søren Kaae
Sønderby, Hugo Larochelle, Ole Winther, “Autoencoding
VAE-GAN beyond pixels using a learned similarity metric”, ICML. 2016

➢ Minimize ➢ Minimize ➢ Discriminate real,


reconstruction error reconstruction error generated and
➢ z close to normal ➢ Cheat discriminator reconstructed images

x z Generator Discrimi
Encoder x scalar
(Decoder) nator

as close as possible

VAE Discriminator provides the similarity measure

GAN
Jeff Donahue, Philipp Krähenbühl, Trevor Darrell,
“Adversarial Feature Learning”, ICLR, 2017

BiGAN Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier


Mastropietro, Alex Lamb, Martin Arjovsky, Aaron
Courville, “Adversarially Learned Inference”, ICLR, 2017

(from prior
distribution) from encoder
code z code z or decoder?

Encoder Decoder Discriminator

Image x Image x Image x code z


(real) (generated)
Lecture II

Conditional Generation

Sequence Generation

A Little Bit of Theory (option)


Sentence Generation
• Training
Reference: A B B

A A A
B B B

<BOS> A
B
Sentence Generation
• Generation (Testing)
B A A

A A A
B B B

<BOS> A
B
Sentence Generation

Generator sentence x

sentence x Discriminator Real or fake

Original GAN
Sentence Generation
• Initialize generator G and discriminator D
• In each iteration:

• Sample real sentences 𝑥 from database


• Generate sentences 𝑥෤ by G
• Update D to increase 𝐷 𝑥 and decrease 𝐷 𝑥෤

• Update Gen such that


Discrimi
Generator scalar
nator
update
Sentence Generation scalar
Discriminator
Can we do
backpropogation?
NO! B A A

Tuning generator a A A A
little bit will not B
B B
change the output.
In the paper of
improved WGAN …
(ignoring <BOS> A
sampling process) B
Sentence Generation - SeqGAN
Using Policy Gradient
Discrimi
Generator scalar
nator
update

• Using Reinforcement learning


• Consider the discriminator as reward function
• Consider the output of discriminator as total reward
• Update generator to increase discriminator = to get
maximum total reward
Ref: Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN: Sequence
Generative Adversarial Nets with Policy Gradient”, AAAI, 2017
Sequence-to-
Conditional Generation sequence learning

• Represent the input condition as a vector, and consider the


vector as the input of RNN generator
• E.g. Machine translation / Chat-bot

. (period)
I’m fine
Information of the
whole sentences

How are you ?

Encoder Jointly train Decoder


Jiwei Li, Will Monroe, Tianlin
Shi, Sébastien Jean, Alan Ritter, Dan
Jurafsky, “Adversarial Learning for
Chat-bot with GAN Neural Dialogue Generation”, arXiv
preprint, 2017

Input
sentence/history h En De response sentence x
Chatbot
Input
sentence/history h
Discriminator Real or fake
response sentence x

Conditional GAN human


dialogues
感謝 段逸林 同學提供實驗結果
Example Results
感謝 王耀賢 同學提供實驗結果
Cycle GAN
✘ Negative sentence to positive sentence:
it's a crappy day -> it's a great day
i wish you could be hereExamples
-> you could be here
it's not a good idea -> it's good idea
i miss you -> i love you
i don't love you -> i love you
i can't do that -> i can do that
i feel so sad -> i happy
it's a bad day -> it's a good day
it's a crappy day -> it's a great day
sorry for doing such a horrible thing -> thanks for doing a
great thing
my doggy is sick -> my doggy is my doggy
i am so hungry -> i am so
my little doggy is sick -> my little doggy is my little doggy
Lecture II

Conditional Generation

Sequence Generation

A Little Bit of Theory (option)


Theory behind GAN
• The data we want to generate has a distribution
𝑃𝑑𝑎𝑡𝑎 𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥

High
Probability

Low
Image Probability
Space
Theory behind GAN
• A generator G is a network. The network defines a
probability distribution.

Normal 𝑃𝐺 𝑥 𝑃𝑑𝑎𝑡𝑎 𝑥
Distribution
generator
𝑧 G 𝑥=𝐺 𝑧

As close as
It is difficult to compute 𝑃𝐺 𝑥 possible
We do not know what the distribution
looks like.
https://blog.openai.com/generative-models/
When the discriminator is well trained, its
loss represents a specific divergence
measure between PG and Pdata.
Discri-
minator
Update discriminator is to minimize
the divergence measure

Normal 𝑃𝐺 𝑥 𝑃𝑑𝑎𝑡𝑎 𝑥
Distribution
generator
𝑧 G 𝑥=𝐺 𝑧

As close as
Binary classifier evaluates the
possible
JS divergence
You can design the discriminator to evaluate other divergence.
https://blog.openai.com/generative-models/
http://www.guokr.com/post/773890/

Why GAN is hard to train?


Better
Why GAN is hard to train?

𝑃𝐺0 𝑥 𝐽𝑆 𝑃𝐺1 ||𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2


𝑃𝑑𝑎𝑡𝑎 𝑥

……
? 𝑃𝐺50 𝑥 𝐽𝑆 𝑃𝐺2 ||𝑃𝑑𝑎𝑡𝑎 = 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎 𝑥
Not really better ……
……

𝑃𝐺100 𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝐽𝑆 𝑃𝐺2 ||𝑃𝑑𝑎𝑡𝑎 = 0
Evaluating JS divergence https://arxiv.org/a
bs/1701.07875

• JS divergence estimated by discriminator telling


little information

Weak Generator Strong Generator


Earth Mover’s Distance
• Considering one distribution P as a pile of earth,
and another distribution Q as the target
• The average distance the earth mover has to move
the earth.

𝑃 𝑄

𝑊 𝑃, 𝑄 = 𝑑
Why Earth Mover’s Distance?

𝐷𝑓 𝑃𝑑𝑎𝑡𝑎 ||𝑃𝐺

𝑊 𝑃𝑑𝑎𝑡𝑎 , 𝑃𝐺

𝑑0 𝑑50

𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎 …… 𝑃𝐺50 𝑃𝑑𝑎𝑡𝑎 …… 𝑃𝐺100 𝑃𝑑𝑎𝑡𝑎

𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎 𝐽𝑆 𝑃𝐺50 , 𝑃𝑑𝑎𝑡𝑎 𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎


= 𝑙𝑜𝑔2 = 𝑙𝑜𝑔2 =0

𝑊 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎 𝑊 𝑃𝐺50 , 𝑃𝑑𝑎𝑡𝑎 𝑊 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎


= 𝑑0 = 𝑑50 =0
Vertical

https://arxiv.org/abs/1701.07875
To Learn more ……

• GAN Zoo
• https://github.com/hindupuravinash/the-gan-zoo
• Tricks: https://github.com/soumith/ganhacks
• Tutorial of Ian Goodfellow:
https://arxiv.org/abs/1701.00160
Lecture III:
Decision Making
and Control
Decision Making and Control
• Machine observe some inputs, takes an action, and
finally achieve the target.
• E.g. Go playing

Target: win the game


observation action
State: summarization of observation
Decision Making and Control
• Machine plays video games
• Widely studies:
• Gym: https://gym.openai.com/
• Universe: https://openai.com/blog/universe/

Machine learns to play video


games as human players
➢ What machine observes are pixels
➢ Machine learns to take
proper action itself
Decision Making and Control
• E.g. self-driving car

Object 紅燈
Recognition
observation

Action: Decision
a giant network? Making
Stop!
Decision Making and Control
• E.g. dialogue system
a giant network?

我想訂11月5日 Slot 抵達時間:11月5日


到台北的機票 Filling 目的地:台北

Decision
Making
Natural
請問您要從
Language 問出發地
哪裡出發呢?
Generation
How to solve this problem?
• Network as a function, learn as typical supervised
tasks
Target: Target: 請問您要
Target: 3-3
Stop! 從哪裡出發呢?

Network Network Network

我想訂11月5日
到台北的機票
Machine do not know some
Behavior Cloning behavior must copy, but
some can be ignored.

https://www.youtube.com/watch?v=j2FSB3bseek
Properties of
Decision Making and Control
Machine does not know the
What do we miss?
influence of each action.
• Agent’s actions affect the subsequent data it
receives
• Reward delay
• In space invader, only “fire” obtains reward
• Although the moving before “fire” is
important
• In Go playing, sacrificing immediate reward to
gain more long-term reward
Better Way ……

A sequence
of decision

Next move
Network (19 x 19
positions)
Classification
Two Learning Scenarios
• Scenario 1: Reinforcement Learning
• Machine interacts with the environment.
• Machine obtains the reward from the environment, so it
knows its performance is good or bad.
• Scenario 2: Learning by demonstration
• Also known as imitation learning, apprenticeship
learning
• An expert demonstrates how to solve the task, and
machine learns from the demonstration.
Lecture III

Reinforcement Learning

Inverse Reinforcement
Learning
Alpha Go: policy-based + value-based
RL + model-based

Model-free
Approach

Policy-based Value-based

Learning an Actor Actor + Critic Learning a Critic

Model-based Approach
Basic Components
You cannot control

Reward
Actor Env
Function

Video 殺一隻怪
Game 得 20 分 …

Go 圍棋規則
Scenario

Actor/Policy
Observation Action
Action =
Function Function
π( Observation ) output
input

Used to pick the


Reward
best function

Environment
Learning to play Go

Observation Action

Reward

Next Move

Environment
Agent learns to take
Learning to play Go actions maximizing
expected reward.

Observation Action

Reward
reward = 0 in most cases
If win, reward = 1
If loss, reward = -1

Environment
Actor, Environment, Reward
𝑠1 𝑎1 𝑠2 𝑎2

Env Actor Env Actor Env ……

𝑠1 𝑎1 𝑠2 𝑎2 𝑠3

Reward Reward

𝑟1 𝑟2
Total reward:
𝑇
Trajectory
𝑅 𝜏 = ෍ 𝑟𝑡
𝜏 = 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇
𝑡=1
Example: Playing Video Game
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Obtain reward Obtain reward


𝑟1 = 0 𝑟2 = 5

Action 𝑎1 : “right” Action 𝑎2 : “fire”


(kill an alien)
Usually there is some randomness in the environment
Example: Playing Video Game
Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3

This is an episode.
After many turns Total reward:
Game Over 𝑇
(spaceship destroyed)
𝑅 = ෍ 𝑟𝑡
𝑡=1
Obtain reward 𝑟𝑇
We want the total
Action 𝑎 𝑇 reward be maximized.
Neural network as Actor
• Input of neural network: the observation of machine
represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
Take the action
NN as actor based on the
left 0.7
probability.

… … right 0.2 Score of an


action
fire 0.1
pixels
What is the benefit of using network instead of lookup table?
generalization
Actor, Environment, Reward
Network Network
𝑠1 𝑎1 𝑠2 𝑎2

Env Actor Env Actor Env ……

𝑠1 𝑎1 𝑠2 𝑎2 𝑠3

Reward argmax Reward


They are not
networks.
𝑟1 𝑟2

如果發現不能微分就用 𝑇

policy gradient 硬 train 一發 𝑅 𝜏 = ෍ 𝑟𝑡


𝑡=1
Ref: https://www.youtube.com/watch?v=W8XF3ME8G2I
Warning of Math
Policy Gradient
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define
Neural a set
Network goodness of the best
ofasfunction
Actor function function

Deep Learning is so simple ……


Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define
Neural a set
Network goodness of the best
ofasfunction
Actor function function

Deep Learning is so simple ……


Goodness of Actor
• Given an actor 𝜋𝜃 𝑠 with network parameter 𝜃
• Use the actor 𝜋𝜃 𝑠 to play the video game
• Start with observation 𝑠1
• Machine decides to take 𝑎1
Total reward: 𝑅𝜃 = σ𝑇𝑡=1 𝑟𝑡
• Machine obtains reward 𝑟1 Even with the same actor,
• Machine sees observation 𝑠2 𝑅𝜃 is different each time
• Machine decides to take 𝑎2
• Machine obtains reward 𝑟2 Randomness in the actor
• Machine sees observation 𝑠3 and the game
• ……
• Machine decides to take 𝑎 𝑇 We define 𝑅ത𝜃 as the
• Machine obtains reward 𝑟𝑇 END expected value of R θ

𝑅ത𝜃 evaluates the goodness of an actor 𝜋𝜃 𝑠


We define 𝑅ത𝜃 as the
Goodness of Actor expected value of R θ

• 𝜏 = 𝑠1 , 𝑎1 , 𝑟1 , 𝑠2 , 𝑎2 , 𝑟2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇 , 𝑟𝑇

𝑃 𝜏|𝜃 =
𝑝 𝑠1 𝑝 𝑎1 |𝑠1 , 𝜃 𝑝 𝑟1 , 𝑠2 |𝑠1 , 𝑎1 𝑝 𝑎2 |𝑠2 , 𝜃 𝑝 𝑟2 , 𝑠3 |𝑠2 , 𝑎2 ⋯
𝑇
𝑝 𝑎𝑡 = "𝑓𝑖𝑟𝑒"|𝑠𝑡 , 𝜃
= 𝑝 𝑠1 ෑ 𝑝 𝑎𝑡 |𝑠𝑡 , 𝜃 𝑝 𝑟𝑡 , 𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 = 0.7
𝑡=1
left
0.1
Actor right
not related Control by 𝑠𝑡 0.2
𝜋𝜃
to your actor your actor 𝜋𝜃 fire
0.7
Goodness of Actor
• An episode is considered as a trajectory 𝜏
• 𝜏 = 𝑠1 , 𝑎1 , 𝑟1 , 𝑠2 , 𝑎2 , 𝑟2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇 , 𝑟𝑇
• 𝑅 𝜏 = σ𝑇𝑡=1 𝑟𝑡
• If you use an actor to play the game, each 𝜏 has a
probability to be sampled
• The probability depends on actor parameter 𝜃:
𝑃 𝜏|𝜃
𝑁
1 Use 𝜋𝜃 to play the
𝑅ത𝜃 = ෍ 𝑅 𝜏 𝑃 𝜏|𝜃 ≈ ෍ 𝑅 𝜏 𝑛 game N times,
𝑁
𝜏 𝑛=1 obtain 𝜏 1 , 𝜏 2 , ⋯ , 𝜏 𝑁
Sum over all Sampling 𝜏 from 𝑃 𝜏|𝜃
possible trajectory N times
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define
Neural a set
Network goodness of the best
ofasfunction
Actor function function

Deep Learning is so simple ……


Gradient Ascent
• Problem statement
𝜃 ∗ = 𝑎𝑟𝑔 max 𝑅ത𝜃
𝜃

• Gradient ascent 𝜃 = 𝑤1 , 𝑤2 , ⋯ , 𝑏1 , ⋯
• Start with 𝜃 0
• 𝜃1 ← 𝜃 0 + 𝜂𝛻𝑅ത𝜃0 𝜕 𝑅ത𝜃 Τ𝜕𝑤1
• 𝜃 2 ← 𝜃1 + 𝜂𝛻𝑅ത𝜃1 𝜕𝑅ത𝜃 Τ𝜕𝑤2
𝛻 𝑅ത𝜃 = ⋮
• ……
𝜕𝑅ത𝜃 Τ𝜕𝑏1

Policy Gradient 𝑅ത𝜃 = ෍ 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻 𝑅ത𝜃 =?
𝜏

𝛻𝑃 𝜏|𝜃
𝛻 𝑅ത𝜃 = ෍ 𝑅 𝜏 𝛻𝑃 𝜏|𝜃 = ෍ 𝑅 𝜏 𝑃 𝜏|𝜃
𝑃 𝜏|𝜃
𝜏 𝜏

𝑅 𝜏 do not have to be differentiable


It can even be a black box.

= ෍ 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 𝑑𝑙𝑜𝑔 𝑓 𝑥 1 𝑑𝑓 𝑥


=
𝜏 𝑑𝑥 𝑓 𝑥 𝑑𝑥
𝑁
1 Use 𝜋𝜃 to play the game N times,
≈ ෍ 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑃 𝜏 𝑛 |𝜃
𝑁 Obtain 𝜏 1 , 𝜏 2 , ⋯ , 𝜏 𝑁
𝑛=1
Policy Gradient 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 =?

• 𝜏 = 𝑠1 , 𝑎1 , 𝑟1 , 𝑠2 , 𝑎2 , 𝑟2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇 , 𝑟𝑇
𝑇

𝑃 𝜏|𝜃 = 𝑝 𝑠1 ෑ 𝑝 𝑎𝑡 |𝑠𝑡 , 𝜃 𝑝 𝑟𝑡 , 𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡


𝑡=1
𝑙𝑜𝑔𝑃 𝜏|𝜃
𝑇

= 𝑙𝑜𝑔𝑝 𝑠1 + ෍ 𝑙𝑜𝑔𝑝 𝑎𝑡 |𝑠𝑡 , 𝜃 + 𝑙𝑜𝑔𝑝 𝑟𝑡 , 𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡


𝑡=1
𝑇
Ignore the terms
𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 = ෍ 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡 |𝑠𝑡 , 𝜃
not related to 𝜃
𝑡=1
𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃
𝑇
Policy Gradient = ෍ 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡 |𝑠𝑡 , 𝜃
𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 + 𝜂𝛻 𝑅ത𝜃𝑜𝑙𝑑 𝑡=1

𝑁 𝑁 𝑇𝑛
1 1
𝛻 𝑅ത𝜃 ≈ ෍ 𝑅 𝜏 𝛻𝑙𝑜𝑔𝑃 𝜏 |𝜃 = ෍ 𝑅 𝜏 𝑛 ෍ 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
𝑛 𝑛
𝑁 𝑁
𝑛=1 𝑛=1 𝑡=1
𝑁 𝑇𝑛
1 What if we replace
= ෍ ෍ 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
𝑁 𝑅 𝜏 𝑛 with 𝑟𝑡𝑛 ……
𝑛=1 𝑡=1

If in 𝜏 𝑛 machine takes 𝑎𝑡𝑛 when seeing 𝑠𝑡𝑛 in


𝑅 𝜏 𝑛 is positive Tuning 𝜃 to increase 𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛
𝑅 𝜏 𝑛 is negative Tuning 𝜃 to decrease 𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛
It is very important to consider the cumulative reward 𝑅 𝜏 𝑛 of
the whole trajectory 𝜏 𝑛 instead of immediate reward 𝑟𝑡𝑛
Policy Gradient
Update
Model
Given actor parameter 𝜃
𝜏1 : 𝑠11 , 𝑎11 𝑅 𝜏1 𝜃 ← 𝜃 + 𝜂𝛻𝑅ത𝜃
𝑠21 , 𝑎21 𝑅 𝜏1
𝛻 𝑅ത𝜃 =
……

……

𝑁 𝑇𝑛
1
𝜏2: 𝑠12 , 𝑎12 𝑅 𝜏2 ෍ ෍ 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
𝑁
𝑠22 , 𝑎22 𝑅 𝜏2 𝑛=1 𝑡=1
……

……

Data
Collection
Policy Gradient
3
Considered as Minimize: − ෍ 𝑦ො𝑖 𝑙𝑜𝑔𝑦𝑖
Classification Problem 𝑖=1
𝑦𝑖 𝑦ො𝑖
left 1
… right 0

s … fire 0

Maximize: 𝑙𝑜𝑔𝑦𝑖 =
𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠
𝜃 ← 𝜃 + 𝜂𝛻𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠
𝜃 ← 𝜃 + 𝜂𝛻𝑅ത𝜃
Policy Gradient
𝛻 𝑅ത𝜃 =
𝑁 𝑇𝑛
1
෍ ෍ 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
Given actor parameter 𝜃 𝑁
𝑛=1 𝑡=1
𝜏1 : 𝑠11 , 𝑎11 𝑅 𝜏1
𝑎11 = 𝑙𝑒𝑓𝑡
𝑠21 , 𝑎21 𝑅 𝜏1
left 1
……

……

𝑠11 NN right 0
𝜏2: 𝑠12 , 𝑎12 𝑅 𝜏2 fire 0
𝑠22 , 𝑎22 𝑅 𝜏2 𝑎12 = 𝑓𝑖𝑟𝑒

……

……

left 0
𝑠12 NN right 0
fire 1
𝜃 ← 𝜃 + 𝜂𝛻𝑅ത𝜃
Policy Gradient
𝛻 𝑅ത𝜃 =
𝑁 𝑇𝑛
1
෍ ෍ 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎𝑡𝑛 |𝑠𝑡𝑛 , 𝜃
Given actor parameter 𝜃 𝑁
𝑛=1 𝑡=1
𝜏1 : 𝑠11 , 𝑎11 𝑅 𝜏1 2
Each training data is
𝑠21 , 𝑎21 𝑅 𝜏1 2 weighted by 𝑅 𝜏 𝑛
……

……

𝑠11 NN 𝑎11 = 𝑙𝑒𝑓𝑡


𝜏2: 𝑠12 , 𝑎12 𝑅 𝜏2 1
𝑠22 , 𝑎22 𝑅 𝜏2 1 𝑠11 NN 𝑎11 = 𝑙𝑒𝑓𝑡
……

……


𝑠12 NN 𝑎12 = 𝑓𝑖𝑟𝑒
End of Warning
Critic
• A critic does not determine the action.
• Given an actor π, it evaluates the how good the
actor is

An actor can be
found from a critic.
e.g. Q-learning

http://combiboilersleeds.com/picaso/critics/critics-4.html
Critic
• State value function 𝑉 𝜋 𝑠
• When using actor 𝜋, the cumulated reward
expects to be obtained after seeing observation
(state) s

𝑉𝜋 𝑠
s 𝑉𝜋
scalar

𝑉 𝜋 𝑠 is large 𝑉 𝜋 𝑠 is smaller
Critic
𝑉 以前的阿光 大馬步飛 = bad
𝑉 變強的阿光 大馬步飛 = good
How to estimate 𝑉 𝜋 𝑠
• Monte-Carlo based approach
• The critic watches 𝜋 playing the game

After seeing 𝑠𝑎 ,
Until the end of the episode, 𝑠𝑎 𝑉𝜋 𝑉 𝜋 𝑠𝑎 𝐺𝑎
the cumulated reward is 𝐺𝑎

After seeing 𝑠𝑏 ,
Until the end of the episode,
𝑠𝑏 𝑉𝜋 𝑉 𝜋 𝑠𝑏 𝐺𝑏
the cumulated reward is 𝐺𝑏
How to estimate 𝑉 𝜋 𝑠
• Temporal-difference approach
⋯ 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ⋯
𝑉 𝜋 𝑠𝑡 = 𝑟𝑡 + 𝑉 𝜋 𝑠𝑡+1
𝑠𝑡 𝑉𝜋 𝑉 𝜋 𝑠𝑡

- 𝑉 𝜋 𝑠𝑡 − 𝑉 𝜋 𝑠𝑡+1 𝑟𝑡

𝑠𝑡+1 𝑉𝜋 𝑉 𝜋 𝑠𝑡+1

Some applications have very long episodes, so that


delaying all learning until an episode's end is too slow.
MC v.s. TD …

𝜋 𝜋 Larger variance
𝑠𝑎 𝑉 𝑉 𝑠𝑎 𝐺𝑎
unbiased

𝑠𝑡 𝑉𝜋 𝑉 𝜋 𝑠𝑡 𝑟 + 𝑉 𝜋 𝑠𝑡+1 𝑉𝜋 𝑠𝑡+1

Smaller variance
May be biased
[Sutton, v2,
MC v.s. TD Example 6.4]

• The critic has the following 8 episodes


• 𝑠𝑎 , 𝑟 = 0, 𝑠𝑏 , 𝑟 = 0, END
• 𝑠𝑏 , 𝑟 = 1, END
𝑉 𝜋 𝑠𝑏 = 3/4
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END 𝑉 𝜋 𝑠𝑎 =? 0? 3/4?
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END Monte-Carlo: 𝑉 𝜋 𝑠𝑎 = 0
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 0, END Temporal-difference:

𝑉 𝜋 𝑠𝑏 + 𝑟 = 𝑉 𝜋 𝑠𝑎
3/4 0 3/4
(The actions are ignored here.)
Actor = Generator ?
Actor-Critic Critic = Discriminator ?
https://arxiv.org/abs/1610.01945

𝜋 interacts with
the environment

𝜋 = 𝜋′ TD or MC

Update actor from


𝜋 → 𝜋’ based on Learning Critic
Critic
Playing On-line Game

劉廷緯、溫明浩
https://www.youtube.com/watch?v=8iRD1w73fDo&feature=youtu.be
Lecture III

Reinforcement Learning

Inverse Reinforcement
Learning
Imitation Learning
𝑠1 𝑎1 𝑠2 𝑎2

Env Actor Env Actor Env ……

𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
reward function is not available

We have demonstration of the expert.


Self driving: record
human drivers
Each 𝜏Ƹ is a trajectory
Robot: grab the 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁 of the export.
arm of robot
Motivation
• It is hard to define reward in some tasks.
• Hand-crafted rewards can lead to uncontrolled
behavior.
機器人三大法則:
一、機器人不得傷害人類,或坐視人類受到
傷害而袖手旁觀。
二、除非違背第一法則,機器人必須服從人
類的命令。
三、在不違背第一法則及第二法則的情況下,
機器人必須保護自己。
神邏輯

因此為了保護人類整體,控制人類的自由是
必須的
Inverse Reinforcement Learning
demonstration
of the expert
Environment
𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁

Reward Inverse Reinforcement


Reinforcement Optimal
Function
Expert
Actor
Learning

➢ Using the reward function to find the optimal actor.


➢ Modeling reward can be easier. Simple reward
function can lead to complex policy.
Inverse Reinforcement Learning
• Principle: The teacher is always the best.
• Basic idea:
• Initialize an actor
• In each iteration
• The actor interacts with the environments to obtain
some trajectories
• Define a reward function, which makes the
trajectories of the teacher better than the actor
• The actor learns to maximize the reward based on
the new reward function.
• Output the reward function and the actor learned from
the reward function
Framework of IRL 𝑁 𝑁

෍ 𝑅 𝜏Ƹ 𝑛 > ෍ 𝑅 𝜏
𝑛=1 𝑛=1

Expert 𝜋ො 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁 Obtain


Reward Function R

Reward
𝜏1 , 𝜏2 , ⋯ , 𝜏𝑁 Function R
Actor
= Generator
Find an actor based
Reward function Actor 𝜋 on reward function R
= Discriminator
By Reinforcement learning
GAN High score for real,
low score for generated

Find a G whose output


G
obtains large score from D
IRL
Larger reward for 𝜏Ƹ 𝑛 ,
Expert 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁 Lower reward for 𝜏

𝜏1 , 𝜏2 , ⋯ , 𝜏𝑁 Reward
Function

Find a Actor obtains


Actor large reward
Teaching Robot
• In the past …… https://www.youtube.com/watch?v=DEGbtjTOIB0
Chelsea Finn, Sergey Levine, Pieter Abbeel, ”
Guided Cost Learning: Deep Inverse Optimal
Robot Control via Policy Optimization”, ICML, 2016
http://rll.berkeley.edu/gcl/
Parking a Car

https://pdfs.semanticscholar.org/27c3/6fd8e6aeadd50ec1e180399df70c46dede00.pdf
http://martin.zinkevich.org/publications
Path Planning /maximummarginplanning.pdf
Third Person Imitation Learning
• Ref: Bradly C. Stadie, Pieter Abbeel, Ilya Sutskever, “Third-Person Imitation
Learning”, arXiv preprint, 2017

First Person Third Person

http://lasa.epfl.ch/research_new/ML/index.php https://kknews.cc/sports/q5kbb8.html

http://sc.chinaz.com/Files/pic/icons/1913/%E6%9C%BA%E5%99%A8%E4%BA%BA%E5%9B
%BE%E6%A0%87%E4%B8%8B%E8%BD%BD34.png
Third Person Imitation Learning
Concluding Remarks

Lecture 1: Introduction of GAN

Lecture 2: Variants of GAN

Lecture 3: Making Decision and Control


To Learn More …
• Machine Learning
• Slides:
http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML16.htm
l
• Video:
https://www.youtube.com/watch?v=fegAeph9UaA&list=
PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49
• Machine Learning and Having it Deep and Structured
• Slides:
http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLDS17.h
tml
• Video:
https://www.youtube.com/watch?v=IzHoNwlCGnE&list=
PLJV_el3uVTsPMxPbjeX7PicgWbY7F8wW9

Potrebbero piacerti anche