NN Theory

Statistical Machine Learning
Methods for Bioinformatics

III. Neural Network & Deep
Learning Theory
Jianlin Cheng, PhD
Department of Computer Science
University of Missouri
2016
Free for Academic Use. Copyright @ Jianlin Cheng & original sources of some materials.
Classification Problem
Input Output
Legs weight size …. Feature m Category / Label
4 100 Mammal
80 0.1 Bug
Question: How to automatically predict output given input?

Idea: Learn from known examples and generalize to unknown ones.
Data Driven Machine Learning
Approach
Training
Prediction
Training
Data Model: Map New
Data with Split Input to Output Data
Labels
Test Data
Test
Input: words of news Training: Build a model (classifier)

Output: politics, sports, entertainment, Test: Test the model
…
Key idea: Learn from known data and Generalize to unseen data
Outline
• Introduction
• Linear regression
• Linear Discriminant function (classification)
• One layer neural network / perceptron
• Multi-layer network
• Recurrent neural network
• Prevent overfitting
• Speedup learning
• Deep learning
Machine Learning
• Supervised learning (training with labeled data),
un-supervised learning (clustering un-labeled
data), and semi-supervised learning (use both
labeled and unlabeled data)
• Supervised learning: classification and regression
• Classification: output is discrete value
• Regression: output is real value
Learning Example: Recognize
Handwriting
Classification: recognize each number

Clustering: cluster the same numbers together
Regression: predict the index of Dow-Jones
Neural Network
• Neural Network can do both supervised
learning and un-supervised learning
• Neural Network can do both regression and
classification
• Neural Network has both statistical and
artificial intelligence roots
Roots of Neural Network
• Artificial intelligence root (neuron science)
• Statistical root (linear regression,
generalized linear regression, discriminant
analysis. This is our focus.)
A Typical Cortical Neuron
1011 neurons
Dentritic tree Junction
between
neurons
Synapse: control
Axon: generate release chemical
Collect chemical signals Potentials (Fire/not Fire) transmitters.
A Neural Model
weight
Activation function
Input Activation
Adapted from http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

Statistics Root: Linear Regression
Example
Fish length vs. weight?

X: input or predictor
Y: output or response
Goal: learn a linear function E[y|x] = wx + b.
Adapted from A. Moore, 2003
Linear Regression
Definition of a linear model:
• y = wx + b + noise.
• noise ~ N (0, σ2), assume σ is a constant.
• y ~ N(wx + b, σ2)
• Estimate expected value of y given x (E[y|x]
= wx +b) .
• Given a set of data (x1, y1), (x2, y2), …, (xn,
yn), to find the optimal parameters w and b.
Objective Function
N
• Least square error:  i   2

( y wxi b )
i 1 N
• Maximum Likelihood:  P( y
i 1
i | xi , w, b)
• Minimizing square
error is equivalent to
maximizing likelihood
Maximize Likelihood
N ( yi  wxi b ) 2
1 

N
 P( yi | xi , w, b)
2 2
= e
i 1
i 1 2 2
Minimize negative log-likelihood:

( yi  wxi b ) 2
N N
1  N
( yi  wxi  b) 2
 log(  P( yi | xi , w, b)) =  log(  2
)   ( log( 2 ) 
2 2
e )
i 1 i 1 2 2 i 1 2 2
N
( yi  wxi  b) 2
=  (log( 2 )  2
)
i 1 2 2
Note: σ is a constant.
1-Variable Linear Regression
N
Minimize E =  i   2
( y wxi b )
i 1 Error
E N N
  2( yi  wxi  b) * ( xi )   2( yi xi  wxi2  bxi )  0
W i 1 i 1
E N N
  2( yi  wxi  b) * (1)   2( yi  wxi  b)  0
b i 1 i 1
w
N
x y
N
 N xy
w i 1
i i
(y i  wxi )
N
b i 1
x
i 1
2
i  N xx N
Multivariate Linear Regression
• How about multiple predictors: (x1, x2, …,
xd).
• y = w0 + w1x1 + w2x2 + … + wdxd + ε
• For multiple data points, each data point is
represented as (yi, xi), xi consists of d
predictors (xi1, xi2, …, xid).
• yi = w0 + w1xi1 + w2xi2 + … + wdxid + ε
A Motivating Example
• Each day you get lunch at the cafeteria.
– Your diet consists of fish, chips, and beer.
– You get several portions of each
• The cashier only tells you the total price of the meal
– After several days, you should be able to figure out the price of
each portion.
• Each meal price gives a linear constraint on the prices of the
portions:
price  x fishw fish  xchipswchips  xbeer wbeer

G. Hinton, 2006
Matrix Representation
n data points, d dimension
 y1   1 x11 ... x1d   w0 

     
 y2   1 x21 ... x2 d   w1 
 ...    ... .     
... . ...
     
y  1 x ... xnd   wd 
 n  n1
n*1 n*(d+1) (d+1)*1
Matrix Representation: Y = XW + ε
Multivariate Linear Regression
• Goal: minimize square error = (Y-XW)T(Y-
XW) = YTY -2XTWY + WTXTXW
• Derivative: -2XTY + 2XTXW = 0
• W = (XTX)-1XTY
• Thus, we can solve linear regression using
matrix inversion, transpose, and
multiplication.
Difficulty and Generalization
• Numerical computation issue. (a lot data
points. Matrix inversion is impossible.)
• Singular matrix (determinant is zero) : no
inversion
• How to handle non-linear data?
• Turns out neural network and its iterative
learning algorithm can address this
problem.
Graphical Representation:
One Layer Neural Network for Regression
Target: y
o
Activation function f is used
to convert a to output. Here
Output Unit f it is a linear function. o = a.
a =Σwixi Activation
w0 w1 wd
Input Unit 1 x1 …… xd
Gradient Descent Algorithm
• For a data x = (x1,x2,…xd), error E = (y – o)2 =
(y – w0x0 - w1x1 - … - wdxd)2
• Partial derivative: E |  E  2( y  o) o  2( y  o)( x )  2( y  o) x
wi wi
wi i i
Minima
Error
Update rule:
E
w
0
wi(t 1)  wi(t )   ( y  o) xi
E
0
w
Famous Delta Rule

w
Algorithm of One-Layer Regression
Neural Network
• Initialize weights w (small random numbers)
• Repeat
Present a data point x = (x1,x2,…,xd) to the network
and compute output o.
if y > o, add ηxi to wi.
if y < o, add -ηxi to wi.
• Until Σ(yk-ok)2 is zero or below a threshold or
reaches the predefined number of iterations.
Comments: online learning: update weight for every x. batch learning:
update weight every batch of x (i.e. Σηxi ).
Graphical Representation:
One Layer Neural Network for Regression
Output O Target: y
Output Unit out O = f(Σwixi), f is activation

function.
a =Σwixi Activation
w0 w1 wd
What about Hyperbolic Tanh
Function for Output Unit
• Can we use activation function
other than linear function?
• For instance, if we want to
limit the output to be in [-1,
+1], we can use hyperbolic
Tanh function:
e x  e x
e x  ex
• The only thing to change is to
use the new gradient.
Two-Category Classification
• Two classes: C1 and C2.
• Input feature vector: x.
• Define a discriminant function y(x) such
that x is assigned to C1 if y(x) > 0 and to
class C2 if y(x) < 0.
• Linear discriminant function: y(x) = wTx +
w0 = wTx, where x = (1, x).
• w: weight vector, w0: bias.
A Linear Decision Boundary in 2-D
Input Space
x2 w: orientation of decision boundary
w0: defines the position of the plan
in terms of its perpendicular distance
from the origin.
w
x1
y(x) = wTx + w0 = 0
y(x) = wTx = 0
l = |wTx| / ||w|| = w0 / ||w||
Graphical Representation: Perceptron, One-
Layer Classification Neural Network
wTx > 0: +1, class 1
Activation / y=g(wTx) wTx < 0: -1, class 2
Transfer function (threshold function)
out
Activation wTx = Σwixi

w0 w1 wd
Perceptron Criterion
• Minimize classification error
• Input data (vector): x1, x2, …, xN and
corresponding target value t1, t2, …, tN.
• Goal: for all x in C1 (t = 1), wTx > 0, for all x in
C2 (t = -1), wTx < 0. Or for all x: wTxt > 0.
• Error: E (w) =
perc   x t. M is the set of
wT n n
x M
n
misclassified data points.

Gradient Descent
Minima
Error
E
0
w
E
0
w
W
For each misclassified data point, adjust weight as follows:
E
w=w- × η = w + η xn t n
w
Perceptron Algorithm
• Initialize weight w
• Repeat
For each data point (xn, tn)
Classify each data point using current w.
If wTxntn > 0 (correct), do nothing
If wTxntn < 0 (wrong), wnew = w + ηxntn
w = wnew
• Until w is not changed (all the data will be
separated correctly, if data is linearly separable) or
error is below a threshold.
Rosenblatt, 1962
Perceptron Convergence Theorem
• For any data set which is linearly separable,

the algorithm is guaranteed to find a
solution in a finite number of steps
(Rosenblatt, 1962; Block 1962; Nilsson,
1965; Minsky and Papert 1969; Duda and
Hart, 1973; Hand, 1981; Arbib, 1987; Hertz
et al., 1991)
Perceptron Demo
• https://www.youtube.com/watch?v=vGwemZ
hPlsA
Multi-Class Linear Discriminant
Function
• c classes. Use one discriminant function
yk(x) = wkTx + wk0 for each class Ck.
• A new data point x is assigned to class Ck if
yk(x) > yj(x) for all j ≠ k.
One-Layer Multi-Class Perceptron
y1 yc
……
wc0
w10 w11 wcd
wc1 w1d
……
x0 = 1 x1 xd
How to learn it?

Muti-Threshold Perceptron
Algorithm
• Initialize weight w
• Repeat
Present data point x to the network, if
classification is correct, do nothing.
if x is wrongly classified to Ci instead of true class
Cj, adjust weights connected to Ci and Cj as
follows.
Add –ηxk to wik. Add ηxk to wjk
• Until misclassification is zero or below a
threshold.
Note: may also Add –ηxk to wlk for any l, yl > yj.
Limitation of the Perceptron
• Can’t not separate non-linear data
completely.
• Or can’t not fit non-linear data well.
• Two directions to attack the problem: (1)
extend to multi-layer neural network (2)
map data into high dimension (SVM
approach)
Exclusive OR Problem
C1 C2
Perceptron (or one-layer
(0,1) (1,1) neural network) can not
learn a function to separate
the two classes perfectly.
C2 C1
(0,0) (1,0)
Logistic Regression
• Estimate posterior distribution: P(C1|x)
• Dose – response estimation: in bioassay, the
relation between dose level and death rate
P(death | x).
• We can not use 0/1 hard classification.
• We can not use unconstrained linear
regression because P(death | x) must be in
[0,1]?
Logistic Regression and One Layer
Neural Network With Sigmoid
Function. y
Target: t (0 or 1)
1
P( death | x) = Activation
1  e  wx 1 Function:
1  ez
(Sigmoid function) sigmoid
Activation z = Σwixi
……
1 x1 xd
How to Adjust Weights?
• Minimize error E=(t-y)2. For simplicity, we derive
the formula for one data point. For multiple data
points, just add the gradients together.
E E y z
  2(t  y ) y (1  y ) xi
wi y z wi
1
( )
y z
 1 e
1 1
Notice:  z
(1  z
)  y (1  y )
z z 1 e 1 e
Error Function and Learning
• Least Square
• Maximum likelihood: output y is the probability of
being in C1 (t=1). 1- y is the probability of being in
C2. So what is probability of P(t|x) = yt(1-y)1-t.
• Maximum likelihood is equivalent to minimize
negative log likelihood:
E = -log P(t|x) = -tlogy - (1-t)log(1-y). (cross /
relative entropy)
How to Adjust Weights?
• Minimize error E= -tlogy - (1-t)log(1-y). For
simplicity, we derive the formula for one data
point. For multiple data points, just add the
gradients together.
E t 1 t t t 1 y t
  (1)    
y y 1 y y 1  y y (1  y )
E E y z y t
  y (1  y ) xi  ( y  t ) xi
wi y z wi y (1  y )
Update rule: w ( t 1)

i  w   (t  y ) xi
t
i
Multi-Class Logistic Regression
• Transfer (or activation) function is normalized ai
exponentials (or soft max) e
y1 yc yi  c
e
aj
……
j 1
Activation Function
wc0
w10 w11 w1d
wc1 w1d
Activation to Node Oi
……
x0 x1 xd
How to learn this network? Once again, gradient descent.

Questions?
• Is logistic regression a linear regression?
• Can logistic regression handle non-linearly
separable data?
• How to introduce non-linearity?
Support Vector Machine Approach
x2 Map data point into high dimension, e.g.
adding some non-linear features.
How about we augument feature

into three dimension
C1 (x1, x2, x12+x22).
C2
x1
All data points in class C2 have a
larger value for the third feature
x12+x22 = 10 Than data points in C1. Now
data is linearly separable.
Neural Network Approach
• Multi-Layer Perceptrons
• In addition to input nodes and output nodes, some
hidden nodes between input / output nodes are
introduced.
• Use hidden units to learn internal features to
represent data. Hidden nodes can learn internal
representation of data that are not explicit in the
input features.
• Transfer function of hidden units are non-linear
function
Multi-Layer Perceptron
• Connections go from lower layer to higher layer.
(usually from input layer to hidden layer, to output layer)
• Connection between input/hidden nodes, input/output
nodes, hidden/hidden nodes, hidden/output nodes are
arbitrary as long as there is no loop (must be feed-
forward).
• However, for simplicity, we usually only allow
connection from input nodes to hidden nodes and from
hidden nodes to output nodes. The connections with a
layer are disallowed.
Multi-Layer Perceptron
• Two-layer neural network (one hidden and one
output) with non-linear activation function is a
universal function approximator (see Baldi and
Brunak 2001 or Bishop 96 for the proof), i.e. it can
approximate any numeric function with arbitrary
precision given a set of appropriate weights and
hidden units.
• In early days, people usually used two-layer (or
three-layer if you count the input as one layer)
neural network. Increasing the number of layers was
occasionally helpful.
• Later expanded into deep learning with many
layers!!!
Two-Layer Neural Network
y1 yk yc Output
Activation function: f (linear,sigmoid, softmax)
… M
Activation of unit ak:
wkj w kj zj
z1 zM j 0
zj
Z0=1 … Activation function: g (linear, tanh, sigmoid)
Activation of unit aj: d
w11 w1i wji w

i 0
x
ji i
1 … M d
xd yk  f ( wkj  g ( w ji xi ))
x0 x1 xi j 0 i 0
Adjust Weights by Training
• How to adjust weights?
• Adjust weights using known examples
(training data) (x1,x2,x3,…,xd,t).
• Try to adjust weights so that the difference
between the output of the neural network y
and t (target) becomes smaller and smaller.
• Goal is to minimize Error (difference) as we
did for one layer neural network
Adjust Weights using Gradient
Descent (Back-Propagation)
Known: Minima
Error
Data: (x1,x2,x3,…,xn) target t.
Unknown weights w:
w11, w12,…..
Randomly initialize weights
Repeat
for each example, compute output y W
calculate error E = (y-t)2
compute the derivative of E over w: dw= E
w
wnew = wprev – η * dw
Until error doesn’t decrease or max num of iterations
Note: η is learning rate or step size.
Insights
• We know how to compute the derivative of one
layer neural network? How to change weights
between input layer and hidden layer?
• Should we compute the derivative of each w
separately or we can reuse intermediate results?
We will have an efficient back-propagation
algorithm.
• We will derive learning for one data example. For
multiple examples, we can simply add the
derivatives from them for a weight parameter
together.
Neural Network Learning: Two
Processes
• Forward propagation: present an example
(data) into neural network. Compute
activation into units and output from units.
• Backward propagation: propagate error
back from output layer to the input layer
and compute derivatives (or gradients).
Forward Propagation
Output yk
y1 yk yc
Activation function: f (linear,sigmoid, softmax)
… M
Activation of unit ak:
w
j 1
kj zj
wkj
z1 zj zM zj
… Activation function: g (linear, tanh, sigmoid)
Activation of unit aj: d
w11 w1i wji w

i 1
ji ix
x1 xi xd
Time complexity?
O(dM + MC) = O(W)
1 C
E   ( yk  t k ) 2
Backward Propagation 2 k 1
E
 yk  t k
y1 yk yc yk
f
E E yk
… M   ( y k  t k ) f ' ( ak )   k
ak yk ak
ak:  wkj z j
j 1
wkj E E ak
  k z j
z1 zj zM wkj ak wkj
… g E c
E yk ak z j C
    k wkj g ' (a j )   j
a j k 1 yk ak z j a j k 1
d
w11 w1i wji aj: w x

ji i
i 1
E E a j
   j xi
w ji a j wji
…
x1 xi xd
Time complexity?
If no back-propagation, time O(CM+Md) = O(W)
complexity is: (MdC+CM)
Example 1
E  ( y  t )2
2
y
E E y
f linear function
   ( y  t)
ak: ak y ak
wj
z1 E
zj zM g is sigmoid:  z j
M
w j
… aj: w
i 1
x
ji i
 j  w j g ' (a j )  ( y  t ) w j z j (1  z j )
w11 w1i wji
E
  j xi  ( y  t ) w j z j (1  z j ) xi
… w ji
x1 xi xd
Algorithm
• Initialize weights w
• Repeat
For each data point x, do the following:
Forward propagation: compute outputs and activations
Backward propagation: compute errors for each output units
and hidden units. Compute gradient for each weight.
Update weight w = w - η (∂E / ∂w)
• Until a number of iterations or errors drops below a

threshold.
Implementation Issue
• What should we store?
• An input vector x of d dimensions
• A M*d matrix {wji} for weights between input and hidden
units
• An activation vector of M dimensions for hidden units
• An output vector of M dimensions for hidden units
• A C*M matrix {wkj} for weights between hidden and
output units
• An activation vector of C dimensions for output units
• An output vector of C dimensions for output units
• An error vector of C dimensions for output units
• An error vector of M dimensions for hidden units
Recurrent Network
y1 yk yc
wkj
z1 zj zM
…
w w wji
11 w1i
x1 xi xd
Forward: Backward:
At time 1: present X1, 0 Time t: back-propagate
At time 2: present X2, y1 Time t-1: back-propagate with
…… Output errors and errors from previous step
Recurrent Neural Network
1. Recurrent network is essentially a series of feed-forward

neural networks sharing the same weights
2. Recurrent network is good for time series data and sequence

data such as biological sequences and stock series
Overfitting
• The training data contains information about the
regularities in the mapping from input to output.
But it also contains noise
– The target values may be unreliable.
– There is sampling error. There will be accidental
regularities just because of the particular training cases
that were chosen.
• When we fit the model, it cannot tell which
regularities are real and which are caused by
sampling error.
– So it fits both kinds of regularity.
– If the model is very flexible it can model the sampling
error really well. This is a disaster.
G. Hinton, 2006
Example of Overfitting and Good Fitting
Overfitting
Good fitting
Overfitting function can not generalize well to unseen data.

Preventing Overfitting
• Use a model that has the right capacity:
– enough to model the true regularities
– not enough to also model the spurious regularities
(assuming they are weaker).
• Standard ways to limit the capacity of a neural
net:
– Limit the number of hidden units.
– Limit the size of the weights.
– Stop the learning before it has time to overfit.
G. Hinton, 2006
Limiting the Size of the Weights

• Weight-decay involves C E 
2
wi
adding an extra term to 2
i
the cost function that C E
penalizes the squared   wi
weights.
wi wi
– Keeps weights small
unless they have big error
C 1 E
derivatives. when  0, wi  
wi  wi
C
w G. Hinton, 2006
The Effect of Weight-Decay
• It prevents the network from using weights that it
does not need.
– This can often improve generalization a lot.
– It helps to stop it from fitting the sampling error.
– It makes a smoother model in which the output changes
more slowly as the input changes.
• If the network has two very similar inputs it
prefers to put half the weight on each rather than
all the weight on one.
w/2 w/2 w 0
G. Hinton, 2006
Deciding How Much to Restrict the
Capacity
• How do we decide which limit to use and how
strong to make the limit?
– If we use the test data we get an unfair prediction
of the error rate we would get on new test data.
– Suppose we compared a set of models that gave
random results, the best one on a particular dataset
would do better than chance. But it wont do better
than chance on another test set.
• So use a separate validation set to do model
selection.
G. Hinton, 2006
Using a Validation Set
• Divide the total dataset into three subsets:
– Training data is used for learning the parameters of
the model.
– Validation data is not used of learning but is used
for deciding what type of model and what amount
of regularization works best.
– Test data is used to get a final, unbiased estimate
of how well the network works. We expect this
estimate to be worse than on the validation data.
• We could then re-divide the total dataset to get
another unbiased estimate of the true error rate.
G. Hinton, 2006
Preventing Overfitting by Early
Stopping
• If we have lots of data and a big model, its very expensive
to keep re-training it with different amounts of weight
decay.
• It is much cheaper to start with very small weights and let
them grow until the performance on the validation set
starts getting worse (but don’t get fooled by noise!)
• The capacity of the model is limited because the weights
have not had time to grow big.
G. Hinton, 2006
Why Early Stopping Works
• When the weights are very
small, every hidden unit is
in its linear range.
outputs
– So a net with a large layer
of hidden units is linear.
– It has no more capacity than
a linear net in which the
inputs are directly
connected to the outputs!
• As the weights grow, the
hidden units start using
their non-linear ranges so inputs
the capacity grows.
G. Hinton, 2006
Combining Networks
• When the amount of training data is limited, we
need to avoid overfitting.
– Averaging the predictions of many different networks is
a good way to do this.
– It works best if the networks are as different as
possible.
– Combining networks reduces variance
• If the data is really a mixture of several different
“regimes” it is helpful to identify these regimes
and use a separate, simple model for each regime.
– We want to use the desired outputs to help cluster cases
into regimes. Just clustering the inputs is not as
efficient.
G. Hinton, 2006
How the Combined Predictor
Compares with the Individual
Predictors
• On any one test case, some individual predictors will
be better than the combined predictor.
– But different individuals will be better on different cases.
• If the individual predictors disagree a lot, the
combined predictor is typically better than all of the
individual predictors when we average over test
cases.
– So how do we make the individual predictors disagree?
(without making them much worse individually).
G. Hinton, 2006
Ways to Make Predictors Differ
• Rely on the learning algorithm getting stuck in a
different local optimum on each run.
– A dubious hack unworthy of a true computer scientist (but
definitely worth a try).
• Use lots of different kinds of models:
– Different architectures
– Different learning algorithms.
• Use different training data for each model:
– Bagging: Resample (with replacement) from the training
set: a,b,c,d,e -> a c c d d
– Boosting: Fit models one at a time. Re-weight each training
case by how badly it is predicted by the models already
fitted.
• This makes efficient use of computer time because it does not
bother to “back-fit” models that were fitted earlier.
G. Hinton, 2006
How to Speedup Learning?
The Error Surface for a Linear Neuron
• The error surface lies in a space with a horizontal axis
for each weight and one vertical axis for the error.
– It is a quadratic bowl.
• i.e. the height can be expressed as a function of the weights without
using powers higher than 2. Quadratics have constant curvature
(because the second derivative must be a constant)
– Vertical cross-sections are parabolas. G. Hinton, 2006
– Horizontal cross-sections are ellipses.
E w1
w w2
Convergence Speed
• The direction of steepest – The gradient is small in the
descent does not point at direction in which we want
the minimum unless the to travel a large distance.
ellipse is a circle. E
– The gradient is big in the wi   
direction in which we wi
only want to travel a This equation is sick.
small distance.
G. Hinton, 2006
How the Learning Goes Wrong
• If the learning rate is big,
it sloshes to and fro across
the ravine. If the rate is too
big, this oscillation
diverges. E
• How can we move quickly
in directions with small
gradients without getting
w
divergent oscillations in
directions with big
gradients?
G. Hinton, 2006
Five Ways to Speed up Learning
• Use an adaptive global learning rate
– Increase the rate slowly if its not diverging
– Decrease the rate quickly if it starts diverging
• Use separate adaptive learning rate on each connection
– Adjust using consistency of gradient on that weight axis
• Use momentum
– Instead of using the gradient to change the position of the weight
“particle”, use it to change the velocity.
• Use a stochastic estimate of the gradient from a few cases
– This works very well on large, redundant datasets.
G. Hinton, 2006
The Momentum Method
Imagine a ball on the error
surface with velocity v.
– It starts off by following the
gradient, but once it has Dw(t) = v(t)
velocity, it no longer does ¶E
steepest descent. = a Dw(t -1) - e (t)
¶w
• It damps oscillations by
combining gradients with
opposite signs.
• It builds up speed in directions
with a gentle but consistent
gradient.
G. Hinton, 2006
How to Initialize weights? Saturated
• Use small random

numbers. For instance
small numbers
between [-0.2, 0.2].
• Some numbers are
positive and some are
negative.
• Why are the initial
weights should be
small? 1
1  e  wx
Neural Network Software
• Weka (Java):
http://www.cs.waikato.ac.nz/ml/weka/
• NNClass and NNRank (C++): J. Cheng, Z.

Wang, G. Pollastri. A Neural Network Approach to
Ordinal Regression. IJCNN, 2008
NNClass Demo
• Abalone data:
http://archive.ics.uci.edu/ml/datasets/Abalone
Abalone (from Spanish Abulón) are a group of shellfish (mollusks) in the

family Haliotidae and the Haliotis genus. They are marine snails
http://sysbio.rnet.missouri.edu/multico
m_toolbox/tools.html
Problem of Neural Network
• Vanishing gradients
• Cannot use unlabeled data
• Hard to understand the relationship between
input and output
• Cannot generate data
Deep Learning Revolution
2012: Is deep learning a revolution in artificial
intelligence?
Accomplishments
Apple’s Siri virtual personal assistant
Google’s Street View & Self-Driving Car
Google/Facebook/Tweeter/Yahoo Deep
Learning Acquisition
Hinton’s Hand Writing Recognition
CASP10 protein contact map prediction

hidden layer cj
• A model for a distribution
over binary vectors
h
• Probability of a vector, v,
under the model is defined wij
via an “energy” v
bi
visible layer
Instead of attempting to sample from joint
distribution p(v,h) (i.e. p∞), sample from
p1(v,h).
Hinton, Neural Computation(2002)

Faster and lower variance in sample.
Partials of E(v, h) easy to calculate.
j j
i i
t=0 t=1

Gradient of the likelihood with respect to wij ≈ the
difference between interaction of vi and hj at time 0 and
at time 1.
Hidden
j j
Layer
Visible i i
Layer
t=0 t=1

at time 1.
Hidden
j j
Layer
Visible i i
Layer
t=0 t=1

at time 1.
Hidden
j j
Layer
Visible i i
Layer
t=0 t=1
Δwi,j = <vi pj0> - <pi1pj1> Hinton, Neural Computation(2002)

ɛ is the learning rate, η is the weight cost, and υ the momentum.
Gradient Smaller Weights Avoid Local Minima

Face or
not ?
……
Lines,
circles,
squares
Image
Brain Learning
pixels
Objective of Iterative Gradient
Unsupervised Descent Approach:
Learning:
Find wi,j to maximize the Adjust wi,j to increase the
likelihood p(v) of visible data likelihood according to gradient
~350 nodes
~500 nodes
~500 nodes
wi,j
~400 input nodes
A Vector of ~400 Features (numbers between 0 and 1)

1. Weights are learned
[0,1]
layer by layer via
unsupervised learning.
…
2. Final layer is learned as a
…
supervised neural
… network.
… 3. All weights are fine-
…
tuned using supervised
back propagation.
…
Hinton and Salakhutdinov, Science, 2006

1. Weights are learned
[0,1]
layer by layer via
unsupervised learning.
…
2. Final layer is learned as a
…
supervised neural
… network.
… 3. All weights are fine-
…
tuned using supervised
back propagation.
…
Hinton and Salakhutdinov, Science, 2006

LSDEKIINVDF KPSEERVREII
Speed up training by
CUDAMat and GPUs …
…
…
…
…
Train DNs with over 1M …
…
parameters in about an
[0,1]
hour
Demo:
http://www.cs.toronto.edu/~hinton/digits.html
Various Deep Learning
Architectures
• Deep belief network
• Deep neural networks
• Deep autoencoder
• Deep convolution networks
• Deep residual network
• Deep recurrent network
Deep Belief Network
Deep
AutoEncoder
Deep Convolution Neural
Network
Deep Recurrent Neural Network
An Example
Deep Residual Network
the rectifier is an activation function defined as

A unit employing the rectifier is also called a rectified linear unit
(ReLU)
• Prevent from over-fitting
• Prevent units from co-
adapting
• Training: remove randomly
selected units according to a
rate (0.5)
• Testing: multiply all the units
with dropout rate
Deep Learning Tools
• Pylearn2
• Theano
• Caffe
• Torch
• Cuda-convnet
• Deeplearning4j
March 29, 2019 Data Mining: Concepts and 136

Techniques
Googles’s
TensorFlow
TensorFlow™ is an open source software library for
numerical computation using data flow graphs. Nodes in the
graph represent mathematical operations, while the graph
edges represent the multidimensional data arrays (tensors)
communicated between them. The flexible architecture
allows you to deploy computation to one or more CPUs or
GPUs in a desktop, server, or mobile device with a single
API. TensorFlow was originally developed by researchers
and engineers working on the Google Brain Team within
Google's Machine Intelligence research organization for the
purposes of conducting machine learning and deep neural
networks research, but the system is general enough to be
applicable in a wide variety of other domains as well.
https://www.tensorflow.org/
Acknowledgements
• Geoffery Hinton’s slides
• Jesse Eickholt’s slides
• Images.google.com

NN Theory

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

NN Theory

Caricato da

Copyright:

Formati disponibili

Statistical Machine Learning

Methods for Bioinformatics

Question: How to automatically predict output given input?

Input: words of news Training: Build a model (classifier)

Classification: recognize each number

Adapted from http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

Fish length vs. weight?

• Least square error:  i   2

Minimize negative log-likelihood:

price  x fishw fish  xchipswchips  xbeer wbeer

 y1   1 x11 ... x1d   w0 

n*1 n*(d+1) (d+1)*1

Famous Delta Rule

Output Unit out O = f(Σwixi), f is activation

Activation wTx = Σwixi

misclassified data points.

• For any data set which is linearly separable,

How to learn it?

Update rule: w ( t 1)

How to learn this network? Once again, gradient descent.

How about we augument feature

w11 w1i wji w

w11 w1i wji w

w11 w1i wji aj: w x

• Until a number of iterations or errors drops below a

1. Recurrent network is essentially a series of feed-forward

2. Recurrent network is good for time series data and sequence

Overfitting function can not generalize well to unseen data.

• Use small random

• NNClass and NNRank (C++): J. Cheng, Z.

Abalone (from Spanish Abulón) are a group of shellfish (mollusks) in the

Hinton’s Hand Writing Recognition

CASP10 protein contact map prediction

Hinton, Neural Computation(2002)

Hinton, Neural Computation(2002)

Hinton, Neural Computation(2002)

Hinton, Neural Computation(2002)

Δwi,j = <vi pj0> - <pi1pj1> Hinton, Neural Computation(2002)

Gradient Smaller Weights Avoid Local Minima

~400 input nodes

A Vector of ~400 Features (numbers between 0 and 1)

Hinton and Salakhutdinov, Science, 2006

Hinton and Salakhutdinov, Science, 2006

the rectifier is an activation function defined as

March 29, 2019 Data Mining: Concepts and 136

Potrebbero piacerti anche

n1 n(d+1) (d+1)*1