Sei sulla pagina 1di 11

NEURAL NETWORKS STUDY NOTES.

Chapter 1

This Chapter covers the principle concepts of pattern recognition. It deals with:
Polynomial curve fitting
Parameter optimization
Generalization
Model complexity
Pattern recognition ->Probability, Decision criteria, Bayes theorem
1.1 Character recognition
Pattern recognition is synonymous with information processing problems.
An approach will be developed based on sound theoretical concepts.
Information is normally probabilistic in nature and a statistical framework
provides means of both operation on the data and representation of
results.
Consider image data represented by a vector x=(x1,….xd)T where d is the
total number of variables. This image has to be classified as either class 1
(C1) or class 2 (C2) up to k, where k is the number of classes. The aim is to
develop an algorithm or system to classify these sets of data. To do this, a
collection of known data is collected for each class (data set or sample)
This is then used in the development of a classification algorithm. A
collection of available data is known as a training set. This normally may
be less than the total number of possible variations in a particular class.
The algorithm, should be able to correctly classify data which was not
used in the training set- generalization .
e.g. for d-dimensional space,
d=20, a four bit representation would mean 220 x 4 elements in a sample.
Too big a number.

To reduce this, data can be combined into features. This reduces the ‘data
set’ considerably.
These could consist of ratios, averages etc depending on the problem. This
can then be used to determine a decision boundary or threshold, provided
there is a marked distinction between the classes for this particular
feature.
e.g. elements of class C2 have bigger values of feature x than elements of
class C1. Overlaps may be there. See (fig 1.2. )
Fig 1.1 fig 1.2
Misclassifications are inevitable. However, these can be reduced by
increasing the number of features which too, should not exceed a certain
limit. Better decision boundaries can be obtained with more features.

1.2 Classification and regression


The overall system maps a set of input variables (from which features are
extracted) to an output variable y. For several outputs yk where k=1..c. c=
number of outputs. This cannot be done with a set of training data. This
data is fed into some mapping function with adjustable weights. yk
= yk(x:w), where w is a vector/matrix of parameters or weights.
A neural network is a particular choice of a set of functions(or fuctional
forms) yk(x:w),,
together with procedures to optimize the mapping.
In classification problems we seek to approximate probabilities of
membership of different classes expressed as functions of input variables.
Apart from classification, another pattern recognition task is regression. In
regression, future and unknown values are predicted given past values.
The output represents the value of a continuous variable. More like, given
a data set, the relationship between the data is determined i.e. the
function connecting the data. This approximated function is then used to
determine other values not present in a data set.
In regression, it is the regression function which is approximated (see
example)

A classical definition of regression


A term coined by Galton for the tendency of the quantitative traits of
offspring to be closer to the population mean than are their parents' traits.
It arises from a combination of factors - dominance, gene interactions, and
environmental influences on traits.

1.3 Pre-processing and feature extraction


The sequence of operations on data is:

feature extraction.A fixed transformation of


variables. contains adaptive parameters
converting to required form

Some definitions
Translational invariance- a classification system whose decisions are
insensitive to location of
object (data) in an image(space).
Scale invariance - a feature does not depend on size of the object
Prior knowledge - information known about the desired form of
the solution

1.4 The curse of dimensionality


Increasing features can improve the performance of the classification
system however, beyond a critical point, more features result in the
reduction of performance of the classification system.
Imagine two input variables x1 and x2(d=2) if these are divided into M
divisions, the number of cells increases by Md
X21= 2
d
M=2, M=3 Fig 1.3
Md=22=4,
Md=32=9 The number of cells grows exponentially with the
dimensionality of the input space. This means the training
data must increase by at least the same amount. That’s the
curse.
-Since data is somewhat correlated it tends only to fill a subspace of lower
dimensionality –intrinsic dimensionality.
-outputs do not arbitrarily change from one region of input space to
another.
Therefore, output values at intermediate points can be inferred like in
interpolation. These properties enable reduction of inputs possible with the
aim of reducing dimensionality.

Polynomial curve fitting


We are trying to fit a polynomial of order M to N data points

yx:w=w0x0+w1x1…+wMxM=j=0Mwjxj

Parameters wj are determined by minimizing the error function


E=12n=1Nyxn;w-tn2 ,

where tn is the desired output (it can be xn or some other ‘prior’ value) and
xn is the nth data point.
Supervised learning – learning which involves some known target
value
Unsupervised learning – no target. The goal is not an input output
mapping but to
model the probability distribution of the data
or some other
inherent structures.
Reinforcement learning –information is supplied to determine the quality
of the output
But no actual values are given.
A system must be able to generalize well to cater for noise. Test datais used
to determine a

system’s generalization capabilities. Test data is just trained data (similarly


generated) with
noise.
The order of the polynomial is called degree of freedom. Curve fitting is
improved with increase
inthe degrees of freedom. Up to a certain point. The curve should not fit the data
exactly. That would be over-fitting which would give a poor representation of the
original underlying function.

1.5 Generalization
ERMS=1Nn=1Nyxn;w*-tn2

This is known as the root mean square function. Where w* denotes the best
set of
weights.(minimum error).
Example.

For a given function

hx=0∙5+0∙4sin2πx

Fig 1.6 M=1 bad fit Fig 1.7 M=3, good fit.
Fig 1.8. M=10, over-fitting Fig 1.9 Test set
can be used to determine best order

M=3 is the
minimum

For 2D data

Fig 1.11 good boundary Fig 1.12 over-fitted


boundary

A model with little flexibility has a high bias.

A model with too much flexibility has a high variance.

1.6 Model complexity


As can be seen above (and from Occan’s razor) simple models are better
than complex ones. But they should not be too simple, M=1.
Another control its effective complexity approach apart from the one
mentioned above is, introducing a penalty Ω(regularization term) to the
error function.

Ê=E +vΩ , v determines how much Ω influences the solution.


Ω increases with complexity (curvature).

Ω=12d2ydx22dx, therefore after substitution,


Ê=12n=1Nyxn;w-tn2 +v12d2ydx22dx

This helps reduce both bias and variance. The limit is the amount of noise.
1.7 Multivariate non- linear functions
Mapping polynomials can be extended to higher dimensions. E.g. for d
input variables, and 1 output variable, it could be chosen to represent the
mapping as
y=w0+i1=1dwi1xi1+i1=1di2=1dwi1i2xi1xi2+i1=1di2=1di3=1dwi1i2i3xi1
xi2xi3
However the number of independent parameters would grow as dM.This
would need lots of training data. The importance of neural networks is how
they deal with the problem of scaling and dimensionality. These models
represent non-linear functions of many variables in terms of
superpositions of non-linear functions of a single variable. These are
known as hidden functions and are adapted to the data. We shall consider
the multi-layer perceptron and radial basis function network. The error
falls as O(1/M) where Mist the number of hidden networks. For
polynomials decrease as O(1/M2/d).However, this is computationally
intensive and has the problem of multiple minima in the error function.

1.8 Bayes’ Theorem


Without knowing about the features or characteristics of the data, prior
probabilities can be determined from a set of data. This is in cases of
classification of data. The fraction of a class in the total number of data
samples is considered to be the prior probability.

e.g. A sample of letters contains 3 ’ A’s , 4 ‘B’s, we can say, the prior
probabilities of A and B are P(A),3/7 and P(B),4/7 respectively. This means
, every new character would be assigned as ‘B’ since it has a higher prior
probability. If more 2 characters were sampled, and their prior probabilities
were as follows A-3/11, B-4/11, C-4/11 then it means each new character
would be between C and B. This means that prior probabilities are
insufficient means of classification. However, they have to be taken into
consideration. Introducing other things to criteria for classification is a
necessity. Here is where features of the data come in.

Some definitions

Joint probability P(Ck,Xl) - probability that the object has feature value X
l
and belongs

to class C k. , where k is 1,2…n. n is the total number


of classes. l is the
total number of features.

Conditional probability P(Xl|Ck) - The probability that the observation is


feature X l seeing

that it belongs to class C k. It is a measure of the


predominance of a

feature by a particular class. How much this


feature is synonymous

with a certain class.

Example.
l
Consider Two classes C1 and C2 and feature X

X1 X2 Total
C1 19 41 60
C2 12 28 40
Total 31 69 100

Feature X1 X2

Row C1= 60, row C2 = 40. C1 = 60. Total Samples =40 + 60.

P(C1) = 60Total Samples= 60100 {Probability of a randomly selected variable


belonging to C1}

P(X 2) = 69100 {probability that of finding a randomly selected feature(X 2) is


feature in the sample}

P(C1|X 2) = 4169 from the definition. It can also be said that P(X 2|C1) =
4160

P(X 2 |C1) is the probability that a given class is C1 seeing that it falls in
feature X 2. It is also
2
known as the class conditional probability of X for class C1.

P(C1 ,X 2) = P(X 2|C1) P(C1)

= 4160×60100 = 0.41 OR

P(C1 ,X 2) = P(C1|X 2) P(X 2)

= 4169×69100 =0.41

It can also be clearly seen from the table that the (joint) probability of
a sample having
feature X 2 and belonging to class C1 is 41 out of 100 samples = 0.41

Since P(X 2|C1) P(C1) = P(C1|X 2) P(X 2), then

PC1|X2=PX2|C1PC1PX2 the “conditional probability” on the left is known


as the posterior

probability . The above is known as Baye’s theorem

An Excerpt from the web

Let's use the same example, but shorten each event to its one letter initial, ie: A, B,
C, and D instead of Aberations, Brochmailians, Chompieliens, and Defective.
P(D|B) is not a Bayes problem. This is given in the problem. Bayes' formula finds
the reverse conditional probability P(B|D).
It is based that the Given (D) is made of three parts, the part of D in A, the part of
D in B, and the part of D in C.
P(B and D)
P(B|D) = -----------------------------------------
P(A and D) + P(B and D) + P(C and D)
Inserting the multiplication rule for each of these joint probabilities gives
P(D|B)*P(B)
P(B|D) = -----------------------------------------
P(D|A)*P(A) + P(D|B)*P(B) + P(D|C)*P(C)
However, and I hope you agree, it is much easier to take the joint probability
divided by the marginal probability. The table does the adding for you and
makes the problems doable without having to memorize the formulas.
Company Good Defective Total
(A) Aberations 0.50-0.025 = 0.475 0.05(0.50) = 0.025 0.50
(B) Brochmailians 0.30-0.021 = 0.279 0.07(0.30) = 0.021 0.30
(C ) Chompieliens 0.20-0.020 = 0.180 0.10(0.20) = 0.020 0.20
Total 0.934 0.066 1.00
N.B the marginal probability is the P(D) (= 0.66 )in this case. P(C1) and
P(C2) are marginal probabilities. Note that the total of the marginal
probabilities adds to 1.00.

Whereas,
P(A and D) + P(B and D) + P(C and D)= 0.025 + 0.021 + 0.020= 0.66 = P(D)
Thus, it can be said that the denominator acts as a normalizing factor. i.e.
the conditional

probabilities all add up to unity. Therefore ,

PC1|X2+PC2|X2=1

From the excerpt, PBD+PCD+PAD= P(B,D)P(D)+P(C,D)P(D)+P(A,D)P(D) =


0.0210.66+0.0200.66+0.0250.66=1
N.B. P(D) ensures that the sum of conditional probabilities tends to unity.
Without it, the sum would be 0.66

It is a NORMALIZING FACTOR

From the Example, it is clear that, P( C1 , X2) + P( C2 , X2) = P(X2)

{C1 and X2} {C2 and X2}

These are, 41% + 28 % = 69 %

The relation above, P( C1 , X2) + P( C2 , X2) = P(X2), can also be written as

P(X2)= P(X2|C1)P(C1) + P(X2|C2)P(C2)

P(X2) has been expressed in terms of the prior probability and class
conditional probability.

1.8.1 Inference and decision

The importance of Baye’s theorem is that posterior probabilities can be


calculated and

expressed using easily obtainable quantities. In the examples, these are


the values in the

tables which are directly obtained from the data.


l
Classification of an object with a particular feature or feature value, X is
done by assigning

the object to class Ck for which the posterior probability P(Ck |X l ) is


greatest. Baye’s rule.

In practice, care must be taken in data collection to avoid large disparities


between our

expectations (calculated posterior probabilities) and reality. Prior


probabilities must be

correctly accounted for.

Implementation of Baye’s theorem would meanevaluating the class


conditional and prior

probabilities separately, then using Baye’s theorem to evaluate posterior


probabilities.

Outputs of neural networks can be interpreted as posterior probabilities


provided that the

error function is chosen appropriately.

1.8.2 Bayesian versus frequentist statistics


We assumed that the samples or observations tend to infinity. This is a
frequentist view

of probabilities.

Usually, probabilities express our degree of belief that a particular event


will occur. Baye’s

theorem, is a precise quantitative way to update these values when new


data is presented.

1.8.3. Probability densities

Feature variables can be regarded as continuous. The notation is as


follows. Upper case

denotes probability while lower case denotes probability densities.

Px∈[a,b]=abpxdx 1.14

px is normalized so that Px∈a.b=1

A more general form for a region r in space is:

Px∈R=Rpxdx 1.15

Average or expectation of a particular function Q(x) is:

εQ=Qxp(x)dx
1.16

For a finite set of data points,

εQ=Qxp(x)dx≅ 1Nn=1NQ(xn) 1.17

1.8.4 Baye’s theory in general

This is to incorporate the probability densities.

PCkx=pxCkP(Ck)p(x)
1.21

px=k=1cpxCkPk 1.22

k=1cPxCk=1
1.23

When class-conditional OR posterior probabilities are viewed as


parameterized

functions, they are referred to as likelihood functions.

Baye’s theorem can be minimized to:


posterior=likelihood X priornormalization factor
1.24

1.9 Decision Boundaries

1.10 Minimizing risk

Potrebbero piacerti anche