Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Chapter 1
This Chapter covers the principle concepts of pattern recognition. It deals with:
Polynomial curve fitting
Parameter optimization
Generalization
Model complexity
Pattern recognition ->Probability, Decision criteria, Bayes theorem
1.1 Character recognition
Pattern recognition is synonymous with information processing problems.
An approach will be developed based on sound theoretical concepts.
Information is normally probabilistic in nature and a statistical framework
provides means of both operation on the data and representation of
results.
Consider image data represented by a vector x=(x1,….xd)T where d is the
total number of variables. This image has to be classified as either class 1
(C1) or class 2 (C2) up to k, where k is the number of classes. The aim is to
develop an algorithm or system to classify these sets of data. To do this, a
collection of known data is collected for each class (data set or sample)
This is then used in the development of a classification algorithm. A
collection of available data is known as a training set. This normally may
be less than the total number of possible variations in a particular class.
The algorithm, should be able to correctly classify data which was not
used in the training set- generalization .
e.g. for d-dimensional space,
d=20, a four bit representation would mean 220 x 4 elements in a sample.
Too big a number.
To reduce this, data can be combined into features. This reduces the ‘data
set’ considerably.
These could consist of ratios, averages etc depending on the problem. This
can then be used to determine a decision boundary or threshold, provided
there is a marked distinction between the classes for this particular
feature.
e.g. elements of class C2 have bigger values of feature x than elements of
class C1. Overlaps may be there. See (fig 1.2. )
Fig 1.1 fig 1.2
Misclassifications are inevitable. However, these can be reduced by
increasing the number of features which too, should not exceed a certain
limit. Better decision boundaries can be obtained with more features.
Some definitions
Translational invariance- a classification system whose decisions are
insensitive to location of
object (data) in an image(space).
Scale invariance - a feature does not depend on size of the object
Prior knowledge - information known about the desired form of
the solution
yx:w=w0x0+w1x1…+wMxM=j=0Mwjxj
where tn is the desired output (it can be xn or some other ‘prior’ value) and
xn is the nth data point.
Supervised learning – learning which involves some known target
value
Unsupervised learning – no target. The goal is not an input output
mapping but to
model the probability distribution of the data
or some other
inherent structures.
Reinforcement learning –information is supplied to determine the quality
of the output
But no actual values are given.
A system must be able to generalize well to cater for noise. Test datais used
to determine a
1.5 Generalization
ERMS=1Nn=1Nyxn;w*-tn2
This is known as the root mean square function. Where w* denotes the best
set of
weights.(minimum error).
Example.
hx=0∙5+0∙4sin2πx
Fig 1.6 M=1 bad fit Fig 1.7 M=3, good fit.
Fig 1.8. M=10, over-fitting Fig 1.9 Test set
can be used to determine best order
M=3 is the
minimum
For 2D data
This helps reduce both bias and variance. The limit is the amount of noise.
1.7 Multivariate non- linear functions
Mapping polynomials can be extended to higher dimensions. E.g. for d
input variables, and 1 output variable, it could be chosen to represent the
mapping as
y=w0+i1=1dwi1xi1+i1=1di2=1dwi1i2xi1xi2+i1=1di2=1di3=1dwi1i2i3xi1
xi2xi3
However the number of independent parameters would grow as dM.This
would need lots of training data. The importance of neural networks is how
they deal with the problem of scaling and dimensionality. These models
represent non-linear functions of many variables in terms of
superpositions of non-linear functions of a single variable. These are
known as hidden functions and are adapted to the data. We shall consider
the multi-layer perceptron and radial basis function network. The error
falls as O(1/M) where Mist the number of hidden networks. For
polynomials decrease as O(1/M2/d).However, this is computationally
intensive and has the problem of multiple minima in the error function.
e.g. A sample of letters contains 3 ’ A’s , 4 ‘B’s, we can say, the prior
probabilities of A and B are P(A),3/7 and P(B),4/7 respectively. This means
, every new character would be assigned as ‘B’ since it has a higher prior
probability. If more 2 characters were sampled, and their prior probabilities
were as follows A-3/11, B-4/11, C-4/11 then it means each new character
would be between C and B. This means that prior probabilities are
insufficient means of classification. However, they have to be taken into
consideration. Introducing other things to criteria for classification is a
necessity. Here is where features of the data come in.
Some definitions
Joint probability P(Ck,Xl) - probability that the object has feature value X
l
and belongs
Example.
l
Consider Two classes C1 and C2 and feature X
X1 X2 Total
C1 19 41 60
C2 12 28 40
Total 31 69 100
Feature X1 X2
Row C1= 60, row C2 = 40. C1 = 60. Total Samples =40 + 60.
P(C1|X 2) = 4169 from the definition. It can also be said that P(X 2|C1) =
4160
P(X 2 |C1) is the probability that a given class is C1 seeing that it falls in
feature X 2. It is also
2
known as the class conditional probability of X for class C1.
= 4160×60100 = 0.41 OR
= 4169×69100 =0.41
It can also be clearly seen from the table that the (joint) probability of
a sample having
feature X 2 and belonging to class C1 is 41 out of 100 samples = 0.41
Let's use the same example, but shorten each event to its one letter initial, ie: A, B,
C, and D instead of Aberations, Brochmailians, Chompieliens, and Defective.
P(D|B) is not a Bayes problem. This is given in the problem. Bayes' formula finds
the reverse conditional probability P(B|D).
It is based that the Given (D) is made of three parts, the part of D in A, the part of
D in B, and the part of D in C.
P(B and D)
P(B|D) = -----------------------------------------
P(A and D) + P(B and D) + P(C and D)
Inserting the multiplication rule for each of these joint probabilities gives
P(D|B)*P(B)
P(B|D) = -----------------------------------------
P(D|A)*P(A) + P(D|B)*P(B) + P(D|C)*P(C)
However, and I hope you agree, it is much easier to take the joint probability
divided by the marginal probability. The table does the adding for you and
makes the problems doable without having to memorize the formulas.
Company Good Defective Total
(A) Aberations 0.50-0.025 = 0.475 0.05(0.50) = 0.025 0.50
(B) Brochmailians 0.30-0.021 = 0.279 0.07(0.30) = 0.021 0.30
(C ) Chompieliens 0.20-0.020 = 0.180 0.10(0.20) = 0.020 0.20
Total 0.934 0.066 1.00
N.B the marginal probability is the P(D) (= 0.66 )in this case. P(C1) and
P(C2) are marginal probabilities. Note that the total of the marginal
probabilities adds to 1.00.
Whereas,
P(A and D) + P(B and D) + P(C and D)= 0.025 + 0.021 + 0.020= 0.66 = P(D)
Thus, it can be said that the denominator acts as a normalizing factor. i.e.
the conditional
PC1|X2+PC2|X2=1
It is a NORMALIZING FACTOR
P(X2) has been expressed in terms of the prior probability and class
conditional probability.
of probabilities.
Px∈[a,b]=abpxdx 1.14
Px∈R=Rpxdx 1.15
εQ=Qxp(x)dx
1.16
PCkx=pxCkP(Ck)p(x)
1.21
px=k=1cpxCkPk 1.22
k=1cPxCk=1
1.23