Bayesian Classifier Implementation Using MATLAB

PATTERN RECOGNITION ASSIGNMENT-I
Submitted by
Shabeeb Ali O.
M.Tech. Signal Processing
Semester -II
Roll No.15
March 6, 2018
Contents
0.1 Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2.1 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2.2 Discriminant Functions, and Decision Surfaces . . . . . . . . . . . . . . . . . . . . . . 2
0.2.3 Normal density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2.4 Discriminant Functions For The Normal Density . . . . . . . . . . . . . . . . . . . . . 4
0.2.5 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.3.1 linearly separable data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.3.2 non-linearly separable data set, real-world data set . . . . . . . . . . . . . . . . . . . . 7
0.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.4.1 Linearly Separable Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.4.2 Non Linearly Separable Data Set - Overlapping Data . . . . . . . . . . . . . . . . . . . 11
0.4.3 Non Linearly Separable - Spiral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.4.4 Non Linearly Separable - Helix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
0.4.5 Real World Data - Glass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
0.4.6 Real World Data - Jaffe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
0.5 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
0.1 Question
The dataset given to you contain 3 folders
1. linearly separable dataset
2. Nonlinearly separable dataset
3. Real world dataset
Implement classifiers using Bayes decision rules for cases I,II and III respectively for all datasets.
0.2 Theory
0.2.1 Bayesian Decision Theory
Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. It
is considered the ideal case in which the probability structure underlying the categories is known perfectly.
While this sort of situation rarely occurs in practice.
Bayes formula
likelihood × prior
posterior =
evidence
Mathematically:
p(x|ωj )P (ωj )
P (ωj |x) =
p(x)
where
c
X
p(x) = p(x|ωj )P (ωj )
j=1
c : no.of classes
Posterior
Bayes formula shows that by observing the value of x we can convert the prior probability P(ωj ) to a
posteriori probability (or posterior) P(ωj |x) - the probability of the state of nature being ωj given that
feature value x has been measured.
likelihood
We call p(x|ωj ) the likelihood of ωj with respect to x, a term chosen to indicate that, other things being
equal, the category ωj for which p(x|ωj ) is large is more ”likely” to be the true category.
Evidence
The evidence factor p(x), can be viewed as merely a scale factor that guarantees that the posterior proba-
bilities sum to one, as all good probabilities must.
1
Bayes Decision Rule
For a 2 class case the bayes decision rule can be shown as:
Decide ω1 if P (ω1 |x) > P (ω2 |x); otherwise decide ω2
Feature space
Allowing use of more than one feature merely requires replacing that scalar x by the f eature vector x, when
x is in a d-dimensional Euclidean space Rd , called f eature space
Suppose that we observe a particular x and that we contemplate taking action αi . If the true state of
nature is ωj by definition, we will incur the loss λ(αi |ωj ). Because P (ωj |x) is the probability that the true
state of nature is ωj , the expected loss associated with taking action αi is
c
X
R(αi |x) = λ(αi |ωj )P (ωj |x)
j=1
An expected loss is called a risk, and R(αi |x) is called the conditional risk. Whenever we encounter
a particular observation x, we can minimize our expected loss by selecting the action that minimizes the
conditional risk.
0.2.2 Discriminant Functions, and Decision Surfaces

There are many different ways to represent pattern classifiers. One of the most useful is in terms of a set of
discriminant functions gi (x),i=1,....,c. The classifier is said to assign a feature vector x to class ωi if
gi (x) > gj (x) for all j 6= i
A Bayes classifier is easily and naturally represented in this way. For the general case with risks, we can
let gi (x) = −R(αi |x)), because the maximum discriminant function will then correspond to the minimum
conditional risk. For the minimum error-rate case, we can simplify things further by taking gi (x) = P (ωj |x),
so that the maximum discriminant function corresponds to the maximum posterior probability. Clearly, the
choice of discriminant functions is not unique. In particular, for minimum-error rate classification, any of
the following choices gives identical classification results, but some can be much simpler to understand or to
compute than others:
p(x|ωi )P (ωi )
gi (x) = P (ωj |x) = Pc
j=1 p(x|ωi )P (ωi )
.
gi (x) = p(x|ωi )P (ωi )
gi (x) = lnp(x|ωi ) + lnP (ωi )

where ln denotes natural logarithm.
Decision Region
Even though the discriminant functions can be written in a variety of forms, the decision rules are equivalent.
The effect of any decision rule is to divide the feature space into c decision boundaries, R1 ,, Rc . If gi (x) >
gj (x) for all i 6= j, then x is in Ri , and the decision rule calls for us to assign x to ωi . The regions are
separated by decision boundaries, surfaces in feature space where ties occur among the largest discriminant
functions.
2
0.2.3 Normal density
Expectation
The definition of the expected value of a scalar function f (x) defined for some density p(x) is given by
Z ∞
E[f (x)] ≡ f (x)p(x)dx
−∞
Univariate density
The continuous univariate normal density is given by
1 h 1 x − µ 2 i
p(x) = √ exp −
2πσ 2 σ
where mean µ (expected value, average) is given by
Z ∞
µ ≡ E[x] = xp(x)dx
−∞
and the special expectation that is variance (squared deviation) is given by
Z ∞
σ 2 ≡ E[(x − µ)2 ] = (x − µ)2 p(x)dx
−∞
Multivariate density
The multivariate normal density in d dimensions is written as
1 h 1
t −1
i
p(x) = exp − (x − µ) Σ (x − µ)
(2π)d/2 |Σ|1/2 2
where x is a d-component column vector, µ is the d-component mean vector, Σ is d × d covariance matrix,
|Σ| is its determinant and Σ−1 is its inverse respectively. The covariance matrix S is defined as the square
matrix whose ij th element σij is the covariance of xi and xj : The covariance of two features measures their
tendency to vary together, i.e., to co-vary.
We can use vector product (x − µ)(x − µ)T to write the covariance matrix as Σ = E[(x − µ)(x − µ)T ]
Thus Σ is symmetric, and its diagonal elements are the variances of the respective individual elements
xi of x (e.g.σi2 ,), which can never be negative; the off-diagonal elements are the covariances of xi and xj ,
which can be positive or negative. If the variables xi and xj are statistically independent, the covariances
σij are zero, and the covariance matrix is diagonal. If all the off-diagonal elements are zero, p(x) reduces to
the product of the univariate normal densities for the components of x. The analog to the Cauchy-Schwarz
inequality comes from recognizing that if w is any d-dimensional vector, then the variance of wT x can never
be negative. This leads to the requirement that the quadratic form wT Σx never be negative. Matrices for
which this is true are said to be positive semidefinite; thus, the covariance matrix is positive semidefinite.
3
0.2.4 Discriminant Functions For The Normal Density
gi (x) = lnp(x|ωi ) + lnP (ωi )
This equation can be easily evaluated if the densities are multivariate normal. i.e.,
1 d 1
gi (x) = − (x − µi )t Σ−1
i (x − µi ) − ln2π − ln|Σi | + lnP (ωi ) (1)
2 2 2
Case I : Σi = σ 2 I
The simplest case occurs when the features that are measured is independent of each other, that is, sta-
tistically independent, and when each feature has the same variance, σ 2 . For example, if we were trying
to recognize an apple from an orange, and we measured the colour and the weight as our feature vector,
then chances are that there is no relationship between these two properties. The non-diagonal elements of
the covariance matrix are the covariances of the two features x1 =colour and x2 =weight. But because these
features are independent, their covariances would be 0. Therefore, the covariance matrix for both classes
would be diagonal, being merely σ 2 times the identity matrix I.
As a second simplification, assume that the variance of colours is the same is the variance of weights.
This means that there is the same degree of spreading out from the mean of colours as there is from the
mean of weights. If this is true for some class i then the covariance matrix for that class will have identical
diagonal elements. Finally, suppose that the variance for the colour and weight features is the same in both
classes. This means that the degree of spreading for these two features is independent of the class from
which you draw your samples. If this is true, then the covariance matrices will be identical. When normal
distributions are plotted that have a diagonal covariance matrix that is just a constant multplied by the
identity matrix, their cluster points about the mean are shperical in shape.
Geometrically, this corresponds to the situation in which the samples fall in equal-size hyperspherical
clusters, the cluster for the ith class being centered about the mean vector µi (see Figure). The computation
of the determinant and the inverse of Σi is particularly easy:
1
|Σi | = σ 2d and Σ−1 = I
σ2
Because both |Σi | and the (d/2) ln 2π term in equation are independent of i, they are unimportant additive
constants that can be ignored. Thus, we obtain the simple discriminant functions
||x − µi ||2
gi (x) = − + lnP (ωi )
2σ 2
where || · || denotes Euclidean norm, that is
||x − µi ||2 = (x − µi )t (x − µi )
If the prior probabilities are not equal, then above equation shows that the squared distance ||x − µi ||2 must
be normalized by the variance σ 2 and offset by adding ln P (ωi ); thus, if x is equally near two different mean
vectors, the optimal decision will favor the a priori more likely category.
Regardless of whether the prior probabilities are equal or not, it is not actually necessary to compute
distances. Expansion of the quadratic form (x − µi )t (x − µi ) yields
1
gi (x) = − [xt x − 2µti x + µti µi ] + lnP (ωi )
2σ 2
4
which appears to be a quadratic function of x. However, the quadratic term xT x is the same for all i, making
it an ignorable additive constant. Thus, we obtain the equivalent linear discriminant functions
gi (x) = wti x + wi0
where
1
wi = µi
σ2
and
−1 t
wi0 = µ µi + lnP (ωi )
2σ 2 i
We call wi0 the threshold or bias for the ith category.
Case II : Σi = Σ
Another simple case arises when the covariance matrices for all of the classes are identical but otherwise
arbitrary. Since it is quite likely that we may not be able to measure features that are independent, this
section allows for any arbitrary covariance matrix for the density of each class. In order to keep things
simple, assume also that this arbitrary covariance matrix is the same for each class wi. This means that we
allow for the situation where the color of fruit may covary with the weight, but the way in which it does
is exactly the same for apples as it is for oranges. Instead of having shperically shaped clusters about our
means, the shapes may be any type of hyperellipsoid, depending on how the features we measure relate to
each other. However, the clusters of each class are of equal size and shape and are still centered about the
mean for that class.
Geometrically, this corresponds to the situation in which the samples fall in hyperellipsoidal clusters of
equal size and shape, the cluster for the ith class being centered about the mean vector µi . Because both
Si and the (d/2) ln 2π terms in eq(1) are independent of i, they can be ignored as superfluous additive
constants. Using the general discriminant function for the normal density, the constant terms are removed.
This simplification leaves the discriminant functions of the form:
5
1
gi (x) = − (x − µi )t Σ−1 (x − µi ) + lnP (ωi )
2
Note that, the covariance matrix no longer has a subscript i, since it is the same matrix for all classes.
If the prior probabilities P (ωi ) are the same for all c classes, then the ln P (ωi ) term can be ignored. In
this case, the optimal decision rule can once again be stated very simply: To classify a feature vector x,
measure the squared Mahalanobis distance (x − µi )T Σ−1 (x − µi ) from x to each of the c mean vectors, and
assign x to the category of the nearest mean. As before, unequal prior probabilities bias the decision in favor
of the a priori more likely category.
Expansion of the quadratic form (x−µi )T Σ−1 (x−µi ) results in a sum involving a quadratic term xT Σ−1 x
which here is independent of i. After this term is dropped from above equation, the resulting discriminant
functions are again linear.
gi (x) = wti x + wi0

where
wi = Σ−1 µi
and
1
wi0 = − µti Σ−1 µi + lnP (ωi )
2
case III : Σi = arbitrary

In the general multivariate normal case, the covariance matrices are different for each category. This case
assumes that the covariance matrix for each class is arbitrary. The discriminant functions cannot be simplified
and the only term that can be dropped from equation(1) is the (d/2) ln 2pπ term, and the resulting
discriminant functions are inherently quadratic.
gi (x) = xt Wi x + wti x + wi0

where
1
Wi = − σi−1
2
wi = Σ−1
i µi
and
1 1
wi0 = − µti Σ−1
i µi − ln|Σi | + lnP (ωi )
2 2
0.2.5 Confusion Matrix

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix,
also known as an error matrix,[4] is a specific table layout that allows visualization of the performance of an
algorithm
Each row of the matrix represents the instances in a predicted class while each column represents the instances
in an actual class (or vice versa).[2] The name stems from the fact that it makes it easy to see if the system
is confusing two classes (i.e. commonly mislabelling one as another).
6
Example:
If a classification system has been trained to distinguish between class 1, class 2 and class 3, a confusion
matrix will summarize the results of testing the algorithm for further inspection. Assuming a sample of 27
test cases 8 cases belongs to class 1, 6 cases belongs to class 2 , and 13 cases belongs to class 3, the resulting
confusion matrix could look like the table below:
Class 1 Class 2 Class 3
Class 1 5 2 0
Class 2 3 3 2
Class 3 0 1 11
In this confusion matrix, of the 8 actual cases that belongs to class 1 , the system predicted that three were
belongs to class 2, and of the six actual cases that belongs to class 2, it predicted that three were belongs to
class 1, three were belongs to class 2 and two were belongs to class 3. We can see from the matrix that the
system in question has trouble distinguishing between class 1 and class 2. All correct predictions are located
in the diagonal of the table, so it is easy to visually inspect the table for prediction errors, as they will be
represented by values outside the diagonal.
Classification Accuracy:
sum of all diagonal elements of confusion matrix
classification accuracy (%) = sum of all elements of confusion matrix × 100
0.3 Procedure
0.3.1 linearly separable data set
1. Load train data set and test data set separately on train and test array respectively.
2. Compute covariance matrix Σ for train data set. using cov() function of MATLAB.
3. Compute mean for each train data set and store them to µi where i, is the ith class.
4. Determine discriminant function for each cases (case I, case II & case III).
5. For each case and for each test data compute discriminant function against each train data set, and
select the highest discriminant function value among values against each train class
6. The class against which the highest value of discriminant function is obtained is the class for that test
data.
0.3.2 non-linearly separable data set, real-world data set

1. Load all classes to a array of matrices
2. split each data to train and test by taking 70% of the data to train class and remaining 30% data to
test class.
3. Compute covariance matrix Σ for train data set. using cov() function of MATLAB.
4. Compute mean for each train data set and store them to µi where i, is the ith class.
5. Determine discriminant function for each cases (case I, case II & case III).
6. For each case and for each test data compute discriminant function against each train data set, and
select the highest discriminant function value among values against each train class
7. The class against which the highest value of discriminant function is obtained is the class for that test
data.
7
0.4 Results
0.4.1 Linearly Separable Data Set
Scatter Plot for Case I
Figure 1: Test Figure 2: Train
Confusion Matrix
Class 1 199 0 1
Class 2 0 200 0
Class 3 0 0 200
Classification Accuracy : 100%
8
Scatter Plot for Case II
Confusion Matrix
Class 1 199 0 1
Class 2 0 200 0
Class 3 0 0 200
Classification Accuracy) : 99.8333%
9
Scatter Plot for Case III
Confusion Matrix
Class 1 200 0 0
Class 2 0 200 0
Class 3 0 0 200
Classification Accuracy : 99.83333%
10
0.4.2 Non Linearly Separable Data Set - Overlapping Data
Confusion Matrix
Class 1 Class 2 Class 3 Class4
Class 1 82 2 2 5
Class 2 3 72 8 8
Class 3 11 0 80 0
Class 4 11 3 0 77
11
Confusion Matrix
Class 1 82 2 2 5
Class 2 3 72 8 8
Class 3 11 0 80 0
Class 4 10 3 1 77
12
Confusion Matrix
Class 1 82 2 2 5
Class 2 3 72 8 8
Class 3 11 0 80 0
Class 4 10 3 1 77
13
0.4.3 Non Linearly Separable Data Set - Spiral Data
Confusion Matrix
Class 1 Class 2
Class 1 151 120
Class 2 120 151
14
Confusion Matrix
Class 1 Class 2
Class 1 157 114
Class 2 114 157
15
Confusion Matrix
Class 1 Class 2
Class 1 157 114
Class 2 114 157
16
0.4.4 Non Linearly Separable DataSet - Helix Data
Case I
Confusion Matrix:
Class 1 Class 2
Class 1 0 151
Class 2 0 151
Classification Accuracy (Case I) : 50%
Case II
Confusion Matrix:
Class 1 Class 2
Class 1 0 151
Class 2 0 151
Classification Accuracy (Case II) : 50%
Case III
Confusion Matrix:
Class 1 Class 2
Class 1 0 151
Class 2 0 151
Classification Accuracy (Case III) : 50%
17
0.4.5 Real World Data - Glass
Case I
Confusion Matrix:
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
Class 1 4 11 7 0 0 0
Class 2 13 1 5 3 1 0
Class 3 0 1 5 0 0 0
Class 4 0 0 0 4 0 0
Class 5 0 0 0 1 1 1
Class 6 0 0 0 0 0 9
Classification Accuracy (Case I) : 35.8209%
Case II
Confusion Matrix:
Class 1 17 3 2 0 0 0
Class 2 11 7 2 1 2 0
Class 3 1 1 4 0 0 0
Class 4 0 2 1 1 0 0
Class 5 0 0 0 0 2 1
Class 6 0 0 0 0 1 8
Classification Accuracy (Case II) : 52.2090%
Case III
Confusion Matrix:
Class 1 17 3 2 0 0 0
Class 2 11 7 2 1 2 0
Class 3 1 1 4 0 0 0
Class 4 0 2 1 1 0 0
Class 5 0 0 0 0 2 1
Class 6 0 0 0 0 1 8
Classification Accuracy (Case III) : 52.2090%
0.4.6 Real World Data - Jaffe

Case I
Confusion Matrix:
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7
Class 1 72 65 601 66 1173 0 0
Class 2 0 1062 312 310 0 0 0
Class 3 0 0 1418 486 0 0 0
Class 4 0 132 472 1462 0 204 0
Class 5 71 65 603 65 1173 0 0
Class 6 71 65 603 63 1175 0 0
Class 7 0 204 0 935 0 0 1058
18
Classification Accuracy (Case I) : 44.6518%
Case II
Confusion Matrix:
Class 1 802 0 3 0 541 535 96
Class 2 1338 25 0 0 321 0 0
Class 3 48 0 78 0 1406 86 286
Class 4 140 0 21 143 1084 882 0
Class 5 816 0 4 0 483 578 96
Class 6 805 0 3 0 465 608 96
Class 7 821 0 0 0 1062 0 314
Classification Accuracy (Case II) : 17.5390%
Case III
Confusion Matrix:
Class 1 802 0 3 0 541 535 96
Class 2 1338 25 0 0 321 0 0
Class 3 48 0 78 0 1406 86 286
Class 4 140 0 21 143 1084 882 0
Class 5 816 0 4 0 483 578 96
Class 6 805 0 3 0 465 608 96
Class 7 821 0 0 0 1062 0 314
Classification Accuracy (Case III) : 17.5390%
0.5 Observation
For complex data like non-linearly separable dataset, bayesian classifier outputs poor classification accuracy.
By comparing Linearly separable data are efficiently handled by the classifier in all cases. Among the cases
case I (Σi = σ 2 I) exhibits much good classification compared to other cases.
19

Bayesian Classifier Implementation Using MATLAB

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Bayesian Classifier Implementation Using MATLAB

Caricato da

Copyright:

Formati disponibili

PATTERN RECOGNITION ASSIGNMENT-I

0.2.2 Discriminant Functions, and Decision Surfaces

gi (x) = p(x|ωi )P (ωi )

gi (x) = lnp(x|ωi ) + lnP (ωi )

where mean µ (expected value, average) is given by

and the special expectation that is variance (squared deviation) is given by

gi (x) = lnp(x|ωi ) + lnP (ωi )

gi (x) = wti x + wi0

case III : Σi = arbitrary

gi (x) = xt Wi x + wti x + wi0

0.2.5 Confusion Matrix

0.3.2 non-linearly separable data set, real-world data set

Figure 1: Test Figure 2: Train

Classification Accuracy : 100%

Figure 3: Test Figure 4: Train

Classification Accuracy) : 99.8333%

Figure 5: Test Figure 6: Train

Classification Accuracy : 99.83333%

Figure 7: Test Figure 8: Train

Classification Accuracy : 85.4396%

Figure 9: Test Figure 10: Train

Classification Accuracy : 85.4396%

Figure 11: Test Figure 12: Train

Classification Accuracy : 85.4396%

Figure 13: Test Figure 14: Train

Classification Accuracy : 55.7196%

Figure 15: Test Figure 16: Train

Classification Accuracy : 57.9336%

Figure 17: Test Figure 18: Train

Classification Accuracy : 57.9336%

Figure 19: Test Figure 20: Train

Classification Accuracy (Case I) : 50%

Classification Accuracy (Case II) : 50%

Classification Accuracy (Case III) : 50%

0.4.6 Real World Data - Jaffe

Classification Accuracy (Case II) : 17.5390%

Classification Accuracy (Case III) : 17.5390%

Potrebbero piacerti anche