Sei sulla pagina 1di 4

International Journal of Scientific Research Engineering & Technology (IJSRET)

Volume 2 Issue 7 pp 393-396 October 2013 www.ijsret.org ISSN 2278 0882


IJSRET @ 2013
Novel Approach for Cluster Analysis of Similar Binary
Variables using Normal Approximation of
Binomial Probability Distribution
1
Makwana Jay,
2
Makwana Pratik
1,2
(MCA student, Gujarat Technological University, ISTAR, VallabhVidhyanagar, Anand)
ABSTRACT
One approach to determine similarity or
dissimilarity in discrete random binary variables
is contingency table
[3]
. A cluster is a set of
meaningful sub classes those have similar
characteristics. Cluster is unsupervised
classification
[7]
. Novel approach to enhance
computation of binary probability distribution
for continuous random variable using normal.
Keywords - binomial distribution, continuity
correction factor, continuous variable, discrete
variable, Normal approximation of binomial
probability distribution.
1 INTRODUCTION
A cluster of data objects can be treated
collectively as one group and so may be
considered as a form of data compression. It is
often more desirable to proceed in the reverse
direction: First partition the set of data into
groups based on data similarity (e.g., using
clustering), and then assign labels to the
relatively small number of groups
[1]
. Clustering
is a multivariate technique of grouping rows
together that share similar values. The goal
of clustering is to organize data by finding
some sensible grouping of the data items.
Clustering is unsupervised learning because
it does not use predefined category labels
associated with data items
[4]
. Clustering
algorithms are engineered to determine
characteristics in the data.
2 TYPES OF DATA IN CLUSTER
ANALYSIS
Following are different types of data/variables
[3]
:
1. Interval-Scaled Variables
2. Binary Variables
3. Categorical, Ordinal, and Ratio-
Scaled Variables
4. Variables of Mixed Types
5. Vector Objects
2.1 Binary Variables
Binary variable has only two states: 0 or 1,
where 0 means that the variable is absent, and 1
means that it is present. Treating binary
variables as if they are interval-scaled can lead
misleading clustering result
[1]
. Therefore,
methods specific to binary data are necessary for
computing similarities or dissimilarities.
There are two types of binary variable objects,
either symmetric or asymmetric
[3]
. In statistics,
binary data is a statistical data type described by
binary variables, which can take only two
possible values. Binary data is used to represent
the outcomes of Bernoulli trials, statistical
experiments with only two possible outcomes.
In modern computers, almost all data is
ultimately represented in binary form. Although
the binary numeral system is usually cited as the
main reason of this, many (if not most) data in
modern computers are not numbers
[5]
.
3 BINOMIAL PROBABILITY
DISTRIBUTION
[1] [6]
Binomial probability typically deals with
the probability of several successive decisions,
each of which has two possible outcomes. In
probability theory and statistics, the binomial
distribution is the discrete probability
distribution of the number of successes in a
sequence of n independent yes/no experiments,
each of which yields success with probability p.
The binomial distribution is frequently used to
model the number of successes in a sample of
size n drawn with replacement from a population
of size N. If the sampling is carried out without
replacement, the draws are not independent and
International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 7 pp 393-396 October 2013 www.ijsret.org ISSN 2278 0882
IJSRET @ 2013
so the resulting distribution is a hyper geometric
distribution, not a binomial one. However, for N
much larger than n, the binomial distribution is a
good approximation, and widely used.
An experiment has a binomial probability
distribution if three conditions are satisfied.
a. There are a fixed number of trials. The
number of trials is denoted by n.
b. The trials are independent.
c. The only outcomes of this experiment can be
classified as "succeed" or "fail" (equivalently
"yes" or "no"). Furthermore, the probability of
success is fixed. The probability of success is
denoted by p.
Consider the following table
[1]
:
Table 1
NAME GENDER
A M
B F
C M
D M


If we compare two objects gender wise
(symmetric), we can apply binomial mass
function [3] as per the following:
P(x) =
n
C
x
(1/p)
x
(1/q)
n-x
From the Table-1,
1 for Male, 0 for Female
p = 1/2,
q=1-p=1/2
n=2
Binomial Probability Distribution table for n=2:
From Table 2 and Table 3,
Measurement (Gender of A=1 & B=0) = 0.5
(a)
Measurement (Gender of A=1 & C=1) = 0.25
(b)
Measurement (Gender of A=1 & D=1) = 0.25
(c)
Measurement (Gender of B=0 & C=1) = 0.5
(d)
Measurement (Gender of B=0 & D=1) = 0.5
(e)
Measurement (Gender of C=1 & D=1) = 0.25
(f)
Above results compare with (1/2
n
) (n = number
of pairs / number of records), if it is equals then
pair has same value.
In equation, (b),(c) and (f) equals to (1/2
n
).
[1]
4 NORMAL APPROXIMATION OF
BINOMIAL PROBABILITY
DISTRIBUTION
[2]
There is a problem with approximating the
binomial with the normal. That problem arises
because the binomial distribution is a discrete
distribution while the normal distribution is a
continuous distribution.
4.1 Continuity Correction Factor
Table 2
Name Gender
A 1
B 0
C 1
D 1
Table 3
Discrete Random
Variable
x (Success Getting
Male)
Probability of x Similarity
0 (Both zeros - same)
2
C
0
(1/2)
0
(1/2)
2
=
0.25
Similar
1 (One 0 & One 1 -
Different)
2
C
1
(1/2)
1
(1/2)
1
=
0.5
Not
Similar
1 (One 1 & One 0 -
Different)
2
C
1
(1/2)
1
(1/2)
1
=
0.5
Not
Similar
2 (Both ones same)
2
C
2
(1/2)
0
(1/2)
2
=
0.25
Similar
International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 7 pp 393-396 October 2013 www.ijsret.org ISSN 2278 0882
IJSRET @ 2013
1. There is a problem with approximating
the binomial with the normal. That
problem arises because the binomial
distribution is a discrete distribution
while the normal distribution is a
continuous distribution.
2. The basic difference here is that with
discrete values, we are talking about
heights but no widths, and with the
continuous distribution we are talking
about both heights and widths.
3. The correction is to either add or
subtract 0.5 of a unit from each discrete
x-value.
4. This fills in the gaps to make it
continuous. This is very similar to
expanding of limits to form boundaries
that we did with group frequency
distributions.
Steps to working a normal approximation to the
binomial distribution
[7]
:
5. Identify success, the probability of
success, the number of trials, and the
desired number of successes. Since this
is a binomial problem, these are the
same things which were identified when
working a binomial problem.
6. Convert the discrete x to a continuous x.
Some people would argue that step 3
should be done before this step, but go
ahead and convert the x before you
forget about it and miss the problem.
7. Find the smaller of np or nq. If the
smaller one is at least five, then the
larger must also be, so the
approximation will be considered good.
When you find np, you're actually
finding the mean, mu, so denote it as
such.
8. Find the standard deviation, sigma =
sqrt (npq). It might be easier to find the
variance and just stick the square root in
the final calculation - that way you don't
have to work with all of the decimal
places.
9. Compute the z-score using the standard
formula for an individual score (not the
one for a sample mean).
10. Calculate the probability desired.
5 PROPOSED WORK
When the number of trials become large,
evaluating the binomial probability function by
hand or with a calculator is difficult. Hence,
when we encounter a binomial distribution
problem with a large number of trials, we may
want to approximate the binomial distribution.
In cases where the number of trials is greater
than 20,
np>= 5 , and n(1 - p) >= 5, the normal
distribution provides an easy-to-use
approximation of binomial probabilities.
When using the normal approximation to the
binomial, we set np and np(1-p)
i
in the
definition of the normal curve. Let us illustrate
the normal approximation to the binomial.
Example:
Toss a coin for 12 times. What is the probability
of getting exactly 7 heads.
Solution for Normal Approximation for
Binomial Distribution:
Formula:
= np
= (np(1-p))
1/2

Z = (x ) /
The values are ..
n=12, p( 6.6 < X < 7.5), p = 0.5
p=0.5
= np
= 12 * 0.5
= 6
= (np (1-p))
1/2
= (12 * 0.5 (1 0.5))
1/2
= 1.73
Z= (x ) /
for x= 6.5 for x= 7.5
z = (6.5 - 6) / 1.73 z = (7.5 - 6) / 1.73
= 0.29 = 0.87
Now, if we see the value of z = 0.29 and 0.87 in
Z-table we get the 0.6141 and 0.8078
simultaneously,
International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 7 pp 393-396 October 2013 www.ijsret.org ISSN 2278 0882
IJSRET @ 2013
Therefore,
0.8078 0.6141 = 0.1937
The probability of getting exactly 7 heads is 0.19
6 CONCLUSION
In this Paper we describe Normal
Approximation for Binomial Probabilities to
find similarity or dissimilarity between two
binary variables for continuous random variable.
This Novel technique may improve computation
& apply N number of records without any
constraints.
REFERENCES:
Journal Papers:
[1]Parag M. Moteria and Dr. Y R
Ghodasara, Application of Binomial
Probability Distribution in Cluster Analysis
of Similar Categorical Variables (ISSN 2250-
2459, Volume 2, Issue 6, June 2012) (133-135)
Books:
[2]Anderson, Sweeney, Williams, Statistics
for business and economics, 9th edition,
Thompson Publication
[3]Jiawei Han and Micheline Kamber,
Data Mining Concepts and Techniques -
Second Edition, ELSEVIER Morgan
Kaufman Publisher, 2011, (page: 383, 389,
390 )
Others:
[4]http://www.jmp.com/support/help/Introdc
tion_to_Clustering_Methods.shtml
[5]http://en.wikipedia.org/wiki/Binary_data
[6]http://en.wikipedia.org/wiki/Binomial_dis
tribution
[7]https://people.richland.edu/james/lecture/
m170/ch07-bin.html
[8]http://webdocs.cs.ualberta.ca