Sei sulla pagina 1di 29



What is Naïve
Bayes LIVE Demo

Introduction How do we Disadvantages Conclusion

use it?

= separation or ordering of objects into classes.
2 phases in classification algorithm:

The algorithm tries to find a model for the class attribute

1 as a function of other variables of the datasets.

Next, it applies previously designed model on the new

2 and unseen datasets for determining the related class of
each record.

If you need to classify thousands of

observations quickly given you have already
evaluated the importance of the predictors

You can go for Naïve Bayes, one of the

fastest classifier
What is Naïve Bayes ?
Two approaches to probability

Population parameters are constant Parameters are considered to be
Randomness lies in the data random variables
Long run behavior that defines Data considered to be known
probability Requires elicitation of prior
The Bayes theorem
Bayes theorem provides a way of calculating the posterior
probability, P(A|B), from P(A), P(B), P(B|A)

P(A) * P(B|A)
P(A|B) =

P(A|B) = posterior proba of class given predictor

P(A) = prior proba of class
P(B|A) = likelihood which is the proba of predictor given class
P(B) = prior proba of predictor
How the algorithm works?
Fair/Unfair Head
Fair Yes
Fair Yes
Unfair Yes
Unfair Yes
Fair No
Fair No
Unfair Yes
Unfair No
Fair Yes
Fair Yes
Fair No
How the algorithm works?

1. Converts the data into a frequency table

Frequency Table
Head No Yes
Fair 3 4
Unfair 1 3
Grand Total 4 7
How the algorithm works?

2. Then create the likelihood table by finding the probabilities

Likelihood Table
Head No Yes
Fair 3 4 7/11 0.64
Unfair 1 3 4/11 0.36
Grand Total 4 7
4/11 7/11
0.36 0.64
How the algorithm works?

3. Use the Bayes formula to compute the posterior probability

P(Yes) * P(Fair|Yes)
P(Yes|Fair) =

0.64 * 0.57
P(Yes|Fair) =
The class with the largest probability is the class associated with the observation
(see Bayes theorem for The posterior probability explanation)
MAP: Maximum A Posteriori
The posterior probability is a distribution and not a single value. One way
to choose its value is to maximize it by finding the right value of B.

To compute the posterior probability with the MAP classifier, with k = 5

(number of classes of variable to predict) and m = 10 (number of predictors)
The number of probabilities to compute would be:

K = 9,765,625
The solution: Naïve Bayes

In machine learning, the Bayes’ theorem is applied with the

“naïve” assumption that all variables are independent (no
correlation between variables)

The number of probabilities to compute becomes:

k *m = 5*10 = 50
What is Naïve Bayes?

It is a supervised learning It is considered a In machine learning, the

method classification technique technique applies Bayes’
Theorem to classify
Key Example


A Naive Bayes classifier considers each of these “features” (red, round, 3” in

diameter) to contribute independently to the probability that the fruit is an apple,
regardless of any correlations between features
Why do we need it?

Naïve Bayes has many case uses:

1. 2. 3. 4. 5.
Fast Multi class Text Sentiment Recommendation
predictions predictions Classification Analysis System
Naïve Bayes is a very (large number of features Linked to a collaborative
quick algorithm and works well with relative filtering algorithm
could therefore be simplicity and independent
used for real- time features assumption of Naïve
predictions. Bayes), Spam filtering

Some examples in our day to day life: weather predictions, medical diagnosis
Why/how is it better/worse than other
techniques that provide similar function?
Types of algorithms

Discriminative Generative

Learns explicit boundaries Models the distribution of

between classes. Maximizes individual classes. Rich
distance between sample of representations of the
two classes. independence relations in the
More “complex” dataset.
Ex: Naïve Bayes
1. Small Size of dataset required

Naïve Bayes Other techniques

#Regression will most likely over fit with a

Naïve Bayes is a good choice if you want small dataset.
something fast & easy, but with still good #Decision trees for example in predicting
performance cancer: because it does not occur a lot, it will
(used a lot in robotics, computer vision…) get pruned out more easily (even if it can be
handled by using weights, Naive Bayes would
be better)
2. Great Speed

Naïve Bayes Other techniques

It is easy and fast to predict the class of test

data set. It also performs well in multi class Other algorithms take more time (see
prediction algorithm with iterations).
3. Good with categorical variables

Naïve Bayes Other techniques

It performs well in case of categorical input When all inputs are numerical (or
variables compared to numerical continuous), logistic regression and
variable(s). For numerical variable, normal random forests are good techniques
distribution (Gaussian distribution) is to use.
assumed (which is a strong assumption).
Limitations of the Naïve Bayes
1. Interdependence of features

Naïve Bayes Other techniques

When assumption of independence holds, a
Naive Bayes classifier performs better
compared to other models like logistic With logistic regression: you don’t have to
regression and you need less training data. worry that much about your features being
BUT: if it is not true, it won’t be able to learn In real life, it is almost impossible to get a set
interactions between the features. of predictors which are completely
Example: If you like carrots and potatoes but independent.
you don’t like them together, it wont be able
to make the difference.
2. Building the model – choosing
the variables

Naïve Bayes Other techniques

You need to build the classification by hand. Decision trees will pick the best features.
You need to pick yourself the variables to They will also help you find out the
classify the observations.Therefore you might relationships between inputs and output
need to use other statistical techniques to and how strong these are.
guide you to choose the features. With regression techniques, you can also
rely on R-Squared and significance tests.
3. Interpretation

Naïve Bayes
Other techniques
Less intuitive and harder to see
Logistic regression/Decision trees are very
truly how a variable impacts on
useful in terms of interpreting the results
the classification.
(coefficients, rules, etc.)
Log of posterior odds ratio
How to improve the algorithm?

1 2 3

If continuous features do not If test data set has zero Remove correlated features, as it
have normal distribution, we frequency issue, apply would lead to an exaggerated
should transform them smoothing techniques importance
“Laplace Correction”
Ex: in spam classifying
How to choose the best algorithm?

1st Option 2nd Option

Cross Validation Confusion Matrix

Train & Test AUC
Recall & F1 Score