Sei sulla pagina 1di 7

Email Classification Using Naive Bayes

Classifier

Domain Machine Learning

Algorithms Naive Bayes Algorithm


Framework Python

Platform Live Google Cloud Deployment

Abstract

This research investigates a comparison between two different approaches for classifying
emails based on their categories. Naive Bayes and Hidden Markov Model (HMM), two different
machine learning algorithms, both have been used for detecting whether an email is important
or spam. Naive Bayes Classifier is based on conditional probabilities. It is fast and works great
with small dataset. It considers independent words as a feature. HMM is a generative,
probabilistic model that provides us with distribution over the sequences of observations.
HMMs can handle inputs of variable length and help programs come to the most likely decision,
based on both previous decisions and current data. Various combinations of NLP techniques-
stopwords removing, stemming, lemmatizing have been tried on both the algorithms to inspect
the differences in accuracy as well as to find the best method among them.

Background

This paper focuses on distinguishing important emails from spam emails. One major factor in
the categorization is that of how to represent the messages. Specifically, one must decide
which features to use, and how to apply those features to the categorization. M. Aery et al. [1]
gave an approach which is based on the premise that patterns can be extracted from a
preclassified email folder and the same can be used effectively for classifying incoming emails.
As emails consists a format in the form of headers and body of the email, the correlation
between different terms can be showed in the form graph. They have chosen graph mining as a
viable technique for pattern extraction and classification. R. Islam et al. [2] showed a way which
proposed a multi-stage classification technique using different popular learning algorithms such
as SVM, Naive Bayes and boosting with an analyzer which reduces the False Precision
substantially and increases classification accuracy compared to similar existing techniques. B.
Klimt et al [3] gave an approach that introduced Enron corpus as a new dataset for this domain.
V. Bhat et al. [4] came up with an approach which derives spam filter called Beaks. They classify
emails into spam and nonspam. Their pre-processing technique is designed to identify tag-of-
spam words relevant to the dataset. X. Wang et al. [5] took an approach which reviews recent
approaches to filter out spam email, to categorize email into a hierarchy of folders, and to
automatically determine the tasks required in response to an email. According to E.Yitagesul et
al [6], in sender based detection, the email sender information such as the writing style and the
email sender user name is used as the major features. The research paper written by S.Teli [7]
showed us a three phased system that they engineered for their way of spam detection. In the
first phase, the user creates the rule for classification. Rules are nothing, but the
keywords/phrases that occur in mails for respective legitimate or spam mails. The second phase
can be called as training phase. Here the classifier will be trained using a spam and legitimate
emails manually by the user. Then with the help of algorithm the keywords are extracted from
classified emails. When the first and second phases are completed, classifying the emails by
given algorithm starts, using this knowledge of tokens, the filter classifies every new incoming
email. Here the probability of maximum keyword match is calculated and the status of a new
email is confirmed as spam or important email. Two main methods for detecting spam email
are widely used. One is sender based spam detection and the other method is content based
spam detection which will consider only the content of an email. This paper talks about the
content based spam detection.
Existing Systems and their Drawbacks

Mohammed et al. [2] [2013] proposed an approach for Classifying Unsolicited Bulk Email
(UBE) using Python Machine Learning Techniques with the help of spam filtering which
performs the work by creating a spam-ham dictionary from the given training data and applying
data mining algorithm to filter the training and testing data. After applying various classifier
on1431 dataset, the approach predicts that, Naïve Bays and SVM classifiers are the prominent
classifier for spam filtering or classification

Subramaniam et al. [23] [2012] implemented Naïve Bayesian Anti-spam Filtering


Technique on Malay Language to investigate the utilization of Naïve Bayesian procedure to
combat spam issue. An experiment conducted through Naïve Bayesian method for filtering
Malay language spam and the result depicts that, propose approach has gained 69% accuracy.

Sharma et al. [24] [2013] described Adaptive Approach for Spam Detection. This article
consider SPAMBASE dataset and various machine learning technique such as Bays Net, Logic
Boost, Random tree, JRip, J48, Multilayer Perception, Kstar, Random Forest, Random
Committee are applied for classifying the spam. It measures the accuracy by grouping the
spam/non-spam e-mails from labeled emails of a single account. The paper estimates that, total
accuracy was 95.32% which depicts the quality of the proposed approach.

Banday et al. [25] [2008] discuss the procedures of statistical spam filters design by
incorporating Naïve Bayes, KNN, SVM, and Bayes Additive Regression Tree. Here evaluates
these procedures in terms of accuracy, recall, precision, etc. Though all machine learning
classifiers are effective but according to this approach, CBART and NB classifiers has better
capability to spam filtering. This approach estimates that during spam filtering calculations of
false positive are more costly than false negative

Awad et al. [1] [2011] proposed an ML- based approach on for Spam E-mail
Classification. In this article present the most prominent machine learning strategies and its
effectiveness regarding spam email classification. Here introduced Portrayals algorithms and
the performance of Spam Assassin corpus. The result shows that, Naïve bays and rough sets
methods are the promising algorithms for email classification. They perform their future
research to improve the Nave Bays and Artificial immune system by hybrid system or by
resolution the feature reliance issue.

Chhabra et al. [26] [2010] developed Spam Filtering using Support Vector Machine by
considering Nonlinear SVM classifier with different kernel functions over Enron Dataset. Here
considered six datasets and perform the analysis of datasets having diverse spam: ham ratio
and makes satisfactory Recall and Precision Value.

Drawbacks

Since last few decades, researchers are trying to make email as a secure medium. Spam
filtering is one of the core features to secure email platform. Regarding this several types of
research have been progressed reportedly but still there are some untapped potentials. Over
time, still now e-mail spam classification is one of the major areas of research to bridge the
gaps

Proposed System

 Pre-Processing Dataset

5500 emails- 1500 important, 4000 spam were retrieved from Enron email dataset [10].
These emails were stored into a python dictionary named documents. For every email,
some methods on that email were run until the end of the dictionary reached. At first,
email multipart problem was dealt with. Current payload is a list of Message objects if
there is multipart, and a string otherwise. HTML was stripped off the email. Then, only
the email texts were left. Words were tokenized from these 5500 emails. At this point,
various combinations of NLP techniques- stemming, lemmatizing and stopwords
removing had been implemented.

 Naive Bayes Classification


This Bayesian Classification is used as a probabilistic learning method and every feature
of this algorithm being classified is independent of the value of any other feature. Bayes
theorem [11] provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c).

 Spam detection using Naive Bayes Classifier

The goal is to build a classifier that will automatically tag new emails with appropriate
category labels. Now the classifier has a list of documents- emails labeled with the
appropriate categories. The first step in creating a classifier is deciding what features of
the input are relevant, and how to encode those features. [12] So, a feature extractor
for documents was defined so that the classifier knows which aspects of the data it
should pay attention to. The duplicate words from those emails were removed. This
made the checking faster. Now for every word in word_features, if that word existed in
a given email, it was tagged with the category (important or spam) of that email. Thus,
words were found that were labeled as ‘important’ and ‘spam’. These word:label pairs
were used as featureset for Naive Bayes Classifier.

Advantages of the proposed system

Advantages of Naive Bayes: Super simple, you're just doing a bunch of counts. If
the NB conditional independence assumption actually holds, a Naive Bayes
classifier will converge quicker than discriminative models like logistic regression, so
you need less training data. The class with the highest posterior probability is the
outcome of prediction.
SYSTEM ARCHITECTURE

Hardware Requirements

Processor Intel Core i5 or AMD FX 8 core series with clock speed


of 2.4 GHz or above
RAM 2GB or above
Hard disk 40 GB or above
Input device Keyboard or mouse or compatible pointing devices
Display XGA (1024*768 pixels) or higher resolution monitor
with 32 bit color settings
Miscellaneous USB Interface, Power adapter, etc
Software Requirements

Operating System Windows XP or above


Programming Language Python, Flask
– Backend
Programming language Bootstrap Framework, HTML, CSS, python
- Frontend
Development Flask IDE
environment
Application Server Flask localhost
Database SQLite

Potrebbero piacerti anche