Sei sulla pagina 1di 7

Survey on Spam Filtering in Text Analysis

Saksham Sharma, Rabi Raj Yadav


Computer Science Engineering, Vellore Institute of University
saksham.sharma2016@vitstudent.ac.in
rabiraj.yadav2017@vitstudent.ac.in

Abstract removes approaching spam or redirects it to a "garbage"


The Internet rose as an amazing framework for the overall letter box (see spam organizer). Likewise called "spam
communication and collaboration of individuals.The misuse of blockers," spam channels are incorporated with a client's
this innovation by fraudster (for example spam or viruses) email program. They are additionally incorporated with or
created difficulties in the improvement of systems to ensure a included onto a mail server, where case the spam never
reasonable and secure experience communication.Data is being arrives at the client's or user’s mailbox. So in order to avoid
exploited by individuals and organisations to gain competitive the spam emails, or spam posts from social media fake
advantage, a substantial amount of data is being generated by
posts and other things authors have purposed various
spam or fake users.Subsequently, attackers focus on increasingly
solid assault vectors like email: unfortunate casualties are techniques to address the issues of common
contaminated utilizing either malicious connections or problems/issues during spam filtering using the approach
connections prompting malicious sites like in social media of natural language processing.NLP is a branch of artificial
websites Facebook & twitter posts and Gmail emails. In this intelligence that deals with the interaction between
manner proficient filtering and blocking strategies for spam computers and humans .The objective of NLP is to read,
messages are required. Sadly, most spam filtering arrangements interpret, comprehend, and understand the human dialects
proposed so far are responsive, they require a lot of both ham and in a way that is profitable.These days, proficient spam
spam messages to effectively produce principles to separate filtering are done by utilizing a combination of various
between both. In this survey paper, we studied a progressively
classifiers supported by advanced filter platforms such as
proactive methodology that enables us to legitimately gather
spam message. We will see how spam text is getting increase in SpamAssassin (The Apache SpamAssassin Project, 2011)
emails, social media fake posts, twitter, url etc. We can watch or Wirebrush4SPAM (Wirebrush4SPAM, 2011).A
current spam runs and get a duplicate of most recent spam common filter usually combines three different kinds of
messages in a quick and proficient way. In light of the gathered techniques: (i) domain authentication schemes, (ii)
information, we can produce layouts that speak to a compact collaborative approaches, (iii) content-based classifiers,
outline of a spam run. The gathered information would then be (iv) characteristics-based filters[9].A novel methodology
able to be utilized to improve current spam filtering methods and for recognizing spam versus non-spam online networking
grow new scenes to productively channel sends. posts and offers more knowledge into the behaviour of
spam users on Twitter. The methodology proposes an
Keyword optimized set of features independent of recorded tweets,
Text Analysis, Spam Filters, Spam Controlling Algorithms,Naive which are available for a brief timeframe on Twitter.
bayes,Classifier, Applying proper algorithms to the filters like Naive Bayes,
SMO, do data collection and data pre-processing
INTRODUCTION
PROBLEM DESCRIPTION
Electronic mail (email) is critical for group connection,
which has turned out to be generally utilized by numerous The most common problems that we have seen in each
individuals people and associations. Simultaneously, email paper is the problem in spam filters. Misclassifying a
is one of the quick rising and expensive issues connected legitimate e-mail to be spam by a spam filter is generally
with the web today, in which case it is called spam email. more destructive than misclassifying a spam e-emails to be
The first spam email was sent on May 1, 1978 to many legitimate.
users on ARPANET. It was an advertisement for a
presentation by Digital Equipment Corporation their Online spamming activities come in different structures,
DECSYSTEM-20 products sent by Gary Thuerk, a for example, malware dissemination, posting of
marketer of theirs[7].Email spam has steadily grown since commercial URLs, fake news or abusive contents,
then by 2019 it is estimated that spam messages accounted automated generation of large volume of contents and
for 56 percent of email traffic[8]. Spam Filtering is a following or mentioning irregular users. Spam filters are
process of detecting spam text, spontaneous and profoundly looked for to naturally filter spam to clean
undesirable emails and avoid those messages from getting mailboxes. Starting spam filters require a user to create
to a client's inbox. Basically it is a software routine that regulations principally by taking a gander at patterns in a
run of the mill garbage email, for example, the presence of Feature selection using the novel TFDCR function.
specific words, blends of words, phrases. So in order to - This algorithm demonstrates the method for TFDCR feature
tackle this problem the authors have proposed a hybrid selection work.
SMS classification system to detect spam or ham, using the
Naive Bayes algorithm and A priori algorithm. Also using Incremental personalized e-mail spam filter with dynamic
social media Spammers utilize URLs in the tweets to divert feature update.
the clients to pernicious destinations which contain - This algorithm describes the proposed system for incremental
infection in those locales. They likewise use URLs for personalized e-mail spam filtering with dynamic feature
phishing and get the individual subtleties of the clients. update work.

The main experiment is performed to analyze how effectively


III. CRITICAL ANALYSIS TFDCR feature selection capacity identifies the most
segregating features from the given preparing information. We
We shall see the ways in which each of these papers address used five fundamental and well-known feature selection
a solution for this problem. capacities to compare the results achieved through proposed
A. Incremental personalized E-mail spam filter using novel TFDCR. The principal experiment is conducted utilizing a
TFDCR feature selection with dynamic feature update [1] conventional cluster preparing of classifier followed by a
testing phase.

In this paper the solution is to address the issue of uneven class In the second experiment, we integrate an incremental learning
dissemination, author has propose a novel term frequency model utilizing SVM with an updated feature set to relearn the
difference and category proportion based feature selection modified circulation of information. In this experiment, we use
work, named as TFDCR, for generating features with solid TFDCR feature selection in both the conventional group
separating capacity from the preparing information. Author learning and the incremental learning models. Features are
used the incremental learning model utilizing Support Vector updated utilizing a selectionRankWeight heuristic capacity.
Machines (SVM) so that the learned decision model is updated
to modify the modified dissemination of information in the The dataset used are: the first dataset is an open ENRON
presence of floating concepts. Due to a frequent change in the dataset. This dataset contains pre-processed e-mail messages
content of e-mails, the relevance of the representative features with the removal of attachments. And the second dataset used
likewise varies over a period. He propose a novel for filter evaluation is ECML. The ECML task-An and task-B
selectionRankWeight heuristic capacity based on the feature's datasets were made publically available during 2006 ECML-
category proportion difference to identify new features from an PKDD Discovery Challenge.
approaching set of e-mails. The existing feature set is updated
by including these newly selected features before actuating The third dataset considered is PU dataset which contains four
incremental preparing of the classifier. folders PU1, PU2, PU3 and PUA that
Contain e-mails received by particular users.
Architecture:

B. Detection of spam-posting accounts on Twitter [2]

A novel methodology for recognizing spam versus non-spam


online networking posts and offers more knowledge into
the behaviour of spam users on Twitter. The methodology
proposes an optimized set of features independent of
recorded tweets, which are available for a brief timeframe
on Twitter. We take into record features related to the users
of Twitter, their records and their pairwise engagement
with each other. We experimentally demonstrate the
efficacy and robustness of our methodology and compare
it to a run of the mill feature set for spam detection in the
literature, achieving a noteworthy improvement on
performance. As opposed to earlier research discoveries,
Working: we observe that an average automated spam record posted
in any event 12 tweets per day at well-defined periods.
Proposed Algorithms-
Working: transformed and integrated into an appropriate group before
classifiers are applied in the informational index.
Following proposed approaches:
Applying Algorithm: After having the pre-processed file, all
Parameter tuning and classification models the grouping calculations, namely Naive Bayes, SMO, J48
- An effective classifier ought to be able to correctly group decision tree, and irregular forest, have applied so as to
previously unseen information by leveraging the experience discover features based on which spam being identified.
gained from preparing on n labelled samples, i.e. information
instances and the corresponding class. Performance Evaluation: After applying all classifiers, each
of them was evaluated based on performance metrics so as to
Feature importance and correlation figure out the best classifier.
- During an underlying investigation stage, a large number of
features have been used for preparing and some features were He stated that classification technique as other information
discarded due to their relatively low commitment to the overall mining techniques like clustering, whereas affiliation isn't
performance data. capable of prediction related issues. Clustering is capable of
parcelling the related information elements in same set, and
Tools that are required: then again, affiliation is generally used for establishing
Scikit-learn toolkit and for evaluation, different metrics are relationships between attributes that exists in an informational
utilized so as to maintain a strategic distance from an index.
inclination towards the dominant part class, especially when Working:
the dataset is imbalanced. 1. Naive Bayes algorithm
2. Apriori algorithm
Honeypot dataset is openly available and useful for examining 3. J48 decision tree
spam movement on Twitter. It was utilized both as a dataset 4. Random forest Algorithm
per se and for collecting the SPD datasets utilizing keywords.
SPD datasets to maintain a strategic distance from the potential The informational collection used in this work is tested and
danger of a high false positive rate. analyzed with four different classification techniques that use
cross-approval which are the accompanying: (I) Naive Bayes,
(ii) SMO, (iii) J48, (iv) irregular forest. After applying every
Architecture: one of the classifiers, the performance of each classifier has
been analyzed based on different performance metrics as: true-
positive rate, false-positive rate, precision, recall, F-measure,
ROC area, time taken to fabricate classifier model.

Tools and Technology: For implementing the required errand,


creators have used WEKA device which is an open-source
device for information mining and the classifiers used in
this work are: (I) Naive Bayes, (ii) SMO, (iii) J48, (iv) irregular
forest.

Architecture:
C. Spam Mail detection using data mining [3]

In this paper author has proposed a solution in following


ways, basically he used the data mining techniques.

Data Collection: Data were donated by George Foreman, and


the collected information set is used in UCI machine learning
repository.

Data Pre-processing: In the real-world informational


collection, which comprises of numerous mistakes, they are
cleaned and removed so as to have accurate results of the
information sets. In this step informational collection it is
E. A new semantic-based feature selection method for spam
filtering[5]

In this paper author introduces a feature selection technique for


a spam-filtering domain that takes advantage of semantic
information.They group word based features into semantic
topics to generate feature vectors.They involves the execution
of three feature selection methods.Information Gain, Latent
Dirichlet Allocation, and Semantic-based Feature Selection.

Working:

They discussed basic four methods:


i) content-based filtering: a set of methods able to perform a
detailed analysis of message content (text, image/s, attached
documents) to determine a class for the message.
(ii) collaborative schemes: sharing detailed information about
received spam messages
(iii) domain authorization methods: to define trust servers
(identified by their IP addresses) to send messages for a certain
D. A Neural Network-Based Ensemble Approach for Spam domain
Detection in Twitter[4] (iv) characteristics-based filters: the number of recipients
receiving the same e-mail.
In this paper author introduced deep learning models based on
convolutional neural networks (CNNs). Five CNNs and one Topic guessing methodology: They used noun semantic
feature-based model are used in the ensemble.Each CNN uses relation included in Wordnet to design it:
different word embeddings (Glove, Word2vec) to train the (i) hyponym (X is a kind of Y)
model. The feature-based model uses content based, user- (ii) hypernym (X includes the notion of Y among others)
based, and n-gram features.Neural network acts as a meta (iii) meronym (X is a part of Y)
classifier.The difference between their proposed method with (iv) holonym (X contains Y among others).
existing methods is that they combine both handcrafted They find e-mail topics, by selecting a hierarchical level (h)
features and word embedding features to capture more in order to semantically group terms (synsets) into more
information about spam and non-spam tweets. generic topics. A topic is present in a message if it contains a
term t belonging to one of the representative synsets for the
Working: topic.
.

User-Based Features: It checks whether the user profile is Tools that are required:
verified or not, finds the length of the user profile description, They have used WordNet Lexical Database.The hierarchical
checks whether location information is given by the user or not, WordNet database groups words into synset.
counts the number of followers, friends of the user,finds the
reputation score(#followers/(#followers + #friends)),finds the F. Enhancing the Naive Bayes Spam Filter through Ìntelligent
number of tweets posted and lists that the user has subscribed Text Modification Detection[6].
to.
In this paper author mainly focuses to increase the accuracy of
Content-Based Features:It finds the number of the existing Naive Bayes Spam Filter. This algorithm will work
words,capitalization words, exclamation and question mark on text modification and which cause hindrance in
symbols, URL,hashtags, mentions in the tweet. classification of email.Common Spam senders are able to
bypass spam detectors by using leetspeak and diacritics.
Tools that are required: Leetspeak is an alternative alphabet that is primarily used on
They used HSpam and 1KS10KN dataset.Along with feature the Ìnternet. Diacritics are the accents placed on words to
model they used four methods (CNN + Twitter Glove, CNN modify the appearance. Leetspeak allows the spam senders to
+Google news, CNN+Edinburgh, CNN+H Spam).They used change letters into symbols or a series of symbols. For
accuracy, precision, and F-Measure for word embeddings as example, ”A” can be written as ”/-\”. When a word is modified
parameter to compare. using leetspeak, spam detectors are not able to identify the
email as spam, which creates a false positive.
spam detection of image and video content as instaspam as the
world is growing into digital market
Bayesian poisoning is a technique used by e mail spammers to
attempt to degrade the effectiveness of spam filters that rely on
Bayesian spam filtering.

Working: REFERENCES
[1] (Sanghani, G., & Kotecha, K. (2019)). Incremental personalized E-mail
They first applied preprocessing technique to remove the issue spam filter using novel TFDCR feature selection with dynamic feature
of diacritic and leetspeak.In python isalpha() function is used update. Expert Systems with Applications,115,287–299.
[2] (Inuwa-Dutse, I., Liptrott, M., & Korkontzelos,I. (2018)). Detection of
with below algorithm: spam-posting accounts on Twitter.
If a = leetspeak: then replace(a, c) [3] (Satapathy, S. C., Bhateja, V., & Das, S. (Eds.). (2019)). Smart
If b = diacritcs: then replace(b, c) Intelligent Computing and Applications. Smart Innovation, Systems
and Technologies. doi:10.1007/978-981-13-1921-1
[4] (Madisetty, S., & Desarkar, M. S. (2018)). A Neural Network-Based
After this they applied multinomial Naive Bayes algorithm Ensemble Approach for Spam Detection in Twitter. IEEE Transactions
along with different machine learning algorithm. on Computational Social Systems, 1–12.
[5] (José R. Méndez, Tomás R. Cotos-Yañez, David Ruano-Ordás2019).A
Multinomial Naive Bayes (MNB) is the probability of the new semantic-based feature selection method for spam filtering
[6] (Peng, W., Huang, L., Jia, J., & Ingram, E. (2018)). Enhancing the
words(tk), within a message d given a class of the message, Naive Bayes Spam Filter Through Intelligent Text Modification
spam or ham. It assumes that the message is a bag of tokens or Detection. 2018 17th IEEE International Conference On Trust, Security
words, such that the order of the tokens is irrelevant. And Privacy In Computing And Communications/ 12th IEEE
Multinomial Naive Bayes essentially counts the relative International Conference On Big Data Science And Engineering
(TrustCom/BigDataSE).
occurrences of a particular token within the message to [7] "History of Spam". Mailmsg.com. Archived from the original on 26
determine the conditional probability. March 2006. Retrieved 11 July 2006.
P(c|d) = P(c)*Product( 1≤k≥nd P(tk|c)) https://web.archive.org/web/20060326032433/http://www.mailmsg.co
m/SPAM_history.htm
[8] Global spam volume as percentage of total e-mail traffic by month.
Tools that are required: https://www.statista.com/statistics/420391/spam-email-traffic-share/
They have used Spam Server Spamassassin for the datasets. [9] (N. Pérez-Diaz, D. Ruano-Ordás, F. Fdez-Riverola, J.R.
Méndez,(2012)) SDAI: An integral evaluation methodology for
content-based spam filtering models, Expert Syst. Appl. 39 12487–
CONCLUSION AND FUTURE WORK 12500, http://dx.doi.org/10.1016/j.eswa. 2012.04.064

This study critically analyses the different approach in


different paper ranging from basic text preprocessing, through
using naive bases classifier, leading to the advance concept of
neural network and feature selection.While studying through
all paper we found that naive bayes classification is the oldest
approach.After that the trend of feature selection came into
highlights.At this present age of machine learning and deep
learning, one paper introduced the concept of applying neural
network.Different paper have used different datasets and for
evaluation metrics they have used different approach like
precision,recall and confusion matrix.The confusion matrix
(i)false positive errors (FP, legitimate messages classified as
spam) (ii) false negative errors (FN,undetected spam e-mails)
(iii) true positive hits (TP, number of spam messages detected)
and (iv) true negative hits (TN, number of legitimate messages
correctly classified).
Although work and research done by above prominent authors
seem promising, new further progressing work can be
done.Among all the above paper Topic Guessing models is
able to achieve the best results when compared with other
alternative.The improvement can be done to decrease the
obfuscation(the destruction of the intended meaning of
communication by making the message difficult to
understand).It will help in precise guessing of topic to classify
it as ham or spam email. Futher this work can be continued to

Potrebbero piacerti anche