Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
In this paper the solution is to address the issue of uneven class In the second experiment, we integrate an incremental learning
dissemination, author has propose a novel term frequency model utilizing SVM with an updated feature set to relearn the
difference and category proportion based feature selection modified circulation of information. In this experiment, we use
work, named as TFDCR, for generating features with solid TFDCR feature selection in both the conventional group
separating capacity from the preparing information. Author learning and the incremental learning models. Features are
used the incremental learning model utilizing Support Vector updated utilizing a selectionRankWeight heuristic capacity.
Machines (SVM) so that the learned decision model is updated
to modify the modified dissemination of information in the The dataset used are: the first dataset is an open ENRON
presence of floating concepts. Due to a frequent change in the dataset. This dataset contains pre-processed e-mail messages
content of e-mails, the relevance of the representative features with the removal of attachments. And the second dataset used
likewise varies over a period. He propose a novel for filter evaluation is ECML. The ECML task-An and task-B
selectionRankWeight heuristic capacity based on the feature's datasets were made publically available during 2006 ECML-
category proportion difference to identify new features from an PKDD Discovery Challenge.
approaching set of e-mails. The existing feature set is updated
by including these newly selected features before actuating The third dataset considered is PU dataset which contains four
incremental preparing of the classifier. folders PU1, PU2, PU3 and PUA that
Contain e-mails received by particular users.
Architecture:
Architecture:
C. Spam Mail detection using data mining [3]
Working:
User-Based Features: It checks whether the user profile is Tools that are required:
verified or not, finds the length of the user profile description, They have used WordNet Lexical Database.The hierarchical
checks whether location information is given by the user or not, WordNet database groups words into synset.
counts the number of followers, friends of the user,finds the
reputation score(#followers/(#followers + #friends)),finds the F. Enhancing the Naive Bayes Spam Filter through Ìntelligent
number of tweets posted and lists that the user has subscribed Text Modification Detection[6].
to.
In this paper author mainly focuses to increase the accuracy of
Content-Based Features:It finds the number of the existing Naive Bayes Spam Filter. This algorithm will work
words,capitalization words, exclamation and question mark on text modification and which cause hindrance in
symbols, URL,hashtags, mentions in the tweet. classification of email.Common Spam senders are able to
bypass spam detectors by using leetspeak and diacritics.
Tools that are required: Leetspeak is an alternative alphabet that is primarily used on
They used HSpam and 1KS10KN dataset.Along with feature the Ìnternet. Diacritics are the accents placed on words to
model they used four methods (CNN + Twitter Glove, CNN modify the appearance. Leetspeak allows the spam senders to
+Google news, CNN+Edinburgh, CNN+H Spam).They used change letters into symbols or a series of symbols. For
accuracy, precision, and F-Measure for word embeddings as example, ”A” can be written as ”/-\”. When a word is modified
parameter to compare. using leetspeak, spam detectors are not able to identify the
email as spam, which creates a false positive.
spam detection of image and video content as instaspam as the
world is growing into digital market
Bayesian poisoning is a technique used by e mail spammers to
attempt to degrade the effectiveness of spam filters that rely on
Bayesian spam filtering.
Working: REFERENCES
[1] (Sanghani, G., & Kotecha, K. (2019)). Incremental personalized E-mail
They first applied preprocessing technique to remove the issue spam filter using novel TFDCR feature selection with dynamic feature
of diacritic and leetspeak.In python isalpha() function is used update. Expert Systems with Applications,115,287–299.
[2] (Inuwa-Dutse, I., Liptrott, M., & Korkontzelos,I. (2018)). Detection of
with below algorithm: spam-posting accounts on Twitter.
If a = leetspeak: then replace(a, c) [3] (Satapathy, S. C., Bhateja, V., & Das, S. (Eds.). (2019)). Smart
If b = diacritcs: then replace(b, c) Intelligent Computing and Applications. Smart Innovation, Systems
and Technologies. doi:10.1007/978-981-13-1921-1
[4] (Madisetty, S., & Desarkar, M. S. (2018)). A Neural Network-Based
After this they applied multinomial Naive Bayes algorithm Ensemble Approach for Spam Detection in Twitter. IEEE Transactions
along with different machine learning algorithm. on Computational Social Systems, 1–12.
[5] (José R. Méndez, Tomás R. Cotos-Yañez, David Ruano-Ordás2019).A
Multinomial Naive Bayes (MNB) is the probability of the new semantic-based feature selection method for spam filtering
[6] (Peng, W., Huang, L., Jia, J., & Ingram, E. (2018)). Enhancing the
words(tk), within a message d given a class of the message, Naive Bayes Spam Filter Through Intelligent Text Modification
spam or ham. It assumes that the message is a bag of tokens or Detection. 2018 17th IEEE International Conference On Trust, Security
words, such that the order of the tokens is irrelevant. And Privacy In Computing And Communications/ 12th IEEE
Multinomial Naive Bayes essentially counts the relative International Conference On Big Data Science And Engineering
(TrustCom/BigDataSE).
occurrences of a particular token within the message to [7] "History of Spam". Mailmsg.com. Archived from the original on 26
determine the conditional probability. March 2006. Retrieved 11 July 2006.
P(c|d) = P(c)*Product( 1≤k≥nd P(tk|c)) https://web.archive.org/web/20060326032433/http://www.mailmsg.co
m/SPAM_history.htm
[8] Global spam volume as percentage of total e-mail traffic by month.
Tools that are required: https://www.statista.com/statistics/420391/spam-email-traffic-share/
They have used Spam Server Spamassassin for the datasets. [9] (N. Pérez-Diaz, D. Ruano-Ordás, F. Fdez-Riverola, J.R.
Méndez,(2012)) SDAI: An integral evaluation methodology for
content-based spam filtering models, Expert Syst. Appl. 39 12487–
CONCLUSION AND FUTURE WORK 12500, http://dx.doi.org/10.1016/j.eswa. 2012.04.064