Sei sulla pagina 1di 4

249

International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. 2, No. 2, April 2011

Phishing E-Mail Detection Using Ontology Concept and Nave Bayes Algorithm
Mahdi Bazarganigilani
Charles Sturt University, School of Business and Economics, Melbounre, Australia

mahdi62b@yahoo.com

Abstract : The emerging of huge phishing e-mails compels individuals to protect their valuable credentials.Invader use social engineering to create phishing e-mails. In this paper we present an algorithm for text classification of phishing e-mails using Semantic ontology concept. We use Naive Bayes algorithm for classification. Keywords-component; Phshing E-Mail Detection Ontology Concept; Spam Classification; Nave Bayes Algorithm; Term Frequency Variance.

section, we perform our experiments and show our results.

2. Nave Bayes Algorithm


Nave Bayes algorithm is a classifier algorithm used in text classification. This algorithm estimates the new texts according to the previous parameters and assigns a probability to every category. The category with the most probability assigned to the new text. The problem consists of many samples. Every sample has a target function

vi V

and

set

of

attributes

1. Introduction
Phishing e-mails are kinds of emails used to steal important credentials of people. Usually, invaders make a website similar to popular websites and persuade victims to enter his/her credentials. Such emails sent to many users and waste valuable bandwidth. Many approaches delivered for spam identification which is based on text classification or information on headers of the emails. There are many approaches and algorithms presented. One approach is using the information on the header of the email like sender address, the numbers of receivers [1, 2] .However, such approaches are successful. But, invaders choose other ways. In [3] an algorithm based on hashing and summarizing the spam presented. In [4] the author used a method to distribute the keys and using digital signature to identify the phishing email. A usual approach for identifying phishing emails is learning and data mining methods [5, 6]. All the emails share some especial words which are more frequent in the context of the text. They used some algorithms like Nave Bayes algorithm [6] to classify the text into des8ired categories. Following, we discus this approach. In the following sections, we discuss on Ontology concept and Nave Bayes classifier. We use a heuristic way to detect the phshing emails. Our approach uses a two layer approach. Firstly, it tries to find the phshing email according to some common features, Furthermore if it is not recognized, it will be classified according to our proposed classifier. In the last

.The learning algorithm computes the target value for the sample and assign the best one. This maximum value obtained according to the following formula.

a1 , a 2 ,..., a n

NaiveBayse = v NB = arg max v j V P(v j ) P (ai | v j )


i

v NB represents the output of the classifier. v j shows the P (v j ) is the initiative probability of output of class j and
class

j . P(ai | v j ) is the conditional probability of v attribute i in class j .Classifier select the maximum j for

the output. For utilizing the Nave Bayes in our spam classification, we should know how to represent the texts. Moreover, we need to compute the probabilities of Nave Bayes formula. We consider every word as an attribute and the frequency of that as the value. Therefore, we easily apply the Nave Bayes to our Spam classification. Moreover, we should compute the conditional probabilities and probability of every category, which is equal in our case. Since, we use the same number of training set for spam and normal emails. For finding the conditional probabilities, we define a set named vocabulary including all the distinct words in the

P (v j )

250
International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. 2, No. 2, April 2011

training set. The conditional probability for every word computed as follow.

P ( a i = wk | v j ) =

nk + 1 n + vocabulary

nk is the number of occurrences of word wk in all emails v vocabulary is the number of distinct word in in class j . all training sets. n is the number of distinct words in class vj

For adding the concepts of every word we could have different strategies [8]. We could ass the concepts in addition to the word. Furthermore, we could just replace them. We could also eliminate the words which doe not have any meaning in our lexicon. We use the Add strategy and eliminate the words with no concept.

4. Disambiguishing Strategies
Many words have different meanings and synonyms. By adding all of them, the quality of clustering diminishes. There are different strategies to add synonym. One is adding all synonyms. Obviously, this does not help much for disambiguating different meanings. Another strategy is the First Synonym. Usually, in ontologies and dictionaries, different meanings and synonyms are sorted acceding to their importances and frequencies. We select the first synonym which ontology offers. Last strategy is context based strategy. According to the context of the text we select the best synonym for the mapping function [8]. 1. Firstly, we declare semantic vicinity as a set of super and under concepts.

.now by considering the conditional probabilities and eliminating the probabilities of every group, a new email could be classified.

v NB = arg max v j V

iwords

P(a

|vj)

In the above formula, words are all the distinct words in the new email. The classifier assigns the category with the most probability to the new e-mail.

3.Compiling Background Context


The main idea of compiling the background context of the text is that makes classifier capable of using the knowledge of the text in its algorithm. The drawback of this method is that it exacerbates the preprocess methods of the text like stemming, eliminating stop word. Ontology is a notation system

V (c ) = b c | c p b or c f b .
all the words
1 c

2.

Obtain

O = (L, F , C , H , ROOT ) which is declared as

U (c ) = UbV (c ) Re f
the

3.

follow [7]. A lexicon, L , including all the words. A set of concepts, C .

Define follow.

. distance

(b )

from

set

function

as

c A reference function of words to a set of concepts.

Re f (t ) which assigns a set

Hierarchy, H which represents the hierarchy of the

U (c ) set words in text d .Therefore , among


different meanings, we select the one which has the maximum synonyms in the underlying text.

c Re f c (t ) | c max imizes Dis (d , t ) = first tf (d , U (c) ) tf (d ,U (c) ) is the number of occurrences of

H (Hotel, Acco mod ation ) denotes Hotel is subset of Acco mod ation . A high level concept named ROOT which is root
of every concept.

words.

For

example

5. Hierachy of the concepts


The main idea of this consideration is to repeat the super concepts of a concept the same as the under concepts. For example, words like vehicle and concept car additionally.

motor could denote to

Adding the concepts of text to text vectors has many advantages. Firstly, it solves the problem of synonym words. Secondly, adding the upper concepts makes the text vector near to main gist of the text. For example, clustering algorithms can not find any relation between Pork and Beef. While by using ontology algorithms, it will add an upper concept like Meat and two texts will be correlated.

251
International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. 2, No. 2, April 2011

7. Proposed Method and Experimental Results


Most of phishing emails have common attributes can be used to determine them. This includes instandard ports, IP addresses in hex format and Character misusing. Usually the phishing attacks originated from computers without fixed DNS addresses. Such attacks use hex based URL to distract users from the real address (like http://19.13.0.1:2234/auBank.cgi?cu_id). Most of attacks target limited numebr and famous companies like Pay Pal. Such companies do not send any email with phishing features described earlier. Therefore, we suppose every e-mail forged from such companies with phishing features is a phishing e-mail [9]. If an email is from popular companies, moreover, have a link to irrelevant places or it is based on IP addresses, it will be considers as a phishing e-mail. If it does not have any phishing feature, it will be sent to next layer for text classification according to Ontology and Nave bayes classifier. We summarize the steps, we performed for our algorithm.

Figure 1. Words Hierachy in an ontology. By using the above strategies, we enter the ontology knowledge based concepts to text vector for Nave Bayes classifier. Entering the super concepts makes the text richer. On the other hand, entering lower concepts with context strategy disambiguates concepts with different synonyms [8].

6. Term Frequency Variance


The feature vector is one of the most important parts of the text mining algorithms. This resulted in increase of efficiency of the classifier by eliminating the repetitive and redundant words. There are many algorithms for evaluating the importance of the words like Term Frequency Variance (TFV) and Information Gain (IG). In this paper, we use TFV approach. For computing the value of the statements in TFV since it is dependent on Clusters. Considering the clusters
1 2 k , for every word firstly the Term as Frequency of every cluster and then the average of

C = {c , c ,..., c

frequencies of clusters computed, The TFV of property f computed as follow:

TFV ( f ) = [tf ( f , ci ) mean _ tf ( f )]


i =1

Figure 2. Five steps in our proposed method. Firstly our algorithm perform some preprocess on raw text of e-mails. It eliminates HTML tags and stems some words and eliminates stopping words. In next step we extract subject and sender address to see if it matches with phishing emails features. Afterwards, the text feeded to Ontology and related concepts would be added according to the strategies. Then, TFV applied and the vector space decreases and redundant concepts eliminate. Finally Nave Bayes algorithm classifies the e-mails in to spam or normal emails.

Document Frequency is independent of the clusters. It simply considers the importance of a word according to its frequency. On the other hand, TFV solves this problem by considering the words with high number of different distributions in the clusters. We sort the TFV of words and select the best scores. We usually eliminate the feature space up to 10 %.

252
International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. 2, No. 2, April 2011

For our experiments, we used dataset Enron-spam [10]. Also for phishing e-mails we used [11]. We considered 800 normal email (spam and non-spam). We used 600 emails of them for training Nave Bayes classifier and the rest, for test dataset. We also used 200 emails from phishing dataset, which 150 of them devoted for training.All algorithms implemented in C#. For evaluating the efficiency of our approach, we use precision and recall measurement metrics stated in [12]. Table 1 .Evaluating Parameters
Assigned to Phishing Belonging to Phishing Not Belonging to Phishing Not assigned to Phishing

compilation for test dataset and in online classification slows down the system. Thus we just include text context compilation for learning classifier.

Conclusions
In this paper, we introduced a new text mining approach for identifying phishing emails. Our approach helps the Nave Bayes classifier by entering ontology based concepts into training and test datasets. We firstly pre-classified the emails with some common features of phishing emails. We focused on text classification in our algorithm. We should be aware using ontology based approach incurs overhead to our classifier. This is not that important when the application of this classification is for offline purposes.

tp

fp

fn tn

ci represents the clusters. The Precision shows the


accuracy of the algorithm while the recall represents the integrity of suggestion algorithm. There may be instances which algorithm can not determine any category for them which reduce the Recall.

References
[1] Cook, D., Hartnett, J., Manderson, K., Scanlan, J.: CatchingSpam Before it Arrives: Domain Specific Dynamic Blacklists. In: Australian workshops on Grid computing and e-research, ACM Press, Tasmania, Australia (2006). Pfleeger, P., Bloom, S.: Canning Spam: Proposed Solutions to Unwanted Email. Security & Privacy Magazine, IEEE, vol. 3, no. 2, 40--47 (2005). Damiani, E., et al: An Open Digest-based Technique for Spam Detection. In: 4th international conference on peerto- peer computing, IEEE Press, Switzerland, (2004). Adida, B., Hohenberger, S., Rivest, R.: Fighting Phishing Attacks: A Lightweight Trust Architecture for Detecting Spoofed Emails. In: DIMACS Workshop on Theft in ECommerce: Content, Identity, and Service, Piscataway, NJ, USA, November, (2005). Graham, P.: Better Bayesian Filtering. In: Spam Conference (2003). Sahami, M., Dumais, S., Heckerman, D., Horvitz,E.: A Bayesian Approach To Filtering Junk E-Mail. In: AAAI Workshop on Learning
for Text Categorization, Madison, Wisconsin, USA (1998).

[2]

tp Pr ecision = tp + fp tp Re call = tp + fn
We also use another parameter F can be computed as follow.

[3]

[4]

2 Pr ecision Re call F= Pr ecision + Re call


Table 2. Results
Using Ontology Concept False False True True Using TFV False True True True 1 3 Depth of Hierarchy F (%)

[5] [6]

[7]

S. Bloehdorn, P. Cimiano, A. Hotho and S.Staab, An Ontologybased Framework for Text Mining, In Principles of Data Mining and
Knowledge Discovery, 7th European Conference, PKDD 2003, 2003.

[8] 81.35 % 86.73 % 90.32 % 94.87 %

In our experiments, as shown above the best result is gained by using context base strategy for selecting different synonyms. Moreover, by entering the 3 level super and under concepts, the results show improvement. The results obviously show the better accuracy for ontology compilation in the text of e-mails while classification. This process incurs more overhead for training the learning classifier. It would be an option to include this process in test datasets. Obviously considering this ontology

E. Asgarian, J. Habibi, Sh. Moaven, H. MoeinZadeh, New Text Document Clustering Method based on Ontology, published in 3th Conference on Information and Knowledge Technology, Mashhad, Iran, Nov 2007. (In Persian) [9] A. Saberi, M. Vahidi, B. Minaei, Learn to Detect Phishing Scams Using Learning and Ensemble Methods, IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Workshop on Web Security, Integrity, Privacy and Trust, pp. 311-14, Silicon Valley, USA, 2007. [10] Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam Filtering with Naive Bayes Which Naive Bayes?. In: 3rd Conference on Email and AntiSpam, Mountain View, California, USA, July ( 2006). [11] phishingcorpus, http://www.monkey.org/~jose/wiki/doku.php?id=phishingcorpus, Accessed 1 August 2010. [12] Cleverdon, C. W. and Mills, J. The testing of index language devices. Aslib Proceeding. 15, 4, 106130, 1963.

Potrebbero piacerti anche