Final NLC Paper PDF

A MACHINE LEARNING APPROACH TO PHISHING DETECTION
Monish Naidu, Shreya Bhagole, Prathamesh Bodake

Lokmanya Tilak College of Engineering, Department of Computer Science.
Navi Mumbai, India.
Abstract — The aim of this paper is to elucidate the Zhang et al. (2012) claimed that the method used for
implications of Machine Learning in detecting the threat phishing varies from region to region. Chinese Phishers
of Phishing. Machine learning can provide an efficient register a new domain to host phishing websites while
method in detecting if a website or pop-up is a phishing Americans hack an existing domain to deploy the phishing
website or not. The term phishing comes from “fishing”, website [1].
probably influenced by phreaking, and alludes to the use
of increasingly sophisticated lures to "fish" for users'
financial information and passwords. There are many
different techniques to combat phishing, including
legislation and technology created specifically to protect
against phishing.
In this paper one will come across various methods to
detect the phishing attacks using Machine Learning. The
concept of Machine learning is largely used and is
evolving at a rapid rate in today’s technological world.
Phishing attacks could be detected in a very efficient and
sophisticated manner with the implication of Machine
learning.
Keywords— Machine Learning, Phishing attacks, Phishers, Figure 1: The Phishing Procedure
Domain, Websites, Decision Tree.
Every phisher carries out phishing in a generic approach.
INTRODUCTION The process can be elaborated as follows:
1. Planning: In this step, the phishers decide which
Phishing is a type of a fraudulent practice of organization to target and what information to get
sending messages of emails to prominent entities of an hold of. They also decide the strategy to get their
organization to persuade them to reveal personal private information.
information, company related information, and other 2. Setup: After the victim has been decided, the
sensitive information. phishers create the basic setup to attack the victim
Generally, in Phishing, the victim receives a and persuade him to give up the relevant
message via any communication medium which appears to information. This often involves creation of e-mails
have sent by a known contact or organization. Such or websites, etc.
messages look authentic but contain malwares to steal
sensitive information. 3. Attack: After the creation of the setup, the phishers
deploy the website or sends the e-mail to the victim.
Users if not alert, fall prey to such malwares and
lose authenticity of their private information. According to 4. Collection: If the victim falls into the trap of the
the 3rd Microsoft Computing Safer Index Report, released phishers, they have to collect the information leaked
in February 2014, the annual worldwide impact of phishing by the victim.
could be very high as $5 billion. [3s] 5. Illicit use of information: Phishers use the
The more sophisticated a phishing email becomes, information to commit frauds, Identity thefts and
the more difficult it is to detect. Fortunately, we have the many other illicit activities.
sophisticated Machine learning approach to detect Phishing.
I. PHISHING STRATEGIES
Phishing has established itself as a major security threat
in today’s web-driven world. The people who carry out
Phishing, colloquially known as Phishers, choose from a
variety of ways to harm the data security of millions of users
involved in the web traffic around the world.
The primary reason of many web servers falling prey to
such phishing websites is their vulnerability. The weakness
in the web servers gets exploited immensely to host a
phishing website without the knowledge of the owner.
It is also possible for a phisher to host a legitimate and Figure 2: Example of Phishing Email
independent server just for carrying out phishing activities.
Based on the mode of attack, Phishing can be classified such random links. This could potentially be a
into the following types: phishing attack.
1. Deceptive Phishing: This is the most common
phishing approach where the phishers act as a 3. Check the links [6]: In case if you encounter a link
legitimate organization in order to steal someone’s or website which seems suspicious you can just
login credentials or other personal information. This copy the link and check it on different websites
also includes the “Dropbox phishing” and the available which will tell you if that website or link
“Google Docs Phishing”. is malicious or not.
2. Spear Phishing: This approach is an advancement to 4. Secure connection [4]: This is usually identified by
the deceptive approach. Here, the phishers lure the
a green area in the address bar, along with https in
victim to give up personal sensitive information by
the URL.
acting as a sender with whom the victim has a
connection. This includes the “Whaling” attack.
5. Check the Grammar [4]: Usually if a site or popup
3. Pharming: The Internet uses DNS to locate the or link is a phishing attack the grammar written is
servers. DNS converts the alphabetical website very poor in structure and is clear giveaway to stay
names into numerical IP addresses which are related out of that site or link.
to the servers. In Pharming, the Phisher changes the
IP address related to the website name. Hence, he 6. Keep software updated [5]: Phishing malware
can redirect the user to malicious website to steal usually depends on the system bugs to attack. If the
his information. system is updated regularly such bugs would go
away and thus prevent the chances of an attack
II TRADITIONAL METHODS FOR PHISHING happening.
DETECTION
Criminal hackers have been using phishing since 7. Beware of offers [13]: Most of the malware will
long to gain secret and sensitive information from the users. lure the customer or user using a “too good to be
These phishing websites, emails, ads are very well disguised true” offer or some exciting deal. Never fall prey to
and very much replicating the ones that the user trusts such false schemes. When something is too good to
enough to enter one’s sensitive information. be true most of the times it is not.
But no matter how complex and difficult to detect
these can never be perfect. There are very simple and easy
things a user could look out for while determining if a
program is legitimate or is a phishing candidate.
1. Check the E-mail address [4]: Check for the

sender’s email address, if your bank or any
organisation is going to send you an email it will be
always through company name in email and never
through a personal mail profile.
Figure 4: Attractive offers to lure users
8. Browsers Phishing list [6]: Your browser provides

you with a phishing list. When you visit a
malicious website, if that website is present in your
browsers list it will show you a warning and will
block your connection to that particular malicious
website.
Figure 3: Malicious Email Format [11]
2. Strange Attachments [4]: If the user receives an

email or pop-up with some strange attachment
asking user to open it by clicking on it, do not open
1. Detecting phishing URLs using PhishTank: We
write Python scripts to automatically download
confirmed phishing websites URLs from
PhishTank. PhishTank is collaborative clearing
house for data and information about phishing on
internet.
Figure 5: Browser Intimation about Phishing [12]
9. Use a Good Antivirus [14]: A traditional way of

security and very convenient. Use a good quality
paid anti-virus and anti-malware to keep the
phishing attacks at bay. The antivirus will warn
you from time to time if you are downloading any
phishing candidate or visiting a malicious website. Figure 6: Machine Learning based approach to Anti-
Phishing by checking URLs
10. Read online reviews: Reading and going through
reviews of a website before visiting it can be very 2. Using IP Address: When an URL consists of IP
helpful to know if the website you are visiting is address instead of domain name such as
safe or not. If a website or link has mostly positive http://125.34.6.123/fake.html, it indicates someone
reviews then it should be safe. Reviews can also is trying to steal user’s personal information. If
give you an idea about the working of the website. domain part contains IP address then the website is
phishing site else it is Legitimate.
III. THE MACHINE LEARNING DIMENSION TO

PHISHING DETECTION 3. Long URL: Attackers use long URLs to hide the
suspicious part which makes it difficult to detect
Due to rapid growth of internet, users prefer phishing URLs.
electronic commerce over traditional shopping. Criminals
try to find their victims in cyberspace with some tricks. By For example:
using anonymous structure of internet, attackers deceive
victims with false websites to gather their sensitive http://federmacedoadv.com.br/3f/aze/ab51e2e319e
information such as account Ids, card numbers, username, 51502f416dbe46b773a5e/?cmd=home&disp
passwords, etc. Understanding whether the web page is atch=11004d58f5b74f8dc1e7c2e8dd4105e811004d
genuine or phishing is a very challenging problem[7]. 58f5b74f8dc1e7c2e8dd4105e8@phishing.website.
Many software companies have launched new anti- html
phishing products which use machine learning based
approaches. First, there must be a criteria to differentiate we calculate the length of URLs in dataset and
malicious websites from genuine ones. Some of which are produced an average URL length. As per the result
mentioned below. if the URL is greater than or equal to 54 characters
then the URL is phishing[6].
A. URL: A phishing URL is a URL that leads a
user to a phishing web page. It is the most B. Domain: The duration of phishing site is very
basic thing to be investigated to decide short and hence, Domain Name System(DNS)
whether the website is genuine or not. Some information about phishing site may not be
key features of phishing websites are available after some time. Hence if the record
mentioned below. is not available anywhere then the website is
not genuine.
i. Number of Subdomains in URL
ii. Total Length of URL The methods to detect phishing on the basis of
iii. To check if the DNS name is valid domains are:
(amazon -> amazen)
iv. Number of digits in URL i. Blacklist based technique: Most of
websites are replicas of exisiting websites.
This approach being the most basic one, comes with variety Hence, well-known services maintain
of methods. All of these methods requires running a python blacklists of the phishing websites that
script to detect malware[2]. may replicate their identity. Blacklists are
generated by identifying phishing
websites. Each phishing website is classification problem[3]. So, we require labelled data
identified with unique hash which is consisting of both legitimate an phish domains to train the
generated from set of proposed features. machine. The samples used, must be precisely classified as
Algorithms like Simhash algorithm is legitimate or phishing without any doubts. The system will
used to generate hash from each websites. throw errogenous results if we use samples about which
we’re not sure. To collect such samples, we use trusted
The Simhash Algorithm focuses on hash created by sources of such information. For example, collecting
the data. If the hash created by two kinds of data Phishing domains from PhishTank, collecting legitimate
are identical, they will be identical. By this domains from WHOIS database. The detailed machine
approach, we can check if a website is legitimate learning approach to phishing detection is as follows:
by comparing the hash generated by some of its
features with the original website’s hash[8]. Step 1: Collect raw data i.e. samples of phishing as well as
genuine websites.
ii. WHOIS Database: The WHOIS Database
is a public database of legitimate domain Step 2: Process the collected data i.e. creating a new dataset
names. If the domain name of the site is to train our machine using algorithms. According to our
not match with WHOIS database record requirements, specific features must be extracted from each
then the webpage is designated as of the websites and should be evaluated for every website.
phishing[9].
Step 3: Implement a Machine Learning Algorithm.
Example- Decision tree.
A detailed decision tree methodology is explained below:
The Decision Tree algorithm uses a Information Gain

measure to determine how efficiently a feature differentiates
the samples. The gain can be calculated as:
The higher the gain, higher is the ability of the feature to

differentiate between samples.
Entropy is a measure to measure the randomness in the

given sample. More the entropy, the higher is the versatility
of the sample elements. Entropy can be calculated as:
Figure 7: Algorithm to detect phishing domains[2]
C. Content: The page content can be processed to

check the legitimacy of the website. The
features to be scanned are:
i. Meta Tags Lets say, we select the length of the URL as the feature. The
ii. Images Websites will be divided into two sets denoting Long and
iii. Page title short URLs. The entropy and the gain will be calculated.
These features give out information like: This calculation is repeated for every feature that is relevant
i. Website Category to us. When all the calculations are done, a decision tree is
ii. Requirement of login through third party created. As we traverse the tree downwards, all the nodes
domains. will have high purity.
iii. Information about the traffic.
All the above three criteria when scanned, give a clear
picture about the website we’re using.
IV DETECTION PROCESS
To detect a phishing domain using machine

learning, the machine has to be trained. This becomes a
V. ADVANTAGES OVER TRADITIONAL APPROACH [12] https://www.symmetritechnology.com/post/10-steps-
avoid-google-reporting-your-site-hacked-search-results
A. The Machine Learning model can be trained to
detect the phishing sites on its own. [13] https://www.hdfcbank.com/personal/learning-
B. The Machine Learning algorithm explained above, center/secure/guide-on-how-to-avoid-phishing-attack
provides better performance as compared to other
traditional approaches to phishing detection[15]. [14]https://searchenterprisedesktop.techtarget.com/answer/C
an-an-antivirus-program-stop-phishing-attacks
CONCLUSION [15] https://nevonprojects.com/detecting-phishing-websites-

using-machine-learning/
Phishing crimes are increasing at an alarming rate.
There are laws and jurisdiction in order to combat phishing
but the best way to combat phishing is through self-
awareness, but that might not work all the time, hence an
alternative of Machine learning could be used to detect such
phishing attacks in a more efficient and precise way.
Techniques such as WHOIS database, content scanning
through algorithm, various machine learning algorithms and
decision trees can be used in order to detect these malicious
attacks. With time the system should be able to
automatically detect a phishing candidate and block it even
from appearing to the user. Applications like PhishChecker 
can also be used to detect the phishing.
REFERENCES
[1] Oluwatobi Akanbi and Elahe Fazeldehkordi, A ML
approach to phishing detection and defense, 2014 edition.
[2] Hemali Sampat, Manisha Saharkar, Ajay Pandey and

Hezal Lopes, “Detection of Phishing Website using
Machine Learning”, Mar 2018.
[3] https://towardsdatascience.com/phishing-domain-
detection-with-ml-5be9c99293e5
[4] https://www.itgovernance.co.uk/blog/5-ways-to-detect-
a-phishing-email
[5] https://ssd.eff.org/en/module/how-avoid-phishing-
attacks
[6] https://www.makeuseof.com/tag/4-general-methods-
detect-phishing-attacks/
[7]https://www.sciencedirect.com/science/article/pii/S09574
17418306067
[8] https://link.springer.com/chapter/10.1007/978-3-319-
72598-7_20
[9] https://www.icann.org/resources/pages/phishing-2013-
05-03-en
[10] https://www.business.com/articles/machine-learning-
spear-phishing/
[11] https://www.globalsign.com/en-in/blog/how-to-spot-a-
fake-website/

Final NLC Paper PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Final NLC Paper PDF

Caricato da

Copyright:

Formati disponibili

A MACHINE LEARNING APPROACH TO PHISHING DETECTION

Monish Naidu, Shreya Bhagole, Prathamesh Bodake

1. Check the E-mail address [4]: Check for the

Figure 4: Attractive offers to lure users

8. Browsers Phishing list [6]: Your browser provides

Figure 3: Malicious Email Format [11]

2. Strange Attachments [4]: If the user receives an

Figure 5: Browser Intimation about Phishing [12]

9. Use a Good Antivirus [14]: A traditional way of

III. THE MACHINE LEARNING DIMENSION TO

A detailed decision tree methodology is explained below:

The Decision Tree algorithm uses a Information Gain

The higher the gain, higher is the ability of the feature to

Entropy is a measure to measure the randomness in the

C. Content: The page content can be processed to

To detect a phishing domain using machine

CONCLUSION [15] https://nevonprojects.com/detecting-phishing-websites-

[2] Hemali Sampat, Manisha Saharkar, Ajay Pandey and

Potrebbero piacerti anche