Phishing Website Detection Using Intelligent Data Mining Techniques

PHISHING WEBSITE DETECTION USING INTELLIGENT DATA MINING
TECHNIQUES
*1
Ms. Vedhanayagi M., *2 Mr.Varadarajan T., M.C.A., M.Phil.,
*1
M.Phil Research Scholar, PG and Research Department of Computer Science, Government Thirumagal Mills
College, Gudiyattam, Tamilnadu, India.
*2
Assistant Professor & Head of Department, PG and Research Department of Computer Science, Government
Thirumagal Mills College, Gudiyattam, Tamilnadu, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract: Detection (APTIPWD) and shows that it can be easily
implemented.
Recently, the Internet has become a very important
Thirdly, the effectiveness of the New Approach
medium of communication. Many people go online and
(APTIPWD) is evaluated using a set of user experiments
conduct a wide range of business. They can sell and buy
showing that it is more effective in helping users
goods, perform different banking activities and even
distinguish between legitimate and Web Content Mining
participate in political and social elections by casting a
websites than the Old Approach of sending
vote online. The parties involved in any transaction
Classification of -Web Content Mining tips by email.
never need to meet and a buyer can sometimes be
The experiments also address the issues of the effects of
dealing with a fraudulent business that does not actually
technical ability and Web Content Mining knowledge
exist. So, security for conducting businesses online is
Classification Content Mining websites' detection. The
vital and critical. All security-critical applications (e.g.
results of the investigation show that technical ability
online banking login pages) that are accessed using the
has no effect whereas Web Content Mining knowledge
Internet are at the risk of fraud. A common risk comes
has a positive effect on Web Content Mining website
from so-called Phishing websites, which have become a
detection. Thus, there is need to ensure that, regardless
problem for online banking and e-commerce users. Web
their technical ability level (expert or non-expert), the
Content Mining websites attempt to trick people into
participants do not know about Web Content Mining
revealing their sensitive personal and security
before they evaluate the effectiveness of a new
information in order for the fraudster to access their
Classification of -Web Content Mining approach. This
accounts. They use websites that look similar to those of
thesis then evaluates the Classification of -Web Content
legitimate organizations and exploit the end-user's lack
Mining knowledge retention of the New Approach users
of knowledge of web browser clues and security
and compares it with the knowledge retention of users
indicators.
who are sent Classification of -Web Content Mining tips
This thesis addresses the effectiveness of Web by email.
Content Mining & Classification website detection. It
reviews existing Classification of Web Content Mining INTRODUCTION:
approaches and then makes the following contributions.
First of all, the research in this thesis evaluates the E-banking Phishing websites are forged websites that
effectiveness of the current most common users' tips for are created by malicious people to mimic real e-banking
detecting Web Content Mining websites. A novel websites. Most of these kinds of Web pages have high
effectiveness criterion is proposed and used to examine visual similarities to scam their victims. Some of these
every tip and rank it based on its effectiveness score, Web pages look exactly like the real ones. Unwary
thus revealing the most effective tips to enable users to Internet users may be easily deceived by this kind of
detect Web Content Mining attacks. The most effective scam. Victims of e-banking phishing Websites may
tips can then be used by Classification of -Web Content expose their bank account, password, credit card
Classification and training approaches. number, or other important information to the phishing
Secondly, this thesis proposes a novel Classification Web page owners. The impact is the breach of
of Web Content Mining Approach that uses Training information security through the compromise of
Intervention for Web Content Mining Websites' confidential data and the
victims may finally suffer losses of money or other A fourth approach is two-factor authentication, which
kinds. Phishing is a relatively new Internet crime in ensures that the user not only knows a secret but also
comparison with other forms, e.g., virus and hacking. presents a security token [6]. However, this approach is
More and more phishing Web pages have been found in a server-side solution. Phishing can still happen at sites
recent years in an accelerative way [7]. The word that do not support two-factor authentication. Sensitive
phishing from the phrase “website phishing” is a information that is not related to a specific site, e.g.,
variation on the word “fishing.” The idea is that bait is credit card information and SSN (Social Security
thrown out with the hopes that a user will grab it and Number),cannot be protected by this approach either
bite into it just like the fish. In most cases, bait is either [22].
an e-mail or an instant messaging site, which will take
the user to hostile phishing websites [10]. Many industrial anti phishing products use toolbars in
Web browsers, but some researchers have shown that
E-banking Phishing website is a very complex issue to security tool bars don’t effectively prevent phishing
understand and to analyze, since it is joining technical attacks. [4], [5] proposed a scheme that utilizes a
and social problem with each other for which there is no cryptographic identity-verification method that lets
known single silver bullet to entirely solve it. The remote Web servers prove their identities. However, this
motivation behind this study is to create a resilient and proposal requires changes to the entire Web
effective method that uses Fuzzy Data Mining infrastructure (both servers and clients), so it can
algorithms and tools to detect e-banking phishing succeed only if the entire industry supports it.
websites in an automated manner. DM approaches such
as neural networks, rule induction, and decision trees B. Main Characteristics Of E-Banking Phishing
can be a useful addition to the fuzzy logic model. It can Websites.
deliver answers to business questions that traditionally
were too time consuming to resolve such as, "Which are Evolving with the anti phishing techniques, various
most important e-banking Phishing website phishing techniques and more complicated and hard-to-
Characteristic Indicators and why?" by analyzing detect methods are used by phishers. The most
massive databases and historical data for training straightforward way for a phisher to defraud people is to
purposes. make the phishing Web pages similar to their targets.
Actually, there are many characteristics and factors that
A.LITERATURE REVIEW can distinguish the original legitimate website from the
forged e-banking phishing website like Spelling errors,
Phishing website is a recent problem, nevertheless due Long URL address and Abnormal DNS record.
to its huge impact on the financial and on-line retailing
sectors and since preventing such attacks is animportant C. Why Using Fuzzy Logic And Data Mining?
step towards defendingagainst e bankingphishing
website attacks,there are several promisingdefending FL has been used for decades in the engineering
approaches to this problem reported earlier. sciences to embed expert input into computer models for
a broad range of applications. It offers a promising
In this section, we briefly survey existing anti-phishing alternative for measuring operational risks [18]. The FL
solutions and list of the related works. One approach is approach provides more information to help risk
to stop phishing at the email level [3], since most current managers effectively manage assessing and ranking e-
phishing attacks use broadcast email (spam) to lure banking phishing website risks than the current
victims to a phishing website [21]. Another approach is qualitative approaches as the risks are quantified based
to use security toolbars. The phishing filter in IE7 [19] is on a
a toolbar approach with more features such as blocking combination of historical data and expert input. The
the user’s activity with a detected phishing site. A third advantage of the fuzzy approach is that it enables
approach is to visually differentiate the phishing sites processing of vaguely defined variables, and variable
from the spoofed legitimate sites. Dynamic Security whose relationships cannot be defined by mathematical
Skins [5] proposes to use a randomly generated visual relationships. FL can incorporate expert human
hash to customize the browser window or web form judgment to define those variable and their relationships.
elements to indicate the successfully authenticated sites.
DM is the process of searching through large amounts of
data and picking out relevant information. It has been
described as "the nontrivial extraction of implicit,
previously unknown, and potentially useful information
from large data sets [30], [31]. It is a powerful new
technology with great potential to help researchers focus
on the most important information in their data archive.
Data mining tools predict future trends and behaviors,

allowing businesses to make proactive, knowledge-
driven decisions [32].
In the case where the user is Web Content
Mining aware, the approach does nothing and lets the
II. The Proposed Fuzzy Based Data Mining user keep surfing the Internet.
Approach
The approach described here is to apply fuzzy logic and The architecture of the New Approach (APTIPWD)
data mining algorithms to assess e-banking phishing
website risk on the 27 characteristics and factors which The components are Proxy, URL Agent (UA)
stamp the forged website. The essential advantage and Knowledge Base (KB). The intervention takes place
offered between the Internet and Users. Any URL request made
by fuzzy logic techniques is the use of linguistic by a user goes through the Proxy. The Proxy
variables to represent Key Phishing characteristic communicates with a URL Agent (UA). When the user
indicators and relating e-banking phishing website browses the URL page and clicks to submit information,
probability the UA verifies
whether the URL is blacklisted or not by

checking the blacklists. If the URL is not blacklisted, the
Proxy allows submission process to proceed. If the URL
is blacklisted, the Proxy prevents the information being
submitted. Then, the UA shows an intervening message
to the user in order to help them understanding what
Web Content Mining is and how to detect them in the
The broad idea of the Classification Web Content future.
Mining proposed approach
There are many Classification-Web Content
The broad idea is to check whether a user is Mining tips that can be used in the intervening message.
Web Content Mining aware when they surf the Internet The most effective Classification -Web Content Mining
and visit a Web Content Mining website. If the user tries tip evaluated. The tip used in the intervening message is
to submit their sensitive information to the Web Content as follows: "a fake website's address is different from
Mining website, they are shown intervening message to what you are used to, perhaps there are extra characters
help them understand what Web Content Mining or words in it or it uses a completely different name or
websites are and how to detect them. no name at all, just numbers. Check the True URL (Web
Address).
The New Approach also keeps Classification
-Web Content Mining training ongoing process. This The true URL of the website can be seen in the
means that whenever users try to submit information to page 'Properties' or 'Page Info': While you are on the
Web Content Mining website, they will be trained. website and using the mouse Go Right Click then Go
'Properties' or 'Page Info'. If you don't know the real web
address for the legitimate organization, you can find it
by using a search engine such as Google". Using the
New Approach will present the intervening messages to
users who access Web Content Mining websites and try
to submit their information.
Also, by using this approach, users do not need checking the FLAPTW. If the URL is listed, the proxy
to attend training courses and do not need to access redirects the User to a simulated Web Content Mining
online training materials. This is because the approach page (i.e. not Web Content Mining) to browse it. The
brings information to end-users and helps them page submission button is linked with an intervention
immediately after they have made a mistake in order to message so that if the User clicks the button to submit
detect Web Content Mining websites by themselves. information the intervention message is presented to
them. If the URL is not listed, the Proxy allows the User
The New Approach helps users on how to make to browse the Internet as normal.
correct decisions in distinguishing Web Content Mining
and legitimate websites during their normal use of the ASSUMPTION
Internet. This approach will only work if intervention is
shown to be an effective method for training people in There is an assumption that the Administrator is
detection of Web Content Mining websites. In order to given the privilege in the network email system to send
effectively evaluate the New Approach, a series of Classification -Web Content Mining training email that
experiments need to be carried out. bypasses the Classification -Web Content Mining filters
that might be applied in the network email system. This
means that the Classification -Web Content Mining
training email should have the following characteristics:
The design of the New Approach system consists of four The domain of the sender's email should be the same as
components. They are: the domain of a legitimate website. The email content
should look as it is legitimate email.
 Server,
 Proxy (Gateway), IMPLEMENTATION
 Administrator, and
 Client (User). In this section, the implementation of the
components of the APTIPWD is presented. Each
The Administrator is a person who is in charge of
component's implementation is described separately.
sending Web Content Mining emails to any User in a
network. The Proxy is in place between the Internet and SERVER
Users. The Proxy acts as a gateway for all requests made
in the network by its Users. Any URL request made by a The Server component was implemented using
User goes through the Proxy. Apache HTTP Server. Apache HTTP Server is an open-
source web Server for popular operating systems such as
The Proxy then communicates with the Server. UNIX and Windows. A 1.40GHz Toshiba laptop, which
The Server contains three sub-components. They are a runs Microsoft Windows XP home edition, was used to
Fixed List of Classification -Web Content Mining run the Apache HTTP Server.
Training Websites (FLAPTW), a URL Agent (UA) and
the Intervention message. The FLAPTW contains a The Server's sub-components, the URL Agent
fixed number of fake websites that are designed to look (UA), the Fixed List of Classification -Web Content
the same as the original ones and to beused for Mining Training Websites (FLAPTW) and the
Classification -Web Content Mining training only, Intervention message, were linked to each other. The UA
whereas the UA is responsible for checking whether the received any URL from the Proxy and directed it to
requested URL passed by the Proxy is in the FLAPTW. either the local server (i.e. the prototype's Server) or the
requested website on the Internet. This was
The Intervention message is stored in the Server. accomplished by the virtual hosts directives in Apache
It is shown to the User in order to help them understand HTTP Server.
what Web Content Mining is and how to detect it in the
future. The Administrator sends the Classification -Web The virtual hosts' container is a configuration
Content Mining training email to (a) specific User(s). file that contains all the web addresses that were served
The email contains a link (URL) for one of the locally by the Server when requested. However, this
FLAPTW. container had to be pointed by the main Apache HTTP
Server's configuration file.
If the User goes to the URL, the UA verifies
whether or not the URL is listed in the FLAPTW by
The fixed list of Classification -Web Content Mining
training websites used in the prototype
URLs with large host names that contained a part of a

well-known web addresses. Each one of the websites
Examples of virtual hosts' directives in their was linked to the intervention message by modifying the
container submission button so that it transferred the traffic to the
intervention message. The intervention message was a
simple HTML page adjusted by JAVA scripts to appear
as a pop up window and to locate in the middle of the
screen.
Pointing virtual hosts' container in Apache
configuration Tile Phishing Features Checking
In addition, the DNS22 host files in the One of the challenges faced in this research is the
Windows operating system were modified so that web Un availability of complete dataset to be used as a
browsers displayed the URL of the actual Web Content standard for phishing websites features. According to
Mining websites. The web addresses listed were pointed [14], few selected features can be used to differentiate
to the local machine IP address (127.0.0.1) so that any between legitimate and spoofed web pages. These
request to one of the addresses that arrived at the Apache selected features are many such as URLs, domain
HTTP Server was directed to and served by the local identity, security & encryption, source code, page style
server. Thus, the users were not actually at risk since & contents, web address bar and social human factor.
they used local web pages. This study focuses only on URLs and domain name
features. Features of URLs and domain names are
checked using several criteria such as IP Address, long
URL address, adding a prefix or suffix, redirecting using
the symbol “//”, and URLs having the symbol “@”.
These features are inspected using a set of rules in order

to distinguish URLs of phishing webpages from the
URLs of legitimate websites. Below is a description for
these rules.
@ symbol as phishing URL
a)Feature of IP address is checked to verify if the IP using the following rule.
address exists in the URLs
. For instance, a URL as
“http://192.100.3.124//fake.html” indicates that someone
is trying to steal some information from the user. In this
study, this URL is checked using the following rule:
b)Long URLs usually uses by the phisher to hide the

suspicious part. There is no exact length to indicate the
phishing site; however, authors in [15] reported that
normal length of URL does not exceed 54 characters.
Thus, in this study URL with length greater than 54
characters is suspicious link for phishing web pages.
This study checks such URLs using the following rule.
c)Phisher tend to add prefixes or suffixes separated by

the mark (-) so that the user will trust the URLs as a
legitimate web page URL. Below is the rule which can
be used to check this feature.
Phishing Attack Checking Algorithm

d)Some URLs of phishing web page have an addition at
the front of the real URLs. An example of this addition
is http://www.legitimate.com//http://www.phishing.com.
This feature checks the location of the symbol “//” in the In this study, Uniform Resource Locator (URLs) is used
URL. If the URL starts with “HTTP”, this means that as an indicator to distinguish the phishing web page
symbol “//” should appear in the sixth position. from the legitimate ones. By using the URLs, it can be
However, if the URL employs “HTTPS” then the determined whether the URL comes from a phishing site
symbol “//” should appear in the seventh position. This or legitimate
study checks this feature using the following rule. site.
CONCLUSION
This approach is unable to detect fraud websites

on zero day or before the fraud website is blacklisted
e) The use of “@” symbol lead Use of heuristic based detection approach in Fraud
website detection application, enables it to detect fraud
s the browser to ignore websites before they are blacklisted.
everything preceding the “@” symbol and the real
address often follows the “@” symbol. Thus, this study The application checks one or more
classifies any URL includes characteristics like URL, HTML source code or the page
content. The classification module of the application, website in real time while browsing the internet. So,the
which consists of data Mining Algorithm “RIPPER”, application will become more user-friendly.
provides classification of any given website as Fraud or
Legal. The application also takes corrective measure The work do be done next is to extract large
against Fraud Website by reporting about the high data from a wide range of sample S and use different
possibility of the website in question, being fraud to cross validation with large data. The results motivate
respective authority. future work to explore the inclusion of any additional
variables to the data set, which might improve the
Thus, the application will prove to be useful to predictive accuracy of classifiers and decrease the
reduce the risk of phishing attack by preventing users misclassification rate of rule classification.
from entering confidential information in fraud
websites. An AI-based hybrid system has been proposed In addition, it will explore developing an
for Web Content Mining website detection systems. automated mechanism to extract new potential of Web
Fuzzy logic has been combined with association Content Mining risk features from raw Web Content
classification data mining algorithms to provide efficient Mining websites in order to keep up with new trends in
techniques for building intelligent models to detect Web Web Content Mining attacks.
Content Mining websites. BIBLIOGRAPHY
Empirical Web Content Mining experimental
Books:
case studies have been implemented to gather and
analyze range of different Web Content Mining website
features and patterns, with all its relations. Our 1. Web Spam Detection Using Data Mining, Ann
experimental case-studies point to the need for extensive Magdacy Garges, 2015
educational campaigns about Web Content Mining and
other security threats. People can become less
vulnerable with heightened awareness of the dangers of 2. E-mail Forensics: Eliminating Spam, Scams and
Web Content Mining. Phishing, Les Hatton , 2010.
Our experimental case-studies also suggest that a new

approach is needed to design a usable model for 3. Google Gmail, 1st EditionSteve Schwartz,
detecting e-banking Web Content Mining websites, 2009
taking into consideration the user's knowledge,
understanding, awareness and consideration of the Web
Web sites
Content Mining pointers located outside the user‘s
centre of interest  http://www.webmining-detector.com
 http://www.researchgate.net
 http://tool.motoricerca.info/spam-detector
FUTURE WORK  https://www.sciencedirect.com/science
A fuzzy-based classification mining technique JOURNAL:

has been introduced for building an intelligent Web
Email Spam Detection Using Association Rules and
Content Mining website detection system, by using a
Extraction Techniques, International Engineering
layered structure for collecting and analyzing all Web
Research Journal (IERJ) Volume 2 Issue 2 Page736-739,
Content Mining website features and patterns.
2016, ISSN 2395 – 1621.
This kind of supervised machine learning
Content-Based Spam Filtering and Detection
technique which combined the fuzzy logic model with
Algorithms-An Efficient Analysis & Comparison,
the associated classification technique for detecting Web
R.Malarvizhi, K.Saraswath , nternational Journal of
Content Mining websites verified lots of potential for its
Engineering Trends and Technology (IJETT) –Volume 4
validity and usability throughout our research
Issue 9- Sep 2013.
investigation. This will warn the user regarding fraud
SMS Spam Filtering Using Machine Learning
Techniques: A Survey, Machine Learning Research
Volume 1, Issue 1, December 2016, Pages: 1-14
Received: Sep. 28, 2016; Accepted: Nov. 5, 2016;
Published: Dec. 5, 2016.

Phishing Website Detection Using Intelligent Data Mining Techniques

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Phishing Website Detection Using Intelligent Data Mining Techniques

Caricato da

Copyright:

Formati disponibili

PHISHING WEBSITE DETECTION USING INTELLIGENT DATA MINING

Data mining tools predict future trends and behaviors,

whether the URL is blacklisted or not by

URLs with large host names that contained a part of a

These features are inspected using a set of rules in order

b)Long URLs usually uses by the phisher to hide the

c)Phisher tend to add prefixes or suffixes separated by

Phishing Attack Checking Algorithm

This approach is unable to detect fraud websites

Our experimental case-studies also suggest that a new

A fuzzy-based classification mining technique JOURNAL:

Potrebbero piacerti anche