Sei sulla pagina 1di 7

CMPS-MC24 Literature Survey: Successful

Applications of Data Mining


Matthew Sparkes
January, 2004

Abstract
This literature survey will focus on both the successful applications
of data mining and those areas where it is proposed that it could be
successful, and what techniques have been used in those instances.

1 Introduction
Data Mining is a technique of data analysis that allows the user to discover
previously unknown relationships amongst data that can be applied to a
problem usefully. This has become very important for many different in-
dustries as their ability to analyse data has become inadequate for the vast
amount of data that they store. It allows organisations to become more prof-
itable and efficient by furnishing them with information that ordinarily could
never be discovered. This paper analyses different fields where data mining
is currently being used and to what extent it has been successful.

2 Successful Applications of Data Mining


2.1 Data Mining in Criminal Investigations
Data mining has been used in many fields of criminal investigation, as out-
lined here. The scopes of different types of investigation vary greatly and
have been discussed seperately in this section.

2.1.1 Data Mining in Forensic Investigations


There are many different databases used in forensic investigations. The po-
lice can make use of vast records of fingerprints, DNA, and even records

1
as varied as shoe prints. These records all take one of three forms; images
taken from a crime scene, images taken from suspects or images to be used
as control samples (i.e. a shoe print from all popular shoe types). [Geradts]
Various analysis techniques are used dependent upon what sort of image is
in question. For fingerprint analysis, things such as euclidean distance and
the Henry classification scheme are used. Searches based upon one type of
data are simple enough, although some advanced analysis techniques have
been developed to match them they are all structured similarly. Data min-
ing, it has been suggested, can take this analysis to another level, looking for
patterns amongst all of these different databases. This has not been imple-
mented in the Geradts paper [Geradts], but it does state that it would be of
huge benefit. It would enable police to discover links that would be impossi-
ble to find manually, it could conceivably solve currently unsolved crimes as
the DNA database did when it was brought online. Another paper describes
an implemented system in use in the American legal system called coplink.
It uses a ’concept space’ which is a database with weighted associations, that
is an evaluated relevance between all records. This allows users to discover
other appropriate records, for example if a shoeprint is taken at a crime scene
then other crimes of a similar nature where a similar shoeprint was found
would be flagged. Likewise, known criminals with a record of that crime who
are known to own that type of shoe will also be highlighted. [Hauck 02]

2.1.2 Data Mining in Credit Card Fraud Detection


The pattern of spending of most card users is surprisingly predictable, the
weekly shop, average monthly disposable income spending and where it is
spent do not vary drastically. Data mining can be used to find erroneous
data such as a larger than usual payment, or a payment to an unlikely source.
This is often a sign that credit card fraud is occurring. One suggested method
of finding erroneous transactions is to use several algorithms to analyse each
set of transactions and then to use a metaclassifier to collate these analyses.
It would then use them to flag certain transactions as illegitimate far more
accurately than if any one algorithm was used. [?] One limitation of this
approach is that the algortithms do not learn in any dynamic sense, the
paper states that adaptive algortithms must be used as spending patterns
change and criminals also adapt and learn. This multi algorithm approach
proved itself to be a very effective technique in experiments and has also
lent itself well to finding illegal intrusions into networks as the two data sets
(network and transaction logs) are inherently similar. [Chan 99]

2
2.1.3 Data Mining in Computer Network Security
The main aim of mining data for computer security is to examine logs to
find unusual patterns such as an irregular log-in time. This can often help to
discover illegal intrusions into the network. This is a successful application
for data mining and helpfully the target data is generated by a computer so
no data cleansing needs to be performed. Research into an entirely visual
representation of network activity has been conducted, this is based on the
premise that humans can take in visual data at 150 Mb/S. [Yurcik et al]. The
idea is that one screen can show the state of many thousands of computers
connected to the network, representing each as a two pixel square of varying
colour. The information for this display is taken from network logs that
have been data mined for any erroneous data that should be highlighted.
Another visual system called mining alarming information from data streams
(MAIDS) has been developed using clustering techniques on a stream of
network information (currently synthesised logs) which uses pie charts to
represent traffic and its classification assigned by the system (illegal/legal
etc). [Dora et al] This is a particularly interesting package as tests have
shown it to be very well suited not just to detecting illegal intrusions but also
to monitoring any constant stream of structured data. This would enable it
to be implemented in a real-time monitoring scenario.

2.2 Data Mining in Tax Administration


There are many areas of tax administration where data mining has proved
useful. Perhaps the most successful application has been in selecting audit
targets. As governments only have a limited amount of resources with which
to enforce tax payment, they must choose carefully whom to audit in order
to maximize their return. [Micci 04] This paper outlines a technique for
achieving this although it merely uses Clementine, an existing commercial
data mining package and breaks no new ground with its approach. It is
however a perfect example of a simple use for data mining, taking existing
data about tax offenders and trying to match that profile to current targets.
A database of tax offenders is built up and patterns in that data can be
found, these patterns are then searched for in the data space that includes
all citizens. It has shown itself to be a useful tool in using an existing budget
far more effectively and efficiently. [Micci 04]

3
2.3 Data Mining in Insurance Risk Assessment
In insurance risk assessment it can be beneficial to find previously unknown
patterns in claim data. For example if it is generally assumed that young
(under 25) male drivers are of a very high risk then it would be very profitable
to discover that this demographic who also happen to drive classic cars are
very low risks. This is found to be the case, often because of the amount
of care taken over such cars. It would be highly profitable for an insurance
company to discover this before it’s competitors as it would enable them to
offer a specialist package for this demographic. It is often the case that the
profit made from this previously unknown sector can easily cover the cost of
the IT infrastructure needed to unearth it, with the additional benefit that
the company has increased it’s market share.

2.4 Data Mining in Aviation Safety


Mining Data in aviation safety records is an important exercise to discover
trends in accidents. This information may be used to ensure that that type
of event does not happen in the future. One paper discusses the problems in-
herent in mining both structured and unstructured data. [Bloedorn] Queries
could be made on both types but would only be useful if exactly the same
string was found in two areas, this would obviously not flag all relevant data.
The proposed strategy in this paper was the use of a hybrid of strict boolean,
ordinal and vector based matching. [Bloedorn] The approach also makes use
of streaming, whereby variations of the same word are treated the same in
order to increase the amount of relevant results i.e. engineering, engineered
and engineer. Stop words are also taken into account as in search engines
like Google, meaning that words such as in, of, and are omitted due to their
irrelevance and frequency in normal text. Another technique that uses clus-
tering on streams of data, called MAIDS could also be very well suited to
this field, although it is only really appropriate in real time due to it pro-
cessing streams of data. It certainly could not work on unstructured text
records, but it is easy to see this being used on black box data from aircraft
in either real-time or simulated real-time after the event. I has proved itself
on network intrusion problems and could conceivably extract erroneous data
from the enormous number of metrics that are reguarly recorded on aircraft.
[Dora et al]

4
2.5 Data Mining in Retail
The use of data mining in retail is becoming far more extensive, in recent
years many large retail companies have increased the amount of data they
store about customers and with that comes the greater need for the ability
to analyse that data. By offering store discount cards such as the Tesco
Clubcard it is possible to gather data about an individual and link that to
what they buy. By mining this data many useful connections can be made
between products, i.e. the customer buying Product A will often buy Product
B in the same trip. These are said to be complementary products and may
include pasta and pasta sauce, for example. These connections can be used
in designing the layout of the store, placing these two products amongst
impulse buy items and in separate aisles may greatly increase the average
spend of a customer, this is often called market basket analysis. [Apte 97]

2.6 Data Mining in Targeted Marketing


As previously mentioned retail companies retain data about their customers,
linking purchase information with customer information by using store loy-
alty cards. This data is also used to calculate who to send certain advertising
material to, in order to maximise the possibility that the advert will be re-
ceived well. For example if a user regularly buys petrol on a store card it is
clear that they own a car, therefore it would be relevant to send information
about car loans to that customer. [Apte 97] This type of data represents a
powerful marketing tool and enables the company to both cut down on the
cost of marketing and increase the positive result that marketing. Before
data mining this would have been a paradox.

3 Conclusion
Data mining has found many different applications through necessity, be-
cause almost every organisation in the world is undergoing a data storage
crisis. The amount of data in the world doubles every year and our capacity
to analyse that data is simply not keeping up. Data warehousing provides
a solution to the storage problem, but it is useless without analysis. Data
mining can solve this issue, making the data warehouse not only justifiable
but indispensible to the large modern corporation. Data mining is still a
very young technology, there are three main type of data mining, clustering,
predictive modeling and frequent pattern extraction, and these models are
basically the same whatever use they are put to. The innovation and advance-
ment is coming from programmers and analysts who are finding ever more

5
intelligent ways to use them to extract meaningful information from data.
[Apte 97] Companies are making use of this new technology and it’s use is
increasing, the worldwide spending on data mining is currently estimated
at $539 million and if current growth continues it will reach $1.85 billion
by 2006, this proves that there is a demand for fast and efficient analysis
of vast amounts of data. Companies and organisations have been collecting
information for decades and now it seems that they can finally start to use it
productively rather than file it all away for occasional reference. [Leavitt 02]
It may however, have negative connotations for the consumer, it raises pri-
vacy issues surrounding maintaining records of customer/citizen activity and
companies must ensure that in trying to become more efficient they do not
cross ethical and moral boundaries about what data they collect and how
they use that information. [Nascio 04]

References
[Yurcik et al] “Two Visual Computer Network Security Monitoring Tools In-
corporating Operator Interface Requirements”, William Yurcik,
James Barlow, Kiran Lakkaraju Mike Haberman, National Cen-
ter for Supercomputing Applications (NCSA)

[Clare] “Data mining the yeast genome in a lazy functional language”,


Amanda Clare Ross D. King, Computational Biology Group, De-
partment of Computer Science, University of Wales Aberystwyth

[Micci 04] “Improving Tax Administration with Data Mining”, Daniele


Micci-Barreca, PhD Satheesh Ramachandran, PhD, White Pa-
per, Elite Analytics, LLC, 2004

[Bloedorn] “Mining Aviation Safety Data: A Hybrid Approach”, Eric Bloe-


dorn, The MITRE Corporation

[Nascio 04] “Think Before You Dig: Privacy Implications of Data Mining
Aggregation”, National Association of State Chief Information
Officers, USA, September 2004

[Geradts] “Data Mining in Forensic Image Databases”, Zeno Geradts, Jur-


rien Bijhold, Netherlands Forensic Institute

[Leavitt 02] “Data Mining for the Corporate Masses?”, Neal Leavitt, Com-
puter (Magazine), May 2002

6
[Apte 97] “Data Mining - An Industrial Research Perspective”, C. Apte,
IEEE Computational Science and Engineering, April-June 1997

[Chan 99] “Distributed Data Mining in Credit Card Fraud Detection”,


Philip K. Chan, Florida Institute of Technology Wei Fan, An-
dreas L. Prodromidis, and Salvatore J. Stolfo, Columbia Univer-
sity, IEEE Intelligent Systems, November/December 1999

[Hauck 02] “Using Coplink to Analyze Criminal-Justice Data”, Roslin


V. Hauck, Homa Atabakhsh, Pichai Ongvasith, Harsh Gupta
Hsinchun Chen, University of Arizona, IEEE Computing Prac-
tices, 2002

[Dora et al] “MAIDS: Mining Alarming Incidents from Data Streams”, Y.


Dora Cai, David Clutter, Greg Pape, Jiawei Han, Michael
Welge Loretta Auvil, University of Illinois at Urbana-Champaign
U.S.A., Demonstration Proposal

Potrebbero piacerti anche