Sei sulla pagina 1di 1

Data Mining Related terms:

Automated Mechanism,
Design Documentation,
Security Officer, System
Configuration, Systems
Design

View all Topics

Set alert About this page

Data Mining Domain 3: Security Engineering


Colleen McCue, in Data Mining and Predictive (Engineering and Management of
Analysis, 2007 Security)
Eric Conrad, ... Joshua Feldman, in CISSP Study
Is Data Mining Evil?
Guide (Third Edition), 2016
Further confounding the question of whether to
acquire data mining technology is the heated Data Mining
debate regarding not only its value in the public Data mining searches large amounts of data to
safety community but also whether data mining determine patterns that would otherwise get “lost in
reflects an ethical, or even legal, approach to the the noise.” Credit card issuers have become experts in
analysis of crime and intelligence data. The data mining, searching millions of credit card
discipline of data mining came under fire in the transactions stored in their databases to discover
Data Mining Moratorium Act of 2003. signs of fraud. Simple data mining rules, such as “X
Unfortunately, much of the debate that followed or more purchases, in Y time, in Z places” can be
has been based on misinformation and a lack of used to discover credit cards that have been stolen
knowledge regarding these very important tools. and used fraudulently.
Like many of the devices used in public safety, Data mining raises privacy concerns: imagine if life
data mining and predictive analytics can confer insurance companies used data mining to track
great benefit and enhanced public safety through purchases such as cigarettes and alcohol, and denied
their judicious deployment and use. Similarly, claims based on those purchases.
these same assets also can be misused or
employed for unethical or illegal purposes. View chapter Purchase book
One of the harshest criticisms has addressed
important privacy issues. It has been suggested
that data mining tools threaten to invade the
privacy of unknowing citizens and unfairly target Introduction
them for invasive investigative procedures that
Vijay Kotu, Bala Deshpande PhD, in Predictive
are associated with a high risk of false allegations
Analytics and Data Mining, 2015
and unethical labeling of certain groups. The
concern regarding an individual's right to privacy 1.1 What Data Mining Is
versus the need to enhance public safety Data mining, in simple terms, is finding useful
represents a long-standing tension within the law patterns in the data. Being a buzzword, there are a
enforcement and intelligence communities that is wide variety of definitions and criteria for data
not unique to data mining. In fact, this concern is mining. Data mining is also referred to as knowledge
misplaced in many ways because data mining in discovery, machine learning, and predictive analytics.
and of itself has a limited ability, if any, to However, each term has a slightly different
compromise privacy. Privacy is maintained connotation depending upon the context. In this
through restricting access to data and chapter, we attempt to provide a general overview of
information. Data mining and predictive analytics data mining and point out its important features,
merely analyze the data that is made available; purpose, taxonomy, and common methods.
they may be extremely powerful tools, but they
Data mining starts with data, which can range from a
are tools nonetheless. With data mining,
simple array of a few numeric observations to a
ensuring privacy should be no different than with
complex matrix of millions of observations with
any other technique or analytical approach.
thousands of variables. The act of data mining uses
Unfortunately, many of these fears were based on some specialized computational methods to discover
a misunderstanding of the Total Information meaningful and useful structures in the data. These
Awareness system (TIA, later changed to the computational methods have been derived from the
Terrorism Information Awareness system), which fields of statistics, machine learning, and artificial
promised to combine and integrate wide-ranging intelligence. The discipline of data mining coexists
data and information systems from both the and is closely associated with a number of related
public and private sectors in an effort to identify areas such as database systems, data cleansing,
possible terrorists. Originally developed by the visualization, exploratory data analysis, and
Defense Advanced Research Projects Agency performance evaluation. We can further define data
(DARPA), this program was ultimately dismantled, mining by investigating some its key features and
due at least in part to the public outcry and motivation.
concern regarding potential abuses of private
1.1.1 Extracting Meaningful Patterns
information. Subsequent review of the program, Knowledge discovery in databases is the nontrivial
however, determined that its main shortcoming process of identifying valid, novel, potentially useful,
was related the failure to conduct a privacy impact and ultimately understandable patterns or
study in an effort to ensure the maintenance of relationships in the data to make important decisions
individual privacy; this is something that (Fayyad et al., 1996) The term “nontrivial process”
organizations considering these approaches distinguishes data mining from straightforward
should include in their deployment strategies and statistical computations such as calculating the mean
use of data-mining tools. or standard deviation. Data mining involves inference
On the other hand, some have suggested that and iteration of many different hypotheses. One of
incorporation of data mining and predictive the key aspects of data mining is the process of
analytics might result in a waste of resources. generalization of patterns from the data set. The
This underscores a lack of information regarding generalization should be valid not just for the data set
these analytical tools. Blindly deploying resources used to observe the pattern, but also for the new
based on gut feelings, public pressure, historical unknown data. Data mining is also a process with
precedent, or some other vague notion of crime defined steps, each with a set of tasks. The term
prevention represents a true waste of resources. “novel” indicates that data mining is usually involved
One of the greatest potential strengths of data in finding previously unknown patterns in the data.
mining is that it gives public safety organizations The ultimate objective of data mining is to find
the ability to allocate increasingly scarce law potentially useful conclusions that can be acted upon
enforcement and intelligence resources in a more by the users of the analysis.
efficient manner while accommodating a 1.1.2 Building Representative Models
concomitant explosion in the available In statistics, a model is the representation of a
information—the so-called “volume challenge” relationship between variables in the data. It
that has been cited repeatedly during describes how one or more variables in the data are
investigations into law enforcement and related to other variables. Modeling is a process in
intelligence failures associated with 9/11. Data which a representative abstraction is built from the
mining and predictive analytics give law observed data set. For example, we can develop a
enforcement and intelligence professionals the model based on credit score, income level, and
ability to put more evidence-based input into requested loan amount, to determine the interest
operational decisions and the deployment of rate of the loan. For this task, we need previously
scarce resources, thereby limiting the potential known observational data with the credit score,
waste of resources in a way not available income level, loan amount, and interest rate. Figure
previously. 1.1 shows the inputs and output of the model. Once
Regarding the suggestion that data mining has the representative model is created, we can use it to
been associated with false leads and law predict the value of the interest rate, based on all the
enforcement mistakes, it is important to note that input values (credit score, income level, and loan
these errors happen already, without data mining. amount).
This is why there are so many checks and
balances in the system—to protect the innocent.
We do not need data mining or technology to
make errors; we have been able to do that without
the assistance of technology for many years.
There is no reason to believe that these same
checks and balances would not continue to
Sign in to download full-size image
protect the innocent were data mining to be used
extensively. On the other hand, basing our Figure 1.1. Representative model for Predictive Analytics.

activities on real evidence can only increase the In the context of predictive analytics, data mining is
likelihood that we will correctly identify the bad the process of building the representative model that
guys while helping to protect the innocent by fits the observational data. This model serves two
casting a more targeted net. Like the difference purposes: on the one hand it predicts the output
between a shotgun and a laser-sited 9mm, there (interest rate) based on the input variables (credit
is always the possibility of an error, but there is score, income level, and loan amount), and on the
much less collateral damage with the more other hand we can use it to understand the
accurate weapon. relationship between the output variable and all the
Again, the real issue in the debate comes back to input variables. For example, does income level really
privacy concerns. People do not like law matter in determining the loan interest rate? Does
enforcement knowing their business, which is a income level matter more than credit score? What
very reasonable concern, particularly when viewed happens when income levels double or if credit score
in light of past abuses. Unfortunately, this drops by 10 points? Model building in the context of
attitude confuses process with input issues and data mining can be used in both predictive and
places the blame on the tool rather than on the explanatory applications.
data resources tapped. Data mining can only be 1.1.3 Combination of Statistics, Machine Learning,
used on the data that are made available to it. and Computing
Data mining is not a vast repository designed to In the pursuit of extracting useful and relevant
maintain extensive files containing both public information from large data sets, data mining derives
and private records on each and every American, computational techniques from the disciplines of
as has been suggested by some. It is an analytical statistics, artificial intelligence, machine learning,
tool. If people are concerned about privacy issues, database theories, and pattern recognition.
then they should focus on the availability of and Algorithms used in data mining originated from
access to sensitive data resources, not the these disciplines, but have since evolved to adopt
analytical tools. Banning an analytical tool more diverse techniques such as parallel computing,
because of fear that it will be misused is similar to evolutionary computing, linguistics, and behavioral
banning pocket calculators because some people studies. One of the key ingredients of successful data
use them to cheat on their taxes. mining is substantial prior knowledge about the data
As with any powerful weapon used in the war on and the business processes that generate the data,
terrorism, the war on drugs, or the war on crime, known as subject matter expertise. Like many
safety starts with informed public safety quantitative frameworks, data mining is an iterative
consumers and well-trained personnel. As is process in which the practitioner gains more
emphasized throughout this text, domain information about the patterns and relationships
expertise frequently is the most important from data in each cycle. The art of data mining
component of a well-informed, professional combines the knowledge of statistics, subject matter
program of data mining and predictive analytics. expertise, database technologies, and machine
As such, it should be seen as an essential learning techniques to extract meaningful and useful
responsibility of each agency to ensure active information from the data. Data mining also typically
participation on the part of those in the know; operates on large data sets that need to be stored,
those professionals from within each organization processed, and computed. This is where database
that know where the data came from and how it techniques along with parallel and distributed
will be used. To relinquish the responsibility for computing techniques play an important role in data
analysis to outside organizations or consultants mining.
should be viewed in the same way as a suggestion 1.1.4 Algorithms
to entirely contract patrol services to a private We can also define data mining as a process of
security corporation: an unacceptable abdication discovering previously unknown patterns in the data
of an essential responsibility. using automatic iterative methods. Algorithms are
Unfortunately, serious misinformation regarding iterative step-by-step procedure to transform inputs
this very important tool might limit or somehow to output. The application of sophisticated algorithms
curtail its future use when we most need it in our for extracting useful patterns from the data
fight against terrorism. As such, it is incumbent differentiates data mining from traditional data
upon each organization to ensure absolute analysis techniques. Most of these algorithms were
integrity and an informed decision-making developed in recent decades and have been borrowed
process regarding the use of these tools and their from the fields of machine learning and artificial
output in an effort to ensure their ongoing intelligence. However, some of the algorithms are
availability and access for public safety based on the foundations of Bayesian probabilistic
applications. theories and regression analysis, originated hundreds
of years ago. These iterative algorithms automate the
process of searching for an optimal solution for a
View chapter Purchase book given data problem. Based on the data problem, data
mining is classified into tasks such as classification,
association analysis, clustering, and regression. Each
data mining task uses specific algorithms like
Multivariate Analysis: Overview decision trees, neural networks, k-nearest neighbors,
k-means clustering, among others. With increased
I. Olkin, A.R. Sampson, in International Encyclopedia research on data mining, the number of such
of the Social & Behavioral Sciences, 2001 algorithms is increasing, but a few classic algorithms
remain foundational to many data mining
6.7 Data Mining
Data mining refers to a set of approaches and applications.
techniques that permit ‘nuggets’ of valuable
information to be extracted from vast and loosely View chapter Purchase book

structured multiple data bases. For example, a


consumer products manufacturer might use data
mining to better understand the relationship of a
specific product's sales to promotional strategies, Data Mining Trends and
selling store's characteristics, and regional Research Frontiers
demographics. Techniques from a variety of different
disciplines are used in data mining. For instance, Jiawei Han, ... Jian Pei, in Data Mining (Third
computer science and information science provide Edition), 2012
methods for handling the problems inherent in
Mining Multimedia Data
focusing and merging the requisite data from Multimedia data mining is the discovery of
multiple and differently structured data bases. interesting patterns from multimedia databases that
Engineering and economics can provide methods for store and manage large collections of multimedia
pattern recognition and predictive modeling. objects, including image data, video data, audio data,
Multivariate statistical techniques, in particular, as well as sequence data and hypertext data
clearly play a major role in data mining. containing text, text markups, and linkages.
Multivariate notions developed to study relationships Multimedia data mining is an interdisciplinary field
provide approaches to identify variables or sets of that integrates image processing and understanding,
variables that are possibly connected. Regression computer vision, data mining, and pattern
techniques are useful for prediction. Classification recognition. Issues in multimedia data mining
and discrimination methods provide a tool to identify include content-based retrieval and similarity search,
functions of the data that discriminate among and generalization and multidimensional analysis.
categorizations of an individual that might be of Multimedia data cubes contain additional dimensions
interest. Another very useful technique for data and measures for multimedia information. Other
mining is cluster analysis that groups experimental topics in multimedia mining include classification and
units which respond similarly. Structural methods prediction analysis, mining associations, and video and
such as principal components, factor analysis, and audio data mining (Section 13.2.3).
path analysis are methodologies that can allow
simplification of the data structure into fewer View chapter Purchase book
important variables. Multivariate graphical methods
can be employed to both explore databases and then
as a means for presentation of the data mining
results. Theoretical Considerations for
A reference to broad issues in data mining is given by Data Mining
Fayyad et al. (1996). Also see Exploratory Data Analysis:
Multivariate Approaches (Nonparametric Regression). Robert Nisbet, ... Gary Miner, in Handbook of
Statistical Analysis and Data Mining Applications,
View chapter Purchase book 2009

What Is Data Mining?


Data mining can be defined in several ways, which
differ primarily in their focus on different aspects of
Introduction data mining. One of the earliest definitions is

Jiawei Han, ... Jian Pei, in Data Mining (Third The non-trivial extraction of implicit, previously
Edition), 2012 unknown, and potentially useful information from
data (Frawley et al., 1991).
1.7.5 Data Mining and Society
As data mining developed as a professional activity, it
How does data mining impact society? What steps
was necessary to distinguish it from the previous
can data mining take to preserve the privacy of
activity of statistical modeling and the broader activity
individuals? Do we use data mining in our daily lives
of knowledge discovery. For the purposes of this
without even knowing that we do? These questions
handbook, we will use the following working
raise the following issues:
definitions:
■ Social impacts of data mining: With data mining
• Statistical modeling: The use of parametric
penetrating our everyday lives, it is important to
statistical algorithms to group or predict an
study the impact of data mining on society. How
outcome or event, based on predictor variables.
can we use data mining technology to benefit
society? How can we guard against its misuse? • Data mining: The use of machine learning
The improper disclosure or use of data and the algorithms to find faint patterns of relationship
potential violation of individual privacy and data between data elements in large, noisy, and messy
protection rights are areas of concern that need to data sets, which can lead to actions to increase
be addressed. benefit in some form (diagnosis, profit, detection,
etc.).
■ Privacy-preserving data mining: Data mining will
help scientific discovery, business management, • Knowledge discovery: The entire process of data
economy recovery, and security protection (e.g., access, data exploration, data preparation,
the real-time discovery of intruders and modeling, model deployment, and model
cyberattacks). However, it poses the risk of monitoring. This broad process includes data
disclosing an individual's personal information. mining activities, as shown in Figure 2.1.
Studies on privacy-preserving data publishing and
data mining are ongoing. The philosophy is to
observe data sensitivity and preserve people's
privacy while performing successful data mining.
■ Invisible data mining: We cannot expect everyone
in society to learn and master data mining
techniques. More and more systems should have
data mining functions built within so that people
can perform data mining or use data mining
results simply by mouse clicking, without any
knowledge of data mining algorithms. Intelligent Sign in to download full-size image
search engines and Internet-based stores perform
Figure 2.1. The relationship between data mining and knowledge discovery.
such invisible data mining by incorporating data
mining into their components to improve their As the practice of data mining developed further, the
functionality and performance. This is done often focus of the definitions shifted to specific aspects of
unbeknownst to the user. For example, when the information and its sources. In 1996, Fayyad et al.
purchasing items online, users may be unaware proposed the following:
that the store is likely collecting data on the Knowledge discovery in databases is the non-trivial
buying patterns of its customers, which may be process of identifying valid, novel, potential useful,
used to recommend other items for purchase in and ultimately understandable patterns in data.
the future.
The second definition focuses on the patterns in the
These issues and many additional ones relating to the data rather than just information in a generic sense.
research, development, and application of data These patterns are faint and hard to distinguish, and
mining are discussed throughout the book. they can only be sensed by analysis algorithms that
can evaluate nonlinear relationships between
View chapter Purchase book predictor variables and their targets and themselves.
This form of the definition of data mining developed
along with the rise of machine learning tools for use
in data mining. Tools like decision trees and neural
Process Models for Data Mining nets permit the analysis of nonlinear patterns in data
and Predictive Analysis easier than is possible in parametric statistical
algorithms. The reason is that machine learning
Colleen McCue, in Data Mining and Predictive algorithms learn the way humans do—by example,
Analysis (Second Edition), 2015 not by calculation of metrics based on averages and
data distributions.
Abstract
Data mining and predictive analytics can best be The definition of data mining was confined originally
understood as a process, rather than specific to just the process of model building. But as the
technology, tool, or tradecraft. Chapter 4 includes an practice matured, data mining tool packages (e.g.,
overview of four complementary approaches to SPSS-Clementine) included other necessary tools to
analysis: the Central Intelligence Agency (CIA) facilitate the building of models and for evaluating
Intelligence Process, the CRoss Industry Standard and displaying models. Soon, the definition of data
Process for Data Mining (CRISP-DM), SEMMA, and mining expanded to include those operations in
the Actionable Mining and Predictive Analysis process Figure 2.1 (and some include model visualization
developed specifically for the operational public safety also).
and security environment. The Actionable Mining The modern Knowledge Discovery in Databases
and Predictive Analysis process addresses unique (KDD) process combines the mathematics used to
requirements and constraints associated with the discover interesting patterns in data with the entire
applied setting, including data access and availability, process of extracting data and using resulting models
public safety-specific evaluation, and the requirement to apply to other data sets to leverage the information
for operationally relevant and actionable output. Data for some purpose. This process blends business
privacy and security also are addressed. systems engineering, elegant statistical methods, and
industrial-strength computing power to find structure
View chapter Purchase book (connections, patterns, associations, and basis
functions) rather than statistical parameters (means,
weights, thresholds, knots). In Chapter 3, we will
expand this rather linear organization of data mining
processes to describe the iterative, closed-loop
system with feedbacks that comprise the modern
approach to the practice of data mining.

View chapter Purchase book

About ScienceDirect Remote access Shopping cart Advertise Contact and support
Terms and conditions Privacy policy

We use cookies to help provide and enhance our service and tailor content and ads. By continuing you agree to the use of
cookies.
Copyright © 2020 Elsevier B.V. or its licensors or contributors. ScienceDirect ® is a registered trademark of Elsevier B.V.

Potrebbero piacerti anche