772s Data - Mining.concepts - And.techniques.2nd - Ed

Ralf Guting and Markus Schneider
Joe Celkos SQL Programming Style

Joe Celko
Data Mining: Practical Machine Learning Tools and Technique
Ian Witten and Eibe Frank
Fuzzy Modeling and Genetic Algorithms for Data Mining and
Earl Cox
Data Modeling Essentials, Third Edition

Graeme C. Simsion and Graham C. Witt
Location-Based Services
Jochen Schiller and Agns Voisard
Database Modeling with Microsft Visio for Enterprise Archite
Terry Halpin, Ken Evans, Patrick Hallock, Bill Maclean
Designing Data-Intensive Web Applications
Stephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambill
Mining the Web: Discovering Knowledge from Hypertext Data
Soumen Chakrabarti
Advanced SQL:II 1999Understanding Object-Relational and
Jim Melton
Database Tuning: Principles, Experiments, and Troubleshootin
Dennis Shasha and Philippe Bonnet
SQL:1999Understanding Relational Language Components
Jim Melton and Alan R. Simon
Information Visualization in Data Mining and Knowledge Dis
Edited by Usama Fayyad, Georges G. Grinstein, and Andreas
Transactional Information Systems: Theory,

Algorithms, and P Control and Recovery
Gerhard Weikum and Gottfried Vossen
Spatial Databases: With Application to GIS
Philippe Rigaux, Michel Scholl, and Agnes Voisard
Information Modeling and Relational Databases: From Concep
Terry Halpin
Component Database Systems

Edited by Klaus R. Dittrich and Andreas Geppert
Joe Celkos SQL for Smarties: Advanced SQL Program
Joe Celko
Joe Celkos Data and Databases: Concepts in Practice
Joe Celko
Developing Time-Oriented Database Applications in S
Richard T. Snodgrass
Web Farming for the Data Warehouse

Richard D. Hackathorn
Management of Heterogeneous and Autonomous Data
Edited by Ahmed Elmagarmid, Marek Rusinkiewicz,
Object-Relational DBMSs: Tracking the Next Great W
Michael Stonebraker and Paul Brown,with Dorothy
A Complete Guide to DB2 Universal Database

Don Chamberlin
Universal Database Management: A Guide to Object/
Cynthia Maro Saracco
Readings in Database Systems, Third Edition
Edited by Michael Stonebraker and Joseph M. Heller
Understanding SQLs Stored Procedures: A Complete
Jim Melton
Principles of Multimedia Database Systems

V. S. Subrahmanian
Principles of Database Query Processing for Advanced
Clement T. Yu and Weiyi Meng
Advanced Database Systems
Carlo Zaniolo, Stefano Ceri, Christos Faloutsos,

Rich V. S. Subrahmanian, and Roberto Zicari
Principles of Transaction Processing

Philip A. Bernstein and Eric Newcomer
Using the New DB2: IBMs Object-Relational Databas
Don Chamberlin
Distributed Algorithms
Nancy A. Lynch
A Guide to Developing Client/Server SQL Applications
Setrag Khoshafian, Arvola Chan, Anna Wong, and Harry K.
The Benchmark Handbook for Database and Transaction Proce
Edited by Jim Gray
Camelot and Avalon: A Distributed Transaction Facility
Edited by Jeffrey L. Eppinger, Lily B. Mummert, and Alfred Z
Readings in Object-Oriented Database Systems

Edited by Stanley B. Zdonik and David Maier
University of I
AM S TE R D AM B O S T O N H E I D
ELBERGLONDON
N E W YO R K O X F O R D PAR IS S
AN D I E G O SAN F R AN C I S C S I
N G AP O R E S Y D N E YT O K Y
Designations used by companies to distinguish their

product registered trademarks. In all instances in which
Morgan Kauf the product names appear in initial capital
or all capital letter the appropriate companies for more
complete information r registration.
No part of this publication may be reproduced, stored

in a re form or by any meanselectronic, mechanical,
photocopyin prior written permission of the publisher.
Permissions may be sought directly from Elseviers Science &

Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853
permissions@elsevier.co.uk. You may also complete your reque
(http://elsevier.com) by selecting Customer Support and th
Library of Congress Cataloging-in-Publication Data
Application submitted
ISBN 13: 978-1-55860-901-3
ISBN 10: 1-55860-901-6
For information on all Morgan Kaufmann publications,

visit www.mkp.com or www.books.elsevier.com
Printed in the United States of America 06 07 08 09

10 5 4 3 2 1
To Erik, Kevan, Kian,
Preface
xxi
Chapter1 Introduction 1
What Motivated Data Mining?
So, What Is Data Mining? 5
Data MiningOn What Kind of
Relational Databases 10
Data Warehouses 12
Transactional Databases 14
Advanced Data and Informatio Applications 15
Data Mining FunctionalitiesW Mined? 21
Concept/Class Description: Ch Discrimination 21
Mining Frequent Patterns, Asso
Classification and Prediction
Cluster Analysis 25
Outlier Analysis 26
Evolution Analysis 27
Are All of the Patterns Interesti
Classification of Data Mining Sy
1.7
Data Mining Task Primitives 3
Integration of a Data Mining Sy a Database or Data

Warehouse
1.9
Major Issues in Data Mining 36
2.3.2
Noisy Data 62
2.3.3
Data Cleaning as a Process 65
Data Integration and Transformation
Data Integration 67
Data Transformation 70
Data Reduction 72
2.5.1
Data Cube Aggregation 73
2.5.2
Attribute Subset Selection 75
Dimensionality Reduction 77
Numerosity Reduction 80
Data Discretization and Concept Hier
Discretization and Concept Hierarchy Numerical Data 88
Concept Hierarchy Generation for Ca
Summary 97 Exercises 97
Bibliographic Notes 101
Chapter 3 Data Warehouse and OLAP Technology: An
What Is a Data Warehouse? 105
3.1.1 Differences between Operational Dat and Data

Warehouses 108
3.1.2 But, Why Have a Separate Data Ware 3.2 A

Multidimensional Data Model 110
From Tables and Spreadsheets to Data
Stars, Snowflakes, and Fact Constellati Schemas for

Multidimensional Databa
Examples for Defining Star, Snowflake,
and Fact Constellation Schemas 117
Data Warehouse Implementati
3.4.1
Efficient Computation of Data
3.4.2
Indexing OLAP Data 141
3.4.3 Efficient Processing of OLAP Q
From Data Warehousing to Dat
3.5.1
Data Warehouse Usage 146
From On-Line Analytical Proce to On-Line Analytical Mining
Summary 150 Exercises 152 Bibliographic

Notes 154
Chapter4 Data Cube Computation and Data G
Efficient Methods for Data Cub
A Road Map for the Materializ of Cubes 158
Multiway Array Aggregation fo
BUC: Computing Iceberg Cub Downward 168
Star-cubing: Computing Iceber a Dynamic Star-tree

Structure
Precomputing Shell Fragments OLAP 178
Computing Cubes with Comp
Further Development of Data C Technology 189
Discovery-Driven Exploration
Complex Aggregation at Multi Multifeature Cubes 192

Constrained Gradient Analysis
Chapter 5 Mining Frequent Patterns,

Associations, and 5.1 Basic Concepts and a
Road Map 227
Market Basket Analysis: A Motivating E
Frequent Itemsets, Closed Itemsets, an
Frequent Pattern Mining: A Road Map
Efficient and Scalable Frequent Itemse
The Apriori Algorithm: Finding Freque Candidate

Generation 234
Generating Association Rules from Fre
5.2.3
Improving the Efficiency of Apriori 2
Mining Frequent Itemsets without Can
Mining Frequent Itemsets Using Vertic
5.2.6
Mining Closed Frequent Itemsets 24
Mining Various Kinds of Association R
5.3.1
Mining Multilevel Association Rules 2
Mining Multidimensional Association R from Relational

Databases and Data
From Association Mining to Correlatio
Strong Rules Are Not Necessarily Inte
From Association Analysis to Correlati
Constraint-Based Association Mining
Metarule-Guided Mining of Associatio
Constraint Pushing: Mining Guided by

Notes 280
Bayes Theorem 310
Nave Bayesian Classification
6.4.3
Bayesian Belief Networks 31
Training Bayesian Belief Netwo
Rule-Based Classification 318
Using IF-THEN Rules for Class
Rule Extraction from a Decisio
Rule Induction Using a Sequen
Classification by Backpropagati
A Multilayer Feed-Forward Ne
Defining a Network Topology
Backpropagation 329
Inside the Black Box: Backprop
6.7
337
Support Vector Machines
The Case When the Data Are
The Case When the Data Are
Associative Classification: Class

Rule Analysis 344
Lazy Learners (or Learning fro
k-Nearest-Neighbor Classifier
Case-Based Reasoning 350
Other Classification Methods
Genetic Algorithms 351
6.10.2
Rough Set Approach
351
6.10.3
Fuzzy Set Approaches

352
Prediction 354
Linear Regression 355
Nonlinear Regression 357
Other Regression-Based Meth
6.15.2 ROC Curves 372
Summary 373 Exercises 375 Bibliographic Notes 378
Chapter7 Cluster Analysis 383
What Is Cluster Analysis? 383
7.2
Types of Data in Cluster Analysis 386
Interval-Scaled Variables 387
Binary Variables 389
Categorical, Ordinal, and Ratio-Scaled
7.2.4
Variables of Mixed Types 395
Vector Objects 397
A Categorization of Major Clustering
Partitioning Methods 401
Classical Partitioning Methods: k-Mean
Partitioning Methods in Large Databas
k-Medoids to CLARANS 407
Hierarchical Methods 408
Agglomerative and Divisive Hierarchic
BIRCH: Balanced Iterative Reducing an Using Hierarchies 412
ROCK: A Hierarchical Clustering Algo Categorical

Attributes 414
Chameleon: A Hierarchical Clustering
Using Dynamic Modeling 416
Density-Based Methods 418
DBSCAN: A Density-Based Clustering Connected Regions with

Sufficiently H
CLIQUE: A Dimension-Growt
PROCLUS: A Dimension-Redu Method

439
Frequent PatternBased Cluste
Constraint-Based Cluster Analy
Clustering with Obstacle Obje
User-Constrained Cluster Anal
Semi-Supervised Cluster Analy
Outlier Analysis 451
Statistical Distribution-Based O
Distance-Based Outlier Detect
Density-Based Local Outlier D
Deviation-Based Outlier Detec
Summary 460 Exercises 461

Bibliographic Notes 464
Chapter 8 Mining Stream, TimeSeries, and Seq 8.1 Mining Data

Streams 468
8.1.1 Methodologies for Stream Dat Stream

Data Systems 469
Stream OLAP and Stream Dat
Frequent-Pattern Mining in Da
Classification of Dynamic Data
Clustering Evolving Data Strea
8.2
489
Mining Time-Series Data
Trend Analysis 490
Similarity Search in Time-Serie
Chapter 9 Graph Mining, Social Network Analysis, and
Data Mining 535
Graph Mining 535
Methods for Mining Frequent Subgrap
Mining Variant and Constrained Substr
Applications: Graph Indexing, Similarit and Clustering 551
9.2
Social Network Analysis 556
9.2.1 What Is a Social Network? 556
9.2.2
Characteristics of Social Networks
5
9.2.3
Link Mining: Tasks and Challenges
56
9.2.4
Mining on Social Networks
565
9.3
Multirelational Data Mining

571
What Is Multirelational Data Mining?
ILP Approach to Multirelational Classifi
9.3.3
Tuple ID Propagation 575
Multirelational Classification Using Tup
Multirelational Clustering with User G
Summary 584 Exercises 586 Bibliographic Notes 587
Chapter 10 Mining Object, Spatial, Multimedia, Text, an
Multidimensional Analysis and Descrip Data Objects

591
10.1.1 Generalization of Structured Data 5
Aggregation and Approximation in Sp Generalization 593
10.3
Multimedia Data Mining 607
Similarity Search in Multimedia
Multidimensional Analysis of M
Classification and Prediction A
Mining Associations in Multime
Audio and Video Data Mining
Text Mining 614
Text Data Analysis and Inform
Dimensionality Reduction for
10.4.3 Text Mining Approaches 624
Mining the World Wide Web
Mining the Web Page Layout S
Mining the Webs Link Structu
Authoritative Web Pages 63
Mining Multimedia Data on the
Automatic Classification of We
10.5.5 Web Usage Mining 640

Notes 645
Chapter 11 Applications and Trends in Data Mini

11.1 Data Mining Applications 649
Data Mining for Financial Data
Data Mining for the Retail Indu
Data Mining for the Telecomm
Data Mining for Biological Dat
Data Mining in Other Scientific
Data Mining for Intrusion Dete

Notes 687
Appendix
OLE D
An Introduction to Microsofts
Data Mining 691
A.1 Model Creation 693
A.2 Model Training 695
A.3 Model Prediction and Browsing 697
Bibliography
703
Index 745
We are deluged by datascientific data, medica

and marketing data. People have no time to l
become the precious resource. So, we must find
to automatically classify it, to automatically sum
characterize trends in it, and to automatically
active and exciting areas of the database
research ing statistics, visualization, artificial
intelligence to this field. The breadth of the field
makes it dif over the last few decades.
Six years ago, Jiawei Hans and Micheline K

presented Data Mining. It heralded a golden ag of
their book reflects that progress; more than h are
to recent work. The field has matured with has
broadened to include many more datatypes
geospatial, audio, images, and video. We are cer
indeed research and commercial interest in dat
all fortunate to have this modern compendium
The book gives quick introductions to da

particular emphasis on data analysis. It then
cov cepts and techniques that underlie
classification These topics are presented with
examples, a to lem class, and with pragmatic
rules of thumb a Socratic presentation style is
both very readable a lot from reading the first

edition and got re-ed edition.
Jiawei Han and Micheline Kamber have be

research. This is the text they use with their stu
Contributing factors include the computerizatio

transactions; the widespread use of digital
cam most commercial products; and advances
in da text and image platforms to satellite
remote se of the World Wide Web as a global
informatio dous amount of data and
information. This exp has generated an urgent
need for new techniqu gently assist us in
transforming the vast amou knowledge.
This book explores the concepts and techn

flourishing frontier in data and information
syst also popularly referred to as knowledge
discover convenient extraction of patterns
representing in large databases, data
warehouses, the Web, ot data streams.
Data mining is a multidisciplinary field, dra

technology, machine learning, statistics, patt
neural networks, knowledge-based systems, a
computing, and data visualization. We present
hidden in large data sets, focusing on issues rela
tiveness, and scalability. As a result, this book
database systems, machine learning, statistics,
vide the background necessary in these areas in
hension of their respective roles in data minin
introduction to data mining, presented with eff It
should be useful for computing science stude

professionals, as well as researchers involved in
Data mining emerged during the late 1980s,

continues to flourish into the new millennium of
the field, introducing interesting data minin
numerous enhancements and a reorganization of the

book. In addition, several new chapters are included to
mining complex types of data, including stream data,
data, social network data, and multirelational data.
The chapters are described briefly as follows, with e

Chapter 1 provides an introduction to the multid It
discusses the evolutionary path of database techno
for data mining, and the importance of its applicatio to be

mined, including relational, transactional, and complex
types of data such as data streams, time-serie works,
multirelational data, spatiotemporal data, mult data. The
chapter presents a general classification of different kinds
of knowledge to be mined. In compa new sections are
introduced: Section 1.7 is on data users to interactively
communicate with data mining mining process, and
Section 1.8 discusses the issues r mining system with a
database or data warehouse sy sent the condensed
materials of Chapter 4, Data M Architectures, in the first
edition. Finally, major chall
Chapter 2 introduces techniques for preprocessin

corresponds to Chapter 3 of the first edition. Because
construction of data warehouses, we address this topi
introduction to data warehouses in the subsequent cha
ious statistical methods for descriptive data summariz
central tendency and dispersion of data. The descripti

been enhanced. Methods for data integration and transf
discussed, including the use of concept hierarchies for d
The automatic generation of concept hierarchies is also
Chapters 3 and 4 provide a solid introduction to d

Analytical Processing), and data generalization. The
Chapters 2 and 5 of the first edition, but with substan
quent itemset mining are presented in an org

Apriori algorithm and its variations to more a
ciency, including the frequent-pattern growth
vertical data format, and mining closed frequent
niques for mining multilevel association rules,
quantitative association rules. In comparison w
placed greater emphasis on the generation of
rules. Strategies for constraint-based mining an
focus the rule search are also described.
Chapter 6 describes methods for data classifi tree

induction, Bayesian classification, rule-base nique of
backpropagation, support vector mac neighbor
classifiers, case-based reasoning, genet set
approaches. Methods of regression are introd to choose
the best classifier or predictor are di sponding chapter in
the first edition, the section vector machines are new,
and the discussion of and prediction accuracy has been
greatly expan
Cluster analysis forms the topic of Chapter 7. are

presented, including partitioning method
methods, grid-based methods, and model-base
introduce techniques for clustering high-dime
based cluster analysis. Outlier analysis is also d
Chapters 8 to 10 treat advanced topics in

materials on recent progress in this frontier. Th
vious single chapter on advanced topics. Cha
data, time-series data, and sequence data (cov

biological sequences). The basic data mining te
ing, classification, clustering, and constraint-ba of
data. Chapter 9 discusses methods for grap
network analysis and multirelational data mi
This book has several strong features that set it apa ing.
It presents a very broad yet in-depth coverage fro
especially regarding several recent research topics on
ing, social network analysis, and multirelational data the
advanced topics are written to be as self-contained in
order of interest by the reader. All of the major m sented.
Because we take a database point of view to dat many
important topics in data mining, such as scalable a OLAP
analysis, that are often overlooked or minimally
To the Instructor
This book is designed to give a broad, yet detailed

overv can be used to teach an introductory course
on data mini level or at the first-year graduate level.
In addition, it can course on data mining.
If you plan to use the book to teach an introducto

materials in Chapters 1 to 7 are essential, among which
do not plan to cover the implementation methods for d
processing in depth. Alternatively, you may omit some
use Chapter 11 as the final coverage of applications and
If you plan to use the book to teach an advanced cou

Chapters 8 through 11. Moreover, additional materials
may supplement selected themes from among the adva
Individual chapters in this book can also be used f in
related courses, such as database systems, machine le
intelligent data analysis.
Each chapter ends with a set of exercises, suitable as a

are either short questions that test basic mastery of the m
that require analytical thinking, or implementation pro
make the book more enjoyable and reader-frien a

textbook, we have tried to organize it so that i
book or handbook, should you later decide to
fields or pursue a career in data mining.
What do you need to know in order to read
You should have some knowledge of the co

database systems, statistics, and machine le
enough background of the basics in these fie
these fields or your memory is a bit rusty,
discussions in the book.
You should have some programming

experi read pseudo-code and understand
simple d arrays.
To the Professional
This book was designed to cover a wide range o

result, it is an excellent handbook on the subjec
as stand-alone as possible, you can focus on the
can be used by application programmers and in
learn about the key ideas of data mining on thei
technical data analysis staff in banking, insuranc
are interested in applying data mining solution
may serve as a comprehensive survey of the da
researchers who would like to advance the sta
the scope of data mining applications.
The techniques and algorithms presented ar

ing algorithms that perform well on small t in
the book are geared for the discovery of p
plemental materials for readers of this book or

anyo mining. The resources include:
Slide presentations per chapter. Lecture notes

in available for each chapter.
Artwork of the book. This may help you to

make room teaching.
Instructors manual. This complete set of answers

available only to instructors from the publishers
Course syllabi and lecture plan. These are

given f versions of introductory and advanced
courses on and slides.
Supplemental reading lists with hyperlinks.

Semin ing are organized per chapter.
Links to data mining data sets and software. We

w mining data sets and sites containing interestin
ages, such as IlliMine from the University of
(http://illimine.cs.uiuc.edu).
Sample assignments, exams, course

projects. A se and course projects will be made
available to i website.
Table of contents of the book in PDF.
Errata on the different printings of the book. We

errors in the book. Once the error is confirmed, w
include acknowledgment of your contribution.
Comments or suggestions can be sent to

hanj@cs. hear from you.
Itskevitch, Wen Jin, Tiko Kameda, Hiroyuki K
Kim, Krzysztof Koperski, Hans-Peter Kriegel,

Joyce Man Lam, James Lau, Deyi Li, George (
Liao, Gang Liu, Junqiang Liu, Ling Liu, Alan (Y
Xuebin Lu, Wo-Shun Luk, Heikki Mannila, Ru
Alberto Mendelzon, Tim Merrett, Harvey Mill
Richard Muntz, Raymond T. Ng, Vicent Ng,
Ozsu, Jian Pei, Gregory Piatetsky-Shapiro, H
hamed Rajan, Peter Scheuermann, Shashi Sh
Evangelos Simoudis, Nebojsa Stefanovic, Yin
J Dick Tsur, Anthony K. H. Tung, Ke Wang, Wei
Winstone, Ju Wu, Betty (Bin) Xia, Cindy M. X
Clement Yu, Jeffrey Yu, Philip S. Yu, Osmar R
Zhong Zhang, Yvonne Zheng, Xiaofang Zhou
Jean Hou, Helen Pinto, Lara Winstone, and H
original figures in this book, and to Eugene
each chapter.
We also wish to thank Diane Cerra, our

Publishers, for her enthusiasm, patience, and s
as well as Howard Severson, our Production
tious efforts regarding production. We are in
invaluable feedback. Finally, we thank our fa
throughout this project.
Acknowledgments for the Seco
We would like to express our grateful thanks

t bers of the Data Mining Group at UIUC, th
Information Systems (DAIS) Laboratory in t
the University of Illinois at Urbana-Champa
Wu, Tianyi Wu, Dong Xin, Xifeng Yan, Jiong Yang, X.

Yu, Philip S. Yu, Maria Zemankova, ChengXiang Zou.
Deng Cai and ChengXiang Zhai have contribut mining
sections, Xifeng Yan to the graph mining secti
tirelational data mining section. Hong Cheng, Chario
David J. Hill, Chulyun Kim, Sangkyum Kim, Chao Li
Tianyi Wu, Xifeng Yan, and Xiaoxin Yin have contrib
individual chapters of the manuscript.
We also which to thank Diane Cerra, our Publis

lishers, for her constant enthusiasm, patience, and su
book. We are indebted to Alan Rose, the book Prod
tireless and ever prompt communications with us to
duction process. We are grateful for the invaluable fee
Finally, we thank our families for their wholehearted s
This book is an introduction to a young and promising

discovery from data. The material in this book i where
emphasis is placed on basic data mining interesting
data patterns hidden in large data cussed are
particularly oriented toward the de mining tools. In this
chapter, you will learn evolution of database technology,
why data mi You will learn about the general
architecture insight into the kinds of data on which
mining that can be found, and how to tell which pa will
study data mining primitives, from whic designed. Issues
regarding how to integrate a data warehouse are also
discussed. In addition ing systems, you will read about
challenging r tools of the future.
1.1
What Motivated Data Mining?
Necessity is the mother of
Data mining has attracted a great deal of atten

society as a whole in recent years, due to the w and
the imminent need for turning such data The
information and knowledge gained can be ket
analysis, fraud detection, and customer rete

exploration.
Data mining can be viewed as a result o technology.

The database system industry has development of the
following functionalities ( creation, data management
(including data
Figure 1.1 The evolution of database system technology.
faces, optimized query processing, and

trans for on-line transaction processing
(OLTP), w transaction, have contributed
substantially to relational technology as a
major tool for effici of large amounts of data.
Database technology since the mid-1980s

adoption of relational technology and an
activities on new and powerful database syste
advanced data models such as extended-relati
and deductive models. Application-oriented d
poral, multimedia, active, stream, and sensor,
a knowledge bases, and office information
base distribution, diversification, and sharing
of dat geneous database systems and
Internet-based World Wide Web (WWW) have
also emerged industry.
The steady and amazing progress of com

three decades has led to large supplies of
po collection equipment, and storage media.
Thi the database and information industry,
and information repositories available for
transacti and data analysis.
Data can now be stored in many differen

repositories. One data repository architecture
(Section 1.3.2), a repository of multiple hetero
unified schema at a single site in order to facili
warehouse technology includes data cleaning,
processing (OLAP), that is, analysis techniqu
rization, consolidation, and aggregation as wel
different angles. Although OLAP tools suppo
sion making, additional data analysis tools are
Figure 1.2 We are data rich, but information poor.
data classification, clustering, and the characterizatio addition,

huge volumes of data can be accumulated b houses. Typical
examples include the World Wide W flow in and out like streams,
as in applications like vid tion, and sensor networks. The effective
and efficient forms becomes a challenging task.
The abundance of data, coupled with the need for been described as a
data rich but information poor situ dous amount of data, collected and
stored in large and far exceeded our human ability for comprehension
wit As a result, data collected in large data repositories beco that are
seldom visited. Consequently, important decisi the information-rich

data stored in data repositories, intuition, simply because the decision
maker does not able knowledge embedded in the vast amounts of da
system technologies, which typically rely on users or d knowledge into
knowledge bases. Unfortunately, this errors, and is extremely timeconsuming and costly. analysis and may uncover important data
patterns,
emphasis on mining from large amounts of da

characterizing the process that finds a small set
raw material (Figure 1.3). Thus, such a misno
ing became a popular choice. Many other te
meaning to data mining, such as knowledge m
data/pattern analysis, data archaeology, and d
Many people treat data mining as a synonym

edge Discovery from Data, or KDD. Alternativ
Knowledge
Figure 1.3 Data miningsearching for

knowledge (interesti
Figure 1.4 Data mining as a step in the process of knowledge discove
based on some interestingness measures; Se
Knowledge presentation (where visualizati

niques are used to present the mined knowl
Steps 1 to 4 are different forms of data pre for

mining. The data mining step may interact
interesting patterns are presented to the user a
the knowledge base. Note that according to this
entire process, albeit an essential one because it
We agree that data mining is a step in the kn

industry, in media, and in the database research
more popular than the longer term of knowledg
book, we choose to use the term data mining.
functionality: data mining is the process of disc
amounts of data stored in databases, data wareh
Based on this view, the architecture of a ty

following major components (Figure 1.5):
Database, data warehouse, World Wide

We is one or a set of databases, data
warehouses, tion repositories. Data

cleaning and data in on the data.
Database or data warehouse server: The dat sible for

fetching the relevant data, based on
A popular trend in the information industry is to pe

preprocessing step, where the resulting data are stored i
2
Sometimes data transformation and consolidation ar

particularly in the case of data warehousing. Data reduc
representation of the original data without sacrificing it
data cleaning, integration and selection
Database
Data
World Wide
Other
Warehouse
Web
Repos
Figure 1.5 Architecture of a typical data mining system.
Knowledge base: This is the domain knowledge th evaluate the

interestingness of resulting patterns. S cept hierarchies, used to
organize attributes or attrib abstraction. Knowledge such as user
beliefs, which interestingness based on its unexpectedness, may a of

domain knowledge are additional interestingnes metadata (e.g.,
describing data from multiple hetero
Data mining engine: This is essential to the data min a set

of functional modules for tasks such as characte tion
analysis, classification, prediction, cluster analys analysis.
Pattern evaluation module: This component typica sures (Section

1.5) and interacts with the data mi search toward interesting patterns.
It may use int out discovered patterns. Alternatively, the pattern grated
with the mining module, depending on t mining method used. For
efficient data mining, it
incorporating more advanced techniques for d
Although there are many data mining syst perform true

data mining. A data analysis syste data should be more
appropriately categorized data analysis tool, or an
experimental system form data or information retrieval,
including fi deductive query answering in large databases s
as a database system, an information retrieval s
Data mining involves an integration of tech database and

data warehouse technology, statist computing, pattern
recognition, neural net retrieval, image and signal
processing, and spa a database perspective in our
presentation of d sis is placed on efficient and scalable data
min scalable, its running time should grow approxi of the
data, given the available system resource By performing
data mining, interesting knowle tion can be extracted from
databases and viewe discovered knowledge can be applied
to decisio management, and query processing. Therefore,
important frontiers in database and informatio
interdisciplinary developments in the informat
1.3
Data MiningOn What Kind o
In this section, we examine a number of diffe can be

performed. In principle, data mining s repository, as well as
to transient data, such examination of data repositories will
include transactional databases, advanced database sy
manage and access the data. The software programs in

tion of database structures; for data storage; for concu
access; and for ensuring the consistency and security o
system crashes or attempts at unauthorized access.
A relational database is a collection of tables, each of Each
table consists of a set of attributes (columns or fie of tuples
(records or rows). Each tuple in a relational tab by a unique key
and described by a set of attribute valu as an entity-relationship
(ER) data model, is often con An ER data model represents the
database as a set of en
Consider the following example.
Example 1.1 A relational database for AllElectronics.

The AllElectro following relation tables: customer, item,
employee, and described here are shown in Figure 1.6.
The relation customer consists of a set of attribute

identity number (cust ID), customer name, address,
credit information, category, and so on.
Similarly, each of the relations item, employee,

and br describing their properties.
Tables can also be used to represent the relationsh

relation tables. For our example, these include purc
creating a sales transaction that is handled by an
items sold in a given transaction), and works at (
AllElectronics).
Relational data can be accessed by database queri

language, such as SQL, or with the assistance of graph
the user may employ a menu, for example, to specify
query, and the constraints on these attributes. A given
Figure 1.6 Fragments of relations from a relational database
relational operations, such as join, selection, a efficient

processing. A query allows retrieval of s your job is to analyze
the AllElectronics data. T can ask things like Show me a list of
all items tional languages also include aggregate function
(maximum), and min (minimum). These allow sales of the last
month, grouped by branch, or in the month of December? or
Which sales pe
Suppose that AllElectronics is a successful international the world.

Each branch has its own set of databases. T asked you to provide an
analysis of the companys sales third quarter. This is a difficult task,
particularly since over several databases, physically located at
numerous
If AllElectronics had a data warehouse, this task house is a

repository of information collected from a unified schema, and
that usually resides at a single structed via a process of data
cleaning, data integra loading, and periodic data refreshing.
This process is Figure 1.7 shows the typical framework for
constructi for AllElectronics.
Data source in Chicago
Clean
Data source in New York

Integrate
Data
Transform
Warehouse
Load
Refresh
Data source in Toronto
Data source in Vancouver
Figure 1.7 Typical framework of a data warehouse for AllElectronics.
Example 1.2 A data cube for AllElectronics.

A data cube for is presented in Figure 1.8(a).
The cube has thr
Chicago, New York, Toronto, Vancouver), time (w item

(with item type values home entertainment, value stored
in each cell of the cube is sales amou sales for the first
quarter, Q1, for items relating to as stored in cell
hVancouver, Q1, securityi. Additi sums over each
dimension, corresponding to the SQL group-bys (e.g.,
the total sales amount per per quarter and item, or per
each individual dim
I have also heard about data marts. What is th a

data mart? you may ask. A data warehouse span
an entire organization, and thus its scope other
hand, is a department subset of a data w and
thus its scope is department-wide.
By providing multidimensional data views data,

data warehouse systems are well suited
OLAP. OLAP operations use background kn
data being studied in order to allow the pr
abstraction. Such operations accommodate d
OLAP operations include drill-down and rol
data at differing degrees of summarization, as
i we can drill down on sales data summarized
by month. Similarly, we can roll up on sales d

summarized by country.
Although data warehouse tools help

suppor mining are required to allow more
in-depth a data warehouse and OLAP
technology is provid ing data warehouse
and OLAP implementation Chapter 4.
entertainment
(a)
item (types)
(b)
Drill-down
R
on time data for Q1

o
Chicago
ies)
New York
(cit
Toronto
ess
addr
Vancouver
150
(months)time
Jan
Feb
100
March
150
computer
security
home
phone
item
(types)
entertainment
ies
ount
ess
(c
Q1
Cana
(quarters)
Q2
addr
Q3
time
Q4
en
Figure 1.8 A multidimensional data cube,

commonly used for data rized data for
AllElectronics and (b) showing summarized rollup operations on the cube in (a). For improved
read values are shown.
1.3.3 Transactional Databases
In general, a transactional database consists of a file whe

action. A transaction typically includes a unique transa and a
list of the items making up the transaction (suc
and so on.
Example 1.3 A transactional database for

AllElectronics. Tr one record per transaction. A
fragment of a is shown in Figure 1.9. From the relational
d Figure 1.9 is a nested relation because the attrib
Because most relational database systems do not
transactional database is usually either stored i the table
in Figure 1.9 or unfolded into a stand the items sold table
in Figure 1.6.
As an analyst of the AllElectronics database purchased

by Sandy Smith or How many Answering such queries
may require a scan of t
Suppose you would like to dig deeper into th together?

This kind of market basket data analy items together as a
strategy for maximizing sale printers are commonly
purchased together with model of printers at a discount
to customers bu selling more of the expensive printers. A
regular queries like the one above. However, data mini
so by identifying frequent itemsets, that is, sets o The
mining of such frequent patterns for trans
Advanced Data and Information S Advanced

Applications
Relational database systems have been widely

progress of database technology, various kinds
tems have emerged and are undergoing
develop applications.
text and multimedia database systems, heterogeneous a

stream management systems, and Web-based global inf
While such databases or information repositories r
efficiently store, retrieve, and update large amounts of
fertile grounds and raise many challenging research an
mining. In this section, we describe each of the advanc
Object-Relational Databases
Object-relational databases are constructed based on

This model extends the relational model by providing a
plex objects and object orientation. Because most sop
need to handle complex objects and structures, object ing
increasingly popular in industry and applications.
Conceptually, the object-relational data model in

object-oriented databases, where, in general terms,
object. Following the AllElectronics example, objects
c tomers, or items. Data and code relating to an objec
unit. Each object has associated with it the following:
A set of variables that describe the objects. These

entity-relationship and relational models.
A set of messages that the object can use to

commu the rest of the database system.
A set of methods, where each method holds the

cod receiving a message, the method returns a
value in re for the message get photo(employee)
will retrieve a employee object.
Objects that share a common set of properties can Each object is

an instance of its class. Object classes can
A temporal database typically stores relational These

attributes may involve several timesta A sequence
database stores sequences of orde notion of time.
Examples include customer sho biological sequences. A
time-series database sto over repeated measurements
of time (e.g., hour collected from the stock exchange,
inventory phenomena (like temperature and wind).
Data mining techniques can be used to find t the

trend of changes for objects in the database. sion
making and strategy planning. For instanc the
scheduling of bank tellers according to the vo
data can be mined to uncover trends that could
when is the best time to purchase AllElectronic
defining multiple granularity of time. For exam to
fiscal years, academic years, or calendar year
quarters or months.
Spatial Databases and Spatiote
Spatial databases contain spatial-related info

(map) databases, very large-scale integration (V
and medical and satellite image databases. Spat
mat, consisting of n-dimensional bit maps or
image may be represented as raster data, where
area. Maps can be represented in vector form
lakes are represented as unions or overlays of ba
lines, polygons, and the partitions and network
Geographic databases have numerous appli ogy

planning to providing public service inform and
electric cables, pipes, and sewage systems
more, spatial data cubes may be constructed to

orga structures and hierarchies, on which OLAP
operations can be performed.
A spatial database that stores spatial objects tha

spatiotemporal database, from which interesting infor
ple, we may be able to group the trends of moving obj
moving vehicles, or distinguish a bioterrorist attack fr
based on the geographic spread of a disease with time.
Text Databases and Multimedia Dat
Text databases are databases that contain word

descr descriptions are usually not simple keywords but
rath such as product specifications, error or bug
reports, war notes, or other documents. Text
databases may be hig Web pages on the World Wide
Web). Some text databa that is, semistructured (such
as e-mail messages and whereas others are relatively
well structured (such as li databases with highly
regular structures typically can database systems.
What can data mining on text databases uncover?

uncover general and concise descriptions of the text
associations, as well as the clustering behavior of text

o mining methods need to be integrated with informat
construction or use of hierarchies specifically for text d
sauruses), as well as discipline-oriented term
classificati stry, medicine, law, or economics).
Multimedia databases store image, audio, and vid
cations such as picture content-based retrieval, voicesystems, the World Wide Web, and speech-based user
commands. Multimedia databases must support large o
queries. Objects in one component

database component databases, making it
difficult to as heterogeneous database.
Many enterprises acquire legacy databases

mation technology development (including th
operating systems). A legacy database is a
gro bines different kinds of data systems, such
as hierarchical databases, network databases,
spr systems. The heterogeneous databases in
a lega or inter-computer networks.
Information exchange across such databas

precise transformation rules from one represe
semantics. Consider, for example, the proble
student academic performance among differen
computer system and use its own curriculum a
adopt a quarter system, offer three courses on d
A+ to F, whereas another may adopt a semester
and assign grades from 1 to 10. It is very diffic
transformation rules between the two universi
ficult. Data mining techniques may provide an
exchange problem by performing statistical da
and transforming the given data into higher, m as
fair, good, or excellent for student grades), fro
more easily be performed.
Data Streams
Many applications involve the generation and an

data, where data flow in and out of an observa
Such data streams have the following unique f
dynamically changing, flowing in and out in a
changes within stream data. For example, we may like t

network based on the anomaly of message flow, which
data streams, dynamic construction of stream models, o
patterns with that at a certain previous time. Most strea of
abstraction, whereas analysts are often more interes of
abstraction. Thus, multilevel, multidimensional on-be
performed on stream data as well.
The World Wide Web
The World Wide Web and its associated distribute

Yahoo!, Google, America Online, and AltaVista, provid
mation services, where data objects are linked togeth
Users seeking information of interest traverse from
Such systems provide ample opportunities and challe
ple, understanding user access patterns will not only
providing efficient access between highly correlated o
marketing decisions (e.g., by placing advertisements i
or by providing better customer/user classification an
user access patterns in such distributed information e
mining (or Weblog mining).
Although Web pages may appear fancy and informat

highly unstructured and lack a predefined schema, type
computers to understand the semantic meaning of diver
in an organized way for systematic information retriev that
provide keyword-based searches without understan

pages can only offer limited help to users. For example
keyword may return hundreds of Web page pointers c of
the pointers will be very weakly related to what the can
often provide additional help here than Web searc tative
Web page analysis based on linkages among We
Can Be Mined?
We have observed various types of databases

an mining can be performed. Let us now examin
mined.
Data mining functionalities are used to spe data

mining tasks. In general, data mining tas descriptive
and predictive. Descriptive mining of the data in the
database. Predictive mining tas in order to make
predictions.
In some cases, users may have no idea regar may be

interesting, and hence may like to search parallel. Thus it
is important to have a data mini patterns to
accommodate different user expecta mining systems
should be able to discover patte levels of abstraction).
Data mining systems sh guide or focus the search for
interesting patter for all of the data in the database, a
measure of associated with each discovered pattern.
Data mining functionalities, and the kinds of

below.
Concept/Class Description: Chara

Discrimination
Data can be associated with classes or concepts classes of

items for sale include computers and pr bigSpenders and
budgetSpenders. It can be usefu cepts in summarized, concise,
and yet precise a concept are called class/concept description
(1) data characterization, by summarizing the d
attribute-oriented induction technique can be

used to characterization without step-by-step
user interaction Chapter 4.
The output of data characterization can be presen include

pie charts, bar charts, curves, multidimension sional
tables, including crosstabs. The resulting descr
generalized relations or in rule form (called characteris
forms and their transformations are discussed in Chap
Example 1.4 Data characterization. A data mining

system should summarizing the characteristics of
customers who sp AllElectronics. The result could be
a general profile of 4050 years old, employed, and
have excellent credit r users to drill down on any
dimension, such as on oc customers according to
their type of employment.
Data discrimination is a comparison of the general f

with the general features of objects from one or a set and
contrasting classes can be specified by the user, an
retrieved through database queries. For example, the u
eral features of software products whose sales increased
whose sales decreased by at least 30% during the same
p discrimination are similar to those used for data charac
How are discrimination descriptions output? The f

similar to those for characteristic descriptions, altho
should include comparative measures that help
disting trasting classes. Discrimination descriptions
expressed discriminant rules.
Example 1.5 Data discrimination. A data mining system

should be AllElectronics customers, such as those who
shop for co
Frequent patterns, as the name suggests, are pat are many

kinds of frequent patterns, includin tures. A frequent itemset
typically refers to a set in a transactional data set, such as milk
and bre such as the pattern that customers tend to purc era, and
then a memory card, is a (frequent) seq to different structural
forms, such as graphs, tr with itemsets or subsequences. If a
substructure structured pattern. Mining frequent patterns lea
ations and correlations within data.
Example 1.6 Association analysis. Suppose, as a marketing m

determine which items are frequently purchase An example of
such a rule, mined from the AllE
buys(X;computer) ) buys(X;software)
where X is a variable representing a customer. A that if a customer

buys a computer, there is a as well. A 1% support means that 1%
of all of that computer and software were purchased t single
attribute or predicate (i.e., buys) that repe predicate are referred
to as single-dimensional a notation, the above rule can be
written simply
Suppose, instead, that we are given the AllE purchases. A data

mining system may find asso
age(X, 20:::29) ^income(X, 20K
[support = 2%, confidence = 60%
The rule indicates that of the AllElectronics 29 years of age

with an income of 20,000 to 2
pattern mining and structured pattern mining are cons

discussed in Chapters 8 and 9, respectively.
1.4.3 Classification and Prediction
Classification is the process of finding a model (or fun guishes data

classes or concepts, for the purpose of bein the class of objects whose
class label is unknown. The de ysis of a set of training data (i.e., data
objects whose cla
How is the derived model presented? The derived m ous forms,

such as classification (IF-THEN) rules, decis or neural networks
(Figure 1.10). A decision tree is a flo each node denotes a test
on an attribute value, each bra test, and tree leaves represent
classes or class distributi converted to classification rules. A
neural network, wh cally a collection of neuron-like processing
units with w units. There are many other methods for constructing
cl Bayesian classification, support vector machines, and k
Whereas classification predicts categorical (discret models

continuous-valued functions. That is, it is use able numerical
data values rather than class labels. Al refer to both numeric
prediction and class label predicti primarily to numeric prediction.
Regression analysis is most often used for numeric prediction,
although other also encompasses the identification of distribution

tren
Classification and prediction may need to be prece attempts

to identify attributes that do not contribute t process. These
attributes can then be excluded.
Example 1.7 Classification and prediction. Suppose, as sales

mana like to classify a large set of items in the store, based
Figure 1.10 A classification model can be represented in vario
(b) a decision tree, or a (c) neural network.
sales campaign: good response, mild response, a a model for

each of these three classes based such as price, brand, place
made, type, and cate maximally distinguish each class from the
other data set. Suppose that the resulting classificatio tree. The
decision tree, for instance, may identif distinguishes the three
classes. The tree may re help further distinguish objects of each
class fro Such a decision tree may help you understand th
design a more effective campaign for the future Suppose
instead, that rather than predicting item, you would like to
predict the amount of re an upcoming sale at AllElectronics,
based on pr (numeric) prediction because the model const
function, or ordered value.
Chapter 6 discusses classification and predic
1.4.4 Cluster Analysis
What is cluster analysis? Unlike classification an data objects,

clustering analyzes data objects
Figure 1.11 A 2-D plot of customer data with respect to customer

loc clusters. Each cluster center is marked with a +.
In general, the class labels are not present in the traini not
known to begin with. Clustering can be used to gen clustered
or grouped based on the principle of maxim minimizing the
interclass similarity. That is, clusters of o within a cluster have
high similarity in comparison to on to objects in other clusters.
Each cluster that is formed c from which rules can be derived.
Clustering can also fac is, the organization of observations
into a hierarchy of together.
Example 1.8 Cluster analysis. Cluster analysis can be

performed o order to identify homogeneous subpopulations of
cust sent individual target groups for marketing. Figure 1.1 with
respect to customer locations in a city. Three clust
Cluster analysis forms the topic of Chapter 7.
1.4.5 Outlier Analysis

A database may contain data objects that do not com model of
the data. These data objects are outliers. Mo
Outlier analysis is also discussed in Chapter
1.4.6 Evolution Analysis

Data evolution analysis describes and models behavior changes
over time. Although this m tion, association and correlation
analysis, classi related data, distinct features of such an ana
sequence or periodicity pattern matching, and
Example 1.10 Evolution analysis. Suppose that you have th of

the last several years available from the Ne like to invest in shares
of high-tech industrial c exchange data may identify stock evolution
re stocks of particular companies. Such regularitie market prices,
contributing to your decision m
Data evolution analysis is discussed in Chap
1.5
Are All of the Patterns Interest
A data mining system has the potential to gen terns, or

rules.
So, you may ask, are all of the patterns inter tion of the patterns
potentially generated would
This raises some serious questions for data pattern interesting? Can a data
mining system ge a data mining system generate only interesting pa
To answer the first question, a pattern is in humans, (2) valid on

new or test data with some
confidence(X ) Y ) = P(Y
In general, each interestingness measure is associate

controlled by the user. For example, rules that do not say,
50% can be considered uninteresting. Rules below
exceptions, or minority cases and are probably of less v
Although objective measures help identify interesti unless
combined with subjective measures that reflect ticular
user. For example, patterns describing the chara
frequently at AllElectronics should interest the marketi
interest to analysts studying the same database for pat
Furthermore, many patterns that are interesting by o
common knowledge and, therefore, are actually unint
ness measures are based on user beliefs in the data. Th
esting if they are unexpected (contradicting a users beli
on which the user can act. In the latter case, such patte
Patterns that are expected can be interesting if they co
wished to validate, or resemble a users hunch.
The second questionCan a data mining syste

patterns?refers to the completeness of a data min
alistic and inefficient for data mining systems to gene
Instead, user-provided constraints and interestingness
the search. For some mining tasks, such as association
the completeness of the algorithm. Association rule mi of
constraints and interestingness measures can ensure
methods involved are examined in detail in Chapter 5.
Finally, the third questionCan a data mining sys

terns?is an optimization problem in data mining. It ing
systems to generate only interesting patterns. This w
users and data mining systems, because neither would
terns generated in order to identify the truly interesting
this direction; however, such optimization remains a ch
1.6 Classification of Data Mining Sy

Data mining is an interdisciplinary field, the c ing database
systems, statistics, machine learnin (Figure 1.12).
Moreover, depending on the data other disciplines may be
applied, such as neural knowledge representation,
inductive logic progr ing. Depending on the kinds of data to
be mined the data mining system may also integrate techn
tion retrieval, pattern recognition, image analys Web
technology, economics, business, bioinfor
Because of the diversity of disciplines contribu is expected

to generate a large variety of data mi provide a clear
classification of data mining syst tinguish between such
systems and identify those systems can be categorized
according to various
Database
Statistics
technology
Information
Data
science
Mining
Visualization
Other discipli
Figure 1.12 Data mining as a confluence of multiple

disciplin
relation analysis, classification, prediction, clusterin

analysis. A comprehensive data mining system
usual grated data mining functionalities.
Moreover, data mining systems can be distingui levels of

abstraction of the knowledge mined, inclu high level of
abstraction), primitive-level knowledge at multiple levels
(considering several levels of abstra system should
facilitate the discovery of knowledge
Data mining systems can also be categorized as

(commonly occurring patterns) versus those that
exceptions, or outliers). In general, concept descrip
analysis, classification, prediction, and clustering mi liers
as noise. These methods may also help detect o
Classification according to the kinds of techniques ut be

categorized according to the underlying data min
techniques can be described according to the degree
autonomous systems, interactive exploratory syste
methods of data analysis employed (e.g., databas
oriented techniques, machine learning, statistics, vi
neural networks, and so on). A sophisticated data multiple
data mining techniques or work out an eff combines the
merits of a few individual approaches
Classification according to the applications adapted:

categorized according to the applications they ad
systems may be tailored specifically for finance, te
markets, e-mail, and so on. Different applications
application-specific methods. Therefore, a generic,
may not fit domain-specific mining tasks.
In general, Chapters 4 to 7 of this book are organize of

knowledge mined. In Chapters 8 to 10, we discuss
the mining process, or examine the findings fr

mining primitives specify the following, as illus
The set of task-relevant data to be mined:

T or the set of data in which the user is
interes or data warehouse dimensions of
interest dimensions).
The kind of knowledge to be mined: This spe

formed, such as characterization, discrimina
classification, prediction, clustering, outlier
The background knowledge to be used in the

the domain to be mined is useful for guidi for
evaluating the patterns found. Concept ground
knowledge, which allow data to be An example
of a concept hierarchy for the a Figure 1.14.
User beliefs regarding relationsh ground
knowledge.
The interestingness measures and thresholds f to

guide the mining process or, after discov Different
kinds of knowledge may have diffe ple,
interestingness measures for associatio Rules
whose support and confidence values considered

uninteresting.
The expected representation for visualizing t form

in which discovered patterns are to be d charts,
graphs, decision trees, and cubes.
A data mining query language can be de

allowing users to flexibly interact with data min
language provides a foundation on which user-
Background knowledge
Concept hierarchies
User beliefs about relationshi
Pattern interestingness measu
Simplicity
Certainty (e.g., confidence)
Utility (e.g., support)
Novelty
Visualization of discovered p Rules, tables, reports, charts, and cubes
Drill-down and roll-up
Figure 1.13 Primitives for specifying a data mining task.
This facilitates a data mining systems communication and its

integration with the overall information proces Designing a
comprehensive data mining language is covers a wide spectrum of
tasks, from data characteriz task has different requirements. The design
of an effec requires a deep understanding of the power, limitation
the various kinds of data mining tasks.
general abstraction level, denoted as all.
There are several proposals on data mining we

use a data mining query language known as
which was designed as a teaching tool, based o
use to specify data mining queries appear thro an
SQL-like syntax, so that it can easily be integ
SQL. Lets look at how it can be used to specify
Example 1.11 Mining classification rules.

Suppose, as a m would like to classify
customers based on th interested in those
customers whose salary i bought more than
$1,000 worth of items, ea $100. In particular,
you are interested in the cu purchased, the
purchase location, and where to view the
resulting classification in the for expressed in
3
DMQL as follows, where each li aid in our

discussion.
use database AllElectronics db
use hierarchy location hierarchy for T.
mine classification as promising custo
in relevance to C.age, C.income, I.type
from customer C, item I, transaction T
where I.item ID = T.item ID and C.cu and

C.income 40,000 and I.price
group by T.cust ID
Note that in this book, query language keywords are
However, in this example, the two classes are

implici retrieved and considered examples of
promising cus customers in the customer table are
considered as n sification is performed based on
this training set. L results are to be displayed as a
set of rules. Several de introduced in Chapter 6.
There is no standard data mining query language

as industry have been making good progress in
this dir Data Mining (described in the appendix of
this book) data mining language. Other
standardization efforts inc Model Markup
Language) and CRISP-DM (CRoss-Ind Mining).
1.8 Integration of a Data Mining

System a Database or Data Warehouse
Syst
Section 1.2 outlined the major components of the arch

system (Figure 1.5). A good system architecture will fac
make best use of the software environment, accomplish
and timely manner, interoperate and exchange informa

tems, be adaptable to users diverse requirements, and
A critical question in the design of a data mining ( or

couple the DM system with a database (DB) system
system. If a DM system works as a stand-alone system
program, there are no DB or DW systems with which it
scheme is called no coupling, where the main focus of th
effective and efficient algorithms for mining the availabl
system works in an environment that requires it to com
system components, such as DB and DW systems, poss
ond, there are many tested, scalable algorith

DB and DW systems. It is feasible to realize
such systems. Moreover, most data have be
Without any coupling of such systems, a D
extract data, making it difficult to integrate
cessing environment. Thus, no coupling rep
Loose coupling: Loose coupling means that a DB

or DW system, fetching data from a dat performing
data mining, and then storing t designated place in a
database or data wareh Loose coupling is better
than no coupling stored in databases or data
warehouses by usi system facilities. It incurs some
advantages o tures provided by such systems.
However, m main memory-based. Because mining
does
optimization methods provided by DB or pling

to achieve high scalability and good pe
Semitight coupling: Semitight coupling mea

a DB/DW system, efficient implementation
itives (identified by the analysis of frequen can
be provided in the DB/DW system. The ing,
aggregation, histogram analysis, multi
essential statistical measures, such as sum, c
so on. Moreover, some frequently used inte
puted and stored in the DB/DW system. Be
are either precomputed or can be computed

performance of a DM system.
Tight coupling: Tight coupling means tha

into the DB/DW system. The data mining
desirable, but its implementation is nontrivial and mo Semitight

coupling is a compromise between loose and identify commonly
used data mining primitives and p of such primitives in DB or
DW systems.
1.9
Major Issues in Data Mining
The scope of this book addresses major issues in data mi

logy, user interaction, performance, and diverse data ty
below:
Mining methodology and user interaction issues: The mined,

the ability to mine knowledge at multiple knowledge, ad hoc mining,
and knowledge visualiza
Mining different kinds of knowledge in databas be interested in

different kinds of knowledge, d spectrum of data analysis and
knowledge disco acterization, discrimination, association and co
prediction, clustering, outlier analysis, and evo trend and
similarity analysis). These tasks may ent ways and require the
development of nume
Interactive mining of knowledge at multiple lev difficult to know

exactly what can be discove mining process should be
interactive. For datab of data, appropriate sampling techniques
can fi active data exploration. Interactive mining al for patterns,
providing and refining data min results. Specifically, knowledge
should be mine
vein, high-level data mining query langu to

describe ad hoc data mining tasks by vant
sets of data for analysis, the domai be
mined, and the conditions and const
patterns. Such a language should be integ
query language and optimized for efficie
Presentation and visualization of data min

be expressed in high-level languages, vis
forms so that the knowledge can be ea
humans. This is especially crucial if the
This requires the system to adopt expressi
such as trees, tables, rules, graphs, charts
Handling noisy or incomplete data: The da

exceptional cases, or incomplete data obje
objects may confuse the process, causin
overfit the data. As a result, the accuracy
Data cleaning methods and data analy
required, as well as outlier mining met
exceptional cases.
Pattern evaluationthe interestingness

pro thousands of patterns. Many of the patt
the given user, either because they repr
elty. Several challenges remain regarding

the interestingness of discovered pattern
measures that estimate the value of pat
based on user beliefs or expectations.
user-specified constraints to guide the d
space is another active area of research.
some data mining methods are factors motivatin

distributed data mining algorithms. Such algo
titions, which are processed in parallel. The res
merged. Moreover, the high cost of some data mi for
incremental data mining algorithms that inc out
having to mine the entire data again from sc
knowledge modification incrementally to amend ously
discovered.
Issues relating to the diversity of database types:
Handling of relational and complex types of data:

data warehouses are widely used, the developme
mining systems for such data is important. Howe
complex data objects, hypertext and multimedia or
transaction data. It is unrealistic to expect data, given
the diversity of data types and differe data mining
systems should be constructed fo Therefore, one
may expect to have different da kinds of data.
Mining information from heterogeneous database
Local- and wide-area computer networks (such

sources of data, forming huge, distributed, and h
covery of knowledge from different sources of
unstructured data with diverse data semantics

mining. Data mining may help disclose high-le
heterogeneous databases that are unlikely to be
tems and may improve information exchange a
neous databases. Web mining, which uncovers i
contents, Web structures, Web usage, and Web
lenging and fast-evolving field in data mining.
database management systems with quer

progress has led to the increasing demand
analysis tools. This need is a result of the expl
cations, including business and manageme
and engineering, and environmental contro
Data mining is the task of discovering interes

where the data can be stored in databases,
data itories. It is a young interdisciplinary field,
d tems, data warehousing, statistics, machine
retrieval, and high-performance computing.
networks, pattern recognition, spatial data an
and many application fields, such as busines
A knowledge discovery process

includes da tion, data transformation,
data mining, presentation.
The architecture of a typical data mining

warehouse and their appropriate servers, a
tion module (both of which interact with a
interface. Integration of the data mining c or
data warehouse system can involve eithe
coupling, or tight coupling. A well-designed
semitight coupling with a database and/or d
Data patterns can be mined from many differ

databases, data warehouses, and transaction
esting data patterns can also be extracted fro
ries, including spatial, time-series, sequence
data streams, and the World Wide Web.
A data warehouse is a repository for long-ter

organized so as to facilitate management dec
Data mining systems can be classified according to kinds of

knowledge mined, the techniques used, or
We have studied five primitives for specifying a data

mining query. These primitives are the specificatio
data set to be mined), the kind of knowledge to be
(typically in the form of concept hierarchies), inter edge
presentation and visualization techniques to b ered
patterns.
Data mining query languages can be designed to su

mining. A data mining query language, such as DM for
specifying each of the data mining primitives. based
and may eventually form a standard on which mining
can be based.
Efficient and effective data mining in large databas and great

challenges to researchers and developers. mining
methodology, user interaction, performance ing of a large
variety of data types. Other issues inclu applications and their
social impacts.
Exercises
What is data mining? In your answer, address the follo
Is it another hype?
Is it a simple transformation of technology develop

machine learning?
Explain how the evolution of database technology
Describe the steps involved in data mining when v

discovery.
Define each of the following data mining func tion,

association and correlation analysis, classi lution
analysis. Give examples of each data mini with
which you are familiar.
What is the difference between discrimination zation

and clustering? Between classification a tasks, how
are they similar?
Based on your observation, describe another po

discovered by data mining methods but has not a
mining methodology that is quite different fr
List and describe the five primitives for specifyi
Describe why concept hierarchies are useful in d
Outliers are often discarded as noise. However,

treasure. For example, exceptions in credit card ulent
use of credit cards. Taking fraudulence det ods that can
be used to detect outliers and disc
Recent applications pay special attention to sp poral

data stream contains spatial information t of stream
data (i.e., the data flow in and out lik
Present three application examples of spati
Discuss what kind of interesting knowledg with

limited time and resources.
Identify and discuss the major challenges i
Using one application example, sketch a from

such stream data efficiently.
Describe the differences between the following mining

system with a database or data wareh
Bibliographic Notes
The book Knowledge Discovery in Databases, edited b

[PSF91], is an early collection of research papers on kno
book Advances in Knowledge Discovery and Data Mini
Shapiro, Smyth, and Uthurusamy [FPSSe96], is a colle

knowledge discovery and data mining. There have bee
lished in recent years, including Predictive Data Mining
Data Mining Solutions: Methods and Tools for Solving R
and Blaxton [WB98], Mastering Data Mining: The Ar
tionship Management by Berry and Linoff [BL99], Build
CRM by Berson, Smith, and Thearling [BST99], Data M
Tools and Techniques with Java Implementations by Witt
of Data Mining (Adaptive Computation and Machine L
Smyth [HMS01], The Elements of Statistical Learning
man [HTF01], Data Mining: Introductory and Advanc
Data Mining: Multimedia, Soft Computing, and Bioin
[MA03], and Introduction to Data Mining by Tan, Stein are
also books containing collections of papers on discovery,
such as Machine Learning and Data Mining: by Michalski,
Brakto, and Kubat [MBK98], and Re Dzeroski and Lavrac
[De01], as well as many tutorial database, data mining,
and machine learning conferen
KDnuggets News, moderated by Piatetsky-Shapiro s

tronic newsletter containing information relevant to da
ery. The KDnuggets website, located at

www.kdnuggets.c information relating to data mining.
The data mining community started its first interna

discovery and data mining in 1995 [Fe95]. The confere
national workshops on knowledge discovery in databas
PS91a, FUe93, Fe94]. ACM-SIGKDD, a Special Interest
nals on databases, statistics, machine

learning, a sources are listed below.
Popular textbooks on database systems inclu by

Garcia-Molina, Ullman, and Widom [GMU
Ramakrishnan and Gehrke [RG03], Database
and Sudarshan [SKS02], and Fundamentals of D
[EN03]. For an edited collection of seminal ar in
Database Systems by Hellerstein and Stonebr
house technology, systems, and applications hav
such as The Data Warehouse Toolkit: The Com
Kimball and M. Ross [KR02], The Data Wareh

Designing, Developing, and Deploying Data Wa
[KRRT98], Mastering Data Warehouse Design: R
Imhoff, Galemmo, and Geiger [IGG03], Buildin

and OLAP Solutions: Building Multidimensio
[Tho97]. A set of research papers on materialize

tions were collected in Materialized Views: Techn
by Gupta and Mumick [GM99]. Chaudhuri and
overview of data warehouse technology.
Research results relating to data mining and the

proceedings of many international data
SIGMOD International Conference on Managem

Conference on Very Large Data Bases (VLDB), th
posium on Principles of Database Systems (POD
Engineering (ICDE), the International Confere (EDBT),
the International Conference on Databa ference on
Information and Knowledge Managem on Database and
Expert Systems Applications (D on Database Systems
for Advanced Applications also published in major
database journals, such
Data Engineering (TKDE), ACM Transactions
Research in statistics is published in the proceedings

ences, including Joint Statistical Meetings, International
Society, and Symposium on the Interface: Computing Sci
of publication include the Journal of the Royal Statistica
Journal of American Statistical Association, Technometri
Textbooks and reference books on machine learnin
Artificial Intelligence Approach, Vols. 14, edited by Mi

KM90, MT94], C4.5: Programs for Machine Learning
Machine Learning by Langley [Lan96], and Machine
Le book Computer Systems That Learn: Classification
and P Neural Nets, Machine Learning, and Expert
Systems by compares classification and prediction
methods from edited collection of seminal articles on
machine learnin ing by Shavlik and Dietterich [SD90].
Machine learning research is published in the proc

learning and artificial intelligence conferences, includin
Machine Learning (ML), the ACM Conference on Compu

the International Joint Conference on Artificial Intelligenc
ciation of Artificial Intelligence Conference (AAAI). Oth
major machine learning, artificial intelligence, patte
system journals, some of which have been mentioned
Learning (ML), Artificial Intelligence Journal (AI), IEEE

and Machine Intelligence (PAMI), and Cognitive Science
Pioneering work on data visualization techniques is of

Quantitative Information [Tuf83], Envisioning Inform
nations: Images and Quantities, Evidence and Narrative
to Graphics and Graphic Information Processing by Ber
Cleveland [Cle93], and Information Visualization in D
covery edited by Fayyad, Grinstein, and Wierse [FGW0
posiums on visualization include ACM Human Facto
Visualization, and the International Symposium on Inf
at www.dmg.org, and CRISP-DM (CRossIndu described at www.crisp-dm.org.
Architectures of data mining systems have be

ference panels and meetings. The recent
design IV99, Cor00, Ras04], the proposal of
on-line a the study of optimization of data
mining queri can be viewed as steps toward
the tight integrati systems and data warehouse
systems. For rela mining primitives as
proposed by Sarawagi, Th as building blocks
for the efficient implemen systems.
Todays real-world databases are highly

susceptible to n to their typically huge size (often
several gigaby multiple, heterogenous sources.
Low-quality da
How can the data be preprocessed in order to

consequently, of the mining results? How can the
efficiency and ease of the mining process?
There are a number of data preprocessing tec remove

noise and correct inconsistencies in the multiple
sources into a coherent data store, suc tions, such as
normalization, may be applied. F the accuracy and
efficiency of mining algorithm reduction can reduce the
data size by aggregating tering, for instance. These
techniques are not mu For example, data cleaning can
involve transfo by transforming all entries for a date field
to a niques, when applied before mining, can substa
patterns mined and/or the time required for th
In this chapter, we introduce the basic conce

Section 2.2 presents descriptive data
summariz data preprocessing. Descriptive
data summariz teristics of the data and identify
the presence successful data cleaning and
data integration. organized into the following
categories: data cle transformation (Section

2.4), and data reductio be used in an
alternative form of data reductio as raw values
for age) with higher-level concept This form of
data reduction is the topic of Secti eneration of
concept hierarchies from num techniques. The
automatic generation of concep described.
niques are incomplete (lacking attribute values or cert

taining only aggregate data), noisy (containing errors, o
the expected), and inconsistent (e.g., containing discre
used to categorize items). Welcome to the real world!
Incomplete, noisy, and inconsistent data are comm world

databases and data warehouses. Incomplete dat sons.
Attributes of interest may not always be availabl for sales
transaction data. Other data may not be inc considered
important at the time of entry. Relevant da
misunderstanding, or because of equipment malfuncti
with other recorded data may have been deleted. Furth
tory or modifications to the data may have been overl for
tuples with missing values for some attributes, may
There are many possible reasons for noisy data (havin

data collection instruments used may be faulty. There m
errors occurring at data entry. Errors in data transmiss
technology limitations, such as limited buffer size for
transfer and consumption. Incorrect data may also resu
conventions or data codes used, or inconsistent form
Duplicate tuples also require data cleaning.
Data cleaning routines work to clean the data by ing

noisy data, identifying or removing outliers, and r believe
the data are dirty, they are unlikely to trust th has been
applied to it. Furthermore, dirty data can cau cedure,
resulting in unreliable output. Although most cedures for

dealing with incomplete or noisy data, the they may
concentrate on avoiding overfitting the data Therefore, a
useful preprocessing step is to run your d routines.
Section 2.3 discusses methods for cleaning up
Getting back to your task at AllElectronics, suppos

data from multiple sources in your analysis. This wo
Getting back to your data, you have decided,

based mining algorithm for your analysis, suc
1
classifiers, or clustering. Such methods provi

lyzed have been normalized, that is, scaled to
customer data, for example, contain the attrib
salary attribute usually takes much larger values
left unnormalized, the distance measurements t
weigh distance measurements taken on age. F
analysis to obtain aggregate information as to th
that is not part of any precomputed data cube i
that data transformation operations, such as n
tional data preprocessing procedures that wou
mining process. Data integration and data tran
Hmmm, you wonder, as you consider you

selected for analysis is HUGE, which is sure to sl
way I can reduce the size of my data set, witho
Data reduction obtains a reduced representati in

volume, yet produces the same (or almost t
number of strategies for data reduction. These
data cube), attribute subset selection (e.g., remov
tion analysis), dimensionality reduction (e.g., us
length encoding or wavelets), and numerosity
alternative, smaller representations such as clus
tion is the topic of Section 2.5. Data can also use
of concept hierarchies, where low-level con are
replaced with higher-level concepts, such a

hierarchy organizes the concepts into varying le
Neural networks and nearest-neighbor

classifiers are de in Chapter 7.
transactions
T2
Data
transformation
T3
T4
Data reduction
... T2000
A1
A2
22, 32, 100, 59, 48
T1
a
tt
ri
b
u
t
e
s
A
3
..
.
A
1
2
6
tr
a
n
s
a
ct
io
n
s
T
4
T1
...
T
1
Figure 2.1 Forms of data preprocessing.
a form of data reduction that is very useful for

the auto archies from numerical data. This is
described in Secti generation of concept
hierarchies for categorical data.
Figure 2.1 summarizes the data preprocessing

step above categorization is not mutually
exclusive. For exa data may be seen as a form of
data cleaning, as well as
In summary, real-world data tend to be dirty, inc

preprocessing techniques can improve the quality
of the the accuracy and efficiency of the
subsequent mining
For many data preprocessing tasks, users w istics

regarding both central tendency and dis tendency include
mean, median, mode, and mid include quartiles,
interquartile range (IQR), and of great help in
understanding the distributio studied extensively in the
statistical literature. need to examine how they can be
computed ef it is necessary to introduce the notions of
dist holistic measure. Knowing what kind of measur an
efficient implementation for it.
2.2.1 Measuring the Central Tendency
In this section, we look at various ways to me most

common and most effective numerical m the (arithmetic)
mean. Let x1;x2;:::;xN be a set some attribute, like salary.
The mean of this set
x1+x
x=
i=1
This corresponds to the built-in aggregate func

relational database systems.
A distributive measure is a measure (i.e., given data set

by partitioning the data into s for each subset, and then
merging the results i for the original (entire) data set.
Both sum() because they can be computed in this mann
min(). An algebraic measure is a measure that braic
function to one or more distributive m an algebraic
measure because it can be compu
Although the mean is the single most useful quantity

always the best way of measuring the center of the data.
is its sensitivity to extreme (e.g., outlier) values. Even a
can corrupt the mean. For example, the mean salary at
pushed up by that of a few highly paid managers. Simi in
an exam could be pulled down quite a bit by a few ve
caused by a small number of extreme values, we can
which is the mean obtained after chopping off values a
example, we can sort the values observed for salary and
before computing the mean. We should avoid trimm 20%)
at both ends as this can result in the loss of valua
For skewed (asymmetric) data, a better measure of

Suppose that a given data set of N distinct values is sorte
then the median is the middle value of the ordered set;
median is the average of the middle two values.
A holistic measure is a measure that must be com

whole. It cannot be computed by partitioning the give
the values obtained for the measure in each subset.
The tic measure. Holistic measures are much more
expens measures such as those listed above.
We can, however, easily approximate the median value

grouped in intervals according to their xi data values an of
data values) of each interval is known. For example, p to
their annual salary in intervals such as 1020K, 2030
contains the median frequency be the median interval. of the

entire data set (e.g., the median salary) by interp
median = L
+
N=2(freq)
freq
median
Data cube computation is described in detail in Chapters 3 and
entire data set, ( freq)l is the sum of the frequ
than the median interval, freqmedian is the freq is

the width of the median interval.
Another measure of central tendency is the value that

occurs most frequently in the set. It correspond to several
different values, which r with one, two, or three modes
are respectively c In general, a data set with two or more
modes each data value occurs only once, then there is
For unimodal frequency curves that are mo the

following empirical relation:
mean mode = 3 (m
This implies that the mode for unimodal frequ can easily
be computed if the mean and median In a unimodal
frequency curve with perfect median, and mode are all at
the same center va data in most real applications are not
symmetr skewed, where the mode occurs at a value that
is or negatively skewed, where the mode occu
(Figure 2.2(c)).
The midrange can also be used to assess th average of

the largest and smallest values in th compute using the
SQL aggregate functions, m
2.2.2 Measuring the Dispersion of Data
The degree to which numerical data tend to spre the

data. The most common measures of data d mary
(based on quartiles), the interquartile ran
75th percentile. The quartiles, including the median, gi

spread, and shape of a distribution. The distance betwe a
simple measure of spread that gives the range covere
This distance is called the interquartile range (IQR) an
IQR = Q3 Q1.
Based on reasoning similar to that in our analysis of

the conclude that Q1 and Q3 are holistic measures, as
is IQ No single numerical measure of spread, such as
I skewed distributions. The spreads of two sides of a s
(Figure 2.2). Therefore, it is more informative to also p
Q3, along with the median. A common rule of thumb f
is to single out values falling at least 1:5 IQR above th
quartile.
Because Q1, the median, and Q3 together contain no

(e.g., tails) of the data, a fuller summary of the shape by
providing the lowest and highest data values as well.
summary. The five-number summary of a distribution
tiles Q1 and Q3, and the smallest and largest individual
Minimum;Q1;Median;Q3;Maximum:
Boxplots are a popular way of visualizing a

distribu five-number summary as follows:
Typically, the ends of the box are at the quartiles,

so th tile range, IQR.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box

extend largest (Maximum) observations.
When dealing with a moderate number of observ

potential outliers individually. To do this in a boxplo
60
40
20
Branch 1 Branch 2
Branch 3
Figure 2.3 Boxplot for the unit price data for items sold
at fo time period.
the extreme low and high observations only beyond the

quartiles. Otherwise, the whiskers vations occurring within
1:5 IQR of the qu individually. Boxplots can be used in the
co data. Figure 2.3 shows boxplots for unit price
AllElectronics during a given time period. For of items sold
is $80, Q1 is $60, Q3 is $100. N this branch were plotted
individually, as their 1.5 times the IQR here of 40. The
efficient comp boxplots (based on approximates of the
challenging issue for the mining of large data
Variance and Standard Deviatio
The variance of N observations, x1;x2;:::;xN , i
1
N
(x
x) =
i=1
where x is the mean value of the observations, as deviation,

, of the observations is the square r
2.2.3 Graphic Displays of Basic Descriptive D
Aside from the bar charts, pie charts, and line graphs u
ical data presentation software packages, there are ot the
display of data summaries and distributions. Th plots, q-q
plots, scatter plots, and loess curves. Such grap
inspection of your data.
Plotting histograms, or frequency histograms, is a ing the

distribution of a given attribute. A histogram for distribution of A
into disjoint subsets, or buckets. Typi uniform. Each bucket is
represented by a rectangle who relative frequency of the values
at the bucket. If A is cat or item type, then one rectangle is drawn
for each kno graph is more commonly referred to as a bar chart.
If is preferred. Partitioning rules for constructing histogr
discussed in Section 2.5.4. In an equal-width histogra resents an
equal-width range of numerical attribute A.
Figure 2.4 shows a histogram for the data set of Table equalwidth ranges representing $20 increments and th sold.
Histograms are at least a century old and are a method.
However, they may not be as effective as the qu methods for
comparing groups of univariate observati
A quantile plot is a simple and effective way to data distribution.

First, it displays all of the data for t user to assess both the
overall behavior and unusua quantile information. The
mechanism used in this st percentile computation discussed in
Section 2.2.2. Le sorted in increasing order so that x1 is the
smallest o Each observation, xi, is paired with a percentage, fi,
wh 100fi%of the data are below or equal to the value,xi.
Figure 2.4 A histogram for the data set of Table 2.1.
Table 2.1 A set of unit price data for items sold at a bran
Unit price ($)
40
43
47
..
74
75
78
..
115
117
120
there may not be a value with exactly a fracti Note that the
0.25 quantile corresponds to qua and the 0.75 quantile is
Q3 .
Let
i0
fi= N
These numbers increase in equal steps of 1=N above zero) to

11=2N (which is slightly belo against fi. This allows us to
compare different For example, given the quantile plots of
sales da
Figure 2.5 A quantile plot for the unit price data of Table 2.1.
compare their Q1, median, Q3, and other fi values at a g plot for
the unit price data of Table 2.1.
A quantile-quantile plot, or q-q plot, graphs t distribution

against the corresponding quantiles of ano tool in that it
allows the user to view whether there is a sh to another.
Suppose that we have two sets of observations for t two

different branch locations. Let x1;:::;xN be the y1;:::;yMbe the
data from the second, where each dataIf M = N (i.e., the
number of points in each set is th against xi, where yi and xi
are both (i 0:5)=N quant If M < N (i.e., the second branch has
fewer observati only M points on the q-q plot. Here, yi is the (i
0:5) is plotted against the (i 0:5)=M quantile of the x d involves
interpolation.
Figure 2.6 shows a quantile-quantile plot for unit pr ferent

branches of AllElectronics during a given time p the same quantile
for each data set and shows the uni versus branch 2 for that
quantile. For example, here th corresponds to the 0.03 quantile.

(To aid in comparison represents the case of when, for each given
quantile, th same. In addition, the darker points correspond to the
respectively.) We see that at this quantile, the unit pri slightly less
than that at branch 2. In other words, 3% o than or equal to $40,
while 3% of items at branch 2 wer highest quantile, we see that
the unit price of items at b at branch 1. In general, we note that
there is a shift in t respect to branch 2 in that the unit prices of
items sold those at branch 2.
Branch 1 (unit p
Figure 2.6 A quantile-quantile plot for unit price data from t
700
600
sold
500
400
Items
300
200
100
20
40
60
Unit price (
Figure 2.7 A scatter plot for the data set of Table 2.1.
A scatter plot is one of the most effective gra appears to be a

relationship, pattern, or trend construct a scatter plot, each
pair of values is tr braic sense and plotted as points in the
plane. Fi data in Table 2.1. The scatter plot is a useful met
3
data to see clusters of points and outliers, or to e tionships. In

Figure 2.8, we see examples of po
A statistical test for correlation is given in Section 2.4.1
Figure 2.9 Three cases where there is no

observed correlation betwee of the data sets.
two attributes in two different data sets. Figure 2.9 sho no

correlation relationship between the two attributes When
dealing with several attributes, the scatter-pl the scatter
plot. Given n attributes, a scatter-plot matri that provides
a visualization of each attribute (or dime The scatter-plot
matrix becomes less effective as the n grows. In this
case, user interactions such as zooming a
help interpret the individual scatter plots effectively.
A loess curve is another important exploratory

grap to a scatter plot in order to provide better
perception o word loess is short for local
regression. Figure 2.10 s data in Table 2.1.
To fit a loess curve, values need to be set for two para

eter, and , the degree of the polynomials that are fitted
any positive number (typical values are between 1=4 an
choosing is to produce a fit that is as smooth as possi
underlying pattern in the data. The curve becomes smoo
some lack of fit, however, indicating possible missing d
underlying pattern is tracked, yet overfitting of the data in
the curve may not be supported by the data. If the un
Figure 2.10 A loess curve for the data set of Table 2.1.
gentle curvature with no local maxima and m

sufficient ( = 1). However, if there are local fitting ( =
2) typically does a better job of foll taining local
smoothness.
In conclusion, descriptive data summaries behavior of

your data. By helping to identify no for data cleaning.
2.3
Data Cleaning
Real-world data tend to be incomplete, noisy, cleansing)

routines attempt to fill in missing v ing outliers, and correct
inconsistencies in the methods for data cleaning. Section
2.3.1 loo Section 2.3.2 explains data smoothing techniqu
data cleaning as a process.
2.3.1 Missing Values
Imagine that you need to analyze AllElectronics many tuples

have no recorded value for several a can you go about filling
in the missing values for methods:
Ignore the tuple: This is usually done when mining task

involves classification). This met contains several attributes with
missing valu age of missing values per attribute varies co
For example, if classifying customers according to cr

with the average income value for customers in the of
the given tuple.
Use the most probable value to fill in the missing with

regression, inference-based tools using a Baye induction. For
example, using the other customer may construct a decision tree
to predict the miss trees, regression, and Bayesian inference are
descr
Methods 3 to 6 bias the data. The filled-in value however, is a

popular strategy. In comparison to the information from the
present data to predict missing of the other attributes in its
estimation of the missing v chance that the relationships
between income and the o It is important to note that, in
some cases, a missi in the data! For example, when applying
for a credit c
supply their drivers license number. Candidates who d

naturally leave this field blank. Forms should allow resp not
applicable. Software routines may also be used t as dont
know, ?, or none. Ideally, each attribut regarding the null
condition. The rules may specify w and/or how such values
should be handled or transfo tionally left blank if they are to
be provided in a later ste although we can try our best to
clean the data after it is and of data entry procedures should
help minimize t errors in the first place.
2.3.2 Noisy Data
What is noise? Noise is a random error or variance numerical

attribute such as, say, price, how can we sm noise? Lets look at
the following data smoothing techn
Smoothing by bin bou
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Figure 2.11 Binning methods for data smoothing.
Binning: Binning methods smooth a sorted hood, that is, the

values around it. The sor of buckets, or bins. Because binning
meth they perform local smoothing. Figure 2.11 ill example, the
data for price are first sorted a bins of size 3 (i.e., each bin
contains three va value in a bin is replaced by the mean value
values 4, 8, and 15 in Bin 1 is 9. Therefore, e by the value 9.
Similarly, smoothing by bin bin value is replaced by the bin
median. In s mum and maximum values in a given bin a bin
value is then replaced by the closest bo width, the greater the

effect of the smoothin where the interval range of values in each
b discretization technique and is further discu
Regression: Data can be smoothed by fitti regression. Linear

regression involves findin variables), so that one attribute can
be us regression is an extension of linear regressi involved and
the data are fit to a multidim described in Section 2.5.4, as well
as in Ch
Figure 2.12 A 2-D plot of customer data with

respect to customer data clusters. Each cluster
centroid is marked with a + in space for that
cluster. Outliers may be detected as v of clusters.
Clustering: Outliers may be detected by clustering, w

into groups, or clusters. Intuitively, values that fall be
considered outliers (Figure 2.12). Chapter 7 is de and
outlier analysis.
Many methods for data smoothing are also meth

ing discretization. For example, the binning techniq
number of distinct values per attribute. This acts a
logic-based data mining methods, such as decision
t make value comparisons on sorted data. Concept

hie cretization that can also be used for data
smoothing. A example, may map real price values
into inexpensive, thereby reducing the number of
data values to be h Data discretization is discussed
in Section 2.6. Some as neural networks, have
built-in data smoothing m topic of Chapter 6.
purposes other than originally intended. There

integration (e.g., where a given attribute can hav
So, how can we proceed with discrepancy detec edge

you may already have regarding properti about data is
referred to as metadata. For exam each attribute? What
are the acceptable values fo length of values? Do all
values fall within the exp dencies between attributes?
The descriptive dat useful here for grasping data trends
and identif are more than two standard deviations away
f be flagged as potential outliers. In this step, yo some
of the tools that we discuss further below. and unusual
values that need investigation.
As a data analyst, you should be on the lookou

inconsistent data representations (such as 2004
overloading is another source of errors that typic
attribute definitions into unused (bit) portions an unused
bit of an attribute whose value range
The data should also be examined regarding

rules. A unique rule says that each value of th all
other values for that attribute. A consecuti ing
values between the lowest and highest val must
also be unique (e.g., as in check numbers
question marks, special characters, or other st
tion (e.g., where a value for a given attribute
should be handled. As mentioned in Section 2.3.
(1) the person originally asked to provide a val
Data integration and the removal of redundant

data t described in Section 2.4.1.
or clustering to identify outliers. They may also

use the were described in Section 2.2.
Some data inconsistencies may be corrected manua

example, errors made at data entry may be corrected b
errors, however, will require data transformations. This i
as a process. That is, once we find discrepancies, we ty
(a series of) transformations to correct them.
Commercial tools can assist in the data transform

allow simple transformations to be specified, such as
sex. ETL (extraction/transformation/loading)

tools through a graphical user interface (GUI).
These tools t set of transforms so that, often, we
may also choose to of the data cleaning process.
The two-step process of discrepancy detection and da

crepancies) iterates. This process, however, is error-pr
transformations may introduce more discrepancies. So be
detected after others have been fixed. For example, a t may only
surface once all date values have been converte mations are
often done as a batch process while the us after the
transformation is complete can the user go bac lies have been
created by mistake. Typically, numerous user is satisfied. Any
tuples that cannot be automatically h are typically written to a file
without any explanation reg failure. As a result, the entire data
cleaning process also s
New approaches to data cleaning emphasize increase

example, is a publicly available data cleaning tool (see
that integrates discrepancy detection and transformatio
transformations by composing and debugging individu
time, on a spreadsheet-like interface. The transformatio
by providing examples. Results are shown immediately
the screen. The user can choose to undo the transform
2.4 Data Integration and Transform

Data mining often requires data integration stores.
The data may also need to be transfor This section
describes both data integration an
2.4.1 Data Integration

It is likely that your data analysis task will invol from multiple
sources into a coherent data stor may include multiple
databases, data cubes, or There are a number of issues to
consider dur and object matching can be tricky. How can
equ
data sources be matched up? This is referred For

example, how can the data analyst or the c database
and cust number in another refer to th for each attribute
include the name, meaning, for the attribute, and null
rules for handling b Such metadata can be used to help
avoid erro may also be used to help transform the data
(e database may be H and S, and 1 and 2 in data
cleaning, as described earlier.
Redundancy is another important issue. An instance)

may be redundant if it can be der attributes.
Inconsistencies in attribute or dime cies in the
resulting data set.
Some redundancies can be detected by correl analysis can

measure how strongly one attribute data. For numerical attributes,
we can evaluate and B, by computing the correlation coefficient
coefficient, named after its inventer, Karl Pearso
resulting value is equal to 0, then A and B are indepe

between them. If the resulting value is less than 0, then
where the values of one attribute increase as the value
This means that each attribute discourages the other.
view correlations between attributes (Section 2.2.3).
Note that correlation does not imply causality. That does

not necessarily imply that A causes B or that B cau
demographic database, we may find that attributes repr
and the number of car thefts in a region are correlate
causes the other. Both are actually causally linked to a th
For categorical (discrete) data, a correlation relatio and B,

2
can be discovered by a (chi-square) test. Suppo

a1;a2;:::ac.Bhasrdistinct values, namelyb1;b2;:::band B
can be shown as a contingency table, with the c va and
the r values of B making up the rows. Let (Ai;B j) takes on
value ai and attribute B takes on value b j, that and every
possible (Ai;B j) joint event has its own cell ( (also known
2
as the Pearson statistic) is computed as:
cr
(oij eij)
2 =
ij
i=1j=1
where oij is the observed frequency (i.e., actual count) is

the expected frequency of (Ai;B j), which can be comp
count(A = ai) count(B

e
ij
where N is the number of data tuples, count(A = ai) is t ai

forA, andcount(B=bj)is the number of tuples hEquation
(2.9) is computed over all of the rc cells. N the most to
2
the value are those whose actual count is
Lets look at a concrete example.
Example 2.1 Correlation analysis of

categorical attributes u people was surveyed.
The gender of each person whether their
preferred type of reading material two attributes,
gender and preferred reading. Th possible joint
event is summarized in the cont the numbers in
parentheses are the expected fr distribution for
both attributes using Equation
Using Equation (2.10), we can verify the exp ple,

the expected frequency for the cell (male, fi
e11 = count(male) count(ficti

N
and so on. Notice that in any row, the sum of

total observed frequency for that row, and the
su umn must also equal the total observed
2
frequen for computation, we get
(25090)
+
(50
210)
+
90
210
284:44+121:90+
71:11+30
For this 2 2 table, the degrees of freedom

2
freedom, the value needed to reject the hyp

10.828 (taken from the table of upper percenta
available from any textbook on statistics). Since
reject the hypothesis that gender and preferred
re the two attributes are (strongly) correlated for t
In addition to detecting redundancies betw be

detected at the tuple level (e.g., where ther
(such as free breakfast) and taxes. An attribute in o a

lower level of abstraction than the same attribu total
sales in one database may refer to one branch of of the
same name in another database may refer to stores in
a given region.
When matching attributes from one database to an

attention must be paid to the structure of the data. Th
functional dependencies and referential constraints in t the
target system. For example, in one system, a disco
whereas in another system it is applied to each indivi If
this is not caught before integration, items in the t
discounted.
The semantic heterogeneity and structure of data p gration.

Careful integration of the data from multiple s redundancies and
inconsistencies in the resulting dat accuracy and speed of the
subsequent mining process.
2.4.2 Data Transformation
In data transformation, the data are transformed or con for

mining. Data transformation can involve the follow
Smoothing, which works to remove noise from th

binning, regression, and clustering.
Aggregation, where summary or aggregation opera example,

the daily sales data may be aggregated so as total amounts. This
step is typically used in constru the data at multiple granularities.
Generalization of the data, where low-level or prim higher-level

concepts through the use of concept hier
therefore discuss normalization and attribute c An attribute is

normalized by scaling its value range, such as 0.0 to 1.0.
Normalization is partic involving neural networks, or distance
measure cation and clustering. If using the neural netw sification
mining (Chapter 6), normalizing the in the training tuples will help
speed up the lear normalization helps prevent attributes with init
weighing attributes with initially smaller ranges methods for data
normalization. We study thr malization, and normalization by
decimal scalin
Min-max normalization performs a linear t pose that

minA and maxA are the minimum a Min-max
normalization maps a value, v, of A by computing
v =
vminA
(newmax
A
maxAminA
Min-max normalization preserves the relati It will encounter an

out-of-bounds error if a outside of the original data range for
A.
Example 2.2 Min-max normalization. Suppose that the m

attribute income are $12,000 and $98,000, resp the range
[0:0;1:0]. By min-max normalization
formed to
73;60012;000
(1:00)+ 0 = 0:716. 98;00012;000
In z-score normalization (or zero-mean nor
A, are normalized based on the mean and stan

0
normalized to v by computing

772s Data - Mining.concepts - And.techniques.2nd - Ed

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

772s Data - Mining.concepts - And.techniques.2nd - Ed

Caricato da

Copyright:

Formati disponibili

Ralf Guting and Markus Schneider

Joe Celkos SQL Programming Style

Data Mining: Practical Machine Learning Tools and Technique

Ian Witten and Eibe Frank

Fuzzy Modeling and Genetic Algorithms for Data Mining and

Data Modeling Essentials, Third Edition

Database Modeling with Microsft Visio for Enterprise Archite

Terry Halpin, Ken Evans, Patrick Hallock, Bill Maclean

Designing Data-Intensive Web Applications

Stephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambill

Mining the Web: Discovering Knowledge from Hypertext Data

Advanced SQL:II 1999Understanding Object-Relational and

Database Tuning: Principles, Experiments, and Troubleshootin

Dennis Shasha and Philippe Bonnet

SQL:1999Understanding Relational Language Components

Jim Melton and Alan R. Simon

Information Visualization in Data Mining and Knowledge Dis

Edited by Usama Fayyad, Georges G. Grinstein, and Andreas

Transactional Information Systems: Theory,

Spatial Databases: With Application to GIS

Philippe Rigaux, Michel Scholl, and Agnes Voisard

Information Modeling and Relational Databases: From Concep

Component Database Systems

Joe Celkos SQL for Smarties: Advanced SQL Program

Joe Celkos Data and Databases: Concepts in Practice

Developing Time-Oriented Database Applications in S

Web Farming for the Data Warehouse

Management of Heterogeneous and Autonomous Data

Edited by Ahmed Elmagarmid, Marek Rusinkiewicz,

Object-Relational DBMSs: Tracking the Next Great W

Michael Stonebraker and Paul Brown,with Dorothy

A Complete Guide to DB2 Universal Database

Universal Database Management: A Guide to Object/

Cynthia Maro Saracco

Readings in Database Systems, Third Edition

Edited by Michael Stonebraker and Joseph M. Heller

Understanding SQLs Stored Procedures: A Complete

Principles of Multimedia Database Systems

Principles of Database Query Processing for Advanced

Clement T. Yu and Weiyi Meng

Advanced Database Systems

Carlo Zaniolo, Stefano Ceri, Christos Faloutsos,

Principles of Transaction Processing

Using the New DB2: IBMs Object-Relational Databas

A Guide to Developing Client/Server SQL Applications

Setrag Khoshafian, Arvola Chan, Anna Wong, and Harry K.

The Benchmark Handbook for Database and Transaction Proce

Edited by Jim Gray

Camelot and Avalon: A Distributed Transaction Facility

Edited by Jeffrey L. Eppinger, Lily B. Mummert, and Alfred Z

Readings in Object-Oriented Database Systems

Designations used by companies to distinguish their

No part of this publication may be reproduced, stored

Permissions may be sought directly from Elseviers Science &

Library of Congress Cataloging-in-Publication Data

ISBN 13: 978-1-55860-901-3

ISBN 10: 1-55860-901-6

For information on all Morgan Kaufmann publications,

Printed in the United States of America 06 07 08 09

To Erik, Kevan, Kian,

What Motivated Data Mining?

So, What Is Data Mining? 5

Data MiningOn What Kind of