Sei sulla pagina 1di 12

TOP-10 DATA MINING CASE STUDIES

GABOR MELLI
PredictionWorks Inc., Seattle, WA 98126, USA
gmelli@predictionworks.com
XINDONG WU
Department of Computer Science
University of Vermont
Burlington, VT 05405, USA
xwu@cems.uvm.edu
PAUL BEINAT
NeuronWorks International, Hurtsville,
NSW 2220, Australia
PBeinat@neuronworks.com
FRANCESCO BONCHI
Yahoo! Research, Barcelona, Spain
bonchi@yahoo-inc.com
LONGBING CAO
University of Technology, Sydney, Australia
lbcao@it.uts.edu.au
RONG DUAN
AT&T Labs, Research, Florham Park, NJ, USA
rongduan@research.att.com
CHRISTOS FALOUTSOS
Department of Computing Science
Carnegie Mellon University
5000 Forber Avenue, Pittsburgh, PA 15213, USA
christos@cs.cmu.edu
RAYID GHANI
Accenture Technology Labs
161 N.Clark St, Chicago, IL 60601, USA
rayid.ghani@gmail.com
BRENDAN KITTS
Lucid Commerce, Seattle, WA 98104, USA
bkitts@lucidcommerce.com
International Journal of Information Technology & Decision Making
Vol. 11, No. 2 (2012) 389400
c World Scientic Publishing Company
DOI: 10.1142/S021962201240007X
389
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
BART GOETHALS
Department of Mathematics and Computer Science
University of Antwerp, Belgium
bart.goethals@ua.ac.be
GEOFF MCLACHLAN
Department of Mathematics, University of Queensland
St. Lucia, Brisbane, Australia
gjm@maths.uq.edu.au
JIAN PEI
School of Computing Science
Simon Fraser University, Canada
jpei@sfu.ca
ASHOK SRIVASTAVA
NASA, USA
ashok.srivastava@nasa.gov
OSMAR ZAANE
Department of Computing Science, University of Alberta
Alberta, Canada T6G 2E8
zaiane@cs.ualberta.ca
We report on the panel discussion held at the ICDM'10 conference on the top 10 data mining
case studies in order to provide a snapshot of where and how data mining techniques have made
signicant real-world impact. The tasks covered by 10 case studies range from the detection of
anomalies such as cancer, fraud, and system failures to the optimization of organizational
operations, and include the automated extraction of information from unstructured sources.
From the 10 cases we nd that supervised methods prevail while unsupervised techniques play a
supporting role. Further, signicant domain knowledge is generally required to achieve a
completed solution. Finally, we nd that successful applications are more commonly associated
with continual improvement rather than by single \aha moments" of knowledge (\nugget")
discovery.
Keywords: Data mining; cost-benet analysis; case study.
MSC 2011: 68T05, 68U30, 68-01.
1. Introduction
Following the successes of the 10 Challenging Problems in Data Mining
Research at ICDM'05,
a
and the Top 10 Algorithms in Data Mining at ICDM'06,
b
and as part of the 10th anniversary celebration of the IEEE International
Conference on Data Mining series (ICDM), the Top-10 Data Mining Case Studies
panel at ICDM'10 presented the top 10 data mining case studies
c
submissions
a
http://www.cs.uvm.edu/%7Eicdm/10Problems/index.shtml.
b
http://www.cs.uvm.edu/%7Eicdm/algorithms/index.shtml.
c
http://www.gabormelli.com/RKB/Data Mining Case Study.
390 G. Melli et al.
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
as selected by 12 experienced data miners. The moderated panel's objective
was to present contemporary exemplars of successful data mining applications
in order to help the community to set a baseline for the success criteria of a
successful contemporary data mining application that practitioners can use to
improve their deployments, and for data mining researchers to push the state of
the art.
Prior to the event the panel's program committee
d
designed a questionnaire to be
completed by each case study candidate, advertised for submissions, and proceeded
to rank the candidates. During the panel a panelist was assigned to act as advocate
for one of the case studies that they ranked highly.
The panel began with a brief summary of each case study by their advocate, and
concluded with a topic-centered discussion.
The remainder of the paper is structured as follows. In Sec. 2 we describe the
selection process. In Sec. 3 we summarize the 10 selected case studies. In Sec. 4 we
summarize the main topics discussed during the second half of the panel.
2. Selection Process
The process to select the top 10 case studies involved three main tasks: questionnaire
design, an open call for case studies, and the ranking step. Table 1 presents the
questions that each candidate was required to answer.
We received 16 high-quality case studies accompanied by a completed ques-
tionnaire. Each program committee member ranked all submissions for which there
was no conict of interest. The top-10 submissions were selected based on the
average normalized ranking provided by the 12 committee members. Figure 1
summarizes the ranking of all submissions.
3. The Top-10 Case Studies
This section summarizes the top-10 case studies in randomized order. For each case
study we briey describe the task, solution, challenges, and results.
d
The case study selection committee did not include Francesco Bonchi and Osmar Zaane.
Table 1. Questions required from candidate case studies.
1 What is the data mining case study about? What problem was solved?
2 What were the quantitative and qualitative measures used to evaluate success of the data mining
initiative? What were the actual results achieved in these measures?
3 What data mining techniques and algorithms were used? How much did they contribute to the
project's success?
4 What novel data mining techniques were developed? What was the impact of these techniques?
5 What makes the case study most noteworthy?
6 What time period was the application in operation?
7 What organization beneted from the application? Whom can we contact at this organization?
8 What organizations were involved in implementing and delivering the application?
9 In what way could other situations benet from the case study?
Top-10 Data Mining Case Studies 391
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
3.1. US Department of Agriculture Risk Management Agency's crop
insurance data mining program
Case-Study Members: Bert Little, Michael Schucking, and the CAE Research
Center Team at Tarleton State University for US Department of Agriculture Risk
Management Agency.
Topic: Predict improper insurance indemnity payments (waste, fraud, and
abuse).
Solution: Produce on a short-list for human inspection twice during the growing
season, and track using satellite data in an automated process.
Techniques: Anomaly detection; link analysis; cluster analysis; regression; factor
analysis; geo-referenced methods.
Results: Saved $1.5 billion between 2001 and 2007. Identied more than $188
million of anomalous claims in 2009.
Reference: Little et al.
8
Panelist: Ashok Srivastava, NASA.
3.2. Click fraud attack detection at massive scale
over 5 years at microsoft
Case-Study Members: Brendan Kitts, Jingying Zhang, Gang Wu, Julien Beasley,
KieranMorrill, JohnEttedgui, Eric Jorgensen, SidSiddhartha, HongYuan, Peter Azo,
Feng Gao, Baiju Nair, Haitao Song, Dinesh Chahlia, Tudor Trunescu, Narayanan
Madhu, Raj Mahato, Wesley Brandi, Sasha Berger, Jigar Mody, Dennis Minium,
Albert Roux, Ron Mills, Kamran Kanany, Brandon Sabottka, Matthew Rice.
Fig. 1. Average rank given to each submission, sorted by the score (papers are anonymized
e
).
e
We do not release the specic rank that each selected case study was assigned.
392 G. Melli et al.
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
Topic: Combating click-fraud at Microsoft's online services.
Challenges:
. Adversarial attackers that evolve their response, e.g., blend their attacks to appear
as statistical noise.
. Approximately 1 billion events per hour need to be scored.
. The Ad network must intercept and contain attacks in real time.
Solution:
. Real-time components on minimum memory systems.
. Oline, grid computing system.
. Infrastructure to experiment on multiple model.
. Rule reporting systems.
Reference: Kitts et al.
6
Panelist: Bart Goethals, University of Antwerp.
3.3. Discovery of precursors to aviation safety incidents
with data mining
Case-Study Members: Ashok Srivastava and Irving Statler for NASA's Aviation
System Monitoring and Modeling (ASMM) Project.
Task: Discover accident precursors to aviation safety incidents to enable proactive
management of the safety risk of the national air transportation system.
Challenges:
. Extract and fuse reliable and useful information from very large, heterogeneous
(numerical and textual) data sources with minimal human labor.
. Integrate information from domain experts.
Solution: k-means for clustering, Linear Discriminant Analysis and natural
language processing for classication.
Outcome: Accepted by the aviation industry and the FAA with the potential to
benet billions of passengers.
Lessons: Diverse techniques and group of people need to be brought together to
solve real-world problems.
Reference: Ferryman et al.
2
Panelist: Rayid Ghani, Accenture Technology Labs.
Top-10 Data Mining Case Studies 393
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
3.4. Analyzing system logs for online failure prediction
on supercomputers
Case-Study Members: Zhiling Lan, Ziming Zheng, Jiexing Gu Susan Coghlan,
and Rajeev Thakur with Argonne National Laboratory.
Task: Online failure prediction in supercomputing systems via log analysis.
Challenges: Eectively capture failure patterns from redundant, unformatted,
dynamically changing, and overwhelming amount of events in system logs.
Solution: Extract unique system events via a three-step log preprocessing and boost
prediction accuracy via dynamic meta-learning.
Techniques: Association rule; Statistical learning; Probability-based method;
Genetic learning; Decision tree; Ensemble learning.
Reference: Lan et al.
7
Panelist: Christos Faloutsos, Carnegie Mellon University.
3.5. Forecasting skewed biased stochastic ozone days:
Analyses, solutions and beyond
Case-Study Members: Wei Fan, Kun Zhang, and Xiaojing Yuan for Texas
Commission on Environmental Quality (TCEQ).
Task: Ozone level alarm forecasting models.
Challenges: Sparse dataset (2% or 5% positives depending on the criteria of \ozone
days"); evolving phenomena; large number of irrelevant features; sample selection bias.
Solution: Used ensemble-based probability trees; bagging probabilistic decision
trees; and, random decision trees.
Impact: 20% higher in recall (correctly detects 1 to 3 more ozone days, depending on
the year) and 10% higher in precision (15 to 30 fewer false alarm days per year).
Lessons: For huge datasets, human cognition is limited to the crafting of manually
created models and hypotheses.
References: Zhang and Fan.
14
Panelist: Longbing cao, University of Technology, Sydney.
3.6. Mining medical images
Case-Study Members: Balaji Krishnapuram and R. Bharat Rao for Siemens Inc.
Task: Computer aided medical image analysis to help radiologists identify early
stage cancers and other medical condition.
Challenges: Large amount of data but not iid; User adoption.
394 G. Melli et al.
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
Techniques: SVM, CRF, etc., with signicant research into modeling the nature
of the data.
Outcome:
. Increased radiologist performance of lung nodule detection from 80% sensitivity
to 9095%.
. Widely licensed and patients diagnosed daily.
Lessons:
. Improve user (radiologist's) outcomes, not just accuracy.
. Simple application of existing methods is insucient.
. First principles research innovation specic to application domain intricacies was
required.
. Use interdisciplinary team of doctors and scientists with expertise in image
processing, data mining, biostatistics.
. Secure buy-in and leadership from key clinical domain experts.
Reference: Fung et al.
3
Panelist: Osmar Zaane, University of Alberta.
3.7. MineFleet
r
: A distributed vehicle performance
data stream mining system
Case-Study Members: Hillol Kargupta, Kakali Sarkar, Michael Gilligan, Parag
Namjoshi, Sai Subhash Paruchuru, Thiraphat Pongsudhiraks, and Robert Gilligan
with Agnik Inc. http://www.agnik.com/.
Task: Commercial eet performance monitoring, such as of: driver behavior, fuel
economy, and emissions.
Challenges:
. Data stream mining in embedded systems. Distributed data and high-cost data
centralization.
Solution: MineFleet
r
distributed data mining system:
. Reduce fuel cost using fuel consumption analytics.
. Advanced predictive vehicle health monitoring.
. Optimize driver behavior by quantifying the eect on vehicle performance.
. Advanced eet analytics for comparing and contrasting vehicles.
Techniques: Statistical aggregates; Outlier detection; Principal component anal-
ysis; Clustering, and Predictive modeling.
Top-10 Data Mining Case Studies 395
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
Reference: Kargupta et al.
5
Panelist: Francesco Bonchi, Yahoo! Research.
3.8. Enhancing the chemical safety of agricultural products
Case-Study Members: Walter C. P. Magalhaes Junior and Marilde Terezinha
Prado Santos for Ministry of Agriculture, Livestock and Supply MAPA, under
the coordination of the Brazilian Agricultural Research Corporation.
Task: Improve data-quality of chemical safety of agricultural products, generated
from laboratory tests on samples of animal and plant matrices.
Challenges:
. Input data complexity.
. Integration with business logic/rules and business meaning.
Solution: Risk-on approach to risk assessment.
Techniques: Used fuzzy logic and ontologies, during the data processing stage.
Measure: Greater accuracy, comprehensiveness, and reliability.
Outcome: Greater chemical safety of agricultural food.
Goal: To enable experts and authorities could better quantify and qualify the impact
of practices and government interventions to control along the production chain.
Lessons: Importance of explanations that can be validated by domain experts can
easily judge the validity.
Reference: Magalhes et al.
9
Panelist: Rong Duan, AT&T Research.
3.9. Chemical and biological entity extraction techniques
for scientic literature
Case-Study Members: Su Yan, Stephen Boyer, Ying Chen, Alfredo Alba, Thomas
D. Grin, W. Scott Spangler, Ana Lelescu, and Jerey T. Kreulen.
Task: Extract and link mentions of chemical and biomedical named entities in
patents and medical journals.
Challenges: Heterogeneous data; poor data quality; complex, diverse, inconsistent,
and changing nomenclatures.
Techniques:
. Conditional Random Field (CRF) modeling.
. Custom lters that embed domain specic knowledge.
396 G. Melli et al.
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
Outcome:
. Commercially successful with several pharmaceutical customers.
. Improved retrieval over text-based searching
Lessons:
. Chemical name extraction from real-world data sources such as patents and
scientic literature is nontrivial.
. MapReduce on cloud implementation ensures the scalability and eciency for
real-world use.
. Domain knowledge is critical to the solution success.
Reference: Yan et al.
13
Panelist: Paul Beinat, NeuronWorks International.
3.10. Social security data mining for public services
Case-Study Members: Longbing cao, Hans M. Bohlscheid, Yanchang Zhao,
Huaifeng zhang, Peter newbigin, Brett Clark, Yuming Ou, Jinjiu Li, Yong Yang,
Chengqi Zhang, and Yanshan Xiao with the University of Technology Sydney and
the Australian Federal Government Department of Human Services.
Tasks: (A) Over-payment centric analysis, (B) customer-centric analysis,
(C) policy centric analysis, (D) process-centric analysis, and (E) fraud centric
analysis.
Goals: Debt prevention; debt recovery; fraud detection; risk-rating with regard to
debt occurrence and incorrect payment, income declaration, customer-oce inter-
action analysis, and change detection.
Challenges:
. Complexities in the increasing size of data, heterogeneity of data types matching
to many relevant business lines such as taxation, immigration, banking and
superannuation, and the mixture of heterogeneous data with ever-increasing
online business transactions and documents
. Change and dynamics of policies, business processes and workows, as well as
client demographics and behaviors, the mixture of diversied changes with
underlying business objects and targets
. The need of involving and integrating factors and aspects from human, domain,
organizational, social, and governmental perspectives towards actionable knowl-
edge discovery and delivery connecting to the business operations and decision
systems
Top-10 Data Mining Case Studies 397
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
Techniques: Behavior analysis and mining; combined pattern mining; change
detection; association rule; frequent pattern mining; decision tree; classication;
clustering; regression; anomaly detection; sequence analysis.
Lessons:
. Patterns, indicators and factors identied are used for advising/informing business
objectives such as debt prevention and recovery, and for improving payment
accuracy.
. Complex behavior analysis and complex data understanding can benet from the
outputs of behavior mining and combined mining for any areas and regions with
similar business.
References: Zhao et al.
16
Panelist: Geo McLachlan, University of Queensland.
4. Topic Discussions
This section briey describes three of the topics discussed during the panel.
Topic 1: Critical Role of Domain Knowledge
One of the patterns seen in the case-studies was the critical role of domain adaptation.
The adaptation ranged from the integration of background information from domain
experts in \Discovery of Precursors to Aviation Safety Incidents"
2
; the denition of
features in \entity extraction from text",
13
and subjective issues such as the focus in
\Mining Medical Images" on interviews to the users of the data mining results.
3
Topic 2: Continually Improved Predictive Models Versus Discovered
Nuggets
Another topic addressed by the panel was the dominance of continually improving
supervised predictive models over solutions that strictly applied unsupervised knowl-
edge discovery.
8
Several panelists supported the point that solution success was dened
more by the continuous improvement then by the discovery of single knowledge
\nugget" that led to a major breakthrough. An example of the role that unsupervised
techniques play a role is the frequent pattern mining to create predictor features.
Topic 3: Connection to other Impactful Application Areas
Since the rst KDD workshop at IJCAI-89, data mining has been extracting
extensive interest from various application domains. The 10 case studies identied at
ICDM'10 are certainly among the most successful applications. A nal topic
addressed by the panel was to compare data mining case studies with other inu-
ential real-world application cases, such as, PageRank's deployment and evolvement
in Google search. Panelists suggested that the top-10 case studies are comparably
signicant in that Google's search techniques have already moved well beyond the
simple application of PageRank.
398 G. Melli et al.
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
5. Conclusion
In this report we summarized the panel at ICDM'10 on Top-10 Data Mining Case
Studies. The panel served to capture a snapshot of the topics to which data mining is
currently being successfully applied and the techniques that support this success. For
a young and challenging eld as data mining we see ongoing value to identify and
publicize successful and inuential applications.
Acknowledgments
Xindong Wu is supported by the US National Science Foundation (NSF) under
grant CCF-0905337.
References
1. G. J. Deng and Y. Zeng, R&D investment decision on emerging technology, International
Journal of Information Technology and Decision Making 10(3) (2010).
2. T. A. Ferryman, C. Posse, L. J. Rosenthal, A. N. Srivastava and I. C. Statler, What
happened, and why: Toward an understanding of human error based on automated
analyses of incident reports, Vol. II, NASA/TP2006-213490 (2006).
3. G. Fung, M. Dundar, B. Krishnapuram and R. B. Rao, Multiple instance algorithms for
computer aided diagnosis, Neural Information Processing Systems (2006).
4. U. Kang, C. E. Tsourakakis and C. Faloutsos, PEGASUS: Mining peta-scale graphs,
Knowledge and Information Systems 27(2) (2011).
5. H. Kargupta, K. Sarkar and M. Gilligan, MineFleet
r
: An overview of a widely
adopted distributed vehicle performance data mining system, in Proc. KDD 2010
(2010).
6. B. Kitts, J. Zhang, G. Wu, W. Brandi, J. Beasley, K. Morrill, J. Ettedgui, S. Siddhartha,
H. Yuan, F. Gao and P. Azo, Click fraud detection: Adversarial attacker pattern rec-
ognition over vast amounts of data over 5 years at microsoft, Unpublished Manuscript
(2010).
7. Z. Lan, J. Gu, Z. Zheng, R. Thakur and S. Coghlan, A study of dynamic meta-learning for
failure prediction in large-scale systems, Journal of Parallel and Distributed Computing
70 (2010).
8. B. B. Little, W. L. Johnston, A. C. Lovell, R. M. Rejesus and S. A. Steed, Collusion in the
US crop insurance program: Applied data mining, in Proc. KDD-2002 (2002).
9. W. C. de Magalhes Junior, M. Bonnet, L. Diamantino Feijo and M. T. P. Santos, Risk-
o method: Improving data quality generated by chemical risk analysis of milk, in SMEs
and Open Innovation: Global Cases and Initiatives (IGI Global, 2010).
10. G. Melli, O. R. Zaane and B. Kitts, Introduction to the special issue on successful real-
world data mining applications, ACM SIGKDD Explorations 8(1) (2006).
11. E. trumbelj, Z. Bosnic, I. Kononenko, B. Zakotnik and C. G. Kuhar, Explanation and
reliability of prediction models: The case of breast cancer recurrence, Knowledge and
Information Systems 24(2) (2010).
12. F. Wang, N. Shi and B. Chen, A comprehensive survey of the reviewer assignment
problem, International Journal of Information Technology and Decision Making 9(4)
(2010).
13. S. Yan, Y. Chen and S. Spangler, Cross media entity extraction and linkage for chemical
documents, in Proc. AAAI-2011 (2011).
Top-10 Data Mining Case Studies 399
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.
14. K. Zhang and W. Fan, Forecasting skewed biased stochastic ozone days: Analyses, sol-
utions and beyond, Knowledge and Information Systems 14(3) (2008).
15. L. Zhao and J. Zhu, Internet marketing budget allocation: From practitioner's per-
spective, International Journal of Information Technology and Decision Making 9(5)
(2010).
16. L. Cao, Social security and social welfare data mining: An Overview, IEEE Transactions
on Systems, Man and Cybernetics, Part C: Applications and Reviews, 10.1109/
TSMCC.2011.2177258.
400 G. Melli et al.
I
n
t
.

J
.

I
n
f
o
.

T
e
c
h
.

D
e
c
.

M
a
k
.

2
0
1
2
.
1
1
:
3
8
9
-
4
0
0
.

D
o
w
n
l
o
a
d
e
d

f
r
o
m

w
w
w
.
w
o
r
l
d
s
c
i
e
n
t
i
f
i
c
.
c
o
m
b
y

N
A
N
Y
A
N
G

T
E
C
H
N
O
L
O
G
I
C
A
L

U
N
I
V
E
R
S
I
T
Y

o
n

0
9
/
2
2
/
1
4
.

F
o
r

p
e
r
s
o
n
a
l

u
s
e

o
n
l
y
.

Potrebbero piacerti anche