Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Lapsed donors
Amount of Information
Data Preparation
T-Code
1 6 DEAN
1 7 J UDGE
1 0 0 1 M ES S RS .
1 7 0 0 2 J UDGE & M RS .
1 0 0 2 M R. & M RS .
2 M RS .
1 8 M AJ O R
1 8 0 0 2 M AJ O R & M RS .
2 0 0 2 M ES DAM ES
3 M IS S
1 9 S ENATO R
1 0 9 LIC.
5 0 ELDER
1 1 1 S A.
5 6 M AYO R
1 1 4 DA.
5 9 0 0 2 LIEUTENANT & M RS .
6 2 LO RD
6 3 CARDINAL
1 1 6 S R.
1 1 7 S RA.
1 1 8 S RTA.
6 4 FRIEND
1 2 0 YO UR M AJ ES TY
3 0 0 3 M IS S ES
2 1 0 0 2 S ERGEANT & M RS .
6 5 FRIENDS
1 2 2 HIS HIGHNES S
4 DR.
4 0 0 2 DR. & M RS .
2 2 0 0 2 CO LNEL & M RS .
2 4 LIEUTENANT
6 8 ARCHDEACO N
6 9 CANO N
1 2 3 HER HIGHNES S
1 2 4 CO UNT
7 0 BIS HO P
1 2 5 LADY
4 0 0 4 DO CTO RS
5 M ADAM E
6 S ERGEANT
9 RABBI
1 0 P RO FES S O R
1 0 0 0 2 P RO FES S O R & M RS .
1 0 0 1 0 P RO FES S O RS
1 1 ADM IRAL
1 1 0 0 2 ADM IRAL & M RS .
1 2 GENERAL
2 0 GO V ERNO R
4 8 CO RP O RAL
2 6 M O NS IGNO R
2 7 REV EREND
28 MS.
28028 MSS.
1 2 7 P RINCES S
1 2 8 CHIEF
2 9 BIS HO P
8 5 S P ECIALIS T
1 2 9 BARO N
3 1 AM BAS S ADO R
8 7 P RIV ATE
1 3 0 S HEIK
8 9 S EAM AN
9 0 AIRM AN
9 1 J US TICE
1 3 5 M . ET M M E.
9 2 M R. J US TICE
2 1 0 P RO F.
1 2 0 0 2 GENERAL & M RS .
3 8 CO M M O DO RE
100 M.
1 3 CO LO NEL
1 3 0 0 2 CO LO NEL & M RS .
4 0 FATHER
4 2 S IS TER
1 0 3 M LLE.
1 0 4 CHANCELLO R
4 3 P RES IDENT
4 4 M AS TER
1 0 7 S ECRETARY
1 5 CO M M ANDER
1 5 0 0 2 CO M M ANDER & M RS .
4 6 M O THER
4 7 CHAP LAIN
1 0 8 LT. GO V ERNO R
1 4 CAP TAIN
1 2 6 P RINCE
10
11
Data Mining Is
the nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data. --- Fayyad
finding interesting structure (patterns, statistical
models, relationships) in data bases.--- Fayyad,
Chaduri and Bradley
a knowledge discovery process of extracting
previously unknown, actionable information from
very large data bases--- Zornes
a process that uses a variety of data analysis
tools to discover patterns and relationships in
data that may be used to make valid predictions.
---Edelstein
12
Data Mining Is
13
Case Study I
Ingot Cracking
Water composition
Metal composition
Process variables
Other environmental variables
14
Case Study II
Car Insurance
15
Mammograms used as
screening instrument
Can we do better?
Queries Describe
17
Crew
First
Second
Third
18
C
C32
1
Mosaic Plot
19
Models
20
Tree Diagram
F
Adult
2 or 3
Child
1 or Crew
Crew
1,2,C
46%
93%
1 or 2
1st
14%
27%
23%
100%
33%
21
Why Models?
Whats
Beer
interesting?
and Diapers
25
20
15
train2$y
10
15
10
5
train2$y
20
25
Toy Problem
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
0.6
0.8
1.0
0.6
0.8
1.0
25
20
15
train2$y
0.4
0.6
0.8
1.0
0.0
0.2
0.4
25
20
15
train2$y
15
10
10
20
25
train2[, i]
train2$y
1.0
5
0.2
train2[, i]
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
20
15
train2$y
15
10
5
10
20
25
train2[, i]
25
train2[, i]
train2$y
0.8
10
20
15
train2$y
10
5
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
20
15
train2$y
15
10
5
10
20
25
train2[, i]
25
train2[, i]
train2$y
0.6
train2[, i]
25
train2[, i]
0.0
0.2
0.4
0.6
train2[, i]
0.8
1.0
0.0
0.2
0.4
train2[, i]
24
Familiar Models
Linear Regression
25
Logistic Regression
26
Linear Regression
Term Estimate Std Error t Ratio Prob>|t|
Intercept
0.806
0.427 1.890
0.059
x1
7.269
0.273 26.590 <.0001
x2
7.289
0.281 25.940 <.0001
x3
-0.719
0.287 -2.500
0.012
x4
9.769
0.273 35.810 <.0001
x5
4.834
0.275 17.590 <.0001
x6
-0.456
0.280 -1.630
0.104
x7
0.123
0.270 0.460
0.647
x8
-0.349
0.276 -1.270
0.206
x9
-0.578
0.285 -2.030
0.043
x10
0.080
0.280 0.280
0.777
73.3% Test
27
Stepwise Regression
Term
Intercept
x1
x2
x3
x4
x5
x9
73.4% Test
28
88.5% Test
29
Next Steps
When to stop?
Transformations?
30
Tree Model
|
x4<0.512146
x1<0.209569
x1<0.359395
x5<0.260297
x2<0.299431
x4<0.140557
x3<0.215425 x5<0.232708
x2<0.129879
x2<0.27271
x3<0.885533 x5<0.621811
x5<0.412206
x5<0.588094
10.640
x4<0.336583
x4<0.148234
x2<0.54068
x3<0.490631
9.785
x4<0.223909
x4<0.264999
x3<0.248065
x4<0.768584 x2<0.20279
x2<0.414822
16.830
7.602
x4<0.283724
4.400
7.074
x4<0.916189
12.700
14.380 12.250
15.830
18.160
15.150
12.040
6.688
9.865
8.887
12.060
x3<0.784959
7.994
9.956
x1<0.328133
x3<0.0777104
x3<0.0789249
x3<0.177433x3<0.114976
15.020
x3<0.728124
19.060
14.340
17.560
21.260
17.770
x3<0.821878
21.240
18.760
x8<0.933915
15.190
x4<0.941058
25.320
x4<0.700738
10.780
14.060
14.030
16.860
67.2% Test
14.200
21.100
24.280
17.690
19.470
31
Feature Creation
Often linear:
zi = +b1 x1 + ... + b p x p
Principal components
Factor analysis
Multidimensional scaling
32
Neural Nets
33
A Single Neuron
x1
0.3
x2
0.7
x3
x4
x5
x0
-0.2
0.4
-0.5
s(z1)
Input (z1)
Output
0.8
z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5
34
z3
Output layer
Input layer
Hidden layer
35
36
90.6% Test
37
flexibility
to fit anything
Including noise
Interpretation?
38
39
Spatial Analysis
40
MARS
Multivariate
What
do they do?
10
1.2
1.2
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
-0.2
-0.2
6
x
10
10
41
94.3% Test
(95.8%)
42
43
Collaborative Filtering
Andre, Starwars
Andre, Starwars, Coeur en Hiver
Starwars, Batman
Starwars, Batman, Rambo
Coeur en Hiver, Chocolat
44
Data Base
Starwars Batman
Lyle
Ellen
Fred
Dean
Jason
y
y
y
y
y
y
y
Karen
Rambo
Coeur
Chocolat
y
y
y
45
46
Where to Start?
EDM
Use a tree to find a smaller subset of
variables to investigate
Explore this set graphically
Build model
47
a Tree:
x4<0.477873
|
x2<0.288579
x5<0.465905
x1<0.297806
x1<0.333728
x1<0.152683
-2.560 -0.265
x5<0.529173
x5<0.466843
x4<0.208211
-1.890 1.150
5.820
2.000 4.570
x2<0.343653
x2<0.125849
2.540 5.120
x4<0.752766
x5<0.644585 x5<0.49235
2.910 6.050
48
Automatic Models
KXEN
49
50
Bagging (Bootstrap
Aggregation)
51
52
53
T re e
Bo o ste d T re e s
Ra n d o m F o re st
Ne u ra l Ne tw o rk
Ra d io lo g ists
F a lse P o sitive Ra te
32.20%
24.90%
19.30%
25.50%
22.40%
F a lse Ne g a tive Ra te
33.70%
32.50%
28.80%
31.70%
35.80%
54
Ingot cracking
55
Model building
Train
Test
Evaluate
56
Alloy
Yttrium
We know that
Selenium!
OH!
57
65 potential predictors
58
Fast Fail
59
PVA Recap
Neural network?
Tree?
60
Student #1 $15,024
Student #2 $14,695
Student #3 $14,345
61
Research is exciting
Thank you!
deveaux@williams.edu
63