Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Databases
MIS 637
Professor Mahmoud Daneshmand
Fall 2012
Final Project: Red Wine Recipe Data
Mining
By Jorge Madrazo
Profound Questions
What basic properties are the
formula for a good wine?
Wine making is believed to be an art.
But is there a formula for a quality wine?
There was a paper on Modeling wine
preferences by Data Mining submitted
by the provider of the data set. How do
my results compare with the papers?
Procedure
Follow a data mining process
Use SAS and SAS Enterprise Miner to
execute the process
SAS Enterprise Miner tool is modeled
on the SAS Institute defined data
mining process of SEMMA Sample,
Explore, Modify, Model, Assess
SEMMA is similar to the CRISP DM
process
Sample
1,599 records
Set up a data partition
Training 40%
Validation 30%
Test 30%
Explore: Target=Quality
Quality
People gave a quality assessment of
different wines on a scale of 0-10. Actual
range 3-8.
An ordinal target
Explore: Inputs
Correlation Analysis
Some correlation, but not enough to
discard inputs
ods graphics on;
ods select MatrixPlot;
proc corr data=wino.red PLOTS(MAXPOINTS=100000 )
plots=matrix(histogram nvar=all);
Modify
At this stage, no modifications are
done
Model: Selection
Because I want to list the important
elements in what is considered a
quality wine, I choose a Decision Tree
Configuration
The Splitting Rule is Entropy
Maximum Branch is set to 5
Therefore a C4.5 type of algorithm is being
implemented
Modify: Target
Change the target so that it becomes a binary.
New variable in the model called isGood. Any
rating over 6 is categorized as isGood.
SAS Code:
data wino.xx;
set wino.red;
if (quality>6) then
isgood=1;
else isgood = 0;
run;
proc print data = wino.xx;
title 'xx';
run;
Leaf Statistics
density
volatile_acidity
sulphates
fixed_acidity
citric_acid
free_sulfur_dioxi
de
1
0.7705517
5
0.7288689
87
0.6716756
28
0.5537197
29
0.5497503
61
0 NaN
0 NaN
0 NaN
0
0
0 NaN
0 NaN
Event Classification
Table
pH
0
chloride
s
0
0
total_sulfur_dioxi
Data
Role=TRAIN
de
Target=isgood
0
0
residual_sugar
False
True
False0
True 0
Negative Negative
Positive
Positive
53
539
14
34
Data Role=VALIDATE
Target=isgood
False
True
False
True
1
0.7705517
5
0.7288689
87
0.4777105
05
0.3938176
71
0.3909945
69
1
1
1
0.711222032
0.711222032
0.711222032
Modify: R Filter
2
Measureme
nt Level
Reasons for Rejection
alcohol
INPUT
INTERVAL
chlorides
citric_acid
INPUT INTERVAL
REJECTE
D
INTERVAL
density
INPUT
Varsel:Small R-square
value
INTERVAL
fixed_acidity
INPUT INTERVAL
free_sulfur_dioxid
e
INPUT INTERVAL
REJECTE
pH
D
INTERVAL
REJECTE
residual_sugar
D
INTERVAL
Varsel:Small R-square
value
Varsel:Small R-square
value
sulphates
INPUT INTERVAL
total_sulfur_dioxi REJECTE
de
D
INTERVAL
Varsel:Small R-square
value
volatile_acidity
INPUT
INTERVAL
Model: NN
Specify 3 Hidden Units in the Hidden
Layer
Assess: NN Results
Hard to interpret results to formulate a recipe
The NEURAL Procedure
Optimization Results
Parameter Estimates
Gradient
Objective
N Parameter
Estimate
Function
1
2
3
4
5
6
7
8
9
alcohol_H11
3.679818
-0.001411
chlorides_H11
0.520190
-0.000479
density_H11
-2.171623
0.000883
fixed_acidity_H11
-0.055929
0.000179
free_sulfur_dioxide_H11
0.403412
0.000139
sulphates_H11
-4.954290
-0.000224
volatile_acidity_H11
2.686209
0.000205
alcohol_H12
-0.313005
0.001209
chlorides_H12
0.200973
0.000759
References
UCI Machine Learning Repository http://
archive.ics.uci.edu/ml/datasets/Wine
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and
J. Reis. Modeling wine preferences by data
mining from physicochemical properties.
In Decision Support Systems, Elsevier,
47(4):547-553, 2009.
Modeling wine preferences by data mining from
physicochemical properties, Paulo Cortez et. al
http://www3.dsi.uminho.pt/pcortez/wine5.pdf