Sei sulla pagina 1di 32

Knowledge Discovery in

Databases
MIS 637
Professor Mahmoud Daneshmand
Fall 2012
Final Project: Red Wine Recipe Data
Mining
By Jorge Madrazo

Profound Questions
What basic properties are the
formula for a good wine?
Wine making is believed to be an art.
But is there a formula for a quality wine?
There was a paper on Modeling wine
preferences by Data Mining submitted
by the provider of the data set. How do
my results compare with the papers?

Procedure
Follow a data mining process
Use SAS and SAS Enterprise Miner to
execute the process
SAS Enterprise Miner tool is modeled
on the SAS Institute defined data
mining process of SEMMA Sample,
Explore, Modify, Model, Assess
SEMMA is similar to the CRISP DM
process

Sample
1,599 records
Set up a data partition
Training 40%
Validation 30%
Test 30%

Explore: Data Background


Data source
UCI Machine Learning Repository.
Wine Quality Data Set.
There are a red and white wine data set. I focused on the red wine set only.
There are 11 input variables and one target variable.
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
Output variable (based on sensory data): quality (score between 0 and 10)

Explore: Target=Quality
Quality
People gave a quality assessment of
different wines on a scale of 0-10. Actual
range 3-8.
An ordinal target

Explore: Inputs
Correlation Analysis
Some correlation, but not enough to
discard inputs
ods graphics on;
ods select MatrixPlot;
proc corr data=wino.red PLOTS(MAXPOINTS=100000 )

plots=matrix(histogram nvar=all);

var quality alcohol ph fixed_acidity density


volatile_acidity sulphates citric_acid;
run;

Explore: Correlation Graphs

Explore: Chi2 Statistics of


Inputs

Explore: Worth of Inputs

Explore: Worth Graph


The Worth Tracks closely with the Chi
Statistic

Modify
At this stage, no modifications are
done

Model: Selection
Because I want to list the important
elements in what is considered a
quality wine, I choose a Decision Tree
Configuration
The Splitting Rule is Entropy
Maximum Branch is set to 5
Therefore a C4.5 type of algorithm is being
implemented

Assess: Initial Results


A Bushy Tree using. The Resulting
tree is too intricate for simple
recommendation.
Over 20 Leaf nodes.

Modify: Target
Change the target so that it becomes a binary.
New variable in the model called isGood. Any
rating over 6 is categorized as isGood.
SAS Code:
data wino.xx;
set wino.red;
if (quality>6) then
isgood=1;
else isgood = 0;
run;
proc print data = wino.xx;
title 'xx';
run;

Explore: Target = isGood

Model Strategy for isGood


Model with Decision Tree to hope for
more descriptive results.
Also model with Neural Network to
aid in assessment and do comparison

Model: Decision Tree


ProbF splitting criteria at Significance
Level .2
Maximum Branch size = 5

Assess: Decision Tree


Results
Much simpler Tree

Assess: Decision Tree Results 2

Leaf Statistics

Assess: Variable Importance


Number Number
Ratio of
Variabl
of
of
Validation Validation to
e
Splitting Surrogate Importanc Importanc
Training
Name Label
Rules
Rules
e
e
Importance
alcohol

density

volatile_acidity

sulphates

fixed_acidity

citric_acid
free_sulfur_dioxi
de

1
0.7705517
5
0.7288689
87
0.6716756
28
0.5537197
29
0.5497503
61

0 NaN

0 NaN

0 NaN

0
0

0 NaN
0 NaN

Event Classification
Table
pH
0

chloride
s
0
0
total_sulfur_dioxi
Data
Role=TRAIN
de
Target=isgood
0
0
residual_sugar
False
True
False0
True 0
Negative Negative
Positive
Positive
53
539
14
34

Data Role=VALIDATE
Target=isgood
False
True
False

True

1
0.7705517
5
0.7288689
87
0.4777105
05
0.3938176
71
0.3909945
69

1
1
1
0.711222032
0.711222032
0.711222032

Model: Neural Network


Positive better at predicting
Negative hard to interpret the
model
Configured with 3 Hidden Nodes

Modify: Input Variables to


NN
Because of the complexity of the NN,
it is recommended to prune variables
prior to running the network.

Modify: R Filter
2

Variable Name Role

Measureme
nt Level
Reasons for Rejection

alcohol

INPUT

INTERVAL

chlorides
citric_acid

INPUT INTERVAL
REJECTE
D
INTERVAL

density

INPUT

Varsel:Small R-square
value

INTERVAL

fixed_acidity
INPUT INTERVAL
free_sulfur_dioxid
e
INPUT INTERVAL
REJECTE
pH
D
INTERVAL
REJECTE
residual_sugar
D
INTERVAL

Varsel:Small R-square
value
Varsel:Small R-square
value

sulphates
INPUT INTERVAL
total_sulfur_dioxi REJECTE
de
D
INTERVAL

Varsel:Small R-square
value

volatile_acidity

INPUT

INTERVAL

Model: NN
Specify 3 Hidden Units in the Hidden
Layer

Assess: NN Results
Hard to interpret results to formulate a recipe
The NEURAL Procedure
Optimization Results
Parameter Estimates
Gradient
Objective
N Parameter
Estimate
Function
1
2
3
4
5
6
7
8
9

alcohol_H11
3.679818
-0.001411
chlorides_H11
0.520190
-0.000479
density_H11
-2.171623
0.000883
fixed_acidity_H11
-0.055929
0.000179
free_sulfur_dioxide_H11
0.403412
0.000139
sulphates_H11
-4.954290
-0.000224
volatile_acidity_H11
2.686209
0.000205
alcohol_H12
-0.313005
0.001209
chlorides_H12
0.200973
0.000759

Assess: Comparative Results

Receiver Operating Characteristics (ROC) Chart for NN vs Decision


Tree

Assess: Comparative Results


Cumulative Lift for NN vs Decision
Tree

Assess: Comparison with Reference


Paper
Used R-Miner
Support Vector Machine (SVM) and Neural
Network used
He applied techniques to extract relative
importance of variables
He attempted to predict every quality level
He noted the importance of alcohol and
sulphates. An increase in sulphates might be
related to the fermenting nutrition, which is
very important to improve the wine aroma.

Assess: Paper Variable Importance

Overall Project in SAS EM

References
UCI Machine Learning Repository http://
archive.ics.uci.edu/ml/datasets/Wine
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and
J. Reis. Modeling wine preferences by data
mining from physicochemical properties.
In Decision Support Systems, Elsevier,
47(4):547-553, 2009.
Modeling wine preferences by data mining from
physicochemical properties, Paulo Cortez et. al
http://www3.dsi.uminho.pt/pcortez/wine5.pdf

Potrebbero piacerti anche