Red Wine Mine

Knowledge Discovery in
Databases
MIS 637
Professor Mahmoud Daneshmand
Fall 2012
Final Project: Red Wine Recipe Data
Mining
By Jorge Madrazo
Profound Questions
What basic properties are the
formula for a good wine?
Wine making is believed to be an art.
But is there a formula for a quality wine?
There was a paper on Modeling wine
preferences by Data Mining submitted
by the provider of the data set. How do
my results compare with the papers?
Procedure
Follow a data mining process
Use SAS and SAS Enterprise Miner to
execute the process
SAS Enterprise Miner tool is modeled
on the SAS Institute defined data
mining process of SEMMA Sample,
Explore, Modify, Model, Assess
SEMMA is similar to the CRISP DM
process
Sample
1,599 records
Set up a data partition
Training 40%
Validation 30%
Test 30%
Explore: Data Background

Data source
UCI Machine Learning Repository.
Wine Quality Data Set.
There are a red and white wine data set. I focused on the red wine set only.
There are 11 input variables and one target variable.
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
Output variable (based on sensory data): quality (score between 0 and 10)
Explore: Target=Quality
Quality
People gave a quality assessment of
different wines on a scale of 0-10. Actual
range 3-8.
An ordinal target
Explore: Inputs
Correlation Analysis
Some correlation, but not enough to
discard inputs
ods graphics on;
ods select MatrixPlot;
proc corr data=wino.red PLOTS(MAXPOINTS=100000 )
plots=matrix(histogram nvar=all);
var quality alcohol ph fixed_acidity density

volatile_acidity sulphates citric_acid;
run;
Explore: Correlation Graphs
Explore: Chi2 Statistics of

Inputs
Explore: Worth of Inputs
Explore: Worth Graph

The Worth Tracks closely with the Chi
Statistic
Modify
At this stage, no modifications are
done
Model: Selection
Because I want to list the important
elements in what is considered a
quality wine, I choose a Decision Tree
Configuration
The Splitting Rule is Entropy
Maximum Branch is set to 5
Therefore a C4.5 type of algorithm is being
implemented
Assess: Initial Results

A Bushy Tree using. The Resulting
tree is too intricate for simple
recommendation.
Over 20 Leaf nodes.
Modify: Target
Change the target so that it becomes a binary.
New variable in the model called isGood. Any
rating over 6 is categorized as isGood.
SAS Code:
data wino.xx;
set wino.red;
if (quality>6) then
isgood=1;
else isgood = 0;
run;
proc print data = wino.xx;
title 'xx';
run;
Explore: Target = isGood
Model Strategy for isGood

Model with Decision Tree to hope for
more descriptive results.
Also model with Neural Network to
aid in assessment and do comparison
Model: Decision Tree

ProbF splitting criteria at Significance
Level .2
Maximum Branch size = 5
Assess: Decision Tree

Results
Much simpler Tree
Assess: Decision Tree Results 2
Leaf Statistics
Assess: Variable Importance

Number Number
Ratio of
Variabl
of
of
Validation Validation to
e
Splitting Surrogate Importanc Importanc
Training
Name Label
Rules
Rules
e
e
Importance
alcohol
density
volatile_acidity
sulphates
fixed_acidity
citric_acid
free_sulfur_dioxi
de
1
0.7705517
5
0.7288689
87
0.6716756
28
0.5537197
29
0.5497503
61
0 NaN
0 NaN
0 NaN
0
0
0 NaN
0 NaN
Event Classification
Table
pH
0
chloride
s
0
0
total_sulfur_dioxi
Data
Role=TRAIN
de
Target=isgood
0
0
residual_sugar
False
True
False0
True 0
Negative Negative
Positive
Positive
53
539
14
34
Data Role=VALIDATE
Target=isgood
False
True
False
True
1
0.7705517
5
0.7288689
87
0.4777105
05
0.3938176
71
0.3909945
69
1
1
1
0.711222032
0.711222032
0.711222032
Model: Neural Network

Positive better at predicting
Negative hard to interpret the
model
Configured with 3 Hidden Nodes
Modify: Input Variables to

NN
Because of the complexity of the NN,
it is recommended to prune variables
prior to running the network.
Modify: R Filter
2
Variable Name Role
Measureme
nt Level
Reasons for Rejection
alcohol
INPUT
INTERVAL
chlorides
citric_acid
INPUT INTERVAL
REJECTE
D
INTERVAL
density
INPUT
Varsel:Small R-square
value
INTERVAL
fixed_acidity
INPUT INTERVAL
free_sulfur_dioxid
e
INPUT INTERVAL
REJECTE
pH
D
INTERVAL
REJECTE
residual_sugar
D
INTERVAL
value
value
sulphates
INPUT INTERVAL
total_sulfur_dioxi REJECTE
de
D
INTERVAL
value
volatile_acidity
INPUT
INTERVAL
Model: NN
Specify 3 Hidden Units in the Hidden
Layer
Assess: NN Results
Hard to interpret results to formulate a recipe
The NEURAL Procedure
Optimization Results
Parameter Estimates
Gradient
Objective
N Parameter
Estimate
Function
1
2
3
4
5
6
7
8
9
alcohol_H11
3.679818
-0.001411
chlorides_H11
0.520190
-0.000479
density_H11
-2.171623
0.000883
fixed_acidity_H11
-0.055929
0.000179
free_sulfur_dioxide_H11
0.403412
0.000139
sulphates_H11
-4.954290
-0.000224
volatile_acidity_H11
2.686209
0.000205
alcohol_H12
-0.313005
0.001209
chlorides_H12
0.200973
0.000759
Assess: Comparative Results
Receiver Operating Characteristics (ROC) Chart for NN vs Decision

Tree
Assess: Comparative Results

Cumulative Lift for NN vs Decision
Tree
Assess: Comparison with Reference

Paper
Used R-Miner
Support Vector Machine (SVM) and Neural
Network used
He applied techniques to extract relative
importance of variables
He attempted to predict every quality level
He noted the importance of alcohol and
sulphates. An increase in sulphates might be
related to the fermenting nutrition, which is
very important to improve the wine aroma.
Assess: Paper Variable Importance
Overall Project in SAS EM
References
UCI Machine Learning Repository http://
archive.ics.uci.edu/ml/datasets/Wine
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and
J. Reis. Modeling wine preferences by data
mining from physicochemical properties.
In Decision Support Systems, Elsevier,
47(4):547-553, 2009.
Modeling wine preferences by data mining from
physicochemical properties, Paulo Cortez et. al
http://www3.dsi.uminho.pt/pcortez/wine5.pdf

Red Wine Mine

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Red Wine Mine

Caricato da

Copyright:

Formati disponibili

Knowledge Discovery in

Explore: Data Background

var quality alcohol ph fixed_acidity density

Explore: Correlation Graphs

Explore: Chi2 Statistics of

Explore: Worth of Inputs

Explore: Worth Graph

Assess: Initial Results

Explore: Target = isGood

Model Strategy for isGood

Model: Decision Tree

Assess: Decision Tree

Assess: Decision Tree Results 2

Assess: Variable Importance

Model: Neural Network

Modify: Input Variables to

Variable Name Role

Assess: Comparative Results

Receiver Operating Characteristics (ROC) Chart for NN vs Decision

Assess: Comparative Results

Assess: Comparison with Reference

Assess: Paper Variable Importance

Overall Project in SAS EM

Potrebbero piacerti anche