Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Machine Learning
towardsdatascience.com/opening-black-boxes-how-to-leverage-explainable-machine-learning-
dd4ab439998e
In this article, I will demonstrate some methods for creating explainable predictions and
guide you into opening these black-box models.
The data used in this article is the US Adult Income data set which is typically used to
predict whether somebody makes less than 50K or more than 50, a simple binary
classification task. You can get the data here or you can follow along with the notebook
here.
1/11
The data is relatively straightforward with information with respect to an individuals’
Relationship, Occupation, Race, Gender, etc.
Modeling
The categorical variables are one-hot encoded and the target is set to either 0 (≤50K) or 1
(>50K). Now let’s say that we would like to use a model that is known for its great
performance on classification tasks, but is highly complex and the output difficult to
interpret. This model would be LightGBM which, together with CatBoost and XGBoost, is
often used in both classification and regression tasks.
I should note that preferably you would have a train/test split with additional holdout
data in order to prevent any overfitting you would do.
Next, I quickly check the performance of the model using 10-fold cross-validation:
Now that we have created the model, next would be explaining what it exactly has done.
Since LightGBM works with highly efficient gradient boosting decision trees,
interpretation of the output can be difficult.
Assumptions
This importance calculation is based on an important assumption, namely that the
feature of interest is not correlated with all other features (except for the target). The
reason for this is that it will show data points that are likely to be impossible. For
example, weight and height are correlated but the PDP might show the effect of a large
weight and very small height on the target while that combination is highly unlikely. This
can be partially resolved by showing a rug at the bottom of your PDP.
Correlation
Thus, we check the correlation between features in order to make sure that there are no
problems there:
3/11
We can see that there is no strong correlation present among the features. However, I
will do some one-hot encoding, later on, to prepare the data for modeling which could
lead to the creation of correlated features.
Continuous variables
PDP plots can be used to visualize the impact of a variable on the output across all data
points. Let’s first start with an obvious one, the effect of a continuous variable, namely
capital_gain, on the target:
The x-axis shows the values capital_gain can take and the y-axis indicates the effect it can
have on the probability of the binary classification. It is clear that as one’s capital_gain
increases their chance of making <50K increases with it. Note, that the rug of data points
at the bottom is helpful in identifying data points that do not appear often.
4/11
pdp_race = pdp.pdp_isolate(
model=clf, dataset=df[df.columns[1:]],
model_features=df.columns[1:],
feature=[i for i in df.columns if 'O_' in i if i not in
['O_ Adm-clerical',
'O_ Armed-Forces',
'O_ Armed-Forces',
'O_ Protective-serv',
'O_ Sales',
'O_ Handlers-cleaners']])
fig, axes = pdp.pdp_plot(pdp_race, 'Occupation', center=True,
plot_lines=True, frac_to_plot=100,
plot_pts_dist=True)
Here, you can clearly see that the likelihood of making more is positively affected by
being either in a managerial position or that of technology. The chance decreases if you
are working in the fishing industry.
inter1 = pdp.pdp_interact(
model=clf, dataset=df[df.columns[1:]],
model_features=df.columns[1:], features=['Age', 'Hours/Week'])
fig, axes = pdp.pdp_interact_plot(
pdp_interact_out=inter1, feature_names=['Age', 'Hours/Week'],
plot_type='grid', x_quantile=True, plot_pdp=True)
5/11
This matrix tells you that an individual is likely to make more if they are around 49 years
old and work roughly 50 hours a week. I should note that it is important to keep in mind
the actual distribution of all interactions in your data set. There is a chance that this plot
will show you interesting interactions that will rarely or never happen.
6/11
i = 304
exp = explainer.explain_instance(X.values[i], clf.predict_proba,
num_features=5)
exp.show_in_notebook(show_table=True, show_all=False)
The output shows the effect of the top 5 variables on the prediction probability. This
helps in identifying why your model makes a certain prediction but also allows for
explanations to users.
Disadvantage
Note that the neighborhood (kernel width) around which LIME tries to find different
values for the initial row is, to an extent, a hyperparameter that can be optimized. At
times you want a larger neighborhood depending on the data. It is a bit of trial and error
to find the right kernel width as it might hurt the interpretability of explanations.
Click here for the package that I used in the analyses. For a more in-depth explanation of
the theoretical background of SHAP click here.
Any feature that has no effect on the predicted value should have a Shapley value
of 0 (Dummy)
If two features add the same value to the prediction, then their Shapley values
should be the same (Substitutability)
If you have two or more predictions that you would want to merge you should be
able to simply add the Shapley values that were calculated on the individual
7/11
predictions (Additivity)
Binary Classification
Let’s see what the result would be if we were to calculate the Shapley values for a single
row:
import shap
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X)
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[1,:],
X.iloc[1,:])
This plot shows a base value that is used to indicate the direction of the prediction.
Seeing as most of the targets are 0 it isn’t strange to see that the base value is negative.
The red bar shows how much the probability that the target is 1 (>50K) is increased if its
Education_num is 13. Higher education typically leads to making more.
The blue bars show that these variables decrease the probability, with Age having the
biggest effect. This makes sense as younger people typically make less.
Regression
Shapley works intuitively a bit better when it concerns regression (continuous variable)
rather than binary classification. Just to show an example, let’s train a model to predict
age from the same data set:
8/11
Shapley values for a single data point
Here, you can quickly observe that if you are never married the predicted Age is lowered
by roughly 8 years. This helps a bit more with explaining the prediction compared to a
classification task since you are directly talking about the value of the target instead of its
probability.
However, you can immediately see the problem using Shapley values for one-hot
encoded features, they are shown for each one-hot encoded feature instead of what
they originally represented.
Fortunately, the Additivity axiom allows the Shapley values for each one-hot encoded
generated feature to be summed as a representation of the Shapley value for the entire
feature.
First, we need to sum up all the Shapley values for the one-hot encoded features:
summary_df = pd.DataFrame([X.columns,
abs(shap_values).mean(axis=0)]).T
summary_df.columns = ['Feature', 'mean_SHAP']
9/11
mapping = {}
for feature in summary_df.Feature.values:
mapping[feature] = feature
summary_df['Feature'] = summary_df.Feature.map(mapping)
shap_df = (summary_df.groupby('Feature')
.sum()
.sort_values("mean_SHAP", ascending=False)
.reset_index())
Now that all the Shapley values are average across all features and summed for the one-
hot encoded features we can plot the resulting feature importance:
We can now see that Occupation is way more important than the original Shapley
summary plot showed. Thus, make sure to use the Additivity to your advantage when
explaining the importance of features to your users. They are likely to be more interested
10/11
in how important Occupation is rather than a specific Occupation.
Conclusion
Although this is definitely not the first article to talk about Interpretable and Explainable
ML, I hope this helped you understand how these technologies can be used when
developing your model.
There has been a significant buzz around SHAP and I hope that demonstrating the
Additivity axiom using one-hot encoded features gives more intuition on how to use such
a method.
This is the first post as part of an (at least) monthly series by Emset in which we show
new and exciting methods for applying and developing Machine Learning techniques.
11/11