Interpretable Machine Learning - Definitions, Methods, and Applications PDF

Interpretable machine learning: definitions,
methods, and applications

W. James Murdocha,1 , Chandan Singhb,1 , Karl Kumbiera,2 , Reza Abbasi-Aslb,c,2 , and Bin Yua,b
a
UC Berkeley Statistics Dept.; b UC Berkeley EECS Dept.; c Allen Institute for Brain Science
Machine-learning models have demonstrated great success in learning complex patterns that en-
able them to make predictions about unobserved data. In addition to using models for prediction,
the ability to interpret what a model has learned is receiving an increasing amount of attention.
However, this increased focus has led to considerable confusion about the notion of interpretabil-
ity. In particular, it is unclear how the wide array of proposed interpretation methods are related,
arXiv:1901.04592v1 [stat.ML] 14 Jan 2019
and what common concepts can be used to evaluate them.

We aim to address these concerns by defining interpretability in the context of machine learning
and introducing the Predictive, Descriptive, Relevant (PDR) framework for discussing interpreta-
tions. The PDR framework provides three overarching desiderata for evaluation: predictive ac-
curacy, descriptive accuracy and relevancy, with relevancy judged relative to a human audience.
Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of
existing techniques into model-based and post-hoc categories, with sub-groups including spar-
sity, modularity and simulatability. To demonstrate how practitioners can use the PDR framework
to evaluate and understand interpretations, we provide numerous real-world examples. These ex-
amples highlight the often under-appreciated role played by human audiences in discussions of
interpretability. Finally, based on our framework, we discuss limitations of existing methods and
directions for future work. We hope that this work will provide a common vocabulary that will
make it easier for both practitioners and researchers to discuss and choose from the full range of
interpretation methods.
1. Introduction accuracy, and relevancy, where relevancy is judged by a human

audience. Using these terms, we categorize a broad range of
M achine learning (ML) has recently received considerable
attention for its ability to accurately predict a wide
variety of complex phenomena. However, there is a growing
existing methods, all grounded in real-world examples† . In
doing so, we provide a common vocabulary for researchers and
practitioners to use in evaluating and selecting interpretation
realization that, in addition to predictions, ML models are methods. We then show how our work enables a clearer
capable of producing knowledge about domain relationships discussion of open problems for future research.
contained in data, often referred to as interpretations. These
interpretations have found uses both in their own right, e.g. A. Defining interpretable machine learning. On its own, inter-
medicine (1), policy-making (2), and science (3, 4), as well as pretability is a broad, poorly defined concept. Taken to its
in auditing the predictions themselves in response to issues full generality, to interpret data means to extract information
such as regulatory pressure (5) and fairness (6). (of some form) from it. The set of methods falling under this
In the absence of a well-formed definition of interpretability, umbrella spans everything from designing an initial experiment
a broad range of methods with a correspondingly broad range to visualizing final results. In this overly general form, inter-
of outputs (e.g. visualizations, natural language, mathematical pretability is not substantially different from the established
equations) have been labeled as interpretation. This has led concepts of data science and applied statistics.
to considerable confusion about the notion of interpretability. Instead of general interpretability, we focus on the use of
In particular, it is unclear what it means to interpret some- interpretations in the context of ML as part of the larger data-
thing, what common threads exist among disparate methods, science life cycle. We define interpretable machine learning
and how to select an interpretation method for a particular as the use of machine-learning models for the extraction of
problem/audience. relevant knowledge about domain relationships contained in
In this paper, we attempt to address these concerns. To do data. Here, we view knowledge as being relevant if it provides
so, we first define interpretability in the context of machine †
Examples were selected through a non-exhaustive search of related work.
learning and place it within a generic data science life cycle.
This allows us to distinguish between two main classes of
interpretation methods: model-based∗ and post hoc. We W.M., C.S., K.K., R.A., and B.Y. helped identify important concepts and provided feedback on the
paper. W.M and C.S. wrote the paper with contributions from K.K., R.A., and B.Y.
then introduce the Predictive, Descriptive, Relevant (PDR)
The authors declare no conflict of interest exists.
framework, consisting of three desiderata for evaluating and
1
W.M. and C.S. contributed equally to this work.
constructing interpretations: predictive accuracy, descriptive
2
K.K. and R.A. contributed equally to this work.
∗
For clarity, throughout the paper we use the term to refer to both machine-learning models and
3
algorithms. To whom correspondence should be addressed: binyu@berkeley.edu
Preprint | Murdoch et al. 2018 | 1–11

insight for a particular audience into a chosen domain problem. in evaluating interpretation methods and is a prerequisite for
These insights are often used to guide communication, actions, trustworthy interpretations. That is, one should not interpret
and discovery. Interpretation methods use ML models to pro- parts of a model which are not stable to appropriate pertur-
duce relevant knowledge about domain relationships contained bations to the model and data. This is demonstrated through
in data. This knowledge can be produced in formats such as examples in the text (20, 23, 24).
visualizations, natural language or mathematical equations,
depending on the context and audience. For instance, a doctor 2. Interpretation in the data science life cycle
who must diagnose a single patient will want qualitatively
different information than an engineer determining if an image Before discussing interpretation methods, we first place the
classifier is discriminating by race. process of interpretable ML within the broader data-science
life cycle. Fig 1 presents a deliberately general description of
B. Background. Interpretability is a quickly growing field in this process, intended to capture most data-science problems.
machine learning, and there have been multiple works exam- What is generally referred to as interpretation largely occurs in
ining various aspects of interpretations (sometimes under the the modeling and post hoc analysis stages, with the problem,
heading explainable AI ). One line of work focuses on providing data and audience providing the context required to choose
an overview of different interpretation methods with a strong appropriate methods.
emphasis on post hoc interpretations of deep learning models
(7, 8), sometimes pointing out similarities between various Predictive Descriptive
methods (9, 10). Other work has focused on the narrower Problem, Data, accuracy accuracy Post hoc
& Audience Model analysis
problem of how interpretations should be evaluated (11, 12)
and what properties they should satisfy (13). These previous
works touch on different subsets of interpretability, but do
not address interpretable machine learning as a whole, and Iterate
give limited guidance on how interpretability can actually be
used in data-science life cycles. We aim to do so by providing Fig. 1. Overview of different stages (black text) in a data-science life cycle where
interpretability is important. Main stages are discussed in Sec 2 and accuracy (blue
a framework and vocabulary to fully capture interpretable
text) is described in Sec 3.
machine learning, its benefits, and its applications to concrete
data problems.
Interpretability also plays a role in other research areas. Problem, data, and audience At the beginning of the cycle, a
For example, interpretability is a major topic when considering data-science practitioner defines a domain problem that they
bias and fairness in ML models (14, 15), with examples given would like to understand using data. This problem can take
throughout the paper (16). In psychology, the general notions many forms. In a scientific setting, the practitioner may be
of interpretability and explanations have been studied at a interested in relationships contained in the data, such as how
more abstract level (17, 18), providing relevant conceptual brain cells in a particular area of the visual system relate to
perspectives. Additionally, we comment on two related areas visual stimuli (25). In industrial settings, the problem often
that are distinct but closely related to interpretability: causal concerns the predictive performance or other qualities of a
inference and stability. model, such as how to assign credit scores with high accuracy
(26), or do so fairly with respect to gender and race (16). The
Causal inference Causal inference (19) is a subject from statis- nature of the problem plays a role in interpretability, as the
tics which is related, but distinct, from interpretable machine relevant context and audience are essential in determining
learning. Causal inference methods focus solely on extracting what methods to use.
causal relationships from data, i.e. statements that altering After choosing a domain problem, the practitioner collects
one variable will cause a change in another. In contrast, in- data to study it. Aspects of the data-collection process can
terpretable ML, and most other statistical techniques, are affect the interpretation pipeline. Notably, biases in the data
generally used to describe non-causal relationships, or rela- (i.e. mismatches between the collected data and the population
tionships in observational studies. of interest) will manifest themselves in the model, restricting
In some instances, researchers use both interpretable ma- one’s ability to make interpretations regarding the problem of
chine learning and causal inference in a single analysis (20). interest.
One form of this is where the non-causal relationships ex-
tracted by interpretable ML are used to suggest potential Model Based on the chosen problem and collected data, the
causal relationships. These relationships can then be further practitioner then constructs a predictive model. At this stage,
analyzed using causal inference methods, and fully validated the practitioner processes, cleans, and visualizes data, extracts
through experimental studies. features, selects a model (or several models) and fits it. In-
terpretability considerations often come into play in this step
Stability Stability, as a generalization of robustness in statis- related to the choice between simpler, easier to interpret mod-
tics, is a concept that applies throughout the entire data- els and more complex, black-box models, which may fit the
science life cycle, including interpretable ML). The stability data better. The model’s ability to fit the data is measured
principle requires that each step in the life cycle is stable with through predictive accuracy.
respect to appropriate perturbations, such as small changes in
the model or data. Recently, stability has been shown to be Post hoc analysis Having fit a model (or models), the prac-
important in applied statistical problems, for example when titioner then analyzes it for answers to the original question.
trying to make conclusions about a scientific problem (21) The process of analyzing the model often involves using in-
and in more general settings (22). Stability can be helpful terpretability methods to extract various (stable) forms of
Preprint | Murdoch et al. 2018 | 2

information from the model. The extracted information can through measures such as test-set accuracy. In the context of
then be analyzed and displayed using standard data analysis interpretation, we describe this error as predictive accuracy.
methods, such as scatter plots and histograms. The ability of Note that in problems involving interpretability, one often
the interpretations to properly describe what the model has requires a notion of predictive accuracy that goes beyond just
learned is denoted by descriptive accuracy. average accuracy. The distribution of predictions matters. For
instance, it could be problematic if the prediction error is
Iterate If sufficient answers are uncovered after the post hoc
much higher for a particular class. Moreover, the predictive
analysis stage, the practitioner finishes. Otherwise, they up-
accuracy should be stable with respect to reasonable data
date something in the chain (problem, data, and/or model)
and model perturbations. For instance, one should not trust
and iterate, potentially multiple times (27). Note that they
interpretations from a model which changes dramatically when
can terminate the loop at any stage, depending on the context
trained on a slightly smaller subset of the data.
of the problem.
A.2. Descriptive accuracy. The second source of error occurs dur-
A. Interpretation methods within the PDR framework. In the ing the post hoc analysis stage, when interpretation methods
framework described above, most interpretation methods fall are used to analyze a fitted model. Oftentimes, interpretation
either in the modeling or post hoc analysis stages. We call inter- methods provide an imperfect representation of the relation-
pretability in the modeling stage model-based interpretability ships learned by a model. This is especially challenging for
(Sec 4). This part of interpretability is focused upon constrain- complex black-box models such as deep neural networks, which
ing the form of ML models so that they readily provide useful store nonlinear relationships between variables in non-obvious
information about the uncovered relationships. As a result forms.
of these constraints, the space of potential models is smaller, . Definition We define descriptive accuracy, in the con-
which can result in lower predictive accuracy. Consequently, text of interpretation, as the degree to which an interpretation
model-based interpretability is best used when the underlying method objectively captures the relationships learned by ma-
relationship is relatively simple. chine learning models.
We call interpretability in the post hoc analysis stage post
hoc interpretability (Sec 5). These interpretation methods take A.3. A common conflict: predictive vs descriptive accuracy. In se-
a trained model as input, and extract information about what lecting what model to use, practitioners are often faced with
relationships the model has learned. They are most helpful a trade-off between predictive and descriptive accuracy. On
when the underlying relationship is especially complex, and the one hand, the simplicity of model-based interpretation
practitioners need to train an intricate, black-box model in methods yields consistently high descriptive accuracy, but
order to achieve a reasonable predictive accuracy. can sometimes result in lower predictive accuracy on complex
After discussing desiderata for interpretation methods, we datasets. On the other hand, in complex settings such as
investigate these two forms of interpretations in detail and image analysis, complicated models generally provide high
discuss associated methods. predictive accuracy, but are harder to analyze, resulting in a
lower descriptive accuracy.
3. The PDR desiderata for interpretations B. Relevancy. When selecting an interpretation method, it is
In general, it is unclear how to select and evaluate interpre- not enough for the method to have high accuracy - the ex-
tation methods for a particular problem and audience. To tracted information must also be relevant. For example, in the
help guide this process, we introduce the PDR framework, context of genomics, a patient, doctor, biologist, and statisti-
consisting of three desiderata that should be used to select cian may each want different (yet consistent) interpretations
interpretation methods for a particular problem: predictive from the same model. The context provided by the problem
accuracy, descriptive accuracy, and relevancy. and data stages in Fig 1 guides what kinds of relationships a
practitioner is interested in learning about, and by extension
A. Accuracy. The information produced by an interpretation the methods that should be used.
method should be faithful to the underlying process the practi- . Definition We define an interpretation to be relevant
tioner is trying to understand. In the context of ML, there are if it provides insight for a particular audience into a chosen
two areas where errors can arise: when approximating the un- domain problem.
derlying data relationships with a model (predictive accuracy) Relevancy often plays a key role in determining the trade-
and when approximating what the model has learned using an off between predictive and descriptive accuracy. Depending
interpretation method (descriptive accuracy). For an interpre- on the context of the problem at hand, a practitioner may
tation to be trustworthy, one should try to maximize both of choose to focus on one over the other. For instance, when
the accuracies. In cases where the accuracy is not very high, interpretability is used to audit a model’s predictions, such as
the resulting interpretations may still be useful. However, it to enforce fairness, descriptive accuracy can be more important.
is especially important to check their trustworthiness through In contrast, interpretability can also be used solely as a tool
external validation, such as running an additional experiment. to increase the predictive accuracy of a model, for instance,
through improved feature engineering.
A.1. Predictive accuracy. The first source of error occurs during Having outlined the main desiderata for interpretation
the model stage, when an ML model is constructed. If the methods, we now discuss how they link to interpretation in
model learns a poor approximation of the underlying relation- the modeling and post hoc analysis stages in the data-science
ships in the data, any information extracted from the model is life cycle. Fig 2 draws parallels between our desiderata for
unlikely to be accurate. Evaluating the quality of a model’s fit interpretation techniques introduced in Sec 3 and our catego-
has been well-studied in standard supervised ML frameworks, rization of methods in Sec 4 and Sec 5. In particular, both

those parameters as being meaningfully related to the out-
come in question, and can also interpret the magnitude and
direction of the parameters. However, before one can interpret
Model-based Post hoc
interpretability
a sparse parameter set, one should check for stability of the pa-
interpretability
rameters. For example, if the set of sparse parameters changes
Generally due to small perturbations in the data set, the coefficients
Predictive
Accuracy
unchanged or should not be interpreted (30).
decrease No Effect
When the practitioner is able to correctly incorporate spar-
(data-dependent)
sity into their model, it can improve all three interpretation
desiderata. By reducing the number of parameters to analyze,
sparse models can be easier to understand, yielding higher
Descriptive
descriptive accuracy. Moreover, incorporating prior informa-

Accuracy
Increase Increase tion in the form sparsity into a sparse problem can help a
model achieve higher predictive accuracy and yield more rele-
vant insights. Note that incorporating sparsity can often be
quite difficult, as it requires understanding the data-specific
Fig. 2. Impact of interpretability methods on descriptive and predictive accuracies.
structure of the sparsity and how it can be modelled.
Model-based interpretability (Sec 4) involves using a simpler model to fit the data Methods for obtaining sparsity often utilize a penalty on
which can negatively affect predictive accuracy, but yields higher descriptive accuracy. a loss function, such as LASSO (31) and sparse coding (32),
Post hoc interpretability (Sec 5) involves using methods to extract information from a or on a model selection criteria such as AIC or BIC (33, 34).
trained model (with no effect on predictive accuracy). These correspond to the model
Many search-based methods have been developed to find sparse
and post hoc stages in Fig 1.
solutions. These methods search through the space of non-
zero coefficients using classical subset-selection methods (e.g.
post hoc and model-based methods aim to increase descriptive orthogonal matching pursuit (35)). Model sparsity is often
accuracy, but only model-based affects the predictive accuracy. useful for high-dimensional problems, where the goal is to
Not shown is relevancy, which determines what type of output identify key features for further analysis. As a result, sparsity
is helpful for a particular problem and audience. penalties have been incorporated into complex models such
as random forests to identify a sparse subset of important
features (36).
4. Model-based interpretability
In the following example from genomics, sparsity is used
We now discuss how interpretability considerations come into to increase the relevancy of the produced interpretations by
play in the modeling stage of the data science life cycle (see reducing the number of potential interactions to a manageable
Fig 1). At this stage, the practitioner constructs an ML model level.
from the collected data. We define model-based interpretability . Ex. Identifying interactions among regulatory factors
as the construction of models that readily provide insight into or biomolecules is an important question in genomics. Typ-
the relationships they have learned. Different model-based ical genomic datasets include thousands or even millions of
interpretability methods provide different ways of increasing features, many of which are active in specific cellular or devel-
descriptive accuracy by constructing models which are easier to opmental contexts. The massive scale of such datasets make
understand, sometimes resulting in lower predictive accuracy. interpretation a considerable challenge. Sparsity penalties
The main challenge of model-based interpretability is to come are frequently used to make the data manageable for statisti-
up with models that are simple enough to be easily understood cians and their collaborating biologists to discuss and identify
by the audience, yet sophisticated enough to properly fit the promising candidates for further experiments.
underlying data. For instance, one recent study (23) uses a biclustering ap-
In selecting a model to solve a domain problem, the practi- proach based on sparse canonical correlation analysis (SCCA)
tioner must consider the entirety of the PDR framework. The to identify interactions among genomic expression features in
first desideratum to consider is predictive accuracy. If the con- Drosophila melanogaster (fruit flies) and Caenorhabditis ele-
structed model does not accurately represent the underlying gans (roundworms). Sparsity penalties enable key interactions
problem, any subsequent analysis will be suspect (28, 29). Sec- among features to be summarized in heatmaps which contain
ond, the main purpose of model-based interpretation methods few enough variables for a human to analyze. Moreover, this
is to increase descriptive accuracy. Finally, the relevancy of study performs stability analysis on their model, finding it
a model’s output must be considered, and is determined by to be robust to different initializations and perturbations to
the context of the problem, data, and audience. We now dis- hyperparameters.
cuss some widely useful types of model-based interpretability
methods. B. Simulatability. A model is said to be simulatable if a human
(for whom the interpretation is intended) is able to internally
A. Sparsity. When the practitioner believes that the under- simulate and reason about its entire decision-making process
lying relationship in question is based upon a sparse set of (i.e. how a trained model produces an output for an arbitrary
signals, they can impose sparsity on their model by limiting input). This is a very strong constraint to place on a model,
the number of non-zero parameters. In this section, we focus and can generally only be done when the number of features is
on linear models, but sparsity can be helpful more generally. low, and the underlying relationship is simple. Decision trees
When the number of non-zero parameters is sufficiently small, (37) are often cited as a simulatable model, due to their hier-
a practitioner can interpret the variables corresponding to archical decision-making process. Another example is lists of

rules (38, 39), which can easily be simulated. Due to their sim- patients, with 46 features including from demographics (e.g.
plicity, simulatable models have very high descriptive accuracy. age and gender), simple physical measurements (e.g. heart
When they can also provide reasonable predictive accuracy, rate, blood pressure) and lab tests (e.g. white blood cell
they can be very effective. In the following example, a novel count, blood urea nitrogen). To predict mortality risk, they
simulatable model is able to produce high predictive accuracy, use a generalized additive model with pairwise interactions,
while maintaining the high levels of descriptive accuracy and displayed below. The univariate and pairwise terms (fj (xj )
relevancy normally attained by rules-based models. and fij (xi , xj )) can be individually interpreted in the form of
. Ex. In medical practice, when a patient has been diag- curves and heatmaps respectively.
nosed with atrial fibrillation, caregivers often want to predict X X
the risk that the particular patient will have a stroke in the g(E[y]) = β0 + fj (xj ) + fij (xi , xj ) [1]
next year. Moreover, given the potential ramifications of med- j i6=j
ical decisions, it is important that these predictions are not
only accurate, but interpretable to both the caregivers and By inspecting the individual modules, the researchers found
patients. a number of counterintuitive properties of their model. For
To make the prediction, (39) uses data from 12,586 patients instance, the fitted model learned that having asthma is asso-
detailing their age, gender, history of drugs and conditions ciated with a lower risk of dying from pneumonia. In reality,
preceding their diagnosis, and whether they had a stroke the opposite is true - patients with asthma are known to have
within a year of diagnosis. In order to construct a model that a higher risk of death from pneumonia. Because of this, in
has high predictive and descriptive accuracy, (39) introduce a the collected data all patients with asthma received aggressive
method for learning lists of if-then rules that are predictive of care, which was fortunately effective at reducing their risk of
one year stroke risk. The resulting classifier, displayed in Fig 3, mortality relative to the general population.
requires only seven if-then statements to achieve competitive In this instance, if the model were used without having been
accuracy, and is easy for even non-technical practitioners to interpreted, pneumonia patients with asthma would have have
quickly understand. been de-prioritized for hospitalization. Consequently, the use
of ML would increase their likelihood of dying. Fortunately,
if hemiplegia and age > 60 then stroke risk 58.9% (53.8%–63.8%)
the use of an interpretable model enabled the researchers to
else if cerebrovascular disorder then stroke risk 47.8% (44.8%–50.7%) identify and correct errors like this one, better ensuring that
else if transient ischaemic attack then stroke risk 23.8% (19.5%–28.4%) the model could be trusted in the real world.
else if occlusion and stenosis of carotid artery without infarction then stroke
risk 15.8% (12.2%–19.6%) D. Domain-based feature engineering. While the type of
else if altered state of consciousness and age > 60 then stroke risk 16.0%
model is important in producing a useful interpretation, so
(12.2%–20.2%)
else if age ≤ 70 then stroke risk 4.6% (3.9%–5.4%)
are the features that are used as inputs to the model. Having
else stroke risk 8.7% (7.9%–9.6%) more informative features makes the relationship that needs
to be learned by the model simpler, allowing one to use other
Fig. 3. Rule list for classifying stroke risk from patient data (replicated Fig 5 from (39)). model-based interpretability methods. Moreover, when the
One can easily simulate and understand the relationships between different variables
features have more meaning to a particular audience, they
such as age on stroke risk. Reprinted with permission from the authors.
become easier to interpret.
In many individual domains, expert knowledge can be use-
C. Modularity. We define an ML model to be modular if a ful in constructing feature sets that are useful for building
meaningful portion(s) of its prediction-making process can be predictive models. The particular algorithms used to extract
interpreted independently. While modular models are not as features are generally domain-specific, relying both on the
easy to understand as sparse or simulatable models, they can practitioner’s existing domain expertise and insights drawn
still be useful in increasing descriptive accuracy to provide from the data through exploratory data analysis. For example,
insights into the relationships the model has learned. in natural language processing, documents are embedded into
A wide array of models satisfy modularity to different de- vectors using tf-idf (45) and in computer vision mathemat-
grees. Generalized additive models (40) force the relationship ical transformations have been developed to produce useful
between variables in the model to be additive. In deep learning, representations of images (46). In the example below, do-
specific methods such as attention (41) and modular network main knowledge about cloud coverage is exploited to design
architectures (42) provide limited insight into a network’s in- three simple features that increase the predictive accuracy of
ner workings. Probabilistic models can enforce modularity by a model while maintaining the high descriptive accuracy of a
specifying a conditional independence structure which makes it simple predictive model.
easier to reason about different parts of a model independently . Ex. When modelling global climate patterns, an im-
(43). portant quantity is the amount and location of arctic cloud
The following example uses modularity to produce relevant coverage. Due to the complex, layered nature of climate mod-
interpretations for use in diagnosing biases in training data. els, it is beneficial to have simple, easily auditable, cloud
. Ex. When prioritizing patient care for pneumonia pa- coverage models for use by down-stream climate scientists.
tients in a hospital, one possible method is to predict the In (47), the authors use an unlabeled dataset of arctic
likelihood of death within 60 days, and focus on the patients satellite imagery to build a model predicting whether each
with a higher mortality risk. Given the potential life and pixel in an image contains clouds or not. Given the qualita-
death consequences, being able to explain the reasons for tive similarity between ice and clouds, this is a challenging
hospitalizing a patient or not is very important. prediction problem. By conducting exploratory data analy-
A recent study (44) uses a dataset of 14,199 pneumonia sis and utilizing domain knowledge through interactions with

climate scientists, the authors identify three simple features When conducting post hoc analysis, the model has already
that are sufficient to cluster whether or not images contain been trained, so its predictive accuracy is fixed. Thus, under
clouds. Using these three features as input to quadratic dis- the PDR framework, a researcher must only consider descrip-
criminant analysis, they achieve both high predictive accuracy tive accuracy and relevancy (relative to a particular audience).
and transparency when compared with expert labels (which Improving on each of these criteria are areas of active research.
were not used in developing the features and the QDA cluster- Most widely useful post hoc interpretation methods fall
ing method). into two main categories: prediction-level and dataset-level
interpretations, which are sometimes referred to as local and
E. Model-based feature engineering. There are a variety of global interpretations, respectively. Prediction-level interpreta-
automatic approaches for constructing interpretable features. tion methods focus on explaining individual predictions made
Two examples are unsupervised learning and dimensionality by models, such as what features and/or interactions led to
reduction. Unsupervised methods, such as clustering, ma- the particular prediction. Dataset-level approaches focus on
trix factorization, and dictionary learning, aim to process the global relationships the model has learned, such as what
unlabelled data and output a description of their structure. visual patterns are associated with a predicted response. These
These structures often shed insight into relationships contained two categories have much in common (in fact, dataset-level
within the data and can be useful in building predictive models. approaches often yield information at the prediction-level),
Dimensionality reduction focuses on finding a representation but we discuss them separately, as methods at different levels
of the data which is lower-dimensional than the original data. are meaningfully different.
Methods such as principal components analysis (48), inde-
pendent components analysis (49), and canonical correlation A. Dataset-level interpretation. When a practitioner is inter-
analysis (50) can often identify a few interpretable dimensions, ested in more general relationships learned by a model, e.g.
which can then be used as input to a model or to provide relationships that are relevant for a particular class of responses
insights in their own right. Using fewer inputs can not only or subpopulation, they use dataset-level interpretations.
improve descriptive accuracy, but can increase predictive ac-
curacy by reducing the number of parameters to fit. In the A.1. Interaction and feature importances. Feature importance
following example, unsupervised learning (non-negative matrix scores, at the dataset-level, try to capture how much indi-
factorization) is used to represent images in a low-dimensional, vidual features contribute, across a dataset, to a prediction.
genetically meaningful, space. These scores can provide insights into what features the model
. Ex. Heterogeneity is an important consideration in has identified as important for which outcomes, and their
genomic problems and associated data. In many cases, regu- relative importance. Methods have been developed to score
latory factors or biomolecules can play a specific role in one individual features in many models including neural networks
context, such as a particular cell type or developmental stage, (52), random forests, (53, 54), and generic classifiers (55).
and have a very different role in other contexts. Thus, it is In addition to feature importances, methods have been
important to understand the “local” behavior of regulatory developed to extract important interactions between features.
factors or biomolecules. Interactions are important as ML models are often highly
A recent study (51), uses unsupervised learning to learn nonlinear and learn complex interactions between features.
spatial patterns of gene expression in Drosophila (fruit fly) Methods exist to extract interactions from a variety of ML
embryos. In particular, they use stability driven nonnegative models including random forests (20, 56) and neural networks
matrix factorization to decompose images of complex spa- (57, 58). In the following example, the descriptive accuracy of
tial gene expression patterns into a library of 21 “principal random forests is increased by extracting Boolean interactions
patterns”, which can be viewed as pre-organ regions This de- (a problem-relevant form of interpretation) from a trained
composition, which is interpretable to biologists, allows the model.
study of gene-gene interactions in pre-organ regions of the . Ex. High-order interactions among regulatory factors
developing embryo. or genes play an important role in defining cell-type specific
behavior in biological systems. As a result, extracting such
5. Post hoc interpretability interactions from genomic data is an important problem in
biology.
We now discuss how interpretability considerations come into A previous line of work considers the problem of searching
play in the post hoc analysis stage of the data-science life cycle. for biological interactions associated with important biological
At this stage, the practitioner analyzes a trained model in processes (20, 56). To identify candidate biological interac-
order to provide insights into the learned relationships. This is tions, the authors train a series of iteratively re-weighted RFs
particularly challenging when the model’s parameters do not and search for stable combinations of features that frequently
clearly show what relationships the model has learned. To aid co-occur along the predictive RF decision paths. This approach
in this process, a variety of post hoc interpretability methods takes a step beyond evaluating the importance of individual
have been developed to provide insight into what a trained features in an RF, providing a more complete description of
model has learned, without changing the underlying model. how features influence predicted responses. By interpreting
These methods are particularly important for settings where the interactions used in RFs, the researchers identified gene-
the collected data is high-dimensional and complex, such as gene interactions with 80% accuracy in the Drosophila embryo
with image data. Once the information has been extracted and identify candidate targets for higher-order interactions.
from the fitted model, it can be analyzed using standard,
exploratory data analysis techniques, such as scatter plots and A.2. Statistical feature importances. In some instances, in addi-
histograms. tion to the raw value, we can compute statistical measures

of confidence as feature importance scores, a standard tech- . Ex. A recent study visualizes learned information from
nique taught in introductory statistics classes. By making deep neural networks to understand individual brain cells (24).
assumptions about the underlying data generating process, In this study, macaque monkeys were shown images while
models like linear and logistic regression can compute confi- the responses of brain cells in their visual system (area V4)
dence intervals and hypothesis tests for the values, and linear were recorded. Neural networks were trained to predict the
combinations, of their coefficients. These statistics can be responses of brain cells to the images. These neural networks
helpful in determining the degree to which the observed coeffi- produce accurate fits, but provide little insight into what pat-
cients are statistically significant. It is important to note that terns in the images increase the brain cells response without
the assumptions of the underlying probabilistic model must be further analysis. To remedy this, the authors introduce Deep-
fully verified before using this form of interpretation. Below we Tune, a method which provides a visualization, accessible to
present a cautionary example where different assumptions lead neuroscientists and others, of the patterns which activate a
to opposing conclusions being drawn from the same dataset. brain cell. The main intuition behind the method is to opti-
. Ex. Here, we consider the lawsuit Students for Fair mize the input of a network to maximize the response of a
Admissions, Inc. v. Harvard regarding the use of race in un- neural network model (which represent a brain cell).
dergraduate admissions to Harvard University. Initial reports The authors go on to analyze the major problem of in-
by Harvard’s Office of Institutional Research used logistic re- stability. When post hoc visualizations attempt to answer
gression to model the probability of admission using different scientific questions, the visualizations must be stable to rea-
features of an applicant’s profile, including their race (59). sonable perturbations (e.g. the choice of model); if there are
This analysis found that the coefficient associated with being changes in the visualization due to the choice of a model, it is
Asian (and not low income) had a coefficient of -0.418 with likely not meaningful. The authors address this explicitly by
a significant p-value (<0.001). This negative coefficient sug- fitting eighteen different models to the data and using a stable
gested that being Asian had a significant negative association optimization over all the models to produce a final consensus
with admission probability. DeepTune visualization.
Subsequent analysis from both sides in the lawsuit at-
tempted to analyze the modeling and assumptions to decide on A.4. Analyzing trends and outliers in predictions. When interpret-
the significance of race in the model’s decision. The plaintiff’s ing the performance of an ML model, it can be helpful to look
expert report (60) suggested that race was being unfairly used not just at the average accuracy, but also at the distribution
by building on the original report from Harvard’s Office of of predictions and errors. For example, residual plots can
Institutional Research. It also incorporates analysis on more identify heterogeneity in predictions, and suggest particular
subjective factors such as “personal ratings” which seem to data points to analyze, such as outliers in the predictions, or
hurt Asian students’ admission. In contrast, the expert report examples which had the largest prediction errors. Moreover,
supporting Harvard University (61) finds that by accounting these plots be used to analyze trends across the predictions.
for certain other variables, the effect of race on Asian students For instance, in the example below, influence functions are
acceptance is no longer significant. Significances derived from able to efficiently identify mislabelled data points.
statistical tests in regression or logistic regression models at . Ex. This kind of analysis can also be used to identify
best establish association, but not causation. Hence the anal- mislabeled training data. A recently introduced method (69)
yses from both sides are flawed. This example demonstrates uses the classical statistical concept of influence functions
the practical and misleading consequences of statistical feature to identify points in the training data which contribute to
importances when used inappropriately. predictions made by ML models. By searching for training
data points which contribute the most amount to individual
A.3. Visualizations. When dealing with high-dimensional predictions, they were able to find mislabelled data points
datasets, it can be challenging to quickly understand the without having to look at too much data. Correcting these
complex relationships that a model has learned, making mislabeled training points subsequently improved the test
the presentation of the results particularly important. To accuracy.
help deal with this, researchers have developed a number
of different visualizations which help to understand what a B. Prediction-level interpretation. Prediction-level approaches
model has learned. For linear models with regularization, are useful when a practitioner is interested in understanding
plots of regression coefficient paths show how varying a how individual predictions are made by a model. Note that
regularization parameter affects the fitted coefficients. When prediction-level approaches can sometimes be aggregated to
visualizing convolutional neural networks trained on image yield dataset-level insights.
data, work has been done on visualizing filters (62, 63),
maximally activating responses of individual neurons or B.1. Feature importance scores. The most popular approach to
classes (64), understanding intra-class variation (65), and prediction-level interpretation has involved assigning impor-
grouping different neurons (66). For Long Short Term tance scores to individual features. Intuitively, a variable
Memory Networks (LSTMs), researchers have focused on with a large positive (negative) score made a highly positive
analyzing the state vector, identifying individual dimensions (negative) contribution to a particular prediction. In the deep
that correspond to meaningful features (e.g. position in line, learning literature, a number of different approaches have been
within quotes) (67), and building tools to track the model’s proposed to address this problem (70–79), with some methods
decision process over the course of a sequence (68). for other models as well (80). These are often displayed in
In the following example, relevant interpretations are pro- the form of a heat map highlighting important features. Note
duced by using maximal activation images for identifying that feature importance scores at the prediction-level can offer
patterns that drive the response of brain cells. much more information than feature importance scores at the

of the paper’s three main sections: interpretation desiderata
(Sec 3), model-based interpretability (Sec 4), and post hoc
interpretability (Sec 5).
A. Measuring interpretation desiderata. Currently, there is no

clear consensus in the community around how to evaluate
interpretation methods, although some recent work has begun
to address it (11–13). As a result, the standard of evaluation
varies considerably across different work, making it challeng-
ing both for researchers in the field to measure progress, and
for prospective users to select suitable methods. Within the
PDR framework, to constitute an improvement, a new inter-
pretation method must improve at least one desideratum (pre-
Fig. 4. Importance of different predictors in predicting the likelihood of arrest for a dictive accuracy, descriptive accuracy, or relevancy) without
particular person. Reprinted with permission from the authors. unduly harming the others. While improvements in predictive
accuracy are easy to measure, measuring improvements in
descriptive accuracy and relevancy remains a challenge. Less
dataset-level. This is a result of heterogeneity in a nonlin-
important areas for improvement include computational cost
ear model: the importance of a feature can vary for different
and ease of implementation.
examples as a result of interactions with other features. In
the following example, feature importance scores are used to A.1. Measuring descriptive accuracy. One way to measure an im-
increase the descriptive accuracy of black-box models in order provement to an interpretation method is to demonstrate that
to validate their fairness. its output better captures what the ML model has learned, i.e.
. Ex. When using ML models to predict sensitive out- its descriptive accuracy. However, unlike predictive accuracy,
comes, such as whether a person should receive a loan or descriptive accuracy is generally very challenging to measure
a criminal sentence, it is important to verify that the algo- or quantify. As a fall-back, researchers often show individual,
rithm is not discriminating against people based on protected cherry-picked, interpretations which seem “reasonable”. These
attributes, such as race or gender. This problem is often kinds of evaluations are limited and unfalsifiable. In particular,
described as ensuring ML models are “fair”. In (16), the au- these results are limited to the few examples shown, and not
thors introduce a variable importance measure designed to generally applicable to the entire dataset.
isolate the contributions of individual variables, such as gender,
While the community has not settled on a standard evalua-
among a set of correlated variables.
tion protocol, there are some promising directions. In particu-
Based on these variable importance scores, the authors
lar, the use of simulation studies presents a partial solution. In
construct transparency reports, such as the one displayed in
this setting, a researcher defines a simple generative process,
Fig 4. This figure displays the importance of features used to
generates a large amount of data from that process, and trains
predict that "Mr. Z" is likely to be arrested in the future (an
their ML model on that data. Assuming a proper simulation
outcome which is often used in predictive policing), with each
setup, a sufficiently powerful model to recover the generative
bar corresponding to a feature provided to the classifier, and
process, and sufficiently large training data, the trained model
the y axis displaying the importance score for that feature. In
should achieve near-perfect generalization accuracy. To com-
this instance, the race feature is the largest value, indicating
pute an evaluation metric, they can then check whether their
that the classifier is indeed discriminating based on race. Thus,
interpretations recover aspects of the original generative pro-
in this instance, prediction-level feature importance scores are
cess. For example, (57, 86) train neural networks on a suite of
able to identify that a model is unfairly discriminating based
generative models with certain built-in interactions, and test
on race.
whether their method successfully recovers them. Here, due to
B.2. Alternatives to feature importances. While feature impor- the ML model’s near-perfect generalization accuracy, we know
tance scores can provide useful insights, they also have a that the model is likely to have recovered some aspects of
number of limitations. For instance, they are unable to cap- the generative process, thus providing a ground truth against
ture when algorithms learn interactions between variables. which to evaluate interpretations. In a related approach, when
There is currently an evolving body of work centered around an underlying scientific problem has been previously studied,
uncovering and addressing these limitations. These methods prior experimental findings can serve as a partial ground truth
focus on explicitly capturing and displaying the interactions to retrospectively validate interpretations (20).
learned by a neural network (81, 82), alternative forms of
A.2. Demonstrating relevancy to real-world problems. Another an-
interpretations such as textual explanations (83), identifying
influential data points (69), and analyzing nearest neighbors gle for developing improved interpretation methods is to im-
(84, 85). prove the relevancy of interpretations for some audience or
problem. This is normally done by introducing a novel form
of output, such as feature heatmaps (71), rationales (87), fea-
6. Future work
ture hierarchies (82) or identifying important elements in the
Having introduced the PDR framework for defining and dis- training set (69). A common pitfall in the current literature
cussing interpretable machine learning, we now leverage it to is to focus exclusively on the novel output, ignoring what real-
frame what we feel are the field’s most important challenges world problems it can actually solve. Given the abundance of
moving forward. Below, we present open problems tied to each possible interpretations, it is particularly easy for researchers

to propose novel methods which do not actually solve any learning, which is often used as a tool for automatically find-
real-world problems. ing relevant structure in data. Improvements in unsupervised
There have been two dominant approaches for demonstrat- techniques such as clustering and matrix factorization could
ing improved relevancy. The first, and strongest, is to directly lead to more useful features.
use the introduced method in solving a domain problem. For
instance, in one example discussed above (20), the authors eval- C. Post hoc. In contrast to model-based interpretability, much
uated a new interpretation method (iterative random forests) of post hoc interpretability is relatively new, with many foun-
by demonstrating that it could be used to identify meaningful dational concepts still unclear. In particular, we feel that two
biological Boolean interactions for use in experiments. In in- of the most important questions to be answered are what an
stances like this, where the interpretations are used directly interpretation of an ML model should look like, and how post
to solve a domain problem, their relevancy is indisputable. hoc interpretations can be used. One of the most promising
A second, less direct, approach is the use of human studies, potential uses of post hoc interpretations is to increase the
often through services like Amazon’s Mechanical Turk. Here, predictive accuracy of a model. In related work, it has been
humans are asked to perform certain tasks, such as evaluat- pointed out that in high stakes decisions practitioners should
ing how much they trust a model’s predictions (82). While be very careful when applying post hoc methods with unknown
challenging to properly construct and perform, these studies descriptive accuracy (88).
are vital to demonstrate that new interpretation methods are,
C.1. What should an interpretation of a black-box look like. Given a
in fact, relevant to any potential practitioners. However, one
shortcoming of this approach is that it is only possible to use black-box predictor and real-world problem, it is generally
a general audience of AMT crowdsourced workers, rather than unclear what format, or combination of formats, is best to
a more relevant, domain-specific audience. fully capture a model’s behavior. Researchers have proposed
a variety of interpretation forms, including feature heatmaps
(71), feature hierarchies (82) and identifying important ele-
B. Model-based. Now that we have discussed the general prob-
ments in the training set (69). However, in all instances there
lem of evaluating interpretations, we highlight important chal-
is a gap between the relatively simple information provided by
lenges for the two main sub-fields of interpretable machine
these interpretations and what the complex model has actually
learning: model-based and post hoc interpretability. Whenever
learned. Moreover, it is unclear if any of the current inter-
model-based interpretability can achieve reasonable predictive
pretation forms can fully capture a model’s behaviour, or if a
accuracy and relevancy, by virtue of its high descriptive ac-
new format altogether is needed. How to close that gap, while
curacy it is preferable to fitting a more complex model, and
producing outputs relevant to a particular audience/problem,
relying upon post hoc interpretability. Thus, the main focus
is an open problem.
for model-based interpretability is increasing its range of pos-
sible use cases by increasing its predictive accuracy through C.2. Using interpretations to improve predictive accuracy. In some
more accurate models and transparent feature engineering. It instances, post hoc interpretations uncover that a model has
is worth noting that sometimes a combination of model-based learned relationships a practitioner knows to be incorrect. For
and post hoc interpretations is ideal. instance, prior interpretation work has shown that a binary
husky vs. wolf classifier simply learns to identify whether there
B.1. Building accurate and interpretable models. In many in-
is snow in the image, ignoring the animals themselves (78). A
stances, model-based interpretability methods fail to achieve natural question to ask is whether it is possible for the practi-
a reasonable predictive accuracy. In these cases, practitioners tioner to correct these relationships learned by the model, and
are forced to abandon model-based interpretations in search consequently increase its predictive accuracy. Given the chal-
of more accurate models. Thus, an effective way of increasing lenges surrounding simply generating post hoc interpretations,
the potential uses for model-based interpretability is to de- research on their uses has been limited (97, 98). However, as
vise new modeling methods which produce higher predictive the field of post hoc interpretations continues to mature, this
accuracy while maintaining their high descriptive accuracy could be an exciting avenue for researchers to increase the pre-
and relevance. Promising examples of this work include the dictive accuracy of their models by exploiting prior knowledge,
previously discussed examples on estimating pneumonia risk independently of any other benefits of interpretations.
from patient data (44) and Bayesian models for generating
rule lists to estimate a patient’s risk of stroke (39). Detailed ACKNOWLEDGMENTS. This research was supported in part
directions for this work are suggested in (88). by grants ARO W911NF1710005, ONR N00014-16-1-2664, NSF
DMS-1613002, and NSF IIS 1741340, an NSERC PGS D fellowship,
B.2. Tools for feature engineering. When we have more informa- and an Adobe research award. We thank the Center for Science
tive and meaningful features, we can use simpler modeling of Information (CSoI), a US NSF Science and Technology Center,
under grant agreement CCF-0939370. Reza Abbasi-Asl would like
methods to achieve a comparable predictive accuracy. Thus, to thank the Allen Institute founder, Paul G. Allen, for his vision,
methods that can produce more useful features broaden the encouragement and support.
potential uses of model-based interpretations. The first main
1. Litjens G, et al. (2017) A survey on deep learning in medical image analysis. Medical image
category of work lies in improved tools for exploratory data analysis 42:60–88.
analysis. By better enabling researchers to interact with and 2. Brennan T, Oliver WL (2013) The emergence of machine learning techniques in criminology.
Criminology & Public Policy 12(3):551–562.
understand their data, these tools (combined with domain 3. Angermueller C, Pärnamaa T, Parts L, Stegle O (2016) Deep learning for computational biol-
knowledge) provide increased opportunities for them to identify ogy. Molecular systems biology 12(7):878.
helpful features. Examples include interactive environments 4. Vu MAT, et al. (2018) A shared vision for machine learning in neuroscience. Journal of Neu-
roscience pp. 0508–17.
(89–91), tools for visualization (92–94), and data exploration 5. Goodman B, Flaxman S (2016) European union regulations on algorithmic decision-making
tools (95, 96). The second category falls under unsupervised and a" right to explanation". arXiv preprint arXiv:1606.08813.

6. Dwork C, Hardt M, Pitassi T, Reingold O, Zemel R (2012) Fairness through awareness in (IEEE), Vol. 1, pp. 886–893.
Proceedings of the 3rd innovations in theoretical computer science conference. (ACM), pp. 47. Shi T, Yu B, Clothiaux EE, Braverman AJ (2008) Daytime arctic cloud detection based on
214–226. multi-angle satellite data with case studies. Journal of the American Statistical Association
7. Chakraborty S, et al. (2017) Interpretability of deep learning models: a survey of results in 103(482):584–593.
Interpretability of deep learning models: a survey of results. 48. Jolliffe I (1986) Principal component analysis.
8. Guidotti R, Monreale A, Turini F, Pedreschi D, Giannotti F (2018) A survey of methods for 49. Bell AJ, Sejnowski TJ (1995) An information-maximization approach to blind separation and
explaining black box models. arXiv preprint arXiv:1802.01933. blind deconvolution. Neural computation 7(6):1129–1159.
9. Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions in Advances 50. Hotelling H (1936) Relations between two sets of variates. Biometrika 28(3/4):321–377.
in Neural Information Processing Systems. pp. 4768–4777. 51. Wu S, et al. (2016) Stability-driven nonnegative matrix factorization to interpret spatial gene
10. Ancona M, Ceolini E, Oztireli C, Gross M (2018) Towards better understanding of gradient- expression and build local gene networks. Proceedings of the National Academy of Sciences
based attribution methods for deep neural networks in 6th International Conference on Learn- 113(16):4290–4295.
ing Representations (ICLR 2018). 52. Olden JD, Joy MK, Death RG (2004) An accurate comparison of methods for quantifying
11. Doshi-Velez F, Kim B (2017) A roadmap for a rigorous science of interpretability. arXiv preprint variable importance in artificial neural networks using simulated data. Ecological Modelling
arXiv:1702.08608. 178(3-4):389–397.
12. Gilpin LH, et al. (2018) Explaining explanations: An approach to evaluating interpretability of 53. Breiman L (2001) Random forests. Machine learning 45(1):5–32.
machine learning. arXiv preprint arXiv:1806.00069. 54. Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance
13. Lipton ZC (2016) The mythos of model interpretability. arXiv preprint arXiv:1606.03490. for random forests. BMC bioinformatics 9(1):307.
14. Hardt M, Price E, Srebro N, , et al. (2016) Equality of opportunity in supervised learning in 55. Altmann A, Toloşi L, Sander O, Lengauer T (2010) Permutation importance: a corrected
Advances in neural information processing systems. pp. 3315–3323. feature importance measure. Bioinformatics 26(10):1340–1347.
15. Boyd D, Crawford K (2012) Critical questions for big data: Provocations for a cultural, techno- 56. Kumbier K, Basu S, Brown JB, Celniker S, Yu B (2018) Refining interaction search through
logical, and scholarly phenomenon. Information, communication & society 15(5):662–679. signed iterative random forests. arXiv preprint arXiv:1810.07287.
16. Datta A, Sen S, Zick Y (2016) Algorithmic transparency via quantitative input influence: The- 57. Tsang M, Cheng D, Liu Y (2017) Detecting statistical interactions from neural network weights.
ory and experiments with learning systems in Security and Privacy (SP), 2016 IEEE Sympo- arXiv preprint arXiv:1705.04977.
sium on. (IEEE), pp. 598–617. 58. Abbasi-Asl R, Yu B (2017) Structural compression of convolutional neural networks based on
17. Keil FC (2006) Explanation and understanding. Annu. Rev. Psychol. 57:227–254. greedy filter pruning. arXiv preprint arXiv:1705.07356.
18. Lombrozo T (2006) The structure and function of explanations. Trends in cognitive sciences 59. Office of Institutional Research HU (2018) Exhibit 157: Demographics of har-
10(10):464–470. vard college applicants. http://samv91khoyt2i553a2t1s05i-wpengine.netdna-ssl.com/wp-
19. Imbens GW, Rubin DB (2015) Causal inference in statistics, social, and biomedical sciences. content/uploads/2018/06/Doc-421-157-May-30-2013-Report.pdf pp. 8–9.
(Cambridge University Press). 60. arcidiacono PS (2018) Exhibit a: Expert report of peter s. arcidi-
20. Basu S, Kumbier K, Brown JB, Yu B (2018) iterative random forests to discover predictive acono. http://samv91khoyt2i553a2t1s05i-wpengine.netdna-ssl.com/wp-
and stable high-order interactions. Proceedings of the National Academy of Sciences p. content/uploads/2018/06/Doc-415-1-Arcidiacono-Expert-Report.pdf.
201711236. 61. Card D (2018) Exhibit 33: Report of david card. https://projects.iq.harvard.edu/files/diverse-
21. Yu B (2013) Stability. Bernoulli 19(4):1484–1500. education/files/legal_-_card_report_revised_filing.pdf.
22. Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (2011) Robust statistics: the approach 62. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks in Euro-
based on influence functions. (John Wiley & Sons) Vol. 196. pean conference on computer vision. (Springer), pp. 818–833.
23. Pimentel H, Hu Z, Huang H (2018) Biclustering by sparse canonical correlation analysis. 63. Olah C, Mordvintsev A, Schubert L (2017) Feature visualization. Distill 2(11):e7.
Quantitative Biology 6(1):56–67. 64. Mordvintsev A, Olah C, Tyka M (2015) Deepdream-a code example for visualizing neural
24. Abbasi-Asl R, et al. (2018) The deeptune framework for modeling and characterizing neurons networks. Google Research 2:5.
in visual cortex area v4. bioRxiv p. 465534. 65. Wei D, Zhou B, Torrabla A, Freeman W (2015) Understanding intra-class knowledge inside
25. Roe AW, et al. (2012) Toward a unified theory of visual area v4. Neuron 74(1):12–29. cnn. arXiv preprint arXiv:1507.02379.
26. Huang CL, Chen MC, Wang CJ (2007) Credit scoring with a data mining approach based on 66. Zhang Q, Cao R, Shi F, Wu YN, Zhu SC (2017) Interpreting cnn knowledge via an explanatory
support vector machines. Expert systems with applications 33(4):847–856. graph. arXiv preprint arXiv:1708.01785.
27. Box GE (1976) Science and statistics. Journal of the American Statistical Association 67. Karpathy A, Johnson J, Fei-Fei L (2015) Visualizing and understanding recurrent networks.
71(356):791–799. arXiv preprint arXiv:1506.02078.
28. Breiman L, , et al. (2001) Statistical modeling: The two cultures (with comments and a rejoin- 68. Strobelt H, et al. (2016) Visual analysis of hidden state dynamics in recurrent neural networks.
der by the author). Statistical science 16(3):199–231. CoRR, abs/1606.07461.
29. Freedman DA (1991) Statistical models and shoe leather. Sociological methodology pp. 291– 69. Koh PW, Liang P (2017) Understanding black-box predictions via influence functions. arXiv
313. preprint arXiv:1703.04730.
30. Lim C, Yu B (2016) Estimation stability with cross-validation (escv). Journal of Computational 70. Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2014) Striving for simplicity: The all
and Graphical Statistics 25(2):464–492. convolutional net. arXiv preprint arXiv:1412.6806.
31. Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal 71. Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks. ICML.
Statistical Society. Series B (Methodological) pp. 267–288. 72. Selvaraju RR, et al. (2016) Grad-cam: Visual explanations from deep networks via gradient-
32. Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: A strategy based localization. See https://arxiv. org/abs/1610.02391 v3 7(8).
employed by v1? Vision research 37(23):3311–3325. 73. Baehrens D, et al. (2010) How to explain individual classification decisions. Journal of Ma-
33. Akaike H (1987) Factor analysis and aic in Selected Papers of Hirotugu Akaike. (Springer), chine Learning Research 11(Jun):1803–1831.
pp. 371–386. 74. Smilkov D, Thorat N, Kim B, Viégas F, Wattenberg M (2017) Smoothgrad: removing noise by
34. Burnham KP, Anderson DR (2004) Multimodel inference: understanding aic and bic in model adding noise. arXiv preprint arXiv:1706.03825.
selection. Sociological methods & research 33(2):261–304. 75. Shrikumar A, Greenside P, Shcherbina A, Kundaje A (2016) Not just a black box: Learning im-
35. Pati YC, Rezaiifar R, Krishnaprasad PS (1993) Orthogonal matching pursuit: Recursive portant features through propagating activation differences. arXiv preprint arXiv:1605.01713.
function approximation with applications to wavelet decomposition in Signals, Systems and 76. Murdoch WJ, Szlam A (2017) Automatic rule extraction from long short term memory net-
Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on. works. arXiv preprint arXiv:1702.02540.
(IEEE), pp. 40–44. 77. Dabkowski P, Gal Y (2017) Real time image saliency for black box classifiers. arXiv preprint
36. Amaratunga D, Cabrera J, Lee YS (2008) Enriched random forests. Bioinformatics arXiv:1705.07857.
24(18):2010–2014. 78. Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you?: Explaining the predictions of
37. Breiman L, Friedman J, Olshen R, Stone CJ (1984) Classification and regression trees. any classifier in Proceedings of the 22nd ACM SIGKDD International Conference on Knowl-
38. Friedman JH, Popescu BE, , et al. (2008) Predictive learning via rule ensembles. The Annals edge Discovery and Data Mining. (ACM), pp. 1135–1144.
of Applied Statistics 2(3):916–954. 79. Zintgraf LM, Cohen TS, Adel T, Welling M (2017) Visualizing deep neural network decisions:
39. Letham B, Rudin C, McCormick TH, Madigan D, , et al. (2015) Interpretable classifiers using Prediction difference analysis. arXiv preprint arXiv:1702.04595.
rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied 80. Lundberg SM, Erion GG, Lee SI (2018) Consistent individualized feature attribution for tree
Statistics 9(3):1350–1371. ensembles. arXiv preprint arXiv:1802.03888.
40. Hastie T, Tibshirani R (1986) Generalized additive models. Statistical Science 1(3):297–318. 81. Murdoch WJ, Liu PJ, Yu B (2018) Beyond word importance: Contextual decomposition to
41. Kim J, Canny JF (2017) Interpretable learning for self-driving cars by visualizing causal atten- extract interactions from lstms. arXiv preprint arXiv:1801.05453.
tion. in ICCV. pp. 2961–2969. 82. Singh C, Murdoch WJ, Yu B (2018) Hierarchical interpretations for neural network predictions.
42. Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neural module networks in Proceedings of arXiv preprint arXiv:1806.05337.
the IEEE Conference on Computer Vision and Pattern Recognition. pp. 39–48. 83. Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases
43. Koller D, Friedman N, Bach F (2009) Probabilistic graphical models: principles and tech- in images by reconstruction in European Conference on Computer Vision. (Springer), pp.
niques. (MIT press). 817–834.
44. Caruana R, et al. (2015) Intelligible models for healthcare: Predicting pneumonia risk and 84. Caruana R, Kangarloo H, Dionisio J, Sinha U, Johnson D (1999) Case-based explanation
hospital 30-day readmission in Proceedings of the 21th ACM SIGKDD International Confer- of non-case-based learning methods. in Proceedings of the AMIA Symposium. (American
ence on Knowledge Discovery and Data Mining. (ACM), pp. 1721–1730. Medical Informatics Association), p. 212.
45. Ramos J, , et al. (2003) Using tf-idf to determine word relevance in document queries in 85. Papernot N, McDaniel P (2018) Deep k-nearest neighbors: Towards confident, interpretable
Proceedings of the first instructional conference on machine learning. Vol. 242, pp. 133–142. and robust deep learning. arXiv preprint arXiv:1803.04765.
46. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection in Computer 86. Tsang M, Sun Y, Ren D, Liu Y (2018) Can i trust you more? model-agnostic hierarchical
Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. explanations. arXiv preprint arXiv:1812.04801.

87. Lei T, Barzilay R, Jaakkola T (2016) Rationalizing neural predictions. arXiv preprint
arXiv:1606.04155.
88. Rudin C (2018) Please stop explaining black box models for high stakes decisions. arXiv
preprint arXiv:1811.10154.
89. Kluyver T, et al. (2016) Jupyter notebooks-a publishing format for reproducible computational
workflows. in ELPUB. pp. 87–90.
90. Pérez F, Granger BE (2007) Ipython: a system for interactive scientific computing. Computing
in Science & Engineering 9(3).
91. RStudio Team (2016) RStudio: Integrated Development Environment for R (RStudio, Inc.,
Boston, MA).
92. Barter R, Yu B (2015) Superheat: Supervised heatmaps for visualizing complex data. arXiv
preprint arXiv:1512.01524.
93. Wickham H (2016) ggplot2: elegant graphics for data analysis. (Springer).
94. Waskom M, et al. (2014) Seaborn: statistical data visualization. URL: https://seaborn. pydata.
org/(visited on 2017-05-15).
95. McKinney W, , et al. (2010) Data structures for statistical computing in python in Proceedings
of the 9th Python in Science Conference. (Austin, TX), Vol. 445, pp. 51–56.
96. Wickham H (2017) tidyverse: Easily Install and Load the ’Tidyverse’. R package version
1.2.1.
97. Ross AS, Hughes MC, Doshi-Velez F (2017) Right for the right reasons: Training differentiable
models by constraining their explanations. arXiv preprint arXiv:1703.03717.
98. Zaidan O, Eisner J, Piatko C (2007) Using “annotator rationales” to improve machine learning
for text categorization in Human Language Technologies 2007: The Conference of the North
American Chapter of the Association for Computational Linguistics; Proceedings of the Main
Conference. pp. 260–267.

Interpretable Machine Learning - Definitions, Methods, and Applications PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Interpretable Machine Learning - Definitions, Methods, and Applications PDF

Caricato da

Copyright:

Formati disponibili

Interpretable machine learning: definitions,

methods, and applications

and what common concepts can be used to evaluate them.

1. Introduction accuracy, and relevancy, where relevancy is judged by a human

Preprint | Murdoch et al. 2018 | 1–11

Preprint | Murdoch et al. 2018 | 2

Preprint | Murdoch et al. 2018 | 3

descriptive accuracy. Moreover, incorporating prior informa-

Preprint | Murdoch et al. 2018 | 4

Preprint | Murdoch et al. 2018 | 5

Preprint | Murdoch et al. 2018 | 6

Preprint | Murdoch et al. 2018 | 7

A. Measuring interpretation desiderata. Currently, there is no

Preprint | Murdoch et al. 2018 | 8

Preprint | Murdoch et al. 2018 | 9

Preprint | Murdoch et al. 2018 | 10

Preprint | Murdoch et al. 2018 | 11

Potrebbero piacerti anche