Sei sulla pagina 1di 117

Accepted Manuscript

A high-bias, low-variance introduction to Machine Learning for physicists

Pankaj Mehta, Marin Bukov, Ching-Hao Wang, Alexandre G.R. Day,


Clint Richardson, Charles K. Fisher, David J. Schwab

PII: S0370-1573(19)30076-6
DOI: https://doi.org/10.1016/j.physrep.2019.03.001
Reference: PLREP 2056

To appear in: Physics Reports

Received date : 26 March 2018


Accepted date : 2 February 2019

Please cite this article as: P. Mehta, M. Bukov, C.-H. Wang et al., A high-bias, low-variance
introduction to Machine Learning for physicists, Physics Reports (2019),
https://doi.org/10.1016/j.physrep.2019.03.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form.
Please note that during the production process errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
Manuscript

1
2
3 A high-bias, low-variance introduction to Machine Learning for physicists
4
5 Pankaj Mehta, Ching-Hao Wang, Alexandre G. R. Day, and Clint Richardson
6 Department of Physics,
7 Boston University,
8 Boston, MA 02215,
9 USA∗
10
Marin Bukov
11
12 Department of Physics,
13 University of California,
14 Berkeley, CA 94720,
15 USA†
16 Charles K. Fisher
17
18 Unlearn.AI, San Francisco,
19 CA 94108
20 David J. Schwab
21
22 Initiative for the Theoretical Sciences,
23 The Graduate Center,
24 City University of New York,
365 Fifth Ave., New York,
25
NY 10016
26
27 (Dated: March 1, 2019)
28
29 Machine Learning (ML) is one of the most exciting and dynamic areas of modern re-
30 search and application. The purpose of this review is to provide an introduction to the
31 core concepts and tools of machine learning in a manner easily understood and intuitive
to physicists. The review begins by covering fundamental concepts in ML and modern
32 statistics such as the bias-variance tradeoff, overfitting, regularization, generalization,
33 and gradient descent before moving on to more advanced topics in both supervised
34 and unsupervised learning. Topics covered in the review include ensemble models, deep
35 learning and neural networks, clustering and data visualization, energy-based models (in-
36 cluding MaxEnt models and Restricted Boltzmann Machines), and variational methods.
Throughout, we emphasize the many natural connections between ML and statistical
37 physics. A notable aspect of the review is the use of Python Jupyter notebooks to in-
38 troduce modern ML/statistical packages to readers using physics-inspired datasets (the
39 Ising Model and Monte-Carlo simulations of supersymmetric decays of proton-proton
40 collisions). We conclude with an extended outlook discussing possible uses of machine
41 learning for furthering our understanding of the physical world as well as open problems
in ML where physicists may be able to contribute.
42
43
44
45 CONTENTS B. Bias-Variance Decomposition 12
46
I. Introduction 3 IV. Gradient Descent and its Generalizations 13
47 A. Gradient Descent and Newton’s method 13
A. What is Machine Learning? 3
48 B. Limitations of the simplest gradient descent
B. Why study Machine Learning? 4
49 C. Scope and structure of the review 4 algorithm 15
50 C. Stochastic Gradient Descent (SGD) with
51 II. Why is Machine Learning difficult? 6 mini-batches 16
52 A. Setting up a problem in ML and data science 6 D. Adding Momentum 17
B. Polynomial Regression 6 E. Methods that use the second moment of the
53 gradient 17
54 III. Basics of Statistical Learning Theory 10 F. Comparison of various methods 18
55 A. Three simple schematics that summarize the basic G. Gradient descent in practice: practical tips 19
56 intuitions from Statistical Learning Theory 10
57 V. Overview of Bayesian Inference 19
A. Bayes Rule 20
58
B. Bayesian Decisions 20
59 C. Hyperparameters 20
∗ pankajm@bu.edu
60
† mgbukov@berkeley.edu
61 VI. Linear Regression 21
62
63
64
65
2
1
2
A. Least-square regression 21 3. Neural networks scale up well
3 B. Ridge-Regression 22 computationally 62
4 C. LASSO and Sparse Regression 23 C. Limitations of supervised learning with deep
5 D. Using Linear Regression to Learn the Ising networks 62
6 Hamiltonian 24
7 E. Convexity of regularizer 25 XII. Dimensional Reduction and Data Visualization 63
F. Bayesian formulation of linear regression 27 A. Some of the challenges of high-dimensional data 63
8 G. Recap and a general perspective on regularizers 28 B. Principal component analysis (PCA) 64
9 C. Multidimensional scaling 66
10 VII. Logistic Regression 29 D. t-SNE 66
11 A. The cross-entropy as a cost function for logistic
12 regression 29 XIII. Clustering 68
B. Minimizing the cross entropy 30 A. Practical clustering methods 69
13 C. Examples of binary classification 30 1. K-means 69
14 1. Identifying the phases of the 2D Ising model 30 2. Hierarchical clustering: Agglomerative
15 2. SUSY 32 methods 70
16 D. Softmax Regression 34 3. Density-based (DB) clustering 71
17 E. An Example of SoftMax Classification: MNIST B. Clustering and Latent Variables via the Gaussian
Digit Classification 34 Mixture Models 72
18
C. Clustering in high dimensions 74
19 VIII. Combining Models 35
20 A. Revisiting the Bias-Variance Tradeoff for XIV. Variational Methods and Mean-Field Theory
21 Ensembles 35 (MFT) 75
22 1. Bias-Variance Decomposition for Ensembles 35 A. Variational mean-field theory for the Ising
2. Summarizing the Theory and Intuitions behind model 76
23 Ensembles 38 B. Expectation Maximization (EM) 78
24 B. Bagging 39
25 C. Boosting 40 XV. Energy Based Models: Maximum Entropy (MaxEnt)
26 D. Random Forests 41 Principle, Generative models, and Boltzmann
27 E. Gradient Boosted Trees and XGBoost 42 Learning 80
F. Applications to the Ising model and A. An overview of energy-based generative models 80
28 Supersymmetry Datasets 44 B. Maximum entropy models: the simplest
29 energy-based generative models 81
30 IX. An Introduction to Feed-Forward Deep Neural 1. MaxEnt models in statistical mechanics 81
31 Networks (DNNs) 45 2. From statistical mechanics to machine
A. Neural Network Basics 46 learning 82
32
1. The basic building block: neurons 46 3. Generalized Ising Models from MaxEnt 83
33 2. Layering neurons to build deep networks:
34 C. Cost functions for training energy-based models 83
network architecture. 47
1. Maximum likelihood 84
35 B. Training deep networks 48
2. Regularization 84
36 C. The Backpropagation algorithm 49
D. Computing gradients 85
1. Deriving and implementing the
37 E. Summary of the training procedure 86
backpropagation equations 50
38 2. Computing gradients in deep networks: what
39 XVI. Deep Generative Models: Hidden Variables and
can go wrong with backprop? 51
Restricted Boltzmann Machines (RBMs) 86
40 D. Regularizing neural networks and other practical
A. Why hidden (latent) variables? 86
41 considerations 52
B. Restricted Boltzmann Machines (RBMs) 87
1. Implicit regularization using SGD:
42 C. Training RBMs 89
initialization, hyper-parameter tuning, and
43 1. Gibbs sampling and contrastive divergence
Early Stopping 52
44 (CD) 89
2. Dropout 52
2. Practical Considerations 90
45 3. Batch Normalization 52
D. Deep Boltzmann Machine 90
46 E. Deep neural networks in practice: examples 53
1. Deep learning packages 53 E. Generative models in practice: examples 91
47 1. MNIST 91
2. Approaching the learning problem 54
48 3. SUSY dataset 55 2. Example: 2D Ising Model 92
49 4. Phases of the 2D Ising model 55 F. Generative models in physics 92
50
X. Convolutional Neural Networks (CNNs) 56 XVII. Variational AutoEncoders (VAEs) and Generative
51 Adversarial Networks (GANs) 94
A. The structure of convolutional neural networks 56
52 A. The limitations of maximizing Likelihood 94
B. Example: CNNs for the 2D Ising model 58
53 C. Pre-trained CNNs and transfer learning 58 B. Generative models and adversarial learning 96
54 C. Variational Autoencoders (VAEs) 97
55 XI. High-level Concepts in Deep Neural Networks 60 1. VAEs as variational models 97
56 A. Organizing deep learning workflows using the 2. Training via the reparametrization trick 98
bias-variance tradeoff 60 3. Connection to the information bottleneck 99
57 B. Why neural networks are so successful: three D. VAE with Gaussian latent variables and Gaussian
58 high-level perspectives on neural networks 61 encoder 100
59 1. Neural networks as representation learning 61 1. Implementing the Gaussian VAE 100
60 2. Neural networks can exploit large amounts of 2. VAEs for the MNIST dataset 101
61 data 61 3. VAEs for the 2D Ising model 101
62
63
64
65
3
1
2
XVIII. Outlook 103 sity in Fall of 2016. As such, it assumes a level of familiar-
3 A. Research at the intersection of physics and ML 103
4 ity with several topics found in graduate physics curricula
B. Topics not covered in review 104
5 (partition functions, statistical mechanics) and a fluency
C. Rebranding Machine Learning as “Artificial
6 Intelligence” 105 in mathematical techniques such as linear algebra, multi-
7 D. Social Implications of Machine Learning 105 variate calculus, variational methods, probability theory,
8 and Monte-Carlo methods. It also assumes a familiar-
XIX. Acknowledgments 106
9 ity with basic computer programming and algorithmic
10 A. Overview of the Datasets used in the Review 106 design.
11 1. Ising dataset 106
12 2. SUSY dataset 106
13 3. MNIST Dataset 107 A. What is Machine Learning?
14 References 107
15 Most physicists learn the basics of classical statistics
16 early on in undergraduate laboratory courses. Classical
17
I. INTRODUCTION statistics is primarily concerned with how to use data
18 to estimate the value of an unknown quantity. For in-
19 stance, estimating the speed of light using measurements
Machine Learning (ML), data science, and statistics
20 obtained with an interferometer is one such example that
21 are fields that describe how to learn from, and make pre-
dictions about, data. The availability of big datasets is relies heavily on techniques from statistics.
22
a hallmark of modern science, including physics, where Machine Learning is a subfield of artificial intelligence
23
24 data analysis has become an important component of di- with the goal of developing algorithms capable of learning
25 verse areas, such as experimental particle physics, ob- from data automatically. In particular, an artificially in-
26 servational astronomy and cosmology, condensed matter telligent agent needs to be able to recognize objects in its
27 physics, biophysics, and quantum computing. Moreover, surroundings and predict the behavior of its environment
28 ML and data science are playing increasingly important in order to make informed choices. Therefore, techniques
29 roles in many aspects of modern technology, ranging from in ML tend to be more focused on prediction rather than
30 biotechnology to the engineering of self-driving cars and estimation. For example, how do we use data from the
31 smart devices. Therefore, having a thorough grasp of the interferometry experiment to predict what interference
32 concepts and tools used in ML is an important skill that pattern would be observed under a different experimental
33 is increasingly relevant in the physical sciences. setup? In addition, methods from ML tend to be applied
34 The purpose of this review is to serve as an introduc- to more complex high-dimensional problems than those
35 tion to foundational and state-of-the-art techniques in typically encountered in a classical statistics course.
36 Despite these differences, estimation and prediction
ML and data science for physicists. The review seeks to
37 problems can be cast into a common conceptual frame-
find a middle ground between a short overview and a full-
38 work. In both cases, we choose some observable quantity
39 length textbook. While there exist many wonderful ML
textbooks (Abu-Mostafa et al., 2012; Bishop, 2006; Fried- x of the system we are studying (e.g., an interference pat-
40
man et al., 2001; Murphy, 2012), they are lengthy and use tern) that is related to some parameters θ (e.g., the speed
41
42 specialized language that is often unfamiliar to physicists. of light) of a model p(x|θ) that describes the probability
43 This review builds upon the considerable knowledge most of observing x given θ. Now, we perform an experiment
44 physicists already possess in statistical physics in order to obtain a dataset X and use these data to fit the model.
45 to introduce many of the major ideas and techniques Typically, “fitting” the model involves finding θ̂ that pro-
46 used in modern ML. We take a physics-inspired peda- vides the best explanation for the data. In the case when
47 gogical approach, emphasizing simple examples (e.g., re- “fitting” refers to the method of least squares, the esti-
48 gression and clustering), before delving into more ad- mated parameters maximize the probability of observ-
49 vanced topics. The intention of this review and the ing the data (i.e., θ̂ = argmaxθ {p(X|θ)}). Estimation
50 accompanying Jupyter notebooks (available at https: problems are concerned with the accuracy of θ̂, whereas
51 //physics.bu.edu/~pankajm/MLnotebooks.html) is to prediction problems are concerned with the ability of the
52 give the reader the requisite background knowledge to model to predict new observations (i.e., the accuracy of
53 follow and apply these techniques to their own areas of p(x|θ̂)). Although the goals of estimation and prediction
54 interest. are related, they often lead to different approaches. As
55
While this review is written with a physics background this review is aimed as an introduction to the concepts of
56
in mind, we aim for it to be useful to anyone with some ML, we will focus on prediction problems and refer the
57
58 background in statistical physics, and it is suitable for reader to one of many excellent textbooks on classical
59 both graduate students and researchers as well as ad- statistics for more information on estimation (Lehmann
60 vanced undergraduates. The review is based on an ad- and Casella, 2006; Lehmann and Romano, 2006; Wasser-
61 vanced topics graduate course taught at Boston Univer- man, 2013; Witte and Witte, 2013).
62
63
64
65
4
1
2
B. Why study Machine Learning? C. Scope and structure of the review
3
4
5 The last three decades have seen an unprecedented in- Any review on ML must simultaneously accomplish
6 crease in our ability to generate and analyze large data two related but distinct goals. First, it must convey the
7 sets. This “big data” revolution has been spurred by an rich theoretical foundations underlying modern ML. This
8 exponential increase in computing power and memory task is made especially difficult because ML is very broad
9 commonly known as Moore’s law. Computations that and interdisciplinary, drawing on ideas and intuitions
10 were unthinkable a few decades ago can now be routinely from many fields including statistics, computational neu-
11 performed on laptops. Specialized computing machines roscience, and physics. Unfortunately, this means mak-
12 (such as GPU-based machines) are continuing this trend ing choices about what theoretical ideas to include in the
13 towards cheap, large-scale computation, suggesting that review. This review emphasizes connections with sta-
14 the “big data” revolution is here to stay. tistical physics, physics-inspired Bayesian inference, and
15 computational neuroscience models. Thus, certain ideas
16 This increase in our computational ability has been ac-
companied by new techniques for analyzing and learning (e.g., gradient descent, expectation maximization, varia-
17 tional methods, and deep learning and neural networks)
18 from large datasets. These techniques draw heavily from
ideas in statistics, computational neuroscience, computer are covered extensively, while other important ideas are
19
science, and physics. Similar to physics, modern ML given less attention or even omitted entirely (e.g., statis-
20
places a premium on empirical results and intuition over tical learning, support vector machines, kernel methods,
21
22 the more formal treatments common in statistics, com- Gaussian processes). Second, any ML review must give
23 puter science, and mathematics. This is not to say that the reader the practical know-how to start using the tools
24 proofs are not important or undesirable. Rather, many and concepts of ML for practical problems. To accom-
25 of the advances of the last two decades – especially in plish this, we have written a series of Jupyter notebooks
26 fields like deep learning – do not have formal justifica- to accompany this review. These python notebooks in-
27 tions (much like there still exists no mathematically well- troduce the nuts-and-bolts of how to use, code, and im-
28 defined concept of the Feynman path-integral in d > 1). plement the methods introduced in the main text. Luck-
29 ily, there are numerous great ML software packages avail-
30 Physicists are uniquely situated to benefit from and able in Python (scikit-learn, tensorflow, Pytorch, Keras)
31 contribute to ML. Many of the core concepts and tech- and we have made extensive use of them. We have also
32 niques used in ML – such as Monte-Carlo methods, simu- made use of a new package, Paysage, for energy-based
33 lated annealing, variational methods – have their origins generative models which has been co-developed by one
34 in physics. Moreover, “energy-based models” inspired by of the authors (CKF) and maintained by Unlearn.AI (a
35 statistical physics are the backbone of many deep learn- company affiliated with two of the authors: CKF and
36 ing methods. For these reasons, there is much in modern PM). The purpose of the notebooks is to both familiarize
37 ML that will be familiar to physicists. physicists with these resources and to serve as a starting
38 Physicists and astronomers have also been at the fore- point for experimenting and playing with ideas.
39 front of using “big data”. For example, experiments such
40 ML can be divided into three broad categories: super-
as CMS and ATLAS at the LHC generate petabytes of vised learning, unsupervised learning, and reinforcement
41
data per year. In astronomy, projects such as the Sloan learning. Supervised learning concerns learning from la-
42
43 Digital Sky Survey (SDSS) routinely analyze and release beled data (for example, a collection of pictures labeled
44 hundreds of terabytes of data measuring the properties of as containing a cat or not containing a cat). Common
45 nearly a billion stars and galaxies. Researchers in these supervised learning tasks include classification and re-
46 fields are increasingly incorporating recent advances in gression. Unsupervised learning is concerned with find-
47 ML and data science, and this trend is likely to acceler- ing patterns and structure in unlabeled data. Examples
48 ate in the future. of unsupervised learning include clustering, dimensional-
49 Besides applications to physics, part of the goal of this ity reduction, and generative modeling. Finally, in rein-
50 review is to serve as an introductory resource for those forcement learning an agent learns by interacting with an
51 looking to transition to more industry-oriented projects. environment and changing its behavior to maximize its
52 Physicists have already made many important contribu- reward. For example, a robot can be trained to navigate
53 tions to modern big data applications in an industrial in a complex environment by assigning a high reward to
54 setting (Metz, 2017). Data scientists and ML engineers actions that help the robot reach a desired destination.
55
in industry use concepts and tools developed for ML to We refer the interested reader to the classic book by Sut-
56
gain insight from large datasets. A familiarity with ML ton and Barto Reinforcement Learning: an Introduction
57
58 is a prerequisite for many of the most exciting employ- (Sutton and Barto, 1998). While useful, the distinction
59 ment opportunities in the field, and we hope this review between the three types of ML is sometimes fuzzy and
60 will serve as a useful introduction to ML for physicists fluid, and many applications often combine them in novel
61 beyond an academic setting. and interesting ways. For example, the recent success
62
63
64
65
5
1
2
of Google DeepMind in developing ML algorithms that MNIST dataset, on the other hand, introduces the fla-
3
4 excel at tasks such as playing Go and video games em- vor of present-day ML problems. By re-analyzing the
5 ploy deep reinforcement learning, combining reinforce- same datasets with multiple techniques, we hope readers
6 ment learning with supervised learning methods based will be able to get a sense of the various, inevitable trade-
7 on deep neural networks. offs involved in choosing how to analyze data. Certain
8 Here, we limit our focus to supervised and unsuper- techniques work better when data is limited while others
9 vised learning. The literature on reinforcement learning may be better suited to large data sets with many fea-
10 is extensive and uses ideas and concepts that, to a large tures. A short description of these datasets are given in
11 degree, are distinct from supervised and unsupervised the Appendix.
12 learning tasks. For this reason, to ensure cohesiveness This review draws generously on many wonderful text-
13 and limit the length of this review, we have chosen not books on ML and we encourage the reader to con-
14 to discuss reinforcement learning. However, this omis- sult them for further information. They include Abu
15 sion should not be mistaken for a value judgement on Mostafa’s masterful Learning from Data, which intro-
16 the utility of reinforcement learning for solving physical duces the basic concepts of statistical learning theory
17 problems. For example, some of the authors have used (Abu-Mostafa et al., 2012), the more advanced but
18 inspiration from reinforcement learning to tackle difficult equally good The Elements of Statistical Learning by
19
problems in quantum control (Bukov, 2018; Bukov et al., Hastie, Tibshirani, and Friedman (Friedman et al., 2001),
20
2018). Michael Nielsen’s indispensable Neural Networks and
21
22 In writing this review, we have tried to adopt a style Deep Learning which serves as a wonderful introduction
23 that reflects what we consider to be the best of the to the neural networks and deep learning (Nielsen, 2015)
24 physics tradition. Physicists understand the importance and David MacKay’s outstanding Information Theory,
25 of well-chosen examples for furthering our understand- Inference, and Learning Algorithms which introduced
26 ing. It is hard to imagine a graduate course in statistical Bayesian inference and information theory to a whole
27 physics without the Ising model. Each new concept that generation of physicists (MacKay, 2003). More com-
28 is introduced in statistical physics (mean-field theory, prehensive (and much longer) books on modern ML
29 transfer matrix techniques, high- and low-temperature techniques include Christopher Bishop’s classic Pattern
30 expansions, the renormalization group, etc.) is applied to Recognition and Machine Learning (Bishop, 2006) and
31 the Ising model. This allows for the progressive building the more recently published Machine Learning: A Prob-
32 of intuition and ultimately a coherent picture of statisti- abilistic Perspective by Kevin Murphy (Murphy, 2012).
33 cal physics. We have tried to replicate this pedagogical Finally, one of the great successes of modern ML is deep
34 approach in this review by focusing on a few well-chosen learning, and some of the pioneers of this field have writ-
35 techniques – linear and logistic regression in the case of ten a textbook for students and researchers entitled Deep
36 supervised learning and clustering in the case of unsu- Learning (Goodfellow et al., 2016). In addition to these
37
pervised learning – to introduce the major theoretical textbooks, we have consulted numerous research papers,
38
concepts. reviews, and web resources. Whenever possible, we have
39
40 In this same spirit, we have chosen three interest- tried to point the reader to key papers and other refer-
41 ing datasets with which to illustrate the various algo- ences that we have found useful in preparing this review.
42 rithms discussed here. (i) The SUSY data set consists However, we are neither capable of nor have we made any
43 of 5, 000, 000 Monte-Carlo samples of proton-proton col- effort to make a comprehensive review of the literature.
44 lisions decaying to either signal or background processes, The review is organized as follows. We begin by
45 which are both parametrized with 18 features. The sig- introducing polynomial regression as a simple example
46 nal process is the production of electrically-charged su- that highlights many of the core ideas of ML. The next
47 persymmetric particles, which decay to W bosons and an few chapters introduce the language and major concepts
48 electrically-neutral supersymmetric particle, invisible to needed to make these ideas more precise including tools
49 the detector, while the background processes are various from statistical learning theory such as overfitting, the
50 decays involving only Standard Model particles (Baldi bias-variance tradeoff, regularization, and the basics of
51 et al., 2014). (ii) The Ising data set consists of 104 Bayesian inference. The next chapter builds on these
52 states of the 2D Ising model on a 40 × 40 square lat- examples to discuss stochastic gradient descent and its
53 tice, obtained using Monte-Carlo (MC) sampling at a generalizations. We then apply these concepts to linear
54 few fixed temperatures T . (iii) The MNIST dataset com- and logistic regression, followed by a detour to discuss
55
prises 70000 handwritten digits, each of which comes in how we can combine multiple statistical techniques to
56
a square image, divided into a 28 × 28 pixel grid. The improve supervised learning, introducing bagging, boost-
57
58 first two datasets were chosen to reflect the various sub- ing, random forests, and XG Boost. These ideas, though
59 disciplines of physics (high-energy experiment, condensed fairly technical, lie at the root of many of the advances
60 matter) where we foresee techniques from ML becom- in ML over the last decade. The review continues with
61 ing an increasingly important tool for research. The a thorough discussion of supervised deep learning and
62
63
64
65
6
1
2
neural networks, as well as convolutional nets. We then on the training set is called the in-sample error Ein =
3
4 turn our focus to unsupervised learning. We start with C(ytrain , f (Xtrain ; θ)) and the value of the cost func-
5 data visualization and dimensionality reduction before tion on the test set is called the out-of-sample error
6 proceeding to a detailed treatment of clustering. Our Eout = C(ytest , f (Xtest ; θ)).
7 discussion of clustering naturally leads to an examina- One of the most important observations we can make
8 tion of variational methods and their close relationship is that the out-of-sample error is almost always greater
9 with mean-field theory. The review continues with a than the in-sample error, i.e. Eout ≥ Ein . We explore
10 discussion of deep unsupervised learning, focusing on this point further in Sec. VI and its accompanying note-
11 energy-based models, such as Restricted Boltzmann Ma- book. Splitting the data into mutually exclusive train-
12 chines (RBMs) and Deep Boltzmann Machines (DBMs). ing and test sets provides an unbiased estimate for the
13 Then we discuss two new and extremely popular model- predictive performance of the model – this is known as
14 ing frameworks for unsupervised learning, generative ad- cross-validation in the ML and statistics literature. In
15 versarial networks (GANs) and variational autoencoders many applications of classical statistics, we start with a
16 (VAEs). We conclude the review with an outlook and mathematical model that we assume to be true (e.g., we
17 discussion of promising research directions at the inter- may assume that Hooke’s law is true if we are observing
18 section physics and ML. a mass-spring system) and our goal is to estimate the
19
value of some unknown model parameters (e.g., we do
20
not know the value of the spring stiffness). Problems in
21
22 II. WHY IS MACHINE LEARNING DIFFICULT? ML, by contrast, typically involve inference about com-
23 plex systems where we do not know the exact form of the
24 A. Setting up a problem in ML and data science mathematical model that describes the system. There-
25 fore, it is not uncommon for ML researchers to have mul-
26 Many problems in ML and data science starts with tiple candidate models that need to be compared. This
27 the same ingredients. The first ingredient is the dataset comparison is usually done using Eout ; the model that
28 D = (X, y) where X is a matrix of independent variables minimizes this out-of-sample error is chosen as the best
29 and y is a vector of dependent variables. The second is model (i.e. model selection). Note that once we select
30 the model f (x; θ), which is a function f : x → y of the the best model on the basis of its performance on Eout ,
31 parameters θ. That is, f is a function used to predict an the real-world performance of the winning model should
32 output from a vector of input variables. The final ingre- be expected to be slightly worse because the test data
33 dient is the cost function C(y, f (X; θ)) that allows us to was now used in the fitting procedure.
34 judge how well the model performs on the observations y.
35 The model is fit by finding the value of θ that minimizes
36 the cost function. For example, one commonly used cost B. Polynomial Regression
37
function is the squared error. Minimizing the squared er-
38
ror cost function is known as the method of least squares, In the previous section, we mentioned that multiple
39
40 and is typically appropriate for experiments with Gaus- candidate models are typically compared using the out-
41 sian measurement errors. of-sample error Eout . It may be at first surprising that
42 ML researchers and data scientists follow a standard the model that has the lowest out-of-sample error Eout
43 recipe to obtain models that are useful for prediction usually does not have the lowest in-sample error Ein .
44 problems. We will see why this is necessary in the fol- Therefore, if our goal is to obtain a model that is use-
45 lowing sections, but it is useful to present the recipe up ful for prediction we may not want to choose the model
46 front to provide context. The first step in the analysis that provides the best explanation for the current obser-
47 is to randomly divide the dataset D into two mutually vations. At first glance, the observation that the model
48 exclusive groups Dtrain and Dtest called the training and providing the best explanation for the current dataset
49 test sets. The fact that this must be the first step should probably will not provide the best explanation for future
50 be heavily emphasized – performing some analysis (such datasets is very counter-intuitive.
51 as using the data to select important variables) before Moreover, the discrepancy between Ein and Eout be-
52 partitioning the data is a common pitfall that can lead to comes more and more important, as the complexity of our
53 incorrect conclusions. Typically, the majority of the data data, and the models we use to make predictions, grows.
54 are partitioned into the training set (e.g., 90%) with the As the number of parameters in the model increases,
55
remainder going into the test set. The model is fit by min- we are forced to work in high-dimensional spaces. The
56
imizing the cost function using only the data in the train- “curse of dimensionality” ensures that many phenomena
57
58 ing set θ̂ = arg minθ {C(ytrain , f (Xtrain ; θ))}. Finally, that are absent or rare in low-dimensional spaces become
59 the performance of the model is evaluated by computing generic. For example, the nature of distance changes in
60 the cost function using the test set C(ytest , f (Xtest ; θ̂)). high dimensions, as evidenced in the derivation of the
61 The value of the cost function for the best fit model Maxwell distribution in statistical physics where the fact
62
63
64
65
7
1
2
3 Ntrain =10, σ =0 (train) Ntest =20, σ =0 (pred.)
4 2.5
5 4 test
6 2.0 linear
7 2 3rd order
8 1.5
9 10th order
0
y

y
10 Training
11 1.0
12 −2 Linear
13 Poly 3 0.5
14
−4 Poly 10
15 0.0
16 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25
17 x x
18
19
20
21
22 Ntrain =10, σ =0 (train) Ntest =20, σ =0 (pred.)
23
24 4 Test
25 60 linear
26 2 3rd order
27
28 40 10th order
0
y

29 Training
30
31 −2 Linear 20
32 Poly 3
33
−4 Poly 10
34 0
35 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25
36 x x
37
38
39 FIG. 1 Fitting versus predicting for noiseless data. Ntrain = 10 points in the range x ∈ [0, 1] were generated from a
40 linear model (top) or tenth-order polynomial (bottom). This data was fit using three model classes: linear models (red), all
41 polynomials of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest = 20 new data points
42 with xtest ∈ [0, 1.2] (shown on right). Notice that in the absence of noise (σ = 0), given enough data points that fitting and
predicting are identical.
43
44
45
46 that all the volume of a d-dimensional sphere of radius we have, the “noise” in the data generation process, and
47 r is contained in a small spherical shell around r is ex- our prior knowledge about the system. The goal is to
48 ploited. Almost all critical points of a function (i.e., the build intuition about why prediction is difficult in prepa-
49 points where all derivatives vanish) are saddles rather ration for introducing general strategies that overcome
50 than maxima or minima (an observation first made in these difficulties.
51 physics in the context of the p-spin spherical spin glass). Before reading the rest of the section, we strongly en-
52 For all these reasons, it turns out that for complicated courage the reader to read Notebook 1 and complete the
53 models studied in ML, predicting and fitting are very accompanying exercises.
54 different things (Bickel et al., 2006). Consider a probabilistic process that assigns a label yi
55
To develop some intuition about why we need to pay to an observation xi . The data are generated by drawing
56
close attention to out-of-sample performance, we will samples from the equation
57
58 consider a simple one-dimensional problem – polynomial yi = f (xi ) + ηi , (1)
59 regression. Our task is a simple one, fitting data with
60 polynomials of different order. We will explore how our where f (xi ) is some fixed (but possibly unknown) func-
61 ability to predict depends on the number of data points tion, and ηi is a Gaussian, uncorrelated noise variable,
62
63
64
65
8
1
2
3 Ntrain =100, σ =1 (train) Ntest =20, σ =1 (pred.)
4 20
5 4 Test
15
6 linear
7 2 10 3rd order
8
9 10th order
0 5
y

y
10 Training
11
12 −2 Linear 0
13 Poly 3
14 −5
−4 Poly 10
15
16 −10
0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25
17 x x
18
19
20
21
22 Ntrain =100, σ =1 (train) Ntest =20, σ =1 (pred.)
23 20
24 4 Test
15
25 linear
26 2 10 3rd order
27
28 10th order
0 5
y

29 Training
30
31 −2 Linear 0
32 Poly 3
33 −5
−4 Poly 10
34
35 −10
0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25
36 x x
37
38
39 FIG. 2 Fitting versus predicting for noisy data. Ntrain = 100 noisy data points (σ = 1) in the range x ∈ [0, 1] were
40 generated from a linear model (top) or tenth-order polynomial (bottom). This data was fit using three model classes: linear
41 models (red), all polynomials of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest = 20
42 new data points with xtest ∈ [0, 1.2](shown on right). Notice that even when the data was generated using a tenth order
polynomial, the linear and third order polynomials give better out-of-sample predictions, especially beyond the x range over
43
which the model was trained.
44
45
46 such that we will consider three different model classes: (i) all poly-
47 nomials of order 1 which we denote by f1 (x; θ1 ), (ii) all
48 hηi i = 0, polynomials up to order 3 which we denote by f3 (x; θ3 ),
49
hηi ηj i = δij σ 2 . and (iii) all polynomials of order 10, f10 (x; θ10 ). Notice
50
that these three model classes contain different number
51 We will refer to the f (xi ) as the function used to generate of parameters. Whereas f1 (x; θ1 ) has only two parame-
52 the data, and σ as the noise strength. The larger σ is the
53 ters (the coefficients of the zeroth and first order terms
noisier the data; σ = 0 corresponds to the noiseless case. in the polynomial), f3 (x; θ3 ) and f10 (x; θ10 ) have four
54 To make predictions, we will consider a family of func-
55 and eleven parameters, respectively. This reflects the
tions fα (x; θα ) that depend on some parameters θα . fact that these three models have different model com-
56
These functions represent the model class that we are us- plexities. If we think of each term in the polynomial as a
57
58 ing to model the data and make predictions. Note that “feature” in our model, then increasing the order of the
59 we choose the model class without knowing the function polynomial we fit increases the number of features. Using
60 f (x). The fα (x; θα ) encode the features we choose to a more complex model class may give us better predic-
61 represent the data. In the case of polynomial regression
62
63
64
65
9
1
2
tive power, but only if we have a large enough sample noise encodes real information. This problem is called
3
4 size to accurately learn the model parameters associated “overfitting” and leads to a steep drop-off in predictive
5 with these extra features from the training dataset. performance.
6 To learn the parameters θα , we will train our models We can guard against overfitting in two ways: we can
7 on a training dataset and then test the effectiveness of use less expressive models with fewer parameters, or we
8 the model on a different dataset, the test dataset. Since can collect more data so that the likelihood that the noise
9 we are interested only in gaining intuition, we will simply appears patterned decreases. Indeed, when we increase
10 plot the fitted polynomials and compare the predictions the size of the training data set by two orders of mag-
11 of our fits for the test data with the true values. As we nitude to Ntrain = 104 (see Figure 3) the tenth order
12 will see below, the models that give the best fit to existing polynomial clearly gives both the best fits and the most
13 data do not necessarily make the best predictions even predictive power over the entire training range x ∈ [0, 1],
14 for a simple task like polynomial regression. and even slightly beyond to approximately x ≈ 1.05.
15 To illustrate these ideas, we encourage the reader to This is our first experience with what is known as the
16 experiment with the accompanying notebook to gener- bias-variance tradeoff, c.f. Sec. III.B. When the amount
17 ate data using a linear function f (x) = 2x and a tenth of training data is limited as it is when Ntrain = 100,
18
order polynomial f (x) = 2x − 10x5 + 15x10 and ask one can often get better predictive performance by using
19
how the size of the training dataset Ntrain and the noise a less expressive model (e.g., a lower order polynomial)
20
strength σ affect the ability to make predictions. Obvi- rather than the more complex model (e.g., the tenth-
21
22 ously, more data and less noise leads to better predic- order polynomial). The simpler model has more “bias”
23 tions. To train the models (linear, third-order, tenth- but is less dependent on the particular realization of the
24 order), we uniformly sampled the interval x ∈ [0, 1] and training dataset, i.e. less “variance”. Finally we note that
25 constructed Ntrain training examples using (1). We then even with ten thousand data points, the model’s perfor-
26 fit the models on these training samples using standard mance quickly degrades beyond the original training data
27 least-squares regression. To visualize the performance of range. This demonstrates the difficulty of predicting be-
28 the three models, we plot the predictions using the best yond the training data we mentioned earlier.
29 fit parameters for a test set where x are drawn uniformly This simple example highlights why ML is so difficult
30 from the interval x ∈ [0, 1.2]. Notice that the test interval and holds some universal lessons that we will encounter
31 is slightly larger than the training interval. repeatedly in this review:
32 Figure 1 shows the results of this procedure for the
33 noiseless case, σ = 0. Even using a small training set • Fitting is not predicting. Fitting existing data well
34 with Ntrain = 10 examples, we find that the model class
35 is fundamentally different from making predictions
that generated the data also provides the best fit and the about new data.
36 most accurate out-of-sample predictions. That is, the
37
linear model performs the best for data generated from a
38 • Using a complex model can result in overfitting. In-
linear polynomial (the third and tenth order polynomials
39 creasing a model’s complexity (i.e number of fitting
perform similarly), and the tenth order model performs
40 parameters) will usually yield better results on the
the best for data generated from a tenth order polyno-
41 training data. However when the training data size
42 mial. While this may be expected, the results are quite
is small and the data are noisy, this results in over-
43 different for larger noise strengths.
fitting and can substantially degrade the predictive
44 Figure 2 shows the results of the same procedure for
performance of the model.
45 noisy data, σ = 1, and a larger training set, Ntrain = 100.
46 As in the noiseless case, the tenth order model provides
47 the best fit to the data (i.e., the lowest Ein ). In contrast, • For complex datasets and small training sets, sim-
48 the tenth order model now makes the worst out-of-sample ple models can be better at prediction than com-
49 predictions (i.e., the highest Eout ). Remarkably, this is plex models due to the bias-variance tradeoff. It
50 true even if the data were generated using a tenth order takes less data to train a simple model than a com-
51 polynomial. plex one. Therefore, even though the correct model
52 At small sample sizes, noise can create fluctuations in is guaranteed to have better predictive performance
53 the data that look like genuine patterns. Simple mod- for an infinite amount of training data (less bias),
54 els (like a linear function) cannot represent complicated the training errors stemming from finite-size sam-
55
patterns in the data, so they are forced to ignore the pling (variance) can cause simpler models to out-
56 perform the more complex model when sampling is
fluctuations and to focus on the larger trends. Complex
57 limited.
58 models with many parameters, such as the tenth order
59 polynomial in our example, can capture both the global
60 trends and noise-generated patterns at the same time. In • It is difficult to generalize beyond the situations
61 this case, the model can be tricked into thinking that the encountered in the training data set.
62
63
64
65
10
1
2
3 Ntrain =10000, σ =1 (train) Ntest =100, σ =1 (pred.)
4 20
5 4 Test
15
6 linear
7 2 10 3rd order
8
9 10th order
0 5
y

y
10 Training
11
12 −2 Linear 0
13 Poly 3
14 −5
−4 Poly 10
15
16 −10
0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25
17 x x
18
19
20
21
22 FIG. 3 Fitting versus predicting for noisy data. Ntrain = 104 noisy data points (σ = 1) in the range x ∈ [0, 1] were
23 generated from a tenth-order polynomial. This data was fit using three model classes: linear models (red), all polynomials
24 of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest = 100 new data points with
25 xtest ∈ [0, 1.2](shown on right). The tenth order polynomial gives good predictions but the model’s predictive power quickly
26 degrades beyond the training data range.
27
28
29 III. BASICS OF STATISTICAL LEARNING THEORY general about the relationship between Ein and Eout ?
30 Surprisingly, the answer is ‘Yes’. We can in fact say
31 In this section, we briefly summarize and discuss the quite a bit. This is the domain of statistical learning
32 sense in which learning is possible, with a focus on su- theory, and we give a brief overview of the main results
33 pervised learning. We begin with an unknown function in this section. Our goal is to briefly introduce some of
34 y = f (x) and fix a hypothesis set H consisting of all func- the major ideas from statistical learning theory because
35 tions we are willing to consider, defined also on the do- of the important role they have played in shaping how we
36 main of f . This set may be uncountably infinite (e.g. if think about machine learning. However, this is a highly
37 there are real-valued parameters to fit). The choice of technical and theoretical field, so we will just skim over
38 which functions to include in H usually depends on our some introductory topics. A more thorough introduction
39
intuition about the problem of interest. The function to statistical learning theory can be found in the intro-
40
f (x) produces a set of pairs (xi , yi ), i = 1 . . . N , which ductory textbook by Abu Mostafa (Abu-Mostafa et al.,
41
42 serve as the observable data. Our goal is to select a func- 2012).
43 tion from the hypothesis set h ∈ H that approximates
44 f (x) as best as possible, namely, we would like to find
45 h ∈ H such that h ≈ f in some strict mathematical A. Three simple schematics that summarize the basic
46 sense which we specify below. If this is possible, we say intuitions from Statistical Learning Theory
47 that we learned f (x). But if the function f (x) can, in
48 principle, take any value on unobserved inputs, how is it The basic intuitions of statistical learning can be sum-
49 possible to learn in any meaningful sense? marized in three simple schematics. The first schematic,
50 The answer is that learning is possible in the restricted shown in Figure 4, shows the typical out-of-sample er-
51 sense that the fitted model will probably perform approx- ror, Eout , and in-sample error, Ein , as a function of the
52 imately as well on new data as it did on the training data. amount of training data. In making this graph, we have
53 Once an appropriate error function E is chosen for the assumed that the true data is drawn from a sufficiently
54 problem under consideration (e.g. sum of squared errors complicated distribution, so that we cannot exactly learn
55 in linear regression), we can define two distinct perfor- the function f (x). Hence, after a quick initial drop (not
56
mance measures of interest. The in-sample error, Ein , shown in figure), the in-sample error will increase with
57
and the out-of-sample or generalization error, Eout . Re- the number of data points, because our models are not
58
59 call from Sec II that both metrics are required due to the powerful enough to learn the true function we are seeking
60 distinction between fitting and predicting. to approximate. In contrast, the out-of-sample error will
61 This raises a natural question: Can we say something decrease with the number of data points. As the number
62
63
64
65
11
1
2
3
4

Optimum
E out
5 E out
6
7
8

Error
{
9 Variance
Variance

}
10
}
Error

11
12 Bias
13
14
15
Bias Model Complexity
16
17 E in
18
19 FIG. 5 Bias-Variance tradeoff and model complexity.
This schematic shows the typical out-of-sample error Eout as
20
function of the model complexity for a training dataset of fixed
21 Number of data points size. Notice how the bias always decreases with model com-
22 plexity, but the variance, i.e. fluctuation in performance due
23 FIG. 4 Schematic of typical in-sample and out-of- to finite size sampling effects, increases with model complex-
24 sample error as a function of training set size. The ity. Thus, optimal performance is achieved at intermediate
25 typical in-sample or training error, Ein , out-of-sample or gen- levels of model complexity.
26 eralization error, Eout , bias, variance, and difference of errors
as a function of the number of training data points. The
27
schematic assumes that the number of data points is large (in
28 fitting and predicting. Models with a large difference be-
particular, the schematic does not show the initial drop in
29 Ein for small amounts of data), and that our model cannot tween the in-sample and out-of-sample errors are said to
30 exactly fit the true function f (x). “overfit” the data. One of the lessons of statistical learn-
31 ing theory is that it is not enough to simply minimize
32 the training error, because the out-of-sample error can
33 of data points gets large, the sampling noise decreases still be large. As we will see in our discussion of regres-
34 and the training data set becomes more representative sion in Sec. VI, this insight naturally leads to the idea of
35 of the true distribution from which the data is drawn. “regularization”.
36 The second schematic, shown in Figure 5, shows the
For this reason, in the infinite data limit, the in-sample
37 out-of-sample, or test, error Eout as a function of “model
and out-of-sample errors must approach the same value,
38
which is called the “bias” of our model. complexity”. Model complexity is a very subtle idea
39
The bias represents the best our model could do if we and defining it precisely is one of the great achieve-
40
41 had an infinite amount of training data to beat down ments of statistical learning theory. In many cases, model
42 sampling noise. The bias is a property of the kind of complexity is related to the number of parameters we
43 functions, or model class, we are using to approximate are using to approximate the true function f (x)1 . In
44 f (x). In general, the more complex the model class we the example of polynomial regression discussed above,
45 use, the smaller the bias. However, we do not generally higher-order polynomials are more complex than the lin-
46 have an infinite amount of data. For this reason, to get ear model. If we consider a training dataset of a fixed
47 best predictive power it is better to minimize the out-of- size, Eout will be a non-monotonic function of the model
48 sample error, Eout , rather than the bias. As shown in complexity, and is generally minimized for models with
49 Figure 4, Eout can be naturally decomposed into a bias, intermediate complexity. The underlying reason for this
50 which measures how well we can hypothetically do in the is that, even though using a more complicated model
51 infinite data limit, and a variance, which measures the always reduces the bias, at some point the model be-
52 typical errors introduced in training our model due to comes too complex for the amount of training data and
53 sampling noise from having a finite training set. the generalization error becomes large due to high vari-
54 The final quantity shown in Figure 4 is the difference ance. Thus, to minimize Eout and maximize our predic-
55 tive power, it may be more suitable to use a more bi-
between the generalization and training error. It mea-
56
sures how well our in-sample error reflects the out-of-
57
58 sample error, and measures how much worse we would
59 do on a new data set compared to our training data. For 1 There are, of course, exceptions. One neat example in the context
60 this reason, the difference between these errors is pre- of one-dimensional regression in given in (Friedman et al., 2001),
cisely the quantity that measures the difference between Figure 7.5.
61
62
63
64
65
12
1
2
favorable to use a less complex, high-bias model to make
3
4 predictions.
5 High variance, x x x x
6 low-bias model x x x x
7 x B. Bias-Variance Decomposition
x x x True model
8 x x x x
9 x x x x x In this section, we dig further into the central prin-
x x x x x ciple that underlies much of machine learning: the bias-
x x xx x x
10
11 variance tradeoff. We will discuss the bias-variance trade-
x x x
12 x x x off in the context of continuous predictions such as regres-
x Low variance, sion. However, many of the intuitions and ideas discussed
13 xx x x x x high-bias model
14 here also carry over to classification tasks. Consider a
15 dataset D = (X, y) consisting of the N pairs of indepen-
16 dent and dependent variables. Let us assume that the
17 true data is generated from a noisy model
18 FIG. 6 Bias-Variance tradeoff. Another useful depiction
19 of the bias-variance tradeoff is to think about how Eout varies y = f (x) +  (2)
20 as we consider different training data sets of a fixed size. A
21 more complex model (green) will exhibit larger fluctuations where  is normally distributed with mean zero and stan-
22 (variance) due to finite size sampling effects than the sim- dard deviation σ .
23 pler model (black). However, the average over all the trained
Assume that we have a statistical procedure (e.g. least-
24 models (bias) is closer to the true model for the more complex
model. squares regression) for forming a predictor f (x; θ̂) that
25
26 gives the prediction of our model for a new data point x.
27 This estimator is chosen by minimizing a cost function
ased model with small variance than a less-biased model which we take to be the squared error
28
29 with large variance. This important concept is commonly X
30 called the bias-variance tradeoff and gets at the heart of C(y, f (X; θ)) = (yi − f (xi ; θ))2 . (3)
31 why machine learning is difficult. i
32 Another way to visualize the bias-variance tradeoff is
shown in Figure 6. In this figure, we imagine training Therefore, the estimates for the parameters,
33
34 a complex model (shown in green) and a simpler model
(shown in black) many times on different training sets θ̂D = arg min C(y, f (X; θ)). (4)
35 θ
36 of a fixed size N . Due to the sampling noise from hav-
37 ing finite size data sets, the learned models will differ for are a function of the dataset, D. We would obtain a
38 each choice of training sets. In general, more complex different error C(yj , f (Xj ; θ̂Dj )) for each dataset Dj =
39 models need a larger amount of training data. For this (yj , Xj ) in a universe of possible datasets obtained by
40 reason, the fluctuations in the learned models (variance) drawing N samples from the true data distribution. We
41 will be much larger for the more complex model than the denote an expectation value over all of these datasets as
42 simpler model. However, if we consider the asymptotic ED .
43 performance as we increase the size of the training set We would also like to average over different instances
44 (the bias), it is clear that the complex model will even- of the “noise”  and we denote the expectation value over
45
tually perform better than the simpler model. Thus, de- the noise by E . Thus, we can decompose the expected
46
pending on the amount of training data, it may be more generalization error as
47
48
49 " #
50 X
2
51 ED, [C(y, f (X; θ̂D ))] = ED, (yi − f (xi ; θ̂D ))
i
52 " #
53 X
2
54 = ED, (yi − f (xi ) + f (xi ) − f (xi ; θ̂D ))
i
55 X
56 = E [(yi − f (xi ))2 ] + ED, [(f (xi ) − f (xi ; θ̂D ))2 ] + 2E [yi − f (xi )]ED [f (xi ) − f (xi ; θ̂D )]
57 i
X
58 = σ2 + ED [(f (xi ) − f (xi ; θ̂D ))2 ], (5)
59 i
60
61 where in the last line we used the fact that our noise has zero mean and variance σ2 and the sum over i applies to all
62
63
64
65
13
1
2
terms. It is also helpful to further decompose the second term as follows:
3
4 ED [(f (xi ) − f (xi ; θ̂D ))2 ] = ED [{f (xi ) − ED [f (xi ; θ̂D )] + ED [f (xi ; θ̂D )] − f (xi ; θ̂D )}2 ]
5
6 = ED [{f (xi ) − ED [f (xi ; θ̂D )]}2 ] + ED [{f (xi ; θ̂D ) − ED [f (xi ; θ̂D )]}2 ]
7 +2ED [{f (xi ) − ED [f (xi ; θ̂D )]}{f (xi ; θ̂D ) − ED [f (xi ; θ̂D )]}]
8
= (f (xi ) − ED [f (xi ; θ̂D )])2 + ED [{f (xi ; θ̂D ) − ED [f (xi ; θ̂D )]}2 ]. (6)
9
10
11
12 The first term is called the bias However, in practice gradient descent is full of surprises
13 X and a series of ingenious tricks have been developed by
14 Bias2 = (f (xi ) − ED [f (xi ; θ̂D )])2 (7) the optimization and machine learning communities to
15 i
improve the performance of these algorithms.
16 and measures the deviation of the expectation value of The underlying reason why training a machine learn-
17
our estimator (i.e. the asymptotic value of our estimator ing algorithm is difficult is that the cost functions we
18
in the infinite data limit) from the true value. The second wish to optimize are usually complicated, rugged, non-
19
20 term is called the variance convex functions in a high-dimensional space with many
X local minima. To make things even more difficult, we
21 V ar = ED [(f (xi ; θ̂D ) − ED [f (xi ; θ̂D )])2 ], (8)
22 almost never have access to the true function we wish
i
23 to minimize: instead, we must estimate this function di-
24 and measures how much our estimator fluctuates due rectly from data. In modern applications, both the size
25 to finite-sample effects. Combining these expressions, of the dataset and the number of parameters we wish to
26 we see that the expected out-of-sample error, Eout := fit is often enormous (millions of parameters and exam-
27 ED, [C(y, f (X; θ̂D ))], can be decomposed as ples). The goal of this chapter is to explain how gradient
28 descent methods can be used to train machine learning
29 Eout = Bias2 + V ar + N oise, (9) algorithms even in these difficult settings.
30 P 2 This chapter seeks to both introduce commonly used
with N oise = i σ .
31
The bias-variance tradeoff summarizes the fundamen- methods and give intuition for why they work. We
32 also include some practical tips for improving the per-
33 tal tension in machine learning, particularly supervised
learning, between the complexity of a model and the formance of stochastic gradient descent (Bottou, 2012;
34 LeCun et al., 1998b). To help the reader gain more in-
35 amount of training data needed to train it. Since data
is often limited, in practice it is often useful to use a tuition about gradient descent and its variants, we have
36
less-complex model with higher bias – a model whose developed a Jupyter notebook that allows the reader to
37
38 asymptotic performance is worse than another model – visualize how these algorithms perform on two dimen-
39 because it is easier to train and less sensitive to sampling sional surfaces. The reader is encouraged to experi-
40 noise arising from having a finite-sized training dataset ment with the accompanying notebook whenever a new
41 (smaller variance). This is the basic intuition behind the method is introduced (especially to explore how changing
42 schematics in Figs. 4, 5, and 6. hyper-parameters can affect performance). The reader
43 may also wish to consult useful reviews that cover these
44 topics (Ruder, 2016) and this blog http://ruder.io/
45 IV. GRADIENT DESCENT AND ITS GENERALIZATIONS optimizing-gradient-descent/.
46
47 Almost every problem in ML and data science starts
48 with the same ingredients: a dataset X, a model g(θ), A. Gradient Descent and Newton’s method
49 which is a function of the parameters θ, and a cost func-
50 tion C(X, g(θ)) that allows us to judge how well the We begin by introducing a simple first-order gradient
51 model g(θ) explains the observations X. The model is fit descent method and comparing and contrasting it with
52 by finding the values of θ that minimize the cost function. another algorithm, Newton’s method. Newton’s method
53 In this section, we discuss one of the most powerful is intimately related to many algorithms (conjugate gra-
54 and widely used classes of methods for performing this dient, quasi-Newton methods) commonly used in physics
55
minimization – gradient descent and its generalizations. for optimization problems. Denote the function we wish
56
The basic idea behind these methods is straightforward: to minimize by E(θ).
57
58 iteratively adjust the parametersθ in the direction where In the context of machine learning, E(θ) is just the
59 the gradient of the cost function is large and negative. cost function E(θ) = C(X, g(θ)). As we shall see for
60 In this way, the training procedure ensures the parame- linear and logistic regression in Secs. VI, VII, this energy
61 ters flow towards a local minimum of the cost function. function can almost always be written as a sum over n
62
63
64
65
14
1
2
with Newton’s method which is the inspiration for many
3 5.0
4 widely employed optimization methods. In Newton’s
5 2.5 method, we choose the step v for the parameters in such
6 a way as to minimize a second-order Taylor expansion to
0.0 the energy function
y

7 η =0.1
8 η =0.5 1
9 −2.5
η =1 E(θ + v) ≈ E(θ) + ∇θ E(θ)v + vT H(θ)v,
10 2
η =1.01
11 −5.0 where H(θ) is the Hessian matrix of second derivatives.
12 −4 −2 0 2 4 Differentiating this equation respect to v and noting that
13 x for the optimal value vopt we expect ∇θ E(θ + vopt ) = 0,
14 yields the following equation
15 FIG. 7 Gradient descent exhibits three qualitatively
16 different regimes as a function of the learning rate. 0 = ∇θ E(θ) + H(θ)vopt . (12)
17 Result of gradient descent on surface z = x2 + y 2 − 1 for
18 learning rate of η = 0.1, 0.5, 1.01. Notice that the trajectory Rearranging this expression results in the desired update
converges to the global minima in multiple steps for small
19 rules for Newton’s method
learning rates (η = 0.1). Increasing the learning rate fur-
20
ther (η = 0.5) causes the trajectory to oscillate around the vt = H −1 (θt )∇θ E(θt ) (13)
21 global minima before converging. For even larger learning
22 rates (η = 1.01) the trajectory diverges from the minima. See θt+1 = θt − vt . (14)
23 corresponding notebook for details.
24 Since we have no guarantee that the Hessian is well con-
25 ditioned, in almost all applications of Netwon’s method,
26 data points, one replaces the inverse of the Hessian H −1 (θt ) by some
27 suitably regularized pseudo-inverse such as [H(θt )+I]−1
28 n
X with  a small parameter (Battiti, 1992).
29 E(θ) = ei (xi , θ). (10) For the purposes of machine learning, Newton’s
30 i=1 method is not practical for two interrelated reasons.
31
For example, for linear regression ei is just the mean First, calculating a Hessian is an extremely expensive
32 numerical computation. Second, even if we employ first-
33 square-error for data point i; for logistic regression, it is
the cross-entropy. To make analogy with physical sys- order approximation methods to approximate the Hes-
34 sian (commonly called quasi-Newton methods), we must
35 tems, we will often refer to this function as the “energy”.
In the simplest gradient descent (GD) algorithm, we store and invert a matrix with n2 entries, where n is the
36
update the parameters as follows. Initialize the parame- number of parameters. For models with millions of pa-
37
ters to some value θ0 and iteratively update the param- rameters such as those commonly employed in the neu-
38
39 eters according to the equation ral network literature, this is close to impossible with
40 present-day computational power. Despite these practi-
41 vt = ηt ∇θ E(θt ), cal shortcomings, Newton’s method gives many impor-
42 θt+1 = θt − vt (11) tant intuitions about how to modify GD algorithms to
43 improve their performance. Notice that, unlike in GD
44 where ∇θ E(θ) is the gradient of E(θ) w.r.t. θ and we where the learning rate is the same for all parameters,
45 have introduced a learning rate, ηt , that controls how big Newton’s method automatically “adapts” the learning
46 a step we should take in the direction of the gradient at rate of different parameters depending on the Hessian
47 time step t. It is clear that for sufficiently small choice of matrix. Since the Hessian encodes the curvature of the
48 the learning rate ηt this methods will converge to a local surface we are trying to find the minimum of – more
49 minimum (in all directions) of the cost function. How- specifically, the singular values of the Hessian are in-
50 ever, choosing a small ηt comes at a huge computational versely proportional to the squares of the local curvatures
51 cost. The smaller ηt , the more steps we have to take to of the surface – Newton’s method automatically adjusts
52 reach the local minimum. In contrast, if ηt is too large, the step size so that one takes larger steps in flat di-
53 we can overshoot the minimum and the algorithm be- rections with small curvature and smaller steps in steep
54 comes unstable (it either oscillates or even moves away directions with large curvature.
55
from the minimum). This is shown in Figure 7. In prac- Our derivation of Newton’s method also allows us to
56
tice, one usually specifies a “schedule” that decreases ηt develop intuition about the role of the learning rate in
57
58 at long times. Common schedules include power law and GD. Let us first consider the special case of using GD
59 exponential decay in time. to find the minimum of a quadratic energy function of
60 To better understand this behavior and highlight some a single parameter θ (LeCun et al., 1998b). Given the
61 of the shortcomings of GD, it is useful to contrast GD current value of our parameter θ, we can ask what is
62
63
64
65
15
1
2
A E(θ) B E(θ) can always perform a singular value decomposition (i.e.
3
a rotation by an orthogonal matrix for quadratic minima
4 η<ηopt η=ηopt
5 where the Hessian is symmetric, see Sec. VI.B for a brief
6 introduction to SVD) and consider the singular values
7 {λ} of the Hessian. If we use a single learning rate for all
8 parameters, in analogy with (17), convergence requires
9 that
10 θ θ
θmin 2
11 θmin D
η< , (18)
12 C E(θ) E(θ) λmax
13 η>ηopt η>2ηopt
where λmax is the largest singular value of the Hessian.
14
15 If the minimum eigenvalue λmin differs significantly from
16 the largest value λmax , then convergence in the λmin -
17 direction will be extremely slow! One can actually show
18 that the convergence time scales with the condition num-
19 θ θ ber κ = λmax /λmin (LeCun et al., 1998b).
20 θmin θmin
21
22 FIG. 8 Effect of learning rate on convergence. For a B. Limitations of the simplest gradient descent algorithm
23 one dimensional quadratic potential, one can show that there
exists four different qualitative behaviors for gradient descent
24
(GD) as a function of the learning rate η depending on the The last section hints at some of the major shortcom-
25
relationship between η and ηopt = [∂θ2 E(θ)]−1 . (a) For η < ings of the simple GD algorithm described in (11). Before
26 ηopt , GD converges to the minimum. (b) For η = ηopt , GD proceeding, we briefly summarize these limitations and
27 converges in a single step. (c) For ηopt < η < 2ηopt , GD discuss general strategies for modifying GD to overcome
28 oscillates around the minima and eventually converges. (d) these deficiencies.
29 For η > 2ηopt , GD moves away from the minima. This figure
30 is adapted from (LeCun et al., 1998b). • GD finds local minima of the cost function. Since
31 the GD algorithm is deterministic, if it converges,
32 it will converge to a local minimum of our energy
33 the optimal choice of the learning rate ηopt , where ηopt
function. Because in ML we are often dealing with
34 is defined as the value of η that allows us to reach the
extremely rugged landscapes with many local min-
35 minimum of the quadratic energy function in a single
ima, this can lead to poor performance. A similar
36 step (see Figure 8). To find ηopt , we expand the energy
problem is encountered in physics. To overcome
37 function to second order around the current value
38 this, physicists often use methods like simulated
1 annealing that introduce a fictitious “temperature”
39 E(θ + v) = E(θc ) + ∂θ E(θ)v + ∂θ2 E(θ)v 2 . (15)
40 2 which is eventually taken to zero. The “tempera-
41 Differentiating with respect to v and setting θmin = θ − v ture” term introduces stochasticity in the form of
42 yields thermal fluctuations that allow the algorithm to
43 thermally tunnel over energy barriers. This sug-
44 θmin = θ − [∂θ2 E(θ)]−1 ∂θ E(θ). (16) gests that, in the context of ML, we should modify
45 GD to include stochasticity.
46 Comparing with (11) gives,
47 ηopt = [∂θ2 E(θ)]−1 . (17) • Gradients are computationally expensive to calcu-
48 late for large datasets. In many cases in statistics
49 One can show that there are four qualitatively different and ML, the energy function is a sum of terms,
50 regimes possible (see Fig. 8) (LeCun et al., 1998b). If with one term for each P data point. For example, in
51 n
η < ηopt , then GD will take multiple small steps to reach linear regression, E ∝ i=1 (yi − wT · xi )2 ; for lo-
52 the bottom of the potential. For η = ηopt , GD reaches gistic regression, the square error is replaced by the
53 the bottom of the potential in a single step. If ηopt < cross entropy, see Secs. VI, VII. Thus, to calculate
54 η < 2ηopt , then the GD algorithm will oscillate across the gradient we have to sum over all n data points.
55
both sides of the potential before eventually converging to Doing this at every GD step becomes extremely
56
the minimum. However, when η > 2ηopt , the algorithm computationally expensive. An ingenious solution
57
58 actually diverges! to this, discussed below, is to calculate the gra-
59 It is straightforward to generalize this to the multidi- dients using small subsets of the data called “mini
60 mensional case. The natural multidimensional general- batches”. This has the added benefit of introducing
61 ization of the second derivative is the Hessian H(θ). We stochasticity into our algorithm.
62
63
64
65
16
1
2
• GD is very sensitive to choices of the learning rates. really experiment with different methods in landscapes
3
4 As discussed above, GD is extremely sensitive to of varying complexity using the accompanying notebook.
5 the choice of learning rates. If the learning rate is
6 very small, the training process takes an extremely
long time. For larger learning rates, GD can di- C. Stochastic Gradient Descent (SGD) with mini-batches
7
8 verge and give poor results. Furthermore, depend-
ing on what the local landscape looks like, we have One of the most widely-applied variants of the gra-
9
10 to modify the learning rates to ensure convergence. dient descent algorithm is stochastic gradient descent
11 Ideally, we would “adaptively” choose the learning (SGD)(Bottou, 2012; Williams and Hinton, 1986). As
12 rates to match the landscape. the name suggests, unlike ordinary GD, the algorithm
13 is stochastic. Stochasticity is incorporated by approx-
14 • GD treats all directions in parameter space uni- imating the gradient on a subset of the data called a
15 formly. Another major drawback of GD is that minibatch 2 . The size of the minibatches is almost al-
16 unlike Newton’s method, the learning rate for GD ways much smaller than the total number of data points
17 is the same in all directions in parameter space. For n, with typical minibatch sizes ranging from ten to a
18 this reason, the maximum learning rate is set by the few hundred data points. If there are n points in total,
19 behavior of the steepest direction and this can sig- and the mini-batch size is M , there will be n/M mini-
20 nificantly slow down training. Ideally, we would like batches. Let us denote these minibatches by Bk where
21 to take large steps in flat directions and small steps k = 1, . . . , n/M . Thus, in SGD, at each gradient descent
22 in steep directions. Since we are exploring rugged step we approximate the gradient using a single mini-
23 landscapes where curvatures change, this requires batch Bk ,
24
us to keep track of not only the gradient but second n
X X
25
derivatives of the energy function (note as discussed ∇θ E(θ) = ∇θ ei (xi , θ) −→ ∇θ ei (xi , θ). (19)
26
27 above, the ideal scenario would be to calculate the i=1 i∈Bk

28 Hessian but this proves to be too computationally We then cycle over all k = 1, . . . , n/M minibatches one
29 expensive). at a time, and use the mini-batch approximation to the
30 gradient to update the parameters θ at every step k. A
31 • GD is sensitive to initial conditions. One conse- full iteration over all n data points – in other words using
32 quence of the local nature of GD is that initial con- all n/M minibatches – is called an epoch. For notational
33 ditions matter. Depending on where one starts, one convenience, we will denote the mini-batch approxima-
34 will end up at a different local minimum. There- tion to the gradient by
35 fore, it is very important to think about how one X
36 initializes the training process. This is true for GD ∇θ E M B (θ) = ∇θ ei (xi , θ). (20)
37 as well as more complicated variants of GD intro- i∈Bk
38 duced below. With this notation, we can rewrite the SGD algorithm as
39
40 • GD can take exponential time to escape saddle vt = ηt ∇θ E M B (θ),
41 points, even with random initialization. As we men- θt+1 = θt − vt . (21)
42 tioned, GD is extremely sensitive to the initial con- Thus, in SGD, we replace the actual gradient over the
43 dition since it determines the particular local min- full data at each gradient descent step by an approxima-
44 imum GD would eventually reach. However, even tion to the gradient computed using a minibatch. This
45 with a good initialization scheme, through random- has two important benefits. First, it introduces stochas-
46 ness (to be introduced later), GD can still take ex- ticity and decreases the chance that our fitting algorithm
47 ponential time to escape saddle points, which are
48 gets stuck in isolated local minima. Second, it signifi-
prevalent in high-dimensional spaces, even for non- cantly speeds up the calculation as one does not have
49 pathological objective functions (Du et al., 2017).
50 to use all n data points to approximate the gradient.
Indeed, there are modified GD methods developed Empirical and theoretical work suggests that SGD has
51
recently to accelerate the escape. The details of additional benefits. Chief among these is that introduc-
52
53 these boosted method are beyond the scope of this ing stochasticity is thought to act as a natural regular-
54 review, and we refer avid readers to (Jin et al., izer that prevents overfitting in deep, isolated minima
55 2017) for details. (Bishop, 1995a; Keskar et al., 2016).
56
In the next few subsections, we will introduce variants
57
58 of GD that address many of these shortcomings. These 2 Traditionally, SGD was reserved for the case where you train on
59 generalized gradient descent methods form the backbone a single example – in other words minibatches of size 1. However,
60 of much of modern deep learning and neural networks, we will use SGD to mean any approximation to the gradient on
see Sec IX. For this reason, the reader is encouraged to a subset of the data.
61
62
63
64
65
17
1
2
D. Adding Momentum Thus, as the name suggests, the momentum parameter
3
4 is proportional to the mass of the particle and effec-
5 In practice, SGD is almost always used with a “mo- tively provides inertia. Furthermore, in the large vis-
6 mentum” or inertia term that serves as a memory of the cosity/small learning rate limit, our memory time scales
7 direction we are moving in parameter space. This is typ- as (1 − γ)−1 ≈ m/(µ∆t).
8 ically implemented as follows Why is momentum useful? SGD momentum helps
9 the gradient descent algorithm gain speed in directions
10 vt = γvt−1 + ηt ∇θ E(θt ) with persistent but small gradients even in the presence
11 θt+1 = θt − vt , (22) of stochasticity, while suppressing oscillations in high-
12 curvature directions. This becomes especially important
13 where we have introduced a momentum parameter γ, in situations where the landscape is shallow and flat in
14 with 0 ≤ γ ≤ 1, and for brevity we dropped the ex- some directions and narrow and steep in others. It has
15 plicit notation to indicate the gradient is to be taken been argued that first-order methods (with appropriate
16 over a different mini-batch at each step. We call this al- initial conditions) can perform comparable to more ex-
17 gorithm gradient descent with momentum (GDM). From pensive second order methods, especially in the context
18 these equations, it is clear that vt is a running average of complex deep learning models (Sutskever et al., 2013).
19 of recently encountered gradients and (1 − γ)−1 sets the Empirical studies suggest that the benefits of including
20
characteristic time scale for the memory used in the av- momentum are especially pronounced in complex models
21
eraging procedure. Consistent with this, when γ = 0, in the initial “transient phase” of training, rather than
22
23 this just reduces down to ordinary SGD as described in during a subsequent fine-tuning of a coarse minimum.
24 Eq. (21). An equivalent way of writing the updates is The reason for this is that, in this transient phase, corre-
25 lations in the gradient persist across many gradient de-
26 ∆θt+1 = γ∆θt − ηt ∇θ E(θt ), (23) scent steps, accentuating the role of inertia and memory.
27 These beneficial properties of momentum can some-
28 where we have defined ∆θt = θt − θt−1 . In what should times become even more pronounced by using a slight
29 be a familiar scenario to many physicists, momentum modification of the classical momentum algorithm called
30 based methods were first introduced in old, largely for- Nesterov Accelerated Gradient (NAG) (Nesterov, 1983;
31 gotten (until recently) Soviet papers (Nesterov, 1983; Sutskever et al., 2013). In the NAG algorithm, rather
32 Polyak, 1964). than calculating the gradient at the current parameters,
33 Before proceeding further, let us try to get more in- ∇θ E(θt ), one calculates the gradient at the expected
34 tuition from these equations. It is helpful to consider a value of the parameters given our current momentum,
35 simple physical analogy with a particle of mass m moving ∇θ E(θt + γvt−1 ). This yields the NAG update rule
36 in a viscous medium with viscous damping coefficient µ
37 and potential E(w) (Qian, 1999). If we denote the par- vt = γvt−1 + ηt ∇θ E(θt + γvt−1 )
38 ticle’s position by w, then its motion is described by θt+1 = θt − vt . (28)
39
40 One of the major advantages of NAG is that it allows for
d2 w dw
41 m +µ = −∇w E(w). (24) the use of a larger learning rate than GDM for the same
dt2 dt
42 choice of γ.
43 We can discretize this equation in the usual way to get
44
45 wt+∆t − 2wt + wt−∆t wt+∆t − wt E. Methods that use the second moment of the gradient
46 m +µ = −∇w E(w).
(∆t)2 ∆t
47 (25) In stochastic gradient descent, with and without mo-
48 Rearranging this equation, we can rewrite this as mentum, we still have to specify a “schedule” for tuning
49 the learning rate ηt as a function of time. As discussed in
50 (∆t)2 m the context of Newton’s method, this presents a number
51 ∆wt+∆t = − ∇w E(w) + ∆wt . (26) of dilemmas. The learning rate is limited by the steepest
52 m + µ∆t m + µ∆t
direction which can change depending on the current po-
53 sition in the landscape. To circumvent this problem, ide-
Notice that this equation is identical to Eq. (23) if we
54 ally our algorithm would keep track of curvature and take
55 identify the position of the particle, w, with the parame-
ters θ. This allows us to identify the momentum param- large steps in shallow, flat directions and small steps in
56
eter and learning rate with the mass of the particle and steep, narrow directions. Second-order methods accom-
57
58 the viscous damping as: plish this by calculating or approximating the Hessian
59 and normalizing the learning rate by the curvature. How-
60 m (∆t)2 ever, this is very computationally expensive for models
γ= , η= . (27) with extremely large number of parameters. Ideally, we
61 m + µ∆t m + µ∆t
62
63
64
65
18
1
2
would like to be able to adaptively change the step size respectively, and (βj )t denotes βj to the power t. The
3
4 to match the landscape without paying the steep compu- parameters η and  have the same role as in RMSprop.
5 tational price of calculating or approximating Hessians. Like in RMSprop, the effective step size of a parameter
6 Recently, a number of methods have been introduced depends on the magnitude of its gradient squared. To
7 that accomplish this by tracking not only the gradient, understand this better, let us rewrite this expression in
8 but also the second moment of the gradient. These terms of the variance σt2 = ŝt − (m̂t )2 . Consider a single
9 methods include AdaGrad (Duchi et al., 2011), AdaDelta parameter θt . The update rule for this parameter is given
10 (Zeiler, 2012), RMSprop (Tieleman and Hinton, 2012), by
11 and ADAM (Kingma and Ba, 2014). Here, we discuss
12 the last two as representatives of this class of algorithms. m̂t
∆θt+1 = −ηt p . (32)
13 In RMSprop, in addition to keeping a running average σt2 + m̂2t + 
14 of the first moment of the gradient, we also keep track of
15 the second moment denoted by st = E[gt2 ]. The update We now examine different limiting cases of this expres-
16 rule for RMSprop is given by sion. Assume that our gradient estimates are consistent
17 so that the variance is small. In this case our update
18 gt = ∇θ E(θ) (29) rule tends to ∆θt+1 → −ηt (here we have assumed that
19 st = βst−1 + (1 − β)gt2 m̂t  ). This is equivalent to cutting off large persis-
20 gt tent gradients at 1 and limiting the maximum step size
θt+1 = θt − ηt √ ,
21 st +  in steep directions. On the other hand, imagine that the
22 gradient is widely fluctuating between gradient descent
23 where β controls the averaging time of the second mo-
steps. In this case σ 2  m̂2t so that our update becomes
24 ment and is typically taken to be about β = 0.9, ηt is a
∆θt+1 → −ηt m̂t /σt . In other words, we adapt our learn-
25 learning rate typically chosen to be 10−3 , and  ∼ 10−8
ing rate so that it is proportional to the signal-to-noise
26 is a small regularization constant to prevent divergences.
ratio (i.e. the mean in units of the standard deviation).
27 Multiplication and division by vectors is understood as
From a physics standpoint, this is extremely desirable:
28 an element-wise operation. It is clear from this formula
the standard deviation serves as a natural adaptive scale
29 that the learning rate is reduced in directions where the
for deciding whether a gradient is large or small. Thus,
30 gradient is consistently large. This greatly speeds up the
ADAM has the beneficial effects of (i) adapting our step
31 convergence by allowing us to use a larger learning rate
size so that we cut off large gradient directions (and hence
32 for flat directions.
33 prevent oscillations and divergences), and (ii) measuring
A related algorithm is the ADAM optimizer. In
34 gradients in terms of a natural length scale, the stan-
ADAM, we keep a running average of both the first and
35 dard deviation σt . The discussion above also explains
second moment of the gradient and use this information
36 empirical observations showing that the performance of
to adaptively change the learning rate for different pa-
37 both ADAM and RMSprop is drastically reduced if the
rameters. In addition to keeping a running average of the
38 square root is omitted in the update rule. It is also
first and second moments of the gradient (i.e. mt = E[gt ]
39 worth noting that recent studies have shown adaptive
and st = E[gt2 ], respectively), ADAM performs an addi-
40 methods like RMSProp, ADAM, and AdaGrad to gener-
41 tional bias correction to account for the fact that we are
alize worse than SGD in classification tasks, though they
42 estimating the first two moments of the gradient using a
achieve smaller training error. Such discussion is beyond
43 running average (denoted by the hats in the update rule
the scope of this review so we refer readers to (Wilson
44 below). The update rule for ADAM is given by (where
et al., 2017) for more details.
45 multiplication and division are once again understood to
46 be element-wise operations)
47
gt = ∇θ E(θ) (30) F. Comparison of various methods
48
49 mt = β1 mt−1 + (1 − β1 )gt
To better understand these methods, it is helpful to
50 st = β2 st−1 + (1 − β2 )gt2 visualize the performance of the five methods discussed
51 mt
52 m̂t = above – gradient descent (GD), gradient descent with
1 − (β1 )t momentum (GDM), NAG, ADAM, and RMSprop. To
53 st
54 ŝt = do so, we will use Beale’s function:
1 − (β2 )t
55
m̂t f (x, y) = (1.5 − x + xy)2 (33)
56 θt+1 = θt − ηt √ , 2 2 3 2
57 ŝt +  +(2.25 − x + xy ) + (2.625 − x + xy ) .
58 (31)
59 This function has a global minimum at (x, y) = (3, 0.5)
60 where β1 and β2 set the memory lifetime of the first and and an interesting structure that can be seen in Fig. 9.
61 second moment and are typically taken to be 0.9 and 0.99 The figure shows the results of using all five methods
62
63
64
65
19
1
2
4 is always important to randomly shuffle the data
3
4 when forming mini-batches. Otherwise, the gra-
5 2 dient descent method can fit spurious correlations
6 resulting from the order in which data is presented.
GD
7 0
y

8
GDM • Transform your inputs. As we discussed above,
9 NAG learning becomes difficult when our landscape has
−2 a mixture of steep and flat directions. One simple
10 RMS
11 ADAMS trick for minimizing these situations is to standard-
12 −4 ize the data by subtracting the mean and normaliz-
13 −4 −2 0 2 4 ing the variance of input variables. Whenever pos-
14 x sible, also decorrelate the inputs. To understand
15 why this is helpful, consider the case of linear re-
FIG. 9 Comparison of GD and its generalization for
16 gression. It is easy to show that for the squared
Beale’s function. Trajectories from gradient descent (GD;
17 black line), gradient descent with momentum (GDM; magenta error cost function, the Hessian of the energy ma-
18 line), NAG (cyan-dashed line), RMSprop (blue dash-dot line), trix is just the correlation matrix between the in-
19 and ADAM (red line) for Nsteps = 104 . The learning rate puts. Thus, by standardizing the inputs, we are
20 for GD, GDM, NAG is η = 10−6 and η = 10−3 for ADAM ensuring that the landscape looks homogeneous in
21 and RMSprop. β = 0.9 for RMSprop, β1 = 0.9 and β2 = all directions in parameter space. Since most deep
22 0.99 for ADAM, and  = 10−8 for both methods. Please see networks can be viewed as linear transformations
23 corresponding notebook for details.
followed by a non-linearity at each layer, we expect
24
this intuition to hold beyond the linear case.
25
26 for Nsteps = 104 steps for three different initial condi- • Monitor the out-of-sample performance. Always
27 tions. In the figure, the learning rate for GD, GDM, and monitor the performance of your model on a valida-
28 NAG are set to η = 10−6 whereas RMSprop and ADAM tion set (a small portion of the training data that is
29 have a learning rate of η = 10−3 . The learning rates held out of the training process to serve as a proxy
30 for RMSprop and ADAM can be set significantly higher for the test set – see Sec. XI for more on validation
31 than the other methods due to their adaptive step sizes. sets). If the validation error starts increasing, then
32 For this reason, ADAM and RMSprop tend to be much
33 the model is beginning to overfit. Terminate the
quicker at navigating the landscape than simple momen- learning process. This early stopping significantly
34 tum based methods (see Fig. 9). Notice that in some
35 improves performance in many settings.
cases (e.g. initial condition of (−1, 4)), the trajectories
36
do not find the global minimum but instead follow the • Adaptive optimization methods do not always have
37
38 deep, narrow ravine that occurs along y = 1. This kind of good generalization. As we mentioned, recent stud-
39 landscape structure is generic in high-dimensional spaces ies have shown that adaptive methods such as
40 where saddle points proliferate. Once again, the adaptive ADAM, RMSprop, and AdaGrad tend to have poor
41 step size and momentum of ADAM and RMSprop allows generalization compared to SGD or SGD with mo-
42 these methods to traverse the landscape faster than the mentum, particularly in the high-dimensional limit
43 simpler first-order methods. The reader is encouraged to (i.e. the number of parameters exceeds the number
44 consult the corresponding Jupyter notebook and experi- of data points) (Wilson et al., 2017). Although it is
45 ment with changing initial conditions, the cost function not clear at this stage why sophisticated methods,
46 surface being minimized, and hyper-parameters to gain such as ADAM, RMSprop, and AdaGrad, perform
47 more intuition about all these methods. so well in training deep neural networks such as
48 generative adversarial networks (GANs) (Goodfel-
49 low et al., 2014) [see Sec. XVII], simpler procedures
50 G. Gradient descent in practice: practical tips like properly-tuned plain SGD may work equally
51 well or better in some applications.
52 We conclude this chapter by compiling some practical
53 tips from experts for getting the best performance from
54
gradient descent based algorithms, especially in the con- V. OVERVIEW OF BAYESIAN INFERENCE
55
text of deep neural networks discussed later in the review,
56
see Secs. IX, XVI.B, IX. This section draws heavily on Statistical modeling usually revolves around estima-
57
58 best practices laid out in (Bottou, 2012; LeCun et al., tion or prediction (Jaynes, 1996). Bayesian methods are
59 1998b; Tieleman and Hinton, 2012). based on the fairly simple premise that probability can
60 be used as a mathematical framework for describing un-
61 • Randomize the data when making mini-batches. It certainty. This is not that different in spirit from the
62
63
64
65
20
1
2
main idea of statistical mechanics in physics, where we look at the data then we would like to select an unin-
3
4 use probability to describe the behavior of large systems formative prior that reflects our ignorance, otherwise we
5 where we cannot know the positions and momenta of all should select an informative prior that accurately reflects
6 the particles even if the system itself is fully determinis- the knowledge we have about θ. This review will focus
7 tic (at least classically). In practice, Bayesian inference on informative priors that are commonly used for ML
8 provides a set of principles and procedures for learning applications. However, there is a large literature on un-
9 from data and for describing uncertainty. In this section, informative priors, including reparameterization invari-
10 we give a gentle introduction to Bayesian inference, with ant priors, that would be of interest to physicists and
11 special emphasis on its logic (i.e. Bayesian reasoning) and we refer the interested reader to (Berger and Bernardo,
12 provide a connection to ML discussed in Sec. II and III. 1992; Gelman et al., 2014; Jaynes, 1996; Jeffreys, 1946;
13 For a technical account of Bayesian inference in general, Mattingly et al., 2018).
14 we refer readers to (Barber, 2012; Gelman et al., 2014). Using an informative prior tends to decrease the vari-
15 ance of the posterior distribution while, potentially, in-
16 creasing its bias. This is beneficial if the decrease in
17 A. Bayes Rule variance is larger than the increase in bias. In high-
18 dimensional problems, it is reasonable to assume that
19 To solve a problem using Bayesian methods, we have many of the parameters will not be strongly relevant.
20
to specify two functions: the likelihood function p(X|θ), Therefore, many of the parameters of the model will
21
which describes the probability of observing a dataset be zero or close to zero. We can express this belief
22
X for a given value of the unknown parameters θ, and using two commonly used priors: the Gaussian prior
23
the prior distribution p(θ), which describes any knowl- Q q λ −λθ2
24 p(θ|λ) = j 2π e j is used to express the assump-
edge we have about the parameters before we collect the
25 tion that many of the parameters
Q will be small, and the
data. Note that the likelihood should be considered as
26 Laplace prior p(θ|λ) = j λ2 e−λ|θj | is used to express the
27 a function of the parameters θ with the data X held
assumption that many of the parameters will be zero.
28 fixed. The prior distribution and the likelihood function
We’ll come back to this point later in Sec. VI.F.
29 are used to compute the posterior distribution p(θ|X)
30 via Bayes’ rule:
31 B. Bayesian Decisions
32 p(X|θ)p(θ)
p(θ|X) = R . (34)
33 dθ 0 p(X|θ 0 )p(θ 0 ) The above section presents the tools for computing the
34 posterior distribution p(θ|X), which uses probability as
35 The posterior distribution describes our knowledge about
a framework for expressing our knowledge about the pa-
36 the unknown parameter θ after observing the data X.
rameters θ. In most cases, however, we need to summa-
37 In many cases, it will not be possible to analytically
rize our knowledge and pick a single “best” value for the
38 compute the normalizing constant in the R denominator of parameters. In principle, the specific value of the param-
39 the posterior distribution, i.e. p(X) = dθ p(X|θ)p(θ),
eters should be chosen to maximize a utility function.
40 and Markov Chain Monte Carlo (MCMC) methods are
41 In practice, however, we usually
R use one of two choices:
needed to draw random samples from p(θ|X).
42 the posterior mean hθi = dθ θp(θ|X), or the poste-
The likelihood function p(X|θ) is a common feature
43 rior mode θ̂MAP = arg maxθ p(θ|X). Often, hθi is called
of both classical statistics and Bayesian inference, and
44
is determined by the model and the measurement noise. the Bayes estimate and θ̂MAP is called the maximum-a-
45
Many common statistical procedures such as least-square posteriori or MAP estimate. While the Bayes estimate
46
fitting can be cast as Maximum Likelihood Estimation minimizes the mean-squared error, the MAP estimate is
47 often used instead because it is easier to compute.
48 (MLE). In MLE, one chooses the parameters θ̂ that max-
49 imize the likelihood (or equivalently the log-likelihood
50 since log is a monotonic function) of the observed data:
C. Hyperparameters
51
52 θ̂ = arg max log p(X|θ). (35) The Gaussian and Laplace prior distributions, used to
θ
53 express the assumption that many of the model parame-
54 In other words, in MLE we choose the parameters that ters will be small or zero, both have an extra parameter
55
maximize the probability of seeing the observed data λ. This hyperparameter or nuisance variable has to be
56
given our generative model. MLE is an important con- chosen somehow. One standard Bayesian approach is to
57
cept in both frequentist and Bayesian statistics. define another prior distribution for λ – usually using an
58
59 The prior distribution, by contrast, is uniquely uninformative prior – and to average the posterior distri-
60 Bayesian. There are two general classes of priors: if we bution over all choices of λ. This is called a hierarchical
61 do not have any specialized knowledge about θ before we prior. Computing averages, however, often requires long
62
63
64
65
21
1
2
Markov Chain Monte Carlo simulations that are compu- the columns X:,j ∈ Rn , j = 1, · · · p being measured fea-
3
4 tationally intensive. Therefore, it is simpler if we can tures. Bear in mind that this function f is never known
5 find a good value of λ using an optimization procedure to us explicitly, though in practice we usually presume
6 instead. We will discuss how this is done in practice when its functional form. For example, in linear regression, we
T
7 discussing linear regression in Sec. VI. assume yi = f (x(i) ; wtrue ) + i = wtrue x(i) + i for some
8 unknown but fixed wtrue ∈ R . p

9 We want to find a function g with parameters w fit


10 VI. LINEAR REGRESSION to the data D that can best approximate f . When this
11 is done, meaning we have found a ŵ such that g(x; ŵ)
12 In Section II, we performed our first numerical ML yields our best estimate of f , we can use this g to make
13 experiments by fitting datasets generated by polynomi- predictions about the response y0 for a new data point
14 als in the presence of different levels of additive noise. x0 , as we did in Section II.
15 We used the fitted parameters to make predictions on It will be helpful for our discussion of linear regres-
16 ‘unseen’ observations, allowing us to gauge the perfor- sion to define one last piece of notation. For any real
17 mance of our model on new data. These experiments number p ≥ 1, we define the Lp norm of a vector
18 highlighted the fundamental tension common to all ML x = (x1 , · · · , xd ) ∈ Rd to be
19
models between how well we fit the training dataset and
20 1
predictions on new data. The optimal choice of predictor ||x||p = (|x1 |p + · · · + |xd |p ) p (36)
21
22 depended on, among many other things, the functions
23 used to fit the data and the underlying noise level. In
Section III, we formalized this by introducing the notion A. Least-square regression
24
25 of model complexity and the bias-variance decomposi-
tion, and discussed the statistical meaning of learning. Ordinary least squares linear regression (OLS) is de-
26
In this section, we take a closer look at these ideas in the fined as the minimization of the L2 norm of the difference
27
simple setting of linear regression. between the response yi and the predictor g(x(i) ; w) =
28
wT x(i) :
29 As in Section II, fitting a given set of samples (yi , xi )
30 means relating the independent variables xi to their re- n
X
31 sponses yi . For example, suppose we want to see how minp ||Xw − y||22 = minp (wT x(i) − yi )2 . (37)
w∈R w∈R
32 the voltage across two sides of a metal slab V changes i=1
33 in response to the applied electric current I. Normally
34 In other words, we are looking to find the w which mini-
we would first make a bunch of measurements labeled mizes the L2 error. Geometrically speaking, the predictor
35 by i and plot them on a two-dimensional scatterplot,
36 function g(x(i) ; w) = wT x(i) defines a hyperplane in Rp .
(Vi , Ii ). The next step is to assume, either from an oracle Minimizing the least squares error is therefore equivalent
37
or from theoretical reasoning, some models that might to minimizing the sum of all projections (i.e. residuals)
38
explain the measurements and measuring their perfor- for all points x(i) to this hyperplane (see Fig. 10). For-
39
40 mance. Mathematically, this amounts to finding some mally, we denote the solution to this problem as ŵLS :
41 function f such that Vi = f (Ii ; w), where w is some pa-
42 rameter (e.g. the electrical resistance R of the metal slab
43 in the case of Ohm’s law). We then try to minimize the ŵLS = arg min ||Xw − y||22 , (38)
44 errors made in explaining the given set of measurements w∈Rp

45 based on our model f by tuning the parameter w. To


which, after straightforward differentiation, leads to
46 do so, we need to first define the error function (formally
47 called the loss function) that characterizes the deviation ŵLS = (X T X)−1 X T y. (39)
48 of our prediction from the actual response.
49 Before formulating the problem, let us set up the no- Note that we have assumed that X T X is invertible,
50 tation. Suppose we are given a dataset with n sam- which is often the case when n  p. Formally speak-
51 ples D = {(yi , x(i) )}ni=1 , where x(i) is the i-th obser- ing, if rank(X) = p, namely, the predictors X:,1 , . . . , X:,p
52 vation vector while yi is its corresponding (scalar) re- (i.e. columns of X) are linearly independent, then ŵLS is
53 sponse. We assume that every sample has p features, unique. In the case of rank(X) < p, which happens when
54 namely, x(i) ∈ Rp . Let f be the true function/model p > n, X T X is singular, implying there are infinitely
55
that generated these samples via yi = f (x(i) ; wtrue ) + i , many solutions to the least squares problem, Eq. (38).
56
where wtrue ∈ Rp is a parameter vector and i is some In this case, one can easily show that if w0 is a solution,
57
58 i.i.d. white noise with zero mean and finite variance. w0 +η is also a solution for any η which satisfies Xη = 0
59 Conventionally, we cast all samples into an n × p ma- (i.e. η ∈ null(X)). Having determined the least squares
60 trix, X ∈ Rn×p , called the design matrix, with the rows solution, we can calculate ŷ, the best fit of our data X,
61 Xi,: = x(i) ∈ Rp , , i = 1, · · · , n being observations and as ŷ = X ŵLS = PX y, where PX = X(X T X)−1 X T ,
62
63
64
65
22
1
2
c.f. Eq. (37). Geometrically, PX is the projection matrix regularization: the first one employs an L2 penalty and
3
4 which acts on y and projects it onto the column space is called Ridge regression, while the second uses an L1
5 of X, which is spanned by the predictors X:,1 , · · · , X:,p penalty and is called LASSO.
6 (see FIG. 11). Notice that we found the optimal solu-
7 tion ŵLS in one shot, without doing any sort of iterative
8 optimization like that discussed in Section IV. y
span( { X 1 , · · · , X p } )
9
10 y − ŷ
11
12
13
14 ŷ
15
16
17
Rp FIG. 11 The projection matrix PX projects the response
vector y onto the column space spanned by the columns of
18 X, span({X:,1 , · · · , X:,p }) (purple area), thus forming a fitted
19 vector ŷ. The residuals in Eq. (37) are illustrated by the red
20 vector y − ŷ.
21 FIG. 10 Geometric interpretation of least squares regression.
22 The regression function g defines a hyperplane in Rp (green
23 solid line, here we have p = 2) while the residual of data point
24 x(i) (hollow circles) is its projection onto this hyperplane (bar- B. Ridge-Regression
25 ended dashed line).
26 In this section, we study the effect of adding to the
27 In Section III we explained that the difference between least squares loss function a regularizer defined as the L2
28 learning and fitting lies in the prediction on “unseen" norm of the parameter vector we wish to optimize over.
29 data. It is therefore necessary to examine the out-of- In other words, we want to solve the following penalized
30 sample error. For a more refined argument on the role regression problem called Ridge regression:
31 of out-of-sample errors in linear regression, we encour- 
32 age the reader to do the exercises in the corresponding ŵRidge (λ) = arg min ||Xw − y||22 + λ||w||22 . (43)
w∈Rp
33 Jupyter notebooks. The upshot is, following our defini-
34 tion of Ēin and Ēout in Section III, the average in-sample
35 This problem is equivalent to the following constrained
and out-of-sample error can be shown to be optimization problem
36 
37 p
Ēin = σ 2 1 − (40) ŵRidge (t) = arg min ||Xw − y||22 . (44)
38  n
p w∈Rp : ||w||22 ≤t
39 Ēout = σ 2 1 + , (41)
40 n This means that for any t ≥ 0 and solution ŵRidge in
41 provided we obtain the least squares solution ŵLS from Eq. (44), there exists a value λ ≥ 0 such that ŵRidge
42 i.i.d. samples X and y generated through y = Xwtrue + solves Eq. (43), and vice versa4 . With this equivalence, it
43  3 . Therefore, we can calculate the average generaliza- is obvious that by adding a regularization term, λ||w||22 ,
44 tion error explicitly: to our least squares loss function, we are effectively con-
45
46 p straining the magnitude of the parameter vector learned
|Ēin − Ēout | = 2σ 2 . (42) from the data.
47 n
48 To see this, let us solve Eq. (43) explicitly. Differenti-
49 This imparts an important message: if we have p  n ating w.r.t. w, we obtain,
50 (i.e. high-dimensional data), the generalization error is
51 extremely large, meaning the model is not learning. Even ŵRidge (λ) = (X T X + λIp×p )−1 X T y. (45)
52 when we have p ≈ n, we might still not learn well due
53 to the intrinsic noise σ 2 . One way to ameliorate this
54 is, as we shall see in the following few sections, to use
4 Note that the equivalence between the penalized and the con-
55 regularization. We will mainly focus on two forms of
strained (regularized) form of least square optimization does not
56 always hold. It holds for Ridge and LASSO (introduced later),
57 but not for best subset selection which is defined by choosing a
58 L0 norm: λ||w||0 . In this case, for every λ > 0 and any ŵBS
3 This requires that  is a noise vector whose elements are i.i.d. of that solves the penalized form of best subset selection, there is a
59
60 zero mean and variance σ 2 , and is independent of the samples value t ≥ 0 such that ŵBS also solves that constrained form of
X. best subset selection, but the converse is not true.
61
62
63
64
65
23
1
2
In fact, when X is orthogonal, one can simplify this ex- As in Ridge regression, there is another formulation for
3
4 pression further: LASSO based on constrained optimization, namely,
5 ŵLS
ŵRidge (λ) = , for orthogonal X, (46) ŵLASSO (t) = arg min ||Xw − y||22 . (53)
6 1+λ w∈Rp : ||w||1 ≤t
7
8 where ŵLS is the least squares solution given by Eq. (39). The equivalence interpretation is the same as in Ridge
9 This implies that the ridge estimate is merely the least regression, namely, for any t ≥ 0 and solution ŵLASSO in
10 squares estimate scaled by a factor (1 + λ)−1 . Eq. (53), there is a value λ ≥ 0 such that ŵLASSO solves
11 Can we derive a similar relation between the fitted Eq. (52), and vice versa. However, to get the analytic
12 vector ŷ = X ŵRidge and the prediction made by least solution of LASSO, we cannot simply take the gradient
13 squares linear regression? To answer this, let us do a sin- of Eq. (52) with respect to w, since the L1 -regularizer is
14 gular value decomposition (SVD) on X. Recall that the not everywhere differentiable, in particular at any point
15 SVD of an n × p matrix X has the form where wj = 0 (see Fig. 13). Nonetheless, LASSO is a
16 convex problem. Therefore, we can invoke the so-called
17 X = U DV T , (47)
“subgradient optimality condition" (Boyd and Vanden-
18
where U ∈ Rn×p and V ∈ Rp×p are orthogonal matrices berghe, 2004; Rockafellar, 2015) in optimization theory
19
20 such that the columns of U span the column space of to obtain the solution. To keep the notation simple, we
21 X while the columns of V span the row space of X. only show the solution assuming X is orthogonal:
22 D ∈ Rp×p =diag(d1 , d2 , · · · , dp ) is a diagonal matrix with
ŵjLASSO (λ) = sign(ŵjLS )(|ŵjLS | − λ)+ , for orthogonal X,
23 entries d1 ≥ d2 ≥ · · · dp ≥ 0 called the singular values of
(54)
24 X. Note that X is singular if there is at least one dj = 0.
where (x)+ denotes the positive part of x and ŵjLS is
25 By writing X in terms of its SVD, one can recast the
the j-th component of least squares solution. In Fig. 12,
26 Ridge estimator Eq. (45) as
27 we compare the Ridge solution Eq. (46) with LASSO
28 ŵRidge = V (D 2 + λI)−1 DU T y, (48) solution Eq. (54). As we mentioned above, the Ridge
29 solution is the least squares solution scaled by a factor
30 which implies that the Ridge predictor satisfies of (1 + λ). Here LASSO does something conventionally
31 ŷRidge = X ŵRidge called “soft-thresholding" (see Fig. 12). We encourage
32 interested readers to work out the exercises in Notebook
= U D(D 2 + λI)−1 DU T y
33 p
3 to explore what this function does.
34 X d2j
= U:,j 2 U Ty (49)
35
j=1
dj + λ :j LASSO Ridge
36
37 ≤ U U Ty (50)
scaled by
38 = X ŷ ≡ ŷLS , (51) λ
λ 1+ λ
39
40 where U:,j are the columns of U . Note that in the in-
λ
41 equality step we assumed λ ≥ 0 and used SVD to sim-
42 plify Eq. (39). By comparing Eq. (49) with Eq. (51), it is
43 clear that in order to compute the fitted vector ŷ, both
44 Ridge and least squares linear regression have to project FIG. 12 [Adapted from (Friedman et al., 2001)] Comparing
45 y to the column space of X. The only difference is that LASSO and Ridge regression. The black 45 degree line is
46 Ridge regression further shrinks each basis component j the unconstrained estimate for reference. The estimators are
47 by a factor d2j /(d2j + λ). We encourage the reader to do shown by red dashed lines. For LASSO, this corresponds to
48 the exercises in Notebook 3 to develop further intuition the soft-thresholding function Eq. (54) while for Ridge regres-
49 about how Ridge regression works. sion the solution is given by Eq. (46)
50
51 How different are the solutions found using LASSO
52 C. LASSO and Sparse Regression and Ridge regression? In general, LASSO tends to give
53 sparse solutions, meaning many components of ŵLASSO
54 In this section, we study the effects of adding an L1 reg- are zero. An intuitive justification for this result is pro-
55 ularization penalty, conventionally called LASSO, which vided in Fig. 13. In short, to solve a constrained op-
56 stands for “least absolute shrinkage and selection opera- timization problem with a fixed regularization strength
57
tor”. Concretely, LASSO in the penalized form is defined t ≥ 0, for example, Eq. (44) and Eq. (53), one first carves
58
by the following regularized regression problem: out the “feasible region" specified by the regularizer in the
59
{w1 , · · · , wd } space. This means that a solution ŵ0 is le-
60 ŵLASSO (λ) = arg min ||Xw − y||22 + λ||w||1 . (52)
61 w∈Rp gitimate only if it falls in this region. Then one proceeds
62
63
64
65
24
1
2
by plotting the contours of the least squares regressors in
3
4 an increasing manner until the contour touches the fea- 1.0
Train (Ridge)
5 sible region. The point where this occurs is the solution Test (Ridge)
6 to our optimization problem (see Fig. 13 for illustration). 0.8 Train (LASSO)
Test (LASSO)
7 Loosely speaking, since the L1 regularizer of LASSO has
sharp protrusions (i.e. vertices) along the axes, and be-

Performance
8 0.6
9 cause the regressor contours are in the shape of ovals (it
10 is quadratic in w), their intersection tends to occur at 0.4
11 the vertex of the feasibility region, implying the solution
12 vector will be sparse. 0.2
13
14
0.0
15 ŵ 10
2
10
1
10
0
10
1
10
2

16 ŵ
w2 w2
17
FIG. 14 Performance of LASSO and ridge regression on the
18
t
diabetes dataset measured by the R2 coefficient of determina-
19 t tion. The best possible performance is R2 = 1. See Notebook
20 3.
w1 w1
21
22
23 LASSO Ridge Ridge LASSO
24
600
25 FIG. 13 [Adapted from (Friedman et al., 2001)] Illustra- 600
26 tion of LASSO (left) and Ridge regression (right). The blue 500
500
27 concentric ovals are the contours of the regression function
400
28 while the red shaded regions represent the constraint func- 400
|wi|

29 tions: (left) |w1 | + |w2 | ≤ t and (right) w12 + w22 ≤ t. In- 300 300
30 tuitively, since the constraint function of LASSO has more
31 protrusions, the ovals tend to intersect the constraint at the 200 200
32 vertex, as shown on the left. Since the vertices correspond to
parameter vectors w with only one non-vanishing component, 100 100
33
34 LASSO tends to give sparse solution. 0 0
35 10
2
10
1
10
0
10
1
10
2
10
2
10
1
10
0
10
1
10
2

36 In Notebook 3, we analyze a Diabetes dataset using


37 both LASSO and Ridge regression to predict the dia-
38 betes outcome one year forward (Efron et al., 2004). In FIG. 15 Regularization parameter λ affects the weights (fea-
39 Figs. 14, 15, we show the performance of both methods tures) we learned in both Ridge regression (left) and LASSO
regression (right) on the Diabetes dataset. Curves with dif-
40 and the solutions ŵLASSO (λ), ŵRidge (λ) explicitly. More ferent colors correspond to different wi ’s (features). Notice
41 details of this dataset and our regression implementation LASSO, unlike Ridge, sets feature weights to zero leading to
42 can be found in Notebook 3. sparsity. See Notebook 3.
43
44
45 D. Using Linear Regression to Learn the Ising Hamiltonian numbers Ej mean, and the configuration {Sj }L j=1 can be
46
interpreted in many ways: the outcome of coin tosses,
47 To gain deeper intuition about what kind of physics black-and-white pixels of an image, the binary represen-
48 problems linear regression allows us to tackle, consider tation of integers, etc. Your goal is to learn a model that
49 the following problem of learning the Hamiltonian for predicts Ej from the spin configurations.
50 the Ising model. Imagine you are given an ensemble of
51 Without any prior knowledge about the origin of the
random spin configurations, and assigned to each state data set, physics intuition may suggest to look for a spin
52 its energy, generated from the 1D Ising model:
53 model with pairwise interactions between every pair of
54 L
X variables. That is, we choose the following model class:
55 H = −J Sj Sj+1 (55)
L X
X L
56 j=1
57 Hmodel [S i ] = − Jj,k Sji Ski , (56)
58 where J is the nearest-neighbor spin interaction, and j=1 k=1
59 Sj ∈ {±1} is a spin variable. Let’s assume the data
60 was generated with J = 1. You are handed the data The goal is to determine the interaction matrix Jj,k by
61 set D = ({Sj }L
j=1 , Ej ) without knowledge of what the applying linear regression on the data set D. This is a
62
63
64
65
25
1
2
well-defined problem, since the unknown Jj,k enters lin- data, was set to unity, J = 1. Moreover, Jj,k was NOT
3
4 early into the definition of the Hamiltonian. To this end, defined to be symmetric (we only used the Jj,j+1 but
5 we cast the above ansatz into the more familiar linear- never the Jj,j−1 elements). Figure. 17 shows the matrix
6 regression form: representation of the learned weights Jj,k . Interestingly,
7 OLS and Ridge regression learn nearly symmetric weights
8 Hmodel [S i ] = Xi · J. (57) J ≈ −0.5. This is not surprising, since it amounts to tak-
9 ing into account both the Jj,j+1 and the Jj,j−1 terms, and
The vectors Xi represent all two-body interactions
10 the weights are distributed symmetrically between them.
11 j,k=1 , and the index i runs over the samples in
{Sji Ski }L
LASSO, on the other hand, tends to break this symme-
the dataset. To make the analogy complete, we can also
12 try (see matrix elements plots for λ = 0.01) 5 . Thus,
represent the dot product by a single index p = {j, k},
13 we see how different regularization schemes can lead to
i.e. Xi · J = Xpi Jp . Note that the regression model does
14 learning equivalent models but in different ‘gauges’. Any
not include the minus sign. In the following, we apply
15 information we have about the symmetry of the unknown
ordinary least squares, Ridge, and LASSO regression to
16 model that generated the data should be reflected in the
17 the problem, and compare their performance.
definition of the model and the choice of regularization.
18 In addition to the diabetes dataset in Notebook 3, we
19 1.0
encourage the reader to work out Notebook 4 in which
20
linear regression is applied to the one-dimensional Ising
21 0.8
Train (OLS) model.
22
23 0.6 Test (OLS)
R2

24 Train (Ridge) E. Convexity of regularizer


25 0.4 Test (Ridge)
26
Train (LASSO) In the previous section, we mentioned that the analyt-
27 0.2 ical solution of LASSO can be found by invoking its con-
28 Test (LASSO)
29 vexity. In this section, we provide a gentle introduction
0.0 −4
30 10 10−2 100 102 104 to convexity theory and highlight a few properties which
31 λ can help us understand the differences between LASSO
32 and Ridge regression. First, recall that a set C ⊆ Rn is
FIG. 16 Performance of OLS, Ridge and LASSO regression called convex if for any x, y ∈ C and t ∈ [0, 1],
33
on the Ising model as measured by the R2 coefficient of de-
34 termination. Optimal performance is R2 = 1.See Notebook
35 tx + (1 − t)y ∈ C. (59)
4.
36
In other words, every line segment joining x, y lies en-
37 Figure. 16 shows the R2 of the three regression models.
38 tirely in C. A function f : Rn → R is called con-
39 Pn true 2 vex if its domain, dom(f ), is a convex set, and for any
pred
40 i=1 yi − y i x, y ∈dom(f ) and t ∈ [0, 1] we have
41 R2 = 1 − P P 2 . (58)
n true 1 n pred f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y), (60)
42 y
i=1 i − n i=1 iy
43 That is, the function lies on or below the line segment
44 Let us make a few remarks: (i) the regularization pa-
joining its evaluation at x and y. This function f is
45 rameter λ affects the Ridge and LASSO regressions at
called strictly convex if this inequality holds strictly for
46 scales separated by a few orders of magnitude. Notice
x 6= y and t ∈ (0, 1). Now, it turns out that for con-
47 that this is different for the data considered in the di-
vex functions, any local minimizer is a global minimizer.
48 abetes dataset, cf. Fig. 14. Therefore, it is considered
Algorithmically, this means that in the optimization pro-
49 good practice to always check the performance for the
cedure, as long as we are “going down the hill” and agree
50 given model and data as a function of λ. (ii) While
51 to stop when we reach a minimum, then we have hit
the OLS and Ridge regression test curves are monotonic,
52 the global minimum. In addition to this, there is an
the LASSO test curve is not – suggesting an optimal
53 abundance of rich theory regarding convex duality and
LASSO regularization parameter is λ ≈ 10−2 . At this
54 optimality, which allow us to understand the solutions
sweet spot, the Ising interaction weights J contains only
55 even before solving the problem itself. We refer interested
nearest-neighbor terms (as did the model the data was
56
generated from).
57
58 Choosing whether to use Ridge or LASSO regression in
59 this case turns out to be similar to fixing gauge degrees of 5 Look closer, and you will see that LASSO actually splits the
60 freedom. Recall that the uniform nearest-neighbor inter- weights rather equally for the periodic boundary condition ele-
actions strength Jj,k = J which we used to generate the ment at the edges of the anti-diagonal.
61
62
63
64
65
26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 FIG. 17 Learned interaction matrix Jij for the Ising model ansatz in Eq. (56) for ordinary least squares (OLS) regression
61 (left), Ridge regression (middle) and LASSO (right) at different regularization strengths λ. OLS is λ-independent but is shown
for comparison throughout.See Notebook 4.
62
63
64
65
27
1
2
readers to (Boyd and Vandenberghe, 2004; Rockafellar, Using the assumption that samples are i.i.d., we can write
3
4 2015). the log-likelihood as
5 Now let us examine the two regularizers we introduced n
earlier. A close inspection reveals that LASSO and Ridge X
6 l(θ) ≡ log p(D|θ) = log p(yi |x(i) , θ). (63)
7 regressions are both convex problems but only Ridge re- i=1
8 gression is a strictly convex problem (assuming λ > 0).
9 From convexity theory, this means that we always have a Note that the conditional dependence of the response
10 unique solution for Ridge but not necessary for LASSO. variable yi on the independent variable x(i) in the likeli-
11 In fact, it was recently shown that under mild conditions, hood function is made explicit since in regression the ob-
12 such as demanding general position for columns of X, served value of data, yi , is predicted based on x(i) using
13 the LASSO solution is indeed unique (Tibshirani et al., a model that is assumed to be a probability distribution
14 2013). Apart from this theoretical characterization, (Zou that depends on unknown parameter θ. This distribu-
15 and Hastie, 2005) introduced the notion of Elastic Net to tion, when endowed with θ, can, as we hope, potentially
16 retain the desirable properties of both LASSO and Ridge explain our prediction on yi . By definition, such distri-
17 regression, which is now one of the standard tools for bution is the likelihood function we discussed in Sec. V.
18 regression analysis and machine learning. We refer to Note that this is consistent with the formal statistical
19 treatment of regression where the goal is to estimate the
reader to explore this in Notebook 2.
20 conditional expectation of the dependent variable given
21
the value of the independent variable (sometimes called
22
F. Bayesian formulation of linear regression the covariate) (Wasserman, 2013). We stress that this
23
24 notation does not imply x(i) is unknown– it is still part
25 In Section V, we gave an overview of Bayesian inference of the observed data!
26 and phrased it in the context of learning and uncertainty Using Eq. (61), we get
27 quantification. In this section we formulate least squares
1 X 2 n
n

28 regression from a Bayesian point of view. We shall see l(θ) = − yi − w T (i)
x − log 2πσ 2
2
2σ i=1 2
29 that regularization in learning will emerge naturally as
30 part of the Bayesian inference procedure. 1
31 From the setup of linear regression, the data D used to =− ||Xw − y||22 + const. (64)
2σ 2
32 fit the regression model is generated through y = xT w +
33 . We often assume that  is a Gaussian noise with mean By comparing Eq. (38) and Eq. (64), it is clear that per-
34 zero and variance σ 2 . To connect linear regression to the forming least squares is the same as maximizing the log-
35 Bayesian framework, we often write the model as likelihood of this model.
36 What about adding regularization? In Section V, we
37 p(y|x, θ) = N (y|µ(x), σ 2 (x)). (61) introduced the maximum a posteriori probability (MAP)
38 estimate. Here we show that it actually corresponds to
39 In other words, our regression model is defined by a con- regularized linear regression, where the choice of prior
40 ditional probability that depends not only on data x but determines the type of regularization. Recall Bayes’ rule
41
on some model parameters θ. For example, if the mean
42
is a linear function of x given by µ = xT w, and the p(θ|D) ∝ p(D|θ)p(θ). (65)
43
44 variance is fixed σ 2 (x) = σ 2 , then θ = (w, σ 2 ). Now instead of maximizing the log-likelihood, l(θ) =
45 In statistics, many problems rely on estimation of some log p(D|θ), let us maximize the log posterior, log p(θ|D).
46 parameters of interest. For example, suppose we are Invoking Eq. (65), the MAP estimator becomes
47 given the height data of 20 junior students from a regional
48 high school, but what we are interested in is the average θ̂MAP ≡ arg max log p(D|θ) + log p(θ). (66)
49 height of all high school juniors in the whole county. It θ

50 is conceivable that the data we are given are not rep- In Sec. V.C, we discussed that a common choice for the
51 resentative of the student population as a whole. It is prior is a Gaussian distribution. Consider the Gaussian
52 therefore necessary to devise a systematic way to pre- prior6 with zero mean and variance τ 2 , namely, p(w) =
53 form reliable estimation. Here we present the maximum
54 likelihood estimation (MLE), and show that MLE for θ
55 is the one that minimizes the mean squared error (MSE)
56 used in OLS, see Sec.VI.A. 6 Indeed, a Gaussian prior is the conjugate prior that gives a Gaus-
57 MLE is defined by maximizing the log-likelihood sian posterior. For a given likelihood, conjugacy guarantees the
58 preservation of prior distribution at the posterior level. For ex-
w.r.t. the parameters θ: ample, for a Gaussian (Geometric) likelihood with a Gaussian
59
60 (Beta) prior, the posterior distribution is still Gaussian (Beta)
θ̂ ≡ arg max log p(D|θ). (62) distribution.
61 θ
62
63
64
65
28
1
2 Q
j N (wj |0, τ 2 ). Then, we can recast the MAP estimator probably aware of the time of day, that it is a Wednes-
3
4 into day, your friend Alice being in attendance, your friend
  Bob being absent with a cold, the country in which you
5 Xn n
X
1 T 1 are doing the experiment, and the planet you are on, but
6 θ̂MAP ≡ arg max − 2 (i)
(yi − w x ) − 2 2
w 
2
7 θ 2σ i=1 2τ j=1 j you almost assuredly haven’t written these down in your
8   notebook. Why not? The reason is because you entered
1 1 the classroom with strongly held prior beliefs that none of
9 2
= arg max − 2 ||Xw − y||2 − 2 ||w||2 . (67) 2
10 θ 2σ 2τ those things affect the physics which takes place in that
11 room. Even of the things you did write down in an effort
Note that we dropped constant terms that do not de-
12 to be a careful scientist you probably hold some doubt
pend on the maximization parameters θ. The equiva-
13 as to their importance to your result and what is serving
lence between MAP estimation with a Gaussian prior
14 you here is the intuition that probably only a few things
and Ridge regression is established by comparing Eq. (67)
15 matter in the physics of pendula. Hence again you are
and Eq. (44) with λ ≡ σ 2 /τ 2 . We relegate the analogous
16 approaching the experiment with prior beliefs about how
17 derivation for LASSO to an exercise in Notebook 3.
many features you will need to pay attention to in order
18 to predict what will happen when you swing an unknown
19 pendulum. This example might seem a bit contrived, but
20 G. Recap and a general perspective on regularizers
the point is that we live in a high-dimensional world of
21
In this section, we explored least squares linear regres- information and while we have good intuition about what
22
23 sion with and without regularization. We motivated the to write down in our notebook for well-known problems,
24 need for regularization due to poor generalization, in par- often in the field of ML we cannot say with any confi-
25 ticular in the “high-dimensional limit" (p  n). Instead dence a priori what the small list of things to write down
26 of showing the average in-sample and out-of-sample er- will be, but we can at least use regularization to help us
27 rors for the regularized problem explicitly, we conducted enforce that the list not be too long so that we don’t end
28 numerical experiments in Notebook 3 on the diabetes up predicting that the period of a pendulum depends on
29 dataset and showed that regularization typically leads Bob having a cold on Wednesdays.
30 to better generalization. Due to the equivalence between Of course, in both LASSO and Ridge regression there
31 the constrained and penalized form of regularized regres- is a parameter λ involved. In principle, this hyper-
32 sion (in LASSO and Ridge, but not generally true in cases parameter is usually predetermined, which means that
33 such as L0 penalization), we can regard the regularized it is not part of the regression process. As we saw in
34 regression problem as an un-regularized problem but on Fig. 15, our learning performance and solution depends
35 a constrained set of parameters. Since the size of the al- strongly on λ, thus it is vital to choose it properly. As
36 we discussed in Sec. V.C, one approach is to assume an
lowed parameter space (e.g. w ∈ Rp when un-regularized
37 uninformative prior on the hyper-parameters, p(λ), and
vs. w ∈ C ⊂ Rp when regularized) is roughly a proxy for
38
model complexity, solving the regularized problem is in average the posterior over all choices of λ following this
39
effect solving the un-regularized problem with a smaller distribution. However, this comes with a large computa-
40
41 model complexity class. This implies that we’re less likely tional cost. Therefore, it is simpler to choose the regular-
42 to overfit. ization parameter through some optimization procedure.
43 We also showed the connection between using a reg- We’d like to emphasize that linear regression can be
44 ularization function and the use of priors in Bayesian applied to model non-linear relationship between input
45 inference. This connection can be used to develop more and response. This can be done by replacing the input
46 intuition about why regularization implies we are less x with some nonlinear function φ(x). Note that doing
47 likely to overfit the data: Let’s say you are a young so preserves the linearity as a function of the parame-
48 Physics student taking a laboratory class where the goal ters w, since model is defined by the their inner product
49 of the experiment is to measure the behavior of several φT (x)w. This method is known as basis function expan-
50 different pendula and use that to predict the formula sion (Bishop, 2006; Murphy, 2012).
51 (i.e. model) that determines the period of oscillation. Recent years have also seen a surge of interest in un-
52 In your investigation you would probably record many derstanding generalized linear regression models from a
53 things (hopefully including the length and mass!) in an statistical physics perspective. Much of this research has
54 effort to give yourself the best possible chance of deter- focused on understanding high-dimensional linear regres-
55
mining the unknown relationship, perhaps writing down sion and compressed sensing (Donoho, 2006) (see (Ad-
56
the temperature of the room, any air currents, if the ta- vani et al., 2013; Zdeborová and Krzakala, 2016) for ac-
57
58 ble were vibrating, etc. What you have done is create a cessible reviews for physicists). On a technical level,
59 high-dimensional dataset for yourself. However you actu- this research imports and extends the machinery of spin
60 ally possess an even higher-dimensional dataset than you glass physics (replica method, cavity method, and mes-
61 probably would admit to yourself. For example you are sage passing) to analyze high-dimensional linear models
62
63
64
65
29
1
2 0 1 2 3
3
4
5
6
7
8 (1, 0, 0, 0) (0, 1, 0, 0) (0, 0, 1, 0) (0, 0, 0, 1)
9
10 FIG. 18 Pictorial representation of four data categories la-
11 beled by the integers 0 through 3 (above), or by one-hot vec-
12 tors with binary inputs (below).
13
14
15 (Advani and Ganguli, 2016; Fisher and Mehta, 2015a,b;
16 Krzakala et al., 2014, 2012a,b; Ramezanali et al., 2015; FIG. 19 Classifying data in the simplest case of only two
17 Zdeborová and Krzakala, 2016). This is a rich area categories, labeled “noise” and “signal” (or “cats” and “dogs”),
18 of activity at the intersection of physics, computer sci- is the subject of Logistic Regression.
19 ence, information theory, and machine learning and in-
20 terested readers are encouraged to consult the literature
21 for further information (see also (Mezard and Montanari, output classes from the design matrix X ∈ Rn×p made of
22 2009)). n samples, each of which bears p features. The primary
23 goal is to identify the classes to which new unseen samples
24 belong.
25 Before delving into the details of logistic regression, it
26 VII. LOGISTIC REGRESSION
is helpful to consider a slightly simpler classifier: a lin-
27 ear classifier that categorizes examples using a weighted
28 So far we have focused on learning from datasets for
linear-combination of the features and an additive offset:
29 which there is a “continuous” output. For example, in
30 linear regression we were concerned with learning the co- si = xTi w + b0 ≡ xTi w, (68)
31 efficients of a polynomial to predict the response of a
32 continuous variable yi on unseen data based on its inde- where we use the short-hand notation xi = (1, xi ) and
33 pendent variables xi . However, a wide variety of prob- w = (b0 , w). This function takes values on the entire real
34 lems, such as classification, are concerned with outcomes axis. In the case of logistic regression, however, the labels
35 taking the form of discrete variables (i.e. categories). yi are discrete variables. One simple way to get a discrete
36 For example, we may want to detect if there is a cat or output is to have sign functions that map the output of
37 a dog in an image. Or given a spin configuration of, say, a linear regressor to {0, 1}, σ(si ) = sign(si ) = 1 if si ≥ 0
38
the 2D Ising model, we would like to identify its phase and 0 otherwise. Indeed, this is commonly known as the
39
(e.g. ordered/disordered). In this section, we introduce “perceptron” in the machine learning literature.
40
logistic regression which deals with binary, dichotomous
41
outcomes (e.g. True or False, Success or Failure, etc.).
42 A. The cross-entropy as a cost function for logistic regression
43 We encourage the reader to use the opportunity to build
44 their intuition about the inner workings of logistic regres- The perceptron is an example of a “hard classification”:
45 sion, as this will prove valuable later on in the study of each datapoint is assigned to a category (i.e. yi = 0 or
46 modern supervised Deep Learning models (see Sec. IX). yi = 1). Even though the perceptron is an extremely
47 This section is structured as follows: first, we define simple model, it is favorable in many cases (e.g. when
48 logistic regression and derive its corresponding cost func- dealing with noisy data) to have a “soft” classifier that
49 tion (the cross entropy) using a Bayesian approach, and outputs the probability of a given category. For example,
50 discuss its minimization. Then, we generalize logistic re- given xi , the classifier returns the probability of being in
51 gression to the case of multiple categories which is called category m. One such function is the logistic (or sigmoid)
52 SoftMax regression. We demonstrate how to apply logis- function:
53 tic regression using three different problems: (i) classify-
54 1
ing phases of the 2D Ising model, (ii) learning features σ(s) = . (69)
55 in the SUSY dataset, and (iii) MNIST handwritten digit 1 + e−s
56
classification. Note that 1 − σ(s) = σ(−s), which will be useful shortly.
57
58 Throughout this section, we consider the case where In many cases, it is favorable to work with a “soft” clas-
59 the dependent variables yi ∈ Z are discrete and only sifier.
60 take values from m = 0, . . . , M − 1 (which enumerate Logistic regression is the canonical example of a soft
61 the M classes), see Fig. 18. The goal is to predict the classifier. In logistic regression, the probability that a
62
63
64
65
30
1
2
data point xi belongs to a category yi = {0, 1} is given Since the cost (error) function is just the negative log-
3
4 by likelihood, for logistic regression we find
5 1 C(w) = −l(w) (76)
6 P (yi = 1|xi , θ) = T , n
7 1 + e−xi θ X  
P (yi = 0|xi , θ) = 1 − P (yi = 1|xi , θ), (70) = −yi log σ(xTi w) − (1 − yi ) log 1 − σ(xTi w) .
8
i=1
9
10 where θ = w are the weights we wish to learn from the The right-hand side in Eq. (76) is known in statistics as
11 data. To gain some intuition for these equations, consider the cross entropy.
12 a collection of non-interacting two-state systems coupled Having specified the cost function for logistic regres-
13 to a thermal bath (e.g. a collection of atoms that can be sion, we note that, just as in linear regression, in practice
14 in two states). Furthermore, denote the state of system we usually supplement the cross-entropy with additional
15 i by a binary variable: yi ∈ {0, 1}. From elementary regularization terms, usually L1 and L2 regularization
16 statistical mechanics, we know that if the two states have (see Sec. VI for discussion of these regularizers).
17 energies 0 and 1 the probability for finding the system
18 in a state yi is:
19 B. Minimizing the cross entropy
20 e−β0 1
21 P (yi = 1) = −β0 = ,
e + e−β1 1 + e−β∆ The cross entropy is a convex function of the weights w
22
P (yi = 1) = 1 − P (yi = 0). (71) and, therefore, any local minimizer is a global minimizer.
23
Minimizing this cost function leads to the following equa-
24
Notice that in these expressions, as is often the case in tion
25
26 physics, only energy differences are observable. If the n
difference in energies between two states is given by ∆ = X  
27 0 = ∇C(w) = σ(xTi w) − yi xi , (77)
28 xTi w, we recover the expressions for logistic regression. i=1
29 We shall use this mapping between partition functions
30 and classification to generalize the logistic regressor to where we made use of the logistic function identity
31 SoftMax regression in Sec. VII.D. Notice that in terms ∂z σ(s) = σ(s)[1−σ(s)]. Equation (77) defines a transcen-
32 of the logistic function, we can write dental equation for w, the solution of which, unlike linear
33 regression, cannot be written in a closed form. For this
34 P (yi = 1) = σ(xTi w) = 1 − P (yi = 0). (72) reason, one must use numerical methods such as those
35 introduced in Sec. IV to solve this optimization problem.
36 We now define the cost function for logistic regression
37 using Maximum Likelihood Estimation (MLE). Recall,
38 that in MLE we choose parameters to maximize the prob- C. Examples of binary classification
39 ability of seeing the observed data. Consider a dataset
40 D = {(yi , xi )} with binary labels yi ∈ {0, 1} from which Let us now show how to use logistic regression in
41
the data points are drawn independently. The likelihood practice. In this section, we showcase two pedagogi-
42
of observing the data under our model is just: cal examples to train a logistic regressor to classify bi-
43 nary data. Each example comes with a corresponding
44 n
Y Jupyter notebook, see https://physics.bu.edu/ panka-
 yi  1−yi
45 P (D|w) = σ(xTi w) 1 − σ(xTi w) jm/MLnotebooks.html.
46 i=1
47 (73)
48 1. Identifying the phases of the 2D Ising model
49 from which we can readily compute the log-likelihood:
50
The goal of this example is to show how one can employ
51 n
X   logistic regression to classify the states of the 2D Ising
52 l(w) = yi log σ(xTi w) + (1 − yi ) log 1 − σ(xTi w) . model according to their phase of matter.
53 i=1
54 (74) The Hamiltonian for the classical Ising model is given
55 The maximum likelihood estimator is defined as the set by
56 of parameters that maximize the log-likelihood: X
57 H = −J Si Sj , Sj ∈ {±1}, (78)
58 n
X hiji
 
59 ŵ = arg max yi log σ(xTi w)+(1−yi ) log 1 − σ(xTi w) .
60 θ i=1 where the lattice site indices i, j run over all nearest
61 (75) neighbors of a 2D square lattice, and J is an interaction
62
63
64
65
31
1
2
3 ordered phase critical region disordered phase
4
5
0 0 0
6
7
8 10 10 10
9
10
11 20 20 20
12
13
14 30 30 30
15
16
17
18
0 20 0 20 0 20
19
20 FIG. 20 Examples of typical states of the 2D Ising model for three different temperatures in the ordered phase (T /J = 0.75,
21 left), the critical region (T /J = 2.25, middle) and the disordered phase (T /J = 4.0, right). The linear system dimension is
22 L = 40 sites.
23
24
25 energy scale. We adopt periodic boundary conditions. to, among other things, critical slowing down of the MC
26 Onsager proved that this model undergoes a phase tran- algorithm. Perhaps identifying the phases is also harder
27 sition in the thermodynamic limit from an ordered fer- in the critical region. With this in mind, consider the
28 romagnet with all spins aligned to a disordered √ phase at following three types of states: ordered (T /J < 2.0),
29 the critical temperature Tc /J = 2/ log(1 + 2) ≈ 2.26. near-critical (2.0 ≤ T /J ≤ 2.5) and disordered (T /J >
30 For any finite system size, this critical point is smeared 2.5). We use both ordered and disordered states to train
31 out to a critical region around Tc . the logistic regressor and, once the supervised training
32 An interesting question to ask is whether one can procedure is complete, we will evaluate the performance
33 train a statistical classifier to distinguish between the two of our classification model on unseen ordered, disordered,
34 phases of the Ising model. If successful, this can be used and near-critical states.
35 to locate the position of the critical point in more compli- Here, we deploy the liblinear routine (the default for
36 cated models where an exact analytical solution has so far Scikit’s logistic regression) and stochastic gradient de-
37 remained elusive (Morningstar and Melko, 2017; Zhang scent (SGD, see Sec. IV for details) to optimize the logis-
38
et al., 2017b). In other words, given an Ising state, we tic regression cost function with L2 regularization. We
39
would like to classify whether it belongs to the ordered or define the accuracy of the classifier as the percentage of
40
41 the disordered phase, without any additional information correctly classified data points. Comparing the accuracy
42 other than the spin configuration itself. This categorical on the training and test data, we can study the degree
43 machine learning problem is well suited for logistic re- of overfitting. The first thing to notice in Fig. 21 is the
44 gression, and will thus consist of recognizing whether a small degree of overfitting, as suggested by the training
45 given state is ordered by looking at its bit configurations. (blue) and test (red) accuracy curves being very close to
46 Notice that, for the purposes of applying logistic regres- each other. Interestingly, the liblinear minimizer outper-
47 sion, the 2D spin state of the Ising model will be flattened forms SGD on the training and test data, but not on the
48 out to a 1D array, so it will not be possible to learn in- near-critical data for certain values of the regularization
49 formation about the structure of the contiguous ordered strength λ. Moreover, similar to the linear regression
50 2D domains [see Fig. 20]. Such information can be incor- examples, we find that there exists a sweet spot for the
51 porated using deep convolutional neural networks, see SGD regularization strength λ that results in optimal
52 Section IX. performance of the logistic regressor, at about λ ∼ 10−1 .
53 To this end, we consider the 2D Ising model on a 40×40 We might expect that the difficulty of the phase recogni-
54 square lattice, and use Monte-Carlo (MC) sampling to tion problem depends on the temperature of the queried
55 prepare 104 states at every fixed temperature T out of sample. Looking at the states in the near-critical region,
56
a pre-defined set. We furthermore assign a label to each c.f. Fig. 20, it is no longer easy for a trained human eye
57
state according to its phase: 0 if the state is disordered, to distinguish between the ferromagnetic and the disor-
58
59 and 1 if it is ordered. dered phases close to Tc . Therefore, it is interesting to
60 It is well-known that near the critical temperature Tc , also compare the training and test accuracies to the ac-
61 the ferromagnetic correlation length diverges, which leads curacy of the near-critical state predictions. (Recall that
62
63
64
65
32
1
2
nal state particles directly, we will use the output of our
3
4 0.7 logistic regression to define a part of phase space that is
5 enriched in signal events (see Jupyter notebookNotebook
6 5).
7 0.6 The dataset we are using comes from the UC Irvine
accuracy

8 ML repository and has been produced using Monte Carlo


9 simulations to contain events with two leptons (electrons
10 0.5 or muons) (Baldi et al., 2014). Each event has the value
11 train of 18 kinematic variables (“features”). The first 8 fea-
12
test tures are direct measurements of final state particles, in
13 0.4 this case the pT , pseudo-rapidity η, and azimuthal an-
14 critical gle φ of two leptons in the event and the amount of
15 missing transverse momentum (MET) together with its
16 10−4 10−2 100 102 104 azimuthal angle. The last ten features are higher or-
17 λ der functions of the first 8 features; these features are
18 derived by physicists to help discriminate between the
19 FIG. 21 Accuracy as a function of the regularization param-
two classes. These high-level features can be thought of
20 eter λ in classifying the phases of the 2D Ising model on the
training (blue), test (red), and critical (green) data. The solid as the physicists’ attempt to use non-linear functions to
21
22 and dashed lines compare the ‘liblinear’ and ‘SGD’ solvers, re- classify signal and background events, having been de-
23 spectively. veloped with formidable theoretical effort. Here, we will
24 use only logistic regression to attempt to classify events
25 as either signal (that is, coming from a SUSY process)
26 the model is not trained on near-critical states.) Indeed, or background (events from some already observed Stan-
27 the liblinear accuracy is about 7% smaller for the criti- dard Model process). Later on in the review, in Sec. IX,
28 cal states (green curves) compared to the test data (red we shall revisit the same problem with the tools of Deep
29 line). Learning.
30 Finally, it is important to note that all of Scikit’s logis- As stated before, we never know the true underlying
31 tic regression solvers have in-built regularizers. We did process, and hence the goal in these types of analyses
32 not emphasize the role of the regularizers in this section, is to find regions enriched in signal events. If we find
33 but they are crucial in order to prevent overfitting. We an excess of events above what is expected, we can have
34 encourage the interested reader to play with the different confidence that they are coming from the type of sig-
35 regularization types and numerical solvers in Notebook 6 nal we are searching for. Therefore, the two metrics of
36 and compare model performances. import are the efficiency of signal selection, and the back-
37
ground rejection achieved (also called detection/rejection
38
rates and similar to recall/precision). Oftentimes, rather
39
40 2. SUSY than thinking about just a single working point, perfor-
41 mance is characterized by Receiver Operator Charecter-
42 In high energy physics experiments, such as the AT- istic curves (ROC curves). These ROC curves plot signal
43 LAS and CMS detectors at the CERN LHC, one major efficiency versus background rejection at various thresh-
44 hope is the discovery of new particles. To accomplish this olds of some discriminating variable. Here that variable
45 task, physicists attempt to sift through events and clas- will be the output signal probability of our logistic re-
46 sify them as either a signal of some new physical process gression. Figure 22 shows examples of these outputs for
47 or particle, or as a background event from already un- true signal events (left) and background events (right)
48 derstood Standard Model processes. Unfortunately, we using L2 regularization with a regularization parameter
49 don’t know for sure what underlying physical process oc- of 10−5 .
50 curred (the only information we have access to are the Notice that while the majority of signal events receive
51 final state particles). However, we can attempt to de- high probabilities of passing our discriminator and the
52 fine parts of phase space that will have a high percentage majority of background events receive low probabilities,
53 of signal events. Typically this is done by using a se- some signal events look background-like, and some back-
54 ries of simple requirements on the kinematic quantities ground events look signal-like to our discriminator. This
55
of the final state particles, for example having one or is further reason to characterize performance of our selec-
56
more leptons with large amounts of momentum that are tion in terms of ROC curves. Figure 23 shows examples
57
58 transverse to the beam line (pT ). Instead, here we will of these curves using L2 regularization for many different
59 use logistic regression in an attempt to find the relative regularization parameters using two different ML python
60 probability that an event is from a signal or a background packages, either TensorFlow (top) or Sci-Kit Learn (bot-
61 event. Rather than using the kinematic quantities of fi- tom), when using the full set of 18 input variables. Notice
62
63
64
65
33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 FIG. 22 The probability of an event being a classified as a signal event for true signal events (left, blue) and background events
17 (right, red).
18
19
20 there is minimal overfitting, in part because we trained we’ve already come up with a set of discriminating vari-
21 on such a large dataset (4.5 million events). More impor- ables, including higher order ones derived from theories
22 tantly, however, is the underlying data we are working about SUSY particles, it’s worth reflecting on whether
23 with: each input variable is an important feature. there is utility to the increased sophistication of ML. To
24
show why we would want to use such a technique, recall
25
that, even to the learning algorithm, some signal events
26
27 and background events look similar. We can illustrate
28 this directly by looking at a plot comparing the pT spec-
29 trum of the two highest pT leptons (often referred to as
30 the leading and sub-leading leptons) for both signal and
31 background events. Figure 24 shows these two distribu-
32 tions, and one can see that while some signal events are
33 easily distinguished, many live in the same part of phase
34 space as the background. This effect can also be seen by
35 looking at figure 22 where you can see that some signal
36 events look like background events and vice-versa.
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53 FIG. 24 Comparison of leading vs. sub-leading lepton pT for
54 signal (blue) and background events (red). Recall that these
FIG. 23 ROC curves for a variety of regularization parame-
55 ters with L2 regularization using TensorFlow (top) or Sci-Kit variables have been scaled to have a mean of one.
56 Learn (bottom).
57
58 One could then ask how much discrimination power
59 While figure 23 shows nice discrimination power be- is obtained by simply putting different requirements on
60 tween signal and background events, the adoption of ML the input variables rather than using ML techniques. In
61 techniques adds complication to any analysis. Given that order to compare this strategy (often referred to as cut-
62
63
64
65
34
1
2
based in the field of HEP) to our regression results, differ-
3
4 ent ROC curves have been made for each of the following 0 5 10 15 20 25
cases: logistic regression with just the simple kinematic
5 0
6 variables, logistic regression with the full set of variables,
7 and simply putting a requirement on the leading lepton
8 pT . Figure 25 shows that there is a clear performance 5
9 benefit from using logistic regression. Note also that in
10 the cut-based approach we have only used one variable
11 where we could have put requirements on all of them. 10
12 While putting more requirements would indeed increase
13 background rejection, it would also decrease signal effi-
14 ciency. Hence, the cut-based approach will never yield as 15
15 strong discrimination as the logistic regression we have
16 performed. One other interesting point about these re-
17 sults is that the higher-order variables noticeably help the 20
18 ML techniques. In later sections, we will return to this
19 point to see if more sophisticated techniques can provide
20
further improvement. 25
21
22
23
24
25 FIG. 26 An example of an input datapoint from the MNIST
26 data set. Each datapoint is a 28 × 28-pixel image of a hand-
27 written digit, with its corresponding label belonging to one
of the 10 digits. Each pixel contains a greyscale value repre-
28
sented by an integer between 0 and 255.
29
30
31 the probability of xi being in class m0 is given by
32
T
33 −1 e−xi wm0
34 P (yim0 = 1|xi , {wk }M
k=0 ) = PM −1 −xT wm , (79)
m=0 e
i
35
36 where yim0 ≡ [yi ]m0 refers to the m0 -th component of vec-
37
FIG. 25 A comparison of discrimination power from using lo- tor yi . This is known as the SoftMax function. There-
38 fore, the likelihood of this M -class classifier is simply
gistic regression with only simple kinematic variables (green),
39 logistic regression using both simple and higher-order kine- (cf. Sec. VII.A):
40 matic variables (purple), and a cut-based approach that varies
41 n M
Y Y −1
the requirements on the leading lepton pT . −1 yim
42 P (D|{wk }M
k=0 ) = [P (yim = 1|xi , wm )]
43 i=1 m=0
1−yim
44 × [1 − P (yim = 1|xi , wm )] (80)
45 D. Softmax Regression
46 from which we can define the cost function in a similar
47 fashion:
So far we have focused only on binary classification,
48 n M
X X −1
in which the labels are dichotomous variables. Here we
49 C(w) = − yim log P (yim = 1|xi , wm )
generalize logistic regression to multi-class classification.
50 i=1 m=0
One approach is to treat the label as a vector yi ∈ ZM 2 ,
51 + (1 − yim ) log (1 − P (yim = 1|xi , wm )) . (81)
52 namely a binary string of length M with only one com-
53 ponent of yi being 1 and the rest zero. For example, As expected, for M = 1, we recover the cross entropy for
54 yi = (1, 0, · · · , 0) means data the sample xi belongs to logistic regression, cf. Eq. (76).
55 class 17 , cf. Fig. 18. Following the notation in Sec. VII.A,
56
57 E. An Example of SoftMax Classification: MNIST Digit
58 Classification
7 For an alternative mathematical description of the categories,
59
60 which labels the classes by integers, see http://ufldl.stanford. A paradigmatic example of SoftMax regression is to
edu/wiki/index.php/Softmax_Regression. classify handwritten digits from the MNIST dataset.
61
62
63
64
65
35
1
2
Yann LeCun and collaborators first collected and pro- (Freund et al., 1999; Freund and Schapire, 1995; Schapire
3
4 cessed 70000 handwritten digits, each of which is laid and Freund, 2012), random forests (Breiman, 2001),
5 out on a 28 × 28-pixel grid. Every pixel assumes one and gradient boosted trees such as XGBoost (Chen and
6 of 256 grayscale values, interpolating between white and Guestrin, 2016).
7 black. A representative input sample is show in Fig. 26.
8 Since there are 10 categories for the digits 0 through 9,
this corresponds to SoftMax regression with M = 10. We A. Revisiting the Bias-Variance Tradeoff for Ensembles
9
10 encourage readers to experiment with Notebook 7 to ex-
11 plore SoftMax regression applied to MNIST. We include The bias-variance tradeoff summarizes the fundamen-
12 in Fig. 27 the learned weights wk , where k corresponds to tal tension in machine learning between the complexity
13 class labels (i.e. digits). We shall come back to SoftMax of a model and the amount of training data needed to fit
14 regression in Sec. IX. it (see Sec. III). Since data is often limited, in practice
15 it is frequently useful to use a less complex model with
16 higher bias – a model whose asymptotic performance is
17 VIII. COMBINING MODELS worse than another model – because it is easier to train
18 and less sensitive to sampling noise arising from having a
19 One of the most powerful and widely-applied ideas in finite-sized training dataset (i.e. smaller variance). Here,
20 modern machine learning is the use of ensemble methods we will revisit the bias-variance tradeoff in the context
21 that combine predictions from multiple, often weak, sta- of ensembles, drawing upon the beautiful discussion in
22 tistical models to improve predictive performance (Diet- Ref. (Louppe, 2014).
23 terich et al., 2000). Ensemble methods, such as random A key property that will emerge from this analysis is
24 forests (Breiman, 2001; Geurts et al., 2006; Ho, 1998), the correlation between models that constitute the en-
25 semble. The degree of correlation between models9 is
and boosted gradient trees, such as XGBoost (Chen and
26
Guestrin, 2016; Friedman, 2001), undergird many of the important for two distinct reasons. First, holding the en-
27
winning entries in data science competitions such as Kag- semble size fixed, averaging the predictions of correlated
28
29 gle, especially on structured datasets 8 . Even in the con- models reduces the variance less than averaging uncor-
30 text of neural networks, see Sec. IX, it is common to related models. Second, in some cases, correlations be-
31 combine predictions from multiple neural networks to in- tween models within an ensemble can result in an in-
32 crease performance on tough image classification tasks crease in bias, offsetting any potential reduction in vari-
33 (He et al., 2015; Ioffe and Szegedy, 2015). ance gained from ensemble averaging. We will discuss
34 In this section, we give an overview of ensemble meth- this in the context of bagging below. One of the most
35 ods and provide rules of thumb for when and why they dramatic examples of increased bias from correlations is
36 work. On one hand, the idea of training multiple models the catastrophic predictive failure of almost all derivative
37 and then using a weighted sum of the predictions of the models used by Wall Street during the 2008 financial cri-
38 all these models is very natural. After all, the idea of the sis.
39 “wisdom of the crowds” can be traced back, at least, to
40 the writings of Aristotle in Politics. On the other hand,
41 one can also imagine that the ensemble predictions can 1. Bias-Variance Decomposition for Ensembles
42 be much worse than the predictions from each of the in-
43 dividual models that constitute the ensemble, especially We will discuss the bias-variance tradeoff in the con-
44 when pooling reinforces weak but correlated deficiencies text of continuous predictions such as regression. How-
45
in each of the individual predictors. Thus, it is impor- ever, many of the intuitions and ideas discussed here also
46
tant to understand when we expect ensemble methods to carry over to classification tasks. Before discussing en-
47
work. sembles, let us briefly review the bias-variance tradeoff in
48 the context of a single model. Consider a data set consist-
49 In order to do this, we will revisit the bias-variance
trade-off, discussed in Sec. III, and generalize it to con- ing of data XL = {(yj , xj ), j = 1 . . . N }. Let us assume
50 that the true data is generated from a noisy model
51 sider an ensemble of classifiers. We will show that the
52 key to determining when ensemble methods work is the y = f (x) + , (82)
53 degree of correlation between the models in the ensemble
54 (Louppe, 2014). Armed with this intuition, we will intro- where  is a normally distributed with mean zero and
55 duce some of the most widely-used and powerful ensem- standard deviation σ .
56 ble methods including bagging (Breiman, 1996), boosting
57
58 9 For example, the correlation coefficient between the predictions
59 made by two randomized models based on the same training set
8 Neural networks generally perform better than ensemble meth- but with different random seeds, see Sec. VIII.A.1 for precise
60
ods on unstructured data, images, and audio. definition.
61
62
63
64
65
36
1
2
3 classification weights vector wj for digit class j
4
5
6
7
8
9
10
11
12
13 Class 0 Class 1 Class 2 Class 3 Class 4
14
15
16
17
18
19
20
21
22
23 Class 5 Class 6 Class 7 Class 8 Class 9
24
25 FIG. 27 Visualization of the weights wj after training a SoftMax Regression model on the MNIST dataset (see Notebook
26 7). We emphasize that SoftMax Regression does not have explicit 2D spatial knowledge; the model learns from data points
27 flattened out in a one-dimensional array.
28
29
30 Assume that we have a statistical procedure (e.g. least- where the bias,
31 squares regression) for forming a predictor ĝL (x) that X
32 gives the prediction of our model for a new data point x Bias2 = (f (xi ) − EL [ĝL (xi )])2 , (85)
33 given that we trained the model using a dataset L. This i
34
estimator is chosen by minimizing a cost function which,
35
for the sake of concreteness, we take to be the squared measures the deviation of the expectation value of our es-
36
error timator (i.e. the asymptotic value of our estimator in the
37 limit of infinite data) from the true value. The variance
38 X
39 C(X, g(x)) = (yi − ĝL (xi ))2 . (83) X
i
V ar = EL [(ĝL (xi ) − EL [ĝL (xi )])2 ], (86)
40 i
41
The dataset L is drawn from some underlying distribu-
42 measures how much our estimator fluctuates due to
tion that describes the data. If we imagine drawing many
43 finite-sample effects. The noise term
44 datasets {Lj } of the same size as L from this distribu-
tion, we know that the corresponding estimators ĝLj (x) X
45 N oise = σ2i (87)
46 will differ from each other due to stochastic effects aris-
i
47 ing from sampling noise. For this reason, we can view
48 our estimator ĝL (x) as a random variable and define an is the part of the error due to intrinsic noise in the data
49 expectation value EL in the usual way. Note that the sub- generation process that no statistical estimator can over-
50 script denotes that the expectation is taken over L. In come.
51 practice, EL is computed by by drawing infinitely many Let us now generalize this to ensembles of estimators.
52 different datasets {Lj } of the same size, fitting the corre- Given a dataset XL and hyper-parameters θ that param-
53 sponding estimator, and then averaging the results. We eterize members of our ensemble, we will consider a pro-
54 will also average over different instances of the “noise” . cedure that deterministically generates a model ĝL (xi , θ)
55 The expectation value over the noise will be denoted by given XL and θ. We assume that the θ includes some
56 E . random parameters that introduce stochasticity into our
57
58 As discussed in Sec. III, we can decompose the ex- ensemble (e.g. an initial condition for stochastic gradient
59 pected generalization error as descent or a random subset of features or data points
60 used for training.) Concretely, with a giving dataset L,
61 EL, [C(X, g(x))] = Bias2 + V ar + N oise. (84) one has a learning algorithm A that generates a model
62
63
64
65
37
1
2
A(θ, L) based on a deterministic procedure which intro- Note that the expectation EL,θm [·] is computed over the
3
4 duced stochasticity through θ in its execution on dataset joint distribution of L and θm . Also, by definition, we
5 L. We will be concerned with the expected prediction assume m 6= m0 in CL,θm ,θm0 .
6 error of the aggregate ensemble predictor
7 M
8 A 1 X We can now ask about the expected generalization
ĝL (xi , {θ}) = ĝL (xi , θm ). (88) (out-of-sample) error for the ensemble
9 M m=1
10
11 For future reference, let us define the mean, variance,
12 and covariance (i.e. the connected correlation function in
13 the language of physics), and the normalized correlation
14 coefficient of a single randomized model ĝL (x, θm ) as: " #
15   X
A A 2
16 EL,θm [ĝL (x, θm )] = µL,θm (x) EL,,θ C(X, ĝL (x)) = EL,,θ (yi − ĝL (xi , {θ})) .
17 i
EL,θm [ĝL (x, θm )2 ] − EL,θm [ĝL (x, θ)]2 = σL,θ
2
m
(x) (90)
18
19 EL,θm [ĝL (x, θm )ĝL (x, θm0 )] − Eθ [ĝL (x, θm )]2 = CL,θm ,θm0 (x) As in the single estimator case, we decompose the error
20 CL,θm ,θm0 (x) into a noise term, a bias-term, and a variance term. To
21 ρ(x) = 2 . (89) see this, note that
σL,θ
22
23
24 " #
X
A A
25 EL,,θ [C(X, ĝL (x))] = EL,,θ (yi − f (xi ) + f (xi ) − ĝL (xi , {θ}))2
26 i
27 X
28 = EL,,θ [(yi − f (xi ))2 + (f (xi ) − ĝL
A
(xi , {θ}))2 + 2(yi − f (xi ))(f (xi ) − ĝL
A
(xi , {θ})]
i
29 X X
30 = σ2i + A
EL,θ [(f (xi ) − ĝL (xi , {θ}))2 ], (91)
31 i i
32
33 where in the last line we have used the fact that E [yi ] = f (xi ) to eliminate the last term. We can further decompose
34 the second term as
35 A
EL,θ [(f (xi ) − ĝL (xi , {θ}))2 ] = EL,θ [(f (xi ) − EL,θ [ĝL
A A
(xi , {θ})] + EL,θ [ĝL A
(xi , {θ})] − ĝL (xi , {θ}))2 ]
36
A
37 = EL,θ [(f (xi ) − EL,θ [ĝL (xi , {θ})])2 ] + EL,θ [(EL,θ [ĝL
A A
(xi , {θ})] − ĝL (xi , {θ}))2 ]
38 A A A
+ 2EL,θ [(EL,θ [ĝL (xi , {θ})] − ĝL (xi , {θ}))(f (xi ) − EL,θ [ĝL (xi , {θ})])
39 A
= (f (xi ) − EL,θ [ĝL (xi , {θ})])2 + EL,θ [(ĝL
A A
(xi , {θ}) − EL,θ [ĝL (xi , {θ})])2 ]
40
41 ≡ Bias2 (xi ) + V ar(xi ), (92)
42
43
44
45 where we have defined the bias of an aggregate predictor So far the calculation for ensembles is almost iden-
46 as tical to that of a single estimator. However, since the
47 aggregate estimator is a sum of estimators, its variance
48 Bias2 (x) ≡ (f (x) − EL,θ [ĝL
A
(x, {θ})])2 (93) implicitly depends on the correlations between the indi-
49 vidual estimators in the ensemble. Using the definition
50 and the variance as of the aggregate estimator Eq. (88) and the definitions in
51 A A Eq. (89), we see that
V ar(x) ≡ EL,θ [(ĝL (x, {θ}) − EL,θ [ĝL (x, {θ})])2 ]. (94)
52
53
54
A A
55 V ar(x) = EL,θ [(ĝL (x, {θ}) − EL,θ [ĝL (x, {θ})])2 ]
56  
57 1 X X
= 2 EL,θ [ĝL (x, θm )ĝL (x, θm0 )] − M 2 [µL,θ (x)]2 
58 M 0 m,m i
59
60 2 1 − ρ(x) 2
= ρ(x)σL,θ + σL,θ . (95)
61 M
62
63
64
65
38
1
2
3 The first reason is statistical. When the
4 learning set is too small, a learning algorithm
5 can typically find several models in the hy-
6 pothesis space H that all give the same per-
7 formance on the training data. Provided their
8 predictions are uncorrelated, averaging sev-
9 eral models reduces the risk of choosing the
10 wrong hypothesis. The second reason is com-
11 putational. Many learning algorithms rely on
12 Aggregating different linear Linear perceptron hypothesis some greedy assumption or local search that
13 hypotheses
may get stuck in local optima. As such, an en-
14 semble made of individual models built from
15 FIG. 28 Why combining models? On the left we show that
by combining simple linear hypotheses (grey lines) one can many different starting points may provide
16
achieve better and more flexible classifications (dark line), a better approximation of the true unknown
17 which is in stark contrast to the case in which one only uses function than any of the single models. Fi-
18 a single perceptron hypothesis as shown on the right. nally, the third reason is representational. In
19
20 most cases, for a learning set of finite size, the
21 true function cannot be represented by any
This last formula is the key to understanding the power of the candidate models in H. By combin-
22 of random ensembles. Notice that by using large ensem-
23 ing several models in an ensemble, it may be
bles (M → ∞), we can significantly reduce the variance, possible to expand the space of representable
24
and for completely random ensembles where the mod- functions and to better model the true func-
25
26 els are uncorrelated (ρ(x) = 0), maximally suppresses tion.
27 the variance! Thus, using the aggregate predictor beats
28 down fluctuations due to finite-sample effects. The key, The increase in representational power of ensembles
29 as the formula indicates, is to decorrelate the models as can be simply visualized. For example, the classification
30 much as possible while still using a very large ensemble. task shown in Fig. 28 reveals that it is more advanta-
31 One can be worried that this comes at the expense of a geous to combine a group of simple hypotheses (verti-
32 very large bias. This turns out not to be the case. When cal or horizontal lines) than to utilize a single arbitrary
33 models in the ensemble are completely random, the bias linear classifier. This of course comes with the price of
34 of the aggregate predictor is just the expected bias of a introducing more parameters to our learning procedure.
35 single model But if the problem itself can never be learned through a
36
simple hypothesis, then there is no reason to avoid ap-
37 Bias2 (x) = (f (x) − EL,θ [ĝL
A
(x, {θ}])2
plying a more complex model. Since ensemble methods
38 M
1 X reduce the variance and are often easier to train than a
39 = (f (x) − EL,θ [ĝL (x, θm ])2 (96)
40 M m=1 single complex model, they are a powerful way of increas-
41 ing representational power (also called expressivity in the
= (f (x) − µL,θ )2 . (97) ML literature).
42
43 Thus, for a random ensemble one can always add more Our analysis also gives several intuitions for how we
44 models without increasing the bias. This observation lies should construct ensembles. First, we should try to ran-
45 behind the immense power of random forest methods dis- domize ensemble construction as much as possible to re-
46 cussed below. For other methods, such as bagging, we duce the correlations between predictors in the ensemble.
47 will see that the bootstrapping procedure actually does This ensures that our variance will be reduced while min-
48 increase the bias. But in many cases, this increase in bias imizing an increase in bias due to correlated errors. Sec-
49
is negligible compared to the reduction in variance. ond, the ensembles will work best for procedures where
50 the error of the predictor is dominated by the variance
51 and not the bias. Thus, these methods are especially well
52 suited for unstable procedures whose results are sensitive
2. Summarizing the Theory and Intuitions behind Ensembles
53
to small changes in the training dataset.
54
55 Before discussing specific methods, let us briefly sum- Finally, we note that although the discussion above
56 marize why ensembles have proven so successful in many was derived in the context of continuous predictors such
57 ML applications. Dietterich (Dietterich et al., 2000) as regression, the basic intuition behind using ensembles
58 identifies three distinct shortcomings that are fixed by applies equally well to classification tasks. Using an en-
59 ensemble methods: statistical, computational, and rep- semble allows one to reduce the variance by averaging the
60 resentational. These are explained in the following dis- result of many independent classifiers. As with regres-
61 cussion from Ref. (Louppe, 2014): sion, this procedure works best for unstable predictors
62
63
64
65
39
1
2
for which errors are dominated by variance due to finite This bootstrapping procedure allows us to construct an
3
4 sampling rather than bias. approximate ensemble and thus reduce the variance. For
5 unstable predictors, this can significantly improve the
6 predictive performance. The price we pay for using boot-
B. Bagging strapped training datasets, as opposed to really partition-
7
8 ing the dataset, is an increase in the bias of our bagged es-
BAGGing, or Bootstrap AGGregation, first introduced timators. To see this, note that as the number of datasets
9
10 by Leo Breiman, is one of the most widely employed and M goes to infinity, the expectation with respect to the
11 simplest ensemble-inspired methods (Breiman, 1996). bootstrapped samples converges to the empirical distri-
12 Imagine we have a very large dataset L that we could bution describing the training data set pL (x) (e.g. a delta
13 partition into M smaller data sets which we label function at each datapoint in L) which in general is dif-
14 {L1 , . . . , LM }. If each partition is sufficiently large to ferent from the true generative distribution for the data
15 learn a predictor, we can create an ensemble aggregate p(x).
16 predictor composed of predictors trained on each subset
17 of the data. For continuous predictors like regression,
18 this is just the average of all the individual predictors:
19
M
20 A 1 X
21 ĝL (x) = gL (x). (98)
M i=1 i
22
23 For classification tasks where each predictor predicts a
24 class label j ∈ {1, . . . , J}, this is just a majority vote of
25 all the predictors,
26
27 M
X
A
28 ĝL (x) = arg max I[gLi (x) = j], (99)
j
29 i=1
30
where I[gLi (x) = j] is an indicator function that is equal
31
to one if gLi (x) = j and zero otherwise. From the the-
32
33 oretical discussion above, we know that this can signifi-
34 cantly reduce the variance without increasing the bias.
35 While simple and intuitive, this form of aggregation
36 clearly works only when we have enough data in each par-
37 titioned set Li . To see this, one can consider the extreme
38 limit where Li contains exactly one point. In this case,
39 the base hypothesis gLi (x) (e.g. linear regressor) becomes
40 extremely poor and the procedure above fails. One way
41 to circumvent this shortcoming is to resort to empir-
42 ical bootstrapping, a resampling technique in statis-
43 tics introduced by Efron (Efron, 1979) (see accompany- In Fig. 30 we demonstrate bagging with a perceptron
44 ing box and Fig. 29). The idea of empirical bootstrap- (linear classifier) as the base classifier that constitutes
45 ping is to use sampling with replacement to create new the elements of the ensemble. It is clear that, although
46 “bootstrapped” datasets {LBS each individual classifier in the ensemble performs poorly
1 , . . . , LM } from our origi-
BS
47 nal dataset L. These bootstrapped datasets share many at classification, bagging these estimators yields reason-
48 points, but due to the sampling with replacement, are ably good predictive performance. This raises questions
49
all somewhat different from each other. In the bagging like why bagging works and how many bootstrap samples
50
procedure, we create an aggregate estimator by replac- are needed. As mentioned in the discussion above, bag-
51 ging is effective on “unstable” learning algorithms where
52 ing the M independent datasets by the M bootstrapped
estimators: small changes in the training set result in large changes
53 in predictions (Breiman, 1996). When the procedure is
54 M
BS 1 X unstable, the prediction error is dominated by the vari-
55 ĝL (x) = g BS (x). (100)
M i=1 Li ance and one can exploit the aggregation component of
56
bagging to reduce the prediction error. In contrast, for
57
58 and a stable procedure the accuracy is limited by the bias
59 M introduced by using bootstrapped datasets. This means
X
60 BS
ĝL (x) = arg max I[gLBS (x) = j]. (101) that there is an instability-to-stability transition point
61 j
i=1
i
beyond which bagging stops improving our prediction.
62
63
64
65
40
1
2
3
M̂ n ( D )
Brief Introduction to Bootstrapping
4
σ σ Suppose we are given a finite set of n data points
5
6 D = {X1 , · · · , Xn } as training samples and our
7 job is to construct measures of confidence for our
8 sample estimates (e.g. the confidence interval or
9 M n (1) M n (2) M n(B ) Bootstrap mean-squared error of sample median estimator).
replications
10 To do so, one first samples n points with re-
11 placement from D to get a new set D?(1) =
?(1) ?(1)
12 (1) (2) Bootstrap {X1 , · · · , Xn }, called a bootstrap sample,
D D . . . . . . . . . . . (B )
13 D samples which possibly contains repetitive elements. Then
14 we repeat the same procedure to get in total B
15 such sets: D?(1) , · · · , D?(B) . The next step is to
16 use these B bootstrap sets to get the bootstrap
17 estimate of the quantity of interest. For example,
18 Training ?(k)
D = { X 1, · · · , X n } samples let Mn = M edian(D?(k) ) be the sample median
19
of bootstrap data D?(k) . Then we can construct
20
21 FIG. 29 Shown here is the procedure of empirical bootstrap- the variance of the distribution of bootstrap medi-
22 ping. The goal is to assess the accuracy of a statistical quan- ans as :
23 tity of interest, which in the main text is illustrated as the
1 X  ?(k) 2
B
sample median M̂n (D). We start from a given dataset D and
24
bootstrap B size n datasets D?(1) , · · · , D?(B) called the boot-
Vd
arB (Mn ) = Mn − M̄n? , (102)
25 B−1
strap samples. Then we compute the statistical quantity of k=1
26 ?(k)
27 interest on these bootstrap samples to get the median Mn , where
for k = 1, · · · , B. These are then used to evaluate the accu-
28
racy of M̂n (D) (see also box on Bootstrapping in main text). B
29 1 X ?(k)
It can be shown that in the n → ∞ limit the distribution M̄n? = Mn (103)
30 of Mn
?(k)
would be a Gaussian centered around M̂n (D) with B
k=1
31 variance σ 2 defined by Eq. (102) scales as 1/n.
32 is the mean of the median of all bootstrap sam-
33 ples. Specifically, Bickel and Freedman (Bickel and
34 Freedman, 1981) and Singh (Singh, 1981) showed
35 that in the n → ∞ limit, the distribution of the
36
bootstrap estimate will be a Gaussian centered
37
around M̂n (D) = M edian(X1 , · · · √
, Xn ) with stan-
38
39 dard deviation proportional to 1/ n. This means
40 that the bootstrap distribution M̂n? − M̂n approxi-
41 mates fairly well the sampling distribution M̂n −M
42 from which we obtain the training data D. Note
43 that M is the median based on which the true dis-
44 tribution D is generated. In other words, if we plot
?(k)
45 the histogram of {Mn }B k=1 , we will see that in
46 the large n limit it can be well fitted by a Gaus-
47 sian which sharp peaks at M̂n (D) and vanishing
48 variance whose definition is given by Eq. (102) (see
49 Fig. 29).
50
51
52 C. Boosting
53
54 Another powerful and widely used ensemble method is
55
Boosting. In bagging, the contribution of all predictors
56
is weighted equally in the bagged (aggregate) predictor.
57
58 However, in principle, there are myriad ways to combine
59 different predictors. In some problems one might prefer
60 to use an autocratic approach that emphasizes the best
61 predictors, while in others it might be better to opt for
62
63
64
65
41
1
2
esis/classifier g : X → Y to classify the data. Let
3 5
Bagging: o, True label: x H = {g : X → Y} be the family of classifiers available
4
5 4
in our ensemble. In the AdaBoost setting, we are con-
6 cerned with the classifiers that perform somehow better
7 3 than “tossing a fair coin”. This means that for each clas-
8 sifier, the family H can predict yi correctly at least half
2
9 of the time.
10 1 We construct the boosted classifier as follows:
x2

11
12
0 • Initialize wt=1 (xn ) = 1/N, n = 1, · · · , N .
13 −1 • For t = 1 · · · , T (desired termination step), do:
14
15 −2 1. Select a hypothesis gt ∈ H that minimizes the
16 −3 weighted error
17 −1.0 −0.5 0.0 0.5 1.0
x1 N
18 X
19 t = wt (xi )1(gt (xi ) 6= yi ) (105)
FIG. 30 Bagging applied to the perceptron learning
20 algorithm (PLA). Training data size n = 500, number of i=1
21 bootstrap datasets B = 25, each contains 50 points. Col-
22 ors corresponds to different classes while the marker indicates t , update the weight for each
2. Let αt = 12 ln 1− t

23 how these points are labelled: cross for true label and circle data xn by
24 for that obtained by bagging. Each gray dashed line indicates
25 the prediction made, based on every bootstrap set while the exp[−αt yn gt (xn )]
wt+1 (xn ) ← wt (xn ) ,
26 dark dashed black line is the average of these. Zt
27 PN
28 where Zt = n=1 wt (xn )e−αt yn gt (xn ) ensures
29 more ‘democratic’ ways as is done in bagging. In all cases, all weights add up to unity.
30 the idea is to build a strong predictor by combining many P 
T
31 weaker classifiers. • Output gA (x) = sign t=1 αt gt (x)
32 In boosting, an ensemble of weak classifiers {gk (x)}
33 is combined into an aggregate, boosted classifier. How- There are many theoretical and empirical studies on the
34 ever, unlike bagging, each classifier is associated with a performance of AdaBoost but they are beyond the scope
35 weight αk that indicates how much it contributes to the of this review. We refer interested readers to the exten-
36 aggregate classifier sive literature on boosting (Freund et al., 1999).
37
38 M
X
39 gA (x) = αk gk (x), (104) D. Random Forests
40 K=1
41 P
42 where k αk = 1. For the reasons outlined above, boost- Data
x = ( x1, x 2, x 3)
43 ing, like all ensemble methods, works best when we com- y= { }
44 bine simple, high-variance classifiers into a more complex
x1 ≥ 0
45 whole.
46 Here, we focus on “adaptive boosting” or AdaBoost, True False
47 first proposed by Freund and Schapire in the mid 1990s
48 (Freund et al., 1999; Freund and Schapire, 1995; Schapire x2 ≥ 0 x3 ≥ 0
49 and Freund, 2012). The basic idea behind AdaBoost, True False True False
50 is to form the aggregate classifier in an iterative pro-
51 cess. Importantly, at each iteration we reweight the error
52 function to "highlight" data points where the aggregate
53 classifier performs poorly (so that in the next round the
54 procedure put more emphasis on making those right.) In FIG. 31 Example of a decision tree. For an input observation
55 x, its label y is predicted by traversing it from the root all
this way, we can successively ensure that our classifier
56 the way down the leaves, following branches it satisfies.
has good performance over the whole dataset.
57
58 We now discuss the AdaBoost procedure in greater
59 detail. Suppose that we are given a data set L = We now briefly review one of the most widely used and
60 {(xi , yi ), i = 1, · · · , N } where xi ∈ X and yi ∈ Y = versatile algorithms in data science and machine learning,
61 {+1, −1}. Our objective is to find an optimal hypoth- Random Forests (RF). Random Forests is an ensemble
62
63
64
65
42
1
2
method widely deployed for complex classification tasks.
3
4 A random forest is composed of a family of (randomized)
5 tree-based classifier decision trees (discussed below). De-
6 cision trees are high-variance, weak classifiers that can
7 be easily randomized, and as such, are ideally suited for
8 ensemble-based methods. Below, we give a brief high-
9 level introduction to these ideas.
10 A decision tree uses a series of questions to hierarchi-
11 cally partition the data. Each branch of the decision tree
12 consists of a question that splits the data into smaller
13 subsets (e.g. is some feature larger than a given num-
14 ber? See Fig. 31), with the leaves (end points) of the
15 tree corresponding to the ultimate partitions of the data.
16 When using decision trees for classification, the goal is
17 to construct trees such that the partitions are informa-
18 tive about the class label (see Fig. 31). It is clear that FIG. 32 Classifying Iris dataset with aggregation models for
19 scikit learn tutorial. This dataset seeks to classify iris flow-
more complex decision trees lead to finer partitions that ers into three types (labeled in red, blue, or yellow) based on
20
give improved performance on the training set. How- a measurement of four features: septal length septal width,
21
22 ever, this generally leads to over-fitting10 , limiting the petal length, and petal width. To visualize the decision sur-
23 out-of-sample performance. For this reason, in practice face, we trained classifiers using only two of the four potential
almost all decision trees use some form of regularization features (e..g septal length, septal width). Each row corre-
24
(e.g. maximum depth for the tree) to control complex- sponds to a different subset of two features and the columns
25 to a Decision Tree with 10-fold CV (first column), Random
26 ity and reduce overfitting. Decision trees also have ex-
Forest with 30 trees and 10-fold CV (second column) and
27 tremely high variance, and are often extremely sensitive AdaBoost with 30 base hypotheses (third column). Decision
28 to many details of the training data. This is not surpris- surface learned is highlighted by color shades. See the corre-
29 ing since decision trees are learned by partitioning the sponding tutorial for more details (Pedregosa et al., 2011)
30 training data. Therefore, individual decision trees are
31 weak classifiers. However, these same properties make
32 them ideal for incorporation in an ensemble method. though this reduces the predictive power of each indi-
33 In order to create an ensemble of decision trees, we vidual decision tree, it still often improves the predictive
34 must introduce a randomization procedure. As discussed power of the ensemble because it dramatically reduces
35 above, the power of ensembles to reduce variance only correlations between members and prevents overfitting.
36 manifests when randomness reduces correlations between Examples of the kind of decision surfaces found by de-
37
the classifiers within the ensemble. Randomness is usu- cision trees, random forests, and Adaboost are shown in
38
ally introduced into random forests in one of three dis- Fig. 32. We invite the reader to check out the correspond-
39
40 tinct ways. The first is to use bagging and simply “bag” ing scikit-learn tutorial for more details of how these are
41 the decision trees by training each decision tree on a dif- implemented in python (Pedregosa et al., 2011).
42 ferent bootstrapped dataset (Breiman, 2001). Strictly There are many different types of decision trees and
43 speaking, this procedure does not constitute a random training procedures. A full discussion of decision trees
44 forest but rather a bagged decision tree. The second (and random forests) lies beyond the scope of this review
45 procedure is to only use a different random subset of and we refer readers to the extensive literature on these
46 the features at each split in the tree. This “feature topics (Lim et al., 2000; Loh, 2011; Louppe, 2014). Re-
47 bagging” is the distinguishing characteristic of random cently, decision trees were applied in high-energy physics
48 forests (Breiman, 2001; Ho, 1998). Using feature bag- to study to learn non-Higgsable gauge groups (Wang and
49 ging reduces correlations between decision trees that can Zhang, 2018).
50 arise when only a few features are strongly predictive
51 of the class label. Finally, extremized random forests
52 (ERFs) combine ordinary and feature bagging with an E. Gradient Boosted Trees and XGBoost
53 extreme randomization procedure where splitting is done
54 randomly instead of using optimality criteria (see for de- Before we turn to applications of these techniques, we
55 tails Refs. (Geurts et al., 2006; Louppe, 2014)). Even briefly discuss one final class of ensemble methods that
56
has become increasingly popular in the last few years:
57
58 Gradient-Boosted Trees (Chen and Guestrin, 2016; Fried-
59 man, 2001). The basic idea of gradient-boosted trees is
60
10 One extreme limit is an n−node tree, with n being the number to use intuition from boosting and gradient descent (in
of data point in the dataset given. particular Newton’s method, see Sec. IV) to construct
61
62
63
64
65
43
1
2
ensembles of decision trees. Like in boosting, the ensem- nalizes both large weights on the leaves (similar to L2 -
3
4 bles are created by iteratively adding new decision trees regularization) and having large partitions with many
5 to the ensemble. In gradient boosted trees, one critical leaves.
6 component is the a cost function that measures the per- As in boosting, we form the ensemble iteratively. For
7 formance of our ensemble. At each step, we compute (t)
this reason, we define a family of predictors ŷi as
8 the gradient of the cost function with respect to the pre-
t
X
9 dicted value of the ensemble and add trees that move us (t) (t−1)
ŷi = gj (xi ) = ŷi + gt (xi ). (109)
10 in the direction of the gradient. Of course, this requires
j=1
11 a clever way of mapping gradients to decision trees. We
12 give a brief overview of how this is done within XGBoost Note that by definition yi
(M )
= gA (xi ). The central idea
13 (Extreme Gradient Boosting), which has recently been is that for large t, each decision tree is a small pertur-
14 applied, to classify and rank transcription factor binding bation to the predictor (of order 1/T ) and hence we can
15 in DNA sequences (Li et al., 2018). Below, we follow perform a Taylor expansion on our loss function to second
16 closely the XGboost tutorial. order:
17 Our starting point is a clever parametrization of de-
18 N
X
cision trees. Here, we use notation where the deci- (t−1)
19 Ct = l(yi , ŷi + gt (xi )) + Ω(gt )
sion tree makes continuous predictions (regression trees),
20 i=1
though this can also easily be generalized to classifica-
21 ≈ Ct−1 + ∆Ct , (110)
tion tasks. We parametrize a decision tree j, denoted
22
23 as gj (x), with T leaves by two quantities: a function with
24 q(x) that maps each data point to one of the leaves of 1
the tree, q : x ∈ Rd → {1, 2 . . . , T } and a weight vector ∆Ct = ai gt (xi ) + bi gt (xi )2 + Ω(gt ), (111)
25 2
26 w ∈ RT that assigns a predicted value to each leaf. In
other words, the j-th decision tree’s prediction for the where
27
28 datapoint xi is simply: gj (xi ) = wq(xi ) . ai = ∂ŷ(t−1) l(yi , ŷi
(t−1)
), (112)
29 In addition to a parametrization of decision trees, we i

30 also have to specify a cost function which measures pre-


(t−1)
31 dictions. The prediction of our ensemble for a datapoint bi = ∂ŷ2(t−1) l(yi , ŷi ). (113)
32 (yi , xi ) is given by i

33 We then choose the t-th decision tree gt to minimize ∆Ct .


M
X
34 This is almost identical to how we derived the Newton
ŷi = gA (xi ) = gj (xi ), gj ∈ F (106)
35 method update in the section on gradient descent, see
j=1
36 Sec. IV.
37 where gj (xi ) is the prediction of the j-th decision tree on We can actually derive an expression for the param-
38 datapoint xi , M is the number of members of the ensem- eters of gt that minimize ∆Ct analytically. To simplify
39 ble, and F = {g(x) = wq(x) } is the space of trees. As notation, it is useful to define the set of points xi that get
40 discussed in the context of random trees above, without mappedP to leaf j: Ij = {i P : qt (xi ) = j} and the functions
41 regularization, decision trees tend to overfit the data by Bj = i∈Ij bi and Aj = i∈Ij ai . Notice that in terms
42
dividing it into smaller and smaller partitions. For this of these quantities, we can write
43
reason, our cost function is generally composed of two
44 XT
terms, a term that measures the goodness of predictions 1
45 ∆Ct = [Aj wj + (Bj + λ)wj2 ] + γT, (114)
46 on each datapoint, li (yi , ŷi ), which is assumed to be dif- 2
j=1
47 ferentiable and convex, and for each tree in the ensemble,
48 a regularization term Ω(gj ) that does not depend on the where we made the t-dependence of all parameters im-
49 data: plicit. Note that λ comes from the regularization term,
50 N M Ω(gt ), through Eq.(108). To find the optimal wj , just as
X X
51 C(X, gA ) = l(yi , ŷi ) + Ω(gj ), (107) in Newton’s method we take the gradient of the above
52 i=1 j=1 expression with respect to wj and set this equal to zero,
53 to get
54 where the index i runs over data points and the index j
runs over decision trees in our ensemble. In XGBoost, Aj
55 wjopt = − . (115)
56 the regularization function is chosen to be Bj + λ
57 λ Plugging this expression into ∆Ct gives
58 Ω(g) = γT + ||w||22 , (108)
2 T
59
with γ and λ regularization parameters that must be 1 X A2j
60 ∆Ctopt = − + γT. (116)
chosen appropriately. Notice that this regularization pe- 2 j=1 Bj + λ
61
62
63
64
65
44
1
2 1.00
It is clear that ∆Ctopt measures the in-sample performance Train (coarse)
3 Test (coarse)
of gt and we should find the decision tree that minimizes 0.95
4 Critical (coarse)
5 this value. In principle, one could enumerate all possi- 0.90 Train (fine)
ble trees over the data and find the tree that minimizes Test (fine)
6 0.85 Critical (coarse)

Accuracy
7 ∆Ctopt . However, in practice this is impossible. Instead,
0.80
8 an approximate greedy algorithm is run that optimizes
one level of the tree at a time by trying to find optimal 0.75
9
10 splits of the data. This leads to a tree that is a good 0.70
11 local minimum of ∆Ctopt which is then added to the en- 0.65
10 20 30 40 50 60 70 80 90 100
12 semble. We emphasize that this is only a very high level Nestimators
13 sketch of how the algorithm works. In practice, addi-
14 tional regularization such as shrinkage(Friedman, 2002) 90
15 and feature subsampling(Breiman, 2001; Friedman et al., 80 Fine
16 2003) is also used. In addition, there are many numerical
Coarse
70
17 and technical tricks used for the approximate algorithm
18 60
and how to find splits of the data that give good decision

Run time (s)


19 trees (Chen and Guestrin, 2016). 50
20 40
21
30
22
23 F. Applications to the Ising model and Supersymmetry 20
24 Datasets 10
25 0
26 We now illustrate some of these ideas using two exam- 10 20 30 40 50 60 70 80 90 100
Nestimators
27 ples drawn from physics: (i) classifying the phases of the
28 spin configurations of the 2D-Ising model above and be-
FIG. 33 Using Random Forests (RFs) to classify Ising Phases.
29 low the critical temperature using random forests and (ii)
(Top) Accuracy of RFs for classifying the phase of samples
30 classifying Monte-Carlo simulations of collision events in from the Ising mode for the training set (blue), test set (red),
31 the SUSY dataset as supersymmetric or standard using and critical region (green) using coarse trees with a few leaves
32 an XGBoost implementation of gradient-boosted trees. (triangles) and fine decision trees with many leaves (filled cir-
33 Both examples were analyzed in Sec. VII.C using logistic cles). RFs were trained on samples from ordered and disor-
34 regression. Here we show that on the Ising dataset, the dered phases but were not trained on samples from the critical
35 RFs perform significantly better than logistic regression region. (Bottom) The time it takes to train RFs scales lin-
36 early with the number of estimators in the ensemble. For the
models whereas gradient boosted trees seem to yield an
37 upper panel, note that the train case (blue) overlaps with the
accuracy of about 80%, comparable to published results. test case (red). Here ‘fine’ and ‘coarse’ refer to trees with 2
38
The two accompanying Jupyter notebooks discuss practi- and 10,000 leaves, respectively. For implementation details,
39
cal details of implementing these examples and the read- see Jupyter notebooks 9
40
41 ers are encouraged to experiment with the notebooks.
42 The Ising dataset used for classification by RFs here
43 is identical to that used to study logistic regression in shown in Figure 33. We used two types of RF classifiers,
44 Sec. VII.C. We assign a label to each state according to one where the ensemble consists of coarse decision trees
45 its phase: 0 if the state is disordered, and 1 if it is ordered. with a few leaves and another with finer decision trees
46 We divide the dataset into three categories according to with many leaves (see corresponding notebook). RFs
47 the temperature at which samples are drawn: ordered have extremely high accuracy on the training and test
48 (T /J < 2.0), near-critical (2.0 ≤ T /J ≤ 2.5) and disor- sets (over 99%) for both coarse and fine trees. How-
49 dered (T /J > 2.5) (see Figure 20). We use the ordered ever, notice that the RF consisting of coarse trees per-
50 and disordered states to train a random forest and eval- form extremely poorly on samples from the critical region
51 uate our learned model on a test set of unseen ordered whereas the RF with fine trees classifies critical samples
52 and disordered states (test sets). We also ask how well with an accuracy of nearly 85%. Interestingly, and unlike
53 our RF can predict the phase of samples drawn in the with logistic regression, this performance in the critical
54 critical region (i.e. predict whether the temperature of region requires almost no parameter tuning. This is be-
55 a critical sample is above or below the critical tempera- cause, as discussed above, RFs are largely immune to
56
ture). Since our model is never trained on samples in the overfitting problems even as the number of estimators in
57
critical region, prediction in this region is a test of the the ensemble becomes large. Increasing the number of es-
58
algorithm’s ability to generalize to new regions in phase timators in the ensemble does increase performance but
59
60 space. at a large cost in computational time (Fig. 33 bottom).
61 The results of fits using RFs to predict phases are In the second application of ensemble methods to
62
63
64
65
45
1
2
Feature importance Learning” in the mid 2000s (Hinton et al., 2006; Hinton
3
4 axial MET 517 and Salakhutdinov, 2006). DNNs truly caught the atten-
missing energy magnitude 505 tion of the wider machine learning community and indus-
5 MET_rel 461
dPhi_r_b 443 try in 2012 when Alex Krizhevsky, Ilya Sutskever, and
6 lepton 2 eta 416
7 lepton 1 eta 396 Geoff Hinton used a GPU-based DNN model (AlexNet)
missing energy phi 361
8 cos(theta_r1) 352 to lower the error rate on the ImageNet Large Scale
Features

lepton 1 pT 351 Visual Recognition Challenge (ILSVRC) by an incred-


9 lepton 2 pT 348
10 S_R 346 ible twelve percent from 28% to 16% (Krizhevsky et al.,
M_R 331
11 R 319 2012). Just three years later, a machine learning group
lepton 1 phi 314
12 MT2 314 from Microsoft achieved an error of 3.57% using an ultra-
lepton 2 phi 290
13 M_TR_2 283 deep residual neural network (ResNet) with 152 layers
14 M_Delta_R 273 (He et al., 2016)! Since then, DNNs have become the
15 0 100 200 300 400 500 workhorse technique for many image and speech recogni-
F score
16 tion based machine learning tasks. The large-scale indus-
17 trial deployment of DNNs has given rise to a number of
FIG. 34 Feature Importance Scores in SUSY dataset from
18 high-level libraries and packages (Caffe, Keras, Pytorch,
applying XGBoost to 100, 000 samples. See Notebook 10 for
19 TensorFlow, etc.) that make it easy to quickly code and
more details.
20
deploy DNNs.
21
22 Conceptually, it is helpful to divide neural networks
physics-related datasets, we used the XGBoost imple- into four categories: (i) general purpose neural net-
23
mentation of gradient boosted trees to classify Monte- works for supervised learning, (ii) neural networks de-
24
25 Carlo collisions from the SUSY dataset. With default signed specifically for image processing, the most promi-
26 parameters using a small subset of the data (100, 000 out nent example of this class being Convolutional Neural
27 of the full 5, 000, 000 samples), we were able to achieve Networks (CNNs), (iii) neural networks for sequential
28 a classification accuracy of about 79%, which could be data such as Recurrent Neural Networks (RNNs), and
29 improved to nearly 80% after some fine-tuning (see ac- (iv) neural networks for unsupervised learning such as
30 companying notebook). This is comparable to published Deep Boltzmann Machines. Here, we will limit our dis-
31 results (Baldi et al., 2014) and those obtained using lo- cussions to the first two categories (unsupervised learn-
32 gistic regression in earlier chapters. One nice feature of ing is discussed later in the review). Though increas-
33 ensemble methods such as XGBoost is that they auto- ingly important for many applications such as audio and
34 matically allow us to calculate feature scores (Fscores) speech recognition, for the sake of brevity, we omit a dis-
35 that rank the importance of various features for clas- cussion of sequential data and RNNs from this review.
36 sification. The higher the Fscore, the more important For an introduction to RNNs and LSTM networks see
37 the feature for classification. Figure 34 shows the feature Chris Olah’s blog, https://colah.github.io/posts/2015-
38 scores from our XGBoost algorithm for the production of 08-Understanding-LSTMs/, and Chapter 13 of (Bishop,
39 electrically-charged supersymmetric particles (χ±) which 2006) as well as the introduction to RNNs in Chapter 10
40 decay to W bosons and an electrically neutral supersym-
41 of (Goodfellow et al., 2016) for sequential data.
metric particle χ0 , which is invisible to the detector. The Due to the number of recent books on deep learning
42
features are a mix of eight directly measurable quanti- (see for example Michael Nielsen’s introductory online
43
44 ties from the detector, as well as ten hand crafted fea- book (Nielsen, 2015) and the more advanced (Goodfel-
45 tures chosen using physics knowledge. Consistent with low et al., 2016)), the goal of this section is to give a
46 the physics of these supersymmetric decays in the lepton high-level introduction to the basic ideas behind DNNs,
47 channel, we find that the most informative features for and provide some practical knowledge for coding simple
48 classification are the missing transverse energy along the neural nets for supervised learning tasks (see the accom-
49 vector defined by the charged leptons (Axial MET) and panying Notebooks). This section assumes the reader
50 the missing energy magnitude due to χ0 . is familiar with the basic concepts introduced in earlier
51 sections on logistic and linear regression. Throughout,
52 we strive to provide intuition behind the inner workings
53 of DNNs, as well as highlight limitations of present-day
IX. AN INTRODUCTION TO FEED-FORWARD DEEP
54 NEURAL NETWORKS (DNNS) algorithms.
55
The influx of corporate and industrial interests has
56
Over the last decade, neural networks have emerged rapidly transformed the field in the last few years. This
57
58 as the one of most powerful and widely-used supervised massive influx of money and researchers has given rise to
59 learning techniques. Deep Neural Networks (DNNs) have new dogmas and best practices that change rapidly. As
60 a long history (Bishop, 1995b; Schmidhuber, 2015), but with most intellectual fields experiencing rapid expan-
61 re-emerged to prominence after a rebranding as “Deep sion, many commonly accepted heuristics many turn out
62
63
64
65
46
1
2
not to be as powerful as thought (Wilson et al., 2017), et al., 2018). Representing quantum states as DNNs (Gao
3
4 and widely held beliefs not as universal as once imagined et al., 2017; Gao and Duan, 2017; Levine et al., 2017;
5 (Lee et al., 2017; Zhang et al., 2016). This is especially Saito and Kato, 2017) and quantum state tomogra-
6 true in modern neural networks where results are largely phy (Torlai et al., 2018) are among some of the impres-
7 empirical and heuristic and lack the firm footing of many sive achievements to reveal the potential of deep learn-
8 earlier machine learning methods. For this reason, in ing to facilitate the study of quantum systems. Machine
9 this review we have chosen to emphasize tried and true learning techniques involving neural networks were also
10 fundamentals, while pointing out what, from our current used to study quantum and fault-tolerant error correc-
11 vantage point, seem like promising new techniques. The tion (Baireuther et al., 2017; Breuckmann and Ni, 2017;
12 field is rapidly evolving and readers are urged to read Chamberland and Ronagh, 2018; Davaasuren et al., 2018;
13 papers and to implement these algorithms themselves in Krastanov and Jiang, 2017; Maskara et al., 2018), es-
14 order to gain a deep appreciation for the incredible power timate rates of coherent and incoherent quantum pro-
15 of modern neural networks, especially in the context of cesses (Greplova et al., 2017), to obtain spectra of 1/f -
16 image, speech, and natural language processing, as well noise in spin-qubit devices (Zhang and Wang, 2018), and
17 as limitations of the current methods. the recognition of state and charge configurations and
18 auto-tuning in quantum dots (Kalantre et al., 2017). In
In physics, DNNs and CNNs have already found
19 quantum information theory, it has been shown that one
numerous applications. In statistical physics, they
20
have been applied to detect phase transitions in 2D can perform gate decompositions with the help of neural
21
Ising (Tanaka and Tomiya, 2017a) and Potts (Li nets (Swaddle et al., 2017). In lattice quantum chromo-
22
23 et al., 2017) models, lattice gauge theories (Wetzel and dynamics, DNNs have been used to learn action param-
24 Scherzer, 2017), and different phases of polymers (Wei eters in regions of parameter space where principal com-
25 et al., 2017). It has also been shown that deep neural net- ponent analysis fails (Shanahan et al., 2018). CNNs were
26 works can be used to learn free-energy landscapes (Sidky applied to data from a high-energy experiment to iden-
27 and Whitmer, 2017). At the same time, methods from tify particle interactions in sampling calorimeters used
28 statistical physics have been applied to the field of deep commonly in neutrino physics (Aurisano et al., 2016).
29 learning to study the thermodynamic efficiency of learn- Last but not least, DNNs also found place in the study
30 ing rules (Goldt and Seifert, 2017), to explore the hy- of quantum control (Yang et al., 2017), and in scattering
31 pothesis space that DNNs span, make analogies between theory to learn the s-wave scattering length (Wu et al.,
32 training DNNs and spin glasses (Baity-Jesi et al., 2018; 2018) of potentials.
33 Baldassi et al., 2017), and to characterize phase transi-
34 tions with respect to network topology in terms of er-
35 rors (Li and Saad, 2017). In relativistic hydrodynam- A. Neural Network Basics
36 ics, deep learning has been shown to capture features
37 Neural networks (also called neural nets) are neural-
of non-linear evolution and has the potential to accel-
38
erate numerical simulations (Huang et al., 2018), while inspired nonlinear models for supervised learning. As
39
in mechanics CNNs have been used to predict eigenval- we will see, neural nets can be viewed as natural, more
40
41 ues of photonic crystals (Finol et al., 2018). Deep CNNs powerful extensions of supervised learning methods such
42 were employed in lensing reconstruction of the cosmic as linear and logistic regression and soft-max methods.
43 microwave background (Caldeira et al., 2018). Recently,
44 DNNs have been used to improve the efficiency of Monte-
45 Carlo algorithms (Shen et al., 2018). 1. The basic building block: neurons
46 Deep learning has also found interesting applications
47 in quantum physics. Various quantum phase transi- The basic unit of a neural net is a stylized “neu-
48 tions (Arai et al., 2017; Broecker et al., 2017; Iakovlev ron” i that takes a vector of d input features x =
49 et al., 2018; van Nieuwenburg et al., 2017b; Suchs- (x1 , x2 , . . . , xd ) and produces a scalar output ai (x). A
50 land and Wessel, 2018) can be detected and studied neural network consists of many such neurons stacked
51 using DNNs and CNNs, including the transverse-field into layers, with the output of one layer serving as the
52 Ising model (Ohtsuki and Ohtsuki, 2017), topological input for the next (see Figure 35). The first layer in the
53 phases (Yoshioka et al., 2017; Zhang et al., 2017a,b) neural net is called the input layer, the middle layers are
54 and non-invasive topological quality control (Caio et al., often called “hidden layers”, and the final layer is called
55 the output layer.
2019). DNNs found applications even in non-equilibrium
56
many-body localization (van Nieuwenburg et al., 2017a,b; The exact function ai varies depending on the type of
57
58 Schindler et al., 2017; Venderley et al., 2017), and the non-linearity used in the neural network. However, in
59 characterization of photoexcited quantum states (Shinjo essentially all cases ai can be decomposed into a linear
60 et al., 2019). Experimentally, DNNs were recently em- operation that weights the relative importance of the var-
61 ployed in cold atoms to identify critical points (Rem ious inputs, and a non-linear transformation σi (z) which
62
63
64
65
47
1
2
A descent. For this reason, until recently the most pop-
3 linear nonlinearity
4
x1 w1 ular choice of non-linearity was the tanh function or a
5 w2
sigmoid/Fermi function. However, this choice of non-
input x2 w. x + b output linearity has a major drawback. When the input weights
6 w3
7 become large, as they often do in training, the activation
8 x3 function saturates and the derivative of the output with
9 hidden respect to the weights tends to zero since ∂σ/∂z → 0
B
10 layers for z  1. Such “vanishing gradients” are a feature of
{
11 any saturating activation function (top row of Fig. 36),
12 output making it harder to train deep networks. In contrast, for
layer
13 a non-saturating activation function such as ReLUs or

{
{
14
input
ELUs, the gradients stay finite even for large inputs.
15 layer
16
17
2. Layering neurons to build deep networks: network
18 architecture.
19
FIG. 35 Basic architecture of neural networks. (A)
20 The basic components of a neural network are stylized neu-
21 The basic idea of all neural networks is to layer neurons
rons consisting of a linear transformation that weights the in a hierarchical fashion, the general structure of which is
22 importance of various inputs, followed by a non-linear activa-
known as the network architecture (see Fig. 35). In the
23 tion function. (b) Neurons are arranged into layers with the
24 output of one layer serving as the input to the next layer. simplest feed-forward networks, each neuron in the in-
25 put layer of the neurons takes the inputs x and produces
26 an output ai (x) that depends on its current weights, see
27 is usually the same for all neurons. The linear trans- Eq. (118). The outputs of the input layer are then treated
28 formation in almost all neural networks takes the form as the inputs to the next hidden layer. This is usually
29 of a dot product with a set of neuron-specific weights repeated several times until one reaches the top or output
30 (i) (i) (i)
w(i) = (w1 , w2 , . . . , wd ) followed by re-centering with layer. The output layer is almost always a simple clas-
31
a neuron-specific bias b : (i) sifier of the form discussed in earlier sections: a logistic
32 regression or soft-max function in the case of categorical
33
z (i) = w(i) · x + b(i) = xT · w(i) , (117) data (i.e. discrete labels) or a linear regression layer in
34 the case of continuous outputs. Thus, the whole neural
35 network can be thought of as a complicated nonlinear
where x = (1, x) and w(i) = (b(i) , w(i) ). In terms of z (i)
36 transformation of the inputs x into an output ŷ that de-
37 and the non-linear function σi (z), we can write the full
input-output function as pends on the weights and biases of all the neurons in the
38
input, hidden, and output layers.
39
40 ai (x) = σi (z (i) ), (118) The use of hidden layers greatly expands the represen-
41 tational power of a neural net when compared with a sim-
42 see Figure 35. ple soft-max or linear regression network. Perhaps, the
43 Historically in the neural network literature, common most formal expression of the increased representational
44 choices of nonlinearities included step-functions (percep- power of neural networks (also called the expressivity) is
45 trons), sigmoids (i.e. Fermi functions), and the hyper- the universal approximation theorem which states that a
46 bolic tangent. More recently, it has become more com- neural network with a single hidden layer can approxi-
47 mon to use rectified linear units (ReLUs), leaky recti- mate any continuous, multi-input/multi-output function
48 fied linear units (leaky ReLUs), and exponential linear with arbitrary accuracy. The reader is strongly urged
49 units (ELUs) (see Figure 36). Different choices of non- to read the beautiful graphical proof of the theorem in
50 linearities lead to different computational and training Chapter 4 of Nielsen’s free online book (Nielsen, 2015).
51 properties for neurons. The underlying reason for this is The basic idea behind the proof is that hidden neurons
52 that we train neural nets using gradient descent based allow neural networks to generate step functions with ar-
53 methods, see Sec. IV, that require us to take derivatives bitrary offsets and heights. These can then be added
54 of the neural input-output function with respect to the together to approximate arbitrary functions. The proof
55
weights w(i) and the bias b(i) . also makes clear that the more complicated a function,
56
Notice that the derivatives of the aforementioned non- the more hidden units (and free parameters) are needed
57
58 linearities σ(z) have very different properties. The to approximate it. Hence, the applicability of the ap-
59 derivative of the perceptron is zero everywhere except proximation theorem to practical situations should not
60 where the input is zero. This discontinuous behavior be overemphasized. In condensed matter physics, a good
61 makes it impossible to train perceptrons using gradient analogy are matrix product states, which can approxi-
62
63
64
65
48
1
2
3
4
5 Perceptron Sigmoid Tanh
6 1 1 1
7
8 0 0 0
9
10 -1 -1 -1
11
-5 0 5 -5 0 5 -5 0 5
12
13 ReLU Leaky ReLU ELU
6 6 6
14
15 4 4 4
16 2 2 2
17
18 0 0 0
19
20 -5 0 5 -5 0 5 -5 0 5
21
22
23
24 FIG. 36 Possible non-linear activation functions for neurons. In modern DNNs, it has become common to use non-linear
functions that do not saturate for large inputs (bottom row) rather than saturating functions (top row).
25
26
27
28 mate any quantum many-body state to an arbitrary ac- that works best. However, a general rule of thumb that
29 curacy, provided the bond dimension can be increased seems to be emerging is that the number of parameters in
30 arbitrarily – a severe requirement not met in any useful the neural net should be large enough to prevent under-
31 practical implementation of the theory. fitting (see theoretical discussion in (Advani and Saxe,
32 Modern neural networks generally contain multiple 2017)).
33 hidden layers (hence the “deep” in deep learning). There Empirically, the best architecture for a problem de-
34 are many ideas of why such deep architectures are fa- pends on the task, the amount and type of data that is
35 vorable for learning. Increasing the number of layers in- available, and the computational resources at one’s dis-
36 creases the number of parameters and hence the represen- posal. Certain architectures are easier to train, while
37 tational power of neural networks. Indeed, recent numer- others might be better at capturing complicated depen-
38 ical experiments suggests that as long as the number of dencies in the data and learning relevant input features.
39 Finally, there have been numerous works that move be-
parameters is larger than the number of data points, cer-
40 yond the simple deep, feed-forward neural network ar-
tain classes of neural networks can fit arbitrarily labeled
41 chitectures discussed here. For example, modern neural
random noise samples (Zhang et al., 2016). This suggests
42
that large neural networks of the kind used in practice can networks for image segmentation often incorporate “skip
43
express highly complex functions. Adding hidden layers connections” that skip layers of the neural network (He
44
45 is also thought to allow neural nets to learn more complex et al., 2016). This allows information to directly propa-
46 features from the data. Work on convolutional networks gate to a hidden or output layer, bypassing intermediate
47 suggests that the first few layers of a neural network learn layers and often improving performance.
48 simple, “low-level” features that are then combined into
49 higher-level, more abstract features in the deeper layers.
50 Other works suggest that it is computationally and al- B. Training deep networks
51 gorithmically easier to train deep networks rather than
52 shallow, wider nets, though this is still an area of major In the previous section, we introduced the basic ar-
53 controversy and active research (Mhaskar et al., 2016). chitecture for neural networks. Here we discuss how to
54 Choosing the exact network architecture for a neural efficiently train large neural networks. Luckily, the basic
55
network remains an art that requires extensive numer- procedure for training neural nets is the same as we used
56
ical experimentation and intuition, and is often times for training simpler supervised learning algorithms, such
57
58 problem-specific. Both the number of hidden layers and as logistic and linear regression: construct a cost/loss
59 the number of neurons in each layer can affect the per- function and then use gradient descent to minimize the
60 formance of a neural network. There seems to be no cost function and find the optimal weights and biases.
61 single recipe for the right architecture for a neural net Neural networks differ from these simpler supervised pro-
62
63
64
65
49
1
2
cedures in that generally they contain multiple hidden m|xi ; w). Then, the categorical cross-entropy is defined
3
4 layers that make taking the gradient computationally as
5 more difficult. We will return to this in Sec. IX.C which n M −1
X X
6 discusses the “backpropagation” algorithm for computing E(w) = − yim log ŷim (w)
7 gradients. i=1 m=0
8 Like all supervised learning procedures, the first thing +(1 − yim ) log [1 − ŷim (w)] . (122)
9 one must do to train a neural network is to specify a
10 loss function. Given a data point (xi , yi ), xi ∈ Rd+1 , As in linear and logistic regression, this loss function is
11 the neural network makes a prediction ŷi (w), where w often supplemented by additional terms that implement
12 are the parameters of the neural network. Recall that regularization.
13 in most cases, the top output layer of our neural net Having defined an architecture and a cost function, we
14 is either a continuous predictor or a classifier that makes must now train the model. Similar to other supervised
15 discrete (categorical) predictions. Depending on whether learning methods, we make use of gradient descent-based
16 one wants to make continuous or categorical predictions, methods to optimize the cost function. Recall that the
17 one must utilize a different kind of loss function. basic idea of gradient descent is to update the parame-
18 For continuous data, the loss functions that are com-
19 ters w to move in the direction of the gradient of the cost
monly used to train neural networks are identical to those function ∇w E(w). In Sec. IV, we discussed numerous
20
used in linear regression, and include the mean squared optimizers that implement variations of stochastic gra-
21
error dient descent (SGD, Nesterov, RMSProp, Adam, etc.)
22
23 1X
n Most modern neural network packages, such as Keras,
24 E(w) = (yi − ŷi (w))2 , (119) allow the user to specify which of these optimizers they
n i=1
25 would like to use in order to train the neural network.
26 where n is the number of data points, and the mean- Depending on the architecture, data, and computational
27 absolute error (i.e. L1 norm) resources, different optimizers may work better on the
28 problem, though vanilla SGD is a good first choice.
29 1X Finally, we note that unlike in linear and logistic re-
E(w) = |yi − ŷi (w)|. (120)
30 n i gression, calculating the gradients for a neural network
31 requires a specialized algorithm, called Backpropagation
32 The full cost function often includes additional terms that (often abbreviated backprop) which forms the heart of
33 implement regularization (e.g. L1 or L2 regularizers), see any neural network training procedure. Backpropaga-
34 Sec. VI. tion has been discovered multiple times independently
35 For categorical data, the most commonly used loss but was popularized for modern neural networks in 1985
36 function is the cross-entropy (Eq. (76) and Eq. (81)), (Rumelhart and Zipser, 1985). We will turn to the back-
37 since the output layer is often taken to be a logistic clas- propagation algorithm in the next section. Before read-
38 sifier for binary data with two types of labels, or a soft-
39 ing it, the reader is strongly encouraged to play with
max classifier if there are more than two types of labels. Notebook 11 in order to gain some intuition about how
40 The cross-entropy was already discussed extensively in
41 to build a DNN in practice using the high-level Keras
earlier sections on logistic regression and soft-max classi- Python package. Notebook 11 discusses a simple exam-
42
fiers, see Sec. VII. Recall that for classification of binary ple where we build a feed-forward deep neural network for
43
44 data, the output of the top layer of the neural network is classifying hand-written digits from the MNIST dataset.
45 the probability ŷi (w) = p(yi = 1|xi ; w) that data point Figures 37 and 38 show the accuracy and the loss as a
46 i is predicted to be in category 1. The cross-entropy be- function of the training episodes.
47 tween the true labels yi ∈ {0, 1} and the predictions is
48 given by
49 n
X C. The Backpropagation algorithm
50 E(w) = − yi log ŷi (w) + (1 − yi ) log [1 − ŷi (w)] .
51 i=1 In the last section, we saw how to deploy a high-level
52 package, Keras, to design and train a simple neural net-
53 More generally, for categorical data, y can take on M
values so that y ∈ {0, 1, . . . , M − 1}. For each datapoint work. This training procedure requires us to be able to
54 calculate the derivative of the cost function with respect
55 i, define a vector yim called a ‘one-hot’ vector, such that
( to all the parameters of the neural network (the weights
56
1, if yi = m and biases of all the neurons in the input, hidden, and
57 yim = (121) visible layers). A brute force calculation is out of the
58 0, otherwise.
59 question since it requires us to calculate as many gradi-
60 We can also define the probability that the neural net- ents as parameters at each step of the gradient descent.
61 work assigns a datapoint to category m: ŷim (w) = p(yi = The backpropagation algorithm (Rumelhart and Zipser,
62
63
64
65
50
1
2
nection from the k-th neuron in layer l − 1 to the j-th
3
4
0.975 train neuron in layer l. We denote the bias of this neuron by
5
test blj . By construction, in a feed-forward neural network the
6 0.950 activation alj of the j-th neuron in the l-th layer can be
related to the activities of the neurons in the layer l − 1
model accuracy

7
8 0.925 by the equation
9 !
0.900 X
10 l
aj = σ l l−1
wjk ak + bj = σ(zjl ),
l
(123)
11 k
12 0.875
13 where we have defined the linear weighted sum
0.850 X
14 zjl = l
wjk al−1 + blj . (124)
k
15 k
16 0 2 4 6 8
17 epoch By definition, the cost function E depends directly on
18 j . It of course also indi-
the activities of the output layer aL
19 rectly depends on all the activities of neurons in lower lay-
FIG. 37 Model accuracy of a DNN to study the MNIST prob-
20 lem as a function of the training epochs (see Notebook 11).
ers in the neural network through iteration of Eq. (123).
21 Besides the input and output layers, the DNN has four layers Let us define the error ∆L j of the j-th neuron in the L-th
22 of size (100,400,400,50) with different nonlinearities σ(z). layer as the change in cost function with respect to the
23 weighted input zjL
24
∂E
25 train ∆L
j = . (125)
26 0.5 ∂zjL
test
27
This definition is the first of the four backpropagation
28
0.4 equations.
29
We can analogously define the error of neuron j in
model loss

30
31 0.3
layer l, ∆lj , as the change in the cost function w.r.t. the
32 weighted input zjl :
33 ∂E ∂E 0 l
0.2
34 ∆lj = l
= σ (zj ), (I)
35 ∂zj ∂alj
36 0.1 where σ 0 (x) denotes the derivative of the non-linearity
37 σ(·) with respect to its input evaluated at x. Notice
38 0 2 4 6 8 that the error function ∆lj can also be interpreted as the
39 epoch
partial derivative of the cost function with respect to the
40
bias blj , since
41 FIG. 38 Model loss of a DNN to study the MNIST problem as
42 a function of the training epochs (see Notebook 11). Besides ∂E ∂E ∂blj ∂E
43 the input and output layers, the DNN has four layers of size ∆lj = l
= l = l, (II)
∂zj ∂bj ∂zjl ∂bj
44 (100,400,400,50) with different nonlinearities σ(z).
45 where in the last line we have used the fact that
46 ∂blj /∂zjl = 1, cf. Eq. (124). This is the second of the
47 1985) is a clever procedure that exploits the layered struc- four backpropagation equations.
48 ture of neural networks to more efficiently compute gra- We now derive the final two backpropagation equations
49 dients (for a more detailed discussion with Python code using the chain rule. Since the error depends on neurons
50 examples see Chapter 2 of (Nielsen, 2015)). in layer l only through the activation of neurons in the
51
subsequent layer l + 1, we can use the chain rule to write
52
53 ∂E X ∂E ∂z l+1
1. Deriving and implementing the backpropagation equations ∆lj = = k
54 l
∂zj ∂z l+1 ∂z l
k k j
55
At its core, backpropagation is simply the ordinary X
56 ∂zkl+1
chain rule for partial differentiation, and can be summa- = ∆l+1
k
57 ∂zjl
rized using four equations. In order to see this, we must k
58 !
59 first establish some useful notation. We will assume that X
60 there are L layers in our network with l = 1, . . . , L in- = ∆l+1 l+1
k wkj σ 0 (zjl ). (III)
61 dexing the layer. Denote by wjk l
the weight for the con- k

62
63
64
65
51
1
2
This is the third backpropagation equation. The final network. However, until fairly recently it was widely be-
3
4 equation can be derived by differentiating of the cost lieved that training deep networks was an extremely dif-
5 function with respect to the weight wjk
l
as ficult task. One reason for this was that even with back-
6 propagation, gradient descent on large networks is ex-
7 ∂E ∂E ∂zjl tremely computationally expensive. However, the great
= l = ∆lj al−1 (IV)
8 l
∂wjk l
∂zj ∂wjk k advances in computational hardware (and the widespread
9 use of GPUs) has made this a much less vexing prob-
10 Together, Eqs. (I), (II), (III), and (IV) define the four lem than even a decade ago. It is hard to understate
11 backpropagation equations relating the gradients of the the impact these advances in computing have had on the
12 activations of various neurons alj , the weighted inputs practical utility of neural networks.
P l l−1 l
13 zjl = k wjk ak +bj , and the errors ∆lj . These equations
On a more technical and mathematical note, another
14 can be combined into a simple, computationally efficient
15 problem that occurs in deep networks, which transmit
algorithm to calculate the gradient with respect to all
16 information through many layers, is that gradients can
parameters (Nielsen, 2015).
17 vanish or explode. This is, appropriately, known as the
18 The Backpropagation Algorithm problem of vanishing or exploding gradients. This prob-
19 lem is especially pronounced in neural networks that try
20 1. Activation at input layer: calculate the activa- to capture long-range dependencies, such as Recurrent
21 tions a1j of all the neurons in the input layer. Neural Networks for sequential data. We can illustrate
22
2. Feedforward: starting with the first layer, exploit this problem by considering a simple network with one
23
the feed-forward architecture through Eq. (123) to neuron in each layer. We further assume that all weights
24 are equal, and denote them by w. The behavior of the
25 compute z l and al for each subsequent layer.
backpropagation equations for such a network can be in-
26 3. Error at top layer: calculate the error of the top ferred from repeatedly using Eq. (III):
27 layer using Eq. (I). This requires to know the ex-
28 pression for the derivative of both the cost function
29
E(w) = E(aL ) and the activation function σ(z). L−1
Y L−1
Y
30
∆1j = ∆L
j wσ 0 (zj ) = ∆L
j (w)
L
σ 0 (zj ), (126)
31 4. “Backpropagate” the error: use Eq. (III) to j=0 j=0
32 propagate the error backwards and calculate ∆lj for
33 all layers.
34
35 5. Calculate gradient: use Eqs. (II) and (IV) to where ∆L j is the error in the L-th topmost layer, and
36 calculate ∂b
∂E
l and ∂w l .
∂E (w)L is the weight to the power L. Let us now also
37 j jk
assume that the magnitude σ 0 (zj ) is fairly constant and
38 We can now see where the name backpropagation we can approximate σ 0 (zj ) ≈ σ00 . In this case, notice
39 comes from. The algorithm consists of a forward pass that for large L, the error ∆1j has very different behavior
40 from the bottom layer to the top layer where one calcu- depending on the value of wσ00 . If wσ00 > 1, the errors
41 lates the weighted inputs and activations of all the neu- and the gradient blow up. On the other hand, if wσ00 < 1
42 rons. One then backpropagates the error starting with the errors and gradients vanish. Only when the weights
43 the top layer down to the input layer and uses these errors satisfy wσ00 ≈ 1 and the neurons are not saturated will
44 to calculate the desired gradients. This description makes the gradient stay well behaved for deep networks.
45 clear the incredible utility and computational efficiency
46 This basic behavior holds true even in more compli-
of the backpropagation algorithm. We can calculate all
47 cated networks. Rather than considering a single weight,
the derivatives using a single “forward” and “backward”
48 we can ask about the eigenvalues (or singular values) of
pass of the neural network. This computational efficiency
49 the weight matrices wjk l
. In order for the gradients to
is crucial since we must calculate the gradient with re-
50 be finite for deep networks, we need these eigenvalues to
51 spect to all parameters of the neural net at each step
stay near unity even after many gradient descent steps.
52 of gradient descent. These basic ideas also underly al-
In modern feedforward and ReLU neural networks, this
53 most all modern automatic differentiation packages such
is achieved by initializing the weights for the gradient de-
54 as Autograd (Pytorch).
scent in clever ways and using non-linearities that do not
55
saturate, such as ReLUs (recall that for saturating func-
56
2. Computing gradients in deep networks: what can go wrong tions, σ 0 → 0, which will cause the gradient to vanish).
57
58 with backprop? Proper initialization and regularization schemes such as
59 gradient clipping (cutting-off gradients with very large
60 Armed with backpropagation and gradient descent, it values), and batch normalization also help mitigate the
61 seems like it should be straightforward to train any neural vanishing and exploding gradient problem.
62
63
64
65
52
1
2
D. Regularizing neural networks and other practical tions, the dataset we train on, and a smaller validation
3
considerations set that serves as a proxy for out-of-sample performance
4
5 on the test set. As we train the model, we plot both the
6 DNNs, like all supervised learning algorithms, must training error and the validation error. We expect the
7 navigate the bias-variance tradeoff. Regularization tech- training error to continuously decrease during training.
8 niques play an important role in ensuring that DNNs However, the validation error will eventually increase due
9 generalize well to new data. The last five years have seen to overfitting. The basic idea of early stopping is to halt
10 a wealth of new specialized regularization techniques for the training procedure when the validation error starts to
11 DNNs beyond the simple L1 and L2 penalties discussed in rise. This Early Stopping procedure ensures that we stop
12 the context of linear and logistic regression, see Secs. VI the training and avoid fitting sample specific features in
13 and VII. These new techniques include Dropout and the data. Early Stopping is a widely used essential tool
14 Batch Normalization. In addition to these specialized in the deep learning regularization toolbox.
15 regularization techniques, large DNNs seem especially
16 well-suited to implicit regularization that already takes
17 place in the Stochastic Gradient Descent (SGD) (Wilson 2. Dropout
18 et al., 2017), cf. Sec. IV. The implicit stochasticity and
19 local nature of SGD often prevent overfitting of spurious Another important regularization schemed that has
20 correlations in the training data, especially when com- been widely adopted in the neural networks literature
21 bined with techniques such as Early Stopping. In this
22 is Dropout (Srivastava et al., 2014). The basic idea of
section, we give a brief overview of these regularization Dropout is to prevent overfitting by reducing spurious
23
techniques. correlations between neurons within the network by in-
24
25 troducing a randomization procedure similar to that un-
26 derlying ensemble models such as Bagging. Recall that
27 1. Implicit regularization using SGD: initialization, the basic idea behind ensemble methods is to train an en-
28 hyper-parameter tuning, and Early Stopping semble of models that are created using a randomization
29 procedure to ensure that the members of the ensemble
30 The most commonly employed and effective optimizer are uncorrelated, see Sec. VIII. This reduces the vari-
31 for training neural networks is SGD (see Sec. IV for other ance of statistical predictions without creating too much
32 alternatives). SGD acts as an implicit regularizer by in- additional bias.
33 troducing stochasticity (from the use of mini-batches) In the context of neural networks, it is extremely costly
34 that prevents overfitting. In order to achieve good per- to train an ensemble of networks, both from the point of
35 formance, it is important that the weight initialization is view of the amount of data needed, as well as computa-
36 chosen randomly, in order to break any leftover symme- tional resources and parameter tuning required. Dropout
37 tries. One common choice is drawing the weights from a circumnavigates these problems by randomly dropping
38 Gaussian centered around zero with some variance that out neurons (along with their connections) from the neu-
39
scales inversely with number of inputs to the neuron (He ral network during each step of the training (see Figure
40
et al., 2015; Sutskever et al., 2013). Since SGD is a lo- 39). Typically, for each mini-batch in the gradient de-
41
42 cal procedure, as networks get deeper, choosing a good scent step, a neuron is dropped from the neural network
43 weight initialization becomes increasingly important to with a probability p. The gradient descent step is then
44 ensure that the gradients are well behaved. Choosing performed only on the weights of the “thinned” network
45 an initialization with a variance that is too large or too of individual predictors.
46 small will cause gradients to vanish and the network to Since during training, on average weights are only
47 train poorly – even a factor of 2 can make a huge differ- present a fraction p of the time, predictions are made
48 ence in practice (He et al., 2015). For this reason, it is by reweighing the weights by p: wtest = pwtrain .The
49 important to experiment with different variances. learned weights can be viewed as some “average” weight
50 The second important thing is to appropriately choose over all possible thinned neural network. This averag-
51 the learning rate or step-size by searching over five log- ing of weights is similar in spirit to the Bagging proce-
52 arithmic grid points (Wilson et al., 2017). If the best dure discussed in the context of ensemble models, see
53 performance occurs at the edge of the grid, repeat this Sec. VIII.
54 procedure until the optimal learning rate is in the middle
55 of the grid parameters. Finally, it is common to center
56
or whiten the input data (just as we did for linear and 3. Batch Normalization
57
logistic regression).
58
59 Another important form of regularization that is of- Batch Normalization is a regularization scheme that
60 ten employed in practice is Early Stopping. The idea of has been quickly adopted by the neural network com-
61 Early Stopping is to divide the training data into two por- munity since its introduction in 2015 (Ioffe and Szegedy,
62
63
64
65
53
1
2
new parameters γkl and βkl can be learned just like the
3
4 weights and biases using backpropagation (since this is
5 just an extra layer for the chain rule). We initialize the
6 neural network so that at the beginning of training the
Standard
7 inputs are being standardized. Backpropagation then ad-
Neural Net
8 justs γ and β during training.
9 In practice, Batch Normalization considerably im-
10 proves the learning speed by preventing gradients from
11 vanishing. However, it also seems to serve as a power-
12 ful regularizer for reasons that are not fully understood.
13 One plausible explanation is that in batch normalization,
14 After applying X the gradient for a sample depends not only on the sam-
15 Dropout ple itself but also on all the properties of the mini-batch.
16 Since a single sample can occur in different mini-batches,
17 this introduces additional randomness into the training
18 procedure which seems to help regularize training.
19
20
21 FIG. 39 Dropout During the training procedure neurons E. Deep neural networks in practice: examples
22 are randomly “dropped out” of the neural network with some
23 probability p giving rise to a thinned network. This prevents Now that we have gained sufficient high-level back-
24 overfitting by reducing correlations among neurons and re-
ground knowledge about deep neural nets, let us discuss
25 ducing the variance in a method similar in spirit to ensemble
methods. how to use them in practice.
26
27
28 2015). The basic inspiration behind Batch Normalization 1. Deep learning packages
29 is the long-known observation that training in neural net-
30 In Notebook 11, we demonstrated that the numerical
works works best when the inputs are centered around
31
zero with respect to the bias. The reason for this is that implementation of DNNs is greatly facilitated by open
32
it prevents neurons from saturating and gradients from source python packages, such as Keras, TensorFlow, Py-
33
34 vanishing in deep nets. In the absence of such center- torch, and others. The complexity and learning curves
35 ing, changes in parameters in lower layers can give rise for these packages differ, depending on the user’s level of
36 to saturation effects in higher layers, and vanishing gra- familiarity with Python. The reader should keep mind
37 dients. The idea of Batch Normalization is to introduce mind that there are DNN packages written in other lan-
38 additional new “BatchNorm” layers that standardize the guages, such as Caffe which uses C++, but we do not
39 inputs by the mean and variance of the mini-batch. discuss them in this review for brevity.
40 Consider a layer l with d neurons whose inputs are Keras is a high-level framework which does not require
41 (z1l , . . . , zdl ). We standardize each dimension so that any knowledge about the inner workings of the underly-
42 ing deep learning algorithms. Coding DNNs in Keras is
zkl − E[zkl ]
43 zkl −→ ẑkl = q , (127) particularly simple, see Notebook 11, and allows one to
44 Var[zkl ] quickly grasp the big picture behind the theoretical con-
45 cepts which we introduced above. However, for advanced
46 where the mean and variance are taken over all sam- applications, which may require more direct control over
47 ples in the mini-batch. One problem with this procedure the operations in between the layers, Keras’ high-level
48 is that it may change the representational power of the design may turn out insufficient.
49 neural network. For example, for tanh non-linearities, it If one opens up the Keras black box, one will find that
50 may force the network to live purely in the linear regime it wraps the functionality of another package – Tensor-
51 around z = 0. Since non-linearities are crucial to the Flow11 . Over the last years, TensorFlow, which is sup-
52 representational power of DNNs, this could dramatically
53 ported by Google, has been gaining popularity and has
alter the power of the DNN. become the preferred library for deep learning. It is fre-
54 For this reason, one introduces two new parameters γkl
55 quently used in Kaggle competitions, university classes,
and βkl for each neuron that can additionally shift and and industry. In TensorFlow one constructs data flow
56
scale the normalized input
57
58 ẑkl −→ ẑlk = γkl ẑkl + βkl . (128)
59
60 One can think of Eqs. (127) and (128) as adding new 11 While Keras can also be used with a Theano backend, we do not
extra layers ẑlk in the deep net architecture. Hence, the discuss this here since Theano support has been discontinued.
61
62
63
64
65
54
1
2
graphs, the nodes of which represent mathematical oper- data serves as a major bottleneck on the ultimate per-
3
4 ations, while the edges encode multidimensional tensors formance of DNNs. In such cases one can consider data
5 (data arrays). A deep neural net can then be thought of augmentation, i.e. distorting data samples from the ex-
6 as a graph with a particular architecture and connectiv- isting dataset in some way to enhance size the dataset.
7 ity. One needs to understand this concept well before one Obviously, if one knows how to do this, one already has
8 can truly unleash TensorFlow’s full potential. The learn- partial information about the important features in the
9 ing curve can sometimes be rather steep for TensorFlow data.
10 beginners, and requires a certain degree of perseverance One of the first questions we are typically faced with is
11 and devoted time to internalize the underlying ideas. how to determine the sizes of the training and test data
12 There are, however, many other open source packages sets. The MNIST dataset, which has 10 classification
13 which allow for control over the inter- and intra-layer op- categories, uses 80% of the available data for training
14 erations, without the need to introduce computational and 20% for testing. On the other hand, the ImageNet
15 graphs. Such an example is Pytorch, which offers li- data which has 1000 categories is split 50% − 50%. As
16 braries for automatic differentiation of tensors at GPU a rule of thumb, the more classification categories there
17 speed. As we discussed above, manipulating neural nets are in the task, the closer the sizes of the training and
18 boils down to fast array multiplication and contraction test datasets should be in order to prevent overfitting.
19
operations and, therefore, the torch.nn library often Once the size of the training set is fixed, it is common to
20
does the job of providing enough access and controlla- reserve 20% of it for validation, which is used to fine-tune
21
22 bility to manipulate the linear algebra operations under- the hyperparameters of the model.
23 lying deep neural nets. Also related to data preprocessing is the standardiza-
24 For the benefit of the reader, we have prepared Jupyter tion of the dataset. It has been found empirically that if
25 notebooks for DNNs using all three packages for the deep the original values of the data differ by orders of magni-
26 learning problems we discuss below. We invite the reader tude, training can be slowed down or impeded. This can
27 to carefully examine the differences in the code which be traced back to the vanishing and exploding gradient
28 should help them decide on which package they prefer to problem in backprop discussed in Sec. IX.C. To avoid
29 use. such unwanted effects, one often resorts to two tricks:
30 (i) all data should be mean-centered, i.e. from every data
31 point we subtract the mean of the entire dataset, and (ii)
32 2. Approaching the learning problem rescale the data, for which there are two ways: if the data
33 is approximately normally distributed, one can rescale by
34 Let us now analyze a typical procedure for using neural the standard deviation. Otherwise, it is typically rescaled
35 networks to solve supervised learning problems. As can by the maximum absolute value so the rescaled data lies
36 be seen already from the code snippets in Notebook 11, within the interval [−1, 1]. Rescaling ensures that the
37 constructing a deep neural network to solve ML problems weights of the DNN are of a similar order of magnitude
38 is a multiple-stage process. Generally, one can identify a (notice the similarity of this idea to Batch Normalization,
39 set of key steps:
40 cf. Sec. IX.D.3).
The next issue is how to choose the right hyperparam-
41 1. Collect and pre-process the data.
42 eters to start training the model. According to Bengio,
43 2. Define the model and its architecture. the optimal learning rate is often an order of magnitude
44 lower than the smallest learning rate that blows up the
45 3. Choose the cost function and the optimizer. loss (Bengio, 2012). One should also keep in mind that,
46 depending on how ambitious of a problem one is dealing
47 4. Train the model. with, training the model can take a considerable amount
48 of time. This can severely slow down any progress on
5. Evaluate and study the model performance
49 improving the model in Step 6. Therefore, it is usu-
on the test data.
50 ally a good idea to play with a small enough fraction of
51
6. Use the validation data to adjust the hyper- the training data to get a rough feeling about the cor-
52 rect hyperparameter regimes, the usefulness of the DNN
53 parameters (and, if necessary, network ar-
chitecture) to optimize performance for the architecture, and to debug the code. The size of this
54 small ‘play set’ should be such that training on it can
55 specific dataset.
be done fast and in real time to allow to quickly adjust
56
At this point a few remarks are in order. While we the hyperparameters. A typical strategy of exploring the
57
treat Step 1 above as consisting mainly of loading and hyperparameter landscape is to use grid searches.
58
59 reshaping a dataset prepared ahead of time, we empha- Whereas it is always possible to view Steps 1-5 as
60 size that obtaining a sufficient amount of data is a typical generic and independent of the particular problem we
61 challenge in many applications. Oftentimes insufficient are trying to solve, it is only when these steps are put
62
63
64
65
55
1
2
100% Unlike the MNIST example where we used Keras, here
3
4 1e-05 0.0001 0.001 0.01 0.1 we use the opportunity to introduce the Pytorch package,
5 see the corresponding notebook. We leave the discussion
1000 54.5% 51.5% 74.5% 77.5% 65.5% 80% of the code-specific details for the accompanying note-
6
7 book.
To classify the SUSY collision events, we construct a

accuracy (%)
data set size

8 60%
10000 44.6% 65.0% 74.6% 77.0% 78.2%
9 DNN with two dense hidden layers of 200 and 100 neu-
10 rons, respectively. We use ReLU activation between the
11 100000 42.7% 48.8% 74.1% 78.6% 79.6% 40% input and the hidden layers, and a sofmax output layer.
12 We apply dropout regularization on the weights of the
13 DNN. Similar to MNIST, we use the cross-entropy as a
14 200000 56.7% 59.9% 76.0% 78.9% 79.9% cost function and minimize it using SGD with batches of
20%
15 size 10% of the training data size. We train the DNN for
16 learning rate 10 epochs.
17 0% Figure 40 shows the accuracy of our DNN on the test
18
data as a function of the learning rate and the size of the
19 FIG. 40 Grid search results for the test set accuracy of the dataset. It is considered good practice to start with a
20 DNN for the SUSY problem as a function of the learning rate
21 logarithmic scale to search through the hyperparameters,
and the size of the dataset. The data used includes all high- to get an overall idea for the order of magnitude of the
22 and low-level features.
23 optimal values. In this example, the performance peaks
24 at the largest size of the dataset and a learning rate of 0.1,
25 together in Step 6 that the real benefit of deep learning and is of the order of 80%. Since the optimal performance
26 is revealed, compared to less sophisticated methods such is obtained at the edge of the grid, we encourage the
27 as regression or bagging, see Secs. VI, VII, and VIII. The reader to extend the grid size to beat our result. For
28 optimal choice of network architecture, cost function, and comparison, in the original study (Baldi et al., 2014),
29 optimizer is determined by the properties of the training the authors achieved ≈ 89% by using the entire dataset
30 and test datasets, which are usually revealed when we with 5, 000, 000 points and a more sophisticated network
31 try to improve the model. architecture, trained using GPUs.
32 While there is no “one-size-fits-them-all” recipe to ap-
33
proach ML problems, we believe that the above list gives
34 4. Phases of the 2D Ising model
a good overview and can be a useful guideline to the lay-
35
36 man. Furthermore, as it becomes clear, this ‘recipe’ can
be applied to generic supervised learning tasks, not just As a second example from physics, we discuss a DNN
37
DNNs. We refer the reader to Sec. XI for more useful approach to the Ising dataset introduced in Sec. VII.C.1.
38
hints and tips on how to use the validation data during We study the problem of classifying the states of the 2D
39
40 the training process. Ising model with a DNN (Tanaka and Tomiya, 2017a),
41 focussing on the model performance as a function of both
42 the number of hidden neurons and the learning rate.
43 3. SUSY dataset The discussion is accompanied by a notebook written in
44 TensorFlow. As in the previous example, the interested
45 As a first example from physics, we discuss a DNN reader can find the discussion of the code-specific details
46 approach to the SUSY dataset already introduced in the in the notebook.
47 context of logistic regression in Sec. VII.C.2, and Bagging To classify whether a given spin configuration is in the
48 in Sec. VIII.F. For a detailed description of the SUSY ordered or disordered phase, we construct a minimalistic
49 dataset and the corresponding classification problem, we model for a DNN with a single hidden layer containing
50 refer the reader to Sec. VII.C.2. There is an interest in a number of hidden neurons. The network architecture
51 using deep learning methods to automate the discovery thus includes a ReLU-activated input layer, the hidden
52 of collision features from data. Benchmark results using layer, and the softmax output layer. We pick the cate-
53 Bayesian Decision Trees from a standard physics pack- gorical cross-entropy as a cost function and minimize it
54 age, and five-layer neural networks using Dropout were using SGD with mini-batches of size 100. We train the
55
presented in the original paper (Baldi et al., 2014); they DNN for 100 epochs.
56
demonstrate the ability of deep learning to bypass the Figure 41 shows the outcome of a grid search over a
57
58 need of using hand-crafted high-level features. Our goal log-spaced learning rate and the number of neurons in the
59 here is to study systematically the accuracy of a DNN hidden layer. We see that about 10 neurons are enough
60 classifier as a function of the learning rate and the dataset at a learning rate of 0.1 to get to a very high accuracy on
61 size. the test set. However, if we aim at capturing the physics
62
63
64
65
56
1
2
100% gauge theories, to critical phenomena.
3
4 Like physical systems, many datasets and supervised
5 1e-06 1e-05 0.0001 0.001 0.01 0.1 learning tasks also possess additional symmetries and
80% structure. For instance, consider a supervised learning
6 1 50.5% 49.3% 50.8% 49.7% 52.0% 68.4%
7 task where we want to label images from some dataset as
hidden neurons

being pictures of cats or not. Our statistical procedure

accuracy (%)
8 60%
9 10 60.7% 43.1% 28.0% 75.6% 95.7% 99.2% must first learn features associated with cats. Because
10 a cat is a physical object, we know that these features
11 100 68.9% 70.8% 85.1% 93.3% 99.7% 99.9%
40% are likely to be local (groups of neighboring pixels in the
12 two-dimensional image corresponding to whiskers, tails,
13 1000 79.2% 53.9% 84.2% 97.9% 99.8% 99.9% eyes, etc). We also know that the cat can be anywhere
14 20% in the image. Thus, it does not really matter where in
15 learning rate the picture these features occur (though relative positions
16 of features likely do matter). This is a manifestation of
17 0% translational invariance that is built into our supervised
18 learning task. This example makes clear that, like many
19 100%
physical systems, many ML tasks (especially in the con-
20
1e-06 1e-05 0.0001 0.001 0.01 0.1
text of computer vision and image processing) also pos-
21
22 80% sess additional structure, such as locality and translation
23 1 50.2% 50.1% 49.8% 49.2% 53.7% 60.1% invariance.
The all-to-all coupled neural networks in the previous
hidden neurons

24
accuracy (%)

25 10 68.6% 46.4% 39.3% 74.5% 83.7% 84.2% 60% section fail to exploit this additional structure. For exam-
26 ple, consider the image of the digit ‘four’ from the MNIST
27 dataset shown in Fig. 26. In the all-to-all coupled neural
100 67.6% 63.9% 55.5% 67.1% 73.1% 93.7%
40%
28 networks used there, the 28 × 28 image was considered
29 a one-dimensional vector of size 282 = 796. This clearly
30 1000 67.6% 51.9% 56.2% 59.9% 74.0% 96.0%
throws away lots of the spatial information contained in
20%
31
learning rate the image. Not surprisingly, the neural networks com-
32 munity realized these problems and designed a class of
33
0% neural network architectures, convolutional neural net-
34 works or CNNs, that take advantage of this additional
35 structure (locality and translational invariance) (LeCun
36 FIG. 41 Grid search results for the test set accuracy (top)
and the critical set accuracy (bottom) of the DNN for the et al., 1995). Furthermore, what is especially interest-
37
Ising classification problem as a function of the learning rate ing from a physics perspective is the recent finding that
38
and the number of hidden neurons. these CNN architectures are intimately related to mod-
39
40 els such as tensor networks (Stoudenmire, 2018; Stouden-
41 mire and Schwab, 2016) and, in particular, MERA-like
close to criticality, clearly more neurons are required to architectures that are commonly used in physical mod-
42
43 reliably learn the more complex correlations in the Ising els for quantum condensed matter systems (Levine et al.,
44 configurations. 2017).
45
46
47 X. CONVOLUTIONAL NEURAL NETWORKS (CNNS) A. The structure of convolutional neural networks
48
49 One of the core lessons of physics is that we should ex- A convolutional neural network is a translationally in-
50 ploit symmetries and invariances when analyzing physi- variant neural network that respects locality of the in-
51 cal systems. Properties such as locality and translational put data. CNNs are the backbone of many modern deep
52 invariance are often built directly into the physical laws. learning applications and here we just give a high-level
53 Our statistical physics models often directly incorporate overview of CNNs that should allow the reader to delve
54 everything we know about the physical system being an- directly into the specialized literature. The reader is also
55
alyzed. For example, we know that in many cases it is strongly encouraged to consult the excellent, succinct
56
sufficient to consider only local couplings in our Hamilto- notes for the Stanford CS231n Convolutional Neural Net-
57
58 nians, or work directly in momentum space if the system works class developed by Andrej Karpathy and Fei-Fei Li
59 is translationally invariant. This basic idea, tailoring our (https://cs231n.github.io/). We have drawn heavily
60 analysis to exploit additional structure, is a key feature of on the pedagogical style of these notes in crafting this
61 modern physical theories from general relativity, through section.
62
63
64
65
57
1
2
3
4
D
5
6
7 H Fully
8 Connected
9 Layer
10 W
11
12 Convolution Coarse-graining Convolution Coarse-graining
13 (pooling) (pooling)
14
15 FIG. 42 Architecture of a Convolutional Neural Network (CNN). The neurons in a CNN are arranged in three
16 dimensions: height (H), width (W ), and depth (D). For the input layer, the depth corresponds to the number of channels (in
17 this case 3 for RGB images). Neurons in the convolutional layers calculate the convolution of the image with a local spatial
18 filter (e.g. 3 × 3 pixel grid, times 3 channels for first layer) at a given location in the spatial (W, H)-plane. The depth D of the
19 convolutional layer corresponds to the number of filters used in the convolutional layer. Neurons at the same depth correspond
20 to the same filter. Neurons in the convolutional layer mix inputs at different depths but preserve the spatial location. Pooling
layers perform a spatial coarse graining (pooling step) at each depth to give a smaller height and width while preserving the
21
depth. The convolutional and pooling layers are followed by a fully connected layer and classifier (not shown).
22
23
24
There are two kinds of basic layers that make up a the input with P zeros (see Fig. 43). For an input
25
26 CNN: a convolution layer that computes the convolu- of width W , the number of neurons (outputs) in the
27 tion of the input with a bank of filters (as a math- layer is given by (W − F + 2P )/S + 1. We invite the
28 ematical operation, see this practical guide to image reader to check out this visualization of the convolution
29 kernels: http://setosa.io/ev/image-kernels/), and procedure, https://github.com/vdumoulin/conv_
30 pooling layers that coarse-grain the input while main- arithmetic/blob/master/README.md, for a square
31 taining locality and spatial structure, see Fig. 42. For input of unit depth. After computing the filter, the
32 two-dimensional data, a layer l is characterized by three output is passed through a non-linearity, a ReLU in Fig.
33 numbers: height Hl , width Wl , and depth Dl 12 . The 43. In practice, one often inserts a BatchNorm layer
34 height and width correspond to the sizes of the two- before the non-linearity, cf. Sec. IX.D.3.
35 dimensional spatial (Wl , Hl )-plane (in neurons), and the These convolutional layers are interspersed with pool-
36 depth Dl (marked by the different colors in Fig. 42) – ing layers that coarse-grain spatial information by per-
37 to the number of filters in that layer. All neurons corre- forming a subsampling at each depth. One common pool-
38 sponding to a particular filter have the same parameters ing operation is the max pool. In a max pool, the spatial
39 (i.e. shared weights and bias). dimensions are coarse-grained by replacing a small region
40 In general, we will be concerned with local spatial (say 2×2 neurons) by a single neuron whose output is the
41 filters (often called a receptive field in analogy with
42 maximum value of the output in the region. In physics,
neuroscience) that take as inputs a small spatial patch this pooling step is very similar to the decimation step
43
of the previous layer at all depths. For instance, a of RG (Iso et al., 2018; Koch-Janusz and Ringel, 2017;
44
square filter of size F is a three-dimensional array of Lin et al., 2017; Mehta and Schwab, 2014). This gener-
45
46 size F × F × Dl−1 . The convolution consists of running ally reduces the dimension of outputs. For example, if
47 this filter over all locations in the spatial plane. To the region we pool over is 2 × 2, then both the height
48 demonstrate how this works in practice, let us a consider and the width of the output layer will be halved. Gen-
49 the simple example consisting of a one-dimensional erally, pooling operations do not reduce the depth of the
50 input of depth 1, shown in Fig. 43. In this case, a filter convolutional layers because pooling is performed sepa-
51 of size F × 1 × 1 can be specified by a vector of weights rately at each depth. A simple example of a max-pooling
52 w of length F . The stride, S, encodes by how many operation is shown in Fig. 44. There are some studies
53 neurons we translate the filter by when performing suggesting that pooling might be unnecessary (Springen-
54 the convolution. In addition, it is common to pad berg et al., 2014), but pooling layers remain a staple of
55 most CNNs.
56
In a CNN, the convolution and max-pool layers are
57
58 12 The depth Dl is often called “number of channels”, to distin-
generally followed by an all-to-all connected layer and a
59 guish it from the depth of the neural network itself, i.e. the total high-level classifier such as a soft-max. This allows us
60 number of layers (which can be convolutional, pooling or fully- to train CNNs as usual using the backprop algorithm,
connected), cf. Fig. 42. cf. Sec. IX.C. From a backprop perspective, CNNs are
61
62
63
64
65
58
1
2
3
almost identical to fully connected neural network archi- (a) F=3
tectures except with tied parameters. weight=[1,-1,1]
4
Apart from introducing additional structure, such as S=1
5
(units to shift bias=-2 ReLU
6 translational invariance and locality, this convolutional 0 (unit slope)
7 structure also has important practical and computational filter by)
8 benefits. All neurons at a given layer represent the same 1 1 -1 0
9 filter, and hence can all be described by a single set of
10 weights and biases. This reduces the number of free pa- 2 1 -1 0
11 rameters by a factor of H ×W at each layer. For example,
12 for a layer with D = 102 and H = W = 102 , this gives P=1 2 -1 -3 0
13 a reduction in parameters of nearly 106 ! This allows for
14 the training of much larger models than would otherwise -1 6 4 4
15 be possible with fully connected layers. We are familiar
16
17
with similar phenomena in physics: e.g. in translation- 3 -4 -6 0
ally invariant systems we can parametrize all eigenmodes
18
19
by specifying only their momentum (wave number) and 0 output
functional form (sin, cos, etc.), while without translation
20
invariance much more information is required.
input
21 W=5
22
23
B. Example: CNNs for the 2D Ising model
(b) F=4
24
25 S=2 weight=[1,-1,2,1]
26 The inclusion of spatial structure in CNNs is an im- (units to shift
27 portant feature that can be exploited when designing
neural networks for studying physical systems. In the
filter by) bias=-1 ReLU
1
28
29 accompanying notebook, we used Pytorch to implement (unit slope)
a simple CNN composed of a single convolutional layer
2
30
31
32
followed by a soft-max layer. Every input data point
(i.e. Ising configuration) is shaped as a two-dimensional
2 1 1
33 array. We varied the output depth (i.e. the number of 2
34 P=0
output channels) of the convolutional layer from unity –
35 a single set of weights and one bias – to an output depth
-1 1 0 0
36 of 50 distinct weights and biases. The CNN was then
37
trained using SGD for five epochs using a training set
0 output
38
consisting of samples from far in the paramagnetic and
39
ordered phases. The results are shown in Fig. 45. The
-2
40
CNN achieved a 100% accuracy on the test set for all
41 input
architectures, even for a CNN with depth one. We also
42
checked the performance of the CNN on samples drawn W=6
43
44 from the near-critical region for temperatures T slightly FIG. 43 Two examples to illustrate a one-dimensional
45 above and below the critical temperature Tc . The CNN convolutional layer with ReLU nonlinearity. Convolu-
46 performed admirably even on these critical samples with tional layer for a spatial filter of size F for a one-dimensional
47 an accuracy of between 80% and 90%. As is the case input of width W with stride S and padding P followed by a
48 with all ML and neural networks, the performance on ReLU non-linearity.
49 parts of the data that are missing from the training set
50 is considerably worse than on test data that is similar
51 to the training data. This highlights the importance of Regarding the SUSY dataset, we stress that the absence
52 properly constructing an accurate training dataset and of spatial locality in the collision features renders apply-
53 the considerable obstacles of generalizing to novel situ- ing CNNs to that problem inadequate.
54 ations. We encourage the interested reader to explore
55 the corresponding notebook and design better CNN ar-
56
chitectures with improved generalization performance on C. Pre-trained CNNs and transfer learning
57
the near-critical set.
58
59 The reader may wish to check out the second part of The immense success of CNNs for image recognition
60 the MNIST notebook for a discussion of CNNs applied to has resulted in the training of huge networks on enormous
61 the digit recognition using the high-level Keras package. datasets, often by large industrial research teams from
62
63
64
65
59
1

110
2
3
4 3 0 1 0
5
6 0 1 1 1 3 1
100

Accuracy
7
8
4 2
90
9 2 3 2 1
10 80
11 4 1 0 1 test
12
70 critical
. .
13

.. .. 60 1 5 10 20
14
15
50
Depth of hidden layer
16
17
18
19
20 2 5 2 1 FIG. 45 Single-layer convolutional network for classi-
fying phases in the Ising mode. Accuracy on test set and
21 critical samples for a convolutional neural network with sin-
22 1 1 0 1 5 2 gle layer of varying depth with filters of size 2, max-pool layer
23 with receptive field of size 2, followed by soft-max classifier.
24 Notice that the test accuracy is 100% even for a CNN of depth
25 1 0 1 0 3 1 one with a single set of weights. Accuracy on the near-critical
26 dataset is significantly below that for the test set.
27
28
3 1 0 1
29 CS231n mentioned in the introduction to this section.
30
• Use CNN as fixed feature detector at top
31 FIG. 44 Illustration of Max Pooling. Illustration of max-
32 pooling over a 2 × 2 region. Notice that pooling is done at layer. If the new dataset we want to train on is
33 each depth (vertical axis) separately. The number of outputs small and similar to the original dataset, we can
34 is halved along each dimension due to this coarse-graining. simply use the CNN as a fixed feature detector
35 and retrain our classifier. In other words, we re-
36 move the classifier (soft-max) layer at the top of the
37 Google, Microsoft, Amazon, etc. Many of these mod- CNN and replace it with a new classifier (linear sup-
38 els are known by name: AlexNet, GoogLeNet, ResNet, port vector machine (SVM) or soft-max) relevant
39 InceptionNet, VGGNet, etc. Most researchers and prac- to our supervised learning problem. In this proce-
40 titioners do not have the resources, data, or time to train dure, the CNN serves as a fixed map from images
41
networks on this scale. Luckily, the trained models have to relevant features (the outputs of the top fully-
42 connected layer right before the original classifier).
been released and are now available in standard packages
43 This procedure prevents overfitting on small, simi-
44 such as the Torch Vision library in Pytorch or the Caffe
framework. These models can be used directly as a basis lar datasets and is often a useful starting point for
45
for fine-tuning in different supervised image recognition transfer learning.
46
47 tasks through a process called transfer learning.
• Use CNN as fixed feature detector at inter-
48 The basic idea behind transfer learning is that the fil- mediate layer. If the dataset is small and quite
49 ters (receptive fields) learned by the convolution layers different from the dataset used to train the origi-
50 of these networks should be informative for most im- nal image, the features at the top level might not
51 age recognition based tasks, not just the ones they were be suitable for our dataset. In this case, one may
52 originally trained for. In other words, we expect that, want to instead use features in the middle of the
53 since images reflect the natural world, the filters learned CNN to train our new classifier. These features are
54 by these CNNs should transfer over to new tasks with
55 thought to be less fine-tuned and more universal
only slight modifications and fine-tuning. In practice, (e.g. edge detectors). This is motivated by the idea
56 this turns out to be true for many tasks one might be
57 that CNNs learn increasingly complex features the
interested in. deeper one goes in the network (see discussion on
58
59 There are three distinct ways one can take a pre- representational learning in next section).
60 trained CNN and repurpose it for a new task. The follow-
61 ing discussion draws heavily on the notes from the course • Fine-tune the CNN. If the dataset is large, in
62
63
64
65
60
1
2
addition to replacing and retraining the classifier proxy for the test error in order to make tweaks to our
3
4 in the top layer, we can also fine-tune the weights model. It is crucial that we do not use any of the test
5 of the original CNN using backpropagation. One data to train the algorithm. This is a cardinal sin in
6 may choose to freeze some of the weights in the ML. We thus suggest the following workflow:
7 CNN during the procedure or retrain all of them
8 simultaneously. Estimate optimal error rate (Bayes rate).—The
9 first thing one should establish is the difficulty of the
10 All these procedures can be carried out easily by us- task and the best performance one can expect to achieve.
11 ing packages such as Caffe or the Torch Vision library No algorithm can do better than the “signal” in the
12 in PyTorch. PyTorch provides a nice python notebook dataset. For example, it is likely much easier to classify
13 that serves as tutorial on transfer learning. The reader objects in high-resolution images than in very blurry,
14 is strongly urged to read the Pytorch tutorials carefully low-resolution images. Thus, one needs to establish
15 if interested in this topic. a proxy or baseline for the optimal performance that
16 can be expected from any algorithm. In the context of
17 Bayesian statistics, this is often called the Bayes rate.
18 XI. HIGH-LEVEL CONCEPTS IN DEEP NEURAL
Since we do not know this a priori, we must get an
19 NETWORKS
estimate of this. For many tasks such as speech or object
20
In the previous sections, we introduced deep neural recognition, we can approximate this by the performance
21
22 networks and discussed how we can use these networks of humans on the task. For a more specialized task,
23 to perform supervised learning. Here, we take a step back we would like to ask how well experts, trained at the
24 and discuss some high-level questions about the practice task, perform. This expert performance then serves as a
25 and performance of neural networks. The first part of this proxy for our Bayes rate.
26 section presents a deep learning workflow inspired by the
27 bias-variance tradeoff. This workflow is especially rele- Minimize underfitting (bias) on training data
28 vant to industrial applications where one is often trying set.—After we have established the Bayes rate, we want
29 to employ neural networks to solve a particular problem. to make sure that we are using a sufficiently complex
30
In the second part of this section, we shift gears and ask model to avoid underfitting on the training dataset.
31
the question, why have neural networks been so success- In practice, this means comparing the training error
32
ful? We provide three different high-level explanations rate to the Bayes rate. Since the training error does
33
that reflect current dogmas. Finally, we end the section not care about generalization (variance), our model
34 should approach the Bayes rate on the training set. If
35 by discussing the limitations of supervised learning meth-
ods and current neural network architectures. it does not, the bias of the DNN model is too large
36 and one should try training the model longer and/or
37
using a larger model. Finally, if none of these techniques
38
A. Organizing deep learning workflows using the bias-variance work, it is likely that the model architecture is not
39
40 tradeoff well suited to the dataset, and one should modify the
41 neural architecture in some way to better reflect the un-
42 Imagine that you are given some data and asked derlying structure of the data (symmetries, locality, etc.).
43 to design a neural network for learning how to per-
44 form a supervised learning task. What are the best Make sure you are not overfitting.— Next, we
45 practices for organizing a systematic workflow that run our algorithm on the validation or dev set. If the
46 allows us to efficiently do this? Here, we present error is similar to the training error rate and Bayes rate,
47 a simple deep learning workflow inspired by think- we are done. If it is not, then we are overfitting the
48 ing about the bias-variance tradeoff (see Figure 46). training data. Possible solutions include, regularization
49 This section draws heavily on Andrew Ng’s tuto- and, importantly, collecting more data. Finally, if
50 rial at the Deep Learning School (available online none of these work, one likely has to change the DNN
51 at https://www.youtube.com/watch?v=F1ka6a13S9I) architecture.
52 which readers are strongly encouraged to watch.
53 If the validation and test sets are drawn from the same
54 The first thing we would like to do is divide the data distributions, then good performance on the validation
55
into three parts. A training set, a validation or dev set should lead to similarly good performance on the
56
(development) set, and a test set. The test set is the test set. (Of course performance will typically be slightly
57
58 data on which we want to make predictions. The dev set worse on the test set because the hyperparameters were
59 is a subset of the training data we use to check how well fit to the validation set.) However, sometimes the train-
60 we are doing out-of-sample, after training the model on ing data and test data differ in subtle ways because, for
61 the training dataset. We use the validation error as a example, they are collected using slightly different meth-
62
63
64
65
61
1
2
Establish proxy for op;mal error rate (e.g. expert performance)
1. Neural networks as representation learning
3
4
5 One important and powerful aspect of the deep learn-
6 ing paradigm is the ability to learn relevant features
Yes of the data with relatively little domain knowledge, i.e.
7 Bigger model
Training error high? Train longer with minimal hand-crafting. Often, the power of deep
8
Underfi9ng New model architecture learning stems from its ability to act like a black box
9
10 No
that can take in a large stream of data and find good
11 features that capture properties of the data we are in-
12 Yes terested in. This ability to learn good representations
13 Valida;on error high? More data with very little hand-tuning is one of the most attractive
Regulariza;on
14 Overfi9ng
New model architecture
properties of DNNs. Many of the other supervised learn-
15 No ing algorithms discussed here (regression-based models,
16 ensemble methods such as random forests or gradient-
17 boosted trees) perform comparably or even better than
18 DONE! neural networks but when using hand-crafted features
19 with small-to-intermediate sized datasets.
20 FIG. 46 Organizing a workflow for Deep Learning. The hierarchical structure of deep learning models is
21 Schematic illustrating a deep learning workflow inspired by
thought to be crucial to their ability to represent com-
22 navigating the bias-variance tradeoff (Figure based on An-
23 drew Ng’s talk at the 2016 Deep Learning School available at plex, abstract features. For example, consider the use
24 https://www.youtube.com/watch?v=F1ka6a13S9I.) In this of CNNs for image classification tasks. The analysis of
25 diagram, we have assumed that there in no mismatch be- CNNs suggests that the lower-levels of the neural net-
26 tween the distributions the training and test sets are drawn works learn elementary features, such as edge detectors,
27 from. which are then combined into higher levels of the net-
28 works into more abstract, higher-level features (e.g. the
29 famous example of a neuron that “learned to respond to
30 ods, or because it is cheaper to collect data in one way cats”) (Le, 2013). More recently, it has been shown that
31 versus another. In this case, there can be a mismatch CNNs can be thought of as performing tensor decompo-
32 between the training and test data. This can lead to sitions on the data similar to those commonly used in
33 the neural network overfitting these small differences be- numerical methods in modern quantum condensed mat-
34 ter (Cohen et al., 2016).
tween the test and training sets, and a poor performance
35 One of the interesting consequences of this line of
on the test set despite having a good performance on
36 thinking is the idea that one can train a CNN on one
37 the validation set. To rectify this, Andrew Ng suggests
making two validation or dev sets, one constructed from large dataset and the features it learns should also be use-
38
the training data and one constructed from the test data. ful for other supervised tasks. This results in the ability
39
40 The difference between the performance of the algorithm to learn important and salient features directly from the
41 on these two validation sets quantifies the train-test mis- data and then transfer this knowledge to a new task. In-
42 match. This can serve as another important diagnostic deed, this ability to learn important, higher-level, coarse-
43 when using DNNs for supervised learning. grained features is reminiscent of ideas like the renormal-
44 ization group (RG) in physics where the RG flows sep-
45 arate out relevant and irrelevant directions, and certain
46 unsupervised deep learning architectures have a natural
47 interpretation in terms of variational RG schemes (Mehta
48 B. Why neural networks are so successful: three high-level and Schwab, 2014).
49 perspectives on neural networks
50
51 Having discussed the basics of neural networks, we con- 2. Neural networks can exploit large amounts of data
52 clude by giving three complementary perspectives on the
53 success of DNNs and Deep Learning. This high-level dis- With the advent of smartphones and the internet, there
54 cussion reflects various dogmas and intuitions about the has been an explosion in the amount of data being gen-
55
success of DNNs and is in no way definitive or conclusive. erated. This data-rich environment favors supervised
56
As the reader was already warned in the introduction to learning methods that can fully exploit this rich data
57
58 DNNs, the field is rapidly expanding and many of these world. One important reason for the success of DNNs
59 perspectives may turn out to be only partially true or is that they are able to exploit the additional signal in
60 even false. Nonetheless, we include them here as a guide- large datasets for difficult supervised learning tasks. Fun-
61 post for readers. damentally, modern DNNs are unique in that they con-
62
63
64
65
62
1
2
3. Neural networks scale up well computationally
3 Small
4 data
5 A final feature that is thought to underlie the success
regime Large NN of modern neural networks is that they can harness the
6
7 immense increase in computational capability that has
occurred over the last few decades. The architecture of
Performance

8
9 Medium NN neural networks naturally lends itself to parallelization
10 and the exploitation of fast but specialized processors
11 Small NN such as graphical processing units (GPUs). Google and
12 NVIDIA set on a course to develop TPUs (tensor pro-
13 cessing units) which will be specifically designed for the
14 Traditional mathematical operations underlying deep learning archi-
15 (e.g. logistic reg) tectures. The layered architecture of neural networks also
16 makes it easy to use modern techniques such as automatic
17 differentiation that make it easy to quickly deploy them.
18 Algorithms such as stochastic gradient descent and the
19
use of mini-batches make it easy to parallelize code and
20 Amount of Data train much larger DNNs than was thought possible fifteen
21
years ago. Furthermore, many of these computational
22 FIG. 47 Large neural networks can exploit the vast
23 amount of data now available. Schematic of how neural
gains are quickly incorporated into modern packages with
24 network performance depends on amount of available data industrial resources. This makes it easy to perform nu-
25 (Figure based on Andrew Ng’s talk at the 2016 Deep Learn- merical experiments on large datasets, leading to further
26 ing School available at https://www.youtube.com/watch?v= engineering gains.
27 F1ka6a13S9I.)
28
29 C. Limitations of supervised learning with deep networks
30
31 Like all statistical methods, supervised learning using
32 tain millions of parameters, yet can still be trained on neural networks has important limitations. This is es-
33 existing hardwares. The complexity of DNNs (in terms pecially important when one seeks to apply these meth-
34 of parameters) combined with their simple architecture ods, especially to physics problems. Like all tools, DNNs
35 (layer-wise connections) hit a sweet spot between expres- are not a universal solution. Often, the same or better
36 sivity (ability to represent very complicated functions) performance on a task can be achieved by using a few
37 and trainability (ability to learn millions of parameters).
38 hand-engineered features (or even a collection of random
39 features). This is especially important for hard physics
40 Indeed, the ability of large DNNs to exploit huge problems where data (or Monte-Carlo samples) may be
41 datasets is thought to differ from many other commonly hard to come by.
42 employed supervised learning methods such as Support Here we list some of the important limitations of su-
43 Vector Machines (SVMs). Figure 47 shows a schematic pervised neural network based models.
44 depicting the expected performance of DNNs of differ-
45 ent sizes with the number of data samples and compares • Need labeled data.—Like all supervised learn-
46 them to supervised learning algorithms such as SVMs or ing methods, DNNs for supervised learning require
47 ensemble methods. When the amount of data is small, labeled data. Often, labeled data is harder to ac-
48 DNNs offer no substantial benefit over these other meth- quire than unlabeled data (e.g. one must pay for
49 ods and often perform worse. However, large DNNs seem human experts to label images).
50 to be able to exploit additional data in a way other meth-
51 ods cannot. The fact that one does not have to hand • Supervised neural networks are extremely
52 engineer features makes the DNN even more well suited data intensive.—DNNs are data hungry. They
53 for handling large datasets. Recent theoretical results perform best when data is plentiful. This is doubly
54 suggest that as long as a DNN is large enough, it should so for supervised methods where the data must also
55
generalize well and not overfit (Advani and Saxe, 2017). be labeled. The utility of DNNs is extremely lim-
56
In the data-rich world we live in (at least in the context ited if data is hard to acquire or the datasets are
57
58 of images, videos, and natural language), this is a recipe small (hundreds to a few thousand samples). In
59 for success. In other areas where data is more limited, this case, the performance of other methods that
60 deep learning architectures have (at least so far) been less utilize hand-engineered features can exceed that of
61 successful. DNNs.
62
63
64
65
63
1
2
• Homogeneous data.—Almost all DNNs deal A. Some of the challenges of high-dimensional data
3
4 with homogeneous data of one type. It is very hard
5 to design architectures that mix and match data Before we begin exploring some specific dimensional
6 types (i.e. some continuous variables, some discrete reduction techniques, it is useful to highlight some of the
7 variables, some time series). In applications beyond generic difficulties encountered when dealing with high-
8 images, video, and language, this is often what is dimensional data.
9 required. In contrast, ensemble models like random
10 forests or gradient-boosted trees have no difficulty
11 handling mixed data types. a. High-dimensional data lives near the edge of sample space.
12 Geometry in high-dimensional space can be counterintu-
13 itive. One example that is pertinent to machine learning
14 • Many physics problems are not about is the following. Consider data distributed uniformly at
15 prediction.—In physics, we are often not inter- random in a D-dimensional hypercube C = [−e/2, e/2]D ,
16 ested in solving prediction tasks such as classifi- where e is the edge length. Consider also a D-dimensional
17 cation. Instead, we want to learn something about hypersphere S of radius e/2 centered at the origin and
18 the underlying distribution that generates the data. contained within C. The probability that a data point x
19 In this case, it is often difficult to cast these ideas in drawn uniformly at random in C is contained within S
20 a supervised learning setting. While the problems is well approximated by the ratio of the volume of S to
21 are related, it’s possible to make good predictions that of C : p(kxk2 < e/2) ∼ (1/2)D . Thus, as the di-
22 with a “wrong” model. The model might or might mension of the feature space D increases, p goes to zero
23 not be useful for understanding the physics.
24 exponentially fast. In other words, most of the data will
25 concentrate outside the hypersphere, in the corners of
26 Some of these remarks are particular to DNNs, oth- the hypercube. In physics, this basic observation under-
27 ers are shared by all supervised learning methods. This lies many properties of ideal gases such as the Maxwell
28 motivates the use of unsupervised methods which in part distribution and the equipartition theorem (see Chapter
29 circumnavigate these problems. 3 of (Sethna, 2006) for instance).
30
31
32 b. Real-world data vs. uniform distribution. Fortunately,
33 XII. DIMENSIONAL REDUCTION AND DATA real-world data is not random or uniformly distributed!
34 VISUALIZATION In fact, real data usually lives in a much lower dimen-
35 sional space than the original space in which the fea-
36 tures are being measured. This is sometimes referred to
Unsupervised learning is concerned with discovering
37
structure in unlabeled data. In this section, we will be- as the “blessing of non-uniformity” (in opposition to the
38
gin our foray into unsupervised learning by way of data curse of dimensionality). Data will typically be locally
39
40 visualization. Data visualization methods are important smooth, meaning that a local variation of the data will
41 for modeling as they can be used to identify correlated or not incur a change in the target variable (Bishop, 2006).
42 redundant features along with irrelevant features (noise) This idea is central to statistical physics and field the-
43 from raw or processed data. Conceivably, being able to ories, where properties of systems with an astronomical
44 identify and capture such characteristics in a dataset can number of degrees of freedom can be well characterized
45 help in designing better predictive models. For data in- by low-dimensional “order parameters” or effective de-
46 volving a relatively small number of features, studying grees of freedom. Another instantiation of this idea is
47 pair-wise correlations (i.e. pairwise scatter plots of all manifest in the description of the bulk properties of a
48 features) may suffice in performing a complete analysis. gas of weakly interacting particles, which can be sim-
49 This rapidly becomes impractical for datasets involving ply described by the thermodynamic variables (temper-
50 a large number of measured featured (such as images). ature, pressure, etc.) that enter the equation of state
51 Thus, in practice, we often have to perform dimensional rather than the enormous number of dynamical variables
52 reduction, namely, project or embed the data onto a lower (i.e. position and momentum) of each particle in the gas.
53 dimensional space, which we refer to as the latent space.
54 As we will discuss, part of the complication of dimen-
55
sional reduction lies in the fact that low-dimensional rep- c. Intrinsic dimensionality and the crowding problem. A re-
56
resentations of high-dimensional data necessarily incurs current objective of dimensional reduction techniques is
57
58 information lost. Below, we introduce some common lin- to preserve the relative pairwise distances (or defined sim-
59 ear and nonlinear methods for performing dimensional ilarities) between data points from the original space to
60 reduction with applications in data visualization of high- the latent space. This is a natural requirement, since
61 dimensional data. we would like for nearby data points (as measured in the
62
63
64
65
64
1
2
3
a) b)
4 2D 1D
5
6
7
8
9
10
11
12
13
14 FIG. 49 Illustration of the crowding problem. (Left) A two-
FIG. 48 The “Swiss roll”. Data distributed in a three-
15
dimensional space (a) that can effectively be described on a dimensional dataset X consisting of 3 equidistant points.
16
two-dimensional surface (b). A common goal of dimensional (Right) Mapping X to a one-dimensional space while try-
17 reduction techniques is to preserve ordination in the data: ing to preserve relative distances leads to a collapse of the
18 points that are close-by in the original space are also near-by mapped data points.
19 in the mapped (latent) space. This is true of the mapping (a)
20 to (b) as can be seen by inspecting the color gradient.
21 B. Principal component analysis (PCA)
22
23 A ubiquitous method for dimensional reduction, data
24 visualization and analysis is Principal Component Anal-
25 original space) to remain close-by after the corresponding ysis (PCA). The goal of PCA is to perform an orthog-
26 mapping to the latent space. onal transformation of the data in order to find high-
27 variance directions. PCA is inspired by the observation
28 Consider the example of the “Swiss roll” presented in that in many cases, the relevant information in a signal
29 FIG. 48a. There, the relevant structure of the data cor- is contained in the directions with largest13 variance (see
30 responds to nearby points with similar colors and is en- FIG. 50). Directions with small variance are ascribed to
31 coded in the “unrolled” data in the latent space, see FIG. “noise” and can potentially be removed or ignored.
32 48b. Clearly, in this example a two-dimensional space
33 is sufficient to capture almost the entirety of the infor-
34
mation in the data. A concept which stems from signal
35
processing that is relevant to our current exposition is signal
36
37 that of the intrinsic dimensionality of the data. Quali-
38 tatively, it refers to the minimum number of dimensions
required to capture the signal in the data. In the case
39 noise
40 of the Swiss roll, it is 2 since the Swiss roll can effec-
41 tively be parametrized using only two parameters, i.e.
42 X ∈ {(x1 sin(x1 ), x1 cos(x1 ), x2 )}. The minimum num-
43 ber of parameters required for such a parametrization is
44 the intrinsic dimensionality of the data (Bennett, 1969).
45 Attempting to represent data in a space of dimension-
46 ality lower than its intrinsic dimensionality can lead to
47 a “crowding” problem (Maaten and Hinton, 2008) (see
48 schematic, FIG. 49). In short, because we are attempting
49 to satisfy too many constraints (e.g. preserve all relative
50 FIG. 50 PCA seeks to find the set of orthogonal directions
distances of the original space), this results in a trivial so- with largest variance. This can be seen as “fitting” an ellipse
51 lution for the latent space where all mapped data points
52 to the data with the major axis corresponding to the first
collapse to the center of the map. principal component (direction of largest variance). PCA as-
53
sumes that directions with large variance correspond to the
54 To alleviate this, one needs to weaken the constraints true signal in the data while directions with low variance cor-
55 respond to noise.
imposed on the visualization scheme. Powerful methods
56
such as t-distributed stochastic embedding (Maaten and
57
58 Hinton, 2008) (in short, t-SNE, see section XII.D) and
59 uniform manifold approximation and projection (UMAP)
60 (McInnes et al., 2018) have been devised to circumvent 13 This assumes that the features are measured and compared using
this issue in various ways. the same units.
61
62
63
64
65
65
1
2
cently, a correspondence between PCA and Renormal-
3
4 ization Group flows across the phase transition in the 2D
5 Ising model (Foreman et al., 2017) and in a more general
6 setting (Bradde and Bialek, 2017) has been proposed.
7 In statistical physics, PCA has also found application
8 in detecting phase transitions (Wetzel, 2017), e.g. in the
9 XY model on frustrated triangular and union jack lat-
10 tices (Wang and Zhai, 2017). PCA was also used to clas-
11 sify dislocation patterns in crystals (Papanikolaou et al.,
12 2017; Wang and Zhai, 2018), and to find correlations in
13 the shear flow of athermal amorphous solids (Ruscher
14 and Rottler, 2018). PCA is widely employed in biolog-
15 ical physics when working with high-dimensional data.
16 Physics has also inspired PCA-based algorithms to infer
17 relevant features in unlabelled data (Bény, 2018). Con-
18 cretely, consider N data points, {x1 , . . . xN } that live in
19 a p-dimensional feature space Rp . Without loss of gener-
20 P
ality, we assume that the empirical mean x̄ = N −1 i xi
21
of these data points is zero14 . Denote the N × p design
22
23 matrix as X = [x1 , x2 , . . . ; xN ]T whose rows are the data
24 points and columns correspond to different features. The
25 p × p (symmetric) covariance matrix is therefore
26
1
27 Σ(X) = X T X. (129)
28 N −1
29 Notice that the j-th diagonal entry of Σ(X) corresponds
30
to the variance of the j-th feature and Σ(X)ij measures
31
the covariance (i.e. connected correlation in the language
32
of physics) between feature i and feature j.
33
34 FIG. 51 (a) The first 2 principal component of the Ising We are interested in finding a new basis for the data
35 dataset with temperature indicated by the coloring. PCA that emphasizes highly variable directions while reducing
36 was performed on a joined dataset of 1000 samples taken at redundancy between basis vectors. In particular, we will
each temperatures T = 0.25, 0.5, · · · , 4.0. Almost all the vari- look for a linear transformation that reduces the covari-
37
ance is explained in the first component which corresponds ance between different features. To do so, we first per-
38 to the magnetization order parameter (linear combination of
39 form singular value decomposition (SVD) on the design
the features with weights all roughly equal). The paramag-
40 netic phase corresponds to the middle cluster and the left and matrix X, namely, X = U SV T , where S is a diagonal
41 right clusters correspond to the symmetry-related ferromag- matrix of singular value si , the orthogonal matrix U con-
42 netic phases (b) Log of the spectrum of the covariance matrix tains (as its columns) the left singular vectors of X, and
43 versus rank ordering. Only one dimension has high-variance. similarly V contains (as its columns) the right singular
44 vectors of X. With this, one can rewrite the covariance
45 matrix as
46
Surprisingly, such PCA-based projections often cap- 1
47 Σ(X) = V SU T U SV T
48 ture a lot of the large scale structure of many datasets. N −1
 
49 For example, Figure 51 shows the projection of samples S2
=V VT
50 drawn from the 2D Ising model at various temperatures N −1
51 on the first two principal components. Despite living in ≡ V ΛV T . (130)
52 a 1600 dimensional space (the samples are 40 × 40 spin
53 configurations), a single principal component (i.e. a sin- where Λ is a diagonal matrix with eigenvalues λi in the
54 gle direction in this 1600 dimensional space) can cap- decreasing order along the diagonal (i.e. eigendecompo-
55 sition). It is clear that the right singular vectors of X
ture 50% of the variability contained in our samples.
56
In fact, one can verify that this direction weights all (i.e. the columns of V ) are principal directions of Σ(X),
57
58 1600 spins nearly equally and thus corresponds to the
59 magnetization order parameter. Thus, even without any
60 prior physical knowledge, one can extract relevant order
14 We can always center around the mean: x̄ ← xi − x̄
61 parameters using a simple PCA-based projection. Re-
62
63
64
65
66
1
2
and the singular values of X are related to the eigenval- to the number of data points, and are thus very limited
3
4 ues of the covariance matrix Σ(X) via λi = s2i /(N − 1). in their application to large datasets. However, sample-
5 To reduce the dimensionality of data from p to p̃ < p, we based methods have been introduce to reduce this scaling
6 first construct the p× p̃ projection matrix Ṽp0 by selecting to O(N log N ) (Yang et al., 2006). In the case of PCA,
7 the singular components with the p̃ largest singular val- a complete decomposition has a scaling of O(N p2 + p3 ),
8 ues. The projection of the data from p to a p̃ dimensional where p is the number of features. Note that the first
9 space is simply Ỹ = X Ṽp0 . The same idea is central term N p2 is due to the computation of covariance ma-
10 to matrix-product-state-like techniques used to compress trix Eq.(129) while the second, p3 , stems from eigenvalue
11 the number of components in quantum wavefunctions in decomposition. Nothe that PCA can be improved to bear
12 studies of low-dimensional many-body lattice systems. complexity O(N p2 +p) if only the first few principal com-
13 The singular vector with the largest singular value (i.e ponents are desired (using iterative approaches). PCA
14 the largest variance) is referred to as the first principal and MDS are often among the first data visualization
15 component; the singular vector with the second largest techniques one resorts to.
16 singular value as the second principal component,
Pp and so
17 on. An important quantity is the ratio λi / i=1 λi which
18 is referred as the percentage of the explained variance D. t-SNE
19 contained in a principal component (see FIG. 51.b).
20
It is common in data visualization to present the data It is often desirable to preserve local structures in
21
projected on the first few principal components. This is high-dimensional datasets. However, when dealing with
22 datasets having clusters delimitated by complicated sur-
23 valid as long as a large part of the variance is explained
in those components. Low values of explained variance faces or datasets with a large number of clusters, preserv-
24 ing local structures becomes difficult using linear tech-
25 may imply that the intrinsic dimensionality of the data is
high or simply that it cannot be captured by a linear rep- niques such as PCA. Many non-linear techniques such as
26
resentation. For a detailed introduction to PCA, see the non-classical MDS (Cox and Cox, 2000), self-organizing
27
28 tutorials by Shlens (Shlens, 2014) and Bishop (Bishop, map (Kohonen, 1998), Isomap (Tenenbaum et al., 2000)
29 2006). and Locally Linear Embedding (Roweis and Saul, 2000)
30 have been proposed and to address this class of problems.
31 These techniques are generally good at preserving local
32 C. Multidimensional scaling structures in the data but typically fail to capture struc-
33 tures at the larger scale such as the clusters in which the
34 Multidimensional scaling (MDS) is a non-linear dimen- data is organized (Maaten and Hinton, 2008).
35 sional reduction technique which preserves the pairwise Recently, t-stochastic neighbor embedding (t-SNE) has
36 distance or dissimilarity dij between data points (Cox emerged as one of the go-to methods for visualizing high-
37 and Cox, 2000). Moving forward, we use the term “dis- dimensional data. It has been shown to offer insight-
38 tance” and “dissimilarity” interchangeably. There are two ful visualization for many benchmark high-dimensional
39 types of MDS: metric and non-metric. In metric MDS, datasets (Maaten and Hinton, 2008). t-SNE is a non-
40 the distance is computed under a pre-defined metric and parametric15 method that constructs non-linear embed-
41 the latent coordinates Ỹ are obtained by minimizing the dings. Each high-dimensional training point is mapped
42 difference between the distance measured in the original to low-dimensional embedding coordinates, which are op-
43 space (dij (X)) and that in the latent space (dij (Y )): timized in a way to preserve the local structure in the
44
X data.
45 (131)
Ỹ = arg min wij |dij (X) − dij (Y )|, When used appropriately, t-SNE is a powerful tech-
46 Y
47
i<j nique for unraveling the hidden structure of high-
dimensional datasets while at the same time preserv-
48 where wij ≥ 0 are weight values. The weight matrix wij
ing locality. In physics, t-SNE has recently been used
49 is a set of free parameters that specify the level of confi-
to reduce the dimensionality and classify spin configu-
50 dence (or precision) in the value of dij (X). If Euclidean
51 rations, generated with the help of Monte Carlo sim-
metric is used, MDS gives the same result as PCA and is
52 ulations, for the Ising (Carrasquilla and Melko, 2017)
usually referred to as classical scaling (Torgerson, 1958).
53 and Fermi-Hubbard models at finite temperatures (Ch’ng
Thus MDS is often considered as a generalization of PCA.
54 et al., 2017). It was also applied to study clustering tran-
In non-metric MDS, dij can be any distance matrix. The
55 sitions in glass-like problems in the context of quantum
objective function is then to preserve the ordination in
56
the data, i.e. if d12 (X) < d13 (X) in the original space,
57
58 then in the latent space we should have d12 (Y ) < d13 (Y ).
59 Both MDS and PCA can be implemented using stan- 15 It does not explicitly parametrize feature extraction required to
60 dard Python packages such as Scikit. MDS algorithms compute the embedding coordinates. Thus it cannot be applied
typically have a scaling of O(N 3 ) where N corresponds to find the coordinate of new data points.
61
62
63
64
65
67
1
2
control (Day et al., 2019). gradient of (135) with respect to yi explicitly:
3
4 The idea of stochastic neighbor embedding is to asso- X X
5 ciate a probability distribution to the neighborhood of ∂yi C = 4pij qij Zi (yi − yj ) − 2
4qij Zi (yi − yj ),
6 each data (note x ∈ Rp , p is the number of features): j6=i j6=i
7 = Fattractive,i − Frepulsive,i , (136)
8 exp(−||xi − xj ||2 /2σi2 )
pi|j = P 2 2 , (132) P
9 k6=i exp(−||xi − xk || /2σi ) where Zi = 1/( k6=i (1 + ||yk − yi ||2 )−1 ). We have sepa-
10 rated the gradient of point yi into an attractive Fattractive
11 where pi|j can be interpreted as the likelihood that xj is and repulsive term Frepulsive . Notice that Fattractive,i in-
12 xi ’s neighbor (thus we take pi|i = 0). σi are free band- duces a significant attractive force only between points
13 width parameters that are usually determined by fixing that are nearby point i in the original space since it in-
14 the local entropy H(pi ) of each data point: volves the pij term. Finding the embedding coordinates
15 x1 equilibrium
yi is thus equivalent to finding ythe x2 yconfigura-
2
X 1
16 (133)
H(pi ) ≡ − pj|i log2 pj|i . tion of particles interacting through the forces in (136).
17 j
18
19 The local entropy is then set to equal a constant across all
20 data points Σ = 2H(pi ) , where Σ is called the perplexity.
21 short-tail
The perplexity constraint determines σi ∀ i and implies yi = arg min |p(xi )
y
q(y)|
22
that points in regions of high-density will have smaller long-tail
23
σi .
24
25 Using Gaussian likelihoods in pi|j implies that only
26 points that are nearby xi contribute to its probability
distribution. While this ensures that the similarity for
p
27 q
28 nearby points is well represented, this can be a prob- x0,y0
29 lem for points that are far away from xi (i.e. outliers):
30 they have exponentially vanishing contributions to the
31 distribution, which in turn means that their embedding y1 x2 y2
32 coordinates are ambiguous (Maaten and Hinton, 2008). x1
33 One way around this is to define a symmetrized distri-
34 bution pij ≡ (pi|j + pj|i )/(2N ). This guarantees that
P FIG. 52 Illustration of the t-SNE embedding. xi points cor-
35 j pij > 1/(2N ) for all data points xi , resulting in each
respond to the original high-dimensional points while the yi
36 data point xi making a significant contribution to the points are the corresponding low-dimensional map points pro-
37 cost function to be defined below. duced by t-SNE. Here we consider two points, x1 , x2 , that are
38 respectively “close” and “far” from x0 . The high-dimensional
t-SNE constructs a similar probability distribution qij Gaussian (short-tail) distribution p(x) of x0 ’s neighbors is
39 in a low dimensional latent space (with coordinates Y =
40 0
shown in blue. The low-dimensional Cauchy (fat-tail) distri-
{yi }, yi ∈ Rp , where p0 < p is the dimension of the latent bution q(y) of x0 ’s neighbors is shown in red. The map point
41
space): yi , are obtained by minimizing the difference |q(y) − p(xi )|
42 (similar to minimizing the KL divergence). We see that point
43 x1 is mapped to short distances |y1 −y0 |. In contrast, far-away
(1 + ||yi − yj ||2 )−1
44 qij = P 2 −1
. (134) points such as x2 are mapped to relatively large distances
45 k6=i (1 + ||yi − yk || ) |y2 − y0 |.
46
47 The crucial point to note is that qij is chosen to be a long
tail distribution. This preserves short distance informa- Below, we list some important properties that one
48
49 tion (relative neighborhoods) while strongly repelling two should bear in mind when analyzing t-SNE plots.
50 points that are far apart in the original space (see FIG.
• t-SNE can rotate data. The KL divergence is in-
51 52). In order to find the latent space coordinates yi , t-
variant under rotations in the latent space, since it
52 SNE minimizes the Kullback-Leibler divergence between
only depends on the distance between points. For
53 qij and pij :
54 this reason, t-SNE plots that are rotations of each
X   other should be considered equivalent.
55 pij
56 C(Y ) = DKL (p||q) ≡ pij log . (135)
ij
qij • t-SNE results are stochastic. In applying gradient
57
58 descent the solution will depend on the initial seed.
59 This minimization is done via gradient descent (see sec- Thus, the map obtained may vary depending on the
60 tion IV). We can gain further insights on what the em- seed used and different t-SNE runs will give slightly
61 bedding cost-function C is capturing by computing the different results.
62
63
64
65
68
1
2
• t-SNE generally preserves short distance informa-
3
4 tion. As a rule of thumb, one should expect that
5 nearby points on the t-SNE map are also closeby in
6 the original space, i.e. t-SNE tends to preserve or-
7 dination (but not actual distances). For a pictorial
8 explanation of this, we refer the reader to Figure
9 52.
10
• Scales are deformed in t-SNE. Since a scale-free dis-
11
tribution is used in the latent space, one should not
12
13 put too much emphasis on the meaning of the size
14 of any clusters observed in the latent space.
15 • t-SNE is computationally intensive. Finally, a
16 direct implementation of t-SNE has an algorith-
17 mic complexity of O(N 2 ) which is only appli-
18
cable to small to medium data sets. Improved
19
scaling of the form O(N log N ) can be achieved
20
21 at the cost of approximating Eq. (135) by us-
22 ing the Barnes-Hut method (Van Der Maaten,
23 2014) for N -body simulations (Barnes and Hut,
24 1986). More recently extremely efficient t-
25 SNE implementation making use of fast Fourier FIG. 53 Different visualizations of a Gaussian mixture formed
transforms for kernel summations in (136) have of K = 30 mixtures in a D = 40 dimensional space. The
26
Gaussians have the same covariance but have means drawn
27 been made available on https://github.com/
uniformly at random in the space [−10, 10]40 . (a) Plot of the
28 KlugerLab/FIt-SNE (Linderman et al., 2017). first two coordinates. The labels of the different Gaussian is
29 indicated by the different colors. Note that in a realistic set-
As an illustration, in Figure 53 we applied t-SNE to
30
a Gaussian mixture model consisting of thirty Gaus- ting, label information is of course not available, thus making
31 it very hard to distinguish the different clusters. (b) Random
sians, whose means are uniformly distributed in forty-
32 projection of the data onto a 2 dimensional space. (c) pro-
33 dimensional space. We compared the results to a random jection onto the first 2 principal components. Only a small
34 two-dimensional projection and PCA. It is clear that un- fraction of the variance is explained by those components (the
35 like more naïve dimensional reduction techniques, both ratio is indicated along the axis). (d) t-SNE embedding (per-
36 PCA and t-SNE can identify the presence of well-formed plexity = 60, # iteration = 1000) in a 2 dimensional latent
37 clusters. The t-SNE visualization cleanly separates all space. t-SNE captures correctly the local structure of the
the clusters while certain clusters blend together in the data.
38
39 PCA plot. This is a direct consequence of the fact that
40 t-SNE keeps nearby points close together while repelling
41 points that are far apart.
42 Figure 54 shows t-SNE and PCA plots for the MNIST
43 dataset of ten handwritten numerical digits (0-9). It is
44 clear that the non-linear nature of t-SNE makes it much
45 better at capturing and visualizing the complicated cor-
46 relations between digits, compared to PCA.
47
48
49 XIII. CLUSTERING
50
51 In this section, we continue our discussion of unsuper- FIG. 54 Visualization of the MNIST handwritten digits train-
52 vised learning methods. Unsupervised learning is con- ing dataset (here N = 60000). (a) First two principal
53 cerned with discovering structure in unlabeled data (for components. (b) t-SNE applied with a perplexity of 30,
54 a Barnes-Hut angle of 0.5 and 1000 gradient descent iter-
instance learning local structures for data visualization, ations. In order to reduce the noise and speed-up compu-
55
see section XII). The lack of labels make unsupervised tation, PCA was first applied to the dataset to project it
56
learning much more difficult and subtle than its super- down to 40 dimensions. We used an open-source implemen-
57
58 vised counterpart. What is somewhat surprising is that tation to produce the results (Linderman et al., 2017), see
59 even without labels it is still possible to uncover and ex- https://github.com/KlugerLab/FIt-SNE.
60 ploit the hidden structure in the data. Perhaps, the sim-
61 plest example of unsupervised learning is clustering. The
62
63
64
65
69
1
2
aim of clustering is to group unlabelled data into clusters A. Practical clustering methods
3
4 according to some similarity or distance measure. Infor-
5 mally, a cluster is thought of as a set of points sharing Throughout this section we focus on the Euclidean dis-
6 some pattern or structure. tance as a similarity measure. Other measures may be
7 better suited for specific problems. We refer the enthusi-
Clustering finds many applications throughout data
8 ast reader to (Rokach and Maimon, 2005) for a more in-
mining (Larsen and Aone, 1999), data compression and
9 depth discussion of the different possible similarity mea-
signal processing (Gersho and Gray, 2012; MacKay,
10 sures.
2003). Clustering can be used to identify coarse features
11
or high level structures in an unlabelled dataset. The
12
13 technique also finds many applications in physical sci-
1. K-means
14 ences, ranging from detecting celestial emission sources
15 in astronomical surveys (Sander et al., 1998) to inferring
groups of genes and proteins with similar functions in We begin our discussion with K-means clustering since
16 this method is simple to implement and understand, and
17 biology (Eisen et al., 1998), and building entanglement
classifiers (Lu et al., 2017). Clustering is perhaps the covers the core concepts of clustering. Consider a set of
18 N
simplest way to look for hidden structure in a dataset N unlabelled observations {xn }n=1 where xn ∈ Rp and
19
20 and for this reason, is among the most widely used and where p is the number of features. Also consider a set
K
21 employed data analysis and machine learning techniques. of K cluster centers called the cluster means: {µk }k=1 ,
22 with µk ∈ R , which we’ll compute “emperically" in the
p

23 The field of clustering is vast and there exists a cluserting procedure. The cluster means can be thought
24 flurry of clustering methods suited for different purposes. of as the representatives of each cluster, to which data
25 Some common considerations one has to take into ac- points are assigned (see FIG. 55). K-means clustering
26 count when choosing a particular method is the distribu- can be formulated as follows: given a fixed integer K, find
27 tion of the clusters (overlapping/noisy clusters vs. well- the cluster means {µ} and the data point assignments in
28 separated clusters), the geometry of the data (flat vs. order to minimize the following objective function:
29 non-flat), the cluster size distribution (multiple sizes vs.
30 uniform sizes), the dimensionality of the data (low vs. K X
X N
31 high dimensional) and the computational efficiency of the C({x, µ}) = rnk (xn − µk )2 , (137)
32 desired method (small vs. large dataset). k=1 n=1
33
34 We begin section XIII.A with a focus on popular prac- where rnk ∈ {0, 1} is a binary variable called the assign-
35 tical clustering methods such as K-means clustering, hi- ment. The assignment rnk is 1 if xnPis assigned to cluster
36 erarchical clustering and density clustering. Our goal is Pand 0 otherwise. Notice that
k k rnk = 1 ∀ n and
37 to highlight the strength, weaknesses and differences be- r
n nk ≡ N k , where N k the number of points assigned
38 tween these techniques, while laying out some of the theo- to cluster k. The minimization of this objective func-
39 retical framework required for clustering analysis. There tion can be understood as trying to find the best cluster
40 exist many more clustering methods beyond those dis- means such that the variance within each cluster is min-
41 cussed in this section16 . The methods we discuss were imized. In physical terms, C is equivalent to the sum of
42 chosen for their pedagogical value and/or their applica- the moments of inertia of every cluster. Indeed, as we
43 bility to problems in physics. will see below, the cluster means µk correspond to the
44 centers of mass of their respective cluster.
45 In section XIII.B we discuss gaussian mixture models
46 and the formulation of clustering through latent variable
47 models. This section introduces many of the methods we
48 will encounter when discussing other unsupervised learn- K-means algorithm. The K-means algorithm alter-
49 ing methods later in the review. Finally, in section XIII.C nates between two steps:
50 we discuss the problem of clustering in high-dimensional
51 1. Expectation: Given a set of assignments {rnk },
data and possible ways to tackle this difficult problem.
52 minimize C with respect to µk . Taking a simple
The reader is also urged to experiment with various clus-
53 derivative and setting it to zero yields the follow-
tering methods using Notebook 15.
54 ing update rule:
55
56 1 X
µk = rnk xn . (138)
57 Nk n
58
59
60
16 Our complementary Python notebook introduces some of these 2. Maximization: Given a set of cluster means {µk },
other methods. find the assignments {rnk } which minimizes C.
61
62
63
64
65
70
1
2
Clearly, this is achieved by assigning each data the data set (more specifically the complexity is O(KN )
3
4 point to their nearest cluster-mean: per iteration) and is thus scalable to very large datasets.
5 ( As we will see in section XIII.B, K-means is a hard-
1 if k = arg mink0 (xn − µk0 )2 assignment limit of the Gaussian mixture model where
6 rnk = (139)
7 0 otherwise all cluster variances are assumed to be the same. This
8 highlights a common drawback of K-means: if the true
9 K-means clustering consists in alternating between these clusters have very different variances (spreads), K-means
10 two steps until some convergence criterion is met. Practi- can lead to spurious results since the underlying assump-
11 cally, the algorithm should terminate when the change in tion is that the latent model has uniform variances.
12 the objective function from one iteration to another be-
13 comes smaller than a pre-specified threshold. A simple
14 2. Hierarchical clustering: Agglomerative methods
example of K-means is presented in FIG. 55.
15
16 Agglomerative clustering is a bottom up approach that
C = 10.5, t = 1 a) C = 8.8, t = 10 b) starts from small initial clusters which are then progres-
17
18 sively merged to form larger clusters. The merging pro-
19 cess generates a hierarchy of clusters that can be visu-
20 alized in the form of a dendrogram (see FIG. 56). This
21 hierarchy can be useful to analyze the relation between
22 clusters and the subcomponents of individual clusters.
23 Agglomerative methods are usually specified by defin-
24 ing a distance measure between clusters17 . We denote
25 the distance between clusters X and Y by d(X, Y ) ∈ R.
26
Different choices of distance result in different clustering
27 C = 8.0, t = 20 d)
c) 10.5 algorithms. At each step, the two clusters that are the
28
29 10.0 closest with respect to the distance measure are merged
30 until a single cluster is left.
9.5
31
32 9.0
33 Agglomerative clustering algorithm Agglomerative
34 8.5 clustering algorithms can thus be summarized as follows:
C
35 8.0 1. Initialize each point to its own cluster.
36 0 10 20
37 t 2. Given a set of K clusters X1 , X2 , · · · , XK , merge
38 clusters until one cluster is left (K = 1):
FIG. 55 K-means with K = 3 applied to an artificial two-
39
dimensional dataset. The cluster means at each iteration are (a) Find the closest pair of clusters (Xi , Xj ):
40 indicated by cyan star markers. t indicates the iteration num- (i, j) = arg min(i0 ,j 0 ) d(Xi0 , Xj 0 )
41 ber and C the value of the objective function. (a) The algo-
42 (b) Merge the pair. Update: K ← K − 1
rithm is initialized by randomly partitioning the space into 3
43 sectors to generate an initial assignment. (b)-(c) For well sep- Here we list a few of the most popular distances used
44 arated clusters, the algorithm converges rapidly to the true in agglomerative methods, often called linkage methods
45 clusters. (d) The objective function as a function of the it-
eration. C converges after t = 18 iterations for this choice of
in the clustering literature.
46
47 random seed (for center initialization). 1. Single-linkage: the distance between clusters i and
48 j is defined as the minimum distance between two
49 A nice property of the K-means algorithm is that it elements of the different clusters
50 is guaranteed to converge. To see this, one can verify
51 d(Xi , Xj ) = min ||xi − xj ||2 . (140)
explicitly (by taking second-order derivatives) that the xi ∈Xi ,xj ∈Xj
52 expectation step always decreases C. This is also true for
53 the assignment step. Thus, since C is bounded from be- 2. Complete linkage: the distance between clusters i
54 low, the two-step iteration of K-means always converges and j is defined as the maximum distance between
55
to a local minimum of C. Since C is generally a non- two elements of the different clusters.
56
convex function, in practice one usually needs to run the d(Xi , Xj ) = max ||xi − xj ||2 (141)
57
58 algorithm with different initial random cluster center ini- xi ∈Xi ,xj ∈Xj

59 tializations and post-select the best local minimum. A


60 simple implementation of K-means has an average com-
17 Note that this measure need not be a metric.
61 putational complexity which scales linearly in the size of
62
63
64
65
71
1
2 0.9
3. Average linkage: average distance between points a)
3
4 of different clusters 1
1 X 0.8 0
5 d(Xi , Xj ) = ||xi − xj ||2 (142)
6 |Xi | · |Xj | 2
xi ∈Xi ,xj ∈Xj
7 0.7
8 4. Ward’s linkage: This distance measure is analogous 3
9 to the K-means method as it seeks to minimize the 0.6
10 total inertia. The distance measure is the “error
11 squared” before and after merging which simplifies 0.5
12 to:
13 5
0.4 4
14 |Xi ||Xj |
d(Xi , Xj ) = (µi − µj )2 , (143)
15 |Xi ∪ Xj |
0.4 0.6 0.8
16 where µj is the center of cluster j.
17 0.30 b)
18 A common drawback of hierarchical methods is that
19 they do not scale well: at every step, a distance ma- 0.25
20 trix between all clusters must be updated/computed.
21 Efficient implementations achieve a typical computa- 0.20

d(X, Y )
22 tional complexity of O(N 2 ) (Müllner, 2011), making the
23 method suitable for small to medium-size datasets. A 0.15
24 simple but major speed-up for the method is to initial-
0.10
25 ize the clusters with K-means using a large K (but still
26 a small fraction of N ) and then proceed with hierarchi-
27 0.05
cal clustering. This has the advantage of preserving the
28 large-scale structure of the hierarchy while making use of
29 0.00
the linear scaling of K-means. In this way, hierarchical 0 1 4 5 2 3
30
clustering may be applied to very large datasets. Leaf label
31
32 FIG. 56 Hierarchical clustering example with single linkage.
33 (a) The data points are successively grouped as denoted by the
3. Density-based (DB) clustering colored dotted lines. (b) Dendrogram representation of the
34
35 hierarchical decomposition. Each node of the tree represents
Density clustering makes the intuitive assumption that a cluster. One has to specify a scale cut-off for the distance
36
clusters are defined by regions of space with higher den- measure d(X, Y ) (corresponding to a horizontal cut in the
37 dendrogram) in order to obtain a set of clusters.
sity of data points. Data points that constitute noise or
38
39 that are outliers are expected to form regions of low den-
sity. Density clustering has the advantage of being able to
40 DBSCAN algorithm. Here we describe the most
41 consider clusters of multiple shapes and sizes while identi-
prominent DB clustering algorithm: DBSCAN, or
42 fying outliers. The method is also suitable for large-scale
density-based spatial clustering of applications with noise
43 applications.
(Ester et al., 1996). Consider once again a set of N data
44 The core assumption of DB clustering is that a rel- N
45 ative local density estimation of the data is possible. points X ≡ {xn }n=1 .
46 In other words, it is possible to order points according We start by defining the ε-neighborhood of point xn
47 to their densities. Density estimates are usually accu- as follows:
48 rate for low-dimensional data but become unreliable for Nε (xn ) = {x ∈ X|d(x, xn ) < ε} . (144)
49 high-dimensional data due to large sampling noise. Here,
50 for brevity, we confine our discussion to one of the most Nε (xn ) are the data points that are at a distance smaller
51 widely used density clustering algorithms, DBSCAN. We than ε from xn . As before, we consider d(·, ·) to be the
52 have also had great success with another recently in- Euclidean metric (which yields spherical neighborhoods,
53 troduced variant of DB clustering (Rodriguez and Laio, see Figure 57) but other metrics may be better suited
54 2014) that is similar in spirit which the reader is urged depending on the specific data. Nε (xn ) can be seen as a
55
to consult. One of the authors (A. D.) has also created crude estimate of local density. xn is considered to be a
56
a Python package, https://pypi.org/project/fdc/, core-point if at least minPts are in its ε-neighborhood.
57
58 which makes use of accurate density estimates via ker- minPts is a free parameter of the algorithm that sets
59 nel methods combined with agglomerative clustering to the scale of the size of the smallest cluster one should
60 produce fast and accurate density clustering (see also expect. Finally, a point xi is said to be density-reachable
61 GitHub repository). if it is in the ε-neighborhood of a core-point. From these
62
63
64
65
72
1
2
definitions, the algorithm can be simply formulated (see a)
3
4 also Figure 57):
5 → Until all points in X have been visited; do
6
7 − Pick a point xi that has not been visited
8 − Mark xi as a visited point
9
10 − If xi is a core point; then "
11 · Find the set C of all points that are density
12 reachable from xi .
13
· C now forms a cluster. Mark all points
14
within that cluster as being visited.
15
16 → Return the cluster assignments C1 , · · · , Ck , with k
17 the number of clusters. Points that have not been
18 assigned to a cluster are considered noise or out- minPts=4
19 liers.
20
21 Note that DBSCAN does not require the user to specify b) High
22 the number of clusters but only ε and minPts. While, it
23 is common to heuristically fix these parameters, methods
24 such as cross-validation can be used for their determi-
25 nation. Finally, we note that DBSCAN is very efficient

Relative density
26 since efficient implementations have a computational cost
27 of O(N log N ).
28
29
30 B. Clustering and Latent Variables via the Gaussian Mixture
31 Models
32
33 In the previous section, we introduced several practical
34 methods for clustering. In this section, we will approach
35 clustering from a more abstract vantage point, and in
36
the process, introduce many of the core ideas underlying Low
37
unsupervised learning. A central concept in many un-
38
39 supervised learning techniques is the idea of a latent or FIG. 57 (a) Illustration of DBSCAN algorithm with
40 hidden variable. Even though latent variables are not di- minPts= 4. Two ε-neighborhood are represented as dashed
41 rectly observable, they still influence the visible structure circles of radius ε. Red points are the core points and
42 of the data. For example, in the context of clustering we blue points are density-reachable point that are not core
43 can think of the cluster identity of each datapoint (i.e. points. Outliers are gray colored. (b) Application of DB-
which cluster does a datapoint belong to) as a latent vari- SCAN (minPts=40) to a noisy dataset with two non-convex
44
clusters. Density profile is shown for clarity. Outliers are
45 able. And even though we cannot see the cluster label
indicated by black crosses.
46 explicitly, we know that points in the same cluster tend
47 to be closer together. The latent variables in our data
48 (cluster identity) are a way of representing and abstract-
49 ing the correlations between datapoints. cluster-specific probability distribution (e.g. a Gaussian
50 In this language, we can think of clustering as an algo- with some mean and variance that characterizes the clus-
51 rithm to learn the most probable value of a latent variable ter). We then specify a procedure for finding the value
52 (cluster identity) associated with each datapoint. Calcu- of the latent variable. This is often done by choosing
53 lating this latent variable requires additional assumptions the values of the latent variable that minimize some cost
54 about the structure of our dataset. Like all unsuper- function.
55
vised learning algorithms, in clustering we must make One common choice for a class of cost functions for
56
an assumption about the underlying probability distri- many unsupervised learning problems is Maximum Like-
57
58 bution from which the data was generated. Our model lihood Estimation (MLE), see Secs. V and VI. In MLE,
59 for how the data is generated is called the generative we choose the values of the latent variables that maximize
60 model. In clustering, we assume that data points are as- the likelihood of the observed data under our generative
61 signed a cluster, with each cluster characterized by some model (i.e. maximize the probability of getting the ob-
62
63
64
65
73
1
2
served dataset under our generative model). Such MLE GMM parameters as
3
4 equations often give rise to the kind of Expectation Maxi-
5 mization (EM) equations that we first encountered in the p(x, z; θ) = p(x|z; {µk , Σk })p(z|{πk }). (150)
6 last section in the context of K-means clustering. We can also use Bayes rule to rearrange this expression
7 Gaussian Mixtures models (GMM) are a generative to give the conditional probability of the data point x
8 model often used in the context of clustering. In GMM, being in the k-th cluster, γ(zk ), given model parameters
9 points are drawn from one of K Gaussians, each with its θ as
10 own mean µk and covariance matrix Σk ,
11   πk N (x|µk , Σk )
12 1 γ(zk ) ≡ p(zk = 1|x; θ) = PK . (151)
N (x|µ, Σ) ∼ exp − (x − µ)Σ−1 (x − µ)T . (145)
13 2 j=1 πj N (x|µj , Σj )
14
15 Let us denote the probability that a point is drawn from The γ(zk ) are often referred to as the “responsibility”
16 mixture k by πk . Then, the probability of generating a that mixture k takes for explaining x. Just like in our
17 point x in a GMM is given by discussion of soft-max classifiers, this can be made into
18
K
a “hard-assignment” by assigning each point to the clus-
X
19
p(x|{µk , Σk , πk }) = N (x|µk , Σk )πk . (146) ter with the largest probability: arg maxk γ(zk ) over the
20 responsibilities.
k=1
21
22 The complication is of course that we do not know
Given a dataset X = {x1 , · · · , xN }, we can write the the parameters θ of the underlying GMM but instead
23 likelihood of the dataset as
24 must also learn them from the dataset X. As discussed
25 N
Y above, ideally we could do this by choosing the param-
26 p(X|{µk , Σk , πk }) = p(xi |{µk , Σk , πk }) (147) eters that maximize the likelihood (or equivalently the
27 i=1 log-likelihood) of the data
28
29 For future reference, let us denote the set of parameters
(of K Gaussians in the model) {µk , Σk , πk } by θ. θ̂ = arg max log p(X|θ) (152)
30 θ
31 To see how we can use GMM and MLE to perform
32 clustering, we introduce discrete binary K-dimensional
where θ = {µk , Σk , πk }. Once we know the MLEs θ̂, we
33 latent variables z for each data point x whose k-th com-
could use Eq. (151) to calculate the optimal hard cluster
34 ponent is 1 if point x was generated from the k-th Gaus-
35 sian and zero otherwise (these are often called “one-hot assignment arg maxk γ̂(zk ) where γ̂(zk ) = p(zk = 1|x; θ̂).
36 variables”). For instance if we were considering a Gaus- In practice, due to the complexity of Eq. (147), it is
37 sian mixture with K = 3, we would have three possible almost impossible to find the global maximum of the like-
38 values for z ≡ (z1 , z2 , z3 ) : (1, 0, 0), (0, 1, 0) and (0, 0, 1). lihood function. Instead, we must settle for a local max-
39 We cannot directly observe the variable z. It is a latent imum. One approach to finding a local maximum of the
40 variable that encodes the cluster identity of point x. Let likelihood is to use a method like stochastic gradient de-
41 us also denote all the N latent variables corresponding scent on the negative log-likelihood, cf. Sec IV. Here, we
42 to a dataset X by Z. introduce an alternative, powerful approach for finding
43 local minima in latent variable models using an iterative
Viewing the GMM as a generative model, we can write
44 procedure called Expectation Maximization (EM). Given
the probability p(x|z) of observing a data point x given
45 an initial guess for the parameters θ(0) , the EM algorithm
46 z as
iteratively generates new estimates for the parameters
47 K
Y θ(1) , θ(2) , . . .. Importantly, the likelihood is guaranteed to
48 p(x|z; {µk , Σk }) = N (x|µk , Σk )zk (148) be non-decreasing under these iterations and hence EM
49 k=1
50 converges to a local maximum of the likelihood (Neal and
51 as well as the probability of observing a given value of Hinton, 1998).
52 latent variable The central observation underlying EM is that it is
53 K
often much easier to calculate the conditional likelihoods
54 Y of the latent variables p̃(t) (Z) = p(Z|X; θ(t) ) given some
p(z|{πk }) = πkzk . (149)
55 choice of parameters, and the maximum of the expected
k=1
56 log likelihood given an assignment of the latent variables:
57 Using Bayes’ rule, we can write the joint probability of θ(t+1) = arg maxθ Ep(Z|X;θ(t) ) [log p(X, Z; θ)]. To get an
58 a clustering assignment z and a data point x given the intuition for this later quantity notice that we can write
59
60
61
62
63
64
65
74
1
2 N X
X K
3 (t)
Ep̃(t) [log p(X, Z; θ)] = γik [log N (xi |µk , Σk ) + log πk ] , (153)
4 i=1 k=1
5
6
7
(t)
8 where we have used the shorthand γik = p(zik |X; θ(t) )
9 with zik the k-th component of zi . Taking the derivative
10 of this equation with
P respect to µk , Σk , and πk (subject
11 to the constraint k πk = 1) and setting this to zero
12 yields the intuitive equations
13
14 PN (t)
(t+1) i γik xi
15 µk = P (t)
16 i γik
17 PN (t)
(t+1) γik (xi − µk )(xi − µk )T
i
18 Σk = P (t)
19 i γik
20 (t+1) 1 X (t)
πk = γik (154)
21 N
k
22
23 These are just the usual estimates for the mean and vari-
24 ance, with each data point weighed according to our cur-
25 rent best guess for the probability that it belongs to clus-
26 ter k. We can then use our new estimate θ(t+1) to calcu-
27 (t+1)
late responsibility γik and repeat the process. This is
28 essentially the K-Means algorithm discussed in the first
29 section.
30
This discussion of the Gaussian mixture model intro-
31
duces several concepts that we will return to repeatedly
32
33 in the context of unsupervised learning. First, it is often
34 useful to think of the visible correlations between features
35 in the data as resulting from hidden or latent variables.
36 Second, we will often posit a generative model that en-
37 codes the structure we think exists in the data and then
38 find parameters that maximize the likelihood of the ob-
39 served data. Third, often we will not be able to directly FIG. 58 (a) Application of gaussian mixture modelling to
40 estimate the MLE, and will have to instead look for a the Ising dataset. The normalized histogram corresponds to
41 computationally efficient way to find a local minimum of the first principal component distribution of the dataset (or
42 the likelihood. equivalently the magnetization in this case). The 1D data
43 is fitted with a K = 3-component gaussian mixture. The
44 likehood of the fitted gaussian mixture is represented in red
45 C. Clustering in high dimensions and is obtained via the expectation-maximization algorithm
46 (b) The gaussian mixture model can be used to compute
47 Clustering data in high-dimension can be very chal- posterior probability (responsibilities), i.e. the probability
of being in one of the phases. Note that the point where
48 lenging. One major problem that is aggravated in high- γ(1) = γ(2) = γ(3) can be interpreted as the critical point.
49 dimensions is the generic accumulation of noise due to Indeed the crossing occurs at T ≈ 2.26.
50 random measurement error for each feature. This in turn
51 leads to increased errors for pairwise similarity and dis-
52 tance measures and thus tends to “blur” distances be-
53 tween data points (Domingos, 2012; Kriegel et al., 2009; ceeding with using a standard clustering method such as
54 Zimek et al., 2012). Many clustering algorithms rely on K-means (Kriegel et al., 2009). Figure 54 illustrates an
55
the explicit use of a similarity measure or distance met- application of denoising to high-dimensional data. PCA
56
rics that weigh all features equally. For this reason, one (section XII.B) was used to denoise the MNIST dataset
57
58 must be careful when using an off-the-shelf method in by projecting the 784 original dimensions onto the 40 di-
59 high dimensions. mensions with the largest principal components. The re-
60 In order to perform clustering on high-dimensional sulting features were then used to construct a Euclidean
61 data, it is often useful to denoise the data before pro- distance matrix which was used by t-SNE to compute
62
63
64
65
75
1
2
the two-dimensional embedding that is presented. Using plicated probability distributions, it is often much easier
3
4 t-SNE directly on original data leads to a “blurring” of to learn the relative weights of different states or data
5 the clusters (the reader is encouraged to test this them- points (ratio of probabilities), than absolute probabili-
6 selves). ties. In physics, this is the familiar statement that the
7 However, simple feature selection or feature denoising weights of a Boltzmann distribution are much easier to
8 (using PCA for instance) can sometimes be insufficient calculate than the partition function. The relative prob-
9 for learning clusters due to the presence of large vari- ability of two configurations, x1 and x2 , are proportional
10 ations in the signal and noise of the features that are to the difference between their Boltzmann weights
11 relevant for identifying the underlying clusters (Kriegel p(x1 )
12 et al., 2009). Recent promising work suggests that one = e−β(E(x1 )−E(x2 )) , (155)
p(x2 )
13 way to overcome these limitations is to learn the latent
14 space and the cluster labels at the same time (Xie et al., where as is usual in statistical mechanics β is the inverse
15 2016). temperature and E(x; θ) is the energy of state x given
16 Finally we end the clustering section with a short dis- some parameters (couplings) θ . However, calculating the
17 absolute weight of a configuration requires knowledge of
cussion on clustering validation, which can be particu-
18 the partition function
larly difficult for high-dimensional data. Often clustering
19
20 validation, i.e. verifying whether the obtained labels are Zp = Trx e−βE(x) , (156)
21 “valid” is done by direct visual inspection. That is, the
22 data is represented in a low-dimensional space and the (where the trace is taken over all possible configurations
23 cluster labels obtained are visually inspected to make x) since
24 sure that different labels organize into distinct “blobs”.
e−βE(x)
25 For high-dimensional data, this is done by performing p(x) = . (157)
dimensional reduction (section XII). However, this can Zp
26
27 lead to the appearance of spurious clusters since dimen- In general, calculating the partition function Zp is ana-
28 sional reduction inevitably loses information about the lytically and computationally intractable.
29 original data. Thus, these methods should be used with For example, for the Ising model with N binary spins,
30 care when trying to validate clusters [see (Wattenberg the trace involves calculating a sum over 2N terms, which
31 et al., 2016) for an interactive discussion on how t-SNE is a difficult task for most energy functions. For this rea-
32 can sometime be misleading and how to effectively use son, physicists (and machine learning scientists) have de-
33 it]. veloped various numerical and computational methods
34 A lot of work has been done to devise ways of validating for evaluating such partition functions. One approach
35 is to use Monte-Carlo based methods to draw samples
clusters based on various metrics and measures (Kriegel
36
et al., 2009). Perhaps one of the most intuitive way of from the underlying distribution (this can be done know-
37
defining a good clustering is by measuring how well clus- ing only the relative probabilities) and then use these
38
39 ters generalize. Clustering methods based on leveraging samples to numerically estimate the partition function.
40 powerful classifiers to measure the generalization errors This is the philosophy behind powerful methods such as
41 of the clusters have been developed by some of the au- Markov Chain Monte Carlo (MCMC) (Andrieu et al.,
42 thors (Day and Mehta, 2018), see https://pypi.org/ 2003) and annealed importance sampling (Neal and Hin-
43 project/hal-x/. We believe this represents an espe- ton, 1998) which are widely used in both the statistical
44 cially promising research direction in high-dimensional physics and machine learning communities. An alterna-
45 clustering. Finally, we emphasize that this discussion is tive approach – which we focus on here – is to approxi-
46 far from exhaustive and we refer the reader to (Rokach mate the the probability distribution p(x) and partition
47 and Maimon, 2005), Chapter 15, for an in-depth survey function using a “variational distribution” q(x; θq ) whose
48 of the various validation techniques. partition function we can calculate exactly. The varia-
49 tional parameters θq are chosen to make the variational
50 distribution as close to the true distribution as possible
51 XIV. VARIATIONAL METHODS AND MEAN-FIELD (how this is done is the focus of much of this section).
52 THEORY (MFT) One of the most-widely applied examples of a varia-
53 tional method in statistical physics is Mean-Field Theory
54 A common thread in many unsupervised learning tasks (MFT). MFT can be naturally understood as a procedure
55
is accurately representing the underlying probability dis- for approximating the true distribution of the system by
56
tribution from which a dataset is drawn. Unsuper- a factorized distribution. The deep connection between
57
58 vised learning of high-dimensional, complex distributions MFT and variational methods is discussed below. These
59 presents a new set of technical and computational chal- variational MFT methods have been extended to under-
60 lenges that are different from those we encountered in stand more complicated spin models (also called graph-
61 a supervised learning setting. When dealing with com- ical models in the ML literature) and form the basis of
62
63
64
65
76
1
2 P
powerful set of techniques that go under the name of Be- with {si =±1} denoting the sum over all possible con-
3
4 lief Propagation and Survey Propagation (MacKay, 2003; figurations of the spin variables. We write Zp to em-
5 Wainwright et al., 2008; Yedidia et al., 2003). phasize that this is the partition function corresponding
6 Variational methods are also widely used in ML to ap- to the probability distribution p(s|β, J ), which will be-
7 proximate complex probabilistic models. For example, come important later. For a fixed number of lattice sites
8 below we show how the Expectation Maximization (EM) N , there are 2N possible configurations, a number that
9 procedure, which we discussed in the context of Gaus- grows exponentially with the system size. Therefore, it
10 sian Mixture Models for clustering, is actually a general is not in general feasible to evaluate the partition func-
11 method that can be derived for any latent (hidden) vari- tion Zp (β, J ) in closed form. This represents a major
12 able model using a variational procedure (Neal and Hin- practical obstacle for extracting predictions from physi-
13 ton, 1998). This section serves as an introduction to this cal theories since the partition function is directly related
14 powerful class of variational techniques. For readers in- to the free-energy through the expression
15 terested in an in-depth discussion on variational infer-
16 ence for probabilistic graphical models, we recommend βFp (J ) = − log Zp (β, J ) = βhE(s, J )ip − Hp ,(160)
17 the great treatise written by Michael I. Jordan and oth-
18 with
ers(Jordan et al., 1999), the more physics oriented dis- X
19 cussion in (Yedidia, 2001; Yedidia et al., 2003), as well as
20 Hp = − p(s|β, J ) log p(s|β, J ) (161)
David MacKay’s outstanding book (MacKay, 2003). {si =±1}
21
22
23 the entropy of the probability distribution p(s|β, J ).
A. Variational mean-field theory for the Ising model Even though the true probability distribution p(s|β, J )
24
25 may be a very complicated object, we can still make
Ising models are a major paradigm in statistical progress by approximating p(s|β, J ) by a variational dis-
26
physics. Historically introduced to study magnetism, it tribution q(s, θ) which captures the essential features of
27
was quickly realized that their predictive power applies interest, with θ some parameters that define our varia-
28
29 to a variety of interacting many-particle systems. Ising tional ansatz. The name variational distribution comes
30 models are now understood to serve as minimal models from the fact that we are going to vary the parame-
31 for complex phenomena such as certain classes of phase ters θ to make q(s, θ) as close to p(s|β, J ) as possible.
32 transitions. In the Ising model, degrees of freedom called The functional form of q(s, θ) is based on an “educated
33 spins assume discrete, binary values, e.g. si = ±1. Each guess”, which oftentimes comes from our intuition about
34 spin variable si lives on a lattice (or, in general, a graph), the problem. We can also define a variational free-energy
35 the sites of which are labeled by i = 1, 2 . . . , N . De-
36 spite the extreme simplicity relative to real-world sys- βFq (θ, J ) = βhE(s, J )iq − Hq , (162)
37 tems, Ising models exhibit a high level of intrinsic com-
38 plexity, and the degrees of freedom can become correlated where hE(s, J )iq is the expectation value of the energy
39 in sophisticated ways. Often, spins interact spatially lo- E(s, J ) with respect to the distribution q(s, θ), and Hq
40 cally, and respond to externally applied magnetic fields. is the entropy of q(s, θ).
41 A spin configuration s specifies the values si of the Before proceeding further, it is helpful to introduce
42 spins at every lattice site. We can assign an “energy” to a new quantity: the Kullback-Leibler divergence (KL-
43 every such configuration divergence or relative entropy) between two distributions
44 p(x) and q(x). The KL-divergence measures the dissim-
45 1X X
E(s, J ) = − Jij si sj − hi si , (158) ilarity between the two distributions and is given by
46 2 i,j i
47 q(x)
48 where hi is a local magnetic field applied to the spin si , DKL (qkp) = Trx q(x) log , (163)
p(x)
49 and Jij is the interaction strength between the spins si
50 and sj . In textbook examples, the coupling parameters which is the expectation w.r.t. q of the logarithmic dif-
51 J = (J, h) are typically uniform or, in studies of disor- ference between the two distributions p and q. The trace
52 Trx denotes a sum over all possible configurations x.
dered systems, (Ji , hi ) are drawn from some probability
53 Two important properties of the KL-divergence are (i)
distribution (i.e. quenched disorder).
54 positivity: DKL (pkq) ≥ 0 with equality if and only if
55 The probability of finding the system in a given spin
configuration at temperature β −1 is given by p = q (in the sense of probability distributions), and (ii)
56
DKL (pkq) 6= DKL (qkp), that is the KL-divergence is not
57 1
58 p(s|β, J ) = e−βE(s,J) , symmetric in its arguments.
Zp (J ) Variational mean-field theory is a systematic way for
59 X
60 Zp (β, J ) = e−βE(s,J) , (159) constructing such an approximate distribution q(s, θ).
61 {si =±1} The main idea is to choose parameters that minimize the
62
63
64
65
77
1
2
difference between the variational free-energy Fq (J , θ) The total variational free-energy is
3
4 and the true free-energy Fp (J |β). We will show in Sec-
tion XIV.B below that the difference between these two βFq (J , θ) = βhE(s, J )iq − Hq ,
5
6 free-energies is actually the KL-divergence: and minimizing with respect to the variational parame-
7
Fq (J , θ) = Fp (J , β) + DKL (qkp). (164) ters θ, we obtain
8    
9 This equality, when combined with the non-negativity of ∂ dqi  X
10 the KL-divergence has important consequences. First, βFq (J , θ) = 2 −β  Jij mj + hi  + θi  .
11 ∂θi dθi
it shows that the variational free-energy is always larger j
12 than the true free-energy, Fq (J , θ) ≥ Fp (J ), with equal- (169)
13 ity if and only if q = p (the latter inequality is found Setting this equation to zero, we arrive at
14 X
in many physics textbooks and is known as the Gibbs
15 θi = β Jij mj (θj ) + hi . (170)
inequality). Second, finding the best variational free-
16 j
17 energy is equivalent to minimizing the KL divergence
18 DKL (qkp). For the special case of a uniform field hi = h and uni-
19 Armed with these observations, let us now derive a form nearest neighbor couplings Jij = J, by symmetry
20 MFT of the Ising model using variational methods. In the variational parameters for all the spins are identical,
21 the simplest MFT of the Ising model, the variational dis- with θi = θ for all i. Then, the mean-field equations
22 tribution is chosen so that all spins are independent: reduce to their familiar textbook form (Sethna, 2006),
23 ! m = tanh(θ) and θ = β(zJm(θ) + h), where z is the
1 X Y eθi si
24 q(s, θ) = exp θi si = . (165) coordination number of the lattice (i.e. the number of
25 Zq i i
2 cosh θi nearest neighbors).
26 Equations (167) and (170) form a closed system, known
27 In other words, we have chosen a distribution q which as the mean-field equations for the Ising model. To find
28 factorizes on every lattice site. An important property a solution to these equations, one method is to iterate
29 of this functional form is that we can analytically find a through and update each θi , once at a time, in an asyn-
30 closed-form expression for the variational partition func- chronous fashion. Once can see the emerging relationship
31 tion Zq . This simplicity also comes at a cost: ignor- of this approach to solving the MFT equations to Expec-
32 ing correlations between spins. These correlations be- tation Maximization (EM) procedure first introduced in
33 come less and less important in higher dimensions and the context of the K-means algorithm in Sec. XIII.A. To
34 the MFT ansatz becomes more accurate. make this explicit, let us spell out the iterative procedure
35 To evaluate the variational free-energy, we make use
36 to find the solutions to Eq. (170). We start by initializing
of Eq. (162). First, we need the entropy Hq of the dis- our variational parameters to some θ (0) and repeat the
37 tribution q. Since q factorizes over the lattice sites, the
38 following two steps until convergence:
entropy separates into a sum of one-body terms
39 X 1. Expectation: Given a set of assignments at iteration
40 Hq (θ) = − q(s, θ) log q(s, θ) t, θ (t) , calculate the corresponding magnetizations
41 {si =±1} m(t) using Eq. (167)
42 X
43 =− qi log qi + (1 − qi ) log(1 − qi ), (166) 2. Maximization: Given a set of magnetizations mt ,
i
44 find new assignments θ(t+1) which minimize the
45 θi
variational free energy Fq . From, Eq. (170) this
where qi = 2 cosh
e
θi is the probability that spin si is in the
46 is just
+1 state. Next, we need to evaluate the average of the
47
Ising energy E(s, J ) with respect to the variational dis- (t+1)
X (t)
48 θi =β Jij mj + hi . (171)
49 tribution q. Although the energy contains bilinear terms,
j
50 we can still evaluate this average easily, because the spins
51 are independent (uncorrelated) in the q distribution. The
From these equations, it is clear that we can think of the
52 mean value of spin si in the q distribution, also known
MFT of the Ising model as an EM-like procedure similar
53 as the on-site magnetization, is given by
to the one we used for K-means clustering and Gaussian
54 X Mixture Models in Sec. XIII.
eθi si
55 mi = hsi iq = si = tanh(θi ). (167) As is well known in statistical physics, even though
56 2 cosh θi
si =±1
MFT is not exact, it can often yield qualitatively and
57
58 Since the spins are independent, we have even quantitatively precise predictions (especially in high
dimensions). The discrepancy between the true physics
59 1X X
and MFT predictions stems from the fact that the varia-
60 hE(s, J )iq = − Jij mi mj − hi mi . (168)
2 i,j tional distribution q we chose cannot capture correlations
61 i
62
63
64
65
78
1
2
between the spins. For instance, it predicts the wrong parameter θ (t) to be those with maximum likeli-
3
4 value for the critical temperature for the two-dimensional hood, assuming qt−1 (z) found in the previous step
5 Ising model. It even erroneously predicts the existence of is the true distribution of hidden variable z:
6 a phase transition in one dimension at a non-zero tem-
θt = arg maxhlog p(z, x|θ)iqt−1 (173)
7 perature. We refer the interested reader to standard θ
8 textbooks on statistical physics for a detailed analysis
9 of applicability of MFT to the Ising model. However, It was shown (Dempster et al., 1977) that each EM iter-
10 we emphasize that the failure of any particular varia- ation increases the true log-likelihood L(θ), or at worst
11 tional ansatz does not compromise the usefulness of the leaves it unchanged. In most models, this iteration pro-
12 approach. In some cases, one can consider changing the cedure converges to a local maximum of L(θ).
13 variational ansatz to improve the predictive properties
14 of the corresponding variational MFT (Yedidia, 2001;
15 Yedidia et al., 2003). The take-home message is that
16 variational MFT is a powerful tool but one that must be θ ( t +2)
17 applied and interpreted with care.
18 M-step
19
20 B. Expectation Maximization (EM)
21 − Fp
22 θ ( t +1)
Ideas along the lines of variational MFT have been
23 independently developed in statistics and imported into θ( t )
24 machine learning to perform maximum likelihood (ML)
25 E-step
estimates. In this section, we explicitly derive the Expec-
26
tation Maximization (EM) algorithm and demonstrate
27
28 further its close relation to variational MFT (Neal and − F q ( θ ( t +1) )
29 Hinton, 1998). We will focus on latent variable mod-
30 els where some of the variables are hidden and cannot
31 be directly observed. This often makes maximum likeli- − F q( θ( t ) )
32 hood estimation difficult to implement. EM gets around
33 this difficulty by using an iterative two-step procedure, FIG. 59 Convergence of EM algorithm. Starting from θ (t) ,
34 closely related to variational free-energy based approxi- E-step (blue) establishes −Fq (θ (t) ) which is always a lower
35 mation schemes in statistical physics. bound of −Fp := hlog p(x|θ)iPx (green). M-step (red) is then
36 To set the stage for the following discussion, let x be applied to update the parameter, yielding θ (t+1) . The up-
37 the set of visible variables we can directly observe and z dated parameter θ (t+1) is then used to construct −Fq (θ (t+1) )
in the subsequent E-step. M-step is performed again to up-
38 be the set of latent or hidden variables that we cannot di- date the parameter, etc.
39 rectly observe. Denote the underlying probability distri-
40 bution from which x and z are drawn by p(z, x|θ), with To see how EM is actually performed and related to
41 θ representing all relevant parameters. Given a dataset
42 variational MFT, we make use of KL-divergence between
x, we wish to find the maximum likelihood estimate of two distributions introduced in the last section. Recall
43 the parameters θ that maximizes the probability of the
44 that our goal is to maximize the log-likelihood L(θ).
observed data. With data z missing, we surely cannot just maximize
45
As in variational MFT, we view θ as variational pa- L(θ) directly since parameter θ might couple both z and
46
47 rameters chosen to maximize the log-likelihood L(θ) = x. EM circumvents this by optimizing another objective
48 hlog p(x|θ)iPx , where the expectation is taken with re- function, Fq (θ), constructed based on estimates of the
49 spect to the marginal distributions of x. Algorithmically, hidden variable distribution q(z|x). Indeed, the function
50 this can be done by iterating the variational parameters optimized is none other than the variational free energy
51 θ (t) in a series of steps (t = 1, 2, . . . ) starting from some we encountered in the previous section:
52 arbitrary initial value θ (0) :
53 Fq (θ) := −hlog p(z, x|θ)iq,Px − hHq iPx , (174)
1. Expectation step (E step): Given the known
54 where Hq is the Shannon entropy (defined in Eq. (161))
55 values of observed variable x and the current esti-
mate of parameter θt−1 , find the probability distri- of q(z|x). One can define the true free-energy Fp (θ) as
56 the negative log-likelihood of the observed data:
57 bution of the latent variable z:
58 − Fp (θ) = L(θ) = hlog p(x|θ)iPx . (175)
59 qt−1 (z) = p(z|θ (t−1) , x) (172)
60 In the language of statistical physics, Fp (θ) is the true
61 2. Maximization step (M step): Re-estimate the free-energy while Fq (θ) is the variational free-energy we
62
63
64
65
79
1
2
would like to minimize (see Table I). Note that we have 2. Maximization step: Fix q, update the variational
3
4 chosen to employ a physics sign convention here of defin- parameters:
5 ing the free-energy as minus log of the partition function.
In the ML literature, this minus sign is often omitted θ (t) = arg max −Fqt−1 (θ). (178)
6 θ
7 (Neal and Hinton, 1998) and this can lead to some con-
8 fusion. Our goal is to choose θ so that our variational To recapitulate, EM implements ML estimation even
9 free-energy Fq (θ) is as close to the true free-energy Fp (θ) with missing or hidden variables through optimizing a
10 as possible. The difference between these free-energies lower bound of the true log-likelihood. In statistical
11 can be written as physics, this is reminiscent of optimizing a variational
12 free-energy which is a lower bound of true free-energy
13 Fq (θ) − Fp (θ) = hfq (x, θ) − fp (x, θ)iPx , (176)
due to Gibbs inequality. In Fig. 59, we show pictorially
14 where how EM works. The E-step can be seen as representing
15 the unobserved variable z by a probability distribution
16 fq (x, θ) − fp (x, θ)
X q(z). This probability is used to construct an alterna-
17
= log p(x|θ) − q(z|x) log p(z, x|θ) tive objective function −Fq (θ), which is then maximized
18
z with respect to θ in the M-step. By construction, maxi-
19 X
20 + q(z|x) log q(z|x) mizing the negative variational free-energy is equivalent
21 z to doing ML estimation on the joint data (i.e. both ob-
X X served and unobserved). The name “M-step” is intuitive
22 = q(z|x) log p(x|θ) − q(z|x) log p(z, x|θ)
23 since the parameters θ are found by maximizing −Fq (θ).
z z
24 X The name “E-step” comes from the fact that one usually
25 + q(z|x) log q(z|x) doesn’t need to construct the probability of missing datas
26 z
explicitly, but rather need only compute the “expected"
X p(z, x|θ) X
27 =− q(z|x) log + q(z|x) log p̃(z) sufficient statistics over these data, cf. Fig. 59.
28 z
p(x|θ) z On the practical side, EM has been demonstrated to be
29 X q(z|x) extremely useful in parameter estimation, particularly in
30 = q(z|x) log hidden Markov models and Bayesian networks (see, for
p(z|x, θ)
31 z example, (Barber, 2012; Wainwright et al., 2008)). Some
32 = DKL (q(z|x)kp(z|x, θ)) ≥ 0 of the authors have used EM in biophysics, to design al-
33 gorithms which establish the equivalence of niche theory
34 where we have used Bayes’ theorem p(z|x, θ) =
p(z, x|θ)/p(x|θ). Since the KL-divergence is always pos- and the Minimum Environmental Perturbation Princi-
35 ple (Marsland III et al., 2019). One of the striking ad-
36 itive, this shows that the variational free-energy Fq is
always an upper bound of the true free-energy Fp . In vantages of EM is that it is conceptually simple and easy
37
physics, this result is known as Gibbs’ inequality. to implement (see Notebook 16). In many cases, imple-
38
39 From Eq. (174) and the fact that the the entropy term mentation of EM is guaranteed to increase the likelihood
40 in Eq. (174) does not depend on θ, we can immediately monotonically, which could be a perk during debugging.
41 see that the maximization step (M-step) in Eq. (173) For readers interested in an overview on applications of
42 is equivalent to minimizing the variational free-energy EM, we recommend (Do and Batzoglou, 2008).
43 Fq (θ). Surprisingly, the expectation step (E-step) can Finally for advanced readers familiar with the physics
44 also viewed as the optimization of this variational free- of disordered systems, we note that it is possible to
45 energy. Concretely, one can show that the distribution construct a one-to-one dictionary between EM for la-
46 of hidden variables z given the observed variable x and tent variable models and the MFT of spin systems with
47 the current estimate of parameter θ, Eq. (172), is the quenched disorder. In a disordered spin systems, the
48
unique probability q(z) that minimizes Fq (θ) (now seen Ising couplings J are commonly taken to be quenched
49
as a functional of q). This can be proved by taking the random variables drawn from some underlying probabil-
50
functional derivative P of Eq. (174), plus a Lagrange mul- ity distribution. In the EM procedure, the quenched dis-
51 order is provided by the observed data points x which
52 tiplier that encodes z q(z) = 1, with respect to q(z).
Summing things up, we can re-write EM in the following are drawn from some underlying probability distribution
53 that characterizes the data. The spins s are like the hid-
54 form (Neal and Hinton, 1998):
den or latent variables z. Similar analogues can be found
55 1. Expectation step: Construct the approximating for all the variational MFT quantities (see Table I). This
56 probability distribution of unobserved z given the striking correspondence offers a glimpse into the deep
57
values of observed variable x and parameter esti- connection between statistical mechanics and unsuper-
58
mate θ (t−1) : vised latent variable models – a connection that we will
59
repeatedly exploit to gain more intuition for the energy-
60 qt−1 (z) = arg min Fq (θ (t−1) ) (177)
61 q based unsupervised models considered in the next few
62
63
64
65
80
1
2
statistical physics Variational EM section starts with a brief overview of generative mod-
3
4 spins/d.o.f.: s hidden/latent variables z els, highlighting the similarities and differences with the
5 couplings /quenched disorder: data observations: x supervised learning methods encountered in earlier sec-
6 J tions. Next, we introduce perhaps the simplest kind of
7 Boltzmann factor e−βE(s,J) Complete probability: generative models – Maximum Entropy (MaxEnt) mod-
8 p(x, z|θ) els. MaxEnt models have no latent (or hidden) vari-
9 partition function: Z(J ) marginal likelihood p(x|θ) ables, making them ideal for introducing the key concepts
10 energy: βE(s, J ) negative log-complete data and tools that underlie energy-based generative models.
11 likelihood: − log p(x, z|θ, m) We then present an extended discussion of how to train
12 free energy: βFp (J |β) negative log-marginal likeli- energy-based models. Much of this discussion will also
13 hood: − log p(x|m) be applicable to more complicated energy-based models
14 variational distribution: q(s) variational distribution: such as Restricted Boltzmann Machines (RBMs) and the
15 q(z|x) deep models discussed in the next section.
16
variational free-energy: variational free-energy: Fq (θ)
17 Fq (J , θ)
18 A. An overview of energy-based generative models
19 TABLE I Analogy between quantities in statistical physics
20 and variational EM. Generative models are a machine learning technique
21 that allows to learn how to generate new examples sim-
22 ilar to those found in a training dataset. The core idea
23 chapters. of most generative models is to learn a parametric model
24
for the probability distribution from which the data was
25
drawn. Once we have learned a model, we can gener-
26 XV. ENERGY BASED MODELS: MAXIMUM ENTROPY ate new examples by sampling from the learned gen-
27 (MAXENT) PRINCIPLE, GENERATIVE MODELS, AND
28 BOLTZMANN LEARNING
erative model (see Fig. 60). As in statistical physics,
29 this sampling is often done using Markov Chain Monte
30 Most of the models discussed in the previous sections Carlo (MCMC) methods. A review of MCMC methods
31 (e.g. linear and logistic regression, ensemble models, and is beyond the scope of this discussion: for a concise and
32 supervised neural networks) are discriminative – they beautiful introduction to MCMC-inspired methods that
33 are designed to perceive differences between groups or bridges both statistical physics and ML the reader is en-
34 categories of data. For example, recognizing differences couraged to consult Chapters 29-32 of David MacKay’s
35 between images of cats and images of dogs allows a dis- book (MacKay, 2003) as well as the review by Michael
36
criminative model to label an image as “cat” or “dog”. I. Jordan and collaborators (Andrieu et al., 2003).
37
Discriminative models form the core techniques of most The added complexity of learning models directly
38
supervised learning methods. However, discriminative from samples introduces many of the same fundamental
39
methods have several limitations. First, like all super- tensions we encountered when discussing discriminative
40 models. The ability to generate new examples requires
41 vised learning methods, they require labeled data. Sec-
ond, there are tasks that discriminative approaches sim- models to be able to “generalize” beyond the examples
42 they have been trained on, that is to generate new sam-
43 ply cannot accomplish, such as drawing new examples
from an unknown probability distribution. A model that ples that are not samples of the training set. The models
44
can learn to represent and sample from a probability dis- must be expressive enough to capture the complex cor-
45
46 tribution is called generative. For example, a genera- relations present in the underlying data distribution, but
47 tive model for images would learn to draw new examples the amount of data we have is finite which can give rise
48 of cats and dogs given a dataset of images of cats and to overfitting.
49 dogs. Similarly, given samples generated from one phase In practice, most generative models that are used in
50 of an Ising model we may want to generate new sam- machine learning are flexible enough that, with a suffi-
51 ples from that phase. Such tasks are clearly beyond the cient number of parameters, they can approximate any
52 scope of discriminative models like the ensemble models probability distribution. For this reason, there are three
53 and DNNs discussed so far in the review. Instead, we axes on which we can differentiate classes of generative
54 must turn to a new class of machine learning methods. models:
55
The goal of this section is to introduce the reader to • The first axis is how easy the model is to train –
56
energy-based generative models. As we will see, energy- both in terms of computational time and the com-
57
58 based models are closely related to the kinds of models plexity of writing code for the algorithm.
59 commonly encountered in statistical physics. We will
60 draw upon many techniques that have their origin in • The second axis is how well the model generalizes
61 statistical mechanics (e.g. Monte-Carlo methods). The from the training set to the test set.
62
63
64
65
81
1
2
• The third axis is which characteristics of the data
3
4 distribution the model is capable of and focuses on
5 capturing.
6
7
8 All generative models must balance these competing re-
9 quirements and generative models differ in the tradeoffs
10 they choose. Simpler models capture less structure about
11 the underlying distributions but are often easier to train.
12
More complicated models can capture this structure but
13
may overfit to the training data.
14
15 One of the fundamental reasons that energy-based FIG. 60 Examples of handwritten digits (“reconstructions”)
16 models have been less widely-employed than their dis- generated using various energy-based models using the pow-
17 criminative counterparts is that the training procedure erful Paysage package for unsupervised learning. Examples
18 for these models differs significantly from those for su- from top to bottom are: the original MNIST database, an
19 pervised neural networks models. Though both employ RBM with Gaussian units which is equivalent to a Hopfield
20 Model, a Restricted Boltzmann Machine (RBM), a RBM with
gradient-descent based procedures for minimizing a cost an L1 penalty for regularization, and a Deep Boltzmann Ma-
21
function (one common choice for generative models is chine (DBM) with 3 layers. All models have 200 hidden units.
22
23 the negative log-likelihood function), energy-based mod- See Sec. XVI and corresponding notebook for details
24 els do not use backpropagation (see Sec. IX.C) and au-
25 tomatic differentiation for computing gradients. Rather,
26 one must turn to ideas inspired by MCMC based meth- B. Maximum entropy models: the simplest energy-based
27 ods in physics and statistics that sometimes go under generative models
28 the name “Boltzmann Learning” (discussed below). As a
29 result, training energy-based models requires additional Maximum Entropy (MaxEnt) models are one of the
30 tools that are not immediately available in packages such simplest classes of energy-based generative models. Max-
31 as PyTorch and TensorFlow. Ent models have their origin in a series of beautiful pa-
32 pers by Jaynes that reformulated statistical mechanics in
33 The open-source package – Paysage – that is built on information theoretic terms (Jaynes, 1957a,b). Recently,
34 top of PyTorch bridges this gap by providing the toolset the flood of new, large scale datasets has resulted in a
35 for training energy-based models (Paysage is maintained resurgence of interest in MaxEnt models in many fields
36 by Unlearn.AI – a company affiliated with two of the au- including physics (especially biological physics), compu-
37 thors (CKF and PM)). Paysage makes it easy to quickly tational neuroscience, and ecology (Elith et al., 2011;
38 code and deploy energy-based models such as Restricted Schneidman et al., 2006; Weigt et al., 2009). MaxEnt
39 Boltzmann Machines (RBMs) and Stacked RBMs – a models are often presented as the class of generative mod-
40 “deep” unsupervised model. The package includes un- els that make the least assumptions about the underlying
41 published training methods that significantly improve data. However, as we have tried to emphasize throughout
42 the training performance, can be applied with various the review, all ML and statistical models require assump-
43 datatypes, and can be employed on GPUs. We make tions, and MaxEnt models are no different. Overlooking
44 use of this package extensively in the next two sections
45 this can sometimes lead to misleading conclusions, and
and the accompanying Python notebooks. For example, it is important to be cognizant of these implicit assump-
46 Fig. 60 (and the accompanying Notebook 17) show how
47 tions (Aitchison et al., 2016; Schwab et al., 2014).
the Paysage package can be used to quickly code and
48
train a variety of energy-based models on the MNIST
49
50 handwritten digit dataset.
1. MaxEnt models in statistical mechanics
51 Finally, we note that generative models at their most
52 basic level are complex parametrizations of the probabil- MaxEnt models were introduced by E. T. Jaynes in a
53 ity distribution the data is drawn from. For this reason, two-part paper in 1957 entitled “Information theory and
54 generative models can do much more than just generate statistical mechanics” (Jaynes, 1957a,b). In these incred-
55
new examples. They can be used to perform a multi- ible papers, Jaynes showed that it was possible to re-
56
tude of other tasks that require sampling from a complex derive the Boltzmann distribution (and the idea of gen-
57
58 probability distribution including “de-noising”, filling in eralized ensembles) entirely from information theoretic
59 missing data, and even discrimination (Hinton, 2012). arguments. Quoting from the abstract, Jaynes consid-
60 The versatility of generative models is one of the major ered “statistical mechanics as a form of statistical infer-
61 appeals of these unsupervised learning methods. ence rather than as a physical theory” (portending the
62
63
64
65
82
1
2
close connection between statistical physics and machine The general form of the maximum entropy distribution
3
4 learning). Jaynes showed that the Boltzmann distribu- is then given by
5 tion could be viewed as resulting from a statistical in-
1 Pi λi fi (x)
6 ference procedure for learning probability distributions p(x) = e (181)
describing physical systems where one only has partial Z
7
information about the system (usually the average en- R P
8 where Z(λi ) = dx e i λi fi (x) is the partition function.
9 ergy). The maximum entropy distribution is clearly just
10 The key quantity in MaxEnt models is the informa- thePusual Boltzmann distribution with energy E(x) =
11 tion theoretic, or Shannon, entropy, a concept introduced − i λi fi (x). The values of the Lagrange multipliers are
12 by Shannon in his landmark treatise on information the- chosen to match the observed averages for the set of func-
13 ory (Shannon, 1949). The Shannon entropy quantifies tions {fi (x)} whose average value is being fixed:
14 the statistical uncertainty one has about the value of a Z
15 random variable x drawn from a probability distribution ∂ log Z
hfi imodel = dxp(x)fi (x) = = hfi iobs . (182)
16 p(x). The Shannon entropy of the distribution is defined ∂λi
17 as
18 In other words, the parameters of the distribution can be
19 Sp = −Trx p(x) log p(x) (179) chosen such that
20
where the trace is a sum/integral over all possible val-
21 ∂λi log Z = hfi idata . (183)
ues a variable can take. Jaynes showed that the Boltz-
22
23 mann distribution follows from the Principle of Maxi- To gain more intuition for the MaxEnt distribution, it
24 mum Entropy. A physical system should be described is helpful to relate the Lagrange multipliers to the famil-
25 by the probability distribution with the largest entropy iar thermodynamic quantities we use to describe physical
26 subject to certain constraints (often provided by measur- systems (Jaynes, 1957a). Our x denotes the microscopic
27 ing the average value of conserved, extensive quantities state of the system, i.e. the MaxEnt distribution is a
28 such as the energy, particle number, etc.) The princi- probability distribution over microscopic states. How-
29 ple uniquely specifies a procedure for parametrizing the ever, in thermodynamics we only have access to average
30 functional form of the probability distribution. Once we quantities. If we know only the average energy hE(x)iobs ,
31 have specified and learned this form we can, of course, the MaxEnt procedure tells us to maximize the entropy
32 generate new examples by sampling this distribution. subject to the average energy constraint. This yields
33 Let us illustrate how this works in more detail. Sup-
34 pose that we have chosen a set of functions {fi (x)} whose 1 −βE(x)
p(x) = e , (184)
35 average value we want to fix to some observed values Z
36 hfi iobs . The Principle of Maximum Entropy states that
37 where we have identified the Lagrange multiplier conju-
we should choose the distribution p(x) with the largest
38 gate to the energy λ1 = −β = 1/kB T with the (negative)
uncertainty (i.e. largest Shannon entropy Sp ), subject to
39 inverse temperature. Now, suppose we also constrain the
the constraints that the model averages match the ob-
40 particle number hN (x)iobs . Then, an almost identical
served averages:
41 Z calculation yields a MaxEnt distribution of the functional
42
hfi imodel = dxfi (x)p(x) = hfi iobs . (180) form
43
1 −β(E(x)−µN (x))
44
We can formulate the Principle of Maximum Entropy p(x) = e , (185)
45 Z
as an optimization problem using the method of Lagrange
46 where we have rewritten our Lagrange multipliers in
47 multipliers by minimizing:
the familiar thermodynamic notation λ1 = −β and
48 X  Z 
L[p] = −Sp + λi hfi iobs − dxfi (x)p(x) λ2 = µ/β. Since this is just the Boltzmann distribu-
49
i tion, we can also relate the partition function in our
50  Z  MaxEnt model to the thermodynamic free-energy via
51 +γ 1− dxp(x) ,
52 F = −β −1 log Z. The choice of which quantities to
53 constrain is equivalent to working in different thermo-
where the first set of constraints enforce the requirement dynamic ensembles.
54
for the averages and the last constraint enforces the nor-
55
malization that the trace over the probability distribu-
56
57 tion equals one. We can solve for p(x) by taking the 2. From statistical mechanics to machine learning
58 functional derivative and setting it to zero
59 δL X The MaxEnt idea also provides a general procedure
60 0= = (log p(x)) + 1) − λi fi (x) − γ. for learning a generative model from data. The key dif-
δp
61 i ference between MaxEnt models in (theoretical) physics
62
63
64
65
83
1
2
and ML is that in ML we have no direct access to ob- consider two special cases where x has different support
3
4 served values hfi iobs . Instead, these averages must be (different kinds of data). First, consider the case that the
5 directly estimated from data (samples). To denote this random variables x ∈ Rn are real numbers. In this case
6 difference, we will call empirical averages calculated from we can compute the partition function directly:
7 data as hfi idata . We can think of MaxEnt as a statisti- Z p
8 cal inference procedure simply by replacing hfi iobs by T 1 T 1 T −1
Z = dx ea x+ 2 x Jx = (2π)n detJ −1 e− 2 a J a .
9 hfi idata above.
10 This subtle change has important implications for (187)
11 training MaxEnt models. First, since we do not know
12 these averages exactly, but must estimate them from the The resulting probability density function is,
13 data, our training procedures must be careful not to over-
p(x) = Z −1 e−E(x)
14 fit to the observations (our samples might not be reflec-
15 tive of the true values of these statistics). Second, the av- 1 1 T −1 T 1 T
=p e 2 a J a+a x+ 2 x Jx
16 erages of certain functions fi are easier to estimate from (2π) detJ
n −1
17 limited data than others. This is often an important con- 1 1 T −1
18 sideration when formulating which MaxEnt model to fit =p e− 2 (x−µ) Σ (x−µ) , (188)
19 (2π) detΣ
n
to the data. Finally, we note that unlike in physics where
20
conservation laws often suggest the functions fi whose where µ = −J −1 a and Σ = −J −1 . This, of course, is the
21
averages we hold fix, ML offers no comparable guide for normalized, multi-dimensional Gaussian distribution.
22
23 how to choose the fi we care about. For these reasons, Second, consider the case that the random variable x
24 choosing the {fi } is often far from straightforward. As a is binary with xi ∈ {−1, +1}. The energy function takes
25 final point, we note that here we have presented a physics- the same form as Eq. (186), but the partition function
26 based perspective for justifying the MaxEnt procedure. can no longer be computed in a closed form. This model
27 We mention in passing that the MaxEnt in ML is also is known as the Ising model in the physics literature, and
28 closely related to ideas from Bayesian inference (Jaynes, is often called a Markov Random Field in the machine
29 1968, 2003) and this latter point of view is more com- learning literature. It is well known to physicists that
30 mon in discussions of MaxEnt in the statistics and ML calculating the partition function for the Ising Model is
31 literature. intractable. For this reason, the best we can do is esti-
32 mate it using numerical techniques such MCMC methods
33 or approximate methods like variational MFT methods,
34 3. Generalized Ising Models from MaxEnt
see Sec. XIV. Finally, we note that in ML it is common to
35
use binary variables which take on values in xi ∈ {0, 1}
36 The form of a MaxEnt model is completely specified
rather than {±1}. This can sometimes be a source of con-
37 once we choose the averages {fi } we wish to constrain.
fusion when translating between ML and physics litera-
38 One common choice often used in MaxEnt modeling is to
tures and can lead to confusion when using ML packages
39 constrain the first two moments of a distribution. When
for physics problems.
40 our random variables x are continuous, the corresponding
41 MaxEnt distribution is a multi-dimensional Gaussian. If
42 the x are binary (discrete), then the corresponding Max-
43 C. Cost functions for training energy-based models
Ent distribution is a generalized Ising (Potts) model with
44 all-to-all couplings.
45 The MaxEnt procedure gives us a way of parametrizing
To see this, consider a random variable x with first
46 an energy-based generative model. For any energy-based
and second moments hxi idata and hxi xj idata , respectively.
47 generative model, the energy function E(x, {θi }) depends
According to the Principle of Maximum Entropy, we
48 on some parameters θi – couplings in the language of
should choose to model this variable using a Boltzmann
49 statistical physics – that must be inferred directly from
distribution with constraints on the first and second mo-
50 the data. For example, for the MaxEnt models the {θi }
ments. Let ai be the Lagrange multiplier associated with
51 are just the Lagrange multipliers {λi } introduced in the
52 hxi idata and Jij /2 be the Lagrange multiplier associated
last section. The goal of the training procedure is to use
53 with hxi xj idata . Using Eq. (182), it is easy to verify that
the available training data to fit these parameters.
54 the energy function
Like in many other ML techniques, we will fit these
55 X 1X couplings by minimizing a cost function using stochastic
56 E(x) = − a i xi − Jij xi xj (186)
2 ij gradient descent (cf. Sec. IV). Such a procedure naturally
57 i
58 separates into two parts: choosing an appropriate cost
59 satisfies the above constraints. function, and calculating the gradient of the cost func-
60 Partition functions for maximum entropy models are tion with respect to the model parameters. Formulating
61 often intractable to compute. Therefore, it is helpful to a cost function for generative models is a little bit trickier
62
63
64
65
84
1
2
than for supervised, discriminative models. The objec- ing the log-likelihood:
3
4 tive of discriminative models is straightforward – predict
the label from the features. However, what we mean by L({θi }) = hlog (pθ (x))idata
5
6 a “good” generative model is much harder to define using = −hE(x; {θi })idata − log Z({θi }), (189)
7 a cost function. We would like the model to generate
where we have set β = 1. In writing this expression we
8 examples similar to those we find in the training dataset.
made use of two facts: (i) our generative distribution is
9 However, we would also like the model to be able to gen-
of the Boltzmann form, and (ii) the partition function
10 eralize – we do not want the model to reproduce “spurious
does not depend on the data:
11 details” that are particular to the training dataset. Un-
12 like for discriminative models, there is no straightforward hlog Z({θi })idata = log Z({θi }). (190)
13 idea like cross-validation on the data labels that neatly
14 addresses this issue. For this reason, formulating cost
15 functions for generative models is subtle and represents 2. Regularization
16 an important and interesting open area of research.
17 Just as for discriminative models like linear and logistic
18 Calculating the gradients of energy-based models also
turns out to be different than for discriminative mod- regression, it is common to supplement the log-likelihood
19
els, such as deep neural networks. Rather than relying with additional regularization terms (see Secs. VI and
20
21 on automatic differentiation techniques and backpropa- VII). Instead of minimizing the negative log-likelihood,
22 gation (see Sec. IX.C), calculating the gradient requires one minimizes a cost function of the form
23 drawing on intuitions from MCMC-based methods. Be-
low, we provide an in-depth discussion of Boltzmann − L({θi }) + Ereg ({θi }), (191)
24
25 learning for energy-based generative models, focusing on where Ereg ({θi }) is an additional regularization term that
26 MaxEnt models. We put the emphasis on training pro- prevents overfitting. From a Bayesian perspective, this
27 cedures that generalize to more complicated generative new term can be viewed as encoding a (negative) log-
28 models with latent variables such as RBMs discussed in prior on model parameters and performing a maximum-
29 the next section. Therefore, we largely ignore the in- a-posteriori (MAP) estimate instead of a MLE (see cor-
30 credibly rich physics-based literature on fitting Ising-like
31 responding discussion in Sec. VI).
MaxEnt models (see the recent reviews (Baldassi et al., As we saw by studying linear regression, different forms
32 2018; Nguyen et al., 2017) and references therein).
33 of regularization give rise to different kinds of properties.
34 A common choice for the regularization function are the
35 sums of the L1 or L2 norms of the parameters
36 X
37 1. Maximum likelihood Ereg ({θi }) = Λ |θi |α , α = 1, 2 (192)
38 i
39 By far the most common approach used for training a
40 with Λ controlling the regularization strength. For Λ = 0,
generative model is to maximize the log-likelihood of the there is no regularization and we are simply performing
41
training data set. Recall, that the log-likelihood char- MLE. In contrast, a choice of large Λ will force many
42
acterizes the log-probability of generating the observed parameters to be close to or exactly zero. Just as in
43
44 data using our generative model. By choosing the nega- regression, an L1 penalty enforces sparsity, with many of
45 tive log-likelihood as the cost function, the learning pro- the θi set to zero, and L2 regularization shrinks the size
46 cedure tries to find parameters that maximize the proba- of the parameters towards zero.
47 bility of the data. This cost function is intuitive and has One challenge of generative models is that it is often
48 been the work-horse of most generative modeling. How- difficult to choose the regularization strength Λ. Recall
49 ever, we note that the Maximum Likelihood estimation that, for linear and logistic regression, Λ is chosen to
50 (MLE) procedure has some important limitations that maximize the out-of-sample performance on a validation
51 we will return to in Sec. XVII. dataset (i.e. cross-validation). However, for generative
52 In what follows, we employ a general notation that is models our data are usually unlabeled. Therefore, choos-
53 applicable to all energy-based models, not just the Max- ing a regularization strength is more subtle and there ex-
54 Ent models introduced above. The reason for this is that ists no universal procedure for choosing Λ. One common
55
much of this discussion does not rely on the specific form strategy is to divide the data into a training set and a val-
56
of the energy function but only on the fact that our gen- idation set and monitor a summary statistic such as the
57
58 erative model takes a Boltzmann form. We denote the log-likelihood, energy distance (Székely, 2003), or varia-
59 generative model by the probability distribution pθ (x) tional free-energy of the generative model on the training
60 and its corresponding partition function by log Z({θi }). and validation sets (the variational free-energy was dis-
61 In MLE, the parameters of the model are fit by maximiz- cussed extensively in Sec. XIV ) (Hinton, 2012). If the
62
63
64
65
85
1
2
gap between the training and validation datasets starts To use SGD, we must still calculate the expectation
3
4 growing, one is probably overfitting the model even if values that appear in Eq. (195). The positive phase of the
5 the log-likelihood of the training dataset is still increas- gradient – the expectation values with respect to the data
6 ing. This also gives a procedure for “early stopping” – – can be easily calculated using samples from the training
7 a regularization procedure we introduced in the context dataset. However, the negative phase – the expectation
8 of discriminative models. In practice, when using such values with respect to the model – is generally much more
9 regularizers it is important to try many different values difficult to compute. We will see that in almost all cases,
10 of Λ and then try to use a proxy statistic for overfitting we will have to resort to either numerical or approximate
11 to evaluate the optimal choice of Λ. methods. The fundamental reason for this is that it is
12 impossible to calculate the partition function exactly for
13 most interesting models in both physics and ML.
14 D. Computing gradients
15 There are exceptional cases in which we can calcu-
16 We still need to specify a procedure for minimizing the late expectation values analytically. When this hap-
17 cost function. One powerful and common choice that pens, the generative model is said to have a Tractable
18 is widely employed when training energy-based models Likelihood. One example of a generative model with a
19 is stochastic gradient descent (SGD) (see Sec. IV). Per- Tractable Likelihood is the Gaussian MaxEnt model for
20 forming MLE using SGD requires calculating the gradi- real valued data discussed in Eq. (188). The param-
21 ent of the log-likelihood Eq. (189) with respect to the eters/Lagrange multipliers for this model are the local
22 parameters θi . To simplify notation and gain intuition, fields a and the pairwise coupling matrix J. In this case,
23 it is helpful to define “operators” Oi (x), conjugate to the the usual manipulations involving Gaussian integrals al-
24 parameters θi low us to exactly find the parameters µ = −J −1 a and
25 Σ = −J −1 , yielding the familiar expressions µ = hxidata
26 ∂E(x; θi ) and Σ = h(x − hxidata )(x − hxidata )T idata . These are
Oi (x) = . (193)
27 ∂θi the standard estimates for the sample mean and covari-
28 ance matrix. Converting back to the Lagrange multipli-
29 Since the partition function is just the cumulant gener-
ers yields
30 ating function for the Boltzmann distribution, we know
31 that the usual statistical mechanics relationships between
32 expectation values and derivatives of the log-partition J = −h(x − hxidata )(x − hxidata )T i−1 (196)
data .
33 function hold:
34
∂ log Z({θi })
35 hOi (x)imodel = Trx pθ (x)Oi (x) = − . (194) Returning to the generic case where most energy-based
36 ∂θi
models have intractable likelihoods, we must estimate ex-
37 In terms of the operators {Oi (x)}, the gradient of pectation values numerically. One way to do this is draw
38 Eq. (189) takes the form (Ackley et al., 1987) samples Smodel = {x0i } from the model pθ (x) and evalu-
39
40 ∂L({θi }) D ∂E(x; θi ) E ∂ log Z({θi }) ate arbitrary expectation values using these samples:
− = +
41 ∂θi ∂θi data ∂θi
Z X
42 = hOi (x)idata − hOi (x)imodel . (195)
43 hf (x)imodel = dxpθ (x)f (x) ≈ f (x0i ). (197)
44 These equations have a simple and beautiful interpre- x0i ∈Smodel
45 tation. The gradient of the log-likelihood with respect to
46 a model parameter is a difference of moments – one calcu-
47 lated directly from the data and one calculated from our The samples from the model x0i ∈ Smodel are often re-
48 model using the current model parameters. The data- ferred to as fantasy particles in the ML literature and
49 dependent term is known as the positive phase of the can be generated using simple MCMC algorithms such
50 gradient and the model-dependent term is known as the as Metropolis-Hasting which are covered in most modern
51 negative phase of the gradient. This derivation also gives statistical physics classes. However, if the reader is unfa-
52 an intuitive explanation for likelihood-based training pro- miliar with MCMC methods or wants a quick refresher,
53 cedures. The gradient acts on the model to lower the en- we recommend the concise and beautiful discussion of
54 ergy of configurations that are near observed data points MCMC methods from both the physics and ML point-
55
while raising the energy of configurations that are far of-view in Chapters 29-32 of David MacKay’s masterful
56
from observed data points. Finally, we note that all infor- book (MacKay, 2003).
57
58 mation about the data only enters the training procedure Finally, we note that once we have the fantasy particles
59 through the expectations hOi (x)idata and our generative from the model, we can also easily calculate the gradient
60 model is blind to information beyond what is contained of any expectation value hf (x)imodel using what is com-
61 in these expectations. monly called the “log-derivative trick” in ML (Fu, 2006;
62
63
64
65
86
1
2
Kleijnen and Rubinstein, 1996): XVI. DEEP GENERATIVE MODELS: HIDDEN VARIABLES
3
Z AND RESTRICTED BOLTZMANN MACHINES (RBMS)
4 ∂ ∂pθ (x)
5 hf (x)imodel = dx f (x)
∂θi ∂θi The last section introduced many of the core ideas be-
6 D ∂ log p (x) E
7 θ hind energy-based generative models. Here, we extend
= f (x)
8 ∂θi model this discussion to energy-based models that include la-
9 = hOi (x)f (x)imodel tent or hidden variables.
10 X Including latent variables in generative models greatly
≈ Oi (xj )f (x0j ). (198)
11 enhances their expressive power – allowing the model to
x0j ∈Smodel
12 represent sophisticated correlations between visible fea-
13 This expression allows us to take gradients of more com- tures without sacrificing trainability. By having multiple
14 plex cost functions beyond the MLE procedure discussed layers of latent variables, we can even construct powerful
15 here. deep generative models that possess many of the same
16 desirable properties as deep, discriminative neural net-
17 works.
18 E. Summary of the training procedure We begin with a discussion that tries to provide a sim-
19 ple intuition for why latent variables are such a pow-
20 We now summarize the discussion above and present a erful tool for generative models. Next, we introduce a
21 general procedure for training an energy based model us- powerful class of latent variable models called Restricted
22
ing SGD on the cost function (see Sec. IV). Our goal is to Boltzmann Machines (RBMs) and discuss techniques for
23
fit the parameters of a model pλ ({θi }) = Z −1 e−E(x,{θi }) . training these models. After that, we introduce Deep
24
25 Training the model involves the following steps: Boltzmann Machines (DBMs), which have multiple layers
26 of latent variables. We then introduce the new Paysage
1. Read a minibatch of data, {x}.
27 package for training energy-based models and demon-
28 2. Generate fantasy particles {x0 } ∼ pλ using an strate how to use it on the MNIST dataset and sam-
29 MCMC algorithm (e.g., Metropolis-Hastings). ples from the Ising model. We conclude by discussing
30 recent physics literature related to energy-based genera-
31 3. Compute the gradient of log-likelihood using these tive models.
32 samples and Eq. (195), where the averages are
33 taken over the minibatch of data and the fantasy
34 particles from the model, respectively. A. Why hidden (latent) variables?
35
36 4. Use the gradient as input to one of the gradient
based optimizers discussed in section Sec. IV. Latent or hidden variables are a powerful yet elegant
37
way to encode sophisticated correlations between observ-
38
In practice, it is helpful to supplement this basic proce- able features. The underlying reason for this is that
39
40 dure with some tricks that help training. As with dis- marginalizing over a subset of variables – “integrating
41 criminative neural networks, it is important to initial- out” degrees of freedom in the language of physics – in-
42 ize the parameters properly and print summary statistics duces complex interactions between the remaining vari-
43 during the training procedure on the training and vali- ables. The idea that integrating out variables can lead
44 dation sets to prevent overfitting. These and many other to complex correlations is a familiar component of many
45 “cheap tricks” have been nicely summarized in a short physical theories. For example, when considering free
46 note from the Hinton group (Hinton, 2012). electrons living on a lattice, integrating out phonons gives
47 A major computational and practical limitation of rise to higher-order electron-electron interactions (e.g. su-
48 these methods is that it is often hard to draw samples perconducting or magnetic correlations). More generally,
49 from generative models. MCMC methods often have long in the Wilsonian renormalization group paradigm, all ef-
50 mixing-times (the time one has to run the Markov chain fective field theories can be thought of as arising from
51 to get uncorrelated samples) and this can result in bi- integrating out high-energy degrees of freedom (Wilson
52 ased sampling. Luckily, we often do not need to know and Kogut, 1974).
53 the gradients exactly for training ML models (recall that Generative models with latent variables run this logic
54 noisy gradient estimates often help the convergence of in reverse – encode complex interactions between visible
55
gradient descent algorithms), and we can significantly re- variables by introducing additional, hidden variables that
56
duce the computational expense by running MCMC for interact with visible degrees of freedom in a simple man-
57
58 a reasonable time window. We will exploit this observa- ner, yet still reproduce the complex correlations between
59 tion extensively in the next section when we discuss how visible degrees in the data once marginalized over (in-
60 to train more complex energy-based models with hidden tegrated out). This allows us to encode complex higher-
61 variables. order interactions between the visible variables using sim-
62
63
64
65
87
1
2
pler interactions at the cost of introducing new latent
3 Hidden Layer
4 variables/degrees of freedom. This trick is also widely bμ(hμ)
5 exploited in physics (e.g. in the Hubbard-Stratonovich
transformation (Hubbard, 1959; Stratonovich, 1957) or Interactions
6 Wiμvihμ
7 the introduction of ghost fields in gauge theory (Faddeev
and Popov, 1967)).
8
9 To make these ideas more concrete, let us revisit the
Visible Layer ai(vi)
10 pairwise Ising model introduced in the discussion of Max-
11 Ent models, see Eq. (186). The model is described by a
FIG. 61 A Restricted Boltzmann Machine (RBM) consists of
12 Boltzmann distribution with energy
13 visible units vi and hidden units hµ that interact with each
X other through interactions of the form Wiµ vi hµ . Importantly,
14 1X
E(v) = − ai vi − vi Jij vj , (199) there are no interactions between visible units themselves or
15 2 ij hidden units themselves.
i
16
17
where Jij is a symmetric coupling matrix that encodes
18 where E(v, h) is a joint energy functional of both the
the pairwise constraints and ai enforce the single-variable
19 latent and visible variables of the form
20 constraint.
X 1X 2 X
21 Our goal is to replace the complicated interactions be- E(v, h) = − ai vi + h − vi Wiµ hµ . (202)
22 tween the visible variables vi encoded by Jij , by interac- i
2 µ µ iµ
23 tions with a new set of latent variables hµ . In order to
24 do this, it is helpful to rewrite the coupling matrix in a We can also use the energy function E(v, h) to define a
25 slightly different form. Using SVD, we canP always express new energy-based model p(v, h) on both the latent and
the coupling matrix in the form Jij =
N visible variables
µ=1 Wiµ Wjµ ,
26
27 where {Wiµ }i are appropriately normalized singular vec- e−E(v,h)
28 tors. In terms of Wiµ , the energy takes the form p(v, h) = . (203)
Z0
29
30 X 1X Marginalizing over latent variables of course gives us back
31 EHop (v) = − ai vi − vi Wiµ Wjµ vj . (200) the generalized Hopfield model (Barra et al., 2012)
i
2 ijµ
32 Z
33 e−EHop (v)
p(v) = dhp(v, h) = . (204)
34 We note that in the special case when both vi ∈ Z
35 {−1, +1} and Wiµ ∈ {−1, +1} are binary variables, a
Notice that E(v, h) contains no direct interactions be-
36 model with this form of the energy function is known as
tween visible degrees of freedom (or between hidden de-
37 the Hopfield model (Amit et al., 1985; Hopfield, 1982).
gree of freedom). Instead, the complex correlations be-
38 The Hopfield model has played an extremely important
tween the vi are encoded in the interaction between the
39 role in statistical physics, computational neuroscience,
visible vi and latent variables hµ . It turns out that the
40 and machine learning, and a full discussion of its prop-
model presented here is a special case of a more general
41 erties is well beyond the scope of this review [see (Amit,
class of powerful energy-based models called Restricted
42 1992) for a beautiful discussion that combines all these
43 Boltzmann Machines (RBMs).
perspectives]. Therefore, here we refer to all energy func-
44 tions of the form Eq. (200) as (generalized) Hopfield mod-
45 els, even for the case when the Wiµ are continuous vari-
46 B. Restricted Boltzmann Machines (RBMs)
ables.
47
48 We now “decouple” the visible variables vi by intro- A Restricted Boltzmann Machine (RBM) is an energy-
49 ducing a set of normally, distributed continuous latent based model with both visible and hidden units where the
50 variables hµ (in condensed matter language we perform a visible and hidden units interact with each other but do
51 Hubbard-Stratonovich transformation). Using the usual not interact among themselves. The energy function of
52 identity for Gaussian integrals, we can rewrite the Boltz- an RBM takes the general functional form
53 mann distribution for the generalized Hopfield model as X X X
54 P P E(v, h) = − ai (vi )− bµ (hµ )− Wiµ vi hµ , (205)
ai vi + 12 vi Wiµ Wjµ vj
55 e i ijµ i µ iµ
56 p(v) =
Z where ai (·) and bµ (·) are functions that we are free to
57 P Q R 1
P P
h2µ − i vi Wiµ hµ
e i ai vi
µ dhµ e− 2 µ
choose. The most common choice is:
58 = (
Z
59 R ai vi , if vi ∈ {0, 1} is binary
60 dh e−E(v,h) ai (vi ) = vi2
= (201) 2σ 2
, if vi ∈ R is continuous,
61 Z i

62
63
64
65
88
1
2
and Combining these equations,
3 Z
(
4 bµ hµ , if hµ ∈ {0, 1} is binary E(v) = − log dhe−E(v,h)
5 bµ (hµ ) = h2µ
6 2σµ2 , if hµ ∈ R is continuous. X X Z P
7 =− ai (vi ) − log dhµ ebµ (hµ )+ i vi Wiµ hµ
8 For this choice of ai (·) and bµ (·), layers consisting of dis- i µ
9 crete binary units are often called Bernoulli layers, and
layers consisting of continuous variables are often called To understand what correlations are captured by p(v) it
10
Gaussian layers. The basic bipartite structure of an RBM is helpful to introduce the distribution
11
12 – i.e., a visible and hidden layer that interact with each ebµ (hµ )
13 other but not among themselves – is often depicted using qµ (hµ ) = (208)
Z
14 a graph of the form shown in Fig. 61.
of hidden units hµ , ignoring the interactions between v
15 An RBM can have different properties depending on
and h, and the cumulant generating function
16 whether the hidden and visible layers are taken to be Z
17 Bernoulli or Gaussian. The most common choice is to X tn
Kµ (t) = log dhµ qµ (hµ )ethµ = κ(n) . (209)
18 have both the visible and hidden units be Bernoulli. This µ
n!
n
19 is what is typically meant by an RBM. However, other
(n)
20 combinations are also possible and used in the ML lit- Kµ (t) is defined such that the nth cumulant is κµ =
21 erature. When all the units are continuous, the RBM ∂tn Kµ |t=0 .
22 reduces to a multi-dimensional Gaussian with a very par- The cumulant generating function appears in the
23 ticular correlation structure. When the hidden units are marginal free-energy of the visible units, which can be
24
continuous and the visible units are discrete, the RBM is rewritten (up to a constant term) as:
25 !
equivalent to a generalized Hopfield model (see discussion X X X
26
27 above). When the the visible units are continuous and E(v) = − ai (vi ) − Kµ Wiµ vi
28 the hidden units are discrete, the RBM is often called a i µ i
Gaussian Bernoulli Restricted Boltzmann Machine (Dahl X XX P
29 ( Wiµ vi )n
30 et al., 2010; Hinton and Salakhutdinov, 2006). It is even =− ai (vi ) − κ(n)
µ
i
n!
31 possible to perform multi-modal learning with a mixture i µ n
!
32 of continuous and discrete variables. For all these archi- X X X
33 tectures, the important point is that all interactions oc- =− ai (vi ) − κ(1)
µ Wiµ vi
34 cur only between the visible and hidden units and there i i µ
!
35 are no interactions between units within the hidden or 1 X X (2)
36 visible layers, see Fig. 61. This is analogous to Quantum − κµ Wiµ Wjµ vi vj + . . . (210)
2 ij
37 Electrodynamics, where a free fermion and a free photon µ
38 interact with one another but not among themselves. We see that the marginal energy includes all orders of in-
39 Specifying a generative model with this bipartite inter- teractions between the visible units, with the n-th order
40 action structure has two major advantages: (i) it enables cumulants of qµ (hµ ) weighting the n-th order interac-
41 capturing both pairwise and higher-order correlations be- tions between the visible units. In the case of the Hop-
42 tween the visible units and (ii) it makes it easier to sample field model we discussed previously, qµ (hµ ) is a standard
43
from the model using an MCMC method known as block (1)
Gaussian distribution where the mean is κµ = 0, the
44
Gibbs sampling, which in turn makes the model easier to (2)
variance is κµ = 1, and all higher-order cumulants are
45
46 train. zero. Plugging these cumulants into Eq. (210) recovers
47 Before discussing training, it is worth better under- Eq. (202).
48 standing the kind of correlations that can be captured These calculations make clear the underlying reason
49 using an RBM. To do so, we can marginalize over the hid- for the incredible representational power of RBMs with
50 den units and ask about the resulting distribution over a Bernoulli hidden layer. Each hidden unit can encode in-
51 just the visible units teractions of arbitrarily high order. By combining many
52 Z Z different hidden units, we can encode very complex in-
e−E(v,h)
53 p(v) = dhp(v, h) = dh (206) teractions at all orders. Moreover, we can learn which
54 Z
order of correlations/interactions are important directly
55 where the integral should be replaced by a trace in all from the data instead of having to specify them ahead of
56 expressions for discrete units. time as we did in the MaxEnt models. This highlights
57 We can also define a marginal energy using the expres-
58 the power of generative models with even the simplest in-
sion teractions between visible and latent variables to encode,
59
60 e−E(v) learn, and represent complex correlations present in the
p(v) = . (207) data.
61 Z
62
63
64
65
89
1
2
3
C. Training RBMs A Alternating Gibbs Sampling
t=0 t=1 t=2 t = οο
4
5 RBMs are a special class of energy-based generative
6 models, which can be trained using the Maximum Like-
7 lihood Estimation (MLE) procedure described in detail
data
8 in Sec. XV. To briefly recap, first, we must choose a cost
9 function – for MLE this is just the negative log-likelihood B Contrastive Divergence (CD-n)
10 with or without an additional regularization term to pre- t=0 t=1 t=2 t=n
11 vent overfitting. We then minimize this cost function us-
12 ing one of the Stochastic Gradient Descent (SGD) meth-
13 ods described in Sec. IV.
14 The gradient itself can be calculated using Eq. (195). data
15 For example, for the Bernoulli-Bernoulli RBM in
16 Eq. (205) we have C Persistent Contrastive Divergence (PCD-n)
17 t=0 t=1 t=2 t=n
18 ∂L({Wiµ , ai , bµ })
= hvi hµ idata − hvi hµ imodel
19 ∂Wiµ
20 ∂L({Wiµ , ai , bµ }) fantasy particles
21 = hvi idata − hvi imodel from last SGD step
22 ∂ai
23 ∂L({Wiµ , ai , bµ })
= hhµ idata − hhµ imodel , (211) FIG. 62 (Top) To draw fantasy particles (samples from the
24 ∂bµ model) we can perform alternating (block) Gibbs sampling
25 between the visible and hidden layers starting with a sam-
26 where the positive expectation with respect to the data
ple from the data using the marginal distributions p(h|v)
is understood to mean sampling from the model while
27 and p(v|h). The “time” t corresponds to the time in the
28 clamping the visible units to their observed values in the Markov chain for the Monte Carlo and measures the num-
29 data. As before, calculating the negative phase of the ber of passes between the visible and hidden states. (Middle)
30 gradient (i.e. the expectation value with respect to the In Contrastive Divergence (CD), we approximately sample
31 model) requires that we draw samples from the model. the model by terminating the Gibbs sampling after n steps
32 Luckily, the bipartite form of the interactions in RBMs (CD-n) starting from the data. (C) In Persistent Contrastive
33 were specifically chosen with this in mind. Divergence (PCD), instead of restarting the Gibbs sampler
from the data, we initialize the sampler with the fantasy par-
34
ticles calculated from the model at the last SGD step.
35
36 1. Gibbs sampling and contrastive divergence (CD)
37
gradient descent is a minibatch of observed data. For
38 The bipartite interaction structure of an RBM makes it
each sample in the minibatch, we simply clamp the visi-
39 possible to calculate expectation values using a Markov
ble units to the observed values and apply Eq. (213) using
40 Chain Monte Carlo (MCMC) method known as Gibbs
the probability for the hidden variables. We then average
41 sampling. The key reason for this is that since there are
42 over all samples in the minibatch to calculate expectation
no interactions of visible units with themselves or hidden
43 values with respect to the data. To calculate expectation
units with themselves, the visible and hidden units of an
44 values with respect to the model, we use (block) Gibbs
RBM are conditionally independent:
45 sampling. The idea behind (block) Gibbs sampling is
46 Y to iteratively sample from the conditional distributions
p(v|h) = p(vi |h)
47 ht+1 ∼ p(h|vt ) and vt+1 ∼ p(v|ht+1 ) (see Figure 62,
i
48 Y top). Since the units are conditionally independent, each
49 p(h|v) = p(hµ |v), (212) step of this iteration can be performed by simply draw-
50 µ ing random numbers. The samples are guaranteed to
51 converge to the equilibrium distribution of the model in
52 with
X the limit that t → ∞. At the end of the Gibbs sampling
53 procedure, one ends up with a minibatch of samples (fan-
p(vi = 1|h) = σ(ai + Wiµ hµ ) (213)
54 tasy particles).
µ
55 X One drawback of Gibbs sampling is that it may take
56 p(hµ = 1|v) = σ(bµ + Wiµ vi )
many back and forth iterations to draw an independent
57 i
58 sample. For this reason, the Hinton group introduced
59 and where σ(z) = 1/(1 + e−z ) is the sigmoid function. an approximate Gibbs sampling technique called Con-
60 Using these expressions it is easy to compute expec- trastive Divergence (CD) (Hinton, 2002; Hinton et al.,
61 tation values with respect to the data. The input to 2006). In CD-n, we just perform n iterations of (block)
62
63
64
65
90
1
2
Gibbs sampling, with n often taken to be as small as 1 Deep Boltzmann Layerwise Fine-tuning with
3
4 (see Figure 62)! The price for this truncation is, of course, Machine (DBM) Pretraining PCD on full DBM

5 that we are not drawing samples from the true model dis-
6 tribution. But for our purpose – using the expectations
7 to estimate the gradient for SGD – CD-n has proven to
8 work reasonably well. As long as the approximate gra-
9 dients are reasonably correlated with the true gradient,
10 SGD will move in a reasonable direction. CD-n of course
11 does come at a price. Truncating the Gibbs sampler pre-
12 vents sampling far away from the starting point, which
13 for CD-n are the data points in the minibatch. Therefore,
14 our generative model will be much more accurate around
15 regions of feature space close to our training data. Thus, FIG. 63 Deep Boltzmann Machine contain multiple hidden
16 as is often the case in ML, CD-n sacrifices the ability layers. To train deep networks, first we perform layerwise
17 to generalize to some extent in order to make the model training where each two layers are treated as a RBM. This
18 easier to train. can be followed by fine-tuning using gradient descent and per-
19
Some of these undesirable features can be tempered sistent contrastive divergence (PCD).
20
by using a slightly different variant of CD called Persis-
21
tent Contrastive Divergence (PCD) (Tieleman and Hin-
22
ton, 2009). In PCD, rather than restarting the Gibbs σ = 0.01 (Hinton, 2012). An alternative initial-
23 ization scheme proposed by Glorot and Bengio in-
24 sampler from the data at each gradient descent step, we
start the Gibbs sampling at the fantasy particles in the stead chooses the standard deviation
√ to scale with
25 the size of the layers: σ = 2/ Nv + Nh where Nv
26 last gradient descent step (see Fig. 62). Since parameters
change slowly compared to the Gibbs sampling, samples and Nh are number of visible and hidden units re-
27
that are high probability at one step of the SGD are also spectively (Glorot and Bengio, 2010). The bias of
28
29 likely to be high probability at the next step. This en- the hidden units is initialized to zero while the bias
30 sures that PCD does not introduce large errors in the of the visible units is typically taken to be inversely
31 estimation of the gradients. The advantage of using fan- proportional to the mean activation, ai = hvi i−1 data .
32 tasy particles to initialize the Gibbs sampler is to allow
• Regularization.—One can of course use an L1 or
33 PCD to explore parts of the feature space that are much
34 L2 penalty, typically only on the weight parame-
further from the training dataset than one could reach
35 ters, not the biases. Alternatively, Dropout has
with ordinary CD.
36 been shown to decrease overfitting when training
We note that, in applications using RBMs as a vari- with CD and PCD, which results in more inter-
37
ational ansatz for quantum states, Gibbs sampling is pretable learned features.
38
39 not necessarily the best option for training, and in prac-
40 tice parallel tempering or other Metropolis schemes can • Learning Rates.—Typically, it is helpful to re-
41 outperform Gibbs sampling. In fact, Gibbs sampling is duce the learning rate in later stages of training.
42 not even feasible with complex-valued weights required
43 for quantum wavefucntions, whereas Metropolis schemes • Updates for CD and PCD.—There are several
44 might be feasible (Carleo, 2018). computational tricks one can use for speeding up
45 the alternating updates in CD and PCD, see Sec-
46 tion 3 in (Hinton, 2012).
47 2. Practical Considerations
48
49 The previous section gave an overview of how to train D. Deep Boltzmann Machine
50 RBMs. However, there are many “tricks of the trade”
51 that are missing from this discussion. Luckily, a succinct In this section, we introduce Deep Boltzmann Ma-
52 summary of these has been compiled by Geoff Hinton and chines (DBMs). Unlike RBMs, DBMs possess multi-
53 ple hidden layers and were the first models rebranded
published as a note that readers interested in training
54 as “deep learning” (Hinton et al., 2006; Hinton and
RBMs are urged to consult (Hinton, 2012).
55
For completeness, we briefly list some of the important Salakhutdinov, 2006) 18 . Many of the advantages that
56
57 points here:
58
59 • Initialization.—The model must be initialized.
60 Hinton suggests taking the weights Wiµ from a 18 Technically, these were Deep Belief Networks where only the top
Gaussian with mean zero and standard deviation layer was undirected
61
62
63
64
65
91
1
2
are thought to stem from having deep layers were al-
3
4 ready discussed in Sec. XI in the context of discrimina-
5 tive DNNs. Here, we revisit many of the same themes
6 with emphasis on energy-based models.
7 An RBM is composed of two layers of neurons that
8 are connected via an undirected graph, see Fig. 61. As
9 a result, it is possible to perform sampling v ∼ p(v|h)
10 and inference h ∼ p(h|v) with the same model. As with
11 the Hopfield model, we can view each of the hidden units
12 as representative of a pattern, or feature, that could be
13 present in the data. 19 The inference step involves assign-
14 ing a probability to each of these features that expresses FIG. 64 Fantasy particles (samples) generated using the in-
15 the degree to which each feature is present in a given dicated model trained on the MNIST dataset. Samples were
16 generated by running (alternating) layerwise Gibbs sampling
data sample. In an RBM, hidden units do not influence for 100 steps. This allows the final sample to be very far away
17 each other during the inference step, i.e. hidden units are
18 from the starting point in our feature space. Notice that the
conditionally independent given the visible units. There generated samples look much less like hand-written recon-
19
are a number of reasons why this is unsatisfactory. One structions than in Fig. 60 which uses a single max-probability
20 iteration of the Gibbs sampler, indicating that training is
reason is the desire for sparse distributed representations,
21 much less effective when exploring regions of probability space
22 where each observed visible vector will strongly activate
a few (i.e. more than one but only a very small frac- faraway from the training data. In the Sec. XVII, we will ar-
23 gue that this is likely a generic feature of Likelihood-based
24 tion) of the hidden units. In the brain, this is thought training.
25 to be achieved by inhibitory lateral connections between
26 neurons. However, adding lateral intra-layer connections
27 between the hidden units makes the distribution difficult samples as an input to the next RBM (consisting of the
28 to sample from, so we need to come up with another way first and second hidden layer – purple hexagons and green
29 of creating connections between the hidden units. squares in Fig. 63). This procedure can then be repeated
30 With the Hopfield model, we saw that pairwise linear to pretrain all layers of the DBM.
31 connections between neurons can be mediated through This pretraining initializes the weights so that SGD
32 another layer. Therefore, a simple way to allow for ef- can be used effectively when the network is trained in a
33 fective connections between the hidden units is to add supervised fashion. In particular, the pretraining helps
34 another layer of hidden units. Rather than just having
35 the gradients to stay well behaved rather than vanish
two layers, one visible and one hidden, we can add addi- or blow up – a problem that we discussed extensively
36 tional layers of latent variables to account for the corre-
37 in the earlier sections on DNNs. It is worth noting that
lations between hidden units. Ideally, as one adds more once pretrained, we can use the usual Boltzmann learning
38
and more layers, one might hope that the correlations rules in Eq. (195) to fine-tune the weights and improve
39
40 between hidden variables become smaller and smaller the performance of the DBM (Hinton et al., 2006; Hinton
41 deeper into the network. This basic logic is reminiscent of and Salakhutdinov, 2006). As we demonstrate in the
42 renormalization procedures that seek to decorrelate lay- next section, the Paysage package presented here can be
43 ers at each step (Li and Wang, 2018; Mehta and Schwab, used to both construct and train DBMs using such a
44 2014; Vidal, 2007). The price of adding additional layers pretraining procedure.
45 is that the models become harder to train.
46 Training DBMs is more subtle than RBMs due to the
47 difficulty of propagating information from visible to hid- E. Generative models in practice: examples
48 den units. However, Hinton and collaborators realized
49 that some of these problems could be alleviated via a lay- 1. MNIST
50 erwise procedure. Rather than attempting to the train
51 the whole DBM at once, we can think of the DBM as a First, we apply the open source package Paysage
52 stack of RBMs (see Fig. 63). One first trains the bottom (French for landscape) for training unsupervised energy-
53 two layers of the DBM – treating it as if it is a stand- based models on the MNIST dataset.
54 alone RBM. Once this bottom RBM is trained, we can In Notebook 17, we explicitly demonstrate how to
55
generate “samples” from the hidden layer and use these build and train four different kinds of models: (i) a
56
“Hopfield” type RBM with Gaussian hidden units and
57
58 Bernoulli (binary) visible units, (ii) a conventional RBM
59 where both the visible and hidden units are Bernoulli,
60
19 In general, one should instead think of activity patterns of hidden (iii) a conventional RBM with an additional L1-penalty
units representing features in the data. that enforces sparsity, and (iv) a Deep Boltzmann Ma-
61
62
63
64
65
92
1
2
chine (DBM) with three Bernoulli layers with L1 penalty
3
4 each. We refer the reader to the Notebook for the details
5 of the code. In the following, we show and briefly discuss
6 the results.
7 After training the model, we compute reconstructions
8 and fantasy particles from the validation data set. Recall
9 that a reconstruction v0 of a given data point x is com-
10 puted in two steps: (i) we fix the visible layer v = x to
11 be the data, and use MCMC sampling to find the state
12 of the hidden layer h which maximizes the probability
13 distribution p(h|v). (ii) fixing the same obtained state
14 h, we find the reconstruction v0 of the original data point
15 which maximizes the probability p(v0 |h). In the case of
16 a DBM, the forward pass continues until we reach the
17 last of the hidden layers, and the backward pass goes in
18 reverse. Figure 60 shows the result.
19 FIG. 65 Images from MNIST were randomly corrupted by
We also used MCMC to draw samples from the learned
20 adding noise. These noisy images were used as inputs to the
21 probability distributions, the so-called fantasy particles.
visible layer of the generative model. The denoised images
To this end, we did layer-wise Gibbs sampling for a total
22 are obtained by a single “deterministic” (max probability) it-
23 of a fixed number of equilibration steps. The result is eration v → h → v0 .
24 shown in Figure 64.
25 Finally, one can use generative models to reduce the
26 noise in images (de-noising). We randomly flipped a frac- We define a Deep Boltzmann machine with two hidden
27 tion of the black&white bits in the validation data, and layers of Nhidden and Nhidden /10 units, respectively, and
28 use the models defined above to reconstruct (de-noise) apply L1 regularization to all weights. As in the MNIST
29 the digit images. Figure 65 shows the result. problem above, we use layer-wise pre-training, and de-
30 The full Paysage code used to generate Figs. 60, 64 ploy Persistent Contrastive Divergence to train the DBM
31 and 65 is available in Notebook 17. The pack- using ADAM.
32 age was developed by one of the authors (CKF) One of the lessons from this problem is that this task is
33 along with his colleagues at Unlearn.AI and makes computationally intensive, see Notebook 17. The train-
34 it easy to build, train, and deploy energy-based ing time on present-day laptops easily exceeds that of pre-
35 generative models with different architectures. vious studies from this review. Thus, we encourage the
36
Paysage’s documentation is available on GitHub under interested reader to try GPU-based training and study
37
https://github.com/drckf/paysage/tree/master/docs. the resulting speed-up.
38
39 Figures 66, 67 and 68 show the results of the numerical
40 experiment at T /J = 1.75, 2.25, 2.75 respectively, for a
41 2. Example: 2D Ising Model DBM with Nhidden = 800. Looking at the reconstructions
42 and the fantasy particles, we see that our DBM works
43 We can also analyze the 2D Ising data set. In previous well in the disordered and critical regions. However, the
44 sections, we used our knowledge of the critical point at chosen layer architecture is not optimal for T = 1.75 in
45 Tc /J ≈ 2.26 (see Onsager’s solution) to label the spin the ordered phase, presumably due to effects related to
46 configurations and study the problem of classifying the symmetry breaking.
47 states according to their phase of matter. However, in
48 more complicated models, where the precise position of
49 Tc is not known, one cannot label the states with such F. Generative models in physics
50 an accuracy, if at all.
51 As we explained, generative models can be used to Generative models have been studied and used ex-
52 learn a variational approximation for the probability dis- tensively in the context of physics. For instance, in
53 tribution that generated the data points. By using only Biophysics, dynamic Boltzmann distributions have been
54 the 2D spin configurations, we now attempt to train a used as effective models in chemical kinetics (Ernst et al.,
55
Bernoulli RBM, the fantasy particles of which are ther- 2018). In Statistical Physics, they were used to identify
56
mal Ising configurations. Unlike in previous studies of criticality in the Ising model (Morningstar and Melko,
57
58 the Ising dataset, here we perform the analysis at a fixed 2017). In parallel, tools from Statistical Physics have
59 temperature T . We can then apply our model at three been applied to analyze the learning ability of RBMs (De-
60 different values T = 1.75, 2.25, 2.75 in the ordered, near- celle et al., 2018; Huang, 2017b), characterizing the spar-
61 critical and disordered regions, respectively. sity of the weights, the effective temperature, the non-
62
63
64
65
93
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 FIG. 66 MC samples, their reconstructions and fantasy particles generated by a Deep Boltzmann Machine in the ordered
18 phase of the the 2D Ising data set at T /J = 1.75. We used two hidden layers of 1000 and 100 layers, respectively.
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35 FIG. 67 MC samples, their reconstructions and fantasy particles generated by a Deep Boltzmann Machine in the critical
36 regime of the the 2D Ising data set at T /J = 2.25. We used two hidden layers of 1000 and 100 layers, respectively.
37
38
39 linearities in the activation functions of hidden units, tions in the study of quantum systems too. Most no-
40 and the adaptation of fields maintaining the activity in tably, RBM-inspired variational ansatzes were used to
41 the visible layer (Tubiana and Monasson, 2017). Spin learn both complex-valued wavefunctions and the real-
42 glass theory motivated a deterministic framework for valued probability distribution associated with the abso-
43 the training, evaluation, and use of RBMs (Tramel lute square of a quantum state (Carleo et al., 2018; Car-
44 et al., 2017); it was demonstrated that the training pro- leo and Troyer, 2017; Freitas et al., 2018; Nomura et al.,
45 cess in RBMs itself exhibits phase transitions (Barra 2017; Torlai et al., 2018) and, in this context, RBMs are
46 et al., 2016, 2017); learning in RBMs was studied in the sometimes called Born machines (Cheng et al., 2017), in-
47
context of equilibrium (Cossu et al., 2018; Funai and cluding quantum state tomorgraphy (Carrasquilla et al.,
48
Giataganas, 2018) and nonequilibrium (Salazar, 2017) 2018; Torlai et al., 2018; Torlai and Melko, 2017). Fur-
49
50 thermodynamics, and spectral dynamics (Decelle et al., ther applications include the detection of order in low-
51 2017); mean-field theory found application in analyzing energy product states (Rao et al., 2017), and learning
52 DBMs (Huang, 2017a). Another interesting direction of Einstein-Podolsky-Rosen correlations on an RBM (Wein-
53 research is the use of generative models to improve Monte stein, 2017). Inspired by the success of tensor networks in
54 Carlo algorithms (Cristoforetti et al., 2017; Nagai et al., physics, the latter have been used as a basis for both gen-
55 2017; Tanaka and Tomiya, 2017b; Wang, 2017). Ideas erative and discriminative learning (Huggins et al., 2019):
56 from quantum mechanics have been put forward to in- RBMs (Chen et al., 2018) were used to extract the spa-
57 troduce improved speed-up in certain parts of the learn- tial geometry from entanglement (You et al., 2017), and
58 ing algorithms for Helmholtz machines (Benedetti et al., generative models based on matrix product states have
59 2016, 2017). been developed (Han et al., 2017). Last but not least,
60 Quantum entanglement was studied using RBM-encoded
61 At the same time, generative models have applica-
62
63
64
65
94
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 FIG. 68 MC samples, their reconstructions and fantasy particles generated by a Deep Boltzmann Machine in the disordered
18 phase of the the 2D Ising data set at T /J = 2.75. We used two hidden layers of 1000 and 100 layers, respectively.
19
20
21 states (Deng et al., 2017) and tensor product based gen- come some of these limitations, simultaneously highlight-
22 erative models have been used to understand MNIST and ing both the power of GANs and some of the difficulties.
23 other ML datasets (Stoudenmire and Schwab, 2016). We then show how VAEs integrate the variational meth-
24 ods introduced in Sec. XIV with deep, differentiable neu-
25 ral networks to build more powerful generative models
26 XVII. VARIATIONAL AUTOENCODERS (VAES) AND that move beyond the Expectation Maximization (EM).
27 GENERATIVE ADVERSARIAL NETWORKS (GANS)
We then briefly discuss VAEs from an information theo-
28
retic perspective, before discussing practical tips for im-
29 In the previous two sections, we considered energy-
plementing and training VAEs. We conclude by using
30 based generative models. Here, we extend our discus-
VAEs on examples using the Ising and MNIST datasets
31 sion to two new generative model frameworks that have
32 (see also Notebooks 19 and 20).
gained wide appeal in the the last few years: generative
33 adversarial networks (GANs) (Goodfellow, 2016; Good-
34 fellow et al., 2014; Radford et al., 2015) and variational A. The limitations of maximizing Likelihood
35 autoencoders (VAEs) (Kingma and Welling, 2013). Un-
36
like energy-based models, both these generative modeling The Kullback-Leibler (KL)-divergence plays a central
37
frameworks are based on differentiable neural networks role in many generative models. Developing an intuition
38
39 and consequently can be trained using backpropagation- about KL-divergences is one of the keys to understanding
40 based methods. VAEs, in particular, can be easily im- why adversarial learning has proved to be such a powerful
41 plemented and trained using high-level packages such as method for generative modeling. Here, we revisit the KL-
42 Keras making them an easy-to-deploy generative frame- divergence with an eye towards understanding GANs and
43 work. These models also differ from the energy-based motivate adversarial learning. The KL-divergence mea-
44 models in that they do not directly seek to maximize like- sures the similarity between two probability distributions
45 lihood. GANs, for example, employ a novel cost function p(x) and q(x). Strictly speaking, the KL divergence is
46 based on adversarial learning (a concept we motivate and not a metric because it is not symmetric and does not
47 explain below). Finally we note that VAEs and GANs are satisfy the triangle inequality.
48 already starting to make their way into physics (Heimel Given two distributions, there are two distinct KL-
49 et al., 2018; Liu et al., 2017; Rocchetto et al., 2018; Wet- divergences we can construct:
50 zel, 2017) and astronomy (Ravanbakhsh et al., 2017), and Z
p(x)
51 methods from physics may prove useful for furthering DKL (p||q) = dxp(x) log (214)
52 q(x)
our understanding of these methods (Alemi and Abbara, Z
53 2017). More generally, GANs have found important ap- q(x)
54 DKL (q||p) = dxq(x) log . (215)
plications in many artistic and image manipulation tasks p(x)
55
(see references in (Goodfellow, 2016)). A related quantity called the Jensen-Shannon divergence,
56  
The section is organized as follows. We start by moti- 
p + q

p + q

57 1
58 vating adversarial learning by discussing the limitations DJS (p, q) = DKL p 2 + DKL q
2
of maximum likelihood based approaches. We then give 2
59
60 a high-level introduction to the main idea behind gen- does satisfy all of the properties of a squared metric (i.e.,
61 erative adversarial networks and discuss how they over- the square root of the Jensen-Shannon divergence is a
62
63
64
65
95
1
2
3 ∆ = 2.0 ∆ = 5.0
4 Model Model
5 Data Data
6
7
8
9
10
11
12
13
14 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15
15 x x
16 ∆ = 1.0
17 Data-Model Model
18 20 Model-Data Data
19
20 15
21
22 10
23
24 5
25
26 0
27 0.0 0.5 1.0 1.5 2.0 2.5 3.0 −4 −2 0 2 4
28 ∆ x
29 FIG. 69 KL-divergences between the data distribution pdata
30 and the model pθ . Data is drawn from a bimodal Gaus- Data-Model
31 sian distribution with unit variances peaked at ±∆ with 5
Model-Data
32 ∆ = 2.0 and the model pθ (x) is a Gaussian with mean zero 4
33 and same variance as pθ (x). (Top) pdata and pθ for ∆ = 2.
34 (Bottom) DKL (pdata ||pθ ) (Data-Model) and DKL (pθ ||pdata ) 3
35 (Model-Data) as a function of ∆. Notice that DKL (pdata ||pθ )
is insensitive to placing weight in the model distribution in 2
36
37 regions where pdata ≈ 0 whereas DKL (pθ ||pdata ) punishes this
harshly. 1
38
39 0
40 0 2 4 6 8 10
41 ∆
42 FIG. 70 KL-divergences between the data distribution pdata
43 and the model pθ . Data is drawn from a Gaussian mixture
44 metric). An important property of the KL-divergence of the form pdata = 0.25N (−∆) + 0.25 ∗ N (∆) + 0.5N (0)
45 that we will make use of repeatedly is its positivity: where N (a) is a normal distribution with unit variance cen-
46 DKL (p||q) ≥ 0 with equality if and only if p(x) = q(x) tered at x = a. pθ (x) is a Gaussian with σ 2 = 2. (Top)
47 pdata and pθ for ∆ = 5. (Middle) pdata and pθ for ∆ = 1.
almost everywhere.
48 (Bottom) DKL (pdata ||pθ ) [Data-Model] and DKL (pθ ||pdata )
49 [Model-Data] as a function of ∆. Notice that DKL (pθ ||pdata )
is insensitive to placing weight in the model distribution in
50 In generative models in ML, the two distributions regions where pθ ≈ 0 whereas DKL (pdata ||pθ ) punishes this
51 we are usually concerned with are the model distribu- harshly .
52 tion pθ (x) and the data distribution pdata (x). We of
53 course would like these models to be as similar as possi-
54 ble. However, as we discuss below, there are many sub-
55
tleties about how we measure similarities that can have
56
large consequences for the behavior of training proce-
57
58 dures. Maximizing the log-likelihood of the data under
59 the model is the same as minimizing the KL divergence
60 between the data distribution and the model distribu-
61 tion DKL (pdata ||pθ ). To see this, we can rewrite the KL
62
63
64
65
96
1
2
divergence as:
3
Z D(x) tries to make
4 D(G(z))be near 0
5 DKL (pdata ||pθ ) = dxpdata (x) log pdata (x)
6 Z G tries to make
7 − dxpdata (x) log pθ (x) D(G(z))be near 1
8
9 = −S[pdata ] − hlog pθ (x)idata (216)
D(x) tries to
10
Rearranging this equation, we have be near 1
11 D
12
hlog pθ (v)idata = −S[pdata ] − DKL (pdata ||pθ ) (217)
13
14 The equivalence follows from the positivity of KL-
15 divergence and the fact that the entropy of the data
16
“Discriminator” x sampled from
distribution is constant. In contrast, the original for- network D
17 data
mulation of GANs minimizes an upper bound on the
18
Jensen-Shannon divergence between the model distribu-
19
20 tion pθ (x) and the data distribution pdata (x) (Goodfellow
21 et al., 2014). “Generator”
This difference in objectives underlies the difference in x sampled from network G
22
23 behavior between GANs and likelihood based generative data
24 models. To see this, we can compare the behavior of the
25 two KL-divergences DKL (pdata ||pθ ) and DKL (pθ ||pdata ). latent space
26 As is evident from Fig. 69 and Fig. 70, though both of input z
27 these KL-divergences measure similarities between the
28 two distributions, they are sensitive to very different
29 things. DKL (pθ ||pdata ) is insensitive to setting pθ ≈ 0 FIG. 71 A GAN consists of two differentiable functions (usu-
30 even when pdata 6= 0 whereas DKL (pdata ||pθ ) punishes ally represented as deep neural networks): a generator func-
31 this harshly. In contrast, DKL (pdata ||pθ ) is insensitive tion G(z; θG ) that takes as an input a z sampled from some
32 to placing weight in the model distribution in regions prior on the latent space and outputs a point x. The generator
33 function (neural network) has parameters θG . The discrim-
where pdata ≈ 0 whereas DKL (pθ ||pdata ) punishes this
34 inator function D(x; θD ) discriminates between x from the
harshly. In other words, DKL (pdata ||pθ ) prefers models data and samples from the model: x = G(z; θG ). The two net-
35 that have a high probability in regions with lots of train-
36 works are trained by “playing a game” where the discriminator
ing data points whereas DKL (pθ ||pdata ) punishes models is trained to distinguish between synthetic and real examples
37
for putting high probability where there is no data. while the generator is trained to try to fool the discriminator.
38 Importantly, the cost function for the discriminator depends
39 In the context of the above discussion, this suggests
that the way likelihood-based methods are most likely to on the generator parameters and vice versa.
40
41 fail, is by improperly “filling in” any low-probability den-
42 sity regions between peaks in the data distribution. In
contrast, at least in principle, the Jensen-Shannon distri- between real data points and samples generated from the
43
44 bution which underlies GANs is sensitive both to placing model. By punishing the model for generating points
45 weight where there is data since it has information about that can be easily discriminated from the data, adversar-
46 DKL (pdata ||pθ ) and to not placing weight where no data ial learning decreases the weight of regions in the model
47 has been observed (i.e. in low-probability density regions) space that are far away from data points – regions that
48 since it has information about DKL (pθ ||pdata ). inevitably arise when maximizing likelihood. This core
49 In practice, DKL (pdata ||pθ ) can be calculated easily intuition implicitly underlies many adversarial training
50 directly from the data using sampling. On the other algorithms (though it has been recently suggested that
51 hand, DKL (pθ ||pdata ) is impossible to compute since we this may not be the entire story (Goodfellow, 2016)).
52 do not know pdata (x). In particular, this integral cannot
53 be calculated using sampling since we cannot evaluate
54 pdata (x) at the locations of the fantasy particles. The B. Generative models and adversarial learning
55
idea of adversarial learning is to circumnavigate this dif-
56
ficulty by using an adversarial learning procedure. Re- Here, we give a brief high-level overview of the ba-
57
58 call, that DKL (pθ ||pdata ) is large when the model artifi- sic idea behind GANs. The mathematics and theory
59 cially over-weighs low-density regions near real peaks (see of GANs draws deeply from concepts in Game Theory
60 Fig. 69). Adversarial learning accomplishes this same such as Nash Equilibrium that are foreign to most physi-
61 task by teaching a discriminator network to distinguish cists. For this reason, a comprehensive discussion of
62
63
64
65
97
1
2
GANs is beyond the scope of the review. Readers inter-
3
ested in learning more are directed to the comprehensive Prior distribution: p (z)
4
tutorial by Goodfellow (Goodfellow, 2016). GANs are
θ
5
6 also notorious for being hard to train. For this reason,
7 readers wishing to play with GANs should also consider z-space
8 the very nice practical discussion entitled “How to train
9 a GAN” (affectionately labeled “ganhacks”) available at
10 https://github.com/soumith/ganhacks.
11 The central idea of GANs is to construct two differ-
12 entiable neural networks (see Fig. 71). The first neural
13 network, usually a (de)convolutional network based on
14 the DCGAN architecture (Radford et al., 2015), approx-
15 imates a generator function G(z; θG ) that takes as input Encoder: q (z|x) Decoder: p (x|z)
16 a z sampled from some prior on the latent space, and out- φ θ
17 puts a x from the model. The second network approxi-
18 mates a discriminator function D(x; θD ) that is designed
19 to distinguish between x from the data and samples gen-
20
erated by the model: x = G(z; θG ). The scalar D(x) rep-
21
22
resents the probability that x came from the data rather x-space
23 than the model pθG . We train D to distinguish actual
24 data points from synthetic examples and the generative
network to fool the discriminative network.
Dataset: D
25
26 To define the cost function for training, it is useful to
27 define the functional
28
V (D, G) = Ex∼pdata (log D(x)) FIG. 72 VAEs learn a joint distribution pθ (x, z) between
29 latent variables z with prior distribution p(z) and data x.
30 + Ez∼pprior (log [1 − D(G(z))]) . (218) The conditional distribution pθ (x|z) can be thought of as a
31 stochastic “decoder” that maps latent variables to new ex-
32 In the version of GANs most amenable to theoretical amples. The stochastic “encoder” qφ (z|x) approximates the
33 analysis – though not the version usually implemented true but intractable pθ (z|x) – much like mean-field theories
34 in practice – we take the cost function for the discrimi- in statistical physics approximate true distributions with ana-
35 nator and generators to be C (G) = −C (D) = 12 V (D, G). lytically tractable approximations. Figure based on Kingma’s
36 This choice of cost functions corresponds to what is called Ph.D. dissertation Chapter 2. (Kingma et al., 2017).
37 a zero-sum game. Since the discriminator is maximized,
38 we can write a cost function for the generator as
39 DNN. The use of latent variables is a common theme
40 C(G) = max V (G, D). (219) in many of the generative models we have encountered in
41 D unsupervised learning tasks from Gaussian Mixture Mod-
42
It turns out that this cost function is related to the els (see Sec. XIII) to Restricted Boltzmann Machines.
43
Jensen-Shannon Divergence in a simple manner (Good- However, in VAEs this mapping, p(x|z, θ) is much less
44
fellow, 2016; Goodfellow et al., 2014): restrictive and much more complicated since it takes the
45 form of a DNN. This added complexity means we can-
46 not use techniques such as Expectation Maximization to
C(G) = − log 4 + 2DJS (pdata , pθG ). (220)
47 train the model and instead must rely of methods based
48 This brings us back full circle to the discussion in the last on backpropagation.
49
section on KL-divergences.
50
51 1. VAEs as variational models
52 C. Variational Autoencoders (VAEs)
53 We start by discussing VAEs from a variational per-
54 We now turn our attention to another class of powerful spective. We will make extensive use of the concepts
55
latent-variable, generative models called Variational Au- introduced in Sec. XIV and the reader is strongly-
56
toencoders (VAEs). VAEs exploit the variational/mean- encouraged to refresh their memory of this section before
57
58 field theory ideas presented in Sec. XIV to build complex proceeding. A VAE is a latent-variable model pθ (x, z)
59 generative models using deep neural networks (DNNs). with a latent variables z and observed variables x. The
60 The central idea behind VAEs is to represent the map latent variables are drawn from some pre-specified prior
61 from latent variables to observable variables using a distribution p(z). In practice, p(z) is almost always taken
62
63
64
65
98
1
2
to be a multivariate Gaussian. The conditional distribu- The second term acts as a regularizer and encourages the
3
4 tion pθ (x|z) maps points in the latent space to new ex- posterior distributions to be close to p(z). By maximizing
5 amples (see Fig. 72). This is often called a “stochastic the ELBO, we minimize the KL-divergence between the
6 decoder” and defines the generative model for the data. approximate and true posterior. By choosing a tractable
7 The reverse mapping that gives the posterior over the qφ (z|x), we make this feasible (see Fig. 72).
8 latent variables pθ (z|x) is often called the “stochastic en-
9 coder”.
10 A central challenge in latent variable modeling is to in- 2. Training via the reparametrization trick
11 fer the posterior distribution of the latent variables given
12 a sample from the data. This can in principle be done VAEs train models by minimizing the variational free
energy (maximizing the ELBO). Training a VAE is some-
13 via Bayes’ rule: pθ (z|x) = p(z)p θ (x|z)
. For some models,
14
pθ (x)
what complicated because we must simultaneously learn
we can calculate this analytically. In this case, we can
15 two sets of parameters: the parameters θ that define our
use techniques like Expectation Maximization (EM) (see
16 generative model pθ (x, z) as well as the variational pa-
Sec. XIV). However, in general this is intractable since
17 rameters φ in qφ (z|x). The basic approach will be the
the denominator requires computing a sum R over all con-
18 same as for all DNN models: we will use gradient de-
19 Rfigurations of the latent variables, pθ (x) = pθ (x, z)dz = scent with the variational free energy as the objective
pθ (x|z)p(z)dz (i.e. a partition function in the language
20 (cost) function. For a dataset L, we can write our cost
of physics), which is often intractable for large models.
21 function as
22 In VAEs, where the pθ (x|z) is modeled using a DNN, this X
23 is impossible. Cθ,φ (L) = −Fqφ (x). (224)
24 A first attempt to address the issue of computing x∈L
25 p(x) could be through importance sampling (Neal, 2001). Taking the gradient with respect to θ is easy since only
26 That is, we choose a proposal distribution q(z|x) which the first term in Eq. (223) depends on θ,
27 is easy to sample from, and rewrite the sum as an expec-
28 tation with respect to this distribution: Cθ,φ (x) = Eqφ (z|x) [∇θ log pθ (x, z)]
29 Z ∼ ∇θ log pθ (x, z) (225)
p(z)
30 pθ (x) = pθ (x|z) qφ (z|x)dz. (221) where in the second line we have replaced the expec-
31 qφ (z|x)
32 tation value with a single Monte-Carlo sample z drawn
Thus, by sampling from qφ (z|x) we can get a Monte Carlo from qφ (z|x) (see Fig. XVII.C.2). When pθ (x|z) is ap-
33
estimate of p(x). However, this requires generating sam- proximated by a neural network, this can be calculated
34
35 ples and thus our estimates will be noisy. If our proposal using backpropagation with the reconstruction error as
36 distribution is poor, the variance in the estimate can be the objective function.
37 very high. On the other hand, calculating the gradient with re-
38 An alternative approach that avoids these sampling spect to the parameters φ is more complicated since φ
39 issues is to use the variational approach discussed in also appears in the expectation value Eqφ (z|x) . Ideally, we
40 Sec. XIV. We know from Eq. (162) that we can write would like to also use backpropagation to calculate this
41 the log-likelihood as as well. It turns out that this can be done by a simple
42
log p(x) = DKL (qφ (z|x)kpθ (z|x, θ)) − Fqφ (x), (222) change of variables that often goes under the name the
43 “reparameterization trick” (Kingma and Welling, 2013;
44 where the variational free energy is defined as Rezende et al., 2014). The basic idea is to change vari-
45 ables so that φ no longer appears in the distribution we
46 − Fqφ (x) ≡ Eqφ (z|x) [log pθ (x, z)] − DKL (qφ (z|x)|p(z)). are taking an expectation value with respect to. To do
47 (223) this, we express the random variable z ∼ qφ (z|x) as some
48 In writing this term, we have used Bayes rule and differentiable and invertible transformation of another
49 Eq. (174). Since the KL-divergence is strictly positive, random variable :
50 the (negative) variational free energy is a lower-bound on
51 the log-likelihood. For this reason, in the VAE literature, z = g(, φ, x), (226)
52 it is often called the Evidence Lower BOund or ELBO.
53 where the distribution of  is independent of x and φ.
Equation (223) has a beautiful interpretation. The
54 Then, we can replace expectation values over qφ (z|x) by
first term in this equation can be viewed as a “recon-
55 expectation values over p
struction error”, where we start with data x, encode it
56
into the latent representation using our approximate pos- Eqφ (z|x) [f (z)] = Ep [f (z)]. (227)
57
terior qφ (z|x), and then evaluate the log probability of
58 Evaluating the derivative then becomes quite straight for-
the original data given the inferred latents. For binary
59 ward since
60 variables, this is just the cross-entropy which we first en-
61 countered when studying logistic regression, cf. Sec. VII. ∇φ Eqφ (z|x) [f (z)] ∼ Ep [∇φ f (z)]. (228)
62
63
64
65
99
1
2
3 Datapoint
4
5 Inference Model Generative Model
6 Sample z
7 q (z|x) p (x|z)
φ θ
8
9
10
11
12 Negative
13
14
Variational Free Energy:
(ELBO)
E [log pθ(x|z) - KL(qφ(z|x)||p(z))]
q (z|x)
φ
15
16
17 FIG. 73 Schematic explaining the computational flow of VAEs. Figure based on Kingma’s Ph.D. dissertation Chapter 2.
18 (Kingma et al., 2017).
19
20
21 Of course, when we do this we still need to be able to These observations hints at the more general connec-
22 calculate the Jacobian of this change of variables tion between VAEs and information theory that we turn
23 to in the next section.
24 ∂z
25 dφ (x, φ) = Det (229)
∂
26
27 since 3. Connection to the information bottleneck
28
29 log qφ (z|x) = log p() − log dφ (x, φ). (230) There is a fundamental connection between the vari-
30 ational autoencoder objective and the information bot-
31 Since we can calculate gradients, we can now use back- tleneck (IB) for lossy compression (Tishby et al., 2000).
32 propagation on the full the ELBO objective function (we The information bottleneck imagines we have input data
33 return to this below when we discuss concrete architec- x that is correlated with another variable of interest, y,
34 tures and implementations of VAE). and we are given access to the joint distribution, p(x, y).
35 One of the problems that commonly occurs when train- Our task is to take x as input and compress it in such a
36 ing VAEs by performing a stochastic optimization of the way as to retain as much information as possible about
37 ELBO (variational free energy) is that it often gets stuck the relevance variable, y. To do this, Tishby et al. pro-
38 in undesirable local minima, especially at the beginning pose to maximize the objective function
39 of the training procedure (Bowman et al., 2015; Kingma
40 et al., 2017; Sønderby et al., 2016). The underlying rea-
41 son for this is that the ELBO objective function can be
42 LIB = I(y; z) − βI(x; z) (232)
improved in two qualitatively different ways correspond-
43 ing to each of the two terms in Eq. (223): by minimizing
44 over a stochastic encoding distribution q(z|x), where z is
the reconstruction error or by making the posterior dis-
45 our compression of the input, and β is a tradeoff param-
tribution qφ (z|x) to be close to p(z) (Of course, the goal
46 eter that sets the relative preference of compression and
is to do both!). For complex datasets, at the beginning of
47 accuracy, and I(y; z) is the mutual information between y
training when the reconstruction error is extremely poor,
48 and z. Note that we choose a slightly different but equiv-
the model often quickly learns to make q(z|x) ≈ p(z) and
49 alent form of the objective relative to Tishby et al.. This
gets stuck in this local minimum. For this reason, in prac-
50 objective is only known to have a closed-form solution
51 tice it is found that it makes sense to modify the ELBO
when x and y are jointly Gaussian (Chechik et al., 2005).
52 objective to use an optimization schedule of the form
Otherwise, the optimization can be performed through a
53 Blahut-Arimoto type iterative update scheme (Arimoto,
54 Eqφ (z|x) [log pθ (x, z)] − βDKL (qφ (z|x)|p(z)) (231)
1972; Blahut, 1972). However, this is only guaranteed to
55
where β is slowly annealed from 0 to 1 (Bowman et al., converge to a local optimum. A significant difficulty in
56
2015; Sønderby et al., 2016). An alternative regulariza- implementing IB is that it requires knowledge of the joint
57
58 tion is the “method of free bits”: modifying the objective distribution p(x, y) and that we must be able to compute
59 function of ELBO to ensure that on average qφ (z|x) has the mutual information, a notoriously difficult quantity
60 at least λ natural units of information about p(z) (see to estimate from samples. Hence, IB has in recent years
61 Kingma Ph.D thesis (Kingma et al., 2017) for details) . been utilized less than it might otherwise.
62
63
64
65
100
1
2
To address these problems, variational approximations x Data
3
4 to the IB objective function have been developed (Alemi
et al., 2016; Chalk et al., 2016). These approximations, Hidden layers Neural network with
5 weights φ
6 when applied to a particular choice of p(x, y) give the
7 same objective as the variational autoencoder. Here
Zmean log σz Latent layer parameters
8 we follow the exposition from Alemi et al.(Alemi et al.,
9 2016). To see this, consider a dataset of N points, xi .
10 We set x = i and y = xi in the IB objective, similar Use analytic expression
KL(qφ(z|x)||p(z))
11 to (Slonim et al., 2005; Strouse and Schwab, 2017). We for Gaussians
12 choose p(i) = 1/N and p(x|i) = δ(x − xi ). That is, we
13 would like to find a compression of the data that preserves ε Standard normal variable
14 information about data point location while reducing in-
15 formation about data point identity. z Latent variable
16 Imagine that we are unable to directly work with the
17 decoder p(x|z). The first approximation replaces the ex-
Hidden layers Neural network with
18 act decoder inside the logarithm with an approximation, weights
19 q(x|z). Due to the positivity of KL-divergence, namely,
20 p
θ Reconstruction
21 DKL (p(x|z)||q(x|z)) ≥ 0
22 Z Z
Reconstruction Error
23 ⇒ dx p(x|z) log p(x|z) ≥ dx p(x|z) log q(x|z), E [log pθ(x|z)]
q (z|x) (i.e. cross-entropy)
24 φ

25 (233)
26 FIG. 74 Computational graph for a VAE with Gaussian hid-
we have den units (i.e. p(z) are standard normal variables N (0, 1) and
27 Z  
28 p(x|z) Gaussian variational encoder whose posterior takes the form
29 I(x; z) = dxdz p(x)p(z|x) log qφ (z|x) = N (µ(x), σ 2 (x)).
p(x)
30 Z
31 ≥ dxdz p(x)p(z|x) log q(x|z) + Hp (x)
x, which is irrelevant in the optimization procedure. In
32 Z
33 fact, this objective has been explored and is called a β-
≥ dxdz p(x)p(z|x) log q(x|z), (234)
34 VAE (Higgins et al., 2016). It’s interesting to note that in
35 the case of IB, the variational approximations are with
where Hp (x) ≥ 0 is the Shannon entropy of x. This respect to the decoder and prior, whereas in the VAE,
36
37 quantity can be estimated from data samples (i, xi ) af- the variational approximations are with respect to the
38 ter drawing from p(z|i) = p(z|xi ). Similarly, we can encoder.
39 replace
R the prior distribution of the encoding, p(z) =
40 dx p(x)q(z|x) which is typically intractable, with a
41 tractable q(z) to get D. VAE with Gaussian latent variables and Gaussian encoder
42 Z
1 X p(z|xi ) Our discussion of VAEs thus far has been quite ab-
43 I(i; z) ≤ dz p(z|xi ) log (235)
44 N i q(z) stract. In this section, we discuss one of the most widely
45 employed VAE architectures: a VAE with factorized
46 Putting these two bounds Eqs. (234)and (235) together Gaussian posteriors, qφ (z|x) = N (z, µ(x), diag(σ 2 (x)))
47 and note that x = i and y = xi , we get an upper bound and standard normal latent variables p(z) = N (0, I).
48 for the IB objective that takes the same form as the VAE The training and implementation simplifies greatly
49 objective Eq. (231) we saw earlier: here because we can analytically workout the term
50 DKL (qφ (z|x)|p(z)).
51 LIB = I(x; z) − βI(y; z)
Z
52
53 ≤ dx p(x)Ep(z|x) [log q(x|z)] (236)
1. Implementing the Gaussian VAE
54 1 X
55 −β DKL (p(z|xi )|q(z)). (237)
N i We now show how we can combine analytic expressions
56
for the KL-divergence with backpropagation to efficiently
57
58 Note that in Eq. (236) we have a conditional distribution implement a Gaussian VAE. We start by first deriving an-
59 of x given z but not their joint distribution inside the alytic expressions for DKL (qφ (z|x)|p(z)) in terms of the
60 expectation, which was the case in Eq. (231). This is means µ(x) and variances σ 2 (x). This is just a simple ex-
61 due to that we dropped the entropy term pertaining to ercise in Gaussian integrals. For notational convenience,
62
63
64
65
101
1
2
decoder is a Multi-layer Perceptron (MLPs) – neural net-
3
4 works with a single hidden layer. For this example, we
5 take the dimension of the hidden layer for both neural
6 networks to be 256. We trained the VAE using the RMS-
7 prop optimizer for 50 epochs.
8 We can visualize the embedding in the latent space
9 by plotting z of the test set and coloring the points by
10 digit identity [0-9] (see Figure XVII.D.2). Notice that
11 in general, digits that are similar end up being closer to
12 each other in the latent space. However, this is not always
13 the case (see bright green points for example). This is a
14 general feature of these low-dimensional embeddings and
15 we saw a similar phenomenon when we examined t-SNE
16 in Section XII.
17 The real advantage that VAEs offer over embeddings
18
such as t-SNE is that they are generative models. Given
19 FIG. 75 Embedding of MNIST dataset into a two-
dimensional latent space using a VAE with two latent dimen- a set of examples, we can generate new examples – or fan-
20
21 sions (see Notebook 19 and main text for details.) Data points tasy particles as they are commonly called in ML – by
22 are colored by their identity [0-9]. sampling the latent space z and then using the decoder to
23 map these latent variables to new examples. The results
24 of this procedure are shown in Figure XVII.D.2. In the
25 we drop the x-dependence of the means µ(x), variances top figure, we sample the latent space uniformly in a 5×5
26 σ 2 (x), and qφ (x). A straight-forward calculation gives grid. Notice that this results in extremely similar exam-
27 Z Z ples through much of the latent space. The underlying
28 dzqφ (z) log p(z) = N (z, µ(x), diag(σ 2 (x))) log N (0, I) reason for this is that uniform sampling does not respect
29 the underlying Gausssian structure of the latent space z.
J
30 J 1X 2 In the bottom figure, we perform a uniform sampling on
= − log 2π − (µ + log σj2 ), (238)
31 2 2 j=1 j the probability p(z) and mapped this back to the latent
32 space using the inverse Cumulative Distribution Func-
33 where J is the dimension of the latent space. An almost tion (CDF) of the Gaussian. We see that the diversity of
34 identical calculation yields the generated examples is much higher for this sampling
35 procedure.
36 Z J
J 1X This example is indicative of a more general problem:
37 dzqφ (z) log qφ (z) = − log 2π − 2
(1 + σj ). (239)
2 2 j=1 once we have learned a generative model how should we
38
39 sample latent spaces (White, 2016). This is especially
40 Combining these equations gives important in high-dimensional spaces where direct visu-
41 alization is not possible. Often certain directions in the
XJ
42 1 latent space can have different meanings. A particularly
−DKL (qφ (z|x)|p(z)) = 1 + log σj2 (x) − µ2j (x) − σj2 (x) striking
. visual illustration is the “smile vector” that in-
43 2 j=1
44 terpolates between smiling and frowning faces (White,
(240) 2016).
45
46 This analytic expression allows us to implement the
47 Gaussian VAE in a straight forward way using neural net-
48 works. The computational graph for this implementation 3. VAEs for the 2D Ising model
49
is shown in Fig. 74. Notice that since the parameters are
50
all compositions of differentiable functions, we can use In Notebook 20, we used an almost identical architec-
51
standard backpropagation algorithms to train VAEs. ture (though coded in a slightly different way) to train
52 a VAE on the Ising dataset discussed through out the
53 review. The only differences between the two VAEs are
54 that the visible layer of the Ising VAE now has 1600 units
2. VAEs for the MNIST dataset
55
(our samples are 40 × 40 instead of the 28 × 28 MNIST
56
In Notebook 19, we have implemented a VAE using images) and we have changed the standard deviation of
57
Keras and trained it using the MNIST dataset. The ba- the Gaussian of the latent variables p(z) from σ = 1 to
58
59 sic architecture is the one describe above. All figures σ = 0.2.
60 were generated with a VAE that has a latent space of We once again visualize the embedding learned by
61 dimension 2. The architecture of both the encoder and the VAE by plotting z and coloring the points by the
62
63
64
65
102
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29 FIG. 77 (Top) Embedding of the Ising dataset into a two-
30 dimensional latent space using a VAE with two latent dimen-
31 sions (see Notebook 20 and main text for details.) Data points
32 are colored by temperature sample was drawn at. (Bottom)
33 Correlation between the latent dimensions and the magneti-
34 zation for each sample. Notice the first principle component
35 corresponds to the magnetization.
36
37
38 saw in Section XII that the first principal component of
39 a PCA also corresponded to the magnetization.
40
We now ask how well the VAE can generate new exam-
41
ples (see Fig. 78). We see that the examples look quite
42
43 different from real Ising configurations – they lack the
44 large scale patchiness seen in the critical region. They
45 mostly turn out to be unstructured speckles that reflect
46 only the average probability that a pixel is on in a region.
47 This is not surprising since our VAE has no spatial struc-
48 ture, has only two latent dimensions, and the cost func-
49 FIG. 76 (Top) Fantasy particle generated by uniform sam- tion does not know about “correlations between spins”
50 pling of the latent space z. (Bottom) Fantasy particles gen- : there is very little information about correlations in
51 erated by uniform sampling of probability p(z) mapped to the binary cross-entropy which we use to measure recon-
latent space using the inverse Cumulative Distribution Func-
52 struction errors. The reader is encouraged to play with
tion (CDF) of the Gaussian.
53 the corresponding notebook and generate examples as we
54 change the latent dimension and/or choose modified ar-
55 chitectures such as decoders based on CNNs instead of
56 MLPs.
temperature at which the sample was drawn (see Fig-
57
58 ure XVII.D.3 top). Notice that the latent space has This example also shows how much easier it is to dis-
59 learned a lot of the physics of the Ising model. For ex- criminate between labeled data than it is to learn how to
60 ample, the first VAE dimension is just the magnetization generate new examples from an unlabeled dataset. This
61 (Fig. XVII.D.3 bottom). This is not surprising since we is true in all spheres of machine learning. This is also
62
63
64
65
103
1
2
differential equations. We hope that this review will play
3
4 some small part toward this aspirational goal.
5 We wrote this review to provide a relatively concise
6 introduction to ML using ideas and language familiar
7 to physicists (though the review ended up being almost
8 twice the planned length). In writing the review, we have
9 tried to accomplish two somewhat disparate tasks. First,
10 we have tried to highlight more abstract and theoretical
11 considerations to show the unity of ML and statistical
12 learning. Many ML techniques can be understood by
13 starting with some key concepts from statistical learning
14 (MLE, bias-variance tradeoff, regularization) and com-
15 bining them with core concepts familiar from statisti-
16 cal physics (Monte-Carlo, gradient descent, variational
17 methods and MFT). Despite the high-level similarities
18 between all the methods presented here, the way that
19 these concepts manifest in any given technique is often
20
quite clever and understanding these “hacks” is the key to
21
understanding why some ML techniques turn out to be
22
23 so powerful and others not so much. ML, in this sense, is
24 as much an art as a science. Second, we have tried to give
25 the reader the practical know-how to start using the tools
26 and concepts from ML for immediately solving problems.
27 We believe the accompanying python notebooks and the
28 emphasis on coding in python have accomplished this
FIG. 78 Fantasy particles for the Ising model generated by
29 uniform sampling of probability p(z) mapped to latent space task.
30 using the inverse Cumulative Distribution Function (CDF) of
31 the Gaussian.
32 A. Research at the intersection of physics and ML
33
34 one of the reasons that generative models are one of the We hope the review catalyzes more research at the
35 cutting edge areas of modern Machine Learning research intersection of physics and machine learning. Here we
36 and there are likely to be a barrage of new techniques for briefly highlight a few promising research directions. We
37 generative modeling in the next few years. note that this list is far from comprehensive.
38
39 • Applying ML to solve physics problems. One
40 theme that has reoccurred through out the review
41 XVIII. OUTLOOK is that ML is most effective in settings with well de-
42 fined objectives and lots of data. For this reason,
43 In this review, we have attempted to give the reader the we expect ML to become a core competency of data
44 intellectual and practical tools to engage with Machine rich fields such as high-energy experiments and as-
45 Learning (ML), data science, and parts of modern statis- tronomy. However, ML may also prove to be use-
46 tics. We have tried to emphasize that ML differs from ful for helping further our physical understanding
47 ordinary statistics in that the goal is to predict rather through data-driven approach to other branches of
48 than to fit. For this reason, all the techniques discussed physics that may not be immediately obvious, such
49 here have to navigate important tensions that lie at the as quantum physics (Dunjko and Briegel, 2017).
50 heart of ML. The most prominent instantiation of these For example, recent works have used ideas from
51 inherent tradeoffs is the bias-variance tradeoff, which is ML to investigate disparate topics such as non-local
52 perhaps the only universal principle in ML. Identifying correlations (Canabarro et al., 2018), disordered
53 how these tradeoffs manifest in a particular algorithm is materials and glasses (Schoenholz, 2017), electronic
54 the key to constructing and training powerful ML meth- structure calculations (Grisafi et al., 2017) and
55 ods. numerical analysis of ferromagnetic resonances in
56
The immense progress in computing power and the thin films (Tomczak and Puszkarski, 2018), de-
57
58 corresponding availability of large datasets ensure that signing and analyzing quantum materials by inte-
59 ML will be an important part of the physics toolkit. In grating ML with existing techniques such as Dy-
60 the future, we expect ML to be a core competency of namical Mean Field Theory (DMFT) (Arsenault
61 physicists much like linear algebra, group theory, and et al., 2014), in the study of inflation (Rudelius,
62
63
64
65
104
1
2
2018), and even for experimental learning of quan- biophysics toolkit in the future. Many of the au-
3
4 tum states by using ML to aid in quantum tomog- thors of this review were inspired to engage with
5 raphy (Rocchetto et al., 2017). For a comprehen- ML for this reason.
6 sive review of ML methods in seismology, see (Kong
et al., 2018). • Using ideas from physics to develop new
7
8 ML algorithms. Many of the core ideas of ML
9 • Machine Learning on quantum computers. from Monte-Carlo techniques to variational meth-
10 Another interesting area of research that is likely ods have their origin in physics. There has been a
11 to grow is asking if and how quantum comput- tremendous amount of recent work developing tools
12 ers can help improve state-of-the art ML algo- to understand physical systems that may be of po-
13 rithms (Arunachalam and de Wolf, 2017; Benedetti tential use to ML. For example, in quantum con-
14 et al., 2016, 2017; Bromley and Rebentrost, 2018; densed matter techniques such as DMRG, MERA,
15 Ciliberto et al., 2017; Daskin, 2018; Innocenti et al., etc. have enriched both our practical and con-
16 2018; Mitarai et al., 2018; Perdomo-Ortiz et al., ceptual understandings (Stoudenmire and White,
17 2017; Rebentrost et al., 2017; Schuld et al., 2017; 2012; Vidal, 2007; White, 1992). It will be inter-
18 Schuld and Killoran, 2018; Schuld et al., 2015). esting to figure how and if these numerical methods
19 Concrete examples that seek to extend some of can be translated from a physics to a ML setting.
20 the basic ideas and methods we introduced in There are tantalizing hints that this is likely to be
21 this review to the quantum computing realm in- a fruitful direction (Han et al., 2017; Stoudenmire,
22 clude: algorithms for quantum-assisted gradient 2018; Stoudenmire and Schwab, 2016).
23 descent (Kerenidis and Prakash, 2017; Rebentrost
24
et al., 2016), classification (Schuld and Petruccione,
25 B. Topics not covered in review
2017), and Ridge regression (Yu et al., 2017). Inter-
26
27 est in this field will undoubtedly grow once reliable
quantum computers become available (see also this Despite the considerable length of the review, we have
28
29 recent review (Dunjko and Briegel, 2017) ). had to make many omissions for the sake of brevity. It
30 is our hope and belief that after reading this review the
31 • Monte-Carlo Methods. An interesting area that reader will have the conceptual and practical knowledge
32 has seen a renewed interest with Bayesian modeling to quickly learn about these other topics. Among the
33 is the development of new Monte-Carlo methods for most prominent topics missing from this review are:
34 sampling complex probability distributions. Some
35 of the workhorses of modern Machine Learning – • Temporal/Sequential Data. We have not cov-
36 Annealed Importance Sampling (AIS) (Neal, 2001) ered techniques for dealing with temporal or se-
37 and Hamiltonian or Hybrid Monte-Carlo (HMC) quential data. Here, too there are many connec-
38 (Neal et al., 2011) – are intimately related to tions with statistical physics. A powerful class of
39 physics. As pointed out by Radford Neal, AIS is models for sequential data called Hidden Markov
40 just the Jarzynski inequality (Jarzynski, 1997) as Models (Rabiner, 1989) that utilize dynamical
41 a Monte-Carlo method and HMC was developed programming techniques have natural statistical
42 by physicists and exploits Hamiltonian dynamics physics interpretations in terms of transfer matri-
43 to improve proposal distributions. ces (see (Mehta et al., 2011) for explicit exam-
44 ple of this). Recently, Recurrent Neural Networks
45 • Statistical physics style theory of Deep (RNNs) have become an important and powerful
46 Learning. Many techniques in ML have origins tool for dealing with sequence data (Goodfellow
47 in statistical physics. Yet, a physics-style theory et al., 2016). RNNs generalize many of the ideas
48 of Deep Learning remains elusive. A key question discussed in the DNN section to deal with temporal
49 is to ask when and why these models manage to data.
50 generalize well. Physicists are only beginning to
51 • Reinforcement Learning. Many of the most ex-
ask these questions (Advani and Saxe, 2017; Mehta
52 citing developments in the last five years have come
and Schwab, 2014; Saxe et al., 2013; Shwartz-Ziv
53 from combining ideas from reinforcement learning
54 and Tishby, 2017). But right now, it is fair to say
that the insights remain scattered and a cohesive with deep neural networks (Mnih et al., 2015; Sut-
55
theoretical understanding is lacking. ton and Barto, 1998). RL traces its origins to be-
56
haviourist psychology, when it was conceived as
57
58 • Biological physics and ML. Biological physics a way to explain and study reward-based deci-
59 is generating ever more datasets in fields ranging sion making. RL was put on solid mathematical
60 from neuroscience to evolution and immunology. It grounds in the 50’s by Richard Bellman and col-
61 is likely that ML will be an important part of the laborators, and has by now become an inseparable
62
63
64
65
105
1
2
part of robotics and artificial intelligence. RL is a that form the basis of modern ML with the more com-
3
4 field of Machine Learning, in which an agent learns monplace notions about what humans and behavioral sci-
5 how to master performing a specific task through entists mean by intelligence (see (Lake et al., 2017) for
6 an interaction with its environment. Depending on an enlightening and important modern discussion of this
7 the reward it receives, the agent chooses to take an distinction from a quantitative cognitive science point of
8 action affecting the environment, which in turn de- view as well as (Dreyfus, 1965) for a surprisingly relevant
9 termines the value of the next received reward, and philosophy-based critique from 1965).
10 so on. The long-term goal of the agent is to max- Almost all the techniques discussed here rely on op-
11 imise the cumulative expected return, thus improv- timizing a pre-specified objective function on a given
12 ing its performance in the longer run. Shadowed by dataset. Yet, we know that for large, complex models
13 more traditional optimal control algorithms, Re- changing the data distribution or the goal can lead to an
14 inforcement Learning has only recently taken off immediate degradation of performance. Deep networks
15 in physics (Albarran-Arriagada et al., 2018; Au- have poor generalizations to even a slightly different con-
16 gust and Hernández-Lobato, 2018; Bukov, 2018; text (the infamous Validation-Test set mismatch). This
17 Bukov et al., 2018; Cárdenas-López et al., 2017; inability to abstract and generalize is a common criticism
18 Chen et al., 2014; Chen and Xue, 2019; Dunjko lobbied against branding modern ML techniques as AI
19 et al., 2017; Fösel et al., 2018; Lamata, 2017; Mel- (Lake et al., 2017). For all these reasons, we have chosen
20
nikov et al., 2017; Neukart et al., 2017; Niu et al., to use the term Machine Learning rather than artificial
21
2018; Ramezanpour, 2017; Reddy et al., 2016b; Sri- intelligence through out the review.
22
23 arunothai et al., 2017; Sweke et al., 2018; Zhang This is far from the first time we have seen the use of
24 et al., 2018). Of particular interest are biophysics the term artificial intelligence and the grandiose promises
25 inspired works that seek to use RL to understand that it implies. In fact, the early 1950’s and 1960’s as
26 navigation and sensing in turbulent environments well as the early 1980’s saw similar AI bubbles (see this
27 (Colabrese et al., 2017; Masson et al., 2009; Reddy interesting summary by Luke Muehlhauser for Open Phi-
28 et al., 2016a; Vergassola et al., 2007). lanthropy (Muehlhauser, 2016)). These AI bubbles have
29 been followed by what have been dubbed “AI Winters”
• Support Vector Machines (SVMs) and Ker-
30 (McDermott et al., 1985).
nel Methods. SVMs and kernel methods are a
31
powerful set of techniques that work well when the The “Singularity” may not be coming but the advances
32
amount of training data is limited (Burges, 1998). in computing and the availability of large data sets likely
33 ensure that the kind of statistical learning frameworks
34 The mathematics and theory of SVM are very dif-
ferent from statistical physics and for this reason discussed are here to stay. Rather than a general arti-
35 ficial intelligence, the kind of techniques presented here
36 we chose not to include them here. However, SVMs
and kernel methods have played an extremely im- seem to be best suited for three important tasks: (a) au-
37
portant role in ML and are worth understanding. tomating prediction from lots of labeled examples in a
38
39 narrowly-defined setting (b) learning how to parameter-
40 ize and capture the correlations of complex probability
41 C. Rebranding Machine Learning as “Artificial Intelligence” distributions, and (c) finding policies for tasks with well-
42 defined goals and clear rules. We hope that this review
43 The immense scientific progress in ML has also been has given the reader enough conceptual tools to start
44 accompanied by a massive public relations effort cen- forming their own opinions about reality and hype when
45 tered around Silicon Valley. Starting with the success it comes to modern ML research. As Michael I. Joran
46 of ImageNet (the most prominent early use of GPUs puts it, “...if the acronym “AI” continues to be used as
47 for training large models) and the widespread adoption placeholder nomenclature going forward, let’s be aware of
48 of Deep Learning based techniques by the Silicon Val- the very real limitations of this placeholder. Let’s broaden
49 ley companies, there has been a deliberate move to re- our scope, tone down the hype and recognize the serious
50 brand modern ML as “artificial intelligence” or AI (see challenges ahead "(Jordan, 2018).
51 graphs in (Katz, 2017)). Recently, computer scientist
52 Michael I. Jordan (who is famously known for his formal-
53 ization of variational inference, Bayesian network, and D. Social Implications of Machine Learning
54 expectation-maximization algorithm in machine learn-
55
ing research) cautioned that “This confluence of ideas The last decade has also seen a systematic increase in
56
and technology trends has been rebranded as “AI” over the use and deployment of Machine Learning techniques
57
58 the past few years. This rebranding is worthy of some into new areas of life and society. Some of the readers of
59 scrutiny”(Jordan, 2018). this review may currently be (or eventually be) employed
60 AI, by design, is an ambiguous term that mixes aspi- in industrial settings that seek to harness ML for practi-
61 rations with reality. It also conflates the statistical ideas cal purposes. However, caution is in order when applying
62
63
64
65
106
1
2
ML. Without foresight and accountability, the scale and was supported as a Simons Investigator in the MMLS and
3
4 scope of modern ML algorithms can lead to large scale by NIH K25 grant GM098875-02. PM and DJS would
5 unaccountable and undemocratic outcomes that can re- like to thank the NSF grant: PHYD1066293 for support-
6 inforce or even worsen existing inequality and inequities. ing the Aspen Center for Physics (ACP) for facilitating
7 Mathematician and data scientist turned social commen- discussions leading to this work. This research was sup-
8 tator Cathy O’Neil has dubbed the indiscriminate use of ported in part by the National Science Foundation under
9 these Big Data techniques “Weapons of Math Destruc- Grant No. NSF PHY-1748958. The authors are pleased
10 tion” (O’Neil, 2017). to acknowledge that the computational work reported on
11 When ML is used in a social context, abstract statis- in this paper was performed on the Shared Computing
12 tical relationships have real social consequences. False Cluster which is administered by Boston University’s Re-
13 positives can mean the difference between life and death search Computing Services.
14 (for example in the context of “signature drone strikes”)
15 (Mehta, 2015). ML algorithms, like all techniques, have
16 important limitations and should be employed with great Appendix A: Overview of the Datasets used in the Review
17 caution. It is our hope that ML practitioners keep this
18 in mind when working in social settings. 1. Ising dataset
19
All algorithms involve inherent tradeoffs in fairness, a
20
point formalized by computer scientist Jon Kleinberg and The Ising dataset we use throughout the review was
21
collaborators in a very interesting recent paper (Klein- generated using the standard Metropolis algorithm to
22
23 berg et al., 2016). It is far from clear how to make al- generate a Markov Chain. The full dataset consist of
24 gorithms fair for all people involved. This is even more 16 × 10000 samples of 40 × 40 spin configurations (i.e.
25 true with methods like Deep Learning that are hard to the design matrix has 160000 samples and 1600 features)
26 interpret. All ML algorithms have implicit assumptions drawn at temperatures 0.25, 0.5, · · · 4.0. The samples
27 and choices reflected in the datasets we use to the kind are drawn for the Boltzmann distribution of the two-
28 of functions we choose to optimize. It is important to dimensional ferromagnetic Ising model on a 40×40 square
29 remember that there is no “ view from nowhere” (Adam, lattice with periodic boundary conditions.
30 2006; Katz, 2017) – all ML algorithms reflect a point of
31 view and a set of assumptions about the world we live
32 in. For this reason, we hope that ML practitioners and 2. SUSY dataset
33 data scientists will take the time to consider the social
34 consequences of their actions. For example, developing a The SUSY dataset was generated by Baldi et al (Baldi
35 Hippocratic Oath for data scientists is now being consid- et al., 2014) to explore the efficacy of using Deep Learning
36 ered (Simonite, 2018). Doing no harm seems like a good for classifying collision events. The dataset is download-
37
start for making sure that we harness ML for the benefit able from the UCI Machine Learning Repository, a won-
38
of all members of society. derful resource for interesting datasets. Here we quote
39
40 directly from the paper:
41
42 XIX. ACKNOWLEDGMENTS The data has been produced using Monte
43 Carlo simulations and contains events with
44 PM and DJS would like to thank Anirvan Sengupta, two leptons (electrons or muons). In high
45 Justin Kinney, and Ilya Nemenman for useful conver- energy physics experiments, such as the AT-
46 sations during the ACP working group. The authors LAS and CMS detectors at the CERN LHC,
47 are also grateful to all readers who provided valuable one major hope is the discovery of new par-
48 feedback on this manuscript while it was under peer re- ticles. To accomplish this task, physicists at-
49 view. We encourage readers to help keep the Notebooks tempt to sift through data events and classify
50 which accompany the review up-to-date, by contribut- them as either a signal of some new physics
51 ing to them on Github at https://github.com/drckf/ process or particle, or instead a background
52 mlreview_notebooks. PM, CHW, and AD were sup- event from understood Standard Model pro-
53 ported by Simon’s Foundation in the form of a Simons In- cesses. Unfortunately we will never know for
54 vestigator in the MMLS and NIH MIRA program grant: sure what underlying physical process hap-
55
1R35GM119461. MB acknowledges support from the pened (the only information to which we have
56
Emergent Phenomena in Quantum Systems initiative of access are the final state particles). How-
57
58 the Gordon and Betty Moore Foundation, the ERC syn- ever, we can attempt to define parts of phase
59 ergy grant UQUAM, and the U.S. Department of Energy, space that will have a high percentage of sig-
60 Office of Science, Office of Advanced Scientific Comput- nal events. Typically this is done by using a
61 ing Research, Quantum Algorithm Teams Program. DJS series of simple requirements on the kinematic
62
63
64
65
107
1
2
quantities of the final state particles, for ex- 20x20 pixel box while preserving their aspect
3
4 ample having one or more leptons with large ratio. The resulting images contain grey lev-
5 amounts of momentum that is transverse to els as a result of the anti-aliasing technique
6 the beam line ( pT ). Here instead we will used by the normalization algorithm. the im-
7 use logistic regression in order to attempt to ages were centered in a 28x28 image by com-
8 find out the relative probability that an event puting the center of mass of the pixels, and
9 is from a signal or a background event and translating the image so as to position this
10 rather than using the kinematic quantities of point at the center of the 28x28 field.
11 final state particles directly we will use the
12 output of our logistic regression to define a The MNIST is often included by default in many modern
13 part of phase space that is enriched in sig- ML packages.
14 nal events. The dataset we are using has the
15 value of 18 kinematic variables ("features") of
16 the event. The first 8 features are direct mea- REFERENCES
17 surements of final state particles, in this case
18 the pT , pseudo-rapidity, and azimuthal angle Abu-Mostafa, Yaser S, Malik Magdon-Ismail, and Hsuan-
19 of two leptons in the event and the amount Tien Lin (2012), Learning from data, Vol. 4 (AMLBook
20
of missing transverse momentum (MET) to- New York, NY, USA:).
21 Ackley, David H, Geoffrey E Hinton, and Terrence J Se-
gether with its azimuthal angle. The last ten
22 jnowski (1987), “A learning algorithm for boltzmann ma-
23 features are functions of the first 8 features; chines,” in Readings in Computer Vision (Elsevier) pp.
24 these are high-level features derived by physi- 522–533.
25 cists to help discriminate between the two Adam, Alison (2006), Artificial knowing: Gender and the
26 classes. You can think of them as physicists thinking machine (Routledge).
attempt to use non-linear functions to classify Advani, Madhu, and Surya Ganguli (2016), “Statistical me-
27
signal and background events and they have chanics of optimal convex inference in high dimensions,”
28
Physical Review X 6 (3), 031034.
29 been developed with a lot of deep thinking Advani, Madhu, Subhaneil Lahiri, and Surya Ganguli (2013),
30 on the part of physicist. There is however, “Statistical mechanics of complex neural systems and high
31 an interest in using deep learning methods to dimensional data,” Journal of Statistical Mechanics: The-
32 obviate the need for physicists to manually ory and Experiment 2013 (03), P03014.
33 develop such features. Benchmark results us- Advani, Madhu S, and Andrew M Saxe (2017), “High-
34 ing Bayesian Decision Trees from a standard dimensional dynamics of generalization error in neural net-
35 physics package and 5-layer neural networks works,” arXiv preprint arXiv:1710.03667.
36 Aitchison, Laurence, Nicola Corradi, and Peter E Latham
and the dropout algorithm are presented in
37 (2016), “Zipfs law arises naturally when there are under-
the original paper to compare the ability of lying, unobserved variables,” PLoS computational biology
38 deep-learning to bypass the need of using such
39 12 (12), e1005110.
high level features. We will also explore this Albarran-Arriagada, F, J. C. Retamal, E. Solano, and
40
topic in later notebooks. The dataset con- L. Lamata (2018), “Measurement-based adapta-
41 tion protocol with quantum reinforcement learning,”
42 sists of 5 million events, the first 4,500,000 of
which we will use for training the model and arXiv:1803.05340 .
43 Alemi, Alexander A, Ian Fischer, Joshua V Dillon, and Kevin
the last 500,000 examples will be used as a
44 Murphy (2016), “Deep variational information bottleneck,”
45 test set. arXiv preprint arXiv:1612.00410.
46 Alemi, Alireza, and Alia Abbara (2017), “Exponential capac-
47 ity in an autoencoder neural network with a hidden layer,”
48 3. MNIST Dataset arXiv preprint arXiv:1705.07441 .
49 Amit, Daniel J (1992), Modeling brain function: The world of
50 The MNIST dataset is one of the simplest and most attractor neural networks (Cambridge university press).
51 widely used Machine Learning Datasets. The MNIST Amit, Daniel J, Hanoch Gutfreund, and Haim Sompolinsky
dataset consists of hand-written images of numerical (1985), “Spin-glass models of neural networks,” Physical
52
Review A 32 (2), 1007.
53 characters 0−9 and consists of a training set of 60,000 ex-
Andrieu, Christophe, Nando De Freitas, Arnaud Doucet, and
54 amples, and a test set of 10,000 examples (LeCun et al., Michael I Jordan (2003), “An introduction to mcmc for
55 1998a). Information about the MNIST database and its machine learning,” Machine learning 50 (1-2), 5–43.
56 historical importance can be found at Yann Lecun’s wed- Arai, Shunta, Masayuki Ohzeki, and Kazuyuki Tanaka
57 site: http://yann.lecun.com/exdb/mnist/. A brief (2017), “Deep neural network detects quantum phase tran-
58 description from the website: sition,” arXiv preprint arXiv:1712.00371 .
59 Arimoto, Suguru (1972), “An algorithm for computing the
60 The original black and white (bilevel) images capacity of arbitrary discrete memoryless channels,” IEEE
61 from NIST were size normalized to fit in a Transactions on Information Theory 18 (1), 14–20.
62
63
64
65
108
1
2
Arsenault, Louis-François, Alejandro Lopez-Bezanilla, Bengio, Yoshua (2012), “Practical recommendations for
3
O Anatole von Lilienfeld, and Andrew J Millis (2014), gradient-based training of deep architectures,” in Neural
4 “Machine learning for many-body physics: the case of the networks: Tricks of the trade (Springer) pp. 437–478.
5 anderson impurity model,” Physical Review B 90 (15), Bennett, Robert (1969), “The intrinsic dimensionality of sig-
6 155136. nal collections,” IEEE Transactions on Information Theory
7 Arunachalam, Srinivasan, and Ronald de Wolf (2017), 15 (5), 517–525.
8 “A survey of quantum learning theory,” arXiv preprint Bény, Cédric (2018), “Inferring relevant features: from qft to
9 arXiv:1701.06806 . pca,” arXiv preprint arXiv:1802.05756 .
10 August, Moritz, and José Miguel Hernández-Lobato (2018), Berger, James O, and José M Bernardo (1992), “On the devel-
11 “Taking gradients through experiments: Lstms and mem- opment of the reference prior method,” Bayesian statistics
12 ory proximal policy optimization for black-box quantum 4, 35–60.
13 control,” arXiv:1802.04063. Bickel, Peter J, and David A Freedman (1981), “Some asymp-
14 Aurisano, A, A Radovic, D Rocco, A Himmel, MD Messier, totic theory for the bootstrap,” The Annals of Statistics ,
E Niner, G Pawloski, F Psihas, A Sousa, and P Vahle 1196–1217.
15
(2016), “A convolutional neural network neutrino event Bickel, Peter J, Bo Li, Alexandre B Tsybakov, Sara A van de
16 classifier,” Journal of Instrumentation 11 (09), P09001. Geer, Bin Yu, Teófilo Valdés, Carlos Rivero, Jianqing Fan,
17 Baireuther, P, TE O’Brien, B Tarasinski, and CWJ and Aad van der Vaart (2006), “Regularization in statis-
18 Beenakker (2017), “Machine-learning-assisted correction tics,” Test 15 (2), 271–344.
19 of correlated qubit errors in a topological code,” arXiv Bishop, C M (2006), Pattern recognition and machine learning
20 preprint arXiv:1705.07855 . (springer).
21 Baity-Jesi, M, L. Sagun, M. Geiger, S. Spigler, G. Ben Arous, Bishop, Chris M (1995a), “Training with noise is equivalent to
22 C. Cammarota, Y. LeCun, M. Wyart, and G. Biroli (2018), tikhonov regularization,” Neural computation 7 (1), 108–
23 “Comparing dynamics: Deep neural networks versus glassy 116.
24 systems,” . Bishop, Christopher M (1995b), Neural networks for pattern
25 Baldassi, Carlo, Federica Gerace, Hilbert J Kappen, Carlo recognition (Oxford university press).
26 Lucibello, Luca Saglietti, Enzo Tartaglione, and Riccardo Blahut, Richard (1972), “Computation of channel capacity
Zecchina (2017), “On the role of synaptic stochasticity and rate-distortion functions,” IEEE transactions on Infor-
27
in training low-precision neural networks,” arXiv preprint mation Theory 18 (4), 460–473.
28 arXiv:1710.09825 . Bottou, Léon (2012), “Stochastic gradient descent tricks,” in
29 Baldassi, Carlo, Federica Gerace, Luca Saglietti, and Ric- Neural networks: Tricks of the trade (Springer) pp. 421–
30 cardo Zecchina (2018), “From inverse problems to learning: 436.
31 a statistical mechanics approach,” in Journal of Physics: Bowman, Samuel R, Luke Vilnis, Oriol Vinyals, Andrew M
32 Conference Series, Vol. 955 (IOP Publishing) p. 012001. Dai, Rafal Jozefowicz, and Samy Bengio (2015), “Gener-
33 Baldi, Pierre, Peter Sadowski, and Daniel Whiteson (2014), ating sentences from a continuous space,” arXiv preprint
34 “Searching for exotic particles in high-energy physics with arXiv:1511.06349.
35 deep learning,” Nature communications 5, 4308. Boyd, Stephen, and Lieven Vandenberghe (2004), Convex
36 Barber, David (2012), Bayesian reasoning and machine learn- optimization (Cambridge university press).
37 ing (Cambridge University Press). Bradde, Serena, and William Bialek (2017), “Pca meets rg,”
38 Barnes, Josh, and Piet Hut (1986), “A hierarchical o (n log Journal of Statistical Physics 167 (3-4), 462–475.
n) force-calculation algorithm,” nature 324 (6096), 446. Breiman, Leo (1996), “Bagging predictors,” Machine learning
39
Barra, Adriano, Alberto Bernacchia, Enrica Santucci, and 24 (2), 123–140.
40 Pierluigi Contucci (2012), “On the equivalence of hopfield Breiman, Leo (2001), “Random forests,” Machine learning
41 networks and boltzmann machines,” Neural Networks 34, 45 (1), 5–32.
42 1–9. Breuckmann, Nikolas P, and Xiaotong Ni (2017), “Scalable
43 Barra, Adriano, Giuseppe Genovese, Peter Sollich, and neural network decoders for higher dimensional quantum
44 Daniele Tantari (2016), “Phase transitions in restricted codes,” arXiv preprint arXiv:1710.09489 .
45 boltzmann machines with generic priors,” arXiv preprint Broecker, Peter, Fakher F Assaad, and Simon Trebst (2017),
46 arXiv:1612.03132 . “Quantum phase recognition via unsupervised machine
47 Barra, Adriano, Giuseppe Genovese, Daniele Tantari, and Pe- learning,” arXiv preprint arXiv:1707.00663 .
48 ter Sollich (2017), “Phase diagram of restricted boltzmann Bromley, Thomas R, and Patrick Rebentrost (2018),
49 machines and generalised hopfield networks with arbitrary “Batched quantum state exponentiation and quantum heb-
50 priors,” arXiv preprint arXiv:1702.05882 . bian learning,” arXiv:1803.07039 .
Battiti, Roberto (1992), “First-and second-order methods for Bukov, Marin (2018), “Reinforcement learning for au-
51
learning: between steepest descent and newton’s method,” tonomous preparation of floquet-engineered states: Invert-
52 Neural computation 4 (2), 141–166. ing the quantum kapitza oscillator,” Phys. Rev. B 98,
53 Benedetti, Marcello, John Realpe-Gómez, Rupak Biswas, 224305.
54 and Alejandro Perdomo-Ortiz (2016), “Quantum-assisted Bukov, Marin, Alexandre G. R. Day, Dries Sels, Phillip Wein-
55 learning of graphical models with arbitrary pairwise con- berg, Anatoli Polkovnikov, and Pankaj Mehta (2018), “Re-
56 nectivity,” arXiv preprint arXiv:1609.02542 . inforcement learning in different phases of quantum con-
57 Benedetti, Marcello, John Realpe-Gómez, and Alejandro trol,” Phys. Rev. X 8, 031086.
58 Perdomo-Ortiz (2017), “Quantum-assisted helmholtz ma- Burges, Christopher JC (1998), “A tutorial on support vector
59 chines: A quantum-classical deep learning framework for machines for pattern recognition,” Data mining and knowl-
60 industrial datasets in near-term devices,” arXiv preprint edge discovery 2 (2), 121–167.
61 arXiv:1708.09784 . Caio, Marcello D, Marco Caccin, Paul Baireuther, Timo
62
63
64
65
109
1
2
Hyart, and Michel Fruchart (2019), “Machine learning as- the expressive power of deep learning: A tensor analysis,”
3
sisted measurement of local topological invariants,” arXiv in Conference on Learning Theory, pp. 698–728.
4 preprint arXiv:1901.03346 . Colabrese, Simona, Kristian Gustavsson, Antonio Celani,
5 Caldeira, J, WLK Wu, B Nord, C Avestruz, S Trivedi, and and Luca Biferale (2017), “Flow navigation by smart mi-
6 KT Story (2018), “Deepcmb: Lensing reconstruction of the croswimmers via reinforcement learning,” Physical review
7 cosmic microwave background with deep neural networks,” letters 118 (15), 158004.
8 arXiv preprint arXiv:1810.01483 . Cossu, Guido, Luigi Del Debbio, Tommaso Giani, Ava Kham-
9 Canabarro, Askery, Samuraí Brito, and Rafael Chaves (2018), seh, and Michael Wilson (2018), “Machine learning deter-
10 “Machine learning non-local correlations,” arXiv preprint mination of dynamical parameters: The ising model case,”
11 arXiv:1808.07069 . arXiv preprint arXiv:1810.11503 .
12 Cárdenas-López, FA, L Lamata, JC Retamal, and E Solano Cox, Trevor F, and Michael AA Cox (2000), Multidimen-
13 (2017), “Generalized quantum reinforcement learning with sional scaling (CRC press).
14 quantum technologies,” arXiv preprint arXiv:1709.07848 . Cristoforetti, Marco, Giuseppe Jurman, Andrea I Nardelli,
Carleo, Giuseppe (2018), , Private Communication. and Cesare Furlanello (2017), “Towards meaningful physics
15
Carleo, Giuseppe, Yusuke Nomura, and Masatoshi Imada from generative models,” arXiv preprint arXiv:1705.09524
16 (2018), “Constructing exact representations of quantum .
17 many-body systems with deep neural networks,” arXiv Dahl, George, Abdel-rahman Mohamed, Geoffrey E Hinton,
18 preprint arXiv:1802.09558 . et al. (2010), “Phone recognition with the mean-covariance
19 Carleo, Giuseppe, and Matthias Troyer (2017), “Solving the restricted boltzmann machine,” in Advances in neural in-
20 quantum many-body problem with artificial neural net- formation processing systems, pp. 469–477.
21 works,” Science 355 (6325), 602–606. Daskin, Ammar (2018), “A quantum implementation model
22 Carrasquilla, Juan, and Roger G Melko (2017), “Machine for artificial neural networks,” Quanta , 7–18.
23 learning phases of matter,” Nature Physics 13 (5), 431. Davaasuren, Amarsanaa, Yasunari Suzuki, Keisuke Fujii,
24 Carrasquilla, Juan, Giacomo Torlai, Roger G Melko, and Le- and Masato Koashi (2018), “General framework for con-
25 andro Aolita (2018), “Reconstructing quantum states with structing fast and near-optimal machine-learning-based de-
26 generative models,” arXiv preprint arXiv:1810.10584 . coder of the topological stabilizer codes,” arXiv preprint
Chalk, Matthew, Olivier Marre, and Gasper Tkacik (2016), arXiv:1801.04377 .
27
“Relevant sparse codes with variational information bottle- Day, Alexandre GR, Marin Bukov, Phillip Weinberg, Pankaj
28 neck,” in Advances in Neural Information Processing Sys- Mehta, and Dries Sels (2019), “Glassy phase of opti-
29 tems, pp. 1957–1965. mal quantum control,” Physical Review Letters 122 (2),
30 Chamberland, Christopher, and Pooya Ronagh (2018), “Deep 020601.
31 neural decoders for near term fault-tolerant experiments,” Day, Alexandre GR, and Pankaj Mehta (2018), “Validated
32 arXiv preprint arXiv:1802.06441 . agglomerative clustering,” in preparation.
33 Chechik, Gal, Amir Globerson, Naftali Tishby, and Yair Decelle, Aurélien, Giancarlo Fissore, and Cyril Furtlehner
34 Weiss (2005), “Information bottleneck for gaussian vari- (2017), “Spectral learning of restricted boltzmann ma-
35 ables,” Journal of machine learning research 6 (Jan), 165– chines,” arXiv preprint arXiv:1708.02917 .
36 188. Decelle, Aurélien, Giancarlo Fissore, and Cyril Furtlehner
37 Chen, Chunlin, Daoyi Dong, Han-Xiong Li, Jian Chu, and (2018), “Thermodynamics of restricted boltzmann ma-
38 Tzyh-Jong Tarn (2014), “Fidelity-based probabilistic q- chines and related learning dynamics,” arXiv preprint
learning for control of quantum systems,” IEEE transac- arXiv:1803.01960 .
39
tions on neural networks and learning systems 25 (5), 920– Dempster, Arthur P, Nan M Laird, and Donald B Rubin
40 933. (1977), “Maximum likelihood from incomplete data via the
41 Chen, Jing, Song Cheng, Haidong Xie, Lei Wang, and Tao Xi- em algorithm,” Journal of the royal statistical society. Series
42 ang (2018), “Equivalence of restricted boltzmann machines B (methodological) , 1–38.
43 and tensor network states,” Phys. Rev. B 97, 085104. Deng, Dong-Ling, Xiaopeng Li, and S Das Sarma (2017),
44 Chen, Jun-Jie, and Ming Xue (2019), “Manipulation of spin “Quantum entanglement in neural network states,” Physi-
45 dynamics by deep reinforcement learning agent,” arXiv cal Review X 7 (2), 021021.
46 preprint arXiv:1901.08748 . Dietterich, Thomas G, et al. (2000), “Ensemble methods in
47 Chen, Tianqi, and Carlos Guestrin (2016), “Xgboost: A scal- machine learning,” Multiple classifier systems 1857, 1–15.
48 able tree boosting system,” in Proceedings of the 22nd acm Do, Chuong B, and Serafim Batzoglou (2008), “What is the
49 sigkdd international conference on knowledge discovery and expectation maximization algorithm?” Nature biotechnol-
50 data mining (ACM) pp. 785–794. ogy 26 (8), 897–899.
Cheng, Song, Jing Chen, and Lei Wang (2017), “Information Domingos, Pedro (2012), “A few useful things to know about
51
perspective to probabilistic modeling: Boltzmann machines machine learning,” Communications of the ACM 55 (10),
52 versus born machines,” arXiv preprint arXiv:1712.04144 . 78–87.
53 Ch’ng, Kelvin, Nick Vazquez, and Ehsan Khatami Donoho, David L (2006), “Compressed sensing,” IEEE Trans-
54 (2017), “Unsupervised machine learning account of mag- actions on information theory 52 (4), 1289–1306.
55 netic transitions in the hubbard model,” arXiv preprint Dreyfus, Hubert L (1965), “Alchemy and artificial intelli-
56 arXiv:1708.03350 . gence,”.
57 Ciliberto, Carlo, Mark Herbster, Alessandro Davide Ialongo, Du, Simon S, Chi Jin, Jason D Lee, Michael I Jordan, Aarti
58 Massimiliano Pontil, Andrea Rocchetto, Simone Severini, Singh, and Barnabas Poczos (2017), “Gradient descent can
59 and Leonard Wossnig (2017), “Quantum machine learning: take exponential time to escape saddle points,” in Advances
60 a classical perspective,” . in Neural Information Processing Systems, pp. 1067–1077.
61 Cohen, Nadav, Or Sharir, and Amnon Shashua (2016), “On Duchi, John, Elad Hazan, and Yoram Singer (2011), “Adap-
62
63
64
65
110
1
2
tive subgradient methods for online learning and stochas- (2001), The elements of statistical learning, Vol. 1 (Springer
3
tic optimization,” Journal of Machine Learning Research series in statistics New York).
4 12 (Jul), 2121–2159. Friedman, Jerome H (2001), “Greedy function approximation:
5 Dunjko, Vedran, and Hans J Briegel (2017), “Machine learn- a gradient boosting machine,” Annals of statistics , 1189–
6 ing and artificial intelligence in the quantum domain,” 1232.
7 arXiv preprint arXiv:1709.02779 . Friedman, Jerome H (2002), “Stochastic gradient boosting,”
8 Dunjko, Vedran, Yi-Kai Liu, Xingyao Wu, and Jacob M Computational Statistics & Data Analysis 38 (4), 367–378.
9 Taylor (2017), “Super-polynomial and exponential im- Friedman, Jerome H, Bogdan E Popescu, et al. (2003), “Im-
10 provements for quantum-enhanced reinforcement learning,” portance sampled learning ensembles,” Journal of Machine
11 arXiv preprint arXiv:1710.11160 . Learning Research 94305.
12 Efron, B (1979), “Bootstrap methods: another look at the Fu, Michael C (2006), “Gradient estimation,” Handbooks in
13 jackknife annals of statistics 7: 1–26,” View Article Pub- operations research and management science 13, 575–616.
14 Med/NCBI Google Scholar. Funai, Shotaro Shiba, and Dimitrios Giataganas (2018),
Efron, Bradley, Trevor Hastie, Iain Johnstone, Robert Tib- “Thermodynamics and feature extraction by machine learn-
15
shirani, et al. (2004), “Least angle regression,” The Annals ing,” arXiv preprint arXiv:1810.08179 .
16 of statistics 32 (2), 407–499. Gao, Jun, Lu-Feng Qiao, Zhi-Qiang Jiao, Yue-Chi Ma, Cheng-
17 Eisen, Michael B, Paul T Spellman, Patrick O Brown, and Qiu Hu, Ruo-Jing Ren, Ai-Lin Yang, Hao Tang, Man-Hong
18 David Botstein (1998), “Cluster analysis and display of Yung, and Xian-Min Jin (2017), “Experimental machine
19 genome-wide expression patterns,” Proceedings of the Na- learning of quantum states with partial information,” arXiv
20 tional Academy of Sciences 95 (25), 14863–14868. preprint arXiv:1712.00456 .
21 Elith, Jane, Steven J Phillips, Trevor Hastie, Miroslav Dudík, Gao, Xun, and Lu-Ming Duan (2017), “Efficient representa-
22 Yung En Chee, and Colin J Yates (2011), “A statistical tion of quantum many-body states with deep neural net-
23 explanation of maxent for ecologists,” Diversity and distri- works,” arXiv preprint arXiv:1701.05039 .
24 butions 17 (1), 43–57. Gelman, Andrew, John B Carlin, Hal S Stern, David B Dun-
25 Ernst, Oliver K, Thomas Bartol, Terrence Sejnowski, and son, Aki Vehtari, and Donald B Rubin (2014), Bayesian
26 Eric Mjolsness (2018), “Learning dynamic boltzmann dis- data analysis, Vol. 2 (CRC press Boca Raton, FL).
tributions as reduced models of spatial chemical kinetics,” Gersho, Allen, and Robert M Gray (2012), Vector quantiza-
27
arXiv preprint arXiv:1803.01063 . tion and signal compression, Vol. 159 (Springer Science &
28 Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, Business Media).
29 et al. (1996), “A density-based algorithm for discovering Geurts, Pierre, Damien Ernst, and Louis Wehenkel (2006),
30 clusters in large spatial databases with noise.” in Kdd, “Extremely randomized trees,” Machine learning 63 (1),
31 Vol. 96, pp. 226–231. 3–42.
32 Faddeev, Ludvig D, and Victor N Popov (1967), “Feynman Glorot, Xavier, and Yoshua Bengio (2010), “Understanding
33 diagrams for the yang-mills field,” Physics Letters B 25 (1), the difficulty of training deep feedforward neural networks,”
34 29–30. in Proceedings of the Thirteenth International Conference
35 Finol, David, Yan Lu, Vijay Mahadevan, and Ankit on Artificial Intelligence and Statistics, pp. 249–256.
36 Srivastava (2018), “Deep convolutional neural networks Goldt, Sebastian, and Udo Seifert (2017), “Thermodynamic
37 for eigenvalue problems in mechanics,” arXiv preprint efficiency of learning a rule in neural networks,” arXiv
38 arXiv:1801.05733 . preprint arXiv:1706.09713 .
Fisher, Charles K, and Pankaj Mehta (2015a), “Bayesian fea- Goodfellow, Ian (2016), “Nips 2016 tutorial: Generative ad-
39
ture selection for high-dimensional linear regression via the versarial networks,” arXiv preprint arXiv:1701.00160.
40 ising approximation with applications to genomics,” Bioin- Goodfellow, Ian, Yoshua Bengio, and Aaron
41 formatics 31 (11), 1754–1761. Courville (2016), Deep Learning (MIT Press)
42 Fisher, Charles K, and Pankaj Mehta (2015b), “Bayesian http://www.deeplearningbook.org.
43 feature selection with strongly regularizing priors maps to Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,
44 the ising model,” Neural computation 27 (11), 2411–2422. David Warde-Farley, Sherjil Ozair, Aaron Courville, and
45 Foreman, Sam, Joel Giedt, Yannick Meurice, and Judah Yoshua Bengio (2014), “Generative adversarial nets,” in Ad-
46 Unmuth-Yockey (2017), “Rg inspired machine learning for vances in neural information processing systems, pp. 2672–
47 lattice field theory,” arXiv preprint arXiv:1710.02079 . 2680.
48 Fösel, Thomas, Petru Tighineanu, Talitha Weiss, and Flo- Greplova, Eliska, Christian Kraglund Andersen, and Klaus
49 rian Marquardt (2018), “Reinforcement learning with neu- Mølmer (2017), “Quantum parameter estimation with a
50 ral networks for quantum feedback,” arXiv:1802.05267 . neural network,” arXiv preprint arXiv:1711.05238 .
Freitas, Nahuel, Giovanna Morigi, and Vedran Dun- Grisafi, Andrea, David M Wilkins, Gábor Csányi, and
51
jko (2018), “Neural network operations and susuki- Michele Ceriotti (2017), “Symmetry-adapted machine-
52 trotter evolution of neural network states,” arXiv preprint learning for tensorial properties of atomistic systems,”
53 arXiv:1803.02118 . arXiv preprint arXiv:1709.06757 .
54 Freund, Yoav, Robert Schapire, and Naoki Abe (1999), “A Han, Zhao-Yu, Jun Wang, Heng Fan, Lei Wang, and Pan
55 short introduction to boosting,” Journal-Japanese Society Zhang (2017), “Unsupervised generative modeling using
56 For Artificial Intelligence 14 (771-780), 1612. matrix product states,” arXiv preprint arXiv:1709.01662 .
57 Freund, Yoav, and Robert E Schapire (1995), “A desicion- He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
58 theoretic generalization of on-line learning and an applica- (2015), “Delving deep into rectifiers: Surpassing human-
59 tion to boosting,” in European conference on computational level performance on imagenet classification,” in Proceed-
60 learning theory (Springer) pp. 23–37. ings of the IEEE international conference on computer vi-
61 Friedman, Jerome, Trevor Hastie, and Robert Tibshirani sion, pp. 1026–1034.
62
63
64
65
111
1
2
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun free energy differences,” Physical Review Letters 78 (14),
3
(2016), “Deep residual learning for image recognition,” in 2690.
4 Proceedings of the IEEE conference on computer vision and Jaynes, Edwin T (1957a), “Information theory and statistical
5 pattern recognition, pp. 770–778. mechanics,” Physical review 106 (4), 620.
6 Heimel, Theo, Gregor Kasieczka, Tilman Plehn, and Jen- Jaynes, Edwin T (1957b), “Information theory and statistical
7 nifer M Thompson (2018), “Qcd or what?” arXiv preprint mechanics. ii,” Physical review 108 (2), 171.
8 arXiv:1808.08979 . Jaynes, Edwin T (1968), “Prior probabilities,” IEEE Trans-
9 Higgins, Irina, Loic Matthey, Arka Pal, Christopher Burgess, actions on systems science and cybernetics 4 (3), 227–241.
10 Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Jaynes, Edwin T (1996), Probability theory: the logic of sci-
11 Alexander Lerchner (2016), “beta-vae: Learning basic vi- ence (Washington University St. Louis, MO).
12 sual concepts with a constrained variational framework,” Jaynes, Edwin T (2003), Probability theory: the logic of sci-
13 . ence (Cambridge university press).
14 Hinton, Geoffrey E (2002), “Training products of experts by Jeffreys, Harold (1946), “An invariant form for the prior prob-
minimizing contrastive divergence,” Neural computation ability in estimation problems,” Proceedings of the Royal
15
14 (8), 1771–1800. Society of London. Series A, Mathematical and Physical
16 Hinton, Geoffrey E (2012), “A practical guide to training re- Sciences , 453–461.
17 stricted boltzmann machines,” in Neural networks: Tricks Jin, Chi, Praneeth Netrapalli, and Michael I Jordan (2017),
18 of the trade (Springer) pp. 599–619. “Accelerated gradient descent escapes saddle points faster
19 Hinton, Geoffrey E, Simon Osindero, and Yee-Whye Teh than gradient descent,” arXiv preprint arXiv:1711.10456.
20 (2006), “A fast learning algorithm for deep belief nets,” Jordan, Michael (2018), “Artificial intelligence: The revolu-
21 Neural computation 18 (7), 1527–1554. tion hasn’t happened yet. medium,”.
22 Hinton, Geoffrey E, and Ruslan R Salakhutdinov (2006), “Re- Jordan, Michael I, Zoubin Ghahramani, Tommi S Jaakkola,
23 ducing the dimensionality of data with neural networks,” and Lawrence K Saul (1999), “An introduction to vari-
24 science 313 (5786), 504–507. ational methods for graphical models,” Machine learning
25 Ho, Tin Kam (1998), “The random subspace method for con- 37 (2), 183–233.
26 structing decision forests,” IEEE transactions on pattern Kalantre, Sandesh S, Justyna P Zwolak, Stephen Ragole,
analysis and machine intelligence 20 (8), 832–844. Xingyao Wu, Neil M Zimmerman, MD Stewart, and Ja-
27
Hopfield, John J (1982), “Neural networks and physical sys- cob M Taylor (2017), “Machine learning techniques for
28 tems with emergent collective computational abilities,” state recognition and auto-tuning in quantum dots,” arXiv
29 Proceedings of the national academy of sciences 79 (8), preprint arXiv:1712.04914 .
30 2554–2558. Katz, Yarden (2017), “Manufacturing an artificial intelligence
31 Huang, Haiping (2017a), “Mean-field theory of input dimen- revolution,” SSRN Preprint .
32 sionality reduction in unsupervised deep neural networks,” Kerenidis, Iordanis, and Anupam Prakash (2017), “Quan-
33 arXiv preprint arXiv:1710.01467 . tum gradient descent for linear systems and least squares,”
34 Huang, Haiping (2017b), “Statistical mechanics of unsuper- arXiv preprint arXiv:1704.04992 .
35 vised feature learning in a restricted boltzmann machine Keskar, Nitish Shirish, Dheevatsa Mudigere, Jorge Nocedal,
36 with binary synapses,” Journal of Statistical Mechanics: Mikhail Smelyanskiy, and Ping Tak Peter Tang (2016),
37 Theory and Experiment 2017 (5), 053302. “On large-batch training for deep learning: Generalization
38 Huang, Hengfeng, Bowen Xiao, Huixin Xiong, Zeming Wu, gap and sharp minima,” arXiv preprint arXiv:1609.04836.
Yadong Mu, and Huichao Song (2018), “Applications Kingma, Diederik P, and Jimmy Ba (2014), “Adam:
39
of deep learning to relativistic hydrodynamics,” arXiv A method for stochastic optimization,” arXiv preprint
40 preprint arXiv:1801.03334 . arXiv:1412.6980.
41 Hubbard, J (1959), “Calculation of partition functions,” Phys- Kingma, Diederik P, and Max Welling (2013),
42 ical Review Letters 3 (2), 77. “Auto-encoding variational bayes,” arXiv preprint
43 Huggins, William, Piyush Patil, Bradley Mitchell, K Birgitta arXiv:1312.6114.
44 Whaley, and E Miles Stoudenmire (2019), “Towards quan- Kingma, DP, et al. (2017), “Variational inference & deep
45 tum machine learning with tensor networks,” Quantum Sci- learning,” PhD thesis 978-94-6299-745-5.
46 ence and Technology 4 (2), 024001. Kleijnen, Jack PC, and Reuven Y Rubinstein (1996), “Op-
47 Iakovlev, I A, O. M. Sotnikov, and V. V. Mazurenko timization and sensitivity analysis of computer simulation
48 (2018), “Supervised learning magnetic skyrmion phases,” models by the score function method,” European Journal
49 arXiv:1803.06682 . of Operational Research 88 (3), 413–427.
50 Innocenti, Luca, Leonardo Banchi, Alessandro Ferraro, Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan
Sougato Bose, and Mauro Paternostro (2018), “Supervised (2016), “Inherent trade-offs in the fair determination of risk
51
learning of time-independent hamiltonians for gate design,” scores,” arXiv preprint arXiv:1609.05807.
52 arXiv:1803.07119 . Koch-Janusz, Maciej, and Zohar Ringel (2017), “Mutual in-
53 Ioffe, Sergey, and Christian Szegedy (2015), “Batch normal- formation, neural networks and the renormalization group,”
54 ization: Accelerating deep network training by reducing in- arXiv preprint arXiv:1704.06279 .
55 ternal covariate shift,” in International Conference on Ma- Kohonen, Teuvo (1998), “The self-organizing map,” Neuro-
56 chine Learning, pp. 448–456. computing 21 (1-3), 1–6.
57 Iso, Satoshi, Shotaro Shiba, and Sumito Yokoo (2018), Kong, Qingkai, Daniel T. Trugman, Zachary E. Ross,
58 “Scale-invariant feature extraction of neural net- Michael J. Bianco, Brendan J. Meade, and Peter Ger-
59 work and renormalization group flow,” arXiv preprint stoft (2018), “Machine learning in seismology: Turn-
60 arXiv:1801.07172 . ing data into insights,” Seismological Research Letters
61 Jarzynski, Christopher (1997), “Nonequilibrium equality for 10.1785/0220180259.
62
63
64
65
112
1
2
Krastanov, Stefan, and Liang Jiang (2017), “Deep neural tion space of deep-learning machines,” arXiv preprint
3
network probabilistic decoder for stabilizer codes,” arXiv arXiv:1708.01422 .
4 preprint arXiv:1705.09334 . Li, Chian-De, Deng-Ruei Tan, and Fu-Jiun Jiang (2017),
5 Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek (2009), “Applications of neural networks to the studies of
6 “Clustering high-dimensional data: A survey on subspace phase transitions of two-dimensional potts models,” arXiv
7 clustering, pattern-based clustering, and correlation clus- preprint arXiv:1703.02369 .
8 tering,” ACM Transactions on Knowledge Discovery from Li, Richard Y, Rosa Di Felice, Remo Rohs, and Daniel A
9 Data (TKDD) 3 (1), 1. Lidar (2018), “Quantum annealing versus classical ma-
10 Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton chine learning applied to a simplified computational biology
11 (2012), “Imagenet classification with deep convolutional problem,” npj Quantum Information 4 (1), 14.
12 neural networks,” in Advances in neural information pro- Li, Shuo-Hui, and Lei Wang (2018), “Neural network renor-
13 cessing systems, pp. 1097–1105. malization group,” arXiv preprint arXiv:1802.02840 .
14 Krzakala, Florent, Andre Manoel, Eric W Tramel, and Lenka Lim, Tjen-Sien, Wei-Yin Loh, and Yu-Shan Shih (2000), “A
Zdeborová (2014), “Variational free energies for compressed comparison of prediction accuracy, complexity, and train-
15
sensing,” in Information Theory (ISIT), 2014 IEEE Inter- ing time of thirty-three old and new classification algo-
16 national Symposium on (IEEE) pp. 1499–1503. rithms,” Machine learning 40 (3), 203–228.
17 Krzakala, Florent, Marc Mézard, François Sausset, YF Sun, Lin, Henry W, Max Tegmark, and David Rolnick (2017),
18 and Lenka Zdeborová (2012a), “Statistical-physics-based “Why does deep and cheap learning work so well?” Journal
19 reconstruction in compressed sensing,” Physical Review X of Statistical Physics 168 (6), 1223–1247.
20 2 (2), 021005. Linderman, G C, M. Rachh, J. G. Hoskins, S. Steiner-
21 Krzakala, Florent, Marc Mézard, Francois Sausset, Yifan Sun, berger, and Y. Kluger (2017), “Efficient Algorithms for
22 and Lenka Zdeborová (2012b), “Probabilistic reconstruc- t-distributed Stochastic Neighborhood Embedding,” ArXiv
23 tion in compressed sensing: algorithms, phase diagrams, e-prints arXiv:1712.09005 [cs.LG].
24 and threshold achieving matrices,” Journal of Statistical Liu, Zhaocheng, Sean P Rodrigues, and Wenshan Cai
25 Mechanics: Theory and Experiment 2012 (08), P08009. (2017), “Simulating the ising model with a deep convo-
26 Lake, Brenden M, Tomer D Ullman, Joshua B Tenenbaum, lutional generative adversarial network,” arXiv preprint
and Samuel J Gershman (2017), “Building machines that arXiv:1710.04987 .
27
learn and think like people,” Behavioral and Brain Sciences Loh, Wei-Yin (2011), “Classification and regression trees,”
28 40. Wiley Interdisciplinary Reviews: Data Mining and Knowl-
29 Lamata, Lucas (2017), “Basic protocols in quantum reinforce- edge Discovery 1 (1), 14–23.
30 ment learning with superconducting circuits,” Scientific Re- Louppe, Gilles (2014), “Understanding random forests: From
31 ports 7. theory to practice,” arXiv preprint arXiv:1407.7502.
32 Larsen, Bjornar, and Chinatsu Aone (1999), “Fast and effec- Lu, Sirui, Shilin Huang, Keren Li, Jun Li, Jianxin Chen,
33 tive text mining using linear-time document clustering,” in Dawei Lu, Zhengfeng Ji, Yi Shen, Duanlu Zhou, and
34 Proceedings of the fifth ACM SIGKDD international con- Bei Zeng (2017), “A separability-entanglement classifier via
35 ference on Knowledge discovery and data mining (ACM) machine learning,” arXiv preprint arXiv:1705.01523 .
36 pp. 16–22. Maaten, Laurens van der, and Geoffrey Hinton (2008), “Vi-
37 Le, Quoc V (2013), “Building high-level features using large sualizing data using t-sne,” Journal of machine learning re-
38 scale unsupervised learning,” in Acoustics, Speech and Sig- search 9 (Nov), 2579–2605.
nal Processing (ICASSP), 2013 IEEE International Con- MacKay, David JC (2003), Information theory, inference and
39
ference on (IEEE) pp. 8595–8598. learning algorithms (Cambridge university press).
40 LeCun, Yann, Yoshua Bengio, et al. (1995), “Convolutional Marsland III, Robert, Wenping Cui, and Pankaj Mehta
41 networks for images, speech, and time series,” The hand- (2019), “The Minimum Environmental Perturbation Prin-
42 book of brain theory and neural networks 3361 (10), 1995. ciple: A New Perspective on Niche Theory,” arXiv preprint
43 LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick arXiv:1901.09673 .
44 Haffner (1998a), “Gradient-based learning applied to docu- Maskara, Nishad, Aleksander Kubica, and Tomas
45 ment recognition,” Proceedings of the IEEE 86 (11), 2278– Jochym-O’Connor (2018), “Advantages of versatile neural-
46 2324. network decoding for topological codes,” arXiv preprint
47 LeCun, Yann, Léon Bottou, Genevieve B Orr, and Klaus- arXiv:1802.08680 .
48 Robert Müller (1998b), “Efficient backprop,” in Neural net- Masson, JB, M Bailly Bechet, and Massimo Vergassola
49 works: Tricks of the trade (Springer) pp. 9–50. (2009), “Chasing information to search in random environ-
50 Lee, Jason D, Ioannis Panageas, Georgios Piliouras, Max Sim- ments,” Journal of Physics A: Mathematical and Theoret-
chowitz, Michael I Jordan, and Benjamin Recht (2017), ical 42 (43), 434009.
51
“First-order methods almost always avoid saddle points,” Mattingly, Henry H, Mark K Transtrum, Michael C Abbott,
52 arXiv preprint arXiv:1710.07406. and Benjamin B Machta (2018), “Maximizing the infor-
53 Lehmann, Erich L, and George Casella (2006), Theory of mation learned from finite data selects a simple model,”
54 point estimation (Springer Science & Business Media). Proceedings of the National Academy of Sciences 115 (8),
55 Lehmann, Erich L, and Joseph P Romano (2006), Testing 1760–1765.
56 statistical hypotheses (Springer Science & Business Media). McDermott, Drew, M Mitchell Waldrop, B Chandrasekaran,
57 Levine, Yoav, David Yakira, Nadav Cohen, and Amnon John McDermott, and Roger Schank (1985), “The dark
58 Shashua (2017), “Deep learning and quantum entangle- ages of ai: a panel discussion at aaai-84,” AI Magazine
59 ment: Fundamental connections with implications to net- 6 (3), 122.
60 work design.” arXiv preprint arXiv:1704.01552 . McInnes, Leland, John Healy, and James Melville
61 Li, Bo, and David Saad (2017), “Exploring the func- (2018), “UMAP: Uniform Manifold Approximation and
62
63
64
65
113
1
2
Projection for Dimension Reduction,” arXiv e-prints , 261.
3
arXiv:1802.03426arXiv:1802.03426 [stat.ML]. Nielsen, Michael A (2015), Neural networks and deep learning
4 Mehta, Pankaj (2015), “Big data’s radical potential, (Determination Press).
5 https://www.jacobinmag.com/2015/03/big-data-drones- van Nieuwenburg, Evert, Eyal Bairey, and Gil Refael (2017a),
6 privacy-workers,” Jacobin . “Learning phase transitions from dynamics,” arXiv preprint
7 Mehta, Pankaj, and David J Schwab (2014), “An exact map- arXiv:1712.00450 .
8 ping between the variational renormalization group and van Nieuwenburg, Evert PL, Ye-Hua Liu, and Sebastian D
9 deep learning,” arXiv preprint arXiv:1410.3831. Huber (2017b), “Learning phase transitions by confusion,”
10 Mehta, Pankaj, David J Schwab, and Anirvan M Sengupta Nature Physics 13 (5), 435.
11 (2011), “Statistical mechanics of transcription-factor bind- Niu, Murphy Yuezhen, Sergio Boixo, Vadim Smelyanskiy,
12 ing site discovery using hidden markov models,” Journal of and Hartmut Neven (2018), “Universal quantum con-
13 statistical physics 142 (6), 1187–1205. trol through deep reinforcement learning,” arXiv preprint
14 Melnikov, Alexey A, Hendrik Poulsen Nautrup, Mario Krenn, arXiv:1803.01857 .
Vedran Dunjko, Markus Tiersch, Anton Zeilinger, and Nomura, Yusuke, Andrew Darmawan, Youhei Yamaji, and
15
Hans J Briegel (2017), “Active learning machine learns Masatoshi Imada (2017), “Restricted-boltzmann-machine
16 to create new quantum experiments,” arXiv preprint learning for solving strongly correlated quantum systems,”
17 arXiv:1706.00868 . arXiv preprint arXiv:1709.06475 .
18 Metz, Cade (2017), “Move over, coders- Ohtsuki, Tomi, and Tomoki Ohtsuki (2017), “Deep learning
19 physicists will soon rule silicon valley,” the quantum phase transitions in random electron systems:
20 Https://deepmind.com/blog/deepmind-ai-reduces-google- Applications to three dimensions,” Journal of the Physical
21 data-centre-cooling-bill-40/. Society of Japan 86 (4), 044708.
22 Mezard, Marc, and Andrea Montanari (2009), Information, O’Neil, Cathy (2017), Weapons of math destruction: How big
23 physics, and computation (Oxford University Press). data increases inequality and threatens democracy (Broad-
24 Mhaskar, Hrushikesh, Qianli Liao, and Tomaso Poggio way Books).
25 (2016), “Learning functions: when is deep better than shal- Papanikolaou, Stefanos, Michail Tzimas, Hengxu Song, An-
26 low,” arXiv preprint arXiv:1603.00988. drew CE Reid, and Stephen A Langer (2017), “Learning
Mitarai, Kosuke, Makoto Negoro, Masahiro Kitagawa, and crystal plasticity using digital image correlation: Exam-
27
Keisuke Fujii (2018), “Quantum circuit learning,” arXiv ples from discrete dislocation dynamics,” arXiv preprint
28 preprint arXiv:1803.00745 . arXiv:1709.08225 .
29 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, An- Pedregosa, F, G. Varoquaux, A. Gramfort, V. Michel,
30 drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
31 Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
32 et al. (2015), “Human-level control through deep reinforce- napeau, M. Brucher, M. Perrot, and E. Duchesnay (2011),
33 ment learning,” Nature 518 (7540), 529. “Scikit-learn: Machine learning in Python,” Journal of Ma-
34 Morningstar, Alan, and Roger G Melko (2017), “Deep chine Learning Research 12, 2825–2830.
35 learning the ising model near criticality,” arXiv preprint Perdomo-Ortiz, Alejandro, Marcello Benedetti, John Realpe-
36 arXiv:1708.04622 . Gómez, and Rupak Biswas (2017), “Opportunities
37 Muehlhauser, Luke (2016), “What should we learn from past and challenges for quantum-assisted machine learn-
38 ai forecasts?” Open Philanthropy Website . ing in near-term quantum computers,” arXiv preprint
Müllner, Daniel (2011), “Modern hierarchical, agglomerative arXiv:1708.09757 .
39
clustering algorithms,” arXiv preprint arXiv:1109.2378 . Polyak, Boris T (1964), “Some methods of speeding up the
40 Murphy, Kevin (2012), Machine Learning: A Probabilistic convergence of iteration methods,” USSR Computational
41 Perspective (MIT press). Mathematics and Mathematical Physics 4 (5), 1–17.
42 Nagai, Yuki, Huitao Shen, Yang Qi, Junwei Liu, and Liang Qian, Ning (1999), “On the momentum term in gradient de-
43 Fu (2017), “Self-learning monte carlo method: Continuous- scent learning algorithms,” Neural networks 12 (1), 145–
44 time algorithm,” arXiv preprint arXiv:1705.06724 . 151.
45 Neal, Radford M (2001), “Annealed importance sampling,” Rabiner, Lawrence R (1989), “A tutorial on hidden markov
46 Statistics and computing 11 (2), 125–139. models and selected applications in speech recognition,”
47 Neal, Radford M, and Geoffrey E Hinton (1998), “A view Proceedings of the IEEE 77 (2), 257–286.
48 of the em algorithm that justifies incremental, sparse, and Radford, Alec, Luke Metz, and Soumith Chintala (2015),
49 other variants,” in Learning in graphical models (Springer) “Unsupervised representation learning with deep convo-
50 pp. 355–368. lutional generative adversarial networks,” arXiv preprint
Neal, Radford M, et al. (2011), “Mcmc using hamiltonian dy- arXiv:1511.06434.
51
namics,” Handbook of Markov Chain Monte Carlo 2 (11). Ramezanali, Mohammad, Partha P Mitra, and Anirvan M
52 Nesterov, Yurii (1983), “A method of solving a convex pro- Sengupta (2015), “Critical behavior and universality classes
53 gramming problem with convergence rate o (1/k2),” in So- for an algorithmic phase transition in sparse reconstruc-
54 viet Mathematics Doklady, Vol. 27, pp. 372–376. tion,” arXiv preprint arXiv:1509.08995.
55 Neukart, Florian, David Von Dollen, Christian Seidel, and Ramezanpour, A (2017), “Optimization by a quantum rein-
56 Gabriele Compostella (2017), “Quantum-enhanced rein- forcement algorithm,” Phys. Rev. A 96, 052307.
57 forcement learning for finite-episode games with discrete Rao, Wen-Jia, Zhenyu Li, Qiong Zhu, Mingxing Luo, and
58 state spaces,” arXiv preprint arXiv:1708.09354 . Xin Wan (2017), “Identifying product order with restricted
59 Nguyen, H Chau, Riccardo Zecchina, and Johannes Berg boltzmann machines,” arXiv preprint arXiv:1709.02597 .
60 (2017), “Inverse statistical problems: from the inverse ising Ravanbakhsh, Siamak, Francois Lanusse, Rachel Mandel-
61 problem to data science,” Advances in Physics 66 (3), 197– baum, Jeff G Schneider, and Barnabas Poczos (2017), “En-
62
63
64
65
114
1
2
abling dark energy science with deep generative models of Xu (1998), “Density-based clustering in spatial databases:
3
galaxy images.” in AAAI, pp. 1488–1494. The algorithm gdbscan and its applications,” Data mining
4 Rebentrost, Patrick, Thomas R. Bromley, Christian Weed- and knowledge discovery 2 (2), 169–194.
5 brook, and Seth Lloyd (2017), “A quantum hopfield neural Saxe, Andrew M, James L McClelland, and Surya Gan-
6 network,” arXiv:1710.03599 . guli (2013), “Exact solutions to the nonlinear dynamics of
7 Rebentrost, Patrick, Maria Schuld, Francesco Petruccione, learning in deep linear neural networks,” arXiv preprint
8 and Seth Lloyd (2016), “Quantum gradient descent and arXiv:1312.6120.
9 newton’s method for constrained polynomial optimization,” Schapire, Robert E, and Yoav Freund (2012), Boosting:
10 arXiv preprint arXiv:1612.01789 . Foundations and algorithms (MIT press).
11 Reddy, Gautam, Antonio Celani, Terrence J Sejnowski, and Schindler, Frank, Nicolas Regnault, and Titus Neupert
12 Massimo Vergassola (2016a), “Learning to soar in turbu- (2017), “Probing many-body localization with neural net-
13 lent environments,” Proceedings of the National Academy works,” Phys. Rev. B 95, 245134.
14 of Sciences 113 (33), E4877–E4884. Schmidhuber, Jürgen (2015), “Deep learning in neural net-
Reddy, Gautam, Antonio Celani, and Massimo Vergassola works: An overview,” Neural networks 61, 85–117.
15
(2016b), “Infomax strategies for an optimal balance be- Schneidman, Elad, Michael J Berry II, Ronen Segev, and
16 tween exploration and exploitation,” Journal of Statistical William Bialek (2006), “Weak pairwise correlations imply
17 Physics 163 (6), 1454–1476. strongly correlated network states in a neural population,”
18 Rem, Benno S, Niklas Käming, Matthias Tarnowski, Luca Nature 440 (7087), 1007.
19 Asteria, Nick Fläschner, Christoph Becker, Klaus Seng- Schoenholz, Samuel S (2017), “Combining machine learning
20 stock, and Christof Weitenberg (2018), “Identifying quan- and physics to understand glassy systems,” arXiv preprint
21 tum phase transitions using artificial neural networks on arXiv:1709.08015 .
22 experimental data,” arXiv preprint arXiv:1809.05519 . Schuld, Maria, Mark Fingerhuth, and Francesco Petruc-
23 Rezende, Danilo Jimenez, Shakir Mohamed, and Daan cione (2017), “Implementing a distance-based classifier
24 Wierstra (2014), “Stochastic backpropagation and approx- with a quantum interference circuit,” arXiv preprint
25 imate inference in deep generative models,” arXiv preprint arXiv:1703.10793 .
26 arXiv:1401.4082. Schuld, Maria, and Nathan Killoran (2018), “Quantum ma-
Rocchetto, Andrea, Scott Aaronson, Simone Severini, Gon- chine learning in feature hilbert spaces,” arXiv:1803.07128
27
zalo Carvacho, Davide Poderini, Iris Agresti, Marco Ben- .
28 tivegna, and Fabio Sciarrino (2017), “Experimental learn- Schuld, Maria, and Francesco Petruccione (2017), “Quan-
29 ing of quantum states,” arXiv preprint arXiv:1712.00127 tum ensembles of quantum classifiers,” arXiv preprint
30 . arXiv:1704.02146 .
31 Rocchetto, Andrea, Edward Grant, Sergii Strelchuk, Schuld, Maria, Ilya Sinayskiy, and Francesco Petruccione
32 Giuseppe Carleo, and Simone Severini (2018), “Learn- (2015), “An introduction to quantum machine learning,”
33 ing hard quantum distributions with variational autoen- Contemporary Physics 56 (2), 172–185.
34 coders,” npj Quantum Information 4 (1), 28. Schwab, David J, Ilya Nemenman, and Pankaj Mehta (2014),
35 Rockafellar, Ralph Tyrell (2015), Convex analysis (Princeton Physical review letters 113 (6), 068102.
36 university press). Sethna, James (2006), Statistical mechanics: entropy, or-
37 Rodriguez, Alex, and Alessandro Laio (2014), “Clustering by der parameters, and complexity, Vol. 14 (Oxford University
38 fast search and find of density peaks,” Science 344 (6191), Press).
1492–1496. Shanahan, Phiala E, Daniel Trewartha, and William
39
Rokach, Lior, and Oded Maimon (2005), “Clustering meth- Detmold (2018), “Machine learning action parameters
40 ods,” in Data mining and knowledge discovery handbook in lattice quantum chromodynamics,” arXiv preprint
41 (Springer) pp. 321–352. arXiv:1801.05784 .
42 Roweis, Sam T, and Lawrence K Saul (2000), “Nonlinear Shannon, Claude E (1949), “Communication theory of secrecy
43 dimensionality reduction by locally linear embedding,” sci- systems,” Bell Labs Technical Journal 28 (4), 656–715.
44 ence 290 (5500), 2323–2326. Shen, Huitao, Junwei Liu, and Liang Fu (2018), “Self-learning
45 Rudelius, Tom (2018), “Learning to inflate,” arXiv preprint monte carlo with deep neural networks,” arXiv preprint
46 arXiv:1810.05159 . arXiv:1801.01127 .
47 Ruder, Sebastian (2016), “An overview of gradient descent Shinjo, Kazuya, Shigetoshi Sota, Seiji Yunoki, and
48 optimization algorithms,” arXiv preprint arXiv:1609.04747 Takami Tohyama (2019), “Characterization of photoex-
49 . cited states in the half-filled one-dimensional extended hub-
50 Rumelhart, David E, and David Zipser (1985), “Feature dis- bard model assisted by machine learning,” arXiv preprint
covery by competitive learning,” Cognitive science 9 (1), arXiv:1901.07900 .
51
75–112. Shlens, Jonathon (2014), “A tutorial on principal component
52 Ruscher, Céline, and Jörg Rottler (2018), “Correlations in analysis,” arXiv preprint arXiv:1404.1100.
53 the shear flow of athermal amorphous solids: A principal Shwartz-Ziv, Ravid, and Naftali Tishby (2017), “Opening the
54 component analysis,” arXiv preprint arXiv:1809.06487 . black box of deep neural networks via information,” arXiv
55 Saito, Hiroki, and Masaya Kato (2017), “Machine learning preprint arXiv:1703.00810.
56 technique to find quantum many-body ground states of Sidky, Hythem, and Jonathan K Whitmer (2017), “Learn-
57 bosons on a lattice,” arXiv preprint arXiv:1709.05468 . ing free energy landscapes using artificial neural networks,”
58 Salazar, Domingos SP (2017), “Nonequilibrium thermody- arXiv preprint arXiv:1712.02840 .
59 namics of restricted boltzmann machines,” arXiv preprint Simonite, Tom (2018), “Should data scientist adhere to a hip-
60 arXiv:1704.08724 . pocratic oath?” Wired .
61 Sander, Jörg, Martin Ester, Hans-Peter Kriegel, and Xiaowei Singh, Kesar (1981), “On the asymptotic accuracy of efron’s
62
63
64
65
115
1
2
bootstrap,” The Annals of Statistics , 1187–1195. Tenenbaum, Joshua B, Vin De Silva, and John C Langford
3
Slonim, Noam, Gurinder S Atwal, Gasper Tkacik, and (2000), “A global geometric framework for nonlinear dimen-
4 William Bialek (2005), “Estimating mutual information sionality reduction,” science 290 (5500), 2319–2323.
5 and multi–information in large networks,” arXiv preprint Tibshirani, Ryan J, et al. (2013), “The lasso problem and
6 cs/0502017. uniqueness,” Electronic Journal of Statistics 7, 1456–1490.
7 Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe, Tieleman, Tijmen, and Geoffrey Hinton (2009), “Using fast
8 Søren Kaae Sønderby, and Ole Winther (2016), “Ladder weights to improve persistent contrastive divergence,” in
9 variational autoencoders,” in Advances in neural informa- Proceedings of the 26th Annual International Conference
10 tion processing systems, pp. 3738–3746. on Machine Learning (ACM) pp. 1033–1040.
11 Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, Tieleman, Tijmen, and Geoffrey Hinton (2012), “Lecture 6.5-
12 and Martin Riedmiller (2014), “Striving for simplicity: The rmsprop: Divide the gradient by a running average of its
13 all convolutional net,” arXiv preprint arXiv:1412.6806. recent magnitude,” COURSERA: Neural networks for ma-
14 Sriarunothai, Theeraphot, Sabine Wölk, Gouri Shankar Giri, chine learning 4 (2), 26–31.
Nicolai Fries, Vedran Dunjko, Hans J Briegel, and Christof Tishby, Naftali, Fernando C Pereira, and William Bialek
15
Wunderlich (2017), “Speeding-up the decision making of a (2000), “The information bottleneck method,” arXiv
16 learning agent using an ion trap quantum processor,” arXiv preprint physics/0004057.
17 preprint arXiv:1709.01366 . Tomczak, P, and H. Puszkarski (2018), “Ferromagnetic res-
18 Srivastava, Nitish, Geoffrey E Hinton, Alex Krizhevsky, Ilya onance in thin films studied via cross-validation of numer-
19 Sutskever, and Ruslan Salakhutdinov (2014), “Dropout: a ical solutions of the smit-beljers equation: Application to
20 simple way to prevent neural networks from overfitting.” (ga,mn)as,” Phys. Rev. B 98, 144415.
21 Journal of machine learning research 15 (1), 1929–1958. Torgerson, Warren S (1958), “Theory and methods of scaling.”
22 Stoudenmire, E Miles (2018), “Learning relevant features .
23 of data with multi-scale tensor networks,” arXiv preprint Torlai, Giacomo, Guglielmo Mazzola, Juan Carrasquilla,
24 arXiv:1801.00315 . Matthias Troyer, Roger Melko, and Giuseppe Carleo
25 Stoudenmire, E Miles, and David J Schwab (2016), “Super- (2018), “Neural-network quantum state tomography,” Na-
26 vised learning with tensor networks,” in Advances in Neural ture Physics 14 (5), 447.
Information Processing Systems, pp. 4799–4807. Torlai, Giacomo, and Roger G. Melko (2017), “Neural de-
27
Stoudenmire, EM, and Steven R White (2012), “Studying coder for topological codes,” Phys. Rev. Lett. 119, 030501.
28 two-dimensional systems with the density matrix renormal- Tramel, Eric W, Marylou Gabrié, Andre Manoel, Francesco
29 ization group,” Annu. Rev. Condens. Matter Phys. 3 (1), Caltagirone, and Florent Krzakala (2017), “A determin-
30 111–128. istic and generalized framework for unsupervised learn-
31 Stratonovich, RL (1957), “On a method of calculating quan- ing with restricted boltzmann machines,” arXiv preprint
32 tum distribution functions,” in Soviet Physics Doklady, arXiv:1702.03260 .
33 Vol. 2, p. 416. Tubiana, Jérôme, and Rémi Monasson (2017), “Emergence
34 Strouse, DJ, and David J Schwab (2017), “The deterministic of compositional representations in restricted boltzmann
35 information bottleneck,” Neural computation 29 (6), 1611– machines,” Physical Review Letters 118 (13), 138301.
36 1630. Van Der Maaten, Laurens (2014), “Accelerating t-sne using
37 Suchsland, Philippe, and Stefan Wessel (2018), “Parameter tree-based algorithms,” The Journal of Machine Learning
38 diagnostics of phases and phase transition learning by neu- Research 15 (1), 3221–3245.
ral networks,” arXiv preprint arXiv:1802.09876 . Venderley, Jordan, Vedika Khemani, and Eun-Ah Kim
39
Sutskever, Ilya, James Martens, George Dahl, and Geoffrey (2017), “Machine learning out-of-equilibrium phases of
40 Hinton (2013), “On the importance of initialization and matter,” arXiv preprint arXiv:1711.00020 .
41 momentum in deep learning,” in International conference Vergassola, Massimo, Emmanuel Villermaux, and Boris I
42 on machine learning, pp. 1139–1147. Shraiman (2007), Nature 445 (7126), 406.
43 Sutton, Richard S, and Andrew G Barto (1998), Reinforce- Vidal, Guifre (2007), “Entanglement renormalization,” Phys-
44 ment learning: An introduction, Vol. 1 (MIT press Cam- ical review letters 99 (22), 220405.
45 bridge). Wainwright, Martin J, Michael I Jordan, et al. (2008),
46 Swaddle, Michael, Lyle Noakes, Liam Salter, Harry Small- “Graphical models, exponential families, and variational in-
47 bone, and Jingbo Wang (2017), “Generating 3 qubit ference,” Foundations and Trends® in Machine Learning
48 quantum circuits with neural networks,” arXiv preprint 1 (1–2), 1–305.
49 arXiv:1703.10743 . Wang, Ce, and Hui Zhai (2017), “Unsupervised learning stud-
50 Sweke, Ryan, Markus S Kesselring, Evert PL van Nieuwen- ies of frustrated classical spin models i: Principle compo-
burg, and Jens Eisert (2018), “Reinforcement learning nent analysis,” arXiv preprint arXiv:1706.07977 .
51
decoders for fault-tolerant quantum computation,” arXiv Wang, Ce, and Hui Zhai (2018), “Machine learning of frus-
52 preprint arXiv:1810.07207 . trated classical spin models. ii. kernel principal component
53 Székely, GJ (2003), “E-statistics: The energy of statistical analysis,” arXiv preprint arXiv:1803.01205 .
54 samples,” Bowling Green State University, Department of Wang, Lei (2017), “Can boltzmann machines discover cluster
55 Mathematics and Statistics Technical Report 3 (05), 1–18. updates?” arXiv preprint arXiv:1702.08586 .
56 Tanaka, Akinori, and Akio Tomiya (2017a), “Detection of Wang, Yi-Nan, and Zhibai Zhang (2018), “Learning non-
57 phase transition via convolutional neural networks,” Jour- higgsable gauge groups in 4d f-theory,” arXiv preprint
58 nal of the Physical Society of Japan 86 (6), 063001. arXiv:1804.07296 .
59 Tanaka, Akinori, and Akio Tomiya (2017b), “Towards reduc- Wasserman, Larry (2013), All of statistics: a concise course in
60 tion of autocorrelation in hmc by machine learning,” arXiv statistical inference (Springer Science & Business Media).
61 preprint arXiv:1712.03893 . Wattenberg, Martin, Fernanda Viégas, and Ian Johnson
62
63
64
65
116
1
2
(2016), “How to use t-sne effectively,” Distill 1 (10), e2. robust control of single-triplet qubits,” arXiv preprint
3
Wei, Qianshi, Roger G Melko, and Jeff ZY Chen (2017), arXiv:1708.00238 .
4 “Identifying polymer states by machine learning,” Physical Yedidia, Jonathan (2001), “An idiosyncratic journey beyond
5 Review E 95 (3), 032504. mean field theory,” Advanced mean field methods: Theory
6 Weigt, Martin, Robert A White, Hendrik Szurmant, James A and practice , 21–36.
7 Hoch, and Terence Hwa (2009), “Identification of direct Yedidia, Jonathan S, William T Freeman, and Yair Weiss
8 residue contacts in protein–protein interaction by message (2003), “Understanding belief propagation and its general-
9 passing,” Proceedings of the National Academy of Sciences izations,” Morgan Kaufmann Publishers Inc. San Francisco,
10 106 (1), 67–72. CA, USA .
11 Weinstein, Steven (2017), “Learning the einstein-podolsky- Yoshioka, Nobuyuki, Yutaka Akagi, and Hosho Katsura
12 rosen correlations on a restricted boltzmann machine,” (2017), “Learning disordered topological phases by statisti-
13 arXiv preprint arXiv:1707.03114 . cal recovery of symmetry,” arXiv preprint arXiv:1709.05790
14 Wetzel, Sebastian Johann (2017), “Unsupervised learning of .
phase transitions: from principle component analysis to You, Yi-Zhuang, Zhao Yang, and Xiao-Liang Qi (2017),
15
variational autoencoders,” arXiv preprint arXiv:1703.02435 “Machine learning spatial geometry from entanglement fea-
16 . tures,” arXiv preprint arXiv:1709.01223 .
17 Wetzel, Sebastian Johann, and Manuel Scherzer (2017), Yu, Chao-Hua, Fei Gao, and Qiao-Yan Wen (2017),
18 “Machine learning of explicit order parameters: From the “Quantum algorithms for ridge regression,” arXiv preprint
19 ising model to SU (2) lattice gauge theory,” arXiv preprint arXiv:1707.09524 .
20 arXiv:1705.05582 . Zdeborová, Lenka, and Florent Krzakala (2016), “Statistical
21 White, Steven R (1992), “Density matrix formulation for physics of inference: Thresholds and algorithms,” Advances
22 quantum renormalization groups,” Physical review letters in Physics 65 (5), 453–552.
23 69 (19), 2863. Zeiler, Matthew D (2012), “Adadelta: an adaptive learning
24 White, Tom (2016), “Sampling generative networks: rate method,” arXiv preprint arXiv:1212.5701.
25 Notes on a few effective techniques,” arXiv preprint Zhang, Chengxian, and Xin Wang (2018), “Spin-qubit noise
26 arXiv:1609.04468. spectroscopy from randomized benchmarking by supervised
Williams, DRGHR, and Geoffrey Hinton (1986), “Learn- learning,” arXiv preprint arXiv:1810.07914 .
27
ing representations by back-propagating errors,” Nature Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin
28 323 (6088), 533–538. Recht, and Oriol Vinyals (2016), “Understanding deep
29 Wilson, Ashia C, Rebecca Roelofs, Mitchell Stern, Nathan learning requires rethinking generalization,” arXiv preprint
30 Srebro, and Benjamin Recht (2017), “The marginal value arXiv:1611.03530.
31 of adaptive gradient methods in machine learning,” arXiv Zhang, Pengfei, Huitao Shen, and Hui Zhai (2017a), “Ma-
32 preprint arXiv:1705.08292. chine learning topological invariants with neural networks,”
33 Wilson, Kenneth G, and John Kogut (1974), “The renormal- arXiv preprint arXiv:1708.09401 .
34 ization group and the epsilon expansion,” Physics Reports Zhang, Xiao-Ming, Zi-Wei Cui, Xin Wang, and Man-Hong
35 12 (2), 75–199. Yung (2018), “Automatic spin-chain learning to explore the
36 Witte, RS, and J.S. Witte (2013), Statistics (Wiley). quantum speed limit,” arXiv preprint arXiv:1802.09248 .
37 Wu, Yadong, Pengfei Zhang, Huitao Shen, and Hui Zhai Zhang, Yi, Roger G Melko, and Eun-Ah Kim (2017b), “Ma-
38 (2018), “Visualizing neural network developing perturba- chine learning Z2 quantum spin liquids with quasi-particle
tion theory,” arXiv preprint arXiv:1802.03930 . statistics,” arXiv preprint arXiv:1705.01947 .
39
Xie, Junyuan, Ross Girshick, and Ali Farhadi (2016), “Un- Zimek, Arthur, Erich Schubert, and Hans-Peter Kriegel
40 supervised deep embedding for clustering analysis,” in In- (2012), “A survey on unsupervised outlier detection in high-
41 ternational conference on machine learning, pp. 478–487. dimensional numerical data,” Statistical Analysis and Data
42 Yang, Tynia, Jinze Liu, Leonard McMillan, and Wei Wang Mining: The ASA Data Science Journal 5 (5), 363–387.
43 (2006), “A fast approximation to multidimensional scaling,” Zou, Hui, and Trevor Hastie (2005), “Regularization and vari-
44 in IEEE workshop on Computation Intensive Methods for able selection via the elastic net,” Journal of the Royal Sta-
45 Computer Vision. tistical Society: Series B (Statistical Methodology) 67 (2),
46 Yang, Xu-Chen, Man-Hong Yung, and Xin Wang 301–320.
47 (2017), “Neural network designed pulse sequences for
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

Potrebbero piacerti anche