Google Cloud Collaboration.

8/3/2019 Anyone Can Learn AI Using This Blog 072319.
ipynb - Colaboratory
Anyone Can Learn Arti cial Intelligence With This Blog.

Building Your First Deep Learning Neural Network:
A Simple, Illustrated Explanation That Skips No Steps, By David Code

If you want to learn AI but you're worried you don't have the math or the software background to master it, this blog is for you.
I just suffered through 18 months of hell as I taught myself AI with online courses and blogs, and it was like drinking from a re hose--too much
information from too many experts with too many con icting opinions, and my head was lled with self-doubt.
I didn't want any more experts to help me learn. I wanted one--just one--teacher. I wanted someone to take my hand and say, "OK Dave, here is what you
need to learn, in this order. I will teach you now, using pictures, funny stories, real-life examples, and plain English."
Now I would like to be that teacher for you.
Why me? Feel free to dismiss me because I don't have a background in math or programming, like the experts online. But I've done Yale and Princeton, I
was a travel writer for over 100 countries, I worked at "Saturday Night Live," and I'm an award-winning author. In other words, I know how to write to
communicate complex concepts and tell a story. I passionately love teaching, and I know a good teacher when I see one. Here are four great teachers,
thanks to whom I have mastered this deep learning material: Andrew Trask, Grant Sanderson, Kalid Azad, and my mentor Adam Koenig, a Stanford PhD
in aerospace engineering, and a gifted teacher.
Most importantly, I was in your exact shoes quite recently. Have you heard of "expert blindness?" It's when an expert teaches a subject like AI to a
rookie, but the expert has been an expert for so long that she forgets how a beginner sees the material. So the expert passes too quickly over complex
concepts that need to be broken down into bite-size, user-friendly pieces. Or the expert teaches without analogies, pictures or examples that would help
the rookie to grasp the concepts. The rookie ends up feeling frustrated and overwhelmed.
You see, every rookie thinks they want an expert to teach them AI. In fact, you actually don't want an expert. You want a teacher to teach you AI.
The best teacher is the person who just learned yesterday the stuff you are studying today, because he still remembers what he struggled with and how
he overcame it, and he can pass those shortcuts on to you. And that "he" would be me. I'm not an expert, but I have been told I am a good teacher--and
a passionate one.
How to Use This Blog to Learn Well

Reading this blog is not like reading a novel. You can't understand and retain everything with one reading. My math friends tell me they often have to
read math texts AT LEAST seven times before they begin to understand it. No joke.
https://colab.research.google.com/drive/1VdwQq8JJsonfT4SV0pfXKZ1vsoNvvxcH#scrollTo=X7jTTrBc4Fwx 1/77
8/3/2019 Anyone Can Learn AI Using This Blog 072319.ipynb - Colaboratory
Pro Tip: For a full Table of Contents, click on the nearby white ">" arrow on the little black square.
Wherever possible, I have used analogy, pictures, examples and geometric representations to teach the material. But this text is also mathematically
precise and rigorous. Please expect to read this material at least ve times. Don't beat yourself up if you don't get it right away.
When I am studying di cult material, I always set an egg timer to beep every ve minutes, to keep reminding me not to beat myself up in despair, but
rather to smile, be patient and persist. It really does get easier, please trust me. One of my favorite slogans is, "There's the task, and there's the drama
about the task." Let's endeavor to keep the drama about our task to a minimum!
Here's the big picture:
1. A neural network is a very popular, cutting-edge type of deep learning;

2. Deep learning is a subset of machine learning;
3. Machine learning is a subset of AI
There are four major concepts that comprise deep learning. My goal in this blog is to equip you with a working grasp of these four concepts, which are
the fundamentals of deep learning:
1. Feed forward;
2. Gradient descent;
3. Global minimum;
4. Back propagation
You don't need any prior knowledge of the above four concepts. I will start by teaching them as simply as possible, using analogies and diagrams. Then
I will return to each of these four concepts over-and-over, eshing out each concept in more detail with each pass I make at them. Think of this blog as
a "upward spiral" of learning as you gain more-and-more insight with each time we return to the concept. Here's an outline of what I'm going to teach in
ve sections:
1) The Big Picture of Deep Learning: Example, Analogy, Diagram, Jokes
2) Create a Working Brain with 28 Lines of Code: Neurons and Synapses
3) Feed Forward: Making an Educated Guess, 60,000 Times
4) Learning from Trial-and-Error: Gradient Descent and the Global Minimum
5) Back Propagation: The Chain Rule
Let's begin with Section 1, a (gripping!) real-life example of using AI for marketing, to target the right ads to the right prospective buyers of (superlative!)
kitty litter, and we will weave this (charming) scenario through all of the concepts I teach below so you can see the practical application of the insights
you are learning.
Section 1 of 5: The Big Picture of Deep Learning:

1) Example, 2) Analogy, 3) Diagram, 4) Jokes
1.1) The Big Picture: An Example

Congratulations! Imagine you are the wealthy owner of a fabulous pet shop. One month ago, you launched a new kitty litter product, cleverly named,
"Litter Rip!" A big part of your success comes from your savvy use of AI. You have built a deep learning neural network to choose the right potential new
customers to whom you will send your targeted advertisements. You seek folks whose beloved felines would be pleased to, "let 'er rip," so to speak,
upon your new cat toilet.
Your secret weapon is your dataset. You have data from surveys of your pet shop customers in the past month since you started selling Litter Rip! in
your store. These customer surveys include their answers to four stunningly insightful questions:
1. Do you own a cat who poops?

2. Do you drink imported beer?
3. In the past month, have you visited our award-winning website, LitterRip.com?
4. In the past month, have you purchased Litter Rip! for your poopin' puss?
The answers to these four questions are known as "features," (characteristics) of your past customers.
Key question: what makes an AI network so powerful? It can take the responses from your past customer surveys and use them to train itself to
accurately predict the buying behavior of future customers.
First, you will train your network by inputting the survey data from past, overjoyed customers and their responses to the rst three of the above four
survey questions/features. The network uses this data to predict whether a given customer did indeed buy Litter Rip! The network then compares its
prediction to each of your past customers' answers to the fourth question. The survey responses to question four are your Test Set, acting as The
Actual Truth to which your network compares its predictions. For example, if your network predicts, "Yep, I bet this customer bought Litter Rip!" and that
customer's answer to Question Four was indeed Yes, then you have a successful neural network.
A network trains itself by a process of trial-and-error: the network will predict, then compare its predictions to the actual answers to Question Four, and
learn from its mistakes to improve over many iterations.
It's important to understand that a neural network always trains on one dataset in order to make predictions on another dataset. Once your network is
fabulous at predicting purchasers of Litter Rip! from the past customers' surveys, then you can turn it loose on your new database, a list of hot
prospects.
From your local veterinarian (who is secretly in love with you, you charmer...) you have obtained a fresh batch of surveys of people who have answered
the same rst three questions, and your by-now-well-trained network will predict who best to send your targeted ad to. Pure genius! OMG, how do you
do it?
Let's move on to The Big Picture, Part 2: an analogy for how neural networks work:
1.2) The Big Picture: A Brain Analogy of Neurons and Synapses

Here is a diagram of the 3 layer neural network we will build, which is laid out here in the common "neurons and synapses" format:
Let me help you get your bearings in this diagram. This is a three-layer, feed forward neural network. The input layer is on the left: the three circles
represent neurons (aka nodes or features. Important: "Features" is the general term, but I would translate "features" as characteristics; our network
happens to use our rst three survey questions as features). So for now, when you look at that column of three questions in the diagram, think of it as
representing one customer's responses. Imagine the top circle on the left contains the answer to the question, "Do you own a cat who poops?" The
middle circle of the three contains the answer to, "Do you drink imported beer?" The bottom circle of the input layer is the feature, "Have you visited our
website, LitterRip.com?" So, if Customer One responded, "Yes/No/Yes" to the questions, the top circle would contain a 1, the middle circle would
contain a 0, and the bottom circle would contain a 1.
The synapses (all those lines connecting the circles) and the hidden layer are where the "brain" of the neural network does its "thinking." The single
circle on the right-hand side (the one still attached to four synapses) is the network's prediction, which says, "Based on the combination of features
input into me, here is the probability that this customer will indeed buy Litter Rip! (or not)."
The stand-alone circle on the right-hand side, labeled "y," contains The Actual Truth, which is each customer's response to the fourth survey question,
"Have you purchased Litter Rip! for your poopin' puss?" For the circle labeled "y" there are only two choices: "0" means No, they did not buy, and "1"
means Yes, they bought. Our neural network will output a prediction of probability, compare it to y to see how accurate it was, and learn from its
mistakes in the next iteration. Trial-and-error. Tens of thousands of times. In seconds.
To summarize, this diagram above is a very common way of depicting neural networks. In essence, it is an illustration of feed forward, which is the rst
of our four major concepts. You might think the neurons are the most important part of this process, but that's where the analogy is a bit misleading. In
fact, it is the synapses that drive all four major concepts of deep learning. So the most important takeaway of this section by far is this: It is the
synapses that drive the creation of a prediction, which I represent below as a ping pong ball that rolls down into a bowl.
Below you will read about "pink ping pong prediction-balls." The synapses are the essential part of the feed forward that creates the pink prediction-ball
you will see below as I explain to you the second major concept, gradient descent.
However, before we get to our pink ping pong ball analogy, rst I need to explain in more detail exactly what makes neural networks so powerful.
1.3) The Big Picture: The Analogy of Bowls and Balls

Wake up everybody! Here is a key point in the blog when I tell you what makes AI so powerful: A neural network uses probability to make incremental
improvements on its next prediction. This process takes learning by trial-and-error to a Whole. New. Level.
Here's how a human makes predictions: Imagine that you have a pile of past customer surveys on your desk, and next to it is another pile of surveys
from potential new customers, which your amorous veterinarian has given you. How can a human use the past customer surveys to predict the buying
behavior of future customers? You probably think, "Well, my fuzzy logic tells me there's probably no correlation between customers who drink imported
beer and customers who buy Litter Rip! And I also shu ed through the customer surveys, looking for recognizable patterns, and I would guess that the
customers who answered Yes to owning a cat and visiting the website probably did buy Litter Rip!"
Fine. Bravo. With only three survey questions, answered by only four customers, that is doable. But what if there are 40 questions? And 4,000 past
customers? How can a human decide which questions to prioritize as key factors in making accurate predictions? Humans are limited as to how many
numbers we can hold in our heads, and it's hard for us to quantify how probable it is that a given customer out of the 40,000 will buy Litter Rip! Is there
a 67% chance? 68%? Who the hell knows?
Here's how a neural network makes predictions: For its rst prediction of whether the customer bought Litter Rip!, the network will not limit itself to a
blunt, black-and-white prediction such as, "1: Yes" or "0: No." Instead, the network predicts by assigning a number on a continuum between zero and
one--a probability. For example, 0.67 means, "There is a decent chance, 67%, that the customer bought Litter Rip!" Or 0.13 means, "There is a small
likelihood, 13%, that the customer bought Litter Rip!"
Here is why assigning a probability between zero and one is so brilliant: It doesn't matter if the computer's rst prediction was a mile off. What matters
is that the network will compare its 0.13 to The Actual Truth (let's say for example that Yes, the customer did buy, which is expressed by the number
"1"). The network notes that its prediction was off by 0.87, which is a big error, so it makes adjustments. The network can keep all its numbers straight,
while juggling them all in the air at the same time, increasing some values and decreasing others, in order to nd a slightly better balance of
combinations of questions that will yield a slightly more accurate prediction next time. Rinse-and-repeat tens of thousands of times until eventually, the
computer can con dently say, "I am pleased to announce that, now when I compare my prediction to The Actual Truth, the error is close to zero. I am
now predicting accurately."
Joy.
So now you know why a deep learning neural network is so powerful. It uses probability and trial-and-error learning to make incremental improvements
on its next prediction.
I have more good news. I can give you a simple, clear picture of what that process of trial-and-error learning looks like: The network's process of
learning by trial-and-error is like a ping pong ball rolling down the side of a cereal bowl and coming to rest at the bottom.
I explained above how a neural network specializes in a steady march of making a prediction, calculating its error, and improving its next prediction
until the error is reduced to almost zero. A network making predictions is analogous to a ping pong ball rolling down the side of a bowl, so pretend that
the bottom of the bowl is Utopia--a perfectly accurate prediction. The network's rst prediction is like the starting location of the "prediction-ball." At the
moment of the second prediction, the ball has advanced a bit down the side of the bowl towards its eventual resting place at the (perfect) bottom. The
network's third prediction sees the ball yet further down the side of the bowl, and so on. See the diagram below. Each prediction of the network appears
as a new location for the ping pong ball as it rolls inexorably toward perfection at the bottom of the bowl.
The process of this prediction-ball rolling down and gaining in accuracy before coming to rest at perfection, the bottom of the bowl, is really a process
that consists of four steps. Let's take a quick-and-dirty overview of each step, but if you feel confused or overwhelmed, don't worry. I will teach you each
of these steps in detail below, but I want you to begin to get a vision of where we're headed.
Below I'll give you a picture in your mind for each step, followed by a simple explanation in plain English:
1. Feed Forward: Picture an old-fashioned 1960's IBM computer that lls a room, with punch cards being fed into one end and then fabulous
solutions spitting out the other end. The network takes the data from the survey's three questions and "feeds it through the computer" to arrive at
a prediction;
2. Global Minimum: Picture the above yellow bowl sitting on a table. The table's surface represents a near-perfect prediction with almost zero error,
so obviously the bottom of that bowl would be the closest point to that perfect prediction, with minimum error. Compared to the entire surface of
the bowl (the "global surface"), the bottom is closest to perfection. It has the "global minimum" of error.
Each time the network makes a better prediction, the pink prediction-ball rolls down the bowl's sides and approaches that global minimum of
error at the bottom. After each prediction, the network compares that prediction to survey question four, The Actual Truth. This is like measuring
how far the prediction-ball is from the bottom of the bowl at a given moment. The measure of this distance between the prediction and The Truth
is called nding the error. The network's goal with each prediction is to constantly reduce that error to a global minimum.
3. Back Propagation: Picture a circus juggler who can juggle 16 bowling pins of different sizes and weights. He is able to keep them all in the air at
the same time, even as he (magically) adjusts their size and weight on the y. After making a prediction, the network then works backwards
through its previous prediction process to see what adjustments would lessen the error in the next prediction, thereby moving the ball down the
bowl; and
4. Gradient Descent: Picture the pink ping pong ball from the diagram above as it rolls down the side of a yellow bowl towards the bottom, the
perfect Global Minimum. The network is like a pink prediction-ball, and the surface of the yellow bowl is made up of every prediction the network
could possibly make. Gradient descent is like a prediction-ball rolling down the surface of the "bowl of predictions" to the bottom of the bowl
where the overall error in prediction is at a minimum (the Global Minimum).
Is that beginning to make sense? A little bit? Let me sum things up in different words:
1. Gradient Descent is the overall process of a network learning by trial-and-error until its predictions are accurate (i.e., with minimum error). It is
like a ping pong ball rolling to the bottom of a bowl;
2. Feed Forward means making a prediction. A prediction is like a freeze-frame photo of where the ball is located in the bowl at a given moment;
3. Global Minimum is the location of perfection, i.e., the bottom of the bowl where predictions have no error. Our goal is to get to the bowl's bottom.
The network compares its prediction to The Actual Truth to measure error, i.e., how far is the ball from the bowl's bottom?
4. Back Propagation means working backwards through our previous prediction to nd out what went wrong and x it. Back prop measures the
distance (i.e., error) from the table underneath the current location of the prediction-ball and gures out how to advance our current prediction a
little closer to the bottom of the bowl on the next try.
Remember: all we're trying to do for now is paint a rough picture with broad brush strokes, so don't worry at all if you haven't fully grasped things yet.
Before I start explaining things in detail, I want to take our bowl-with-ball analogy one step further. The above diagram of the pink ping pong ball in the
yellow bowl captures all the four main steps of a neural network training process, but it's a little oversimpli ed. In the above diagram, notice that the
ball has one, simple and straight route down to the bottom of one simple bowl. That would be an accurate representation of a network if it contained
only one survey question, such as, "Do you have a cat?" and based its predictions entirely on that one question.
But, of course we want to nd the best combination of our three survey questions in order to serve up our nest prediction. So, if our network is
experimenting with different combinations of survey questions in its trial-and-error iterations, what kind of a bowl would that look like? Answer: a lumpy
bowl--one with many bumps and dips. Something like this:
(taken with gratitude from Grant Sanderson Ch. 2)

You can see above that this red bowl has some serious bumps and dips. Why so lumpy?
Well, rst we have to understand what that red bowl is made of. In the diagram it looks like it's made of plastic, but imagine instead that the red bowl is
made up of millions of red dots. Each red dot is a point in the 3D grid of X, Y and Z. And each point represents one possible combination of our survey
questions, as well as one possible hierarchy of which questions to prioritize in that combination to provide the network with a prediction. This lumpy
bowl gives us a lumpy landscape of all the possible permutations that could give us a prediction of The Actual Truth from survey question four.
Keep that image in your mind: the red bowl is a "landscape of all possible permutations," just like a mountain range has peaks and valleys and hills and
riverbeds. The network doesn't explore every possible permutation in the whole, wide universe--that would suck a ton of computer time. Instead, it just
starts wherever you randomly "drop the prediction-ball" and works its way down to the "valley oor" of the global minimum. The network doesn't care
where it starts. It only cares about getting from its random starting point to the bottom of the bowl.
Next, before I explain the red bumps and dips it is important to focus on that dotted white line that begins at the top, right-hand corner of the red bowl.
The top of that dotted white line is our network's rst prediction. Remember our pink ping pong prediction-ball from the yellow bowl above? Pretend it is
at the top of that dotted line, where our rst prediction is located. The dotted line represents the path of predictions that our network is going to take to
get to the global minimum at the bottom of the bowl. Our pink prediction-ball is going to follow that white dotted path of many predictions down to the
place of minimum error and maximum accuracy in predicting.
But why will the path of our pink prediction-ball down this white dotted line of predictions be so crooked? The crooked path comes from the network's
constant experiments with which questions we should combine, and with how much each question should be weighted (emphasized), to produce the
best predictions with less-and-less error. The network's constant goal is to drive the error down as fast as possible, which means to drive the prediction
ball to the bottom of the red bowl as quickly as possible. So the network constantly takes the slope of the point on the dotted line right under the
prediction-ball in order to determine what direction from the ball's position has the steepest slope for the fastest path down. Since the slope changes
constantly, the path is crooked with turns. Here's a simple example:
So, at rst glance a human would say that the questions, "Do you have a cat?" and "Have you visited our website?" are the obvious questions on which
to base our predictions. But which question should be a bigger factor in accurately predicting who bought our kitty litter? Is owning a cat a bigger
predictor? Is it 50/50 between the two questions?
The network wants to experiment with many different weightings of these two questions, to see what the best combination is for accurate predicting.
Each bump you see in the red bowl above represents a combination and a weighting of questions that is "on the wrong track" because each bump is
taking the network's "prediction ball" further from that global minimum at the bottom. Each dip in the bowl represents a balance of weighting among
the three questions that is "on the right track" because it's taking the prediction ball closer to the bottom.
Ah, but what if the network nds the perfect weighting of our three survey questions but its predictions are still around, say, 60% accurate? What do we
do then? Well, the network has another trick up its sleeve: inferred questions. Here is a silly example of a very important concept:
Let's go back to the question about drinking imported beer. Networks just keep trying different combos of questions. So, for example, perhaps only rich
people have money to burn on Heineken instead of Coors. And we all know that cat owners tend to be rich people with fabulous taste in pets, right? So
perhaps if the "Drink imported beer?" and "Own a cat?" questions are combined and weighted heavily in the network's calculations, the predictions will
improve. The inferred question here is, "Are rich people more likely to buy Litter Rip?" And we are making a (silly) inference that Heinken drinkers and
cat owners tend to be rich.
But, lo and behold, as the pink prediction-ball is stumbling around in the lumpy bowl, it rolls onto a bump just as the network is experimenting with this
inferred question of rich people loving Litter Rip! That suggests that this inferred question about rich customers is not helpful to improving prediction
accuracy. This may be because, for example, kitty litter is not a luxury good but rather a staple of life, like paper towels or socks. (Every citizen's cat has
a right to a clean bum, don't you think?) So in future iterations, the prediction ball will roll off the bump and away from the unhelpful inferred question
about rich customers. But it will also roll into dips of helpful inferred questions that bring it more quickly to the bottom of the bowl.
So, the above is just a silly example that illustrates how inferred questions are tested for their usefulness to a more accurate prediction. Humans use
fuzzy logic to explore inferred questions, but the network just tries out every possible permutation. If the experiment yields a slightly better prediction,
then it's a keeper. If it yields a worse prediction, the network discards it.
To sum up: each red dot on the surface of the lumpy red bowl is an example of the network experimenting with a particular combination of questions,
and with weighting that combination for a particular emphasis. Every dip suggests, "a step in the right direction," which the network will note and
preserve. Every bump suggests, "movement in the wrong direction," which the network will note and discard. The prediction-ball's path (the white dotted
line) staggers around a bit as the prediction-ball seeks dips and avoids bumps on its way to the bottom of the bowl, where perfect predictions live, and
error is almost zero. You might say the white dotted line is the most e cient path for improvement-after-improvement of the prediction-ball on its
journey down the red bowl's lumpy landscape of all possible permutations.
OK, now let's circle back once again and reexamine Feed Forward, Global Minimum, Back Propagation and Gradient Descent in further detail.
1.3.a) Feedkeep
Remember: Forward: punch
breathing! cardstobeing
It's normal fed through
feel overwhelmed at athis
1960's
point.IBM
Just computer
read and re-read the material ve times and don't beat yourself up in the
process.
The goal Boldly
of feedonward!
forward is to create a prediction. Think of each prediction as a new location for our prediction-ball as it descends the bowl towards the
error-free bottom. In the lumpy red bowl above, the rst prediction our network makes about which customers bought kitty litter is represented by the
top-most point of the dotted, white line on the curvy, red surface, so let's pretend the prediction ball is sitting on that point. Feed forward is the process
that created the rst prediction on which our prediction-ball now sits. The next dot of that dotted white line is the second prediction, so our prediction-
ball will then "roll" to that next point and stop until the third prediction moves it again.
What most AI courses and blogs forget to mention is that each location the prediction-ball sits on is de ned by two coordinates on the white grid
underneath the curvy red bowl. This curvy red bowl is not drifting in space. It is sitting on a white grid with an X and a Y axis.
Do you remember the visual image of Global Minimum that I introduced above? Our original yellow bowl sitting on a tabletop? Well, that is the same
thing as this curvy red bowl sitting on this white grid. The white grid is our tabletop and the "place of perfect, error-free prediction" is the point on which
the warped red bowl actually sits. Note that the bowl's only point of contact with the tabletop is the very bottom-most dip in the bowl--the point where
the dotted white line ends--the global minimum.
So, each stop the prediction-ball makes in the red bowl--each prediction--is a place with 3 coordinates: the X and Y axes tell you the location on the grid
(tabletop), and the third, Z coordinate is the distance from the prediction-ball's location to the grid.
But let's get out of the abstract as much as we can. What do these 3 coordinates in 3D space really represent in real life? I'm glad you asked, because
the answer is absolutely fascinating and super-important:
You may recall that each red dot on the surface of the lumpy red bowl is an example of the network experimenting with a particular combination of the
three survey questions, and weighting that combination with a particular emphasis. How does the network conduct that experiment? It uses the X and
Y coordinates to propose a particular location on the white grid which represents a particular combination of the survey questions. The X and Y
coordinates are essentially asking, "Hey, Z axis! How 'bout this combo with that emphasis?" The Z coordinate is telling us how much that particular
combination sucks--the Z coordinate is measuring error.
It makes sense when you think about it. We have already said that the bottom of the bowl, the part sitting right on the white grid, is a perfectly accurate
prediction, right? So every red dot in the bowl represents error, and the more distant that red point is from the white grid, the greater its error is.
This point is so important that it bears repeating: the white grid represents every combination of questions and emphasis that the network could try.
For each point on the white grid, the red point directly above it represents the amount of error that that particular combination has. In other words, each
red point on the lumpy red bowl represents the amount of error of the prediction directly below it. The red bowl is a "bowl of error." It is only at the
bottom of the bowl, where the red point touches the white grid, that there is no distance between red points and white points, and therefore no error.
For your convenience, here is the same diagram of the red bowl again:
Now, do you notice that yellow arrow with the cone-shaped pointer on top? That yellow arrow represents the Z axis, and the Z axis is a measure of error.
The yellow arrow is super-helpful, because it tells you the amount of your error so you can learn from it and try again to be more accurate next time.
From now on, I want you to imagine our pink prediction-ball on each dot of the white, dotted line, and the yellow arrow is always under the prediction
ball, and moves with it. The pink ball represents each prediction the network makes, and the yellow arrow under the ball measures the distance from
the prediction-ball to the white grid. It measures the error of each prediction. How does that work?
You may recall that the feed forward process takes the rst three survey questions and combines them in different ways to make a prediction--one
number, a probability between zero and one. Imagine the prediction ball sits at the point that represents that number on the dotted white line of
predictions. We know the answer to the fourth survey question is "1," or "Yes," so the Actual Truth is that this customer did buy Litter Rip! The Actual
Truth is the global minimum, the target to strive for as we improve our predictions, and it is the place where the bottom of the red bowl meets the white
grid. So, the network subtracts that prediction probability number from one. In our complete numerical example below you will see that the network's
rst prediction is 0.5, meaning there is a 50% chance the customer bought Litter Rip! The remainder is a measure of how much the prediction missed
by--the error. The yellow arrow is the measure of that error. In a way you could say, "Actual Truth number minus prediction-ball number equals yellow
arrow number." In our example, that's 1 - 0.5 = 0.5 The error (the length of the yellow arrow) is 0.5 (Error is always expressed in terms of absolute value--
it cannot be negative).
Does that make sense? Without the fourth survey question, you couldn't train a network, because it's The Actual Truth you need to test your predictions
against. The yellow arrow measured the error--how much the network missed The Actual Truth (which in this case is 1 or Yes). Our rst prediction of
0.5 wasn't so accurate. A prediction of, say, 0.99 would have been very accurate.
Here's a super-key insight that will not make sense on your rst read-through, so feel free to skim the following three paragraphs or skip to the part
1.3.b). But I include it here so you can have an "Aha!" moment in your subsequent read-throughs: the two orange arrows you see, representing the X and
Y axes on the grid, are actually the synapses syn0 and syn1 that you will learn about below. So for now, let's say syn0 is the X axis and equals 3 while
syn1 is the Y axis and equals 3.
Here's where it gets cool: remember Customer One's Yes/No/Yes responses to the rst three questions? The network takes that 1,0,1 of Customer One,
multiplies it by the numbers in the synapses of syn0 and syn1 (plus some other fancy jazz, which we'll get to later), and makes a prediction: we already
said it's 0.5. That's our rst feed forward prediction, and it is where our prediction-ball sits on the white dotted line in the red bowl.
This is a perfect example of why the best teacher is someone who learned yesterday the material you are learning today. I only discovered the insight
that the red, gradient descent bowl was sitting on the darn grid of syn0 and syn1 after a year of studying gradient descent, because all the experts take
this point so for-granted that they don't bother mentioning it. Here is a BIG chance for you to learn from my mistakes, and gain a super-key insight that
eluded me for over a year, even though it was right under my nose.
But wait! There's more...
1.3.b) Finding the Global Minimum

This brief section is mostly review, but I want to make sure you are clear on the global minimum before we tackle gradient descent in section 1.3.c)
You may recall that our goal is to train our network to nd the fastest way to reduce error in our predictions. In other words, we want to make that
yellow arrow shorter, which means we need the most e cient path for the prediction-ball to roll down from its starting point to the bottom-most dip of
the warped, red bowl--the point that sits on the tabletop. That is Utopia, folks--the place where the yellow arrow, the Z axis, the error in our predictions,
would equal (almost) zero in length (the global minimum), meaning our predictions have minimal error and therefore our network would be stunningly
accurate. And the fastest, most e cient path for the prediction-ball to roll down the surface of the curvy red bowl is represented by that white dotted
line which hugs the surface of the red bowl. Once we have completed our training with the rst dataset, then our network can accurately predict which
of the hot prospects our veterinarian gave us are most likely to buy Litter Rip!
So, consider our scenario now: our prediction-ball started at the top of that white, dotted line, where the yellow arrow, the amount of error in our rst
prediction equals 0.5. How do we roll the ball down the bowl to get to the bottom, where the yellow arrow--the amount of error--is near zero? In other
words, how do we tweak our network so that the prediction-ball (currently at the X,Y coordinates of (3,3)) will move on the white grid to the point right
under the bowl's bottom, at (roughly) (3,0)? So far, we have only made one prediction based on Customer One's responses. How can we improve each
subsequent prediction, based on each customer's responses, until our prediction error is almost zero, the prediction-ball is at the bottom, and our
fabulous network is trained enough to make good use of our new dataset of hot prospects that were provided by the (smitten) veterinarian?
To nd that step-by-step path down the surface of the bowl--from its rst, not-so-great prediction of 0.5 to the nal, 60,000th prediction of (hopefully)
0.99999--is the process of Gradient Descent.
1.3.c) Gradient Descent and Back Propagation

My Dear Little Dears, How grateful am I that you are still with me? Extremely. Let us study together the Mother Of All Deep Learning, gradient descent.
Gradient descent is the fancy term for how a network learns by trial-and-error. Here is a de nition for math nerds: gradient descent is the master plan to
change the weights of the synapses so as to maximize the reduction in error accomplished by each iteration.
Hmm. OK, well...here is a better explanation For Normal People: Roll the damn ball down the steepest slope to get to the bottom of the bowl as fast as
possible. Back propagation is the method used to compute the gradient, which is just a fancy word for slope. It is Back Propagation that tells you the
slope of the surface of the bowl under the prediction-ball.
Slope? Yes, my dears, it's ALL ABOUT SLOPE. All your fancy experts throw around the term, "gradient descent," but "gradient" just means slope. Finding
the slope of the bowl's surface, where our prediction-ball lies, tells us the direction the ball should roll to make the quickest descent to the bottom of the
bowl where error is closest to zero (i.e., where our ball would be touching the white grid plane).
But why slope? Why is good ol' rise-over-run the key to the Kingdom? Well, consider the process of gradient descent:
First the computer makes a feed forward prediction. It subtracts this prediction from The Actual Truth (where the global minimum lives, at the bottom
of the bowl) to get the error (the length of the yellow arrow). Then back propagation is used to compute the slope of the error. This slope number
determines what direction the prediction-ball should roll in, and how fast (i.e., how big an adjustment in numbers the network should make).
Finding slope, rise-over-run, is a key tool in back propagation. You might say that gradient descent is the master plan, and back propagation is the main
tool for achieving that plan. Here's what I mean:
Gradient Descent is the Master Plan

What, basically, does gradient descent do? It changes the weights to reduce the error in our prediction with each iteration.
So far I've been talking about how the network experiments with different combinations of survey questions, and different emphasis on those
questions, to produce the best possible prediction. But that's a little abstract, right? What is a weight in concrete terms?
I would say a weight is a number that quanti es how important a given question is to the prediction.
This is a very important concept. Weights are the values (the numbers) in our two groups of synapses (As you will see in Section 2 below, there are 12
in syn0 and 4 in syn1). It is these synapse numbers, the weights, that the network adjusts and plays with in its experiments with different combos of
questions and different weightings of emphasis within that combination of questions.
Each line you saw connecting neurons in the yellow diagram above is a synapse, aka a weight. If a weight is a big value, that means "this survey
question has a big impact on the network's prediction." In other words, this survey question is a big contributor to an accurate prediction. If a weight is
close to zero, that means this feature/question is largely irrelevant to making accurate predictions.
Back Propagation is the Main Tool for Achieving Gradient Descent

Back propagation is the method we use to compute the gradient, and that gradient value then tells us how much to re-prioritize each weight (i.e., each
relationship between feature questions) in the next iteration. Each point on the dotted line as the prediction-ball descends to the bottom of the bowl is
one iteration of the neural network.
So, what is an iteration? That's a key question. Think of it as one try, or "trial," in our trial-and-error learning process of training a network to predict
accurately. The rst part of an iteration is our feed forward, which we saw above. It is the feed forward that sets our prediction-ball at its starting
position at the top of the white, dotted line, from whence it rolls down the curvy surface of our red bowl towards the bottom, which sits on the white
grid, where near-zero error lives (i.e., the global minimum). Each dot of that dotted white path represents a tweak, or update of the network's proposed
combination of survey questions, with a proposed balance of emphasis that yields a slightly better prediction.
Here's the key thing to visualize: the path of our prediction-ball as it rolls down the inside surface of our curvy red bowl is erratic. You can see that the
ball rolls over some bumps, rolls into some dips, changes direction suddenly, etc. To understand what's happening, let's start with our vertical axis rst
and imagine that the yellow arrow (which is our vertical, Z coordinate) moves with the ball and always remains under the ball as it rolls erratically down
the side of the bowl. The yellow arrow therefore has many changes in length, because with each lurch the prediction-ball makes along the surface of
the red bowl, the yellow arrow lurches right along underneath the ball. So, as the ball gets closer to the bottom of the bowl, the yellow arrow gets
shorter and shorter until it is close to zero, where there is almost zero difference between our prediction and The Actual Truth. That means our
prediction is accurate. So when the yellow arrow equals zero, our horizontal coordinates on the X and Y axes must be right under the bottom of the
bowl (aka the global minimum).
The diagram at this link may also be helpful in envisioning the geometry of neural networks. It's essentially another version of the same, red bowl
above, but from a slightly different angle.
Let's conclude Section 1, The Big Picture with a summary:
Gradient Descent is the overall process of a network learning by trial-and-error until its predictions are accurate (i.e., with minimum error). It is like the
pink ping pong ball rolling down the side of the red, curvy bowl towards the bottom, the perfect Global Minimum. The network is like a pink prediction-
ball, and the surface of the red bowl is made up of every prediction the network could possibly make. Gradient descent is like a prediction-bowl rolling
down the surface of the "bowl of predictions" to the bottom of the bowl where the overall error in prediction is at a minimum (the Global Minimum).
Feed Forward reminds me of an old-fashioned 1960's IBM computer that lls a room, with punch cards being fed into one end and then a fabulous
prediction spitting out the other end. The network takes the data from the survey's three questions and "feeds it through the computer" to arrive at a
prediction. A prediction is like a freeze-frame photo of where the ball is located in the bowl at a given moment;
Global Minimum: Again, picture the red bowl sitting on a white table. The place where the bowl meets the table's surface represents a near-perfect
prediction with minimal error. Compared to the entire surface of the "bowl of error," (Think, "the global surface"), the bottom is closest to perfection. It
has the "global minimum" of error.
Each time the network makes a better prediction, the pink prediction-ball rolls down the bowl's sides and approaches that global minimum of error at
the bottom. After each prediction, the network compares that prediction to survey question four, The Actual Truth. This is like measuring how far the
prediction-ball is from the bottom of the bowl at a given moment. The measure of this distance between the prediction and The Truth is called nding
the error. The network's goal with each prediction is to constantly reduce that error to a global minimum.
Back Propagation: Picture a circus juggler who can juggle 16 bowling pins of different sizes and weights. He is able to keep them all in the air at the
same time, even as he (magically) adjusts their size and weight on the y. After making a prediction, the network then works backwards through its
previous prediction process to nd out what went wrong and x it. The network is asking the question, "What adjustments would lessen the error in the
next prediction, thereby moving the ball down the bowl to the global minimum?"
My goal so far was to give you a general understanding of how our neural network can train itself on past customer surveys from the pet shop. Next,
we're going to take a look under the hood and learn the code that makes our network learn by trial-and-error. Hopefully, now that you can see what a
neural network does in 3D, it will make it easier to understand why we do all the following abstract steps with math and with code.
Again: if you are still confused by the above diagrams and analogies, please don't worry at all. This is not a novel, and you are going to read it more than
once. With each step up the "upward spiral" staircase of knowledge, you will gain more-and-more insight. Godspeed, keep breathing deeply, and do not
beat yourself up!
Section 2 of 5: Creating a Working Brain with 28 Lines of Code

Now, let's get an overview of our code. I suggest you open two copies of this blog post in two side-by-side windows and show the code in the left
window while you scroll through my explanation of it in your right window. First, I'll show you the entire code we'll be studying today, and underneath
that is my detailed step-by-step explanation of what it does. As you see from the comments below (the lines beginning with a # that are interspersed
with the lines of actual code), I have broken this process of building a neural network down into 13 steps. Get ready for the wonder of watching a
computer learn from its mistakes and recognize patterns! We're about to give birth to our own little baby brain... :-)
But rst, a word on the concept of a matrix and linear algebra...

You will see the word "matrix" or matrices in the code below. This is VERY important: the matrix is the engine of our car. Without matrices, a neural
network goes nowhere. At all.
A matrix is a set of rows of numbers. For a quick-and-dirty metaphor, think of a glori ed XL spreadsheet. Or a database with many rows and many
columns of numbers. The rst matrix you will now meet holds the data from our pet shop customer survey. It looks like this:
[1,0,1],
[0,1,1],
[0,0,1],
[1,1,1]
Think of each row as one customer, so there are four customers in the matrix above. Each row contains three numbers (1's and 0's for Yes and No)
between brackets, right? And you may recall in Section 1.2 above we already saw Customer One's responses of, "Yes/No/Yes" to the three survey
questions, hence the top row of 1, 0 ,1 in the matrix above. And column one contains our four customers' responses to question one, "Do you own a
cat?" (i.e., Customers Two and Three do not own a cat, hence the zero's). Now, to ll in a little more detail, here's the exact same matrix as above, but
with some labels and some color coding to emphasize how one customer's responses are a row and one feature/question is a column:
Hopefully, the above diagram clari es a key point that really confused me at rst: it is the relationship between rows of customers and columns of
features. You have to see a matrix as both. Let's break it down:
In our matrix, one customer's data is represented by a ROW of three numbers, right? And in a neural network diagram with all those circular neurons
connected by lines of synapses, the input layer is a column containing three circular "neurons," right? Well, it's important to notice that each neuron
does NOT represent a customer--a ROW of data. Rather, each neuron represents a FEATURE--a COLUMN of data. So, within a neuron, we have all the
customers' answers to the same question/feature, e.g., "Do you own a cat?" We're only charting four customers, so in my diagram above, you only see
four 1's and 0's that are responses to that question. But if we were charting 1,000,000 customers, that top neuron would contain one million 1's and 0's
representing each customer's Yes/No response to feature one, "Do you own a cat?"
So I hope it's becoming clear why we need matrices: Because we have more than one customer. In our toy neural network below, we describe four
customers, so we need four rows of numbers.
Our network also has more than one survey question. So we need one column per survey question (aka, per feature), thus we have three columns here
representing the responses to the rst three questions of our survey (the fourth question appears in a different matrix that we'll see later).
So our matrix is tiny: 4 rows X 3 columns, known as a "4 by 3." But the matrices in real neural networks can have millions of customers and hundreds of
survey questions (features). Or the neural networks that do image recognition in photos or video can have billions of rows of "customers" and billions
of columns of features.
In sum, we need matrices to keep all our data straight while we do complex calculations on it, so a matrix organizes our data into nice, neat little rows
and columns (usually, not so little at all). Good enough for now? Let's move on.
I am grateful for Andrew Trask's blog post from which the code below is taken (though the comments are mine). Display this in your left window:
IMPORTANT:
The comments interspersed below with the Python code are a good summary, but they are complex. Rookies should not feel intimidated if they don't
understand. All will be explained below in detail, so fear not-- later these comments in the code can serve as a nice reference if you need it in the future.
Thenote
Also, Code:
that I have inserted line numbers in front of the code, for easy reference. If you copy-and-paste this code, you'll have to delete the line
numbers and adjust the spacing.
#This is the "3 Layer Network" near the bottom of:

#http://iamtrask.github.io/2015/07/12/basic-python-network/
#First, housekeeping: import numpy, a powerful library of math tools.

5 import numpy as np
#1 Sigmoid Function: changes numbers to probabilities and computes confidence to use in gradient descent
8 def nonlin(x,deriv=False):
9 if(deriv==True):
10 return x*(1-x)
11
12 return 1/(1+np.exp(-x))
#2 The X Matrix: This is the responses to our survey from 4 of our customers,
#in language the computer understands. Row 1 is the first customer's set of
#Yes/No answers to the first 3 of our survey questions:
#"1" means Yes to, "Have cat who poops?" The "0" means No to "Drink imported beer?"
#The 1 for "Visited the LitterRip.com website?" means Yes. There are 3 more rows
#(i.e., 3 more customers and their responses) below that.
#Got it? That's 4 customers and their Yes/No responses
#to the first 3 questions (the 4th question is used in the next step below).
#These are the set of inputs that we will use to train our network.
23 X = np.array([[1,0,1],
24 [0,1,1],
25 [0,0,1],
26 [1,1,1]])
#3The y Vector: Our testing set of 4 target values. These are our 4 customers' Yes/No answers
#to question four of the survey, "Actually purchased Litter Rip?" When our neural network
#outputs a prediction, we test it against their answer to this question 4, which
#is what really happened. When our network's
#predictions compare well with these 4 target values, that means the network is
#accurate and ready to take on our second dataset, i.e., predicting whether our
#hot prospects from the (hot) veterinarian will buy Litter Rip!
34 y = np.array([[1],
35 [1],
36 [0],
37 [0]])
#4 SEED: This is housekeeping. One has to seed the random numbers we will generate
#in the synapses during the training process, to make debugging easier.
40 np.random.seed(1)
#5 SYNAPSES: aka "Weights." These 2 matrices are the "brain" which predicts, learns
#from trial-and-error, then improves in the next iteration. If you remember the
#diagram of the curvy red bowl above, syn0 and syn1 are the
#X and Y axes on the white grid under the red bowl, so each time we tweak these
#values, we march the grid coordinates of Point A (think, "moving the yellow arrow")
#towards the red bowl's bottom, where error is near zero.
47 syn0 = 2*np.random.random((3,4)) - 1 # Synapse 0 has 12 weights, and connects l0 to l1.
48 syn1 = 2*np.random.random((4,1)) - 1 # Synapse 1 has 4 weights, and connects l1 to l2.
#6 FOR LOOP: this iterator takes our network through 60,000 predictions,
#tests, and improvements.
52 for j in range(60000):
#7 FEED FORWARD: Think of l0, l1 and l2 as 3 matrix layers of "neurons"

#that combine with the "synapses" matrices in #5 to predict, compare and improve.
# l0, or X, is the 3 features/questions of our survey, recorded for 4 customers.
57 l0=X
58 l1=nonlin(np.dot(l0,syn0))
59 l2=nonlin(np.dot(l1,syn1))
#8 The TARGET values against which we test our prediction, l2, to see how much
#we missed it by. y is a 4x1 vector containing our 4 customer responses to question
#four, "Did you buy Litter Rip?" When we subtract the l2 vector (our first 4 predictions)
#from y (the Actual Truth about who bought), we get l2_error, which is how much
#our predictions missed the target by, on this particular iteration.
66 l2_error = y - l2
#9 PRINT ERROR--a parlor trick: in 60,000 iterations, j divided by 10,000 leaves

#a remainder of 0 only 6 times. We're going to check our data every 10,000 iterations
#to see if the l2_error (the yellow arrow of height under the white ball, Point A)
#is reducing, and whether we're missing our target y by less with each prediction.
72 if (j% 10000)==0:
73 print("Avg l2_error after 10,000 more iterations: "+str(np.mean(np.abs(l2_error))))
#10 This is the beginning of back propagation. All following steps share the goal of
# adjusting the weights in syn0 and syn1 to improve our prediction. To make our
# adjustments as efficient as possible, we want to address the biggest errors in our weights.
# To do this, we first calculate confidence levels of each l2 prediction by
# taking the slope of each l2 guess, and then multiplying it by the l2_error.
# In other words, we compute l2_delta by multiplying each error by the slope
# of the sigmoid at that value. Why? Well, the values of l2_error that correspond
# to high-confidence predictions (i.e., close to 0 or 1) should be multiplied by a
# small number (which represents low slope and high confidence) so they change little.
# This ensures that the network prioritizes changing our worst predictions first,
# (i.e., low-confidence predictions close to 0.5, therefore having steep slope).
88 l2_delta = l2_error*nonlin(l2,deriv=True)
#11 BACK PROPAGATION, continued: In Step 7, we fed forward our input, l0, through
#l1 into l2, our prediction. Now we work backwards to find what errors l1 had when
#we fed through it. l1_error is the difference between the most recent computed l1
#and the ideal l1 that would provide the ideal l2 we want. To find l1_error, we
#have to multiply l2_delta (i.e., what we want our l2 to be in the next iteration)
#by our last iteration of what we *thought* were the optimal weights (syn1).
# In other words, to update syn0, we need to account for the effects of
# syn1 (at current values) on the network's prediction. We do this by taking the
# product of the newly computed l2_delta and the current values of syn1 to give
# l1_error, which corresponds to the amount our update to syn0 should change l1 next time.
100 l1_error = l2_delta.dot(syn1.T)
#12 Similar to #10 above, we want to tweak this
#middle layer, l1, so it sends a better prediction to l2, so l2 will better
#predict target y. In other words, tweak the weights in order to produce large
#changes in low confidence values and small changes in high confidence values.
#To do this, just like in #10 we multiply l1_error by the slope of the
#sigmoid at the value of l1 to ensure that the network applies larger changes
#to synapse weights that affect low-confidence (e.g., close to 0.5) predictions for l1.
109 l1_delta = l1_error * nonlin(l1,deriv=True)
#13 UPDATE SYNAPSES: aka Gradient Descent. This step is where the synapses, the true
#"brain" of our network, learn from their mistakes, remember, and improve--learning!
# We multiply each delta by their corresponding layers to update each weight in both of our
#synapses so that our next prediction will be even better.
115 syn1 += l1.T.dot(l2_delta)
#Print results!
119 print("Our y-l2 error value after all 60,000 iterations of training: ")
120 print(l2)
2.1) The Sigmoid Function, Brie y Mentioned: lines 8-12

The sigmoid function plays a super-important role in making our network learn, but don't worry if you don't understand it all yet. This is only our rst
pass over the material. I'll explain it in detail below in Step 10. For now, just do your best:
"nonlin()" is a type of sigmoid function called a logistic function. Logistic functions are very commonly used in science, statistics, and probability. This
particular sigmoid function is written in a more complicated way than necessary here because it serves two functions:
Function #1 of 2 is to take a matrix (represented here by a small x) within its parentheses and convert each value to a number between 0 and 1 (aka a
statistical probability). This is done by line 12: return 1/(1+np.exp(-x))
Why do we need statistical probabilities? Well, remember that our network doesn't predict in just 0's and 1's, right? Our network's prediction doesn't
shout, "YES! Customer One WILL ABSOLUTELY buy Litter Rip! if she knows what's good for her!" Rather, our network predicts the probability: "There's a
74% chance Customer One will buy Litter Rip!"
This is an important distinction because if you predict in 0's and 1's, there's no way to improve. You're either right or wrong. Period. But with a
probability, there's room for improvement. You can tweak the system to increase or decrease that probability a few decimal points each time, so you
can improve your accuracy. It's a controlled, incremental process, rather than just blind guessing in the dark.
We will see below that this is very important, because this conversion to a number between zero and one gives us FOUR very big advantages. I will
discuss these four in detail below, but for now, just know that the sigmoid function converts every number in every matrix within its parentheses into a
number between 0 and 1 that falls somewhere on the S-curve illustrated here:
(taken with gratitude from: Andrew Trask)

So, Function #1 of the sigmoid function has converted each value in the matrix into a statistical probability.
Function #2 of 2 is the second part of this sigmoid function is in lines 9 and 10:
' if(deriv==True):
return x*(1-x)'
When called to do so by deriv=True in line 9 of the code takes each value in a given matrix and converts it into a slope at a particular point on the
sigmoid S curve. This slope number is also known as a con dence measure. In other words, the number answers the question, "how con dent are we
that this number correctly predicts an outcome?" You may wonder, So what? Well, our goal is a neural network that con dently makes accurate
predictions. The fastest way to achieve that goal is to x the non-con dent, wishy-washy, low-accuracy predictions, while leaving the accurate,
con dent predictions alone. This concept of "con dence measures" is super-duper important, so we'll address it in depth below. For now, just
remember this image of wishy-washy, non-con dent numbers.
Let's move on to Step 2:
2.2) Creating X input: Lines 23-26

Lines 23-26 create a 4x3 matrix of input values that we will use to train our network. X will become layer 0, or l0 of our network, so this is the beginning
of the "toy brain" we are creating!
This is our feature set from our customer surveys, in language the computer understands:
Line 23 creates the X input (which becomes l0, layer 0, in line 57)
X:
[1,0,1],
[0,1,1],
[0,0,1],
[1,1,1]
We have four customers who have answered our three questions. We have already discussed how Row 1 above is 101, which is Customer One's set of
Yes/No answers to our survey questions. Think of each row of this matrix as a training example we'll feed into our network, and each column is one
feature of our input. So our Matrix X can be visualized as the 4x3 matrix that is l0 in the diagram below:
2.3) Create
You may y output:
wonder, "How Lines
does Matrix 34-37
X become layer 0 in the diagram above?" We'll get to that soon. Next, let's create our list of the four correct answers
we want our network to be able to predict.
This is our test set. It is the Yes/No answers to the fourth question of our survey, namely, "Have you purchased Litter Rip!?" So, look at the column of 4
numbers below, and you'll see that Customer One answered Yes, Customer Two answered Yes, and so on.
Line 34 creates the y vector, a set of target values we strive to predict.

y:
[1]
[1]
[0]
[0]
To use a metaphor, I also like to think of y as our "target" values, and I picture an archery target. As our network improves, its arrows hit closer-and-
closer to the bullseye. Once our network can correctly predict these 4 target values from the inputs provided by matrix X above, it is then ready to make
predictions on the other database--the hot prospects from the hot veterinarian's survey.
2.4) Seed your random numbers: Line 40

This step is housekeeping. We have to seed the random numbers we will generate in synapses/weights for the next step in our training process, to
make debugging easier. You don't have to understand how this code works, you just have to include it.
The reason we generate random numbers to populate the synapses is because one has to start somewhere. So we begin with a set of made-up
numbers, and then we tweak each of these numbers incrementally, over 60,000 iterations, until they produce predictions with the smallest possible
error. Seeding makes your tests repeatable (e.g. if you test with the same inputs multiple times, your results will be identical).
2.5) Create the "Synapses" of your Brain--the Weights: Lines 47-48

When you rst look at the diagram below, you might assume the brain of the network is the circles, the "neurons" of the neural network brain. In fact, the
real brain of a neural network, the part that actually learns and improves, is the synapses, those lines that connect the circles in this diagram. These 2
matrices, syn0 and syn1, are the brain of our network. They are the part of our network that learns by trial-and-error, making predictions, comparing to
their target values in y, then improving their next prediction--learning!
Notice how this code, syn0 = 2*np.random.random((3,4)) - 1 creates a 3x4 matrix and seeds it with random values. This will be the rst layer of
synapses, or weights, Synapse 0, that connects l0 to l1. It looks like the matrix below:
Line 47: syn0 = 2*np.random.random((3,4)) - 1

This line creates synapse 0, aka syn0:
[ 0.36 -0.28 0.32 -0.15]
[-0.48 0.35 0.25 -0.25]
[ 0.16 -0.66 -0.28 0.18]
Now, please learn from a mistake I made: I could not understand why syn0 should be a 3x4 matrix. I thought it should be a 4x3 matrix, because you
have to multiply the l0 4x3 matrix by syn0, so don't we want all the numbers to line up nicely in the rows and columns?
But that was my mistake: to think that multiplying a 4x3 by a 4x3 lined up the numbers nicely. Wrong. In fact, if we want our numbers to line up nicely,
we want to multiply a 4x3 by a 3x4. This is a fundamental and very important rule of matrix multiplication. Take a close look at the rst neuron of the
now-familiar diagram below, "Do you own a cat who poops?" Now, consider:
Inside that neuron are the Yes/No responses from each of four customers. This is the rst column of our 4x3 layer0 matrix:
[1]
[0]
[0]
[1]
Got it? Now, notice there are four lines (synapses) that connect the "Cat who poops?" neuron with the four neurons of l1. That means that EACH of the
1,0,0,1 above has to be multiplied four times by the four different weights that connect "Cat who poops?" to l1. So, four numbers inside "Cat who
poops?" times four weights = 16 values, right? Yes, l1 is a matrix that is 4x4, so that makes sense.
And notice that we're going to do the exact same thing with the four numbers inside the second neuron, "Drink imported beer?" So that's also four
numbers times four weights = 16 values. And we add each of the 16 values to its corresponding value in the 4x4 we already created above.
Rinse and repeat a nal time with the four numbers inside the third neuron, "Visited Litter Rip.com?" So our nal 4x4 l1 matrix has 16 values, and each
of those values is the SUM of the three corresponding values from the three sets of multiplication we just completed.
Get it? 3 survey questions times 4 customers = 3 neurons times 4 synapses = 3 features times times 4 weights = a 3x4 matrix.
Seems complicated? You'll get used to it. Besides, the computer does the multiplication for you. But I want you to understand what's going on beneath
the hood. When you look at a neural network diagram such as below, the lines don't lie (neither do Shakira's hips). Consider:
If there are four synapses connecting the "Cat who poops?" neuron to all four neurons of the next layer, that means you MUST multiply whatever is
inside "Cat who poops?" by four weights. In this kitty litter example, we know there are four numbers inside "Cat who poops?" Therefore, you know you'll
end up with a 4x4 matrix, and you know that to arrive there you have to multiply by a 3x4 matrix: i.e., the 3 nodes times the 4 synapses connecting each
node to the next layer of 4 neurons. Look this over for a while and study where each synapse begins and ends until you're clear on the pattern:
So, always remember that matrix multiplication requires the inner 2 "matrix size" numbers to match, e.g., a 4x3 matrix must be multiplied by a 3x_?_
matrix--in this case, a 3x4. See how those inner two numbers (in this case, 3) must be the same?
And maybe you're wondering where the "2*" at the beginning of our equation, and the "-1" near the end come from? I wondered. Well, the function
np.random.random produces random numbers uniformly distributed between 0 and 1 (with a corresponding mean of 0.5). But we want this
initialization to have a mean of zero. Why? So that the initial weight numbers in this matrix do not have an a-priori bias towards values of 1 or 0,
because this would imply a con dence that we do not yet have (i.e. in the beginning, the network has no idea what is going on so it should display no
con dence in its predictions until we update it after each iteration).
So, how do we convert a set of numbers with an average of 0.5 to a set with a mean of 0? We rst double all the random numbers (resulting in a
distribution between 0 and 2 with mean 1), and then we subtract one (resulting in a distribution between -1 and 1 with mean 0). That's why you see 2*
at the beginning of our equation, and - 1 at the end. This changes the mean from 0.5 to 0. Nice: 2*np.random.random((3,4)) - 1
Moving on: Next, this line of code, syn1 = 2*np.random.random((4,1)) - 1 creates a 4x1 vector and seeds it with random values. This will be our
network's second layer of weights, Synapse 1, connecting l1 to l2. Meet syn1:
Line 48: syn1 = 2*np.random.random((4,1)) - 1

This line creates synapse 1, aka syn1:
[ 0.12]
[ 0.10]
[-0.63]
[-0.15]
It would be a good exercise for you to gure out what size the matrices have to be for this multiplication. We know l1 is 4x4. Why is syn1 a 4x1? Look at
the diagram: whatever numbers that are in the top neuron of l1 have to be multiplied only once, right? Because there's only one line (weight) connecting
the top neuron of l1 to the single neuron of l2. And we know that the top neuron of l1 has four values in it, right? Therefore, 4 values x 1 weight = four
products. Add them together and that gives you the rst product inside l2.
Rinse and repeat that process three more times for the other 3 neurons in l1, and presto: l2 is a 4x1 matrix (known as a vector when there's only one
column).
Again: always remember that those two inner "size numbers" have to match. A 4x3 matrix MUST be multiplied by a 3x_?_ matrix. A 4x4 matrix MUST be
multiplied by a 4x_?_ matrix, and so on.
2.6) For Loop: Line 52
You're doin' great. I gotta remind you again that you're a rock star.
This is a for loop that will take our network through 60,000 iterations. For each iteration, our network will take X, our input data of the customer survey
responses, and based on that data, give its best prediction of the probability that that customer purchased Litter Rip! It will then compare its prediction
to The Actual Truth, found in y, learn from its mistakes, and give a slightly better prediction on the next iteration. 60,000 times, until it has learned by
trial-and-error how to take the X input and predict accurately what the y target value is. Then our network will be ready to take any input data you give it
(such as the surveys from our loving veterinarian) and correctly predict which hot prospects should get a targeted ad!
We have now completed 2 of the 5 sections in this blog:
I. The Big Picture of Deep Learning: Example, Analogy, Diagram, Jokes
II. Create a Working Brain with 21 Lines of Code: Sigmoid Function, Input, Output, Layers, Synapses
In the remaining 3 sections, I will explain the above code in detail:
III. Feed Forward: Making an Educated Guess, 60,000 Times
IV. Learning from Trial-and-Error: Gradient Descent and Back Propagation
V. Back Propagation Revisited: The Chain Rule
Section 3 of 5: Feed Forward: Making an Educated Guess, 60,000 Times
(lines 57-59)
This is where our network makes its prediction. This is an exciting part of our deep learning process, so I'm going to teach this same concept from
three different perspectives:
1. First, I will tell you a spellbinding fairytale of feed forward;

2. Second, I will draw stunningly beautiful pictures of feed forward; and
3. I will open up the hood and show you the matrix multiplication that is the engine of feed forward.
I'm Irish. Who doesn't love a good story? My mentor Adam Koenig suggested the following analogy, which I have ridiculously exaggerated into a
fairytale, because I am an artiste:
3.1) The Castle and The Meaning of Life: The Feed Forward Network
Imagine yourself as a neural network. You happen to be a neural network with a valid driver's license, and you're the type of neural network that enjoys
fast cars and a mystical spiritual journey. You eagerly wish to nd The Meaning of Life. Well, Miracle of Miracles, you have just found out that if you
drive to a certain castle, your Mystical Oracle is waiting to meet you for the rst time and tell you the meaning of life. Joy!
Needless to say, you're fairly motivated to nd The Oracle's castle. After all, The Oracle represents The Actual Truth, the answer to the mystical
question, "Did you buy Litter Rip!?" How mystical! (In other words, if your prediction matched the Oracle's truth then you'd be at the castle, where there is
an l2 error of zero, and the yellow arrow with a height of zero, and the white ball arrives at the bottom of red bowl. They are all the same thing.)
Unfortunately, nding The Oracle's castle is going to require some patience and persistence, because you have already attempted to drive to the castle
thousands of times and you keep getting lost.
(Hint: thousands of trips = iterations. "keep getting lost" means you still have some error in your l2 prediction, which leaves you a distance from the
Castle of Truth, y.)
But there is fabulous news: you know that every day, with every driving trip, you're getting closer-and-closer to The Oracle (Ahh, the y vector in all its
loveliness--stunning...). The bad news is, alas, each time that you don't arrive at her castle, POOF! you wake up the next morning back at your house
(aka Layer 0, our input features, the 3 survey questions) and have to start again from there (a new iteration). It looks a bit like this:
Fortunately, this story has a happy ending, but you will have to keep attempting to arrive at The Oracle's house and correcting your route for, say,
another 58,000 trials (and errors!) before you nd enlightenment. Don't worry, it'll be worth it.
Let's take a look at one of those drives (an iteration) from your house, X to her castle, y: each trip you make is the feed forward pass of lines 57-59, and
each day you arrive at a new place, only to discover it is NOT the castle. Drat! Of course, you want to gure out how to arrive a little closer to your guru
on the next drive. I'll explain that in the steps below. Stay tuned for The Oracle and the Castle, Part Deux: Learning from Your Errors.
3.2) Stunningly Beautiful Pictures of Feed Forward
OK, so the Oracle Castle story above was an analogy for the feed forward process. Next we'll walk through a simpli ed example of feed forward:
We're going to walk through one example of one weight only, out of the 16. The weight we will study is the tippy-top line of the 12 lines of syn0, between
the top neurons of l0 and l1, and for simplicity's sake we will call it syn0,1 (technically it is syn0(1,1) for "row 1 column 1 of the matrix syn0"). Here's
what it looks like:
Why are the circles representing the neurons of l2 and l1 divided down the middle? The left half of the circle (depicted in the variables I use with an
"LH") is the value used as input for the sigmoid function and right hand side is the output of the sigmoid function: l1 or l2. In this context, recall that the
sigmoid is simply taking the product of the previous layer times the previous synapse and "squishing" it down to a value between 0 and 1. It's this part
of the code (created in line 10, but not called upon until line 58):
return 1/(1+np.exp(-x))
OK, here is Feed Forward using one of our training examples, row 1 of l0, aka "Customer One's responses to the 3 survey questions": [1,0,1]. So, we'll
begin by multiplying the rst value of l0 by the rst value of syn0. Imagine that our syn0 matrix has been through some training iterations since we
initialized it earlier (in Section 2.4), and now it looks like this:
syn0:
[ 3.66 -2.88 3.26 -1.53]
[-4.84 3.54 2.52 -2.55]
[ 0.16 -0.66 -2.82 1.87]
My dearest reader, perhaps you are thinking, "Gee, Dave, why are syn0's values so different from the syn0 you showed me when you created syn0 in
Section 2.4?" I'm glad you asked. You see, many of the matrices above have realistic initial values as if they were just created from randomized seeding
by the computer. But the matrices you see here and below are not initial values. They are matrices that have been through some training iterations, so
their values may have changed a lot during updating in the learning process.
Lovely. So, let's multiply the "1" of l0 by the "3.66" of syn0 and see where such shenanigans lead us:
Here's what Feed Forward looks like in pseudo-code, and you can follow the forward, left-to-right process in the diagram above. (Note that I add "LH,"
meaning left-hand, to specify the left half of the circle we are using in our diagrams to depict a neuron in a given layer; so, "l1LH" means, "the left half of
the circle representing l1," which means, "before the product has been passed through the nonlin() function.")
l1_LH = l0 x syn0 so l1_LH = 1 x 3.66 -> (Next, don't forget to add the products of the other l0 values x the
other syn0 values. For simplicity right now, just trust me that they add up to be 0.16 total) -> l1_LH = 1 x 3.66
+ 0.16 = 3.82
l1 = nonlin(l1_LH) = nonlin(3.82) -> nonlin() = 1/(1+np.exp(-x)) = [1/(1+2.718^-3.82))] = 0.98
l2_LH = nonlin(l1_LH) = l1 -> l1 x syn1 = 0.98 x 12.21 = 11.97 (again, add the products of the other syn1
multiplications--trust me that they total -11.97) -> 11.97 + -11.97 = 0.00
nonlin(l2_LH) -> nonlin() = 1/(1+np.exp(-x)) = [1/(1+2.718^0.00))] = 0.5
l2 = 0.5 -> l2_error = y-l2 -> 1 - 0.5 = 0.5 -> l2_error = 0.5
Alrighty, then. Below is a view under-the-hood of the math that makes the code above happen:
3.3) Let's Walk Through the Math of Feed Forward Slowly:

l0 x syn0 = l1LH, so in our example 1 x 3.66 = 3.66, but don't forget we have to add the other two products of l0 x the corresponding weights of syn0. In
our example, l0,2 x syn0,2= 0 x something = 0, so it doesn't matter. But l0,3 x syn0,3 does matter because l0,3=1, and and we know from our matrix
example in the last section that the value for syn0,3 is 0.16. Therefore, l0,3 x syn0,3 = 1 x 0.16 = 0.16. Our product of l0,1 x syn0,1 + our product of l0,3 x
syn0,3 = 3.66 + 0.16 = 3.82, and 3.82 is l1_LH. Next, we have to run l1_LH through our nonlin() function to create a probability between 0 and 1.
Nonlin(l1_LH) uses the code, return 1/(1+np.exp(-x)), so in our example that would be: 1/(1+(2.718^-3.82))=0.98, so l1 (the RH side of the l1 node)
is 0.98.
So, what just happened with the equation: 1/(1+np.exp(-x)) = [1/(1+2.718^-3.82))] = 0.98 above? The computer used some fancy code,
return 1/(1+np.exp(-x)), to do what we could do manually with our eyeballs--it told us the corresponding y value of x = 3.82 on the sigmoid curve
as pictured in this diagram:

Notice that, at 3.82 on the X axis, the corresponding point on the blue, curved line is about 0.98 on the y axis. Our code converted 3.82 into a statistical
probability between 0 and 1. It's helpful to visualize this on the diagram so you know there's no abstract hocus-pocus going on here. The computer did
what we did: it used math to "eyeball what 3.82 on the X axis would be on the Y axis of our diagram." Nothing more.
Again: nonlin() is the part of the sigmoid function that renders any number as a value between 0-1. It is the code, return 1/(1+np.exp(-x)). It does
not take slope. But in back prop, we're going to use the other part of the sigmoid function, the part that does take slope, i.e., return x*(1-x) because
you will notice that lines 57 and 71 speci cally request the sigmoid to take slope with the code, (deriv==True).
Now, rinse-and-repeat: we'll multiply our l1 value by our syn1,1 value. l1 x syn1 = l2LH which in our example would be 0.98 x 12.21 = 11.97. But again,
don't forget that to 11.97 we must add all the products of all the other l1 neurons times their corresponding syn1 weights, so for simplicity's sake trust
me that they all added up to -11.97 (I am using the same matrix). So you end up with 11.97 + -11.97 = 0.00, which is l2_LH. Next we run l2_LH through
our fabulous nonlin() function, which would be: 1/(1+2.718^-(0)) = 0.5, which is l2, which is our very rst prediction of what the truth, y, might be!
Congratulations! You just completed your rst feed forward!
Now, let's assemble all our variables in one place, for clarity:
l0=1
syn0,1=3.66
l1_LH=3.82
l1=0.98
syn1,1=12.21
l2_LH=0
l2=~0.5
y=1 (this is a "Yes" answer to survey Question 4, "Actually bought Litter Rip?")
l2_error = y-l2 = 1-0.5 = 0.5
OK, let's now take a look at the matrix multiplication that makes this all happen (for those of you who are rookies to matrix multiplication and linear
algebra, Grant Sanderson teaches it brilliantly, with lovely graphics, in 14 YouTube videos. Watch those rst, if you wish, then return here).
First, on line 58 we multiply the 4x3 l0 and the 3x4 Syn0 to create (hidden layer) l1, a 4x4 matrix:
Now we pass it through the "nonlin()" function in line 58, which is a fancy math expression I explained above that "squishes" all values down to values
between 0 and 1:
1/(1 + 2.781281^-x)
This creates layer 1, the hidden layer of our neural network:

l1:
[0.98 0.03 0.61 0.58]

[0.01 0.95 0.43 0.34]
[0.54 0.34 0.06 0.87]
[0.27 0.50 0.95 0.10]
If you nd yourself feeling faint at the mere sight of matrix multiplication, fear not. We're going to start simple, and break down our multiplication into
tiny pieces, so you can get a feel for how this works. Let's take one, single training example from our input. Row 1 (customer one's survey answers):
[1,0,1] a 1x3 matrix. We're going to multiply that by syn0, which would still be a 3x4 matrix, and our new l1 would be a 1x4 matrix. Here's how that
simpli ed process can be visualized:
(multiply row 1 of l0 by column 1 of syn0, then row 1 by column 2, etc.)
row 1 of l0: col 1 of syn0:

[1 0 1] X [ 3.66] + [ 3.82 -3.54 0.44 0.34]
[1 0 1] X [ -4.84] + = [ (row 2 of l0 x cols. 1, 2, 3, and 4 of syn0...) ]
[1 0 1] X [ 0.16] [ etc... ]
Then pass the above 4x4 product through "nonlin()" and you get the l1 values
l1:
[0.98 0.03 0.61 0.58]
[0.01 0.95 0.43 0.34]
[0.54 0.34 0.06 0.87]
[0.27 0.50 0.95 0.10]
Note that, on line 58, we next take the sigmoid function of l1 because we need l1 to have values between 0 and 1, hence:
l1=nonlin(np.dot(l0,syn0))
It is on line 58 that we see Big Advantage #1 of the Four Big Advantages of the sigmoid function. When we pass the dot product matrix of l0 and syn0
through the nonlin() function, the sigmoid converts each value in the matrix into a statistical probability between 0 and 1.
3.4)
My Super-Key
loveliest Point:
and most astute Inferred
readers Questions
will then pose in the"Gee,
a clever question: Hidden Layer probability' of what?" Well, Friends, prepare yourself for
Dave, 'statistical
(yet another) stunning breakthrough in deep learning:
Ah yes, well, statistical probability indeed. Er, why should you care? Well, I guess only because statistical probability is one of the main factors that
makes a bunch of dumb matrices suddenly come alive and learn like a child's brain!!! Other than that, though, no reason to pay any attention here...
In layer one, when we multiply l0 by syn0, why do we ddle with different values in the weights of syn0? It is because we are trying out various
combinations of our original three feature questions to see which combinations are the biggest help in predicting our overall query, i.e., "What are the
odds this customer bought Litter Rip!?" I'll use some silly examples to show you what I mean:
We have the customer responses to our original three survey questions. What can we infer from various combinations of those questions that might
improve our ability to predict? For example, if customers own a cat, it shows they have great taste in pets. If customers drink imported beer, it means
they have great taste in beer. Therefore, we could infer that these customers will buy Litter Rip! just to enjoy the beautiful, tasteful design and
ergonomic form of each stink-sucking granule of kitty litter, so they'll buy the product just to display it proudly on their mantle, even if their cat only
poops outdoors, right? Therefore, we might use our weights to strengthen the link between these two features if it appears to give us more accurate
predictions.
Another example: If some customers don't own a pooping kitty, but they drink imported beer and have visited Litter Rip.com online, we can infer that
they are very tech-savvy customers--they obviously drink imported beer because they admire the supply chain logistics it takes to get a Dutch bottle of
Heineken from the factory in Holland into their hot little Yankee hands in Kansas. And if they visit websites online, and everything, well gosh--then they
are obviously tech-savvy geniuses. So we can infer that technologically sophisticated customers will clamor to buy Litter Rip! just out of sheer
admiration for the bleeding-edge technology that goes into each granule of poop-absorber--even if the customer doesn't actually own a cat. So perhaps
we'll adjust the weights of syn0 to strengthen the link between these ne features in the hope of enjoying more accurate predictions...
Get it? When we multiply the responses to the survey questions, l0, times the weights in syn0 (each weight is our best guess at how important an
inferred question is to our prediction), we are trying out different combinations of our survey responses to see which combinations are most helpful in
predicting who will buy Litter Rip! Over 60,000 iterations, it will become clear that, for example, cat owners who visit Litter Rip! online are more likely to
buy Litter Rip! so the corresponding weights will get larger over iterations, meaning the statistical probability will be closer to 1 than 0. But imported
beer drinkers who don't own a cat are less likely to buy Litter Rip! so those weights will get smaller--meaning that statistical probability will be closer to
0 than to 1. Ain't that cool? It's like poetry in numbers. Matrices that reason and think!
Here's why this is SUCH a big deal: my wee lads and lassies, don't you EVER let some arrogant software engineer or some fear-mongering journalist tell
you about the "hidden layer" or the "Pandora's box" of neural networks. There is no "mysterious magic" going on under the hood. The math is clear,
elegant, and beautiful. And you can master it. It just takes patience and persistence, my lovelies.
Keep calm and carry on multiplying matrices.
3.5) Visualizing Matrix Multiplication in Terms of Neurons and Synapses

Here's where these l0 and syn0 values would appear if we were to plug them into in our picture analogy of neurons and synapses:
So, the above diagram gives you a picture of how row one of our input l0 feeds through its rst step in a network. And you remember that row one =
Customer One and her three answers to the three survey questions. But when it comes time to multiply those three numbers by the entire 12 values of
syn0, and then do the same for the other three customers and their three values, how do you juggle all those numbers and keep them straight?
The key is to think of the four customers as stacked on top of each other in a "batch." So, our top stack is our top row is Customer One, right? As you
saw above, you multiply the three numbers of row one through the entire 12 numbers of syn0, sum them, and you end up with four values on the top
stack of l1. Perfect.
Now, what's your second stack in the batch? Well, 0,1,1 is the second row of answers is the Second Customer is the second stack. Multiply those three
numbers through all 12 of the syn0 values, sum them, and you end up with four values that become the second stack (or layer) in the batch of l1 values.
And so on. Two more times. The key is to just take things one stack at a time. Then you'll be ne, whether it's four stacks in your batch, or four million.
You might say that each feature has its own batch of values to it; in our kitty litter case, each survey question (or feature) has a batch of four answers
to it from four customers. But it could be four million. This concept of "full batch con guration" is a very common model, so that's why I took the time
to explain this. I nd it easiest to think of a given feature having its own batch of values. When you see a feature, you know there's a stack (batch) of
values under it.
Exactly the same thing happens on line 59, as we take the dot product of 4x4 l1 and 4x1 syn1 and then run that product through the sigmoid function to
produce a 4x1 l2 with each value becoming a statistical probability from 0-1.
l1 (4x4):
[0.98 0.03 0.61 0.58] [ 12.21]
[0.01 0.95 0.43 0.34] X [ 10.24] =
[0.54 0.34 0.06 0.87] [ -6.31]
[0.27 0.50 0.95 0.10] [-14.52]
Then pass the above 4x1 product through "nonlin()" and you get l2, our prediction:
l2:
[ 0.50]
[ 0.90]
[ 0.05]
[ 0.70]
So, what do those four predictions tell us in the case of our superior feline hygiene product? They tell us, "the closer the value is to 1, the more certainty
the customer will buy Litter Rip!, whereas the closer the value is to 0, the more certainty that Litter Rip will remain untouched by the (ungrateful!
unfeeling!) customer. As I mentioned above 0.2 means "Probably won't buy," 0.8 is "Probably will buy," and 0.999 is, "You're damn right they'll buy!"
We have now completed the Feed Forward portion of our network. I hope you can visualize what we have done so far, namely:
1. the matrices involved;

2. rows as customers;
3. columns as features; and
4. each feature containing batches of values (i.e., responses to survey questions);
If you can visualize these four items, then you have done outstanding work. Bravo to you.
In closing, think of the above feed forward step as our rst prediction, with 60,000 more to come. Our next step is to calculate where our rst prediction
Section 4 of 5: Learning from Trial-and-Error: Gradient Descent

went wrong, and how to adjust our network's weights so that the next prediction is better. We'll do this same process of predicting and then improving
the next prediction over-and-over. 60,000 times. That is learning by trial-and-error. And it is good.
4.1) The Big Picture of Gradient Descent

What is the purpose of gradient descent? It is to nd the best set of adjustments to our network of weights so that it gives a better prediction in the
next iteration. In other words, certain values in the synapse matrices of our network need to be increased or decreased in order to give a better
prediction next time. To adjust each of these values, we must answer two key questions:
1) In what direction do I adjust the number? Do I increase the value, or decrease it? Positive direction, or negative? and
2) By how much do I increase or decrease the number? A little, or a lot?

We will examine these two basic questions in great detail below. But if you want to visualize what gradient descent does, simply remember that
"gradient" is just a fancy word for "slope." If you remember our curvy red bowl from Section 1, The Big Picture, gradient descent simply means
calculating the optimal slope of the surface of that bowl to get that little white ball down to the bottom of the bowl as quickly and e ciently as
possible. So keep that curvy red bowl in your mind.
Our rst step in gradient descent is to calculate how much our current prediction missed the actual truth, namely a 1/yes or a 0/no in y.
4.2) How Far Off is our Prediction when Compared to Survey Question #4?
Line 66
l2_error = y - l2
So, by how much did our rst prediction miss the target of "Yes/1," the actual truth from survey question four that Customer One did indeed buy Litter
Rip? Well, with Customer One (Row One of l0), we want to compare our l2 prediction to the y value of 1, since Customer did indeed buy Litter Rip! When
I say, "compare our l2 prediction," I mean we subtract the l2 probability from the y value and the remainder is our l2_error, or "how much we missed the
target value y by."
So, big picture here again: our network took the input of each customer's response to the rst three survey questions, and manipulated that data to
come up with a prediction of whether that customer bought Litter Rip! or not. Because we have four customers, our network made four predictions.
And you may recall that the 4x1 y vector contains the answers of four customers to question four, "Have you purchased Litter Rip!?" It contains four "0"
or "1" values to which we want to compare the four predictions our network came up with.
Once we know our l2_error (which of course is also a vector of four errors, one for each prediction), next we want to print that error, so we can eyeball
our process in real time:
Print Error: Lines 72-73

Line 72 is a clever parlor trick to have the computer print out our l2_error every 10,000 iterations. It's helpful for us to envision the learning the network
is doing if it "shows us its homework" every 10,000 times, so we can see its progress. The line, if (j% 10000)==0: means, "If your iterator is at a
number of iterations that, when divided by 10,000, leaves no remainder, then..." So, j%10000 would have a remainder of 0 only six times: at 0 iterations,
10,000, 20,000, and so on to 60,000. So this print-out gives us a nice report on the progress of our network's learning.
The code + str(np.mean(np.abs(l2_error)))) simpli es our print out by taking the absolute value of each of the 4 values, then averaging all 4 into
one mean number and printing that.
OK, so we now know how much our predictions about four customers (l2) missed the Actual Truth about who purchased Litter Rip! (y). And we've
printed that.
But of course, any distance between us and The Oracle's castle is too much for our hearts to bear, so how can we reduce the current, unsatisfactory
prediction error of 0.5 to nally attain enlightenment?
One step at a time. First, let's get clear on what part of our network needs to change in order to improve our network's next prediction. After that, we'll
discuss how to adjust our network.
4.3) Exactly What Part of this Network Will We Adjust?

When you rst look at the diagram below, you might assume the brain of the network is the circles, the "neurons" of the neural network brain. In fact, the
real brain of a neural network, the part that actually learns and remembers, is the synapses--those lines that connect the circles in this diagram. We
have control over 16 variables in our network: 12 variables in the 3x4 matrix syn0, and 4 variables in the 4x1 vector syn1. Look at this diagram and
understand that every line (aka, "edge" or synapse) you see represents one variable, containing one number, aka one weight.
These 16 weights are all we can control.

l0, our input, is xed and unchanging. l1 is determined exclusively by the weights in syn0 by which you multiply the xed values of l0. And l2 is
determined exclusively by the weights in syn1 by which you multiplied l1. Those 16 lines pictured above, the synapses, the weights, are the only
numbers you can tweak to achieve your goal, which is an l2_error that gets smaller and smaller until l2 almost equals y.
To summarize: so far we know how much our prediction (l2) missed the actual truth (y). We know the l2_error. How can we use those numbers to
4.4) Con
calculate dence
how much Measures:
we need to adjust thethe Genius
16 weights in ourthat Bringsmatrices,
two synapses Lifeless
syn0 Numbers
and syn1? TheAlive
answer To Think
to that Like
question Humans
is one of the most
miraculous aspects of AI: it is a concept of statistics and probability known as con dence measures.
Line 88:
l2_delta = l2_error*nonlin(l2,deriv=True)
The term, "con dence measure" may sound abstract and nerdy to you, but it's actually something you use every day without realizing it. So now I'm
going to make you aware of how you use con dence measures by telling you a (super-compelling!) story:
The Oracle and the Castle, Part Deux: You Must Become Aware That You Use Con dence Measures ALL THE TIME!
You may recall, back in Section 3.1, you made a feed forward pass and drove to l2, your best guess as to where the castle y is located, but you arrived at
l2 only to discover you were closer to the castle, but not yet arrived. And you know that soon, you will (POOF!) disappear and wake up the next morning
back at your house, l0, and start over with another trip (iteration).
How can you improve your driving directions to get closer to the meaning of life tomorrow?
First, when you arrive at today's destination, you eagerly ask a local knight how far today's arrival place is from The Oracle's castle. This chivalrous
knight tells you the distance you are from Castle y (this is the l2_error, or "how much you missed The Oracle by"). Every day, at the end of each trip,
before you disappear for the day, you want to compute how much you want to change the weights that created today's failed l2 prediction such that
tomorrow your l2 prediction will be better and you can eventually nd your Oracle. This is the l2_delta (clearly marked in the diagram below). It is the
amount you want to change today's l2 so that tomorrow, that new-and-improved l2 will hopefully lead to the castle drawbridge!
Wake up, because here comes a super-duper key insight:

Note that the l2_delta is NOT the same as l2_error because l2_error only tells you how many miles you are from your Oracle. l2_delta also factors in
how con dent you were in the turn-by-turn directions by which you missed the castle today. You are measuring con dence. These con dence measure
numbers are the derivatives (hmm...actually, just forget calculus. You don't need it here, so let's just use the word "slope," as in Good Ol' rise-over-run), or
rather, the slope of each value of l2. Think of these slopes as the con dence levels you had in each of the turns in the set of directions we're using for
today's trip. Some of those turns you were super-con dent of. With other turns, you weren't certain if they were right or not.
But wait: perhaps this concept of using con dence measures to compute where you want to arrive tomorrow seems a bit abstract and confusing?
Actually, you use con dence measures to navigate all the time--you just aren't conscious of it. Our next task is to make you aware.
Think about a time when you got lost. At rst, you started out assuming you were on the right route, or you wouldn't have taken that route in the rst
place. You started out con dent. But slowly you realize that your trip seems to be taking longer than you expected, and you wonder, "Gee, did I miss a
turn?" You are less con dent. Then, as time passes and you should have arrived by now, you become more certain you missed that turn. Low
con dence. And you know you are not at your destination, but you are not sure where your destination is from your current location.
So now, you decide to stop and ask a nice lady for directions, but she tells you more turns and landmarks than you can remember, so you can only
follow her directions part-way before you are again unsure how to proceed. So you ask directions again, but this time you are closer, so the directions
are simpler, and you follow them to the letter and arrive joyfully at your destination.
It's very important to notice a couple of things:

First, you just learned by trial-and-error, and you had varying con dence levels. A bit later below, I will explain in detail how those con dence levels allow
our network to learn by trial-and-error, and then I will explain how our beloved sigmoid function gives us those all-important con dence levels.
Second, notice that your trip had two segments--the rst segment was your route up to where you asked the nice lady for directions (l1), and segment
two was your route from the nice lady to l2, the place you thought was your destination. But then you realized you arrived somewhere else, and had to
ask how far you were from your true destination.
You see how con dence plays a role in your navigation? At rst, you were sure you were on the correct route. Then, you wondered if you missed a turn.
Then you were certain you missed a turn, and stopped to ask directions before proceeding further. Those 2 segments of your daily trip look like the dog
legs pictured here, and each day with your improvements, the dog leg gets a bit straighter.
This is like the process you go through ("you" being our spiritual-seeking neural network with a driver's license) on your way to the castle:
Every day, on every trip, you (our 3-layer network) start out with a set of directions to The Oracle (syn0). When those directions end, you stop at l1 and
ask for further directions (syn1). These take you to your nal destination of the day, your prediction of where you thought The Oracle was. But alas, no
castle stands before you. So you ask the knight, "How far to the castle? (i.e., "Hey, Bub: what's the l2_error?")" And because you are a genius, you can
multiply the l2_error by how con dent you were in each turn of your directions (the slope of l2) and come up with where you want to arrive tomorrow
(your l2_delta).
(True Confession: the place where my "driving directions" analogy is incorrect is when you stop at l1 to ask for further directions. I imply that the nice
lady gave you fresh directions. In fact, your directions (the values of syn1) were already chosen for you during the update of the last iteration. So, it
would be more accurate to say the nice lady doesn't speak to you, but rather holds up the sign with directions that you gave her last night before your
disappeared Poof! because you knew you'd pass by her house tomorrow.)
So you must compute 3 facts before you can learn from them and re-attempt your quest for The Oracle's castle. You must know:
1. Where you arrived--your current location (l2);

2. How far you are from The Oracle's castle (l2_error); and
3. What changes you need to make in your set of turns to increase your certainty that your next driving attempt will get you closer to the castle
(l2_delta).
Once you possess these three facts, then you can compute the required changes in the navigation turns (i.e., the weights of the synapses). So let's
teach you how to compute the l2_delta using con dence measures we get from our use of the (stunning! fabulous!) sigmoid function.
4.5) How the Slope-Taking Feature of the Sigmoid Function Gives You Con dence Measures
Stand by to be absolutely fascinated.
Here is where you will see the beauty of the sigmoid function in four magical operations. To me, the genius of our neural network's ability to learn is
found largely in these four, very advantageous, operations:
We saw in Section 2 of 5 how the nonlin() transformed each value of l2_LH into a statistical probability (i.e., a prediction) between 0 and 1 (aka l2).
This is Big Advantage #1 of the Four Big Advantages of the Sigmoid Function:
But I have yet to explain how Part 2 of the sigmoid, namely "nonlin(l2,deriv=True)" can transform those 4 values in l2 into con dence measures.
This is Big Advantage #2. If each of our network's predictions, the four values of l2, are both high-accuracy and high-con dence, then that's an
outstanding prediction, and we want to leave the syn0 and syn1 weights that contributed to that outstanding prediction alone. We don't want to mess
with what's working; we want to x what's NOT working. Below I'll teach you how "nonlin(l2,deriv=True)" shows us which weights require our
attention.
Con dence measures help us to prioritize which weights we want to mess with rst. How? Well, if "nonlin(l2)" produces a con dence measure of
0.999, that's the equivalent of the network saying "I am extremely con dent the customer will buy Litter Rip!." A number of 0.001 is the equivalent of "I'm
con dent there is no way in Hell the customer will buy Litter Rip!." But what about all those numbers in the middle? Low con dence numbers are in the
vicinity of 0.5. For example, a value of 0.6 would be similar to "The customer might buy Litter Rip!, but I'm not sure." 0.5 means, "Hey--it could go either
way, I'm on the fence..."
That's why we focus our attention on the numbers in the middle: all numbers approaching 0.5 in the middle are wishy-washy, and lacking con dence.
So, how can we tweak our network to produce four l2 values that are both high-con dence and high-accuracy?
The key lies in the values, or weights of syn0 and syn1. As I mentioned above, syn0 and syn1 are the center, the absolute brains of our neural network.
We are going to take the four values of the l2_error and perform beautiful, elegant math on them to produce an l2_delta. l2_delta means, basically, "the
change we want to see in the output of the network (l2) so that it better resembles y (the truth about who bought Litter Rip!)." In other words, l2_delta is
the change you want to see in l2 in the next feed forward pass in the next iteration.
Get ready for beauty.
Here is Big Advantage #3 of the Four Big Advantages of the Sigmoid Function: Do you remember that diagram of the beautiful S-curve of the sigmoid
function that I showed you above? Well, lo-and-behold, each of the 4 probability values of l2 lies somewhere on the S curve of the sigmoid graph
(pictured again below, but this time with more detail). For example, let's say value 1 of l2 is 0.9. If we search for 0.9 on the Y axis of the graph below, we
can see that it corresponds with a point on the S curve roughly where you see the green dot:

Did you notice not only the green dot but also the green line through the dot? That green line is meant to represent the slope of the tangent to the line at
the exact point where that dot is. You don't need to know calculus to take the slope of a curve at a particular point--the computer will do that for you.
But you do have to notice that the S curve above has very shallow slope at both the upper extreme (near 1) and the lower extreme (near 0). Does that
sound familiar? Wonder of wonders, a shallow slope on the sigmoid curve coincides with high con dence in our predictions!
And you also need to know that a shallow slope on the S-curve comes out to a tiny number for slope. That's good news. Why?
Because, when we go to update our synapses, we basically want to leave our high con dence, high accuracy weights alone since they are already good.
So we simply want to introduce a small change. The fact that the change (l2_delta) is constructed by multiplying by a tiny number (the slope) ensures
this will be the case.)
And here comes *Big Advantage #4* of the *Four Big Advantages of the Sigmoid Function:***
Miracle-of-miracles, our high-con dence numbers correspond to shallow slope on the S-curve, which corresponds to tiny slope numbers. Therefore,
multiplying the values of syn0 and syn1 by these teeny-tiny numbers has exactly the effect we want: the corresponding values in our synapses are left
virtually unchanged, so our con dent, accurate, high-performing values in l2 remain so.
By the same token, our wishy-washy, indecisive, low-accuracy l2 values, which correspond to points in the middle of the S-curve, are the numbers that
have the biggest slope on our S-curve. What I mean is, the values around 0.5 can be traced on the Y axis of our graph above to the middle of the S-
curve, where the slope is steepest, and therefore the value of that slope is a big number. Those big numbers mean a big change when we multiply them
by the wishy-washy values in l2.
In detail now, how do we compute the l2_delta?

You may recall we already found l2_error, which measures how much our rst prediction, l2, missed the target values of y, our truth, our future, and our
Oracle. We are particularly interested in the Big Misses.
In line 88, the rst thing we do is use the second part of our beloved sigmoid function, "nonlin(l2,deriv=True)" to nd the slope of each of the 4
values in our l2 prediction. This slope tells us which predictions were con dent, and which were (wait for it...) Wishy-Washy. This is how we nd and x
the weakest links in our network, the low-con dence predictions. We then multiply those 4 slopes, (or con dence measures) by the four misses in
l2_error and the product of this multiplication will be l2_delta.
Oh, Lordy! This is an important step--did you notice that we are multiplying the Big Misses by the Wishy-Washy Predictions (i.e., the l2 predictions that
had big slopes)? That is a super-duper key point, as I'll explain below. But rst, let's make sure you can visualize what I just said:
Below is the matrix multiplication of this line of code, in order of operations: l2_delta =
l2_error*nonlin(l2,deriv=True)
Take l2 predictions, find their slopes, multiply them by the l2_error, and the product is l2_delta
y: l2: l2_error:
[1] [0.50] [ 0.50]
[1] _ [0.90] = [ 0.10]
[0] [0.05] [-0.05]

[0] [0.70] [-0.70]
This equation below is SUPER-key to understanding how a neural network "learns:"
l2 slopes after nonlin(): l2_error: l2_delta:

[0.25] Not Confident [ 0.50] Big Miss [ 0.125] Big change
[0.09] Fairly Confident X [ 0.10] Small Miss = [ 0.009] Small-ish Change
[0.05] Confident [-0.05] Tiny miss [-0.003] Tiny change
[0.21] Not Confident [-0.70] Very Big Miss [-0.150] Huge Change
Notice that, the Big Misses are (relatively speaking), the biggest numbers in l2_error. And the Wishy-Washy's have the steepest slope, so they are the
biggest numbers in nonlin(l2,deriv=True). So, when we multiply the Big Misses X The Wishy-Washy's, we are multiplying the biggest numbers by
the biggest numbers, which will give us--guess what?--the biggest numbers in our vector, l2_delta.
Why is that fabulous news? Think of l2_delta as "the change we want to see in l2 in the next iteration." The big l2_delta values are the big changes we
want to have in the l2 prediction of the next iteration, and we'll make those happen by making big tweaks in the corresponding values of syn1 and syn0
below. Those big tweak values will be added to the existing values of syn1 (that's what the "+=" operator does). The result means that the updated set
of weights will contribute to a better l2 prediction in the next iteration! Happy Happy! Joy Joy!
OK, let's make sure we got that. So...Why bother taking the slope of l2?
We take the slope of l2 to x the most mistaken of our 16 weights faster. How? Well, you may recall from our discussion of the sigmoid function (and
the S-curve diagram) above that the slope of l2 is related to (and negatively correlated with) con dence. That is to say, big slope numbers equal low
con dence, and the smallest slope numbers indicate the highest con dence level. Therefore, multiplying the corresponding values of the l2_error by
these small numbers ain't gonna cause a big change in the l2_delta product, which is good. We don't want to change those weights much, because
we're already pretty con dent in the job they're doing.
But the l2 prediction numbers that we are least con dent in have the steepest slope, which yields a larger number. When we multiply that larger number
by the l2_error then the resulting l2_delta has a bigger number. When we update syn1 later on, that bigger multiplier is going to mean a bigger product,
and therefore a bigger change, or tweak, in that value. This is as it should be, because we want to take the weights we have the least con dence in and
change them the most. That's where we will get the biggest "bang for our buck" when it comes to tweaking the 16 weights of our system. To
summarize, taking the slope of l2 gives us the con dence of each l2 prediction, which allows us to home in on the numbers that most need xing, and
x them the fastest.
But wait! The fabulousness doesn't end there, Folks!
A Return to our Super-Key Point about Inferred Questions
Remember our conversation in Section 3.4 above about how changing the weights of syn0 is similar to changing which inferred questions are most
helpful in predicting? For example, cat owners who visit Litter Rip.com are likely to buy, while non-cat owners who drink imported beer are less likely to
buy. Well, my dearies, that's what l2_delta is doing here. l2_delta is increasing the importance (the weight) of the most useful inferred questions in layer
two, and decreasing the importance of the less useful inferred questions. And below, we'll go through exactly the same process as we determine the
usefulness of each inferred question in "hidden layer" #1 ("Hidden"? Hah! WE know better--ain't no magic in that hidden layer...).
But rst, let's make sure we are crystal-clear on how these con dence measures of l2 modify the weights in syn1. Take your time with the above points
and make sure you understand them. Do you see why the sigmoid function is a thing of beauty? It takes any random number in one of our matrices
and:
1. turns it into a statistical probability,

2. transforms that value into a con dence level as well,
3. which creates a big-or-small tweak of our synapses, and
4. that tweak, in a positive or negative direction, is (almost) always in the direction of greater con dence and accuracy, therefore lower l2_error.
The sigmoid function is the miracle by which mere numbers in this matrix can "learn." A single number, along with its many colleagues in a matrix, can
represent probability and con dence, which allows a matrix to "learn" by trial-and-error. How's that for elegance in math?
Next, let me show you how these con dence measures can also tell us what changes to make in syn0 so that we have a more accurate layer one that
will lead us to a more accurate layer two.
4.6) Applying the "Big Miss/Small Miss" Con dence Measures to Layer 1 as well
(Line 100)
l1_error = = l2_delta.dot(syn1.T)
Well, our oracle/castle story has been mighty useful: when we look at the story's illustration in 3.1 (the picture of the blue castle and the dog-leg routes
to it that miss their destination), it shows how l2_error is the distance remaining between our l2 prediction and the castle. And we can see how l2_delta
is the distance between our current prediction and the prediction we will make in our next iteration. So learning this Big Miss/Small Miss idea just
above helps us understand how to tweak our current syn1 in line 115 of the code below--we just multiply our current syn 1 by our new-found l2_delta
and it will make big adjustments where we had big misses, and make small adjustments where our prediction was pretty accurate and pretty con dent.
Now that you have done such a lovely job of preparing an l2_delta to update syn1, wouldn't you love to update syn0 as well? Of course you would!
But how can you do that? Back in Section 4.2 when we were nding l2_error, life was easy. We knew l2, and we knew "The Ideal l2" we were shooting for,
which was simply y. But with l1_error, things are different. We don't know what "The Ideal l1" is. There's no y to tell us that. y only tells us The Ideal l2.
So, how are we going to gure out what The Ideal l1 is? Well, let's begin by being very practical: before we dive into any fancy math, why are we even
doing math here in the rst place?
Returning (Yet Again) to those Super-Key Inferred Questions
It comes back to the concept of inferred questions (features), and combinations of feature questions that we discussed in 3.4 and 4.5. When we
change the weights in syn1, what we're really doing is experimenting with the importance we want to attach to this question versus that question, or
some other combination of other questions. Likewise with syn0: we are looking for math that will help us nd l1_error because l1_error tells us which
questions we mistakenly prioritized or mistakenly combined. And when we next calculate l1_delta, the l1_delta will tell us a better set of priorities and
combinations of questions for our next iteration.
For example, perhaps syn0 started out with weights suggesting that the inferred question of "Is this person rich?" was a very important question. (For
example, perhaps this question was inferred from some combination of the two questions, 1) "Own a cat?" because cat owners of course tend to be
millionaires, and 2) "Drink imported beer?" because imported beer drinkers are willing and able to pay the extra $20 per bottle, rather than settling for
Coors Light.)
However, as training went on, the network realized that Litter Rip! isn't a luxury good, so being rich isn't important, and thus the network adjusted the
values of syn0 to downplay the inferred question, "Is this person rich?" How did the network downplay this inferred question? By multiplying the
question's corresponding value in syn0 by a large l1_delta value that has a negative sign in front of it. Next, perhaps the network upgraded an emerging
inferred question, "Does this person have allergies to stinky cat poop?" It would upgrade that question by multiplying the corresponding syn0 value by a
large and positive l1_delta value.
OK, so perhaps the above two questions evolve in priority. But perhaps another inferred question, "Does this person know how to buy things on the
internet?" has consistently seemed important the whole time because your target audience is folks who feel comfortable making online purchases. So
the network multiplies this question's corresponding syn0 value by a tiny l1_delta value. Therefore, it's relative priority remains unchanged. It ain't broke,
so don't x it! It's already working well.
Another piece of the puzzle is how many inferred questions do you need? In this network, we can only hold four inferred questions at a time. This is
speci ed by the size of syn0 (though syn1 must be selected to match). If you don't have enough, it may not be possible for the network to make
accurate predictions. Too many can waste effort (e.g. the network might ask the same question ve times).
Well then: we know that changing the values of syn0 will change which of our inferred questions increases or decreases as a factor contributing to l1's
overall contribution to accurate predictions in l2. And syn1 takes the combination of feature questions and further re nes them to make a more
accurate l2.
And that, me lovelies, is why we are looking for math that will help us adjust syn0 to give us a better l1, which in turn leads to a better l2: the better we
sort our feature/question combinations in l1, the easier it is to re ne them with syn1 to give us a fantastic l2.
...and isn't that just what you've always wanted? Just admit it now...
Why All the Fuss about Taking the Slope?

Here's a question I had when I had arrived at this stage: Why bother taking the slope of l2?
We take the slope of l2 to x the most mistaken of our 16 weights faster. How? Well, you may recall from our discussion of the sigmoid function (and
the S-curve diagram) above that the slope of l2 is the con dence level of l2. The smallest slope numbers indicate the highest con dence level.
Therefore, multiplying the corresponding values of the l2_error by these small numbers ain't gonna cause a big change in the l2_delta product, which is
good. We don't want to change those weights much, because we're already pretty con dent in the job they're doing.
But the l2 prediction numbers that we are least con dent in have the steepest slope, which yields a larger number. When we multiply that larger number
by the l2_error then the resulting l2_delta has a bigger number. When we update syn1 later on, that bigger multiplier is going to mean a bigger product,
and therefore a bigger change, or tweak, in that value. This is as it should be, because we want to take the weights we have the least con dence in and
change them the most. That's where we will get the biggest "bang for our buck" when it comes to tweaking the 16 weights of our system. To
summarize, taking the slope of l2 gives us the con dence of each l2 prediction, which allows us to home in on the numbers that most need xing, and
x them the fastest.
If you have survived to this point, congratulations! If you are bleary-eyed and confused, hey--so was I, the rst 10 times. Take a break, practice your
salsa dancing, and then re-read this section again.
OK, the math. Let's start with what we know: we know the change we want to see happen in our next l2 prediction--that's l2_delta. We also know the
syn1 values we used to arrive at our current prediction. Important question: What would happen if we just took the change we want to see next time,
l2_delta, and multiplied it by the old syn1 values we used this time?
It's kind of like being a Bigtime Hollywood Screenwriter: you write a movie script with a sad ending where our hero never arrives at The Oracle's castle
because he gets fried by the dragon's aming breath and eaten like a savory snack from a food truck. You show your script to the director, who runs
screaming from the room, "Give me a happy damn ending! I want an ending where the hero slays the dragon, discovers the meaning of life and drives
off in a Tesla!"
"Well, I gotta eat..." you're thinking, so you agree to revise your old script as your boss wishes. You know the ending you have to create, so you go back
through your script to gure out where the action went wrong. What plot points are you going to have to change in your script to end up with a happy
hero rather than a charred dragon snack?
Our math does the same thing here: if you multiply l2_delta, which is the happy ending you want to change to, by syn1, which is the plot points that got
you to the wrong ending (i.e., dragon food), then you will end up with l1_error--the places where your plot went wrong. Change those less-than-happy
plot points and the next draft of your script gets happier.
Again: if you multiply where you want to go next time (l2_delta) by the mistaken set of directions you used this time during the second part of your
journey (your current syn1), the product will be "where you went wrong in the rst part of your journey," i.e., the l1_error.
And once you know your l1_error, you might as well go ahead and collect your Nobel Prize in Math before you calculate what you have to change in your
next l1 to come up with a better prediction in l2. Um, that would be the l1_delta. Oui? Non? Claro que si!
(Line 109)
l1_delta = l1_error * nonlin(l1,deriv=True)
To calculate the l1_delta, we're doing the same thing we did to calculate l2_delta. We rst nd our con dence measures of our l1 values by taking the
slope, rise-over-run, of each l1 value. Where the errors were big, and con dence was low, this will produce a big l1_delta value that will create a large
change when we update syn0. Where the errors were small and con dence was pretty good, the l1_delta value will be small, and the syn0 of the next
iteration will be mostly unchanged.
In the same way that we multiplied l2_error by the con dence in those predictions (the slope of l2) to produce a desired change in l2, we now multiply
the l1_error by the con dence in the l1 predictions (the slope of l1) to obtain our desired change in l1, aka l1_delta.
It's just like the con dence you have in the set of turns you're taking as you drive to (what you predict is) The Oracle's castle. You start out con dent in
your road map, but when you get lost, you want to get a new set of directions you can be con dent in. Thus, you have to look back in retrospect, gure
out where you went wrong, and x it for next time.
4.7) Updating the Synapses for Fun and Pro t

Lines 115-116

This nal piece of our code is the Glory Moment: all our work is complete, and we reverently carry our hard-earned l1_delta and l2_delta up the ancient,
stone steps of our podium to our hallowed leader, Emperor Synapse, the true brains of our operation.
The rst step of computing the update to syn0 is to multiply l1_delta by the input l0. Then we add that product to the current syn0. This causes large
changes in components of syn0 that have stronger effects on l1. In other words, we are downplaying questions, or combinations of questions, that we
mistakenly thought were important in this iteration. Turns out they were not as important as we thought, so the l1_delta helps us correct our mistakes
in the right direction, which will bring us closer to The Oracle's castle in our next iteration.
Consider the very important phrase, "learning by trial-and-error." This is exactly what our network does. Here's how it used trial-and-error to learn:
1. Trial: The network performs 60,000 feed forwards, which are 60,000 trials of this-or-that combination of feature questions that produce the best
prediction we can muster for this iteration;
2. Error: The network takes the current prediction, compares it to the actual truth to nd the error;
3. Learning: The network uses con dence measures (gradient descent) to gure out what is the right amount of change in the right direction for the
next iteration, i.e. the delta. Then the network takes the new-and-improved delta (the change we want), and multiplies it by the old l1 or l2, which
were our best guess at the time of what the right combination of feature questions was. Then it adds this product to the old syn0 or syn1. Do you
see how the synapses are our network's brain? They have "learned" from the new delta improving the old "trial" (the old layers) and adding it to
the old synapse. Our network has learned from its mistakes, just like humans do.
You might say, "The neurons that re together, wire together!" (i.e., the combinations of features that contribute most to an accurate prediction end up
being prioritized in the network). And it is the synapses' weights that wire them together.
In math parlance, you could say it is e cient to change weights in the synapse that correspond to larger values of l1 (i.e. if a feature of l1 has a large
value, a small change in the weights that are multiplied by this value can have a large effect on l2). The multiplication ensures that the total change
applied to the synapse maximizes the impact on l2. In other words, it produces an increment in the direction of steepest descent.
Yes, my dears, nding the delta and updating syn0 and syn1 by adding the product of the old layer and the new delta is gradient descent. It will lead our
little white ball one step further down the surface of the curvy, red bowl to that ideal bottom of our bowl, where error is smallest, predictions are most
accurate, and joy abounds! This is such an important concept that I want to further illustrate it below, to tie together all the ne threads of insight and
intuition we have joyously gathered! Alas, my heart is bursting...
4.8) Surprise! The Slope of Con dence Measures is the Same as the Slope of the Surface of the
Curvy Red Bowl!
Let's review a bit and dazzle you by tying all that con dence measure stuff together with the curvy, red bowl diagram we saw in 1.3, The Big Picture.
Think of the curvy, red bowl as The Force. The curvy, red bowl is a visual, geometric representation of every shenanigan we have done in this entire
blog.
Respect. The. Bowl.
When we update our synapse matrix by adding the product of the old layer and the new delta, it's going to give that synapse's value a nudge in the right
direction towards a con dent and accurate prediction. When I say, "nudge in the right direction," what I mean is this:
Some values of our l2_delta are going to be negative values. l2_delta being negative means that the next l2 should be closer to zero, which can
manifest in multiple ways in the synapses. So it's important to notice that nding the slope is what gives us a sense of "direction." That sense of
direction, or slope, is key to rolling our white ball to the bottom of the bowl, where l2_error is near zero and predictions are fabulously accurate. The
network's job is to increment the weights such that the next l2 has moved by l2_delta.
(taken with gratitude from Grant Sanderson Ch. 2)

Above is a nice, simple 2D picture of the "rolling ball" of gradient descent (which, you may recall, is the cost function of our network, aka the total
l2_error, aka the height of the yellow arrow from the white grid to Point A).
Line 88 computes each value of l2 as a slope value. Of the 3 dips, or "bowls" pictured in the diagram above, it is clear that the true global minimum is
the deepest bowl on the far left. But, for simplicity's sake, let's pretend that the bowl in the middle is the global minimum. So, a steep slope downwards
to the right (i.e., a negative slope value, as depicted by the green line in the picture) means our ball will roll a lot in the negative direction (i.e., to the
right), causing a big, negative adjustment to the corresponding weights of syn1 that will be used in the next iteration to predict l2. In other words, some
of the four l2 values will approach zero and a probability prediction that the customer in question did NOT buy Litter Rip!.
However, if for example you have a shallow slope downwards to the left, that would mean the prediction value is already accurate and con dent, which
produces a tiny, positive slope value, so the ball will roll very little to the left (i.e., in a positive direction), thus adjusting by very little the corresponding
weight in syn1, so the next iteration's prediction of that value will remain largely unchanged. This makes sense because, as the network's predictions
become more accurate, the back-and-forth motion of the rolling ball is becoming smaller and smaller before it soon comes to rest at the global
minimum, the bottom of the bowl, so there's no need to move much. It is already close to the ideal.
The above diagram is a tad oversimpli ed, in order to give you the gist of gradient descent. I don't mean to suggest that gradient descent is simply
nding the slope in one dimension. It's much more elegant than that. The gradient also gives you the relative importance of movement along different
axes. That is, you might know three things from the gradient in 3D. For example:
1. Moving south decreases my error,

2. Moving east decreases my error,
3. Moving east produces a much larger reduction in my error than moving south.
So, moving east should be our top priority. Again, think of that erratic, stumbling dotted line we saw wandering down the curvy surface of the red bowl
pictured at the beginning of this blog.
And here is a link to another useful picture of what gradient descent looks like. This image is similar to the curvy red bowl, sitting on the "tabletop"
plane of the syn0, syn1 grid, with an arrow showing Feed Forward and a tiny arrow showing slope/gradient descent:
https://docs.google.com/document/d/1_YXtTs4vPh33avOFi5xxTaHEzRWPx2Hmr4ehAxEdVhg/edit?usp=sharing
We have now arrived at the end of Section 4 of 5, "Learning from Trial-and-Error, Gradient Descent." It's time for me to confess that I tricked you: you
have already been doing calculus but we didn't call it calculus!
Here's the unvarnished truth:
Taking the con dence measure of each value of l2 = Taking the slope of l2 = Taking the derivative of l2!
Ain't that simple and lovely? This fabulous process gives us the amounts by which we need to adjust each of the 12 values of syn0, and the 4 values of
syn1. If you understand the wonder of con dence measures, then you're doin' just ne so far.
But wait, Kids--there's more! When it comes to gradient descent and learning by trial-and-error, con dence measures are just the tip of the iceberg. They
work great in this simple neural network where you only have binary features with Yes/No responses, but now you're ready to learn a strategy that you
can apply to even the most complicated neural networks, with many layers and features.
Section 5 of 5: Back Propagation

In the next section, why settle for just the tip? I'm going to give you the whole iceberg: back propagation. I'm going to teach you the chain rule of
calculus, which will give you the back propagation process that will allow us to rule the world! Prepare to unleash the power within...
5.1) Busting Myths About Back Prop

OK, me lovelies, in order to maintain your (sacred!) trust I must now confess that I played a little trick on you. The previous section was all about back
propagation using calculus, but I didn't want to use those terms because I thought you might get ustered and lose that perfect, "Zen" quality to your
learning that I have grown to so admire during our time together...
So, in Section 4 you just learned how con dence measures make it possible for a network to learn, and you dig that. But Baby, lemme tell ya, you ain't
seen nothin' yet. Learning con dence measures is like discovering that apples only fall down from trees--never up. But our next topic--back propagation
using the chain rule of calculus--is like mastering the concepts of gravity so well that you could launch a rocket into space.
Yeah, that's what I'm talkin' 'bout. So get frickin' psyched, because now I'm going to teach you The Holy Grail: The Chain Rule.
Now we have entered the brain of the beast; here is the secret sauce of deep learning. Back propagation is the key tool for performing gradient descent,
but back prop has a bad reputation for being intimidating to learn. Well! That's all vicious gossip, my friends. Back prop is eager to help us, and she only
has the best of intentions. Let not terror grip your heart, my Dears, Dave is here to hold you close and keep you safe!
Let's bust two myths, shall we?
Myth #1: Back Propagation is Super-Hard
False. Back propagation requires patience and persistence. If you read a text on back prop once and throw up your hands because you understand
nothing, you're done. But if you watch Grant Sanderson's video 3 on back prop 5 times at reduced speed, then watch video 4 on the math of back prop
ve times, you'll be well on your way. Grit.
Myth #2: To Understand Back Prop, You Need Calculus
There are many online posts which state with great authority that you need multivariable calculus in order to understand AI. Not true. Deep learning
author and guru Andrew Trask says that, if you took three semesters of college-level calc, only a tiny subset of that material would be useful for
learning back propagation: the chain rule. But, even if you studied those 3 semesters of college calculus, the chain rule is often presented very
differently in college from the way you would use it in back propagation. You'd be much better off reading Kalid Azad on the chain rule ve times.
So, bottom line? Don't make the same mistake I did: I panicked every time I saw the word, "derivative," and it was self-defeating. You must ght that
inner voice saying, "I don't have the background to master this." There are workarounds--simple ways to do calculus without calling it "calculus." But
there is no workaround for grit.
Here is my favorite saying: "There is the task, and there is the drama about the task."
Leave your drama here now. Please give me your grit, and your trust. Let's learn back prop.
5.2) Back Prop De ned
5.2.a) Gradient Descent is Back Prop is the Chain Rule

OK, me lovelies, off we go then...
Let's start with The Big Picture and some de nitions: "Gradient" means slope, and "descent" means going down. Gradient descent is the name of a
process where we nd the slope of our error function near our current, erroneous prediction, and we use that slope to make the current amount of our
error go down in the next iteration.
In Section 4 of 5 above, we did one form of gradient descent in several steps:
1. Compare our prediction to y to nd the error of our prediction (4.2);

2. Find the slope (i.e., the con dence measure) of l1 and l2 (4.4);
3. Use that slope to nd the delta, which is the change we want to see in our next iteration of l1 and l2 (4.5 and 4.6);
4. Use the delta to update syn0 and syn1 (i.e., to learn from the mistakes of this iteration and make our error go down next time) (4.7).
The gradient descent process we learned in Section 4 is a very valuable step in understanding gradient descent, but it has limitations. In this section,
which is 5 of 5, we are going to learn gradient descent from a different perspective, which can be applied to most neural networks.
Back propagation is absolutely essential to understanding neural networks. It is a very elegant piece of math that allows us to adjust every single one
of our weights at the same time, relative to each other, and relative to the prediction we will make in our next iteration. Back prop does this by applying
the chain rule.
What is the chain rule? First I'll give you the math de nition, and then an analogy to help you picture it.
Mathy De nition: (Don't forget to breathe, and trust that Dave will teach you this): The essence of the chain rule is that the derivative of nested
functions is the product of the derivatives of the individual functions. As a result, the end derivative is simply the product of the derivates (ratios of
change) of the component functions.
An Analogy You Can Picture: Think of the chain rule as a juggler performing in the circus. In our network we have 16 weights, so think of them as 16
bowling pins this amazing juggler can keep up in the air all at the same time. Pretend that these bowling pins are not all the same size, so the juggler
has to be very skilled to throw each pin up with the correct amount of force and timing, while keeping the other 15 in the air as well. Now pretend that
this juggler is so good that, even if one bowling pin suddenly changes size in midair, the juggler can instantly adjust the other 15 pins to compensate for
the change in that one pin's size. This miracle juggler can adjust for any and all changes in the pins while still keeping all 16 in the air!
That's a pretty cool juggling talent, right? Well, that's what makes the chain rule so elegant: there are so many paths from even just one weight in the
synapses to the nal prediction, and yet the chain rule can adjust them while keeping them all straight, relative to each other. And our network is tiny.
The chain rule can do this even with big networks containing millions or billions of weights!
Nice.
5.2.b) The Problem We Want to Solve: How to Tweak 16 Weights
My devoted students, whom I have grown to love with the passage of time, here is the challenge: every time you tweak one of your 16 weights in syn0
and syn1, it has a ripple effect throughout the whole network. How can you calculate the best adjustment to each of the 16 weights while taking into
consideration that adjustment's effect on the prediction error, which depends on the ripple effect from the 15 other weights? That sounds crazy-
complex, right?
Can do. We've already done the necessary computations using our Python code and con dence measures, but in order to master this stuff, we're going
to do the same thing again from a different angle: the chain rule.
Let me show you the World's Greatest Parlor Trick. Those of you who know calculus will understand when I say we are going to use the chain rule to
take derivatives. But those of us who don't know calculus are not intimidated in the least, right? That's right, because we will use slope, Good Ol' rise-
over-run, to juggle 16 bowling pins in the air at the same time. The secret? To nd the slope is to take the derivative. As I mentioned above, they are
exactly the same thing in this context. (Psst: Don't tell all the calculus teachers. They'll be out of a job...)
5.2.c) Intro to the Chain Rule and Ratios of Change

Here's the twist: this time, using the chain rule, we're going to nd the slope of a ratio of change.
Ratio of change is a key concept to remember here. Do you remember from Section 3.2 that syn0,1 is 3.66? Well, here is the key overall question: "When
we adjust syn0,1 by nudging it up or down, how much does that increase or decrease the l2_error?" In other words, think of l2_error's reaction to a
change in syn0,1 as a "sensitivity," or a relationship, or a ratio of change: we know that if we wiggle syn0,1, up or down, then the l2_error will wiggle up or
down in proportion to that nudge. But how big is the proportion? Will it move a little? A lot? What is the ratio of l2_error's wiggle to syn0,1's wiggle? How
sensitive is l2_error with respect to syn0,1?
Remember: the goal of back prop is to gure out the amount to tweak each of our 16 weights so that we can reduce our l2_error as much as possible in
the next iteration. But the challenge is that each of these 16 weights affects many of the other 15 weights, so you might say that how much you adjust
Weight 1 depends on how much you adjust Weight 2, which depends on how much you adjust Weight 3, and so on. Keep this "depends on" image in
your mind because it will come in handy later in the math. How you juggle the 16 changing bowling pins depends on which pin changes by how much.
But alas, no. No, no, no. I must have a bigger analogy! Something sweeping, and bold, and beautiful! Yes, my friends, once again the artiste blooms in
me, and I must bring beauty and joy into your learning with an analogy of butter ies. Who doesn't love butter ies? I couldn't nd a way to use both
butter ies and unicorns to explain the chain rule, so you're only getting butter ies. Deal with it.
5.3) De ning The Chain Rule: It's Like The Butter y Effect
OK. Here we go. Remember that the key overall problem we are trying to solve is, "How much do I adjust the 16 weights of syn0 and syn1 in order to
synch each value perfectly with the other 15 bowling pins, to keep them all in the air and minimize the l2_error so I can arrive at the castle and the
meaning of life?"
Let's break that big question down into tiny little pieces. Instead of looking at 16 bowling pins--er, I mean weights--at a time, let's just look at 1 of the 16-
-syn0 (1,1), which we have nicknamed syn0,1 for simplicity.
There, that's better. Now my question is this: How do I adjust syn0,1 in an orderly, interdependent manner so that all 16 bowling pins don't fall out of my
hands while I am reducing the l2_error as much as possible?
The answer to that question is the chain rule, which is kind of like The Butter y Effect, where the butter y apping its wings in New Mexico sets off a
chain of events that results in a hurricane in China. (Did you notice? "chain of events" sounds a tad like "chain rule" in calculus, no? Oh Dave, so very
clever...). Let's look at a picture of how this analogy might apply to the series of ratios we must calculate in back propagation:
So: a butter y aps its wings once, setting off an escalating chain reaction. In New Mexico, the tiny air current from the wing ap combines with a
"perfect storm" of other events to create a gust of wind in neighboring Nevada, which combines with other weather to create heavy winds in L.A., which
creates a thunderstorm in Hawaii...you get the idea.
Now, let's walk through the math of our butter y analogy: When we increase or decrease the value syn0,1, that's like the butter y in New Mexico
apping its wings. For simplicity's sake, let's say we just increased our syn0,1 to 3.66. This increase will now ripple through our chain of events--our
"perfect storm" of other weights combining perfectly to reduce the l2_error.
One can never have too much beauty in the world, so now let's combine our butter y effect with another beautiful analogy, the ripple effect:
5.4) The Butter y Effect Meets Our 5 Ratios of Change: the Ripple Effect
Loyal friends, you are no doubt aware that the butter y effect is merely an example of a more general concept, the ripple effect. The diagram below will
joyfully combine our butter y effect with the ripple effect of how one ratio of change affects the next. This diagram looks complicated, but it actually is
made up of only three things:
1. The same white circles we saw above, which describe the butter y effect from left-to-right;
2. At the bottom of the picture, there is a row of squares which contain the ratios of change in the back prop chain rule. These squares move
backwards from right-to-left, i.e., a change in the right-most ratio ripples through the other ratios to affect the left-most ratio.
3. Some artful colored arrows connect the two in a rather pleasing manner.
Ripple 1, the rst ripple effect of our tweak to syn0,1 (the butter y wing ap) will cause l1_LH to increase by a certain proportion, aka a ratio of change.
That's the "gust of wind in Nevada." (see the grey line below)
Since l1_LH is the input of our sigmoid function, to calculate that ratio of change between l1_LH and L1 (aka, "to take the slope") is to measure Ripple 2,
the "heavy winds in L.A." (purple line below)
Then, l2_LH will obviously be affected in proportion to the change in l1 and its subsequent multiplication by syn1,1 (which does not change-- we're
leaving syn1,1 and the other 14 weights unchanged for now, to simplify our example), so measuring that ratio of change will give us Ripple 3, the
"thunderstorm in Hawaii." (yellow line below)
You can probably guess l2 will change in proportion to the change in l2_LH, so taking the slope of l2_LH will give us Ripple 4, the "storm over the
Paci c." (green line below)
Finally, when this new l2 is subtracted from our target y value, the remainder, which is l2_error will change, and this is Ripple Effect #5, the "hurricane in
China." (light blue line below).
Our goal is to calculate the ratio by which each ripple ripples, in order to know the amount we want to increase/decrease syn0,1 in order to minimize
l2_error on our next iteration (the dark blue arc line points to both the input and the output of our chain rule function). When we say our neural network
"learns," we really mean it reduces l2_error with each iteration such that the network's predictions become more and more accurate each time. So,
tweaking syn0,1 is like tweaking the ap of the butter y's wings, which ripples through a chain of events right up to the hurricane in China, which in our
example is the reduction of l2_error.
Since we're working backwards, we would say, "How much the hurricane l2_error changes depends on how much l2 changes, which depends on how
much l2_LH changes, which depends on how much l1 changes, which depends on how much l1_LH changes, which depends on how much our
butter y, syn0,1 changes. The unspeci ed rule used in this example is that the size of the step is equal to the slope.
That's all well and good, but what I really want you to understand is how our Python code synchs up with these ratios of change that ripple from a
5.5) How the Code Synchs Up with the Math:
change in syn0,1 all the way through to a change in l2_error. So now, let's take a look at which lines of code align with which ratios of change in our
chain rule function.
5.5.a) Removing the Intermediary Variables

My dear students, you will notice with great joy that lines 66 to 115 of our Python code appear above, with (fashionable) red lines connecting them to
our ratios of change. If those connections don't look consistent to you, that is only because our original code breaks the back propagation process
down into several intermediary steps with several extra, intermediary variables. Below I want you to take a look at the code after I remove these four
intermediary variables:
1. l2_error;
2. l2_delta;
3. l1_error; and
4. l1_delta
Start by studying the top pieces of code (with red arrows attached). Pretend the four intermediary variables have disappeared. What does that leave?
Now take a look at the bottom line of the diagram below, with the green arrows pointing upward. You'll see that the bottom line of code with green
arrows is what remains when we remove the intermediary variables from the red-arrowed chunks of code. All three rows now synch up perfectly. Below,
I'll explain what I mean:
5.5.b) How the Code Calculates the Same Variable that Each Ratio of Change Calculates
Mi amigos, you may recall that, with con dence measures, when we took the slope of our statistical probability, that number between 0 and 1, we
simply computed rise over run using the function of x (1 - x). x is the rise, 1-x is the run.
But here, we're computing something different--and more accurate. We want to know how much a given CHANGE in syn0,1 will cause a CHANGE in
l2_error. It's like saying, "A change in the run, which is syn0,1, will cause how much of a change in the rise, which is l2_error?" So, in order to gure out
the rate of change in d l2_error/d syn0,1 we are going to break that big ratio of change down into 5 little parts, 5 little ratios of change, 5 little cases of
"this change in run causes this much change in rise." In other words, we're going to examine each link of the chain rule separately.
Let's walk through each of these ratios of change, from right-to-left, to make sure you understand how the code synchs up with the math:
For d l1_LH / d syn0,1: We know that l0,1 is 1 (the "yes" answer of customer one to, "Do you own a cat?"). Therefore, the ratio of change will always be 1,
because no matter what value you make syn0,1 to be, l1_LH will always be that value times l0,1, which is one. This makes sense because l1_LH divided
by syn0 will always equal l0.
For d l1 / d l1_LH: In the top code, when you remove the intermediary variables l1_delta and l1_error, you are left with only nding the slope of l1. This is
exactly the same as when we took the con dence measure of l1 in Section 4.5 of 5. So, d l1 divided by d l1_LH will always equal the slope of l1.
For d l2_LH / d l1: In the top code, when you remove the intermediary variables l1_error and l2_delta, you are left with only syn1,1. This makes sense,
because d l2_LH divided by d l1 will always equal syn1.
For d l2 / d l2_LH: In the top code, when you remove the intermediary variables l2_delta and l2_error, you are left with only the slope of l2. This makes
sense, because d l2 divided by d l2_LH will always equal the slope of l2.
For d l2_error / d l2: The "-1" in the bottom green code makes sense because the relationship is a negative correlation. For example, take y - l2 = l2_error,
and for our rst customer 1 - 0.5 = 0.5. If you increase l2 by 0.1, then 1 - 0.6 = 0.4. In other words, l2_error decreased by 0.1 when we increased l2 by
0.1. That's a 1-to-1 negative correlation.
It's key that you understand that all the slopes we are calculating above are being evaluated at the CURRENT STATE OF THE NETWORK (i.e. with
weights xed at the values used in feed forward). To use our juggler's analogy, it's like he can magically stop time for a moment and take a snapshot of
all 16 pins in the air at that moment. This is like our prediction at the end of one forward feed. Since one bowling pin has changed in size, he now
magically adjusts the sizes of the other 15 pins, still in mid-air, before he magically starts time moving again (i.e., the next iteration).
I hope you are beginning to see the amazing power of the chain rule to juggle all the weights of a neural network while adjusting them relative to each
other. The chain rule is the guts of back prop, which is the guts of gradient descent.
Again: our goal is to calculate these 5 ratios and multiply them together in order to nd the ultimate ratio of how much a change in our butter y, syn0,1
creates the change we want in our hurricane, the l2_error. How do we calculate those ratios? Next, let's take one example of one weight and walk
through all the steps of the math of the chain rule.
5.6) A Working Example: Walking Through the Math of Changing One Weight in syn0
For your convenience, here again are the variables we found in the Section 3 feed forward:
l0= 1
syn0,1= 3.66
l1_LH= 3.82
l1= 0.98
syn1,1= 12.21
l2_LH= 0.00
l2= 0.50
y= 1 (This is a "Yes" answer to survey question 4, "Ever bought Litter Rip?" which corresponds to training example
#1, i.e., row 1 of l0)
l2_error = y-l2 = 1-0.5 = 0.5
5.6.a) Please Learn from the Mistake I Made

When you tackle this ratio of change, don't make the same mistake I made. Here's how I rst attempted to calculate the ratios (i.e., the wrong way):
5.6.b) It's Not Just Calculating a Ratio--It's the CHANGE in the Ratio: Slope
For $250,000
Answer: In mycash and
above a trip to ourI bonus
calculations, forgot round,
that theWhat's
goal iswrong with the
to calculate above change.
relative picture?Do you see all those nice "d's" in front of each variable above? I
mistakenly ignored those, and just wrote down the current values for each variable.
But those d's stand for delta, folks. And delta is too important to ignore. Think of each delta as a wiggle to the input. To compute the slope (or
sensitivity) you have to apply a small wiggle to the input and observe how large the corresponding wiggle is in the output. This in general has no
dependence on the current values of the input and output. (NB: This might be a good time to review Kalid Azad on the Chain Rule .)
I want to make sure this is clear for you. Consider for a moment: we want to predict the future (so to speak). We want to know how much a change
syn0,1 will ripple through out network to cause a change for the better in our next, future iteration of l2_error. In other words, we're not just comparing
how a single number 3.66 in syn0,1 affects a single number 0.5 in l2_error. We want to compare how a change in the number 3.66 in syn0,1 will in turn
change the 5 rippling ratios to ultimately create a better, future l2_error. Speci cally, we want to make that 0.5 value smaller, as quickly as possible. So,
we don't want a bunch of ratios of numbers; we want the ratio of CHANGE in the numbers, the numbers that change our future. The delta.
OK, so hopefully you now agree that we need two values for each of our variables. Well, to be precise: we wouldn't need two points if we were doing
cold, hard calculus, but I'm doing things this way to provide a clearer understanding of what's going on under the hood.
We already have a current value for each variable. We need to provide a second value for each variable and subtract one value from the other. The
remainder will be an "amount of change," or delta, that we can then compare to the amounts of change, or deltas, of the other variables. It looks like this
formula:
5.6.c) The Formula for Calculating Ratios of Change: x_current and x_nearby
"Current" above refers to the current values we have for each variable. "Nearby" means we want to provide a number pretty close to our current number,
for convenience's sake. That way, when we subtract the current number from the nearby number, it will yield a small number that's easy to calculate in
our ratio of change. And it will give a more accurate slope if, for example, your two points are on a curvy graph line. Let's walk through our 5 ratios
together so we can practice nding the "nearby's" for each ratio. Once we've done a full example, you'll be a giant step closer to understanding back
propagation.
Take a look at the full picture of back prop now, including this newly-added bottom line (with blue arrows) which computes the ratios of change:
5.6.d) Crunching the Actual Numbers of an Example

OK. Now let's walk through these ve ratios of change together:
Ratio 1, which is Ripple 5, the hurricane in China (We start in China since "back propagation" means we're working backwards through the 5 ripples of
our ripple effect, right?): d l2_error / d l2. Where did our "currents" and our "nearby's" come from?
1. x_current is the l2 we calculated from our feed forward, 0.5;

2. y_current is y-l2 = 1-0.5 = 0.5, our l2_error;
3. x_nearby is simply a convenient example we made up. We know that if l2 were 0.6, which is indeed nearby our x_current of 0.5, then y-0.6 would
be 0.4.
4. Hence, x_nearby = 0.6 and y_nearby = 0.4
5. Once you are clear on your 4 variables, the math is easy and the slope, aka the sensitivity = -1.
6. This means that for every 1 you increase l2, it decreases the l2_error by 1. A delta of 1 in our l2 produces a delta of -1 in our l2_error. Nice.
Ratio 2, which is Ripple 4, the storm over the Paci c. d l2_LH / d l2.
1. We know x_current is l2_LH from our feed forward: 0.00.

2. y_current is our l2, 0.50.
3. But how do we nd the nearby's? We eyeball the S-curve of our sigmoid function in the diagram above, to nd a convenient ratio to plug in.
4. We notice that at 0.1 on the X axis, Y is 0.525. Nice, let's use that.
5. So, our x_nearby becomes 0.1 and our y_nearby becomes 0.525, and it's all over but the math: answer is 0.25.
Why, you might ask, do we eyeball the S-curve of our sigmoid function? Look at the corresponding code that the blue arrow points to. It's not asking us
to squish the LH side of l2 into a number between 0 and 1 with the return 1/(1+np.exp(-x)) portion of our sigmoid code. Rather, this time it's
asking us to take the slope (aka derivative) of l2, by calling the return x*(1-x) code with (deriv==True). So, input is the X axis (i.e., 0), and output is
the Y axis (i.e., 0.5).
Next is Ratio 3, Ripple 3, the thunderstorm in Hawaii. d l2_LH / d l1.
1. We know x_current is L1, or 0.98.

2. y_current is L2_LH, the product when we multiply the entire rst row of l1 by the full column of syn1. Answer is 0.0.
3. Note that, on this ratio, the code does not ask us to take the derivative (aka slope). So for our x_nearby and our y_nearby, we can choose any darn
number we please. Let's choose 1 for x_nearby and 0.2442 for y_nearby.
4. Subtract, divide, done, and notice that it equals our syn1 rst value, 12.21. Hallelujah.
Ratio 4, Ripple 2, the heavy winds in L.A.. Having waded far upriver, we are now nearing the source of our ripples, syn0,1! Exciting. Note that the code is
asking us to take slope, so you know we'll be eyeballing our S-curve again.
1. x_current is l1_LH, or 3.82, so look that up on our X axis.

2. y_current lines up at about 0.975, no? Done.
3. Now, since we're taking slope this time, for our x-and-y nearby's we have to nd a nice pair of coordinates on our S-curve that make for
convenient math. How 'bout x_nearby as 4? That would make y_nearby as about 0.982. Lovely.
4. Do the math and it's 0.04. Joy.
Final Ratio, #5, Ripple 1, the gust of wind in Nevada. We are nearing the "source of our mountain stream." x_current is syn0,1, or 3.66. y_current is l1_LH,
or 3.82. Our code is not asking for any slopes/derivatives because it is a linear function (which looks like a straight line on a graph), so the distance
between coordinates can be very large and the slope will still be exactly right. For curvy functions (parabolas, sigmoids, etc) the points should be close
together to minimize the effect of curvature on your estimate of the slope. Thus, for our nearby's on this linear function, we can use numbers that are,
well...convenient, rather than nearby. Hence: x_nearby is 4 and y_nearby is 4.16. And the math happens to work out (again, conveniently) to 1, which is
the l0.
OK. We now have our 5 ratios, so let's multiply them together to come up with an answer to our question, "How much will l2_error increase/decrease,
depending on how much I increase/decrease syn0,1?"
1 x 0.04 x 12.21 x 0.25 x -1 = -0.1221 = d l2_error / d syn0,1
In other words, for every 1 we increase syn0,1, the l2_error decreases by 0.1221.
To summarize: our goal in this section was to measure the butter y effect. If syn0 is the ap of a butter y's wings in New Mexico, how do we measure
its reduction of l2_error, the hurricane in China? When we say our neural network "learns," we really mean it reduces l2_error with each iteration such
that the network's predictions become more and more accurate each time. So, tweaking syn0 is like adjusting the size of the ap of the butter y's
wings, which ripples through a chain of events right up to the hurricane in China, which in our example is the reduction of l2_error.
We measured this butter y effect by working backwards, i.e., "How much the hurricane l2_error changes depends on how much l2 changes, which
depends on how much l2_LH changes, which depends on how much l1 changes, which depends on how much l1_LH changes, which depends on how
much our butter y, syn0, changes."
Congrats! You just completed your rst chain rule calculation of back propagation. If you have followed things thus far, then you are well on your way to
becoming a Back Propagation Rock Star. If not yet, hey--no problem! Just reread the above several more times and click on the helpful links of the
Super Teachers I have cited in Section 1.1 above.
In Closing...
Dearest Friends (may I call you a dear friend? I don't want to be presumptuous, but I feel we've already been through SO much together...)
Sigh...we have completed our nal section, and I know how you are feeling. It's like you've been reading a good book, that is SO good that you never
want it to end. But now you have mastered back prop, which is the key tool of gradient descent, and you're all grown up. It's time to send you off to a
lucrative career in AI, and this old house of mine will feel empty without you, but I'll pray for you come back soon and support me in my old age...
However, until that day I will proudly send you on your way with a parting gift:
Reading Andrew Trask taught me that memorizing these lines of Python code leads to mastery, and I agree for two reasons:
1) When you try to write out this code from memory, you will nd that the places where you forget the code are the places where you don't understand
the code. Once you understand this code perfectly, every part of it will make sense to you and therefore you will remember it forever;
2) This code is the foundation on which (perhaps) all deep learning networks are built. If you master this code, every network you learn and every paper
you wade through will be clearer and easier because of your work in memorizing this code.
Memorizing this code was made easy for me by making up an absolutely ridiculous story that ties all the concepts together in a fairy-tale mnemonic.
You will remember better if you make your own, but here's mine, to give you an idea. I count the 13 steps on my ngers as I recite this story out loud:
1. Sigmund Freud (think: Sigmoid Function) absolutely treasured his neural network, and he buried it like a pirate's treasure,
2. "X" marks the spot (Creating X input that will become l0. The survey responses).
3. "Why," I asked him (Create the y vector of target values), "didn't you plant
4. Seeds instead?" (Seed your random number generator) "You could have grown a lovely garden of
5. Snapdragons," (Create Synapses: Weights) "which could be fertilized by the
6. Firm poop" (For loop) "of the deer that
7. Feed on the owers" (Feed Forward Network)! Then suddenly, an archer
8. Missed his target (By How Much did l2 Miss the Target of y?) and killed a grazing deer. As punishment, he was forced to
9. Print his error (Print Error) 500 times on a blackboard facing the
10. Direction of his target (In What Direction is y? aka, Is l2_delta pos. or neg?). But he noticed behind the
11. BACK of his target two deer were mating and PROPAGATING their species (Back Propagation) and he shouted for them to stop but they wouldn't
take
12. Direction and ignored him (In what Direction is the ideal l1? aka, is l1_delta pos. or neg?). He got so angry that his mind
13. Snapped and he Descended into Gradient insanity (Gradient Descent, Update Synapses).
So, this is a very silly story, but I can tell you that it has burned those 13 steps into my brain, and once I can write down those 13 steps, and I understand
the code in each step, to write the code perfectly from memory becomes easy.
I hope you can tell that I loved my journey into deep learning, and I wish you the same joy I found! Feel free to email me improvements to this article at:
DavidCode1@gmail.com
THE END

Google Cloud Collaboration.

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Google Cloud Collaboration.

Caricato da

Copyright:

Formati disponibili

8/3/2019 Anyone Can Learn AI Using This Blog 072319.

Anyone Can Learn Arti cial Intelligence With This Blog.

A Simple, Illustrated Explanation That Skips No Steps, By David Code

How to Use This Blog to Learn Well

1. A neural network is a very popular, cutting-edge type of deep learning;

Section 1 of 5: The Big Picture of Deep Learning:

1.1) The Big Picture: An Example

1. Do you own a cat who poops?

1.2) The Big Picture: A Brain Analogy of Neurons and Synapses

1.3) The Big Picture: The Analogy of Bowls and Balls

(taken with gratitude from Grant Sanderson Ch. 2)

1.3.b) Finding the Global Minimum

1.3.c) Gradient Descent and Back Propagation

Gradient Descent is the Master Plan

Back Propagation is the Main Tool for Achieving Gradient Descent

Section 2 of 5: Creating a Working Brain with 28 Lines of Code

But rst, a word on the concept of a matrix and linear algebra...

#This is the "3 Layer Network" near the bottom of:

#First, housekeeping: import numpy, a powerful library of math tools.

#7 FEED FORWARD: Think of l0, l1 and l2 as 3 matrix layers of "neurons"

#9 PRINT ERROR--a parlor trick: in 60,000 iterations, j divided by 10,000 leaves

2.1) The Sigmoid Function, Brie y Mentioned: lines 8-12

(taken with gratitude from: Andrew Trask)

2.2) Creating X input: Lines 23-26

Line 34 creates the y vector, a set of target values we strive to predict.

2.4) Seed your random numbers: Line 40

2.5) Create the "Synapses" of your Brain--the Weights: Lines 47-48

Line 47: syn0 = 2*np.random.random((3,4)) - 1

Line 48: syn1 = 2*np.random.random((4,1)) - 1

Section 3 of 5: Feed Forward: Making an Educated Guess, 60,000 Times

1. First, I will tell you a spellbinding fairytale of feed forward;

l1 = nonlin(l1_LH) = nonlin(3.82) -> nonlin() = 1/(1+np.exp(-x)) = [1/(1+2.718^-3.82))] = 0.98

nonlin(l2_LH) -> nonlin() = 1/(1+np.exp(-x)) = [1/(1+2.718^0.00))] = 0.5

3.3) Let's Walk Through the Math of Feed Forward Slowly:

(taken with gratitude from: Andrew Trask)

This creates layer 1, the hidden layer of our neural network:

[0.98 0.03 0.61 0.58]

(multiply row 1 of l0 by column 1 of syn0, then row 1 by column 2, etc.)

row 1 of l0: col 1 of syn0:

3.5) Visualizing Matrix Multiplication in Terms of Neurons and Synapses

1. the matrices involved;

Section 4 of 5: Learning from Trial-and-Error: Gradient Descent

4.1) The Big Picture of Gradient Descent

2) By how much do I increase or decrease the number? A little, or a lot?

Print Error: Lines 72-73

4.3) Exactly What Part of this Network Will We Adjust?

These 16 weights are all we can control.

Wake up, because here comes a super-duper key insight:

It's very important to notice a couple of things:

1. Where you arrived--your current location (l2);

(taken with gratitude from: Andrew Trask)

In detail now, how do we compute the l2_delta?

[0] [0.05] [-0.05]

This equation below is SUPER-key to understanding how a neural network "learns:"

l2 slopes after nonlin(): l2_error: l2_delta:

A Return to our Super-Key Point about Inferred Questions

1. turns it into a statistical probability,

Returning (Yet Again) to those Super-Key Inferred Questions

Why All the Fuss about Taking the Slope?

l1_delta = l1_error * nonlin(l1,deriv=True)

4.7) Updating the Synapses for Fun and Pro t

115 syn1 += l1.T.dot(l2_delta)

(taken with gratitude from Grant Sanderson Ch. 2)