Understanding Neural Networks as Statistical Tools

Understanding Neural Networks as Statistical Tools 1
Brad Warner
Department of Preventive Medicine and Biometrics
University of Colorado
Health Sciences Center
Denver, CO 80220
Ph. (303)-399-8020 ext. 2521
bw@rover.uchsc.edu
Manavendra Misra
Department of Mathematical and Computer Sciences
Colorado School of Mines
Golden, CO 80401
Ph. (303)-273-3873
Fax (303)-273-3875
mmisra@mines.edu
1
Starting June 30, 1996, Brad Warner will be an Assistant Professor at the United States Air Force Academy.
Manavendra Misra is an Assistant Professor, Colorado School of Mines, Golden, CO 80401. The authors thank
Guillermo Marshall, the referees, and editors for the helpful comments that have improved this paper. The authors
also thank Dr. Karl Hammermeister and the Department of Veterans Aairs for the use of their data. Please direct
all correspondence to the second author at the Colorado School of Mines address.
Abstract
Neural networks have received a great deal of attention over the last few years. They are being used in the
areas of prediction and classication; areas where regression models and other related statistical techniques
have traditionally been used. In this paper, we discuss neural networks and compare them to regression
models. We start by exploring the history of neural networks. This includes a review of relevant literature
on the topic of neural networks. Neural network nomenclature is then introduced and the backpropagation
algorithm, the most widely used learning algorithm, is derived and explained in detail. A comparison be-
tween regression analysis and neural networks in terms of notation and implementation is conducted to aid
the reader in understanding neural networks. We compare the performance of regression analysis with that
of neural networks on two simulated examples and one example on a large data set. We show that neural
networks act as a type of nonparametric regression model, enabling us to model complex functional forms.
We discuss when it is advantageous to use this type of model in place of a parametric regression model and
also some of the diculties in implementation.
Key-words: Nonparametric Regression, Articial Intelligence, Backpropagation, Generalized Linear

Model.
1 Introduction
Neural networks have recently received a great deal of attention in many elds of study. The excitement
stems from the fact that these networks are attempts to model the capabilities of the human brain. People
are naturally attracted by attempts to create human-like machines; a Frankenstein obsession if you will.
On a practical level, the human brain has many features that are desirable in an electronic computer. The
human brain has the ability to generalize from abstract ideas, recognize patterns in the presence of noise,
quickly recall memories, and withstand localized damage. From a statistical perspective, neural networks
are interesting because of their potential use in prediction and classication problems.
Neural networks have been used for a wide variety of applications where statistical methods are tra-
ditionally employed. They have been used in classication problems such as identifying underwater sonar
contacts (Gorman & Sejnowski, 1988), and predicting heart problems in patients (Baxt, 1990, 1991; Fujita,
Katafuchi, Uehara & Nishimura, 1992). They have also been used in such diverse areas as diagnosing hyper-
tension (Poli, Cagnoni, Livi, Coppini & Valli, 1991), playing backgammon (Tesauro, 1990), and recognizing
speech (Lippmann, 1989). In time series applications, they have been used in predicting stock market perfor-
mance (Hutchinson, 1994). Neural networks are currently the preferred tool in predicting protein secondary
structures (Qian & Sejnowski, 1988). As statisticians or users of statistics, we would normally solve these
problems through classical statistical models such as discriminant analysis (Flury & Riedwyl, 1990), logistic
regression (Studenmund, 1992), Bayes and other types of classiers (Duda & Hart, 1973), multiple regression
(Neter, Wasserman & Kutner, 1990), and time series models such as ARIMA and other forecasting methods
(Studenmund, 1992). It is therefore time to recognize neural networks as a potential tool for data analysis.
Several authors have done comparison studies between statistical methods and neural networks (Hruschka,
1993; Wu & Yen, 1992). These works tend to focus on performance comparisons and use specic problems
as examples. There are a number of good introductory articles on neural networks usually located in various
trade journals. For instance, Lippmann (1987) provides an excellent overview of neural networks for the
signal processing community. There are also a number of good introductory books on neural networks
1
with Hertz, Krogh, and Palmer (1991) providing a good mathematical description, Smith (1993) explaining
backpropagation in an applied setting, and Freeman (1994) using examples and code to explain neural
networks. There have also been papers relating neural networks and statistical methods (Buntine & Weigend,
1991; Ripley, 1992; Sarle, 1994; Werbos, 1991). One of the best for a general overview is Ripley (1993).
This paper intends to provide a short, basic introduction of neural networks to scientists, statisticians,
engineers, and professionals with a mathematical and statistical background. We achieve this by contrasting
regression models with the most popular neural network tool, a feedforward multilayered network trained
using backpropagation. This paper provides an easy to understand introduction to neural networks; avoiding
the overwhelming complexities of many other papers comparing these techniques.
Section 2 discusses the history of neural networks. Section 3 explains the nomenclature unique to the
neural network community and provides a detailed derivation of the backpropagation learning algorithm.
Section 4 shows an equivalence between regression and neural networks. It demonstrates the methods on
three examples. Two examples are simulated data where the underlying functions are known and the third
is on data from the Department of Veterans Aairs Continuous Improvement in Cardiac Surgery Program
(Hammermeister, Johnson, Marshall & Grover, 1994). These examples demonstrate the ideas in the paper
and clarify when one method would be preferred over the other.
2 History
Computers are extremely fast at numerical computations, far exceeding human capabilities. However, the
human brain has many abilities that would be desirable in a computer. These include: the ability to quickly
identify features, even in the presence of noise; to understand, interpret, and act on probabilistic or fuzzy
notions (such as `Maybe it will rain tomorrow'); to make inferences and judgments based on past experiences
and relate them to situations that have never been encountered before; to suer localized damage without
losing complete functionality (fault tolerance). So even though the computer is faster than the human brain
in numeric computations, the brain far out performs the computer in other tasks. This is the underlying
2
Synaptic Endbulb
Dendrites
Cell
Body
Cell
Nucleus Axon Axon
Hillock
Axonal
Aborization
wi1
wi2
wi3
µi
wiN
neuron i
dimensions, exists that can completely delineate the classes that the classier attempts to identify. Problems
that are linearly separable are only a special case of all possible classication problems.) A major blow to the
early development of neural networks occurred when Minsky and Papert picked up on the linear separability
limitation of the simple perceptron and published results demonstrating this limitation (Minsky & Papert,
1969). Although Rosenblatt knew of these limitations, he had not yet found a way to train other models
to overcome this problem. As a result, interest and funding in neural networks waned. (It is interesting to
note that while Rosenblatt, a psychologist, was interested in modeling the brain, Widrow, an engineer, was
developing a similar model for signal processing applications called the Adaline (Widrow, 1962).)
In the 1970s, there was still a limited amount of research activity in the area of neural networks. Modeling
the memory was the common thread of most of this work. (Anderson (1970) and Willshaw, Buneman, and
Longuet-Higgins (1969) discuss some of this work.) Grossberg (1976) and von der Malsburg (1973) were
developing ideas on competitive learning while Kohonen (1982) was developing feature maps. Grossberg
(1983) was also developing his Adaptive Resonance Theory. Obviously, there was great deal of work done
during this period with many important papers and ideas that are not presented in this paper. (For a more
detailed description of the history see Cowen and Sharp (1988).)
Interest in neural networks renewed with the Hopeld model (Hopeld, 1982) of a content addressable
memory. In contrast to the human brain, a computer stores data as a look-up table. Access to this memory
is made using addresses. The human brain does not go through this look-up process, it \settles" to the
closest match based on the information content presented to it. This is the idea of a content-addressable
memory. The Hopeld model retrieves a stored pattern by `relaxing' to the closest match to an input pattern.
Hopeld, however, did not use the network as a memory as it is prone to getting stuck in local minima as
well as being limited in the number of stored patterns (the network could reliably store a total number
of patterns equal to approximately one tenth the number of inputs). Instead, Hopeld used his model for
solving optimization problems such as the traveling salesperson problem (Hopeld & Tank, 1985).
One of the most important developments during this period was the development of a method to train
multilayered networks. This new learning algorithm was called backpropagation (McClelland, Rumelhart
5
& the PDP Research Group, 1986). The idea was explored in earlier works (Werbos, 1974), but was not
fully appreciated at the time. Backpropagation overcame the earlier problems of the simple perceptron
and renewed interest in neural networks. A network trained using backpropagation can solve a problem
that is not linearly separable. Many of the current uses of neural networks in applied settings involve a
multilayered feedforward network trained using backpropagation or a modication of the algorithm. Details
of the backpropagation algorithm will be presented in Section 3.
Neural network research incorporates many other architectures besides the multilayered feedforward
network. Boltzmann machines have been developed based on stochastic units and have been used for tasks
such as pattern completion (Hinton & Sejnowski, 1986). Time series problems have been attacked with
recurrent networks such as the Elman network (Elman, 1990), the Jordan network (Jordan, 1989), and
real-time recurrent learning (Williams & Zipser, 1989), to mention a few. There are neural networks that
perform principal component analysis, prototyping, encoding and clustering. Examples of neural network
implementations include linear vector quantization (Kohonen, 1989), adaptive resonance theory (Moore,
1988), feature mapping (Willshaw & von der Malsburg, 1976), and counterpropagation networks (Hecht-
Nielsen, 1987). There are of course many more equally important contributions which have been omitted
here in the interest of time and space.
3 Neural Network Theory
The nomenclature used in the neural network literature is dierent from that used in statistical literature.
This section introduces the nomenclature and explains in detail the backpropagation algorithm (the algorithm
used for estimation of model coecients).
3.1 Nomenclature
Although the original motivation for the development of neural networks was to model the human brain,
most neural networks as they are currently being used bear little resemblance to a biological brain. (It must
6
be pointed out that there is research in the areas of accurately modeling biological neurons and the processes
of the brain, but these areas will not be discussed further in this paper because this paper is concerned
with the use of neural networks in prediction and function estimation.) A neural network is a set of simple
computational units that are highly interconnected. The units are also called nodes and loosely represent
the biological neuron. The networks discussed in this paper resemble the network in Figure 3. The neurons
are represented by circles in Figure 3. The connections between units are uni-directional and are represented
as arrows in Figure 3. These connections model the synaptic connections in the brain. Each connection
has a weight called the synaptic weight, denoted as w , associated with it. The synaptic weight, w , is
ij ij
interpreted as the strength of the connection from the j unit to the i unit.
th th
The input into a node is a weighted sum of the outputs from nodes connected to it. Thus the net input
into a node is:
X
netinput =
i w output +
ij j i (3)
j
where w are the weights connecting neuron j to neuron i, output is the output from unit j , and is a
ij j i
threshold for neuron i. The threshold term is the baseline input to a node in the absence of any other inputs.
(The term threshold comes from the activation function used in the McCulloch-Pitts neuron, see Equation 2,
where the threshold term set the level that the other weighted inputs had to exceed for the neuron to re.)
If a weight w is negative, it is termed inhibitory since it decreases the net input. If the weight is positive,
ij
the contribution is excitatory since it increases the net input.

Each unit takes its net input and applies an activation function to it. For example, the output of the
P
j unit, also called the activation value of the unit, is: g( w x ), where g() is the activation function
th
ji i
and x is the output of the i unit connected to unit j . A number of nonlinear functions have been used
i
th
by researchers as activation functions, the two common choices for activation functions are the threshold
function in Equation 2 (mentioned in Section 2) and sigmoid functions such as:
g(netinput) =
1 (4)
1 + e? netinput
7
Outputs
y1 y2
Hidden Units
h1 h2 h3 Weighted Links
x1 x2 x3 x4
Inputs
in neural networks, similar to coecients in regression models, are adjusted to solve the problem presented
to the network. Learning or training is the term used to describe the process of nding the values of these
weights. The two types of learning associated with neural networks are supervised and unsupervised learning.
Supervised learning, also called learning with a teacher, occurs when there is a known target value associated
with each input in the training set. The output of the network is compared with the target value and this
dierence is used to train the network (alter the weights). There are many dierent algorithms for training
neural networks using supervised learning, backpropagation is one of the more common ones and will be
explored in detail in Section 3. A biological example of supervised learning is when you teach a child the
alphabet. You show him or her a letter and based on his or her response you provide feedback to the child.
This process is repeated for each letter until the child knows the alphabet.
Unsupervised learning is needed when the training data lacks target output values corresponding to input
patterns. The network must learn to group or cluster the input patterns based on some common features,
similar to factor analysis (Harman, 1976) and principal components (Morrison, 1976). This type of training
is also called learning without a teacher because there is no source of feedback in the training process. A
biological example would be when a child touches a hot heating coil. He or she soon learns, without any
external teaching, not to touch it. In fact, the child may associate a bright red glow with hot and learn to
avoid touching objects with this feature.
The networks discussed in this paper are constructed with layers of units and thus are termed multilayered
networks. A layer of units in a multilayer network is composed of units that perform similar tasks. A
feedforward network is one where units in one layer are connected only to units in the next layer and not
to units in a preceding layer or units in the same layer. Figure 3 shows a multilayered feedforward network.
Networks where the units are connected to other units in the same layer, to units in the preceding layer, or
even to themselves are termed recurrent networks. Feedforward networks can be viewed as a special case of
recurrent networks.
The rst layer of a multilayer network consists of the input units, denoted by x . These units are
i
known as independent variables in statistical literature. The last layer contains the output units, denoted
9
by y . In statistical nomenclature these units are known as the dependent or response variables. (Note
k
that Figure 3 has more than one output unit. This conguration is common in neural network classication
applications where there are more than two classes. The outputs represent membership in one of the k
classes. The multiple outputs could represent a multivariate response function, but this is not common in
practice.) All other units in the model are called hidden units, h , and constitute the hidden layers. The
j
feedforward network can have any number of hidden layers with a variable number of hidden units per layer.
When counting layers, it is common practice not to count the input layer because it does not perform any
computation, but simply passes data onto the next layer. So a network with an input layer, one hidden
layer, and an output layer is termed a two layer network.
3.2 Backpropagation Derivation

The backpropagation algorithm is a method to nd weights for a multilayered feedforward network. The
development of the backpropagation algorithm is primarily responsible for the recent resurgence of interest
in neural networks. One of the reasons for this is that it has been shown that a two-layer feedforward neural
network with a sucient number of hidden units can approximate any continuous function to any degree of
accuracy (Cybenko, 1989). This makes multilayered feedforward neural networks a powerful modeling tool.
As mentioned, Figure 3 shows a schematic of a feedforward, two-layered, neural network. Given a set
of input patterns (observations) with associated known outputs (responses), the objective is to train the
network, using supervised learning, to estimate the functional relationship between the inputs and outputs.
The network can then be used to model or predict a response corresponding to a new input pattern. This
is similar to the regression problem where we have a set of independent variables (inputs) and dependent
variables (output), and we want to nd the relationship between the two.
To accomplish the learning, some form of an objective function or performance metric is required. The
goal is to use the objective function to optimize the weights. The most common performance metric used in
neural networks (although not the only one, see (Solla, Levin & Fleisher, 1988)) is the sum of squared errors
10
dened as:
E=
1 X X(y ? y^ )2
n O
(5)
2 =1 =1
p k
pk pk
where the subscript p refers to the patterns (observations) with a total of n patterns, the subscript k to
the output unit with a total of O output units, y is the observed response, and y^ is the model (predicted)
response. This is a sum of the squared dierence between the predicted response and the observed response
averaged over all output and observations (patterns). In the simple case of predicting a single outcome,
k = 1 and Equation 5 reduces to
E=
1 X(y ? y^ )2n
2 =1 p
p p
the usual function to minimize in least-squares regression.

To understand backpropagation learning, we will start by examining how information is rst passed
forward through the network. The process starts with the input values being presented to the input layer.
The input units perform no operation on this information, but simply pass it onto the hidden units. Recalling
the simple computational structure of a unit expressed in Equation 3, the input into the j hidden unit is:
th
X N
h =
pj w x ji pi (6)
i =1
Here, N is the total number of input nodes, w is the weight from input unit i to hidden unit j , and x is
ji pi
value of the i input for pattern p. The j hidden unit applies an activation function to its net input and
th th
outputs:
v = g(h ) =
1 (7)
pj pj
1 + e? hpj
(Assuming g() is the sigmoid function dened in Equation 4.) Similarly, output unit k receives a net input
of:
X M
f =
pk W v kj pj (8)
j =1
where M is the number of hidden units, and W represents the weight from hidden unit j to output k. The
kj
unit then outputs the quantity:

y^ = g(f ) =
1 (9)
pk pk
1 + e? fpk
11
(Notice that the threshold value has been excluded from the equations. This is because the threshold can
be accounted for by adding an extra unit to the layer and xing its value at 1. This is similar to adding a
column of ones to the design matrix in regression problems to account for the intercept.)
Recall that the goal is to nd the set of weights w , the weights connecting the input units to the hidden
ji
units, and W , the weights connecting the hidden units to the output units, that minimize our objective
kj
function, the sum of squared errors in Equation 5. Equations 6 - 9 demonstrate that the objective function,
Equation 5, is a function of the unknown weights w and W . Therefore, the partial derivative of the
ji kj
objective function with respect to a weight represents the rate of change of the objective function with
respect to that weight (it is the slope of the objective function). Moving the weights in a direction down
this slope will result in a decrease in the objective function. This intuitively suggests a method to iteratively
nd values for the weights. We evaluate the partial derivative of the objective function with respect to the
weights and then move the weights in a direction down the slope, continuing until the error function no
longer decreases. Mathematically this is represented as:
@E
W = ? @W
kj (10)
kj
(The term is known as the learning rate and simply scales the step size. The common practice in neural
networks is to have the user enter a xed value for the learning rate at the beginning of the problem.)
We will rst derive an expression for calculating the adjustment for the weights connecting the hidden
units to the outputs, W . Substituting Equations 6 through 9 into Equation 5 yields:
kj
1 X X(y ? g(X W g(X w x )))2

E=
n O M N
2 =1 =1p k =1
pk
=1 j
kj
i
ji pi
and then expanding Equation 10 using the chain rule, we get:

@E @ y^ @f
? @W = ? @@E
y^ @f @W
pk pk
kj pk pk kj
but
@E
@ y^
= ?(y ? y^ ) pk pk
pk
@ y^
@f
pk
= g0 (f ) = y^ (1 ? y^ )
pk pk pk (11)
pk
12
(for the sigmoid in Equation 4) and
@f
@W
=vpk
pj
kj
Substituting these results back into Equation 10, the change in weights from the hidden units to the output
units, W , are given by:
kj
W = ?[(?1)(y ? y^ )]^y (1 ? y^ )v
kj pk pk pk pk pj (12)
This gives us a formula to update the weights from the hidden units to the output units. The weights are
updated as:
W +1 = W + W
t
kj
t
kj kj
This equation implies that we take the weight adjustment in Equation 12 and add it to our current estimate
of the weight, W , to obtain an updated weight estimate, W +1.
t
kj
t
kj
Before moving on to the calculations for the weights from the inputs to the hidden units, there are
several interesting points to be made about Equation 12. 1) Given that the range of the sigmoid function
is 0 g() 1, from Equation 11 we can see that the maximum change of the weight will occur when the
output y^ is 0.5. In classication problems the objective is to have the output be either 1 or 0, so an output
pk
of 0.5 represents an undecided case. If the output is at saturation (0 or 1) then the weight will not change.
Conceptually, this means that units which are undecided will receive the greatest change to their weights.
2) As in other function approximation problems, the (y ? y^ ) term in uences the weight change. If the
pk pk
predicted response matches the desired response then no weight changes occur. 3) If k=1, one outcome, then
Equation 12 simplies to
W = ?[(?1)(y ? y^ )]^y (1 ? y^ )v
j p p p p pj
To update the weights, w , connecting the inputs to the hidden units, we will follow similar logic as
ji
Equation 12. Thus

@E
w = ? @w
ji
ji
13
Expanding using the chain rule
@E X @y^ @f @v @h
= ? @@E
O
? @w pk pk pj pj
(13)
=1 y^ @f @v @h @w
ji
k
pk pk pj pj ji
where ^ and ^ are given in Equation 11. Also

@E
@ ypk
@ ypk
@fpk
@f
@v
=W
pk
kj
pj
@v
@h
= g0 (h ) = v (1 ? v )
pj
pj pj pj
pj
and
@h
@w
=x pj
pi
ji
Substituting back into Equation 13 reduces to:

X
O
w =
ji (y ? y^ )^y (1 ? y^ )W v (1 ? v )x
pk pk pk pk kj pj pj pi (14)
k =1
Note that there is a summation over the number of output units. This is because each hidden unit is
connected to all the output units. So if the weight connecting an input unit to a hidden unit changes, it will
aect all the outputs. Again, notice that if the number of output units equals one, then
w = (y ? y^ )^y (1 ? y^ )W v (1 ? v )x
ji p p p p j pj pj pi
3.3 Backpropagation Algorithm

Given the above equations, we proceed to put down the processing steps needed to compute the change in
network weights using backpropagation learning:
Note: this algorithm is adapted from (Hertz et al., 1991).
1. Initialize the weights to small random values. This puts the output of each unit around 0.5.
2. Choose a pattern p and propagate it forward. This yields values for v and y^ , the outputs from the pj pk
hidden layer and output layer.
3. Compute the output errors: = (y ? y^ )g0 (f ).

pk pk pk pk
14
P
4. Compute the hidden layer errors: = =1 W v (1 ? v ).
O
pj k pk kj pj pj
5. Compute
W = v
kj pk pj
and
w =
ji i
pj pi
to update the weights.
6. Repeat the steps for each pattern.
It is easy to see how this could be implemented in a computer program. (Note that there are many com-
mercial and shareware products that implement the multilayered feedforward neural network. Check the
frequently asked questions, FAQ, posted monthly to the comp.ai.neural-nets newsgroup for a listing
of these products. The web site http://wwwipd.ira.uka.de/~prechelt/FAQ/neural-net-faq.html also
maintains this list.)
4 Regression and Neural Networks
In this section we compare and contrast neural network models with regression models. Most scientists have
some experience using regression models and by explaining neural networks in relation to regression analysis
some deeper understanding can be achieved.
4.1 Discussion
Regression is used to model a relationship between variables. The covariates (independent variables, stimuli)
are denoted as x . These variables are either under the experimenter's control or are observed by the
i
experimenter. The response (outcome, dependent) variable is denoted as y. The objective of regression is to
predict or classify the response, y, from the covariates, x . Sometimes the investigator also uses regression
i
to test hypotheses about the functional relationship between the response and the stimulus.
15
The general form of the regression model is (this is adopted from McCullagh and Nelder (1989)):
X N
= x i i
i =0
with
= h()
E (y) =
Here h() is the link function, are the coecients, N is the number of covariate variables, and 0 is the
i
intercept. This model has three components:
1. A random component of the response variable y, with mean and variance 2 .
2. A systematic component that relates the stimuli x to a linear predictor =

P
=0 x .
N
i i i i
3. A link function that relates the mean to the linear predictor = h().
The generalized linear model reduces to the familiar multiple linear regression if we believe that the
random component has a normal distribution with mean zero and variance 2 and we specify the link
function h() as the identity function. The model is then:
X N
y = 0 +
p x + i pi p
=1
i
where N (0; 2). The objective of this regression problem is to nd the coecients that minimize the
p i
sum of squared errors,

X n
X
N
E= (y ?
p x )2:
i pi
=1
p =1
i
To nd the coecients, we must have a data set that includes the independent variable and associated known
values of the dependent variable (akin to a training set in supervised learning in neural networks).
This problem is equivalent to a single layer feedforward neural network (Figure 4). The independent
variables correspond to the inputs of the neural network and the response variable y to the output. The
coecients, ~ , correspond to the weights in the neural network. The activation function is the identity
16
β1
β2 1
β0
β3
...
βN
not assume any functional relationship and let the data dene the functional form, in a sense we let the
data speak for itself. This is the basis of the power of the neural networks. As mentioned in Section 3,
a two layer, feedforward network with sigmoid activation functions is a universal approximater because it
can approximate any continuous function to any degree of accuracy. Thus, a neural network is extremely
useful when you do not have any idea of the functional relationship between the dependent and independent
variables. If you had an idea of the functional relationship, you are better o using a regression model.
An advantage of assuming a functional form is that it allows dierent hypothesis tests. Regression, for
example, allows you to test the functional relationships by testing the individual coecients for statistical
signicance. Also, because regression models tend to be nested, two dierent models can be tested to
determine which one models the data better. A neural network never reveals the functional relations; it is
buried in the summing of the sigmoidal functions.
Other diculties with neural networks involve choosing the parameters, such as the number of hidden
units, the learning parameter , the initial starting weights, the cost function, and deciding when to stop
training. The process of determining appropriate values for these variables is often an experimental process
where dierent values are used and evaluated. The problem with this is that it can be very time consuming,
especially when you consider the fact that neural networks typically have slow convergence rates.
4.2 Examples
To demonstrate the use of neural networks, two simulated examples and one real example are presented.
Both of the simulated examples involve one independent variable and a continuous valued output variable.
The rst example is a simple linear problem. The true relationship is:
y = 30 + 10x:
Fifty samples were obtained by generating 50 random error terms from a normal distribution with mean
zero and standard deviation of 50 and adding these to the y values of the data set. The values of x were
18
•
1000
• •
• ••
• •
• •
• ••
•
800
• •
• •
•
• •
• •
• •
•
600
Y
•
••
• • •• •
• •
•
400
•
• •
• • •
• • Neural Network (4 Hidden Units)
•
• • True Curve
Regression Curve
200
• Data Points
•
20 40 60 80 100
Figure 5: This gure shows a comparison between a linear regression model and a neural network model
where the underlying functional relationship was linear. The lled dots represent the actual data, the solid
line is the predicted function from the neural network, the dotted line is the result from the regression model,
and the dashed line is the true function. This example demonstrates that the neural network approximated
the linear function without any assumption about the underlying functional form.
randomly selected from the range of 20 to 100. Linear regression was applied to the problem of the form
E (y j x) = + x
and yielded coecients of ^ = ?10:69 and ^ = 10:52. (Note the error in the intercept term, remember that
we were interested in modeling the data between x values of 20 and 100, so an accurate intercept was not
important.) The dotted line in Figure 5 shows the regression line.
The problem was also solved with a two layer, feedforward neural network with four hidden units and
sigmoid activation functions. The network was trained using backpropagation. This network conguration
is too general for this problem; however, we wanted to setup the problem with the notion that the underlying
functional relationship is unknown. This gives us an idea how the network will perform when we think the
19
problem is complex when in fact it is simple. The solid line in Figure 5 shows the results from the neural
network, the dashed line is the true functional relationship. Both the regression curve and the neural network
curve are close to the true curve. A separate validation set of 100 values was generated and applied to the
regression model and the neural network model. The sum of squared errors on this validation set was .290
and .303 for the regression model and neural network model respectively. So the predictive performance of
both models was essentially equal. The linear regression model was much faster and easier to develop and
is easy to interpret. The neural network did not assume any functional form for the relationship between
dependent and independent variable and was still able to derive an accurate curve.
As a second, more dicult example, consider the function:
y = 20 exp?8 5 [ln(0:9x + 0:2) + 1:5]

: x
Fifty random x values between 0 and 1 were used to generate data. A random noise component consisting
of normally distributed error terms with mean 0 and standard deviation of .05 were added to each y value.
An identical neural network to the one used on the previous example, with the exception that this network
had eight hidden units, was trained on this data. The results are plotted in Figure 6. The true curve is
dotted and the neural network estimate is solid. This problem would be dicult to model using regression
techniques and would require some estimates of variable transformations or the use of a smoothing method
(Hastie & Tibshirani, 1990). Most of these transformations assume a power or logarithmic transformation
while a combination of both would have been more appropriate in this case.
The third example is from the Department of Veterans Aairs Continuous Improvement in Cardiac
Surgery Study (Hammermeister et al., 1994). The outcome variable is a binary variable indicating operative
death status 30 days after a coronary artery bypass grafting surgery. If a patient is still alive 30 days after
surgery, he or she is coded as a 0, otherwise as a 1. The objective is to obtain predictions of a patients'
probability of death given their individual risk factors. The twelve independent variables in the study are
patient risk factors and include variables such as age, priority of surgery, and history of prior heart surgery.
Both neural network and logistic regression models were built on 21,435 observations ( 32 learning sample) and
20
• • ••
•
•
•
2
• Neural Network (8 Hidden Units)

True Curve
• • Data Points
•
• •••
•
•
1
•
•
••
Y
• •
•• • • •
•
•• • • • •
• • • • • ••
•• • ••
0
•
-1
0.0 0.2 0.4 0.6 0.8 1.0
Figure 6: Example of Modeling a Nonlinear Function. This gure demonstrates the ability of a neural
network to model a complex functional relationship. The lled dots represent the data, the dashed curve is
the true function, and the solid line is the predicted function from the neural network.
validated on the remaining 10,657 observations ( 31 testing sample). The neural network was a feedforward
network with one hidden layer comprised of four hidden units. It was trained using the backpropagation
algorithm.
The discrimination and calibration of these two models were compared on the validation set. The c-index
was used to measure discrimination (how well the predicted binary outcomes can be separated (Hanley &
McNeil, 1982)). No statistically signicant dierence was found between the two c-indices (the neural network
c-index = 0.7168 and the logistic regression c-index = 0.7162 with p < 0:05). The Hosmer-Lemeshow test
(Hosmer Jr. & Lemeshow, 1989) was applied to the validation data to test calibration of the models
(calibration measures how close the predicted values are to the observed values). The p-value for the logistic
regression model was 0.34 indicating a good t to the data, while the p-value for the neural network was
21
0.08 indicating a lack of t. In summary, the logistic regression model had comparable predictive power and
better calibration in comparison to the neural network.
The reason this occurs is that the majority of the independent variables are binary. This means that
their contribution to the model must be on a linear scale. Trying to model them in a nonlinear manner will
not contribute to the predictive performance of the model. In addition, a simple check of all two variable
interactions revealed nothing of signicance. Thus with no interactions or non-linearities, the linear additive
structure of logistic regression is appropriate for this data. As indicated in Section 1, the literature is full of
examples illustrating the improved performance of neural networks over traditional techniques. But as the
last example illustrates, this is not always true and the practitioner must be aware of the appropriate model
for his or her problem.
Neural networks can be valuable when we do not know the functional relationship between independent
and dependent variables. They use the data to determine the functional relationship between the dependent
and independent variables. Since they are data dependent, their performance improves with sample size.
Regression performs better when theory or experience indicates an underlying relationship. Regression may
also be a better alternative for extremely small sample sizes.
5 Conclusions
Neural networks originally developed out of an interest in modeling the human brain. They have, however,
found applications in many dierent elds of study. This paper has focused on the use of neural networks for
prediction and classication problems. We specically restricted our discussion to multilayered feedforward
neural networks.
Parallels with statistical terminology used in regression models were developed to aid the reader in
understanding neural networks. Neural network notation is dierent from that of statistical regression
analysis but most of the underlying ideas are the same. For example, instead of coecients, the neural
network community uses the term weights and instead of observations they use patterns.
22
Backpropagation is an algorithm that can be used to determine the weights of a network designed to
solve a given problem. It is an iterative procedure that uses a gradient descent method. The cost function
used is normally a squared error criterion, but functions based on maximum likelihood are also used. This
paper gives a detailed derivation of the backpropagation algorithm based on existing bodies of work and
gives an outline of how to implement it on a computer.
Two simple synthetic problems were presented to demonstrate the advantages and disadvantages of
multilayered feedforward neural networks. These networks do not impose a functional relationship between
the independent and dependent variables. Instead, the functional relationship is determined by the data in
the process of nding values for the weights. The advantage of this process is that the network is able to
approximate any continuous function and we do not have to guess the functional form. The disadvantage
is that it is dicult to interpret the network. In linear regression models, we can interpret the coecients
in relation to the problem. Another disadvantage of the neural network is that convergence to a solution
can be slow and depends on the network's initial conditions. A third example on real data revealed that
traditional statistical tools still have a role in analysis and that the use of any tool must be thought about
carefully.
Neural networks can be viewed as a nonparametric regression method. A large number of claims have
been made about the modeling capabilities of neural networks, some exaggerated and some justied. As
statisticians, it is important to understand the capabilities and potential of neural networks. This article is
intended to build a bridge of understanding for the practitioner and interested reader.
23
References
Abu-Mostafa, Yasser S. (1986) Neural networks for computing pp.1{6 of Proceedings of the the american
institute of physics meeting .
Anderson, J.A. (1970) Two models for memory organization Mathematical biosciences , 8, 137{160.
Baxt, William G. (1990) Use of an articial neural network for data analysis in clinical decision-making:
The diagnosis of acute coronary occlusion Neural computation , 2, 480{489.
Baxt, William G. (1991) Use of an articial neural network for the diagnosis of myocardial infraction Annals
of internal medicine , 115, 843{848.
Buntine, W.L. & Weigend, A.S. (1991) Bayesian back-propagation Complex systems , 5, 603{643.
Cowan, J.D. & Sharp, D.H. (1988) Neural nets Quaterly reviews of biophysics , 21, 365{427.
Cybenko, G. (1989) Approximation by superpositions of a sigmoidal function Mathematics of control,
signals, and systems , 2, 303{314.
Duda, R.O. & Hart, P.E. (1973) Pattern classication and scene analysis New York, New York: Wiley.
Elman, J.L. (1990) Finding structure in time Cognitive science , 14, 179{211.
Flury, Bernhard & Riedwyl, Hans (1990) Multivariate statistics: A practical approach London: Chapman
Hall.
Freeman, James A. (1994) Simulating neural networks with mathematica Reading, Massachusetts: Addison
Wesley Publishing Company.
Fujita, H., Katafuchi, T., Uehara, T. & Nishimura, T. (1992) Application of articial neural network to
computer-aided diagnosis of coronary artery disease in myocardial spect bull's-eye images The journal
of nuclear medicine , 33(2), 272{276.
Gorman, R.P. & Sejnowski, T.J. (1988) Analysis of hidden units in a layered network to classify sonar targets
Neural networks , 1, 75{89.
Grossberg, S. (1976) Adaptive pattern classication and universal recoding, i: Parallel development and
coding of neural feature detectors Biological cybernetics , 23, 121{134.
Grossberg, S. & G.A., Carpenter (1983) A massively parallel architecture for a self-organizing neural pattern
recognition machine Computer vision, graphics, and image processing , 37, 54{115.
Hammermeister, K.E., Johnson, R., Marshall, G. & Grover, F.L (1994) Continuous assessment and improve-
ment in quality of care: A model from the department of veterans aairs cardiac surgery Annals of
surgery , 219, 281{290.
Hanley, J.A. & McNeil, B.J. (1982) The meaning and use of the area under a receiver operating characteristic
(ROC) curve Radiology , 143, 29{36.
Harman, H.H. (1976) Modern factor analysis . 3rd editionChicago: University of Chicago Press.
Hastie, T. & Tibshirani, R. (1990) Generalized additive models London and New York: Chapman and Hall.
Hecht-Nielsen, R. (1987) Counterpropagation networks Applied optics , 26, 4979{4984.
Hertz, J., Krogh, A. & Palmer, R.G. (1991) Introduction to the theory of neural computation Santa Fe
Institute Studies in the Sciences of Complexity, Vol. 1 Redwood City, California: Addison Wesley
Publishing Company.
24
Hinton, G.E. & Sejnowski, T.J. (1986) Learning and relearning in boltzmann machines chap. 7 of Parallel
distributed processing , Vol. 1.
Hopeld, J.J. (1982) Neural networks and physical systems with emergent collective computational abilities
Proceeding of the national academy of sciences, usa , 81, 2554{2558.
Hopeld, J.J. & Tank, D.W. (1985) 'neural' computation of decisions in optimization problems Biological
cybernetics , 52, 141{152.
Hosmer Jr., D.W. & Lemeshow, S. (1989) Applied logistic regression New York: John Wiley & Sons, Inc.
Hruschka, Harald (1993) Determining market response functions by neural network modeling: A comparison
to econometric techniques European journal of operational research , 66, 27{35.
Hutchinson, James M. (1994) A radial basis function approach to nancial time series analysis Ph.D. thesis,
Massachusetts Institute of Technology.
Jordan, M.I. (1989) Serial order: A parallel, distributed processing approach in: J. Elman & D. Rumelhart
(Eds.) Advances in connectionist theory: Speech Hillsdale: Erlbaum.
Kohonen, T. (1982) Self-organized formation of topologically correct feature maps Biological cybernetics ,
43, 59{69.
Kohonen, T. (1989) Self-organization and associative memory . 3 editionBerlin: Springer-Verlag.
Lippmann, R.P. (1987) An introduction to computing with neural nets Ieee assp magazine , April, 4{22.
Lippmann, R.P. (1989) Review of neural networks for speech recoginition Neural computation , 1, 1{38.
McClelland, J.L., Rumelhart, D.E. & the PDP Research Group (1986) Parallel distributed processing: Explo-
rations in the microstructure of cognition, volume 2: Psychological and biological models. Cambridge:
MIT Press.
McCullagh, P. & Nelder, J.A. (1989) Generalized linear models London: Chapman Hall.
McCulloch, W.S. & Pitts, W. (1943) A logical calculus of ideas immanent in nervous activity Bulletin of
mathematical biophysics , 5, 115{133.
Minsky, M.L. & Papert, S.A. (1969) Perceptrons Cambridge: MIT Press.
Moore, B. (1988) ART1 and pattern clustering in: D. Touretzky, G. Hinton, & T. Sejnowski (Eds.)
Proceedings of the 1988 connectionist models summer school San Mateo: Morgan Kaufmann.
Morrison, D.F. (1976) Multivariate statistical methods . 2nd editionNew York: McGraw-Hill.
Neter, J., Wasserman, W. & Kutner, M.H. (1990) Applied linear statistical models Homewood, IL.: Richard
D. Irwin, Inc.
Poli, R., Cagnoni, S., Livi, R., Coppini, G. & Valli, G. (1991) A neural network expert system for diagnosing
and treating hypertension Computer , March, 64{71.
Qian, N. & Sejnowski, T.J. (1988) Predicting the secondary structure of globular proteins using neural
network models Journal of molecular biology , 202, 865{884.
Ripley, B.D. (1992) Neural networks and related methods for classication Submitted to the Royal Statistical
Society Research Section.
Ripley, B.D. (1993) Statistical aspects of neural networks pp.40{123 of O. Barndor-Nielsen, J. Jensen, &
W. Kendall (Eds.) Networks and chaos-statistical and probabilistic aspects Chapman & Hall.
25
Rosenblatt, F. (1962) Principlies of neurodynamics Washington D.C.: Spartan.
Sarle, W.S. (1994) Neural networks and statistical methods in: Proceedings of the 19th annual sas users
group international conference .
Smith, Murray (1993) Neural networks for statistical modeling New York, New York: Van Nostrand Reinhold.
Solla, S.A, Levin, E. & Fleisher, M. (1988) Accelerated learning in layered neural networks Complex systems ,
2, 625{639.
Studenmund, A. H. (1992) Using econometrics: A practical guide New York, New York: HarperCollins
Publishers.
Tesauro, G. (1990) Neurogammon wins computer olympiad Neural computation , 1, 321{323.
Thompson, Richard F. (1985) The brain: Aneuroscience primer New York, New York: W.H. Freeman &
Company.
von der Malsburg, C. (1973) Self-organizing of orientation sensitive cells in the striate cortex Kybernetik ,
14, 85{100.
Werbos, Paul J. (1991) Links between articial neural networks (ann) and statistical pattern recognition
pp.11{31 of I. Sethi & A. Jain (Eds.) Articial neural networks and statistical pattern recognition: Old
and new connections Elsevier Science Publishers.
Werbos, P.J. (1974) Beyond regression: New tools for prediction and analysis in the behavioral sciences
Ph.D. thesis, Harvard University.
Widrow, B. (1962) Generalization and information storage in networks of adaline neurons pp.435{461 of
M. Yovitz, G. Jacobi, & G. Goldstein (Eds.) Self-organizing systems Washington D.C.: Spartan.
Williams, R.J. & Zipser, D. (1989) A learning algorithm for contimually running fully recurrent neural
networks Neural computation , 1, 270{280.
Willshaw, D.J. & von der Malsburg, C. (1976) How patterned neural connections can be set up by self-
organizaiton Proceedings of the royal society of london b , 194, 431{445.
Willshaw, D.J., Buneman, O.P. & Longuet-Higgins, H.C. (1969) Non-holographic associative memory Na-
ture , 222, 960{962.
Wu, Fred Y. & Yen, Kang K. (1992) Application of neural network in regression analysis in: Proceedings of
the 14th annual conference on computers and industrial engineering .
26

Understanding Neural Networks as Statistical Tools

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Understanding Neural Networks as Statistical Tools

Caricato da

Copyright:

Formati disponibili

Understanding Neural Networks as Statistical Tools 1

Key-words: Nonparametric Regression, Arti cial Intelligence, Backpropagation, Generalized Linear

3 Neural Network Theory

the contribution is excitatory since it increases the net input.

3.2 Backpropagation Derivation

the usual function to minimize in least-squares regression.

unit then outputs the quantity:

1 X X(y ? g(X W g(X w x )))2

and then expanding Equation 10 using the chain rule, we get:

Equation 12. Thus

where ^ and ^ are given in Equation 11. Also

Substituting back into Equation 13 reduces to:

3.3 Backpropagation Algorithm

hidden layer and output layer.

3. Compute the output errors:  = (y ? y^ )g0 (f ).

to update the weights.

6. Repeat the steps for each pattern.

4 Regression and Neural Networks

intercept. This model has three components:

1. A random component of the response variable y, with mean  and variance 2 .

2. A systematic component that relates the stimuli x to a linear predictor  =

sum of squared errors,

y = 20 exp?8 5 [ln(0:9x + 0:2) + 1:5]

• Neural Network (8 Hidden Units)

0.0 0.2 0.4 0.6 0.8 1.0

Potrebbero piacerti anche

Key-words: Nonparametric Regression, Articial Intelligence, Backpropagation, Generalized Linear

3. Compute the output errors: = (y ? y^ )g0 (f ).

1. A random component of the response variable y, with mean and variance 2 .

2. A systematic component that relates the stimuli x to a linear predictor =