Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
I. INTRODUCTION
to find such a hyperplane and its orthogonal vectors. Furthermore, such a grouping method by using hyperplanes cannot be
extended to the case where more than two groups are required.
In contrast, we show that it is not necessary to find such a hyperplane and its orthogonal vectors. In fact, almost any vector
input vectors into multiple groups.
can be used to separate
Without this multiple grouping way, we cannot get the markedly
improved results shown in this paper. Second, we introduce the
concept of neural network modularity in our paper. That means,
although overall the constructive network designed in our paper
is a strictly standard TLFN, it can be seen as a combinative
network consisting of different neural subnetworks playing different roles. In essence, it includes multiple quantizers, each
consisting of two sigmoid neurons. Logically, each neural quantizer by adjusting the weights and the biases of its neurons can
inhibit inputs within some intervals. In contrast, single inhibition neurons used by Tamura and Tateishi [1] is only suitable
for two group case and cannot be extended to multiple group
cases.
This paper is organized as follows. The necessary preliminaries are given in Section II. Based on these preliminaries, we
hidden
propose in Section III a new TLFN with
distinct input samples with
neurons, which can interpolate
any arbitrarily small error. Generalization capability of the proposed TLFNs are studied in Section V. Conclusions are given in
Section VI.
II. PRELIMINARIES
A.
where
is the weight vector connecting the th hidden neuron and the input neurons,
is the weight vector connecting the th
hidden neuron and the output neurons, and is the bias of
denotes the inner product of
the th hidden neuron.
and . The output neurons are chosen to be linear. The above
equations can be written compactly as
, where
,
, and
with
,
. We call
the
hidden layer output matrix1 of the SLFN; the th column of
is the output of the th hidden neuron with respect to inputs
.
B. Upper Bounds of Number of Hidden Neurons of Standard
SLFNs With Sigmoid Activation Functions
As stated in Lemma 2.3 of Huang and Babris paper [2], we
have the following lemma.
1This
matrix has been used in [1], [2], and [16], for example.
275
(1)
276
determinant of
can go on for
sequence
Fig. 1. Subnetwork with n linear input neurons, N=L hidden neurons and L
output hidden neurons where the activation function in the hidden neurons and
the output neurons is the sigmoid function. If the input vector x belongs to
V
the output of the pth output neuron is t .
, and
the
does not become zero. The same procedure
, and we get a bias value set
Therefore,
, the
, are nonzero.
determinants of
Define
, where
denotes the
weight vector connecting the th hidden neuron and the th
can be chosen
output neuron. According to Theorem 3.1
the expected
suitably such that for any input vector
output of the th output neuron is . Thus, such a subnetwork
has the feature that for any input vector within the th input
vector group the output of the th output neuron is .
B. Construction of TLFNs
-Hidden Neurons
such that
for
,
if
and
. That is,
if
, and
. Regarding each newly added hidden neuron
, its bias
can be chosen as
(4)
such that
and
is,
for
for
. That
if
and
if
. Thus, can be set large enough so that, for
,
, the outputs of newly added neuare almost zero and the outputs of newly
rons
are almost one, and the outadded neurons
are almost one
puts of newly added neurons
,
Fig. 2. 2L new hidden neurons are added to the subnetwork. For each output neuron p, p = 1; . . . ;
to it. All the 2L newly added neurons are linked to all the n input neurons.
277
and B
are linked
2R
2R
N=L
+ 3L ( 2
input samples (x
; t
278
Fig. 5. One newly constructed feedforward network with N=L + (m + 2)L ( 2 (m + 2)N ) hidden neurons can learn the N input samples (x
any arbitrarily small error, where
and
.
x 2R
t 2R
t ) with
279
All the weights linking the quantizers to second hidden layer are
.
Based on Bartlett theorem [19, part 2 of Theorem 28] we
affect the generalization percan study how parameters
formance of the proposed network. Note that sigmoid funcsatisfies Lipschitz condition: for all
tion
,
. Denote by
the th weight of computation layer
is
1, 2, 3 for the first hidden, second hidden, and output layer, rethe network architecture corresponding to
spectively) and
a set of weights. Thus, a slightly modified Bartlett theorem
for sigmoid two-hidden-layer feedforward networks is
Theorem 5.1 [19]: Suppose is a probability distribution
, with
,
, and
on
. Let
. Given a standard
two-hidden-layer sigmoid feedforward network architecture ,
for
, where
is the number of neuif
, then with probability
rons in computation layer ,
over a training sample
chosen according
at least
in has
to , very
(5)
, and can be adjusted to
where is a positive factor,
achieve optimum generalization performance for the proposed
classifier. The weights connecting the second hidden layer to the
, and the bias of the output neuron is
output neuron are
.
As analyzed in the previous section, similarly for this classiand bias of the th neuron in the first hidden
fier the weight
quantizers) can be chosen from any
layer (except those
and , respectively. Thus, the values of paramintervals of
can be chosen as small as needed. The weight
eters
connecting the neurons (except those quantizer neural modules)
of first hidden layer to the th neuron in the second hidden layer
can be chosen suitably such that for any input vector
the expected output of the th output neuron is
..
.
and we have
. Thus, the values of paramecan be chosen as small as needed by adjusting
ters
weight size factor .
The weights connecting the input neurons to the quantizers
or
. The biases of the
of the first hidden neurons are
th quantizers and [according to equations (3) and (4)] are
(6)
for a constant .
Constant depends only on the number of layers (except
input layer) of the network architecture , thus, it is fixed
, the
for any two-hidden-layer network architecture.
that a new subsequent
misclassification probability of
randomly drawn from the same probability
pattern
as the training samples is mislabeled by the
distribution
trained network, depends on two factors: the sample estimate
is the estimate
error and complexity penalty [20].
training samples
. The complexity
error4 for the
penalty
depends on the size of the parameters in the network, rather
than on the number of parameters.
For the proposed network (classifier), weight size factor can
small enough so that the bound
be adjusted to make
of the weights of the proposed network architecture will only be
affected by quantizer factors and . In order to get small esti, one can set parameters , large enough.
mate error
However, making parameters and too large leads to large
values of weight bound , and thus large complexity penalty.
On the other hand, small values of parameters , (and subgives a small comsequently small value of weight bound
plexity penalty, but perhaps with some increases in the estimate
. Thus, there is a tradeoff between interpolation
error
accuracy and generalization capability, and proper values of
and should be chosen in order to optimize generalization
performance. As pointed out by Bartlett [21], for networks with
4Strictly speaking, the error estimate counts the proportion of training examples that are not correctly classified by the network (before thresholding
the output) with a margin of
. For appropriate definition, readers can refer to
Bartlett [19].
280
many small weights but small squared error on the training examples, then the Vapnik-Chervonenkis (VC) dimension (and,
hence, number of parameters) is irrelevant to the generalization
performance. Instead, the magnitude of the weights in the network is more important. We believe that it is true for our proposed network.
B. VC Dimension
The VC dimension is a measure of flexibility of the network
showing the number of data points in a space that can be classified arbitrarily. Roughly speaking, assume that is a neural-network architecture with inputs and threshold outputs, which
each corresponding to a set of
is actually a function set of
. Given a subset
weights including biases,
:
. is said to be
shattered by if for each Boolean function
there exists some weight set so that
for all
. The VC dimension of represented by VC
is
that is shattered by .
the maximal size of a subset
Many researchers have investigated the VC dimension of
various feedforward networks. Sakurai [22] showed a lower
bound and an upper bound of the VC-dimension of feedforward
neural networks with single hidden layer of neurons and single
output neuron, where all the neurons have the same activation
function of the threshold function and the input patterns are in
general position. Sakurai [23] showed a lower bound and an
upper bound for the VC-dimension of a set of neural networks
of neurons with piecewise polynomial activation functions.
is a lower bound of
Koiran and Sontag [24] showed that
VC dimension of sigmoid network architectures with
weights. It is shown [20], [25] that there is a feedforward netparameters, where every
work with layers and a total of
hidden neuron has sigmoid activation function and the output
.
neuron has threshold function,
(For brevity, we are ignoring the details of VC theory in this
paper since it is not the aim of the current work. The readers
can refer to a good reference [20].)
Baum and Haussler [26] showed that
for feedforward network
with
weights
computation units.5 Shawe-Taylor and Anthony [27]
and
extended Baum and Haussler [26] result from single-output
threshold network to multi-output threshold network and
showed that it also holds. It is known [20] that any set of input
patterns shattered by a network of linear threshold neurons
with the same structure as
is also shattered by a network
, but with the threshold activation functions replaced by a
sigmoid activation function in all nonoutput computation units.
Hence, these lower bound results also hold for standard sigmoid
networks. It is shown [20], [28], [29] that for a feedforward
with
parameters and
computation units, in
network
which each computation unit other than the output unit has the
standard sigmoid activation function (the output unit being a
linear threshold unit), the best known upper bound on the VC
.
dimension is
5Computation
281