The answer to this question is embodied in the universal approximation theorem8 for a
nonlinear input–output mapping, which may be stated as follows:
Let (·) be a nonconstant, bounded, and monotone-increasing continuous function. Let Im0 Section 4.12 Approximations of Functions 167 denote the m0-dimensional unit hypercube . The space of continuous functions on [0, 1]m0 Im0 is denoted by . Then C(Im0) , given any function and f C(Im0) 0, there exist an integer m1 and sets of real constants i, bi, and wij, where i 1, m1 and j 1, m0 such that we may define F(x1, ... , xm0) = a (4.88) m1 i a am0 as an approximate realization of the function f(·); that is, for all that lie in the input space. The universal approximation theorem is directly applicable to multilayer perceptrons. We first note, for example, that the hyperbolic tangent function used as the nonlinearity in a neural model for the construction of a multilayer perceptron is indeed a nonconstant, bounded, and monotone-increasing function; it therefore satisfies the conditions imposed on the function (·) Next, we note that Eq. (4.88) represents the output of a multilayer perceptron described as follows: 1. The network has m0 input nodes and a single hidden layer consisting of m1 neurons; the inputs are denoted by . 2. Hidden neuron i has synaptic weights , and bias bi. 3. The network output is a linear combination of the outputs of the hidden neurons, with defining the synaptic weights of the output layer. The universal approximation theorem is an existence theorem in the sense that it provides the mathematical justification for the approximation of an arbitrary continuous function as opposed to exact representation. Equation (4.88), which is the backbone of the theorem, merely generalizes approximations by finite Fourier series. In effect, the theorem states that a single hidden layer is sufficient for a multilayer perceptron to compute a uniform approximation to a given training set represented by the set of inputs x1, ..., xm and a desired (target) output . However, the theorem f(x1, ..., xm ) 0 0
1,..., m 1
wi , ..., wm 1 0
x1, ... , xm 0
x1, x2, ..., xm0
F(x1, ..., xm0) - f(x1, ..., xm0) 6 i=1 j=1 wijxj + bi b ..., ..., does not say that a single hidden layer is optimum in the sense of learning time, ease of implementation, or (more importantly) generalization. Bounds on Approximation Errors Barron (1993) has established the approximation properties of a multilayer perceptron, assuming that the network has a single layer of hidden neurons using sigmoid functions and a linear output neuron. The network is trained using the back-propagation algorithm and then tested with new data. During training, the network learns specific points of a target function f in accordance with the training data and thereby produces the approximating function F defined in Eq. (4.88). When the network is exposed to test data that have not been seen before, the network function F acts as an “estimator” of new points of the target function; that is, . A smoothness property of the target function f is expressed in terms of its Fourier representation. In particular, the average of the norm of the frequency vector weighted by the Fourier magnitude distribution is used as a measure for the extent to which the denote the multidimensional Fourier transform function f oscillates. Let of the function f(x), the m0-by-1 vector is the frequency vector.The function f(x) is defined in terms of its Fourier transform by the inverse formula f (4.89) f(x) = 3f
where . For the complex-valued function for which
is integrable, j = 2-1 ( ) f we define the first absolute moment of the Fourier magnitude distribution of the function f as C (4.90) f = 3 f~( ) * 7 7 12 d m0
where is the Euclidean norm of and is the absolute value of . The
first absolute moment C f quantifies the smoothness of the function f. The first absolute moment C f provides the basis for a bound on the error that results from the use of a multilayer perceptron represented by the input–output mapping function F(x) of Eq. (4.88) to approximate f(x). The approximation error is measured by the integrated squared error with respect to an arbitrary probability measure on the ball of radius r 0. On this basis, we may state the following proposition for a bound on the approximation error given by Barron (1993): For every continuous function f(x) with finite first moment Cf and every m1 1, there exists a linear combination of sigmoid-based functions F(x) of the form defined in Eq. (4.88) such that when the function f(x) is observed at a set of values of the input vector x denoted by {xi}Ni1 that are restricted to lie inside the prescribed ball of radius r, the result provides the following bound on the empirical risk: (4.91) where C f¿ = (2rCf)2. e av(N) = 1 Na N i=1 (f(xi) - F(xi))2 C¿ f m1 B r = {x: 7x 7 r} 7 7 ~f( ) ~f( ) ( ~f ~ ) ~( )exp(j Tx) d m0