Sei sulla pagina 1di 6

Financial Time Series Forecasting by Neural Network

Using Conjugate Gradient Learning Algorithm and

Multiple Linear Regression Weight Initialization

CHAN Man-Chung, WONG Chi-Cheong, LAM Chi-Chung

Department of Computing
The Hong Kong Polytechnic University
Kowloon, Hong Kong

Multilayer neural network has been successfully applied to the time series forecasting. Steepest descend, a
popular learning algorithm for backpropagation network, converges slowly and has the difficulty in
determining the network parameters. In this paper, conjugate gradient learning algorithm with restart
procedure is introduced to overcome these problems. Also, the commonly used random weight initialization
does not guarantee to generate a set of initial connection weights close to the optimal weights leading to slow
convergence. Multiple linear regression (MLR) provides a better alternative for weight initialization.

The daily trade data of the listed companies from Shanghai Stock Exchange is collected for technical analysis
with the means of neural networks. Two learning algorithms and two weight initializations are compared.
The results find that neural networks can model the time series satisfactorily, whatever which learning
algorithm and weight initialization are adopted. However, the proposed conjugate gradient with MLR weight
initialization requires a lower computation cost and learns better than steepest decent with random

Keywords: time series forecasting, technical analysis, learning algorithm, conjugate gradient, multiple linear regression weight
initialization, backpropagation neural network

order steepest descent technique as learning algorithm.

1. Introduction Weights are modified in a direction that corresponds to the
negative gradient of the error surface. Gradient is an
Detecting trends of stock data is a decision support extremely local pointer and does not point to global
process. Although the Random Walk Theory claims that minimum. This hill-climbing search is in zigzag motion and
price changes are serially independent, traders and certain may move towards a wrong direction, getting stuck in a
academics[4] have observed that there is no efficient local minimum. The direction may be spoiled by
market. The movements of market price are not random subsequent directions, leading to slow convergence.
and predictable.
In addition, classical backpropagation is sensitive to the
Statistical methods and neural networks are commonly used parameters such as learning rate and momentum rate. For
for time series prediction. Empirical results have shown examples, the value of learning rate is critical in the sense
that Neural Networks outperform linear regression[1,18,32] that too small value will make have slow convergence and
since stock markets are complex, nonlinear, dynamic and too large value will make the search direction jump wildly
chaotic[22]. Neural networks are reliable for modeling and never converge. The optimal values of the parameters
nonlinear, dynamic market signals[15]. Neural Network are difficult to find and often obtained empirically.
makes very few assumptions as opposed to normality
assumptions commonly found in statistical methods. Customary random weight initialization does not guarantee
Neural network can perform prediction after learning the a good choice of initial weight values. The random weights
underlying relationship between the input variables and may be far from a good solution or near local minima or
outputs. From a statistician’s point of view, neural saddle points of the error surface, leading to a slow learning.
networks are analogous to nonparametric, nonlinear
regression models. To overcome the deficiencies of steepest descent learning
and random weight initialization, some researches[19,29]
Backpropagation neural network is commonly used for have investigated the use of Genetic Algorithms and
price prediction. Classical backpropagation adopts first- Simulated Annealing to escape local minimum. Some[5,20]
have attempted Orthogonal Least Squares. Some have 1
d k +1 = [ µ( −g k +1 ) + ( 1 − µ )d k ] (5)
adopted Newton-Raphson and Levenberg-Marquardt. µ
The search direction can be viewed as a convex combination
In this paper, conjugate gradient learning algorithm and of the current steepest descent direction and the direction
multiple linear regression weight initialization are used in the last move.
attempted. In next section, conjugate gradient learning
algorithm is introduced. Section 3 mentions multiple linear The search distance of each direction is varied. The value
regression weight initialization. The descriptions and the
of α k can be determined by line search techniques, such as
results of experiments on the performance of both learning
algorithms and both weight initializations are reported in Golden Search and Brent’s Algorithm, in the way that
section 4. Finally, conclusion is drawn and further f ( w k + α k d k ) is minimized along the direction d k ,
research is discussed in section 5.
given fixed w k and fixed d k .
2. Conjugate Gradient β k can be calculated by the following three formulae:
Learning Algorithm Hestenes and Stiefel’s formula,
The training phase of a backpropagation network is an g Tk +1 [ g k +1 − g k ]
unconstrained nonlinear optimization problem. The goal of βk = (6)
the training is to search an optimal set of connection d Tk [ g k +1 − g k ]
weights in the manner that the errors of the network output Polak and Ribiere’s formula,
can be minimized. g Tk +1 [ g k +1 − g k ]
βk = (7)
g Tk g k
Besides popular steepest descent algorithm, conjugate
Fletcher and Reeves’ formula,
gradient algorithm is another search method that can be
g Tk +1 g k + 1
used to minimize network output error in conjugate βk = (8)
directions. Conjugate gradient method uses orthogonal and g Tk g k
linearly independent non-zero vectors. Two vectors di
Shanno’s inexact line search[15] considers the conjugate
and d j are mutually G -conjugate if method as a memoryless quasi-Newton method. Shanno
d Ti Gd j = 0 for i ≠ j (1) derives a formula for computing d k +1 :
 yT y  p T g yT g  pT g
The algorithm was firstly developed to minimize a d k +1 = − g k +1 − 1 + kT k  kT k − kT k  pk + kT k y k
quadratic function of n variables  pk y k  pk y k pk y k  pk y k
1 T where p k = α k d k and yk = gk +1 − gk . (9)
f ( w) = c − b T w + w Gw (2)
2 The method performs an approximate line minimization in
where w is a vector with n elements and G is an n × n a descent direction in order to increase numerical stability.
symmetric and positive definite matrix. The algorithm was
then extended to minimization of general non-linear For n-dimensional quadratic problems, the solution is
functions by interpreting (2) as a second order Taylor converged from w0 to w* by n step moves along
series expansion of the objective function. G in (2) is different G -conjugate directions d 1 , d 2 ,..., d n .
regarded as Hessian matrix of function f.
However, for non-quadratic problems, G -conjugacy of the
direction vectors deteriorates. Therefore, the direction
A starting point w1 is selected first. The first search vector is reinitialized to the negative gradient of the current
direction d1 is set to negative gradient g1 (i.e. point after every n steps. That is,
d1 =- g1 ). Conjugate gradient method is to minimize d k = − g k where k = mn +1, m ∈ N (10)
differentiable function (2) by generating a sequence of
approximation wk +1 iteratively according to Conjugate gradient method has a second-order convergence
property without complex calculation of the Hessian
w k +1 = w k + α k d k (3) matrix. A faster convergence is expected than first order
d k +1 = − gk +1 + βk d k (4) steepest descent approach. Conjugate-gradient approach
α and β are momentum terms to avoid oscillations. finds the optimal weight vector w along the current gradient
by doing a line-search. It computes the gradient at the new
point and projects it onto the subspace defined by the
Let µ = 1 . Equation (4) can be rewritten as: complement of the space defined by all previously chosen
1 + βk gradients. The new direction is orthogonal to all previous
search directions. The method is simple. No parameter is 3. Multiple Linear Regression
involved. It requires little storage space and expected to be
efficient. Weight Initialization
Backpropagation is a hill-climbing technique. It runs the
The summary of conjugate gradient algorithm is describe risk of being trapped in local optimum. The starting point
below: of the connection weights becomes an important issue to
1. Set k = 1. Initialize w1. reduce the possibility of being trapped in local optimum.
2. Compute g 1 = ∇f(w1). Random weight initialization does not guarantee to generate
3. Set d 1 = -g 1. a good starting point. It can be enhanced by multiple linear
4. Compute α k by line search, regression. In this method, weights between input layer
where α k = arg min α [ f ( w k + α k d k )] . and hidden layer are still initialized randomly but weights
between hidden layer and output layer is obtained by
5. Update weight vector by wk+1 = wk+ α k d k.
multiple linear regression.
6. If network error is less than a pre-set minimum value
or the maximum number of iterations has been
reached, stop; else go to step 7. The weight wij between the input node i and the hidden
7. If k+1 > n, then w1 = wk+1, k = 1 and go to step 2; node j is initialized by uniform randomization. Once input
Else a) set k= k+1 xis of sample s has been fed into the input node and wij ’s
b) compute g k+1 = ∇f(wk+1).
c) compute âk . have been assigned values, output value R sj of the hidden
d) compute new direction: d k+1=-g k+1+βk d k. node j can be calculated as
e) go to step 4 R sj = f (∑ wij xis ) , (15)

To compute gradient in step 2 and 7b, the objective where f is a transfer function. The output value of the
function is first defined. The aim is to minimize the output node can be calculated as
network error that is dependent of the independent ∑
y s = f ( v j R sj )
connection weights. The objective function is defined by
the error function: where v j is the weight between the hidden layer and the
∑∑( t
1 2
f (w ) = nj − y nj ( w )) (11) output layer.
2N n j

where N is the number of patterns in the training set; Assume sigmoid function f ( x ) = 1 is used as the
w is one-dimensional weight vector in which 1 + e −x
weights are ordered by layer and then by neuron; transfer function of the network. By Taylor’s expansion,
1 x
t nj and ynj ( w ) are the actual and desired f (x ) ≅ + (17)
2 4
outputs of the j-th output neuron for n-th pattern,
respectively. Applying the linear approximation in (17) to (16), we have
the following approximated linear relationship between the
With the arguments in [34], the gradient is output y and vj’s:
g( w ) = ∑δ y ( w )
N n nj ni
(12) 1 1 m
y s = + ( ∑ v j R sj ) (18)
2 4 j
For output nodes,
or 4 y s − 2 = v1 R1s + v2 R2s + ... + vm Rms
δ nj = −( t nj − y nj ( w ))s 'j ( netnj ) (13)
s = 1, 2,..N (19)
where s'j ( net nj ) is the derivative of the activation where m is the number of hidden nodes;
N is the total number of training samples.
function of the input of the j-th neuron net nj .
For the hidden node, The set of equations in (19) is a typical multiple linear
δ nj = s 'j ( net nj )∑ δ nk w jk (14)
regression model. R sj ’s are considered as the regressors.

where w jk is the weight from j-th to the k-th neuron. v j ’s can be estimated by standard regression method.

Once v j ’s have been obtained, the network initialization is

completed and the training starts.
4. Experiment
Prediction of price change allows a larger error tolerance
The daily trading data of eleven listing companies in 1994- than prediction of exact price value, resulting in a significant
1996 was collected from Shanghai Stock Exchange for improvement in the forecasting ability[8,10]. In order to
technical analysis of stock price. The first 500 entries were smooth out the noise and the random component of data,
used as training data. The rest 150 were testing data. The exponential moving average of the closing price change at
raw data is preprocessed into various technical indicators to day t (∆EMA(t)) was selected as the output node of the
gain insight into the direction that the stock market may be network. ∆EMA(t) can then be transformed to stock
going. Ten technical indicators was selected as inputs of closing price Pt by
the neural network: the lagging input of past 5 days’ change 1
in exponential moving average (∆1EMA(t-1), ∆2EMA(t-1), Pt = Pt−1 + [ ∆ EMA( t ) − ∆ EMA( t − 1 )] + ∆EMA( t − 1 )
∆3EMA(t-1), ∆4EMA(t-1), ∆5EMA(t-1)), relative
strength index on day t-1 (RSI(t-1)), moving average A three-layer network architecture was used. The required
convergence-divergence on day t-1 (MACD(t-1)), number of hidden nodes is estimated by
MACD Signal Line on day t-1 (MACD Signal No. of hidden nodes = (M + N) / 2
Line (t-1)), stochastic %K on day t-1 (%K(t-1)) and where M and N is the number of input nodes and output
stochastic %D on day t-1 (%D(t-1)). nodes respectively. In our network, there were ten input
nodes and one output node. Hence, five hidden nodes were
EMA is a trend-following tool that gives an average value used.
of data with greater weight to the latest data. Difference of
EMA can be considered as momentum. RSI is an oscillator The following scenarios have been examined:
which measures the strength of up versus down over a a. Conjugate gradient with random initialization (CG/RI)
certain time interval (nine days were selected in our b. Conjugate gradient with multiple linear regression
experiment). High value of RSI indicates a strong market initialization (CG/MLRI)
and low value indicates weak markets. MACD, a trend- c. Steepest descent with random initialization (SD/RI)
following momentum indicator, is the difference between d. Steepest descent with multiple linear regression
two moving average of price. In our experiment, 12-day initialization (SD/MLRI)
EMA and 26-day EMA were used. MACD signal line In steepest descent algorithm, the learning rate and
smoothes MACD. 9-day EMA of MACD was selected momentum rate was set to 0.1 and 0.5 respectively[27]. In
for the calculation of MACD signal line. Stochastic is an conjugate gradient, Golden Search was used to perform the
oscillator that tracks the relationship of each closing price exact the line search of α . According to Bazarra’s
to the recent high-low range. It has two lines: %K and %D. analysis[3], Polak and Ribiere’s form was selected for the
%K is the “raw” Stochastic. In our experiment, the calculation of β. For all scenarios mentioned above, the
Stochastic’s time window was set to five for calculation of training is terminated when mean square error (MSE) is
%K. %D smoothes %K – over a 3-day period in our smaller than 0.5%.
All the eleven company data were used for each of the
Neural network cannot handle wide range of values. In above scenario. Each company data set ran 10 times.
order to avoid difficulty in getting network outputs very Figure 1 shows a sample result from testing phase.
close to the two endpoints, the indicators were normalized
to the range [0.05, 0.95], instead of [0,1], before being input
to the network.

Fig 1a: Predicted ∆EMA(t) vs actual ∆EMA(t) Fig 1b: Predicted stock price vs actual stock price
Figure 1: A sample result from neural network
In figure 1a, although predicted ∆EMA(t) and actual 5. Conclusion & Discussion
∆EMA(t) have a relative great deviation in some regions,
the network can still model the actual EMA reasonably The experimental results show that it is possible to model
well. On the other hand, after the transformation of stock price based on historical trading data by using a three-
∆EMA(t) to exact price value, the deviation between actual layer neural network. In general, both steepest descent
price and predicted price is small. Two curves in figure 1b network and conjugate gradient network produce the same
nearly coincide. This reflects the selection of the network level of error and reach the same level of direction
forecaster was appropriate. prediction accuracy.

The performance of scenarios mentioned above is evaluated Conjugate gradient approach has advantages of steepest
by average number of iterations required for training, descent approach. It does not require empirical
average MSE in testing phase and the percentage of correct determination of network. As opposed to zigzag motion in
direction prediction in testing phase. The results are steepest descent approach, its orthogonal search prevents a
summarized in Table 1. good point being spoiled. Theoretically, the convergence of
second-order conjugate gradient method is faster than first
Average % of correct order steepest descent approach. This is verified in our
Average experiment.
number of direction
iterations prediction
In regard to initial starting point, the experimental results
CG / RI 56.636 0.001753 73.055
show the good starting point generated by multiple linear
CG / MLRI 30.273 0.001768 73.545
regression weight initialization is spoiled by subsequent
SD / RI 497.818 0.001797 72.564
direction in steepest descent network. On the contrary,
SD / MLRI 729.367 0.002580 69.303 regression initialization provides a good starting point,
Table 1: Performance evaluation for four scenarios improving the convergence of conjugate gradient learning.

All scenarios, except for steepest descent with MLR To sum up, the efficiency of backpropagation can be
initialization, achieve similar average MSE and percentage improved by conjugate gradient learning with multiple
of correct direction prediction. All scenarios perform linear regression weight initialization.
satisfactory. The mean square error produced is on average
below 0.258% and more than 69% correct direction It is believed that the computation time of conjugate
prediction is reached. gradient can be reduced by Shanno’s approach[7]. The
initialization scheme may be improved by estimating
Conjugate gradient learning on average requires significant weights between input nodes and hidden nodes, instead of
less number of iterations than steepest descent learning. random initialization. Enrichment of more relevant inputs
Due to complexity of line search, conjugate gradient such as fundamental data and data from derivative markets
requires a longer computation time than steepest gradient may improve the predictability of the network. Finally,
per iteration. However, overall convergence of conjugate more sophisticated network architectures can be attempted
gradient neural network is still faster than steepest descent for price prediction.

In conjugate gradient network, MLR initialization requires

less number of iterations required for training than random
initialization, achieving similar MSE and direction
prediction accuracy with random initialization. The
positive result shows that regression provides a better
starting point for the local quadratic approximation of the
nonlinear network function performed by conjugate

However, in steepest descent network, regression

initialization does not improve performance. It requires
more number of iterations for training, produces a larger
MSE and fewer correct direction predictions than random
initialization. The phenomenon is opposite to the case in
conjugate gradient network. It is attributed to the
characteristics of the gradient descent algorithm that
modifies direction to negative gradient of error surface,
resulting in spoils of good starting point generated by MLP
by subsequent directions.
Proc. of World Congress on Neural Networks,
References Washington D.C., July 1995.
18. Leorey Marquez, Tim Hill, Reginald Worthley,
1. Arun Bansal, Robert J. Kauffman, Rob R. Weitz. William Remus. Neural Network Models as an
Comparing the Modeling Performance of Regression Alternate to Regression, Proc. of IEEE 24th Annual
and Neural Networks as Data Quality Varies: A Hawaii Int’l Conference on System Sciences,
Business Value Approach, Journal of Management pp.129-135, Vol VI, 1991.
Information Systems, Vol 10. pp.11-32, 1993. 19. Michael McInerney, Atam P. Dhawan. Use of
2. Baum E.B., Haussler D. What size net gives valid Genetic Algorithms with Back-Propagation in training
generalization?, Neural Computation 1, pp.151- of Feed-Forward Neural Networks, Proc. of IEEE
160, 1988. Int’l Joint Conference on Neural Networks, Vol.
3. Bazarra M.S. Nonlinear Programming, Theory and 1, pp. 203-208, 1993.
Algorithms, Ch 8, Wiley, 1993. 20. Mikko Lehtokangas. Initializing Weights of a
4. Burton G. Malkeil. A Random Walk Down Wall Multilayer Perceptron Network by Using the
Street, Ch 5-8, W.W. Norton & Company, 1996. Orthogonal Least Squares Algorithm, Neural
5. Chen S., Cowan C.F.N., Grant P.M. Orthogonal Computation, 1995.
Least Squares LearningAlgorithm for Radial Basis 21. Referenes A-P., Zapranis A., Francis G. Stock
Function Networks, IEEE Trans. Neural Network Performance Modeling Using Neural Networks: A
2(2), pp. 302-309, 1991. Comparative Study with Regression Models, Neural
6. Chris Bishop. Exact Calculation of the Hessian Networks, Vol. 7, no. 2, pp.375-388, 1994.
Matrix for the Multilayer Perceptron, Neural 22. Robert R. Trippi, Jae L. Lee. Artificial Intelligence in
Computation 4, pp 494-501, 1992. Finance & Investing, Ch 10, IRWIN, 1996.
7. David F. Shanno. Conjugate Gradient Methods with 23. Robert R. Trippi. Neural Networks in Finance and
Inexact Searches, Mathematics of Operations Investing, Probus Publishing Company, 1993.
Research, Vol. 3, no. 3, 1978. 24. Roberto Battiti. First- and Second-Order Methods for
8. David M. Skapura. Building Neural Networks, Ch 5, Learning: Between Steepest Descent and Newton’s
ACM Press, Addison-Wesley. Method, Neural Computation 4, pp. 141-166,
9. Davide Chinetti, Francesco Gardin, Cecilia Rossignoli. 1992.
A Neural Network Model for Stock Market 25. Roger Stein. Preprocessing Data for Neural Networks,
Prediction, Proc. Int’l Conference on Artificial AI Expert, March, 1993.
Intelligence Applications on Wall Street, 1993. 26. Roger Stein. Selecting Data for Neural Networks, AI
10. Edward Gately. Neural Networks for Financial Expert, March, 1993.
Forecasting, Wiley, 1996. 27. Simon Haykin. Neural Networks, A Comprehensive
11. Eitan Michael Azoff. Reducing Error in Neural Foundation, Ch 6, Macmillan, 1994.
Network Time Series Forecasting, Neural 28. Steven B. Achilis. Technical Analysis From A to Z,
Computing & Applications, pp.240-247, 1993. Probus Publishing, 1995.
12. Emad W. Saad, Danil V.P., Donald C.W. Advanced 29. Timothy Masters. Practical Neural Network Receipes
Neural Network Training Methods for Low False in C++, Ch 4-10, Academic Press.
Alarm Stock Trend Prediction, Proc. of World 30. W. Press et al. Numerical Recipes in C, The Art of
Congress on Neural Networks, Washing D.C., Scientific Computing, 2nd Edition, Cambridge
June 1996. University Press, 1994.
13. G. Thimm and E. Fiesler. High Order and Multilayer 31. Wasserman P.D. Advanced Methods in Neural
Perceptron Initialization, IEEE Transactions on Computing, Van Nostrand Reinhold.
Neural Networks, Vol. 8, No.2, pp.349-359, 1997. 32. White H. Economic Prediction Using Neural
14. Gia-Shuh Jang, Feipei Lai. Intelligent Stock Market Networks: The Case of IBM Daily Stock Returns,
Prediction System Using Dual Adaptive-Structure Proc. of IEEE Int’l Conference on Neural
Neural Networks, Proc. Int’l Conference on Networks, 1988.
Artificial Intelligence Applications on Wall 33. Wood D., Dasgupta B. Classifying Trend Movements
Street, 1993. in the MSCI U.S.A. Capital Market Index - A
15. Guido J. Deboeck. Trading on the Edge - Neural, comparison of Regression, ARIMA and Neural
Genetic, and Fuzzy Systems for Chaotic Financial Network Methods, Computer and Operations
Markets, Wiley, 1994. Research, Vol. 23, no. 6, pp.611, 1996.
16. Hecht-Nielsen R. Kolmogorov’s mapping neural 34. Rumelhart, D.E., Hinton G.E., Williams R.J. Learning
network existence theorem, Proc. 1st IEEE Int’l Internal Representations by Error Propagation,
Joint Conf. Neural Network, June 1987. Parallel Distributed Processing: Explorations
17. Hong Tan, Danil V. P., Donald C. W. Probabilistic in the Microstructure of Cognition, Vol. 1:
and Time-Delay Neural Network Techniques for Foundations. MIT Press, 1986.
Conservative Short-Term Stock Trend Prediction,