0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

89 visualizzazioni6 pagineNov 17, 2010

© Attribution Non-Commercial (BY-NC)

PDF, TXT o leggi online da Scribd

Attribution Non-Commercial (BY-NC)

0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

89 visualizzazioni6 pagineAttribution Non-Commercial (BY-NC)

Sei sulla pagina 1di 6

Multiple Linear Regression Weight Initialization

Department of Computing

The Hong Kong Polytechnic University

Kowloon, Hong Kong

csmcchan@comp.polyu.edu.hk csccwong@comp.polyu.edu.hk

Abstract

Multilayer neural network has been successfully applied to the time series forecasting. Steepest descend, a

popular learning algorithm for backpropagation network, converges slowly and has the difficulty in

determining the network parameters. In this paper, conjugate gradient learning algorithm with restart

procedure is introduced to overcome these problems. Also, the commonly used random weight initialization

does not guarantee to generate a set of initial connection weights close to the optimal weights leading to slow

convergence. Multiple linear regression (MLR) provides a better alternative for weight initialization.

The daily trade data of the listed companies from Shanghai Stock Exchange is collected for technical analysis

with the means of neural networks. Two learning algorithms and two weight initializations are compared.

The results find that neural networks can model the time series satisfactorily, whatever which learning

algorithm and weight initialization are adopted. However, the proposed conjugate gradient with MLR weight

initialization requires a lower computation cost and learns better than steepest decent with random

initialization.

Keywords: time series forecasting, technical analysis, learning algorithm, conjugate gradient, multiple linear regression weight

initialization, backpropagation neural network

1. Introduction Weights are modified in a direction that corresponds to the

negative gradient of the error surface. Gradient is an

Detecting trends of stock data is a decision support extremely local pointer and does not point to global

process. Although the Random Walk Theory claims that minimum. This hill-climbing search is in zigzag motion and

price changes are serially independent, traders and certain may move towards a wrong direction, getting stuck in a

academics[4] have observed that there is no efficient local minimum. The direction may be spoiled by

market. The movements of market price are not random subsequent directions, leading to slow convergence.

and predictable.

In addition, classical backpropagation is sensitive to the

Statistical methods and neural networks are commonly used parameters such as learning rate and momentum rate. For

for time series prediction. Empirical results have shown examples, the value of learning rate is critical in the sense

that Neural Networks outperform linear regression[1,18,32] that too small value will make have slow convergence and

since stock markets are complex, nonlinear, dynamic and too large value will make the search direction jump wildly

chaotic[22]. Neural networks are reliable for modeling and never converge. The optimal values of the parameters

nonlinear, dynamic market signals[15]. Neural Network are difficult to find and often obtained empirically.

makes very few assumptions as opposed to normality

assumptions commonly found in statistical methods. Customary random weight initialization does not guarantee

Neural network can perform prediction after learning the a good choice of initial weight values. The random weights

underlying relationship between the input variables and may be far from a good solution or near local minima or

outputs. From a statistician’s point of view, neural saddle points of the error surface, leading to a slow learning.

networks are analogous to nonparametric, nonlinear

regression models. To overcome the deficiencies of steepest descent learning

and random weight initialization, some researches[19,29]

Backpropagation neural network is commonly used for have investigated the use of Genetic Algorithms and

price prediction. Classical backpropagation adopts first- Simulated Annealing to escape local minimum. Some[5,20]

have attempted Orthogonal Least Squares. Some have 1

d k +1 = [ µ( −g k +1 ) + ( 1 − µ )d k ] (5)

adopted Newton-Raphson and Levenberg-Marquardt. µ

The search direction can be viewed as a convex combination

In this paper, conjugate gradient learning algorithm and of the current steepest descent direction and the direction

multiple linear regression weight initialization are used in the last move.

attempted. In next section, conjugate gradient learning

algorithm is introduced. Section 3 mentions multiple linear The search distance of each direction is varied. The value

regression weight initialization. The descriptions and the

of α k can be determined by line search techniques, such as

results of experiments on the performance of both learning

algorithms and both weight initializations are reported in Golden Search and Brent’s Algorithm, in the way that

section 4. Finally, conclusion is drawn and further f ( w k + α k d k ) is minimized along the direction d k ,

research is discussed in section 5.

given fixed w k and fixed d k .

2. Conjugate Gradient β k can be calculated by the following three formulae:

Learning Algorithm Hestenes and Stiefel’s formula,

The training phase of a backpropagation network is an g Tk +1 [ g k +1 − g k ]

unconstrained nonlinear optimization problem. The goal of βk = (6)

the training is to search an optimal set of connection d Tk [ g k +1 − g k ]

weights in the manner that the errors of the network output Polak and Ribiere’s formula,

can be minimized. g Tk +1 [ g k +1 − g k ]

βk = (7)

g Tk g k

Besides popular steepest descent algorithm, conjugate

Fletcher and Reeves’ formula,

gradient algorithm is another search method that can be

g Tk +1 g k + 1

used to minimize network output error in conjugate βk = (8)

directions. Conjugate gradient method uses orthogonal and g Tk g k

linearly independent non-zero vectors. Two vectors di

Shanno’s inexact line search[15] considers the conjugate

and d j are mutually G -conjugate if method as a memoryless quasi-Newton method. Shanno

d Ti Gd j = 0 for i ≠ j (1) derives a formula for computing d k +1 :

yT y p T g yT g pT g

The algorithm was firstly developed to minimize a d k +1 = − g k +1 − 1 + kT k kT k − kT k pk + kT k y k

quadratic function of n variables pk y k pk y k pk y k pk y k

1 T where p k = α k d k and yk = gk +1 − gk . (9)

f ( w) = c − b T w + w Gw (2)

2 The method performs an approximate line minimization in

where w is a vector with n elements and G is an n × n a descent direction in order to increase numerical stability.

symmetric and positive definite matrix. The algorithm was

then extended to minimization of general non-linear For n-dimensional quadratic problems, the solution is

functions by interpreting (2) as a second order Taylor converged from w0 to w* by n step moves along

series expansion of the objective function. G in (2) is different G -conjugate directions d 1 , d 2 ,..., d n .

regarded as Hessian matrix of function f.

However, for non-quadratic problems, G -conjugacy of the

direction vectors deteriorates. Therefore, the direction

A starting point w1 is selected first. The first search vector is reinitialized to the negative gradient of the current

direction d1 is set to negative gradient g1 (i.e. point after every n steps. That is,

d1 =- g1 ). Conjugate gradient method is to minimize d k = − g k where k = mn +1, m ∈ N (10)

differentiable function (2) by generating a sequence of

approximation wk +1 iteratively according to Conjugate gradient method has a second-order convergence

property without complex calculation of the Hessian

w k +1 = w k + α k d k (3) matrix. A faster convergence is expected than first order

d k +1 = − gk +1 + βk d k (4) steepest descent approach. Conjugate-gradient approach

α and β are momentum terms to avoid oscillations. finds the optimal weight vector w along the current gradient

by doing a line-search. It computes the gradient at the new

point and projects it onto the subspace defined by the

Let µ = 1 . Equation (4) can be rewritten as: complement of the space defined by all previously chosen

1 + βk gradients. The new direction is orthogonal to all previous

search directions. The method is simple. No parameter is 3. Multiple Linear Regression

involved. It requires little storage space and expected to be

efficient. Weight Initialization

Backpropagation is a hill-climbing technique. It runs the

The summary of conjugate gradient algorithm is describe risk of being trapped in local optimum. The starting point

below: of the connection weights becomes an important issue to

1. Set k = 1. Initialize w1. reduce the possibility of being trapped in local optimum.

2. Compute g 1 = ∇f(w1). Random weight initialization does not guarantee to generate

3. Set d 1 = -g 1. a good starting point. It can be enhanced by multiple linear

4. Compute α k by line search, regression. In this method, weights between input layer

where α k = arg min α [ f ( w k + α k d k )] . and hidden layer are still initialized randomly but weights

between hidden layer and output layer is obtained by

5. Update weight vector by wk+1 = wk+ α k d k.

multiple linear regression.

6. If network error is less than a pre-set minimum value

or the maximum number of iterations has been

reached, stop; else go to step 7. The weight wij between the input node i and the hidden

7. If k+1 > n, then w1 = wk+1, k = 1 and go to step 2; node j is initialized by uniform randomization. Once input

Else a) set k= k+1 xis of sample s has been fed into the input node and wij ’s

b) compute g k+1 = ∇f(wk+1).

c) compute âk . have been assigned values, output value R sj of the hidden

d) compute new direction: d k+1=-g k+1+βk d k. node j can be calculated as

e) go to step 4 R sj = f (∑ wij xis ) , (15)

i

To compute gradient in step 2 and 7b, the objective where f is a transfer function. The output value of the

function is first defined. The aim is to minimize the output node can be calculated as

network error that is dependent of the independent ∑

y s = f ( v j R sj )

j

(16)

connection weights. The objective function is defined by

the error function: where v j is the weight between the hidden layer and the

∑∑( t

1 2

f (w ) = nj − y nj ( w )) (11) output layer.

2N n j

where N is the number of patterns in the training set; Assume sigmoid function f ( x ) = 1 is used as the

w is one-dimensional weight vector in which 1 + e −x

weights are ordered by layer and then by neuron; transfer function of the network. By Taylor’s expansion,

1 x

t nj and ynj ( w ) are the actual and desired f (x ) ≅ + (17)

2 4

outputs of the j-th output neuron for n-th pattern,

respectively. Applying the linear approximation in (17) to (16), we have

the following approximated linear relationship between the

With the arguments in [34], the gradient is output y and vj’s:

1

g( w ) = ∑δ y ( w )

N n nj ni

(12) 1 1 m

y s = + ( ∑ v j R sj ) (18)

2 4 j

For output nodes,

or 4 y s − 2 = v1 R1s + v2 R2s + ... + vm Rms

δ nj = −( t nj − y nj ( w ))s 'j ( netnj ) (13)

s = 1, 2,..N (19)

where s'j ( net nj ) is the derivative of the activation where m is the number of hidden nodes;

N is the total number of training samples.

function of the input of the j-th neuron net nj .

For the hidden node, The set of equations in (19) is a typical multiple linear

δ nj = s 'j ( net nj )∑ δ nk w jk (14)

regression model. R sj ’s are considered as the regressors.

k

where w jk is the weight from j-th to the k-th neuron. v j ’s can be estimated by standard regression method.

completed and the training starts.

4. Experiment

Prediction of price change allows a larger error tolerance

The daily trading data of eleven listing companies in 1994- than prediction of exact price value, resulting in a significant

1996 was collected from Shanghai Stock Exchange for improvement in the forecasting ability[8,10]. In order to

technical analysis of stock price. The first 500 entries were smooth out the noise and the random component of data,

used as training data. The rest 150 were testing data. The exponential moving average of the closing price change at

raw data is preprocessed into various technical indicators to day t (∆EMA(t)) was selected as the output node of the

gain insight into the direction that the stock market may be network. ∆EMA(t) can then be transformed to stock

going. Ten technical indicators was selected as inputs of closing price Pt by

the neural network: the lagging input of past 5 days’ change 1

in exponential moving average (∆1EMA(t-1), ∆2EMA(t-1), Pt = Pt−1 + [ ∆ EMA( t ) − ∆ EMA( t − 1 )] + ∆EMA( t − 1 )

L

∆3EMA(t-1), ∆4EMA(t-1), ∆5EMA(t-1)), relative

strength index on day t-1 (RSI(t-1)), moving average A three-layer network architecture was used. The required

convergence-divergence on day t-1 (MACD(t-1)), number of hidden nodes is estimated by

MACD Signal Line on day t-1 (MACD Signal No. of hidden nodes = (M + N) / 2

Line (t-1)), stochastic %K on day t-1 (%K(t-1)) and where M and N is the number of input nodes and output

stochastic %D on day t-1 (%D(t-1)). nodes respectively. In our network, there were ten input

nodes and one output node. Hence, five hidden nodes were

EMA is a trend-following tool that gives an average value used.

of data with greater weight to the latest data. Difference of

EMA can be considered as momentum. RSI is an oscillator The following scenarios have been examined:

which measures the strength of up versus down over a a. Conjugate gradient with random initialization (CG/RI)

certain time interval (nine days were selected in our b. Conjugate gradient with multiple linear regression

experiment). High value of RSI indicates a strong market initialization (CG/MLRI)

and low value indicates weak markets. MACD, a trend- c. Steepest descent with random initialization (SD/RI)

following momentum indicator, is the difference between d. Steepest descent with multiple linear regression

two moving average of price. In our experiment, 12-day initialization (SD/MLRI)

EMA and 26-day EMA were used. MACD signal line In steepest descent algorithm, the learning rate and

smoothes MACD. 9-day EMA of MACD was selected momentum rate was set to 0.1 and 0.5 respectively[27]. In

for the calculation of MACD signal line. Stochastic is an conjugate gradient, Golden Search was used to perform the

oscillator that tracks the relationship of each closing price exact the line search of α . According to Bazarra’s

to the recent high-low range. It has two lines: %K and %D. analysis[3], Polak and Ribiere’s form was selected for the

%K is the “raw” Stochastic. In our experiment, the calculation of β. For all scenarios mentioned above, the

Stochastic’s time window was set to five for calculation of training is terminated when mean square error (MSE) is

%K. %D smoothes %K – over a 3-day period in our smaller than 0.5%.

experiment.

All the eleven company data were used for each of the

Neural network cannot handle wide range of values. In above scenario. Each company data set ran 10 times.

order to avoid difficulty in getting network outputs very Figure 1 shows a sample result from testing phase.

close to the two endpoints, the indicators were normalized

to the range [0.05, 0.95], instead of [0,1], before being input

to the network.

Fig 1a: Predicted ∆EMA(t) vs actual ∆EMA(t) Fig 1b: Predicted stock price vs actual stock price

Figure 1: A sample result from neural network

In figure 1a, although predicted ∆EMA(t) and actual 5. Conclusion & Discussion

∆EMA(t) have a relative great deviation in some regions,

the network can still model the actual EMA reasonably The experimental results show that it is possible to model

well. On the other hand, after the transformation of stock price based on historical trading data by using a three-

∆EMA(t) to exact price value, the deviation between actual layer neural network. In general, both steepest descent

price and predicted price is small. Two curves in figure 1b network and conjugate gradient network produce the same

nearly coincide. This reflects the selection of the network level of error and reach the same level of direction

forecaster was appropriate. prediction accuracy.

The performance of scenarios mentioned above is evaluated Conjugate gradient approach has advantages of steepest

by average number of iterations required for training, descent approach. It does not require empirical

average MSE in testing phase and the percentage of correct determination of network. As opposed to zigzag motion in

direction prediction in testing phase. The results are steepest descent approach, its orthogonal search prevents a

summarized in Table 1. good point being spoiled. Theoretically, the convergence of

second-order conjugate gradient method is faster than first

Average % of correct order steepest descent approach. This is verified in our

Average experiment.

number of direction

MSE

iterations prediction

In regard to initial starting point, the experimental results

CG / RI 56.636 0.001753 73.055

show the good starting point generated by multiple linear

CG / MLRI 30.273 0.001768 73.545

regression weight initialization is spoiled by subsequent

SD / RI 497.818 0.001797 72.564

direction in steepest descent network. On the contrary,

SD / MLRI 729.367 0.002580 69.303 regression initialization provides a good starting point,

Table 1: Performance evaluation for four scenarios improving the convergence of conjugate gradient learning.

All scenarios, except for steepest descent with MLR To sum up, the efficiency of backpropagation can be

initialization, achieve similar average MSE and percentage improved by conjugate gradient learning with multiple

of correct direction prediction. All scenarios perform linear regression weight initialization.

satisfactory. The mean square error produced is on average

below 0.258% and more than 69% correct direction It is believed that the computation time of conjugate

prediction is reached. gradient can be reduced by Shanno’s approach[7]. The

initialization scheme may be improved by estimating

Conjugate gradient learning on average requires significant weights between input nodes and hidden nodes, instead of

less number of iterations than steepest descent learning. random initialization. Enrichment of more relevant inputs

Due to complexity of line search, conjugate gradient such as fundamental data and data from derivative markets

requires a longer computation time than steepest gradient may improve the predictability of the network. Finally,

per iteration. However, overall convergence of conjugate more sophisticated network architectures can be attempted

gradient neural network is still faster than steepest descent for price prediction.

network.

less number of iterations required for training than random

initialization, achieving similar MSE and direction

prediction accuracy with random initialization. The

positive result shows that regression provides a better

starting point for the local quadratic approximation of the

nonlinear network function performed by conjugate

gradient.

initialization does not improve performance. It requires

more number of iterations for training, produces a larger

MSE and fewer correct direction predictions than random

initialization. The phenomenon is opposite to the case in

conjugate gradient network. It is attributed to the

characteristics of the gradient descent algorithm that

modifies direction to negative gradient of error surface,

resulting in spoils of good starting point generated by MLP

by subsequent directions.

Proc. of World Congress on Neural Networks,

References Washington D.C., July 1995.

18. Leorey Marquez, Tim Hill, Reginald Worthley,

1. Arun Bansal, Robert J. Kauffman, Rob R. Weitz. William Remus. Neural Network Models as an

Comparing the Modeling Performance of Regression Alternate to Regression, Proc. of IEEE 24th Annual

and Neural Networks as Data Quality Varies: A Hawaii Int’l Conference on System Sciences,

Business Value Approach, Journal of Management pp.129-135, Vol VI, 1991.

Information Systems, Vol 10. pp.11-32, 1993. 19. Michael McInerney, Atam P. Dhawan. Use of

2. Baum E.B., Haussler D. What size net gives valid Genetic Algorithms with Back-Propagation in training

generalization?, Neural Computation 1, pp.151- of Feed-Forward Neural Networks, Proc. of IEEE

160, 1988. Int’l Joint Conference on Neural Networks, Vol.

3. Bazarra M.S. Nonlinear Programming, Theory and 1, pp. 203-208, 1993.

Algorithms, Ch 8, Wiley, 1993. 20. Mikko Lehtokangas. Initializing Weights of a

4. Burton G. Malkeil. A Random Walk Down Wall Multilayer Perceptron Network by Using the

Street, Ch 5-8, W.W. Norton & Company, 1996. Orthogonal Least Squares Algorithm, Neural

5. Chen S., Cowan C.F.N., Grant P.M. Orthogonal Computation, 1995.

Least Squares LearningAlgorithm for Radial Basis 21. Referenes A-P., Zapranis A., Francis G. Stock

Function Networks, IEEE Trans. Neural Network Performance Modeling Using Neural Networks: A

2(2), pp. 302-309, 1991. Comparative Study with Regression Models, Neural

6. Chris Bishop. Exact Calculation of the Hessian Networks, Vol. 7, no. 2, pp.375-388, 1994.

Matrix for the Multilayer Perceptron, Neural 22. Robert R. Trippi, Jae L. Lee. Artificial Intelligence in

Computation 4, pp 494-501, 1992. Finance & Investing, Ch 10, IRWIN, 1996.

7. David F. Shanno. Conjugate Gradient Methods with 23. Robert R. Trippi. Neural Networks in Finance and

Inexact Searches, Mathematics of Operations Investing, Probus Publishing Company, 1993.

Research, Vol. 3, no. 3, 1978. 24. Roberto Battiti. First- and Second-Order Methods for

8. David M. Skapura. Building Neural Networks, Ch 5, Learning: Between Steepest Descent and Newton’s

ACM Press, Addison-Wesley. Method, Neural Computation 4, pp. 141-166,

9. Davide Chinetti, Francesco Gardin, Cecilia Rossignoli. 1992.

A Neural Network Model for Stock Market 25. Roger Stein. Preprocessing Data for Neural Networks,

Prediction, Proc. Int’l Conference on Artificial AI Expert, March, 1993.

Intelligence Applications on Wall Street, 1993. 26. Roger Stein. Selecting Data for Neural Networks, AI

10. Edward Gately. Neural Networks for Financial Expert, March, 1993.

Forecasting, Wiley, 1996. 27. Simon Haykin. Neural Networks, A Comprehensive

11. Eitan Michael Azoff. Reducing Error in Neural Foundation, Ch 6, Macmillan, 1994.

Network Time Series Forecasting, Neural 28. Steven B. Achilis. Technical Analysis From A to Z,

Computing & Applications, pp.240-247, 1993. Probus Publishing, 1995.

12. Emad W. Saad, Danil V.P., Donald C.W. Advanced 29. Timothy Masters. Practical Neural Network Receipes

Neural Network Training Methods for Low False in C++, Ch 4-10, Academic Press.

Alarm Stock Trend Prediction, Proc. of World 30. W. Press et al. Numerical Recipes in C, The Art of

Congress on Neural Networks, Washing D.C., Scientific Computing, 2nd Edition, Cambridge

June 1996. University Press, 1994.

13. G. Thimm and E. Fiesler. High Order and Multilayer 31. Wasserman P.D. Advanced Methods in Neural

Perceptron Initialization, IEEE Transactions on Computing, Van Nostrand Reinhold.

Neural Networks, Vol. 8, No.2, pp.349-359, 1997. 32. White H. Economic Prediction Using Neural

14. Gia-Shuh Jang, Feipei Lai. Intelligent Stock Market Networks: The Case of IBM Daily Stock Returns,

Prediction System Using Dual Adaptive-Structure Proc. of IEEE Int’l Conference on Neural

Neural Networks, Proc. Int’l Conference on Networks, 1988.

Artificial Intelligence Applications on Wall 33. Wood D., Dasgupta B. Classifying Trend Movements

Street, 1993. in the MSCI U.S.A. Capital Market Index - A

15. Guido J. Deboeck. Trading on the Edge - Neural, comparison of Regression, ARIMA and Neural

Genetic, and Fuzzy Systems for Chaotic Financial Network Methods, Computer and Operations

Markets, Wiley, 1994. Research, Vol. 23, no. 6, pp.611, 1996.

16. Hecht-Nielsen R. Kolmogorov’s mapping neural 34. Rumelhart, D.E., Hinton G.E., Williams R.J. Learning

network existence theorem, Proc. 1st IEEE Int’l Internal Representations by Error Propagation,

Joint Conf. Neural Network, June 1987. Parallel Distributed Processing: Explorations

17. Hong Tan, Danil V. P., Donald C. W. Probabilistic in the Microstructure of Cognition, Vol. 1:

and Time-Delay Neural Network Techniques for Foundations. MIT Press, 1986.

Conservative Short-Term Stock Trend Prediction,

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.