Sei sulla pagina 1di 173

Data Mining for Prediction.

Financial Series Case


Stefan Zemke

Doctoral Thesis
The Royal Institute of Technology
Department of Computer and Systems Sciences
December 2003

i
Doctoral Thesis
The Royal Institute of Technology, Sweden
ISBN 91-7283-613-X

Copyright c by Stefan Zemke


Contact: steze@kth.se
Printed by Akademitryck AB, Edsbruk, 2003

ii
Abstract
Hard problems force innovative approaches and attention to detail, their exploration
often contributing beyond the area initially attempted. This thesis investigates
the data mining process resulting in a predictor for numerical series. The series
experimented with come from financial data – usually hard to forecast.
One approach to prediction is to spot patterns in the past, when we already know
what followed them, and to test on more recent data. If a pattern is followed by
the same outcome frequently enough, we can gain confidence that it is a genuine
relationship.
Because this approach does not assume any special knowledge or form of the regular-
ities, the method is quite general – applicable to other time series, not just financial.
However, the generality puts strong demands on the pattern detection – as to notice
regularities in any of the many possible forms.
The thesis’ quest for an automated pattern-spotting involves numerous data mining
and optimization techniques: neural networks, decision trees, nearest neighbors,
regression, genetic algorithms and other. Comparison of their performance on a
stock exchange index data is one of the contributions.
As no single technique performed sufficiently well, a number of predictors have been
put together, forming a voting ensemble. The vote is diversified not only by different
training data – as usually done – but also by a learning method and its parameters.
An approach is also proposed how to speed-up a predictor fine-tuning.
The algorithm development goes still further: A prediction can only be as good as
the training data, therefore the need for good data preprocessing. In particular, new
multivariate discretization and attribute selection algorithms are presented.
The thesis also includes overviews of prediction pitfalls and possible solutions, as
well as of ensemble-building for series data with financial characteristics, such as
noise and many attributes.
The Ph.D. thesis consists of an extended background on financial prediction, 7
papers, and 2 appendices.

iii
Acknowledgements
I would like to take the opportunity to express my gratitude to the many
people who helped me with the developments leading to the thesis. In
particular, I would like to thank Ryszard Kubiak for his tutoring and
support reaching back to my high-school days and beginnings of university
education, also for his help to improve the thesis. I enjoyed and appreciated
the fruitful exchange of ideas and cooperation with Michal Rams, to whom
I am also grateful for comments on a part of the thesis. I am also grateful to
Miroslawa Kajko-Mattsson for words of encouragement in the final months
of the Ph.D. efforts and for her style-improving suggestions.
In the early days of my research Henrik Boström stimulated my interest
in machine learning and Pierre Wijkman in evolutionary computation. I
am thankful for that and for the many discussions I had with both of
them. And finally, I would like to thank Carl Gustaf Jansson for being
such a terrific supervisor.
I am indebted to Jozef Swiatycki for all forms of support during the
study years. Also, I would like to express my gratitude to the computer
support people, in particular, Ulf Edvardsson, Niklas Brunbäck and Jukka
Luukkonen at DMC, and to other staff at DSV, in particular to Birgitta
Olsson for her patience with the final formatting efforts.
I dedicate the thesis to my parents who always believed in me.

Gdynia. October 27, 2003.


Stefan Zemke

iv
Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Questions in Financial Prediction . . . . . . . . . . . . . . 2
1.2.1 Questions Addressed by the Thesis . . . . . . . . . 4
1.3 Method of the Thesis Study . . . . . . . . . . . . . . . . . 4
1.3.1 Limitations of the Research . . . . . . . . . . . . . 4
1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . 6

2 Extended Background 9
2.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Time Series Glossary . . . . . . . . . . . . . . . . . 10
2.1.2 Financial Time Series Properties . . . . . . . . . . 13
2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Data Integration . . . . . . . . . . . . . . . . . . . 15
2.2.3 Data Transformation . . . . . . . . . . . . . . . . . 16
2.2.4 Data Reduction . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Data Discretization . . . . . . . . . . . . . . . . . . 17
2.2.6 Data Quality Assessment . . . . . . . . . . . . . . . 18
2.3 Basic Time Series Models . . . . . . . . . . . . . . . . . . 18
2.3.1 Linear Models . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Limits of Linear Models . . . . . . . . . . . . . . . 19
2.3.3 Nonlinear Methods . . . . . . . . . . . . . . . . . . 20
2.3.4 General Learning Issues . . . . . . . . . . . . . . . 21
2.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . 23
2.5 System Evaluation . . . . . . . . . . . . . . . . . . . . . . 24

v
2.5.1 Evaluation Data . . . . . . . . . . . . . . . . . . . 24
2.5.2 Evaluation Measures . . . . . . . . . . . . . . . . . 25
2.5.3 Evaluation Procedure . . . . . . . . . . . . . . . . . 25
2.5.4 Non/Parametric Tests . . . . . . . . . . . . . . . . 26

3 Development of the Thesis 27


3.1 First half – Exploration . . . . . . . . . . . . . . . . . . . 27
3.2 Second half – Synthesis . . . . . . . . . . . . . . . . . . . . 29

4 Contributions of Thesis Papers 33


4.1 Nonlinear Index Prediction . . . . . . . . . . . . . . . . . . 33
4.2 ILP via GA for Time Series Prediction . . . . . . . . . . . 34
4.3 Bagging Imperfect Predictors . . . . . . . . . . . . . . . . 35
4.4 Rapid Fine Tuning of Computationally Intensive Classifiers 36
4.5 On Developing Financial Prediction System: Pitfalls and
Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 Ensembles in Practice: Prediction, Estimation, Multi-Feature
and Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Multivariate Feature Coupling and Discretization . . . . . 38

5 Bibliographical Notes 39

A Feasibility Study on Short-Term Stock Prediction 141

B Amalgamation of Genetic Selection and Boosting


Poster GECCO-99, US, 1999 147

vi
List of Thesis Papers
Stefan Zemke. 45
Nonlinear Index Prediction.
Physica A 269 (1999)

Stefan Zemke. 57
ILP and GA for Time Series Prediction.
Dept. of Computer and Systems Sciences Report 99-006

Stefan Zemke. 71
Bagging Imperfect Predictors.
ANNIE’99, St. Louis, MO, US, 1999

Stefan Zemke. 81
Rapid Fine-Tuning of Computationally Intensive Classifiers.
MICAI’2000, Mexico, 2000. LNAI 1793

Stefan Zemke. 95
On Developing Financial Prediction System: Pitfalls and Possibilities.
DMLL Workshop at ICML-2002, Australia, 2002

Stefan Zemke. 113


Ensembles in Practice: Prediction, Estimation, Multi-Feature and
Noisy Data.
HIS-2002, Chile, 2002

Stefan Zemke and Michal Rams. 131


Multivariate Feature Coupling and Discretization.
FEA-2003, Cary, US, 2003

vii
viii
Chapter 1

Introduction

Predictions are hard, especially about the future. Niels Bohr and Yogi Berra

1.1 Background
As computers, sensors and information distribution channels proliferate,
there is an increasing flood of data. However, the data is of little use, unless
it is analyzed and exploited. There is indeed little use in just gathering the
tell tale signals of a volcano eruption, heart attack, or a stock exchange
crash, unless they are recognized and acted upon in advance. This is where
prediction steps in.
To be effective, a prediction system requires good input data, good
pattern-spotting ability, good discovered pattern evaluation, among other.
The input data needs to be preprocessed, perhaps enhanced by a domain
expert knowledge. The prediction algorithms can be provided by methods
from statistics, machine learning, analysis of dynamical systems, together
known as data mining – concerned with extracting useful information from
raw data. And predictions need to be carefully evaluated to see if they fulfill
criteria of significance, novelty, usefulness etc. In other words, prediction is
not an ad hoc procedure. It is a process involving a number of premeditated
steps and domains, all of which influence the quality of the outcome.
The process is far from automatic. A particular prediction task requires
experimentation to assess what works best. Part of the assessment comes
from intelligent but to some extent artful exploratory data analysis. If the
task is poorly addressed by existing methods, the exploration might lead

1
to a new algorithm development.
The thesis research follows that progression, started by the question of
days-ahead predictability of a stock exchange index data. The thesis work
and contributions consist of three developments. First, exploration of sim-
ple methods of prediction, exemplified by the initial thesis papers. Second,
higher level analysis of the development process leading to a successful pre-
dictor. The process also supplements the simple methods by specifics of
the domain and advanced approaches such as elaborate preprocessing, en-
sembles, chaos theory. Third, the thesis presents new algorithmic solutions,
such as bagging a Genetic Algorithms population, parallel experiments for
rapid fine-tuning and multivariate discretization.
Time series are common. Road traffic in cars per minute, heart beats
per minute, number of applications to a school every year and a whole
range of scientific and industrial measurements, all represent time series
which can be analyzed and perhaps predicted. Many of the prediction
tasks face similar challenges, such as how to decide which input series will
enhance prediction, how to preprocess them, or how efficiently tune various
parameters. Despite the thesis referring to the financial data, most of the
work is applicable to other domains, even if not directly, then indirectly by
pointing different possibilities and pitfalls in a predictor development.

1.2 Questions in Financial Prediction


Some questions of scientific and practical interest concerning financial pre-
diction follow.

Prediction possibility. Is statistically significant prediction of financial


markets data possible? Is profitable prediction of such data possible,
what involves answer to the former question, adjusted by constraints
imposed by the real markets, such as commissions, liquidity limits,
influence of the trades.

Methods. If prediction is possible, what methods are best at performing


it? What methods are best-suited for what data characteristics –
could it be said in advance?

2
Meta-methods. What are the ways to improve the methods? Can meta-
heuristics successful in other domains, such as ensembles or pruning,
improve financial prediction?

Data. Can the amount, type of data needed for prediction be character-
ized?

Data preprocessing. Can data transformations that facilitate prediction


be identified? In particular, what transformation formulae enhance
input data? Are the commonly used financial indicators formulae of
any good?

Evaluation. What are the features of sound evaluation procedure, re-


specting the properties of financial data and the expectations of fi-
nancial prediction? How to handle rare but important data events,
such as crashes? What are the common evaluation pitfalls?

Predictor development. Are there any common features of successful


prediction systems? If so, what are they, and how could they be
advanced? Can common reasons of failure of financial prediction be
identified? Are they intrinsic, non-reparable, or there is a way to
amend them?

Transfer to other domains. Can the methods developed for financial


prediction benefit other domains?

Predictability estimation. Can financial data be reasonably quickly es-


timated to be predictable or not, without the investment to build a
custom system? What are the methods, what do they actually say,
what are their limits?

Consequences of predictability. What are the theoretical and practical


consequences of demonstrated predictability of financial data, or the
impossibility of it? How a successful prediction method translates
into economical models? What could be the social consequences of
financial prediction?

3
1.2.1 Questions Addressed by the Thesis
The thesis addresses many of the questions, in particular the prediction
possibility, methods, meta-methods, data Preprocessing, and the predic-
tion development process. More details on the contributions are provided
by the chapter: Contributions of the Thesis Papers.

1.3 Method of the Thesis Study


The investigation behind the thesis has been mostly goal driven. As prob-
lems appeared on the way to realizing financial prediction, they were con-
fronted by various means including the following:

• Investigation of existing machine learning and data mining methods


and meta-heuristics.
• Reading of financial literature for properties and hints of regularities
in financial data which could be exploited.
• Analysis of existing financial prediction systems, for commonly work-
ing approaches.
• Implementation and experimentation with own machine learning meth-
ods and hybrid approaches involving a number of existing methods.
• Some theoretical considerations on mechanisms behind the generation
of financial data, e.g. deterministic chaotic systems, and on general
predictability demands and limits.
• Practical insights into the realm of trading, some contacts with pro-
fessional investors, courses on finance and economics.

1.3.1 Limitations of the Research


As any closed work, this thesis research has its limitations. One criticism of
the thesis could be that the contributions do not directly tackle the promi-
nent question: if financial prediction can be profitable. A Ph.D. student
concentrating efforts on this would make a heavy bet: either s/he would

4
end up with a Ph.D. and as a millionaire, or without anything, should the
prediction attempts fail. This is too high risk to take. This is why in my
research, after the initial head-on attempts, I took a more balanced path
investigating prediction from the side: methods, data preprocessing etc.,
instead of prediction results per se.
Another criticism could address the omission or shallowness of experi-
ments involving some of the relevant methods. For instance, a researcher
devoted to Inductive Logic Programming could bring forward a new sys-
tem good at dealing with numerical/noisy series, or the econometrician
could point out the omission of linear methods. The reply could be: there
are too many possibilities for one person to explore, so it was necessary
to skip some. Even then, the interdisciplinary research demanded much
work, among other, for:

• Studying ’how to’ in 3 areas: machine learning/data mining, finance


and mathematics; 2 years of graduate courses taken.
• Designing systems exploiting and efficiently implementing the result-
ing ideas.
• Collecting data for prospective experiments – initially quite a time
consuming task of low visibility.
• Programming, which for new ideas not guaranteed to work, takes time
going into hundreds of hours.
• Evaluating the programs, adjusting parameters, evaluating again –
the loop possibly taking hundreds of hours. The truth here is that
most new approaches do not work, so the design, implementation and
initial evaluation efforts are not publishable.
• Writing papers, extended background study, for the successful at-
tempts.

Another limitation of the research concerns evaluation methods. The


Evaluation section stresses how careful the process should be, preferably
involving a trading model, commissions, whereas the evaluations in the the-
sis papers do not have that. The reasons are many-fold. First, as already

5
pointed out, the objective was not to prove there is profit possibility in the
predictions. This would involve not only commissions, but also a trading
model. A simple model would not fit the bill, so there would be a need
to investigate how predictions, together with general knowledge, trader’s
experience etc. merge into successful trading – a subject for another Ph.D.
Second, after commissions, the above-random gains, would be much thin-
ner, demanding better predictions, more data, more careful statistics to
spot the effect – perhaps too much for a pilot study.
The lack of experiments backing some of the thesis ideas is another
shortcoming. The research attempts to be practical, i.e. mostly experi-
mental, but there are tradeoffs. As ideas become more advanced, the path
from an idea to a reported evaluation becomes more involved. For instance,
to predict, one needs data preprocessing, often including discretization. So,
even having implemented an experimental predictor, it could not have been
evaluated without the discretization completed, pressing to describe just
the prediction part – without real evaluation. Also computational demands
grow – a notebook computer is no longer enough.

1.4 Outline of the Thesis

The rest of the initial chapters – preceding the thesis papers – is meant to
provide the reader with the papers’ background, often skimmed in them
for page limit reasons. Thus, the Extended Background chapter goes
through the subsequent areas and issues involved in time series prediction
in the financial domain, one of the objectives being to introduce the vo-
cabulary. The intention is also to present the width of the prediction area
and of my study of it, which perhaps will allow one to appreciate the effort
and knowledge behind the developments in this domain.
Then comes the Development of the Thesis chapter which, more or
less chronologically, presents the research advancement. In this tale one
can also see the many attempts proving to be dead-ends. As such, the
positive Published results can be seen as an essence of much bigger work.
The next chapter Contributions of Thesis Papers summarizes all the
thesis papers and their contributions. The summaries assume familiarity

6
with the vocabulary of the Extended Background chapter.
The rest of the thesis consists of 8 thesis papers, formatted for a common
appearance, otherwise quoted the way they were published. The thesis
ends with common bibliography, resolving references for the introduction
chapters and all the included papers.

7
8
Chapter 2

Extended Background

This chapter is organized as follows. Section 1 presents time series prelim-


inaries and characteristics of financial series, Section 2 summarizes data
preprocessing, Section 3 lists basic learning schemes, Section 4 ensemble
methods, and Section 5 discuses predictor evaluation.

2.1 Time Series

This section introduces properties of time series appearing in the context


of developing a prediction system in general, and in the thesis papers in
particular. The presentation is divided into generic series properties and
characteristics of financial time series. Most of the generic time series
definitions follow (Tsay, 2002).
Time series, series for short, is a sequence of numerical values indexed
by increasing time units, e.g. a price of a commodity, such as oranges in
a particular shop, indexed by the time when the price is checked. In the
sequel, series’ st return values refer to rt = log(st+T ) − log(st ), the return
period T assumed 1, if not specified. Remarks about series distribution
refer to the distribution of the returns series rt . A predictor forecasts a
0
future value st+T , having access only to past values si , i ≤ t, of this and
usually other series. For the prediction to be of any value it has to be
better than random, which can be measured by various metrics, such as
accuracy, discussed in Section 6.

9
2.1.1 Time Series Glossary

Stationarity of a series indicates that its mean value and arbitrary au-
tocorrelations are time invariant. Finance literature commonly assumes
that asset returns are weakly stationary. This can be checked, provided a
sufficient number of values, e.g., one can divide data into subsamples and
check the consistency of mean and autocorrelations (Tsay, 2002). Determi-
nation if a series moved into a nonstationary regime is not trivial, let alone
deciding which of the series properties are still holding. Therefore, most
prediction systems, which are based on past data, implicitly assume that
the predicted series is to a great extent stationary, at least with respect
to the invariants that the system may spot, which most likely go beyond
mean and autocorrelations.

Seasonality means periodic fluctuations. For example, retail sales peak


around Christmas season and decline after the holidays. So the time series
of retail sales will show increasing values from September through Decem-
ber and declining in January and February. Seasonality is common in
economic time series and less in engineering and scientific data. It can be
identified, e.g. by correlation or Fourier analysis, and removed, if desired.

Linearity and Nonlinearity are wide notions depending on the context in


which they appear. Usually, linearity signifies that an entity can be decom-
posed into sub-entities, properties of which, such as influence on the whole,
carry on to the whole entity in an easy to analyze additive way. Nonlin-
ear systems do not allow such a simple decomposition analysis since the
interactions do not need to be additive, often leading to complex emergent
phenomena not seen in the individual sub-entities (Bak, 1997).
In the much narrower context of prediction methods, nonlinear often
refers to the form of dependencies between data and the predicted vari-
able. In nonlinear systems the function might be nonlinear. Hence, linear
approaches, such as correlation analysis and linear regression are not suf-
ficient. One must use less orthodox tools to find and exploit nonlinear
dependencies, e.g. neural networks.

10
Deterministic and Nondeterministic Chaos. For a reader new to chaos, an
illustration of the theory applied to finances can be found in (Deboeck,
1994). A system is chaotic if its trajectory through state space is sensi-
tively dependent on the initial conditions, that is, if small differences are
magnified exponentially with time. This means that initially unobserv-
able fluctuations will eventually dominate the outcome. So, though the
process may be deterministic, it is unpredictable in the long run (Kantz
& Schreiber, 1999a; Gershenfeld & Weigend, 1993). Deterministic means
that given the same circumstances the transition from a state is always the
same.
The topic if financial markets express this kind of behavior is hotly
debated and there are numerous publications supporting each view. The
deterministic chaos notion involves a number of issues. First, whether
markets react deterministically to events influencing prices versus a more
probabilistic reaction. Second, whether indeed magnified small changes
eventually take over, which does not need to be the case, e.g. self-correction
could step in if a value is too much off mark – overpriced or underpriced.
Financial time series have been analyzed in those respects, however, the
mathematical theory behind chaos often poorly deals with noise prevalent
in financial data making the results dubious.
Even a chaotic system can be predicted up to a point where magnified
disturbances dominate. The time when this happens depends inversely
on the largest Lyapunov exponent, a measure of divergence. It is an av-
erage statistics – at any time the process is likely to have different di-
vergence/predictability, especially if nonstationary. Beyond, prediction is
possible only in statistical terms – which outcomes are more likely, no mat-
ter what we start with. Weather – a chaotic system – is a good illustration:
despite global efforts in data collection, forecasts are precise up to a few
days and in the long run offer only statistical views such as average month
temperature. However, chaos is not to be blamed for all poor forecasts – it
recently came to attention that the errors in weather forecasts initially do
not grow exponentially but linearly, what points more to imprecise weather
models than chaos at work.
Another exciting aspect of a chaotic system is its control. If at times the

11
system is so sensitive to disturbances, a small influence at that time can
profoundly alter the trajectory, provided that the system will be determin-
istic for a while thereafter. So potentially a government, or a speculator,
who knew the rules, could control the markets without a vast investment.
Modern pace-makers for human heart – another chaotic system – work
by this principle providing a little electrical impulse only when needed,
without the need for constant overwhelming of the heart electrical activity.
Still, it is unclear if the markets are stochastic or deterministic, let alone
chaotic. A mixed view is also possible: market are deterministic only in
part – so even short-term prediction cannot be fully accurate, or that
there are pockets of predictability – markets, or market conditions, when
the moves are deterministic, otherwise being stochastic.

Delay vectors embedding converts a scalar series st into a vector series:


vt = (st , st−delay , .., st−(D−1)∗delay ). This is a standard procedure in (non-
linear) time series analysis, and a way to present a series to a predictor
demanding an input of constant dimension D. More on how to fit the
delay embedding parameters can be found in (Kantz & Schreiber, 1999a).

Takens Theorem (Takens, 1981) states that we can reconstruct the dy-
namics of a deterministic system – possibly multidimensional, which each
state is a vector – by long-enough observation of just one noise-free vari-
able of the system. Thus, given a series we can answer questions about
the dynamics of the system that generated it by examining the dynamics
in a space defined by delayed values of just that series. From this, we can
compute features such as the number of degrees of freedom and linking of
trajectories and make predictions by interpolating in the delay embedding
space. However, Takens theorem holds for mathematical measurement
functions, not the ones seen in the laboratory or market: asset price is
not a noise-free function. Nevertheless, the theorem supports experiments
with a delay embedding, which might yield useful models. In fact, they
often do (Deboeck, 1994).

12
Prediction, modeling, characterization are three different goals of time se-
ries analysis (Gershenfeld & Weigend, 1993): ”The aim of prediction is
to accurately forecast the short-term evolution of the system; the goal of
modeling is to find description that accurately captures features of the
long-term behavior. These are not necessarily identical: finding governing
equations with proper long-term properties may not be the most reliable
way to determine parameters for short-term forecasts, and a model that
is useful for short-term forecasts may have incorrect long-term properties.
Characterization attempts with little or no a priori knowledge to deter-
mine fundamental properties, such as the number of degrees of freedom of
a system or the amount of randomness.”

2.1.2 Financial Time Series Properties


One may wonder if there are universal characteristics of the many series
coming from markets different in size, location, commodities, sophistica-
tion etc. The surprising fact is that there are (Cont, 1999). Moreover,
interacting systems in other fields, such as statistical mechanics, suggest
that the properties of financial time series loosely depend on the market
microstructure and are common to a range of interacting systems. Such
observations have stimulated new models of markets based on analogies
with particle systems and brought in new analysis techniques opening the
era of econophysics (Mantegna & Stanley, 2000).

Efficient Market Hypothesis (EMH) developed in 1965 (Fama, 1965) ini-


tially got wide acceptance in the financial community. It asserts, in weak
form, that the current price of an asset already reflects all information ob-
tainable from past prices and assumes that news is promptly incorporated
into prices. Since news is assumed unpredictable, so are prices.
However, real markets do not obey all the consequences of the hypoth-
esis, e.g., price random walk implies normal distribution, not the observed
case; there is a delay while price stabilizes to a new level after news, which
among other, lead to a more modern view (Haughen, 1997): ”Overall, the
best evidence points to the following conclusion. The market isn’t efficient
with respect to any of the so-called levels of efficiency. The value invest-

13
ing phenomenon is inconsistent with semi-strong form efficiency, and the
January effect is inconsistent even with weak form efficiency. Overall, the
evidence indicates that a great deal of information available at all levels is,
at any given time, reflected in stock prices. The market may not be easily
beaten, but it appears to be beatable, at least if you are willing to work at
it.”

Distribution of financial series (Cont, 1999) tends to be non-normal, sharp


peaked and heavy-tailed, these properties being more pronounced for in-
traday values. Such observations were pioneered in the 1960s (Mandelbrot,
1963), interestingly around the time the EMH was formulated.
Volatility – measured by the standard deviation – also has common char-
acteristics (Tsay, 2002). First, there exist volatility clusters, i.e. volatility
may be high for certain periods and low for other. Second, volatility evolves
over time in a continuous manner, volatility jumps are rare. Third, volatil-
ity does not diverge to infinity but varies within fixed range, which means
that it is often stationary. Fourth, volatility reaction to a big price increase
seems to differ from reaction to a big price drop.
Extreme values appear more frequently in a financial series as compared
to a normally-distributed series of the same variance. This is important to
the practitioner since often the values cannot be disregarded as erroneous
outliers but must be actively anticipated, because of their magnitude which
can influence trading performance.

Scaling property of a time series indicates that the series is self-similar at


different time scales (Mantegna & Stanley, 2000). This is common in fi-
nancial time series, i.e. given a plot of returns without the axis signed, it is
next to impossible to say if it represents hourly, daily or monthly changes,
since all the plots look similar, with differences appearing at minute res-
olution. Thus prediction methods developed for one resolution could, in
principle, be applied to others.

Data frequency refers to how often series values are collected: hourly,
daily, weekly etc. Usually, if a financial series provides values on daily,

14
or longer, basis, it is low frequency data, otherwise – when many intraday
quotes are included – it is high frequency. Tick-by-tick data includes all
individual transactions, and as such, the event-driven time between data
points varies creating challenge even for such a simple calculation as corre-
lation. The minute market microstructure and massive data volume create
new problems and possibilities not dealt with by the thesis. The reader
interested in high frequency finance can start at (Dacorogna et al., 2001).

2.2 Data Preprocessing


Before data is scrutinized by a prediction algorithm, it must be collected,
inspected, cleaned and selected. Since even the best predictor will fail on
bad data, data quality and preparation is crucial. Also, since a predictor
can exploit only certain data features, it is important to detect which data
preprocessing/presentation works best.

2.2.1 Data Cleaning

Data cleaning fills in missing values, smoothes noisy data, handles or re-
moves outliers, resolves inconsistencies. Missing values can be handled by
a generic method (Han & Kamber, 2001). Methods include skipping the
whole instance with a missing value, or filling the miss with the mean/new
’unknown’ constant, or using inference, e.g. based on most similar instances
or some Bayesian considerations.
Series data has another dimension – we do not want to spoil the temporal
relationship, thus data restoration is preferable to removal. The restora-
tion should also accommodate the time aspect – not use too time-distant
values. Noise is prevalent, especially low volume markets should be dealt
with suspicion. Noise reduction usually involves some form of averaging or
putting a range of values into one bin, discretization.
If data changes are numerous, a test if the predictor picks the inserted
bias is advisable. This can be done by ’missing’ some values from a random
series – or better: permuted actual returns – and then restoring, cleaning
etc. the series as if genuine. If the predictor can subsequently predict

15
anything from this, after all random, series there is too much structure
introduced (Gershenfeld & Weigend, 1993).

2.2.2 Data Integration

Data integration combines data from multiple sources into a coherent store.
Time alignment can demand consideration in series from different sources,
e.g. different time zones. Series to instances conversion is required by most
of the learning algorithms expecting as an input a fixed length vector. It
can be done by the delay vector embedding technique. Such delay vectors
with the same time index t – coming from all input series – appended
give an instance, data point or example, its coordinates referred to as data
features, attributes or variables.

2.2.3 Data Transformation

Data transformation changes the values of series to make them more suit-
able for prediction. Detrending is such a common transformation removing
the growth of a series, e.g. by working with subsequent value differentials,
or subtracting the trend (linear, quadratic etc.) interpolation. For stocks,
indexes, and currencies converting into the series of returns does the trick.
For volume, dividing it by last k quotes average, e.g. yearly, can scale it
down.
Indicators are series derived from others, enhancing some features of
interest, such as trend reversal. Over the years traders and technical ana-
lysts trying to predict stock movements developed the formulae (Murphy,
1999), some later confirmed to pertain useful information (Sullivan et al.,
1999). Indicators can also reduce noise due to averaging in many of the
formulae. Common indicators include: Moving Average MA), Stochas-
tic Oscillator, Moving Average Convergence Divergence (MACD), Rate of
Change (ROC), Relative Strength Index (RSI).
Normalization brings values to a certain range, minimally distorting
initial data relationships, e.g. the SoftMax norm increasingly squeezes
extreme values, linearly mapping middle 95% values.

16
2.2.4 Data Reduction

Sampling – not using all the data available – might be worthwhile. In my


experiments with NYSE predictability, skipping half of training instances
with the lowest weight (i.e. weekly return) enhanced predictions, similarly
reported (Deboeck, 1994). The improvement could be due to skipping
noise-dominated small changes, and/or the dominant changes ruled by a
mechanism whose learning is distracted by the numerous small changes.
Feature selection – choosing informative attributes – can make learn-
ing feasible, because of the curse of dimensionality (Mitchell, 1997) multi-
feature instances demand (exponentially w.r.t. feature number) more data
to train. There are 2 approaches to the problem: filter – a purpose-made
algorithm evaluates and selects features, whereas in wrapper approach the
final learning algorithm is presented with different feature subsets, selected
on the quality of the resulting predictions.

2.2.5 Data Discretization

Discretization maps similar values into one discrete bin, with the idea that
it preserves important information, e.g. if all that matters is a real value’s
sign, it could be digitized to {0; 1}, 0 for negative, 1 otherwise. Some
prediction algorithms require discrete data, sometimes referred to as nom-
inal. Discretization can improve predictions by reducing the search space,
reducing noise, and by pointing to important data characteristics. Un-
supervised approaches work by dividing the original feature value range
into few equal-length or equal-data-frequency intervals; supervised – by
maximizing measure involving the predicted variable, e.g. entropy or the
chi-square statistics (Liu et al., 2002).
Since discretization is an information loosing transformation, it should
be approached with caution, especially as most algorithms perform uni-
variate discretization – they look at one feature at a time, disregarding
that it may have (additional) significance only in the context of other fea-
tures, as it would be preserved in multivariate discretization. For example,
if the predicted class = sign(xy), only discretizing x and y in tandem can
discover their significance, alone x and y can be inferred as not related to

17
class and even disregarded! The multivariate approach is especially im-
portant in financial prediction, where no single variable can be expected
to bring significant predictability (Zemke & Rams, 2003).

2.2.6 Data Quality Assessment

Predictability assessment allows to concentrate on feasible cases (Hawawini


& Keim, 1995). Some tests are simple non-parametric predictors – predic-
tion quality reflecting predictability. The tests may involve: 1) Linear
methods, e.g. to measure correlation between the predicted and feature
series. 2) Nearest Neighbor prediction method, to assess local model-free
predictability. 3) Entropy, to measure information content (Molgedey &
Ebeling, 2000). 4) Detrended Fluctuation Analysis (DFA), to reveal long
term self-similarity, even in nonstationary series (Vandewalle et al., 1997).
5) Chaos and Lyapunov exponent, to test short-term determinism. 6) Ran-
domness tests like chi-square, to assess the likelihood that the observed
sequence is random. 7) Nonstationarity tests.

2.3 Basic Time Series Models


This section presents basic prediction methods, starting with the linear
models well established in the financial literature and moving on to modern
nonlinear learning algorithms.

2.3.1 Linear Models

Most linear time series models descend from the AutoRegressive Mov-
ing Average (ARMA) and Generalized Autoregressive Conditional Het-
eroskedastic (GARCH) (Bollerslev, 1986) models summary of which follows
(Tsay, 2002).

ARMA models join simpler AuroRegressive (AR) and Moving-Average


(MA) models. The concept is useful in volatility modelling, less in return
prediction. A general ARMA(p, q) is in the form:

18
rt = φ0 + Σpi=1 φi rt−i + at − Σqi=1 θi at−i

where p is the order of the AR part, φi its parameters, q the order of


the MA part, θj its parameters, and at normally-distributed noise. Given
data series rt , there are heuristics to specify the order and parameters,
e.g. either by the conditional or exact likelihood method. The Ljung-Box
statistics of residuals can check the fit (Tsay, 2002).

GARCH models volatility which is influenced by time dependent informa-


tion flows resulting in pronounced temporal volatility clustering. For a log
return series rt , we assume its mean ARMA-modelled, then let at = rt − µt
be the mean-corrected log return. Then at follows a GARCH(m, s) model
if:

at = σt t , σt2 = α0 + Σm 2 s 2
i=1 αi at−i + Σj=1 βj σt−j

where t is a sequence of identically independent distributed (iid) random


variables with mean 0 and variance 1, α0 > 0, αi ≥ 0, βj >≥ 0, and
max(m,s)
Σi=1 (αi + βi ) < 1.

Box-Jenkins AutoRegressive Integrated Moving Average (ARIMA) extend


the ARMA models, moreover coming with a detailed procedure how to fit
and test such a model, not an easy task (Box et al., 1994). Because of wide
applicability, extendable to nonstationary series, and the fitting procedure,
the models are commonly used. ARIMA assumes that a probability model
generates the series, with future values related to past values and errors.
Econometric models extend the notion of series depending only on it
past values – they additionally use related series. This involves a regression
model in which the time series is forecast as the dependent variable; the
related time series as well as the past values of the time series are the
independent or predictor variables. This, in principle, is the approach of
the thesis papers.

19
2.3.2 Limits of Linear Models
Modern econometrics increasingly shifts towards nonlinear models of risk
and return. Bera – actively involved in (G)ARCH research – remarked
(Bera & Higgins, 1993): ”a major contribution of the ARCH literature is
the finding that apparent changes in the volatility of economic time series
may be predictable and result from a specific type of nonlinear depen-
dence rather than exogenous structural changes in variables”. Campbell
further argued (Campbell et al., 1997): ”it is both logically inconsistent
and statistically inefficient to use volatility measures that are based on
the assumption of constant volatility over some period when the resulting
series moves through time.”

2.3.3 Nonlinear Methods


Nonlinear methods are increasingly preferred for financial prediction, due
to the perceived nonlinear dependencies in financial data which cannot be
handled by purely linear models. A short overview of the methods follows
(Mitchell, 1997).

Artificial Neural Network (ANN) advances linear models by applying a


non-linear function to the linear combination of inputs to a network unit – a
perceptron. In an ANN, perceptrons are usually prearranged in layers, with
those in the the first layer having access to the inputs, and the perceptrons’
outputs forming the inputs to the next layer, the final one providing the
ANN output(s). Training a network involves adjusting the weights in each
unit’s linear combination as to minimize an objective, e.g. squared error.
Backpropagation – the classical training method – however, may miss an
optimal network due to falling into a local minimum, so other methods
might be preferred (Zemke, 2002b).

Inductive Logic Programming (ILP) and a decision tree (Mitchell, 1997)


learner C4.5 (Quinlan, 1993) generate if-conditions-then-outcome symbolic
rules, human understandable if small. Since the search for such rules is ex-
pensive, the algorithms either employ greedy heuristics, e.g. C4.5 looking

20
at a single variable at a time, or perform exhaustive search, e.g. ILP Progol.
These limit the applicability, especially in an area where data is volumi-
nous and unlikely in the form of simple rules. Additionally, ensembles –
putting a number of different predictors to vote – obstruct the acclaimed
human comprehension of the rules. However, the approach could be of use
in more regular domains, such as customer rating and perhaps fraud de-
tection. Rules can be also extracted from an ANN, or used together with
probabilities making them more robust (Kovalerchuk & Vityaev, 2000).

Nearest Neighbor (kNN) does not create a general model, but to predict,
it looks back for the most similar k cases. Distracted by noisy/irrelevant
features, but if this ruled out, failure of kNN suggests that the most that
can be predicted are general regularities, e.g. based on the output (condi-
tional) distribution.

Bayesian predictor first learns probabilities how evidence supports out-


comes, used then to predict new evidence’s outcome. Although the simple
learning scheme is robust to violating the ’naive’ independent-evidence as-
sumption, watching independence might pay off, especially as in decreasing
markets variables become more correlated than usual.

Support Vector Machines (SVM) offer a relatively new and powerful learner,
having attractive characteristics for time series prediction (Muller et al.,
1997). First, the model deals with multidimensional instances, actually the
more features the better – reducing the need for (wrong) feature selection.
Second, it has few parameters, thus finding optimal settings can be easier;
one parameter referring to noise level the system can handle.

Genetic Algorithms (GAs) (Deboeck, 1994) – mimic biological evolution


by mutation and cross-over of solutions, in order to maximize their fit-
ness. This is a general optimization technique, thus can be applied to any
problem – a solution can encode data selection, preprocessing, predictor.
GAs explore novel possibilities, often not thought of by humans. There-
fore, it may be worth keeping some predictor settings as parameters that

21
can be (later) GA-optimized. Evolutionary systems – another example of
evolutionary computation – work in a similar way to GAs, except that the
solution is coded as real-valued vector, and optimized not only with respect
to the values but also to the optimization rate.

2.3.4 General Learning Issues


Computational Learning Theory (COLT) theoretically analyzes prediction
algorithms, with respect to the learning process assumptions, data and
computation requirements.
Probably Approximately Correct (PAC) Learnability is a central notion
in the theory, meaning that we learn probably – with probability 1 − δ –
and approximately – within error  – the correct predictor drawn from a
space H. The lower bound on the number of training examples m to find
such a predictor is an important result:
1
m ≥ (ln |H| + ln(1/δ))

where |H| is the size of the space – the number of predictors in it. This is
usually overly big bound – specifics about the learning process can lower it.
However, it provides some insights: m grows linearly in the error factor 1/
and logarithmically in 1/δ – that we find the hypothesis at all (Mitchell,
1997).

Curse of dimensionality (Bellman, 1961) involves two related problems.


As the data dimension – the number of features in an instance – grows,
the predictor needs increasing resources to cover the increasing instances.
It also needs more instances to learn – exponentially with the dimension.
Some prediction algorithms, e.g. kNN, will not be able to generalize at all,
if the dimension is greater than ln(M ), M the number of instances. This
is why feature selection – reducing the data dimension – is so important.
The amount of data to train a predictor can be experimentally estimated
(Walczak, 2001).

Overfitting means that a predictor memorizes non-general aspects of the


training data, such as noise. This leads to poor prediction on a new data.

22
This is a common problem due to a number of reasons. First, the training
and testing data are often not well separated, so memorizing the common
part will give the predictor a higher score. Second, multiple trials might
be performed on the same data (split), so in effect the predictor coming
out will be best suited for exactly that data. Third, the predictor com-
plexity – number of internal parameters – might be too big for the number
of training instances, so the predictor learns even the unimportant data
characteristics.
Precautions against overfitting involve: good separation of training and
testing data, careful evaluation, use of ensembles averaging-out the indi-
vidual overfitting, and an application of the Occam’s razor. In general,
overfitting is a difficult problem that must be approached individually. A
discussion how to deal with it can be found in (Mitchell, 1997).

Occam’s razor – preferring a smaller solution, e.g. a predictor involving


fewer parameters, to a bigger one, other things equal – is not a specific
technique but a general guidance. There are indeed arguments (Mitchell,
1997) that a smaller hypothesis has a bigger chance to generalize well on
new data. Speed is another motivation – smaller predictor is likely to be
faster, which can be especially important in an ensemble.

Entropy (Shannon & Weaver, 1949) is an information measure useful at


many stages in a prediction system development. Entropy expresses the
number of bits of information brought in by an entity, let it be next train-
ing instance, or checking another condition. Since the notion does not
assume any data model, it is well suited to deal with nonlinear systems.
As such it is used in feature selection, predictability estimation, predictor
construction, e.g. in C4.5 as the information gain measure to decide which
feature to split.

2.4 Ensemble Methods


An ensemble (Dietterich, 2000) is a number of predictors of which votes
are put together into the final prediction. The predictors, on average,

23
are expected above-random and making independent errors. The idea is
that correct majority offsets individual errors, thus the ensemble will be
correct more often than an individual predictor. The diversity of errors
is usually achieved by training a scheme, e.g. C4.5, on different instance
samples or features. Alternatively, different predictor types – like C4.5,
ANN, kNN – can be used. Common schemes include Bagging, Boosting,
Bayesian ensembles and their combinations (Dietterich, 2000).

Bagging produces an ensemble by training predictors on different boot-


strap samples – each the size of the original data, but sampled allowing
repetitions. The final prediction is the majority vote. This simple to imple-
ment scheme is always worth trying, in order to reduce prediction variance.

Boosting initially assigns equal weights to all data instances and trains a
predictor, then it increases weights of the misclassified instances, trains
next predictor on the new distribution etc. The final prediction is a
weighted vote of predictors obtained in this way. Boosting increasingly
pays attention to misclassified instances, what may lead to overfitting if
the instances are noisy.

Bayesian ensemble, similarly to the Bayesian predictor, uses conditional


probabilities accumulated for the individual predictors, to arrive at the
most evidenced outcome. Given good estimates for predictors’ accuracy,
Bayesian ensemble results in a more optimal prediction compared to bag-
ging.

2.5 System Evaluation


Proper evaluation is crucial to a prediction system development. First, it
has to measure exactly the interesting effect, e.g. trading return as opposed
to related, but not identical, prediction accuracy. Second, it has to be
sensitive enough as to spot even minor gains. Third, it has to convince
that the gains are no merely a coincidence.

24
Usually prediction performance is compared against published results.
Although, having its problems, such as data overfitting and accidental suc-
cesses due to multiple (worldwide!) trials, this approach works well as long
as everyone uses the same data and evaluation procedure, so meaningful
comparisons are possible. However, when no agreed benchmark is avail-
able, as in the financial domain, another approach must be adopted. Since
the main question concerning financial data is whether prediction is at all
possible, it suffices to compare a predictor’s performance against the in-
trinsic growth of a series – also referred to as the buy and hold strategy.
Then a statistical test can judge if there is a significant improvement.

2.5.1 Evaluation Data

To reasonably test a prediction system, the data must include different


trends, assets for which the system is to perform, and to be plentiful to
warrant significant conclusions. Overfitting a system to data is a real dan-
ger. Dividing data into three disjoint sets is the first precaution. Training
portion of the data is used to build the predictor. If the predictor in-
volves some parameters which need to be tuned, they can be adjusted as
to maximize performance on the validation part. Now, the system pa-
rameters frozen, its performance on an unseen test set provides the final
performance estimation. In multiple tests, the significance level should be
adjusted, e.g. if 10 tests are run and the best appears 99.9% significant, it
really is 99.9%10 = 99% (Zemke, 2000). If we want the system to predict
the future of a time series, it is important to maintain proper time relation
between the training, validation and test sets – basically training should
involve instances time-preceding any test data.
Bootstrap (Efron & Tibshirani, 1993) – with repetitions, sampling as
many elements as in the original – and deriving a predictor for each such
a sample, is useful for collecting various statistics (LeBaron & Weigend,
1994), e.g. return and risk-variability. It can be also used for ensemble
creation or best predictor selection, however not without limits (Hastie
et al., 2001).

25
2.5.2 Evaluation Measures
Financial forecasts are often developed to support semi-automated trading
(profitability), whereas the algorithms used in those systems might have
originally different objectives. Accuracy – percentage of correct discrete
(e.g. up/down) predictions – is a common measure for discrete systems,
e.g. ILP/decision trees. Square error – sum of squared deviations from
actual outputs – is a common measure in numerical prediction, e.g. ANN.
Performance measure – incorporating both the predictor and the trading
model it is going to benefit – is preferable and ideally should measure
exactly what we are interested in, e.g. commission and risk adjusted return
(Hellström & Holmström, 1998), not just return. Actually, many systems’
’profitability’ disappears once the commissions are taken into account.

2.5.3 Evaluation Procedure


In data sets, where instance order does not matter, the N -cross validation –
data divided into N disjoint parts, N −1 for training and 1 for testing, error
averaged over all N (Mitchell, 1997) is a standard approach. However, in
the case of time series data, it underestimates error because in order to train
a predictor we sometimes use the data that comes after the test instances –
unlike in real life, where predictor knows only the past, not the future.
For series, sliding window approach is more adept: a window/segment of
consecutive instances used for training and a following segment for testing,
the windows sliding over all data, as statistics collected.

2.5.4 Non/Parametric Tests


Parametric statistical tests have assumptions, e.g. concerning the sample
independence and distribution, and as such allow stronger conclusion for
smaller data – the assumptions can be viewed as additional input informa-
tion, so need to be demonstrated – what is often missed. Nonparametric
tests put much weaker requirements, so for equally numerous data allow
weaker conclusions. Since financial data have non-normal distribution, re-
quired by many of the parametric tests, non-parametric comparisons might
be safer (Heiler, 1999).

26
Surrogate data is a useful concept in a system evaluation (Kantz &
Schreiber, 1999a). The idea is to generate data sets sharing characteristics
of the original data – e.g. permutations of series have the same mean,
variance etc. – and for each compute a statistics of interest, e.g. return of a
strategy. If α is the acceptable risk of wrongly rejecting the null hypothesis
that the original series statistics is lower (higher) than of any surrogate,
then 1/α − 1 surrogates needed; if all give higher (lower) statistics than
the original series, then the hypothesis can be rejected. Thus, if predictor’s
error was lower on the original series, as compared to 19 runs on surrogates,
we can be 95% sure it was not a fluke.

27
28
Chapter 3

Development of the Thesis

Be concerned with the ends not the means. Bruce Lee.

3.1 First half – Exploration


When introduced to the area of machine learning (ML) around 1996, I
noticed that many of the algorithms were developed on artificial ’toy prob-
lems’ and once done, the search started for more realistic problems ’suit-
able’ for the algorithm. As reasonable as such strategy might initially
appear – knowledge of the optimal performance area of a learning algo-
rithm is what is often desired – such studies seldom yielded general area
insights, merely performance comparisons for the carefully chosen test do-
mains. This is in sharp contrast to the needs of a practitioner, who faces a
learning problem first and searches for the solution method later, not vice
versa. So, in my research, I adopted the practical approach: here is my
prediction problem, what can I do about it.
My starting point was that financial prediction is difficult, but is it im-
possible? Or perhaps, the notion of unpredictability emerged due to the
nature of the method rather than the data – a case already known: with the
advent of chaotic analysis many processes previously considered random
turned out deterministic, at least in the short run. Though, I do not be-
lieve that such a complex socio-economical process as the markets will any
time soon be found completely predictable, the question of a limited pre-
dictability remains open and challenging. And since challenging problems
often lead to profound discoveries I considered the subject worthwhile.

29
The experiments started with Inductive Logic Programming (ILP) –
learning logic programs by combining provided background predicates sup-
posedly useful in the domain in question. I used the then (in 1997) state-
of-the-art system, Progol, reported successful in other domains, such as
toxicology and chemistry. I provided the system with various financial in-
dicators, however, despite many attempts, no compressed rules were ever
generated. This could be due to the noise present in financial data and the
rules, if any, far from the compact form sought for by an ILP system.
The initial failure reiterated the question: is financial prediction at all
possible, and if so, which algorithm works best? The failure of an otherwise
successful learning paradigm, directed the search towards more original
methods. After many fruitless trials, some promising results started ap-
pearing, with the unorthodox method shortly presented in the Feasibility
Study on Short-Term Stock Prediction, Appendix A. This method
looked for invariants in the time series predicted – not just patterns with
high predictive accuracy, but patterns that have above-random accuracy
in a number of temporarily distinct time epochs, thus excluding those that
work perhaps well, but only for a time. The work went unpublished since
the trials were limited and in the early stages of my research I was encour-
aged to use more established methods. However, it is interesting to note
that the method is similar to entropy-based compression schemes, what I
discovered later.
So I went on to evaluate standard machine learning – to see which of the
methods warrants further investigation. I tried: Neural Network, Nearest
Neighbor, Naive Bayesian Classifier and Genetic Algorithms (GA) evolved
rules. That research, presented and published as Nonlinear Index Pre-
diction – thesis paper 1, concludes that Nearest Neighbor (kNN) works
best. Some of the details, not included in the paper, made into a report
ILP and GA for Time Series Prediction, thesis paper 2.
The success of kNN suggested that the delay embedding and local pre-
diction works for my data, so perhaps could be improved. However, when
I tried to GA-optimize the embedding parameters, the prediction results
were not better. If fine-tuning was not the way, perhaps averaging a num-
ber of rough predictors would be. The majority voting scheme has indeed

30
improved the prediction accuracy. The originating publication Bagging
Imperfect Predictors, thesis paper 3, presents bagging results from Non-
linear Index Prediction and an approach believed to be novel at that time –
bagging predictions from a number of classifiers evolved in one GA popu-
lation.
Another spin off from the success of kNN in Nonlinear Index Prediction,
so the implicit presence of determinism and perhaps limited dimension of
the data, was a research proposal Evolving Differential Equations for Dy-
namical System Modeling. The idea behind this more extensive project is
to use Genetic Programming-like approach, but instead of evolving pro-
grams, to evolve differential equations, known as the best descriptive and
modeling tool for dynamical systems. This is what the theory says, but
finding equations fitting given data is not yet a solved task. The project
was stalled, awaiting financial support.
But coming back to the main thesis track. GA experiments in Bagging
Imperfect Predictors were computationally intensive, as it is often the case
while developing a new learning approach. This problem gave rise to an
idea how to try a number of development variants at once, instead of one-
by-one, saving on computation time. Rapid Fine-Tuning of Compu-
tationally Intensive Classifiers, thesis paper 4, explains the technique,
together with some experimental guidelines.
The ensemble of GA individuals, as in Bagging Imperfect Predictors,
could further benefit from a more powerful classifier committee technique,
such as boosting. The published poster Amalgamation of Genetic Se-
lection and Boosting, Appendix B, highlights the idea.

3.2 Second half – Synthesis

At that point, I presented the mid-Ph.D. results and thought what to


do next. Since the ensembles, becoming a mainstream in the machine
learning community, seemed the most promising way to go, I investigated
how different types of ensembles performed with my predictors, with the
Bayesian coming a bit ahead of Bagging and Boosting. However, the results
were not that startling and I found more extensive comparisons in the

31
literature, making me abandon that line of research.
However, while searching for the comparisons above, I had done quite
an extensive review. I selected the most practical and generally-applicable
papers in Ensembles in Practice: Prediction, Estimation, Multi-
Feature and Noisy Data which publication addresses the four data issues
relevant to financial prediction, thesis paper 5.
Except for the general algorithmic considerations, there are also the
tens of little decisions that need to be taken while developing a prediction
system, many leading to pitfalls. While reviewing descriptions of many
systems ’beating the odds’ I realized that, although widely different, the
acclaimed successful systems share common characteristics, while the naive
systems – quite often manipulative in presenting the results – share com-
mon mistakes. This led to the thesis paper 6: On Developing Financial
Prediction System: Pitfalls and Possibilities which is an attempt to
highlight some of the common solutions.
Financial data are generated in complex and interconnected ways. What
happens in Tokyo influences what happens in New York and vice versa.
For prediction this has several consequences. First, there are very many
data series to potentially take as inputs, creating data selection and curse
of dimensionality problems. Second, many of the series are interconnected,
in general, in nonlinear ways. Hence, an attempt to predict must identify
the important series and their interactions, having decided that the data
warrants predictability at all.
These considerations led me to a long investigation. Searching for a
predictability measure, I had the idea to use the common Zip compression
to estimate entropy in a constructive way – if the algorithm could compress
(many interleaved series), its internal working could provide the basis for a
prediction system. But reviewing references, I found a similar work, more
mathematically grounded, so had abandoned mine. Then, I shifted atten-
tion to uncovering multivariate dependencies, along predictability measure,
by means of weighted and GA-optimized Nearest Neighbor, which failed.
1
.
Then came a multivariate discretization idea, initially based on Shannon
1
It worked, but only up to 15 input data series, whereas I wanted the method to work for more than
50 series.

32
(conditional) entropy, later reformulated in terms of accuracy. After so
many false-starts, the feat was quite spectacular as the method was able
to spot multivariate regularities, involving only fraction of the data, in
up to 100 series. Up to my knowledge, this is also the first, (multivariate)
discretization having maximizing an ensemble performance as an objective.
Multivariate Feature Coupling and Discretization is the thesis paper
number 7.
Along the second part of the thesis, I have steadily developed a time se-
ries prediction software incorporating my experiences and expertise. How-
ever, at the thesis print time the system is not yet operational so its de-
scription is not included.

33
34
Chapter 4

Contributions of Thesis Papers

This section summarizes some of the contributions of the 7 papers included


in the thesis.

4.1 Nonlinear Index Prediction


This publication (Zemke, 1998) examines index predictability by means
of Neural Networks (ANN), Nearest Neighbor (kNN), Naive Bayesian and
Genetic Algorithms-optimized Inductive Logic Program (ILP) classifiers.
The results are interesting in many respects. First, they show that a lim-
ited prediction is indeed possible. This adds to the growing evidence that
an unqualified Efficient Market Hypothesis might one day be revised. Sec-
ond, Nearest Neighbor achieves best accuracy among the commonly used
Machine Learning methods what might encourage further exploration in
this area dominated by Neural Network and rule-based, ILP-like, systems.
Also, the success might hint specific features of the data analyzed. Namely,
unlike the other approaches, Nearest Neighbor is a local, model-free tech-
nique that does not assume any form of the learnt hypothesis, as it is done
by Neural Network architecture or LP background predicates. Third, the
superior performance of Nearest Neighbor, as compared to the other meth-
ods, points to the problems in constructing global models for the financial
data. If conformed in more extensive experiments, it would highlight the
intrinsic difficulties of describing some economical dependencies in terms
of simple rules, as taught to economics students. And fourth, the failure of
the Naive Bayesian classifier can point out limitations of some statistical

35
techniques used to analyze complex preprocessed data, a common approach
in the earlier studies of financial data so much contributing to the Efficient
Market Hypothesis view.

4.2 ILP via GA for Time Series Prediction

With only the main results, due to publisher space limits, of the GA-
optimized ILP included in the earlier paper, this report presents some
details of these computationally intensive experiments (Zemke, 1999c). Al-
though the overall accuracy of LP on the index data was not impressive,
the attempts still have practical value – in outlining limits of otherwise suc-
cessful techniques. First, the initial experiments applying Progol – at that
time a ’state of the art’ Inductive Logic Programming system – show that
a learning system successful on some domains can fail on others. There
could be at least two reasons for this: domain unsuitable for the learning
paradigm or unskillful use of the system. Here, I only note that most of
the successful applications of Progol involve domains where few rules hold
most of the time: chemistry, astronomy, (simple) grammars, whereas fi-
nancial prediction rules, if any, are more soft. As for the unskillful use of
an otherwise capable system, the comment could be that such a system
would merely shift the burden to learning its ’correct usage’ from learning
the theory implied by the data provided – instead of lessening the bur-
den altogether. As such, one should be aware that machine learning is
still more of an art – demanding experience and experimentation, rather
than engineering – providing procedures for almost blindly solving a given
problem.
The second contribution of this paper exposes background predicate sen-
sitivity – exemplified by variants of equal. The predicate definitions can
have a substantial influence on the achieved results – again highlighting
the importance of an experimental approach and, possibly, a requirement
for nonlinear predicates. Third, since GA-evolved LP can be viewed as
an instance of Genetic Programming (GP), the results confirm that GP is
perhaps not the best vehicle for time series prediction. And fourth, a gen-
eral observation about GA-optimization and learning: while evolving LP of

36
varying size, the best (accuracy) programs usually emerged in GA experi-
ments with only secondary fitness bonus for smaller programs, as opposed
to runs in which programs would be penalized by their size. Actually, it
was interesting to note that the path to smaller and accurate programs
often lead through much bigger programs which have been subsequently
reduced – should the bigger programs be not allowed to appear in the
first place, the smaller ones would not be found either. This observation,
together with the not so good generalization of the smallest programs, is-
sues a warning against blind application of Occam’s Razor in evolutionary
computation.

4.3 Bagging Imperfect Predictors

This publication (Zemke, 1999b), again due to publisher restrictions, com-


pactly presents a number of contributions both to the area of financial
prediction and machine learning. The key tool here is bagging – a scheme
involving majority voting of a number of different classifiers as to increase
the ensemble’s accuracy. The contributions could be summarized as fol-
lows. First, instead of the usual bagging of the same classifier trained on
different (bootstrap) partitions of the data, classifiers based on different
data partitions as well as methods are bagged together – an idea described
as ’neat’ by one of the referees. This leads to higher accuracy than those
achieved by bagging each of the individual method classifiers or data se-
lections separately. Second, as applied to index data, prediction accuracy
seems highly correlated to returns, a relationship reported breaking up at
higher accuracies. Third, since the above two points hold, bagging applied
to a variety of financial predictors has the potential to increase the ac-
curacy of prediction and, consequently, of returns what is demonstrated.
Fourth, in the case of GA-optimized classifiers, it is advantageous to bag
all above-average classifiers present in the final GA population, instead of
the usual taking the singe best classifier. And fifth, somehow contrary to
conventional wisdom, it turned out that on the data analyzed, big index
movements were more predictable than smaller ones – most likely due to
the smaller ones consisting of relatively more of noise.

37
4.4 Rapid Fine Tuning of Computationally Intensive
Classifiers
This publication (Zemke, 2000), a spin-off of the experiments carried out
for the previous paper, elaborates on a practical aspect applicable to almost
any machine learning system development, namely, on a rapid fine-tuning
of parameters for optimal performance. The results could be summarized
as follows. First, working on a specific difficult problem, as in the case of
index prediction, can lead to a solution and insights to more general prob-
lems, and as such is of value beyond merely the domain of the primary
investigation. Second, the paper describes a strategy for simultaneous
exploration of many versions of a fine-tuned algorithm with different pa-
rameter choices. And third, a statistical analysis method for detection of
superior parameter settings is presented, which together with the earlier
point allows for rapid fine-tuning.

4.5 On Developing Financial Prediction System: Pit-


falls and Possibilities
The publication (Zemke, 2002b) is the result of my own experiments with
a financial prediction system development and of a review of such in the
literature. The paper succinctly lists issues appearing in the development
process pointing to some common pitfalls and solutions. The contributions
could be summarized as follows.
First, it makes the reader aware of the many steps involved in a suc-
cessful system implementation. The presentation tried to follow the devel-
opment progression – from data preparation, through predictor selection
and training, ’boosting’ the accuracy, to evaluation issues. Being aware of
the progression can help in a more structured development and pinpoint
some omissions.
Second, for each stage of the process, the paper lists some common
pitfalls. The importance of this cannot be overestimated. For instance,
many ’profit-making’ systems presented in the literature are tested only
in the decade-long bull market 1990-2000, and never tested in long-term

38
falling markets, which most likely would average the systems’ performance.
Such are some of the many pitfalls pointed out.
Third, the paper suggests some solutions to the pitfalls and to general
issues appearing in a prediction system development.

4.6 Ensembles in Practice: Prediction, Estimation,


Multi-Feature and Noisy Data
This publication (Zemke, 2002a) is the result of an extensive literature
search on ensembles applied to realistic data sets, with the 4 objectives in
mind: 1) time series prediction – how ensembles can specifically exploit the
serial nature of the data; 2) accuracy estimation – how ensembles can mea-
sure the maximal prediction accuracy for a given data set, in a better way
than any single method; 3) how ensembles can exploit multidimensional
data and 4) how to use ensembles in the case of noisy data.
The four issues appear in the context of financial time series predic-
tion, though the examples referred to are non-financial. Actually, this
cross-domain application of working solutions could bring new methods to
financial prediction. The contributions of the publication can be summa-
rized.
First, after a general introduction to how and why ensembles work, and
to the different ways to build them, the paper diverges into the four title
areas. The message here can be that although ensembles are generally-
applicable and robust techniques, a search for the ’ultimate ensemble’
should not overlook the characteristics and requirements of the problem
in question. Similar quest for the ’best’ machine learning technique few
years ago failed with the realization that different techniques work best
in different circumstances. Similarly with ensembles: different problem
settings require individual approaches.
Second, the paper goes on to present some of the working approaches
addressing the four issues in question. This has a practical value. Usually
the ensemble literature is organized by ensemble method, whereas, a prac-
titioner has data and a goal, e.g. to predict from noisy series data. The
paper points to possible solutions.

39
4.7 Multivariate Feature Coupling and Discretization
This paper (Zemke & Rams, 2003) presents a multivariate discretization
method based on Genetic Algorithms applied twice, first to identify im-
portant feature groupings, second to perform the discretization maximiz-
ing desired function, e.g. the predictive accuracy of an ensemble build on
those groupings. The contributions could be summarized as follows.
First, as the title suggests, a multivariate discretization is provided,
presenting an alternative to the very few multivariate methods reported.
Second, feature grouping and ranking – the intermediate outcome of the
procedure – has a value in itself: allows to see which features are interre-
lated and how much predictability is brought in by them, promoting feature
selection. Third, the second global GA-optimization allows an arbitrary
objective to be maximized, unlike in other discretization schemes where
the objective is hard-coded into the algorithm. The objective exemplified
in the paper maximizes the goal of prediction: accuracy, whereas other
schemes often only indirectly attempt to maximize it via measures such as
entropy or the chi-square statistics. Fourth contribution, up to my knowl-
edge, this is the first discretization to allow explicit optimization for an
ensemble. This forces the discretization to act on global basis, not merely
searching for maximal information gain per selected feature (grouping) but
for all features viewed together. Fifth, the global discretization can also
yield a global estimate of predictability for the data.

40
Chapter 5

Bibliographical Notes

This chapter is intended to provide a general bibliography introducing new


adepts to the interdisciplinary area of financial prediction. I list a few
books I have found to be both educational and interesting to read in my
study of the domain.

Machine Learning
Machine Learning (Mitchell, 1997). As for now, I would regard this book
as the textbook for machine learning. It not only presents the main learn-
ing paradigms – neural networks, decision trees, rule induction, nearest
neighbor, analytical and reinforcement learning – but also introduces to
hypothesis testing and computational learning theory. As such, it balances
the presentation of machine learning algorithms with practical issues of
using them, and some theoretical aspects of their function. Next editions
of this, otherwise an excellent book, could also consider the more novel
approaches: support vector machines and rough sets.
Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations (Witten & Frank, 1999). Using this book, and the
software package Weka behind it, could save time, otherwise spent on im-
plementing the many learning algorithms. This book essentially provides
an extended user guide to the open-source code available online. The
Weka toolbox, in addition to more than 20 parameterized machine learning
methods, offers data preparation, hypothesis evaluation and some visual-
ization tools. A word of warning, though: most of the implementations are

41
straightforward and non-optimized – suitable rather for learning the nuts
and bolts of the algorithms, rather than a big scale data mining.
The Elements of Statistical Learning: Data Mining, Inference, and Pre-
diction (Hastie et al., 2001). This book, in wide scope similar to Machine
Learning (Mitchell, 1997), could be recommend for its more rigorous treat-
ment and some additional topics, such as ensembles.
Data Mining and Knowledge Discovery with Evolutionary Algorithms
(Alex, 2002). This could be a good introduction to practical applications
of evolutionary computations to various aspects of data mining.

Financial Prediction
Here, I present a selection of books introducing to various aspects of non-
linear financial time series analysis.
Data Mining in Finance: Advances in Relational and Hybrid Methods
(Kovalerchuk & Vityaev, 2000). This is an overview of some of the methods
used for financial prediction and of features such a prediction system should
have. The authors also present their system, supposedly overcoming many
of the common pitfalls. However, the book is somehow short on details
allowing to re-evaluate some of the claims, but good as an overview.
Trading on the Edge (Deboeck, 1994). This is an excellent book of self-
contained chapters practically introducing to the essence of neural net-
works, chaos analysis, genetic algorithms and fuzzy sets, as applied to
financial prediction.
Neural Networks in the Capital Markets (Refenes, 1995). This collection
on neural networks for economic prediction, highlights some of the practical
considerations while developing a prediction system. Many of the hints are
applicable to prediction systems based on other paradigms, not just on
neural networks.
Fractal Market Analysis (Peters, 1994). In this book, I found as the
most interesting chapters on various applications of Hurst or R/S analysis.
Though, this has not resulted in immediately using that approach, it is
always good to know what the self-similarity analysis can reveal about the
data in hand.

42
Nonlinear Analysis, Chaos
Nonlinear Time Series Analysis (Kantz & Schreiber, 1999a). As authors
can be divided into those who write what they know, and those who know
what they write about, this is definitely the latter case. I would recom-
mend this book, among other introductions to nonlinear time series, for
its readability, practical approach, examples (though mostly from physics),
formulae with clearly explained meaning. I could easily convert into code
many of the algorithms described in the text.
Time Series Prediction: Forecasting the Future and Understanding the
Past (Weigend & Gershenfeld, 1994). A primer on nonlinear prediction
methods. The book, finalizing the Santa Fe Institute prediction compe-
tition, introduces time series forecasting issues and discusses them in the
context of the competition entries.
Coping with Chaos (Ott, 1994). This book, by a contributor to the
chaos theory, is a worthwhile read providing insights into aspects of chaotic
data analysis, prediction, filtering, control, with the theoretical motivations
revealed.

Finance, General
Modern Investment Theory (Haughen, 1997). A relatively easy to read
book systematically introducing to current views on investments, mostly
from an academic point, though. This book also discusses the Efficient
Market Hypothesis.
Financial Engineering (Galitz, 1995). A basic text on what financial
engineering is about and what it can do.
Stock Index Futures (Sutcliffe, 1997). Mostly overview work, providing
numerous references to research on index futures. I considered skimming
the book essential for insights into documented futures behavior, as not to
reinvent the wheel.
A Random Walk down Wall Street (Malkiel, 1996) and Reminiscences
of a Stock Operator (Lefvre, 1994). Enjoyable, leisure read about the me-
chanics of Wall Street. In some sense the books – presenting investment
activity in a wider historical and social context – have also great educa-

43
tional value. Namely, they show the influence of subjective, not always
rational, drives on the markets, which as such, perhaps cannot be fully
analyzed by rational methods.

Finance, High Frequency


An Introduction to High-Frequency Finance (Dacorogna et al., 2001). A
good introduction to high frequency finance, presenting facts about the
data and ways to process it, with simple prediction schemes presented.
Financial Markets Tick by Tick (Lequeux, 1998). In high frequency fi-
nance, where data is usually not equally time-spaced, certain mathematical
notions – such as correlation, volatility – require new precise definitions.
This book is attempting that.

44
Nonlinear Index Prediction
International Workshop on Econophysics and Statistical Finance, 1998.

Physica A 269 (1999)

45
.

46
Nonlinear Index Prediction
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se

Presented: International Workshop on Econophysics and Statistical Finance, Palermo,


1998.
Published: Physica A, volume 269, 1999

Abstract Neural Network, K-Nearest Neighbor, Naive Bayesian Classifier and Genetic
Algorithm evolving classification rules are compared for their prediction accuracies on
stock exchange index data. The method yielding the best result, Nearest Neighbor, is
then refined and incorporated into a simple trading system achieving returns above index
growth. The success of the method hints the plausibility of nonlinearities present in the
index series and, as such, the scope for nonlinear modeling/prediction.

Keywords: Stock Exchange Index Prediction, Machine Learning, Dynamics Reconstruc-


tion via delay vectors, Genetic Algorithms optimized Trading System

Introduction
Financial time series present a fruitful area for research. On one hand
there are economists claiming that profitable prediction is not possible, as
voiced by the Efficient Market Hypothesis, on the other, there is a grow-
ing evidence of exploitable features of these series. This work describes a
prediction effort involving 4 Machine Learning (ML) techniques. These ex-
periments use the same data and lack unduly specializing adjustments – the
goal being relative comparison of the basic methods. Only subsequently,
the most promising technique is scrutinized.
Machine Learning (Mitchell, 1997) has been extensively applied to fi-
nances (Deboeck, 1994; Refenes, 1995; Zirilli, 1997) and trading (Allen

47
& Karjalainen, 1993; Bauer, 1994; Dacorogna, 1993). Nonlinear time se-
ries (Kantz & Schreiber, 1999a) approaches also become a commonplace
(Trippi, 1995; Weigend & Gershenfeld, 1994). The controversial notion of
(deterministic) chaos in financial data is important since the presence of a
chaotic attractor warrants partial predictability of financial time series –
in contrast to the random walk and Efficient Market Hypothesis (Fama,
1965; Malkiel, 1996). Some of the results supporting deviation from the
log-normal theory (Mandelbrot, 1997) and a limited financial prediction
can be found in (LeBaron, 1993; LeBaron, 1994).

The Task
Some evidence suggests that markets with lower trading volume are eas-
ier to predict (Lerche, 1997). Since the task of the study is to compare
ML techniques, data from the relatively small and scientifically unexplored
Warsaw Stock Exchange (WSE) (Aurell & Zyczkowski, 1996) is used, with
the quotes, from the opening of the exchange in 1991, freely available on
the Internet. At the exchange, prices are set once a day (with intraday
trading introduced more recently). The main index, WIG, is a capital-
ization weighted average of all the stocks traded on the main floor, and
provides the time series used in this study.
The learning task involves predicting the relative index value 5 quotes
ahead, i.e., a binary decision whether the index value one trading week
ahead will be up or down in relation to the current value. The interpretation
of up and down is such that they are equally frequent in the data set,
with down also including small index gains. This facilitates detection of
above-random predictions – their accuracy, as measured by the proportion
of correctly predicted changes, is 0.5 + s, where s is the threshold for
the required significance level. For the data including 1200 index quotes,
the following table presents the s values for one-sided 95% significance,
assuming that 1200 − W indowSize data points are used for the accuracy
estimate.
Window size: 60 125 250 500 1000
Significant error: 0.025 0.025 0.027 0.031 0.06

48
Learning involves W indowSize consecutive index values. Index daily
(relative) changes are digitized via monotonically mapping them into 8
integer values, 1..8, such that each is equally frequent in the resulting series.
This preprocessing is necessary since some of the ML methods require
bounded and/or discrete values. The digitized series is then used to create
delay vectors of 10 values, with lag one. Such a vector (ct , ct−1 , ct−2 , ..., ct−9 ),
is the sole basis for prediction of the index up/down value at time t + 5
w.r.t. the value at time t. Only vectors, and their matching predictions,
derived form index values falling within the current window are used for
learning.
The best generated predictor – achieving highest accuracy at the window
cases – is then applied to the vector next to the last one in window –
yielding prediction for the index value falling next to the window. With the
accuracy estimate accumulating and the window shifting over all available
data points, the resulting prediction accuracies are presented in the tables
as percentages.

Neural Network Prediction

Five layered network topologies have been tested. The topologies, as de-
scribed by the numbers of non-bias units in subsequent layers, are: G0:
10-1, G1: 10-5-1, G2: 10-5-3-1, G3: 10-8-5-1, G4: 10-20-5-1. Units in
the first layer represent the input values. Standard backpropagation (BP)
algorithm is used for learning weights, with the change values 1..8 linearly
scaled down to the [0.2, 0.8] range required by the sigmoid BP, and up
denoted by 0.8, and down – by 0.2.
The window examples are randomly assigned into either training or
validation set, compromising 80% and 20% of the examples respectively.
The training set is used by BP to update weights, while the validation set –
to evaluate the network’s squared output error. The minimal error network
for the whole run is then applied to the example next to the window for
prediction. Prediction accuracies and some observations follow.

49
Window/Graph G0 G1 G2 G3 G4
60 56 - - - -
125 58 56 63 58 -
250 57 57 60 60 -
500 58 54 57 57 58
1000 - - - 61 61
• Prediction accuracy, without outliers, is in the significant 56 – 61%
range
• Accuracies seem to increase with window size, reaching above 60% for
bigger networks (G2 – G4), as such the results could further improve
with more training data

Naive Bayesian Classifier


Here the basis for prediction consists of the probabilities P (classj ) and
P (evidencei | classj ) for all recognized evidence/class pairs. The classp
preferred by observed evidenceo1 ... evidenceon is given by maximizing the
expression P (classp )∗P (evidenceo1 | classp )∗ ... ∗P (evidenceon | classp ).
In the task in hand, evidence can take the form: attributen = valuen ,
where attributen , n = 1..10, denotes the n-th position in the delay vec-
tor, and valuen is a fixed value. If the position has this value, the evi-
dence is present. Class and conditional probabilities are computed through
counting respective occurrences in the window, with conditionals missing
assigned the default 1/equivalentSampleSize probability 1/80 (Mitchell,
1997). Some results and comments follow.
Window size: 60 125 250 500 1000
Accuracy: 54 52 51 47 50
• The classifier performs poorly – perhaps due to preprocessing of the
dataset removing any major probability shifts – in the bigger window
case no better than a guessing strategy
• The results show, however, some autocorrelation in the data: positive
for shorter periods (up to 250 data-points) and mildly negative for

50
longer (up to 1000 data-points), which is consistent with other studies
on stock returns (Haughen, 1997).

K-Nearest Neighbor
In this approach, K most similar window vectors – to the one being clas-
sified – are found. The most frequent class among the K vectors is then
returned as the classification. The standard similarity metrics is Euclidean
distance between the vectors. Some results and comments follow.
Window/K 1 11 125
125 56 - -
250 55 53 56
500 54 52 54
1000 64 61 56

• Peak of 64%

• Accuracy always at least 50% and significant in most cases

The above table has been generated for the Euclidean metrics. However,
the peak of 64% accuracy (though for other Window/K combinations) has
also been achieved for the Angle and Manhattan metrics1 , indicating that
the result is not merely an outlier due to some idiosyncrasies of the data
and parameters.

GA-evolved Logic Programs


The logic program is a list of clauses for the target up predicate. Each
clause is 10 literals long, with each literal drawn form the set of available
2 argument predicates: lessOrEqual, greaterOrEqual – with the implied
interpretation, as well as Equal(X, Y) if abs(X − Y ) < 2 and nonEqual(X,
Y) if abs(X − Y ) > 1. The first argument of each literal is a constant
1
The results were obtained from a GA run in the space: M erticsT ype ∗ K ∗ W indowSize. For a pair
of vectors, the Angle metrics returns the angle between them, Maximal – the maximal absolute difference
coordinate-wise, whereas Manhattan - sum of such differences.

51
among 1..8 – together with the predicate symbol – evolved through the
GA. The other genetic operator is a 2-point list crossover, applied to the
2 programs – lists of clauses.
The second argument of N-th literal is the clause’s N-th head argument
which is unified with the N-th value in a delay vector. Applying the up
predicate to a delay vector performs prediction. If the predicate succeeds
the classification is up, and down otherwise. Fitness of a program is mea-
sured as the proportion of window examples it correctly classifies. Upon
the GA termination, the fittest program from the run is used to classify
the example next to the current window. Programs in a population have
different lengths – number of up clauses – limited by a parameter, as shown
in the following table.

Window/Clauses 5 10 50 100 200


250 60 - - - –
500 44 47 53 50 –
1000 48 50 50 38 44

• Accuracy, in general, non-significant

• Bigger programs (number of clauses > 10) are very slow to converge
and result in erratic predictions

In subsequent trials individual program clauses are GA-optimized for


maximal up coverage, and one by one added to the initially empty program
until no uncovered up examples remain in the window. A clause covers an
example, if it succeeds on that example’s delay vector. The meaning of
values in the clause fitness formulas is the following: Neg is the count of
window down examples (wrongly!) covered by the (up!) clause, Pos is
the count of up examples yet-uncovered by clauses already added to the
program, but covered by the current clause, and AllPos is the total count
of all window up examples covered by that clause. The weights given
to individual counts mark their importance in the GA search trying to
maximize the fitness value. The results and some commentaries follow.

52
Clause fitness function/Window 60 125 250 500 1000
AllP os + P os − 103 ∗ N eg 54.8 50.3 51.7 51.9 53.2
AllP os + 103 ∗ P os − 106 ∗ N eg 57.1 51.7 52.8 53.0 48.9
as above & ordinary equality 53.6 51.9 53.0 52.5 58.8

• Accuracies, in general, not significant


• Accuracy increase after ordinary equation introduction (Equal(X, Y)
if X = Y ) indicates the importance of relations used
• The highest accuracy achieved, reaching 59% for window of 1000,
indicates possibility of further improvement should bigger window be
available

K-nearest Neighbor Prediction Scrutinized


In the prediction accuracy measurements so far, no provision has been
made for the magnitude of the actual index changes. As such, it could
turn out that highly accurate system is not profitable in real terms, e.g.
by making infrequent but big loses (Deboeck, 1994). To check this, a more
realistic prediction scheme is tested, in which prediction performance is
measured as the extra growth in returns in relation to intrinsic growth of
the series. The series worked with is the sequence of logs of daily index
changes: logn = ln indexn − ln indexn−1 . The log change delay vectors still
have length 10, but because of high autocorrelation present (0.34) the delay
lag has been set to 2, instead of 1 as before (Kantz & Schreiber, 1999a).
Additional parameters follow.

Neighborhood Radius – maximal distance w.r.t. chosen metrics, up


to which vectors are considered neighbors and used for prediction, in
[0.0, 0.05)
Distance Metrics – between vectors, one of Euclidean, Maximal, Man-
hattan metrics
Window size – limit how many past data-points are looked at while
searching for neighbors, in [60, 1000)

53
Kmin – minimal number of vectors required within a neighborhood to
warrant prediction, in [1, 20)

Predictions’ Variability – how much neighborhood vector’s predictions


can vary to justify a consistent common prediction, in [0.0, 1.0)

Prediction Variability Measure – how to compute the above mea-


sure from the series of the individual predictions, as: standard devia-
tion, difference max min between maximal and minimal value or the
same sign proportion of predictions

Distance scaling – how contributory predictions are weighted in the


common prediction sum, as a function of neighbor distance, no-scaling:
1, linear: 1/distance, exponential: exp(−distance)

The parameters are optimized via GA. The function maximized is the
relative gain of an investment strategy involving long position in the index,
when the summaric prediction says it will go up, short position – when
down, and staying in cash if no prediction warranted. The prediction
period is 5 days and the investment continues for that period, after which
a new prediction is made. A summaric prediction is computed by adding
all the weighted contributory predictions associated with valid neighbors.
If some of the requirements, e.g. minimal number of neighbors, fail – no
prediction is issued.
The following tests have been run. Test1 computed average annual gain
over index growth during 4 years of trading: 33%. Test2 computed minimal
(out of 5 runs shifted by 1 day each) gain during the last year (ending on
Sept. 1, 1998): 28%. Test3 involved generating 19 sets of surrogate data –
permuted logarithmic change series – and checking if the gain on the real
series exceeds those for the surrogate series; the test failed – in 6 cases the
gain on the permuted data was bigger. However, assuming normality of
distribution in the Test2 and Test3 samples, the two-sample t procedure
yielded 95% significant result (t = 1.91, df = 14, P < 0.05) that the Test2
gains are indeed higher than those for Test32 .
2
The (logarithmic) average for Test3 was around 0 as opposed to the strictly positive results and
average for Test1 and Test2 – this could be the basis for another surrogate test.

54
Conclusion
The results show that some exploitable regularities do exist in the index
data and Nearest Neighbor is able to profit from them. All the other, def-
initely more elaborate techniques, fall short of the 64% accuracy achieved
via Nearest Neighbor. One of the reasons could involve non-linearity of
the problem in question: with only linear relations available, logic program
classifier rules require linear nature of the problem for good performance,
nonlinear Neural Network performing somehow better. On the other hand
the Nearest Neighbor approach can be viewed as generalizing only locally –
with no linear structure imposed/assumed – moreover with the granularity
set by the problem examples.
As further research, other data could be tested, independent tests for
nonlinearity performed (e.g. dimension and Lyapunov exponent estima-
tion) and the other Machine Learning methods refined as well.

55
56
ILP and GA for Time Series
Prediction
Dept. of Computer and Systems Sciences Report 99-006

57
.

58
ILP via GA for Time Series Prediction
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se

June 1998. Published: DSV report 99-006

Abstract This report presents experiments using GA for optimizing Logic Programs
for time series prediction. Both strategies: optimizing the whole program at once, and
building it clause-by-clause are investigated. The set of background predicates stays the
same during all the experiments, though the influence of some variations is also observed.
Despite extensive trials none of the approaches exceeded 60% accuracy, with 50% for
a random strategy, and 64% achieved by a Nearest Neighbor classifier on the same data.
Some reasons for the weak performance are speculated, including non-linearity of the
problem and too greedy approach.

Keywords: Inductive Logic Programming, Genetic Programming, Financial Applications,


Time Series Forecasting, Machine Learning, Genetic Algorithms

Introduction
Inductive Logic Programming
Inductive Logic Programming (ILP) (Muggleton & Feng, 1990) – auto-
matic induction of logic programs, given set of examples and background
predicates – has shown successful performance in several domains (Lavarac
& Dzeroski, 1994).
The usual setting for ILP involves providing positive examples of the re-
lationship to be learned, as well as negative examples for which the relation-
ship does not hold. The hypotheses are selected to maximize compression
or information gain, e.g., measured by the number of used literals/tests

59
or program clauses. The induced hypothesis, in the form of a logic pro-
gram (or easily converted to it), can be usually executed without further
modifications as a Prolog program.
The hypotheses are often found via covering, in which a clause succeed-
ing on, or covering, some positive examples is discovered (e.g. by greedy
local search) and added to the program, covered positives removed from
the example set and clauses added until the set is empty. Each clause
should cover positive examples only with all the negative examples ex-
cluded, though this can be relaxed, e.g., because of different noise handling
schemes.
As such, ILP is well suited for domains with compact representations of
the concept learned possible and likely. Without further elaboration it can
be seen that most of the ILP success areas belong to such domains with
concise mathematical-like description feasible and subsequently discovered.

The Task

The task attempted in this work consists in short term prediction of a time
series – a normalized version of stock exchange daily quotes. The normal-
ization involved monotonically mapping the daily changes to 8 values, 1..8,
ensuring that the frequency of those values is equal in the 1200 points con-
sidered. The value to be predicted is a binary up or down, referring to the
index value five steps ahead in the series. These classifications were again
made equally frequent (with down including also small index gains).
The normalization of the class data allows easy detection of above-
random predictions – their accuracy is above 50% + s, where s is the
threshold for required significance level. If the level is one-sided 95% and
the predictions tested on all 1200 – W indowSize examples, then the sig-
nificant deviations from 0.5 are as presented in the table.
Thus, predictions with accuracy above 0.56 are of interest, no matter
what the window size. For an impression of the predictability of the time
series: a Nearest Neighbor method yielded 64% (Zemke, 1998) accuracy,
Neural Network reaching similar results.

60
Window Significant error
60 0.025
125 0.025
250 0.027
500 0.031
1000 0.06

Figure 5.1: One-sided 95% significance level errors for the tests

class_example(223, up, [3,6,5,8,8,8,4,8,8,7]).


class_example(224, up, [6,5,8,8,8,4,8,8,7,8]).
class_example(225, up, [5,8,8,8,4,8,8,7,8,8]).
class_example(226, down, [8,8,8,4,8,8,7,8,8,8]).
class_example(227, down, [8,8,4,8,8,7,8,8,8,1]).
class_example(228, up, [8,4,8,8,7,8,8,8,1,1]).
class_example(229, down, [4,8,8,7,8,8,8,1,1,8]).
class_example(230, down, [8,8,7,8,8,8,1,1,8,8]).

Figure 5.2: Data format sample: number, class and 10-changes vector

Data Format

The actual prediction involves looking at the pattern of subsequent 10


changes and from them forecasting the class. To make it more convenient,
the sequence of the changes and class tuples has been pre-computed. The
change tuples are generated at one step resolution from the original series,
so the next tuple’s initial 9 values overlap with the previous tuple’s last
9 values, with the most recent value concatenated as the 10th argument.
Such tuples constitute the learning task’s examples. A sample is presented.
The accuracy estimation for a prediction strategy consists in learning in
a window of consecutive examples, and then trying to predict the class of
the example next to the window via applying the best predictor for that
window to the next example’s change vector. The counts for correct and all
predictions are accumulated as the window shifts one-by-one over all the

61
available examples. The final accuracy is the ratio of correct predictions
to all predictions made.

GA Program Learning
Common GA Settings
All the tests use the same GA module, with GA parameters constant for
all trials, unless indicated otherwise. Random individual generation, mu-
tation, crossover, fitness evaluation are provided as plug-ins to the module
and are described for each experiment setting.
The Genetic Algorithm uses 2-member tournament selection strategy
with the fitter individual having lower numerical fitness value (which can be
a negative or positive). Mutation rate is 0.1, with each individual mutated
at most once before applying other genetic operators, crossover rate is 0.3
(so offspring constitute 0.6 of next population) and the population size is
100. Two-point (uniform) crossover is applied only to the top-level list in
the individuals’ representation. The number of generations is at least 5, no
more than 30, and additionally terminated if the GA run’s best individual
has not improved in the last 5 generations.
A provision is made for the shifted window learning to benefit from the
already learned hypothesis, in an incremental learning fashion. This can
be conveniently done using few (mutated) copies of the previous window
best hypothesis – while initializing a new population – instead of a totally
random initialization. This is done, both, to speed up convergence as well
as to increase GA exploitation.

Evolving Whole Logic Program


Representation The program is represented as a list of clauses for the tar-
get up predicate. Each clause is 10 literals long, with each literal drawn
form the set of available 2 argument predicates: lessOrEqual, greaterOrE-
qual – with the implied interpretation, as well as equal(X, Y) if abs(X −
Y ) < 2 and nonEqual(X, Y) if abs(X − Y ) > 1.
The first argument of each literal is an integer among 1..8 – together

62
with the predicate symbol – evolved through the GA. The second argument
of clause’s N-th literal is the value of the N-th head argument, which is
unified with the N-th change value in an example’s tuple.

Evaluation Classification is performed by applying the up predicate to an


example’s change vector. If it succeeds the classification is up, and down
otherwise. Fitness of a program is measured as the (negative) number of
window examples it correctly classifies. Upon the GA termination, the
fittest program from the run is used to classify the example next to the
current window.

GA Operators Mutation changes single predicate symbol or a constant in


one of program clauses. The 2-point crossover is applied to list holding
program clauses and cuts in between them (not inside clauses) with the
resulting offspring programs of uncontrolled length.

Other parameters Initial program population consists of randomly gen-


erated programs of up to L clauses. The L parameter has been varied
form 1 to 300, with more thoroughly tested cases reported. The L param-
eter is also used during fitness evaluation. If, as the result of crossover, a
longer program is generated, its fitness is set to 0 (practically setting its
tournament survival chance to nil).
Another approach to limit program length has also been tried and given
up because of no performance improvement and more computational effort.
Namely, programs of up to 2*L clauses were evaluated and the actual fitness
value returned for those longer than L was multiplied by a factor linearly
declining from 1 to 0, as the length increased form L to 2*L.
When the learning window is shifted and program population initialized,
the new population has 0.02 chance of being seeded with a mutated version
of previous window’s best classifier – the one used for prediction. This
initiative is intended to promote incremental learning on the new window
differing only by 2 examples (one added and one removed).

63
Window/Clauses 5 10 50 100 200
250 60 - - - –
500 44 47 53 50 –
1000 48 50 50 38 44

Figure 5.3: GA-evolved logic program prediction accuracy

Results The results for bigger program sizes and smaller windows are
missing since the amount of information required to code the programs
would be comparable to that to memorize the examples which could easily
lead to overfitting, instead of generalization.
Observations form over 50 GA runs follow.

• Accuracy, in general, non-significant

• Up to certain number of clauses, increased clause number improves


accuracy, with more erratic results thereafter

• The influence of crossover seems to be limited to that of random mu-


tations, with offspring less fit than parents

• Programs evolved with different settings (e.g. maximal program length)


often fail to predict the same ’difficult’ examples

• For bigger programs allowed (clause count more than 50; with popula-
tion tried up to 2000), convergence is very slow and the best program
often (randomly) created in an initial population

• Window increase, as well as bigger population, generally improve pre-


diction accuracy

• Window-cases accuracy (i.e. fitness) is not a good measure of pre-


diction accuracy, though both remain related (especially for bigger
window sizes)

64
Learning Individual Clauses via GA

To limit the search space explosion, perhaps responsible for the previous
trial poor performance, the next tests optimize individual clauses one-by-
one added to the program. In this more traditional ILP setting, the window
up cases constitute the positive, and down – the negative examples.

Representation Clauses are represented as lists of literals – in the same


way as an individual clause in the whole program learning. A GA popula-
tion maintains set of clauses, at a particular run of GA, all optimized by
the same fitness function.
The classifier is built by running the GA search for an optimal clause,
adding the clause to the (initially empty) program, updating the set of
yet uncovered positives and initiating the GA procedure again. The pro-
cess terminates when there are no more remaining positive examples to be
covered.

Evaluation Details of the fitness function vary and will be described for
the individual tests. In general, the function promotes a single clause
covering maximal number of positive and no negative examples in the
current window. The variants include different sets of positives (all or yet
uncovered), different weights assigned to their counts and some changes in
the relations used.

GA Operators Crossover takes the 10-element lists encoding body literals


of selected 2 clauses and applies the 2-point crossover to them with the
restriction that each of the offspring must also have exactly 10 literals.
Mutation changes individual literal: its relation symbol or constant.

Unrestricted GA Clause Search

The first trial initializes the set of clauses randomly,


with no connection to the window example set.

65
Evaluation Fitness of a clause is defined as the difference Negatives – Pos-
itives, where Negatives is the count of all negatives covered by the clause,
and Positives is the count of yet-uncovered – by previously added clauses –
positives covered by the current clause.

Termination The problem with this approach is that, however initially it


seems to work, as soon as the set of remaining positives becomes sparse,
the GA search has difficulty finding any clause covering a positive example
at all, not even mentioning a number of positives, and no negatives. The
search did not terminate in many cases.
However, in those cases in which a set of clauses covering all positives
has been found, the accuracy on classifying new examples looked promising
which lead to subsequent trials.

More Specific Genetic Operators


GA Operators In this setting all genetic operators, including clause initial-
ization, have an invariant: a selected positive example must be covered.
This leads to changes in the implementation of clause initialization and
mutation. Crossover does not need to be specialized: a crossover of two
clauses, each covering the same positive example, still covers that example.

Evaluation Fitness function is defined by the formula: 1000*Negatives –


Positives – AllPositives, where the additional AllPositives indicates the
count of all window positives covered by that clause; the other summands
as already explained. Such formula has showed better prediction accuracy
than just promoting a maximal count among remaining positives. Here
there is a double premium for capturing the remaining positives: they are
included in the two positive counts.

Termination and full positive coverage are ensured by iterating over all
positive examples, with each clause added covering at least one of them.
Some observations about the results follow.
• The only significant prediction is that for window size 60, but only
just

66
Window Accuracy
60 54.8
125 50.3
250 51.7
500 51.9
1000 53.2

Figure 5.4: More Specific Genetic Operators. Fitness: 1000*Negatives – Positives –


AllPositives

Window Accuracy
60 57.1
125 51.7
250 52.8
500 53.0
1000 48.9

Figure 5.5: Refined Fitness Function. Fitness: 1000000*Negatives – 1000*Positives –


AllPositives

• The rest of the results hint no prediction, giving overall poor perfor-
mance

Refined Fitness Function


In this trial all the settings are as above with different fitness function.

Evaluation The new fitness formula, 1000000*Negatives – 1000*Positives –


AllPositives, gives priority to covering no negatives, then to maximizing the
coverage on yet-uncovered positives, and only then on all positives (with
all Negatives, Positives, AllPositives counts less than 1000 because of the
window size).
As compared to the previous fitness setting, this resulted in:
• Improved accuracy for all window sizes

67
Window Accuracy
60 53.6
125 51.9
250 53.0
500 52.5
1000 58.8

Figure 5.6: Ordinary Equality in the Relation Set. Fitness: 1000000*Negatives –


1000*Positives – AllPositives

• Predictions for window size 250 significant

Ordinary Equality in the Relation Set


Another possibility for changes involves redefining the relations employed.
Since with the relations used so far there was no possibility to select a
specific value – all the relations, including equality, involving intervals –
the definition of Equal has been confined.

Representation The Equal predicate has been set to ordinary equality,


holding if its two arguments are the same numbers. Other settings stay as
previously.
The results are interesting, among others because:
• The property: the more data fed (i.e. bigger window) the higher
accuracy allows to expect further accuracy gains
• The accuracy achieved for window size 1000 is the highest for all
methods with individually evolved clauses

Other variations attempted led to no profound changes. In principle, all


the changes to the clause invention scheme and to parameter values could
be carried by a meta-GA search in the appropriate space. However, due to
computational effort this is not yet feasible, e.g. achieving the above result
for window size 1000 involved more than 50h of computation (UltraSparc,
248MHz)

68
Decision Trees
The prediction task was also attempted via the Spectre (Bostrom & L.,
1999) system, a prepositional learner with results equivalent to a decision
tree classifier, equipped with hypothesis pruning and noise handling. I am
grateful for the courtesy of Henrik Boström, actually running the test on
provided data. The results follow.

I tried SPECTRE on learning from a random subset consisting


of 95\% of the entire set, and testing on the remaining 5\%.
The results were very poor (see below).
[...]

******************* Experimental Results ********************


Example file: zemke_ex
No. of runs: 10
****************** Summary of results ***********************
=============================================================
Method: S
Theory file: zemke
-------------------------------------------------------------
Training size: 95
Mean no. of clauses: 3.4 Std deviation: 1.65 Std error: 0.52
Mean accuracy: 50.33 Std deviation: 7.11 Std error: 2.25
Pos. accuracy: 28.64 Std deviation: 14.69 Std error: 4.64
Neg. accuracy: 71.58 Std deviation: 17.59 Std error: 5.56
Since the set of predicates employed by Spectre included only =, <, >,
excluding any notion of inequality, the last setting for GA clause induc-
tion was run for comparison, with nonEqual disabled. The result for the
window size 1000 is 51.5%, slightly better than that of Spectre but still
non-significant. The drop from 58.8% indicates the importance of the re-
lation set.
The results are similar to unreported in the current study – focusing
on evolutionary approaches to ILP – experiments involving Progol (Mug-
gleton, 1995). This system searches the space of logic programs covering

69
given positive and excluding negative examples by exhaustively (according
to some restrictions) considering combinations of background predicates.
In the trials, the system came either with up( any) or the example set as
the most compressed hypothesis, thus effectively offering no learning.

Conclusion
The overall results are not impressive – none of the approaches has exceeded
the 60% accuracy level. The failure of the standard ILP systems (Progol
and decision tree learner) can be indicative of the inappropriateness of the
locally greedy compression/information gain driven approach to this type
of problems. The failure of evolving whole programs once more shows the
difficulty of finding optima in very big search spaces.
Another factor is the set of predicates used. As compared with the
GA runs, Progol and Spectre tests missed the inequality relation. As the
introduction of ordinary equality or removal of inequality showed, even the
flexible GA search is very sensitive to the available predicates. This could
be an area for further exploration.
All the above, definitely more elaborate techniques, fall short of the
results achieved via Nearest Neighbor method. One of the reasons could
involve non-linearity of the problem in question: with only linear relations
available all generalizations assume linear nature of the problem for good
performance. On the other hand the Nearest Neighbor approach can be
viewed as generalizing only locally, moreover with the granularity set by
the problem examples themselves.

70
Bagging Imperfect Predictors
ANNIE’99, St. Louis, MO, US, 1999

71
.

72
Bagging Imperfect Predictors
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se

Presented: ANNIE’99.
Published: Smart Engineering System Design, ASME Press, 1999

Abstract Bagging – a majority voting scheme – has been applied to a population of


stock exchange index predictors yielding returns higher than these by the best predictor.
The observation has been more thoroughly checked in a setting in which all above-average
predictors evolved in a Genetic Algorithm population have been bagged, and their trading
performance compared with that of the population’s best, resulting in significant improve-
ment.

Keywords: Bagging, Financial Applications, Performance Analysis, Time Series Forecast-


ing, Machine Learning, Mixture of Experts, Neural Network Classifier, Genetic Algo-
rithms, Nearest Neighbor

Introduction
Financial time series prediction presents a difficult task with no single
method best in all respects, the foremost of which are accuracy (returns)
and variance (risk). In the Machine Learning area, ensembles of classifiers
have long been used as a way to boost accuracy and reduce variance. Fi-
nancial prediction could also benefit from this approach, however due to
the peculiarities of financial data the usability needs to be experimentally
confirmed.
This paper reports experiments applying bagging – a majority voting
scheme – to predictors for a stock exchange index. The predictors come

73
from efforts to obtain a single best predictor. In addition to observing bag-
ging induced changes in accuracies, the study also analyzes their influence
on potential monetary returns.
The following chapter provides an overview of bagging. Next, settings
for the base study generating index predictions are described, and how the
predictions are bagged in the current experiments. And at last, a more
realistic trading environment is presented together with the results.

Bagging

Bagging (Breiman, 1996) is a procedure involving a committee of different


classifiers. This is usually achieved by applying a single learning algorithm
to different bootstrap samples drawn form the training data – which should
destabilize the learning process resulting in non-identical classifiers. An-
other possibility is to use different learning algorithms trained on common
data, or a mix of both. When a new case is classified, each individual clas-
sifier issues its unweighted vote, and the class which obtains the biggest
number of votes is the bag outcome.
For bagging to increase accuracy, the main requirement is that the indi-
vidual classifiers make independent errors and are (mostly) above random.
By majority voting, bagging promotes the average bias of the classifiers
reducing the influence of individual variability. Experiments show (Webb,
1998), that indeed, bagging reduces variance while slightly increasing bias,
with bias measuring the contribution to classification error by classifiers’
central tendency, whereas variance – error by deviation from the central
tendency.

Bagging Predictors
Results in this study involve bagging outcomes of 55 experiments run for
earlier research comparing predictions via Neural Network (ANN, 10 pre-
dictors), Nearest Neighbor (kNN, 29), Evolved Logic Programs (ILP, 16)
and Bayesian Classifier (not used in this study). More detailed description
of the methods can be found in (Zemke, 1998).

74
Experimental Settings

Some evidence suggests that markets with lower trading volume are easier
to predict (Lerche, 1997). Since the task of the earlier research was to
compare Machine Learning techniques, data from the relatively small and
unexplored Warsaw Stock Exchange (WSE) was used, with the quotes
freely available on the Internet (WSE, 1995 onwards). At the exchange,
prices are set once a day (with intraday trading introduced more recently).
The main index, WIG, a capitalization weighted average of stocks traded
on the main floor, provided the time series used in this study, with 1250
quotes since the formation of the exchange in 1991 to the comparative
research.
Index daily (log) changes were digitized via monotonically mapping
them into 8 integer values, 1..8, such that each was equally frequent in
the resulting series. The digitized series, {c}, was then used to create de-
lay vectors of 10 values, with lag one. Such a vector (ct , ct−1 , ct−2 , ..., ct−9 ),
was the sole basis for prediction of the index up/down value at time t + 5
w.r.t. the value at time t. Changes up and down have been made equally
frequent (with down including small index gains) for easier detection of
above-random predictors. Only delay vectors and their matching 5-day
returns derived from consecutive index values within a learning window
were used for learning. Windows of half-year, 1-year (250 index quotes),
2-years and 4 years were tested.
For each method, the predictor obtained for the window was then ap-
plied to the vector next to the last one in the window yielding up/down
prediction for the index value falling next to the window. With the coun-
ters for in/correct predictions accumulating as the window shifted over all
available data points, the resulting average accuracies for each method are
included in table 1, with accuracy shown as the percentage (%) of correctly
predicted up and down cases.

Estimating Returns

For estimating index returns induced by predictions, the 5-day index changes
have been divided into 8 equally frequent ranges, 1..8, with ranges 1..4 cor-

75
responding to down and 5..8 to up. Changes within each range obtained
values reflecting non-uniform distribution of index returns (Cizeau et al.,
1997). The near-zero changes 4 and 5 obtained value 1, changes 3 and 6 —
2, 2 and 7 — 4 and the extreme changes 1 and 8 — value 8.
Return is calculated as sum of values corresponding to correct (up/down)
predictions subtracted by the values for incorrect predictions. To normal-
ize, it is divided by the total sum of all values involved, thus ranging be-
tween −1 – for null and 1 – for full predictability. It should be noted that
such a return is not equivalent to accuracy, which gives the same weight
to all correct predictions.
The different learning methods, ILP, kNN and ANN, involved in this
study offer the classification error independence required by bagging to
work. Within each method predictors, there is still a variety due to different
training windows and parameters, such as background predicates for ILP,
k values for kNN, and architectures for ANN.
In this context bagging is applied as follows: all selected predictors, e.g.
these trained on a window of half a year – as for the first row of bagged
results in table 1, issue their predictions for an instance, with the majority
class being the instance’s bagged prediction. The predicate selections in
table 1 are according to the learning method (columns): ILP, kNN, ANN,
all of them, and according to training window size (rows), e.g. ’4 & 2 & 1
year’ – bagging predictions for all these window sizes.

Method’s ILP, #16 kNN, #29 ANN, #10 all, #55


return % Return % Return % Return % Return
Individual methods – no bagging involved
Average 56 0.18 57 0.19 62 0.32 57 0.21
Deviation .029 .068 .038 .094 .018 .043 .039 .095
Window-wise Bagged results
Half year 55 0.20 63 0.32 - - 61 0.30
1 year 56 0.19 57 0.20 61 0.30 59 0.27
2 years 55 0.14 60 0.28 65 0.38 60 0.26
4 years 60 0.22 66 0.41 62 0.32 64 0.34
4 & 2 years 62 0.28 63 0.34 63 0.35 64 0.35
& 1 year 60 0.26 61 0.30 64 0.36 63 0.34
& half year 61 0.28 61 0.30 63 0.34 64 0.37

Figure 1: Accuracies and returns for individual and bagged methods.

76
With up to 1000 (4 years) – of the 1250 index points used for training –
the presented accuracies for the last 250 require 6% increase for a signifi-
cant improvement (one-sided, 0.05 error). Looking at the results, a number
of observations can be attempted. First, increased accuracy – bagged accu-
racies exceeding the average for each method. Second, poorly performing
methods gaining most, e.g. ILP (significantly) going up from 56% average
to 62% bagged accuracy. Third, overall, bagged predictors incorporating
windows of 4 & 2 years achieve highest accuracy. And fourth, return per-
formance is positively correlated to bagged accuracy, with highest returns
for highest accuracies.

Bagging GA Population
This section describes trading application of bagged GA-optimized Nearest
Neighbor classifiers. As compared to previously used Nearest Neighbor
classifier, these in this section have additional parameters warranting what
constitutes a neighbor and are optimized for maximizing return implied by
their predictions; they also work on more extensive data – choice of which
is also parameterized. Some of the parameters follow (Zemke, 1998).
Active features – binary vector indicating features/coordinates in delay
vector included in neighbor distance calculation, max. 7 active
Neighborhood Radius – maximal distance up to which vectors are con-
sidered neighbors and used for prediction, in [0.0, 0.05)
Window size – limit how many past data-points are looked at while
searching for neighbors, in [60, 1000)
Kmin – minimal number of vectors required within a neighborhood to
warrant prediction, in [1, 20)
Predictions’ Variability – how much neighborhood vector’s predictions
can vary to justify a consistent common prediction, in [0.0, 1.0)
Prediction Variability Measure – how to compute the above measure
from the series of the individual predictions, as: standard deviation,
difference max min between maximal and minimal value

77
Distance scaling – how contributory predictions are weighted in the
common prediction sum, as a function of neighbor distance, no-scaling:
1, linear: 1/distance, exponential: exp(−distance)

The kNN parameters are optimized for above-index gain of an invest-


ment strategy involving long index position for up prediction, short posi-
tion – for down, and staying out of index if no prediction warranted. The
prediction and investment period is 5 days, after which a new prediction
is executed. A kNN prediction is arrived at by adding all weighted past
5-day returns associated with valid neighbors. If some of the requirements,
e.g. minimal number of neighbors, fail – no overall prediction is issued.
The trading is tested for a period of one year, split into 1.5 month
periods, for which new kNN parameters are GA-optimized. The delay
vectors are composed of daily logarithmic changes derived from series, with
number of delayed values (lag 1) indicated: WIG index (30), Dow Jones
Industrial Average (10), and Polish-American Pioneer Stock Investment
Fund (10). The results of the trading simulations are presented in table 2.

Method No. of Trials Mean Deviation


Random strategy 10000 0.171 0.192
Best strategy 200 0.23 0.17
Bagged strategy 200 0.32 0.16
Figure 2: Returns for a random, GA-best and bagged strategy

Random strategy represents trading according to up/down sign of ran-


domly chosen 5-day index return form the past. Best strategy indicates
trading according to the GA-optimized strategy (fitness = return on pre-
ceding year). Bagged strategy, indicates trading according to a majority
vote of all above-random (i.e. positive fitness) predictors present in the
final generation.
Trading by the best predictor outperforms random strategy with 99.9%
confidence (t-test), the same, as trading by bagged predictor outperforms
the best strategy.

78
Conclusion
This study presents evidence that bagging multiple predictors can improve
prediction accuracy for a stock exchange index data. With observation
that returns are proportional to prediction accuracy, bagging makes an in-
teresting approach for increasing returns. This is confirmed by trading in
a more realistic setting with the returns of bagging significantly outper-
forming that of trading by a single best strategy.

79
80
Rapid Fine-Tuning of
Computationally Intensive Classifiers
MICAI’2000, Mexico, 2000. LNAI 1793

81
.

82
Rapid Fine-Tuning of Computationally
Intensive Classifiers
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se

Presented: MICAI’00. Published: LNAI 1793, Springer, 2000

Abstract This paper proposes a method for testing multiple parameter settings in one
experiment, thus saving on computation-time. This is possible by simultaneously tracing
processing for a number of parameters and, instead of one, generating many results –
for all the variants. The multiple data can then be analyzed in a number of ways, such
as by the binomial test used here for superior parameters detection. This experimental
approach might be of interest to practitioners developing classifiers and fine-tuning them
for particular applications, or in cases when testing is computationally intensive.

Keywords: Analysis and design, Classifier development and testing, Significance tests,
Parallel tests

Introduction
Evaluating a classifier and fine-tuning its parameters, especially when per-
formed with non-optimal prototype code, all often require lengthy com-
putation. This paper addresses the issue of such experiments, propos-
ing a scheme speeding up the process in two ways: by allowing multiple
classifier-variants comparison in shorter time, and by speeding up detection
of superior parameter values.
The rest of the paper is organized as follows. First, a methodology of
comparing classifiers is described pointing out some pitfalls. Next, the
proposed method is outlined. And finally, an application of the scheme to
a real case is presented.

83
Basic Experimental Statistics
Comparing Outcomes
While testing 2 classifiers, one comes with 2 sets of resulting accuracies.
The question is then: are the observed differences indicating actual supe-
riority of one approach or could they arise randomly.
The standard statistical treatment for comparing 2 populations, the t-
test, came under criticism when applied in the machine learning settings
(Dietterich, 1996), or with multiple algorithms (Raftery, 1995). The test
assumes that the 2 samples are independent, whereas usually when two
algorithms are compared, this is done on the same data set so the inde-
pendence of the resulting accuracies is not strict. Another doubt can arise
when the quantities compared do not necessarily have normal distribution.
If one wants to compare two algorithms, A and B, then the binomial test
is more appropriate. The experiment is to run both algorithms N times
and to count the S times A was better than B. If the algorithms were
equal, i.e., P(A better than B in a single trial) = 0.5, then the probability
of obtaining the difference of S or more amounts to the sum of binomial
trials, P = 0.5, yielding between S and N successes. As S gets larger than
N/2, the error of wrongly declaring A as better than B decreases, allowing
one to achieve desired confidence level. The table 1 provides the minimal
S differentials as a function of number of trials N and the (I or II sided)
confidence level.
The weaknesses of binomial tests for accuracies include: non-qualitative
comparison – not visualizing how much one case is better than the other
(e.g., as presented by their means), somehow ambivalent results in the
case of many draws – what if the number of draws D >> S, should the
relatively small number of successes decide which sample is superior, non-
obvious ways of comparing more than 2 samples or samples of different
cardinality (Salzberg, 1997).

Significance Level
Performing many experiments increases the odds that one will find ’sig-
nificant’ results where there is none. For example, an experiment at 95%

84
#Trials 95% I 95% II 99% I 99% II 99.9% I 99.9% II 99.99% I 99.99% II
5 5 - - - - - - -
6 6 6 - - - - - -
7 7 7 7 - - - - -
8 7 8 8 8 - - - -
16 12 13 13 14 15 15 16 16
32 22 22 23 24 25 26 27 28
64 40 41 42 43 45 46 47 48
128 74 76 78 79 82 83 86 87
256 142 145 147 149 153 155 158 160
512 275 279 283 286 292 294 299 301
1024 539 544 550 554 562 565 572 575

Figure 5.7: Minimal success differentials for desired confidence

Confidence desired 95% 99% 99.9%


tested
99% 5 1 -
99.9% 51 10 1
99.99% 512 100 10

Figure 5.8: Required single-trial confidence for series of trials

confidence level draws a conclusion that is with 0.05 probability wrong,


so in fact, for every 20 experiments, one is expected to pass arbitrary test
with 95% confidence. The probability of not making such an error in all
among K (independent) experiments goes down to 0.95K , which for K > 1
is clearly less than the 95% confidence level.
Thus in order to keep the overall confidence for a series of experiments,
the individual confidences must be more stringent. If c is the desired
confidence, then the productive confidence of the individual experiments
must be at least that. Table 2 presents, for few desired levels, maxi-
mally how many experiments at a higher level can be run for the series
to still be within the intended level. The approximate (conservative) for-
mula is quite simple MaxNumberOfTrials = (1 - Confidence desired) / (1
- Confidence tested).

85
To avoid spurious inferences, one is strongly advised to always aim at
significance higher than the bottom line 95% easily obtained in tens of
testing runs. However, more stringent tests also increase the possibility
that one will omit some genuine regularities. One solution to this trade-off
could be, first searching for any results accepting relatively low significance,
and once something interesting is spotted, to rerun the test, on a more
extensive data, aiming at a higher pass.

Tuning Parameters
A common practice involves multiple experiments in order to fine-tune
optimal parameters for the final trial. Such a practice increases the chances
of finding an illusory significance – in two ways. First, it involves the
discussed above effect of numerous tests on the same data. Second, it
specializes the algorithm to perform on the (type of) data on which it is
later tested.
To avoid this pitfall, first each fine-tuning experiment involving the
whole data should appropriately adjust the significance level of the whole
series – in a way discussed. The second possibility requires keeping part of
the data for testing and never using it at the fine-tuning stage, in which
case the significance level must only be adjusted according to the number
of trials on the test portion.

Proposed Method
Usually it is unclear without a trial how to set parameter values for optimal
performance. Finding the settings is often done in a change-and-test man-
ner, which is computationally intensive, both to check the many possible
settings, and to get results enough as to be confident that any observed reg-
ularity is not merely accidental. The proposed approach to implementing
the change-and-test routine can speed up both.
The key idea is to run many experiments simultaneously. For example,
if the tuned algorithm has 3 binary parameters A, B and C taking values
-/+, in order to decide which setting among A- B- C-, A- B- C+, ..., A+
B+ C+ to choose, all could be tried at once. This can be done by keeping

86
2 copies of all the variables influenced by parameter A: one variable set
representing the setting A- and the other – A+. Those 2 variable sets
could be also used in 2 ways – each with respect to processing required by
B- and B+ resulting in 4 variable sets representing the choices A- B-, A-
B+, A+ B- and A+ B+. And in the same manner, the C choice would
generate 8 sets of affected variables. Finally, as the original algorithm
produces one result, the modified multiple-variable version would produce
8 values per iteration.
The details of the procedure, namely which variables need to be traced in
multiple copies, depend on the algorithm in question. Though the process
might seem changing the structure of the algorithm – using data structure
in the place of a single variable – once this step is properly implemented, it
does not increase the conceptual complexity if 2 or 10 variables are traced.
Actually, with the use of any programming language allowing abstractions,
such as an object-oriented language, it is easy to reveal the internal nature
of variables only where necessary - without the need for any major code
changes where the modified variables are merely passed.
Handling the variable choices obviously increases the computational
complexity of the algorithm, however, as it will be shown on an exam-
ple, the overhead can be negligible when the variable parameters concern
choices outside the computationally intensive core of the algorithm, as it
usually is in the case for fine-tuning3 .

Superior Parameter Detection


Continuing the above case with 3 binary choices, for each classifier applica-
tion 8 outcomes would be generated instead of one, if all the 3 parameters
were fixed. Concentrating on just one parameter, say A, divides the 8
outcomes into 2 sets: this with A- and A+ – each including 4 elements
indexed by variants of the other parameters: B- C-, B- C+, B+ C-, B+
C+. The identical settings for the other parameters allow us to observe
the influence of value of A by comparing the corresponding outcomes.
The comparisons can be made according to the binomial test, as dis-
3
It is matter of terminology what constitutes parameter-tuning and what development of a new
algorithm.

87
cussed (Salzberg, 1997). In order to collect the statistics, several itera-
tions – applications of the algorithm – will usually be required, depending
on the number of variable choices – so outcomes – at each iteration, and
the required confidence. With 3 variable choices, each application allows 4
comparisons – in general, tracing K choices allows 2K−1 .
This analysis can reveal if a certain parameter setting results in signifi-
cantly better performance. The same procedure, and algorithm outcomes,
can be used for all the parameters, here including also B and C, which
equally divide the outcomes into B- and B+, etc. Any decisive results
obtained in such a way indicate a strong superiority of a given parameter
value – regardless of the combinations of the other parameters. However,
in many cases the results cannot be expected to be so crisp – with the
influence of parameter values inter-dependent, i.e. which given parameter
value is optimal may depend on the configuration of the other parameters.
In that case the procedure can be extended, namely the algorithm out-
comes can be divided according to value of a variable parameter, let it be
A, into 2 sets: A- and A+. Each of the sets would then be subject to the
procedure described above, with the already fixed parameter excluded. So
the analysis of the set A- might, for example, reveal that parameter B+
gives superior results no matter what the value of the other parameters
(here: only C left), whereas analysis of A+ might possibly reveal superior-
ity of B-. The point to observe is that fixing one binary variable reduces the
cardinality of the sample by half, thus twice as many algorithm iterations
will be required for the same cardinality of the analyzed sets. This kind
of analysis might reveal the more subtle interactions between parameters,
helpful in understanding why the algorithms works the way it does.

Parallel Experiments
In the limit, the extended procedure will lead to 2K sets obtaining one
element per iteration, K – the number of binary parameters traced. Such
obtained sets can be subject to another statistical analysis, this time the
gains in computation coming from the fact that once generated, the 2K
sets can be compared to a designated set, or even pair-wise, corresponding
to many experiments.

88
The statistics used in this case can again involve the binomial compari-
son or – unlike in the previous case – a test based on random sampling. In
the superior parameter detection mode, the divisions obtained for a single
parameter most likely do not have normal distribution, thus tests assuming
it, such as the t-test, are not applicable. Since the binomial test does not
make any such assumption it was used.
However, if the compared sets are built in one-element-per-iteration
fashion, where each iteration is assumed to be independent (or random
generator dependent) from the previous one, the sets can be considered
random samples. The fact that they are originating from the same random
generator sequence forming the outcomes at each iteration, can be actually
considered helpful in getting more reliable comparison of the sets – due
only to the performance of the variants, but not to the variation in the
sampling procedure. This aspect could be considered another advantage
of the parallel experiments. However, discussing the more advanced tests
utilizing this property is beyond the scope of the current paper.

Example of Actual Application


This section provides description of a classifier development (Zemke, 1999a),
which inspired the parameter tuning and testing procedure. Since the de-
veloped algorithm was (believed to be) novel, there were no clear guidelines
which, among the many, small but important choices within the algorithm
should be preferred. By providing with more results for analysis, the test-
ing proposed approach helped both to find promising parameters and to
clarify some misconceptions about the algorithm’s performance. Generat-
ing the data took approximately one week of computation, thus repeating
the run for the 13 variants considered would be impractical.

Algorithm
The designed classifier was an extension of the nearest neighbor algorithm,
with parameters indicating what constitutes a neighbor, which features to
look at, how to combine neighbor classifications etc. The parameters were
optimized by a genetic algorithm (GA) whose population explored their

89
combinations. The idea believed to be novel, involved taking – instead
of the best GA-evolved classifier – part of the final GA-population and
bagging (Breiman, 1996) the individual classifiers together into an ensemble
classifier. Trying the idea seemed worthwhile since bagging is known to
increase accuracy benefiting from the variation in the ensemble – exactly
what a (not over-converged) GA-population should offer.
The computationally intensive part was the GA search – evolving a
population of parameterized classifiers and evaluating them. This had to
be done no matter if one was interested just in the best classifier or in
a bigger portion of the population. As proposed, the tested algorithm
needs to be multi-variant traced for a number of iterations. Here, iteration
involved a fresh GA run, and yielded accuracies (on the test set) – one for
each variant traced.
The questions concerning bagging the GA population involved: which
individual classifiers to bag – all above-random or only some of them,
how to weight their vote – by single vote or according to accuracy of the
classifiers, how to solicit the bagged vote – by simple majority or if the
majority was above a threshold. The questions gave rise to 3 parameters,
described below, and their 3 ∗ 2 ∗ 2 = 12 combinations, listed in Table 3
indicating which parameter (No) takes what value (+).

1. This parameter takes 3 values depending which of the above-random


(fitness) accuracy classifiers from the final GA population are included
in the bagged classifier: all, only the upper half, or a random half
among the above-random.

2. This binary parameter distinguishes between unweighted vote (+):


where each classifier adds 1 to its class, and a weighted vote (-): where
the class vote is incremented according to the classifier’s accuracy.

3. This binary parameter decides how the bagged ensemble decision is


reached – by taking the class with the biggest cumulative vote (+), or
(-) when the majority is by more than 31 total votes greater than that
of the next class and returning the bias of the training data otherwise.

90
1 2 3 4 5 6 7 8 9 10 11 12
No Parameter setting
1 Upper half bag + + + + - - - - - - - -
1 All above-random bag - - - - + + + + - - - -
1 Half above-random bag - - - - - - - - + + + +
2 Unweighted vote + + - - + + - - + + - -
3 Majority decision + - + - + - + - + - + -

Figure 5.9: Settings for 12 parameter combinations.

Parameter Analysis
The parameter analysis can identify algorithm settings that give superior
performance, so they can be set to these values. The first parameter has 3
values which can be dealt with by checking if results for one of the values
are superior to both of the others. Table 4 presents the comparisons as
probabilities for erroneously deciding superiority of the left parameter set
versus one on the right. Thus, for example, in the first row comparison of
{1..4 } vs. {5..8}, which represent the different settings for parameter 1,
the error 0.965 by 128 iterations indicates that setting {1..4 } is unlikely
to be better than {5..8}. Looking at it the other way: {5..8} is more likely
to be better than {1..4 } with error4 around 0.035 = 1 − 0.965. The setting
{0} stands for results by the reference non-bagged classifier – respective
GA run fittest. The results in Table 4 allow us to make some observations
concerning the parameters. The following conclusions are for results up to
128 iterations, the results for the full trials up to 361 iterations included
for comparison only.

1. There is no superior value for parameter 1 – such that it would out-


perform all the other values.

2. Both settings for parameter 2 are comparable.


4
The error probabilities of A- vs. A+ and A+ vs. A- do not add exactly to 1 for two reasons. First,
draws are possible thus the former situation S successes out of N trials can lead to less than F = N − S
successes for the later, so adding only the non-draw binomial probabilities would amount for less than
1. And second, even if there are no draws, both error binomial sums would involve a common factor
binomial(N, S) = binomial(N, N − S) making the complementary probabilities to add to more than 1.
Thus for the analysis to be strict, the opposite situation error should be computed from the scratch.

91
No Parameter settings/Iterations 32 64 128 361
1 {1..4 } vs. {5..8} 0.46 0.95 0.965 0.9985
1 {1..4 } vs. {9..12 } 0.53 0.75 0.90 0.46
1 {5..8 } vs. {9..12 } 0.33 0.29 0.77 0.099
2 {1,2,5,6,9,10} vs. {3,4,7,8,11,12} 0.53 0.6 0.24 0.72
3 {1,3,5,7,9,11} vs. {2,4,6,8,10,12} 0.018 9E-5 3E-5 0
- {1} vs. {0} 0.0035 0.0041 0.013 1E-6
- {2} vs. {0} 0.19 0.54 0.46 0.91
- {3} vs. {0} 0.055 0.45 0.39 0.086
- {4} vs. {0} 0.11 0.19 0.33 0.12
- {5} vs. {0} 0.0035 3.8E-5 6.2E-5 0
- {6} vs. {0} 0.30 0.64 0.87 0.89
- {7} vs. {0} 0.055 0.030 0.013 1.7E-4
- {8} vs. {0} 0.025 0.030 0.02 0.0011
- {9} vs. {0} 0.055 7.8E-4 2.5E-4 0
- {10} vs. {0} 0.11 0.35 0.39 0.73
- {11} vs. {0} 0.19 0.64 0.39 0.085
- {12} vs. {0} 0.055 0.030 0.0030 0.0016

Figure 5.10: Experimental parameter setting comparisons.

3. Majority decision ({1,3,5,7,9,11}), for parameter 3, is clearly outper-


forming with confidence 99.99% by 64 iterations.

4. In the comparisons against the non-bagged {0}, settings 5 and 9 are


more accurate, at less than 0.1% error (by iteration 128) pointing out
superior parameter values.

Speed up
In this case the speed up of the aggregate experiments – as opposed to
individual pair-wise comparisons – comes from the fact that the most com-
putationally intensive part of the classification algorithm – the GA run –
does not involve the multiply-threaded variables. They come into play
only when the GA evolution is finished and different modes of bagging and
non-bagging are evaluated.
Exploring variants outside the inner loop can still benefit algorithms in
which multiple threading will have to be added to the loop thus increasing

92
the computational burden. In this case, the cost of exploring the core
variants should be fully utilized by carefully analyzing the influence of the
(many) post-core settings as not to waste the core computation due to
some unfortunate parameter choice afterwards.

Conclusion
This paper proposes a method for testing multiple parameter settings in
one experiment, thus saving on computation-time. This is possible by
simultaneously tracing processing for a number of parameters and, instead
of one, generating many results – for all the variants. The multiple data can
then be analyzed in a number of ways, such as by the binomial test used
here for superior parameters detection. This experimental approach might
be of interest to practitioners developing classifiers and fine-tuning them
for particular applications, or in cases when testing is computationally
intensive.
The current approach could be refined in a number of ways. First, finer
statistical framework could be provided taking advantage of the specific
features of the data generating process, thus providing crisper tests, possi-
bly at smaller sample size. Second, some standard procedures for dealing
with common classifiers could be elaborated, making the proposed devel-
opment process more straightforward.

93
94
On Developing Financial Prediction
System: Pitfalls and Possibilities
DMLL Workshop at ICML-2002, Australia, 2002

95
.

96
On Developing Financial Prediction System:
Pitfalls and Possibilities
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se

Published: Proceedings of DMLL Workshop at ICML-2002, 2002

Abstract A successful financial prediction system presents many challenges. Some


are encountered over again, and though an individual solution might be system-specific,
general principles still apply. Using them as a guideline might save time, effort, boost
results, as such promoting project’s success.
This paper remarks on a prediction system development stemming from author’s ex-
periences and published results. The presentation follows stages in a prediction system
development: data preprocessing, prediction algorithm selection and boosting, system
evaluation – with some commonly successful solutions highlighted.

Introduction
Financial prediction presents challenges encountered over again. The paper
highlights some of the problems and solutions. A predictor development
demands excessive experimentation: with data preprocessing and selection,
the prediction algorithm(s), a matching trading model, evaluation and tun-
ing – to benefit from the minute gains, but not fall into over-fitting. The
experimentation is necessary since there are no proven solutions, but ex-
periences of others, even failed, can speed the development.
The idea of financial prediction (and resulting riches) is appealing,
initiating countless attempts. In this competitive environment, if one
wants above-average results, one needs above-average insight and sophisti-
cation. Reported successful systems are hybrid and custom made, whereas

97
straightforward approaches, e.g. a neural network plugged to relatively
unprocessed data, usually fail (Swingler, 1994).
The individuality of a hybrid system offers chances and dangers. One
can bring together the best of many approaches, however the interaction
complexity hinders judging where the performance dis/advantage is coming
from. This paper provides hints in major steps in a prediction system
development based on author’s experiments and published results.
The paper assumes some familiarity with machine learning and financial
prediction. As a reference one could use (Hastie et al., 2001; Mitchell,
1997), including java code (Witten & Frank, 1999), applied to finance
(Deboeck, 1994; Kovalerchuk & Vityaev, 2000). Non-linear analysis (Kantz
& Schreiber, 1999a), in finance (Deboeck, 1994; Peters, 1991). Ensemble
techniques (Dietterich, 2000), in finance (Kovalerchuk & Vityaev, 2000).

Data Preprocessing
Before data is fed into an algorithm, it must be collected, inspected, cleaned
and selected. Since even the best predictor will fail on bad data, data
quality and preparation is crucial. Also, since a predictor can exploit only
certain data features, it is important to detect which data preprocess-
ing/presentation works best.

Visual inspection is invaluable. At first, one can look for: trend – if


need to remove, histogram – redistribute, missing values and outliers, any
regularities. There are financial data characteristics (Mantegna & Stanley,
2000) that differ from the normally-distributed, aligned data assumption
in general data mining literature.
Outliers may require different considerations: 1) genuine big changes – of big interest
to prediction, such data could even be multiplied to promote recognition; 2) jumps
due to change a quality is calculated, e.g. stock splits; all previous data could be
re-adjusted or a single outlier treated as a missing value 3) outlier regularities could
signal a systematic error.

Fat tails – extreme values more likely as compared to the normal distribution – is an es-
tablished property of financial returns (Mantegna & Stanley, 2000). It can matter in
1) situations, which assume normal distribution, e.g. generating missing/surrogate

98
data w.r.t. normal distribution will underestimate extreme values 2) in outlier de-
tection. If capturing the actual distribution is important, the data histogram can
be preferred to parametric models.

Time alignment – same date-stamp data may differ in the actual time as long as the
relationship is kept constant. The series originating the predicted quantity sets
the time – extra time entries in other series may be skipped, whereas missing in
other series may need to be restored. Alternatively, all series could be converted to
event-driven time scale, especially for intra-day data (Dacorogna et al., 2001).

Missing values dealt with by data mining methods (Han & Kamber, 2001;
Dacorogna et al., 2001). If a miss spoils temporal relationship, restoration
is preferable to removal. Conveniently all misses in the raw series are
restored for feature derivation, alignment etc., skipping any later instances
of undefined values. If data restorations are numerous, test if the predictor
picks the inserted bias is advisable.

Detrending removes the growth of a series. For stocks, indexes, and cur-
rencies converting into logarithms of subsequent (e.g. daily) returns does
the trick. For volume, dividing it by last k quotes average, e.g. yearly, can
scale it down.

Noise minimally at price discretisation level is prevalent; especially low


volume markets should be dealt with suspicion. Discretisation of series into
few (< 10) categories (Gershenfeld & Weigend, 1993) along noise cleaning
could be evaluated against predictions quality. Simple cleaning: for each
series value, find its nearest neighbor based on surrounding values, and
then substitutes the value by an average of the original and those from the
neighbors (Kantz & Schreiber, 1999a). Other operations limiting noise:
averaging, instance multiplication, sampling – mentioned below.

Normalization. Discretization – mapping the original values to fewer (new)


ones – e.g. positive to 1 and other to -1 – is useful for noise reduction and
for nominal input predictors. Subsequent predictor training with input dis-
cretized into decreasing number of values can estimate noise – prediction
accuracy could increase (Kohavi & Sahami, 1996) once difference between

99
discretized values exceeds noise, to decline later after rough discretization
ignores important data distinctions.
Redistribution – changing the frequency of some values in relation to
others – can better utilize available range, e.g. if daily returns were linearly
scaled to (-1, 1), majority would be around 0.
Normalization brings values to a certain range, minimally distorting
initial data relationships. SoftMax norm increasingly squeezes extreme
values, linearly mapping middle, e.g. middle 95% input values could be
mapped to [-0.95, 0.95], with bottom and top 2.5% nonlinearly to (-1,-0.95)
and (0.95, 1) respectively. Normalization should precede feature selection,
as non-normalized series may confuse the process.

Series to instances conversion is required by most learning algorithms ex-


pecting as an input a fixed length vector. It can be a delay vector derived
from series, a basic technique in nonlinear analysis (Kantz & Schreiber,
1999a), vt = (seriest , seriest−delay , .., seriest−(D−1)∗delay ). The delay can be
the least giving zero autocorrelation, when applied to the series. Such vec-
tors with the same time index t – coming from all input series – appended
give an instance, its coordinates referred to as features or attributes.

Data multiplication can be done on many levels. The frequency of a series


can be increased by adding (Fourier) interpolated points (Gershenfeld &
Weigend, 1993).
Instances can be cloned with some features supplemented with Gaussian
noise, 0-mean, deviation between the noise level already present in the
feature/series, and the deviation of that series. This can be useful when
only few instances are available for an interesting type, e.g. instances
with big return. Such data forces the predictor to look for important
characteristics ignoring noise – added and intrinsic. Also, by relatively
increasing the number of interesting cases, training will pay more attention
to their recognition.
Including more series can increase the number of features. A simple test
what to include, is to look for series significantly correlated to the predicted
one. More difficult is to add non-numerical series, however, adding a text

100
filter for keywords in news can bring substantial advantage.

Indicators are series derived from others, enhancing some features of in-
terest, such as trend reversal. Over the years traders and technical analysts
trying to predict stock movements developed the formulae (Murphy, 1999),
some later confirmed to pertain useful information (Sullivan et al., 1999).
Indicator feeding into a prediction systems is important due to 1) averag-
ing, thus noise reduction, present in many indicator formulae, 2) providing
views of the data suitable for prediction. Common indicators follow.

MA, Moving Average, is the average of past k values up to date. Exponential Moving
Average, EM An = weight ∗ seriesn + (1 − weight) ∗ EM An−1 .

Stochastic (Oscillator) places the current value relative to the high/low range in a pe-
n −low(k)
riod: series
high(k)−low(k)
, low(k) – the lowest among the k values preceding n, k often 14
days.

MACD, Moving Average Convergence Divergence, difference of short and long-term


exponential moving averages, 8 and 17, or 12 and 26 days used.

ROC, Rate of Change, ratio of the current price to price k quotes earlier, k usually 5 or
10 days.

RSI, Relative Strength Index, relates growths to falls in a period. RSI can be computed
as positive changes (i.e. seriesi −seriesi−1 > 0) sum divided by all absolute changes
sum, taking last k quotes; k usually 9 or 14 days.

Sampling. In my experiments with NYSE predictability, skipping 0.5 train-


ing instances with the lowest weight (i.e. weekly return) enhanced predic-
tions, similarly reported (Deboeck, 1994). The distribution (for returns
approximated by lognormal) was such that the lowest-return half consti-
tuted only 0.2 of the cumulative return, and lowest 0.75 – 0.5 (Mantegna &
Stanley, 2000). The improvement could be due to skipping noise-dominated
small changes, and/or bigger changes ruled by a mechanism whose learn-
ing is distracted by the numerous small changes. Thus, while sampling, it
might be worth under-representing small weight instances, missing value-
filled, evident-outlier instances and older ones. The amount of data to
train a model can be estimated (Walczak, 2001).

101
Bootstrap – with repetitions, sampling as many elements as in the origi-
nal – and deriving a predictor for each such a sample, is useful for collecting
various statistics (LeBaron & Weigend, 1994), e.g. performance, also en-
semble creation or best predictor selection (e.g. via bumping), however not
without limits (Hastie et al., 2001).

Feature selection can make learning feasible, as because of the curse of di-
mensionality (Mitchell, 1997) long instances demand (exponentially) more
data. As always, feature choice should be evaluated together with the pre-
dictor, as assuming feature importance because it worked well with other
predictors, may mislead.

Principal Component Analysis (PCA) and claimed better for stock data Indepen-
dent Component Analysis (Back & Weigend, 1998), reduce dimension by proposing
a new set of salient features.

Sensitivity Analysis trains predictor on all features and then drops those least influ-
encing predictions. Many learning schemes internally signal important features, e.g.
(C4.5) decision tree use them first, neural networks assign highest weights etc.

Heuristic such as hill-climbing or genetic algorithms operating on binary feature selec-


tion can be used not only to find salient feature subsets, but also – invoked several
times – to provide different sets for ensemble creation.

Predictability assessment allows to concentrate on feasible cases (Hawaw-


ini & Keim, 1995). Some tests below are simple non-parametric predictors –
prediction quality reflecting predictability, measured, e.g., by standard er-
ror to series standard deviation ratio.

Linear methods measure correlation between predicted and feature series – significant
non-zero implying predictability (Tsay, 2002). Multiple features can be taken into
account by multivariate regression.

Nearest Neighbor (Mitchell, 1997) offers a powerful local predictor. Distracted by


noisy/irrelevant features, but if this ruled out, failure suggests that the most that
can be predicted are general regularities, e.g. an outcome overall probability.

Entropy measures information content, i.e. deviation from randomness (Molgedey &
Ebeling, 2000). This general measure, not demanding big amounts of data, and
useful in discretisation or feature selection is worth familiarizing with.

102
Compressibility – the ratio of compressed to the original sequence length – shows how
regularities can be exploited by a compression algorithm (which could be the basis
of a predictor). An implementation: series digitized 4-bit values packed in pairs into
byte array subjected to Zip compression (Feder et al., 1992).
Detrended Fluctuation Analysis (DFA) reveals long term correlations (self-similarity)
even in non-stationary time series (Vandewalle et al., 1997). DFA is more robust,
so recommended to Hurst analysis – a sensitive statistics of cycles, proper interpre-
tation requiring experience (Peters, 1991).
Chaos and Lyapunov exponent test short-term determinism, thus predictability (Kantz
& Schreiber, 1999a). However, the algorithms are noise-sensitive and require long
series, thus conclusions should be cautious.
Randomness tests like chi-square, can assess the likelihood that the observed (digi-
tized) sequence is random. Such a test on patterns of consecutive digits could hint
pattern no/randomness.
Non-stationarity test can be implemented by dividing data into parts and computing
part i predictability based only on part j data. The variability of the measures
(visual inspection encouraged), such as standard deviation, assesses stationarity.

A battery of tests could include linear regression, DFA for long term
correlations, compressibility for entropy-based approach, Nearest Neighbor
for local prediction, and a non-stationarity test.

Prediction Algorithms
Below, common learning algorithms (Mitchell, 1997) are discussed, point-
ing their features important to financial prediction.

Linear methods not main focus here, are widely used in financial pre-
diction (Tsay, 2002). In my Weka (Witten & Frank, 1999) experiments,
Locally Weighted Regression (LWR) – scheme weighting Nearest Neighbor
predictions – discovered regularities in NYSE data 5 . Also, Logistic – non-
linear regression for discrete classes – performed above-average and with
speed. As such, regression is worth trying, especially its schemes more spe-
cialized to the data (e.g. Logistic to discrete) and as a final optimization –
weighting other predictions (LWR).
5
Unpublished, ongoing work.

103
Neural Network (ANN) – seems the method of choice for financial pre-
diction (Kutsurelis, 1998; Cheng et al., 1996). Backpropagation ANNs
present the problems of long training and guessing the net architecture.
Schemes training architecture along weights could be preferred (Hochre-
iter & Schmidhuber, 1997) (Kingdon, 1997), limiting under-performance
due to wrong (architecture) parameter choice. Note, a failure of an ANN
attempt, especially using a general-purpose package, does not necessitate
prediction impossible. In my experiments, Voted Perceptron performance
often compared with that of ANN, this could be a start, especially when
speed is important, such as in ensembles.

C4.5, ILP – generate decision trees/if-then rules – human understand-


able, if small. In my experiments with Progol (Mitchell, 1997) – otherwise
successful rule-learner – applied to NYSE data, rules (resembling techni-
cal) seldom emerged; Weka J48 (C4.5) tree-learner predictions have not
performed; GA-evolved rules’ performance was very sensitive to ’right’
background predicates (Zemke, 1998). The conclusion being that, small
rule-based models cannot express certain relationships and perform well
with noisy/at times inconsistent financial data (Kovalerchuk & Vityaev,
2000). Ensembles of decision trees, can make up for the problems, but
readability is usually lost. Rules can also be extracted from ANN, offering
accuracy and readability (Kovalerchuk & Vityaev, 2000).

Nearest Neighbor (NN) does not create a general model, but to predict,
it looks back for the most similar case(s) (Mitchell, 1997). Irrelevant/noisy
features disrupt the similarity measure, so pre-processing is worthwhile.
NN is a key technique is nonlinear analysis which offers insights, e.g.
weighting more neighbors, efficient NN search (Kantz & Schreiber, 1999a).
Cross-validation (Mitchell, 1997) can also decide an optimal number of
kNN neighbors. Ensemble/bagging NNs trained on different instance sam-
ples usually does not boost accuracy, though on different feature subsets
might.

104
Bayesian classifier/predictor first learns probabilities how evidence sup-
ports outcomes, used then to predict new evidence’s outcome. Though
the simple scheme is robust to violating the ’naive’ independent-evidence
assumption, watching independence might pay off, especially as in decreas-
ing markets variables become more correlated than usual. The Bayesian
scheme might also combine ensemble predictions – more optimally than
majority voting.

Support Vector Machines (SVM) are a relatively new and powerful learner,
having attractive characteristics for time series prediction (Muller et al.,
1997). First, it deals with multidimensional instances, actually the more
features the better – reducing the need for (wrong) feature selection. Sec-
ond, it has few parameters, thus finding optimal settings can be easier, one
of the parameters referring to noise level the system can handle.

Performance improvement
Most successful prediction are hybrid: several learning schemes coupled
together (Kingdon, 1997; Cheng et al., 1996; Kutsurelis, 1998; Kovalerchuk
& Vityaev, 2000). Predictions, indication of their quality, biases, etc., fed
into a (meta-learning) final decision layer. The hybrid architecture may
also stem from performance improving techniques:

Ensemble (Dietterich, 2000) is a number of predictors of which votes are put together
into the final prediction. The predictors, on average, are expected above-random
and making independent errors. The idea is that correct majority offsets individual
errors, thus the ensemble will be correct more often than an individual predictor.
The diversity of errors is usually achieved by training a scheme, e.g. C4.5, on differ-
ent instance samples or features. Alternatively, different predictor types – like C4.5,
ANN, kNN – can be used or the predictor’s training can be changed, e.g. by choos-
ing the second best decision, instead of first, building C4.5 decision tree. Common
schemes include bagging, boosting and their combinations and Bayesian ensembles
(Dietterich, 2000). Boosting is particularly effective in improving accuracy.
Note: an ensemble is not a panacea for non-predictable data – it only boosts accu-
racy of already performing predictor. Also, readability, efficiency are decreased.

Genetic Algorithms (GAs) (Deboeck, 1994) explore novel possibilities, often not thought
of by humans. Therefore, it is always worth keeping some decisions as parameters

105
that can be (later) GA-optimized, e.g., feature preprocessing and selection, sampling
strategy, predictor type and settings, trading strategy. GAs (typically) require a fit-
ness function – reflecting how well a solution is doing. A common mistake is to
define the fitness one way and to expect the solution to perform another way, e.g. if
not only return but also variance are important, both factors should be incorporated
into fitness. Also, with more parameters and GAs ingenuity it is easier to overfit
the data, thus testing should be more careful.

Local, greedy optimization can improve an interesting solution. This is worth com-
bining with a global optimization, like GAs, which may get near a good solution
without reaching it. If the parameter space is likely nonlinear, it is better to use a
stochastic search, like simulated annealing, as compared to simple up-hill.

Pruning properly applied can boost both 1) speed – by skipping unnecessary computa-
tion, and 2) performance – by limiting overfitting. Occam’s razor – among equally
performing models, simpler preferred – is a robust criterion to select predictors, e.g.
Network Regression Pruning (Kingdon, 1997), MMDR (Kovalerchuk & Vityaev,
2000) successfully use it. In C4.5 tree pruning is an intrinsic part. In ANN, weight
decay schemes (Mitchell, 1997) reduce towards 0 connections not sufficiently pro-
moted by training. In kNN, often a few prototypes perform better than referring
to all instances – as mentioned, high return instances could be candidates. In en-
sembles, if the final vote is weighted, as in AdaBoost (Dietterich, 2000), only the
highest-weighted predictors matter.

Tabu, cache, incremental learning, gene GA can accelerate search, allowing more
exploration, bigger ensemble etc. Tabu search prohibits re-visiting recent point
again – except for not duplicating computation, it forces the search to explore
new areas. Caching stores computationally expensive results for a quick recall, e.g.
(partial) kNN can be precomputed. Incremental learning only updates a model as
new instances arrive, e.g. training ANN could start with ANN previously trained
on similar data, speeding up convergence. Gene expression GAs optimize solution’s
compact encoding (gene), instead of the whole solution which is derived from the
encoding for evaluation.
I use a mixture: optimizing genes stored in a tabu cache (logged and later scrutinized
if necessary).

What if everything fails but the data seems predictable? There are still
possibilities: more relevant data, playing with noise reduction/discretisation,
making the prediction easier, e.g. instead of return, predicting volatility
(and separately direction), or instead of stock (which may require company
data) predicting index, or stock in relation to index; changing the horizon –
prediction in 1 step vs. many; another market, trading model.

106
Trading model given predictions, makes trading decisions, e.g. predicted
up – long position, down – short, with more possibilities (Hellström &
Holmström, 1998). Return is just one objective, other include: minimizing
variance, maximal loss (bankruptcy), risk (exposure), trade (commissions),
taxes; Sharpe ratio etc. A practical system employs precautions against
predictors non-performance: monitoring recent performance and signaling
if it is below accepted/historic level. It is crucial in non-stationary markets
to allow for market shifts beyond control – politics, disasters, entry of a big
player. If the shifts cannot be dealt with, at least should be signaled before
inflicting unreparable loss. This touches the subject of a bigger (money)
management system, taking the predictions into account while hedging,
but it is beyond the scope of this paper.

System Evaluation
Proper evaluation is critical to a prediction system development. First,
it has to measure exactly the interesting effect, e.g. trading return, as
opposed to prediction accuracy. Second, it has to be sensitive enough as
to distinguish often minor gains. Third, it has to convince that the gains
are no merely a coincidence.

Evaluate the right thing. Financial forecasts are often developed to support
semi-automated trading (profitability), whereas the algorithms underlying
those systems might have different objective. Thus, it is important to
test the system performing in the setting it is going to be used, a trivial,
but often missed notion. Also, the evaluation data should be of exactly
the same nature as planned for real-life application, e.g. an index-futures
trading performed for index data used as a proxy for futures price, but
real futures data degraded it. Some problems with common evaluation
strategies (Hellström & Holmström, 1998) follow.

Accuracy – percentage of correct discrete (e.g. up/down) predictions; common measure


for discrete systems, e.g. ILP/decision trees. It values instances equally, disregard-
ing both instance’s weight and accuracies for different cases, e.g. a system might
get high score predicting the numerous small changes whereas missing the big few.

107
Actually, some of the best-performing systems have lower accuracy than could be
found for that data (Deboeck, 1994).

Square error – sum of squared deviations from actual outputs – is a common measure
in numerical prediction, e.g. ANN. It penalizes bigger deviations, however if sign
is what matters this might not be optimal, e.g. predicting -1 for -0.1 gets bigger
penalty than predicting +0.1, though the latter might trigger going long instead
of short. Square error minimization is often an intrinsic part of an algorithm such
as ANN backpropagation, and changing it might be difficult. Still, many such
predictors, e.g. trained on bootstrap samples, can be validated according to the
desired measure and the best picked.

Reliability – predictor’s confidence in its forecast – is equally important and difficult to


develop as the predictor itself (Gershenfeld & Weigend, 1993). A predictor will not
always be confident – it should be able to express this to the trading counterpart,
human or not. e.g. by an output ’undecided’. No trade on dubious predictions is
beneficial in many ways: lower errors, commissions, exposure. In my experiments
optimizing the reliability requirement, stringent values emerged – why to trade if the
predicted move and confidence are low? Reliability can be assessed by comparing
many predictions: coming from an ensemble, as well as done in one step and multiple
steps fashion.

Performance measure (Hellström & Holmström, 1998) should incorporate the predic-
tor and the (trading) model it is going to benefit. Some points: Commissions need
to be incorporated – many trading ’opportunities’ exactly disappear with commis-
sions. Risk/variability – what is the value of even high return strategy if in the
process one gets bankrupt? Data difficult to obtain in real time, e.g. volume, might
mislead historic data simulations.

Evaluation bias resulting from the evaluation scheme and time series data,
needs to be recognized. Evaluation similar to the intended operation can
minimize performance estimate bias, though different tests can be useful
to estimate different aspects, such as return, variance.
N -cross validation – data divided into N disjoint parts, N − 1 for training and 1 for
testing, error averaged over all N (Mitchell, 1997) – in the case of time series data,
underestimates error. Reason: in at least N − 2 out of the N train-and-test runs,
training instances precede and follow the test cases unlike in actual prediction when
only past in known. For series, window approach is more adept.

Window approach – segment (’window’) of consecutive instances used for training


and a following segment for testing, the windows sliding over all data, as statistics
collected. Often, to save training time, the test segment consists of many instances.

108
However, more than 1 instance overestimates error, since the training window does
not include the data directly preceding some tested cases. Since markets undergo
regime change in matter of weeks, the test window should be no longer than that, or
the train window’s fraction (< 20%). To speed up training for the next test window,
the previous window predictor could be used as the starting point while training on
the next window, e.g. instead of starting with ANN random weights.

Evaluation data should include different regimes, markets, even data er-
rors, and be plentiful. Dividing test data into segments helps to spot
performance ir-regularities (for different regimes).
Overfitting a system to data is a real danger. Dividing data into disjoint
sets is the first precaution: training, validation for tuning, and test set for
performance estimation. A pitfall may be that the sets are not as separated
as seem, e.g. predicting returns 5 days ahead, a set may end at day D,
but that instance may contains return for day D + 5 falling into a next set.
Thus data preparation and splitting should be careful.
Another pitfall is using the test set more than once. Just by luck, 1 out
of 20 trials is 95% above average, 1 out of 100, 99% above etc. In multiple
test, significance calculation must factor that in, e.g. if 10 tests are run
and the best appears 99.9% significant, it really is 99.9%10 = 99% (Zemke,
2000).
Multiple use can be avoided, for the ultimate test, by taking data that
was not available earlier. Another possibility is to test on similar, not tuned
for, data – without any tweaking until better results, only with predefined
adjustments for the new data, e.g. switching the detrending preprocessing
on.

Surrogate data is a useful concept in nonlinear system evaluation (Kantz


& Schreiber, 1999a). The idea is to generate data sets sharing charac-
teristics of the original data – e.g. permutations of series have the same
mean, variance etc. – and for each compute an interesting statistics, e.g.
return of a strategy. To compare the original series statistics to those of the
surrogates, there are 2 ways to proceed: 1) If the statistics is normally dis-
tributed, the usual one/two-sided test comparing to the surrogates’ mean
used. 2) If no such assumption, the nonparametric rank test can be used:

109
If α is the acceptable risk of wrongly rejecting the null hypothesis that
the original series statistics is lower (higher) than of any surrogate, then
1/α − 1 surrogates needed; if all give higher (lower) statistics than the
original series, then the hypothesis can be rejected. Thus, if predictor’s
error was lower on original series, than in 19 runs on surrogates, we can be
95% sure it is up to something.

Non/Parametric tests. Most statistical tests (Hastie et al., 2001) (Efron


& Tibshirani, 1993) have preconditions. They often involve assumptions
about sample independence and distributions – unfulfilled leading to un-
founded conclusions. Independence is tricky to achieve, e.g. predictors
trained on overlapping data are not independent. If the sampling distribu-
tion is unknown, as it usually is, it takes least 30, better 100, observations
for normal distribution statistics.
If the sample is smaller than 100, nonparametric test are preferable,
with less scope for assumption errors. The backside is they have less dis-
criminatory power – for the same sample size (Heiler, 1999).
A predictor should significantly win (nonparametric) comparisons with
naive predictors: 1) Majority predictor outputs the commonest value all
the time, for stocks it could be the dominant up move, translating into
the buy and hold strategy. 2) Repeat previous predictor for the next value
issues the (sign of the) previous one.

Sanity checks involve common sense (Gershenfeld & Weigend, 1993). Pre-
diction errors along the series should not reveal any structure, unless the
predictor missed something. Do predictions on surrogate (permuted) series
discover something? If valid, this is the bottom line for comparison with
prediction on the original series – is it significantly better?

Putting it all together


To make the paper’s less abstract, some author’s choices in a NYSE index
prediction system follow. The ongoing research extends an earlier system
(Zemke, 1998). The idea is to develop a 5-days return predictor, later on,
to support a trading strategy.

110
Data used consists of 30 years of daily NYSE 5 indexes and 4 volume
series. Data is plotted and some series visibly mimicking other omitted.
Missing values are filled by a nearest neighbor algorithm, and the 5-days
return series to be predicted computed. The index series are converted to
logarithms of daily returns; the volumes divided by lagged yearly averages.
Additional series are derived, depending on experiment, 10 and 15 days MA
and ROC for indexes. Then all series are Softmax normalized to -1..1 and
discretized to 0.1 precision. In between major preprocessing steps series
statistics are computed: number of NaN, min and max values, mean, st.
deviation, 1,2-autocorrelation, zip-compressibility, linear regression slope,
DFA – tracing if preprocessing does what expected – removing NaN, trend,
outliers, but not zip/DFA predictability. In the simplest approach, all
series are then put together into instances with D = 3 and delay = 2.
An instance’s weight is corresponding time absolute 5-days return and
instance’s class – the return’s sign.
The predictor is one of Weka (Witten & Frank, 1999) classifiers han-
dling numerical data, 4-bit coded into a binary string together with: which
instance’s features to use, how much past data to train on (3, 6, 10, 15, 20
years) and what part of lowest weight instances to skip (0.5, 0.75, 0.85).
Such strings are GA-optimized, with already evaluated strings cached and
prohibited from costly re-evaluation. Evaluation: a predictor is trained
on past data and used to predict values in a disjoint window, 20% size of
the data, ahead of it; repeated 10 times with the windows shifted by the
smaller window size. The average of the 10 period returns less the ’always
up’ return and divided by the 10 values st. deviation give a predictor’s
fitness.

Final Remarks

Financial markets, as described by multidimensional data presented to a


prediction/trading system, are complex nonlinear systems – with subtleties
and interactions difficult for humans to comprehend. This is why, once a
system has been developed, tuned and proven performing on (volumes of)
data, there is no space for human ’adjustments’, except for going through

111
the whole development cycle. Without stringent re-evaluation performance
is likely hurt.
A system development usually involves a number of recognizable steps:
data preparation – cleaning, selecting, making data suitable for the pre-
dictor; prediction algorithm development and tuning – for performance on
the quality of interest; evaluation – to see if indeed the system performs on
unseen data. But since financial prediction is very difficult, extra insights
are needed. The paper has tried to provide some: data enhancing tech-
niques, predictability tests, performance improvements, evaluation hints
and pitfalls to avoid. Awareness of them hopefully will make predictions
easier, or at least the realization that they cannot be done quicker.

112
Ensembles in Practice: Prediction,
Estimation, Multi-Feature and Noisy
Data
HIS-2002, Chile, 2002

113
.

114
Ensembles in Practice: Prediction,
Estimation, Multi-Feature and Noisy Data
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se

Published: Proceedings of HIS-2002, 2002

Abstract This paper addresses 4 practical ensemble applications: time series prediction,
estimating accuracy, dealing with multiple feature and noisy data. The intent is to refer
a practitioner to ensemble solutions exploiting the specificity of the application area.

Introduction
Recent years have seen a big interest in ensembles – putting several classi-
fiers together to vote – and for a good reason. Even weak, by itself not so
accurate classifiers can create an ensemble beating the best learning algo-
rithms. Understanding, why and when this is possible, and what are the
problems, can lead to even better ensemble use. Many learning algorithms
incorporate voting. Neural networks apply weights to inputs and a nonlin-
ear threshold function to summarize the ’vote’. Nearest neighbor (kNN)
searches for k prototypes for a classified case, and outputs the prototypes’
majority vote. If the definition of an ensemble allows that all members clas-
sify, but only one outputs, then also Inductive Logic Programming (ILP)
is an example.
A classifier can be also put into an external ensemble. Methods, how
to generate and put classifiers together have been prescribed reporting
accuracy over that of the base classifiers. But this success and generality of
ensemble use does not mean that there are no special cases benefiting from a
problem-related approach. This might be especially important in extreme

115
cases, e.g. when it is difficult to obtain above-random classifiers due to
noise – ensemble will not help. In such cases, it takes more knowledge
and experiments (ideally by others) to come up with a working solution,
which probably involves more steps and ingenuity than a standard ensemble
solution. This paper presents specialized ensembles in 4 areas: prediction,
estimating accuracy, dealing with multiple feature and noisy data. The
examples have been selected from many reviewed papers, with clarity and
generality (within their area) of the solution in mind. The idea of this paper
is to provide a problem-indexed reference to the existing work, rather than
to detail it.

Why Ensembles Work


An ensemble outperforms the base classifier due to several reasons (Diet-
terich, 2000). First, given limited training data, a learning algorithm can
find several classifiers performing equally well. An ensemble minimizes the
risk of selecting the wrong one, as more can be incorporated and averaged.
Second, the learning algorithm’s outcome – a classifier – might be merely
a local optimum in the algorithm’s search. Ensemble construction restarts
the algorithm from different points avoiding the pitfall. Third, a number
of even simple classifiers together can express more complex functions –
better matching the data.
It is not difficult to make ensembles work – most of the methods to
’disturb’ either the training data, or the classifier construction result in
an ensemble performing better than a single classifier. Ensembles have
improved classification results for data repositories, having as the base
learner different algorithms, thus suggesting that the ensemble is behind
the progress. Ensembles can be stuck on top of each other adding benefits,
e.g. if boosting is good at improving accuracy, bagging – at reducing
variance, bagging boosted classifiers, or the other way, may be good at
both, as experienced (Webb, 1998) Ensemble size is not crucial, even small
benefits. Much of the reduction in error appears at 10 − 15 classifiers
(Breiman, 1996). But AdaBoost and Arcing measurably improve their
test-set error until around 25 classifiers (Opitz & Maclin, 1999) or more.
Increased computational cost is the first bad news. An ensemble of tens,

116
perhaps hundreds of classifiers takes that much more to train, classify and
store. This can be alleviated by simpler base classifiers – e.g. decision
stubs instead of trees – and pruning, e.g. skipping low-weight members in
a weighted ensemble (Margineantu & Dietterich, 1997a). Overfitting can
result when an ensemble does not merely model training data from many
(random) angles, but tries to fit its whims. Such a way to boost accuracy
may work on noise-free data, but in the general case this is a recipe for
overfitting (Sollich & Krogh, 1996). Readability loss is another consequence
of voting classifiers. Rule-based decisions trees and predicate-based ILP,
having similar accuracy as e.g. neural networks (ANN) and kNN were
favored in some areas because of human-understandable models. However,
an ensemble of 100 such models – different to make the ensemble work and
possibly weighted – blows the readability.
A note on vocabulary. Bias refers to the classification error part of
the central tendency, or most frequent classification, of a learner when
trained on different sets; variance – the error part of deviations from the
central tendency (Webb, 1998). Stable learning algorithms are not that
sensitive to changes in the training set, include kNN, regression, Support
Vector Machines (SVM), whereas decision trees, ILP, ANN are unstable.
Global learning creates a model for the whole data, later used (the model,
not data) to classify new instances, e.g. ANN, decision tree, whereas local
algorithms refrain from creating such models, e.g. kNN, SVM. An overview
of learning algorithms can be found (Mitchell, 1997).

Common Ensemble Solutions

For an ensemble to increase accuracy, the member classifiers need to have:


1) Independent, or better negatively correlated, errors (Ali & Pazzani,
1995), 2) Expected above-random accuracy. Ensemble methods usually do
not check the assumptions, instead they prescribe how to generate compli-
ant classifiers. In this sense, the methods are heuristics, found effective and
scrutinized for various aspects, e.g. suitable learning algorithms, ensemble
size, data requirements etc.
An ensemble explicitly fulfilling the assumptions might be even more

117
effective: smaller – including only truly contributing classifiers, more accu-
rate – taking validated above-random classifiers etc. Experiments confirm
this (Liu & Yao, 1998). However, even more advanced methods often refer
to features of common ensembles. There are many ways to ensure ensem-
ble’s classifier diversity, e.g. by changing training data or the classifier
construction process – the most common methods described below.

Different Training Instances


Different subsets of the training set lead to different classifiers, especially
if the learning algorithm is unstable. The subsets could be without or
with replacement, i.e. allowing multiple copies of the same example. A
common setting is a bootstrap – drawing as many elements as in the whole
set, however with replacement, thus having on average 63.2% elements
of the original set, some repeated. Another possibility is to divide the
training set into N disjoint subsets and systematically train on N − 1 of
them, leaving each one out. More general way is to assign weights to the
examples and change them before training a new classifier respecting the
weights.

Bagging (Breiman, 1996) – classifiers trained on different bootstrap sam-


ples are put to a majority vote – class issued by most wins. Bagging
improves accuracy by promoting the average classification of the ensemble,
thus reducing influence of individual variances (Domingos, 1997), with bias
mostly intact (Webb, 1998), handles noise (to 20%) well (Dietterich, 1998),
but does not at all work with stable learning methods.

AdaBoost (Freund & Schapire, 1995) forms a committee by applying a


learning algorithm to the training set whose distribution is changed, after
generating each classifier, as to stress frequently misclassified cases. While
classifying, a member of the ensemble is weighted by a factor proportional
to its accuracy. With little training data, AdaBoost performs better than
Bagging (Dietterich, 1998), however it may deteriorate if there is insuffi-
cient training data relative to the complexity of the base classifiers or their
training errors grow (Schapire et al., 1997).

118
Feature selection
Classifiers can be trained with different feature subsets. The selection can
be random or premeditated, e.g. providing a classifier with a selection
of informative, uncorrelated features. If all features are independent and
important, the accuracy of the restricted (feature-subset) classifiers will
decline, however putting them all together could still give a boost. Features
can also be preprocessed presenting different views of the data to different
classifiers.

Changing Output Classes


The output values can be assigned to 2 super-classes, e.g. A1 , B1 – each
covering several of the original class values – and then training a classifier
with the super-class, then making another selection, A2 , B2 and training
next classifier etc. When classifying, all ensemble members issue their
super-classifications and a count is made which of the original classes ap-
pears most frequently – the final output.

Error Correcting Output Coding (ECOC) (Dietterich & Bakiri, 1991) is a


multi-class classification, where each class is encoded as a string of binary
code-letters, a codeword. Given a test instance, each of its code-letters
is predicted, and the class whose codeword has the smallest Hamming
distance to the predicted codeword is assigned. By reducing bias and vari-
ance, ECOC boosts global learning algorithms, but not local (Ricci & Aha,
1998) – in that case ECOC code-letter’s classifiers can be differentiated by
providing them with a subset of features. In data with few classes (K < 6),
extending the codeword length yields increases error reduction.

Randomization
Randomization could be inserted at many points resulting in ensemble
variety. Some training examples could be distorted, e.g. by adding 0-mean
noise. Some class values could be randomized. The internal working of the
learning algorithm could be altered, e.g. by choosing a random decision
among the 3 best in decision tree build up.

119
Wagging (Bauer & Kohavi, 1998), a variant of bagging, requires a base
learner accepting training set weights. Instead of bootstrap samples, wag-
ging assigns random weights to instances in each training set, the original
formulation used Gaussian noise to vary the weights.

Different Classifier Types

Even when trained on the same data, classifiers such as kNN, neural net-
work, decision tree create models classifying new instances differently due
to different internal language, biases, sensitivity to noise etc. Learners
could also induce varied models due to different settings, e.g. network
architecture.

Bayesian ensemble uses k classifiers obtained by any means in the Bayes


formula. The basis for the ensemble outcome are probabilities: classes’ and
conditional for predicted/actual class pairs, for each classifier. The Bayes
output, given k classifications, is then given by classp maximizing P (classp )
* P1 (predict1 |classp ) * . . . * Pk (predictk |classp ). It can be viewed as the
Naive Bayes Classifier (Mitchell, 1997) meta-applied to the ensemble.

Specialized Ensemble Use

The quest for the ultimate ensemble technique resembles the previous ef-
forts to find the ’best’ learning algorithm which discovered a number of sim-
ilarly accurate methods, some somehow better in specific circumstances,
and usually further improved by problem-specific knowledge. Ensemble
methods also show their strengths in different circumstances, e.g. no/data
noise, un/stable learner etc. Problem specifics could be directly incorpo-
rated into an specialized ensemble. This section addresses four practical
problem areas and presents ensemble adaptations. Though other problems
in the areas might require individual approach, the intention is to bring
some issues and worked-out solutions.

120
Time Series Prediction
Time series arise in any context in which data is linearly ordered, e.g. by
time or distance. The index increment may be constant, e.g. 1 day, or not,
as in the case of event-driven measurements, e.g. indicating a transaction
time and its value. Series values are usually numeric, in a more general
case – vectors of fixed length. Time series prediction is to estimate a future
value, given values up to date. There are different measures of success, the
most common accuracy – in the case of nominal series values, and squared
mean error – in the case of numeric.
Series to instances conversion is required by most learning algorithms ex-
pecting as an input a fixed length vector. It can be a lag vector derived from
series, a basic technique in nonlinear analysis vt = (seriest , seriest−lag , ..,
seriest−(D−1)∗lag ). Such vectors with the same time index t – coming from
all input series – appended give an instance, its coordinates referred to as
features. The lag vectors have motivation in Takens embedding theorem
(Kantz & Schreiber, 1999b) stating that a deterministic – i.e. to some
extent predictable – series’ dynamics is mimicked by the dynamics of the
lag vectors, so e.g. if a series has a cycle – coming to the same values, the
lag vectors will have a cycle too.

Embedding dimension D – the number of lagged series values used to


model a series trajectory – according to the embedding theorem does not
need to exceed 2d − 1, where d is the dimension of the series generating
system. In practice d is unknown, possibly infinite if the data is stochastic.
D is usually arrived at by increasing the value until some measure – e.g.
prediction accuracy – gets saturated. In theory – infinite data and no
noise – it should stay the same even when D is increased, in practice it
is not, due to curse of dimensionality etc. A smaller dimension allows
more and closer neighborhood matches. An ensemble involving different
dimensions could resolve the dilemma.

Embedding lag according to Takens theorem should only be different from


the system’s cycle, in practice it is more restricted. Too small, makes
differences between the lagged values not informative enough to model the

121
system’s trajectory – imagine a yearly cycle by giving just several values
separated seconds apart. Too big, misses the details and risks putting
together weakly related values – as in the case of a yearly cycle sampled
at 123-months interval. Without advanced knowledge of the data, a lag
is preferred: either as the first zero autocorrelation or minimum of the
mutual information (Kantz & Schreiber, 1999b). However, those are only
heuristics and an ensemble could explore a range of values, especially as
theory does not favor any.

Prediction horizon – how much ahead to predict at a time – is another


decision. Target 10 steps ahead prediction can be in 1 shot, in 2 iterated 5-
ahead predictions, 5*2-ahead, or 10*1. Longer horizon makes the predicted
quantity less corrupted by noise; shorter – can be all that can be predicted,
and iterated predictions can be corrected for their systematic errors as
described below. An ensemble of different horizons could not only limit
outliers, but also estimate the overall prediction reliability via agreement
among the individual predictions.
Converting a short-term predictor into longer-term can be also done
utilizing some form of metalearning/ensemble (Judd & Small, 2000). The
method uses a second learner to discover the systematic errors in the (not
necessarily very good, but above-average) short-term predictor, as it is
iterated. These corrections are then used when a longer-term prediction
is issued resulting in much better results. The technique also provides an
indication of a feasible prediction horizon and is robust w.r.t. noisy series.

Series preprocessing – meaning global data preparation before it is used


for classification or prediction – can introduce domain-specific data view,
reduce noise, normalize, presenting the learning algorithm with more ac-
cessible data. E.g. in the analysis of financial series, so called indicators
series are frequently preprocessed/derived and consist of different moving
averages and relative value measures within an interval (Zemke, 2002b).
Preprocessing can precede, or be done at the learning time, e.g. as calls to
background predicates.
The following system (Gonzalez & Diez, 2000) introduces general time

122
series preprocessing predicates: relative increases, decreases, stays
(within a range) and region: always, sometime, true percentage – test-
ing if interval values belong to a range. The predicates, filled with values
specifying the intervals and ranges, are the basis of simple classifiers – con-
sisting of only one predicate. The classifiers are then subject to boosting
up to 100 iterations. The results are good, though noisy data causes some
problems.

Initial conditions of the learning algorithm can differ for each ensemble
member. Usually, the learning algorithm has some settings, other than
input/output data features etc. In the case of ANN, it is the initial weights,
architecture, learning speed, weight decay rate etc. For an ILP system –
the background predicates, allowed complexity of clauses. For kNN – the k
parameter and weighting of the k neighbors w.r.t. distance: equal, linear,
exponential. All can be varied.
An ANN example of different weight initialization for time series pre-
diction follows (Naftaly et al., 1997). Nets of the same architecture are
randomly initialized and assigned to ensembles built at 2 levels. First, the
nets are grouped into ensembles of fixed size Q, and the results for the
groups averaged at the second level. Initially, Q = 1, which as Q increases
expectably reduces the variance. At Q = 20 the variance is similar to what
could be extrapolated for Q = ∞. Except for suggesting a way the im-
prove predictions, the study offers some interesting observations. First, the
minimum of the ensemble predictor error is obtained at ANN epoch that
for a single net would already mean overfitting. Second, as Q increases,
the test set error curves w.r.t. epochs/training time, go flatter making it
less crucial to stop training at the ’right’ moment.

Different series involved in a prediction of a given one, are another ensem-


ble possibility. They might be other series than the one predicted, but sup-
porting its prediction, what could be revealed by, e.g., significant non-zero
correlation. Or the additional series could be derived from the given one(s),
e.g. according to the indicator formulae, in financial prediction. Then all
the series, can be put together into the lag vectors – already described

123
for one series – and presented to the learning algorithm. Different ensem-
ble members can be provided with their different selection/preprocessing
combination.
Selection of delay vector lag, dimension, even for more input series, can
be done with the following (Zemke, 1999b). For each series, lag is set to a
small value, and dimension to a reasonable value, e.g. 2 and 10. Next, a
binary vector, as long as the sum of embedding dimensions for all series,
is optimized by a Genetic Algorithm (GA). The vector, by its ’1’ positions
indicates which lagged values should be used, their number restricted to
avoid the curse of dimensionality. The selected features are used to train
a predictor which performance/accuracy measures the vector’s fitness. In
the GA population no 2 identical vectors are allowed and, after a certain
number of generations, the top performing half of the last population is
subject to majority vote/averaging of their predictions.

Multiple Features
Multiple features, running into hundreds or even thousands, naturally ap-
pear in some domains. In text classification, a word’s presence may be
considered a feature, in image recognition – a pixel’s value, in chemical
design – a component’s presence and activity, or in a joint data base the
features may mount. Feature selection and extraction are main dimension-
ality reduction schemes. In selection, a criterion, e.g. correlation, decides
feature choice for classification. Feature extraction, e.g. Principal Com-
ponent Analysis (PCA), reduces dimensionality by creating new features.
Sometimes, it is impossible to find an optimal feature set, when several sets
perform similarly. Because different feature sets represent different data
views, simultaneous use of them can lead to a better classification.
Simultaneous use of different feature sets usually lumps feature vec-
tors together into a single composite vector. Although there are several
methods to form the vector, the use of such joint feature set may result
in the following problems: 1) Curse of dimensionality, the dimension of a
composite feature vector becomes much higher than any of component fea-
ture vectors, 2) Difficulty in formation, it is often difficult to lump several
different feature vectors together due to their diversified forms, 3) Redun-

124
dancy, the component feature vectors are usually not independent of each
other (Chen & Chi, 1998). The problems of relevant feature and example
selection are interconnected (Blum & Langley, 1997).

Random feature selection for each ensemble classifier is perhaps the sim-
plest method. It works if 1) data is highly redundant – it does not matter
much which features are included, as many carry similar information and 2)
the selected subsets are big enough to create above-random classifier – find-
ing that size may require some experimentation. Provided that, one may
obtain better classifiers in random subspaces than in the original feature
space, even before the ensemble application. In a successful experiment
(Skurichina & Duin, 2001), the original dimensionality was 80 (actually
24-60), subspaces – 10, randomly selected for 100-classifier majority vote.

Feature synthesis creates new features, exposing important data charac-


teristics to classifiers. Different feature preprocessing for different classifiers
ensures their variety for an effective ensemble. PCA – creating orthogonal
combinations of features, maximizing variance – is a common way to deal
with multi-dimensional data. PCA new features, principal components,
generated in a sequence with decreasing variability/importance, in differ-
ent subsets or derived from different data, can be the basis of an ensemble.
In an experiment to automatically recognize volcanos in Mars satellite
images (Asker & Maclin, 1997), PCA has been applied to 15 ∗ 15 pixels =
225 feature images. Varying number, 6 − 16, of principal components, plus
domain features – line filter values – have been fed into 48 ANNs, making
an ensemble reaching experts’ accuracy. The authors conclude that the
domain-specific features and PCA preprocessing were far more important
than the learning algorithm choice. Such scheme seems suitable for cases
when domain-specific features can be identified and the detection which
other features contributed most is not important, since PCA mixes them
all.

Ensemble-based feature selection reduces data dimensionality by observing


which classifiers – based on which features – perform well. The features can

125
then contribute to an even more robust ensemble. Sensitivity of a feature
is defined as the change in the output variable when an input feature is
changed within its allowable range (while holding all other inputs frozen
at their median/average value) (Embrechts et al., 2001).
In in-silico drug design with QSAR, 100-1000 dependent features and
only 50-100 instances present related challenges: how to avoid curse of di-
mensionality, and how to maximize classification accuracy given the few
instances yet many features. A solution reported is to bootstrap an (ANN)
ensemble on all features adding one random – with values uniformly dis-
tributed – to estimate sensitivities of features, and skip features less sen-
sitive than the random. Repeat the process until not further feature can
be dropped and train the final ensemble. This scheme allows to identify
important features.

Class-aware feature selection – input decimation – is based on the follow-


ing. 1) Class is important for feature selection (but ignored, e.g. in PCA).
2) Different classes have different sets of informative features. 3) Retaining
original features is more human-readable. Input decimation works as fol-
lows (Oza & Tumer, 2001). For each among L classes, decimation selects a
subset of features most correlated to the class and trains a separate classi-
fier on that features. The L classifiers constitute an ensemble. Given a new
instance, each of the classifiers is applied (to its respective features) and
the class voted for by most is the output. Decimation reduces classification
error up to 90% over single classifiers and ensembles trained on all features,
as well as ensembles trained on principal components. Ensemble methods
such as bagging, boosting and stacking can be used in conjunction with
decimation.

Accuracy Estimation
For many reallife problems, perfect classification is not possible. In addi-
tion to fundamental limits to classification accuracy arising from overlap-
ping class densities, errors arise because of deficiencies in the classifier and
the training data. Classifier related problems such as incorrect structural
model, parameters, or learning regime may be overcome by changing or

126
improving the classifier. However, errors caused by the data (finite train-
ing sets, mislabelled patterns) cannot be corrected during the classification
stage. It is therefore important not only to design a good classifier, but
also to estimate limits to achievable classification rates. Such estimates
determine whether it is worthwhile to pursue (alternative) classification
schemes.
The Bayes error provides the lowest achievable error for a given clas-
sification problem. A simple Bayes error upper bound is provided by the
Mahalanobis distance, however, it is not tight – might be twice the actual
error. The Bhattacharyya distance provides a better range estimate, but
it requires knowledge of the class densities. The Chernoff bound tight-
ens Bhattacharyya upper estimate but it is seldom used since difficult to
compute (Tumer & Ghosh, 1996). The Bayes error can be also estimated
non-parametrically from errors of a nearest neighbor classifier, provided
the training data is large, otherwise the asymptotic analysis might fail.
Little work has been reported on a direct estimation of the performance
of classifiers (Bensusan & Kalousis, 2001) and on data complexity analysis
for optimal classifier combination (Ho, 2001).

Bayes error estimation via an ensemble (Tumer & Ghosh, 1996) exploits
that the error is only data dependent, thus the same for all classifiers that
add to it extra error due to a specific classifier limitations. By determining
the amount of improvement obtained from an ensemble, the Bayes error
can be isolated. Given the error of a single classifier E, of an averaging
ensemble Eensemble , of N ρ-correlated classifiers, the Bayes error stands:
−((N −1)ρ+1)E
EBayes = N Eensemble
(N −1)(1−ρ) . The classifier correlation ρ is estimated by
deriving the (binary) misclassification vector for each classifier, and then
averaging the vectors’ correlations. This can cause problems, as it treats
classifiers equally, and is expensive if their number, N is high. The corre-
lation can be, however, also derived via mutual information by averaging
it between classifiers and an ensemble as a fraction of the total entropy
in the individual classifiers (Tumer et al., 1998). This yields even better
estimate of the Bayes error.

127
Noisy Data

There is little research specifically on ensembles for noisy data. This is an


important combination, since most real-life data is noisy (in broad sense
of missing and corrupted data) and ensembles’ success may partially come
from reducing influence of the noise by feature selection/preprocessing,
bootstrap sampling etc.

Noise deteriorates weighted ensemble as the optimization of the combining


weights overfits difficult, including noisy, examples (Sollich & Krogh, 1996).
This is perhaps the basic result to bear in mind while developing/applying
elaborate (weighted) ensemble schemes. To asses the influence of noise,
controlled amount of 5-30% input and output features have been corrupted
in an experiment involving Bagging, AdaBoost, Simple and Arcing ensem-
bles of decision trees or ANNs (Opitz & Maclin, 1999). As the noise grew,
the efficacy of the Simple and Bagging ensembles generally increased while
the Arcing and AdaBoost gained much less. As for ensemble size, with its
increase Bagging error rate did not increase, whereas AdaBoost’s did.

Boosting in the presence of outliers can work, e.g., by allowing a fraction


of examples to be misclassified, if this improves overall (ensemble) accu-
racy. An overview of boosting performance on noisy data can be found
(Jiang, 2001). ν-Arc is AdaBoost algorithm enhanced by a free param-
eter determining the fraction of allowable errors (Rtsch et al., 2000). In
(toy) experiments, on noisy data ν-Arc performs significantly better than
AdaBoost and comparably to SVM.

Coordinated ensemble specializes classifiers on different data aspects, e.g.


so classifiers appropriately miss-classify outliers coming into their area,
without the need to recognize outliers globally. Negatively correlated clas-
sifiers – making different (if need be) instead of independent errors – build
highly performing ensembles (Ali & Pazzani, 1995). This principle have
been joint together with coordinated training specializing classifiers on dif-
ferent data parts (Liu & Yao, 1998). The proposed ANN training rule –

128
extension of backpropagation – clearly outperforms standard ANNs ensem-
bles on noisy data both in terms of accuracy and ensemble size.

Missing data is another aspect of ’noise’ where specialized ensemble solu-


tion can increase performance. Missing features can be viewed as data to
be predicted – based on non-missing attributes. An approach sorts all data
instances according to how many features they miss: complete instances,
missing 1, 2, etc. features. A missing feature is a target for an ensemble
trained on all instances where the feature is present. The feature is then
predicted and the repaired instance added to the the data and the whole
process repeated, if needed, for other features (Conversano & Cappelli,
2000).

Removing mislabelled instances, with such cleaned data used for training,
can improve accuracy. The problem is how to recognize a corrupted la-
bel, distinguishing it from exceptional, but correct, case. Interestingly, as
opposed to labels, cleaning corrupted attributes may decrease accuracy if
a classifier trained on the cleaned data later classify noisy instances. In
an approach (Brodley & Friedl, 1996), all data has been divided into N
parts and an ensemble trained (by whatever ensemble-generating method)
on N − 1 parts, and used to classify the remaining part, in turn done for
all parts. The voting method was consensus – only if the whole ensemble
agreed on a class different from the actual, the instance was removed. Such
a conservative approach is unlikely to remove correct labels, though may
still leave some misclassifications. Experiments have shown that using the
cleaned data for training the final classifier (of whatever type) increased
accuracy for 20 - 40% noise (i.e. corrupted labels), and left it the same for
noise less than 20%.

Conclusion
Ensemble techniques, bringing together multiple classifiers for increased
accuracy, have been intensively researched in the last decade. Most of
the papers either propose a ’novel’ ensemble technique, often a hybrid one

129
bringing features of several existing, or compare existing ensemble and clas-
sifier methods. This kind of presentation has 2 drawbacks. It is inaccessible
to a practitioner, with a specific problem in mind, since the literature is en-
semble method oriented, as opposed to problem oriented. It also gives the
impression that there is the ultimate ensemble technique. Similar search
for the ultimate machine learning proved fruitless. This paper concen-
trates on ensemble solutions in 4 problem areas: time series prediction,
accuracy estimation, multiple feature and noisy data. Published systems,
often blending internal ensemble working with some of the areas specific
problems are presented easing the burden to reinvent them.

130
Multivariate Feature Coupling and
Discretization
FEA-2003, Cary, US, 2003

131
.

132
Multivariate Feature Coupling and
Discretization
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se

Michal Rams6
Institut de Mathematiques de Bourgogne
Universite de Bourgogne
Dijon, France
M.Rams@impan.gov.pl

Published: Proceedings of FEA-2003, 2003

Abstract This paper presents a two step approach to multivariate discretization, based
on Genetic Algorithms (GA). First, subsets of informative and interacting features are
identified – this is one outcome of the algorithm. Second, the feature sets are globally
discretized, with respect to an arbitrary objective. We illustrate this by discretizion for
the highest classification accuracy of an ensemble diversified by the feature sets.

Introduction
Primitive data can be discrete, continuous or nominal. Nominal type
merely lists the elements without any structure, whereas discrete and con-
tinuous data have an order – can be compared. Discrete data differs from
continuous that it has a finite number of values. Discretization, digitiza-
tion or quantization maps a continuous interval into one discrete value, the
idea being that the projection preserves important distinctions. If all that
matters, e.g., is a real value’s sign, it could be digitized to {0; 1}, 0 for
negative, 1 otherwise.
6
On leave from the Institute of Mathematics, Polish Academy of Sciences, Poland.

133
A data set has data dimension of attributes or features – each holding
a single type of values across all data instances. If attributes are the data
columns, instances are the rows and their number is the data size. If one of
the attributes is the class to be predicted, we are dealing with supervised
data, versus unsupervised. The data description vocabulary carries over
to the discretization algorithms. If an algorithm discretizing an attribute
takes into account the class, it is supervised.
Most common univariate methods discretize one attribute at a time,
whereas multivariate methods consider interactions between attributes in
the process. Discretization is global if performed on the whole data set,
versus local if only part of the data is used, e.g. a subset of instances.
There are many advantages of discretized data. Discrete features are
closer to a knowledge level representation than continuous ones. Data can
be reduced and simplified, so it is easier to understand, use, and explain.
Discretization can make learning more accurate and faster and the re-
sulting hypotheses (decision trees, induction rules) more compact, shorter,
hence can be more efficiently examined, compared and used. Some learning
algorithms can only deal with discrete data (Liu et al., 2002).

Background
Machine learning and data mining aim at high accuracy, whereas most
discretization algorithms promote accuracy only indirectly, by optimizing
related metrics such as entropy or the chi-square statistics.
Univariate discretization algorithms are systemized and compared in
(Liu et al., 2002). The best discretizations were supervised: entropy mo-
tivated Minimum Description Length Principle (MDLP) (Fayyad & Irani,
1993), and based on the chi-square statistics (Liu & Setiono, 1997), later
extended into parameter-free (Tay & Shen, 2002).
There is much less literature on multivariate discretization. A chi-square
statistics approach (Bay, 2001) aims at discretizing data so its distribution
is most similar to the original. Classification rules based on feature inter-
vals can also be viewed as discretization, as done by (Kwedlo & Kretowski,
1999) who GA-evolve the rules. However, different rules may impose dif-

134
ferent intervals for the same feature.

Multivariate Considerations

Discretizing one feature at a time is computationally less demanding, though


limiting. First, some variables are only important together, e.g., if the pre-
dicted class = sign(xy), only discretizing x and y in tandem can discover
their significance, each alone can be inferred as not related to the class
and even discarded.
Second, especially in noisy/stochasic data, a non-random feature may be
only slightly above randomness, so can still test as non-significant. Only a
grouping a number of such features can reveal their above-random nature.
Those considerations are crucial since discretization is an information
loosing transformation – if the discretization algorithm cannot spot a reg-
ularity, it will discretize suboptimally, possibly corrupting the features or
omitting them as irrelevant.
Besides, data mining applications proliferate beyond the each feature
alone informative data. To exploit such data and the capabilities of ad-
vanced mining algorithms, the data preprocessing including discretization,
needs to be equally adequate.
Voting ensembles also pose demands. When data is used to train a
number of imperfect classifiers, which together yield the final hypothesis,
the aim of discretization should not be so much to perfect individual fea-
ture cut-points, but to ensure that the features as a whole carry as much
information as possible – to be recaptured by the ensemble.

Discretization Measures

When discretization goes beyond the fixed interval length/frequency ap-


proach, it needs a measure guiding it through the search for salient features
and their cut-points. The search strategy can itself have different imple-
mentations (Liu et al., 2002). However, let’s concentrate on the score
functions first.

135
Shannon conditional entropy (Shannon & Weaver, 1949) is commonly used
to estimate the information gain of a cut-point, with the point with max-
imal score used, as in C4.5 (Quinlan, 1993). We encountered the problem
that entropy has low discriminative power in some non-optimal equilibria,
and as such, does not provide a clear direction how to get out of them.

Chi-square statistics assess how the discretized data is similar to the orig-
inal (Bay, 2001). We experimented with chi-square as a secondary test, to
further distinguish between nearly equal primary objective. Eventually, we
preferred the Renyi entropy which has similar quadratic formula, though
interpretable as accuracy. It can be linearly combined with the following
accuracy measure.

Accuracy is rarely a discretization score function, though a complemen-


tary data inconsistency is an objective (Liu et al., 2002): For each instance,
its features’ discretized values create a pattern. If there are np instances of
that pattern p in the data, then inconsistencyp = np − majorityClassp ,
where majorityClassp is the most numerous among the pattern classes.
The data totalInconsistency is the sum of inconsistences, for all patterns.
We define discretization accuracy = 1 − totalInconsistency/dataSize.

Number of discrete values is an objective to be minimized. We use related


splits: The number of patterns a feature set can maximally produce. The
number for a feature set is computed by multiplying the number of discrete
values introduced by each feature, e.g. if feature 1 is discretized into 3
values, features 2 and 3 into 4, splits = 3 ∗ 4 ∗ 4 = 80.

Our Measure
Accuracy alone is not a good discretization measure since a set of random
features may have high accuracy, as the number of splits and overfitting
grow. Also, some of the individual features in a set may induce accurate
predictions. We need a measure of the extra gain over that of contributing
features and overfitting. Such considerations led to the following.

136
Signal-to-Noise Ratio (SNR) expresses the accuracy gain of a feature set:
SNR = accuracy / (1 − accuracy) = (dataSize- totalInconsistency) /
totalInconsistency, i.e. the ratio of consistent to inconsistent pattern to-
tals.
To correct for the accuracy induced by individual features, we normalize
the SNR by dividing it by the SNR for all the features involved in the
feature set, getting SNRn. SNRn > 1 indicates that a feature set predicts
more than its individual features.

Fitness for a GA-individual, consisting of feature set and its discretiza-


tion, is provided by the SNRn. The GA population is sorted w.r.t. fitness,
with a newly evaluated individual included only if it surpasses the cur-
rent population worst. If two individuals have the same fitness, further
discretization preferences are used. Thus, a secondary objective is to min-
imize splits; individuals with splits > dataSize/40 discarded. Next, one
with greater SNR is promoted. Eventually, the feature sets are compared
in lexicographic order.

Two-Stage Algorithm
The approach uses Genetic Algorithms to select feature sets contributing
to predictability and to discretize the features. First, it identifies different
feature subsets, on the basis of their predictive accuracy. Second, the
subsets fixed, all the features involved in them are globally fine-discretized.

Feature Coupling
This stage uses rough discretization to identify feature sets of above random
accuracy, via GA fitness maximization. A feature in different subsets may
have different discretization cut-points.
After each population size evaluations, the fittest individual’s feature
set is a candidate to the list of coupled feature sets. If the SNRn < 1,
the set is rejected. Otherwise, it is locally optimized: in turn each feature
is removed and the remaining subset evaluated for SNRn. If a subset
measures no worse than the original, it is recursively optimized. If the SNR

137
(not-normalized) of the smallest subset exceeds an acceptance threshold,
the set joins the coupled feature list.

An acceptance threshold presents a balance. Threshold too low will let


through random, overfitting feature subsets, too high will reject genuine
subsets, or postpone their discovery due to increased demands on fitness.
Through long experimentation, we arrived at the formula: A number of
random features are evaluated for SNR on the class attribute. To simulate
a feature set splitting the class into many patterns, each random feature
is split into dataSize/40 discrete values. The greatest such obtained SNR
defines the threshold. This has its basis in the surrogate data approach:
generate K sets resembling the original data and compute some statistics
of interest. Then, if the statistics on the actual data is greater (lower)
than all on the random sets, the chance of getting this by coincidence is
less than 1/K.

Once a feature subset joins the above-random list, all its subsets and
supersets in the GA population are mutated. The mutation is such as not
to generate any subset or superset of a set already in the list. This done,
the GA continues.

At the end of the GA run, single features are considered to the list of
feature sets warranting predictability. The accuracy threshold for accepting
a feature is arrived at by collecting statistics on accuracy of permuted
original features predicting the actual class. The features are randomized
in this way as to preserve their distribution, e.g. the features may happen
to be binary which should be respected collecting the statistics. Then the
threshold accuracy is provided by the mean accuracy plus required number
of standard deviations.

The discretization accuracy for a feature is computed by unsupervised


discretizing it into 20 equally frequent intervals. The discretization cut-
points are used in the feature set search. Such restricted cut-points are to
minimize the search space of individuals optimized for active features and
their cut-point selection.

138
Global Discretization
Once we have identified the coupled features sets, the second optimization
can proceed. The user could provide the objective. We have attempted a
fine discretization of the selected features, in which each feature is assigned
only one set of cut-points. The fitness of such can be measured in many
ways, e.g., in the spirit of the Naive Bayesian Classifier as the product of
the discretization accuracies for all the feature sets. The GA optimization
proceeds by exploring the cut-points, the feature sets fixed.
The overall procedure provides:

Coupled features – sets inducing superior accuracy to that obtained by


the features individually.

Above-random features – all features that induce predictability indi-


vidually.

Measure of predictability – expressed by the global discretization Ba-


yesian ensemble accuracy.

Implementation Details
GA-individual = active feature set + cut-point selection. The features are
a subset of all the features available, no more than 4 selected at once. The
cut-points are indices to sorted threshold values, precomputed for each
data feature as the value at which class changes (Fayyad & Irani, 1993).
Thus, a discretization of a value is the smallest index whose corresponding
threshold exceeds that value.
Although non-active features are not processed in an individual, their
thresholds are inherited from a predecessor. Once the corresponding fea-
ture is mutated active, the retained threshold indices will be used. The
motivation is that even for non-active features, the thresholds had been op-
timized, and as such have greater potential as opposed to random thresh-
olds selection. This is not a big overhead, as all feature threshold index
lists are merely pointers to an event when they were created, and the
constant-size representation promotes simple genetic operators.

139
Genetic operators currently include mutation at 2 levels. First mutation,
stage 1 only, may alter the active feature selection by adding a feature,
deleting or changing it. Second mutation does the same to threshold se-
lection of an active feature: add, delete or change.

Experiments

Since the advantage of a multivariate discretization over univariate lies


in the ability to identify group-informative features, it is misguided to
compare the two on the same data. The data could look random to the
univariate approach, or it could not need the multivariate search if single
features warranted satisfactory predictability. In the latter case, a univari-
ate approach skipping all the multivariate considerations would be more
appropriate and efficient. Comparison to another multivariate discretiza-
tion would require the exact algorithm and data, which we do not have.
Instead, we test our method on synthetic data designed to identify the
limitations of the approach. On that data a univariate approach would
completely fail.
The data is defined as follows. The data sets have dataSize = 8192
instances, values uniformly in (0,1). Class of an instance is the xor func-
tion on subsequent groupings of classDim = 3 features: for half of the
instances, class is xor of features {0, 1, 2}, for another quarter – xor of
features {3,4,5}, for another one-eighth – xor of {6,7,8} etc. The xor is
computed by multiplying the values involved, each minus 0.5, and if the
product > 0 returning 1, otherwise 0. This data is undoubtedly artificial
but most difficult. In applications where the feature sets could be incre-
mentally discovered, e.g. {0,1} above random but {0,1,2} even better, we
expect the effectiveness of the algorithm to be higher than reported.
The tables below have been generated for the default setting: GA pop-
ulation size 5000, allowed number of fitness evaluations 100 000, only for
exploring dataDim the number of evaluations was increased to 250 000.
Unless otherwise indicated, data dimension is 30, and noise is 0. Note that
since the feature groupings are defined on diminishing parts of the data,
the rest effectively acts as noise. Data and class noise indicate percentage

140
of randomly assigned data, respectively class, values after the class had
been computed on the non-corrupted data. The table results represent the
percentages of cases, out of 10 runs, when the sets {0,1,2} etc. were found.

Data noise 0.1 0.15 0.2 0.25 0.35


{0,1,2} found 100 100 100 100 20
{3,4,5} found 100 80 0 0 0

Class noise 0.05 0.1 0.15 0.2 0.25


{0,1,2} found 100 100 100 100 100
{3,4,5} found 100 100 60 0 0

Data dim 30 60 90 120 150


{0,1,2} found 100 100 100 100 60
{3,4,5} found 100 100 100 100 60
{6,7,8} found 80 40 40 0 0

Conclusion
The approach presented invokes a number of interesting possibilities for
data mining applications. First, the algorithm detects informative feature
groupings even if they contribute only partially to the class definition and
the noise is strong. In more descriptive data mining, where it is not only
important to obtain good predictive models but also to present them in
a readable form, the discovery that a feature group contributes to pre-
dictability with certain accuracy is of value.
Second, the global discretization stage can be easily adjusted to a partic-
ular objective. If it is prediction accuracy by another type of ensemble, or
if only 10 features are to be involved, it can be expressed via the GA-fitness
function for the global discretization.

141
142
Appendix A

Feasibility Study on Short-Term


Stock Prediction

143
.

144
Feasibility Study on Short-Term Stock
Prediction
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se

1997

Abstract This paper presents an experimental system predicting a stock exchange index
direction of change with up to 76 per cent accuracy. The period concerned varies form 1
to 30 days.
The method combines probabilistic and pattern-based approaches into one, highly
robust system. It first classifies the past of the time series involved into binary patterns
and then, analyzes the recent data pattern and probabilistically assigns a prediction based
on the similarity to past patterns.

Introduction
The objective of the work was to test if short-term prediction of a stock in-
dex is at all possible using simple methods and a limited dataset(Deboeck,
1994; Weigend & Gershenfeld, 1994). Several approaches were tried, both
with respect to the data format and algorithms. Details of the successful
setting follow.

Experimental Settings
The tests have been performed on 750 daily index quotes of the Polish
stock exchange, with the training data reaching another 400 sessions back.
Index changes obtained binary characterization: 1 – for strictly positive
changes, 0 – otherwise. Index prediction – the binary function of change
between current and future value – was attempted for periods of 1, 3, 5, 10,

145
20 and 30 days ahead. Pattern learning took place up to the most recent
index value available (before the prediction period). Benchmark strategies
are presented to account for biases present in the data. Description of used
strategies follows.

Always up assumes that the index always goes up.


Trend following assumes the same sign of index change to continue.
Patterns+trend following. Patterns of subsequent 7 binary index chan-
ges are assigned a probability of correctly predicting positive index
change. The process is carried independently for a number of non-
overlapping, adjacent epochs (currently 2 epochs, 200 quotes each).
The patterns which consistently – with probability 50% or higher –
predict the same sign of change in all epochs are retained and subse-
quently used for predicting the index change should the pattern occur.
Since only some patterns pass the filtering process, in cases when no
pattern is available, the outcome is the same as in the Trend following
method.

Results
The table presents portions of agreements on actual index changes and
those predicted by the strategies.
Prediction/quotes ahead 1 3 5 10 20 30
Always up 0.48 0.51 0.51 0.52 0.56 0.60
Trend following 0.60 0.56 0.53 0.52 0.51 0.50
Patterns + trend 0.76 0.76 0.75 0.74 0.73 0.671
In all periods considered, patterns allowed to maintain predictability at
levels considerably higher than benchmark methods. Though this relatively
simple characterization of index behavior, patterns correctly predict index
move in 3 out of 4 cases up to 20 sessions ahead. The Trend following
strategy diminishes from 60% accuracy to a random strategy at around 10
sessions and the Always up gains strength at 20 quotes ahead in accordance
with general index appreciation.

146
Conclusions
The experiments show that a short-term index prediction is indeed possible
(Haughen, 1997). However, as a complex, non-linear system, the stock
exchange requires a careful approach (Peters, 1991; Trippi, 1995). In earlier
experiments, when pattern learning took place only in epochs proceeding
the test period or when epochs extended too far back, the resulting patterns
were of little use. This could be caused by shifting regimes (Asbrink, 1997)
in the dynamic process underlying the index values.
However, with only short history relevant, the scope for inferring any
useful patterns, so prediction, is limited. A solution to this could be pro-
vided by a hill climbing method, such as genetic algorithms (Michalewicz,
1992; Bauer, 1994), in the space of (epoch-size * number-of-epochs *
pattern-complexity) as to maximize the predictive power. Other ways of
increasing predictability include incorporating other data series and in-
creasing the flexibility of the pattern building process, which now only
incorporates simple probability measure and logical conjunction.
Other interesting possibilities follow even a short analysis of the suc-
cessful binary patterns: many of them pointing to an existence of short
period ‘waves’ in the index. This could be further explored e.g. by Fourier
or wavelet analysis.

Finally, I only mention trials with the symbolic, ILP, system Progol em-
ployed for finding a logical expression generalizing positive index change
patterns (up to 10 binary digits long). The system failed to find any hy-
pothesis in a number of different settings and a rather exhaustive search
(more than 20h computation on SPARC 5 for longer cases). I view the
outcome as a result of a strong insistence of the system for generating
(only) compressed hypothesis and problems in dealing with partially in-
consistent/noisy data.

147
148
Appendix B

Amalgamation of Genetic Selection


and Boosting
Poster GECCO-99, US, 1999

149
.

150
Amalgamation of Genetic Selection and
Boosting
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se

Published: poster at GECCO-99, 1999

Synopsis comes from research on financial time series prediction (Zemke,


1998). Initially 4 methods, ANN, kNN, Bayesian Classifier and GP, have
been compared for accuracy, and the best, kNN, scrutinized by GA-op-
timizing various parameters. However, the resulting predictors were often
unstable. This led to use of bagging (Breiman, 1996) – a majority voting
scheme provably reducing variance. The improvement came at no compu-
tational cost – instead of taking the best evolved kNN classifier (as defined
by its parameters), all above a threshold voted on the class.
Next, a method similar to bagging, but acclaimed better, was tried.
AdaBoost (Freund & Schapire, 1996) which works by creating (weighted)
ensemble of classifiers – each trained on updated distribution of examples,
with those misclassified by previous ensemble getting more weight. A pop-
ulation of classifiers was GA-optimized for minimal error on the training
distribution. Once the best individual exceeded threshold it joined the
ensemble. After distribution, thus fitness, update, the GA proceeded with
the same classifier population, effectively implementing data-classifier co-
evolution. However, as the distribution drifted from (initial) uniform, GA
convergence became problematic. The following averts this by re-building
the GA population from the training set after each distribution update. A
classifier consists of a list of prototypes, one per class, and binary vector
selecting active features for 1-NN determination.

The algorithm extends, initially empty, ensemble of 1-NN classifiers – in


above form.

151
1. Split training examples into the evaluation set (15%) and training set
(85%).
2. Build GA classifier population by selecting prototypes for each class by
copying examples from the training set according to their probability
distribution. Each classifier also includes random binary active feature
vector.
3. Evolve GA population until criterion for the best classifier met
4. Add the classifier to ensemble list, perform ensemble Reduce-Error
Pruning with Backfitting (Margineantu & Dietterich, 1997b) to max-
imize its accuracy on the evaluation set. Check ensemble enlargement
end criterion
5. If not an end, update training set distribution according to AdaBoost,
go to 2.
The operators used in the GA search include: Mutation – changing sin-
gle bit in the feature select vector, or randomly changing an active feature
value in one of classifier’s prototypes. Crossover, given 2 classifiers, in-
volves swapping of either feature select vectors or prototypes for one class.
Classifier fitness (to be minimized) is measured as its error on the training
set, i.e., as sum of probabilities of examples it misclassifies. The end crite-
rion for classifier evolution is that at least half of the GA population has
below-random error. The end criterion for ensemble enlargement is that
its accuracy on the evaluation set is not growing. The algorithm draws
from several methods to boost performance:
• AdaBoost
• Pruning of ensembles
• Feature selection/small prototype set to destabilize individual classi-
fiers (Zheng et al., 1998)
• GA-like selection and evolving of prototypes
• Redundancy in prototype vectors (Ohno, 1970) – only selected fea-
tures influence the 1-NN distance, but all are subject to evolution

152
Experiments indicate robustness of the approach – acceptable classifier is
usually found in early generation, thus ensemble grows rapidly. Accuracies
on the (difficult) financial data are fairly stable and, on average, above
those obtained by the methods from the initial study, but below their
peaks. Bagging such obtained ensembles has also been attempted further
reducing variance but only minimally increasing accuracy.

Foreseen work includes pushing the accuracy. Trials involving the UCI
repository are planned for wider comparisons. Refinement of the algo-
rithms will include: genetic operators (perhaps leading to many prototypes
per class) and end criteria. The intention is to promote rapid finding of
(not prefect but) above-random and diverse classifiers contributing to an
accurate ensemble.
In summary the expected outcome of this research is a robust general
purpose system distinguished by generating small set of prototypes, nev-
ertheless in ensemble exhibiting high accuracy and stable results.

153
154
Bibliography

Alex, F. (2002). Data mining and knowledge discovery with evolutionary


algorithms. Natural Computing Series. Springer.

Ali, K. M., & Pazzani, M. J. (1995). On the link between error correlation
and error reduction in decision tree ensembles (Technical Report ICS-
TR-95-38). Dept. of Information and Computer Science, UCI, USA.

Allen, F., & Karjalainen, R. (1993). Using genetic algorithms to find tech-
nical trading rules (Technical Report). The Rodney L. White Center for
Financial Research, The Wharton School, University of Pensylvania.

Asbrink, S. (1997). Nonlinearities and regime shifts in financial time series.


Stockholm School of Economics.

Asker, L., & Maclin, R. (1997). Feature engineering and classifier selection:
A case study in Venusian volcano detection. Proc. 14th International
Conference on Machine Learning (pp. 3–11). Morgan Kaufmann.

Aurell, E., & Zyczkowski, K. (1996). Option pricing and partial hedging:
Theory of polish options. Applied Math. Finance.

Back, A., & Weigend, A. (1998). A first application of independent com-


ponent analysis to extracting structure from stock returns. Int. J. on
Neural Systems, 8(4), 473–484.

Bak, P. (1997). How nature works: the science of self organized criticality.
Oxford University Press.

Bauer, E., & Kohavi, R. (1998). An empirical comparison of voting classi-


fication algorithms: Bagging, boosting and variants. To be published.

155
Bauer, R. (1994). Genetic algorithms and investment strategies. an alter-
native approach to neural networks and chaos theory. New York: Wiley.
Bay, S. D. (2001). Multivariate discretization for set mining. Knowledge
and Information Systems, 3, 491–512.
Bellman, R. (1961). Adaptive control processes: A guided tour. Princeton
Univ. Press.
Bensusan, H., & Kalousis, A. (2001). Estimating the predictive accuracy
of a classifier (Technical Report). Department of Computer Science,
University of Bristol, UK.
Bera, A. K., & Higgins, M. (1993). Arch models: Properties, estimation
and testing. Journal of Economic Surveys, 7, 307366.
Blum, A., & Langley, P. (1997). Selection of relevant features and examples
in machine learning. Artificial Intelligence, 97, 245–271.
Bollerslev, T. (1986). Generalised autoregressive conditional heteroskedas-
ticity. Journal of Econometrics, 31, 307–327.
Bostrom, H., & L., A. (1999). Combining divide-and-conquer and separate-
and-conquer for efficient and effective rule induction. Proceedings of
the Ninth International Workshop on Inductive Logic Programming.
Springer.
Box, G., Jenkins, G., & Reinsel, G. (1994). Time series analysis, forecast-
ing and control. Prentice Hall.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
Brodley, C. E., & Friedl, M. A. (1996). Identifying and eliminating misla-
beled training instances. AAAI/IAAI, Vol. 1 (pp. 799–805).
Campbell, J. Y., Lo, A., & MacKinlay, A. (1997). The econometrics of
financial markets. Princeton University Press.
Chen, K., & Chi, H. (1998). A method of combining multiple probabilistic
classifiers through soft competition on different feature sets. Neurocom-
puting, 20, 227–252.

156
Cheng, W., Wagner, L., & Lin, C.-H. (1996). Forecasting the 30-year u.s.
treasury bond with a system of neural networks.
Cizeau, P., Liu, Y., Meyer, M., Peng, C.-K., & H., S. (1997). Volatility
distribution in the s&p500 stock index. Physica A, 245.
Cont, R. (1999). Statistical properties of financial time series (Technical
Report). Ecole Polytechnique, F-91128, Palaiseau, France.
Conversano, C., & Cappelli, C. (2000). Incremental multiple imputation
of missingdata through ensemble of classifiers (Technical Report). De-
partment of Matematics and Statistics, University of Naples Federico II,
Italy.
Dacorogna, M. (1993). The main ingredients of simple trading models
for use in genetic algorithm optimization (Technical Report). Olsen &
Associates.
Dacorogna, M., Gencay, R., Muller, U., Olsen, R., & Pictet, O. (2001). An
introduction to high-frequency finance. Academic Press.
Deboeck, G. (1994). Trading on the edge. Wiley.
Dietterich, T. (1996). Statistical tests for comparing supervised learning
algorithms (Technical Report). Oregon State University, Corvallis, OR.
Dietterich, T. (1998). An experimental comparison of three methods for
constructing ensembles of decision trees: Bagging, boosting, and random-
ization. Machine Learning, ?, 1–22.
Dietterich, T., & Bakiri, G. (1991). Error-correcting output codes: A gen-
eral method of improving multiclass inductive learning programs. Pro-
ceedings of the Ninth National Conference on AI (pp. 572–577).
Dietterich, T. G. (2000). Ensemble methods in machine learning. Multiple
Classifier Systems (pp. 1–15).
Domingos, P. (1997). Why bagging work? a bayesian account and its impli-
cations. Proceedings of the Third International Conference on Knowledge
Discovery and Data Mining (pp. 155–158).

157
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Chap-
man & Hall.
Embrechts, M., et al. (2001). Bagging neural network sensitivity analysis
for feature reduction in qsar problems. Proceedings INNS-IEEE Interna-
tional Joint Conference on Neural Networks (pp. 2478–2482).
Fama, E. (1965). The behavior of stock market prices. Journal of Business,
January, 34–105.
Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continous-
valued attributes for classification learning. Proc. of the International
Joint Conference on Artificial Intelligence (pp. 1022–1027). Morgan
Kaufmann.
Feder, M., Merhav, N., & Gutman, M. (1992). Universal prediction of
individual sequences. IEEE Trans. Information Theory, IT-38, 1258–
1270.
Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization of
online learning and an application to boosting. Proceedings of the Second
European Conference on Machine Learning (pp. 23–37). Springer-Varlag.
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting al-
gorithm. Machine Learning: Proceedings of the Thirteenth International
Conference.
Galitz, L. (1995). Financial engineering: Tools and techniques to manage
financial risk. Pitman.
Gershenfeld, N., & Weigend, S. (1993). The future of time series: Learning
and understanding. Addison-Wesley.
Gonzalez, C. A., & Diez, J. J. R. (2000). Time series classification by boost-
ing interval based literals. Inteligencia Artificial, Revista Iberoamericana
de Inteligencia Artificial, 11, 2–11.
Han, J., & Kamber, M. (2001). Data mining. concepts and techniques.
Morgan Kaufmann.

158
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statis-
tical learning. data mining, inference and prediction. Springer.

Haughen, R. (1997). Modern investment theory. Prentice Hall.

Hawawini, G., & Keim, D. (1995). On the predictability of common stock


returns: World-wide evidence, chapter 17. North Holland.

Heiler, S. (1999). A survey on nonparametric time series analysis.

Hellström, T., & Holmström, K. (1998). Predicting the stock market (Tech-
nical Report). Univ. of Umeøa, Sweden.

Ho, T. K. (2001). Data complexity analysis for classifier combination.


Lecture Notes in Computer Science, 2096, 53–.

Hochreiter, S., & Schmidhuber, J. (1997). Flat minima. Neural Computa-


tion, 9, 1–42.

Jiang, W. (2001). Some theoretical aspects of boosting in the presence of


noisy data. Proc. of ICML-2001.

Judd, K., & Small, M. (2000). Towards long-term prediction. Physica D,


136, 31–44.

Kantz, H., & Schreiber, T. (1999a). Nonlinear time series analysis. Cam-
bridge Univ. Press.

Kantz, H., & Schreiber, T. (1999b). Nonlinear time series analysis. Cam-
bridge Univ. Press.

Kingdon, J. (1997). Intellignet systems and financial forecasting. Springer.

Kohavi, R., & Sahami, M. (1996). Error-based and entropy-based dis-


cretization of continuous features. Proc. of Second Itn. Conf. on Knowl-
edge Discovery and Data Mining (pp. 114–119).

Kovalerchuk, B., & Vityaev, E. (2000). Data mining in finance: Advances


in relational and hybrid methods. Kluwer Academic.

159
Kutsurelis, J. (1998). Forecasting financial markets using neural networks:
An analysis of methods and accuracy.
Kwedlo, W., & Kretowski, M. (1999). An evolutionary algorithm using
multivariate discretization for decision rule induction. Principles of Data
Mining and Knowledge Discovery (pp. 392–397).
Lavarac, N., & Dzeroski (1994). Inductive logic programming: Techniques
and applications. Ellis Horwood.
LeBaron, B. (1993). Nonlinear diagnostics and simple trading rules for
high-frequency foreign exchange rates. In A. Weigend and N. Gershenfeld
(Eds.), Time series prediction: Forecasting the future and understanding
the past, 457–474. Reading, MA: Addison Wesley.
LeBaron, B. (1994). Chaos and forecastability in economics and finance.
Phil. Trans. Roy. Soc., 348, 397–404.
LeBaron, B., & Weigend, A. (1994). Evaluating neural network predictors
by bootstrapping. Proc. of Itn. Conf. on Neural Information Processing.
Lefvre, E. (1994). Reminiscences of a stock operator. John Wiley & Sons.
Lequeux, P. (Ed.). (1998). The financial markets tick by tick. Wiley.
Lerche, H. (1997). Prediction and complexity of financial data (Technical
Report). Dept. of Mathematical Stochastic, Freiburg Univ.
Liu, H., Hussain, F., Tan, C., & Dash, M. (2002). Discretization: An
enabling technique. Data Mining and Knowledge Discovery, 393–423.
Liu, H., & Setiono, R. (1997). Feature selection via discretization (Technical
Report). Dept. of Information Systems and Computer Science, Singapore.
Liu, Y., & Yao, X. (1998). Negatively correlated neural networks for clas-
sification.
Malkiel, B. (1996). Random walk down wall street. Norton.
Mandelbrot, B. (1963). The variation of certain speculative prices. Jour-
nala of Business, 36, 392–417.

160
Mandelbrot, B. (1997). Fractals and scaling in finance: Discontinuity and
concentration. Springer.

Mantegna, R., & Stanley, E. (2000). An introduction to econophysics:


Correlations and complexity in finance. Cambridge Univ. Press.

Margineantu, D., & Dietterich, T. (1997a). Pruning adaptive boosting


(Technical Report). Technical report: Oregon State University.

Margineantu, D., & Dietterich, T. (1997b). Pruning adaptive boosting


(Technical Report). Technical report: Oregon State University.

Michalewicz, Z. (1992). Genetic algorithms + data structures = programs.


Springer.

Mitchell, T. (1997). Machine learning. McGraw Hill.

Molgedey, L., & Ebeling, W. (2000). Local order, entropy and predictabil-
ity of financial time series (Technical Report). Institute of Physics,
Humboldt-University Berlin, Germany.

Muggleton, S. (1995). Inverse entailment and Progol. New Generation


Computing, Special issue on Inductive Logic Programming, 13, 245–286.

Muggleton, S., & Feng, C. (1990). Efficient induction of logic programs.


Proceedings of the 1st Conference on Algorithmic Learning Theory (pp.
368–381). Ohmsma, Tokyo, Japan.

Muller, K.-R., Smola, A., Rätsch, G., Schölkopf, B., Kohlmorgen, J., &
Vapnik, V. (1997). Using support vector machines for time series predic-
tion.

Murphy, J. (1999). Technical analysis of the financial markets: A compre-


hensive guide to trading methods and applications. Prentice Hall.

Naftaly, U., Intrator, N., & Horn, D. (1997). Optimal ensemble averaging
of neural networks. Network, 8, 283–296.

Ohno, S. (1970). Evolution by gene duplication. Springer-Verlag.

161
Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical
study. Journal of Artificial Intelligence Research, 169–198.
Ott, E. (1994). Coping with chaos. Wiley.
Oza, N. C., & Tumer, K. (2001). Dimensionality reduction through clas-
sifier ensembles. Instance Selection: A Special Issue of the Data Mining
and Knowledge Discovery Journal.
Peters, E. (1991). Chaos and order in the capital markets. Wiley.
Peters, E. (1994). Fractal market analysis. John Wiley & Sons.
Quinlan, R. (1993). C4.5: Programs for machine learning. Morgan Kauf-
mann.
Raftery, A. (1995). Bayesian model selection in social research, 111–196.
Blackwells, Oxford, UK.
Refenes, A. (Ed.). (1995). Neural networks in the capital markets. Wiley.
Ricci, F., & Aha, D. (1998). Error-correcting output codes for local learn-
ers. Proceedings of the 10th European Conference on Machine Learning.
Rtsch, G., Schlkopf, B., Smola, A., Mller, K.-R., Onoda, T., & Mika, S.
(2000). nu-arc: Ensemble learning in the presence of outliers.
Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a rec-
ommended approach. Data Mining and Knowledge Discovery, 1, 317–327.
Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1997). Boosting
the margin: a new explanation for the effectiveness of voting methods.
Proc. 14th International Conference on Machine Learning (pp. 322–330).
Morgan Kaufmann.
Shannon, C., & Weaver, W. (1949). The mathematical theory of commu-
nication. Urbana, Illinois: University of Illinois Press.
Skurichina, M., & Duin, R. P. (2001). Bagging and the random subspace
method for redundant feature spaces. Second International Workshop,
MCS 2001.

162
Sollich, P., & Krogh, A. (1996). Learning with ensembles: How overfitting
can be useful. Advances in Neural Information Processing Systems (pp.
190–196). The MIT Press.

Sullivan, R., A., A. T., & White, H. (1999). Data-snooping, technical


trading rule performance and the bootstrap. J. of Finance.

Sutcliffe, C. (1997). Stock index futures: Theories and international evi-


dences. International Thompson Business Press.

Swingler, K. (1994). Financial prediction, some pointers, pitfalls and com-


mon errors (Technical Report). Centre for Cognitive and Computational
Neuroscience, Stirling Univ., UK.

Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dy-


namical Systems and Turbulence, 898.

Tay, F., & Shen, L. (2002). A modified chi2 algorithm for discretization.
Knowledge and Data Engineering, 14, 666–670.

Trippi, R. (1995). Chaos and nonlinear dynamics in the financial markets.


Irwin.

Tsay, R. (2002). Analysis of financial time series. Wiley.

Tumer, K., Bollacker, K., & Ghosh, J. (1998). A mutual information based
ensemble method to estimate bayes error.

Tumer, K., & Ghosh, J. (1996). Estimating the bayes error rate through
classifier combining. International Conference on Pattern Recognition
(pp. 695–699).

Vandewalle, N., Ausloos, M., & Boveroux, P. (1997). Detrended fluctuation


analysis of the foreign exchange markets. Proc. Econophysics Workshop,
Budapest.

Walczak, S. (2001). An empirical analysis of data requirements for financial


forecasting with neural networks.

163
Webb, G. (1998). Multiboosting: A technique for combining boosting and
wagging (Technical Report). School of Computing and Mathematics,
Deakin University, Australia.
Weigend, A., & Gershenfeld, N. (1994). Time series prediction: Forecasting
the future and understanding the past. Addison-Wesley.
Witten, I., & Frank, E. (1999). Data mining: Practical machine learning
tools and techniques with java implementations. Morgan Kaufmann.
WSE (1995 onwards). Daily quotes.
http://yogi.ippt.gov.pl/pub/WGPW/wyniki/.
Zemke, S. (1998). Nonlinear index prediction. Physica A, 269, 177–183.
Zemke, S. (1999a). Amalgamation of genetic selection and bag-
ging. GECCO-99 Poster, www.genetic-algorithm.org/GECCO1999/phd-
www.html (p. 2).
Zemke, S. (1999b). Bagging imperfect predictors. ANNIE’99. ASME Press.
Zemke, S. (1999c). Ilp via ga for time series prediction (Technical Report).
Dept. of Computer and System Sciences, KTH, Sweden.
Zemke, S. (2000). Rapid fine tuning of computationally intensive classifiers.
Proceedings of AISTA, Australia.
Zemke, S. (2002a). Ensembles in practice: Prediction, estimation, multi-
feature and noisy data. Proceedings of HIS-2002, Chile, Dec. 2002 (p. 10).
Zemke, S. (2002b). On developing a financial prediction system: Pitfalls
and possibilities. Proceedings of DMLL-2002 Workshop at ICML-2002,
Sydney, Australia.
Zemke, S., & Rams, M. (2003). Multivariate feature coupling and dis-
cretization. Proceedings of FEA-2003.
Zheng, Z., Webb, G., & Ting, K. (1998). Integrating boositng and stochastic
attribute selection committees for further improving the performance of
decission tree learning (Technical Report). School of Computing and
Mathematics, Deakin University, Geelong, Australia.

164
Zirilli, J. (1997). Financial prediction using neural networks. International
Thompson Computer Press.

165

Potrebbero piacerti anche