Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Doctoral Thesis
The Royal Institute of Technology
Department of Computer and Systems Sciences
December 2003
i
Doctoral Thesis
The Royal Institute of Technology, Sweden
ISBN 91-7283-613-X
ii
Abstract
Hard problems force innovative approaches and attention to detail, their exploration
often contributing beyond the area initially attempted. This thesis investigates
the data mining process resulting in a predictor for numerical series. The series
experimented with come from financial data – usually hard to forecast.
One approach to prediction is to spot patterns in the past, when we already know
what followed them, and to test on more recent data. If a pattern is followed by
the same outcome frequently enough, we can gain confidence that it is a genuine
relationship.
Because this approach does not assume any special knowledge or form of the regular-
ities, the method is quite general – applicable to other time series, not just financial.
However, the generality puts strong demands on the pattern detection – as to notice
regularities in any of the many possible forms.
The thesis’ quest for an automated pattern-spotting involves numerous data mining
and optimization techniques: neural networks, decision trees, nearest neighbors,
regression, genetic algorithms and other. Comparison of their performance on a
stock exchange index data is one of the contributions.
As no single technique performed sufficiently well, a number of predictors have been
put together, forming a voting ensemble. The vote is diversified not only by different
training data – as usually done – but also by a learning method and its parameters.
An approach is also proposed how to speed-up a predictor fine-tuning.
The algorithm development goes still further: A prediction can only be as good as
the training data, therefore the need for good data preprocessing. In particular, new
multivariate discretization and attribute selection algorithms are presented.
The thesis also includes overviews of prediction pitfalls and possible solutions, as
well as of ensemble-building for series data with financial characteristics, such as
noise and many attributes.
The Ph.D. thesis consists of an extended background on financial prediction, 7
papers, and 2 appendices.
iii
Acknowledgements
I would like to take the opportunity to express my gratitude to the many
people who helped me with the developments leading to the thesis. In
particular, I would like to thank Ryszard Kubiak for his tutoring and
support reaching back to my high-school days and beginnings of university
education, also for his help to improve the thesis. I enjoyed and appreciated
the fruitful exchange of ideas and cooperation with Michal Rams, to whom
I am also grateful for comments on a part of the thesis. I am also grateful to
Miroslawa Kajko-Mattsson for words of encouragement in the final months
of the Ph.D. efforts and for her style-improving suggestions.
In the early days of my research Henrik Boström stimulated my interest
in machine learning and Pierre Wijkman in evolutionary computation. I
am thankful for that and for the many discussions I had with both of
them. And finally, I would like to thank Carl Gustaf Jansson for being
such a terrific supervisor.
I am indebted to Jozef Swiatycki for all forms of support during the
study years. Also, I would like to express my gratitude to the computer
support people, in particular, Ulf Edvardsson, Niklas Brunbäck and Jukka
Luukkonen at DMC, and to other staff at DSV, in particular to Birgitta
Olsson for her patience with the final formatting efforts.
I dedicate the thesis to my parents who always believed in me.
iv
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Questions in Financial Prediction . . . . . . . . . . . . . . 2
1.2.1 Questions Addressed by the Thesis . . . . . . . . . 4
1.3 Method of the Thesis Study . . . . . . . . . . . . . . . . . 4
1.3.1 Limitations of the Research . . . . . . . . . . . . . 4
1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . 6
2 Extended Background 9
2.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Time Series Glossary . . . . . . . . . . . . . . . . . 10
2.1.2 Financial Time Series Properties . . . . . . . . . . 13
2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Data Integration . . . . . . . . . . . . . . . . . . . 15
2.2.3 Data Transformation . . . . . . . . . . . . . . . . . 16
2.2.4 Data Reduction . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Data Discretization . . . . . . . . . . . . . . . . . . 17
2.2.6 Data Quality Assessment . . . . . . . . . . . . . . . 18
2.3 Basic Time Series Models . . . . . . . . . . . . . . . . . . 18
2.3.1 Linear Models . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Limits of Linear Models . . . . . . . . . . . . . . . 19
2.3.3 Nonlinear Methods . . . . . . . . . . . . . . . . . . 20
2.3.4 General Learning Issues . . . . . . . . . . . . . . . 21
2.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . 23
2.5 System Evaluation . . . . . . . . . . . . . . . . . . . . . . 24
v
2.5.1 Evaluation Data . . . . . . . . . . . . . . . . . . . 24
2.5.2 Evaluation Measures . . . . . . . . . . . . . . . . . 25
2.5.3 Evaluation Procedure . . . . . . . . . . . . . . . . . 25
2.5.4 Non/Parametric Tests . . . . . . . . . . . . . . . . 26
5 Bibliographical Notes 39
vi
List of Thesis Papers
Stefan Zemke. 45
Nonlinear Index Prediction.
Physica A 269 (1999)
Stefan Zemke. 57
ILP and GA for Time Series Prediction.
Dept. of Computer and Systems Sciences Report 99-006
Stefan Zemke. 71
Bagging Imperfect Predictors.
ANNIE’99, St. Louis, MO, US, 1999
Stefan Zemke. 81
Rapid Fine-Tuning of Computationally Intensive Classifiers.
MICAI’2000, Mexico, 2000. LNAI 1793
Stefan Zemke. 95
On Developing Financial Prediction System: Pitfalls and Possibilities.
DMLL Workshop at ICML-2002, Australia, 2002
vii
viii
Chapter 1
Introduction
Predictions are hard, especially about the future. Niels Bohr and Yogi Berra
1.1 Background
As computers, sensors and information distribution channels proliferate,
there is an increasing flood of data. However, the data is of little use, unless
it is analyzed and exploited. There is indeed little use in just gathering the
tell tale signals of a volcano eruption, heart attack, or a stock exchange
crash, unless they are recognized and acted upon in advance. This is where
prediction steps in.
To be effective, a prediction system requires good input data, good
pattern-spotting ability, good discovered pattern evaluation, among other.
The input data needs to be preprocessed, perhaps enhanced by a domain
expert knowledge. The prediction algorithms can be provided by methods
from statistics, machine learning, analysis of dynamical systems, together
known as data mining – concerned with extracting useful information from
raw data. And predictions need to be carefully evaluated to see if they fulfill
criteria of significance, novelty, usefulness etc. In other words, prediction is
not an ad hoc procedure. It is a process involving a number of premeditated
steps and domains, all of which influence the quality of the outcome.
The process is far from automatic. A particular prediction task requires
experimentation to assess what works best. Part of the assessment comes
from intelligent but to some extent artful exploratory data analysis. If the
task is poorly addressed by existing methods, the exploration might lead
1
to a new algorithm development.
The thesis research follows that progression, started by the question of
days-ahead predictability of a stock exchange index data. The thesis work
and contributions consist of three developments. First, exploration of sim-
ple methods of prediction, exemplified by the initial thesis papers. Second,
higher level analysis of the development process leading to a successful pre-
dictor. The process also supplements the simple methods by specifics of
the domain and advanced approaches such as elaborate preprocessing, en-
sembles, chaos theory. Third, the thesis presents new algorithmic solutions,
such as bagging a Genetic Algorithms population, parallel experiments for
rapid fine-tuning and multivariate discretization.
Time series are common. Road traffic in cars per minute, heart beats
per minute, number of applications to a school every year and a whole
range of scientific and industrial measurements, all represent time series
which can be analyzed and perhaps predicted. Many of the prediction
tasks face similar challenges, such as how to decide which input series will
enhance prediction, how to preprocess them, or how efficiently tune various
parameters. Despite the thesis referring to the financial data, most of the
work is applicable to other domains, even if not directly, then indirectly by
pointing different possibilities and pitfalls in a predictor development.
2
Meta-methods. What are the ways to improve the methods? Can meta-
heuristics successful in other domains, such as ensembles or pruning,
improve financial prediction?
Data. Can the amount, type of data needed for prediction be character-
ized?
3
1.2.1 Questions Addressed by the Thesis
The thesis addresses many of the questions, in particular the prediction
possibility, methods, meta-methods, data Preprocessing, and the predic-
tion development process. More details on the contributions are provided
by the chapter: Contributions of the Thesis Papers.
4
end up with a Ph.D. and as a millionaire, or without anything, should the
prediction attempts fail. This is too high risk to take. This is why in my
research, after the initial head-on attempts, I took a more balanced path
investigating prediction from the side: methods, data preprocessing etc.,
instead of prediction results per se.
Another criticism could address the omission or shallowness of experi-
ments involving some of the relevant methods. For instance, a researcher
devoted to Inductive Logic Programming could bring forward a new sys-
tem good at dealing with numerical/noisy series, or the econometrician
could point out the omission of linear methods. The reply could be: there
are too many possibilities for one person to explore, so it was necessary
to skip some. Even then, the interdisciplinary research demanded much
work, among other, for:
5
pointed out, the objective was not to prove there is profit possibility in the
predictions. This would involve not only commissions, but also a trading
model. A simple model would not fit the bill, so there would be a need
to investigate how predictions, together with general knowledge, trader’s
experience etc. merge into successful trading – a subject for another Ph.D.
Second, after commissions, the above-random gains, would be much thin-
ner, demanding better predictions, more data, more careful statistics to
spot the effect – perhaps too much for a pilot study.
The lack of experiments backing some of the thesis ideas is another
shortcoming. The research attempts to be practical, i.e. mostly experi-
mental, but there are tradeoffs. As ideas become more advanced, the path
from an idea to a reported evaluation becomes more involved. For instance,
to predict, one needs data preprocessing, often including discretization. So,
even having implemented an experimental predictor, it could not have been
evaluated without the discretization completed, pressing to describe just
the prediction part – without real evaluation. Also computational demands
grow – a notebook computer is no longer enough.
The rest of the initial chapters – preceding the thesis papers – is meant to
provide the reader with the papers’ background, often skimmed in them
for page limit reasons. Thus, the Extended Background chapter goes
through the subsequent areas and issues involved in time series prediction
in the financial domain, one of the objectives being to introduce the vo-
cabulary. The intention is also to present the width of the prediction area
and of my study of it, which perhaps will allow one to appreciate the effort
and knowledge behind the developments in this domain.
Then comes the Development of the Thesis chapter which, more or
less chronologically, presents the research advancement. In this tale one
can also see the many attempts proving to be dead-ends. As such, the
positive Published results can be seen as an essence of much bigger work.
The next chapter Contributions of Thesis Papers summarizes all the
thesis papers and their contributions. The summaries assume familiarity
6
with the vocabulary of the Extended Background chapter.
The rest of the thesis consists of 8 thesis papers, formatted for a common
appearance, otherwise quoted the way they were published. The thesis
ends with common bibliography, resolving references for the introduction
chapters and all the included papers.
7
8
Chapter 2
Extended Background
9
2.1.1 Time Series Glossary
Stationarity of a series indicates that its mean value and arbitrary au-
tocorrelations are time invariant. Finance literature commonly assumes
that asset returns are weakly stationary. This can be checked, provided a
sufficient number of values, e.g., one can divide data into subsamples and
check the consistency of mean and autocorrelations (Tsay, 2002). Determi-
nation if a series moved into a nonstationary regime is not trivial, let alone
deciding which of the series properties are still holding. Therefore, most
prediction systems, which are based on past data, implicitly assume that
the predicted series is to a great extent stationary, at least with respect
to the invariants that the system may spot, which most likely go beyond
mean and autocorrelations.
10
Deterministic and Nondeterministic Chaos. For a reader new to chaos, an
illustration of the theory applied to finances can be found in (Deboeck,
1994). A system is chaotic if its trajectory through state space is sensi-
tively dependent on the initial conditions, that is, if small differences are
magnified exponentially with time. This means that initially unobserv-
able fluctuations will eventually dominate the outcome. So, though the
process may be deterministic, it is unpredictable in the long run (Kantz
& Schreiber, 1999a; Gershenfeld & Weigend, 1993). Deterministic means
that given the same circumstances the transition from a state is always the
same.
The topic if financial markets express this kind of behavior is hotly
debated and there are numerous publications supporting each view. The
deterministic chaos notion involves a number of issues. First, whether
markets react deterministically to events influencing prices versus a more
probabilistic reaction. Second, whether indeed magnified small changes
eventually take over, which does not need to be the case, e.g. self-correction
could step in if a value is too much off mark – overpriced or underpriced.
Financial time series have been analyzed in those respects, however, the
mathematical theory behind chaos often poorly deals with noise prevalent
in financial data making the results dubious.
Even a chaotic system can be predicted up to a point where magnified
disturbances dominate. The time when this happens depends inversely
on the largest Lyapunov exponent, a measure of divergence. It is an av-
erage statistics – at any time the process is likely to have different di-
vergence/predictability, especially if nonstationary. Beyond, prediction is
possible only in statistical terms – which outcomes are more likely, no mat-
ter what we start with. Weather – a chaotic system – is a good illustration:
despite global efforts in data collection, forecasts are precise up to a few
days and in the long run offer only statistical views such as average month
temperature. However, chaos is not to be blamed for all poor forecasts – it
recently came to attention that the errors in weather forecasts initially do
not grow exponentially but linearly, what points more to imprecise weather
models than chaos at work.
Another exciting aspect of a chaotic system is its control. If at times the
11
system is so sensitive to disturbances, a small influence at that time can
profoundly alter the trajectory, provided that the system will be determin-
istic for a while thereafter. So potentially a government, or a speculator,
who knew the rules, could control the markets without a vast investment.
Modern pace-makers for human heart – another chaotic system – work
by this principle providing a little electrical impulse only when needed,
without the need for constant overwhelming of the heart electrical activity.
Still, it is unclear if the markets are stochastic or deterministic, let alone
chaotic. A mixed view is also possible: market are deterministic only in
part – so even short-term prediction cannot be fully accurate, or that
there are pockets of predictability – markets, or market conditions, when
the moves are deterministic, otherwise being stochastic.
Takens Theorem (Takens, 1981) states that we can reconstruct the dy-
namics of a deterministic system – possibly multidimensional, which each
state is a vector – by long-enough observation of just one noise-free vari-
able of the system. Thus, given a series we can answer questions about
the dynamics of the system that generated it by examining the dynamics
in a space defined by delayed values of just that series. From this, we can
compute features such as the number of degrees of freedom and linking of
trajectories and make predictions by interpolating in the delay embedding
space. However, Takens theorem holds for mathematical measurement
functions, not the ones seen in the laboratory or market: asset price is
not a noise-free function. Nevertheless, the theorem supports experiments
with a delay embedding, which might yield useful models. In fact, they
often do (Deboeck, 1994).
12
Prediction, modeling, characterization are three different goals of time se-
ries analysis (Gershenfeld & Weigend, 1993): ”The aim of prediction is
to accurately forecast the short-term evolution of the system; the goal of
modeling is to find description that accurately captures features of the
long-term behavior. These are not necessarily identical: finding governing
equations with proper long-term properties may not be the most reliable
way to determine parameters for short-term forecasts, and a model that
is useful for short-term forecasts may have incorrect long-term properties.
Characterization attempts with little or no a priori knowledge to deter-
mine fundamental properties, such as the number of degrees of freedom of
a system or the amount of randomness.”
13
ing phenomenon is inconsistent with semi-strong form efficiency, and the
January effect is inconsistent even with weak form efficiency. Overall, the
evidence indicates that a great deal of information available at all levels is,
at any given time, reflected in stock prices. The market may not be easily
beaten, but it appears to be beatable, at least if you are willing to work at
it.”
Data frequency refers to how often series values are collected: hourly,
daily, weekly etc. Usually, if a financial series provides values on daily,
14
or longer, basis, it is low frequency data, otherwise – when many intraday
quotes are included – it is high frequency. Tick-by-tick data includes all
individual transactions, and as such, the event-driven time between data
points varies creating challenge even for such a simple calculation as corre-
lation. The minute market microstructure and massive data volume create
new problems and possibilities not dealt with by the thesis. The reader
interested in high frequency finance can start at (Dacorogna et al., 2001).
Data cleaning fills in missing values, smoothes noisy data, handles or re-
moves outliers, resolves inconsistencies. Missing values can be handled by
a generic method (Han & Kamber, 2001). Methods include skipping the
whole instance with a missing value, or filling the miss with the mean/new
’unknown’ constant, or using inference, e.g. based on most similar instances
or some Bayesian considerations.
Series data has another dimension – we do not want to spoil the temporal
relationship, thus data restoration is preferable to removal. The restora-
tion should also accommodate the time aspect – not use too time-distant
values. Noise is prevalent, especially low volume markets should be dealt
with suspicion. Noise reduction usually involves some form of averaging or
putting a range of values into one bin, discretization.
If data changes are numerous, a test if the predictor picks the inserted
bias is advisable. This can be done by ’missing’ some values from a random
series – or better: permuted actual returns – and then restoring, cleaning
etc. the series as if genuine. If the predictor can subsequently predict
15
anything from this, after all random, series there is too much structure
introduced (Gershenfeld & Weigend, 1993).
Data integration combines data from multiple sources into a coherent store.
Time alignment can demand consideration in series from different sources,
e.g. different time zones. Series to instances conversion is required by most
of the learning algorithms expecting as an input a fixed length vector. It
can be done by the delay vector embedding technique. Such delay vectors
with the same time index t – coming from all input series – appended
give an instance, data point or example, its coordinates referred to as data
features, attributes or variables.
Data transformation changes the values of series to make them more suit-
able for prediction. Detrending is such a common transformation removing
the growth of a series, e.g. by working with subsequent value differentials,
or subtracting the trend (linear, quadratic etc.) interpolation. For stocks,
indexes, and currencies converting into the series of returns does the trick.
For volume, dividing it by last k quotes average, e.g. yearly, can scale it
down.
Indicators are series derived from others, enhancing some features of
interest, such as trend reversal. Over the years traders and technical ana-
lysts trying to predict stock movements developed the formulae (Murphy,
1999), some later confirmed to pertain useful information (Sullivan et al.,
1999). Indicators can also reduce noise due to averaging in many of the
formulae. Common indicators include: Moving Average MA), Stochas-
tic Oscillator, Moving Average Convergence Divergence (MACD), Rate of
Change (ROC), Relative Strength Index (RSI).
Normalization brings values to a certain range, minimally distorting
initial data relationships, e.g. the SoftMax norm increasingly squeezes
extreme values, linearly mapping middle 95% values.
16
2.2.4 Data Reduction
Discretization maps similar values into one discrete bin, with the idea that
it preserves important information, e.g. if all that matters is a real value’s
sign, it could be digitized to {0; 1}, 0 for negative, 1 otherwise. Some
prediction algorithms require discrete data, sometimes referred to as nom-
inal. Discretization can improve predictions by reducing the search space,
reducing noise, and by pointing to important data characteristics. Un-
supervised approaches work by dividing the original feature value range
into few equal-length or equal-data-frequency intervals; supervised – by
maximizing measure involving the predicted variable, e.g. entropy or the
chi-square statistics (Liu et al., 2002).
Since discretization is an information loosing transformation, it should
be approached with caution, especially as most algorithms perform uni-
variate discretization – they look at one feature at a time, disregarding
that it may have (additional) significance only in the context of other fea-
tures, as it would be preserved in multivariate discretization. For example,
if the predicted class = sign(xy), only discretizing x and y in tandem can
discover their significance, alone x and y can be inferred as not related to
17
class and even disregarded! The multivariate approach is especially im-
portant in financial prediction, where no single variable can be expected
to bring significant predictability (Zemke & Rams, 2003).
Most linear time series models descend from the AutoRegressive Mov-
ing Average (ARMA) and Generalized Autoregressive Conditional Het-
eroskedastic (GARCH) (Bollerslev, 1986) models summary of which follows
(Tsay, 2002).
18
rt = φ0 + Σpi=1 φi rt−i + at − Σqi=1 θi at−i
at = σt t , σt2 = α0 + Σm 2 s 2
i=1 αi at−i + Σj=1 βj σt−j
19
2.3.2 Limits of Linear Models
Modern econometrics increasingly shifts towards nonlinear models of risk
and return. Bera – actively involved in (G)ARCH research – remarked
(Bera & Higgins, 1993): ”a major contribution of the ARCH literature is
the finding that apparent changes in the volatility of economic time series
may be predictable and result from a specific type of nonlinear depen-
dence rather than exogenous structural changes in variables”. Campbell
further argued (Campbell et al., 1997): ”it is both logically inconsistent
and statistically inefficient to use volatility measures that are based on
the assumption of constant volatility over some period when the resulting
series moves through time.”
20
at a single variable at a time, or perform exhaustive search, e.g. ILP Progol.
These limit the applicability, especially in an area where data is volumi-
nous and unlikely in the form of simple rules. Additionally, ensembles –
putting a number of different predictors to vote – obstruct the acclaimed
human comprehension of the rules. However, the approach could be of use
in more regular domains, such as customer rating and perhaps fraud de-
tection. Rules can be also extracted from an ANN, or used together with
probabilities making them more robust (Kovalerchuk & Vityaev, 2000).
Nearest Neighbor (kNN) does not create a general model, but to predict,
it looks back for the most similar k cases. Distracted by noisy/irrelevant
features, but if this ruled out, failure of kNN suggests that the most that
can be predicted are general regularities, e.g. based on the output (condi-
tional) distribution.
Support Vector Machines (SVM) offer a relatively new and powerful learner,
having attractive characteristics for time series prediction (Muller et al.,
1997). First, the model deals with multidimensional instances, actually the
more features the better – reducing the need for (wrong) feature selection.
Second, it has few parameters, thus finding optimal settings can be easier;
one parameter referring to noise level the system can handle.
21
can be (later) GA-optimized. Evolutionary systems – another example of
evolutionary computation – work in a similar way to GAs, except that the
solution is coded as real-valued vector, and optimized not only with respect
to the values but also to the optimization rate.
22
This is a common problem due to a number of reasons. First, the training
and testing data are often not well separated, so memorizing the common
part will give the predictor a higher score. Second, multiple trials might
be performed on the same data (split), so in effect the predictor coming
out will be best suited for exactly that data. Third, the predictor com-
plexity – number of internal parameters – might be too big for the number
of training instances, so the predictor learns even the unimportant data
characteristics.
Precautions against overfitting involve: good separation of training and
testing data, careful evaluation, use of ensembles averaging-out the indi-
vidual overfitting, and an application of the Occam’s razor. In general,
overfitting is a difficult problem that must be approached individually. A
discussion how to deal with it can be found in (Mitchell, 1997).
23
are expected above-random and making independent errors. The idea is
that correct majority offsets individual errors, thus the ensemble will be
correct more often than an individual predictor. The diversity of errors
is usually achieved by training a scheme, e.g. C4.5, on different instance
samples or features. Alternatively, different predictor types – like C4.5,
ANN, kNN – can be used. Common schemes include Bagging, Boosting,
Bayesian ensembles and their combinations (Dietterich, 2000).
Boosting initially assigns equal weights to all data instances and trains a
predictor, then it increases weights of the misclassified instances, trains
next predictor on the new distribution etc. The final prediction is a
weighted vote of predictors obtained in this way. Boosting increasingly
pays attention to misclassified instances, what may lead to overfitting if
the instances are noisy.
24
Usually prediction performance is compared against published results.
Although, having its problems, such as data overfitting and accidental suc-
cesses due to multiple (worldwide!) trials, this approach works well as long
as everyone uses the same data and evaluation procedure, so meaningful
comparisons are possible. However, when no agreed benchmark is avail-
able, as in the financial domain, another approach must be adopted. Since
the main question concerning financial data is whether prediction is at all
possible, it suffices to compare a predictor’s performance against the in-
trinsic growth of a series – also referred to as the buy and hold strategy.
Then a statistical test can judge if there is a significant improvement.
25
2.5.2 Evaluation Measures
Financial forecasts are often developed to support semi-automated trading
(profitability), whereas the algorithms used in those systems might have
originally different objectives. Accuracy – percentage of correct discrete
(e.g. up/down) predictions – is a common measure for discrete systems,
e.g. ILP/decision trees. Square error – sum of squared deviations from
actual outputs – is a common measure in numerical prediction, e.g. ANN.
Performance measure – incorporating both the predictor and the trading
model it is going to benefit – is preferable and ideally should measure
exactly what we are interested in, e.g. commission and risk adjusted return
(Hellström & Holmström, 1998), not just return. Actually, many systems’
’profitability’ disappears once the commissions are taken into account.
26
Surrogate data is a useful concept in a system evaluation (Kantz &
Schreiber, 1999a). The idea is to generate data sets sharing characteristics
of the original data – e.g. permutations of series have the same mean,
variance etc. – and for each compute a statistics of interest, e.g. return of a
strategy. If α is the acceptable risk of wrongly rejecting the null hypothesis
that the original series statistics is lower (higher) than of any surrogate,
then 1/α − 1 surrogates needed; if all give higher (lower) statistics than
the original series, then the hypothesis can be rejected. Thus, if predictor’s
error was lower on the original series, as compared to 19 runs on surrogates,
we can be 95% sure it was not a fluke.
27
28
Chapter 3
29
The experiments started with Inductive Logic Programming (ILP) –
learning logic programs by combining provided background predicates sup-
posedly useful in the domain in question. I used the then (in 1997) state-
of-the-art system, Progol, reported successful in other domains, such as
toxicology and chemistry. I provided the system with various financial in-
dicators, however, despite many attempts, no compressed rules were ever
generated. This could be due to the noise present in financial data and the
rules, if any, far from the compact form sought for by an ILP system.
The initial failure reiterated the question: is financial prediction at all
possible, and if so, which algorithm works best? The failure of an otherwise
successful learning paradigm, directed the search towards more original
methods. After many fruitless trials, some promising results started ap-
pearing, with the unorthodox method shortly presented in the Feasibility
Study on Short-Term Stock Prediction, Appendix A. This method
looked for invariants in the time series predicted – not just patterns with
high predictive accuracy, but patterns that have above-random accuracy
in a number of temporarily distinct time epochs, thus excluding those that
work perhaps well, but only for a time. The work went unpublished since
the trials were limited and in the early stages of my research I was encour-
aged to use more established methods. However, it is interesting to note
that the method is similar to entropy-based compression schemes, what I
discovered later.
So I went on to evaluate standard machine learning – to see which of the
methods warrants further investigation. I tried: Neural Network, Nearest
Neighbor, Naive Bayesian Classifier and Genetic Algorithms (GA) evolved
rules. That research, presented and published as Nonlinear Index Pre-
diction – thesis paper 1, concludes that Nearest Neighbor (kNN) works
best. Some of the details, not included in the paper, made into a report
ILP and GA for Time Series Prediction, thesis paper 2.
The success of kNN suggested that the delay embedding and local pre-
diction works for my data, so perhaps could be improved. However, when
I tried to GA-optimize the embedding parameters, the prediction results
were not better. If fine-tuning was not the way, perhaps averaging a num-
ber of rough predictors would be. The majority voting scheme has indeed
30
improved the prediction accuracy. The originating publication Bagging
Imperfect Predictors, thesis paper 3, presents bagging results from Non-
linear Index Prediction and an approach believed to be novel at that time –
bagging predictions from a number of classifiers evolved in one GA popu-
lation.
Another spin off from the success of kNN in Nonlinear Index Prediction,
so the implicit presence of determinism and perhaps limited dimension of
the data, was a research proposal Evolving Differential Equations for Dy-
namical System Modeling. The idea behind this more extensive project is
to use Genetic Programming-like approach, but instead of evolving pro-
grams, to evolve differential equations, known as the best descriptive and
modeling tool for dynamical systems. This is what the theory says, but
finding equations fitting given data is not yet a solved task. The project
was stalled, awaiting financial support.
But coming back to the main thesis track. GA experiments in Bagging
Imperfect Predictors were computationally intensive, as it is often the case
while developing a new learning approach. This problem gave rise to an
idea how to try a number of development variants at once, instead of one-
by-one, saving on computation time. Rapid Fine-Tuning of Compu-
tationally Intensive Classifiers, thesis paper 4, explains the technique,
together with some experimental guidelines.
The ensemble of GA individuals, as in Bagging Imperfect Predictors,
could further benefit from a more powerful classifier committee technique,
such as boosting. The published poster Amalgamation of Genetic Se-
lection and Boosting, Appendix B, highlights the idea.
31
literature, making me abandon that line of research.
However, while searching for the comparisons above, I had done quite
an extensive review. I selected the most practical and generally-applicable
papers in Ensembles in Practice: Prediction, Estimation, Multi-
Feature and Noisy Data which publication addresses the four data issues
relevant to financial prediction, thesis paper 5.
Except for the general algorithmic considerations, there are also the
tens of little decisions that need to be taken while developing a prediction
system, many leading to pitfalls. While reviewing descriptions of many
systems ’beating the odds’ I realized that, although widely different, the
acclaimed successful systems share common characteristics, while the naive
systems – quite often manipulative in presenting the results – share com-
mon mistakes. This led to the thesis paper 6: On Developing Financial
Prediction System: Pitfalls and Possibilities which is an attempt to
highlight some of the common solutions.
Financial data are generated in complex and interconnected ways. What
happens in Tokyo influences what happens in New York and vice versa.
For prediction this has several consequences. First, there are very many
data series to potentially take as inputs, creating data selection and curse
of dimensionality problems. Second, many of the series are interconnected,
in general, in nonlinear ways. Hence, an attempt to predict must identify
the important series and their interactions, having decided that the data
warrants predictability at all.
These considerations led me to a long investigation. Searching for a
predictability measure, I had the idea to use the common Zip compression
to estimate entropy in a constructive way – if the algorithm could compress
(many interleaved series), its internal working could provide the basis for a
prediction system. But reviewing references, I found a similar work, more
mathematically grounded, so had abandoned mine. Then, I shifted atten-
tion to uncovering multivariate dependencies, along predictability measure,
by means of weighted and GA-optimized Nearest Neighbor, which failed.
1
.
Then came a multivariate discretization idea, initially based on Shannon
1
It worked, but only up to 15 input data series, whereas I wanted the method to work for more than
50 series.
32
(conditional) entropy, later reformulated in terms of accuracy. After so
many false-starts, the feat was quite spectacular as the method was able
to spot multivariate regularities, involving only fraction of the data, in
up to 100 series. Up to my knowledge, this is also the first, (multivariate)
discretization having maximizing an ensemble performance as an objective.
Multivariate Feature Coupling and Discretization is the thesis paper
number 7.
Along the second part of the thesis, I have steadily developed a time se-
ries prediction software incorporating my experiences and expertise. How-
ever, at the thesis print time the system is not yet operational so its de-
scription is not included.
33
34
Chapter 4
35
techniques used to analyze complex preprocessed data, a common approach
in the earlier studies of financial data so much contributing to the Efficient
Market Hypothesis view.
With only the main results, due to publisher space limits, of the GA-
optimized ILP included in the earlier paper, this report presents some
details of these computationally intensive experiments (Zemke, 1999c). Al-
though the overall accuracy of LP on the index data was not impressive,
the attempts still have practical value – in outlining limits of otherwise suc-
cessful techniques. First, the initial experiments applying Progol – at that
time a ’state of the art’ Inductive Logic Programming system – show that
a learning system successful on some domains can fail on others. There
could be at least two reasons for this: domain unsuitable for the learning
paradigm or unskillful use of the system. Here, I only note that most of
the successful applications of Progol involve domains where few rules hold
most of the time: chemistry, astronomy, (simple) grammars, whereas fi-
nancial prediction rules, if any, are more soft. As for the unskillful use of
an otherwise capable system, the comment could be that such a system
would merely shift the burden to learning its ’correct usage’ from learning
the theory implied by the data provided – instead of lessening the bur-
den altogether. As such, one should be aware that machine learning is
still more of an art – demanding experience and experimentation, rather
than engineering – providing procedures for almost blindly solving a given
problem.
The second contribution of this paper exposes background predicate sen-
sitivity – exemplified by variants of equal. The predicate definitions can
have a substantial influence on the achieved results – again highlighting
the importance of an experimental approach and, possibly, a requirement
for nonlinear predicates. Third, since GA-evolved LP can be viewed as
an instance of Genetic Programming (GP), the results confirm that GP is
perhaps not the best vehicle for time series prediction. And fourth, a gen-
eral observation about GA-optimization and learning: while evolving LP of
36
varying size, the best (accuracy) programs usually emerged in GA experi-
ments with only secondary fitness bonus for smaller programs, as opposed
to runs in which programs would be penalized by their size. Actually, it
was interesting to note that the path to smaller and accurate programs
often lead through much bigger programs which have been subsequently
reduced – should the bigger programs be not allowed to appear in the
first place, the smaller ones would not be found either. This observation,
together with the not so good generalization of the smallest programs, is-
sues a warning against blind application of Occam’s Razor in evolutionary
computation.
37
4.4 Rapid Fine Tuning of Computationally Intensive
Classifiers
This publication (Zemke, 2000), a spin-off of the experiments carried out
for the previous paper, elaborates on a practical aspect applicable to almost
any machine learning system development, namely, on a rapid fine-tuning
of parameters for optimal performance. The results could be summarized
as follows. First, working on a specific difficult problem, as in the case of
index prediction, can lead to a solution and insights to more general prob-
lems, and as such is of value beyond merely the domain of the primary
investigation. Second, the paper describes a strategy for simultaneous
exploration of many versions of a fine-tuned algorithm with different pa-
rameter choices. And third, a statistical analysis method for detection of
superior parameter settings is presented, which together with the earlier
point allows for rapid fine-tuning.
38
falling markets, which most likely would average the systems’ performance.
Such are some of the many pitfalls pointed out.
Third, the paper suggests some solutions to the pitfalls and to general
issues appearing in a prediction system development.
39
4.7 Multivariate Feature Coupling and Discretization
This paper (Zemke & Rams, 2003) presents a multivariate discretization
method based on Genetic Algorithms applied twice, first to identify im-
portant feature groupings, second to perform the discretization maximiz-
ing desired function, e.g. the predictive accuracy of an ensemble build on
those groupings. The contributions could be summarized as follows.
First, as the title suggests, a multivariate discretization is provided,
presenting an alternative to the very few multivariate methods reported.
Second, feature grouping and ranking – the intermediate outcome of the
procedure – has a value in itself: allows to see which features are interre-
lated and how much predictability is brought in by them, promoting feature
selection. Third, the second global GA-optimization allows an arbitrary
objective to be maximized, unlike in other discretization schemes where
the objective is hard-coded into the algorithm. The objective exemplified
in the paper maximizes the goal of prediction: accuracy, whereas other
schemes often only indirectly attempt to maximize it via measures such as
entropy or the chi-square statistics. Fourth contribution, up to my knowl-
edge, this is the first discretization to allow explicit optimization for an
ensemble. This forces the discretization to act on global basis, not merely
searching for maximal information gain per selected feature (grouping) but
for all features viewed together. Fifth, the global discretization can also
yield a global estimate of predictability for the data.
40
Chapter 5
Bibliographical Notes
Machine Learning
Machine Learning (Mitchell, 1997). As for now, I would regard this book
as the textbook for machine learning. It not only presents the main learn-
ing paradigms – neural networks, decision trees, rule induction, nearest
neighbor, analytical and reinforcement learning – but also introduces to
hypothesis testing and computational learning theory. As such, it balances
the presentation of machine learning algorithms with practical issues of
using them, and some theoretical aspects of their function. Next editions
of this, otherwise an excellent book, could also consider the more novel
approaches: support vector machines and rough sets.
Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations (Witten & Frank, 1999). Using this book, and the
software package Weka behind it, could save time, otherwise spent on im-
plementing the many learning algorithms. This book essentially provides
an extended user guide to the open-source code available online. The
Weka toolbox, in addition to more than 20 parameterized machine learning
methods, offers data preparation, hypothesis evaluation and some visual-
ization tools. A word of warning, though: most of the implementations are
41
straightforward and non-optimized – suitable rather for learning the nuts
and bolts of the algorithms, rather than a big scale data mining.
The Elements of Statistical Learning: Data Mining, Inference, and Pre-
diction (Hastie et al., 2001). This book, in wide scope similar to Machine
Learning (Mitchell, 1997), could be recommend for its more rigorous treat-
ment and some additional topics, such as ensembles.
Data Mining and Knowledge Discovery with Evolutionary Algorithms
(Alex, 2002). This could be a good introduction to practical applications
of evolutionary computations to various aspects of data mining.
Financial Prediction
Here, I present a selection of books introducing to various aspects of non-
linear financial time series analysis.
Data Mining in Finance: Advances in Relational and Hybrid Methods
(Kovalerchuk & Vityaev, 2000). This is an overview of some of the methods
used for financial prediction and of features such a prediction system should
have. The authors also present their system, supposedly overcoming many
of the common pitfalls. However, the book is somehow short on details
allowing to re-evaluate some of the claims, but good as an overview.
Trading on the Edge (Deboeck, 1994). This is an excellent book of self-
contained chapters practically introducing to the essence of neural net-
works, chaos analysis, genetic algorithms and fuzzy sets, as applied to
financial prediction.
Neural Networks in the Capital Markets (Refenes, 1995). This collection
on neural networks for economic prediction, highlights some of the practical
considerations while developing a prediction system. Many of the hints are
applicable to prediction systems based on other paradigms, not just on
neural networks.
Fractal Market Analysis (Peters, 1994). In this book, I found as the
most interesting chapters on various applications of Hurst or R/S analysis.
Though, this has not resulted in immediately using that approach, it is
always good to know what the self-similarity analysis can reveal about the
data in hand.
42
Nonlinear Analysis, Chaos
Nonlinear Time Series Analysis (Kantz & Schreiber, 1999a). As authors
can be divided into those who write what they know, and those who know
what they write about, this is definitely the latter case. I would recom-
mend this book, among other introductions to nonlinear time series, for
its readability, practical approach, examples (though mostly from physics),
formulae with clearly explained meaning. I could easily convert into code
many of the algorithms described in the text.
Time Series Prediction: Forecasting the Future and Understanding the
Past (Weigend & Gershenfeld, 1994). A primer on nonlinear prediction
methods. The book, finalizing the Santa Fe Institute prediction compe-
tition, introduces time series forecasting issues and discusses them in the
context of the competition entries.
Coping with Chaos (Ott, 1994). This book, by a contributor to the
chaos theory, is a worthwhile read providing insights into aspects of chaotic
data analysis, prediction, filtering, control, with the theoretical motivations
revealed.
Finance, General
Modern Investment Theory (Haughen, 1997). A relatively easy to read
book systematically introducing to current views on investments, mostly
from an academic point, though. This book also discusses the Efficient
Market Hypothesis.
Financial Engineering (Galitz, 1995). A basic text on what financial
engineering is about and what it can do.
Stock Index Futures (Sutcliffe, 1997). Mostly overview work, providing
numerous references to research on index futures. I considered skimming
the book essential for insights into documented futures behavior, as not to
reinvent the wheel.
A Random Walk down Wall Street (Malkiel, 1996) and Reminiscences
of a Stock Operator (Lefvre, 1994). Enjoyable, leisure read about the me-
chanics of Wall Street. In some sense the books – presenting investment
activity in a wider historical and social context – have also great educa-
43
tional value. Namely, they show the influence of subjective, not always
rational, drives on the markets, which as such, perhaps cannot be fully
analyzed by rational methods.
44
Nonlinear Index Prediction
International Workshop on Econophysics and Statistical Finance, 1998.
45
.
46
Nonlinear Index Prediction
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se
Abstract Neural Network, K-Nearest Neighbor, Naive Bayesian Classifier and Genetic
Algorithm evolving classification rules are compared for their prediction accuracies on
stock exchange index data. The method yielding the best result, Nearest Neighbor, is
then refined and incorporated into a simple trading system achieving returns above index
growth. The success of the method hints the plausibility of nonlinearities present in the
index series and, as such, the scope for nonlinear modeling/prediction.
Introduction
Financial time series present a fruitful area for research. On one hand
there are economists claiming that profitable prediction is not possible, as
voiced by the Efficient Market Hypothesis, on the other, there is a grow-
ing evidence of exploitable features of these series. This work describes a
prediction effort involving 4 Machine Learning (ML) techniques. These ex-
periments use the same data and lack unduly specializing adjustments – the
goal being relative comparison of the basic methods. Only subsequently,
the most promising technique is scrutinized.
Machine Learning (Mitchell, 1997) has been extensively applied to fi-
nances (Deboeck, 1994; Refenes, 1995; Zirilli, 1997) and trading (Allen
47
& Karjalainen, 1993; Bauer, 1994; Dacorogna, 1993). Nonlinear time se-
ries (Kantz & Schreiber, 1999a) approaches also become a commonplace
(Trippi, 1995; Weigend & Gershenfeld, 1994). The controversial notion of
(deterministic) chaos in financial data is important since the presence of a
chaotic attractor warrants partial predictability of financial time series –
in contrast to the random walk and Efficient Market Hypothesis (Fama,
1965; Malkiel, 1996). Some of the results supporting deviation from the
log-normal theory (Mandelbrot, 1997) and a limited financial prediction
can be found in (LeBaron, 1993; LeBaron, 1994).
The Task
Some evidence suggests that markets with lower trading volume are eas-
ier to predict (Lerche, 1997). Since the task of the study is to compare
ML techniques, data from the relatively small and scientifically unexplored
Warsaw Stock Exchange (WSE) (Aurell & Zyczkowski, 1996) is used, with
the quotes, from the opening of the exchange in 1991, freely available on
the Internet. At the exchange, prices are set once a day (with intraday
trading introduced more recently). The main index, WIG, is a capital-
ization weighted average of all the stocks traded on the main floor, and
provides the time series used in this study.
The learning task involves predicting the relative index value 5 quotes
ahead, i.e., a binary decision whether the index value one trading week
ahead will be up or down in relation to the current value. The interpretation
of up and down is such that they are equally frequent in the data set,
with down also including small index gains. This facilitates detection of
above-random predictions – their accuracy, as measured by the proportion
of correctly predicted changes, is 0.5 + s, where s is the threshold for
the required significance level. For the data including 1200 index quotes,
the following table presents the s values for one-sided 95% significance,
assuming that 1200 − W indowSize data points are used for the accuracy
estimate.
Window size: 60 125 250 500 1000
Significant error: 0.025 0.025 0.027 0.031 0.06
48
Learning involves W indowSize consecutive index values. Index daily
(relative) changes are digitized via monotonically mapping them into 8
integer values, 1..8, such that each is equally frequent in the resulting series.
This preprocessing is necessary since some of the ML methods require
bounded and/or discrete values. The digitized series is then used to create
delay vectors of 10 values, with lag one. Such a vector (ct , ct−1 , ct−2 , ..., ct−9 ),
is the sole basis for prediction of the index up/down value at time t + 5
w.r.t. the value at time t. Only vectors, and their matching predictions,
derived form index values falling within the current window are used for
learning.
The best generated predictor – achieving highest accuracy at the window
cases – is then applied to the vector next to the last one in window –
yielding prediction for the index value falling next to the window. With the
accuracy estimate accumulating and the window shifting over all available
data points, the resulting prediction accuracies are presented in the tables
as percentages.
Five layered network topologies have been tested. The topologies, as de-
scribed by the numbers of non-bias units in subsequent layers, are: G0:
10-1, G1: 10-5-1, G2: 10-5-3-1, G3: 10-8-5-1, G4: 10-20-5-1. Units in
the first layer represent the input values. Standard backpropagation (BP)
algorithm is used for learning weights, with the change values 1..8 linearly
scaled down to the [0.2, 0.8] range required by the sigmoid BP, and up
denoted by 0.8, and down – by 0.2.
The window examples are randomly assigned into either training or
validation set, compromising 80% and 20% of the examples respectively.
The training set is used by BP to update weights, while the validation set –
to evaluate the network’s squared output error. The minimal error network
for the whole run is then applied to the example next to the window for
prediction. Prediction accuracies and some observations follow.
49
Window/Graph G0 G1 G2 G3 G4
60 56 - - - -
125 58 56 63 58 -
250 57 57 60 60 -
500 58 54 57 57 58
1000 - - - 61 61
• Prediction accuracy, without outliers, is in the significant 56 – 61%
range
• Accuracies seem to increase with window size, reaching above 60% for
bigger networks (G2 – G4), as such the results could further improve
with more training data
50
longer (up to 1000 data-points), which is consistent with other studies
on stock returns (Haughen, 1997).
K-Nearest Neighbor
In this approach, K most similar window vectors – to the one being clas-
sified – are found. The most frequent class among the K vectors is then
returned as the classification. The standard similarity metrics is Euclidean
distance between the vectors. Some results and comments follow.
Window/K 1 11 125
125 56 - -
250 55 53 56
500 54 52 54
1000 64 61 56
• Peak of 64%
The above table has been generated for the Euclidean metrics. However,
the peak of 64% accuracy (though for other Window/K combinations) has
also been achieved for the Angle and Manhattan metrics1 , indicating that
the result is not merely an outlier due to some idiosyncrasies of the data
and parameters.
51
among 1..8 – together with the predicate symbol – evolved through the
GA. The other genetic operator is a 2-point list crossover, applied to the
2 programs – lists of clauses.
The second argument of N-th literal is the clause’s N-th head argument
which is unified with the N-th value in a delay vector. Applying the up
predicate to a delay vector performs prediction. If the predicate succeeds
the classification is up, and down otherwise. Fitness of a program is mea-
sured as the proportion of window examples it correctly classifies. Upon
the GA termination, the fittest program from the run is used to classify
the example next to the current window. Programs in a population have
different lengths – number of up clauses – limited by a parameter, as shown
in the following table.
• Bigger programs (number of clauses > 10) are very slow to converge
and result in erratic predictions
52
Clause fitness function/Window 60 125 250 500 1000
AllP os + P os − 103 ∗ N eg 54.8 50.3 51.7 51.9 53.2
AllP os + 103 ∗ P os − 106 ∗ N eg 57.1 51.7 52.8 53.0 48.9
as above & ordinary equality 53.6 51.9 53.0 52.5 58.8
53
Kmin – minimal number of vectors required within a neighborhood to
warrant prediction, in [1, 20)
The parameters are optimized via GA. The function maximized is the
relative gain of an investment strategy involving long position in the index,
when the summaric prediction says it will go up, short position – when
down, and staying in cash if no prediction warranted. The prediction
period is 5 days and the investment continues for that period, after which
a new prediction is made. A summaric prediction is computed by adding
all the weighted contributory predictions associated with valid neighbors.
If some of the requirements, e.g. minimal number of neighbors, fail – no
prediction is issued.
The following tests have been run. Test1 computed average annual gain
over index growth during 4 years of trading: 33%. Test2 computed minimal
(out of 5 runs shifted by 1 day each) gain during the last year (ending on
Sept. 1, 1998): 28%. Test3 involved generating 19 sets of surrogate data –
permuted logarithmic change series – and checking if the gain on the real
series exceeds those for the surrogate series; the test failed – in 6 cases the
gain on the permuted data was bigger. However, assuming normality of
distribution in the Test2 and Test3 samples, the two-sample t procedure
yielded 95% significant result (t = 1.91, df = 14, P < 0.05) that the Test2
gains are indeed higher than those for Test32 .
2
The (logarithmic) average for Test3 was around 0 as opposed to the strictly positive results and
average for Test1 and Test2 – this could be the basis for another surrogate test.
54
Conclusion
The results show that some exploitable regularities do exist in the index
data and Nearest Neighbor is able to profit from them. All the other, def-
initely more elaborate techniques, fall short of the 64% accuracy achieved
via Nearest Neighbor. One of the reasons could involve non-linearity of
the problem in question: with only linear relations available, logic program
classifier rules require linear nature of the problem for good performance,
nonlinear Neural Network performing somehow better. On the other hand
the Nearest Neighbor approach can be viewed as generalizing only locally –
with no linear structure imposed/assumed – moreover with the granularity
set by the problem examples.
As further research, other data could be tested, independent tests for
nonlinearity performed (e.g. dimension and Lyapunov exponent estima-
tion) and the other Machine Learning methods refined as well.
55
56
ILP and GA for Time Series
Prediction
Dept. of Computer and Systems Sciences Report 99-006
57
.
58
ILP via GA for Time Series Prediction
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se
Abstract This report presents experiments using GA for optimizing Logic Programs
for time series prediction. Both strategies: optimizing the whole program at once, and
building it clause-by-clause are investigated. The set of background predicates stays the
same during all the experiments, though the influence of some variations is also observed.
Despite extensive trials none of the approaches exceeded 60% accuracy, with 50% for
a random strategy, and 64% achieved by a Nearest Neighbor classifier on the same data.
Some reasons for the weak performance are speculated, including non-linearity of the
problem and too greedy approach.
Introduction
Inductive Logic Programming
Inductive Logic Programming (ILP) (Muggleton & Feng, 1990) – auto-
matic induction of logic programs, given set of examples and background
predicates – has shown successful performance in several domains (Lavarac
& Dzeroski, 1994).
The usual setting for ILP involves providing positive examples of the re-
lationship to be learned, as well as negative examples for which the relation-
ship does not hold. The hypotheses are selected to maximize compression
or information gain, e.g., measured by the number of used literals/tests
59
or program clauses. The induced hypothesis, in the form of a logic pro-
gram (or easily converted to it), can be usually executed without further
modifications as a Prolog program.
The hypotheses are often found via covering, in which a clause succeed-
ing on, or covering, some positive examples is discovered (e.g. by greedy
local search) and added to the program, covered positives removed from
the example set and clauses added until the set is empty. Each clause
should cover positive examples only with all the negative examples ex-
cluded, though this can be relaxed, e.g., because of different noise handling
schemes.
As such, ILP is well suited for domains with compact representations of
the concept learned possible and likely. Without further elaboration it can
be seen that most of the ILP success areas belong to such domains with
concise mathematical-like description feasible and subsequently discovered.
The Task
The task attempted in this work consists in short term prediction of a time
series – a normalized version of stock exchange daily quotes. The normal-
ization involved monotonically mapping the daily changes to 8 values, 1..8,
ensuring that the frequency of those values is equal in the 1200 points con-
sidered. The value to be predicted is a binary up or down, referring to the
index value five steps ahead in the series. These classifications were again
made equally frequent (with down including also small index gains).
The normalization of the class data allows easy detection of above-
random predictions – their accuracy is above 50% + s, where s is the
threshold for required significance level. If the level is one-sided 95% and
the predictions tested on all 1200 – W indowSize examples, then the sig-
nificant deviations from 0.5 are as presented in the table.
Thus, predictions with accuracy above 0.56 are of interest, no matter
what the window size. For an impression of the predictability of the time
series: a Nearest Neighbor method yielded 64% (Zemke, 1998) accuracy,
Neural Network reaching similar results.
60
Window Significant error
60 0.025
125 0.025
250 0.027
500 0.031
1000 0.06
Figure 5.1: One-sided 95% significance level errors for the tests
Figure 5.2: Data format sample: number, class and 10-changes vector
Data Format
61
available examples. The final accuracy is the ratio of correct predictions
to all predictions made.
GA Program Learning
Common GA Settings
All the tests use the same GA module, with GA parameters constant for
all trials, unless indicated otherwise. Random individual generation, mu-
tation, crossover, fitness evaluation are provided as plug-ins to the module
and are described for each experiment setting.
The Genetic Algorithm uses 2-member tournament selection strategy
with the fitter individual having lower numerical fitness value (which can be
a negative or positive). Mutation rate is 0.1, with each individual mutated
at most once before applying other genetic operators, crossover rate is 0.3
(so offspring constitute 0.6 of next population) and the population size is
100. Two-point (uniform) crossover is applied only to the top-level list in
the individuals’ representation. The number of generations is at least 5, no
more than 30, and additionally terminated if the GA run’s best individual
has not improved in the last 5 generations.
A provision is made for the shifted window learning to benefit from the
already learned hypothesis, in an incremental learning fashion. This can
be conveniently done using few (mutated) copies of the previous window
best hypothesis – while initializing a new population – instead of a totally
random initialization. This is done, both, to speed up convergence as well
as to increase GA exploitation.
62
with the predicate symbol – evolved through the GA. The second argument
of clause’s N-th literal is the value of the N-th head argument, which is
unified with the N-th change value in an example’s tuple.
63
Window/Clauses 5 10 50 100 200
250 60 - - - –
500 44 47 53 50 –
1000 48 50 50 38 44
Results The results for bigger program sizes and smaller windows are
missing since the amount of information required to code the programs
would be comparable to that to memorize the examples which could easily
lead to overfitting, instead of generalization.
Observations form over 50 GA runs follow.
• For bigger programs allowed (clause count more than 50; with popula-
tion tried up to 2000), convergence is very slow and the best program
often (randomly) created in an initial population
64
Learning Individual Clauses via GA
To limit the search space explosion, perhaps responsible for the previous
trial poor performance, the next tests optimize individual clauses one-by-
one added to the program. In this more traditional ILP setting, the window
up cases constitute the positive, and down – the negative examples.
Evaluation Details of the fitness function vary and will be described for
the individual tests. In general, the function promotes a single clause
covering maximal number of positive and no negative examples in the
current window. The variants include different sets of positives (all or yet
uncovered), different weights assigned to their counts and some changes in
the relations used.
65
Evaluation Fitness of a clause is defined as the difference Negatives – Pos-
itives, where Negatives is the count of all negatives covered by the clause,
and Positives is the count of yet-uncovered – by previously added clauses –
positives covered by the current clause.
Termination and full positive coverage are ensured by iterating over all
positive examples, with each clause added covering at least one of them.
Some observations about the results follow.
• The only significant prediction is that for window size 60, but only
just
66
Window Accuracy
60 54.8
125 50.3
250 51.7
500 51.9
1000 53.2
Window Accuracy
60 57.1
125 51.7
250 52.8
500 53.0
1000 48.9
• The rest of the results hint no prediction, giving overall poor perfor-
mance
67
Window Accuracy
60 53.6
125 51.9
250 53.0
500 52.5
1000 58.8
68
Decision Trees
The prediction task was also attempted via the Spectre (Bostrom & L.,
1999) system, a prepositional learner with results equivalent to a decision
tree classifier, equipped with hypothesis pruning and noise handling. I am
grateful for the courtesy of Henrik Boström, actually running the test on
provided data. The results follow.
69
given positive and excluding negative examples by exhaustively (according
to some restrictions) considering combinations of background predicates.
In the trials, the system came either with up( any) or the example set as
the most compressed hypothesis, thus effectively offering no learning.
Conclusion
The overall results are not impressive – none of the approaches has exceeded
the 60% accuracy level. The failure of the standard ILP systems (Progol
and decision tree learner) can be indicative of the inappropriateness of the
locally greedy compression/information gain driven approach to this type
of problems. The failure of evolving whole programs once more shows the
difficulty of finding optima in very big search spaces.
Another factor is the set of predicates used. As compared with the
GA runs, Progol and Spectre tests missed the inequality relation. As the
introduction of ordinary equality or removal of inequality showed, even the
flexible GA search is very sensitive to the available predicates. This could
be an area for further exploration.
All the above, definitely more elaborate techniques, fall short of the
results achieved via Nearest Neighbor method. One of the reasons could
involve non-linearity of the problem in question: with only linear relations
available all generalizations assume linear nature of the problem for good
performance. On the other hand the Nearest Neighbor approach can be
viewed as generalizing only locally, moreover with the granularity set by
the problem examples themselves.
70
Bagging Imperfect Predictors
ANNIE’99, St. Louis, MO, US, 1999
71
.
72
Bagging Imperfect Predictors
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se
Presented: ANNIE’99.
Published: Smart Engineering System Design, ASME Press, 1999
Introduction
Financial time series prediction presents a difficult task with no single
method best in all respects, the foremost of which are accuracy (returns)
and variance (risk). In the Machine Learning area, ensembles of classifiers
have long been used as a way to boost accuracy and reduce variance. Fi-
nancial prediction could also benefit from this approach, however due to
the peculiarities of financial data the usability needs to be experimentally
confirmed.
This paper reports experiments applying bagging – a majority voting
scheme – to predictors for a stock exchange index. The predictors come
73
from efforts to obtain a single best predictor. In addition to observing bag-
ging induced changes in accuracies, the study also analyzes their influence
on potential monetary returns.
The following chapter provides an overview of bagging. Next, settings
for the base study generating index predictions are described, and how the
predictions are bagged in the current experiments. And at last, a more
realistic trading environment is presented together with the results.
Bagging
Bagging Predictors
Results in this study involve bagging outcomes of 55 experiments run for
earlier research comparing predictions via Neural Network (ANN, 10 pre-
dictors), Nearest Neighbor (kNN, 29), Evolved Logic Programs (ILP, 16)
and Bayesian Classifier (not used in this study). More detailed description
of the methods can be found in (Zemke, 1998).
74
Experimental Settings
Some evidence suggests that markets with lower trading volume are easier
to predict (Lerche, 1997). Since the task of the earlier research was to
compare Machine Learning techniques, data from the relatively small and
unexplored Warsaw Stock Exchange (WSE) was used, with the quotes
freely available on the Internet (WSE, 1995 onwards). At the exchange,
prices are set once a day (with intraday trading introduced more recently).
The main index, WIG, a capitalization weighted average of stocks traded
on the main floor, provided the time series used in this study, with 1250
quotes since the formation of the exchange in 1991 to the comparative
research.
Index daily (log) changes were digitized via monotonically mapping
them into 8 integer values, 1..8, such that each was equally frequent in
the resulting series. The digitized series, {c}, was then used to create de-
lay vectors of 10 values, with lag one. Such a vector (ct , ct−1 , ct−2 , ..., ct−9 ),
was the sole basis for prediction of the index up/down value at time t + 5
w.r.t. the value at time t. Changes up and down have been made equally
frequent (with down including small index gains) for easier detection of
above-random predictors. Only delay vectors and their matching 5-day
returns derived from consecutive index values within a learning window
were used for learning. Windows of half-year, 1-year (250 index quotes),
2-years and 4 years were tested.
For each method, the predictor obtained for the window was then ap-
plied to the vector next to the last one in the window yielding up/down
prediction for the index value falling next to the window. With the coun-
ters for in/correct predictions accumulating as the window shifted over all
available data points, the resulting average accuracies for each method are
included in table 1, with accuracy shown as the percentage (%) of correctly
predicted up and down cases.
Estimating Returns
For estimating index returns induced by predictions, the 5-day index changes
have been divided into 8 equally frequent ranges, 1..8, with ranges 1..4 cor-
75
responding to down and 5..8 to up. Changes within each range obtained
values reflecting non-uniform distribution of index returns (Cizeau et al.,
1997). The near-zero changes 4 and 5 obtained value 1, changes 3 and 6 —
2, 2 and 7 — 4 and the extreme changes 1 and 8 — value 8.
Return is calculated as sum of values corresponding to correct (up/down)
predictions subtracted by the values for incorrect predictions. To normal-
ize, it is divided by the total sum of all values involved, thus ranging be-
tween −1 – for null and 1 – for full predictability. It should be noted that
such a return is not equivalent to accuracy, which gives the same weight
to all correct predictions.
The different learning methods, ILP, kNN and ANN, involved in this
study offer the classification error independence required by bagging to
work. Within each method predictors, there is still a variety due to different
training windows and parameters, such as background predicates for ILP,
k values for kNN, and architectures for ANN.
In this context bagging is applied as follows: all selected predictors, e.g.
these trained on a window of half a year – as for the first row of bagged
results in table 1, issue their predictions for an instance, with the majority
class being the instance’s bagged prediction. The predicate selections in
table 1 are according to the learning method (columns): ILP, kNN, ANN,
all of them, and according to training window size (rows), e.g. ’4 & 2 & 1
year’ – bagging predictions for all these window sizes.
76
With up to 1000 (4 years) – of the 1250 index points used for training –
the presented accuracies for the last 250 require 6% increase for a signifi-
cant improvement (one-sided, 0.05 error). Looking at the results, a number
of observations can be attempted. First, increased accuracy – bagged accu-
racies exceeding the average for each method. Second, poorly performing
methods gaining most, e.g. ILP (significantly) going up from 56% average
to 62% bagged accuracy. Third, overall, bagged predictors incorporating
windows of 4 & 2 years achieve highest accuracy. And fourth, return per-
formance is positively correlated to bagged accuracy, with highest returns
for highest accuracies.
Bagging GA Population
This section describes trading application of bagged GA-optimized Nearest
Neighbor classifiers. As compared to previously used Nearest Neighbor
classifier, these in this section have additional parameters warranting what
constitutes a neighbor and are optimized for maximizing return implied by
their predictions; they also work on more extensive data – choice of which
is also parameterized. Some of the parameters follow (Zemke, 1998).
Active features – binary vector indicating features/coordinates in delay
vector included in neighbor distance calculation, max. 7 active
Neighborhood Radius – maximal distance up to which vectors are con-
sidered neighbors and used for prediction, in [0.0, 0.05)
Window size – limit how many past data-points are looked at while
searching for neighbors, in [60, 1000)
Kmin – minimal number of vectors required within a neighborhood to
warrant prediction, in [1, 20)
Predictions’ Variability – how much neighborhood vector’s predictions
can vary to justify a consistent common prediction, in [0.0, 1.0)
Prediction Variability Measure – how to compute the above measure
from the series of the individual predictions, as: standard deviation,
difference max min between maximal and minimal value
77
Distance scaling – how contributory predictions are weighted in the
common prediction sum, as a function of neighbor distance, no-scaling:
1, linear: 1/distance, exponential: exp(−distance)
78
Conclusion
This study presents evidence that bagging multiple predictors can improve
prediction accuracy for a stock exchange index data. With observation
that returns are proportional to prediction accuracy, bagging makes an in-
teresting approach for increasing returns. This is confirmed by trading in
a more realistic setting with the returns of bagging significantly outper-
forming that of trading by a single best strategy.
79
80
Rapid Fine-Tuning of
Computationally Intensive Classifiers
MICAI’2000, Mexico, 2000. LNAI 1793
81
.
82
Rapid Fine-Tuning of Computationally
Intensive Classifiers
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se
Abstract This paper proposes a method for testing multiple parameter settings in one
experiment, thus saving on computation-time. This is possible by simultaneously tracing
processing for a number of parameters and, instead of one, generating many results –
for all the variants. The multiple data can then be analyzed in a number of ways, such
as by the binomial test used here for superior parameters detection. This experimental
approach might be of interest to practitioners developing classifiers and fine-tuning them
for particular applications, or in cases when testing is computationally intensive.
Keywords: Analysis and design, Classifier development and testing, Significance tests,
Parallel tests
Introduction
Evaluating a classifier and fine-tuning its parameters, especially when per-
formed with non-optimal prototype code, all often require lengthy com-
putation. This paper addresses the issue of such experiments, propos-
ing a scheme speeding up the process in two ways: by allowing multiple
classifier-variants comparison in shorter time, and by speeding up detection
of superior parameter values.
The rest of the paper is organized as follows. First, a methodology of
comparing classifiers is described pointing out some pitfalls. Next, the
proposed method is outlined. And finally, an application of the scheme to
a real case is presented.
83
Basic Experimental Statistics
Comparing Outcomes
While testing 2 classifiers, one comes with 2 sets of resulting accuracies.
The question is then: are the observed differences indicating actual supe-
riority of one approach or could they arise randomly.
The standard statistical treatment for comparing 2 populations, the t-
test, came under criticism when applied in the machine learning settings
(Dietterich, 1996), or with multiple algorithms (Raftery, 1995). The test
assumes that the 2 samples are independent, whereas usually when two
algorithms are compared, this is done on the same data set so the inde-
pendence of the resulting accuracies is not strict. Another doubt can arise
when the quantities compared do not necessarily have normal distribution.
If one wants to compare two algorithms, A and B, then the binomial test
is more appropriate. The experiment is to run both algorithms N times
and to count the S times A was better than B. If the algorithms were
equal, i.e., P(A better than B in a single trial) = 0.5, then the probability
of obtaining the difference of S or more amounts to the sum of binomial
trials, P = 0.5, yielding between S and N successes. As S gets larger than
N/2, the error of wrongly declaring A as better than B decreases, allowing
one to achieve desired confidence level. The table 1 provides the minimal
S differentials as a function of number of trials N and the (I or II sided)
confidence level.
The weaknesses of binomial tests for accuracies include: non-qualitative
comparison – not visualizing how much one case is better than the other
(e.g., as presented by their means), somehow ambivalent results in the
case of many draws – what if the number of draws D >> S, should the
relatively small number of successes decide which sample is superior, non-
obvious ways of comparing more than 2 samples or samples of different
cardinality (Salzberg, 1997).
Significance Level
Performing many experiments increases the odds that one will find ’sig-
nificant’ results where there is none. For example, an experiment at 95%
84
#Trials 95% I 95% II 99% I 99% II 99.9% I 99.9% II 99.99% I 99.99% II
5 5 - - - - - - -
6 6 6 - - - - - -
7 7 7 7 - - - - -
8 7 8 8 8 - - - -
16 12 13 13 14 15 15 16 16
32 22 22 23 24 25 26 27 28
64 40 41 42 43 45 46 47 48
128 74 76 78 79 82 83 86 87
256 142 145 147 149 153 155 158 160
512 275 279 283 286 292 294 299 301
1024 539 544 550 554 562 565 572 575
85
To avoid spurious inferences, one is strongly advised to always aim at
significance higher than the bottom line 95% easily obtained in tens of
testing runs. However, more stringent tests also increase the possibility
that one will omit some genuine regularities. One solution to this trade-off
could be, first searching for any results accepting relatively low significance,
and once something interesting is spotted, to rerun the test, on a more
extensive data, aiming at a higher pass.
Tuning Parameters
A common practice involves multiple experiments in order to fine-tune
optimal parameters for the final trial. Such a practice increases the chances
of finding an illusory significance – in two ways. First, it involves the
discussed above effect of numerous tests on the same data. Second, it
specializes the algorithm to perform on the (type of) data on which it is
later tested.
To avoid this pitfall, first each fine-tuning experiment involving the
whole data should appropriately adjust the significance level of the whole
series – in a way discussed. The second possibility requires keeping part of
the data for testing and never using it at the fine-tuning stage, in which
case the significance level must only be adjusted according to the number
of trials on the test portion.
Proposed Method
Usually it is unclear without a trial how to set parameter values for optimal
performance. Finding the settings is often done in a change-and-test man-
ner, which is computationally intensive, both to check the many possible
settings, and to get results enough as to be confident that any observed reg-
ularity is not merely accidental. The proposed approach to implementing
the change-and-test routine can speed up both.
The key idea is to run many experiments simultaneously. For example,
if the tuned algorithm has 3 binary parameters A, B and C taking values
-/+, in order to decide which setting among A- B- C-, A- B- C+, ..., A+
B+ C+ to choose, all could be tried at once. This can be done by keeping
86
2 copies of all the variables influenced by parameter A: one variable set
representing the setting A- and the other – A+. Those 2 variable sets
could be also used in 2 ways – each with respect to processing required by
B- and B+ resulting in 4 variable sets representing the choices A- B-, A-
B+, A+ B- and A+ B+. And in the same manner, the C choice would
generate 8 sets of affected variables. Finally, as the original algorithm
produces one result, the modified multiple-variable version would produce
8 values per iteration.
The details of the procedure, namely which variables need to be traced in
multiple copies, depend on the algorithm in question. Though the process
might seem changing the structure of the algorithm – using data structure
in the place of a single variable – once this step is properly implemented, it
does not increase the conceptual complexity if 2 or 10 variables are traced.
Actually, with the use of any programming language allowing abstractions,
such as an object-oriented language, it is easy to reveal the internal nature
of variables only where necessary - without the need for any major code
changes where the modified variables are merely passed.
Handling the variable choices obviously increases the computational
complexity of the algorithm, however, as it will be shown on an exam-
ple, the overhead can be negligible when the variable parameters concern
choices outside the computationally intensive core of the algorithm, as it
usually is in the case for fine-tuning3 .
87
cussed (Salzberg, 1997). In order to collect the statistics, several itera-
tions – applications of the algorithm – will usually be required, depending
on the number of variable choices – so outcomes – at each iteration, and
the required confidence. With 3 variable choices, each application allows 4
comparisons – in general, tracing K choices allows 2K−1 .
This analysis can reveal if a certain parameter setting results in signifi-
cantly better performance. The same procedure, and algorithm outcomes,
can be used for all the parameters, here including also B and C, which
equally divide the outcomes into B- and B+, etc. Any decisive results
obtained in such a way indicate a strong superiority of a given parameter
value – regardless of the combinations of the other parameters. However,
in many cases the results cannot be expected to be so crisp – with the
influence of parameter values inter-dependent, i.e. which given parameter
value is optimal may depend on the configuration of the other parameters.
In that case the procedure can be extended, namely the algorithm out-
comes can be divided according to value of a variable parameter, let it be
A, into 2 sets: A- and A+. Each of the sets would then be subject to the
procedure described above, with the already fixed parameter excluded. So
the analysis of the set A- might, for example, reveal that parameter B+
gives superior results no matter what the value of the other parameters
(here: only C left), whereas analysis of A+ might possibly reveal superior-
ity of B-. The point to observe is that fixing one binary variable reduces the
cardinality of the sample by half, thus twice as many algorithm iterations
will be required for the same cardinality of the analyzed sets. This kind
of analysis might reveal the more subtle interactions between parameters,
helpful in understanding why the algorithms works the way it does.
Parallel Experiments
In the limit, the extended procedure will lead to 2K sets obtaining one
element per iteration, K – the number of binary parameters traced. Such
obtained sets can be subject to another statistical analysis, this time the
gains in computation coming from the fact that once generated, the 2K
sets can be compared to a designated set, or even pair-wise, corresponding
to many experiments.
88
The statistics used in this case can again involve the binomial compari-
son or – unlike in the previous case – a test based on random sampling. In
the superior parameter detection mode, the divisions obtained for a single
parameter most likely do not have normal distribution, thus tests assuming
it, such as the t-test, are not applicable. Since the binomial test does not
make any such assumption it was used.
However, if the compared sets are built in one-element-per-iteration
fashion, where each iteration is assumed to be independent (or random
generator dependent) from the previous one, the sets can be considered
random samples. The fact that they are originating from the same random
generator sequence forming the outcomes at each iteration, can be actually
considered helpful in getting more reliable comparison of the sets – due
only to the performance of the variants, but not to the variation in the
sampling procedure. This aspect could be considered another advantage
of the parallel experiments. However, discussing the more advanced tests
utilizing this property is beyond the scope of the current paper.
Algorithm
The designed classifier was an extension of the nearest neighbor algorithm,
with parameters indicating what constitutes a neighbor, which features to
look at, how to combine neighbor classifications etc. The parameters were
optimized by a genetic algorithm (GA) whose population explored their
89
combinations. The idea believed to be novel, involved taking – instead
of the best GA-evolved classifier – part of the final GA-population and
bagging (Breiman, 1996) the individual classifiers together into an ensemble
classifier. Trying the idea seemed worthwhile since bagging is known to
increase accuracy benefiting from the variation in the ensemble – exactly
what a (not over-converged) GA-population should offer.
The computationally intensive part was the GA search – evolving a
population of parameterized classifiers and evaluating them. This had to
be done no matter if one was interested just in the best classifier or in
a bigger portion of the population. As proposed, the tested algorithm
needs to be multi-variant traced for a number of iterations. Here, iteration
involved a fresh GA run, and yielded accuracies (on the test set) – one for
each variant traced.
The questions concerning bagging the GA population involved: which
individual classifiers to bag – all above-random or only some of them,
how to weight their vote – by single vote or according to accuracy of the
classifiers, how to solicit the bagged vote – by simple majority or if the
majority was above a threshold. The questions gave rise to 3 parameters,
described below, and their 3 ∗ 2 ∗ 2 = 12 combinations, listed in Table 3
indicating which parameter (No) takes what value (+).
90
1 2 3 4 5 6 7 8 9 10 11 12
No Parameter setting
1 Upper half bag + + + + - - - - - - - -
1 All above-random bag - - - - + + + + - - - -
1 Half above-random bag - - - - - - - - + + + +
2 Unweighted vote + + - - + + - - + + - -
3 Majority decision + - + - + - + - + - + -
Parameter Analysis
The parameter analysis can identify algorithm settings that give superior
performance, so they can be set to these values. The first parameter has 3
values which can be dealt with by checking if results for one of the values
are superior to both of the others. Table 4 presents the comparisons as
probabilities for erroneously deciding superiority of the left parameter set
versus one on the right. Thus, for example, in the first row comparison of
{1..4 } vs. {5..8}, which represent the different settings for parameter 1,
the error 0.965 by 128 iterations indicates that setting {1..4 } is unlikely
to be better than {5..8}. Looking at it the other way: {5..8} is more likely
to be better than {1..4 } with error4 around 0.035 = 1 − 0.965. The setting
{0} stands for results by the reference non-bagged classifier – respective
GA run fittest. The results in Table 4 allow us to make some observations
concerning the parameters. The following conclusions are for results up to
128 iterations, the results for the full trials up to 361 iterations included
for comparison only.
91
No Parameter settings/Iterations 32 64 128 361
1 {1..4 } vs. {5..8} 0.46 0.95 0.965 0.9985
1 {1..4 } vs. {9..12 } 0.53 0.75 0.90 0.46
1 {5..8 } vs. {9..12 } 0.33 0.29 0.77 0.099
2 {1,2,5,6,9,10} vs. {3,4,7,8,11,12} 0.53 0.6 0.24 0.72
3 {1,3,5,7,9,11} vs. {2,4,6,8,10,12} 0.018 9E-5 3E-5 0
- {1} vs. {0} 0.0035 0.0041 0.013 1E-6
- {2} vs. {0} 0.19 0.54 0.46 0.91
- {3} vs. {0} 0.055 0.45 0.39 0.086
- {4} vs. {0} 0.11 0.19 0.33 0.12
- {5} vs. {0} 0.0035 3.8E-5 6.2E-5 0
- {6} vs. {0} 0.30 0.64 0.87 0.89
- {7} vs. {0} 0.055 0.030 0.013 1.7E-4
- {8} vs. {0} 0.025 0.030 0.02 0.0011
- {9} vs. {0} 0.055 7.8E-4 2.5E-4 0
- {10} vs. {0} 0.11 0.35 0.39 0.73
- {11} vs. {0} 0.19 0.64 0.39 0.085
- {12} vs. {0} 0.055 0.030 0.0030 0.0016
Speed up
In this case the speed up of the aggregate experiments – as opposed to
individual pair-wise comparisons – comes from the fact that the most com-
putationally intensive part of the classification algorithm – the GA run –
does not involve the multiply-threaded variables. They come into play
only when the GA evolution is finished and different modes of bagging and
non-bagging are evaluated.
Exploring variants outside the inner loop can still benefit algorithms in
which multiple threading will have to be added to the loop thus increasing
92
the computational burden. In this case, the cost of exploring the core
variants should be fully utilized by carefully analyzing the influence of the
(many) post-core settings as not to waste the core computation due to
some unfortunate parameter choice afterwards.
Conclusion
This paper proposes a method for testing multiple parameter settings in
one experiment, thus saving on computation-time. This is possible by
simultaneously tracing processing for a number of parameters and, instead
of one, generating many results – for all the variants. The multiple data can
then be analyzed in a number of ways, such as by the binomial test used
here for superior parameters detection. This experimental approach might
be of interest to practitioners developing classifiers and fine-tuning them
for particular applications, or in cases when testing is computationally
intensive.
The current approach could be refined in a number of ways. First, finer
statistical framework could be provided taking advantage of the specific
features of the data generating process, thus providing crisper tests, possi-
bly at smaller sample size. Second, some standard procedures for dealing
with common classifiers could be elaborated, making the proposed devel-
opment process more straightforward.
93
94
On Developing Financial Prediction
System: Pitfalls and Possibilities
DMLL Workshop at ICML-2002, Australia, 2002
95
.
96
On Developing Financial Prediction System:
Pitfalls and Possibilities
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se
Introduction
Financial prediction presents challenges encountered over again. The paper
highlights some of the problems and solutions. A predictor development
demands excessive experimentation: with data preprocessing and selection,
the prediction algorithm(s), a matching trading model, evaluation and tun-
ing – to benefit from the minute gains, but not fall into over-fitting. The
experimentation is necessary since there are no proven solutions, but ex-
periences of others, even failed, can speed the development.
The idea of financial prediction (and resulting riches) is appealing,
initiating countless attempts. In this competitive environment, if one
wants above-average results, one needs above-average insight and sophisti-
cation. Reported successful systems are hybrid and custom made, whereas
97
straightforward approaches, e.g. a neural network plugged to relatively
unprocessed data, usually fail (Swingler, 1994).
The individuality of a hybrid system offers chances and dangers. One
can bring together the best of many approaches, however the interaction
complexity hinders judging where the performance dis/advantage is coming
from. This paper provides hints in major steps in a prediction system
development based on author’s experiments and published results.
The paper assumes some familiarity with machine learning and financial
prediction. As a reference one could use (Hastie et al., 2001; Mitchell,
1997), including java code (Witten & Frank, 1999), applied to finance
(Deboeck, 1994; Kovalerchuk & Vityaev, 2000). Non-linear analysis (Kantz
& Schreiber, 1999a), in finance (Deboeck, 1994; Peters, 1991). Ensemble
techniques (Dietterich, 2000), in finance (Kovalerchuk & Vityaev, 2000).
Data Preprocessing
Before data is fed into an algorithm, it must be collected, inspected, cleaned
and selected. Since even the best predictor will fail on bad data, data
quality and preparation is crucial. Also, since a predictor can exploit only
certain data features, it is important to detect which data preprocess-
ing/presentation works best.
Fat tails – extreme values more likely as compared to the normal distribution – is an es-
tablished property of financial returns (Mantegna & Stanley, 2000). It can matter in
1) situations, which assume normal distribution, e.g. generating missing/surrogate
98
data w.r.t. normal distribution will underestimate extreme values 2) in outlier de-
tection. If capturing the actual distribution is important, the data histogram can
be preferred to parametric models.
Time alignment – same date-stamp data may differ in the actual time as long as the
relationship is kept constant. The series originating the predicted quantity sets
the time – extra time entries in other series may be skipped, whereas missing in
other series may need to be restored. Alternatively, all series could be converted to
event-driven time scale, especially for intra-day data (Dacorogna et al., 2001).
Missing values dealt with by data mining methods (Han & Kamber, 2001;
Dacorogna et al., 2001). If a miss spoils temporal relationship, restoration
is preferable to removal. Conveniently all misses in the raw series are
restored for feature derivation, alignment etc., skipping any later instances
of undefined values. If data restorations are numerous, test if the predictor
picks the inserted bias is advisable.
Detrending removes the growth of a series. For stocks, indexes, and cur-
rencies converting into logarithms of subsequent (e.g. daily) returns does
the trick. For volume, dividing it by last k quotes average, e.g. yearly, can
scale it down.
99
discretized values exceeds noise, to decline later after rough discretization
ignores important data distinctions.
Redistribution – changing the frequency of some values in relation to
others – can better utilize available range, e.g. if daily returns were linearly
scaled to (-1, 1), majority would be around 0.
Normalization brings values to a certain range, minimally distorting
initial data relationships. SoftMax norm increasingly squeezes extreme
values, linearly mapping middle, e.g. middle 95% input values could be
mapped to [-0.95, 0.95], with bottom and top 2.5% nonlinearly to (-1,-0.95)
and (0.95, 1) respectively. Normalization should precede feature selection,
as non-normalized series may confuse the process.
100
filter for keywords in news can bring substantial advantage.
Indicators are series derived from others, enhancing some features of in-
terest, such as trend reversal. Over the years traders and technical analysts
trying to predict stock movements developed the formulae (Murphy, 1999),
some later confirmed to pertain useful information (Sullivan et al., 1999).
Indicator feeding into a prediction systems is important due to 1) averag-
ing, thus noise reduction, present in many indicator formulae, 2) providing
views of the data suitable for prediction. Common indicators follow.
MA, Moving Average, is the average of past k values up to date. Exponential Moving
Average, EM An = weight ∗ seriesn + (1 − weight) ∗ EM An−1 .
Stochastic (Oscillator) places the current value relative to the high/low range in a pe-
n −low(k)
riod: series
high(k)−low(k)
, low(k) – the lowest among the k values preceding n, k often 14
days.
ROC, Rate of Change, ratio of the current price to price k quotes earlier, k usually 5 or
10 days.
RSI, Relative Strength Index, relates growths to falls in a period. RSI can be computed
as positive changes (i.e. seriesi −seriesi−1 > 0) sum divided by all absolute changes
sum, taking last k quotes; k usually 9 or 14 days.
101
Bootstrap – with repetitions, sampling as many elements as in the origi-
nal – and deriving a predictor for each such a sample, is useful for collecting
various statistics (LeBaron & Weigend, 1994), e.g. performance, also en-
semble creation or best predictor selection (e.g. via bumping), however not
without limits (Hastie et al., 2001).
Feature selection can make learning feasible, as because of the curse of di-
mensionality (Mitchell, 1997) long instances demand (exponentially) more
data. As always, feature choice should be evaluated together with the pre-
dictor, as assuming feature importance because it worked well with other
predictors, may mislead.
Principal Component Analysis (PCA) and claimed better for stock data Indepen-
dent Component Analysis (Back & Weigend, 1998), reduce dimension by proposing
a new set of salient features.
Sensitivity Analysis trains predictor on all features and then drops those least influ-
encing predictions. Many learning schemes internally signal important features, e.g.
(C4.5) decision tree use them first, neural networks assign highest weights etc.
Linear methods measure correlation between predicted and feature series – significant
non-zero implying predictability (Tsay, 2002). Multiple features can be taken into
account by multivariate regression.
Entropy measures information content, i.e. deviation from randomness (Molgedey &
Ebeling, 2000). This general measure, not demanding big amounts of data, and
useful in discretisation or feature selection is worth familiarizing with.
102
Compressibility – the ratio of compressed to the original sequence length – shows how
regularities can be exploited by a compression algorithm (which could be the basis
of a predictor). An implementation: series digitized 4-bit values packed in pairs into
byte array subjected to Zip compression (Feder et al., 1992).
Detrended Fluctuation Analysis (DFA) reveals long term correlations (self-similarity)
even in non-stationary time series (Vandewalle et al., 1997). DFA is more robust,
so recommended to Hurst analysis – a sensitive statistics of cycles, proper interpre-
tation requiring experience (Peters, 1991).
Chaos and Lyapunov exponent test short-term determinism, thus predictability (Kantz
& Schreiber, 1999a). However, the algorithms are noise-sensitive and require long
series, thus conclusions should be cautious.
Randomness tests like chi-square, can assess the likelihood that the observed (digi-
tized) sequence is random. Such a test on patterns of consecutive digits could hint
pattern no/randomness.
Non-stationarity test can be implemented by dividing data into parts and computing
part i predictability based only on part j data. The variability of the measures
(visual inspection encouraged), such as standard deviation, assesses stationarity.
A battery of tests could include linear regression, DFA for long term
correlations, compressibility for entropy-based approach, Nearest Neighbor
for local prediction, and a non-stationarity test.
Prediction Algorithms
Below, common learning algorithms (Mitchell, 1997) are discussed, point-
ing their features important to financial prediction.
Linear methods not main focus here, are widely used in financial pre-
diction (Tsay, 2002). In my Weka (Witten & Frank, 1999) experiments,
Locally Weighted Regression (LWR) – scheme weighting Nearest Neighbor
predictions – discovered regularities in NYSE data 5 . Also, Logistic – non-
linear regression for discrete classes – performed above-average and with
speed. As such, regression is worth trying, especially its schemes more spe-
cialized to the data (e.g. Logistic to discrete) and as a final optimization –
weighting other predictions (LWR).
5
Unpublished, ongoing work.
103
Neural Network (ANN) – seems the method of choice for financial pre-
diction (Kutsurelis, 1998; Cheng et al., 1996). Backpropagation ANNs
present the problems of long training and guessing the net architecture.
Schemes training architecture along weights could be preferred (Hochre-
iter & Schmidhuber, 1997) (Kingdon, 1997), limiting under-performance
due to wrong (architecture) parameter choice. Note, a failure of an ANN
attempt, especially using a general-purpose package, does not necessitate
prediction impossible. In my experiments, Voted Perceptron performance
often compared with that of ANN, this could be a start, especially when
speed is important, such as in ensembles.
Nearest Neighbor (NN) does not create a general model, but to predict,
it looks back for the most similar case(s) (Mitchell, 1997). Irrelevant/noisy
features disrupt the similarity measure, so pre-processing is worthwhile.
NN is a key technique is nonlinear analysis which offers insights, e.g.
weighting more neighbors, efficient NN search (Kantz & Schreiber, 1999a).
Cross-validation (Mitchell, 1997) can also decide an optimal number of
kNN neighbors. Ensemble/bagging NNs trained on different instance sam-
ples usually does not boost accuracy, though on different feature subsets
might.
104
Bayesian classifier/predictor first learns probabilities how evidence sup-
ports outcomes, used then to predict new evidence’s outcome. Though
the simple scheme is robust to violating the ’naive’ independent-evidence
assumption, watching independence might pay off, especially as in decreas-
ing markets variables become more correlated than usual. The Bayesian
scheme might also combine ensemble predictions – more optimally than
majority voting.
Support Vector Machines (SVM) are a relatively new and powerful learner,
having attractive characteristics for time series prediction (Muller et al.,
1997). First, it deals with multidimensional instances, actually the more
features the better – reducing the need for (wrong) feature selection. Sec-
ond, it has few parameters, thus finding optimal settings can be easier, one
of the parameters referring to noise level the system can handle.
Performance improvement
Most successful prediction are hybrid: several learning schemes coupled
together (Kingdon, 1997; Cheng et al., 1996; Kutsurelis, 1998; Kovalerchuk
& Vityaev, 2000). Predictions, indication of their quality, biases, etc., fed
into a (meta-learning) final decision layer. The hybrid architecture may
also stem from performance improving techniques:
Ensemble (Dietterich, 2000) is a number of predictors of which votes are put together
into the final prediction. The predictors, on average, are expected above-random
and making independent errors. The idea is that correct majority offsets individual
errors, thus the ensemble will be correct more often than an individual predictor.
The diversity of errors is usually achieved by training a scheme, e.g. C4.5, on differ-
ent instance samples or features. Alternatively, different predictor types – like C4.5,
ANN, kNN – can be used or the predictor’s training can be changed, e.g. by choos-
ing the second best decision, instead of first, building C4.5 decision tree. Common
schemes include bagging, boosting and their combinations and Bayesian ensembles
(Dietterich, 2000). Boosting is particularly effective in improving accuracy.
Note: an ensemble is not a panacea for non-predictable data – it only boosts accu-
racy of already performing predictor. Also, readability, efficiency are decreased.
Genetic Algorithms (GAs) (Deboeck, 1994) explore novel possibilities, often not thought
of by humans. Therefore, it is always worth keeping some decisions as parameters
105
that can be (later) GA-optimized, e.g., feature preprocessing and selection, sampling
strategy, predictor type and settings, trading strategy. GAs (typically) require a fit-
ness function – reflecting how well a solution is doing. A common mistake is to
define the fitness one way and to expect the solution to perform another way, e.g. if
not only return but also variance are important, both factors should be incorporated
into fitness. Also, with more parameters and GAs ingenuity it is easier to overfit
the data, thus testing should be more careful.
Local, greedy optimization can improve an interesting solution. This is worth com-
bining with a global optimization, like GAs, which may get near a good solution
without reaching it. If the parameter space is likely nonlinear, it is better to use a
stochastic search, like simulated annealing, as compared to simple up-hill.
Pruning properly applied can boost both 1) speed – by skipping unnecessary computa-
tion, and 2) performance – by limiting overfitting. Occam’s razor – among equally
performing models, simpler preferred – is a robust criterion to select predictors, e.g.
Network Regression Pruning (Kingdon, 1997), MMDR (Kovalerchuk & Vityaev,
2000) successfully use it. In C4.5 tree pruning is an intrinsic part. In ANN, weight
decay schemes (Mitchell, 1997) reduce towards 0 connections not sufficiently pro-
moted by training. In kNN, often a few prototypes perform better than referring
to all instances – as mentioned, high return instances could be candidates. In en-
sembles, if the final vote is weighted, as in AdaBoost (Dietterich, 2000), only the
highest-weighted predictors matter.
Tabu, cache, incremental learning, gene GA can accelerate search, allowing more
exploration, bigger ensemble etc. Tabu search prohibits re-visiting recent point
again – except for not duplicating computation, it forces the search to explore
new areas. Caching stores computationally expensive results for a quick recall, e.g.
(partial) kNN can be precomputed. Incremental learning only updates a model as
new instances arrive, e.g. training ANN could start with ANN previously trained
on similar data, speeding up convergence. Gene expression GAs optimize solution’s
compact encoding (gene), instead of the whole solution which is derived from the
encoding for evaluation.
I use a mixture: optimizing genes stored in a tabu cache (logged and later scrutinized
if necessary).
What if everything fails but the data seems predictable? There are still
possibilities: more relevant data, playing with noise reduction/discretisation,
making the prediction easier, e.g. instead of return, predicting volatility
(and separately direction), or instead of stock (which may require company
data) predicting index, or stock in relation to index; changing the horizon –
prediction in 1 step vs. many; another market, trading model.
106
Trading model given predictions, makes trading decisions, e.g. predicted
up – long position, down – short, with more possibilities (Hellström &
Holmström, 1998). Return is just one objective, other include: minimizing
variance, maximal loss (bankruptcy), risk (exposure), trade (commissions),
taxes; Sharpe ratio etc. A practical system employs precautions against
predictors non-performance: monitoring recent performance and signaling
if it is below accepted/historic level. It is crucial in non-stationary markets
to allow for market shifts beyond control – politics, disasters, entry of a big
player. If the shifts cannot be dealt with, at least should be signaled before
inflicting unreparable loss. This touches the subject of a bigger (money)
management system, taking the predictions into account while hedging,
but it is beyond the scope of this paper.
System Evaluation
Proper evaluation is critical to a prediction system development. First,
it has to measure exactly the interesting effect, e.g. trading return, as
opposed to prediction accuracy. Second, it has to be sensitive enough as
to distinguish often minor gains. Third, it has to convince that the gains
are no merely a coincidence.
Evaluate the right thing. Financial forecasts are often developed to support
semi-automated trading (profitability), whereas the algorithms underlying
those systems might have different objective. Thus, it is important to
test the system performing in the setting it is going to be used, a trivial,
but often missed notion. Also, the evaluation data should be of exactly
the same nature as planned for real-life application, e.g. an index-futures
trading performed for index data used as a proxy for futures price, but
real futures data degraded it. Some problems with common evaluation
strategies (Hellström & Holmström, 1998) follow.
107
Actually, some of the best-performing systems have lower accuracy than could be
found for that data (Deboeck, 1994).
Square error – sum of squared deviations from actual outputs – is a common measure
in numerical prediction, e.g. ANN. It penalizes bigger deviations, however if sign
is what matters this might not be optimal, e.g. predicting -1 for -0.1 gets bigger
penalty than predicting +0.1, though the latter might trigger going long instead
of short. Square error minimization is often an intrinsic part of an algorithm such
as ANN backpropagation, and changing it might be difficult. Still, many such
predictors, e.g. trained on bootstrap samples, can be validated according to the
desired measure and the best picked.
Performance measure (Hellström & Holmström, 1998) should incorporate the predic-
tor and the (trading) model it is going to benefit. Some points: Commissions need
to be incorporated – many trading ’opportunities’ exactly disappear with commis-
sions. Risk/variability – what is the value of even high return strategy if in the
process one gets bankrupt? Data difficult to obtain in real time, e.g. volume, might
mislead historic data simulations.
Evaluation bias resulting from the evaluation scheme and time series data,
needs to be recognized. Evaluation similar to the intended operation can
minimize performance estimate bias, though different tests can be useful
to estimate different aspects, such as return, variance.
N -cross validation – data divided into N disjoint parts, N − 1 for training and 1 for
testing, error averaged over all N (Mitchell, 1997) – in the case of time series data,
underestimates error. Reason: in at least N − 2 out of the N train-and-test runs,
training instances precede and follow the test cases unlike in actual prediction when
only past in known. For series, window approach is more adept.
108
However, more than 1 instance overestimates error, since the training window does
not include the data directly preceding some tested cases. Since markets undergo
regime change in matter of weeks, the test window should be no longer than that, or
the train window’s fraction (< 20%). To speed up training for the next test window,
the previous window predictor could be used as the starting point while training on
the next window, e.g. instead of starting with ANN random weights.
Evaluation data should include different regimes, markets, even data er-
rors, and be plentiful. Dividing test data into segments helps to spot
performance ir-regularities (for different regimes).
Overfitting a system to data is a real danger. Dividing data into disjoint
sets is the first precaution: training, validation for tuning, and test set for
performance estimation. A pitfall may be that the sets are not as separated
as seem, e.g. predicting returns 5 days ahead, a set may end at day D,
but that instance may contains return for day D + 5 falling into a next set.
Thus data preparation and splitting should be careful.
Another pitfall is using the test set more than once. Just by luck, 1 out
of 20 trials is 95% above average, 1 out of 100, 99% above etc. In multiple
test, significance calculation must factor that in, e.g. if 10 tests are run
and the best appears 99.9% significant, it really is 99.9%10 = 99% (Zemke,
2000).
Multiple use can be avoided, for the ultimate test, by taking data that
was not available earlier. Another possibility is to test on similar, not tuned
for, data – without any tweaking until better results, only with predefined
adjustments for the new data, e.g. switching the detrending preprocessing
on.
109
If α is the acceptable risk of wrongly rejecting the null hypothesis that
the original series statistics is lower (higher) than of any surrogate, then
1/α − 1 surrogates needed; if all give higher (lower) statistics than the
original series, then the hypothesis can be rejected. Thus, if predictor’s
error was lower on original series, than in 19 runs on surrogates, we can be
95% sure it is up to something.
Sanity checks involve common sense (Gershenfeld & Weigend, 1993). Pre-
diction errors along the series should not reveal any structure, unless the
predictor missed something. Do predictions on surrogate (permuted) series
discover something? If valid, this is the bottom line for comparison with
prediction on the original series – is it significantly better?
110
Data used consists of 30 years of daily NYSE 5 indexes and 4 volume
series. Data is plotted and some series visibly mimicking other omitted.
Missing values are filled by a nearest neighbor algorithm, and the 5-days
return series to be predicted computed. The index series are converted to
logarithms of daily returns; the volumes divided by lagged yearly averages.
Additional series are derived, depending on experiment, 10 and 15 days MA
and ROC for indexes. Then all series are Softmax normalized to -1..1 and
discretized to 0.1 precision. In between major preprocessing steps series
statistics are computed: number of NaN, min and max values, mean, st.
deviation, 1,2-autocorrelation, zip-compressibility, linear regression slope,
DFA – tracing if preprocessing does what expected – removing NaN, trend,
outliers, but not zip/DFA predictability. In the simplest approach, all
series are then put together into instances with D = 3 and delay = 2.
An instance’s weight is corresponding time absolute 5-days return and
instance’s class – the return’s sign.
The predictor is one of Weka (Witten & Frank, 1999) classifiers han-
dling numerical data, 4-bit coded into a binary string together with: which
instance’s features to use, how much past data to train on (3, 6, 10, 15, 20
years) and what part of lowest weight instances to skip (0.5, 0.75, 0.85).
Such strings are GA-optimized, with already evaluated strings cached and
prohibited from costly re-evaluation. Evaluation: a predictor is trained
on past data and used to predict values in a disjoint window, 20% size of
the data, ahead of it; repeated 10 times with the windows shifted by the
smaller window size. The average of the 10 period returns less the ’always
up’ return and divided by the 10 values st. deviation give a predictor’s
fitness.
Final Remarks
111
the whole development cycle. Without stringent re-evaluation performance
is likely hurt.
A system development usually involves a number of recognizable steps:
data preparation – cleaning, selecting, making data suitable for the pre-
dictor; prediction algorithm development and tuning – for performance on
the quality of interest; evaluation – to see if indeed the system performs on
unseen data. But since financial prediction is very difficult, extra insights
are needed. The paper has tried to provide some: data enhancing tech-
niques, predictability tests, performance improvements, evaluation hints
and pitfalls to avoid. Awareness of them hopefully will make predictions
easier, or at least the realization that they cannot be done quicker.
112
Ensembles in Practice: Prediction,
Estimation, Multi-Feature and Noisy
Data
HIS-2002, Chile, 2002
113
.
114
Ensembles in Practice: Prediction,
Estimation, Multi-Feature and Noisy Data
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se
Abstract This paper addresses 4 practical ensemble applications: time series prediction,
estimating accuracy, dealing with multiple feature and noisy data. The intent is to refer
a practitioner to ensemble solutions exploiting the specificity of the application area.
Introduction
Recent years have seen a big interest in ensembles – putting several classi-
fiers together to vote – and for a good reason. Even weak, by itself not so
accurate classifiers can create an ensemble beating the best learning algo-
rithms. Understanding, why and when this is possible, and what are the
problems, can lead to even better ensemble use. Many learning algorithms
incorporate voting. Neural networks apply weights to inputs and a nonlin-
ear threshold function to summarize the ’vote’. Nearest neighbor (kNN)
searches for k prototypes for a classified case, and outputs the prototypes’
majority vote. If the definition of an ensemble allows that all members clas-
sify, but only one outputs, then also Inductive Logic Programming (ILP)
is an example.
A classifier can be also put into an external ensemble. Methods, how
to generate and put classifiers together have been prescribed reporting
accuracy over that of the base classifiers. But this success and generality of
ensemble use does not mean that there are no special cases benefiting from a
problem-related approach. This might be especially important in extreme
115
cases, e.g. when it is difficult to obtain above-random classifiers due to
noise – ensemble will not help. In such cases, it takes more knowledge
and experiments (ideally by others) to come up with a working solution,
which probably involves more steps and ingenuity than a standard ensemble
solution. This paper presents specialized ensembles in 4 areas: prediction,
estimating accuracy, dealing with multiple feature and noisy data. The
examples have been selected from many reviewed papers, with clarity and
generality (within their area) of the solution in mind. The idea of this paper
is to provide a problem-indexed reference to the existing work, rather than
to detail it.
116
perhaps hundreds of classifiers takes that much more to train, classify and
store. This can be alleviated by simpler base classifiers – e.g. decision
stubs instead of trees – and pruning, e.g. skipping low-weight members in
a weighted ensemble (Margineantu & Dietterich, 1997a). Overfitting can
result when an ensemble does not merely model training data from many
(random) angles, but tries to fit its whims. Such a way to boost accuracy
may work on noise-free data, but in the general case this is a recipe for
overfitting (Sollich & Krogh, 1996). Readability loss is another consequence
of voting classifiers. Rule-based decisions trees and predicate-based ILP,
having similar accuracy as e.g. neural networks (ANN) and kNN were
favored in some areas because of human-understandable models. However,
an ensemble of 100 such models – different to make the ensemble work and
possibly weighted – blows the readability.
A note on vocabulary. Bias refers to the classification error part of
the central tendency, or most frequent classification, of a learner when
trained on different sets; variance – the error part of deviations from the
central tendency (Webb, 1998). Stable learning algorithms are not that
sensitive to changes in the training set, include kNN, regression, Support
Vector Machines (SVM), whereas decision trees, ILP, ANN are unstable.
Global learning creates a model for the whole data, later used (the model,
not data) to classify new instances, e.g. ANN, decision tree, whereas local
algorithms refrain from creating such models, e.g. kNN, SVM. An overview
of learning algorithms can be found (Mitchell, 1997).
117
effective: smaller – including only truly contributing classifiers, more accu-
rate – taking validated above-random classifiers etc. Experiments confirm
this (Liu & Yao, 1998). However, even more advanced methods often refer
to features of common ensembles. There are many ways to ensure ensem-
ble’s classifier diversity, e.g. by changing training data or the classifier
construction process – the most common methods described below.
118
Feature selection
Classifiers can be trained with different feature subsets. The selection can
be random or premeditated, e.g. providing a classifier with a selection
of informative, uncorrelated features. If all features are independent and
important, the accuracy of the restricted (feature-subset) classifiers will
decline, however putting them all together could still give a boost. Features
can also be preprocessed presenting different views of the data to different
classifiers.
Randomization
Randomization could be inserted at many points resulting in ensemble
variety. Some training examples could be distorted, e.g. by adding 0-mean
noise. Some class values could be randomized. The internal working of the
learning algorithm could be altered, e.g. by choosing a random decision
among the 3 best in decision tree build up.
119
Wagging (Bauer & Kohavi, 1998), a variant of bagging, requires a base
learner accepting training set weights. Instead of bootstrap samples, wag-
ging assigns random weights to instances in each training set, the original
formulation used Gaussian noise to vary the weights.
Even when trained on the same data, classifiers such as kNN, neural net-
work, decision tree create models classifying new instances differently due
to different internal language, biases, sensitivity to noise etc. Learners
could also induce varied models due to different settings, e.g. network
architecture.
The quest for the ultimate ensemble technique resembles the previous ef-
forts to find the ’best’ learning algorithm which discovered a number of sim-
ilarly accurate methods, some somehow better in specific circumstances,
and usually further improved by problem-specific knowledge. Ensemble
methods also show their strengths in different circumstances, e.g. no/data
noise, un/stable learner etc. Problem specifics could be directly incorpo-
rated into an specialized ensemble. This section addresses four practical
problem areas and presents ensemble adaptations. Though other problems
in the areas might require individual approach, the intention is to bring
some issues and worked-out solutions.
120
Time Series Prediction
Time series arise in any context in which data is linearly ordered, e.g. by
time or distance. The index increment may be constant, e.g. 1 day, or not,
as in the case of event-driven measurements, e.g. indicating a transaction
time and its value. Series values are usually numeric, in a more general
case – vectors of fixed length. Time series prediction is to estimate a future
value, given values up to date. There are different measures of success, the
most common accuracy – in the case of nominal series values, and squared
mean error – in the case of numeric.
Series to instances conversion is required by most learning algorithms ex-
pecting as an input a fixed length vector. It can be a lag vector derived from
series, a basic technique in nonlinear analysis vt = (seriest , seriest−lag , ..,
seriest−(D−1)∗lag ). Such vectors with the same time index t – coming from
all input series – appended give an instance, its coordinates referred to as
features. The lag vectors have motivation in Takens embedding theorem
(Kantz & Schreiber, 1999b) stating that a deterministic – i.e. to some
extent predictable – series’ dynamics is mimicked by the dynamics of the
lag vectors, so e.g. if a series has a cycle – coming to the same values, the
lag vectors will have a cycle too.
121
system’s trajectory – imagine a yearly cycle by giving just several values
separated seconds apart. Too big, misses the details and risks putting
together weakly related values – as in the case of a yearly cycle sampled
at 123-months interval. Without advanced knowledge of the data, a lag
is preferred: either as the first zero autocorrelation or minimum of the
mutual information (Kantz & Schreiber, 1999b). However, those are only
heuristics and an ensemble could explore a range of values, especially as
theory does not favor any.
122
series preprocessing predicates: relative increases, decreases, stays
(within a range) and region: always, sometime, true percentage – test-
ing if interval values belong to a range. The predicates, filled with values
specifying the intervals and ranges, are the basis of simple classifiers – con-
sisting of only one predicate. The classifiers are then subject to boosting
up to 100 iterations. The results are good, though noisy data causes some
problems.
Initial conditions of the learning algorithm can differ for each ensemble
member. Usually, the learning algorithm has some settings, other than
input/output data features etc. In the case of ANN, it is the initial weights,
architecture, learning speed, weight decay rate etc. For an ILP system –
the background predicates, allowed complexity of clauses. For kNN – the k
parameter and weighting of the k neighbors w.r.t. distance: equal, linear,
exponential. All can be varied.
An ANN example of different weight initialization for time series pre-
diction follows (Naftaly et al., 1997). Nets of the same architecture are
randomly initialized and assigned to ensembles built at 2 levels. First, the
nets are grouped into ensembles of fixed size Q, and the results for the
groups averaged at the second level. Initially, Q = 1, which as Q increases
expectably reduces the variance. At Q = 20 the variance is similar to what
could be extrapolated for Q = ∞. Except for suggesting a way the im-
prove predictions, the study offers some interesting observations. First, the
minimum of the ensemble predictor error is obtained at ANN epoch that
for a single net would already mean overfitting. Second, as Q increases,
the test set error curves w.r.t. epochs/training time, go flatter making it
less crucial to stop training at the ’right’ moment.
123
for one series – and presented to the learning algorithm. Different ensem-
ble members can be provided with their different selection/preprocessing
combination.
Selection of delay vector lag, dimension, even for more input series, can
be done with the following (Zemke, 1999b). For each series, lag is set to a
small value, and dimension to a reasonable value, e.g. 2 and 10. Next, a
binary vector, as long as the sum of embedding dimensions for all series,
is optimized by a Genetic Algorithm (GA). The vector, by its ’1’ positions
indicates which lagged values should be used, their number restricted to
avoid the curse of dimensionality. The selected features are used to train
a predictor which performance/accuracy measures the vector’s fitness. In
the GA population no 2 identical vectors are allowed and, after a certain
number of generations, the top performing half of the last population is
subject to majority vote/averaging of their predictions.
Multiple Features
Multiple features, running into hundreds or even thousands, naturally ap-
pear in some domains. In text classification, a word’s presence may be
considered a feature, in image recognition – a pixel’s value, in chemical
design – a component’s presence and activity, or in a joint data base the
features may mount. Feature selection and extraction are main dimension-
ality reduction schemes. In selection, a criterion, e.g. correlation, decides
feature choice for classification. Feature extraction, e.g. Principal Com-
ponent Analysis (PCA), reduces dimensionality by creating new features.
Sometimes, it is impossible to find an optimal feature set, when several sets
perform similarly. Because different feature sets represent different data
views, simultaneous use of them can lead to a better classification.
Simultaneous use of different feature sets usually lumps feature vec-
tors together into a single composite vector. Although there are several
methods to form the vector, the use of such joint feature set may result
in the following problems: 1) Curse of dimensionality, the dimension of a
composite feature vector becomes much higher than any of component fea-
ture vectors, 2) Difficulty in formation, it is often difficult to lump several
different feature vectors together due to their diversified forms, 3) Redun-
124
dancy, the component feature vectors are usually not independent of each
other (Chen & Chi, 1998). The problems of relevant feature and example
selection are interconnected (Blum & Langley, 1997).
Random feature selection for each ensemble classifier is perhaps the sim-
plest method. It works if 1) data is highly redundant – it does not matter
much which features are included, as many carry similar information and 2)
the selected subsets are big enough to create above-random classifier – find-
ing that size may require some experimentation. Provided that, one may
obtain better classifiers in random subspaces than in the original feature
space, even before the ensemble application. In a successful experiment
(Skurichina & Duin, 2001), the original dimensionality was 80 (actually
24-60), subspaces – 10, randomly selected for 100-classifier majority vote.
125
then contribute to an even more robust ensemble. Sensitivity of a feature
is defined as the change in the output variable when an input feature is
changed within its allowable range (while holding all other inputs frozen
at their median/average value) (Embrechts et al., 2001).
In in-silico drug design with QSAR, 100-1000 dependent features and
only 50-100 instances present related challenges: how to avoid curse of di-
mensionality, and how to maximize classification accuracy given the few
instances yet many features. A solution reported is to bootstrap an (ANN)
ensemble on all features adding one random – with values uniformly dis-
tributed – to estimate sensitivities of features, and skip features less sen-
sitive than the random. Repeat the process until not further feature can
be dropped and train the final ensemble. This scheme allows to identify
important features.
Accuracy Estimation
For many reallife problems, perfect classification is not possible. In addi-
tion to fundamental limits to classification accuracy arising from overlap-
ping class densities, errors arise because of deficiencies in the classifier and
the training data. Classifier related problems such as incorrect structural
model, parameters, or learning regime may be overcome by changing or
126
improving the classifier. However, errors caused by the data (finite train-
ing sets, mislabelled patterns) cannot be corrected during the classification
stage. It is therefore important not only to design a good classifier, but
also to estimate limits to achievable classification rates. Such estimates
determine whether it is worthwhile to pursue (alternative) classification
schemes.
The Bayes error provides the lowest achievable error for a given clas-
sification problem. A simple Bayes error upper bound is provided by the
Mahalanobis distance, however, it is not tight – might be twice the actual
error. The Bhattacharyya distance provides a better range estimate, but
it requires knowledge of the class densities. The Chernoff bound tight-
ens Bhattacharyya upper estimate but it is seldom used since difficult to
compute (Tumer & Ghosh, 1996). The Bayes error can be also estimated
non-parametrically from errors of a nearest neighbor classifier, provided
the training data is large, otherwise the asymptotic analysis might fail.
Little work has been reported on a direct estimation of the performance
of classifiers (Bensusan & Kalousis, 2001) and on data complexity analysis
for optimal classifier combination (Ho, 2001).
Bayes error estimation via an ensemble (Tumer & Ghosh, 1996) exploits
that the error is only data dependent, thus the same for all classifiers that
add to it extra error due to a specific classifier limitations. By determining
the amount of improvement obtained from an ensemble, the Bayes error
can be isolated. Given the error of a single classifier E, of an averaging
ensemble Eensemble , of N ρ-correlated classifiers, the Bayes error stands:
−((N −1)ρ+1)E
EBayes = N Eensemble
(N −1)(1−ρ) . The classifier correlation ρ is estimated by
deriving the (binary) misclassification vector for each classifier, and then
averaging the vectors’ correlations. This can cause problems, as it treats
classifiers equally, and is expensive if their number, N is high. The corre-
lation can be, however, also derived via mutual information by averaging
it between classifiers and an ensemble as a fraction of the total entropy
in the individual classifiers (Tumer et al., 1998). This yields even better
estimate of the Bayes error.
127
Noisy Data
128
extension of backpropagation – clearly outperforms standard ANNs ensem-
bles on noisy data both in terms of accuracy and ensemble size.
Removing mislabelled instances, with such cleaned data used for training,
can improve accuracy. The problem is how to recognize a corrupted la-
bel, distinguishing it from exceptional, but correct, case. Interestingly, as
opposed to labels, cleaning corrupted attributes may decrease accuracy if
a classifier trained on the cleaned data later classify noisy instances. In
an approach (Brodley & Friedl, 1996), all data has been divided into N
parts and an ensemble trained (by whatever ensemble-generating method)
on N − 1 parts, and used to classify the remaining part, in turn done for
all parts. The voting method was consensus – only if the whole ensemble
agreed on a class different from the actual, the instance was removed. Such
a conservative approach is unlikely to remove correct labels, though may
still leave some misclassifications. Experiments have shown that using the
cleaned data for training the final classifier (of whatever type) increased
accuracy for 20 - 40% noise (i.e. corrupted labels), and left it the same for
noise less than 20%.
Conclusion
Ensemble techniques, bringing together multiple classifiers for increased
accuracy, have been intensively researched in the last decade. Most of
the papers either propose a ’novel’ ensemble technique, often a hybrid one
129
bringing features of several existing, or compare existing ensemble and clas-
sifier methods. This kind of presentation has 2 drawbacks. It is inaccessible
to a practitioner, with a specific problem in mind, since the literature is en-
semble method oriented, as opposed to problem oriented. It also gives the
impression that there is the ultimate ensemble technique. Similar search
for the ultimate machine learning proved fruitless. This paper concen-
trates on ensemble solutions in 4 problem areas: time series prediction,
accuracy estimation, multiple feature and noisy data. Published systems,
often blending internal ensemble working with some of the areas specific
problems are presented easing the burden to reinvent them.
130
Multivariate Feature Coupling and
Discretization
FEA-2003, Cary, US, 2003
131
.
132
Multivariate Feature Coupling and
Discretization
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se
Michal Rams6
Institut de Mathematiques de Bourgogne
Universite de Bourgogne
Dijon, France
M.Rams@impan.gov.pl
Abstract This paper presents a two step approach to multivariate discretization, based
on Genetic Algorithms (GA). First, subsets of informative and interacting features are
identified – this is one outcome of the algorithm. Second, the feature sets are globally
discretized, with respect to an arbitrary objective. We illustrate this by discretizion for
the highest classification accuracy of an ensemble diversified by the feature sets.
Introduction
Primitive data can be discrete, continuous or nominal. Nominal type
merely lists the elements without any structure, whereas discrete and con-
tinuous data have an order – can be compared. Discrete data differs from
continuous that it has a finite number of values. Discretization, digitiza-
tion or quantization maps a continuous interval into one discrete value, the
idea being that the projection preserves important distinctions. If all that
matters, e.g., is a real value’s sign, it could be digitized to {0; 1}, 0 for
negative, 1 otherwise.
6
On leave from the Institute of Mathematics, Polish Academy of Sciences, Poland.
133
A data set has data dimension of attributes or features – each holding
a single type of values across all data instances. If attributes are the data
columns, instances are the rows and their number is the data size. If one of
the attributes is the class to be predicted, we are dealing with supervised
data, versus unsupervised. The data description vocabulary carries over
to the discretization algorithms. If an algorithm discretizing an attribute
takes into account the class, it is supervised.
Most common univariate methods discretize one attribute at a time,
whereas multivariate methods consider interactions between attributes in
the process. Discretization is global if performed on the whole data set,
versus local if only part of the data is used, e.g. a subset of instances.
There are many advantages of discretized data. Discrete features are
closer to a knowledge level representation than continuous ones. Data can
be reduced and simplified, so it is easier to understand, use, and explain.
Discretization can make learning more accurate and faster and the re-
sulting hypotheses (decision trees, induction rules) more compact, shorter,
hence can be more efficiently examined, compared and used. Some learning
algorithms can only deal with discrete data (Liu et al., 2002).
Background
Machine learning and data mining aim at high accuracy, whereas most
discretization algorithms promote accuracy only indirectly, by optimizing
related metrics such as entropy or the chi-square statistics.
Univariate discretization algorithms are systemized and compared in
(Liu et al., 2002). The best discretizations were supervised: entropy mo-
tivated Minimum Description Length Principle (MDLP) (Fayyad & Irani,
1993), and based on the chi-square statistics (Liu & Setiono, 1997), later
extended into parameter-free (Tay & Shen, 2002).
There is much less literature on multivariate discretization. A chi-square
statistics approach (Bay, 2001) aims at discretizing data so its distribution
is most similar to the original. Classification rules based on feature inter-
vals can also be viewed as discretization, as done by (Kwedlo & Kretowski,
1999) who GA-evolve the rules. However, different rules may impose dif-
134
ferent intervals for the same feature.
Multivariate Considerations
Discretization Measures
135
Shannon conditional entropy (Shannon & Weaver, 1949) is commonly used
to estimate the information gain of a cut-point, with the point with max-
imal score used, as in C4.5 (Quinlan, 1993). We encountered the problem
that entropy has low discriminative power in some non-optimal equilibria,
and as such, does not provide a clear direction how to get out of them.
Chi-square statistics assess how the discretized data is similar to the orig-
inal (Bay, 2001). We experimented with chi-square as a secondary test, to
further distinguish between nearly equal primary objective. Eventually, we
preferred the Renyi entropy which has similar quadratic formula, though
interpretable as accuracy. It can be linearly combined with the following
accuracy measure.
Our Measure
Accuracy alone is not a good discretization measure since a set of random
features may have high accuracy, as the number of splits and overfitting
grow. Also, some of the individual features in a set may induce accurate
predictions. We need a measure of the extra gain over that of contributing
features and overfitting. Such considerations led to the following.
136
Signal-to-Noise Ratio (SNR) expresses the accuracy gain of a feature set:
SNR = accuracy / (1 − accuracy) = (dataSize- totalInconsistency) /
totalInconsistency, i.e. the ratio of consistent to inconsistent pattern to-
tals.
To correct for the accuracy induced by individual features, we normalize
the SNR by dividing it by the SNR for all the features involved in the
feature set, getting SNRn. SNRn > 1 indicates that a feature set predicts
more than its individual features.
Two-Stage Algorithm
The approach uses Genetic Algorithms to select feature sets contributing
to predictability and to discretize the features. First, it identifies different
feature subsets, on the basis of their predictive accuracy. Second, the
subsets fixed, all the features involved in them are globally fine-discretized.
Feature Coupling
This stage uses rough discretization to identify feature sets of above random
accuracy, via GA fitness maximization. A feature in different subsets may
have different discretization cut-points.
After each population size evaluations, the fittest individual’s feature
set is a candidate to the list of coupled feature sets. If the SNRn < 1,
the set is rejected. Otherwise, it is locally optimized: in turn each feature
is removed and the remaining subset evaluated for SNRn. If a subset
measures no worse than the original, it is recursively optimized. If the SNR
137
(not-normalized) of the smallest subset exceeds an acceptance threshold,
the set joins the coupled feature list.
Once a feature subset joins the above-random list, all its subsets and
supersets in the GA population are mutated. The mutation is such as not
to generate any subset or superset of a set already in the list. This done,
the GA continues.
At the end of the GA run, single features are considered to the list of
feature sets warranting predictability. The accuracy threshold for accepting
a feature is arrived at by collecting statistics on accuracy of permuted
original features predicting the actual class. The features are randomized
in this way as to preserve their distribution, e.g. the features may happen
to be binary which should be respected collecting the statistics. Then the
threshold accuracy is provided by the mean accuracy plus required number
of standard deviations.
138
Global Discretization
Once we have identified the coupled features sets, the second optimization
can proceed. The user could provide the objective. We have attempted a
fine discretization of the selected features, in which each feature is assigned
only one set of cut-points. The fitness of such can be measured in many
ways, e.g., in the spirit of the Naive Bayesian Classifier as the product of
the discretization accuracies for all the feature sets. The GA optimization
proceeds by exploring the cut-points, the feature sets fixed.
The overall procedure provides:
Implementation Details
GA-individual = active feature set + cut-point selection. The features are
a subset of all the features available, no more than 4 selected at once. The
cut-points are indices to sorted threshold values, precomputed for each
data feature as the value at which class changes (Fayyad & Irani, 1993).
Thus, a discretization of a value is the smallest index whose corresponding
threshold exceeds that value.
Although non-active features are not processed in an individual, their
thresholds are inherited from a predecessor. Once the corresponding fea-
ture is mutated active, the retained threshold indices will be used. The
motivation is that even for non-active features, the thresholds had been op-
timized, and as such have greater potential as opposed to random thresh-
olds selection. This is not a big overhead, as all feature threshold index
lists are merely pointers to an event when they were created, and the
constant-size representation promotes simple genetic operators.
139
Genetic operators currently include mutation at 2 levels. First mutation,
stage 1 only, may alter the active feature selection by adding a feature,
deleting or changing it. Second mutation does the same to threshold se-
lection of an active feature: add, delete or change.
Experiments
140
of randomly assigned data, respectively class, values after the class had
been computed on the non-corrupted data. The table results represent the
percentages of cases, out of 10 runs, when the sets {0,1,2} etc. were found.
Conclusion
The approach presented invokes a number of interesting possibilities for
data mining applications. First, the algorithm detects informative feature
groupings even if they contribute only partially to the class definition and
the noise is strong. In more descriptive data mining, where it is not only
important to obtain good predictive models but also to present them in
a readable form, the discovery that a feature group contributes to pre-
dictability with certain accuracy is of value.
Second, the global discretization stage can be easily adjusted to a partic-
ular objective. If it is prediction accuracy by another type of ensemble, or
if only 10 features are to be involved, it can be expressed via the GA-fitness
function for the global discretization.
141
142
Appendix A
143
.
144
Feasibility Study on Short-Term Stock
Prediction
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se
1997
Abstract This paper presents an experimental system predicting a stock exchange index
direction of change with up to 76 per cent accuracy. The period concerned varies form 1
to 30 days.
The method combines probabilistic and pattern-based approaches into one, highly
robust system. It first classifies the past of the time series involved into binary patterns
and then, analyzes the recent data pattern and probabilistically assigns a prediction based
on the similarity to past patterns.
Introduction
The objective of the work was to test if short-term prediction of a stock in-
dex is at all possible using simple methods and a limited dataset(Deboeck,
1994; Weigend & Gershenfeld, 1994). Several approaches were tried, both
with respect to the data format and algorithms. Details of the successful
setting follow.
Experimental Settings
The tests have been performed on 750 daily index quotes of the Polish
stock exchange, with the training data reaching another 400 sessions back.
Index changes obtained binary characterization: 1 – for strictly positive
changes, 0 – otherwise. Index prediction – the binary function of change
between current and future value – was attempted for periods of 1, 3, 5, 10,
145
20 and 30 days ahead. Pattern learning took place up to the most recent
index value available (before the prediction period). Benchmark strategies
are presented to account for biases present in the data. Description of used
strategies follows.
Results
The table presents portions of agreements on actual index changes and
those predicted by the strategies.
Prediction/quotes ahead 1 3 5 10 20 30
Always up 0.48 0.51 0.51 0.52 0.56 0.60
Trend following 0.60 0.56 0.53 0.52 0.51 0.50
Patterns + trend 0.76 0.76 0.75 0.74 0.73 0.671
In all periods considered, patterns allowed to maintain predictability at
levels considerably higher than benchmark methods. Though this relatively
simple characterization of index behavior, patterns correctly predict index
move in 3 out of 4 cases up to 20 sessions ahead. The Trend following
strategy diminishes from 60% accuracy to a random strategy at around 10
sessions and the Always up gains strength at 20 quotes ahead in accordance
with general index appreciation.
146
Conclusions
The experiments show that a short-term index prediction is indeed possible
(Haughen, 1997). However, as a complex, non-linear system, the stock
exchange requires a careful approach (Peters, 1991; Trippi, 1995). In earlier
experiments, when pattern learning took place only in epochs proceeding
the test period or when epochs extended too far back, the resulting patterns
were of little use. This could be caused by shifting regimes (Asbrink, 1997)
in the dynamic process underlying the index values.
However, with only short history relevant, the scope for inferring any
useful patterns, so prediction, is limited. A solution to this could be pro-
vided by a hill climbing method, such as genetic algorithms (Michalewicz,
1992; Bauer, 1994), in the space of (epoch-size * number-of-epochs *
pattern-complexity) as to maximize the predictive power. Other ways of
increasing predictability include incorporating other data series and in-
creasing the flexibility of the pattern building process, which now only
incorporates simple probability measure and logical conjunction.
Other interesting possibilities follow even a short analysis of the suc-
cessful binary patterns: many of them pointing to an existence of short
period ‘waves’ in the index. This could be further explored e.g. by Fourier
or wavelet analysis.
Finally, I only mention trials with the symbolic, ILP, system Progol em-
ployed for finding a logical expression generalizing positive index change
patterns (up to 10 binary digits long). The system failed to find any hy-
pothesis in a number of different settings and a rather exhaustive search
(more than 20h computation on SPARC 5 for longer cases). I view the
outcome as a result of a strong insistence of the system for generating
(only) compressed hypothesis and problems in dealing with partially in-
consistent/noisy data.
147
148
Appendix B
149
.
150
Amalgamation of Genetic Selection and
Boosting
Stefan Zemke
Department of Computer and System Sciences
Royal Institute of Technology (KTH) and Stockholm University
Forum 100, 164 40 Kista, Sweden
Email: steze@kth.se
151
1. Split training examples into the evaluation set (15%) and training set
(85%).
2. Build GA classifier population by selecting prototypes for each class by
copying examples from the training set according to their probability
distribution. Each classifier also includes random binary active feature
vector.
3. Evolve GA population until criterion for the best classifier met
4. Add the classifier to ensemble list, perform ensemble Reduce-Error
Pruning with Backfitting (Margineantu & Dietterich, 1997b) to max-
imize its accuracy on the evaluation set. Check ensemble enlargement
end criterion
5. If not an end, update training set distribution according to AdaBoost,
go to 2.
The operators used in the GA search include: Mutation – changing sin-
gle bit in the feature select vector, or randomly changing an active feature
value in one of classifier’s prototypes. Crossover, given 2 classifiers, in-
volves swapping of either feature select vectors or prototypes for one class.
Classifier fitness (to be minimized) is measured as its error on the training
set, i.e., as sum of probabilities of examples it misclassifies. The end crite-
rion for classifier evolution is that at least half of the GA population has
below-random error. The end criterion for ensemble enlargement is that
its accuracy on the evaluation set is not growing. The algorithm draws
from several methods to boost performance:
• AdaBoost
• Pruning of ensembles
• Feature selection/small prototype set to destabilize individual classi-
fiers (Zheng et al., 1998)
• GA-like selection and evolving of prototypes
• Redundancy in prototype vectors (Ohno, 1970) – only selected fea-
tures influence the 1-NN distance, but all are subject to evolution
152
Experiments indicate robustness of the approach – acceptable classifier is
usually found in early generation, thus ensemble grows rapidly. Accuracies
on the (difficult) financial data are fairly stable and, on average, above
those obtained by the methods from the initial study, but below their
peaks. Bagging such obtained ensembles has also been attempted further
reducing variance but only minimally increasing accuracy.
Foreseen work includes pushing the accuracy. Trials involving the UCI
repository are planned for wider comparisons. Refinement of the algo-
rithms will include: genetic operators (perhaps leading to many prototypes
per class) and end criteria. The intention is to promote rapid finding of
(not prefect but) above-random and diverse classifiers contributing to an
accurate ensemble.
In summary the expected outcome of this research is a robust general
purpose system distinguished by generating small set of prototypes, nev-
ertheless in ensemble exhibiting high accuracy and stable results.
153
154
Bibliography
Ali, K. M., & Pazzani, M. J. (1995). On the link between error correlation
and error reduction in decision tree ensembles (Technical Report ICS-
TR-95-38). Dept. of Information and Computer Science, UCI, USA.
Allen, F., & Karjalainen, R. (1993). Using genetic algorithms to find tech-
nical trading rules (Technical Report). The Rodney L. White Center for
Financial Research, The Wharton School, University of Pensylvania.
Asker, L., & Maclin, R. (1997). Feature engineering and classifier selection:
A case study in Venusian volcano detection. Proc. 14th International
Conference on Machine Learning (pp. 3–11). Morgan Kaufmann.
Aurell, E., & Zyczkowski, K. (1996). Option pricing and partial hedging:
Theory of polish options. Applied Math. Finance.
Bak, P. (1997). How nature works: the science of self organized criticality.
Oxford University Press.
155
Bauer, R. (1994). Genetic algorithms and investment strategies. an alter-
native approach to neural networks and chaos theory. New York: Wiley.
Bay, S. D. (2001). Multivariate discretization for set mining. Knowledge
and Information Systems, 3, 491–512.
Bellman, R. (1961). Adaptive control processes: A guided tour. Princeton
Univ. Press.
Bensusan, H., & Kalousis, A. (2001). Estimating the predictive accuracy
of a classifier (Technical Report). Department of Computer Science,
University of Bristol, UK.
Bera, A. K., & Higgins, M. (1993). Arch models: Properties, estimation
and testing. Journal of Economic Surveys, 7, 307366.
Blum, A., & Langley, P. (1997). Selection of relevant features and examples
in machine learning. Artificial Intelligence, 97, 245–271.
Bollerslev, T. (1986). Generalised autoregressive conditional heteroskedas-
ticity. Journal of Econometrics, 31, 307–327.
Bostrom, H., & L., A. (1999). Combining divide-and-conquer and separate-
and-conquer for efficient and effective rule induction. Proceedings of
the Ninth International Workshop on Inductive Logic Programming.
Springer.
Box, G., Jenkins, G., & Reinsel, G. (1994). Time series analysis, forecast-
ing and control. Prentice Hall.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
Brodley, C. E., & Friedl, M. A. (1996). Identifying and eliminating misla-
beled training instances. AAAI/IAAI, Vol. 1 (pp. 799–805).
Campbell, J. Y., Lo, A., & MacKinlay, A. (1997). The econometrics of
financial markets. Princeton University Press.
Chen, K., & Chi, H. (1998). A method of combining multiple probabilistic
classifiers through soft competition on different feature sets. Neurocom-
puting, 20, 227–252.
156
Cheng, W., Wagner, L., & Lin, C.-H. (1996). Forecasting the 30-year u.s.
treasury bond with a system of neural networks.
Cizeau, P., Liu, Y., Meyer, M., Peng, C.-K., & H., S. (1997). Volatility
distribution in the s&p500 stock index. Physica A, 245.
Cont, R. (1999). Statistical properties of financial time series (Technical
Report). Ecole Polytechnique, F-91128, Palaiseau, France.
Conversano, C., & Cappelli, C. (2000). Incremental multiple imputation
of missingdata through ensemble of classifiers (Technical Report). De-
partment of Matematics and Statistics, University of Naples Federico II,
Italy.
Dacorogna, M. (1993). The main ingredients of simple trading models
for use in genetic algorithm optimization (Technical Report). Olsen &
Associates.
Dacorogna, M., Gencay, R., Muller, U., Olsen, R., & Pictet, O. (2001). An
introduction to high-frequency finance. Academic Press.
Deboeck, G. (1994). Trading on the edge. Wiley.
Dietterich, T. (1996). Statistical tests for comparing supervised learning
algorithms (Technical Report). Oregon State University, Corvallis, OR.
Dietterich, T. (1998). An experimental comparison of three methods for
constructing ensembles of decision trees: Bagging, boosting, and random-
ization. Machine Learning, ?, 1–22.
Dietterich, T., & Bakiri, G. (1991). Error-correcting output codes: A gen-
eral method of improving multiclass inductive learning programs. Pro-
ceedings of the Ninth National Conference on AI (pp. 572–577).
Dietterich, T. G. (2000). Ensemble methods in machine learning. Multiple
Classifier Systems (pp. 1–15).
Domingos, P. (1997). Why bagging work? a bayesian account and its impli-
cations. Proceedings of the Third International Conference on Knowledge
Discovery and Data Mining (pp. 155–158).
157
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Chap-
man & Hall.
Embrechts, M., et al. (2001). Bagging neural network sensitivity analysis
for feature reduction in qsar problems. Proceedings INNS-IEEE Interna-
tional Joint Conference on Neural Networks (pp. 2478–2482).
Fama, E. (1965). The behavior of stock market prices. Journal of Business,
January, 34–105.
Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continous-
valued attributes for classification learning. Proc. of the International
Joint Conference on Artificial Intelligence (pp. 1022–1027). Morgan
Kaufmann.
Feder, M., Merhav, N., & Gutman, M. (1992). Universal prediction of
individual sequences. IEEE Trans. Information Theory, IT-38, 1258–
1270.
Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization of
online learning and an application to boosting. Proceedings of the Second
European Conference on Machine Learning (pp. 23–37). Springer-Varlag.
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting al-
gorithm. Machine Learning: Proceedings of the Thirteenth International
Conference.
Galitz, L. (1995). Financial engineering: Tools and techniques to manage
financial risk. Pitman.
Gershenfeld, N., & Weigend, S. (1993). The future of time series: Learning
and understanding. Addison-Wesley.
Gonzalez, C. A., & Diez, J. J. R. (2000). Time series classification by boost-
ing interval based literals. Inteligencia Artificial, Revista Iberoamericana
de Inteligencia Artificial, 11, 2–11.
Han, J., & Kamber, M. (2001). Data mining. concepts and techniques.
Morgan Kaufmann.
158
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statis-
tical learning. data mining, inference and prediction. Springer.
Hellström, T., & Holmström, K. (1998). Predicting the stock market (Tech-
nical Report). Univ. of Umeøa, Sweden.
Kantz, H., & Schreiber, T. (1999a). Nonlinear time series analysis. Cam-
bridge Univ. Press.
Kantz, H., & Schreiber, T. (1999b). Nonlinear time series analysis. Cam-
bridge Univ. Press.
159
Kutsurelis, J. (1998). Forecasting financial markets using neural networks:
An analysis of methods and accuracy.
Kwedlo, W., & Kretowski, M. (1999). An evolutionary algorithm using
multivariate discretization for decision rule induction. Principles of Data
Mining and Knowledge Discovery (pp. 392–397).
Lavarac, N., & Dzeroski (1994). Inductive logic programming: Techniques
and applications. Ellis Horwood.
LeBaron, B. (1993). Nonlinear diagnostics and simple trading rules for
high-frequency foreign exchange rates. In A. Weigend and N. Gershenfeld
(Eds.), Time series prediction: Forecasting the future and understanding
the past, 457–474. Reading, MA: Addison Wesley.
LeBaron, B. (1994). Chaos and forecastability in economics and finance.
Phil. Trans. Roy. Soc., 348, 397–404.
LeBaron, B., & Weigend, A. (1994). Evaluating neural network predictors
by bootstrapping. Proc. of Itn. Conf. on Neural Information Processing.
Lefvre, E. (1994). Reminiscences of a stock operator. John Wiley & Sons.
Lequeux, P. (Ed.). (1998). The financial markets tick by tick. Wiley.
Lerche, H. (1997). Prediction and complexity of financial data (Technical
Report). Dept. of Mathematical Stochastic, Freiburg Univ.
Liu, H., Hussain, F., Tan, C., & Dash, M. (2002). Discretization: An
enabling technique. Data Mining and Knowledge Discovery, 393–423.
Liu, H., & Setiono, R. (1997). Feature selection via discretization (Technical
Report). Dept. of Information Systems and Computer Science, Singapore.
Liu, Y., & Yao, X. (1998). Negatively correlated neural networks for clas-
sification.
Malkiel, B. (1996). Random walk down wall street. Norton.
Mandelbrot, B. (1963). The variation of certain speculative prices. Jour-
nala of Business, 36, 392–417.
160
Mandelbrot, B. (1997). Fractals and scaling in finance: Discontinuity and
concentration. Springer.
Molgedey, L., & Ebeling, W. (2000). Local order, entropy and predictabil-
ity of financial time series (Technical Report). Institute of Physics,
Humboldt-University Berlin, Germany.
Muller, K.-R., Smola, A., Rätsch, G., Schölkopf, B., Kohlmorgen, J., &
Vapnik, V. (1997). Using support vector machines for time series predic-
tion.
Naftaly, U., Intrator, N., & Horn, D. (1997). Optimal ensemble averaging
of neural networks. Network, 8, 283–296.
161
Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical
study. Journal of Artificial Intelligence Research, 169–198.
Ott, E. (1994). Coping with chaos. Wiley.
Oza, N. C., & Tumer, K. (2001). Dimensionality reduction through clas-
sifier ensembles. Instance Selection: A Special Issue of the Data Mining
and Knowledge Discovery Journal.
Peters, E. (1991). Chaos and order in the capital markets. Wiley.
Peters, E. (1994). Fractal market analysis. John Wiley & Sons.
Quinlan, R. (1993). C4.5: Programs for machine learning. Morgan Kauf-
mann.
Raftery, A. (1995). Bayesian model selection in social research, 111–196.
Blackwells, Oxford, UK.
Refenes, A. (Ed.). (1995). Neural networks in the capital markets. Wiley.
Ricci, F., & Aha, D. (1998). Error-correcting output codes for local learn-
ers. Proceedings of the 10th European Conference on Machine Learning.
Rtsch, G., Schlkopf, B., Smola, A., Mller, K.-R., Onoda, T., & Mika, S.
(2000). nu-arc: Ensemble learning in the presence of outliers.
Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a rec-
ommended approach. Data Mining and Knowledge Discovery, 1, 317–327.
Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1997). Boosting
the margin: a new explanation for the effectiveness of voting methods.
Proc. 14th International Conference on Machine Learning (pp. 322–330).
Morgan Kaufmann.
Shannon, C., & Weaver, W. (1949). The mathematical theory of commu-
nication. Urbana, Illinois: University of Illinois Press.
Skurichina, M., & Duin, R. P. (2001). Bagging and the random subspace
method for redundant feature spaces. Second International Workshop,
MCS 2001.
162
Sollich, P., & Krogh, A. (1996). Learning with ensembles: How overfitting
can be useful. Advances in Neural Information Processing Systems (pp.
190–196). The MIT Press.
Tay, F., & Shen, L. (2002). A modified chi2 algorithm for discretization.
Knowledge and Data Engineering, 14, 666–670.
Tumer, K., Bollacker, K., & Ghosh, J. (1998). A mutual information based
ensemble method to estimate bayes error.
Tumer, K., & Ghosh, J. (1996). Estimating the bayes error rate through
classifier combining. International Conference on Pattern Recognition
(pp. 695–699).
163
Webb, G. (1998). Multiboosting: A technique for combining boosting and
wagging (Technical Report). School of Computing and Mathematics,
Deakin University, Australia.
Weigend, A., & Gershenfeld, N. (1994). Time series prediction: Forecasting
the future and understanding the past. Addison-Wesley.
Witten, I., & Frank, E. (1999). Data mining: Practical machine learning
tools and techniques with java implementations. Morgan Kaufmann.
WSE (1995 onwards). Daily quotes.
http://yogi.ippt.gov.pl/pub/WGPW/wyniki/.
Zemke, S. (1998). Nonlinear index prediction. Physica A, 269, 177–183.
Zemke, S. (1999a). Amalgamation of genetic selection and bag-
ging. GECCO-99 Poster, www.genetic-algorithm.org/GECCO1999/phd-
www.html (p. 2).
Zemke, S. (1999b). Bagging imperfect predictors. ANNIE’99. ASME Press.
Zemke, S. (1999c). Ilp via ga for time series prediction (Technical Report).
Dept. of Computer and System Sciences, KTH, Sweden.
Zemke, S. (2000). Rapid fine tuning of computationally intensive classifiers.
Proceedings of AISTA, Australia.
Zemke, S. (2002a). Ensembles in practice: Prediction, estimation, multi-
feature and noisy data. Proceedings of HIS-2002, Chile, Dec. 2002 (p. 10).
Zemke, S. (2002b). On developing a financial prediction system: Pitfalls
and possibilities. Proceedings of DMLL-2002 Workshop at ICML-2002,
Sydney, Australia.
Zemke, S., & Rams, M. (2003). Multivariate feature coupling and dis-
cretization. Proceedings of FEA-2003.
Zheng, Z., Webb, G., & Ting, K. (1998). Integrating boositng and stochastic
attribute selection committees for further improving the performance of
decission tree learning (Technical Report). School of Computing and
Mathematics, Deakin University, Geelong, Australia.
164
Zirilli, J. (1997). Financial prediction using neural networks. International
Thompson Computer Press.
165