Ga NN

FACULTY
OF
N UCLEAR S CIENCES AND P HYSICAL E NGINEERING

C ZECH T ECHNICAL U NIVERSITY
P RAGUE
O PTIMIZATION OF
NEURAL NETWORKS ARCHITECTURES
USING GENETIC ALGORITHM
PHD
P RAGUE , A PRIL 2009
THESIS
R OMAN KALOUS
A BSTRACT
In this disertation we list all important steps that led us to the solution of the optimization
of neural networks acyclic architectures. Using the cellular encoding we achieve to represent
wide range of acyclic architectures within the NNSU tool, including some special classes
(layered architectures, architectures with variables selections). We introduced a multi-factor
weighted fitness function incorporating internal data transforms that unify the influence of
each factor. This way defined fitness, together with recombination operators round off the
merge into the genetic algorithm. We also introduced a statistical indicator of the time to finding solution. This indicator allows for independent comparison of optimization approaches
in terms of number of steps needed to find an individual of required quality. Comparison
of times to finding solution speaks for the genetic algorithm above the random search techniques. This is due to the fact that the genetic algorithm searches for individuals simultaneously maximizing all factors.
Contents
Acknowledgements
ix
Used symbols and notation
xi
Introduction
xiii
1 Neural networks architectures

1.1 The acyclic architecture of NNSU . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Existing approaches to acyclic architectures representation . . . . . . . . . . .
2 IPCode: The representation of the acyclic architecture
2.1 Representational scheme . . . . . . . . . . . . . . . .
2.2 Reads codes and subcodes . . . . . . . . . . . . . . .
2.3 Instruction-parameters codes . . . . . . . . . . . . .
2.4 Implemented architectures . . . . . . . . . . . . . . .
1
1
7
.
.
.
.
11
11
12
19
30
3 Properties of IPCodes
3.1 Representational properties overview . . . . . . . . . . . . . . . . . . . . . . .
3.2 Discussion of IPCodes properties . . . . . . . . . . . . . . . . . . . . . . . . .
40
40
47
4 Genetic algorithm defined on IPCodes

4.1 General scheme of the genetic algorithm evolution . .
4.2 NNSU tool in detail . . . . . . . . . . . . . . . . . . . .
4.3 Fitness function . . . . . . . . . . . . . . . . . . . . . .
4.4 Recombination operators . . . . . . . . . . . . . . . . .
4.5 Genetic algorithm . . . . . . . . . . . . . . . . . . . . .
4.6 The parameters controlling the transitional behaviour .
4.7 Genetic algorithmsimple runs . . . . . . . . . . . . .
.
.
.
.
.
.
.
49
49
52
54
61
66
67
75
.
.
.
.
.
79
79
82
86
91
96
5 Real data application

5.1 The data description . . . . . . . . . . . . . . . .
5.2 Parameters settings . . . . . . . . . . . . . . . . .
5.3 Evolutionary optimization used in NNSU tool . .
5.4 Evolutionary optimization versus random seeking
5.5 Time to finding solutions . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Results of the work

99
6.1 Main goals of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Achievements presented in this work . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Appendices
iii
iv
A Broader remarks on IPCodes and GA
A.1 Properties of Codes . . . . . . . . . . . . . . . .
A.2 Subcoding . . . . . . . . . . . . . . . . . . . . .
A.3 Relation between Codes and planted plane trees
A.4 Cardinality of the Codes(N ) set . . . . . . . .
Bibliography
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
103
103
105
109
111
115
List of Figures
1.1
1.2
1.3
1.4
1.5
Scheme of NNSU network . . . . . . . . . . . . .

Plot of topology G . . . . . . . . . . . . . . . . .
Plot of architecture H . . . . . . . . . . . . . . .
CPU time spent on learning architectures . . . . .
Topologies destroyed through AM recombination
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
6
7
9
2.1
2.2
2.3
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
Overall scheme of the IPCode representation . . . . . . . . . . . . .

Members of sets 2N fulfilling the level-property . . . . . . . . . . .
Graphs of two samples of Codes . . . . . . . . . . . . . . . . . . . .
Active node with ancestors and descendants . . . . . . . . . . . . .
Application of the S instruction on the node 1 . . . . . . . . . . . .
Application of the P instruction on the node 1 . . . . . . . . . . .
Node 1 is deleted according to the D instruction . . . . . . . . . .
Initial graph GIN IT . . . . . . . . . . . . . . . . . . . . . . . . . . .
Decoding example: the initial graph . . . . . . . . . . . . . . . . .
Decoding example: node 1 was split into 1 and 2 via P instruction
Decoding example: node 1 was set up via E instruction . . . . . .
Decoding example: node 2 was split into 2 and 3 via S instruction
Decoding example: node 3 was set up via E instruction . . . . . .
Sample generic architecture . . . . . . . . . . . . . . . . . . . . . .
Samples of random architectures . . . . . . . . . . . . . . . . . . .
Sample layered architecture . . . . . . . . . . . . . . . . . . . . . .
Samples of layered architectures . . . . . . . . . . . . . . . . . . .
Architecture with variables selection . . . . . . . . . . . . . . . . .
Architecture with variables selection . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
15
18
20
20
21
22
23
28
29
29
29
30
31
32
33
35
36
39
3.1 Phenotypes and genotypes within an evolutionary optimization environment .

3.2 Two ways to construct the same simple architecture . . . . . . . . . . . . . . .
3.2 Symmetrical architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
44
45
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
56
57
62
63
64
65
67
70
73
75
76
Histogram graph for a NNSU response to the evaluation data . . . . . .

ROC curve sample for a NNSU response to the testing data . . . . . . . .
Graphs of the function T . . . . . . . . . . . . . . . . . . . . . . . . . . .
Original fitness evaluations f of the individuals . . . . . . . . . . . . . .
Proportional and modified proportional selections . . . . . . . . . . . . .
Mutation of architectures . . . . . . . . . . . . . . . . . . . . . . . . . .
Crossover on the architectures . . . . . . . . . . . . . . . . . . . . . . . .
Histogram of lengths of recombined SubIPCodes . . . . . . . . . . . . .
Evolution of the fitness in the genetic algorithm with and without elitism
Generalization ratio and generalization difference . . . . . . . . . . .
Two illustrations of a genetic algorithm run . . . . . . . . . . . . . . . .
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vi
LIST OF FIGURES
4.12 Two further illustrations of a genetic algorithm run . . . . . . . . . . . . . . .
4.13 3D-fitness and 2D-fitness graphs for graph size optimization . . . . . . . . . .
4.14 Block counts within architectures and lengths of the IPCodes . . . . . . . . . .
77
78
78
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
84
85
89
90
90
90
93
94
95
96
97
Ranking transformations for evaluation factors eHST and eP O . . . . . . . . .

Ranking transformations for evaluation factors eROC and eM SE . . . . . . . .
Short run GA: 3D-fitness and 2D-fitness graphs . . . . . . . . . . . . . . . . .
Short run GA: Comparison of the histogram and ROC quality . . . . . . . . .
Short run RS: Comparison of the histogram and ROC quality . . . . . . . . . .
Short run: Visualisation best performing NNSUs . . . . . . . . . . . . . . . . .
Large run GA: Fitness evolutions of GA L01, GA L02, and GA L03 runs . . . .
Large run GA: Top architectures found by GA runs GA L01, GA L02, and GA L03
Large run RS: Top architectures found by RS runs RS L01, RS L02, and RS L03
Empirical CPDFs for quality levels qf = 0.9 and qf = 0.95 . . . . . . . . . . . .
Empirical CPDFs for quality levels qf = 0.99 and qf = 0.995 . . . . . . . . . .
A.1 Schema of a walk around of a tree . . . . . . . . . . . . . . . . . . . . . . . . 110

A.2 Equivalence of trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
List of Tables
2.1
2.2
2.3
2.4
Level-property checks for two different integer series

Subcode seeking . . . . . . . . . . . . . . . . . . . .
Subcodes and their lenghts for a Code . . . . . . . .
Sample delFlag settings . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
16
17
25
3.1 Representational properties review . . . . . . . . . . . . . . . . . . . . . . . .
48
4.1
4.2
4.3
4.4
Core values of evaluation factors . . . . . . . . . . . . . . . . . . . . . .

Used types of functions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Variability of constructed selections . . . . . . . . . . . . . . . . . . . . .
Generalization ratios and differences assigned to sample NNSU instances
.
.
.
.
.
.
.
.
.
.
.
.
59
63
63
74
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
Numbers of L, T , and E data . . . . . . . . . . . . . . . . . . . . . .

Fitness function design part of the overall GA parametrization table
Basic statistics on random search on 2, 400 NNSUs. . . . . . . . . . .
Short run RS: Best performing NNSUs according to the GL . . . . .
Short run GA: Generalization factors and . . . . . . . . . . . . . .
Large run GA: GA L01 with memory . . . . . . . . . . . . . . . . . .
Large run GA: GA L02 without memory . . . . . . . . . . . . . . . .
Large run GA: GA L03 without memory, with diminished mutation .
Large run: Comparison no. 1 . . . . . . . . . . . . . . . . . . . . . .
Large run: List of best performing NNSUs . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
81
85
88
88
89
92
92
92
92
93
93
94
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A.1 Illustrative numbers of |Codes(N )| and its cumulates . . . . . . . . . . . . . . 114
vii
Acknowledgements
I was finishing this work for over two years. Together with the preliminary studies, coding
and testing of the related applications and modules I spent six years with the area of evolutionary optimization of graph structures. Now, after having passed all steps of research,
analyses, and implementations, I am very glad to present the finalized thesis containing all
relevant and reasonable outcomes of my work and I feel I really need to express my thanks
to all colleagues and peers that helped me to make this thesis done.
Firstly, I owe much my wife Lucka for all the unyielding support during my active work and
also during the times of my doubts and occasional periods of hesitation. I do remember how
patiently and benevolently she dealt with the nights and other times I spent over coding the
modules or writing and reading any relevant documents, and I am very grateful for this.
From the academic sphere, I would like to thank my supervisor, Frantisek Hakl, for his
openness and his active cooperation throughout the whole time I have been concerned with
this work, and for the technical support he provided me. All articles and posters we prepared
together and I realize that it was Frantiseks experienced contribution that added to the final
quality of the outcomes.
Regarding the working group, I need to thank Marek Hlav
acek primarily. Marek introduced me to the area of neural networks and together we worked on the incorporation of the
generalized evolutionary optimization; we undertook many analyses and experiments and
I am positive our discussions fruitfully moved the state of our experience with the neural
computation tasks. The second thanks goes to Jan Vachulka who managed to implement
the multi-platform architecture of our tools and in particular helped me to enhance the implementation of the modules for representations and genetic algorithm. Jan was also one of
the most willing proof-reader, either it was a poster or a draft of the dissertation.
In fine, I have to express a huge appreciation to Pavel Pl
at, my colleague at the university, and later also a colleague in a risk management team during period of our common
employment. Pavel was and still is open to discussions on data mining and statistical issues.
Throughout the talks with Pavel, we also touched some of the questions related to the evolutionary optimization and neural networks, among others the ranking statistics and fitness
transformations, and also the analysis of the structure of ICodes set. In times when I could
not identify ICodes with the Catalan numbers, Pavel brought up a brand new way proofing
the recurrent formulaI am quoting it in the appendix.
ix
A CKNOWLEDGEMENTS
All the research, data analyses, and other parts of my post-graduate studies were more
or less related to a single field of applied non-linear programming; in fact weme and
my colleagueshave used the model and also the implementation of Neural networks with
switching units on as many ways. Guided by our supervisor Frantisek Hakl, Marek Hlav
acek
and me revised and extended the model of the neural networks. Later on, with additional
colleagues in our team, we enhanced the Neural networks with switching units tool into a
multi-platform application with extensible modularized architecture.
Thus, all the results listed in this work are naturally based on results of our NNSU tool
tool for use of Neural Networks with Switching Units, including the task specifications, output
charts and graphs, neural networks visualizations, etc.
NNSU
http://www.cs.cas.cz/nnsu
Of course we did not start from scratch with all of our implementations. Many of the
NNSU modules (reporting, visualizations) are incorporating specialized scientific software.
The vast majority of the plots of graph structures was prepared by with the help of
GraphvizGraph Visualization Software. Where the requirements on the resulting graph
plot were far from Graphvizs range, or rather far from my skills, DiaDiagram Creation
Programwas a very convenient alternative. The experimental data plots were produced by
RLanguage and Environment for Statistical Computing and Graphics. Also, as a flexible tool
for many ad-hoc analyses and scripting, the Octavehigh-level language, primarily intended
for numerical computationswas in place.
Graphviz
Dia
R
Octave
http://www.graphviz.org
http://live.gnome.org/Dia
http://www.r-project.org
http://www.octave.org
The next helpful software packages that provided excellent working comfort during the
preparation and typesetting of this book were the typesetting environment LATEXdocument
preparation system, as well as the text editor gVimadvanced text editor. Thanks to the
highest flexibility of both of these I was given the great opportunity to present my thoughts
exactly in form I have imagined, and the writing of this work proceeded very smoothly.
LATEX
gVim
http://www.latex-project.org
http://www.vim.org
Used symbols and notation

The following text will lead us through theoretical and practical issues, analyses, and outcomes of work with neural networks and their optimization. In most situations we would be
comfortable with general text and we will try to avoid cumbersome formulas. We will use
general notation throughout this text, small alphabetic literals will assign variableseither
reals or integers, capital alphabetic literals will assign either sets or complex structures. These
complex structures will be distinguished by special fonts used.
In accord with standard notation of numeric sets, the letter N will assign the set of integers, the letter R will assign the set of real numbers. The general integers, e.g. summation indices, we will mainly write using letters i, j, k, l, whereas integers with specialized purposes,
e.g. entries of ICodes, will be noted as either a, b, c, d or as capital letters K, L, M, N, P .
The reals will be mostly printed as Greek letters , , , . The Greek alphabet is also used
when assigning some strings or vectors. Still, the particular use will be clear from context.
We will also work with subintervals of reals and we will denote these with standard
notation: (, ) for open intervals , [, ) and (, ] for half-closed intervals, and [, ] for
closed intervals respectively.
T YPESETTING
A
M
G
P
code
general sets, capitals printed with bold font

mappings, printed with calligraphic font
mappings used within NNSU tool, printed with sans serif font
special symbols and advanced structures, printed with math scripting font
source code or pseudo code variable, printed with verbatim style
M ATRIX
Ak
Aj
P
sum of the k-th column, Ak = j Ajk
P
sum of the j-th row, Aj = k Ajk
NOTATIONS
G RAPH THEORY
Main part of this book concerns with graphs. We will denote graphs in a standard way, i.e.
with capitals G (or possibly H) for a graph, V for the set of nodes, and E for the set of edges.
Among the further terms we frequently use AG (j) assigning a set of ancestors of a node j
within graph G, and DG (j) assigning a set of descendants of a node j within graph G. The
adjacent matrix of a graph G is assigned as AG
ij , with i, j |V |. The degree of a node j within
a graph G is assigned as dG (j).
A BBREVIATIONS
We will use abbreviations frequently as they save space, allow for common syntax, and yet
they sufficiently represent the original meanings. In order to utilize fully the advantage of the
abbreviations, we will now list the ones that are supposed to be known broadlyto be sure
that we do not miss the exact meaning. In addition, we list the abbreviations with specific
meanings which will also be introduced later in relevant sections.
xi
xii
U SED
SYMBOLS AND NOTATION
ABR
ASR
ROC
Accepted Background Rate, used in section 4.3

Accepted Signal Rate, used in section 4.3
Receiver Operating Characteristic, an indicator used in decision making/measuring in machine learning, used in section 4.3
CE
DAG
SSDAG
EDNA
RS
GA
GP
GE
PST
WAV code
Cellular encoding, discussed in section 2.1

Directed Acyclic Graph, discussed in section 1.1
Single source single sink DAG, discussed in section 1.1
Evolutionary Designed Neural Architectures, discussed in chapter 3
Random search
Genetic Algorithm
Genetic Programming
Grammatical Evolution
Program Symbol Tree
Walk Around Valency code, used in section 2.2
NN
NNSU
NSU
VS
Neural networks
Neural Networks with Switching Units, briefly described in sections 1.1
and 4.2
Linear chain, a special type of neural networks topology, discussed in sections 1.1 and 2.3
Neuron with Switching Unit, discussed in sections 1.1 and 2.3
Variable Selection, used in section 2.4
CPDF
PDF
Cumulative probability distribution function, utilized in section 5.5

Probability density function, utilized in section 5.5
CHAIN
Also, as a special type of abbreviations, we will use a following notation that simplifies
notation of minimums and maximums of any bounded variable.
MINK
Minimum value of variable K
MAXK
Maximum value of variable K
Introduction
This thesis describes a combination of two phenomena, both originally inspired by natural
systems. Genetic algorithms, first one of these, rank among very common optimization strategies; the genetic algorithm are well established as highly performing in field of non-linear
tasks. The second phenomenon, the Neural Networks, proved very useful in solving problems
of high complexity; the issues where the neural networks succeed can be named as pattern
recognition, data classification, forecasting tasks, and many other.
G ENETIC A LGORITHMS
Application of the biological model onto optimization task in form of evolutionary strategies
appears as very effective in situations where standard analytical approaches fail to succeed
(either because the solution could not be analytically grasped and described or simply the
exhaustive calculations become infeasible). The genetic algorithms allow for finding optima
or at least over-averaged solutions to multidimensional and multi-modal functions. Genetic
algorithms might also be conveniently used in cooperation with another optimization techniques, e.g. as one part of structured optimization. In these situations genetic algorithms can
provide acceptable initial solutions to analytical methods. The analytical optimization, then,
searches for absolute an optima starting in these initial points. Among other fields of application, genetic algorithms proved useful in the field of on-line tasks where only solutions
known to reach given quality level are sought, not necessarily optimal, while the system is
required to be quick.
The merger of the rigorous mathematical visions and principles on one side and rather
unorganized and seemingly chaotic world of evolutionary behaving nature on the other did
not proceed straightforwardly, though. Despite the seemingly simple rules driving and controlling the natural selection, it appears as highly complicated to rigorously describe its transitional behaviour in precise mathematical language. The biggest hurdle is that any evolutionary process contains randomness at the very heart of it. This deeply rooted stochastic
principle makes it infeasible to describe the particular path of the algorithm.
Thanks to the complicated analytical formalization of genetic algorithm, it took a long
time before this optimization method received a serious attention of mathematical society.
The first turn came with the first computers that allowed for the first simulations proving that
genetic algorithms really work. As John H. Holland, a professor at University of Michigan,
released in 1975 a breaking book Adaptation in Natural and Artificial Systems, in which he
described and analyzed the behaviour of evolutionary systems and their seek for optima,
the genetic algorithms grew even more recognized and used. Holland introduced important
concept of schematasort of explanatory units in genetic algorithms, he also stated and
proved theorem giving the rules of schemata sampling between generations.
Since 1990s, the personal computers had been providing satisfactory computational and
data storage capacities. Evolutionary strategies are implemented in most of computational
systems, including MATLAB, Mathematica, SAS. Together with the increased use of genetic
algorithm grew the needs for revealing detailed theoretical backgrounds. The main extension
of mathematical formalization was brought up by Michael D. Vose who gave the rules for
xiii
xiv
I NTRODUCTION
genetic algorithm as a dynamical system, i.e. not only for two succeeding generations, but
for overall path of the process.
For either theoretical or practical reasons, many variants of evolutionary strategies evolved
during last fifty years, among others the most important techniques are grammatical evolution, genetic programming, and simulated annealing. All of those are further researched, explored, and applied to various tests and tasks which brings many improvements and solutions
to complex tasks.
N EURAL N ETWORKS
In this thesis, the main subject to optimization via genetic algorithms are Neural Networks,
very robust computational system itself. Neural networks are next representatives of phenomena with biological background, the concept of neural networks firstly appeared in early
fifties and thanks to computational feasibility it is getting still more popular in the last twenty
years in theoretical and also in practical fields of use.
The central neural systems are vital to all living organisms and they seem to work properly
in their common environments of high complexity. Brain, the executive centre of the neural
system, is capable of learning new situations, it proves as a system we would describe as very
robust and generalized decision system, effective data classifier and pattern recognizer. It
also features effective memory mechanism providing recalls of the environments, patterns,
and decisions. From the modelling point of view it is highly interesting that the neural
systems, consisting of huge amount of neurons, exhibit a synergistic effect. That means
the resulting powers of the numbers of connected neurons is significantly higher than direct
summation of powers of particular neurons.
Models of artificial neural networks consist generally of basic unitsartificial neurons
and connections between these artificial neurons. According to the natural motivation, the
artificial neurons are one way transformation of incoming informations into outgoing signal.
Even though there is only the single function to specify, there are many ways to choose a
function realizing this transformation. In fact, only one rule regarding this artificial neurons
functions is in place, deeming the artificial neurons to have one or more incoming inputs and
only one (bounded) outgoing output.
The very basic types of neural network consist only of one artificial neuron. Still, even
with this simple type of neural network, later extended into well known perceptron, we can
produce results of surprisingly high quality in all fields of common application. The perceptron was described firstly by Warren McCulloch and Walter Pitts in the 1943. A multi-layered
perceptron, neural network composed of more artificial neurons organized into successive
layers, makes the responses even more robust and generalized.
During the second the half of the 20th century, the further evolutions of specific types
of neural networks grew quite spread. Today we recognize large numbers of various types
of neural networks as cyclic and acyclic networks, feed-forward, self-organizing, fully connected, sparsely connected, and many others. Each type of neural networks features its
particular advantages which make it use and even successful in different specific tasks.
The artificial neuron carries the ground computational power of the artificial neural network, the connections between the neurons increase the level of robustness of calculations,
and also heighten the generalization. In live brains the amount of neurons is incomparably
higher; for instance, in human brain the number of neurons possesses 13 digits. At such high
density of neurons, the multitudes of mutual connections allow even for memory. The network consisting of more than one neuron are thus an approximation of the complex neural
structure and are believed to bring about at least the robustness and generalization.
xv
S YNERGISTIC C OMBINATION OF G ENETIC A LGORITHM AND N EURAL N ETWORK PARADIGMS
The idea to optimize the neural network architectures through application of evolutionary
strategies was originally initiated by work with the neural networks themselves. The neural
networks of our concern are neural networks with switching units (NNSU); this type of neural
networks was designed to be trained using (single-pass, linear or logistic) regression, and
thus unusually fast. Naturally, the idea came up to experiment with wider range of architectures and to try to identify those well-conditioned for high quality results.
We worked to define a general representation of the NNSU architecture which can be
easily incorporated in some optimization process. A representation that would allow for convenient handling of the individual architectures. Moreover, we sought for a representation
that can be optimized via some evolutionary strategy.
The representation problem, in general, belongs to known issues of todays modelling. As
the structural complexity of studied objects grows, the descriptive grammars need to adopt
and get more and more specific. We can name some types of neural networks architectures
representations. In general, they are of two typesdirect and indirect encodings. The direct
encodings arent suitable for recombination operators since the recombination might disrupt
the descriptive grammar. The indirect encodings are more flexible in this way and it is mostly
because they are constructive rather than descriptive.
The representation used in our particular case is an indirect cellular encoding introduced
by Frederic Gruau. The cellular encoding is further transformed into so-called Reads codes,
or linear codes. The codes are extended by instructions from cellular encodings and some
parameters from the original architectures. This structure representing neural network architecture is, then, called an instruction-parameter code.
The representation via instruction parameter codes was implemented and we adjusted its
form so that it could represent various architecturesgeneric architectures, layered architectures, and architectures with variables selections.
The construction of the genetic algorithm required not only the valid representation of
objects to be optimized, but also some evolution schema defining the way the particular steps
of the algorithm will be iterated. The choice of a representation and the evolution schema
is naturally related via requirement of the closure of the set of representations against the
operators of the evolution schema, e.g. in case of commonly used techniques of genetic
algorithms the binary case is used consisting of binary strings and evolution schema using
substring swapping.
The instruction-parameter codes, thanks to their linear structure, allow for recombination
using subcodes swapping. This way defined recombination effectively iterates the search
through the representation space and rounds off the definition of the genetic algorithm.
S TRUCTURING OF T HE T HESIS
The thesis reveals all material parts of our research and improvements of the optimization
of architectures. It is organized in the most intuitive way. It proceeds through the particular steps where subsequently neural networks with switching units (NNSU) are brought up
together with representations of their architectures. Then the properties of chosen representations are reviewed. As a next step, formal genetic algorithm is constructed for representations of NNSU. In the main chapter of this thesis, the application of the genetic algorithm on
real data is reported and compared to random search strategy.
1. First chapter. In the first chapter, the general introduction of NNSU is listed, together
with the basic properties. The short review of adjacent matrices as structures from
representational point of view leads to enumeration of main requirements put on representations of architectures of NNSU.
xvi
I NTRODUCTION
2. Second chapter. The second chapter introduces IPCodesthe representation of NNSU

and reviews all achievable architectures, including the random architectures, layered
architectures and special architectures with variables selection.
3. Third chapter. The third chapter simply reviews the IPCodes properties in wider context of encodings of graph structures.
4. Fourth chapter. The fourth chapter rounds up all the previous chapters as it subsequently defines every part of the environment of a genetic algorithm. It counts in the
further revision and deepening of the NNSU properties, the fitness function, recombination operators, and the genetic algorithm that ties in all of its parts together.
5. Fifth chapter. The fifth chapter reveals the experimental results; it turns the previously
defined structures and algorithms into practice. The testing data are described, and
the tuning parameters of genetic algorithm environment that adjust the process of the
optimization. Then, the results of tuned up genetic algorithm are listed, with a compare
to a uncontrolled random search of equivalent time consumption.
6. Sixth chapter. The sixth chapter reports the main deliverables from the whole work;
it summarizes the goals of the dissertation and compares the results against them.
7. Appendix. The last chapter, denoted as an appendix, returns to all of the previous
chapters in theoretical point of view that overgrew the dimensions of the particular
parts. Systematization of advanced IPCodes features appeared as a very complex task
and the more it would not bring sufficient effect to mention the rather particular results
each time it showed some relevancy. Still, there were some interesting points claimed
about the space of IPCodes and the behaviour of the genetic algorithm; these are listed
out in the sixth chapter.
This thesis brings up many questions considering architectures, the measure of quality of
decision systems, etc. Some of these questions become answered, some of these remain open
and wait for further revisions.
1
Neural networks architectures
Neural networks are todays well recognized tool for non-linear programming tasks. It is a
natural consequence of development both theoretical knowledge and computational tools
had undergone. The technical equipment became sufficient enough for wide use of artificial
intelligence approaches which initiated and also motivated development in corresponding
theory. The area of neural networks grew quite broad, there are many types of neural networks constructed for general as well as for specific purposes. Together with the steeping
usage and evolution of new types of neural networks, the corresponding terminology grew
into variety interpretations, and specific definitions.
We will narrow down on a specific type of neural networks, regarding the topology and
also learning procedures. We are speaking of so-called neural networks with switching units
(NNSU). Further, our concern will straight turn to representations of NNSU, more specifically
with the representation of their architectures.
1.1
T HE
ACYCLIC ARCHITECTURE OF
NNSU
Throughout the following section we will subsequently describe properties of the NNSU. We
will try to emphasize the points that are relevant and important to the further chapters.
Mainly, this will relate to the representational point of view.
1.1.1
N EURAL
NETWORKS WITH SWITCHING UNITS
Speaking in terms of acyclic topology and acyclic architecture, we formalize the frequently
used terms topology and architecture of a neural networks (NN). Prior these definitions, we
turn our specific type of NN in the whole and describe their main properties.
We speak of neural networks with switching units (NNSU). In deeper detail, we should say
neural networks with neurons with switching units because those are neurons that possess a
switching unit. We call these neurons neurons with switching unit (NSU). This specific type of
NN was introduced in 1995 by Bitzan, Smejkalov

a, Kucera. The comprehensive informations
about this type of NN might be found in the originating article (Bitzan, Smejkalov
a, Kucera,
1995).
The concept of NNSU was later improved and re-implemented by Marek Hlav
acek. All
significant steps are described in Hlav
aceks diploma thesis, (Hlav
acek, 2002). The further
articles about recent NNSU theory and application is found in articles (Hakl, Hlav
acek, and
Kalous, 2003), and (Hlav
acek, Kalous, 2003).
From the very beginning, the NNSU model was intended to be computationally undemanding and it underwent many forms of implementation, starting from early DOS application, ending todays multi-platform modular NNSU tool. The recent implementation and
computational updates of NNSU tool was due to Jan Vachulka, who reviews computational
properties and enhancements in his diploma thesis (Vachulka, 2006).
1
CHAPTER 1.
N EURAL
NETWORKS ARCHITECTURES
Regarding the properties of NNSU, we should primarily turn to the main features of
the NNSU and relate these into a wider context of the neural networks approaches. The
illustrative decomposition of the NNSU structure is sketched in figure 1.1 where also all
discussed properties are displayed.
Block structure
The NNSU are composed of blocks. Where common neural networks usually incorporate
a single neural computation unit and corresponding data flows between these, the NNSU
applies a two-level approach.
On a first level, the data flows are designed between individual blocks and should
represent reasonable composition of elementary powers of each block.
On a second level, the internal structure of each block is designed between particular
neurons with switching units. The ground idea is to utilize rather small subnetworks of
similar properties on the second level.
Both the first level network and the second level subnetworks are technically feedforward
neural networks. On the first level, we allow arbitrary feedforward topology that is computationally feasible and does not lead to a generalization loss in data processing. In praxis, we
speak of networks counting tens of blocks and tens of data flows between the blocks.
On the second level, the blocks may consist of any valid feedforward network. For the
time being, the most utilized type of network is a chain of linearly connected neurons. This
elementary type of NN is called linear chain (CHAIN). Closer informations about this elementary type of NN is found in (Hakl et al., 2003). We usually construct CHAINs counting up to
ten neurons with switching units.
Acyclic topology
The internal NNSU topology does not feature no implicit recursion nor feedback connections.
This means, no feedback edges (edge from a block to self or to one of its predecessors) may
appear in the topology. If there is some type of feedback implemented, it needs to be feature
of a particular type of a blockon first level or a neuronon a second level.1
The fact that the NNSU topologies are acyclic determines the learning algorithm and
the whole adaptive dynamics of the network; the NNSU are learnt via single-pass learning
algorithm.
We also limit the number of input and output blocks to one. Thanks to this, we work with
standardized input and output interfaces in the NNSU and do not accept no multi-component
inputs nor outputs above the standard structure.
Switching units
The first specific detail about NSU is the inner structure. The neurons within blocks in the
NNSU structure contain switching units connected to the calculation units. The neuron components are two-fold.
Initial clustering unit. The clustering unit is splitting the input data into disjoint sets
clusters.
Each cluster is then processed by a computational / neuron unit. This unit is equivalent
to a commonly used neuron.
The original intention with switching units was the enhancements of the data processing
through individual treatment of homogeneous subsets of the data. This approach is kept,
though the clusters are found independently in each NSU.
1
Currently, Marek Hlav
acek is working on NSU learning with feedback for time forecasting models; the initial
results should be available during 2009.
1.1.
T HE
NNSU
Active and adaptive dynamics

The specific inner structure of the NNSU, and even of the NSU, demands corresponding adaptive and active dynamics. The choice of the NNSU is that the switching unit simply assigns
each data record to a specific computational unit and then each subsequent computational
unit is a generic neuron with linear activation function RN R. The overall activation
function of a NSU is a piecewise linear function RN R, not necessarily continuous. NNSU
does not incorporate no sigmoids nor any further mapping to {0, 1}.
The adaptive dynamics of the switching unit represents a clustering method. For NNSU
we picked a k-means algorithm, closer information is found in (Vachulka, 2006). Adaptive
dynamics of the computational unit is a straightforward choice as it encapsulates a linear
model. For data classification tasks the NNSU utilizes logistic regression for internal parameters set up.
Even though the k-means algorithm is iterative, we claim that the learning of NSUs is
quick compared to the other learning methods, such as backpropagation. We will later see
that the computational time is kept in order of tens of seconds.
CHAIN 3 (3,3,2,4)
consisting of
NSU NSU1,..., NSU4
IN
CH1
NSU NSU3 consisting of SU
and NUs N1, and N2
1
(2,3,5)
2
(5,3,3)
CH2
SU
CHAIN 3
NSU
3
(3,3,2,4)
OUT
CH3
N1
N2
CH4
Figure 1.1: Simple scheme of the levels of structures in NNSU. A whole NNSU architecture consists of blocks (CHAIN), blocks consist of particular NSU. The left-most figure
shows the whole architecture, the middle one depicts the block 3 consisting of 4 NSU.
The right-most figure reveals the detailed structure of the 3rd NSU in the CHAIN.
All of the points mentioned above shape the resulting topologies and architectures. The
first two point determine topologies, the second two relate to architectures. Before we move
to the topologies review, let us emphasize once again, that throughout this text, the term
neural networks will stand for NNSU.
1.1.2
CHAPTER 1.
A CYCLIC
N EURAL
TOPOLOGY
The topology of a neural networks, in general, is usually described as an oriented graph.

The further specification of the graph depends on the type of the network, its purpose, and
many others. This way, we can meet cyclic topologies as well as acyclic, one-source or multisource, single-component or multi-component, etc. For example, the cyclic topologies are
useful when modelling time dependent data: the learning procedure iterates over time while
the time dimension is realized via feedback connection.
The NNSU topology is viewed at the block level, i.e. the nodes of the graph correspond
to the blocks and the edges between the nodes represent the data flows between the blocks.
This is very clearly seen in figure 1.1. The NNSU are acyclic, and naturally the set of acyclic
digraphs is used to describe the connections between particular blocks. The NNSU topology
contains one input block (IN), one output block (OUT), and a set of data processing blocks
between them.
In general, we assume the set of data processing blocks is non-emptyit is a straightforward demand in order to analyze, in a thorough way, the data. On the other hand, we do
not deny the empty topology. It is because the evolutionary process might recombine the
topologies so that the empty topology emerges, and we want it not to fall out of the solution
set and the evolutionary process to stop due to this. The empty NNSU with no inner block
provides low computational power since only the last block passes a calculation, i.e. only a
single regression runs in this case.
All the data flows are designed to start in IN and end in OUT. This type of graph is also
known as single-source single-sink graph or single-source single-sink network, written as s, tgraph. These network usually describe the flows from one location to the other and they are
frequently used in the flow optimizations, e.g. for TCP/IP network DDoS attacks impact or
traffic jams simulations, etc.
The s, t-graph actually means that for each node there exists a path
from the IN to this node and a path from this node to the OUT. We
IN
can formulate this requirement via condition that each node possesses
at least one ancestor and at least one descendant. Except for IN node
which only has descendants, and OUT node which only has ancestors.
1
2
The set of nodes V of the graph G = (V, E) contains at least IN and
OUT nodes. The set of inner nodes, assigned as V0 , is either empty or
3
isomorphic to the set of first |V0 | integers, V0 {1, . . . , |V0 |}. Letting the
IN node correspond to a 0 and the OUT node correspond to |V | 1, we
can say that j V0 , IN < j and j V0 , j < OUT, and the sets V0 and
OUT
V are applicable for indexing.
As we have defined this indexing, we can directly introduce the adjacency matrix as a matrix AG where the i, j-th entry is equal to 1 if the
Figure 1.2: Plot of
edge connects from node i to node j, and equal to 0 if not. The adjatopology G.
cency matrix provides an effective descriptive structure of any graph and
we will frequently use it. A sample adjacency matrix is seen in equation
1.3. The adjacency matrix features many properties, few of them listed in following2 .
Returning back to the first requirement on NNSU topologies, we speak about asserting
that for each specific inner node there exists a path from IN node into OUT node that contains
this inner node. We can formulate this in terms of adjacency matrix, telling that we insist that
in each row (except the OUT) there exists at least one non-zero entry and in each column
2
All of the propositions are found in (Matousek, Nesetril, 1996) where they are described and proved in a
very intuitive way
1.1.
T HE
NNSU
(except the IN) there exists at least one non-zero entry.

Definition 1.1. Acyclic digraph G = (V, E), V = {IN, OUT} V0 , V0 N , is called acyclic
topology, if its adjacency matrix AG fulfils the following conditions:
k V \ {OUT}, max AG
k,l > 0,
(1.1)
k V \ {IN}, max AG
l,k > 0.
(1.2)
lV
lV
The conditions (1.1) and (1.2) formulate the requirement about the existence of ancestors
and descendants. The conditions are easy to verify, we simply go through the adjacency
matrix and watch the column and row maximums or column and row summations AG
k and
G
Ak . Once these vectors would possess a zero on INth or OUTth position, respectively, the
graph fails be architecture.
The set size |V0 | will assign the number of blocks within the topology. In words, the
empty topology contains 0 blocks, non-empty topologies contains |V0 | blocks. As an example
let us take a look at the topology of the already mentioned network.
Example 1.1. Let us consider a graph G = (V, E), V = {IN, 1, 2, 3, OUT}, E = {{IN, 1},
{IN, 2}, {1, OUT}, {2, 3}, {3, OUT}}). We can see that the vectors of column and row summaG
T
G
tions look as AG
,k = (0, 1, 1, 1, 1) and Ak, = (1, 1, 1, 1, 0) . This proves that the A adjacency
matrix fulfils conditions (1.1) and (1.2). Thus, the graph G is an topology on 5 nodes with 3
inner nodes. The topology is shown in figure 1.2. The corresponding adjacency matrix is given
as follows.
A =
1.1.3
0
0
0
0
0
A CYCLIC
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
1
0
1
0
(1.3)
ARCHITECTURE
The topology works perfectly as a graph representation. It covers the topological structure,
all the nodes and the connection between them. Thus, it can stand for any block structure
and of the NNSU. Still, it does not carry any further information about the structure of the
particular blocks, neither the additional parameters of the blocks and data flows. It is a
straightforward extension to insert all the descriptions into the node labels and work with
labelled graphs. Adding these labels to the topology we construct a suitable description of
the NNSU which is enough to catch its architecture.
The labels may comprise parameters together with the general descriptions of each block
and all particular informations. The parameters set may be considered in quite general way
depending on particular situation, e.g. the type of activation function, parameters of the
activation function, etc. For now, we will assume those to lie in some parameters set PS.
For the IN, OUT blocks there are no parameters needed because they only pass the data to
the successive blocks. These blocks simply wont be assigned with no vectors, or with a zero
vector for convenient notation.
Definition 1.2. Labelled acyclic digraph G = (V, E, d), V = {IN, OUT} V0 , V0 N ,
dj PS, j V , dIN = 0, dOUT = 0, is called acyclic architecture, if its adjacency matrix AG
fulfils conditions 1.1, and 1.2 in definition 1.1.
CHAPTER 1.
N EURAL
In the following example we will show a simple architecture. Actually, it is the extension
of the topology listed in example 1.1. Here, the PS is simply set of integer vectors of variable
length.
Example 1.2. Again, let us consider graph H = (V, E, d), V = {IN, 1, 2, 3, OUT}, E = {{IN, 1},
{IN, 2}, {1, OUT}, {2, 3}, {3, OUT}}, d = (0, (2, 3, 5), (5, 3, 3), (3, 3, 2, 4), 0)T as an extension of
the topology G from example 1.1. According to the definition above (1.2), this is an acceptable
architecture. The architecture is shown in the figure 1.3, and the corresponding adjacency matrix
and vector of parameters are given as follows.
0
0 1 1 0 0
(2, 3, 5)
0 0 0 0 1
(5,
3,
3)
A = 0 0 0 1 0 , d =
(3, 3, 2, 4)
0 0 0 0 1
0
0 0 0 0 0
This architecture defines a NNSU instance that consists of 3 blocks.
IN
1
(2,3,5)
2
(5,3,3)
3
(3,3,2,4)
OUT
Figure 1.3: Plot of architecture H.
The block assigned as 1 consists of a CHAIN of

length 3 with particular number of clusters within
the NSUs set to 2, 3, and 5. The incoming data flow
comes from the input block IN, the outgoing data
flow goes to output block OUT.
the NSUs set to 5, 3, and 3. The incoming data flow
comes from the input block IN, the outgoing data
flow goes to block 3.
the NSUs set to 3, 3, 2, and 4. The incoming data
flow comes from the block 2, the outgoing data flow
goes to output block OUT.
This way defined architecture represents a general form of an architecture. The number of
nodes and their mutual connections may rise to higher of lower amounts, the inner structure
of the blocks may differ as well. This way defined structures formly represents the world
of NNSU architectures. The set of architectures will be the generalization used for all the
descriptions when speaking of the NNSU instances, all the domain and range spaces of the
recombination operators, the range spaces of functions randomly sampling NNSU, or in any
other place where needed.
Definition 1.3. The set of all acyclic topologies on N nodes is assigned as AT(N ), the set of
all acyclic architectures on N nodes is assigned as AA(N ). The set of all acyclic topologies on
maximum MAXN nodes is assigned as ATMAXN , set of all acyclic architectures on maximum
MAXN nodes is assigned as AAMAXN .
Within the architecture, the nodes represent blocks together with their inner structure;
further the architecture contains the numbers and design of the connections between the
blocks. That is, the complexity of individual blocks calculations and also the particular data
1.2.
E XISTING
APPROACHES TO ACYCLIC ARCHITECTURES REPRESENTATION
10
15
20
25
30
Number of blocks within a NNSU architecture
7
6
5
4
3
2
1
Learning procedure duration (in seconds)
7
6
5
4
3
2
1
Learning procedure duration (in seconds)
flows between the blocks are subject to the architecture design. We need to control both
number of nodes and also the number of connections in order to keep the computational
complexity within acceptable or required bounds.
In figure 1.4 we can see the growth of CPU time consumption depending on number of
nodes and number of edges within an architecture. The number of blocks, as well as the
number of data flows, do not seem to indicate no super-linear CPU time dependence. We see
that with a demand of an NNSU learnt within 10 seconds we can use up to 40 blocks with
approximately hundred data flows.
15
25
35
45
55
65
75
85
95
Number of edges within a NNSU architecture
Figure 1.4: CPU time spent on learning architectures of given number of NSU and number of data flows.
1.2
E XISTING APPROACHES TO ACYCLIC ARCHITECTURES REPRESENTA TION
The overall goal of our work is to design a working evolutionary scheme that would optimize
the architectures through an efficient representations. We will briefly show that the intuitive
pick of representation, i.e. the adjacent matrix of the architecture graph, is not enough for
our purposes.
In a second subsection we will discuss the existing approaches to graphs representation
and we will shortly introduce the representation selected for the NNSU tool.
1.2.1
T HE
ADJACENCY MATRIX IS NOT AN ACCEPTABLE REPRESENTATION
The first step to be made when defining a genetic algorithm is to find a suitable representation. A representation that decodes to studied object in a flexible way, and might be
recombined with another one as well. The adjacency matrix complies implicitly with the first
property of the two mentioned. It is rather the lack of the latter, the acceptable recombination scheme, that makes the adjacency matrices impractical for the representational purpose
and thus also for the genetic algorithm environment.
The topologies, architectures are defined as acyclic digraphs, in the case of architectures
as acyclic digraphs with additional labels. We will unveil some of the reasons of why the adjacency matrix could barely work as a representation within a GA environment. The reasoning
does not formalize no general disability of genetic algorithms based on adjacency matrices; it
rather points out some of complications we faced and why we rejected the adjacency matrices
as acceptable representations.
CHAPTER 1.
N EURAL
In following paragraphs we schematically review two main drawbacks in a short example depicting a construction of a recombination scheme against which the adjacency matrix
is not closed. The most common definition of recombination scheme is based on mutual
swapping of subparts between the parent individuals. Following that, let us think of a simple
recombination of two topologies as swapping (randomly chosen) columns and rows. This
recombination would, in general, recombine edges between nodes or, in terms of the NN, the
data flows between neurons/blocks.
Even such trivial approach to recombination carries few limitations. Firstly, we can read
that for matrices of different dimensions we have to handle redundant positions in columns
and rows originating from the larger matrix as well as missing positions in columns and rows
originating from the smaller one. This inconvenience with handling matrices of different
sizes would be called as a lack of scalability.
Even if we count on matrices of same dimensionality only, we will easily find a pair of
topologies and their columns which produce a graph which is not an topology. In particular,
consider C, D AT(5), given by following matrices.
0 1 1 0 0
0 1 1 1 0
0 0 0 0 1
0 0 0 0 1
C
D
0
0
0
0
1
(1.4)
A = 0 0 0 1 0 A =
0 0 0 0 1
0 0 0 0 1
0 0 0 0 0
0 0 0 0 0
Then, let k C = 3, k D = 2 assign the columns and rows to be swapped. Now, we will look at
C 0 , D0 being C, D having swapped the k C -th row and column with the k D -th. Formally,
0
C
D
D
AC
ij = Aij , i, j {1, 2, 4, 5}, Aij = Aij , i, j {1, 3, 4, 5},
0
0
D
C
AC
AD
i3 = Ai2 , i {1, . . . , 5},
i2 = Ai3 , i {1, . . . , 5},
0
0
D
C
AC
AD
3j = A2j , j {1, . . . , 5},
2j = A3j , j {1, . . . , 5}.
The adjacency matrices AC
0 1 1 0
0 0 0 0
0
AC =
0 0 0 0
0 0 0 0
0 0 0 0
(1.5)
, AD would then look as follows.
0
0 1 1 1 0
0 0 0 1 0
1
0
D
1
A = 0 0 0 0 1
1
0 0 0 0 1
0
0 0 0 0 0
0
(1.6)
0 / AT(5).
It is seen that the fourth column of AC sums to zero, AC
,4 = 0, and thus C
0
The particular topologies are depicted in figure (1.5)the graph C is not single source, i.e.
possesses with one node for which there does not exist a path from IN to it.
The reason why this recombination would not work for adjacency matrices is the difference of how the recombination and adjacency matrix approach the substructuring. The
recombination assumes that the subpart is somehow related to the rows and columns of
the adjacency matrix, whereas the adjacency matrix does not necessarily say anything on
substructuring, not even in terms of its columns or rows. A sub-matrix of the adjacency matrix does not always relate to a subgraph in the represented topology (architecture). The
adjacency matrix does not exhibit no modularity.
To summarize this, we can say that for adjacency matrices the recombination scheme is
complicated to define for matrices of different dimensions (insufficient scalability), further it
is not clear how to incorporate substructuring ability to the adjacency matrices (insufficient
modularity). The adjacency matrices do intuitively describe the structure of topology, but
they are not suitable for representational purposes due to lack of scalability and modularity.
1.2.
E XISTING
APPROACHES TO ACYCLIC ARCHITECTURES REPRESENTATION
IN
IN
IN
OUT
OUT
IN
OUT
OUT
Figure 1.5: Subsequently, topologies C and D, then resulting C 0 which is not an topology
due to the node 3, and D0 which is.
1.2.2
E FFECTIVE
ACYCLIC ARCHITECTURES REPRESENTATIONS
We may ask what the properties are that make a descriptive structure an acceptable representation too? What would be the most flexible structure? As soon as 1990, the first models of
architectures began to appear. The very first models were rather structure-oriented, i.e. representing the structure of architectures in a similar way as the adjacency matrix. The more
advanced approaches were found soon, based on developmental rules. Let us now have a
short review of the two main approaches to representation of the architectures.
D IRECT ENCODINGS
The very first approaches were direct encodings working as simple serialized nodes, edges,
and parameters storage. The edge encodings, one of which is GENITOR by Whitley (Whitley,
1990), decompose the adjacency matrix and parameters (weights) into single vector. GENITOR allowed for recombination, but there was a significant constraint that the maximum
architecture must have been defined so that the edges indicators only assigned the weight of
the specific edge. An extension of GENITOR was later proposed that incorporated weights of
variable lengths; the lengths control was not integral part of the representations though, it
was driven by the computational program setup.
The node encodings try to approach this working only with nodes. Even the node-based
encodings have to narrow down on specific architectures and data flows. Also, the recombination operators need to adjust into form too different from standard swapping and interchanging of representational fragments. The node encodings representatives are representations by Schiffmann (Schiffmann et al., 1990, 1992, 1993), or GANNet by White (White,
1993).
Another types are layer-based (Mandischer, 1993) and pathway-based (Jacob, Rehder,
1993) encodings which incorporate specific substructures as encoded entities, either sequences of layers or the INOUT paths. These encodings are even more structure specific
than node/edge encodings; and also require specialized recombination operators.
No matter whether node encoded or edge encoded technique we would use, the main
constraint for direct encodings still remains. The encodings are not scalable enough; either
the number of nodes or the number and positions of edges have to be fixed. The next lacking
feature is modularity; it is not as straightforward to tell the representations subparts that
correspond to the subparts in resulting Neural Networks architectures. This naturally limits
the performance of the evolutionary process, mainly its recombination features.
I NDIRECT ENCODINGS
The lack of scalability and modularity lead to brand new approach to architectures represen-
10
CHAPTER 1.
N EURAL
tation. On a half way between direct and indirect encodings we would find Kozas genetic
programming technique that encoded fixed architecture within parameter trees.
Initial work to indirect encodings area is by Kitano who introduced simple rewriting
schemes for matrices. This method introduced the first advantage against direct encodings,
i.e. in terms of the scalability. The rewriting rules allow for representation of architectures
of arbitrary size. In (Kitano, 1990), there is a decoding mechanism described: all 2 2
matrices are assigned to terminals a, b, . . . , p, for non-terminals A, B, . . . and the starting
symbol S, the rewriting rules are given. The encoding consists of the starting symbol and
number of both non-terminals and terminals; for instance in form of SABAgdcabbaf . The
recombination operators are applied in form known from standard evolutionary optimization
on strings with the exception that the splitting position can not affect the starting symbol nor
the non-terminals.
The highly advanced approach is introduced by Frederic Gruau; the same way as Koza
uses trees structures for purposes of genetic programming, Gruau utilizes trees for program
symbols storage and manipulation. The encoding is called cellular encoding (CE). The instructions that build the resulting architecture are stored within a program symbol tree; as
the program symbol tree is traversed down, the instructions and additional parameters are
read and applied on the emerging architecture. The CE provides both modularity and scalability; and it also easily implements recombination operators via subtrees operations.
C ONTEXT F REE G RAMMARS APPROACH
Further extension of encodings is a significant relaxation into so-called context free grammars (CFG). The CFG approach merges the rewriting rules and genetic programming into
single method: the CFG creates from ordinary integer series, using special rewriting rules,
valid representations, e.g. instances of CE. The representation then decodes into a final phenotype. This method doubtlessly stands for most advanced regarding the complexity of the
representation decoder.
Application of the CFG rounds off the original scheme of the genetic algorithm: the representations are integer strings, recombination might proceed in standard way, the only difference is that the decoding procedure is largely advanced itself and produces highly complex
resulting structure of a NN architecture.
Within this text, we wont incorporate the CFG techniques; for the use of the NNSU tool
we prepared an indirect encoding based on Gruaus cellular encoding.
2
Instruction-Parameter Code: The
representation of the acyclic architecture
We concluded at the end of the introductory chapter that the adjacency matrices are useful
for descriptive purposes rather than for representational. On one hand, they formly fit the
topologies and encompass in detail their structure. On the other hand, they do not reflect
scalability of architectures nor support their modularity which casts the adjacency matrices
off the recombination scheme.
This representation combines the two methods: Program Symbol Trees (PST) approach
introduced by Frederic Gruau(Gruau, 1994), and Reads linear codes (Codes) by Ronald C.
Read (Read et al., 1972). The PST technique is used as a representation of the architectures, the Codes technique then transcodes the PST tree structure into linear integer series.
Using combination of these two techniques we derive an indirect representation in form of
multivariate integer series, called Instruction-Parameter Code.
Initially, we look over the representational relation between architectures and PSTs, and
between PSTs and IPCodes; In the second section, we approach the structural content of the
representationthe Code and the sets of the Codes. The level-property, Codes important
feature, is discussed together with its direct applicationa subcoding ability, and random
generating. The subcodes, their seeking and random sampling is shown. Also, the possible
recombination abilities are formally defined and described in examples.
Afterwards we finish up the work with the Codes and their properties, we move to the
connection with instructions in the third section. That builds up an Instruction code, the final
representation of the topology. In this section, we will introduce and analyze the constructive
algorithm that grows the acyclic topologies. In the fourth section we add internal parameters
as the last part of representation and we construct the Instruction-Parameter Code. In an
illustrative example with a step-by-step decoding of sample Instruction-Parameter code we
construct an architecture.
In the last section we outline the used architectures classes represented via InstructionParameter codes. We will list three classes of architectures that differ in their topological structure. This will round up development of the representational dimension of the
Instruction-Parameter codes. The evolutionary optimization points of view will come to discussion in the following chapters.
2.1
R EPRESENTATIONAL
SCHEME
The mechanism that successfully represents architectures was introduced by Frederic Gruau
in his doctoral thesis (Gruau, 1994). The representations proposed by Gruau are called
Cellular Encodings. This encoding stores building instructions (and possible additional parameters) into a tree structure called Program Symbol Tree(s) (PST). Sometimes, they are also
called Encoding Trees.
11
12
CHAPTER 2.
IPC ODE : T HE
REPRESENTATION OF THE ACYCLIC ARCHITECTURE
The tree structure conveniently stores the strings of the

IN
E
E
building instructions and provides easy control over the
order of their use. The decoding procedure proceeds as
E
S
1
parsing through the PST and executing the particular program symbols applying them on particular nodes within
P
the graph being emerged.
OUT
For the purposes of the NNSU tool, we decided to further decompose the whole PST content into linear code
IN
that would carry the structural and also the informational
E
E
entries of the PST. We can say we transformed the PST
1
E
S
into a program symbol code. We call these program symbol
2
codes as Instruction-Parameter Codes, or shortly IPCodes.
P
This additional transformation step from PST to the
OUT
linear codes was made for two main reasons. Firstly, within the NNSU tool, these codes are preferred due to straight
and also more transparent implementation into the NNSU
environment. Secondly, there was the intention to design a representation that is as close to return back to the
string representations. Thus, the linear form was the natural choice. As we will see later, the structural complexity
could not be decreased significantly though.
As soon as we establish the linear equivalents for PST,
the PST can be formally excluded and that we can only
work with the direct relations between architectures and
IPCodes. In accord with PST contents, the three ground dimensions of the IPCodes are as
follows (see also the IPCode in figure 2.1).
A skeleton to the IPCode, coding structure called Code;
constructive program symbols, called instructions, composing structure called ICode;
and
all relevant descriptive parameters.
In the subsequent sections we will turn to all three dimensions. Afterwards, we will finalize
the detailed structure of the IPCode. Now, let us move to the first part, the Code.
2.2
R EAD S
CODES AND SUBCODES
The PST within cellular encoding scheme is transcoded into multivariate integer series so that
the resulting sequence equivalently corresponds to the original PST. The structural part of
the PSTthe tree structureis coded within the first dimension, while the other dimensions
comprise the program symbols.
The correspondence between the trees structure and their codes is caught by a series of
inequalities and one equation, called in the whole as level-property. The construction of the
sequence itself is carried out via PST traversing. The ground idea is that the sequences and
PSTs correspond to each other in sense that any integer series fulfilling the level-property
can be transcoded into (a tree part of) a PST, and contrary any (tree part of) PST is rewritten
into an integer series that fulfils the level-property.
The linear codes are illustrated to be equivalent to the trees and as a result of it we will be
allowed to focus on the Codes as our primary structure from the very beginning. Naturally, we
will return to the PSTCodes correspondence whenever it comes as illustrative and helpful.
2.2.
R EAD S
13
CODES AND SUBCODES
IPCode
(2, S, 0)
(2, P, 0)
(0, E, (3,7,8,...))
Parameters
Parameters
(3,7,8,...)
(2,4,5,...)
(4,5,4,...)
(4,2,4,...)
(5,4,3,...)
(3,7,8,...)
(2,4,5,...)
(4,5,4,...)
(4,2,4,...)
(5,4,3,...)
(0, E, (2,4,5,...))
Reads codes
IN
(2,4,5,...)
(3,7,8,...)
(4,2,4,...)
(4,5,4,...)
Cellular encoding
(2, P, 0)
S
(0, E, (4,5,4,...))
NNSU
architecture
(5,4,3,...)
(2, S, 0)
P
(0, E, (4,2,4,...))
(0, E, (5,4,3,...))
PST
Reads code
Instructions
OUT
Parameters
Figure 2.1: Overall scheme of the IPCode representation. The instructions are applied subsequently and build the resulting architecture. It does not depend if we use a tree or a linear
code to store the right information.
2.2.1
C ODES
DEFINITION
The linear coding of tree structures is set out by Ronald C. Read in book (Read et al., 1972),
as a part of wide introduction of the coding of unlabelled trees. In this article, Read outlines
various ways of trees codings; e.g. the bottom-up valency codes or the walk-around valency
(WAV) codes. For the NNSU tool we had chosen the WAV codes; we will assign these WAV
codes as Codes.
Read thoroughly depicts the types of trees for which the linear codes build the representationso-called planted plane trees; the planted plane trees are rooted trees with strong
isomorphism induced by fixed embedding of their drawing in a plane. The PST structures
from the grammatical evolution adhere to this specific isomorphism and are, in fact, planted
plane trees. It is because in case of PST it highly depends on the order in which the particular
program symbols are read and processed. In terms of traversing the tree and the order of the
nodes visited this exactly means to the tree to be planted plane tree.
The original definition of the Code by Read proceeds from the trees to the linear code,
depicting the relating condition called level-property. We will now turn directly to the levelproperty that ties the planted plane trees and the Codes together. The level-property is a
criterion that asserts that a sequence is a linear code of some tree. Conversely, any tree
encoded into WAV code would fulfil the level-property. The following definition of the levelproperty is kept in general form, holding for any integer series.
Definition 2.1. A finite sequence {aj }N
j=1 , N N, aj {0} N, j 1, . . . , N fulfils the
level-property if
k
X
j=1
aj > k 1, k {1, . . . , N 1}, and
N
X
aj = N 1.
(2.1)
j=1
Consider, for instance, sequences A = (4, 0, 1, 1, 0, 0, 0) and B = (4, 0, 1, 2, 0, 2, 0). The

two tables in 2.1 document the particular equation and inequalities of the level-property
respectively for A and B. We can read that the A fulfils the level-property while B does not.
For B there is the last equation broken. Thus, B can not correspond to any PST whereas A
14
CHAPTER 2.
First k members of A
1
2
3
4
5
6
7
4
4
4
4
4
4
4
First k members of B
1
2
3
4
5
6
7
4
4
4
4
4
4
4
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
0
0
0
0
0
0
0
0
2
2
IPC ODE : T HE
Pk
j=1 aj
Pk
j=1 aj
4
4
5
6
6
6
6
0
Pk
j=1 bj
4
4
5
7
7
9
9
>k1]
P7
j=1 aj
=6]
1
1
1
1
1
1
1
Pk
j=1 bj
>k1]
P7
j=1 aj
=6]
1
1
1
1
1
1
0
Table 2.1: Level-property checks for two different integer series. The logical values in last
two columns are calculated using an indicator [.].
can. The reasons why the B series does not fulfil the level-property are straightforward: the
series has to contain non-zero values arranged in such order so that the partial sums grow
faster than the number of summed members. At the same time, it cannot overgrow certain
level (the last equation of level-property). The B sequence failed to meet the final equation
already on its fourth position where the partial sum reached the number 7.
Now we are going to define Codes as sequences fulfilling the level-property. Even though
the level-property can be assigned to arbitrary finite sequence of non-negative integers, we
will restrict the set of integer series with {0, 2}-valued entries only. The Gruaus PSTs are
binary planted plane trees and the corresponding WAV codes are exactly {0, 2}-valued. Thus,
the two-valued entries will be suitable base for Codes construction.
As we suppose the Codes will only contain values 0 or 2, utilizing the last equation of the
level-property we can see that the length of such Codes must be odd (N equals to sum of 2s
plus one). Thus, we will consider N {2k 1 | k N }.
Definition 2.2. For N {2k 1 | k N } we define the set of Codes of length N as
n
def
Codes(N ) =
{aj }N
j=1 aj {0, 2},
o
{aj }N
fulfils
a
level-property
,
j=1
(2.2)
and the set of all codes and the set of all codes with maximum length MAXN respectively as
[
[
def
Codes =
Codes(N ) and CodesMAXN =
(2.3)
Codes(N ).
N =2k1
kN
N MAXN
The level-property (a series of inequalities and one equality) utilised within the definition
(2.2) is an important key to the inner structure of Codes. Not only it states whether any
15
CODES AND SUBCODES
1 out
of 2
1 out
of 8
0.4
0.6
0.8
1.0
R EAD S
5 out
of 128
14 out
of 512
42 out
of 2048
132 out
of 8192
429 out
of 32768
0.2
2 out
of 32
0.0
Members of Codes(N) in relative sizes
2.2.
11
13
15
2N
Figure 2.2: Members of sets 2N fulfilling the level-property. The sets are represented by
the full bar, each member that meets the level-property is plotted in black according to its
position in 2N , the width of the bar depends on the relative to 2N .
sequence of non-negative integers is a Code or not, it also provides easy random generating of
an arbitrary Code, and also it defines the subcode of any Code at any position k {1, . . . , N }.
We will examine the subcoding effect as first, together with the recombination of Codes.
Then we will present an algorithm for Codes random sampling.
In appendix, section A.3 we added further figures regarding the relation between Codes
and the trees structure; we depict the WAV code construction from a tree, and also we discuss
the trees isomorphism.
2.2.2
S UBCODING
The subcoding speaks of an Codes essential feature. In words it means that there exist some
sub-sequences of the Code that compose another Code. Even more closely, for each position
within the Code there is a unique subsequence found that fulfils the level-property.
The subcoding structures within Codes are directly related to subtrees within the PST
structures. This parallel leads to propositions that the subcode exists for any position within
the Code and that it is defined uniquely.
In the literature where the Codes are discussed, (Read et al., 1972; Matousek, Nesetril,
1996), the Codes are constructed via concatenating of Codes corresponding to subtrees. Under this approach, the subcoding is a natural consequence of the construction. We tried,
nevertheless, to prepare the proof of the existence and uniqueness of the subcodes on an independent basis and utilizing the level-property. In appendix section A.2 we propose the full
version of proof of the following lemma. It is not as straightforward as in trees case, during
the course of the proof it is felt that the level-property is highly complex set of conditions.
This higher complexity appears as a cost for the linear structure of Codes.
Lemma 2.1. Let N {2k 1 | k N }. For any Code A Codes(N ), and position k
1
{1, . . . , N } there exist a subsequence {aj }k+M
of length M {1, . . . , N k} that fulfils the
j=k
level-property. This subsequence is defined uniquely.
16
CHAPTER 2.
IPC ODE : T HE
The lemma asserts that the subcode exists on each position of any Code and also it provides its length M . Using these we define the subcode as official term.
Definition 2.3. Let N {2k 1 | k N }, A Codes(N ), and k {1, . . . , N }. The subsequence {aj }k+M
j=k , M {1, . . . , N k}, is called a subcode on position k, assigned as sub (A, k).
As an example, let us now look at a Code
A = {aj }13
j=1 = (2, 0, 2, 2, 2, 0, 0, 2, 0, 2, 0, 0, 0)
Subcode on 7th position, i.e. (0), is a Code of length 1. Similarly, a subcode on 10th position,
i.e. (2, 0, 0), is a Code of length 3. We may continue showing that on each position we can
find a valid Code. For A these Codes are listed in table 2.2.
Position
k=1
k=7
k = 10
Pl
l1
Original code A = {aj }13

j=1
2 0 2 2 2 0 0 2
0 1 2 3 4 5 6 7
0
8
2
9
0 0 0
10 11 12
j=1 ak1+j
10 10 12 12 12 12
sub (A, k)
l1
Pl
j=1 ak1+j
0
0
sub (A, k)
l1
Pl
j=1 ak1+j
sub (A, k)
0
0
Table 2.2: Subcode seeking. For the position k the first row shows value of l 1 which is
the
Pl right side of the (in)equality in level-property, the second row shows the partial sum
j=1 ak1+j which is the left side of level-property, and the third row shows the subcode.
The subcode is found as soon as the left side equals to the right one.
For each position of a Code there is one subcode found; the table 2.2 shows the process of
seeking the subcodes on positions 1, 7, and 10. We may ask how do the particular subcodes
look like for arbitrary Code and for any positions.
The formulas for a subcodes lengths cannot be stated for each position in general, but
there are some positions for which we can exactly identify the subcodes. For arbitrary A
Codes(N ), A = {aj }N
j=1 the subcode on first position is the original code:
sub (A, 1) = A, |sub (A, 1)| = N.
And further, for all k {1, . . . , N } where ak = 0 the subcode on position k is a trivial code of
length 1:
k {1, . . . , N }, ak = 0, sub (A, k) = 0, and |sub (A, k)| = 1.
The remaining positions, however, can not be easily assigned with such general rules regarding lengths of subcodes. Of course, we can always enumerate the subcodes and their lengths
for a specific Code instance.
As another evidence of the advanced complexity of the inner structure of Codes we can
turn to the properties of all subcodes lengths within a single Code. We will see that the
2.2.
R EAD S
17
CODES AND SUBCODES
subcodes do not appear regularly within a given Code. We might expect that the observed
subcodes lengths will be all odd numbers between 1 and N . The truth is quite contrary.
Except for the trivial cases mentioned in the previous paragraph we cannot state for sure
that the subcode of a particular length can be found in the original Code. For instance, in
table 2.2.2 we can see that no Code of length 7 is found within the Code of length 13. This
is important point since we come to situations later, when the subcodes of the desired length
are sought and we need to know that in some cases they wont be found.
j
1
2
3
4
5
6
7
8
9
10
11
12
13
aj
2
0
2
2
2
0
0
2
0
2
0
0
0
sub (A, j)
2, 0, 2, 2, 2, 0, 0, 2, 0, 2, 0, 0, 0
0
2, 2, 2, 0, 0, 2, 0, 2, 0, 0, 0
2, 2, 0, 0, 2, 0, 2, 0, 0
2, 0, 0
0
0
2, 0, 2, 0, 0
0
2, 0, 0
0
0
0
|sub (A, j)|

13
1
11
9
3
1
1
5
1
3
1
1
1
Table 2.3: Subcodes and their lenghts for a Code A =

{aj }13
j=1 = (2, 0, 2, 2, 2, 0, 0, 2, 0, 2, 0, 0, 0).
2.2.3
R ECOMBINATION
ABILITIES OF
C ODES
Now that the subcodes are identified for arbitrary position in any Code, we take a look at its
main use. We had already observed that as long as any subcode grows sufficiently fast so
that the level-property holds for each position within the subcode, it does not matter what
the exact order of zeroes and non-zeroes the subcode has.
We may ask a question of what changes we might perform to a subcode without destroying the whole Code. The answer is we can apply much more general changes than reordering
of the subcode. For instance, we can cut off the whole subcode and put arbitrary valid Code
instead and we obtain a Code. This openness to structural changes within Codes is the main
point that allows for recombination which is a very heart of any genetic algorithm.
We will utilize an operator formalizing the operations on subcode. We define a binary
operator mapping two Codes onto a single resulting Code. There is one additional parameter
to this operator designating the position on which the subcodes swapping proceeds.
M
Definition 2.4. Let A Codes, A = {aj }N
j=1 and B Codes, B = {bi }i=1 , k {1, . . . , |A|},
Nsub = |sub (A, k)|. The left-addition on position k, assigned as k , is defined as follows.
A k B = {cl }Pl=1 ,
IPC ODE : T HE
CHAPTER 2.
18
(2,2,2,2,0,0,0)
(2,0,2,0,0,0,0)
Partial sum
4
3
0
(2,2,2,0,0,0,0) Meets the levelproperty

(2,2,0,2,0,0,0) Meets the levelproperty
1
(2,2,2,2,0,0,0) Fails to meet the levelproperty

(2,0,2,0,0,0,0) Fails to meet the levelproperty
Partial sum
(2,2,0,2,0,0,0)
(2,2,2,0,0,0,0)
Index
Index
Figure 2.3: Graphs of two samples of Codes. The cumulative sums the Codes are plotted
saying that the Code grows sufficiently fast means that the plot does not cross the basic
slope.
where
P = N Nsub + M,
al ,
b
,
cl =
lk+1
alM +Nsub ,
and
l = 1, . . . , k 1
l = k, . . . , k + M 1
l = k + M, . . . , P
(2.4)
In the preceding definition of k , we did not discuss exactly the range space of k .
Naturally, we need k to map to Codes. The following lemma states that this holds. Again,
the intuitive comparison with the PST would indicate that the k swaps subtrees on k-th
node. The detailed proof of incorporating the level-property is found in appendix section
A.2.
M
Lemma 2.2. Let A Codes(N ), A = {aj }N
j=1 , and B Codes(M ), B = {bi }i=1 , and
k {1, . . . , N } any position in A. The sequence {cl }Pl=1 = A k B fulfils the level-property, i.e.
A k B Codes(N |sub (A, k)| + M ).
Example 2.1. Let us take a look at a simple k application. Consider a Code A = (2, 0, 2, 0, 0),
and B = (2, 2, 0, 0, 0). Then
A 3 B = (2, 0, 2, 2, 0, 0, 0).
(2.5)
The subcode on 3rd position, i.e. (2, 0, 0), is replaced with B = (2, 2, 0, 0, 0). The length of the
resulting code A 3 B is 5 3 + 5 = 7 as stated in lemma 2.2. Even though we work with two
Codes of length 5, |A| = 5 and |B| = 5, the resulting Code A 3 B is of length 7. This indicates
that the length of Codes may, in general, vary through application of the k operator.
2.2.4
R ANDOM
SAMPLING OF
C ODES
Once we will run the genetic algorithm, we will have to be able to fill the initial population
with randomly picked representations. Thus, we need to specify, how to sample Codes at
random. We will utilize the level-property for the random sampling process. We simply
choose the desired length N of the resulting Code, then we set a counter L to 1, and proceed
the algorithm 1 that subsequently fills the values aL of the Code A = {aL }N
L=1 .
Once the condition in the fourth step of the algorithm is reached, no more members of
generated Code can be set to non-zero value. It is because the particular entries of a Code
2.3.
I NSTRUCTION - PARAMETERS
19
CODES
Algorithm 1 Random generating of the Codes

1:
2:
3:
4:
5:
6:
7:
8:
Generate value atmp

as a binomial variable from {0, 2} at random, e.g. with probability
L
p = 0.5.
P
tmp
> L, proceed to the Step 3, else go back to the
Level-property check. If L1
j=1 aj + aL
Step 1.
Set aL = atmp
L .
PL
if j=1 aj = N then
Set aj = 0 for j = L + 1, . . . , N and finish the algorithm.
else
Increment L by 1 and go to the Step 1.
end if
are only valued with 0 or 2 and thus the partial sums are increasing sequence. The condition
in step 4 is reached only once during the whole generating process. As soon as we find such
aL that rounds off the partial sum to N , we simply set the remaining members to zero and
finalize the procedure.
Example 2.2. In this example we only lay out a list of ten randomly generated Codes. The
lengths of the Codes vary from 11 to 27. Regarding the distribution of the zeros and non-zeros,
we can see a Codes with highly uniform 0s and 2s: the second one and the fourth one. The
opposite, Code with totally polarized 0s and 2s is eighth.
(2, 0, 2, 2, 0, 0, 2, 0, 2, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0),
(2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 0),
(2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 0, 0),
(2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 0),
(2, 0, 2, 2, 0, 2, 2, 2, 2, 0, 2, 0, 2, 2, 0, 0, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0),
(2, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0),
(2, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
(2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0),
(2, 2, 2, 0, 2, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0),
(2, 0, 2, 2, 2, 0, 0, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 0, 0, 0).
2.3
CODES
After unveiling all of the needed details of Codes, we finally get to the point where we start
looking at the particular instructions and their role in the representational mechanism. In
this section we introduce the instructions from the PST scheme to the Codes structure.
This approach leads to growing of nodes within a graph in a very similar way to PST
structures. The order of the instruction application and their relation to the nodes is controlled via Codes structure here. The final graph is reached as long as all instructions have
been applied to the graph structure.
In order to describe the representational scheme fully and put the decoding process
clearly, we must specify its three main stages.
20
CHAPTER 2.
IPC ODE : T HE
The way the instructions grow the particular nodes. These properties will be discussed
at first. The application of particular instruction is local and takes only one node into
account. Thus, it will be enough to give a general description of growing a node with
some ancestors and some descendants for each considered instruction.
The order of instruction application. Once we know how to grow each particular node
with appropriate instruction, we need to state the way the set of the instruction is
applied. At each step of the graph construction, the node being processed is assigned
with an instruction and grown according to it.
The initial graph. The only remaining point needed is the initial structure. Mostly, the
trivial topology is used. Also, there is no restriction to start with an advanced graph
either.
2.3.1
I NSTRUCTIONS
Let us look at the instructions now. As we already said, any instruction is supposed to grow one some individual node. Thus,
we will study a node, assigned as 1, which has some ancestors,
say A1, A2, A3, and descendants, say D1, D2, with corresponding incoming and outgoing edges.
Note that the node 1 is also blue tinted which indicates that
this node is active and the instruction is applied on this particular node. The order of nodes assigned to the instructions
processing is given by a nodes stack. This is, the node 1 has
been popped from this nodes stack.
A1
A2
A3
D1
D2
B UILDING INSTRUCTIONS
For each instruction we will pass through an illustrative application on the node 1 and then
we will formulate the general rule that describes the application of the particular instruction
for arbitrary node with general set of ancestors and descendants.
We will illustrate a serializing instruction S first. This instruction is graphically sketched
in the following figure
A1
A2
A3
A1
A2
D1
A3
D2
D1
D2
Figure 2.4: Application of the S instruction on the node 1.

The S instruction is interpreted as splitting the node serial-wise, i.e. the node 1 is rewritten
with two succeeding nodes, assigned as 10 and 100 . The node 10 becomes ancestor of 100 . That
2.3.
21
CODES
means, all the edges from the ancestors A1, A2, A3 will be connected to 10 , then one (new)
edge is connected from 10 to 100 , and finally all edges from 100 to the descendants D1, D2.
Both nodes 10 and 100 are pushed in the nodes stack. The following rule describes the general
way of the application of the S instruction.
Rule 2.3 (Application of the S instruction). For an active node j V in graph G = (V, E)
with ancestors AG (j) and descendants DG (j), the S causes following actions. Add new node:
V (G) = V (G) {jnew }, and set up edges so that the nodes from DG (j) become
descendants of
the new node and the new
node becomes (the only) descendant of j: E(G) = E(G) (j, jnew )
{(jnew , l) | l DG (j)} \ {(j, l) | l DG (j)}. Both nodes are stored in the nodes stack.
The way we find the particular value of jnew may be quite arbitrary. We only need to
assert that this index jnew
/ V . In the NNSU implementation we set the jnew the lowest
unused integer:
(2.6)
jnew = min{n N | n > l V (G)}
The second instruction we will illustrate is a parallelizing instruction P. Again, we look

on the sketch of its action.
A1
A2
A3
D1
A1
A2
D2
D1
A3
D2
Figure 2.5: Application of the P instruction on the node 1.

The P instruction is interpreted as splitting the node parallel-wise, i.e. the node is rewritten
into two equivalent nodes assigned 10 and 100 while all the edges from ancestors A1, A2, A3
are be connected to both 10 and 100 , and all edges from both 10 and 100 to descendants D1, D2.
The following rule describes the general way of the application of the P instruction. Both
nodes 10 and 100 are pushed in the nodes stack.
Rule 2.4 (Application of the P instruction). For a node j V in graph G = (V, E) with
ancestors AG (j) and descendants DG (j) the P causes following actions. Add new node:
V (G) = V (G) {jnew }, and set up edges such that the new node has same ancestors and
descendants: E(G) = E(G) {(l, jnew ) | l AG (j)} {(jnew , l) | l DG (j)}. Both nodes are
stored in the nodes stack.
The jnew is set as in equation 2.6. The instructions S and P are called building instructions because they are causing the growing of the particular node. The set of these instruction
is assigned as BI:
BI = {S, P}, or in a numeric notation BI = {1, 2}.
(2.7)
The numerical notation is tightly related to the implementation and might appear in examples which are produced by the NNSU application where the output values are taken
unformatted directly from the raw data reports.
22
CHAPTER 2.
IPC ODE : T HE
T ERMINATING INSTRUCTIONS
So far, we specified how to grow the nodes within a graph. The nodes from the graph being
built cannot be split endlessly and the nodes stack cannot grow without boundary. We need to
specify instructions that use the active node popped from the stack and do not push no node
back to it. The terminating instruction forces the node to stop its growing and to undergo
some additional actions self-adjusting actions. Then the instructions move to another node
popped from the nodes stack.
Among the trivial instruction causing the node to stay unchanged we can find instructions
to delete a node leaving connected its ancestors to its descendants, delete some incoming
edges, delete some outgoing edges, etc. In addition, for the architectures case which are the
final structures of our concern the terminating instruction will provide further information
regarding the parameters and final design of the particular block.
The basic terminating instruction is an E instruction. This instruction causes the active
node being popped from the nodes stack just not to be split any further. The ancestors and
descendants are kept and also all edges are kept unchanged. The following rule describes
the general way of the application of the E instruction.
Rule 2.5 (Application of the E instruction). For a node j V in graph G = (V, E) with
ancestors AG (j) and descendants DG (j) the E causes no actions. No node is stored back to the
nodes stack.
Last instruction is deleting instruction D. An illustrative application is seen in the following figure.
A1
A2
A3
A1
A2
A3
D1
D1
D2
D2
Figure 2.6: Node 1 with ancestors A1, A2, A3 and descendants D1, D2 is deleted according to the D instruction.
The instruction D is interpreted as to remove the node while keeping the connections. This
means the node 1 is deleted and the edges are connected from all ancestors to all descendants. The following rule describes the general way of the application of the D instruction.
Again, no nodes are pushed back to the nodes stack.
Rule 2.6 (Application of the D instruction). For a node j V in graph G = (V, E) with
ancestors AG (j) and descendants DG (j) the D causes following actions. Remove the node:
V (G) = V (G)\{j}, and set up edges such that all the ancestors are connected to all descendants:
E(G) = (E(G) {(l, k) | l AG (j), l DG (j)}) \ ({(l, j) | l AG (j)} {(j, l) | l DG (j)}).
No node is stored back to the nodes stack.
The shown instructions are called terminating instructions because they are causing the
node to stop its growing. The set of these instruction is assigned as TI:
TI = {E, D}, or in a numeric notation TI = {11, 16}.
(2.8)
2.3.
23
CODES
There is one point we should note about the application of D instruction. The deleting
action on cells inverts in some aspects the actions of building instructions, technically. We
might imagine that adding an S instruction followed by one D instruction and one E
instruction leads to an IPCode decoding to the same architecture. This property is called
genotypic multiplicity and it is discussed in the next chapter.
2.3.2
T HE
ORDER OF INSTRUCTIONS AND THE INITIAL GRAPH
The order in which the instruction are applied is coded in the Code structure and it is equivalent to a walk-around of the binary planted plane tree. The building instructions are assigned
to inner nodes of the tree and the terminating instructions are assigned to leaves of the tree.
We have specified the set of particular instructions, we are able to enIN
code the order in which they are applied. Before we set out the IPCode
definition which combines these two structures, we introduce a graph
on which the series of instructions is applied. In the NNSU tool, we
1
use the trivial graph GIN IT = ({IN, 1, OUT}, {{IN, 1}, {1, OUT}}), with
node 1 being set as active. The graph is depicted in figure 2.7. We can
see in the figure that whatever series of instructions in arbitrary order,
we always obtain as a result an s, t-graph1 .
OUT
On the other hand, if we would require, in some specific case, only
to represent a specific subpart of a fixed graph, we would accordingly
Figure 2.7: Initial define an initial graph with the only starting node that would be set as
active.
graph GIN IT .
Now we are ready to put the Codes and instructions together and
define the representation. In the same way we distinguished topologies and architectures,
we firstly define encoding that decodes into topology and then we define encoding that represents architecture. We extend the Codes structure by adding the building and terminating
instructions. Each position within a Code determines what type of instruction might be to it.
Definition 2.5 (The ICode). Let N {2k 1 | k N }, {aj }N
j=1 Codes(N ). The ICode P
of length N is defined as
P = {(aj , j )}N
j=1 ,
(2.9)
where j {1, . . . , N }, aj = 2 j BI aj = 0 j TI. Set of all ICodes of length

N is assigned as ICodes(N ). The set of all ICodes and the set of all ICodes with maximum
lenght MAXN are defined respectively as
def
ICodes =
[
N =2k1
kN
ICodes(N ) and ICodesMAXN =
ICodes(N ).
(2.10)
N MAXN
The ICode is a two-variate series; the first part carries the structural informations and the
second part comprises the information about all the instructions. Thanks to the first part,
the various terms and properties of Codes stay valid even for ICode, mainly the subcoding
sub-ICodes are written as SubICodesand also the recombinationthe k operator applies
intuitively to ICodes.
1
The trivial case of G = ({IN, OUT}, {{IN, OUT}}) might appear after too high number of D instruction used.
This is also accepted as an topology since these can arise as results of recombination operators.
24
CHAPTER 2.
IPC ODE : T HE
Example 2.3. A simple example of an ICode would looks as follows.

P = ((2, 2), (2, 1), (0, 11), (2, 2), (2, 2), (0, 11), (0, 11),
(2, 2), (0, 11), (2, 1), (0, 11), (0, 11), (0, 11))
(2.11)
The length of this code is equal to |P| = 13. SubICode on 7th position sub (P, 7) = ((0, 11)) is
an ICode of length 1, SubICode on 4th position sub (P, 4) = ((2, 2), (2, 2), (0, 11), (0, 11), (2, 2),
(0, 11), (2, 1), (0, 11), (0, 11)) is an ICode of length 9.
R ANDOM GENERATING OF IC ODES
The process of random generating of ICodes extends the procedure for Codes by generation
of the instructions. We must put a special care on work with the D instruction. The deleting
D instructions may appear only to that extent that the resulting topology would contain the
intentioned number of blocks.
Let us suppose the desired number of blocks is M and that this number is drawn at
random from interval given by the minimum blocks MINBC and maximum blocks MAXBC . The
deleting instruction D reduces the resulting number of blocks by number of its occurrences
within the ICode. As we preferably need to control the resulting size of the topology, i.e.
the number of blocks within the topology, we prepare the number of deleted blocks MD as
a random number from 0, . . . , MAXBC and add this number to the random number of nondeleted blocks M and get the needed length of an ICode that would contain MD deleting
instructions and lead to a topology on M blocks. And we simply calculate the length of the
underlying ICode: 2(M + MD ) 1.
As the random generating proceeds we must assert that the ICode will contain exactly
MD deleting instructions D. To do this, we create a copy of instructions vector from the
+MD
ICode {delFlagj }M
and we prepare the positions of D instructions to it.
j=1
The random generating of the delFlag arrays should produce uniformly placed D instructions. The array is of length M + MD and we need to set MD flags to 1. We utilize a
simple the algorithm 2 for these purposes.
Algorithm 2 Construction of delFlag array
1: Set delFlagj to E for all j {1, . . . , 2(M + MD ) 1}
2: Set M done = 0
3: repeat
4:
Generate the position M tmp from 1, . . . , M + MD at random.
5:
if delFlagM tmp is already set to D then
6:
Return to the Step 4.
7:
else
8:
Set delFlagM tmp to D and return to the Step 4.
9:
Set M done = M done + 1
10:
end if
11: until M done = MD
Afterwards the array delFlag is constructed, we may go through the main random sampling algorithm 3. The ICode resulting from this algorithm is of length 2(M + MD ) 1 and
it decodes into topology consisting of M blocks.
Example 2.4. Let us suppose we want to generate an ICode that decodes into topology of block
count 2, and we want the ICode to contain one D instruction. We construct the ICode of length
2(2+1)1 = 5, set the delFlag array to (11, 16, 11) and follow the general decoding procedure.
The ICode with underlying delFlag array is shown in table 2.4.
2.3.
25
CODES
Algorithm 3 Random sampling of ICodes

1: Set mL = 0, dF L = 1.
tmp
2: Generate value amL as a binomial variable from {0, 2} at random.
PmL1
tmp
3: if
j=1 aj + amL > L then
4:
Proceed to the Step 8.
5: else
6:
Go back to the Step 2.
7: end if
tmp
8: Set amL = amL .
9: if amL = 2 then
10:
Generate value mL as a binomial variable from {1, 2} at random.
11: else
12:
Set mL = delFlagdF L , and increment delFlag counter by 1, dF L = dF L + 1.
13: end if
PL
14: if
j=1 aj = N then
15:
for j = mL + 1, . . . , M + MD do
16:
Set aj = 0
17:
end for
18:
for k = dF L, . . . , MD do
19:
Set k = delFlagk
20:
end for
21:
Finish the algorithm.
22: else
23:
Increment mL by 1 and go to the Step 2.
24: end if
delFlag
ICode
(2, 2)
11
(0, 11)
(2, 1)
16
(0, 16)
11
(0, 11)
Table 2.4: Sample delFlag settings.
The randomly generated ICodes originating from real runs of the NNSU tool might look as
follows.
((2, 2), (0, 11), (2, 1), (2, 1), (0, 16), (0, 16), (2, 1), (0, 16), (2, 2), (2, 2),
(0, 11), (2, 1), (0, 11), (0, 16), (2, 1), (2, 1), (0, 16), (0, 11), (0, 16)),
((2, 2), (0, 16), (2, 2), (0, 11), (2, 2), (0, 16), (2, 1), (0, 16), (2, 2), (0, 11),
(2, 2), (0, 16), (0, 16)),
((2, 2)(0, 11), (2, 1), (0, 16), (2, 2), (0, 11), (2, 1), (0, 16), (2, 1), (0, 16),
(2, 2), (2, 2), (0, 16), (2, 2), (0, 16), (0, 16), (0, 16)),
((2, 2), (0, 11), (2, 1), (0, 11), (2, 2), (0, 16), (2, 2), (0, 16), (2, 2), (0, 16),
(0, 16)),
((2, 2), (0, 16), (2, 2), (2, 2), (0, 16), (2, 1), (2, 2), (2, 1), (2, 1), (0, 11), (2, 2),
(0, 16), (2, 1), (2, 1), (0, 16), (0, 16), (0, 16), (2, 2), (0, 11),
(0, 11), (2, 1), (2, 1), (0, 16), (0, 16), (0, 16), (0, 16), (0, 16)),
26
CHAPTER 2.
IPC ODE : T HE
((2, 1), (0, 16), (2, 2), (0, 11), (2, 2), (0, 16), (2, 2), (0, 16), (2, 2), (2, 2),
(0, 16), (2, 2), (2, 2), (0, 16), (0, 11), (0, 16), (0, 11)),
((2, 2), (2, 2), (2, 1), (0, 16), (0, 11), (2, 1), (0, 16), (2, 1), (2, 2), (2, 1),
(2, 2), (2, 2), (2, 1), (0, 16), (2, 2), (2, 1), (2, 2), (0, 11), (0, 16),
(0, 16), (0, 11), (0, 11), (0, 16), (0, 11), (0, 16), (0, 16), (0, 11)),
((2, 2), (0, 16), (2, 2), (2, 2), (2, 2), (0, 11), (0, 11), (0, 11), (2, 1), (0, 11),
(2, 1), (0, 16), (2, 2), (2, 2), (0, 16), (0, 16), (2, 2), (2, 1), (0, 16),
(0, 11), (0, 16)),
((2, 2), (2, 2), (2, 1), (2, 1), (2, 1), (2, 2), (2, 2), (0, 11), (0, 16), (0, 11),
(0, 16), (0, 16), (0, 16), (0, 11), (0, 11)),
((2, 2), (2, 1), (2, 1), (0, 16), (2, 1), (2, 2), (0, 16), (0, 11), (2, 1), (2, 2),
(0, 16), (0, 16), (0, 16), (0, 16), (0, 11)).
This way we finished description of the topology coding. And, we naturally lean to extend
this methodology to architecture coding. We will work out this in the following section.
2.3.3
I NSTRUCTION - PARAMETER
CODES
The final step to round off the definition of the representation is to extend the ICode structure
with informations coding the particular parameters of the blocks. For each block we will use
the descriptions as members from the parameters set PS.
Definition 2.6 (The IPCode). Let N {2k 1 | k N }. The IPCode P of length N is defined
as
P = {(aj , j , dj )}N
j=1 ,
(2.12)
where the {(aj , j )}N

j=1 form an ICode, and j {1, . . . , N }, dj PS. Set of all IPCodes of
length N is assigned as IPCodes(N ). The set of all IPCodes and the set of all IPCodes of
maximum length MAXN are respectively defined as
[
def
IPCodes(N ) =
IPCodes(N ) and
(2.13)
N =2k1
kN
IPCodesMAXN
IPCodes(N ).
(2.14)
N MAXN
The dj s may represent any general form of parameter to the final block within an architecture, and thus the PS in the previous definition describes a general set of these parameters.
Also, dj s assigned to all instructions except E dont carry any information. Thus, PS should
always contain a zero vector. In our particular case the PS set contains parameters that
describe the number of clusters in the neurons in a particular block of a NNSU. The vectors of
integers will suffice in this purpose as the CHAINs may possess different lengths and internal
structure.
We will assign the minimum and maximum number of neurons per block (NSUC) as
MINNSUC , MAXNSUC , and the minimum and maximum number of used clusters troughout all
2.3.
CODES
27
neurons (ClC) as MINClC , MAXClC . The set PS is a set of vectors of length between MINNSUC
and MAXNSUC and taking values in {MINClC , . . . , MAXClC }.
PS = {0} {MINClC , . . . , MAXClC }(MAXNSUC MINNSUC )
2.3.4
R ANDOM
GENERATING OF
(2.15)
IPC ODES
The IPCodes add one more dimension to each fragment of the IPCode series. For each terminating instruction we must generate the descriptions for the particular block. This means,
when filling in the descriptions vector dj in step 12 in alg. 3 we simply sample some vector of
descriptions from the set PS at random. In the second situation (step 19 in alg. 3) no values
for no dj are needed because of the deleting D instruction. The whole process of IPCode
random sampling is shown in algorithm 4.
Algorithm 4 Generic algorithm for random sampling of IPCodes
1: Generate MD between MINBlC and MAXBlC at random. Then generate M at random so
that Mf inal = M MD lies also between MINBlC and MAXBlC .
2: for j = 1 to j = Mf inal do
3:
Generate block descriptions dtmp
PS.
j
4: end for
tmp
5: Build the IPCode of length 2Mf inal 1 with descriptions dj .
6: Set mL = 0, dF L = 1.
tmp
7: Generate value amL as a binomial variable from {0, 2} at random.
PmL1
tmp
8: if
j=1 aj + amL > L then
9:
Proceed to the Step 13.
10: else
11:
Go back to the Step 7.
12: end if
tmp
13: Set amL = amL .
14: if amL = 2 then
15:
Generate value mL as a binomial variable from {1, 2} at random.
16: else
17:
Set mL = delFlagdF L .
18:
Set dmL = dtmp
dF L , and increment delFlag counter by 1, dF L = dF L + 1.
19: end if
PL
20: if
j=1 aj = N then
21:
for j = mL + 1, . . . , M + MD do
22:
Set aj = 0
23:
end for
24:
for k = dF L, . . . , MD do
25:
Set k = delFlagk
26:
Set dk = 0
27:
end for
28:
Finish the algorithm.
29: else
30:
Increment mL by 1 and go to the Step 7.
31: end if
28
CHAPTER 2.
IPC ODE : T HE
Example 2.5. In the following, we will list random samples of IPCodes as we use them within
common work in the NNSU tool. The structures are becoming slightly complex, but hopefully
still illustrative.
((2, 2, 0), (0, 11, (3, 1, 3, 4, 1, 10)), (2, 1, 0), (2, 1, 0), (0, 16, 0),
(0, 16, 0), (2, 1, 0), (0, 16, 0), (2, 2, 0), (2, 2, 0), (0, 11, (6, 0, 2, 3, 4, 5, 6, 2, 8, 3)),
(2, 1, 0), (0, 11, (5, 1, 2, 3, 4, 5, 3, 4, 5, 6)), (0, 16, 0), (2, 1, 0), (2, 1, 0), (0, 16, 0),
(0, 11, (4, 0, 2, 5, 6, 2, 8, 8)), (0, 16, 0)),
((2, 2, 0), (0, 16, 0), (2, 2, 0), (0, 11, (5, 1, 3, 4, 5, 6, 2, 4, 7)), (2, 2, 0),
(0, 16, 0), (2, 1, 0), (0, 16, 0), (2, 2, 0), (0, 11, (3, 0, 1, 6, 2, 8, 3)), (2, 2, 0),
(0, 16, 0), (0, 16, 0)),
((2, 2, 0), (0, 11, 6, 0, 2, 3, 4, 5, 6, 3, 4, 6, 7), (2, 1, 0), (0, 16, 0),
(2, 2, 0), (0, 11, 5, 1, 2, 3, 4, 6, 1, 10), (2, 1, 0), (0, 16, 0), (2, 1, 0), (0, 16, 0),
(2, 2, 0), (2, 2, 0), (0, 16, 0), (2, 2, 0), (0, 16, 0), (0, 16, 0), (0, 16, 0)),
((2, 2, 0), (0, 11, (3, 2, 4, 6, 2, 10, 9)), (2, 1, 0), (0, 11, (5, 0, 2, 3, 5, 6, 1, 2)),
(2, 2, 0), (0, 16, 0), (2, 2, 0), (0, 16, 0), (2, 2, 0), (0, 16, 0), (0, 16, 0)),
Before we pass to the short review of architectures handled by IPCodes, we will dedicate
a short sub-section to a description of the whole process of the IPCode decoding.
2.3.5
E XAMPLE :
DECODING AN
IPC ODE
Now we will pass a simple IPCode decoding, a step-by-step construction of the architecture
from an IPCode. Within this scope, we will use the literal form of instructions for clearer
orientation.
We use the same IPCode as listed in the previous example (example 2.3): P = ((2, P, 0),
(0, E, (2, 3, 5)), (2, S, 0), (0, E, (5, 3, 3)), (0, E, (3, 3, 2, 4))). The decoding will start with initial graph GIN IT .
S TEP 1
IN
P
0
OUT
Figure 2.8: Decoding example: the

initial graph.
S TATE
The node 1 is active and the instruction element
(2, P, 0) is assigned to it. In other words, the instruction P with no additional parameters is applied, as seen in figure 2.8.
A CTION
The instruction P causes the node 1 to split
parallel-wise according to rule 2.4. The set V gets
extended by new entry V = V {2} and the
set of edges E gets extended by two new entries
E = E {(IN, 2), (2, OUT)}. The active node is set
to 1.
2.3.
CODES
S TEP 2
S TATE
The node 1 was split into 1 and 2 via instruction P.
(0, E, (2, 3, 5)) is assigned to it. The instruction E
with additional parameters (2, 3, 5) applies to node
1, as seen in figure 2.9.
IN
E
(2,3,5)
OUT
Figure 2.9: Decoding example:

node 1 was split into 1 and 2 via
P instruction.
S TEP 3
IN
1
(2,3,5)
S
0
OUT

node 1 was set up via E instruction and the parameters (2, 3, 5).
S TEP 4
IN
1
(2,3,5)
29
E
(5,3,3)
OUT

node 2 was split into 2 and 3 via
S instruction.
A CTION
The instruction E causes the node 1 to stop its further growing, and set the internal parameters to
(2, 3, 5). The active node is set to 2.
S TATE
The node 1 was set up via instruction E and the
parameters (2, 3, 5). The node 2 is active and the
instruction element (2, S, 0) is assigned to it. That
is, the instruction S with no additional parameters
is applied to node 2, as in figure 2.10.
A CTION
The instruction S causes the node 2 to split serialwise according to rule 2.3. The set V gets extended by new entry V = V {3}; the set of
edges E gets extended by two new entries: E =
E {(2, 3), (3, OUT)} while one entry is removed:
E = E \ {2, OUT). The active node is set to 2.
S TATE
The node 2 was split into 2 and 3 via instruction S.
(0, E, (5, 3, 3)) is assigned to it, i.e. the instruction
E with additional parameters (5, 3, 3) is applied to
node 2, as seen in figure 2.11.
A CTION
The instruction E causes the node 2 to stop its further growing set the internal parameters to (5, 3, 3).
The active node is set to 3.
30
CHAPTER 2.
IPC ODE : T HE
S TEP 5
S TATE
The node 2 was set up via instruction E and the
parameters (5, 3, 3). The node 3 is active and the
instruction element (0, E, (3, 3, 2, 4)) is assigned to
it, i.e. the instruction E with additional parameters (3, 3, 2, 4) is applied to node 3, as seen in figure
2.12.
IN
1
(2,3,5)
2
(5,3,3)
E
(3,3,2,4)
A CTION
The instruction E causes the node 3 to stop its
further growing set the internal parameters to
(3, 3, 2, 4). No further node is assigned as active.
OUT

node 3 was set up via E instruction and the parameters (5, 3, 3).
S TEP 6
IN
1
(2,3,5)
S TATE
The node 3 was set up via instruction
E and the parameters (3, 3, 2, 4).
The last figure shows the final architecture. There are no more active
nodes and no more instructions to be
processed.
2.4
I MPLEMENTED
2
(5,3,3)
3
(3,3,2,4)
OUT
ARCHITECTURES
Architecture, based on a specific IPCode, is constructed according to the program symbols

order coded within this IPCode. If we modify the order, allow for some specific instructions
only, or apply whatever change to the IPCode, we directly influence the shape of the resulting
architecture. We can conveniently incorporate various constraints to get control over the
constructed architectures.
Once we decide to work only with narrowed class of the IPCodes, we may choose to introduce either structural, instructional restrictions, or combination of these. The structural
constraints would influence only the order of instructions applied, whereas the instructional
rules addition would influence restrictions on particular type of instructions used. For example, if we permit only use of {P, D, E} instructions within IPCodes class, the resulting
architectures would be single layered architectures with full edges. Where D applied, the
edge would go directly between IN and OUT.
When testing and implementing the IPCodes structures as architectures representations
within the NNSU tool, we experimented with various types of such additional parametrizations of IPCodes, and of course their structure. We carried out the IPCodes structure restrictions, limited their length, examined various instruction combinations. Some of the archi-
2.4.
I MPLEMENTED
31
ARCHITECTURES
tectures we constructed had turned out useful for data classification tasks, some of them we
created and tested simply for comparative purposes. The three most important architectures
types are mentioned within the rest of the sectionnetwork architectures generic, layered,
and architecture with variables selection.
Generic architectures. Simply a direct incorporation of IPCodes with no additional
constraints added.
Layered architectures. Well known type of architecture, in this case we wanted to prove
whether it is representable with IPCodes.
Architectures with variables selection. This architecture is the example of intentionally
designed architecture for special purposes.
We mentioned that the IPCodes can be further modified using additional rules and narrowed down to represent specific classes of architectures. This is, from the representational
point of view, very flexible feature. On the other hand, we still have to bear in mind that
the complexity of the representation grows this way and we are naturally getting one of the
requirements lose with it. As the representations of layered architectures and also the architectures with variables selection contain some further rules that assert the correct form of the
resulting architecture, these rules have to be incorporated in the recombination schemes of
the evolutionary optimization.
This, unfortunately, might bring additional complexity and possibly unavoidable hurdles
into the evolutionary optimization design. Particularly, it might cause the recombination
rules inapplicable, or applicable with incorrect results. For now, we can use the random
IPCodes for both representational and recombination purposes, and the layered architectures and architectures with variables selection for the representational purposes. Let us see
the architectures and look over their main features.
2.4.1
G ENERIC
ARCHITECTURES
The class of generic architectures contains architectures that are constructed from IPCodes without any internal constraints. We will denote these architectures
with a special abbreviation G AA. The term generic
indicates that the class G AA assigns a set of all architectures representable via IPCodes. In accord with
the notation introduced in the architectures section
(1.1) we can write G AA(N ) for generic architectures
with N blocks, G AA (MAXN ) for generic architectures with maximum MAXN blocks.
The architectures from G AA can be immediately
incorporated into a process of evolutionary optimization. The IPCodes subcoding and recombination abilities will assert validity of the IPCodes inner structure
and thus we will still be working with the G AA class.
We will see in the following subsections that the incorporation into evolutionary scheme is more complex for any other classes of architectures represented
with IPCodes with additional inner constraints.
IN
1
(2,3)
3
(3,4,4)
2
(2,3,3)
4
(3,2)
5
(3,2,3,3)
5
(2,4,3)
OUT
Figure 2.13: Sample generic archiR ANDOM SAMPLING OF RANDOM ARCHITECTURES

We work with the general IPCodes and thus, they are tecture.
32
CHAPTER 2.
IPC ODE : T HE
sampled at random exactly as described in algorithm 4 in section 2.3.4. It only depends on

particular situation (limits on computational time, error tolerance, required generalization
level) what size, data flows density of the architectures we decide to construct.
Samples of randomly generated NNSU instances are seen in figure 2.14. The number of
blocks ranges from 5 to 20, the number of data flows is arbitrary, the deleting effect is quite
strong as we allowed for MAXBC occurrences of the D instruction.
IN
IN
IN
1
5
IN
6
1
5
2
10
10
4
8
2
OUT
OUT
OUT
OUT
IN
IN
IN
2
2
1
IN
7
4
8
6
11
5
8
10
OUT
10
IN
IN
IN
OUT
OUT
OUT
IN
14
3
15
10
8
9
3
12
13
11
5
OUT
OUT
OUT
OUT
Figure 2.14: Samples of random architectures. We can see various types as pure single
layered architecture. We mostly appreciate the multilayered architectures with interconnections between layers. These interconnections between layers are representable using the D
instruction.
2.4.
2.4.2
I MPLEMENTED
L AYERED
33
ARCHITECTURES
ARCHITECTURES
Layered architectures, denoted as L AA, are well esIN

tablished type of architectures within the NN world;
for instance the most typical representative multilayered perceptron is based on layered architecture.
L1
The original motivation for multilayered perceptron
construction was to utilize the maximum achievable
1
2
3
(2,3)
(2,3,3)
(3,4,4)
number of combinations of data flows that proceed
between individual layers. For our purposes, within
the NNSU scope, one of the reasons for testing this
L2
type of architecture was to prove whether IPCodes are
able to represent this class of architectures, and also
5
4
6
what specific powers can the L AA class produce.
(3,2,3,3)
(3,2)
(4,4)
For the design of layered architectures we will
compose the CHAIN blocks into layers. Let us assume we are constructing a layered architecture with
L3
K layers, K N, and for each layer j {1, . . . , K} we
7
8
place Mj blocks (Mj N). Blocks in the same layer
(3,4,2)
(3,3,4)
are supposed to simultaneously process data from all
blocks from the preceding layer, i.e. the graph of the
layered architecture is given as set of nodes (grouped
into clusters by their layer) {IN,P
1, . . . , M1 , M1 + 1,
OUT
. . . , M1 + M2 , M1 + M2 + 1, . . . , K
j=1 Mj , OUT}, together with the set of edges comprising a full bipartied graph between each two layers. The IN and
OUT nodes stand for trivial layers, all layers between Figure 2.15: Sample layered archithem are so-called hidden layers.
tecture.
The edges if the full bi-partied graphs between layers are organized in the following way.
Edges between IN node and first hidden layer connect the IN node to nodes 1, . . . , M1 ,
Edges between successive hidden layers.
For all lP
= 1, . . . , K 1 there goes an edge
P
l1
from all nodes in layer l, i.e. nodes l2
M
+
1,
j
j=1 Mj , to all nodes in layer l + 1,
j=1
Pl
Pl1
i.e. j=1 Mj + 1, . . . , j=1 Mj ,
for all nodes in layer K, i.e. nodes
OUT.
PK1
j=1
Mj + 1, . . . ,
PK
j=1 Mj ,
there goes an edge to
A sample 3-layered architecture is shown in figure 2.15.

IPC ODES SETTINGS FOR LAYERED ARCHITECTURES
The IPCode decoding into the layered architecture intuitively reflects the structure and in the
same way it is divided in part that grow the required number of levels, and also the parts
growing the part growing the particular levels.
We only put further requirements on both Code part and also the instructions part of the
IPCode to get an IPCode that decodes into layered architecture. The following form of IPCode
will represent a layered architecture. We give a skeleton of an IPCode that decodes into Klayered architecture with Ml blocks in lth layer. One layer is built using the S instruction,
and the required number of layers is achieved applying K 1 times the P instruction. The
34
CHAPTER 2.
IPC ODE : T HE
skeleton IPCode is listed in following equation.

(2, S, 0)(2, P, 0)(0, 11, d1 ) . . . (2, P, 0)(0, 11, dM1 1 )(0, 11, dM1 )
(2, S, 0)(2, P, 0)(0, 11, dM1 +1 ) . . .
. . . (2, P, 0)(0, 11, dM1 +M2 1 )(0, 11, dM1 +M2 )
..
.
(2.16)
(2, P, 0)(0, 11, d(PK1 Mj )+1 ) . . .

j=1
. . . (2, P, 0)(0, 11, d(PK
j=1
Mj )1 )(0, 11, d(
PK
j=1
Mj ) )
This condition is quite restrictive since it states the layout of the Code part of an IPCode. We
also generate the labels d1 , . . . , d(PK Mj ) .
j=1
S UBCODING OF LAYERED ARCHITECTURES IPC ODES

The layered architectures are decomposed into layers and individual blocks within each layer.
The same way the IPCode is constructed as building first layer and all its blocks, second layer
and all of its blocks, and so on up to the last layer, the subcodes easily identify first layer,
second layer, up to the last one. The difference is that with a specific SubIPCode we can
identify any of the layers; we can also identify layer and all of its descendant layers.
On the other hand, we are not able to identify all groups of layers. Starting from the
beginning of any IPCode for layered architecture, any S instruction bears a SubIPCode that
decodes into a layer of the same order (from the top) and all layers below.
R ANDOM SAMPLING OF LAYERED ARCHITECTURES
When working with layered architecture we determine the minimum and maximum acceptable number of layers MINK , MAXK , and the minimum and maximum acceptable number of
blocks in each layer MINM , MAXM . Then we generate an IPCode representing architecture
with MINK K MAXK layers and with MINM Ml MAXM blocks in lth layer. The
algorithm for sampling the IPCodes according to given MINK , MAXK is shown as algorithm
5.
Algorithm 5 Generic algorithm for random sampling of IPCodes decoding into layered architectures
1: Generate K between MINK and MAXK at random (e.g. uniformly).
2: for l = 1 to l = K do
3:
Generate Ml between MINM and MAXM at random (e.g. uniformly).
4:
for j = 1 to j = Ml do
5:
Sample block descriptions d(Pl1 Mj )+j PS at random.
j=1
end for
7: end for
8: Compose the IPCode according to the template 2.16 with descriptions dj .
6:
The architectures decoded from IPCodes sampled using algorithm 5 are valid layered
NNSU instances. Illustrative examples of this way built architectures are seen in figure 2.16.
2.4.
I MPLEMENTED
35
ARCHITECTURES
IN
L1
L1 3
L1 2
L1 7
L1 9
L1 8
L1 5
L1 1
L1 6
L1 10
L1 4
L1 11
L2
L2 4
L2 3
L2 6
L2 2
L2 7
L2 1
L2 5
L3
L3 3
L3 4
L3 5
L3 2
L3 1
IN
OUT
L1
L1 8
L1 2
L1 4
L1 7
L1 5
L1 6
L1 9
L1 1
L1 3
L1 10
L3 11
L3 5
L3 6
L2
L2 1
L2 3
L2 4
L2 2
L2 5
L3
L3 8
L3 7
L3 10
L3 4
L3 9
L3 3
L3 1
L4
IN
L4 2
L4 1
L4 4
L4 3
L1
L1 1
L1 10
L1 7
L1 2
L1 9
L1 8
L1 3
L1 5
L1 6
L1 4
L1 11
OUT
L2
L2 3
L2 5
L2 7
L2 2
L2 6
L2 1
L2 4
L3
L3 5
L3 6
L3 7
L3 3
L3 1
L3 4
L3 8
L3 2
L4
L4 7
L4 3
L4 6
L4 4
L4 1
L4 5
L4 2
L5
L5 2
L5 1
L5 5
L5 9
L5 4
L5 8
L5 6
L5 3
L5 7
L6
L6 3
L6 5
L6 8
L6 1
L6 7
L6 4
L6 6
L6 2
L6 9
OUT
Figure 2.16: Samples of layered architectures.
L4 5
L4 6
L3 2
36
CHAPTER 2.
2.4.3
A RCHITECTURES
IPC ODE : T HE
WITH VARIABLES SELECTION
Within the NNSU framework we worked mainly with the generic architectures and the architectures with variables selection. The main motivation for this was to test whether the process
of the genetic algorithm could handle additional dimensions than the graph structure and
inner block design.
We worked to extend the descriptions so that they described not only the structure of
the CHAIN in each block. Our intention was to put into existing data flow additional block
controlling variables that are filtered for further processing. These variables filters, then,
undergo the evolutionary optimization together with the remaining parts of IPCodes which
would allow the GA to seek for the best performing weight for the variables and at the same
time for significant combinations of these variables.
The variables selection (VS) introduces a special type of block used in NNSU tool. This
block consists of a single neuron which selects specified variables in input data and processes
it into its output. This block is used at the start of the data flows, followed by standard blocks.
In figure 2.17 there is an architecture seen with three variable selectors.
The topology of VS architecture is constructed so
that the data are separately processed in K parallel
IN
branches. Thus, it contains K blocks with VS neuron
at the beginning of the data flow, followed by separate groups of data processing blocks. We worked
VS 1
VS 2
VS 3
with two VS architectures depending on the complex(2,3,5,7)
(1,2,3,4,6,7)
(3,5)
ity level of the whole architectures.
Simple VS architecture. Each group consists of
just one data processing block (CHAIN). This is
the computationally less expensive version.
For example, it might be used to during the
phase when seeking for the best performing
combination of variables as the rather small
data processing parts do not spent much computational time and also do not tend to get overlearned.
Advanced VS architecture. For advanced VS
architecture, each data flow is processed with
a subnetwork which, in general, represents a
sovereign generic architecture. These subnetworks have independently set up architectures
and also independently picked up variables to
process.
B1 4
(3,2)
B1 5
(3,2,3,3)
B1 6
(4,4)
B1 7
(3,4,2)
B2 8
(3,3,4)
B3 11
(3,4,4)
B2 9
(2,3)
B2 10
(2,3,3)
OUT
This type of architecture, consisting of larger Figure 2.17: Architecture with varisub-architectures with the variables filters al- ables selection.
ready set, is expected to be performing with
higher robustness on already analyzed data.
IPC ODES SETTINGS FOR ARCHITECTURES WITH VS
Again, we conveniently use the substructuring of IPCodes for design of a generic architectures with VS. Let us suppose we build an architecture consisting of K separate dataflows
with variable selectors F k , k {1, . . . , K}. The VS block design additionally requires the
2.4.
I MPLEMENTED
ARCHITECTURES
37
dimensionality of the data used; we will assign it as DD. The value of DD is needed in order
to correctly build the filter.
Each filter within a VS block specifies which variables will be output for further data
processing. Firstly, we identify the total amount of variables filtered within each VS block,
1 Lk DD, k {1, . . . , K}. Then each filter is designed as a vector of distinct values
F k = (f1k , . . . , fLkk ), fjk {1, . . . , DD} for j {1, . . . , Lk } and fik 6= fjk for i 6= j.
After the filtering VS blocks are assigned with the filters Fk we fill in the remaining
SubIPCodes representing the sub-architecturesfor the separate data flows. These SubICodes
then serialize the VS block and the data processing part, either a single block or a random
sub-architecture. The general template for an IPCode representing a VS architecture looks as
follows.
(2, P, 0) , . . . , (2, P, 0)
initial splitting of the data flows
1
(2, S, 0) , 0, E, F , SubIPCode1
(2, S, 0) , 0, E, F 2 , SubIPCode2
..
(2.17)
.
particular VS blocks
..
.
and sub-architectures
k
(2, S, 0) , 0, E, F , SubIPCodek
For instance, the IPCode that decodes into the architecture shown in figure 2.17 is given as
(2, P, 0) , (2, P, 0) ,
(2, S, 0) , (0, E, (2, 3, 5, 7)) ,
(2, S, 0) , (0, E, (3, 2)) , (2, S, 0) , (2, P, 0) , (0, E, (3, 2, 3, 3)) ,
(0, E, (4, 4)) , (0, E, (3, 4, 2)) ,
(2, S, 0) , (0, E, (1, 2, 3, 4, 6, 7)) ,
(2, S, 0) , (0, E, (3, 3, 4)) , (2, S, 0) , (2, P, 0) , (0, E, (2, 3)) ,
(0, D, 0) , (0, E, (2, 3, 3)) ,
(2, S, 0) , (0, E, (3, 5)) ,
(0, E, (3, 4, 4)) .
S UBCODING OF ARCHITECTURES WITH VS
The construction of the IPCode for architectures with VS coheres with its substructures. Thus,
we are able to identify SubIPCodes within the IPCode that correspond to any subarchitecture.
Looking at the template for IPCodes decoding to architectures with VS we see that choosing
a position with SubICodek we directly choose the corresponding SubIPCode; similarly a position with any S instruction identifies SubIPCode together with its preceding VS.
The positions with P instructions cause the initial growing of the parallel branches.
The structure of the IPCode is recursive, thus the VS block with filter F 1 is on the right
side while the last F K is on the left. First P instruction is a first position of the whole
IPCode and the corresponding SubIPCode is the whole IPCode and the subarchitecture is
thus the whole
architecture.
SubIPCode on the second P instruction excludes the last clause
k
(2, S, 0) , 0, E, F , SubIPCodek and thus decodes into all branches (with the VS blocks)
but the left-most one. The similar way we would find that kth P position selects all k initial
branches.
The same way as in layered architectures we can identify SubIPCodes for complete detail
on blocks, but only subsets of branches.
R ANDOM SAMPLING OF ARCHITECTURES WITH VS
For the random sampling of the architectures with VS we specify the minimum MINK and
38
CHAPTER 2.
IPC ODE : T HE
maximum MAXK between which we want to spread the number of parallel branches to filter
and further process the data. We also need to specify the data dimensionality DD. Then we
run the algorithm 6 that composes the IPCode template 2.17 and samples the filters F j and
block descriptions dl at random.
Algorithm 6 Generic algorithm for random sampling of IPCodes decoding into architectures
with VS
1: Generate K between MINK and MAXK at random (e.g. uniformly).
2: for l = 1 to l = K do
3:
Sample the filter F l at random.
4:
if Simple VS architecture then
5:
Sample a trivial SubIPCodel with descriptions dl at random.
6:
else
7:
Sample generic IPCode SubIPCodel at random.
8:
end if
9: end for
10: Compose the IPCode with filters F k and SubIPCodes SubIPCodel according to the template 2.17.
2.4.
I MPLEMENTED
39
ARCHITECTURES
IN
VS 1
(2,3,5,7)
B1 3
(9)
VS 2
(1,2,3,4,6,7)
B1 1
(7,5)
B2 3
(3,10,5)
B1 2
(9,4)
B2 2
(6,6)
B1 4
(9,8)
B1 5
(5,10)
B2 6
(7,6)
B2 4
(8)
B2 7
(6,8)
B1 6
(4,8)
B1 7
(10,7)
B2 1
(8,10)
B2 5
(3,10)
B1 8
(8,4)
B2 8
(3,2,3)
B1 9
(4,7)
B1 10
(3,38)
OUT
Figure 2.18: Architecture with variables selection. The rectangular VS blocks in the first layer
create individual data flows and filter the variables into it; the blocks in the parallel branches
are prefixed as B1 and B2. The dimension of the data DD = 7. We can see, that the first VS
block filters only second, third, fifth and seventh columns of data, whereas the third VS block
filters fifth column only.
3
Properties of instruction-parameter codes
Some properties of IPCodes as representations were known a priori as they were the requirements when designing the IPCodes structure. Our intentions went rather along the
representational line; in this chapter we start to involve the evolutionary aspects.
We constructed IPCodes as structures providing comfortable handling of architectures of
neural networks and in addition we anticipated that they are flexible for mutual recombination, which should ease the incorporation of IPCodes into the genetic algorithms environment. We are now going to briefly compare the features of IPCodes, counting in not even
their representational powers but also appropriate recombination abilities, encoding and decoding complexity, and some others. All of these features will be listed in detailed level and,
as a result, they will provide us with descriptions and informations about IPCodes in comparison to the general criteria. Using the terms of common characteristics we gain better insight
and comparison to other types of representations.
Throughout this chapter we will study a set of properties that K. Balakrishnan and V.
Honavar gathered and united in their work (Balakrishnan, Honavar, 1995). This paper lists
all relevant properties that characterise representations of architectures of neural networks
used in so called evolutionary designed neural architectures (EDNA) framework. The authors
summarize various qualities that representations of neural networks architectures should
or might possess if they are to be effectivelly applied in field of massive evaluation tasks
(as random search, or any evolutionary optimization). Balakrishnan and Honavar try to
encompass the most complete set of the properties, and also the discussion of these.
The first section will contain the list of the properties by Balakrishnan and Honavar. We
try to adjust the terms to terminology of architectures and IPCodes where possible here, and
for each point we append a short remarking comment about fulfilling the particular criteria.
Also we give a resume reviewing the specific impact in representations used in NNSU tool. In
the second section, we summarize all obtained results and notes about representations and
we list a revision table that depicts the IPCodes as representations in the point of view by
Balakrishnan and Honavar.
3.1
R EPRESENTATIONAL
PROPERTIES OVERVIEW
The properties of representations listed within EDNA scope are described with specific terminology connected with neural networks and also with evolutionary strategies and representations. Connection of these two paradigms leads to definitions of terms as encoding complexity,
closure, compactness, and many others. The purpose of EDNA is mainly revisionary and comparative, in our case the aim is to find the properties of IPCodes and compare them with other
encodings. Thus, we will try to extend the exact formulations onto the NNSU field where
possible. We will subsequently list the basic EDNA terms, phenotype, genotype, environment,
and learning procedure.
40
3.1.
R EPRESENTATIONAL
41
PROPERTIES OVERVIEW
Decoding
Offspring
Trained NNSU
Evolutionary part
Genotype
(Encoding)
Learning part
Phenotype
(NNSU)
Recombination
NNSU training
Fitness Evaluation Label
Figure 3.1: Generic sketch of connection of phenotypes and genotypes within an evolutionary
optimization. These two structures are tied in by the evolutionary process which decodes
each genotype into phenotype, then assigns a quality measure to it and then recombines the
underlying genotypes so that the new genotype of possibly higher quality is built.1
P HENOTYPE
A phenotype assigns general entity that is subject to optimization with respect to some quality
function. Phenotype might be a person or some company looking for a tax optimization, or
centrally controlled city traffic lights requiring lights switching setup minimizing traffic jams,
or any other situation requiring adjusting its parts so that one can measure the increasing
profit.
In our case phenotypes are acyclic architectures of neural networks with switching units
used within the scope of the NNSU tool. We already assigned the set of architectures as AA,
this set and also the set of architectures with limited number of blocks AA(N ) will be our
particular sets of phenotypes.
G ENOTYPE
Genotype is a modeling structure for phenotype or, equivalently, its representation. Mostly,
the representation is also called encoding, saying that the representation emerges using some
kind of process coding the original phenotype. We can think of multidimensional vectors
describing particular tax flows, labelled graphs representing the city traffic plan, or generally
various strings of (rather small) alphabets.
The space of genotypes representing the architectures is the set of IPCodes, the set
IPCodes. Generally, IPCodes are series of integers triplets with some additional inner constraints. The alphabets for all three position in triplets are quite small. For the first position
we have {0, 2} for ICode part, {S, P, E, D} for instructions part, and {1, . . . , MAXClC }MAXNSUC
for clusters per block setups.
E NVIRONMENT, L EARNING PROCEDURE
The phenotypes, their behaviour and performance, measured according to given quality function, are under influence of some environment E they are put in. This influence might stretch
from indirect to direct, and also might stand for a complex composition of multiple factors
or simply a single real parameter. In case of NNSU tool, this environment is represented
to learning data and the learning procedure which set up the internal parameters of neural
networks with switching units.
The learning procedure L sets up the internal parameters of the NNSU instances, taking
into account the conditions of the environment E. In the NNSU case the learning procedure
is a regression model combined with cluster analysis. The clustering procedure splits the
learning data into disjoint subsets, following with regression applied on these disjoint sets.
A CCEPTABLE ARCHITECTURE , SOLUTIONS SET
The next term discussed by Balakrishnan and Honavar is acceptable architecture. The ac1
This figure is taken from (Balakrishnan, Honavar, 1995) and adjusted so that it contains NNSU related terms
and objects.
42
CHAPTER 3.
P ROPERTIES
OF
IPC ODES
ceptable architecture is simply a valid architecture in terms of the definition of architecture

in 1.2. We know that the architecture for NNSU networks are required to be single source
single stock acyclic digraphs, with acceptable values of parameters for each block. The set of
acceptable architectures is thus the set AA or, if we decide to limit the maximum number of
blocks by MAXN , the set AAMAXN .
The last term concerns solutions set, i.e. the subsets of AA that contain the top quality
members, so called solution architectures. The expectation of evolutionary optimization of
NNSU architectures is to find best performing NNSU from achievable ones. In this regard,
we assume there exists some solution set within AA (AAMAXN ); still, the solution architectures are unknown even in the case of sample data used and we are not able to identify the
solution architectures explicitly until we proceed exhaustive search of AA (AAMAXN ).
Now we will move to the properties themselves. There will be 9 properties reviewing the
IPCodes and their representational powers in different points of view. Some of the properties
will also include the evolutionary optimization aspects which were not discussed yet, but the
results should be intuitive and clear.
3.1.1
C OMPLETENESS
Completeness reflects the primary concern we might be considering when building a new
representation. We naturally need to take care whether the representation is able to encompass all of the phenotypes. Or, at least acceptably large subset. The following formulation
suitably fits our terminology and without any significant modifications we can directly move
to discussion whether the IPCodes are or are not a complete representation.
Definition 3.1 (Completeness). A representation is complete if every neural architecture in the
solution set can be constructed (in principle) in the system.
Technically, in our case the solution set is a subset of AA and if we can prove that any
architecture H AA can be constructed, in principal, using the IPCodes framework, we are
permitted to claim that the IPCodes are complete representation. The question is whether
the set of instructions {S, P, E, D}, used within the scope of NNSU tool, is able to produce
any achievable arbitrary architecture.
Gruau argues in his work that the cellular encoding with the full set of instructions, i.e.
{S, P, E, D, } plus W for WAIT, and R for RECURSION, is complete. We are not utilizing
the W and R instructions and thus we can not claim directly that this way restricted cellular
encoding is also complete.
For instance, if we narrow on class of generic architectures G AA or layered architectures L AA, as defined in subsections 2.4.1 and 2.4.2, we can say IPCodes are complete
with respect to generic architectures or layered architectures. For both types we provided a
general template for an IPCode that encodes into the particular architecture, either generic or
layered (with arbitrary design of number of layers and number of blocks within each layer).
On the other hand, considering the whole AA set, we can not give a convincing proof on
IPCodes universal representational abilities and thus we will regard the IPCodes as incomplete with respect to the AA set.
Resume: We can not hold IPCodes as complete representation with respect to the whole AA
set as we utilize narrowed instruction set. Nevertheless, for special classes of architectures,
G AA AA and L AA AA, IPCodes proved as complete.
3.1.
3.1.2
R EPRESENTATIONAL
PROPERTIES OVERVIEW
43
C LOSURE
Definition 3.2 (Closure). A representation is completely closed if every genotype decodes to

an architecture.
The closure property should be called representational closure since it speaks of the closure of the decoding process as a mapping. Should we produce any valid IPCode, we are
wondering whether it decodes into architecture or into some invalid structure.
The set IPCodes is completely closed since each P IPCodes decodes to some architecture. None of the instructions {S, P, E, D} breaks the properties that build the initial
graph to the final architecture. Also, none of their combination does. Thus, any repetitive
calls of these instructions can not take the quality of being an architecture away from the
initial graph during the building.
In fact, it is the closure together with the subcoding of the IPCodes that makes their use
so advantageous. We can moreover say that any recombination of two IPCodes is an IPCode
again and any IPCode will decode onto a valid architecture. This is the mechanism featuring
the recombination schemes definition, crossover and mutation. This type of closure, against
recombination mappings, is called recombinational closure.
Resume: IPCodes are representationally closed as well as recombinationally closed; both of
these features are deeply utilized throughout the evolutionary environment implementation
within NNSU tool.
3.1.3
C OMPACTNESS
Definition 3.3 (Compactness). Suppose two IPCodes P1 and P2 , both decode to the same architecture H, then P1 is said to be more compact than P2 if P1 occupies less space than P2 , i.e. if
|P1 | < |P2 |.
For the first sight, it might seem that there is no compactness provided in the representation via IPCodes. Let us see that using the deleting instruction D we can easily construct two
different IPCodes P1 6= P2 decoding to the same architecture. The difference lies in redundant combinations of building and deleting instructions: both the IPCodes P1 = (0, E, d)
and P2 = ((2, S, 0), (0, E, d), (0, D, 0)) decode into the same block with configuration d.
The IPCode P1 is more compact than P2 as |P1 | < |P2 |. The way these IPCodes decode into
the single-block architecture is seen in figure 3.2.
Any two IPCodes that differ the similar way the P1 and P2 do will decode into the same
architecture. We can also find more complex IPCodes that would decode into same architecture, though they would not equal to each other. Choosing the IPCode with minimal size
from all those decoding to the same architecture we get the most compact representation.
We will later speak of classes of equivalence for IPCodes that represent the same architecture
as of representationally equivalent IPCodes.
The redundancy of IPCodes, caused by different IPCodes decoding into single architecture, would not necessarily lead to a lack of effectiveness of the representation. It rather
enriches the variance of the representations set. If we would come to preferring the compact
solutions against less compact ones, we may achieve it adding an evaluation factor that
reflects the size of the representation into fitness function and gain automatic search for the
most compact solutions. This way modified quality function is usually called fitness function
with parsimony.
Resume: The compactness works well on IPCodes; moreover incorporating the parsimony the
search might be adjusted in order to prefer the less compact solutions.
44
CHAPTER 3.
P ROPERTIES
OF
IPC ODES
IN
IN
IN
IN
E
d
S
0
1
d
1
d
1
2
D
0
OUT
OUT
OUT
OUT
IN
IN
E
d
1
d
OUT
OUT
Figure 3.2: Two ways to construct the same simple architecture. The first row depicts the
decoding of the IPCode ((2, S, 0), (0, E, d), (0, D, 0)) with redundant deleting instruction,
the second row shows the shortest way to achieve the same resulting architecture using the
IPCode (0, E, d).
3.1.4
S CALABILITY
The term of scalability is not as easy to formulate as the previous ones. The main goal is to
describe the changes in representation size, together with the decoding time, depending on
the changes of size of the phenotype. If we consider the change in size of some architecture,
this property depicts the change in the size of the underlying IPCode.
Definition 3.4 (Scalability). Let HNB ,NC = (V, E, D) assign an architecture H with number
of blocks NB = |V | and number of data flows (connections between blocks) NC = |E|. The
representation is O(K)size(time)-scalable with respect to blocks if the addition of one block
to the architecture HNB ,NC requires an increase in the size (decoding time) of the corresponding
IPCode by O(K), where K is some function of NB and NC .
The representation is O(K)size(time)-scalable with respect to data flows if the addition
of one data flow to the architecture HNB ,NC requires an increase in the size (decoding time) of
the corresponding IPCode by O(K), where K is some function of NB and NC .
S CALABILITY WITH RESPECT TO BLOCKS
Adding a single new block behind, before, or beside a specific block is worked out via serializing of paralleling instructions. We need to find the specific terminating triplet (0, E, d)
that finalizes the desired block and replace it, say in case of preceding by new block, by
3.1.
R EPRESENTATIONAL
PROPERTIES OVERVIEW
45
((2, S, 0), (0, E), dnew )). Thus, the elementary scalability with respect to blocks is linear, i.e.
O(1).
We need to note that we silently assumed that we use the most compact IPCode out
of the representationally equivalent ones that would emerge the additional block. IPCodes
would also exist, able to carry out the identical extension of the block, that would enlarge
the original IPCode in terms of multiples of the original IPCodes length.
More important point is that this type of architecture extension was the simplest case.
Once we decide to add a block into existing architecture we need to specify where to place it
and also how would it be incorporated into the existing data flows. Merging the new block
with more advanced connection structure would require larger operations on the IPCode.
The lack of the reversal mapping from architecture to IPCode denies to state correctly the
scope of changes that would reflect the required change in the architecture.
S CALABILITY WITH RESPECT TO EDGES
Adding a new dataflow into the architecture is very legitimate step when working with NNSU.
Again, it would highly depend on where exactly we decide to place the dataflow. Also in this
case we would need to identify the induced SSDAG hyper-graph, analyse what changes in
the corresponding SubIPCode would be needed. Both the induced SSDAG hyper-graph and
the SubIPCode updates might lead to significantly different SubIPCode.
Resume: The IPCodes are both time and size linearly scalable with respect to addition of a
single block into architecture. All other types of scalability, including more complex block
structures addition or dataflows addition are subject to irreversible decoding process; the
scalability might stretch to higher order dependence in these cases.
3.1.5
M ULTIPLICITY
Definition 3.5 (Multiplicity). A representation is said to exhibit genotypic multiplicity if multiple genotypes decode to an identical phenotype, i.e. the decoding function is a many-to-one
mapping from the space of genotypes to the corresponding phenotypic space.
A representation is said to exhibit phenotypic multiplicity if different instances of the same
genotype can decode to different phenotypes, i.e. the decoding function is a one-to-many mapping
of genotypes into phenotypes.
G ENOTYPIC MULTIPLICITY
As for the genotypic multiplicity, we can return to the paragraph where we discussed compactness of IPCodes. We listed an example of two IPCodes that decoded into same architecture. That means we found occurrence of genotypic multiplicity.
There is another source of the genotypic multiplicity.
It is
called permutation problem and is discussed e.g. in (Hasan, IsIN
lam, and Murase, 2004).
The permutation problem is caused
by the symmetric effect of the P instruction.
For example,
two IPCodes P1 = ((2, P, 0), (0, E, d1), (0, E, d2)), and P2 =
d1
d2
((2, P, 0), (0, E, d2), (0, E, d1)) decode into equivalent architecture
depicted as one in figure 3.1.5. The graphs would be differently embedded in plane but with no effect on the final calculations and data
flows.
OUT
In general, interchanging the two SubIPCodes after the P does not
affect the resulting architecture. In fact, it does, but we dont care that
much about where does the block lie in the architecture graph, the important point is that
the data flows remain uninfluenced.
46
CHAPTER 3.
P ROPERTIES
OF
IPC ODES
In the mentioned article by Hasan et al., a solution to permutation problem is suggested.

The suggested approach unifies all the equivalent encodings and randomizes the order of
instructions succeeding after any P instruction. We did not incorporate this technique in the
IPCodes paradigm used in NNSU tools, and it is the second source of multiplicity.
P HENOTYPIC MULTIPLICITY
With phenotypic multiplicity the situation is opposite; each IPCode decodes always into a
unique architecture. Thus, no phenotypic multiplicity can be in place in case of IPCodes
representations.
Resume: The IPCodes exhibit genotypic multiplicity. The advantage is the diversity of IPCodes. The drawback is that there can not be stated unique representation of a given NNSU.
As regards phenotypic multiplicity, it does not appear in the IPCodes case. The IPCode always
decodes to the same architecture.
3.1.6
O NTOGENETIC
PLASTICITY
Definition 3.6 (Ontogenetic plasticity). A representation exhibits ontogenetic plasticity if the

determination of the phenotype corresponding to a given genotype is influenced by the environment.
Balakrishnan and Honavar claim that ontogenetic plasticity may appear as a result of
either environment-sensitive developmental processes, or learning processes. For instance,
In the NNSU case, we excluded the internal parameters of the neural networks out of the
architecture concept. The internal parameters are stochastic variables and as such may vary
according to specific program compilation and runtime conditions. Except for these internal
parameters, no other content of a NNSU instance would change. Thus, the decoding process
deterministically produces the identical architecture to a given IPCode.
Resume:The IPCodes representation does not possess any ontogenetic plasticity. We can consider it as rather positive quality because any additional dependence might lead sensitivity,
and thus even unstable behaviour, into the evolutionary process.
3.1.7
M ODULARITY
Modularity is a very interesting quality of any encoding. We can find both advantages and
drawbacks on it. Modularity seeks for repetitive occurrences of the same sub-architectures and checks how many times do these fragments appear in the encoding structure. The
modular encodings are supposed to maintain the architectures in more efficient way, mainly
architectures with higher number of blocks (nodes) and with higher structural granularity.
Definition 3.7 (Modularity). Suppose an architecture H includes several instances of a subarchitecture G, then the encoding of H is modular if it codes G only once (with instructions to
copy it which would be understood by the decoding process).
The definition of modularity says, in other words, that the encoding should incorporate
instructions with symbolic linking. These links would enable to bind the decoding process to
the (single) part decoding the sub-architecture.
In the IPCodes mechanism the links are not implemented, and no modularity is provided.
Moreover, modularity does not necessarily imply wider class of represented architecture as
it does not bring no new building instruction. We might speak of compressing the encoding
rather than increasing the representational powers.
3.2.
D ISCUSSION
OF
IPC ODES
PROPERTIES
47
Resume: The NNSU tool does not seek for the most optimal encodings that would represent
architectures possessing the most economic storage. The IPCodes do not contain no recursion
and they do not provide no modularity.
3.1.8
R EDUNDANCY
Definition 3.8 (Redundancy). An evolutionary environment is said to exhibit

genotypic redundancy if the genotype contains redundant genes,
decoding redundancy if the decoding process reads the genotype more than once,
phenotypic redundancy if the architecture contains redundant elements.
The redundancy is strongly present in the biological systems. As a typical example of
genotypic redundancy let us remind that standard genotype contains all chromosomes doubled, with alleles inherited from both parents.
Balakrishnan and Honavar are reasoning that redundancy often contributes to the robustness of the system in the face of failure of components of processes. Meaning that the
phenotypes could prove better performance even in extreme conditions of the environment.
The use NNSU assumes stable environment that does not influence the decoding processes, learning procedures, and none of the other performed actions. Thus, the IPCodes
representational mechanism does not provide redundancy of any type. The networks themselves wont work properly if any of their subparts fails. Which means that even the phenotypic redundancy is not present in the NNSU tool.
Resume: The IPCodes representation does not exhibit genotypic redundancy; the architecture do not provide phenotypic redundancy; the evolutionary environment uses one-way
decoding procedures.
3.2
D ISCUSSION
OF
IPC ODES
PROPERTIES
We went through the nine properties defined by Balakrishnan and Honavar; each of those
properties are speaking of a specific quality of the IPCodes representation. We tried to give
a short resume on each point and in the end we figured that some of the features are met
by IPCodes and some not. The situations where the lack of the particular feature is a real
drawback, with respect to use under NNSU tool context, are actually rarewe can identify
only one such situation.
The consideration of various properties of a NN should be connected with the particular
task the NN are used for. Different tasks may lead to different requirements on the qualities of
the NN, and thus on qualities of their representations. For instance, Balakrishnan, Honavar
speak about the situation artificial NN controllers for robots operating in unexplored and
possibly hazardous environment. In this regard they claim what specific properties mentioned above should be met by the NN representations in order to reach the most efficient
representation and evolutionary environment.
The main scope of the NNSU tool lies in analytical use, the NNSU networks are successfully applied in areas of data classification and in time-series analyses. All the mentioned
properties seem well-balanced in this way.
Both representational and recombination closure allow for unbounded IPCodes sampling and mutual recombination without moving out of the represented AA set.
Compactness might, when required, incorporate search for most compact solutions.
48
CHAPTER 3.
Property
Completeness
Closure
Compactness
Scalability
Genotypic multiplicity
Phenotypic multiplicity
Ontogenetic plasticity
Modularity
Redundancy
Satisfied
No
Yes
Yes
Yes
Yes
No
No
No
No
P ROPERTIES
OF
IPC ODES
NNSU point of view

Drawback
Plus
Plus
Plus
Neutral
Plus
Neutral
Neutral
Neutral
Table 3.1: Properties review. The pros and cons are denoted for each property. Where the
decision is not clear, the neutral statement is given.
Phenotypic uniqueness, as the phenotypic multiplicity does not work, assert that the
NNSU provide always the same performance whenever the instance of NNSU is built
from the particular IPCode.
4
Genetic algorithm defined on
instruction-parameter codes
Troughout all the previous chapters, we mainly concerned us with the structural properties
of NNSU architectures and with all the related questions. We respectivelly introduced the
topologies and architectures of NNSU networks, then the ICodes and IPCodes as representations of the topologies and architectures. We demonstrated the convenience of generalized
representation and we proved the abilities of IPCodes to represent various classes of architecture. We showed in examples and various comparisons that the representational power
of IPCodes provides comfortable and flexible work with architectures. This chapter subsequently leads to a definition of the whole genetic algorithm (GA) environment that optimizes
architectures as base structure for the NNSU instances.
We will now subsequently discuss all parts relevant to the incorporation of the genetic
algorithm for NNSU. We will be through the specific parametrization of NNSU, then we will
define the quality function called f itness, then we will proceed to recombination operators.
In fine, all these components will be tied into a GA environment.
In the first section, we will overview the scheme of the genetic algorithm we decided to
use for the NNSU tool while we will list the reasoning for this step and also we will stress
the main commons and the differences our approach exhibits compared to yet established
techniques. In the second section, the NNSU networks with their inner structure are reviewed
in deeper detail. In the third section, the specific fitness function is described. In the fourth
section we describe the recombination operators. The last section reviews all the previous
steps and utilizes all components and terms to define the whole genetic algorithm. The
framework of the genetic algorithm is given, and all the particular specific components are
described.
4.1
G ENERAL
SCHEME OF THE GENETIC ALGORITHM EVOLUTION
The genetic algorithm optimizing NNSU architectures has been defined as a more or less
standard genetic algorithm, with appropriate modifications incorporated where needed. Let
us remind that the genetic algorithm works as an iterative algorithm that searches through
some search space. In the NNSU case the search space is the set of all architectures, this is,
the G AA set. The search proceeds in subsets called populations, members of populations are
called individuals or genotypes, also encodings or representations. The decoded genotypes are
then phenotypes. The evolution, i.e. the iteration step between populations, is carried out
using recombination operators (sometimes also called evolutionary operators).
The recombination operators compose new population based on information obtained
from the existing one. The members from the actual population are sampled using selection
operator, mutually recombined using crossover, individually modified using mutation, and
49
50
CHAPTER 4.
G ENETIC
ALGORITHM DEFINED ON
IPC ODES
placed into the new population. The particular iterations of populations are called generations.
So far, we discussed representations as the only one part of the genetic algorithm framework. The important enhancements we need to incorporate in addition to representational
properties relate mainly to recombinational aspects and the fitness evaluation. Both of these
components are highly complex due to non-trivial structure of the optimized objectsarchitectures of the NNSU.
G ENETIC PROGRAMMING APPROACHES WITH NON -TRIVIAL REPRESENTATIONS
In tasks of enhanced complexity as we observe in NNSU, the advanced genetic algorithms
approaches appear as successful; these techniques are usually called genetic programming
(GP). In genetic programming, the phenotypes are general functions and the genotypes are
their representations. The genetic programming was revealed in late 80s by Cramer and
(parallel) by Schmidhuber (Cramer, N. L. (1985) and Schmidhuber (1987)). The breaking
proposition came in Kozas book (Koza, 1992); this book is widely considered as a ground
publication in genetic programming.
According to the official web page of genetic programming1 , the genetic programming
ranks as one of the technique in wider scope of so-called genetic and evolutionary computation.
Genetic programming is an automated method for creating a working computer program from a high-level problem statement of a problem. Genetic programming starts
from a high-level statement of what needs to be done and automatically creates a
computer program to solve the problem.
The technique of genetic programming is one of the techniques of the field of genetic
and evolutionary computation which, in turn, includes techniques such as genetic
algorithms, evolution strategies, evolutionary programming, grammatical evolution, and machine code (linear genome) genetic programming.
In deeper detail, the GP approach utilizes the representational powers of grammar tree
structures. The original phenotype is decomposed to basic elements and both these elements
and the decomposition information are stored in a tree; for instance, the reverse polish
notation stores an algebraic expression into a binary tree with operators stored in inner
nodes and operands in leaves.
The GP environment iterates search through the space of trees of (usually) limited size.
The fitness evaluation of each tree constructs the stored functional formula and evaluates it
on testing data set. According to individual fitness evaluation, the selection probabilities are
built in the very standard way. The recombination operators then utilize the sub-trees swapping as an extension of sub-strings swapping known from binary strings genetic algorithm.
The grammatical evolution mentioned in the quotation above, stands for additional extension of the GP approach, as regards both representational powers as well as implementation steps. The grammatical evolution approach is not used in the NNSU tool for its highly
relaxed representation and consequent higher computational demands and slower convergence.
GENETIC PROGRAMMING AND NEURAL NETWORKS
The incorporation of GP techniques into neural computation is straightforward application of

the GP as theyre able to handle evolution of advanced representational structures. As already
mentioned in subsection 1.2.2, there are as many implementations incorporating either direct
or indirect encodings. The main scheme of evolution does not change significantly, though.
1
http://www.genetic-programming.org/
4.1.
G ENERAL
SCHEME OF THE GENETIC ALGORITHM EVOLUTION
51
The difference comes with the fitness evaluation of the phenotypes. Compared to the
standard GP progress, with neural networks evaluation arises question of additional learning/optimization incorporation where GP is not sufficient or efficient enough. From this
point of view we distinguish two main approaches.
Full optimization of all neural network parameters. This is, the GP maintains all parameters of a neural network, including interconnections design, weights and neurons
internal set up.
The fully optimized neural networks structures are usually concerned in situations with
neural networks where inter-nodes connections are univariate and may be assigned
with a weight (either binary or real). In these situations, the GP searches for both
optimal topology and weights. Especially, this method may be utilised for saving computational time compared to the heavily time-consuming gradient methods.
Hybrid optimization. The GP optimizes part of the parameters, usually the underlying
architecture, and some other method optimizes the remaining ones, usually internal
parameters of each neuron. According to the treatment of out-of-GP optimized parameters, we recognize the Lamarckian strategythe out-of-GP optimized parameter
values are stored back into the representation and recombined into next generations,
and a Baldwin effectthe genotype is recombined with original values of parameters;
this approach assumes only predispositions for learning/optimization are driven by the
evolution.
GENETIC PROGRAMMING AND NNSU
Given the nature of the NNSU type of neural networks, we sought for specific genetic programming technique that would fit the NNSU phenotypes and requirements on optimization
of underlying architectures. The three following paragraphs summarize the NNSU specific
properties that drove the approach selection.
Baldwin effect
As each NNSU instance is thought through training data set. The switching units are set up
using k-means clustering algorithm, and the neuron units are adjusted via generalized linear
model. The internal parameters, set up via specific methods, are not stored into the IPCode.
Thus, we speak of Baldwin effect in the NNSU case.
No edge weighting
In contrast to common genetic programming praxis where inter-nodes connections represent
univariate data flows and are usually assigned with weights, the NNSU incorporate internodes (inter-blocks) connections as multivariate data flows. NNSU do not assign no specific
modification to the data flow, i.e. no weighting, no filtering. All of these transformations
are carried out inside the blocks, through application of either NSU or block with variables
selection.
No maximum architecture
Considering the data flows structure and its impacts, NNSU are supposed rather to use
smaller number of the data flows and to identify the most efficient combinations of the
data flows. Through genetic programming defined on NNSU, we intended to seek for sparse
architectures with general connections passing over more than single layer, not bounded in
number of blocks nor data flows. We will find later that the recombination of IPCodes structures may, in general, lead to architectures of unlimited number of blocks, and as a result
also the number of inter-block edges. In praxis, we limit the realistic size of an NNSU. This
prevents from unrealistic rise in computational time spent on NNSU evaluation, as well as
52
CHAPTER 4.
G ENETIC
IPC ODES
from generalization loss of the NNSU. Note that this is not a limitation of the model; it is
rather its parametrization.
In terms of the previous paragraphs, we constructed a special genetic algorithm environment, as a GP instance, that would allow for IPCodes treatment and recombination, was
able to learn and evaluate decoded NNSUs in each iteration, and was not tied with fixed
architectures nor fixed number of nodes/blocks.
We decided for incorporation of Gruaus cellular encoding. The CE implicitly work with
Baldwin effect as well as with unlimited architectures. Based on Gruaus CE methodology,
the IPCodes are indirect encoding incorporated into a GA environment. Each IPCode, once
it gets decoded into an NNSU instance, is further learned and then evaluated with a fitness.
With IPCodes, we adjusted the representation for non-weighted multidimensional data flows.
The other framework recently used in neural computation, known as NeuroEvolution of
Augmenting Topologies (NEAT) by Stanley and Miikkulainen, (Stanley, Miikkulainen, 2002),
was not available to us during the design of NNSU GA. The NEAT appears as highly efficient
in various tasks, including the advanced real world tasks (pole balancing, real-time game
adaptive behaviour). Still, there are two main reasons why NEAT would not work for NNSU,
both relating to the representational technique.
NEAT uses direct encodings. As seen in (Stanley, Miikkulainen, 2002), page 102, NEAT
argues for incorporation of linear genoms.
We chose direct encoding for NEAT because, as Braun and Weisbrod argue, indirect encoding requires more detailed knowledge of genetic and neural mechanisms. In other words, because indirect encodings do not map directly to their
phenotypes, they can bias the search in unpredictable ways.
NEAT supposes maximum topology for its runs. This is, partly, a consequence of the use
of direct encoding pointed above. NEAT utilizes layered topologies with input, hidden
and output nodes. Even though NEAT allows for inter-layer connections, we saw in
representation subsection 2.4.2 that layered architectures L AA are only a subset of all
representable architectures G AA.
Now that we know that we will construct a genetic algorithm environment with CE-based
representations in form of IPCodes, we will go on with introduction of each specific part of
the GA.
4.2
NNSU
TOOL IN DETAIL
The NNSU tool is being successfully applied in areas of data analyses, mainly for data separation and classification tasks, and in time series predictions.
The early reports regarding the neural networks with switching units evolution are cited
in (Bitzan, Smejkalov
a, Kucera, 1995); the modifications and improvements are described in
(Hakl, Hlav
acek, and Kalous, 2002a), (Hakl, Hlav
acek, and Kalous, 2002b), (Hakl, Hlav
acek,
and Kalous, 2002c) and in (Hakl et al., 2003), the implementation related matters together
with further model analyses are laid down in (Vachulka, 2006).
As regards real application, the projects focused on data separation are active for a longer
period and our team is more experienced in this area. We participate with CERN on particles
identification tasks since 1995; the analyses and outcomes are found in (Hakl, Jirina, 1999)
and (Hakl, Jirina, Richter-Wa
s, 2005). The next project the NNSU co-operated on was the R.
Bock project of -hadron identification, seen in (Bock, 2004).
4.2.
NNSU
TOOL IN DETAIL
53
The time series analysis use is under development meanwhile, the methods are undergoing many experiments, and it would be too early to incorporate the evolutionary optimization
and all of its further parametrisation when the current situation is not analysed enough. For
closer informations, see works of Marek Hlav
acek (Hlav
acek, 2004), and (Hlav
acek, 2008)
(in preparation).
The content of this book and the scope of the analyses are focused on the NNSU tool as a
data classifier. This determines the way we construct the data for processing, the evaluation
factors for the NNSU instances, and also the definition of the fitness function.
4.2.1
A NALYSED D ATA
In the data separation case, the data arise as observations of a two-state system. The observed
data correspond to vectors of real numbers, i.e. the observations are multivariate variables.
The two states are called signala positive state, or backgrounda negative state.
The type of the observed state is assigned with a data type flag. The data type flag is
either equal to 1 which designates the background state or to 1 which designates the signal
state. We also define a function t returning for each data record D the value from the set
{1, 1}. The data type is also called teaching flag, and thus the t function.
We assume the set of all the observed data D to be of dimension P , and count (number
of records) N . The data set is distinctly composed of the signal data and background data
D = DS DB , DS DB = , N = NS + NB = |DS | + |DB |
(4.1)
In the common work with the NNSU tool, without any further optimization, we use
two types of datalearning data DL , and evaluation data DE . The learning data allows
for setting up the internal parameters of the NNSU system. The evaluation data provide
independent test of the performancethe results of a NNSU system is compared to known
data types, and at the same time the overlearning is verified. When the genetic algorithm
optimization is applied, we might think of it as optimizing procedure on evaluation data.
Procedure that is optimizing architectures.
We need to provide additional data set which allow for independent optimization of the
architectures. This data is called testing set and it is used for testing the generalization of the
genetic algorithm on the evaluation data. The testing sets are indexed with T superscript.
All the data types (L, E, T ) are composed of the two mentioned types of datasignal
and background, i.e. Dj = DjS DjB , DjS DjB = , j {L,
E,
T}, so that for the record
j j
j
j
j
counts of the particular sets we have N = NS + NB = DS + DB , j {L, E, T }.
D ATA AS RANDOM VARIABLES
Note that the data are considered as observations of a real system, or at least as observations
of some sampling procedure which is simulating the real system. Thus, the data are handled
as random variables distributed according to their probability distribution function, either FS
for signal or FB for background. The data are assumed to be independent observations from
these probability distribution functions.
4.2.2
NNSU
AS A MAPPING
The NNSU is mapping each data record to real number.

NN : RP R.
The NNSU tool is used for data separation. We separate data produced by two-state system
and we see that classifier with a binary response would be of better use. Thus, we compose
54
CHAPTER 4.
G ENETIC
IPC ODES
the mapping NN and some classifier C : R {1, 1}, which assigns correspondence either
to background (1) or signal (1) to the outputs of the NNSU.
For instance, the simple threshold model can be used. The response of the classifier is
considered as signal if some threshold is reached, as background in other case.
1
if x
C (x) =
1 otherwise
In = 0 case we have a basic classifier Cb = C0 . The Cb classifier is used as a default classifier
in the NNSU tool. The next types of classifiers used in the NNSU tool are reviewed, analyzed,
and compared in work (Vachulka, 2006). Some of them are also used when working with
NNSU tool, and we will list those out in the following sections.
Once the classifier and the NNSU are composed, the resulting mapping is then called
NNSU classifier and it is assigned as
NNC : RP {1, 1}, NNC = C NN.
4.2.3
T HE
(4.2)
HISTOGRAMS
For the response to any data where the data type flag is known, we may construct a separate
sample histogram graph for each data type, i.e. create two histograms of values NN(DS ) and
NN(DB ). These histograms visualise the empirical probability distribution of the response of
the NNSU tool. In general, histograms are quite common statistical tool, detailed informations regarding histograms can be found in (Bishop, 1997), chapter 2.
Because we are trying to build comparable histograms for both types of data, we will
assume that both the sample histograms HS , HB are calculated with identical bins {cj }M
j=1 .
The histograms are assumed to be normalized because the numbers of signal and background
events may differ. Thanks to the normalization we always get equal measure of both types
of events:
M
X
HS (cj ) =
j=1
M
X
HB (cj ) = 1.
(4.3)
j=1
A sample histogram is seen in figure 4.1. We can see that the mean of the signal-response
distribution is closer to the 1 value whereas the mean of the background-response distribution
is closer to the 1 value.
4.3
F ITNESS
FUNCTION
The fitness function is a quality measure which works as a main component of the genetic
algorithm optimization. The individuals are optimized according to criteria evaluated by the
fitness function during the genetic algorithm run.
The fitness function of one individuala
NNSU instanceis defined as a linear function
P
of evaluation factors, i.e. in form f = k wk ek . The evaluation factors ek allow for definition of particular quality functions narrowed on a specific part of NNSU performance, and
the linear structure introduces transparent mixing of these subparts into a complex quality
measure. The following evaluation factors are used in NNSU case:
Histogram ratio evaluation eHST ,
Posterior probabilities evaluation eP O ,
ROC curve evaluation eROC ,
Mean square error evaluation eM SE .
4.3.
F ITNESS
FUNCTION
55
The fitness is thus written as

f = wHST eHST + wROC eROC + wP O eP O + wM SE eM SE .
(4.4)
We suppose that both the factors e and weights w are non-negative, and thus also the
resulting fitness f is non-negative. This will be important in later sections where we will
discuss the advanced use of fitness values.
In the next subsections we will describe and specify in deeper detail each of the evaluation
factors. We will speak about the motivation and the particular definitions. We also list basic
properties of the factors. In the last subsection we will return to the formula 4.4 defining the
mixing of factors. We will look at the factor transformations that lead to normalized factors
and consistent system of weights.
4.3.1
H ISTOGRAM
RATIO EVALUATION
The histogram ratio evaluation is one of the common techniques used in the data separation
field. The Cb classifier is used, the quality of separation is measured by volumes of histograms
on the left and on the right side of 0 which is a threshold of the basic classifier Cb .
The basic classifier divides its domain, i.e. reals, into two intervals, (, 0), (0, +).
These two intervals represent signal and background. This still gives too much space for the
NNSU to assign the data correctly but impractically for any further analytical work or for
visual presentation.
In the NNSU tool we use narrowed intervals: background window for background data:
BW = [1.1, 0.3], and signal window for signal data: SW = [0.3, 1.1]. Intuitively, the more
volume of data records correctly assigned as signal in the SW, the better, the less of data
records correctly assigned as background in the SW, the better. And similarly for the BW.
Using the notation of sample histograms HS , HB , and the bin centers {cj }M
j=1 , the histogram ratio evaluation is calculated as follows.
P
P
cj SW HS (cj )
cj BW HB (cj )
+P
(4.5)
eHST = P
cj BW (HS (cj ) + HB (cj ))
cj SW (HS (cj ) + HB (cj ))
The eHST evaluation factor focuses on the subintervals SW and BW and reflects the quality of the separation on these intervals. Optimizing this factor will lead to NNSUs that concentrate correctly assigned background on BW and correctly assigned signal on SW.
4.3.2
ROC
CURVE EVALUATION
The ROC curve (Receiver Operating Characteristic curve) is a widely used tool in the field of
data separation. The underlying theory originally appeared in signal theory, see in (Green,
Swets, 1964), an illustrative review is in (Heeger, 1997). The use of ROC curve in machine
learning, pattern recognition, or data separation is nicely reviewed in slides by (Orallo, 2004)
and in article by (Bengio, Mariethoz, Keller, 2005).
The evaluation factor eROC used in the NNSU tool is based upon the ROC curve, thus
we will firstly show the way the ROC curve is calculated from the NNSU data and results.
Using the previously defined NNSU terminology we introduce the terms of ROC theorytrue
positive rate (T P R), false negative rate (F N R), false positive rate (F P R), and true negative
rate (T N R). Note that we dont put any requirements on the classifier in this case.
TPR =
N
1 X
[ NNC(Dk ) = 1 ].[ t(Dk ) = 1 ]
NS
j=1
(4.6)
56
CHAPTER 4.
G ENETIC
IPC ODES
Histogram---sample
-2
(x10 )
BG
SG
Events/bin
0
-1.0
-0.5
0.0
BW
SW
0.5
1.0
NNSU output
Figure 4.1: Histogram graph for a NNSU response to the evaluation data. The blue solid
curve shows histogram of response to signal data HS , and the red dashed curve shows histogram of response to background data HB . The intervals used for evaluation factors BW
and SW are sketched at the x-axis.
FNR =
N
1 X
[ NNC(Dk ) = 0 ].[ t(Dk ) = 1 ]
NS
(4.7)
N
1 X
[ NNC(Dk ) = 0 ].[ t(Dk ) = 0 ]
NB
(4.8)
N
1 X
[ NNC(Dk ) = 1 ].[ t(Dk ) = 0 ]
NB
(4.9)
j=1
FPR =
j=1
TNR =
j=1
In the NNSU terminology the F N R is assigned as accepted background rate (ABR) and T P R
as accepted signal rate (ASR). The ROC curve is given as mapping assigning the ASR as a
function of ABR, i.e. mapping assigning interval [0, 1] to interval [0, 1], ROC : [0, 1] [0, 1].
ROC(y) = max{ASR|ABR y}.
(4.10)
The slope of the ROC curve shows how much of signal are we able to assign having
reached given level of badly assigned background.
In the NNSU case we use the threshold classifier C and in this case NNC(Dk ) = 1 if
NN(Dk ) . The zero level of accepted signal/background corresponds to = . As
the decreases the level of accepted responsesfor both signal and background events
increases. We then simply identify the level for which the number of background data records
reach the required level, i.e. ABR = y.
4.3.
F ITNESS
57
FUNCTION
In general, the best shape of ROC curve is a flat curve that starting at point (0, 1) and ending in point (1, 1) which is designating that all signal data was accepted while no background
was.
The quality factor base on ROC curve might be defined in many ways.
In general, we
R
may choose any integral of the ROC curve. In scoring applications, the (ROC(y) y)dy is
used, known as the Ginis coefficient. In the NNSU case we use a discrete integral. Actually,
the definition of evaluation factor based on the ROC curve comes from the cooperation on
the R. Bocks project (Bock, 1998). It is defined as follows.
eROC = ROC(0.01) + ROC(0.02) + ROC(0.05)
(4.11)
+ROC(0.1) + ROC(0.2).
The sample ROC curve is seen in figure 4.2, the points 0.01, 0.02, 0.05, 0.1, and 0.2 are
emphasized with gray crosses.
ROC curve for NNSU 00307 00019---sample
1.0
ROC curve
Accepted signal rate
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Accepted background rate
Figure 4.2: ROC curve sample for a NNSU response to the testing data. In the eROC evaluation the growth of this curve is watched on interval [0, 0.2]. The values in points 0.01, 0.02,
0.05, 0.1, and 0.2 are summed into eP O evaluation factor. The points are emphasized with
grey crosses on the ROC curve.
Optimizing this factor will lead to the concentration of ASR over the ABR in the area
where ABR is less than 0.2.
4.3.3
P OSTERIOR
PROBABILITIES EVALUATION
The posterior probabilities evaluation factor is calculated using a specific NNSU classifier response. The non-trivial classifier CPO is used, it is built upon the Bayesian posterior theorem.
58
CHAPTER 4.
G ENETIC
IPC ODES
Closer informations and properties regarding this classifier are found in work (Vachulka,
2006). The evaluation factor calculates the ratio between the correctly assigned data and all
the processed data.
eP O
N
1 X
=
[ CPO (NN(Dj )) = t(Dj ) ]
N
(4.12)
j=1
The classifier works individually with each NNSU response and thus there is no guarantee
that the areas of equally classified data records will lie in a contiguous intervals. On one
hand we may gain better performance, on the other hand we loose the visual interpretation
of the results.
Optimizing this factor will lead to the NNSU supports better classification using posterior
probabilities.
4.3.4
M EAN
SQUARE ERROR EVALUATION
The mean square error evaluation factor is well known loss function:
M SE =
N
X
(NNC(Dk ) t(Dk ))2 .
k=1
All of the evaluation factors are constructed as increasing with the quality of the NNSU.
Thus we simply take root square of the M SE inversion, and define the zero value in the
singular point of M SE = 0.
(
1
M SE > 0
M SE
(4.13)
eM SE =
0
M SE = 0
Optimizing this factor will lead to the NNSU with lower uniform error.
4.3.5
C OMPOSED
FITNESS
The fitness function as a quality measure is a function of a NNSU instance and data.
f = f (NN, D).
(4.14)
For a given NNSU instance, the fitness might vary depending on different data. Since the
learning process sets up the inner parameters of the NNSU depending on learning data,
the different learning data lead to different inner parameters and thus to a different output
quality. Similarly, for a given data, the fitness might vary depending on different NNSU
instances as two different instances of NNSU give different outputs. We assume the data D
are fixed for some specific task, and the NNSU instances are taken from some suitable set of
architecture.
Technically, the fitness is considered as a linear function of 4 evaluation factorshistogram
based factor (HST ), ROC curve based factor (ROC), factor based on posterior classifier
quality (P O), and factor based on mean square error (M SE) as written in 4.4.
Further, we will set up the weights wHST , wROC , wP O , and wM SE to control the influence
of the particular factors. For instance, if we want to emphasize the eHST and eP O factors,
and we simply set the weights to wHST = 1, wROC = 0.1, wP O = 1, and wM SE = 0.1. We
want the weights to work independently on underlying data and NNSU instance. It means
4.3.
F ITNESS
59
FUNCTION
that the assignment of weight equal to 1.0 to a factor is interpreted with the same influence
as to any other factor. We need some kind of normalization to factors that would assert our
requirement and prepare factors before they are put together using the linear function 4.4.
This factor preprocessing is described in the following headings.
E VALUATION FACTORS NORMALIZATION
The evaluation factors are considered as observations of random variables representing the
factors are positive and bounded. The basic characteristics of the factors are shown in table
4.1.
Evaluation Factor
eHST
eROC
eP O
eM SE
Min
7.20
2.08
0.65
1.27
Max
8.11
2.50
0.72
1.32
Mean
7.70
2.38
0.68
1.30
Std
0.1670
0.0687
0.0126
0.0098
Table 4.1: Core values of evaluation factors. This is the sample from calibration run on data
introduced in following chapters. The statistics are based on 150 measurements, i.e. we
constructed 150 NNSU instances, we performed the analysis on the data, and measured the
factors evaluations.
The mean of a factor represents a global influenceeach factor possesses with influence
depending on its mean, e.g. in-between two factor the one with higher mean will gain higher
importance. The variance of a factor represents local influencefor two factors with equal
mean the one with higher variance will more intensively influence the resulting values of
fitness defined as a linear sum.
The factors have to be of relevant values so that the weight system can work consistently.
Naturally, the term relevant has to be precised to get the requirements on factors. Then, the
transformations leading to factors satisfying the requirements have to be found.
Regarding the moments of factors, we require the mean to be comparable. With the term
comparable we mean that the factors possess with equal global influence and individual local
influence.
E ej E ek , j, k {HST, ROC, P O, M SE}
(4.15)
Thus we are seeking a transformation ee = T (e) allowing that the transformed factors
have equal means. Preferably, the variances need not to get equal after transform.
S CALING TRANSFORM
First type of transform is a scaling transform. The transformed factor is simply the original
factor denominated by its mean.
Ts (e) =
e
Ee
(4.16)
e
For moments of the transformed factor ee = Ts (e) we have E ee = 1 and VAR ee = VAR
.
(E e)2
The factor is transformed so the resulting mean is equal to 1 and the variance is scaled
proportionally to the mean of the original factor. Thus, the scaling transform modifies both
moments and speaking of the factor influence, the scaling transform modifies global and
local influence at once.
R ANKING TRANSFORM
The second transform is a ranking transform. The distribution of the factor is transformed
60
CHAPTER 4.
G ENETIC
IPC ODES
into distribution on interval [0, 1]. The properties of the resulting distribution might be set,
we might assume the uniform one for now.
j M 1
M
Let {eref
j }j=1 be the system of reference ranking system of factors, and {uj = M }j=1 . The
ranking transform is defined as follows.
ref
0 e e1
ref
Tr (e) =
uk ek < e eref
k+1 , k {1, . . . , M 1}
1 e > eref
M
(4.17)
I MPLEMENTATION OF THE TRANSFORMS

For both transform functions unifying the factors we need to get a sample factors observations
using which we estimate either the mean of the evaluation factor E e or the whole reference
M
ranking system {eref
j }j=1 . This sample must be calculated under same conditions we will
put on the optimization task, i.e. the observation must originate from the specific data, the
underlying NNSU architectures must comply with requirements we will be using within the
optimization.
In order to get the observations we simply generate enough NNSU instances {NNj }M
j=1 ,
learn them, test, and evaluate on the data with resulting factors ekl , k {HST , ROC, P O,
M SE}, l = 1, . . . , M .
For scaling transform Ts we calculate the estimate of the mean as
M
1 X
ekl , k {HST, ROC, P O, M SE}.
k =
M
(4.18)
l=1
For ranking transform Tr we calculate the reference ranking system as

eref
kl = ek(l) , k {HST, ROC, P O, M SE}, l = 1, . . . , M.
(4.19)
W EIGHTS SETTINGS
Having the transformed factors with equal means (E eej = E eek , j, k {HST , ROC, P O,
M SE}), we could define fitness as following weighted sum.
f=
wHST e]
^
HST + wROC e]
ROC + wP O eg
P O + wM SE e
M SE
.
wHST + wROC + wP O + wM SE
(4.20)
The fitness defined in this way has mean equal to the mean of the factors
Ef = E
=
wHST e]
^
HST + wROC e]
ROC + wP O eg
P O + wM SE e
M SE
wHST E e]
^
HST + wROC E e]
ROC + wP O E eg
P O + wM SE E e
M SE
= E e]
HST
That is, for a scaling transform E f = 1, and for ranking transform E f = 0.5.
(4.21)
4.4.
R ECOMBINATION
4.4
OPERATORS
R ECOMBINATION
61
OPERATORS
Afterwards the individual properties of the NNSU structures were through, and the quality
function was defined and described, we may proceed to the next step. The next step will be
the recombination schemes.
The operators play a vital role in the model of evolutionary algorithms because their definition determines the transitional behaviour of the system. The operators defined for IPCodes
are intuitively adopted from the cellular encoding theory, i.e. swapping of the subparts of the
program symbol trees. The operations on subtrees in cellular encoding are formulated for
SubIPCodes easily due to level-property equivalently defined on SubIPCodes.
The genetic algorithm in the NNSU tool uses standard three operators selection, mutation,
and crossover.
4.4.1
S ELECTION
A selection operator controls which individuals that will be given the opportunity to appear
in the following a generation. During the selection, the fitness evaluations of the individuals
in the population are evaluated and according to their quality, the members of the population
are picked up for recombination. The selection is thus a mapping
S : P IPCodes, P 2IPCodes .
The selection works in two stepsit builds up a selection distribution s over the current population and then it samples the members of this population at random according to the selection distribution s. Thanks to the random sampling part, the mapping is non-deterministic.
The selection distribution s is a discrete probability distribution over all positions in population and, in general, it is based on the fitness evaluations of the individuals in the population. We have many ways how to define the particular selection distribution s construction.
In the NNSU case, the following two types of the selections are implemented.
Proportional selection. The proportional selection is derived in such a way that the
particular entries sj are proportional to the entry fj .
Ranking selection. The ranking selection is built upon a mapping assigning increasing
probability sj to better ranked positions fj .
P ROPORTIONAL SELECTION
The proportional selection is defined in a straightforward form as a normalized vector of
fitness values.
spj = PR
j=1 fj
fj , j {1, . . . , R }.
(4.22)
The proportional selection defined this elementary way leads to levelled probability distribution (see table 4.3 and figure 4.5), and thus modified proportional selection is constructed.
Firstly, the normalized value from interval [0, 1] is calculated as
tj =
fj minf
, where
maxf
minf = min{fj , j {1, . . . , R }}, maxf = max{fj , j {1, . . . , R }}.
62
CHAPTER 4.
G ENETIC
IPC ODES
Then, the modified proportional selection is defined so that it puts emphasis the distances
from the centre of the unit interval. This can be easily achieved using a function T that
non-linear on intervals [0, 0.5) and [0.5, 1]. We use the following one.
( 1
2 . (1 (1 2t) ) , for t [0, 0.5)

T (t, ) =
,
(4.23)
1
) , for t [0.5, 1]
.
(1
+
(2t
1)
2
sp,mod
=
j
1
T (tj , ), j {1, . . . , R }
N
(4.24)
where N is a normalizing factor defined as

N =
R
X
T (tj , ).
(4.25)
j=1
In the NNSU tool, the parameter is set to = 1/3. The selection sp,mod emphasizes the
differences between the quality of each individual much more distinctively. This effect is seen
in figure 4.5 and in table 4.3.
R ANKING SELECTION
The ranking selection assigns the values according to the rank of the individuals fitness
evaluations. The higher rank of the individual, the higher value of the particular selection.
Considering any strictly increasing function over R positions, the ranking selection is given
as
1
, j {1, . . . , R },
(4.26)
srj =
N (j)
where is the mapping that orders the sequence of the fitness evaluations, i.e. f(j) = f(j) ,
j {1, . . . , R }. The fitness evaluation is not necessarily injective and nor is the mapping.
This is the reason normalizing factor N is used; it is calculated as
N =
R
X
(4.27)
(j) .
2.0
1.0
j=1
1.5
Logarithmic
Linear
Exponential
0.0
The T transformation function

used in modified proportional selection
0.2
0.4
0.6
0.8
t
0.0
0.0
0.2
0.5
0.4
T(t)
RHO(x)
1.0
0.6
0.8
=1 3
=1
=3
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 4.3: The graphs of the function T and the three types of functions.
In the NNSU tool, three types of functions are worked out: logarithmic, linear, and
exponential. The corresponding formulas are listed in the table 4.2 and shown in figure 4.5.
The particular selections are the assigned as sr,log , sr,lin , and sr,exp .
4.4.
R ECOMBINATION
Type
Logarithmic
Linear
63
OPERATORS
Formula
j
log
j = log(1 + log(1 + R )), j {1, . . . , R}
j
lin
j = R , j {1, . . . , R}
j
exp
= e R 1, j {1, . . . , R}
j
exponential
Table 4.2: Used types of functions. The graphs are shown in the right figure in 4.4.
Example 4.1. We will demonstrate particular selection types for a vector f = (4.19, 4.23, 4.17,
4.18, 4.16, 4.19, 4.16, 4.30), which means we have fitness of population of size R = 8. The
particular selections calculated according to formulas 4.22, 4.23, and 4.26 are listed in the
following table. For a better illustration we can see figure 4.5.
sp
sp,mod
sr,log
sr,lin
sr,exp
(0.12, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13)

(0.09, 0.13, 0.08, 0.09, 0.08, 0.10, 0.08, 0.34)
(0.14, 0.18, 0.10, 0.12, 0.04, 0.16, 0.07, 0.19)
(0.14, 0.19, 0.08, 0.11, 0.03, 0.17, 0.06, 0.22)
(0.13, 0.21, 0.07, 0.10, 0.02, 0.17, 0.04, 0.26)
The resulting selection probability distributions should assign higher probability to better performing individuals and lower probability to less performing individuals which means that the
probability distributions have a high variance or using the distance from the uniform distribution
on 8 positions U8 = (0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125).
These two terms show us the variability of a particular selection. The proportional selection
is considerably close to U8 and on the figure 4.5 we can see that it is levelled.
Selection Standard
Type
Deviation
p
s
0.001
sp,mod
0.090
r,log
s
0.052
sr,lin
0.068
sr,exp
0.084
3.6
3.8
Fitness
4.0
4.2
4.4
Fitness mean = 4.196475

Fitness std = 0.046332
Fitness value
1
4
5
Individual
Figure 4.4: Original fitness evaluations f of the individuals. We can see that the values do not differ
significantly.
4.4.2
Distance
from U8
0.004
0.237
0.137
0.180
0.222
Table 4.3: Variability of constructed selections. The proportional selection shows

low variability.
M UTATION
Mutation is a genetic operator that modifies subpart of representation. For example in binary
case it simply inverts all of affected bits. The advanced structure of IPCodes does not allow
this inversion of elements. In case of IPCodes, the mutation is described as randomly changing some subpart with another. Naturally, the resulting structure needs to be representation
G ENETIC
0.4
CHAPTER 4.
0.4
64
ranking, log
ranking, lin
ranking, exp
0.0
0.0
Selection prob. distribution

0.1
0.2
0.3
Selection prob. distribution

0.1
0.2
0.3
proportional
proportional, mod
IPC ODES
Index
Index
Figure 4.5: The left figure shows the proportional and modified proportional selections calculated from the fitness f . The right figure shows the ranking selections calculated from the
fitness f , the logarithmic ranking selection exhibits the least portion of difference emphasis,
the exponential ranking selection the greatest.
again. Mutation is thus considered as a function M mapping an IPCode onto another IPCode
M : IPCodes IPCodes,
and it is carried out as a change of SubIPCode on randomly picked position. Let P
IPCodes. The mutation Q = M(P) proceeds as listed in algorithm 7.
Algorithm 7 Generic algorithm for mutation
1: Choose randomly position k {1, . . . , |P|}. This can be done according to an arbitrary
distribution on {1, . . . , |P|}, e.g. uniform.
2: Generate IPCode Qtmp at random.
3: Substitute the P-SubIPCode on position k with Qtmp :
M(P) = P k Qtmp .
The length of M(P) is then bounded as
min Qtmp , |P| |M(P)| Qtmp + |P| 1.
(4.28)
In case the first position is chosen at step 1 of the mutation mechanism, the whole IPCode
is interchanged with new random IPCode while the lower bound is reached. On the other
hand picking up the positions with zero entries means growing the IPCode (the upper bound
is reached, the resulting IPCode is of length that is greater or equal of the mutated one).
Example 4.2. We will proceed mutation of an IPCode. Let P be the IPCode from example 2.3
extended by labels {L1, . . . , L13}, 8 the randomly picked position, and Qtmp = ((2, 2, K1),
(0, 11, K2), (0, 11, K3)) the random IPCode. The resulting IPCode is created as follows (the
interchanged SubIPCodesare emphasised).
tmp =
M(P)
= P 8 Q
= (2, 2, L1), (2, 1, L2), (0, 11, L3), (2, 2, L4), (2, 2, L5), (0, 11, L6),
(0, 11, L7), (2,2,L8),(0,11,L9),(2,1,L10),(0,11,L11),(0,11,L12),
(0, 11, L13) 8 ((2, 2, K1), (0, 11, K2), (0, 11K3)) =
= (2, 2, L1), (2, 1, L2), (0, 11, L3), (2, 2, L4), (2, 2, L5), (0, 11, L6),
(0, 11, L7), (2,2,K1),(0,11,K2),(0,11,K3), (0, 11, L13) .
4.4.
R ECOMBINATION
65
OPERATORS
The resulting IPCode is assigned as Q and we will use it in following examples. In this case, the
mutation maps member of IPCodes(13) to a member of IPCodes(11). In real, the represented
architecture is mutated as shows following figure.
IN
IN
1
L3
2
L13
exchanged subgraph
6
L11
1
L3
5
L9
exchanged subgraph
3
L6
3
L6
7
L12
2
L13
4
L7
6
K3
5
K2
4
L7
OUT
OUT
Figure 4.6: The mutation alters blocks 5, 6, 7. The particular swapped subgraphs are
pointed out using frames.
4.4.3
C ROSSOVER
Crossover is the operator recombining subparts of two representation and as a result it creates
one new representation. In general, it is considered as a mapping with prototype
C : IPCodes IPCodes IPCodes.
The crossover scheme on IPCodes is given as interchange of SubIPCodes on given positions. Let P1 , P2 IPCodes. The crossover Q = C(P1 , P2 ) proceeds as in 8. The size bounds
for cross-overed IPCodes is written as
1 |C(P1 , P2 )| |P1 | + |P2 |.
(4.29)
Again, let us look over a simple example of the IPCodes crossing over.
Example 4.3. Let P1 = P, P2 = Q (here with labels assigned with K) from the mutation
example, k1 = 8 the randomly picked position in P1 , and k2 = 3 the randomly picked position
in P2 . Further, let K = 2. The crossover proceeds as written in the following equation. The
SubIPCodes are emphasised.
C(P1 , P2 )
= Qtmp
2
= (2, 2, K1), (2, 1, K2), (0,11,K3), (2, 2, K4), (2, 2, K5), (0, 11, K6),
(0, 11, K7), (2, 2, K8), (0, 11, K9), (0, 11, K10), (0, 11, K11) 3
(2, 2, L8), (0, 11, L9), (2, 1, L10), (0, 11, L11), (0, 11, L12)
66
CHAPTER 4.
G ENETIC
IPC ODES
Algorithm 8 Generic algorithm for crossover

1: Choose randomly positions k1 {1, . . . , |P1 |}, and k2 {1, . . . , |P2 |}. This can be done
according to an arbitrary distributions on {1, . . . , |P1 |}, {1, . . . , |P2 |}, e.g. uniform.
tmp
2: Create temporary IPCode Q1
via substituting P1 -SubIPCode on position k1 with P2 SubIPCode on position k2 :
Qtmp
= P1 k1 sub (P2 , k2 ) .
1
3:
Create temporary IPCode Qtmp

via substituting the P2 -SubIPCode on position k2 with
2
P1 -SubIPCode on position k1 :
Qtmp
= P2 k2 sub (P1 , k1 ) .
2
4:
Generate binomial variable K {1, 2} at random with probability P = 0.5 assigning

which temporary IPCode is used as a result of crossing over, and set
C(P1 , P2 ) = Qtmp
K .
= (2, 2, K1), (2, 1, K2), (2,2,L8),(0,11,L9),(2,1,L10),(0,11,L11),(0,11,L12),

(2, 2, K4), (2, 2, K5), (0, 11, K6), (0, 11, K7), (2, 2, K8),
(0, 11, K9)(0, 11, K10), (0, 11, K11)
We can see, for instance, that the first parent has interchanged a trivial SubIPCode. The
visualisation of the crossover on architectures is drawn in figure 4.7.
4.5
G ENETIC
ALGORITHM
Now that all needed sub-parts of the genetic algorithm are described and prepared, we can
move to the genetic algorithm itself and round off its definition. We put all the mentioned
structures together, make them an effective use in the whole and build a genetic algorithm
that works with architectures.
We already said that the genetic algorithm is, in principle, an iterative procedure that
searches through the search space. The search proceeds through examination of special
subsets of the search space, so-called populations. The populations are are initially sampled
at random. The succeeding populations candidates are sampled using selection rules and
recombined using crossover and mutation. The preparation of next populations for a given
number times, or possibly until some level of fitness quality is reached.
The particular iterations of populations are called generations. The generations are indexed starting from the first recombined population. The number of representations in the
population is set to R . The number of generations to be build is assigned as V . As soon as the
genetic algorithm is finished, we obtain a fitness matrixan R (V + 1) matrix (counting in
the initial population) revealing the performance path of the algorithm. This fitness matrix
is a main documentation of a particular genetic algorithm run and its performance.
The search space for the genetic algorithm is from the very beginning supposed as a
set of IPCodes, either with bounded length, IPCodes(N ), or without unlimited length,
IPCodes. The genetic algorithm decodes the genotypes from the actual population into
NNSU instances, assigns them with evaluation factors e and finally with fitness f . Then,
4.6.
T HE
67
PARAMETERS CONTROLLING THE TRANSITIONAL BEHAVIOUR
IN
1
L3
2
L13
exchanged subgraph
6
L11
5
L9
IN
exchanged subgraph
3
L6
7
L12
4
L7
3
L11
OUT
7
K9
IN
1
L9
4
L12
8
K10
5
K6
2
K11
6
K7
exchanged subgraph
1
K3
3
K6
4
K7
5
K9
2
K11
OUT
6
K10
OUT
Figure 4.7: Crossover on the architectures. In the lower architecture the framed subgraph is
swapped with the framed subgraph from the upper architecture.
the genetic algorithm uses the operators selection S, mutation M, and crossover C to construct the next generation of genotypes. This is performed V -times. The general scheme is
described as listed in algorithm 9.
Afterwards the genetic algorithm is finished, the best performing architectures, in form
of IPCodes, are assigned for further use. The IPCodes can be stored and reused whenever
needed. It would depend on particular use the specific number of genotypes to keep or
to proceed to additional testing. The actual implementation of the NNSU tool stores all
genotypes that emerge during the genetic algorithm run, so they can easily be located from
the fitness matrix, loaded and re-used.
4.6
T HE
The size of any architecture decoded from an IPCode depends on number of building instructions within the IPCode, and thus on the overall length of the IPCode. And, of course on
number of D instructions within this IPCode. Once the number of blocks in the architecture
exceeds certain level, we may expect that the learning procedures will spend more computa-
68
CHAPTER 4.
G ENETIC
IPC ODES
Algorithm 9 Generic scheme for a genetic algorithm

1: Generate the initial population of a given size R . Each IPCode within initial population is
sampled at random from the IPCodes set, according to initial setup (IPCodes lengths,
internal parameters, etc.).
2: Evaluate the initial population. Assign each genotype within the initial population with
evaluation factors e. and then with a fitness evaluation f .
3: for 1 to V do
4:
Set up the selection distribution s according to the given type of selection distribution
construction, and the fitness values of the individuals in actual population f .
5:
for 1 to R do
6:
Selection. Using the selection S pick up two representations P1 , P2 from current
population at random.
7:
Crossover. With probability PC perform crossover C on P1 , P2 , i.e. calculate Qtmp
K =
C(P1 , P2 ), and set
(
Qtmp
with prob. PC
K
QC =
(4.30)
PK
with prob. 1 PC
8:
Mutation. With probability PM mutate with mutation M the result of crossover QC .

(
M(QC ) with prob. PM
QC,M =
(4.31)
QC
with prob. 1 PM
Filling the next population. Put the representation QC,M into the next population.
10:
end for
11:
Evaluate the next population. Assign each genotype within the next population with
12: end for
9:
tional time to proceed and, according to the neural networks experience, the topology with
higher number of NSU will tend to get more and easily over learnt.
During the whole run of the genetic algorithm there are only two moments when the
IPCodes inner parts may become altered and the lengths of the whole IPCodes might possibly
change. These moments are the following.
Initial population construction, the step 1 of the algorithm 9. Here, the length of each
generated IPCode P is bounded as
2MINBC 1 |P| 2MAXBC 1.
(4.32)
Recombination, the steps 7 and 8 of the algorithm 9. At each recombination step,

the IPCode inserted into new population is a result of a recombination which might
influence its length.
The initial random sampling of IPCodes is bounded as described in 4.32. The lengths of
IPCodes after recombination are controlled in two ways. Either by setting a global limit of
the IPCodes lengths throughout all generations of populations, or by setting the bounds for
one recombination step.
At each iteration step of the genetic algorithm, some new IPCodes are constructed and
put into new population. The way the IPCodes vary depends on the detailed definition of the
4.6.
T HE
69
crossover and mutation mappings. The particular SubIPCodes represent compact subparts of
the IPCodes and they correspond to closed subgraphs in the architecture. Controlling these
SubIPCodes, their size and possibly other properties, we drive the components of architectures; we can achieve better understanding and also handling of the runs of genetic algorithm
by analysing the additional constraints we put on the IPCodes during the genetic algorithm
computation.
We will introduce only scale-oriented constraints to IPCodes structure. The global lengths
constraints would assert that the whole genetic algorithm process would not follow some
divergent path. The limits put on the recombination operators, on the other hand, allow for
detailed work with intensity of the genotypes recombination.
IPC ODE
LENGTH BOUNDS
From the inequalities in 4.28

the upper bounds for lengths of mutated SubIPCodes
describing
we have |M(P)| |P| + Qtmp . The Qtmp is randomly generated according to the algorithm
3 so we can write |M(P)| |P| + 2MAXBC 1. For the two cross-overed IPCodes we have
similar inequality bounding the growth of their lengths. |C(P)| |P1 | + |P2 |.
|C(P1 , P2 )| |P1 | + |P2 |, and |M(P)| |P| + 2MAXBC 1
In the initial population we have all IPCodes shorter than 2MAXBC 1. If we look at
the recombination and suppose that both crossover and mutation proceeded, we have O =
M(C(P1 , P2 )) and thus
|O| = |M(C(P1 , P2 ))|
|C(P1 , P2 )| + 2MAXBC 1
|P1 | + |P2 | + 2MAXBC 1
3 (2MAXBC 1)
The resulting upper bound corresponds to the common sense that the IPCode might,
at most, grow by two additional IPCodes of maximum length during crossover and mutation. During the whole run of the genetic algorithm we might get an IPCode of length
3V (2MAXBC 1). It as a question of number of generations V and initial maximum block
count MAXBC we may allow that might influence the maximum achievable length of IPCodes
during a single run of a genetic algorithm.
In order to tackle the undesired IPCodes growth during the successive recombinations, we
implemented two main limitsthe global limit capping the maximum length of any IPCode
throughout the whole genetic algorithm run, and the limit for each recombination step. The
first limit is carried out via maximum allowed IPCode length a maxIPCodeLen constant; the
latter is controlled by maximum recombined SubIPCode length via maxRecLen constant.
The maxIPCodeLen would relate to maximum acceptable number of blocks within the
initial population MAXBC , maxIPCodeLen = 2 MAXBC 1. In case we would require the
initial population to be limited by MAXBC and also that the number of blocks might double or
triple during the run, the maxIPCodeLen is set to the specific value according to actual needs.
The maxRecLen value assigns the maximum recombined length within each recombination
That is, any temporary IPCode Qtmp in mutation can not exceed this length,
tmpstep.
Q maxRecLen, as well as the swapped subcodes in crossover have to be of lengths

shorter than maxRecLen too, |sub (P, k1 )| maxRecLen, |sub (P, k2 )| maxRecLen.
This way we stabilise the discrepancy of the lengths of the IPCodes. At the same time we
restrict on certain part of the possible search space only.
G ENETIC
IPC ODES
maxRecLen unlimited
maxRecLen set to 7
0.2
maxRecLen = 7
0.1
0.04
Density
0.06
0.3
0.08
maxIPCodeLen unlimited
maxIPCodeLen set to 49
0.02
Density
0.4
CHAPTER 4.
0.10
70
0.0
0.00
maxIPCodeLen = 49
20
40
60
80
100
120
10
IPCodes' Lengths
20
30
40
50
60
Recombined Length
Figure 4.8: The left figure shows histogram of lengths of recombined SubIPCodes with no
limit of maxRecLen. The right figure shows histogram of lengths of recombined SubIPCodes
with maxRecLen = 7.
4.6.1
R ECOMBINATION
ON
Codes(N )
SET
Almost immediately after introducing the two mitigating factors maxIPCodeLen and maxRecLen we faced occasional program hang-ups.The reason was that these bounds put too tight
conditions to be always fulfilled during recombination. The first parameter caps the total
length of the Code requiring that all Codes must lie in CodesN , thus also offspring emerging
during the recombination process are required to have length bounded by maxIPCodeLen.
The maxRecLen limits the lengths of interchanged SubICodes during the recombination itself.
If we apply these constraints on the recombination template, the straight formula for
crossover
(
with prob. PC
PK lK sub P(3K) , l(3K)
, where
C (P1 , P2 ) =
PK
with prob. 1 PC
K {1, 2}, lK {1, . . . , |PK |}; l(3K) {1, . . . , P(3K) }

extends by additional ones.
(
PK lK sub P(3K) , l(3K)

C (P1 , P2 ) =
PK
with prob. PC
with prob. 1 PC
, where
K {1, 2}, lK {1, . . . , |PK |}; l(3K) {1, . . . , P(3K) } subject to

|C (P1 , P2 )| maxIPCodeLen, and
(4.33)
|sub (PK , lK )| maxRecLen, and sub P(3K) , l(3K) maxRecLen
(4.34)
That is, we have to oversee three additional inequalities in 4.33 and 4.34. This set of equations and inequalities may disrupt the solutions set of the original equation system.
Think, for example, the two Codes A = (2, 2, 2, 0, 0, 2, 0, 0, 2, 2, 0, 0, 2, 2, 0, 0, 2, 0, 0), and
B = (2, 2, 2, 0, 2, 0, 0, 2, 0, 0, 2, 0, 2, 0, 0) and the values of transitional parameters of maxIPCodeLen =
4.6.
T HE
71
19, maxRecLen = 5. The length of the codes are |A| = 19, |B| = 15. Once we decide to recombine A on some position with subcode of B of length 5, |sub (B, l2 ) = 5|, we realize by soon
that there are no SubICodes allowing for proper recombination. There are only SubICodes
of length 1 and 3 within A which is not enough to keep resulting Code under maxIPCodeLen.
The SubICodes solution set narrowed and we have to specify the conditions accordingly.
The existence of solution is asserted only in case the positions of the SubICodes lK , l(3K)
are sought as early as all the possible SubICodes are identified within P1 , P2 . We gather the
admissible positions (AP) of existing subcodes within both IPCodes.
\
AP =
{l {1, . . . , |PK |} | |sub (PK , l)| maxRecLen }
K{1,2}
Algorithm 10 Solving recombinations with length constraints

1: Sample the random positions l1 within the Code A1 so that
l1 AP, and |sub (A, l1 )| maxRecLen
2:
Sample the random positions l2 within the Code A2 so that

l2 AP, |sub (A2 , l2 )| maxRecLen,
|A1 | |sub (A1 , l1 )| + |sub (A2 , l2 )| maxIPCodeLen, and
|A2 | |sub (A2 , l2 )| + |sub (A1 , l1 )| maxIPCodeLen
3:
4:
Sample K {1, 2} with uniform probability (p = 0.5).

Recombine A1 and A2 as
Aout = AK lK sub A(3K) , l(3K)
4.6.2
E LITISM THE
MEMORY OF THE ITERATIVE SYSTEM
The next parameter extending the abilities of genetic algorithm is elitism and it is literally
introducing memory to it. It directly affects the way the populations move through the
search space. It can ease the initial testing runs of the algorithm and ease convergence in
tough tasks. On the other hand, it might also cause the local or pre-mature convergence.
The optimization process, as it is defined in the previous paragraph, lacks memory in the
sense that any of the representation from the actual population need not to appear in the
next population. Naturally, during the optimization process the best performing individuals
are of concern. As scheduled in algorithm 9, new population is filled with individuals. The
individual either emerge being recombined or just copied from the members of the current
population.
The elitism simply asserts that the best performing genotype is kept and copied into next
generation. The genetic algorithm is provided with a memory between its particular steps.
The number of the copied genotypes might vary from one to the whole population, depending
72
CHAPTER 4.
G ENETIC
IPC ODES
on how large memory we want. Naturally, the more individuals we keep when moving to
the next generation the slower movement is induced within the whole context of the genetic
algorithm. As the genetic algorithm keeps a given portion of its genotypes unchanged, the
whole population is limited in choosing of its next position within the search space. It is also
possible that the whole search process gets stuck at some local optimum and become unable
to find any over-performing genotypes at some other position in the search space.
This is why the elitism should be used after conscious review of all conditions. Typical
use are indicative runs that disclose the behaviour of the genetic algorithm, or in situations
when standard genetic algorithm appears as hard to converge.
Algorithm 11 Genetic algorithm with elitism
1: Generate the initial population of a given size R . Each IPCode within initial population is
sampled at random from the IPCodes set, according to initial setup (IPCodes lengths,
internal parameters, etc.).
2: Evaluate the initial population. Assign each genotype within the initial population with
3: for 1 to V do
4:
Set up the selection distribution s according to the given type of selection distribution
construction, and the fitness values of the individuals in actual population f .
5:
for 1 to R new do
6:
Selection. Using the selection S pick up two representations P1 , P2 from current
population at random.
7:
Crossover. With probability PC perform crossover C on P1 , P2 , i.e. calculate Qtmp
K =
C(P1 , P2 ), and set
(
Qtmp
with prob. PC
C
K
Q =
(4.35)
PK
with prob. 1 PC
8:
Mutation. With probability PM mutate with mutation M the result of crossover QC .

(
M(QC ) with prob. PM
C,M
Q
=
(4.36)
QC
with prob. 1 PM
9:
Filling the next population. Put the representation QC,M into the next population.
end for
for k = R new + 1 to R do
Copy (k R new )th best performing genotype from the actual population to the next
population.
end for
Evaluate the next population. Assign each genotype within the next population with
end for
10:
11:
12:
13:
14:
15:
Regarding our particular NNSU toolbox case, the memory of the genetic algorithm was
implemented for two main reasons. Firstly, it was an extension and a natural improvement of
the original NNSU genetic algorithm implementation. This form used so-called queue model
where only two candidates were recombined at each step of the genetic algorithm, i.e. we
used a genetic algorithm with strong elitism. Secondly, we did not want to leave the concept
of elitism at all and keep the maximum flexibility.
4.6.
T HE
73
Thus, we introduced the factor assigning the number of genotypes to be put in each
next population R new . Using this, we can effectively control the level of elitism. Once the
R new < R , the first R R new best performing individuals in the current population are copied
into the next one. Due to this, the maximum fitness in each population is non-decreasing
function of the generations. This behaviour is seen in figure 4.9. Also, we can read that in
the first 50 generations, the genetic algorithm with memory exhibits faster fitness growth
while the genetic algorithm without memory proves better convergence in the remaining
generations.
0.8
0.6
0.6
Fitness
Fitness
0.8
0.4
0.2
0.4
0.2
25
25
20
Me 15
mb
ers 10
5
0
10
30
20
tions
Genera
40
50
20
Me 15
mb
ers 10
5
0
10
30
20
tions
Genera
40
50
Figure 4.9: The left figure shows the evolution of the fitness in the genetic algorithm run
without memory. The right figure shows the evolution of the fitness in the genetic algorithm
run with memory.
4.6.3
G ENERALIZATION
ISSUE
The next question regarding finer tuning of genetic algorithm behaviour is a question of
generalization. It is directly tied with the fact that we are working with phenotypes of neural
networksthe optimized structures are architectures and their performance level depends
also on particular data used for training and evaluation for each individual NNSU instance.
Any neural network exhibits the ability to perform consistently on particular tasks, no matter
what specific data sample is processed. This consistently high performance on perturbed data
is considered as high level of generalization. Contrary, once a network ceases to perform
consistently on different data sets sampled from the same underlying process, it is said to be
over-learned (to a particular data sample) and is losing its generalization.
The over-learning issue is detected via evaluation on separate data samples (L and E data
introduced in subsection 4.2.1). With high portion of data to process it does not make no
constraints to count on with a special testing set. The NNSU instance is trained via L data and
verified via E data, the (generalization) testing data cannot step into the genetic algorithm
process since it can not influence the learning process nor the evolutionary one. And based
on the E data only, the calculated fitness automatically would not catch any indication of
generalization level.
The genetic algorithm optimizes the architectures according to information contained in
fitness and as such it is a second-level optimization that uses both training and evaluating
74
CHAPTER 4.
G ENETIC
IPC ODES
data as learning. As a verification that genetic algorithms do not disrupt the generalization
of underlying NNSUs we use a third type of data, the testing data (T data introduced also
in subsection 4.2.1). The best performing phenotypes produced by genetic algorithm are
re-tested on T data and the resulting fitness is compared to the fitness on E data. This way
we can prove the real generalization of the NNSU instances.
The f E and f T , respectively,
would
assign fitness
gained
on E and T data: f E =
f E L D(P), DL , DE and f T = f T L D(P), DL , DT . We define a generalization ratio (GR) of an IPCode decoded into architecture and learnt on L data as a simple ratio of
fitness calculated on T data denominated by fitness calculated on E data.
fT
=p
fE
(4.37)
In case of ranking fitness where the fitness is bounded by 1, or any other type with finite
maximum value (max f E ), we assign a generalization difference (GD) of the following form:
(4.38)
= f T max f E (f E f T )
The average (equal-weighted) of the GR and GD will exhibit a scalar assessment of the
NNSUs generalization level (GL) :
1
1
= +
2
2
Both the GD and GR fulfil the general requirements we would put on any other function that
should correctly measure the generalization. For a two-variate real function h : R2 7 R the
requirements are
the first variable is always higher than the second2 (written using f E and f T we say
f E f T );
h(0, 0) = 0;
h(0, t) = 0, for t R;
for any t R, the function h(, t) is (strictly) decreasing in first variable (the closer the
f E drops to f T , the better);
for any s R, the function h(s, ) is (strictly) increasing in second variable (the closer
the f T grows to f E , the better).
The fitness is constructed so that its evaluation is always a non-negative
real number.
p
Thus the GR is also non-negative and takes value between 0 and max f E . The GD values
stretch between 0 and the max f E . Sample results obtained during a genetic algorithm run
are listed in table 4.4.
NNSU
00000
00000
00000
00014
00023
00040
2
00000.net
00015.net
00023.net
00001.net
00010.net
00010.net
f E (NNSU)
0.745
0.660
0.699
0.992
0.991
0.995
f T (NNSU)
0.346
0.382
0.449
0.904
0.803
0.884
(NNSU)
0.401
0.470
0.537
0.908
0.806
0.887
(NNSU)
0.208
0.276
0.337
0.825
0.652
0.787
(NNSU)
0.305
0.373
0.437
0.866
0.729
0.837
This is rather an assumption coming out of the fact that the learning procedure optimizes the internal parameters of a NNSU instance and the resulting internal set up is optimal, meaning that for any other parameters
evaluation and learning data proposed, the NNSU cannot produce response of higher quality.
4.7.
G ENETIC
ALGORITHM SIMPLE RUNS
75
Generalization difference
1.0
1.0
Generalization ratio
0.8
0.8
0.6
0.6
0.4
0.2
0.0
1.0
Fit
0.6
on
0.2
0.0
1.0
Fit
0.8
ne
ss
0.4
0.4
`T
'd
0.4
0.2
ata
0.6
0.0 0.0
0.2
Fitnes
0.8
' data
on `E
1.0
0.8
ne
s
s o0.6
n ` 0.4
T' 0.2
da
ta
0.4
0.0 0.0
0.2
s on
Fitnes
0.6
0.8
1.0
ta
`E' da
Figure 4.10: Generalization ratio and generalization difference .

Table 4.4: Generalization ratios and differences assigned to sample NNSU instances.
The first three networks originate from the initial population, the last three networks
emerged during recombination.
The table showing GR and GD will be appended to each results of any genetic algorithm
run since it provides additional information regarding the successful optimization together
with the generalization level information.
4.7
G ENETIC
Before we proceed to the next chapter where we will discuss the real use of genetic algorithms on real data, we take a short review of simple runs of genetic algorithm without
complex data, nor structured fitness. We rather show how the run of a genetic algorithm
should look, we will outline the basic outputs of the genetic algorithm together with graphs
and reporting charts. The following two examples would provide us with an initial insight
into runs of genetic algorithms, a general scheme of genetic algorithms behaviour, without
specific optimization performance being at focus.
In order to show a simple form of a genetic algorithm based on IPCodes we must bear in
mind that the search space is a set of graphs, not a usual multidimensional RN . We need to
adjust the form of fitness function to it. We will use two basic forms: a random fitness, and
a fitness reflecting size of a graph (the number of blocks).
The sizing of the genetic algorithmnumbers of populations R and generations V do
not play no significant role for now. Within the following illustrative examples we would not
arrange a large scale scenarios. Also the probabilities of crossing over and mutating of the
genotypes are set to more or less common values (PC = 0.8, and PM = 0.1).
76
CHAPTER 4.
4.7.1
R ANDOM
G ENETIC
IPC ODES
FITNESS
The most trivial case is to let the quality of a phenotype to be randomly distributed. Fitness
evaluator was set up to return randomly distributed values from interval [0, 20]. The genetic
algorithm then, only recombines the representations without gaining any significant increase
of fitness. The genetic algorithm was set up with population of size R = 10 and number of
generations V = 150.
A very illustrative description of the run of the genetic algorithm is a graph of evolution
of the fitness from the initial population to the last one, with generations in one axis and
the particular members of each population in second. The graph is called 3D-fitness graph.
A two dimensional version of the fitness evolution shows only generations with maximum
and average fitness over each population. The graph is called 2D-fitness graph. Both of the
mentioned graphs are shown in figure 4.11. We can see in the 3D-fitness graph that the fitness really does not show no trend as it is randomly sampled independently on the particular
phenotype. As expected, the values stretch from 0 to 20, with no significant structure.
2D-FITNESS |10 INDIV|150 GEN|

20
15
fitness
FITNESS
15
10
10
5
10
mean fitness
min fitness
max fitness
ME
6
MB
ER
S
150
100
4
50
2
0
ATIO
GENER
NS
0
0
50
100
150
generations
Figure 4.11: Two illustrations of a genetic algorithm run, a 3D-fitness (left) and 2D-fitness
(right) graphs. In this special case of random fitness, the values stretch from 0 to 20.
The next two descriptive graphs show average, minimum and maximum values of number
of blocks within architectures, and the IPCodes lengths across the populations. These graphs,
see figure 4.12, provide us with additional information since the size of the architectures nor
the representations length are not active components of fitness (the lengths limits are rather
part of the genetic algorithm environment). We can read how did the IPCodes lengths evolve
during the run, and we can also judge the involvement of D instructions from the differences
between IPCodes lengths and number of blocks in architectures.
In this specific random fitness case, we can read that the fitness value is fully independent
on the design of the particular IPCodes in populations. The genetic algorithm does not
obtain no information for any reasonable architecture handling and thus it behaves in a
pure chaotic way. By accident, starting with the eighth generation, the IPCodes contain
only D instruction. The decoded phenotypes are only two-blocks (IN, OUT), whereas the
underlying IPCodes encode 15 to 25 building instructions that are all cancelled out by the
Ds.
4.7.
G ENETIC
77
2D-REPRESENTATION LENGTHS |10 INDIV|150 GEN|
2D-BLOCK COUNTS WITHIN NN |10 INDIV|150 GEN|

6
40
representation length
block count
30
20
10
2
0
mean fitness
min fitness
max fitness
50
100
mean fitness
min fitness
max fitness
150
50
100
150
generations
generations
Figure 4.12: Two further illustrations of a genetic algorithm run, a 2D graphs showing the
minimum, maximum, and average block counts within architectures (left), and also the
lengths of the IPCodes (right). Note the seemingly contradictory evolution of the representation lengths and the number of blocks within architectures, caused by the random fitness.
4.7.2
O PTIMIZATION
OF NETWORK SIZE
The second illustrative run of the genetic algorithm will work with a simple fitness defined
as returning the number of nodes within architecture (counting in the IN, OUT nodes).
X
f (D(P)) = |D(P)| =
d(j)
jVD(P)
The genetic algorithm was also set up with population of size R = 10, the number of generations was set only to V = 20. Which proved later as sufficient. The probabilities of crossing
over and mutating of the genotypes are set to more or less common values (PC = 0.8, and
PM = 0.1).
Once we provided the genetic algorithm with quality criteriaarticulated in terms of
fitness function, it appears to perform well regarding the genotypes structure optimization.
We can see the 3D and 2D fitness graphs in figure 4.13. The 3D-fitness graph is no longer
hairy and lacking any trend. Quite the opposite, it grows as new generations proceed. In
the 2D-fitness graph we can see that also the span between minimum and maximum lengths
reached within individual populations narrows which documents the tight convergence of
the genetic algorithm.
The 2D-fitness graph mainly reveals whether the genetic algorithm is able to find and
effectively recombine better performing genotypes between generationswe simply check
whether the maximum fitness exhibits an increasing trend. The 3D-fitness graph provides
more illustrative representation of the genetic algorithm optimization since we can read the
step-by-step evolution. The use of the 3D-fitness graph might become cumbersome as the
dimensionality of the genetic algorithm grows.
The block counts and IPCodes lengths exactly copy the fitness, the first one naturally
because the number of blocks equals the fitness, and the IPCodes lengths were in strict correlation with block counts (no use of D instruction).
78
CHAPTER 4.
G ENETIC
IPC ODES

25
25
20
fitness
FITNESS
20
15
10
15
10
10
mean fitness
min fitness
max fitness
ME
MB
ER
20
15
10
5
2
0
10
15
20
generations
ION
ENERAT
Figure 4.13: 3D-fitness and 2D-fitness graphs for graph size optimization.
2D-REPRESENTATION LENGTHS |10 INDIV|20 GEN|
2D-BLOCK COUNTS WITHIN NN |10 INDIV|20 GEN|

50
representation length
25
block count
20
15
10
40
30
20
10
mean fitness
min fitness
max fitness
5
0
10
generations
mean fitness
min fitness
max fitness
15
20
10
15
20
generations
Figure 4.14: The block counts within architectures, and the lengths of the IPCodes. In this
run, these were in tight accord.
5
Real data application
Throughout the previous chapters we introduced all the parts needed to define the genetic
algorithm optimizing NNSU architectures. At the end of the latter chapter we defined the
genetic algorithm itself, speaking about all internal and external parameters of the genetic
algorithm.
This chapter concerns with the application of the genetic algorithm discussed in the previous chapter. The genetic algorithm is a part of the NNSU tool and allows for various
purposes. The genetic algorithm might be run on any type of data that NNSU processes, e.g.
some benchmark data as described in (Vachulka, 2006). As for the real data classifications,
the NNSU tool is mostly applied to high-energy physics data, see in (Hakl et al., 2002a), and
(Hakl, Jirina, Richter-Wa
s, 2005). As we want to focus on demonstration of properties of the
genetic algorithm on closed task, we will use only one type of the data so that the results and
reports are not parametrized by any data type.
In the first section we will give a short walk through the data, a brief description of
the data used for the testing purposes and creating result reports. We will review the data
dimensionality, the data sample sizes, etc. In the second section we will focus on the genetic
algorithm parameters settings. We will set up intervals of acceptable values for all genetic
algorithm properties and parameters, and the weights of evaluation factors, and we will
discuss the number of individuals and generations to use.
The following sections list the reports from the genetic algorithm usage in the NNSU tool.
The comparison of random and evolutionary seeking, and also a demo of a common usage
of genetic algorithm in the NNSU tool.
5.1
T HE
DATA DESCRIPTION
The main usage of the NNSU, for the time being, is data separation and this task is also
the main subject of the test throughout this chapter. All the results listed hereunder are
calculated on a specific data set. We will describe this data in the first section, its properties
are listed in the second section. These are data generated by simulations of particle physics
processes. These simulations build among other parts build contribute to the preparation
of the real particle collidor in CERN centre in Geneve. Although these processes are still
in phase of simulations, the data proved as the right choice as they thoroughly checked the
limits of NNSU.
The data are, in general, separable by the NNSU, but thanks to their complexity there is
still some free space left for further improvements of particular networks. For instance, we
can observe slight loss of generalization when switching from evaluation to testing data sets.
As we will see, the yet produced results seem very promising and we might expect successful
application of the NNSU tool on the raw data from real particle collisions.
79
80
CHAPTER 5.
5.1.1
T HE
R EAL
DATA APPLICATION
HIGH ENERGY PHYSICS DATA
The data originate from field of high energy physics in CERN ATLAS research. You can
see some of the ATLAS NOTES (ATL-PHYS-TDR, 1999) or (ATLAS Workshop, 2005). More
specifically, we speak about a hadronic identification. The following text in this subsection
is a citation from the ATLAS NOTE (Hakl, Jirina, Richter-Wa
s, 2005). It contains description
of the background of the data creation, and the descriptions of the particular variables.
Identification of hadronic decays will be the key to the possible Higgs boson
discovery in the wide range of the MSSM parameter space (ATL-PHYS-TDR, 1999).
The h/H/A and H are promising channels in the mass range spanning
from roughly 100 GeV to 800 GeV. The sensitivity increases with large tan and
decreases with rising mass of the Higgs boson. The H decays will give access
to the Standard Model and light Minimal Supersymmetric Standard Model Higgs
boson observability around mH = 120 GeV, with Higgs boson produced by vectorboson fusion (Asai et al., 2003). The hadronic identification is also very important
in searching for supersymmetric particles, particularly at high tan values (ATLAS
Workshop, 2005).
(quoting from (Hakl, Jirina, Richter-Wa
s, 2005))
Hadronic identification in ATLAS has been studied since several years as a key benchmark
signature for optimizing detector desing and presently, to optimize the performance of the
final reconstruction algorithms. The studied task is a reconstruction and identification of true
and fake hadronic s, here for the qq Z , qq W , and gg, gq, qq gg, gq, qq
events. Two exclusive features of hadronic decays are explored: single-prong and threeprong signature; the presence of only charged hadronic energy ( or 3 ), and of only
neutral pure electromagnetic energy (n 0 ).
The same signal and background samples, as discussed in (Richter-Was and Szymocha,
2005), are used to evaluate performance of the proposed methods. As signal ( S ), we
consider reconstructed candidates from tau decays in pp W and pp Z
events. As background ( B ), we consider candidates from QCD shower in the same pp
W , pp Z events and in QCD dijet events (sample with phard
> 35 GeV).
T
For the classification procedure calorimetric observables as described in details in (RichterWas and Szymocha, 2005) are used. Separately we optimize identification procedure for
single-prong (1P ) and three-prong (3P ) candidates. The 1P is seeded by the leading
hadronic track at vertex (track and at the vertex). The 3P is seeded by the bary-center
of three nearby tracks. The calorimetric observables are calculated from energy deposition
in cells within a distance from a seed of R = 0.2.
The following calorimetric and tracking variables are used to build discriminating observables:
(or scalar sum of tracks transverse
1. Track transverse momenta of a leading track ptrack
T
momenta in case of 3P candidates)
2. Electromagnetic radius of the -candidate, Rem
3. Number of strips Nstrips

, strips with energy deposition above a certain threshold
4. The width of energy deposition in strips, Wstrips
5. The fraction of the transverse energy deposited, f racETR12 , in the 0.1 < R < 0.2
radius with respect to the total energy in the cone R = 0.2. Cells belonging to all
layers of the calorimeter are used.
5.1.
T HE
81
DATA DESCRIPTION
6. The ratio of energy deposited in the hadronic calorimeter ETchrgHAD and track transverse momenta,
ETchrgHAD
ptrack
T
(or sum of transverse momenta in case of 3P candidates)
7. The ratio of energy deposited in calorimeters in a a ring 0.2 < R < 0.4, with respect
to the total energy deposited in a cone R < 0.4, ETchrgEM /ETcalo and ETchrgHAD /ETcalo .
The variables above are used directly without any assumptions on the possible correlations.
5.1.2
D ATA
PROPERTIES
We see that the data are 7-dimensional, consisting of track transverse momenta, electromagnetic radius, number of strips, the width of energy, one fraction of the transverse energy, and two ratios of energy. Signal data S represent tau decays in pp W
and pp Z events, and background data B represent QCD shower in the same
pp W , pp Z events and in QCD dijet events.
All variables are continuous, except for the number of strips Nstrips
which is categorical.
Still, from the NNSU point of view it does not make any difference. The data pre-processing
within each NSU normalizes the each data variable before it is input into the regression.
The data are simulated, as we can read in (Richter-Was and Szymocha, 2005), whereas
each data record is assigned with the data type flag specifying whether the observations
correspond to a S or B event. We need to split the data into L, E, and T tiers; when
preparing the data for NNSU task runs we decided for a distribution with higher emphasis
on L data. The numbers of the created sets are listed in the following table.
Learning data set

Evaluation data set
Testing data set
N
6 821
3 812
3 812
NS
2 860
1 906
1 906
NB
3 961
1 906
1 906
Table 5.1: Numbers of L, T , and E data. The total

number of data records is assigned as N , the number
of signal data as NS , the number of background data
as NB .
The L data are important for adjusting internal parameters of a NNSU. Thus, it is created
as large as possible, counting 6 821 records. Actually, this was made of all data we could
afford for learning purposes. The number of S and B records is not equal here because of
the underlying simulation process. The unbalanced sizes of learning set is not much unusual
and we are confident this does not influence the learning process. In other words, we simply
use all information contained in the sample.
In evaluation and testing procedure we do not need as many data records and we simply
reduce the number of background data records to be equal to the signal ones. The L data
do not overlap E data nor the T data. The E and T data have one third of the data records
in common. At this level number of data records the statistical significance of the evaluating
tests is not biased by low numbers of observations.
82
5.2
CHAPTER 5.
PARAMETERS
R EAL
DATA APPLICATION
SETTINGS
We already outlined all the parameters that are essential for the whole run of the genetic
algorithm to proceed correctly and bring any contributive results. We did not, however, discuss no specific values of the parameters to be set up so that we gain optimal behaviour of
all concerned parts of the genetic algorithm. The parameters need to be and usually are
adjusted as soon as we apply the genetic algorithm to the specific data. We must address all
the connected components as underlying data, requirements put on the NNSU size, complexity, specific requirements on outputs, etc. and then approach the final design of the genetic
algorithm.
The parameters we will subsequently adjust are split into four groups that correspond to
the four sub-parts of any genetic algorithm we are to work with. Within these four groups
we find nineteen parameters. The four groups relate subsequently to NNSU instances, fitness
function design, recombination operators settings, and to the advanced parameters for transitional behaviour. Let us now look on the parameters in deeper detail and specify their setup
with respect to standard use of NNSU tool on the hadronic data.
5.2.1
NNSU
PARAMETERS SETTING
We already mentioned in section 2.3 dedicated to IPCodes that the labels within IPCodes
structure describe the inner structure of NNSU blocks; we specified the parameter set PS in
equation 2.15. The vectors from PS are bounded within intervals given by bounds MINBC ,
MAXBC for numbers of blocks within any architecture, MINNSUC , MAXNSUC defining the inner
structure of blocks, and MINClC , MAXClC for the particular numbers of NSU within blocks.
The recommended numbers of blocks, NSUs, and clusters range rather among lower
values. This is done mainly for two reasons. First, we try to keep the evolutionary algorithms
ground idea that the genotypes subparts, so called building blocks, should be small and
compact rather than big and structured. This asserts that the building blocks are universally
recombined. From the neural networks point of view, the blocks should be of low number
of NSUs and of light performance. Second, the concentration of computational units within
a NNSU might lead to decrease of generalisation level. The over-learning effect is always a
threat even in case the lengths of the IPCodes, and thus the size of the NNSU instance might
grow during the proceed of the genetic algorithm.
The optimal design of the NNSU is discussed in thorough detail in work by Vachulka in
work (Vachulka, 2006). Based on the numbers recommended in this work we will be use the
following set-up throughout all of the tests in this book.
MINBC = 3
MINNSUC = 1
MINClC = 2
MAXBC = 25
MAXNSUC = 3
MAXClC = 10
These bounds induce a space counting 223 23 29 = 235 states which is approximately
34.3 billions. With average learning and testing CPU time (orders of tens of seconds), this
would represent over thousands years of undistributed calculations. And we did not yet
count in the possible inner configurations of blocks, neither the combinations of their interconnections.
5.2.
PARAMETERS
5.2.2
F ITNESS
83
SETTINGS
FUNCTION DESIGN
In order to correctly adjust the fitness function we simply followed the steps listed in implementation paragraph in sub-section 4.3.5. We needed to prepare a sample evaluation factors
responses for some random NNSUs on hadronic tau data. We decided to create a sample response of 150 NNSUs that constructed according to 150 random IPCodes. The summarizing
statistics on sample we obtained are laid out in table below.
We have shown the table in section 4.3 only as an illustrative figure, but it is now that
it becomes the real use and incorporation. Based on the 150 constructed NNSU instances,
learned on hadronic data, and evaluated with particular evaluation factors we calculated
the means and also the empirical evaluation factors distributions estimates and used them to
set up the transform functions Ts and Tr .
Evaluation Factor
eHST
eP O
eROC
eM SE
Min
7.20
0.65
2.08
1.27
Max
8.11
0.72
2.50
1.32
Mean
7.70
0.68
2.38
1.30
Std
0.1680
0.0126
0.0687
0.0098
We can see that on tau data we obtained sample means evaluated from 0.68 for eP O to 7.70
for eHST , the most volatile factor is the eHST with standard deviation of 0.17.
We should stress that the count of 150 distinct evaluations does not necessarily mean
that we exhaust the measuring powers of this benchmark after evaluation of another 150
NNSU instances. The observed statistics are order-independent and the maxima and minima
are realized on different NNSUs. Our optimization seeks for a NNSU that would reach or
approach the maxima of all factors at once.
S CALING TRANSFORM
Using the means estimates from calibrating observations listed above we directly set up the
denominators used in scaling transform Ts (equation 4.16).
e]
HST =
eHST
7.70
eg
PO =
eP O
0.68
e]
ROC =
eROC
2.38
e^
M SE =
eM SE
1.30
R ANKING TRANSFORM
The scaling transformation incorporates the means of the calibrating observations. The ranking transform uses the whole empirical distribution of each evaluation factor to create the
ranking levels. The ranking levels are mapped onto the unit interval [0, 1], i.e. we use the
images of the empirical distribution function assigned to the evaluation factors. The figures
5.1 and 5.2 show the empirical distributions for each factor, together with Q-Q plots against
fitted normal distribution.
We can see in the figure that the distributions are symmetric and quite close to normal
distribution. The Q-Q plots also provide the Kolmogorov-Smirnov statistics. The most diverse
is the distribution of the eP O evaluation factor which appears to have its upper tail bounded.
W EIGHTS SETTINGS
Once we have the transform functions adjusted, we will step onto a definition of individual
weights for each evaluation factor. The transformed factors are weighted according to the
specific use, users requirements and also with respect to the merit of the task to be solved.
7.2
7.4
7.6
7.8
8.0
7.6
7.4
8.2
7.4
0.72
8.0
0.71
KS test:
D = 0.06 , pvalue = 0.65
0.69
eP
0.67
Empirical distribution of eP
0.70
7.8
Estimated values
Mean= 0.68
Std= 0.013
0.65
1.0
0.8
0.6
0.4
Factors rankings
0.2
0.0
0.68
7.6
Fitted Normal distribution
eP
Factors thresholds
KS test:
D = 0.06 , pvalue = 0.69
Estimated values
Mean= 7.70
Std= 0.169
Factors thresholds
0.66
DATA APPLICATION
7.8
eH
R EAL
7.2
1.0
0.8
0.6
0.4
0.2
0.0
Factors rankings
eH
8.0
CHAPTER 5.
Empirical distribution of eH
84
0.65
0.66
0.67
0.68
0.69
0.70
0.71
Figure 5.1: Ranking transformations for evaluation factors eHST and eP O .

For the hadronic tau data the weighting in the NNSU tool was driven by the explanatory power of each evaluation factor. We also reflected the direct relevance to the hadronic
data task. The weights arrangement is listed in the following table for each factor with the
appropriate weight value and the brief reasoning for it.
wHST = 1.0
wP O = 0.5
wROC = 1.0
wM SE = 0.1
The histogram-based evaluation factor catches very closely the

requirement of separation of different event types and their concentration in areas of BW and SW.
We set full 100 % weight to an eHST factor.
The posterior evaluation factor does reflect very specific powers
of the neural network, especially it is able to incorporate the level
of generalization. On the other hand the eP O factor does not
support the continuous character of BW and SW.
We set 50 % weight to an eP O factor.
The ROC evaluation factor is mainly focused on the SW area and
consistently reflects the ASR to ABR ratio.
We set 100 % weight to an eROC factor.
The MSE evaluation factor is rather a general quality measure
that might be useful as indicative in analysing phase of a project.
It does not add no specific feature the hadronic tau data task.
We set 10 % weight to an eM SE factor.
85
2.1
2.2
2.3
2.4
2.4
2.3
2.2
Estimated values
Mean= 2.38
Std= 0.069
2.5
2.2
1.29
1.30
2.4
1.31
1.32
2.5
1.31
KS test:
D = 0.07 , pvalue = 0.54
1.29
1.30
eM
1.28
Estimated values
Mean= 1.30
Std= 0.010
1.27
1.0
0.8
0.6
0.4
0.2
0.0
Factors rankings
eM
1.28
2.3
Empirical distribution of eM
Factors thresholds
1.27
KS test:
D = 0.06 , pvalue = 0.56
eR
2.1
0.8
0.6
0.4
0.2
0.0
Factors rankings
eR
2.5
SETTINGS
Empirical distribution of eR
PARAMETERS
1.0
5.2.
1.27
Factors thresholds
1.28
1.29
1.30
1.31
1.32
Figure 5.2: Ranking transformations for evaluation factors eROC and eM SE .

The final group of fitness function parameters specifies the used transform function and
the weights values. The weights for the evaluation factors are always used as listed, the
transform function may change between Ts and Tr .
Transform function: Ts
wHST = 1.0
wP O = 0.5
wROC = 1.0
wM SE = 0.1
Table 5.2: Fitness function design part of the overall
genetic algorithm parametrization table.
5.2.3
R ECOMBINATION OPERATORS SETTINGS AND ADVANCED PARAMETERS FOR TRAN SITIONAL BEHAVIOUR
The transitional behaviour of the genetic algorithm is driven by the triplet of operators of
selection, mutation, and crossover. These operators determine the way the IPCodes evolve
over each generation, how distinctive the selection will be, what will be the intensity of
the genotypes modifications. Also, the dimensionality of the genetic algorithm environment
represents an important question: the numbers of genotypes within population, together
with the number of generations, and the memory specification R new (if used) are significant
to the resulting path of the genetic algorithm process. We can choose large populations
replicated only few times over the search space, or a narrow set which traverses the search
86
CHAPTER 5.
R EAL
DATA APPLICATION
space through many subsequent steps.

We need to adjust the global limit on overall IPCodes length and the limit for recombination operations help to control and mitigate the high volatility of IPCodes lengths between the generations. As we set the maximum number of blocks within an architecture to
MAXBC = 25, the maxIPCodeLen should not be lower than 2 25 1 = 49. So we set the
minimal length to 50, in some situation it might bet set to higher values. The maxRecLen
parameter would be set to low value, so that we assert that only small parts of IPCodes will
undergo the recombination.
The maximum recombined length of maxRecLen = 7 allows only for swapping of 5 different types of ICodes structures. Typical set up of the recombination part of the genetic
algorithm will look as follows.
PC = 0.8
Selection type: sr,lin
R = 80
V = 300
maxIPCodeLen
maxRecLen
PM = 0.1
R new = 70
50
7
Now, we are through all the main parametrisations of the genetic algorithm environment.
We will now move to the first report from the experiments.
5.3
E VOLUTIONARY
OPTIMIZATION USED IN
NNSU
TOOL
Our first real experiments review introduces a short-term genetic algorithm run. In this form,
the genetic algorithm appears in a NNSU tool. We will show that this is reasonable as the
genetic algorithm does find over-average NNSU instances, technically of better quality level
than a random search. And we also show that for high required generalization level, the
genetic algorithm also behaves better compared to a random search approach.
5.3.1
M OTIVATION
One of the main use for the IPCodes and also for the evolutionary optimization is its incorporation in the NNSU tool as integral part providing an effective work with architectures. The
low CPU-time demanding learning procedures of the NNSU suggest to use larger architectures, and possibly with more complex edge connections. We claimed in figure 1.4 in section
1.1 that even a network built of thirty NSUs with average number of data flows counting
one hundred is learnt up to ten seconds. With such high number of architectures at hand, it
becomes almost infeasible to provide all achievable architectures, e.g. in the combo box of
the application.
Naturally, any user might as well input any architecture in XML file or design the architectures by hand or maybe in some external editor. Filling in the adjacency matrix and all
the required descriptions by hand into a XML-tagged file will soon or later annoy any user,
the more when the computational feasibility opens the counts of blocks to, at least, two-digit
numbers. Also, our tries of graphical architectures editor use turned up rather cumbersome,
as drawing and connecting tens of blocks would be another random generating process than
a serious design of an architecture.
We decided to use the yet existing approach of IPCodes and to relax the particular structure or data flows map of the acceptably performing architecture and let the application seek
5.3.
E VOLUTIONARY
NNSU
TOOL
87
for such architecture itself. For any specific data, the program will run an acceptably small
genetic algorithm task and return a closer set of NNSU instances. Thanks to the short optimization processes, these NNSUs are supposed to outperform any randomly sampled NNSU
for this specific data sets. Naturally, they can be stored for later use.
What would prove the advantage of such advanced technique are the following three
subjects.
The ability to find solution architectures (for example in comparison with random
search);
acceptable level of generalization for solution architectures; and
low portion of time required to find solution architectures (time to finding solution).
5.3.2
D ESCRIPTION
For a comparison we simply run a genetic algorithm and compare the results and outcomes
to results obtained from the random search based on the equal number of NNSU instances. It
means, we randomly generate as much IPCodes as they appear during the run of the genetic
algorithm, i.e. R V IPCodes, decode them into NNSU instances, learn and evaluate them
on hadronic tau data.
We run a genetic algorithm of number of
genotypes R = 30 and the number of generGA S01 parameters setup
ations to proceed V = 80. Thus, the genetic
MINBC = 3
MAXBC = 25
algorithm processes 2, 400 genotypes in the
MINNSUC = 1
MAXNSUC = 3
whole. The other parameters of the genetic
MINClC = 2
MAXClC = 10
algorithm are listed in the table below.
Transform function: Tr
The number of architectures evaluated
wHST = 1.0
wP O = 0.5
within a random search was set to 2, 400 eiwROC = 1.0
wM SE = 0.1
ther, and also the parameters for BC, NSUC,
PC = 0.8
PM = 0.1
r,lin
and ClC are used the same.
Selection type : s
new
The calculation of the genetic algorithm
R = 30
R
= 27
required approximately 90 minutes to finish
V = 80
whereas the random search took 110 minmaxIPCodeLen = 91
utes. This difference might be caused by demaxRecLen = 25
pendency present in the genetic algorithm
calculation. The average number of blocks
within architecture declined during first 20 generations of the genetic algorithm run and stabilized around 10 blocks. The average number of blocks during the run of the random search
is not influenced and stays around 3 + (25 3)/2 = 14 blocks. The higher average number
of blocks may lead, in the end, to higher computational time demands.
5.3.3
R ESULTS
The overall results are shown in fitness evolution for genetic algorithm, figure 5.3, the summary on random search is listed in table below. We can see that both genetic algorithm and
random search reached over 90 per cent quality individuals. The genetic algorithm resulted
slightly better as for the overall quality (more than 0.95); the most well performing instance
did not appear as best generalizing, the genetic algorithm found NNSUs with high (more
than 0.9) and though.
The maximum fitness within the random population was 0.967. In table (5.4) we can
see that the NNSU u001186 with maximum fitness 0.967 ranked as sixth due to weaker
88
CHAPTER 5.
Number of evaluated
NNSU instances
Minimum fitness
Maximum fitness
Fitness mean
Fitness std
R EAL
DATA APPLICATION
2, 400
0.019
0.967
0.404
0.218
Table 5.3: Basic statistics on random search on 2, 400 NNSUs.
response on T data. At the same time, the NNSU u000591 with third E data rank provides
high quality response on T data which makes it the first one, according to GR .
The progress of the genetic algorithm is seen in figures 5.3 (both 3D-fitness and 2Dfitness graphs). We can see that the number of generations was slightly overstated because
the process reached the maximum fitness already in 31st generation, i.e. within 38 % of its
overall course. Moreover, the most powerful member appeared already in 14 generation.
NNSU instance
u000591
u001433
u002380
u002232
u001186
u002295
fE
0.916
0.947
0.917
0.923
0.967
0.921
fT
0.808
0.801
0.773
0.773
0.750
0.677
0.844
0.824
0.808
0.805
0.762
0.705
0.721
0.685
0.662
0.657
0.587
0.512
0.783
0.754
0.735
0.731
0.675
0.609
Table 5.4: Short run RS: Best performing NNSUs according to

the GL .
The overall comparison of genetic algorithm and random search comes in table 5.5. The
top two NNSU found by GA clearly overtake the top two NNSU instances by RS, according to fitness f E with f E (00014 00001) = 0.992 and f E (00031 00047) = 0.995 higher
than f E (u000591) = 0.916 and f E (u001433) = 0.947as well as according to generalization levelwith (00014 00001) = 0.866 and (00031 00047) = 0.837 higher than
(u000591) = 0.783 and (u001433) = 0.753.
E VALUATION FACTORS
The histogram and ROC curve graphs for the best performing NNSU instances, 00014 00001 for genetic algorithm and u000591 for random search, are shown in figure 5.4 and
5.5 respectively. For both NNSUs, there is a slight deterioration of response quality for T
data, compared to E data. The difference is more visible for the NNSU optimized by genetic
algorithm. The fitness performance, together with , and indicators assign better results
to the NNSU optimized by genetic algorithm, as seen from the following table where two
best-performing NNSUs from each approach are listed.
The reason is that the genetic algorithm incorporates the feedback from the evaluation
factors and seeks for more fitting individuals. For instance, let us look on the histogram charts
in figure 5.4 for 00014 00001 and 5.5 for u000591. The optimized NNSU does reflect the
BW and SW intervals and selectively maps its responses into their areas. We can see in figure
5.5 that the signal data response of random NNSU (u000591) produces a peak in [0.1, 0.3].
The optimized NNSU (00014 00001) exhibits the same area of the signal response more
flattened.
5.3.
E VOLUTIONARY
NNSU
89
TOOL

1.0
0.8
0.8
fitness
FITNESS
0.6
0.4
0.6
0.4
0.2
0.2
30
25
20
MB
15
ER
S
mean fitness
min fitness
max fitness
0.0
ME
80
60
10
20
20
0
60
80
generations
40
40
NS
ERATIO
GEN
Figure 5.3: Short run GA: 3D-fitness and 2D-fitness graphs for a short optimization of
30 members by 80 generations. The span of 3 remembered individuals (R new = 27)
between the generations are visible on the end of population line.
Method
GA
Random
search
NNSU instance
00014 00001
00031 00047
u000591
u001433
fE
0.992
0.995
0.916
0.947
fT
0.904
0.884
0.808
0.801
0.908
0.887
0.844
0.824
0.825
0.787
0.721
0.685
0.866
0.837
0.783
0.753
Table 5.5: Short run GA: Generalization factors and for two bestperforming NNSUs from genetic algorithm and random search runs.
On the other hand, if we look at optimized NNSUs ROC curves, we can see slight loss of
performance on T data. The generalization indicators and possess comparable values for
both optimized and random NNSUs. Assuming that the optimized NNSUs are profiting from
the higher eHST evaluation factor, we can say that the random NNSUs show slightly higher
level of generalization.
T IME TO FINDING A SOLUTION
Let us now look how fast can the genetic algorithm and random search approaches be. For
now, we will not consider the fact that random search did not find NNSU instances of quality
relevant to genetic algorithm, and we will compare the powers of each method individually.
The best individual found in GA compared to the best individual found in RS, the second best
individual in GA compared to second best individual in RS, et cetera.
The genetic algorithm managed to find the best-performing NNSU in 14-th generation,
this is the individual was found in 15 R + 2 = 15 30 + 2 = 4521 evaluations. The
random search did spend 592 NNSU evaluations before it found its best-performing one.
That is, GA was 592 452 = 140 evaluations faster. Compared to the overall number of
evaluations, this represents approximately 5.8 % saved calculation time. Also in finding the
second best-performing individual, the genetic algorithm was faster than random search,
being 32 30 + 48 = 1008 less than 1434, now by 17.8 %.
1
We count in the evaluation of the initial 0-th generation, and also the members within population are
internally indexed starting from 0.
90
CHAPTER 5.
DATA APPLICATION
00014_00001
00014_00001
-2
(x10 )
R EAL
1.0
E data
T data
BG
E data
SG
BG
T data
SG
0.8
accepted signal rate
events/bin
0.6
0.4
2
0.2
0
-1.0
-0.5
0.0
0.5
0.0
0.0
1.0
0.2
0.4
0.6
0.8
1.0
accepted background rate
NNSU output
Figure 5.4: Short run GA: Comparison of the histogram and ROC quality of the NNSU
00014 00001 found by the short run of the genetic algorithm. Each graph shows results
obtained on E data and on T data.
u000591
u000591
-2
(x10 )
1.0
0.8
accepted signal rate
events/bin
E data
T data
BG
E data
SG
BG
T data
SG
0.6
0.4
2
0.2
0
-1.0
-0.5
0.0
0.5
0.0
0.0
1.0
0.2
0.4
0.6
0.8
1.0
accepted background rate
NNSU output
Figure 5.5: Short run RS: Comparison of the histogram and ROC quality of the NNSU
u000591 found by random search. Each graph shows results obtained on E data and on T
data.
IN
00014 00001
OUT
IN
u000591
OUT
Figure 5.6: Short run: Visualisation of the architecture of the best performing NNSU throughout all the populations of the genetic algorithm, the 00014 00001, on the left; and the best
performing NNSU throughout the random search, the u000591 on the right.
5.4.
5.3.4
E VOLUTIONARY
OPTIMIZATION VERSUS RANDOM SEEKING
91
R ESUM E
The genetic algorithm, in the presented form of rather smaller scenarios, is an established
method within the NNSU tool, providing a suitable choice for fast seeking of well-performing
architectures with high level of generalization. The use of genetic algorithm naturally extends the integral use of IPCodes as architectures representations.
We observed that within an two-hours calculation, the genetic algorithm found higher
quality NNSU instances, with higher generalization level . Finally, the number of evaluated
NNSU instances was lower for genetic algorithm than in random search case.
5.4
E VOLUTIONARY
The short genetic algorithm scenarios we described in the previous section showed clearly
that the genetic algorithm is able to find solution architectures, and moreover, it is able to
find those via evaluation of lower number of NNSU instances. We will now undertake a few
larger scenarios, using both genetic algorithm and random search approaches, and compare
the results to see whether the genetic algorithm would be convenient to use even for the
larger tasks. Each task will evaluate up to 10, 000 NNSU instances, which will allow us to see
whether the advantageous feature of genetic algorithm is retained for larger scenarios. The
ten thousands would require approximately 12 hours of computational time.
After we will be through all the particular results, comparisons and graphs, we will also
precise the terms of finding solution speed. The generalization of the term faster, speaking
of finding a suitable solution, would be very helpful in any comparison of the GA against RS
(and possibly other approaches). Within this section we will come up with the convergence
speed ranking tool based on estimates of cumulative probability distribution function (CPDF)
of finding an individual of given performance.
5.4.1
E XPERIMENTS
OVERVIEW
The first two genetic algorithm runs will proceed in 50 200 scenarios, i.e. with R = 50,
and V = 200, evaluating 10, 050 NNSUs. The first was set up with five genotypes-memory
(R new = 45) to see a guided convergence, the second was setup without memory (R new =
R ). The detailed set-up tables are listed in tables 5.6, 5.7. Note that with the higher number
of generations and thus more frequent recombinations we also raised the maxIPCodeLen
bound. The maxIPCodeLen was set to maxIPCodeLen = 251, still, the number of blocks
peaked around 40 blocks and the calculation was not prolong significantly.
The third run uses 150 70 scenario, i.e. R = 150, and V = 70 evaluating 10, 200 NNSUs.
The details on parametrization of the GA run are listed in table 5.8.
5.4.2
R ESUM E
After sucessful perfomance in rather smaller scenarios, the GA approach appears as efficient
in larger scenarios as well.
The genetic algorithm exhibits higher concentration of more generalized NNSUonly
one from the top three NNSU instances was below 70 % compared to five cases for RS
In a total listing of 15 overall top NNSU, ranked according to generalization level , 10,
i.e. 66.7 %, were produced during GA optimization
92
CHAPTER 5.
GA L01 parameters setup

MINBC = 3
MAXBC = 25
MINNSUC = 1
MAXNSUC = 3
MINClC = 2
MAXClC = 10
transform function: Tr
wHST = 1.0
wP O = 0.5
wROC = 1.0
wM SE = 0.1
PC = 0.9
PM = 0.1
Selection type : sr,lin
R = 50
R new = 45
V = 200
maxIPCodeLen = 251
maxRecLen = 25
R EAL
DATA APPLICATION

MINBC = 3
MAXBC = 25
MINNSUC = 1
MAXNSUC = 3
MINClC = 2
MAXClC = 10
wHST = 1.0
wP O = 0.5
wROC = 1.0
wM SE = 0.1
PC = 0.9
PM = 0.1
R = 50
R new = 50
V = 200
maxIPCodeLen = 251
maxRecLen = 25
Table 5.6: Large run genetic algorithm:

GA L01 with memory.

GA L02 without memory.

MINBC = 3
MAXBC = 25
MINNSUC = 1
MAXNSUC = 3
MINClC = 2
MAXClC = 10
wHST = 1.0
wP O = 0.5
wROC = 1.0
wM SE = 0.1
PC = 0.9
PM = 0.05
R = 150
R new = 150
V = 70
maxIPCodeLen = 251
maxRecLen = 25
GA L03 without memory, with diminished
mutation.
Method
GA
GA L01
Random
search
RS L01
NNSU instance
00139 00039
00111 00024
00197 00036
u008598
u006828
u002728
fE
0.995
0.997
1.000
0.955
0.951
0.961
fT
0.902
0.837
0.789
0.876
0.757
0.752
0.904
0.838
0.789
0.896
0.777
0.768
0.818
0.703
0.622
0.806
0.611
0.596
0.861
0.771
0.705
0.851
0.694
0.682
Table 5.9: Large run: Comparison no. 1. The genetic algorithm reached better performing
E = 0.995 against f E = 0.955), and also higher level of generalization (
instances (fGA
GA =
RS
0.861 against RS = 0.851).
5.4.
E VOLUTIONARY
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
mean fitness
min fitness
max fitness
0.0
0
50
100
fitness
1.0
fitness
fitness
93
0.4
0.2
mean fitness
min fitness
max fitness
0.0
150
200
0.6
50
100
150
200
mean fitness
min fitness
max fitness
0.0
0
50
generations
generations
1.0
1.0
0.8
0.8
0.8
FITNESS
FITNESS
FITNESS
0.6
0.6
0.4
0.4
0.2
0.2
0.2
50
40
50
40
30
ER
S
MB
200
50
0
GENERA
30
ER
S
MB
100
10
40
ME
150
20
200
0.6
0.4
50
150
generations
1.0
ME
100
200
150
20
100
10
TIONS
50
0
ME
30
ER
S
MB
200
150
20
100
10
TIONS
GENERA
50
0
GENERA
TIONS
Figure 5.7: Large run GA: Fitness evolutions of GA L01, GA L02, and GA L03 runs.
Method
GA
GA L02
Random
search
RS L02
NNSU instance
00123 00044
00087 00013
00049 00017
u007361
u009572
u004860
fE
1.000
0.999
1.000
0.971
0.955
0.957
fT
0.914
0.838
0.767
0.786
0.765
0.738
0.914
0.839
0.767
0.797
0.783
0.754
0.836
0.703
0.588
0.640
0.620
0.576
0.875
0.771
0.677
0.719
0.701
0.665
E = 1.000 against f E = 0.971), and also higher level of generalization (
instances (fGA
GA =
RS
0.875 against RS = 0.719).
Method
GA
GA L03
Random
search
RS L03
NNSU instance
00037 00049
00039 00013
00047 00018
u008304
u001804
u007903
fE
0.996
1.000
0.994
0.984
0.985
0.951
fT
0.954
0.893
0.941
0.805
0.773
0.738
0.956
0.893
0.944
0.811
0.779
0.757
0.914
0.798
0.892
0.661
0.609
0.581
0.935
0.845
0.918
0.736
0.694
0.669
E = against f E = 0.985), and also higher level of generalization (
instances (fGA
GA = against
RS
RS = 0.736).
94
CHAPTER 5.
Method
GA
GA
GA
GA
RS
GA
GA
GA
GA
GA
RS
RS
GA
RS
RS
NNSU instance
00037 00049
00047 00018
00123 00044
00139 00039
u008598
00039 00013
00036 00042
00026 00007
00087 00013
00111 00024
u008304
u007361
00197 00036
u009572
u006828
fE
0.996
0.994
1.000
0.995
0.955
1.000
0.997
0.995
0.999
0.997
0.984
0.971
1.000
0.955
0.951
No.
1, 950
2, 419
6, 245
7, 040
8, 599
2, 014
1, 893
1, 358
4, 414
5, 625
8, 305
7, 362
9, 937
9, 573
6, 829
fT
0.954
0.941
0.914
0.902
0.876
0.893
0.885
0.837
0.838
0.837
0.805
0.786
0.789
0.765
0.757
0.956
0.944
0.914
0.904
0.896
0.893
0.886
0.839
0.839
0.838
0.811
0.797
0.789
0.783
0.777
R EAL
0.914
0.892
0.836
0.818
0.806
0.798
0.785
0.705
0.703
0.703
0.661
0.640
0.622
0.620
0.611
DATA APPLICATION
0.935
0.918
0.875
0.861
0.851
0.845
0.835
0.772
0.771
0.771
0.736
0.719
0.705
0.701
0.694
Table 5.12: Large run: List of best performing NNSUs.

IN
IN
12
11
10
13
IN
11
10
10
12
11
12
13
14
OUT
OUT
IN
IN
1
IN
10
11
13
15
16
17
12
19
20
18
14
16
21
25
23
10
12
14
11
14
19
13
16
21
20
15
26
12
24
11
3
10
4
14
OUT
13
15
14
27
15
18
13
17
22
22
OUT
OUT
OUT
IN
IN
11
10
14
12
13
OUT
10
11
12
13
14
15
17
16
22
OUT
IN
10
13
11
12
OUT
Figure 5.8: Large run GA: Top architectures found by genetic algorithm runs GA L01,
GA L02, and GA L03.
18
19
20
21
5.4.
E VOLUTIONARY
IN
IN
1
95
IN
4
2
OUT
OUT
IN
OUT
IN
IN
3
5
1
1
OUT
OUT
OUT
IN
IN
IN
10
11
12
13
16
15
14
OUT
OUT
OUT
Figure 5.9: Large run RS: Top architectures found by random search runs RS L01, RS L02,
and RS L03.
96
CHAPTER 5.
5.5
T IME
R EAL
DATA APPLICATION
TO FINDING SOLUTIONS
We already mentioned and compared the ability of a method (GA, RS) ot find solutions within
low number of evaluations, and thus in shorter computational time. The tables 5.9, 5.10, and
5.11 again reveal that the genetic algorithm managed to reach top quality instances within
shorter time.
5.5.1
D EFINITION
OF THE
VARIABLE
The time each optimization method spends before it reaches for an IPCode, or set of IPCodes
that decode into an architecture of given quality level, we will call a time to finding solution
architecture and assign as T . In our case, the term time stand for the number of evaluated NNSU instances, neglecting other CPU time and disk space consuming activities. The
quality of a NNSU instance is naturally calculated in terms of NNSUs fitness values which
are constructed as quality measures. With high convenience we use the known limits of a
ranking fitness. As the ranking fitness values lie between 0 and 1 we may directly set out our
requirements of achieved quality level qf , e.g. the level of qf = 0.93.
The T variable is representing the number of NNSUs that are to be evaluated before
a phenotype with fitness higher than given quality level qf . The T is a random variable
distributed according to someunknowndistribution function FT . This distribution is highly
parametrizedamong other factors we know that the specific data used, together with all
parameters of the underlying NNSU instance, and also required quality level qf influence
the shape of FT . Thus, we do not seek FT as a parametric distribution, we will simply
estimate it as an empirical cumulative distribution function. The empirical estimate would
be far satisfactory for our purposes since we will read only shape of the CPDF curve and the
empirical quantiles (50 %, 70 %, and 80 %).
For the quality level qf We build a cumulative probability distribution function (CPDF) for
the T variable.
q
FT f (t) = P (T t)
0.4
GA: 324 RS: 890
50 %
GA: 252 RS: 470
Quality level: 0.950
Genetic algorithm
Random search
0
50
100
150
200
Number of individuals evaluated
250
Genetic algorithm
Random search
0.0
0.2
0.2
0.8
GA: 81 RS: 66
GA: 335 RS: 894
70 %
0.6
50 %
80 %
0.4
GA: 131 RS: 92
Probability
GA: 138 RS: 94
70 %
0.6
80 %
0.0
Probability
0.8
1.0
1.0
qf
d
The empirical estimate F
T is calculated using unbiased Kaplan-Meier estimator (Kaplan E.,
Meier P. (1972)), also called a product-limit estimator. The particular estimates for quality
levels of qf = 0.9, 0.95, 0.99, and 0.995 are shown in figures below (5.10, 5.11).
200
400
600
800
Figure 5.10: Empirical CPDFs for quality levels qf = 0.9 and qf = 0.95.
1000
1200
80 %
GA: 2132 RS: Inf
70 %
GA: 1804 RS: 15082
50 %
GA: 1047 RS: 11497
80 %
GA: 12015 RS: Inf
70 %
GA: 4727 RS: Inf
50 %
GA: 2482 RS: Inf
0.4

0.2
Probability
0.4
0.6
0.6
0.8
97
TO FINDING SOLUTIONS
0.0
Genetic algorithm
Random search
0
5000
10000
15000
20000
25000
Genetic algorithm
Random search
0.0
0.2
Probability
T IME
0.8
5.5.
5000
10000
15000
20000
25000
Figure 5.11: Empirical CPDFs for quality levels qf = 0.99 and qf = 0.995.
5.5.2
R ESULTING
FIGURES
The observations for the estimates were collected from the listed experiments (all short and
long), and also from further experiments we run during testing phase. In fact, any observation gained from an experiment is very valuable since we can only use one observation per
experiment and each such run consumes tens of hours. The results shown in 5.10, 5.11 are
based on 21 observations for GA and 16 for RS.
Each figure shows the CPDF calculated for given quality level, the solid green line depicts the CPDF for genetic algorithm, the dashed blue line depicts the random search. The
horizontal grey dashed lines assign the probabilities levels of 50, 70, and 80 percent. On
the right side the quantiles are listed for GA and RS. In case the method did not reach the
specific quality level in any of experiments, the quantile is set to infinity, meaning that the
method would need to spend much more evaluations than during any of the experiments so
far. For GA the maximum number of evaluations was 18, 000, for RS the maximum number
of evaluations was 25, 000.
We can read from the figures that for lower quality level, qf = 0.9, and higher probabilities of success, the RS overperforms the GA, being able to find solution with probability
70 and 80 percent respectively by 50 individuals faster. For all other cases, the genetic algorithm proves as faster approach, even more with higher demands on quality level qf and
on probability of success. We would say that for lower level of required quality, qf = 0.9,
the random search searches for optima comparably with genetic algorithm, both of these
fast. As the quality level rises above the 0.95, the GA take significantly over. This advance
of reaching better quality levels favours genetic algorithm before random search. In general,
the difference relies in the way the GA utilizes the information from previous generations
and incorporates it into the newly built generations.
For qf = 0.9, the random search provides faster solution (with 80 % confidence), seeking
solution networks in 94 evaluations whereas the genetic algorithm is needing 138 evaluations. This represents 32 per cents calculation save. With qf = 0.95 the genetic algorithm
takes over with 335 evaluations against 894 evaluations with random search (80 % conf).
Further increase of qf levels only emphasise the strengths of the genetic algorithm. For
qf = 0.99 random search would not reach 80 per cent confidence for seeking a suitable
NNSU, the genetic algorithm requires 2, 132 evaluations to do so. For 70 % confidence the
genetic algorithm advances with 1, 804 evaluations against random searchs 15, 802.
In fine, for qf = 0.995 the random search process wouldnt assert even 50 % confidence.
At the same time genetic algorithm keeps up with acceptable 12, 015 evaluations for 80 %
98
CHAPTER 5.
R EAL
DATA APPLICATION
confidence, and 4, 727 for 70 %.
5.5.3
R ESUM E
The results list put in this book comprises 3 runs of genetic algorithm and 3 runs of random
search. Then again, the CPDF estimates and time to convergence tables and charts are covering another 40 runs of, at least, comparable dimensions which represents 20 full days of
calculation. We carefully considered all experiments and results and the output is indeed as
economic as possible.
We hope that the final section, together with the previous one, provide rich and convincing enough results that the genetic algorithm is a flexible optimization technique that is able
to succeed in complex search space of structured entities, with highly complicated fitness
landscape.
The genetic algorithm adopts to quality measures specified within an optimization
task. In most figures we could observe a significant optimization along the evaluation factors
e , mainly the adaptation to BW and SW intervals in histogram factor eHST .
The genetic algorithm keeps the generalization powers of NNSU. The generalization
ratio , and also both of its components and did not exhibit no decrease nor discrepancy
for optimized NNSU instances. We should also note that the random search does not disrupt
the generalization powers neither.
The genetic algorithm is fast in terms of finding solution NNSUs with least number
of evaluated instances. With the required quality level higher than qf 0.95 the genetic
algorithm does significantly speed up the search process in comparison with random evaluations.
6
Results of the work
This chapter concludes the dissertation and summarizes all steps accomplished during the
work on implementation of the NNSU architectures optimization. This work naturally mixed
the existing theory and findings with new thoughts and adjustments we needed to incorporate into the models in order to get all the parts of the NNSU GA working properly and, if
possible, the most efficiently.
Among the detailed paragraphs so far, the parts of the re-used theory and the new are
not distinguished in the deepest detail. In the following review we assign all original contributions clearly within the paragraphs in the page margins.
6.1
M AIN
GOALS OF THIS WORK
The dissertation describes all steps we needed to tackle in order to make do the optimization
of neural networks with switching units (NNSU) architectures work properly. This addressed
the following points.
Define a flexible representation of the NNSU architecture that would allow for further
optimization.
Design an evolutionary environment of a genetic algorithm that would allow for identification of NNSU with top quality responses.
Implement and operate the GA environment into existing NNSU tool as a plug-in.
Perform a short research/comparison of the quality results of the evolutionary process
and some existing approaches (random search).
Firstly, we needed to deeply understand the principles of the NNSU networks; then, we
went on to extend the structure of the acyclic architectures and approach the search and
optimization techniques that would allow for finding architectures with greatest powers.
The results of our work is a finalized genetic algorithms module flexibly incorporated into
the NNSU tool. The GA module was developed and tested during the dissertation season; all
relevant descriptions, discussion, and figures resulting from all of its parts are presented in
the thesis.
6.2
6.2.1
A CHIEVEMENTS
A RCHITECTURES
PRESENTED IN THIS WORK

DEFINITION
The very first step for us was to define the NNSU architectures. The suitable definitions are
listed in the section 1.1, including the s, t-graph definition in definition 1.2 that puts the
NNSU architecture well in all important regards.
99
100
CHAPTER 6.
R ESULTS
OF THE WORK
The acyclic architecture then represents an acyclic, multi layered (with arbitrary inter- Original
layer connections), non-recurrent neural networks. The architecture describes a block struc- contributure of the NNSU and the term architecture stands for both the graph component of the tion
NNSU and also for further parameters of the structural details of each block.
It is important to recall here that the parameters of the neurons with switching units, i.e.
clustering centres and regression coefficients, are found by standard methods being clustering analysis and generalized least squares. These parameters are not directly touched by the
evolutionary optimization as the applied methods are the most efficient.
6.2.2
A RCHITECTURES
REPRESENTATION
The representation of the architectures is a first step to be made to incorporate the genetic
algorithm. For the purposes of NNSU representation we subsequently incorporated a genetic programming approach which uses program symbol trees (PST), initially described by
Frederic Gruau, and combined it with the technique of trees (more specifically planted plane
trees) encoding by Ronald C. Read.
Original
We transformed the PST into linear strings (of variable length) and added the struccontribu- tural parameters that are assigned to the architecture. This way we defined an Instructiontion
parameter code an IPCode.
With the standard binary or integer strings we only could represent a linear CHAIN of
the neurons with switching units. Utilizing the IPCodes we achieved representation of quite
broad range of architectures, though we did not prove that all acceptable architectures are
representable via IPCodes.
Original
IPCodes with no additional constraints on instructions nor Code structure represent the
contribu- class of generic architectures G AA. The modified IPCodes then represent class of layered
tion
architectures L AA, and the special type of architectures with variables selection VS AA. For
each class of architectures we proposed an IPCode template that asserts the desired architectures structure and form, and also we described the algorithm for random sampling from the
particular set of architectures.
6.2.3
A RCHITECTURES
OPTIMIZATION
The architectures optimization rounds off our task of finding architectures suitable for use
within the NNSU tool. In order to get the whole optimization working, we needed to define
the quality function and also the recombination schemes.
Original
We defined a structured fitness function based on four evaluation factorshistogram racontribu- tio evaluation eHST , ROC curve evaluation eROC , posterior probabilities evaluation eP O , and
tion
an root mean square error based evaluation eM SE . The composed fitness then aggregates
the values of individual evaluation factors into a scalar value. We introduced transformations, namely the scaling transform Ts and the ranking transform Tr , that allow for balanced
contribution of each evaluation factor.
The two out of three recombination operators, mutation and cross-over, follow natural
extension of the PST operations and realize SubIPCodes swapping. The third one, selection
operator, was adjusted so that it would distinctively identify individuals with significantly
Original different underlying fitness. We implemented the ranking selection with three internal
contribu- functions (logarithmic, linear, exponential), and a modified proportional selection with emtion
phasized variance.
Having designed, implemented and tested all of the components mentioned above, we
have put those together into an environment for evolutionary optimization. The genetic algorithm we constructed applies a standardized template of iterative search through controlled
6.3.
F UTURE
101
WORK
sampling and recombination of individuals. The genetic algorithm designed for NNSU makes
naturally use of these variables and features the following extensions:
Generalized multi factor fitness; the most popular root mean square error is one of the
evaluation factors, moreover, in the NNSU case, this factor exhibited significantly low
variance as seen in table 4.1
Distinctive selection operator; we modified the standard proportional selection so that it
heightened its distinctive powers, as seen in figure 4.5; for better flexibility we implemented the ranking selection as well that exhibits high distinction between differently
performing individuals.
Memory; for common runs of the NNSU tools we prepared a model that copies nonzero number of individuals into new generations. This guides the convergence but
might lead to a local minima only. The comparison of the GA runs with and without
memory is seen in figure 4.9.
Generalization criteria; final contribution were the generalization figures . Technically, Original
the genetic algorithm may get over learned on the data, the generalization factors GR contribuand GD, defined in 4.37 and 4.38 respectively, allow for identification of GA meeting tion
the quality level required and keeping the out-of-sample performance at acceptable
level at the same time.
6.2.4
R ESULTS
If we compare the times to finding solution for the genetic algorithm and the random search
techniques, we conclude that in case of lower level of required quality (approx. 90 %) the
methods are quite on par, c.f. the left figure in 5.10. For higher levels of required quality the
genetic algorithm over performs the random search, as seen in figure 5.11.
Using genetic algorithm we achieve to find solutions with level of required quality up to
100 % and moreover, the genetic algorithm manages to find them faster, i.e. with lower time
to finding solution. This is due to the fact that the genetic algorithm searches for individuals
simultaneously maximizing all factors.
6.3
F UTURE
WORK
This thesis brings up many questions regarding architectures, the measure of quality of decision systems, etc. Some of these questions are cleared, some of these remain open and wait
for further revisions. The main point is a further weakening the representational structure
and move from encodings to grammars and from genetic algorithm to grammatical evolution.
6.3.1
R EPRESENTATION
EXTENSION
The very interesting goal would be answering the representational aspects of the IPCodes,
mainly the question of the completeness of the IPCodes set.
Another inspiring work would then be analysis of IPCodes with extended set of building
instructions as proposed by Ondrej Pokorny in (Pokorny, 2008).
6.3.2
G RAMMATICAL
EVOLUTION
Grammatical Evolution (GE) can be used to generate programs in any language, using a
genetic algorithm to control what production rules are used in a Backus Naur Form (BNF)
grammar definition.
102
CHAPTER 6.
R ESULTS
OF THE WORK
A
Broader remarks on IPCodes and GA
All the described findings, expectations, experiments, and observations quoted so far in the
text were tied with the overall idea of compact and comprehensive review of work we undertook and experienced with IPCodes structures, NNSU networks and their use.
All chapters and sections listed do not solve the optimization of NNSU architectures completely; during the working and testing periods we found many interesting and inspiring
sub-quests and issues. Some of them were already described and discussed in the text as
they were closely related to the main content, for instance we took an in-depth analysis of
the fitness design, we examined the IPCodes representational powers. In some situation we
simply integrated the feature without deeper explanations, such as IPCodes subcoding and
structural complexity, and some of the features were not mentioned at all.
In the following appendices we will turn to the areas that are still worth mentioning and
would add up to the comprehensive view of the work with NNSU structures via IPCodes.
We will narrow down on the additional properties of the Codes, and thus also ICodes
and IPCodes, that speak of the highly complex and structured shape of the sets we worked
with. The search space IPCodesN exhibits, in contrast to usual GA search spaces, many
differences, counting in incompleteness, in sense that not all binary strings of given length
necessarily belong to it, and scalability, meaning that its members possess variable lengths.
We will now look in deeper detail on algebraic properties and formalisms of the Codes and
IPCodes and describe the background reasoning of the incompleteness and scalability issues.
We will also take a look on the Codes subcoding feature. Within the main text we did not
treat the exact proof and we referred to this appendix to bring up rigorous explanations of
this property. We will describe the proof of the subcoding within terms of the level-property
and the Codes hereunder.
Also, we will examine the way the WAV codes are being built from a binary tree, and
especially we will point out the specific isomorphism used for the binary trees within the PST
structures. And finally, we will present a short excursion on the exact size of the IPCodes(N )
set. This will relate the IPCodes to the problem of Catalan numbers and thus to various
existing algebraic structures.
A.1
P ROPERTIES
OF
C ODES
Even though the Codes might look as more or less trivial structures on the first sight, the
real work with Codes is not as straight. We should recognize that the level-property is a set
of (in)equations whose degrees of freedom are growing with the length of the Code. In the
following sections we will list all possible Codes features we had observed during our work
with them.
P
For convenient usage we will refer to the sums kj=1 aj , k {1, . . . , N } as to k-partial
sums, in the case of k = N we well refer the N -partial sum as to a complete sum.
103
104
APPENDIX A. B ROADER
A.1.1
Z ERO
AND NON - ZERO ENTRIES OF A
REMARKS ON
IPC ODES
AND
GA
C ODE
The first basic remark shows that the number of non-zeros within a Code is always even
number and is exactly stated. The k-partial sums are rewritten using an indicator1 as follows:
k
X
aj =
j=1
k
X
2 [aj = 2] = 2 |{aj | aj = 2, j {1, . . . , k} }|.
(A.1)
j=1
Following the last equation of level-property in definition 2.1 and the previous equation A.1,
we gain
N
X
aj = 2 |{aj | aj = 2, j {1, . . . , N } }| = N 1
j=1
which gives the number of non-zero entries of an Code: (N 1)/2. We can divide by the 2
here since we know that N is odd. The number of zero entries of a Code is complementary to
N and thus equal to (N + 1)/2. Without further proofs we can claim the following corollary.
Corollary A.1. Let N {2k 1 | k N }. For arbitrary A Codes(N ), A = {aj }N
j=1 the
counts of non-zero and zero entries are given as follows.
N 1
, and
2
N +1
|{aj | aj = 0, j {1, . . . , N } }| =
.
2
|{aj | aj = 2, j {1, . . . , N } }| =
A.1.2
F IXED
POSITIONS
Let us now turn to the set of inequalities in the level-property and examine some additional
properties of the Codes(N ) set. The k-partial sums always count up to even integers, as seen
in equation A.1. With increasing k, the k-partial sums simply grow by 2 or stay at existing
level. Clearly, the first
P1 entry of an Code must be a 2 to satisfy the inequality in the levelproperty for k = 1: j=1 aj = a1 which is required to be greater than 0. Hence, a1 = 2. The
second position already allows for both 0 and 2.
Next, let us move to the last entriesPof a Code. According to the last equation in levelproperty, the complete sum is equal to N
j=1 aj = N 1. If the last entry were non-zero, the
(N 1)-partial sum would
equal
to
N
3, which doesnt meet the inequality in the levelP 1

property for k = N 1: N
a
=
N
3 which is not greater than N 2. Thus, aN = 0;

j=1 j
and the (N 1)-partial sum is equal to N 1. One entry back, if the entry (N 1) were
non-zero, the (N 2)-partial sum would
PN 2equal to N 3, which still doesnt fit the inequality
in the level-property for k = N 2: j=1 aj = N 3 which is not greater than N 3. Thus
also the entry (N 2) holds a zero.
As a result of this we see that two last partial sums are equal to the complete sum:
N
2
X
j=1
aj = N 1,
N
1
X
aj = N 1,
(A.2)
j=1
and, the last two entries of any Code are zero-valued. These conclusions only apply to Codes
of length N > 3, the only single-entry Code is given as (0) and has only a single k-partial
sum which is identical to the complete sum. Thus, we start with Codes of length N 3 in
the following corollary.
1
The indicator [expr]; if expr may be true of false, the [expr] returns 1 if expr is true, and 0 otherwise.
A.2.
S UBCODING
105
Corollary A.2. Let N {2k + 1 | k N }. For arbitrary A Codes(N ), A = {aj }N

j=1 three
PN 2
entries possess fixed values: a1 = 2, aN 1 = 0, and aN = 0. Thus, j=1 aj = N 1, and
PN 1
j=1 aj = N 1.
This way, members of Codes(N ) have in fact N 3 degrees of freedom. On N
3 positions both 0 and 2 might appear, of course with respect to the level-property. The
total number of all possible combinations of zeros and non-zeros within a Code is really
complicated to find though; we will look the details over in section A.4.
A.2
S UBCODING
Within the main text, chapter on IPCodes, we introduced and utilized the subcoding feature
that allowed for recombination and thus for the implementation of the genetic algorithm.
The subcoding is quite intuitive when approaching the coding structures from the Gruaus
point of view. For the PST, the subcoding applies to the underlying trees and simply means
that any node within a tree is a root for its subtree. For IPCodes we reformulate this in terms
of valid SubIPCode for each position within an IPCode.
We will now provide the full proof of the subcoding using terms of IPCodes, and more
closely their first dimension Codes. Before we move to the proof itself, we shortly recall
the level-property features once again. We have to keep in mind that the level-property
represents variable number of inequalities, depending on the length of the particular Code.
For a Code of length 5 we have 4 inequalities and 1 equation, for Code of length 11 there
are already 10 inequalities and 1 equations, and so on. Thus, the proof is partly utilizing the
induction principle on the number of valid inequalities for a Codes subsequence.
Lemma A.3. Let N {2k 1 | k N }. For any Code A Codes(N ), and position k
1
{1, . . . , N } there exist a subsequence {aj }k+M
of length M {1, . . . , N k} that fulfils the
j=k
level-property. This subsequence is defined uniquely.
Proof. We firstly show that the subcode always exists. Then we prove that it is unique. Let
k {1, . . . , N } be fixed.
We start with the existence of the subcode. Let B represent the sequence B constructed
k+1
k+1
from the Code A shifted by k positions.{bj }N
= {a(k1)+j }N
. The proof will proceed
j=1
j=1
by partial induction on Bs entries.
The b1 evaluates only as a 0 or a 2. b1 = 0 meets the level-property as a trivial Code
and the subcode (of length 1) is found, being B = (0). Let b1 = 2; note here, that thanks
to corollary A.2 we are certain that k
/ {N, N 1}. We will firstly show that for each
l {1, . . . , N k + 1} the l -partial sum of the sequence B would either reach the whole
level-property or at least its inequalities part. For l = 1 the situation is described above; let
l {1, . . . , N k} and suppose that the inequalities part is fulfilled for l , that is
m
m {1, . . . , l 1}
bj > m 1 .
j=1
P
Now, the bl +1 evaluates only as a 0 or a 2. Once the lj=1 bj = l and bl +1 = 0, the
+1
+1
level-property is met and B = {bj }lj=1
is a subcode of length l + 1, being B = {bj }lj=1
.
Pl
Otherwise, j=1 bj > l , then both bl +1 = 0 and bl +1 = 2 would lead to another inequality
Pl +1
j=1 bj > l .
106
REMARKS ON
IPC ODES
AND
GA
For now we can say that any subsequence starting on position k, which is now less than
N 1 as b1 = 2, either fulfils the level-property or at least the inequalities part. Now we
demonstrate that the point where the B sequences meet the whole level-property must lie
within {1, . . . , N k + 1} positions. Let us suppose the opposite, i.e. that we have a sequence
N k+1
{bj }j=1
that fulfils only the inequalities. Especially the last one:
NX
k+1
bj > N k.
j=1
We will now return

back to the P
original Code A and rewrite the
previous inequality in
PN
PN
k+1
N
terms of As entries:
b
=
a
.
Thus,
we
have
j
j=1
j=(k1)+1 j
j=(k1)+1 aj > N k.
P(k1)
We add the initial (k 1)-partial sum of A, which by definition gives j=1 aj > k 2, and
since we work with integers we have2
(k1)
X
j=1
aj +
N
X
aj > N 1.
j=(k1)+1
This is a contradiction. As a result we found that starting at any position k {1, . . . , N }

1
there exist a subsequence {aj }k+M
, M {1, . . . , N k} that fulfils the level-property.
j=k
The uniqueness of the Code found is shown using the contradiction technique. Suppose
we have two subcodes sA1 6= sA2 on position k. These subcodes have to differ in lengths
only; let us suppose, the |sA1 | > |sA2 |. The l-partial sum for l = |sA2 | are written as follows.
l
X
sA1j > l 1, and
j=1
l
X
sA2j = l 1.
(A.3)
j=1
In fact, the l-partial sums are summing the same numbers since sA1j = sA2j = ak+j1 for
j = 1, . . . , l . Thus
l
X
j=1
ak+j1 > l 1, and
l
X
ak+j1 = l 1.
(A.4)
j=1
These cannot hold at the same time which brings us to the contradiction. Thus, any subcode
within a Code on a specific position k is unique.
The advanced use of subcoding is incorporated in the left-addition k operator. Within
the main text we only referred to a lemma assuring closure of the Codes set against the k ;
here we will state the lemma together with the proof, again utilizing the level-property.
Technically, we will split the proof into two subsequent stages. We will utilize the decomposition of the addition A k B = (A k (0)) k B. As a first step, we will show that
A0 = A k (0) lies in Codes, i.e. that swapping any subcode with a trivial one leads to a valid
Code. And then, we move to the addition of B and show that A0 k B lies in Codes, i.e. that
the left-addition of a subcode on position of a trivial subcode leads to a valid Code either. Put
all together it claims that swapping any subcode with any valid Code leads to a valid Code.
Lemma A.4. Let A Codes(N ), A = {aj }N
j=1 , and k {1, . . . , N } any position in A. The
sequence {cl }Pl=1 = A k (0) fulfils the level-property, i.e.
A k (0) Codes(N |sub (A, k)| + 1).
2
For a > a, a N and b > b, b N it holds a + b > a + b + 1.
(A.5)
A.2.
S UBCODING
107
Proof. By definition of the left-addition 2.4 we have |A0 | = N N 0 + 1. We should note that
in the case ak = 0 we simply have A0 = A and the lemma holds as true. We will consider
ak = 2, and thus also k < N 1, in the following.
The level property for A0 is rewritten as
l
X
a0j
> l 1, k {1, . . . , N N }, and
0
N N
X+1
j=1
a0j = N N 0 .
(A.6)
j=1
The inequalities hold for l < k for which the l-partial sums are identical to those of As. Let
now l = k, we know that k < N 1. If the a0k is the last position after removing sub (A, k),
then k = N N 0 + 1 and we only check the last equation of the level-property (A.6):
0
N N
X+1
a0j
j=1
N
X
aj
0
k+N
X1
j=1
aj
j=k
= N 1 (N 0 1)
= N N 0.
This is, the level-property is met.
P
If a0k is inner position then it must hold that kj=1 a0j > k 1. Otherwise we will come
Pk
P
0
0
to contradiction. Since k1
j=1 aj would equal to k 1, i.e. the
j=1 aj > k 2, necessarily
level-property for {a0j }kj=1 holds and a0k would be last position.
Now, let l = k +1, . . . , N N 0 arbitrary position higher than k. For all but the last position
we show that the l-partial sums inequalities hold and for the complete sum we will show the
equality. Firstly the inequalities. We rewrite the l-partial sum for A0 as
l
X
a0j
j=1
k1
X
0
l+N
X1
aj + 0 +
j=k+N 0 +1
j=1
aj =
0
l+N
X1
aj
j=1
0
k+N
X
aj .
(A.7)
j=k
Let us write the partial sums for original code A and position l + N 0 1.
0
l+N
X1
aj > l + N 0 2
j=1
Now, we subtract the part

is a valid Code we know
0
l+N
X1
j=1
aj
0
k+N
X
Pk+N 0
j=k aj
P
0
that k+N
j=k
which corresponds to the removed subcode. Since this

aj = N 0 1.
aj > l 1
j=k
According to the equation A.7, the left side equals to a l-partial sum of A0 and thus
l
X
j=1
a0j > l 1.
108
REMARKS ON
IPC ODES
AND
GA
Now, only the last equation from A.6 remains. Again, we utilize the equation A.7 for
l = N N 0 + 1:
0
N N
X+1
a0j
j=1
N
X
aj
j=1
0
k+N
X
aj
j=k
= N 1 (N 0 1)
= N N0
As a result, for any situation that might show up when swapping subcode for a trivial Code,
the resulting sequence always fulfils the level-property and thus belongs to Codes .
The second step is to prove that left-adding a valid Code on position with zero entry
composes valid Code. We will formulate the lemma for any position with zero entry which
asserts the generalized level of the claim.
M
k {1, . . . , N } position in A for which ak = 0. Then the sequence {cl }Pl=1 = A k B fulfils the
level-property, i.e.
A k B Codes(N 1 + M ).
(A.8)
Proof. Recalling the definition 2.4, the C = {cl }Pl=1 is constructed as cj = aj for j = 1, . . . , k,
ck = b1 , . . . , ck+M 1 = bM , ck+M +l1 = ak+l for l = 1, . . . , N k. We will subsequently
check C for the inequalities and final equation of the level-property.
P
P
For l {1, . . . , k 1} the situation is straightforward as lj=1 cj = lj=1 aj for which we
P
know that lj=1 aj > l 1. Let l {k, . . . , k + M }. The l-partial sum is split into two parts:
l
X
cj =
j=1
k
X
aj +
j=1
lk+1
X
bj .
j=1
Now we utilize thePfact that ak = 0. The As (k 1)-partial sum equals to the k-partial sum
for which it holds kj=1 aj > k 1. Thus
k
X
aj +
j=1
l
X
lk+1
X
bj > (k 1) + (l k), and
j=1
cj > l 1.
j=1
Next, we will consider l {k + M + 1, . . . , (N 1) + M 1}. The l-partial sum is split

into three parts, two of which put together As partial sum:
l
X
j=1
cj =
k
X
j=1
aj +
M
X
j=1
bj +
lM
X+1
j=k+1
aj =
lM
X+1
j=1
aj + M 1.
A.3.
R ELATION
BETWEEN
C ODES
109
AND PLANTED PLANE TREES
We also directly enumerated the complete sum of B (equal to M 1). Now, we simply use
the level-property inequality for l M + 1-partial sum of A.
lM
X+1
aj + M 1 > l M + M 1, and
j=1
we see that
l
X
cj > l 1.
j=1
Finally, let l = N 1 + M . The complete sum of C equals to the total of complete sums
of A and B which directly gives
N 1+M
X
j=1
cj =
N
X
j=1
aj +
M
X
bj = (N 1) + M 1.
j=1
This means, the level-property is fulfilled for C and thus it holds C Codes(N 1+M ).
As a result of the two previous lemmas, we have the whole claim that the Codes are
closed against the left-addition by k .
M
k {1, . . . , N } any position in A. The sequence {cl }Pl=1 = A k B fulfils the level-property, i.e.
A k B Codes(N |sub (A, k)| + M ).
(A.9)
Proof. The proof of this lemma only relies on composition of the two preceding. We rewrite
the left addition A k B into A k B = (A k (0)) k B = A0 k B and simply apply
the first lemma (A.4)on the first element A0 = A k (0). The lemma claims that A0
Codes(N |sub (A, k)| + 1). The length of the Code A0 equals to N |sub (A, k)| + 1.
Then, due to the second lemma (A.5) we have that A0 k B Codes with length given
as N |sub (A, k)| + 1 1 + M = N |sub (A, k)| + M . This means that A k B
Codes(N |sub (A, k)| + M ) which rounds off the proof.
A.3
R ELATION
BETWEEN
C ODES
AND PLANTED PLANE TREES
We had incorporated the IPCodes representations into the genetic algorithm environments
based on the PST model. The structural form of the binary planted plane treea skeleton of
a PSTwas transformed into a Codea skeleton of an IPCode. We moved forward without
revealing more on the relationship between Codes and planted plane trees. We can now
move to two of the common points: the description of the walk-around of the planted plane
tree, and the illustration of the special planted plane trees isomorphism.
A.3.1
A SSIGNING
A TREE WITH A
C ODE
Firstly, we will look closely at procedure assigning any tree with its Code. We used a technique introduced by Read et al. in (Read et al., 1972). This procedure is based on so-called
walk-around of a tree, and also known as tree traversing. Read calls the resulting code as
110
REMARKS ON
IPC ODES
AND
GA
Walk-Around Valency code (WAV code). Since these are identical, inn our case we use directly the term Code.
The walk-around technique simply moves along the tree, as indicated in figure A.1, and
writes down the numbers of descendants of each yet unvisited node. We start in root of the
tree and move to the left-most nodes until we reach a leave. From a leave we return back
to the closest node which has some unvisited descendant. Again, we move to the left most
of those and repeat the same scheme until we have all nodes visited. This is also a moment
we return back to a root node. Each time we meet an unvisited node, we note number of its
descendants.
The constructed sequence results in a Code. We can find in Reads article (Read et al.,
1972, p. 174) that any walk-around valency code fulfils the level-property. Thus, the walkaround valency code is a Code.
0
2
The direction of
walk through
2
2
0
2
0
2
Code:2022200202000
Figure A.1: Schema of a walk around of a tree. The tree is walked around as indicated by the dashed oriented line. After the whole tree is traversed the sequence
(2, 0, 2, 2, 2, 0, 0, 2, 0, 2, 0, 0, 0) is built.
In a very similar way as we build a Code of a tree we can construct a series of labels
(for instance instructions and parameters), if they were appended to the nodes. This way we
would directly derive an IPCode from a PST.
A.3.2
T REES
ISOMORPHISM
The planted plane trees are trees of a special kind. Their specificity lies in more stricter
isomorphism, technically. This isomorphism appears also if we consider the qualities of the
walk-around traversing as a mapping of trees onto Codes.
In the case of standard trees isomorphism, we would easily find that for different drawing
of a tree we might get different seriesthe tree traverse would proceed different way. For
instance the two trees on the left in figure A.2; the trees are assigned with Codes that do
not equal each other. The left tree has Code (2, 0, 2, 0, 0) whereas the right one has Code
(2, 2, 0, 0, 0). This is, we have the difference of the integer series, while on the other side we
have difference of trees. The walk-around traversing would be a one-to-many mapping.
The specific trees isomorphism for which the classes induced by this isomorphism uniquely
fit the Codes is the isomorphism of planted plane trees: The planted plane trees are trees that
A.4.
C ARDINALITY
4
OF THE
Codes(N )
111
SET
PPT
Figure A.2: Equivalence of trees. The standard isomorphism is too weak to distinguish trees
the same way as Codes. The planted plane trees isomorphism (PPT ) works well.
are considered as isomorphic if there is an orientation-preserving homeomorphism of the plane
onto itself, which maps one tree onto the other (Read et al., 1972, p. 156). In other words, the
planted plane trees are said to be isomorphic to each other if their drawing in the plane is
the same. No symmetries, nor rotations around root node apply in this isomorphism.
This form of isomorphism, assigned as PPT , is the right approach to trees in terms of
their representational powers. The PST based on planted plane trees are uniquely decoded
and thus the constructive procedure uniquely corresponds to the PST. Nevertheless, if we
recall the phenotypic multiplicity from chapter 3, we can see that the uniqueness can not
cold between the final phenotypes and PSTs (or IPCode).
A.4
C ARDINALITY
OF THE
Codes(N )
SET
Both in the main text and also within the two latter sections we slightly touched the issue of
numbers of Codes of given length. We already know that on certain positions fixed numbers
must occur, what about the remaining positions? Could, for instance, the 2 appear on 4th
position within a Code of length 7? The answer is, it depends. It indeed on the other entries
and also on the length of the Code. See the Code starting with (2, 2, 2, ., ., ., .); the last two
are zeros by corollary A.2, and the other two as well because the Code has to fit the levelproperty. The Code is fully defined with these three initial entries, being (2, 2, 2, 0, 0, 0, 0).
There is a complex dependence between the Codes entries induced by the inequalities of
the level-property. And answering the question of a occurrence of a number on a certain position within a Code would in wider context relate to the number of all possible combination
of the positions within Code and to the overall number of all achievable Codes. This would
reveal the total size of the Codes(N ) set.
During the genetic optimization the Codes are recombined and their length might vary.
In the implemented solutions we defined a global upper bound maxIPCodeLen for the representations length in order to limit the resulting NNSU complexity. If we restrict only on the
Code skeleton of the IPCode and focus only on their Codes-part, we can say that the genetic
algorithm generally works with search space of the following form.
[
CodesmaxIPCodeLen =
Codes(N )
(A.10)
N maxIPCodeLen
The cardinality of this set is easily derived from the cardinality of the set Codes(N ). Thus,
we have another motivation to find out the Codes(N ) size.
For the initial lengths, we can make out the cardinalities using full evaluation. As the
N equals to 1, 3, or 5, we directly have Codes(1) = {(0)}, Codes(3) = {(2, 0, 0)}, and
Codes(5) = {(2, 2, 0, 0, 0), (2, 0, 2, 0, 0)}. The full evaluation is feasible up to lengths of approximately 20 which consumed more than 24 hours of MATLAB calculation. The initial
112
REMARKS ON
IPC ODES
AND
GA
terms of the sequence 1, 1, 2, 5, 14, 42, 132 indicate the A000108 sequence on The On-Line
Encyclopedia of Integer Sequences3 .
4
The sequences are indeed identical with Catalan numbers
2n ; The nth Catalan number is
given directly in terms of binomial coefficients by Cn = n /(n + 1). The Catalan numbers
Cn are identified in many different areas from number of different ways (as seen in (Catalan
number on Wolfram mathworld)).
The Catalan number C(n1) gives the number of binary bracketing of n letters; this is
the original Catalans problem;
the solution to the ballot problem;
the number of states possible in an n-flexagon;
the number of different diagonals possible in a frieze pattern with n + 1 rows;
the number of ways of forming an n-fold exponential;
the number of rooted planar binary trees with n internal nodes; this is exactly our
concern;
the number of mountains which can be drawn with n upstrokes and n down strokes;
the number of non-crossing handshakes possible across a round table between n pairs
of people
the number of n + 1 factors can be completely parenthesized;
the number of different ways a convex polygon with n+2 sides can be cut into triangles
by connecting vertices with straight lines
We can evaluate the number of different Codes of length N as a Catalan number , for n =
(2N 1)/2.
We can find many ways to prove the formula of the Catalan number, many of them
listed in (Catalan number on WIKI) but it would appear as the most appropriate to quote
the original solution suggested by my colleague during the multiple analyses of the Codes
recurrence scheme and formulas. Those times we did not even guessed we should concern
Catalan numbers to it and the recurrent formula and the presented form of the proof is by
my colleague Pavel Pl
at.
Theorem A.7. Let N = 2k 1, k {4, 5, 6, . . . }. For Codes(N ) cardinality it holds

N 3
N 3
|Codes(N )| = N 3 N 7 .
2
(A.11)
Proof. Since we know that any Code contains a non-zero on the first position and two zeros
0
on the last positions (cor. A.1), we rewrite the Codes variable part as A0 = {a0j }N
j=1 where
N 0 = N 3 and a0j = aj 1, j = 2, . . . , N 2. Then, only these cut series are of interest,
and as such they could be viewed as series randomly independently sampled entries (with
probability 12 ) of {1, 1}-values. So we may define a symmetric discrete random walk as a
P 0 0
sum SN 0 = N
j=1 aj . The level-property is for the random walk rewritten as
Sk > 2, k {1, . . . , N 0 1}, SN 0 = 0.
3
(A.12)
See the following link at www.research.att.com/~njas/sequences.

The Catalan numbers form a sequence of natural numbers that occur in various counting problems, often
involving recursively defined objects. They are named for the Belgian mathematician Eugene Charles Catalan
(1814-1894), see section 10.4 in (Matousek, Nesetril, 1996) or (Catalan number on WIKI).
4
A.4.
C ARDINALITY
OF THE
Codes(N )
113
SET
The size |Codes(N )| corresponds to number of all acceptable random walks SN 0 . The
term acceptable means fulfilling the properties listed in
A.12. Then, the probability that
acceptable random walk is found is written as PN 0 = P SN 0 = 0, mink{1,...,N 0 1} Sk > 2 .

0
Once this probability is found, the desired cardinality is given as 2N PN 0 .
PN 0
= P SN 0 = 0,
= P (SN 0
Sk > 2
min
= 0) P SN 0 = 0,
k{1,...,N 0 1}
min
k{1,...,N 0 1}
Sk 2 .
(A.13)
The moment is now the mirroring effect is used. The idea behind says generally, that
considering random walk represented by Sk , k {1, . . . , N 0 }, starting from any position l
{1, . . . , N 0 1} the probability P (Sl+j = Sl + D) is equal to P (Sl+j = Sl D) for any
j {1, . . . , N 0 l} and D = 0, 1, 2, . . . .
Now lets have a look on the second term in the equation A.13. Since either S1 = 1 or
S1 = 1, and we insist that the minimum of the partial sums Sk is less or equal than 2,
there exists k {1, . . . , N 0 1} for which Sk = 2. Starting from k the probability that 0
is reached at N 0 is the same as the probability that 4 is reached using mirroring effect.
P SN 0 = 0,
min
Sk 2 = P SN 0 = 4,
min
Sk 2
k{1,...,N 0 1}
k{1,...,N 0 1}
And thus
PN 0 = P (SN 0 = 0) P (SN 0 = 4)

0
1 N0
1
N
= N 0 N 0 N 0 N 0 4 .
2
2
2
2
Since N 0 = N 3, the latter equals to
1
N 3
1
N 3
N 3 N 7 .
2N 3 N 23
2
2
Multiplying these terms with 2N 3 results in desired formula.
Using the cardinality formula A.11 we can evaluate the cardinality of |Codes(N )| for
N 7. The summarizing cardinalities and upper bounds of Codes(N ) are provided in table
A.1. For higher Codes lengths we can approximate the asymptotic Catalan numbers growth
4n
using Stirlings formula as Cn n3/2
. Based on this we can see both Codes(N ) size and
the size of the cumulative union exhibit exponential rate of growth.

Even for Codes lengths under 25 the search space counts up to 300 thousands. This
extended by the combination of all possible instructions and block descriptions (factor of 106
approximately) reaches level of billion combinations which by far exceeds the level of feasible
exhaustive search. With observed average CPU time of 4 seconds the billion evaluations
would deem years of permanent calculation.
114
Code
length N
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
Topology block
number n
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
REMARKS ON
|Codes(N )|
|CodesN |
1
1
2
5
14
42
132
429
1, 430
4, 862
16, 796
58, 786
208, 012
742, 900
2, 674, 440
9, 694, 845
35, 357, 670
129, 644, 790
477, 638, 700
1, 767, 263, 190
6, 564, 120, 420
24, 466, 267, 020
91, 482, 563, 640
343, 059, 613, 650
1, 289, 904, 147, 324
4, 861, 946, 401, 453
1
2
4
9
23
65
197
626
2, 056
6, 918
23, 714
82, 500
290, 512
1, 033, 412
3, 707, 852
13, 402, 697
48, 760, 367
178, 405, 157
656, 043, 857
2, 423, 307, 047
8, 987, 427, 467
33, 453, 694, 487
124, 936, 258, 127
467, 995, 871, 777
1, 757, 900, 019, 101
6, 619, 846, 420, 554
IPC ODES
AND
GA
Table A.1: Illustrative numbers of |Codes(N )| and its cumulates topologies up to 24 blocks
(N = 1, 3, 5, . . . , 49). The block number n also assigns the order of the Catalan number Cn ,
n = (N 1)/2.
Bibliography
A SAI S. ET AL . (2003). Prospects for the search for a Standard Model Higgs boson in ATLAS
using vector boson fusion, ATLAS Scientific Note SN-ATLAS-2003-024.
ATLAS Collaboration, ATLAS Performance and Physics Technical Design Report (May 1999),
ATLAS TDR 15, CERN/LHCC/99-15.
ATLAS Physics Worshop, Rome (2005). see presentations at
http://agenda.cern.ch/fullAgenda.php?ida=a044738.
B ALAKRISHNAN K., H ONAVAR V. (1995). Properties of Genetic Representations of Neural Architectures, Iowa State University, Ames.
B ISHOP C. M. (1997). Neural Networks for Pattern Recognition, ISBN 0-19-853864-2, Clarendon Press, Oxford.
MEJKALOV A
J., KU C
ERA M. (1995). Neural networks with switching units, In:
B ITZAN P., S
Neural Network World, vol. 4, pp. 515-526.
B ENGIO S., M ARI E THOZ J., K ELLER M. (2005). The Expected Performance Curve, Proceedings
of the ICML 2005 workshop on ROC Analysis in Machine Learning, Bonn, Germany.
B OCK R. K. (1998). Data Analysis BriefBook,
http://rkb.home.cern.ch/rkb/AN16pp/node185.html.
INA M., K LASCHKA
B OCK R.K., C HILINGARIAN A., G AUG M., H AKL F., H ENGSTEBECK T., J I R
E., S AVICK Y P., T OWERS S., VAICILIUS A., W ITTEK W. (2004). Methods for
J., KOTR C
Multidimensional Event Classification: A Case Study using Images from a Cherenkov GammaRay Telescope, In: Nuclear Instruments and Methods in Physics Research A, Vol. 516, 2004,
pp. 511-528, ISSN: 0168-9002
C ATALAN N UMBER ON WIKIPEDIA . See at
http://en.wikipedia.org/wiki/Catalan_number.
C RAMER , N. L. (1985). A representation for the Adaptive Generation of Simple Sequential
Programs, in Proceedings of an International Conference on Genetic Algorithms and the
Applications, Grefenstette, John J. (ed.), Carnegie Mellon University.
C ATALAN N UMBER ON W OLFRAM MATHWORLD. See at
http://mathworld.wolfram.com/CatalanNumber.html.
F ILLER T (2006). On the design of graphs using grammatical evolution, Supervised research
work, Department of Mathematics, Faculty of Nuclear Sciences and Physical Engineering,
Czech Technical University, Prague.
G REEN D.M., S WETS J.A. (1964). Signal Detection Theory and Psychophysics, John Wiley and
Sons, New York, USA.
115
116
BIBLIOGRAPHY
G RUAU F. (1994). Neural Network Synthesis using Cellular Encoding and The Genetic Algorithm, Doctor Thesis, Ecole Normale Superieure de Lyon.
G RUAU F. W HITLEY R. (1993). Adding learning to the cellular development of neural networks:
Evolution and the Baldwin effect, In: Evolutionary Computation 1, pp. 213233.
C
EK M., KALOUS R. (2002). Application of Neural Networks Optimized by GeH AKL F., H LAV A
netic Algorithms to Higgs Boson Search, In: Computational Science, pp. 554563, Ed: (Sloot
P.M.A., Tan C.J.K., Dongarra J.J., Hoekstra A.G.), Vol: 3. Workshop Papers, Berlin, Springer
2002, ISBN: 3-540-43594-8, ISSN: 0302-9743, Lecture Notes in Computer Science; 2331,
Held: ICCS 2002. International Conference, Amsterdam, NL, 02.04.21-02.04.24.
C
EK M., KALOUS R. (2002 B ). Application of Neural Networks Optimized by
H AKL F., H LAV A
Genetic Algorithms to Higgs Bosson Search, The 6th World Multi-Conference on Systemics,
Cybernetics and Informatics. Proceedings, edited by Callaos N., Margenstern M., Sanchez
B., pp. 5559, Orlando : IIIS, 2002, ISBN 980-07-8150-1.
C
EK M., KALOUS R. (2002 C ). Mbb Distribution of Subsets of Higgs Boson
H AKL F., H LAV A
Decay Events Defined via Neural Networks, Neural Network World, 2002, pp. 559571,
ISSN 1210-0552.
C
EK M., KALOUS R. (2003). Application of Neural Networks to Higgs Boson
H AKL F., H LAV A
Search, Nuclear Instruments and Methods in Physics Research A, Vol. 502, 2003, pp. 489
491, ISSN: 0168-9002.
INA M., AND R ICHTER -W A
H AKL F., J I R
S E. (2005). Hadronic taus identification using artificial neural network, ATLAS NOTE, ATL-COM-PHYS-2005-044, CERN.
INA M. (1999). Using GMDH Neural Net and Neural Net with Switching Units to
H AKL F., J I R
Find Rare Particles, Artificial Neural Nets and Genetic Algorithms, edited by Dobnikar A.,
Steele N., Pearson D., Albrecht R.. pp. 5258, SpringerVerlag, 1999. ISBN 3-211-83364-1.
H ASAN M. M., I SLAM M. M., M URASE K. (2004). A Permutation Problem Free Modified Cellular Encoding for Evolving Artificial Neural Networks, Proceedings of CIMCA 2004, edited
by M. Mohammadian, ISBN: 1740881885.
H EEGER D. (1997). Signal Detection Theory,
http://www.cns.nyu.edu/~david/ftp/handouts/sdt-advanced.pdf.
H OLLAND J. (1992). Adaption in Natural and Artificial Systems, ISBN: 0-262-58111-6,
A Bradford Book, The MIT Press, London.
C
EK M. (2002). N
H LAV A
avrh a analyza neuronovych st s prepnacmi jednotkami pro studium
procesu rozpadu element
arnch ca
stic s vyuzitm moznost geneticke optimalizace (Design and
analysis of NNSU for study of elementary particles decays, with respect to possible genetic
optimization), Diploma thesis, Department of Mathematics, Faculty of Nuclear Sciences
and Physical Engineering, Czech Technical University, Prague. In Czech.
C
EK M. (2004). Structured Neural Networks in monetary policy.
H LAV A
C
EK M. (2008). Time series predictions using NNSU, PhD Thesisin preparation.
H LAV A
ADA J., H AKL F.(2005). The Application of Structured Feedforward Neural
EK M., C
H LAV C
Networks to the Modelling of the Daily Series of Currency in Circulation, International Conference ICNC 2005 /1./. Changsha, pp. 1234-1246. ISSN: 0302-9743.
BIBLIOGRAPHY
117
C
EK M., KALOUS R. (2003). Structured Neural Networks, In: PhD Conference, pp.
H LAV A
2533, Ed: Hakl F., MatFyzPRESS, 2003, ISBN: 80-86732-16-9, Prague. In Czech.
H USSAIN T. (1997). Cellular Encoding: Review and Critique, Technical Report, Queens University.
JACOB C H ., R EHDER J. (1993). Evolution of neural net architectures by a hierarchical
grammar-based genetic system, In: Proceedings of the International Joint Conference on
Neural Networks and Genetic Algorithms, pp. 7279, Innsbruck.
KALOUS R., H AKL F. (2004). Evolutionary Operators on Neural Networks Architecture, Proceedings of the International Conference on Computing, Communications and Control
Technologies: CCCT 04, Austin, Texas, USA.
KALOUS R. (2004). Evolutionary operators on ICodes, Proceedings of the IX. PhD. Conference
Doktorandsky den 04, Institute of Computer Science, Academy of Sciences of the Czech
Republic, edited by F. Hakl, ISBN: 80-86732-30-4, MafyzPRESS, Prague.
KAPLAN E., M EIER P. (1972). Nonparametric estimation from incomplete observations, In:
Journal of the American Statistical Association, pp. 457481.
K ITANO H. (1990). Designing Neural Networks Using Genetic Algorithms with Graph Generation Systems, Complex Systems, No. 4, pp. 461476.
KOZA J. R., R ICE J. P. (1991). Genetic Generation of Both the Weight and Architecture for a
Neural Network, Proceedings of the International Joint Conference on Neural Networks,
Vol. II, pp. 397404, IEEE.
KOZA J. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press.
KA V., P OSPICHAL J., T I N
O J. (2000). Evolucne algoritmy (Evolutionary Algorithms),
K VASNI C
STU Bratislava. In slovak.
M ANDISCHER M. (1993), Representation and Evolution of Neural Networks, Proceedings of
the International Joint Conference on Neural Networks and Genetic Algorithms, pp. 643
649, Innsbruck.
IL J. (1996). Kapitoly z diskretn matematiky (Chapters from Discrete
M ATOUS EK J., N ES ET R
Mathematics), ISBN: 80-246-0084-6, MatfyzPress, Prague. In czech.
M C C ULLOCH W. S., P ITTS W. (1943). A logical calculus of the ideas emmanent in nervous
activity, In: Bulletin of mathematical Biophysics, 5, pp. 115133.
M ILLER G. F., T ODD P. M.,H EGDE S. U. (1989). Designing Neural Networks Using Genetic
Algorithms, In: Proceedings of the Third International Conference of Genetic Algorithms,
pp. 37938., Ed: Schaffer J. D.
O RALLO J. H. (2004). Classifier Evaluation in Data Mining: ROC Analysis, Universidad Politcnica de Valencia,
http://www.dsic.upv.es/~jorallo/Albacete/ROCAnalysis-1.6-English.pdf.
P OKORN Y O. (2008). Graph encodings and their application in genetic optimization, Bachelor
thesis, Department of Mathematics, Faculty of Nuclear Sciences and Physical Engineering,
Czech Technical University, Prague. In Czech.
118
BIBLIOGRAPHY
R EAD R. C. (1972). The coding of various kinds of unlabeled trees, In: Graph theory and
computing, pp. 153182, Ed.: Read R. C., Library of Congress Catalog Card Number: 74187228, Academic Press, New York.
R ICHTER -WAS , E., S ZYMOCHA , T. (2005). Hadronic tau identification with track based approach: the Z tau tau and di-jet events from DC1 data samples. ATLAS Physics Note ATLPHYS-2005-005.
R OSCA J. P., B ALLARD D. H. (1996). Discovery of subroutines in genetic programming, In:
Advances in Genetic Programming 2, chapter 9, Ed.: Angeline P., Kinnear Jr. K. E., MIT
Press, Cambridge.
S CHIFFMANN W., J OOST M., W ERNER R. (1990). Performance Evaluation of Evolutionary
Created Neural Network Topologies, Parallel Problem Solving from Nature 2, pp. 292296,
H.P. Schwefel and R. Maenner, Springer Verlag.
S CHIFFMANN W., J OOST M., W ERNER R. (1992). Synthesis and Performance Analysis of
Neural Network Architectures, Technical Report 16/1992, University of Koblenz, Germany,
ftp://archive.cis.ohio-state.edu/pub/neuroprose/schiff.nnga.ps.Z.
S CHIFFMANN W., J OOST M., W ERNER R. (1993). Application of Genetic Algorithms to the
Construction of Topologies for Multilayer Perceptrons, Proceedings of the International Joint
Conference on Neural Networks and Genetic Algorithms, pp. 675682, Innsbruck.
S CHMIDHUBER , J. (1987). Evolutionary principles in self-referential learning., Diploma thesis,
Institut f. Informatik, Tech. Univ. Munich
IMA J., N ERUDA R. (1996). Teoreticke ot
S
azky neuronovy st, ISBN: 80-85863-18-9, matfyzPress, Prague.
S TANLEY, K., O. , M IIKKULAINEN , R. (2002). Evolving Neural Networks through Augmenting Topologies, In: Evolutionary Computation Volume 10, Number 2, pp. 99127, Massachusetts Institute of Technology.
VACHULKA J. (2006). Monitorov
an procesu ucen neuronovych st s prepnacmi jednotkami
(Monitoring of the learning process of the NNSU), Diploma thesis, Department of Mathematics, Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University,
Prague. In Czech.
V OSE M. D. (1999). The Simple Genetic Algorithm, Foundations and Theory, ISBN: 0-26222058-X, MIT Press, London.
W HITE D. W. (1993). GANNet: A genetic Algorithm for Searching Topology and Weight Spaces
in Neural Network Design, Dissertation at the University of Maryland.
W HITLEY D., S TARKWEATHER T., B OGART C. (1990). Genetic algorithms and neural networks: optimizing connections and connectivity, Parallel Computing 14, p. 347361, NorthHolland.

Ga NN

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Ga NN

Caricato da

Copyright:

Formati disponibili

FACULTY

N UCLEAR S CIENCES AND P HYSICAL E NGINEERING

P RAGUE , A PRIL 2009

Used symbols and notation

1 Neural networks architectures

4 Genetic algorithm defined on IPCodes

5 Real data application

6 Results of the work

Scheme of NNSU network . . . . . . . . . . . . .

Overall scheme of the IPCode representation . . . . . . . . . . . . .

3.1 Phenotypes and genotypes within an evolutionary optimization environment .

Histogram graph for a NNSU response to the evaluation data . . . . . .

Ranking transformations for evaluation factors eHST and eP O . . . . . . . . .

A.1 Schema of a walk around of a tree . . . . . . . . . . . . . . . . . . . . . . . . 110

Level-property checks for two different integer series

3.1 Representational properties review . . . . . . . . . . . . . . . . . . . . . . . .

Core values of evaluation factors . . . . . . . . . . . . . . . . . . . . . .

Numbers of L, T , and E data . . . . . . . . . . . . . . . . . . . . . .

A.1 Illustrative numbers of |Codes(N )| and its cumulates . . . . . . . . . . . . . . 114

Used symbols and notation

general sets, capitals printed with bold font

SYMBOLS AND NOTATION

Accepted Background Rate, used in section 4.3

Cellular encoding, discussed in section 2.1

Cumulative probability distribution function, utilized in section 5.5

2. Second chapter. The second chapter introduces IPCodesthe representation of NNSU

NETWORKS WITH SWITCHING UNITS

NN was introduced in 1995 by Bitzan, Smejkalov

Active and adaptive dynamics

The topology of a neural networks, in general, is usually described as an oriented graph.

(except the IN) there exists at least one non-zero entry.

Figure 1.3: Plot of architecture H.

The block assigned as 1 consists of a CHAIN of

APPROACHES TO ACYCLIC ARCHITECTURES REPRESENTATION

Number of blocks within a NNSU architecture

Learning procedure duration (in seconds)

Learning procedure duration (in seconds)

Number of edges within a NNSU architecture

E XISTING APPROACHES TO ACYCLIC ARCHITECTURES REPRESENTA TION

ADJACENCY MATRIX IS NOT AN ACCEPTABLE REPRESENTATION

The adjacency matrices AC

, AD would then look as follows.

APPROACHES TO ACYCLIC ARCHITECTURES REPRESENTATION

ACYCLIC ARCHITECTURES REPRESENTATIONS

REPRESENTATION OF THE ACYCLIC ARCHITECTURE

The tree structure conveniently stores the strings of the

CODES AND SUBCODES

CODES AND SUBCODES

aj > k 1, k {1, . . . , N 1}, and

Consider, for instance, sequences A = (4, 0, 1, 1, 0, 0, 0) and B = (4, 0, 1, 2, 0, 2, 0). The

REPRESENTATION OF THE ACYCLIC ARCHITECTURE

CODES AND SUBCODES

Members of Codes(N) in relative sizes

REPRESENTATION OF THE ACYCLIC ARCHITECTURE

Original code A = {aj }13

CODES AND SUBCODES

|sub (A, j)|

Table 2.3: Subcodes and their lenghts for a Code A =

REPRESENTATION OF THE ACYCLIC ARCHITECTURE

(2,2,2,0,0,0,0) Meets the levelproperty

(2,2,2,2,0,0,0) Fails to meet the levelproperty

Algorithm 1 Random generating of the Codes

Generate value atmp