Sei sulla pagina 1di 214

The Programmer’s Guide To Theory

Great ideas explained

First Edition

Mike James

I/O Press
I Programmer Library
Copyright © 2019 IO Press

All rights reserved. This book or any portion thereof may not be reproduced
or used in any manner whatsoever without the express written permission of
the publisher except for the use of brief quotations in a book review.

Mike James The Programmers Guide To Theory


1st Edition
ISBN Paperback: 9781871962437
First Printing, 2019
Revision 0

Published by IO Press www.iopress.info


In association with I Programmer www.i-programmer.info

The publisher recognizes and respects all marks used by companies and
manufacturers as a means to distinguish their products. All brand names and
product names mentioned in this book are trade marks or service marks of
their respective companies and our omission of trade marks is not an attempt
to infringe on the property of others.

2
Preface
Computer science, specifically the theory of computation, deserves to be
better known even among non-computer scientists. The reason is simply that
it is full of profound thoughts and ideas. It contains some paradoxes that
reveal the limits of human knowledge. It provides ways to reason about
information and randomness that are understandable without the need to
resort to abstract math. It is the very physical reality of computation that
makes it such a “solid” thing to reason about and yet doing so reveals
paradoxical results that are as interesting as anything philosophy has to offer.
In fact, computational theory has a claim to being the new philosophy of the
digital age.
And yet one of the barriers to learning the great ideas it contains is that
computer scientists have little choice but to present what they know in
mathematical form. Many books on the subject adopt the dry theorem-proof
format and this does nothing to inspire understanding and enthusiasm. I’m
not saying that theorems and proof are unnecessary. We need mathematical
rigor to be sure what we know is what we know, but there is no need for it in
an exposition that explains the basic ideas.
My intention is that if you read this book you will understand the ideas well
enough that, if you have the mathematical prowess, you should be able to
construct proofs of your own, or failing this, understand proofs presented in
other books. This is not an academic textbook but a precursor to reading an
academic textbook.
Finally I hope that my reader will enjoy the ideas and be able to understand
them well enough not to make the common mistakes when applying them to
other areas such as philosophy, questions of consciousness and even moral
logic – which so often happens.
My grateful thanks are due to Sue Gee and Kay Ewbank for their input in
proof-reading the manuscript. As with any highly technical book there are
still likely to be mistakes, hopefully few in number and small in importance,
and if you spot any please let me know.

Mike James
November 2019

3
This book is based on a series of in informal articles in a number of
magazines and on the I Programmer website: www.i-programmer.info

To keep informed about forthcoming titles visit the I/O Press website:
www.iopress.info.

This is also where you will also find errata, update information and can
provide feedback to help improve future editions..

4
Table of Contents
Chapter 1 11
What Is Computer Science?
Computable?......................................................................................12
Hard or Easy?.....................................................................................13
Understanding Computational Theory............................................15
The Roadmap....................................................................................15

Part I What Is Computable?

Chapter 2 19
What Is Computation?
Turing Machines...............................................................................19
Tapes and Turing Machines.............................................................20
Infinite or Simply Unlimited............................................................23
The Church-Turing Thesis...............................................................24
Turing-Complete...............................................................................25
Summary...........................................................................................27
Chapter 3 29
The Halting Problem
The Universal Turing Machine........................................................29
What Is Computable? The Halting Problem.....................................30
Reduction..........................................................................................32
The Problem With Unbounded........................................................33
Non-Computable Numbers...............................................................36
Summary...........................................................................................38
Chapter 4 39
Finite State Machines
A Finite History.................................................................................39
Representing Finite State Machines.................................................40
Finite Grammars................................................................................41
Grammar and Machines....................................................................43
Regular Expressions..........................................................................44
Other Grammars................................................................................45
Turing Machines...............................................................................47
Turing Machines and Finite State Machines...................................49
Turing Thinking................................................................................50
Summary...........................................................................................54

5
Chapter 5 55
Practical Grammar
Backus-Naur Form - BNF..................................................................55
Extended BNF....................................................................................57
BNF in Pictures - Syntax Diagrams..................................................58
Why Bother?......................................................................................59
Generating Examples........................................................................59
Syntax Is Semantics..........................................................................60
Traveling the Tree.............................................................................62
A Real Arithmetic Grammar.............................................................62
Summary...........................................................................................63
Chapter 6 65
Numbers, Infinity and Computation
Integers and Rationals.......................................................................65
The Irrationals...................................................................................67
The Number Hierarchy.....................................................................68
Aleph-Zero and All That...................................................................70
Unbounded Versus Infinite..............................................................71
Comparing Size.................................................................................72
In Search of Aleph-One....................................................................73
What Is Bigger than Aleph-Zero?.....................................................74
Finite But Unbounded “Irrationals”.................................................76
Enumerations....................................................................................76
Enumerating the Irrationals..............................................................78
Aleph-One and Beyond.....................................................................79
Not Enough Programs!......................................................................80
Not Enough Formulas!......................................................................82
Transcendental and Algebraic Irrationals........................................83
π - the Special Transcendental.........................................................84
Summary...........................................................................................85
Chapter 7 87
Kolmogorov Complexity and Randomness
Algorithmic Complexity...................................................................87
Kolmogorov Complexity Is Not Computable...................................89
Compressability.................................................................................90
Random and Pseudo Random...........................................................91
Randomness and Ignorance..............................................................92
Pseudo Random.................................................................................93
True Random.....................................................................................94
Summary...........................................................................................96

6
Chapter 8 97
Algorithm of Choice
Zermelo and Set Theory...................................................................97
The Axiom of Choice........................................................................98
To Infinity and..................................................................................98
Choice and Computability..............................................................100
Non-Constructive............................................................................101
Summary.........................................................................................103
Chapter 9 105
Gödel’s Incompleteness Theorem
The Mechanical Math Machine......................................................106
The Failure Axioms.........................................................................108
Gödel’s First Incompleteness Theorem..........................................108
End of the Dream.............................................................................111
Summary.........................................................................................112
Chapter 10 113
Lambda Calculus
What is Computable?......................................................................113
The Basic Lambda...........................................................................114
Reduction........................................................................................115
Reduction As Function Evaluation................................................116
More Than One Parameter..............................................................117
More Than One Function...............................................................118
Bound, Free and Names..................................................................118
Using Lambdas................................................................................119
The Role of Lambda In Programming.............................................121
Summary.........................................................................................122

7
Part II Bits, Codes and Logic
Chapter11
Information Theory
Surprise, Sunrise!............................................................................126
Logs..................................................................................................127
Bits...................................................................................................129
A Two-Bit Tree................................................................................130
The Alphabet Game........................................................................131
Compression....................................................................................131
Channels, Rates and Noise..............................................................132
More Information - Theory.............................................................133
Summary.........................................................................................134
Chapter12 135
Coding Theory – Splitting the Bit
Average Information.......................................................................135
Make it Equal...................................................................................137
Huffman Coding..............................................................................138
Efficient Data Compression............................................................140
Summary.........................................................................................142
Chapter13 143
Error Correction
Parity Error......................................................................................143
Hamming Distance..........................................................................144
Hypercubes......................................................................................146
Error Correction...............................................................................147
Real Codes.......................................................................................147
Summary.........................................................................................149
Chapter 14 151
Boolean Logic
Who was George Boole?..................................................................151
Boolean Logic .................................................................................152
Truth Tables....................................................................................152
Practical Truth Tables.....................................................................153
From Truth Tables to Electronic Circuits......................................154
Logic in Hardware...........................................................................155
Binary Arithmetic............................................................................156
Sequential Logic..............................................................................158
De Morgan's Laws............................................................................159
The Universal Gate..........................................................................160
Logic, Finite State Machines and Computers................................162
Summary.........................................................................................163

8
Part III Computational Complexity

Chapter 15 167
How Hard Can It Be?
Orders..............................................................................................167
Problems and Instances..................................................................169
Polynomial Versus Exponential Time............................................170
A Long Wait.....................................................................................171
Where do the Big Os Come From?..................................................172
Finding the Best Algorithm............................................................174
How Fast can you Multiply?...........................................................174
Prime Testing..................................................................................175
Summary.........................................................................................178
Chapter 16 179
Recursion
Ways of Repeating Things..............................................................180
Self-Reference..................................................................................181
Functions as Objects.......................................................................182
Conditional recursion.....................................................................183
Forward and Backward Recursion.................................................185
What Use is Recursion?..................................................................186
A Case for Recursion -The Binary Tree..........................................187
Nested Loops...................................................................................188
The Paradox of Self-Reference........................................................190
Summary.........................................................................................191
Chapter 17 193
NP Versus P Algorithms
Functions and Decision Problems..................................................193
Non-Deterministic Polynomial Problems.......................................194
Co-NP...............................................................................................195
Function Problems..........................................................................196
The Hamiltonian Problem..............................................................197
Boolean Satisfiability......................................................................198
NP-Complete and Reduction..........................................................199
Proof That SAT Is NP-Complete.....................................................199
NP-Hard...........................................................................................201
What if P = NP?..............................................................................202
Summary.........................................................................................204

9
Chapter 1

What Is Computer Science?

Computer science is no more about computers than astronomy is


about telescopes.
Edsger Dijkstra
What can this possibly mean? Computer science has to be about computers,
doesn’t it? While computers are a central part of computer science, they are
not its subject of study. It is what they compute that is of more interest than
the actual method of computation.
Computer science as practiced and taught in higher educational
establishments is a very broad topic and it can include many practical things,
but there is a pure form of the subject which is only interested in questions
such as what can be computed, what makes something difficult or easy to
compute, and so on. This is the interpretation of computer science that we are
going to explore in the rest of this book.
So what is this computing and what sort of questions are there about it?
Today we nearly all have a rough idea of what computing is. You start from
some data, do something to it and end up with an answer, or at least
something different. This process can be deeply hidden in what we are doing.
For example, a word processor is an example of computation, but exactly
what it is doing is very complex and to the average user very obscure. To keep
things simple we can say that computation is just the manipulation of data in
precise and reproducible ways.
At a more practical level, computations are rarely just “games with symbols”.
They usually involve finding the answer to some real world problem or
implementing a task. For example, can you find me the solution to a set of
equations? What about finding the best route between a number of towns, or
perhaps how best to pack a suitcase? These are all potential candidates for
computation. We start off with some data on the initial configuration of the
problem and then we perform manipulations that eventually get us the
answer we are looking for. Sometimes we demand an exact answer and
sometimes we can make do with an answer that is good enough.

11
You can see that this has the potential to become increasingly complex even
though we have attempted to keep things simple. Most of the time we are
interested in exact answers, even though these may be more than we need in
practice. The reason is that we are interested in what can be achieved in
theory as this guides us as to what is possible in practice. So, what sort of
questions can we have about computation?

Computable?
The first and most obvious is – what can we compute?
You might think that also has an obvious answer – surely everything is
computable. We might not know how to compute something, but this is just a
matter of discovering how to do it. In principle, everything is computable in
the sense that every question has an answer and every computation has a
result. This was a thought that occurred to mathematicians in the nineteenth
century. The German mathematician, David Hilbert (1862-1943), for example,
believed in the automation of the whole of mathematics. A machine would
perform all of the proofs and mathematicians would become secondary. It
was thought impossible that there could be anything that wasn’t computable.
This deeply held belief turns out not to be true.
There are questions and computations that are in principle impossible - not
just things that cannot be computed because they take too long, but things
that are in principle non-computable and questions that are undecidable.
This was, and is, one of the big shocks of computer science and one of the
reasons for learning more about it. The fact that there exist things that cannot
be computed is deeply surprising and it tells us something about the world
we live in. However, it is important to really understand the nature of these
non-computable results. The conditions under which they occur are subtle
and may sometimes be considered unrealistic. What this means is that the
existence of the non-computable is a pure result worthy of consideration, but
perhaps not too much practical concern. In this respect it is much like
infinity – it’s an interesting idea but you can’t actually order an infinity of
anything or perform an infinite number of operations. You can head in that
direction, but you can’t actually get there. As you will see, the idea of infinity
in all its subtle forms plays a big role in the theory of computation.
There are lots of instances, that are clearly wrong, of non-computable things
being used to argue that real world computations are impossible. For
example, it was argued that a robot could not kill a human because of a
famous non-computable question called the halting problem. I don’t advise
anyone to take this result seriously if pursued by a homicidal robot.
If we are to avoid silly erroneous applications of the idea, it is important to
understand what non-computability actually means and how it arises, rather
than to assume that it means what its superficial description seems to imply.

12
There are other aspects of non-computability that might have more to tell us
about the practical nature of the universe, or more precisely our description
of the universe, but these are much more difficult to explore. We generally
believe that when you do a computation that results in a number then every
possible number is a candidate for the result. Put another way, it is easy to
believe that every number has its day as the result of some program that
produces it as a result of a computation. There are very simple arguments that
indicate that this is not so. In fact the vast majority of numbers cannot
possibly be the result of a computation for the simple reason that there are
more numbers than programs that can produce them. This is perhaps an
even more shocking conclusion than the discovery of non-computable
problems. We take it for granted that there are programs that can produce
numbers like π or e to any precision, but this type of number is relatively
rare.
An even deeper reason is the possibility that these numbers lack the
regularity to allow a definition that is shorter than they are when written out.
That is, π is an infinite sequence of digits, but we can write down a program
involving a few tens of characters that will produce these digits one after
another. Most numbers do not seem to permit their digit representation to be
condensed in this way. This is also a hint that perhaps ideas of randomness
and information play a role in what is computable. The majority of numbers
have digit sequences that are so close to random that they do not allow a
program to capture the inherent pattern of digits because there is no such
pattern. This is an area we still do not understand well and it relates to subtle
mathematical ideas such as the axiom of choice, transfinite numbers and so
on.

Hard or Easy?
This sort of absolute non-computability is interesting, but often from a more
philosophical than practical point of view. Another slightly different form of
the question of computability is how hard the computation is. In this case we
are not looking for things that cannot be computed at any cost, only things
which cost so much they are out of our reach. This is a much more practical
definition of what is computable and what is non-computable and, in
particular, modern cryptography is based on problems which are difficult
enough to be regarded as practically non-computable.
This is the area of computational complexity theory and it relates to how
time, or resources in general, scale with a problem. To make this work we
need some measure of the size of a problem. Usually there is a natural
measure, but it isn’t difficult to prove that the actual way that we measure the
size of a problem isn’t that important and doesn’t alter our conclusions. For
example, if I give you a list of numbers and ask you to add them up then it is
clear that the time it take depends on the number of values, i.e. the length of

13
the list. The longer the list, the longer it takes. Now suppose that I set the task
of multiplying each possible pair of numbers, including any repeats. The time
this takes goes up as the length of the list squared. You can see that the
multiplication task is much harder than the sum task and that the length of
list you can sum is much longer than the list you can multiply. This is how
we measure the difficulty of tasks.
There are lots of tasks that are easy where the time or other resources which
they need to be completed increase relatively slowly so we can process large
examples. Other tasks are much more demanding and quickly reach a size
where no amount of computer availability could complete them in a
reasonable time.
What qualifies as a reasonable time depends on the application and can vary
from a few years to the expected lifetime of the universe. You would probably
agree that a problem that has a computation time equal to the lifetime of the
universe is effectively non-computable. It might be possible in theory, but it
isn’t possible in practice. Such problems are the foundation for modern
cryptography.
You might say that how accessible these difficult computations are depends
on the computer hardware you have. Double the speed of your computer and
you halve the time it takes. This is true, but half the lifetime of the universe is
still too long and you only have to increase the size of the problem by a little
and again your improved computer cannot finish the computation in a
reasonable time. The discussion may be about efficiency, but it isn’t much
affected by the power of the hardware at your disposal. It isn’t so much about
the time anything takes on real hardware, more about how that time scales
with the size of the problem. In some cases the time scales in proportion to
the size of the problem and in other cases it increases much faster than this.
Notice that difficult problems might be effectively non-computable, but
unlike real non-computable problems they do have a solution. In many cases,
if someone gives you a solution to the problem then you can check that it is
correct very quickly. This is a strange class of computational tasks that take a
long time to find a solution, but are very quick to check if someone gives you
a solution. These are the so-called “NP problems”.
“Easy problems” are usually called “P problems” and today the greatest single
unsolved question in computer science is if NP problems really are P
problems in disguise. That is, does NP = P? If so we just have noticed that
there are fast ways of solving all difficult problems that are easy to check.
This isn’t a small theoretical difficulty, as modern cryptography depends on
the fact that NP≠P, and there is a million dollar prize for proving the
conjecture, one way or the other.

14
One of the big difficulties in all of this is in proving that you have a method
that is as efficient as it is possible to be. If you find that it takes too long to
solve a problem then perhaps the problem really is difficult or perhaps we
have just not thought hard enough about how to do the computation.
Computer science is intriguing because of the possibility of finding a fast
method of working something out that previously the best minds were unable
to find. It is worth saying that this does happen and computations that were
previously thought to take an impractical amount of time have been
simplified to make them accessible.

Understanding Computational Theory


I hope that the meaning of Dijkstra’s quote at the start of this chapter is
becoming clearer. We can examine computation without getting bogged down
in the design or use of a computer to actually perform the computation. This
is the theory of computation and it does inform the practice of computation,
but it isn’t dependent on any practical implementation.
In many ways this area of computer science is more like mathematics than
anything else and it is possible to pursue mathematical theorem proof-style
explanations of the subject. Many books do exactly this, but while this
approach is rigorous and essential for progress, it doesn’t help develop
intuition about what is actually being described, let alone being proved. In
this book you mostly won’t find theorems and the only “proofs” are
explanations that hopefully make a proof seem possible and reasonable. If
you have a mathematical background then converting your intuitions into a
full proof will be an interesting exercise and so much more rewarding than
spending much more time converting a theorem proof presentation into
intuitions.
In most cases I have tried to present how to think about the main ideas in
computability rather than present watertight proofs. As a result this is not a
computer science textbook, but it will help you read and master any of the
excellent textbooks on the subject.

The Roadmap
This book has three parts, this first of which is about theoretical computation.
It looks, in Chapter 2, at what constitutes a model that captures the essence
of computation and the best known exponent, the Turing machine. We move
on, in Chapter 3, to consider what is logically computable. This involves
proof by contradiction, which is always troubling because it is difficult to see
exactly what is causing the problem. In this case, we examine the best known
non-computable problem – the halting problem – and find out what makes it
impossible and what changes are needed to make it possible. In Chapter 4 we
take a step back to a slightly simpler machine, the finite state machine, which

15
isn’t as powerful as a Turing machine, but is much closer to what is
practically possible. The interesting thing is that there is a hierarchy of
machines that result in a hierarchy of language grammars. Chapter 5 looks in
more detail at the idea of grammar and how it is useful in computing to
convey, not just syntax, but also meaning.
One of the problems in studying computer science is that many of its ideas
make use of mathematical concepts. In Chapter 6 we look in detail at
numbers, the basis of much computation, and at the different types of
infinity. Rather than a pure math approach to these ideas, we take an
algorithmic approach to make the ideas real. Chapter 7 takes us into an area
that is often overlooked – algorithmic complexity theory. This is easy enough
to understand, but has some strange consequences. Even more overlooked is
the subject of Chapter 8, the axiom of choice which deserves to be better
known as it has connections to algorithmic information and non-
computability. Chapter 9 takes us into the ideas of Gödel’s incompleteness
theorem, which is very much connected to non-computability, but in a
different arena. Finally in Chapter 10, the section ends with a look at Lambda
calculus, which is an alternative expression of what it is to compute.
Part II is about more practical, rather than logical, matters. What is a bit forms
the subject matter of Chapter 11 on classical information theory. Chapter 12
expands this to look at coding theory, which is at the heart of implementing
compression algorithms. Chapter 13 is about error correcting codes. Even
though we know about bits and information, we don’t yet know about bits
and logic, which is what Chapter 14 is all about.
Part III looks at the ideas of computational complexity theory. This isn’t about
studying things that are logically non-computable but things that are
practically non-computable. This is about algorithms that simply take too
long to work that the answers that they provide are out of our reach as surely
as the Halting problem. Chapter 15 examines how long computations take in
terms of how the time scales with the problem size. Chapter 16 looks at
recursion, which is fundamental to the longest running algorithms. Finally,
Chapter 17 is about NP and P, two fundamental and related classes of
algorithms. The class P is the set of algorithms that are computable in
reasonable time for reasonable size problems. The set NP is the set of
algorithms that are checkable in reasonable time for reasonable size problems.
These two are very similar and yet so different that there is a $1 million prize
on offer for proving whether they are the same or different.

16
Part I

What Is Computable?

Chapter 2 What Is Computation? 19


Chapter 3 The Halting Problem 29
Chapter 4 Finite State Machines 39
Chapter 5 Practical Grammar 55
Chapter 6 Numbers, Infinity and Computation 65
Chapter 7 Kolmogorov Complexity 87
Chapter 8 The Algorithm of Choice 97
Chapter 9 Gödel’s Incompleteness Theorem 105
Chapter 10 The Lambda calculus 113

17
Chapter 2

What Is Computation?

Today we might simply say that computation is what a computer does. Until
recently, however, there were no computing machines. If you wanted a
computation performed then you had little choice but to ask a human to do it.
So in a sense, computing is what humans do with whatever help they can get
– slide rule, calculator or even digital computer. But we still don’t really know
what a computer does or what a computation actually is.

Turing Machines
You have probably heard of Alan Turing. He is often cited as being one of the
Bletchley Park, UK, code breakers who shortened World War II, but he is
equally, if not more, important for his fundamental work in computing.

Alan Mathison Turing


1912-1954
Turing constructed an abstract “thought” model of computing that contained
the essence of what a human computer does. In it, he supposed that the
human computer had an unlimited supply of paper to write on and a book of
rules telling them how to proceed. He reasoned that the human would read
something from the paper, then look it up in the book which would tell them
what to do next. As a result the “computer” would write something on the
paper and move to the next rule to use. It is worth noting that at the time
Turing was working on this, the term “computer” did usually refer to a
human who was employed to perform calculations using some sort of
mechanical desk calculator.

19
A human seems like a simple model of computation, but it has many vague
and undefined parts that make it difficult to reason with. For example, what
operations are allowed, where can the computer write on the paper, where
can they read from and so on. We need to make the model more precise.
The first simplification is that the paper is changed into a more restrictive
paper tape that can only be written to or read from in one location, usually
called a “cell”. The tape can be moved one place left or right. This seems
limiting, but you can see that with enough moves anything can be written on
the tape. The second simplification is that the rule book has a very simple
form. It tells the computer what to write on the tape, which direction to move
the tape in, and which rule to move to next.

One of the interesting things about the Turing machine is that you can
introduce variation on how it works, but you usually end up with something
with the same computing power. That is, tinkering with the definition doesn’t
provide the computer with the ability to compute something it previously
couldn’t. It sometimes alters the speed of the computation but it introduces
nothing new that could not have been computed before.
This means that we can settle on a Turing machine being something like the
description of the human with a paper tape and a book of rules and what can
be computed using this setup isn’t changed by the exact way we define it.
This is usually expressed as the fact that what is computable is robust to the
exact formulation of the computing device.
However, if we are all going to understand what is going on, then we might as
well settle on one precise mathematical definition of what a Turing machine
is, but we need to remember that our conclusions don't really depend on the
exact details.

Tapes and Turing Machines


In most definitions a Turing machine is a finite state machine with the ability
to read and write data to a tape. Obviously to understand this definition we
need to look at what a finite state machine is. The idea is so important we will
return to it in Chapter 4, but for the moment we can make do with an
intuitive understanding.

20
Essentially a finite state machine consists of a number of states. When a
symbol, a character from some alphabet say, is input to the machine it
changes state in such a way that the next state depends only on the current
state and the input symbol. That is, the current state becomes the next state
depending on the symbol read from the tape.
You can represent a finite state machine in a form that makes it easier to
understand and think about. All you have to do is draw a circle for every state
and arrows that show which state follows for each input symbol. For
example, the finite state machine in the diagram has three states. If the
machine is in state 1 then if the next symbol is an A it moves it to state 3 and
a B moves it to state 2.

A Turing machine is a finite state machine that has an unlimited supply of


paper tape that it can write on and read back from. There are many
formulations of a Turing machine that vary the way the tape is used, but
essentially the machine reads a symbol from the tape, to be used as an input
to the finite state machine. This takes the input symbol and, according to it
and the current state, does three things:
1. Prints something on the tape
2. Moves the tape right or left by one cell
3. Changes to a new state

21
To give you an example of a simple Turing machine, consider the rules:
1. If tape reads blank write a 0, move right and move to state 2
2. If tape reads blank write a 1, move right and move to state 1
The entire tape is assumed to be blank when it is started and so the Turing
machine writes:
010101010101...
where the ... means "goes on forever".
Notice that the basic operations of a Turing machine are very simple
compared to what you might expect a human computer to perform. For
example, you don’t have arithmetic operations. You can’t say, “add the
number on the tape to 42”, or anything similarly sophisticated. All you can do
is read symbols, write symbols and change state.
At this point you might think that this definition of a Turing machine is
nothing like a human reading instructions from a book, but it is. You can
build complex operations like addition simply by manipulating symbols at
the very basic level. You can, for example, create a Turing machine that will
add two numbers written on a tape and write the resulting sum onto the tape.
It would be a very tedious task to actually create and specify such a Turing
machine, but it can be done. Once you have a Turing machine that can add
two numbers, you can use this to generate other operations. Again tedious,
but very possible.
The whole point is that a Turing machine is that ultimate reduction of
computation. It does hardly anything – read a symbol, write a symbol and
change state – but this is enough to build up all of the other operations you
need to do something that looks like computation. If you attempt to make it
any simpler then it likely won’t do the job.
“Everything should be made as simple as possible, but no simpler.”
Albert Einstein
A Turing machine can also perform a special action – it can stop or halt – and
surprisingly it is this behavior that attracts a great deal of attention. For
example, a Turing machine is said to recognize a sequence of symbols written
on the tape if it is started on the tape and halts in a special state called a final
state. What is interesting about this idea, as we will discover in more detail
later, is that there are sequences that a Turing machine can recognize that a
finite state machine, i.e. a Turing machine without a tape, can’t. For example,
a finite state machine can recognize a sequence that has three As followed by
three Bs, AAABBB, and so can a Turing machine. But only a Turing machine
can recognize a sequence that has an arbitrary number of As followed by the
same number of Bs. That is, a Turing machine is more powerful than a finite
state machine because it can count without limit.

22
At first this seems very odd, but a little thought reveals it to be quite
reasonable, obvious even!
To discover if the number of Bs is the same as the number of As you have to
do something that is equivalent to counting them, or at least matching them
up, and given there can be arbitrarily many input A symbols even before you
ever see a B you need unlimited storage to count or match them up. The finite
state machine doesn't have unlimited storage but the Turing machine does in
the form of its tape!
This is a very important observation, and one we will return to, but it
deserves some explanation before we move on.

Infinite or Simply Unlimited


Infinity is a difficult concept and one it is very easy to make mistakes when
reasoning about. You might think that in the tangible, physical world of
computers there would be no need for infinity, but in the theoretical world of
computer science infinity figures large, very large. Many of the paradoxes and
their conclusions depend on infinity in some form or another and while
proofs and discussions often cover up this fact, it is important that you know
where the infinities are buried.
What we have just discovered is that a Turing machine is more powerful than
a finite state machine because it has unlimited storage in the form of a paper
tape. At this point we could fall to arguing the impossibility of the Turing
machine because its tape can’t be infinitely long. Some definitions of a Turing
machine do insist that the tape is infinite, but this is more than is required.
The machine doesn't, in fact, need an infinite tape, just one that isn't limited
in length, and this is a subtle difference.
Imagine if you will that Turing machines are produced with a very long, but
finite, tape and if the machine ever runs out of space you can just order some
more tape. This is the difference between the tape actually being infinitely
long or just unlimited. Mathematicians refer to this idea as finite but
unbounded. That is, at any given time the tape has a very definite finite
length, but you never run out of space to write new symbols. This is a slightly
softer and more amenable form of infinity because you never actually get to
infinity, you just head towards it as far as you need to go.
Some definitions of a Turing machine state that the tape is infinite. In most
cases this means the same thing as the Turing machine only uses as much
tape as it needs – it is effectively used as if it was finite but unbounded. This
idea is subtle and it is discussed in more detail in Chapter 6.

23
The Church-Turing Thesis
You cannot ignore the fact that a Turing machine is very simple. It certainly
isn’t a practical form of computation, but that isn’t the intent. A Turing
machine is the most basic model of computation possible and the intent is to
use it to see what is computable, not how easy it is to compute. This is
summarized in the Church-Turing thesis, named in honor of Alonzo Church
whose work was along the same lines as that of Turing.

Alonzo Church
1903–1995
Church invented the Lambda calculus, see Chapter 10, as a mathematical
embodiment of what computation is. If you like, Turing stripped the physical
computer down to its barest simplicity and Church did the same thing for
mathematics. The Church-Turing thesis is a thesis rather than a conjecture or
a theorem because it most likely cannot ever be proved. It simply states:
● Anything that can be computed, can be computed by a Turing
machine.
Notice that it doesn’t say anything about how fast. It doesn’t matter if it takes
a Turing machine several lifetimes of the universe to get the answer, it is still
computable.
The Turing machine is thus the gold standard for saying which questions
have answers and which don’t – which are decidable and which are not.
Notice also that a consequence of the Church-Turing thesis is that there
cannot be a computing device that is superior to a Turing machine. That is,
the Church-Turing thesis rules out the existence of a super-Turing machine.
Indeed about the only way you could falsify the Church-Turing thesis would
be to find a super-Turing machine – needless to say, to date, no-one has.
There are many suggestions for alternative computational systems – Post’s
production system; Lambda calculus; recursive functions; and many
variations on the basic Turing machine – but they all are equivalent to the
Turing machine described above. In the jargon they are Turing equivalent.

24
You might ask, why bother with these alternative systems if they are
equivalent to a Turing machine, and the answer is that in some cases they are
simpler to use for the problem in hand. In most cases, however, the Turing
machine is obviously the most direct, almost physical, interpretation of what
we mean by “compute”.
The latest star of computing, the quantum computer, is often claimed to have
amazing powers to compute things that other computers cannot, but it is easy
to prove that even a quantum computer is Turing-equivalent. A standard non-
quantum computer can simulate anything a quantum computer can do – it is
just much slower. Thus a quantum computer is Turing-equivalent, but it is
much faster – so much faster that it does enable things to be computed that
would otherwise simply take too long. In the real world time does matter.
A computational system is Turing equivalent if it can simulate a Turing
machine – i.e. it can do what a Turing machine can – and if a Turing machine
can simulate it – i.e. a Turing machine can do what it can.

Turing-Complete
This idea that every computational system is going to be at most as good as a
Turing machine leads on to the idea of a system being Turing-complete.
There are systems of computation, regular expressions for example as we’ll
see later, that look powerful and yet they cannot compute things a Turing
machine can. These, not quite full programming systems, are interesting in
their own right for the light they shine on the entire process of computing.
Any computer system that can compute everything a Turing machine can is
called Turing-complete. This is the badge that makes it a fully grown up
computer, capable of anything. Of course, it might not be efficient, and hence
too slow for real use, but in principle it is a full computing system.
One of the fun areas of computer science is identifying systems that are
Turing-complete by accident – unintended Turing-completeness. So much
the better if they are strange and outlandish and best of all if they seem
unreasonable. For example, the single x86 machine instruction MOV is
Turing-complete on its own without any other instructions. Some of the more
outlandish Turing-complete systems include knitting, sliding block puzzles,
shunting trains and many computer and card games.
Sometimes identifying Turing-completeness can be a sign that things have
gone too far. For example, markup languages such as HTML or XML, which
are designed only to format a document, would be considered dangerous if
they were proved Turing-complete. The reason is simply that having a
Turing-complete language where you don’t need one is presenting an
unnecessary opportunity for a hacker to break in. You might think that simple
languages accidentally becoming Turing-complete is unlikely, but CSS – a
style language – has been proved to be Turing-complete.

25
Turing-completeness may be more than just fun, however. When you look at
the physical world you quickly discover examples of Turing-completeness.
Some claim that this is a consequence of the Universe being a computational
system and its adherents want to found a new subject called digital physics.
Mostly this is just playing with words, as clearly the Universe is indeed a
computational system and this is generally what our computers are trying to
second guess. This is not to say that there are no good questions about natural
Turing-completeness that await an answer, but its existence in such
abundance is not at all surprising.

26
Summary

● To investigate what computation is we need a model of computation


that is simple enough to reason about.

● A Turing machine is a simplified abstraction of a “human” computer


working with pen and paper and a book of instructions.

● A Turing machine is composed of a finite state machine, which


corresponds to the book of instructions, and an unlimited supply of
paper tape, which corresponds to the pen and paper.

● This model of computation is the subject of the Church-Turing thesis


which roughly states that anything that can be computed can be
computed by a Turing machine.

● The fact that the tape is unlimited is perhaps the most important
feature of a Turing machine. If you make the tape a fixed length then
much of the computing power is lost.

● There are many alternative formulations of the Turing machine and


they all amount to the same thing. The definition of a Turing machine
is robust against small changes.

● There are also many alternative formulations to the Turing machine,


but they are all equivalent to it. This gives rise to the idea of Turing
equivalent – a machine that can compute everything a Turing
machine can and vice-versa.

● It also gives rise to the idea of a Turing-complete system, something


that is Turing equivalent.

● Unintentional or accidental Turing-completeness is often observed


and is interesting and amusing.

● Turing Completeness is also very common in natural systems and this


isn’t so mysterious when you consider that the real world has to
perform, in some form, the computations that our computers perform
when they simulate real world problems.

27
Chapter 3

The Halting Problem

Now we have the Church-Turing thesis and the prime model of computation
the Turing machine, it is time to ask what computations are out of reach. If
you have not encountered these ideas before then prepared to be shocked and
puzzled. However paradoxical it may all sound, it is, perhaps at the expense
of some mysticism, very easy to understand. First we have to look at the
second good idea Turing had – the universal Turing machine.

The Universal Turing Machine


Suppose you have a Turing machine that computes something. It consists of a
finite state machine, an initial state and an initialized tape. These three things
completely define the Turing machine.
The key idea in creating a universal Turing machine (UTM) is to notice that
the information that defines a specific Turing machine can be coded onto a
tape. The description of the finite state machine can be written first as a table
of symbols representing states and transitions to states, i.e. it is a lookup table
for the next state given the current state and the input symbol. The
initialization information can then be written on the tape and the remainder
of the tape can be blank and used as the work area for the universal Turing
machine.
Now all we have to do is to design a Turing machine that reads the beginning
of the tape and uses this as a “lookup table” for the behavior of the finite state
machine. It can then use another special part of the tape to record the
machine’s current state. It can read the symbols from the original machine’s
initial tape, read the machine definition part of the tape to look up what it
should do next, and write the new state out to the working area.
What we have just invented is a Turing machine that can simulate any other
Turing machine – i.e. a universal Turing machine. Just in case you hadn’t
noticed, a universal Turing machine is just a programmable computer and the
description on the tape of another Turing machine is a program.
In that the universal Turing machine simulates any specific Turing machine
in software, it is perhaps the first example of a virtual machine. You might
think that this is obvious, but such ideas were not commonplace when Turing
thought it up and implemented it. Even today you should be slightly

29
surprised that such as simple device – a Turing machine – is capable of
simulating any other Turing machine you can think of. Of course, if it wasn’t
then the Church-Turing Thesis would have been disproved.
If you read about the universal Turing machine in another book then you
might encounter a more complete description including the structure of the
finite state machine and the coding used to describe the Turing machine to be
simulated. All of this fine detail had to be done once to make sure it could be
done, but you can see that it is possible in general terms without have to deal
with the details. Be grateful someone has checked them out, but don’t worry
about it too much.

What Is Computable? The Halting Problem


At this point it all seems very reasonable – anything that can be computed
can be computed by a specific Turing machine. Even more generally,
anything that can be computed can be computed by a universal Turing
machine, a very particular Turing machine, by virtue of it being able to
simulate any other Turing machine. All you have to do is give the universal
Turing machine a description of the specific Turing machine you are
interested in on a tape and let it run. The universal Turing machine will then
do whatever the specific Turing machine did, or would do if you actually
built it.
This all seems innocent, but there is a hidden paradox waiting to come out of
the machinery and attack – and again this was the reason Turing invented all
of this! It is known as the “halting problem”. This is just the problem of
deciding if a given Turing machine will eventually halt on a specific tape.
In principle it looks reasonably easy. You examine the structure of the finite
state part of the machine and the symbols on its tape and work out if the
result halts or loops infinitely. Given that we claim that anything that can be
computed can be computed by a Turing machine, let’s suppose that there is
just such a machine called “D”. D will read the description of any Turing
machine, “T”, and its initial data from its tape and will decide if T ever stops.
It is important to note that D is a single fixed machine that is supposed to
work for all T. This rules out custom designing a machine that works for a
particular type of T or a machine that simply guesses – it has to be correct all
of the time.

30
There are various proofs that the halting problem isn’t computable, but the
one that reveals what is going on most clearly is one credited to Christopher
Strachey, a British computer scientist. To make the proof easier to follow, it
is better to describe what the Turing machines do rather than give their
detailed construction. You should be able to see that such Turing machines
can be constructed.

Christopher Strachey
1916-1975
Suppose we have a Turing machine, called halt, that solves the halting
problem for a machine, described by tape T, working on tape t:
halt(T,t)
if T halts on t then state is true
else state is false
gives true if the machine described by tape T working on tape t halts and
false otherwise, where true and false can be coded as any two states when
the Turing machine stops.
Now consider the following use of the halt Turing machine to create another
Turing machine:
paradox(P)
if halt(P,P)==true loop forever
else stop
Nothing is wrong with this Turing machine, despite its name. It takes a
description of a machine in the form of a tape P and runs halt(P,P) to
discover if machine P halts on tape P i.e. halts on its own description as data.
If the machine halts then paradox doesn’t and if it doesn’t the machine does.
This is a little silly but there is nothing wrong with the construction.
Now consider:
halt(paradox,paradox)
Now our hypothetical halting machine has to work out if paradox halts on it
own tape.

31
Suppose it does halt, then halt(paradox,paradox) is true, but in paradox the
call to halt(P,P) is the same because P=paradox, so it has to be true as well
and hence it loops for ever and doesn’t halt.
Now suppose it doesn’t halt. By the same reasoning halt(paradox,paradox)
is false and the if in paradox isn’t taken and paradox stops. Thus, if paradox
halts, it doesn’t and if paradox doesn’t halt, it does – so it is indeed well
named and it is a paradox and cannot exist.
The only conclusion that can be drawn is that halt cannot exist and the
halting problem is undecidable.
At this point you probably want to start to pick your way through this
explanation to find the flaw. I have to admit that I have cut a few corners in
the interests of simplicity, but you can trust me that any defects can be
patched up without changing the overall paradox.
Notice that a key step in constructing this paradox is the use of halt within
the program you are feeding to halt. This is self reference and like most self
references, it results in a paradox. For example, the well known barber
conundrum is a paradox:
The barber is the "one who shaves all those, and those only, who do not
shave themselves". The question is, does the barber shave himself?
It is impossible to build a Turing machine that can solve the halting problem
– and given we believe that anything that can be computed can be computed
by a Turing machine we now have the unpleasant conclusion that we have
found a problem that has no solution – it is undecidable.
We have found something that is beyond the limits of computation.

Reduction
As soon as you have one of the flood gates open and there are lots of non-
computable problems:
Does machine T halt on input tape TT?
Does machine T ever print a particular symbol?
Does T ever stop when started on a blank tape?
and so on. All of these allow the construction of a paradox and hence they are
non-computable.
Once you have a specimen non-computable problem you can prove other
problems non-computable by showing that the new problem is the same as
the old one. More precisely, if a solution to problem A can be used to solve
problem B, we say that A can be reduced to B. So if we can show that a
solution to A can be used to solve the halting problem, i.e. A can be reduced to
the halting problem, then A is also undecidable.

32
This is a standard technique in complexity theory, reduction, because while it
is often difficult to prove something, it is easy to show that something is
equivalent to what you have already proved.
For example, the Busy Beaver problem is simple enough to state. Find the
maximum number of steps a Turing machine with n states can make before
halting i.e. find the function BB(n) which gives the maximum number of
steps the machine takes. If this function exists you can use it to put a bound
on the number of steps any machine could take and so create the function
halt(T,t). If you have BB(n) you know that any machine that halt(T,t) has
to process has to either halt in fewer than n steps or loop forever. We have
reduced BB(n) to halt(T,t) therefore BB(n) is undecidable.
Arguing the other way, there are some well known math problems that have
solutions if the halting problem does. For example, Goldbach’s conjecture is
that every number greater than 2 can be written as the sum of two primes. If
the halting problem was decidable you could use halt(T,t) to solve the
Goldbach conjecture. Simply create a Turing machine G that checks each
integer to see if it is representable as the sum of two primes and halts if it
finds a counter example. Now we compute halt(G,t) and if it returns true
then G halts and the Goldbach conjecture is false. If it returns false then no
counter example exists and the Goldbach conjecture is true. The fact that
halt is non-computable rules out this approach to proving many
mathematical conjectures that involve searching an infinite number of
examples to find a counter example.

The Problem With Unbounded


It is very important to note that hidden within all these descriptions is the use
of the fact that a Turing machine has a finite but unbounded tape. We never
ask if it is impossible to construct the paradoxical machine because of a
resource constraint. In the real world the memory is finite and all Turing
machines are bounded. That is, there is a maximum length of tape.
The key point is that any Turing machine with a tape of length n, a symbol set
of size s and x states (in the finite state machine) can be simulated by a finite
state machine with a = snx states.
We will return to this in the next chapter, but you should be able to see that a
resource-limited Turing machine has a finite set of states – a combination of
the states of the finite state machine and the states the tape can be in. Hence
it is itself just a finite state machine.
Why does this matter? Because for any finite state machine the halting
problem is computable.
The reason is that if you wait enough time for the finite state machine with a
states to complete a+1 steps it has either already halted or entered one of the a

33
states more than once and hence it will loop forever and never halt. This is
often called the pigeon hole principle. If you have a things and you pick a+1
of them you must have picked one of them twice.
There is even a simple algorithm for detecting the repetition of a state that
doesn't involve recording all of the states visited.
It is the unbounded tape that makes the Turing machine more powerful than
a finite state machine because a Turing machine can have as many states as it
likes.
The fact that the Turing machine has an unbounded tape is vital to the
construction of the halting paradox. You might now ask where the
unbounded nature of the Turing machine comes into the construction of the
paradoxical machine? After all we never actually say “assuming the tape is
unlimited” in the construction. The answer is that we keep on creating ever
bigger machine descriptions from each initial machine description without
really noticing.
An informal description is not difficult and it is possible to convert it into a
formal proof.
Suppose we have a bound b on the size of the tape we can construct – that is
the tape cannot be bigger than b locations. Immediately we have the result
that the halting problem can be solved for these machines as they have a
finite number of states and hence are just finite state machines.
Let’s ignore this easy conclusion for the moment and instead construct a
machine that implements halt(T,t) then the function
halt(paradox,paradox), which has to evaluate paradox(paradox), which in
turn has to evaluate halt(paradox,paradox), which in turn has to … You can
see that what we have is an infinite regression each time trying to evaluate
halt(paradox,paradox). This is fine if the tape, or more generally the
resources available, are unbounded and the evaluations can go on forever.
Notice that the use of halt in paradox is an example of recursion – a subject
we will return to in Chapter 16.
What happens if the Turing machine is bounded to b tape locations?
Then at some point it will run out of tape and the evaluation of
halt(paradox,paradox) will fail and the whole set of evaluations will
complete without ever having evaluated halt(paradox,paradox). That is, an
error stops the computation without a result. In this case no paradox exists
and we haven’t proved that halt(p,t) cannot exist, only that if it does exist
there is a limit to the size of machine it can process.
The halting problem can only be solved by a machine that uses a tape bigger
than the class of machines it solves the problem for and this implies it cannot
solve the problem for itself or any machine constructed by adding to its tape.
You need a bit of infinity to make the halting problem emerge.

34
Even if you expand the conditions so that the computing machine can have
memory bounded by some function of the size of the input, the halting
problem is still decidable.
As soon as you have an unbounded set then there are questions you can ask
that have no answer.
If you still find this strange consider the following question. We have an
arbitrary, but bounded by b, set of things – is it possible to compute a value n
that is larger than the number of things in the set? Easy, set n≥b and you have
the required number.
Now change the conditions so that the set is finite but unbounded. Can you
find a value of n? The answer is no. As the set is unbounded, the question is
undecidable. Any value of n that you try can be exceeded by increasing the
size of the set. In this sense the size of an unbounded set is non-computable
by definition as it has no upper bound. The situation with the halting
paradox is very similar.
Another way of thinking of this is that the machine that solves the halting
problem has to be capable of having a tape that is larger than any machine it
solves the problem for and yet the machine corresponding to the paradox
machine has an even larger tape.
Just like n in the unbounded set, a Turing machine with the largest tape
doesn’t exist as you can always make a bigger one.
For the record:
 All real physical computing devices have bounded memory and
therefore are finite state machines, not Turing machines, and hence
they are not subject to the undecidability of the halting problem.
 A Turing machine has an unbounded tape and this is crucial in the
proof of the undecidability of the halting problem, even if this isn’t
made clear in the proof.
Having said this, it is important to realize that for a real machine the number
of states could be so large that the computation is impractical. But this is not
the same as the absolute ban imposed by the undecidability of the halting
problem for Turing machines.
There are many other undecidable propositions and they all share the same
sort of structure and depend on the fact that an unbounded tape means there
is no largest size Turing machine and no restriction on what a Turing
machine can process as long as it can just ask for more tape.
Notice that humans are also finite state machines, we do not have unlimited
memory, and hence there is no real difference, from the point of view of
computer science, between a robot and a human.

35
Non-Computable Numbers
You can comfort yourself with the thought that these non-computable
problems are a bit esoteric and not really practical – but there is more and this
time it is, or possibly is, practical. There is another type of non-computability
which relates to numbers and which is covered in detail in Chapters 6 and 7.
However, while we are on the topic of what cannot be computed, it is worth
giving a brief overview of the non-computable numbers and how we know
they exist.
We are all very smug about the ability to compute numbers such as π to
millions of decimal places, it seems to be what computers were invented for.
The bad news is that there are non-computable numbers. To see that this is
the case, think of each Turing machine tape as being a number – simply read
the binary number you get from a coding of its symbols on the tape that
describes it to a universal Turing machine. So to each tape there is an integer
and to each integer there is a tape. We have constructed a one-to-one
numbering of all the possible Turing machines. Now consider the number R
which starts with a decimal point and has a 1 in decimal position n if the nth
Turing tape represents a machine that halts when run on U, the universal
Turing machine, and a 0 if it doesn’t halt.
Clearly there is no Turing machine that computes R because this would need
a solution to the halting problem and there isn’t one. R is a non-computable
number, but again you can comfort yourself with the idea that it is based on
an unbounded computing machine and so isn’t really practical. Perhaps non-
computable numbers are all of this sort and not really part of the real world.
The really bad news is that it is fairly easy to prove that there are far more
non-computable numbers than there are computable numbers. Just think
about how many real numbers there are and how many tapes – there is only a
countable infinity of computable numbers as there is only a countable infinity
of tapes, but there is an uncountable infinity of real numbers in total, see
Chapter 6 if you are unclear what countable and uncountable is all about.
Put even more simply - there are more real numbers than there are programs
to compute them. Most of the world is outside of our computing reach…
What does this mean?
Some people are of the opinion that the fault lies with the real numbers. They
perhaps don’t correspond to any physical reality. Some think that it has
something deep to do with information and randomness. Some of these ideas
are discussed in more detail later.

36
There seem to be two distinct types of non-computable numbers.
The first are those that we infer the existence of because there just aren’t
enough programs to do the computing and those that are based on the
undecidability of some problems. The first type are inherently inaccessible as
the proof that they exist is non-constructive. It doesn’t give you an example of
the number, or the proposed algorithm that cannot compute it. It just states
that there are many non-computable numbers. Any attempt that you make to
exhibit such a number is doomed to failure because to succeed would turn it
into a computable number.
The second type of non-computable numbers are based on constructing a
number from a non-decidable problem. The simplest example is the number
given earlier where the nth bit is 0 if the nth Turing machine doesn’t halt and
1 if it does. These non-computable numbers are very different and there is a
sense in which they are computable if you make the Turing machines tape
bounded.

37
Summary
● The universal Turing machine can simulate any other Turing machine
by reading a description of it on its tape plus its starting tape.
● The halting problem is the best known example of a non-computable
or undecidable problem. All you need to do to solve it is create a
Turing machine that implements halt(p,t) which is true if and only
if the machine described by tape p halts on tape t.
● Notice that the halting program has to be solved for all Turing
machines not just a subset.
● If halt(p,t) exists and is computable for all p and t then it is possible
to use it to construct another machine paradox(p) which halts if p
loops on p and loops if p halts on p. This results in a paradox when
you give halt the problem of deciding if halt(paradox,paradox) halts
or not.
● The halting problem is an archetypal undecidable problem and it can
be used to prove that many other problems are undecidable by
reducing them to the halting problem.
● If the halting problem were decidable we could answer many
mathematical question such as the Goldbach conjecture by creating a
Turing machine that searches for counter examples and then using
halt to discover if a counter example existed.
● The undecidability of the halting problem for Turing machines has it
origin in the unboundedness of the tape.
● A Turing machine with a bounded tape is a finite state machine f for
which halt(f) is computable.
● The reason that the proof that halt is undecidable fails if the tape is
bounded is that it involves a non-terminating recursion, which
requires an infinite tape.
● All real computers, humans and robots are subject to finite memory
and hence are finite state machines and are not subject to the
undecidability of the halting problem.
● It is possible to create numbers that are not computable using
undecidable problems such as halt(p,t) – such as the number with
its nth bit is 0 if the nth Turing machine halts and 1 if if doesn’t. Such
numbers are based on the unboundedness of the Turing machine.
● There is a class of non-computable numbers that in a sense are even
more non-computable. There are only a countable number of Turing
machines, but an uncountable number of real numbers. Therefore
most real numbers do not have programs that compute them.

38
Chapter 4

Finite State Machines

We have already met the finite state machine as part of the Turing machine,
but now we need to consider it in its own right. There is a sense in which
every real machine is a finite state machine of one sort or another while
Turing machines, although theoretically interesting, are just theoretical.
We know that if the Church-Turing thesis is correct, Turing machines can
perform any computation that can be performed. This isn’t the end of the
story, however, because we can learn a lot about the nature of computation by
seeing what even simpler machines can do. We can gain an understanding of
how hard a computation is by asking what is the least that is needed to
perform the computation.
As already stated the simplest type of computing machine is called a ‘finite
state machine’ and as such it occupies an important position in the hierarchy
of computation.

A Finite History
A finite state machine consists of a fixed number of states. When a symbol, a
character from some alphabet say, is input to the machine it changes state in
such a way that the next state depends only on the current state and the input
symbol.
Notice that this is more sophisticated than you might think because inputting
the same symbol doesn’t always produce the same behavior or result because
of the change of state.
 The the new state depends only on the current state and the input
You can also have the machine output a character as a result of changing state
or you can consider each state to have some sort of action associated with it.
What this means is that the entire history of the machine is summarized in its
current state. All of the inputs since the machine was started determine its
current state and thus the current state depends on the entire history of the
machine. However, all that matters for its future behavior is the state that it is
in and not how it reached this state. This means that a finite state machine
can "forget" aspects of its history that it deems irrelevant to its future.

39
Before you write off the finite state machine as so feeble as to be not worth
considering as a model of computation, it is worth pointing out that in
addition to being able to "forget" irrelevant aspects of its history it can record
as many as it needs. As you can have as many states as you care to invent, the
machine can record arbitrarily long histories. Suppose some complex
machine has a long and perhaps complex set of histories which determine
what it will do next. It is still a finite state machine because all you need is a
state for each of the possible past histories and the finite state machine can
respond just like the seemingly complex machine. In this case the state that
you find the machine in is an indication of its complete past history and
hence can determine what happens next.
Because a finite state machine can represent any history and, by regarding the
change of state as a response to the history, a reaction, it has been argued
that it is a sufficient model of human behavior. Unless you know of some way
that a human can have an unbounded history or equivalently an unbounded
memory then this seems to be an inescapable conclusion - humans are finite
state machines.

Representing Finite State Machines


You can represent a finite state machine in a form that makes it easier to
understand and think about. All you have to do is draw a circle for every state
and arrows that show which state follows for each input symbol. For
example, the finite state machine in the diagram below has three states. If the
machine is in state 1 then an A moves it to state 2 and a B moves it to state 3.

This really does make the finite state machine look very simple and you can
imagine how as symbols are applied to it how it jumps around between states.

40
This really is a simple machine but its simplicity can be practically useful.
There are some applications which are best modeled as a finite state machine.
For example, many communications protocols, such as USB, can be defined
by a finite state machine diagram showing what happens as different pieces
of information are input. You can even write, or obtain, a compiler that will
take a finite state machine’s specification and produce code that behaves
correctly.
Many programming problems are most easily solved by actually
implementing a finite state machine. You set up an array, or other data
structure, which stores the possible states and you implement a pointer to the
location that is the current state. Each state contains a lookup table that
shows what the next state is, given an input symbol. When a symbol is read
in, your program simply has to look it up in the table and move the pointer to
the new state. This a very common approach to organizing games.

Finite Grammars
The practical uses of finite state machines is reason enough to be interested in
them. Every programmer should know about finite state machines and
shouldn't be afraid of implementing them as solutions to problems. However,
the second good reason is perhaps more important - but it does depend on
your outlook. Finite state machines are important because they allow us to
explore the theory of computation. They help us discover what resources are
needed to compute particular types of problem. In particular, finite state
machines are deeply connected with the idea of grammars and languages that
follow rules.
If you define two of the machine’s states as special – a starting and a finishing
state – then you can ask what sequence of symbols will move it from the
starting to the finishing state. Any sequence that does this is said to be
‘accepted’ by the machine. Equally you can think of the finite state machine
as generating the sequence by outputting the symbols as it moves from state
to state. That is, a list of state changes obeyed in order, from the start to the
finish state, generates a particular string of symbols. Any string that can be
generated in this way will also be accepted by the machine.
The point is that the simplicity, or complexity, of a sequence of symbols is
somehow connected to the simplicity or complexity of the finite state
machine that accepts it.

41
So we now have a way to study sequences of symbols and ask meaningful
questions. As a simple example, consider the finite state machine given below
with state 1 as start and state 3 as finish – what sequences does it accept?
Assuming that A and B are the only two symbols available, it is clear from the
diagram that any sequence like BABAA is accepted by it.

A finite machine accepts a set of sequences


In general, the machine will accept all sequences that can be described by the
computational grammar, see Chapter 5.
1) <S1> → B<S3>|A#
2) <S3> → A<S1>|B<S3>
A computational grammar is a set of rules that specify how you can change
the symbols that you are working with. The | symbol means select one of the
versions of the rule. You can use <null> to specify the starting state and # to
specify the final state; in this case we have used <S1> as the starting state.
You can have many hours of happy fun trying to prove that this grammar
parses the same sequences as the finite state machine accepts.
To see that it is it does, just try generating a sequence:
 Start with <S1> and apply rule 1 to get B<S3>. You could have
selected A# instead and that would be the end of the sequence.
 Use rule 2 to get BA<S1>. You have replaced <S3> by A<S1>
 Use rule 1 to get BAB<S3>
 Use rule 2 to get BABB<S3>
 Use rule 2 to get BABBA<S1>
You can carry on using rule 1 and 2 alternately until you get bored and decide
to use the A# alternative of rule 1 giving something like BABBAA#.

42
Grammar and Machines
Now you can start to see why we have moved on from machines to consider
grammar. The structure of the machine corresponds to sequences that obey a
given grammar. The question is which grammars correspond to finite state
machines? More to the point, can you build a finite state machine that will
accept any family of sequences? In other words, is a finite state machine
sufficiently powerful to generate or accept sequences generated by any
grammar?
The answer is fairly easy to discover by experimenting with a few sequences.
It doesn’t take long to realize that a finite state machine cannot recognize a
palindromic sequence. That is, if you want to recognize sequences like
AAABAAA, where the number of As on either side of the B has to be the same,
then you can’t use a finite state machine to do the job.
If you want to think about this example for a while you can learn a lot. For
example, it indicates that a finite state machine cannot count without limit.
You may also be puzzled by the earlier statement that you can build a finite
state machine that can "remember" any past history. Doesn't this statement
contradict the fact it cannot recognize a palindromic sequence?
Not really. You can build a finite state machine that accepts AAABAAA and
palindromes up to this size, but it won't recognize AAAABAAAA as a similar
sequence because it is bigger than the size you designed the machine to work
with. Any finite state machine that you build will have a limit on the number
of repeats it can recognize and so you can always find a palindromic
sequence that it can't recognize.
The point here isn't that you can't build a finite state machine that can
recognize palindromes, but that you can't do it so that it recognizes all
palindromes of any size. If you think back to Chapter 3, you might notice that
this is another argument about bounded and unbounded sequences. If you
want to recognize palindromes of any size you need a Turing machine, or
equivalent, and an unbounded storage facility.
A finite state machine can count, but only up to fixed maximum.
For some this doesn't quite capture the human idea of a palindrome. In our
heads we have an algorithm that doesn't put any limits on the size - it is a
Turing algorithm rather than a finite state algorithm. The definition of a
palindrome doesn't include a size limit but for a practical machine that
accepts palindromes there has to be a limit but we tend to ignore this in our
thinking. More on this idea later.

43
There are lots of other examples of sequences that cannot be recognized by a
finite state machine but can be parsed or specified by a suitable grammar.
What this means is that finite state machines correspond to a restricted type
of grammar. The type of sequence that can be recognized by a finite state
machine is called a ‘regular sequence’ - and, yes, this is connected to the
regular expressions available in so many programming languages.
A grammar that can parse or generate a regular sequence is called a ‘regular’
or ‘type 3 grammar’ and its rules have the form:
<non-terminal1> → symbol <non-terminal2 >
or
<non-terminal1> → symbol

Regular Expressions
Most computer languages have a regular expression facility that allows you to
specify a string of characters you are looking for. There is usually an OR
symbol, often |, so A|B matches A or B. Then there is a quantifier usually *
which means zero or more. So A* means A, AA, AAA and so on or the null
string. There are also symbols which match whole sets of characters. For
example, \d specifies any digit, \w specifies any character other than white
space or punctuation and \s specifies white space or punctuation. There are
usually more symbols so you can match complicated patterns but these three
enable you to match a lot of patterns easily. For example, a file name ending
in .txt is specified as:
\w*.txt
A file name ending in .txt or .bak as:
\w*.txt|\w*.bak
A name starting with A and ending with Z as:
A\w*Z
and so on.
As already mentioned, regular expressions usually allow many more types of
specifiers, far too many to list here. The key point is that a regular expression
is another way to specifying a string that a finite state machine can recognize,
i.e. it is a regular grammar and this means there are limits on what you can do
with it. In particular, you generally can’t use it to parse a real programming
language and there is no point in trying.
It is also worth knowing that the "*" operator is often called the Kleene star
operator after the logician Stephen Kleene. It is used generally in
programming and usually means "zero or more". For example z* means zero
or more z characters.

44
Other Grammars
Now that we know that regular grammars, finite state machines and the
sequences, or languages, that they work with do not include everything, the
next question is, what else is there? The answer is that there is a hierarchy of
machines and grammars, each one slightly more powerful than the last.

Avram Noam Chomsky


1928-
This hierarchy was first identified by linguist Noam Chomsky who was
interested in the grammar of natural language. He wanted to find a way to
describe, analyze and generate language and this work is still under
development today.
So what is the machine one step more powerful than a finite state machine?
The answer is a ‘pushdown machine’. This is a finite state machine with the
addition of a pushdown stack or Last In First Out (LIFO) stack.

On each state transition the machine can pop a symbol off the top of the stack
or push the input symbol onto it. The transition that the machine makes is
also determined by the current input symbol and the symbol on the top of the
stack. If you are not familiar with stacks then it is worth saying that a pushing
a symbol onto the top of the stack pushes everything already on the stack
down one place and popping a symbol of the top of the stack makes
everything move up by one.

45
At first sight the pushdown machine doesn’t seem to be any more powerful
than a finite state machine – but it is. The reason is it more powerful is that,
while its stack only has a finite memory, it has a finite but unbounded
memory which can grow to deal with anything you can throw at it. That is we
don't add in the realistic condition that the stack can only store a maximum
number of symbols. If you recall, it is the unbounded nature of the tape that
gives a Turing machine its extra powers. For example, a pushdown stack
machine can accept palindromes of the type AAABAAA where the number of As
on each side of the B have to be the same. It simply counts the As on both
sides by pushing them on the stack and then popping them off again after the
B.
Have a look at the pushdown machine below – it recognizes palindromes of
the type given above. If you input a string like AAABAAA to it then it will end
up in the finish state and as long as you have used up the sequence, i.e. there
are no more input symbols, then it is a palindrome. If you have symbols left
over or if you end up in state 4 it isn’t a palindrome.

A palindrome detector (TOS =Top Of Stack)


So a pushdown machine is more powerful than a finite state machine. It also
corresponds to a more powerful grammar – a ‘context-free grammar’.
A grammar is called context-free or ‘type 2’ if all its rules are of the form:
<non-terminal> → almost anything
The key feature of a context-free grammar is that the left-hand side of the rule
is just a non-terminal class.

46
For example, the rule:
A<S1> → A<S1>A
isn’t context-free because there is a terminal symbol on the left. However, the
rule:
<S1> → A<S1>A
is context-free and you can see that this is exactly what you need to parse or
generate a palindrome like ABA with the addition of:
<S1> → B
which is also context-free.

Turing Machines
You can probably guess that, if the language corresponding to a pushdown
machine is called context-free, the next step up is going to be ‘context-
sensitive’. This is true, but the machine that accepts a context-sensitive
sequence is a little more difficult to describe. Instead of a stack, the machine
needs a ‘tape’, which stores the input symbols. It can read or write the tape
and then move it to the left or the right. Yes, it’s a Turing machine!
A Turing machine is more powerful than a pushdown machine because it can
read and write symbols in any order, i.e. it isn’t restricted to LIFO order.
However, it is also more powerful for another, more subtle, reason. A
pushdown machine can only ever use an amount of storage that is
proportional to the length of its input sequence, but a machine with a tape
that can move in any direction can, in principle, use any amount of storage.
For example, the machine could simply write a 1 on the tape and move one
place left, irrespective of what it finds on the tape. Such a machine would
create an unbounded sequence of 1s on the tape and so use an unbounded
amount of storage.
It turns out that a Turing machine is actually too powerful for the next step
up the ladder. What we need is a Turing machine that is restricted to using
just the portion of tape that its input sequence is written on, a ‘linear bounded
machine’. You can think of a linear bounded machine as a Turing machine
with a short tape or a full Turing machine as a linear bounded machine that
has as long a tape as it needs. The whole point of this is that a linear bounded
machine accepts context-sensitive languages defined by context-sensitive
grammars that have rules of the form:
anything1 → anything2
but with the restriction that the sequence on the output side of the rule is as
least as long as the input side.

47
Consider:
A<S1> → A<S1>A
as an example of a context-sensitive rule. Notice that you can think of it as
the rule:
<S1> → <S1>A
but only applied when an A comes before <S1> and hence the name “context-
sensitive”.
A full Turing machine can recognize any sequence produced by any grammar
– generally called a ‘phrase structured’ grammar. In fact, as we already know,
a Turing machine can be shown to be capable of computing anything that can
reasonably be called computable and hence it is the top of the hierarchy as it
can implement any grammar. If there was a grammar that a Turing machine
couldn't cope with then the Church Turing Thesis wouldn't hold. Notice that
a phrase structured grammar is just a context-sensitive grammar that can also
make sequences shrink as well as grow.

The languages, the grammar and the machines


That’s our final classification of languages and the power of computing
machines. Be warned, there are lots of other equivalent, or nearly equivalent,
ways of doing the job. The reason is, of course, that there are many different
formulations of computing devices that have the same sort of power. Each
one of these seems to produce a new theory expressed in its own unique
language. There are also finer divisions of the hierarchy of computation, not
based on grammars and languages. However, this is where it all started and it
is enough for the moment.

48
Turing Machines and Finite State Machines
We already know that a Turing machine is more powerful than a finite state
machine, however, this advantage is often over-played. A Turing machine is
only more powerful because of its unbounded memory. Any algorithm that a
Turing machine actually implements has a bounded tape if the machine
stops. Obviously, if the machine stops it can only have executed a finite
number of steps and so the tape has a bounded length. Thus any computation
that a bounded Turing machine can complete can be performed by a finite
state machine. The subtle point is that the finite state machine is different in
each case. There is no single finite state machine that can do the work of any
bounded Turing machine that you care to select but once you have selected
such a Turing machine it can be converted into a finite state machine.
As explained in the previous chapter if a Turing machine uses a tape of
length n, a symbol set of size s and x states (in the finite state machine) can be
simulated by a finite state machine with a= snx states. Proving this is very
simple.
The tape can be considered as a string of length n and each cell can have one
of the s symbols. Thus, the total number of possible strings is sn. For example,
if we have a tape of length 2 and a symbol set consisting of 0,1 then we have
the following possible tapes [0|0] [0|1] [1|0] [1|1] and you can see that
this is 22, i.e. 4. For another example, a tape of length 3 on two symbols has
the following states:
[0|0|0] [0|0|1] [0|1|0] [0|1|1]
[1|0|0] [1|0|1] [1|1|0] [1|1|1]
i.e. 23=8 states.
For each state of the tape, the Turing machine’s finite state machine
controller can be in any of x states, making a total of snx states. This means
you can construct a finite state machine using snx states arranged so that the
transition from one state to another copies the changes in the Turing
machine’s internal state and the state of its tape. Thus the finite state machine
is exactly equivalent to the Turing machine.
What you cannot do is increase the number of states in the finite state
machine to mimic any addition to the Turing machine’s tape - but as the tape
is bounded we don't have to.
It is sometimes argued that modern computers have such large storage
capacities that they are effectively unbounded. That is a big computer is like a
Turing machine. This isn’t correct as the transition from finite state machine
to Turing machine isn’t gradual – it is an all or nothing.
The halting problem for a Turing machine is logically not decidable. The
halting problem for a finite state machine or a bounded Turing machine is
always decidable, in theory. It may take a while in practice, but there is a

49
distinct difference between the inaccessibility of the halting problem for a
Turing machine and the practical difficulties encountered in the finite state
machine case.
What it is reasonable to argue is that for any bounded machine a finite state
machine can do the same job. So the language and grammar hierarchy
discussed in the last section vanishes when everything has to be finite. A
finite state machine can recognize a type 0, 1 or 2 language as long as the
length of the string is limited.

Turing Thinking
One of the interesting things about Turing machines versus finite state
machines is the way we write practical programs or, more abstractly, the way
we formulate algorithms. We nearly always formulate algorithms as if we
were working with a Turing machine.
For example, consider the simple problem of matching parentheses. This is
not a task a finite state machine can perform because the grammar that
generates the “language” is context-free:
1. <S> → (S)
2. <S> → <S><S>
3. <S> → #
where S is the current state of the string.
You can prove that this is the grammar of balanced parentheses or you can
just try a few examples and see that this is so.
For example, starting with S="" i.e. the null string, rule 1 gives:
()
Applying rule 2 gives:
()()
Applying rule 1 gives:
(()())
and finally rule 3 terminates the string.
Let’s create an algorithm that checks to make sure that a string of parentheses
is balanced. The most obvious one is to use a stack in which for each symbol:
if symbol is )
if the stack is empty then stop
else pop the stack
else push ( on stack
When the last symbol has been processed, the stack is empty if the
parentheses are balanced. If the stack isn't empty then the parentheses are
unbalanced. If there is an attempt to pop an empty stack while still processing
symbols the parentheses are unbalanced.

50
If you try it you should be convinced that this works. Take (()())(), then the
stack operations are:
string operation stack
(()())() push ( (
()())() push ( ((
)())() pop (
())() push ( ((
))() pop (
)() pop empty
() push ( (
) pop empty balanced
as the final stack is empty, the string is balanced.
Now try (()()))(:
string operation stack
(()()))( push ( (
()()))( push ( ((
)()))( pop (
()))( push ( ((
)))( pop (
))( pop empty
)( pop empty unbalanced
as an attempt to pop an empty stack has occurred the parentheses are
unbalanced.
This algorithm is an example of Turing thinking. The algorithm ignores any
issue of what resources are needed. It is assumed that pushes and pops will
never fail for lack of resources. The algorithm is a Turing machine style
unbounded algorithm that will deal with a string of any size.
Now consider a finite machine approach to the same problem. We need to set
up some states to react to each ( and each ) such that if the string is balanced
we end in a final state. In the machine below we start from state 1 and move
to new states according to whether we have a left or right parenthesis. State 5
is an error halt state and state 0 is an unbalanced halt state.

51
Consider (())()
(())() state 1 → state 2
())() state 2 → state 3
))() state 3 → state 2
)() state 2 → state 1
() state 1 → state 2
) state 2 → state 1
and as we finish in state 1, the string is balanced.
(()))( state 1 → state 2
()))( state 2 → state 3
)))( state 3 → state 2
))( state 2 → state 1
)( state 1 → state 0
as we finish in state 0, the string is unbalanced.
Finally consider:
(((()))) state 1 → state 2
((()))) state 2 → state 3
(()))) state 3 → state 4
()))) state 4 → state 5
as we finish in state 5, an out of memory error has occurred.
The difference between the two approaches is that one doesn’t even think
about resources and uses the pattern in the data to construct an algorithm
that can process any amount of data given the resources. The second deals
with a problem with an upper bound – no more than four left parentheses can
be processed.
Of course, this distinction is slightly artificial in that we could impose a
bound on the depth of the stack and include a test for overflow and stop the
algorithm. Would this be finite state thinking or just Turing thinking with a
fix for when things go wrong? The point is that we construct pure algorithms
that tend not to take resource limitations into account – we think Turing,
even if we implement finite state. There is also a sense in which Turing
thinking captures the abstract essence of the algorithm that goes beyond a
mere mechanism to implement it.
The human mind can see algorithms that always work in principle even if in
practice they have limits. It is the computer science equivalent of the
mathematician conceiving of a line as having no thickness, of a perfect circle
or indeed something that has no end.

52
Summary
● The finite state machine is simple and less powerful than other
machines with unbounded storage such as the Turing machine.

● Finite state machines accept sequences generated by a regular


grammar.

● Regular expressions found in most programming languages are a


practical implementation of a regular grammar.

● The pushdown machine is slightly more powerful than a finite state


machine and it accepts sequences generated by a context-free
grammar.

● A full Turing machine can accept sequences generated by a phrase


structured grammar.

● The different types of machine and grammars form a hierarchy of


increasingly complex languages.

● Although a Turing machine is more powerful than a finite state


machine, in practice all machines are finite state machines.

● A Turing machine with a tape limited to n cells is equivalent to a


finite state machine with snx where s is the number of symbols it uses
and x is the number of states in its controller.

● Even though Turing machines are an abstraction and do not exist, we


tend to create algorithms as if they did, making no assumptions about
resource limits. If we do think of resource limits they are
afterthoughts bolted on to rescue our inherently unbound algorithms.

● Algorithms that are unbounded are the pure thought of computer


science.

53
Chapter 5

Practical Grammar

This chapter is a slight detour from the subject of what is computable. We


have discovered that different types of computing model are equivalent to
different types of grammar and the languages they produce. What we haven’t
looked at so far is the impact this practical problem has had on theoretical
computer science. Many hours are spent studying grammar, how to specify it
and how to use it. What is often ignored is what practical relevance grammar
has to the meaning of a language? Surely the answer is none whatsoever!
After all, grammar is about syntax and semantics is about meaning. This isn’t
quite true. A grammar that doesn’t reflect meaning is a useless theoretical
construct.
There is an argument that the arithmetic expression is the whole reason that
grammar and parsing methods were invented, but once you have the theory
of how to do it you might as well use it for the entire structure of a language.

Backus-Naur Form - BNF


In the early days of high-level languages the only really difficult part was the
translation of arithmetic expressions to machine code and much of the theory
of grammar was invented to deal with just this problem.

John Warner Backus


1924 - 2007

55
At the core of this theory, and much used in the definition of programming
languages, is BNF, or Backus Normal Form. Because Donald Knuth pointed
out that it really isn't a normal form, in the sense of providing a unique
normalized grammar, it is often known as Backus-Naur Form after John
Backus the inventor of Fortran, and Peter Naur one of the people involved in
creating Algol 60. Fortran was the first "high level" computer language and its
name derives from FORmula TRANslation. The intention was to create a
computer language that would allow programmers to write something that
looked like an arithmetic expression or formula. For example, A=B*2+C is a
Fortran expression. Working out what exactly had to be done from a formula
is not as easy as you might think and other languages ducked the issue by
insisting that programmers wrote things out explicitly - "take B multiply it by
two and add C". This was the approach that the language Cobol took and it
was some time before it added the ability to use formulas.
Not only isn't Backus Normal Form a normal form, there isn't even a single
standard notation for BNF, and everyone feels free to make up their own
variation on the basic theme. However, it is always easy enough to
understand and it is very similar to the way we have been describing
grammar to this point.
For example, using "arrow notation" you might write:
<additive expression> → <variable> + <variable>
You can read this as saying that an additive expression is formed by taking a
variable, a plus sign and another variable. This rule only fits expressions like
A+B, and not A+B+C, but you get the general idea.
Quantities in angle brackets like:
<additive expression>
are called “non-terminal” symbols because they are defined by a BNF rule and
don’t appear in the final language. The plus sign on the other hand is a
“terminal” symbol because it isn't further defined by a BNF rule.
You might notice that there is a slight cheat going on in that the non-terminal
<variable> was replaced by a terminal in the examples, but without a rule
allowing this to happen. Well, in proper BNF, you have to have a rule that
defines every non-terminal in terms of terminal symbols, but in the real world
this becomes so tedious that we often don’t bother and rely instead on
common sense. In this case the missing rule is something like:
<variable> → A|B|C|D etc .. |Z
where the vertical bar is read as “OR”. Notice that this defines <variable> as
just a single character - you need a slightly more complicated rule for a full
multicharacter variable name.

56
If you want to define a variable that was allowed to have more than a one-
letter name you might use:
1. <variable> → <variable> <letter> | <letter>
2. <letter> → A|B|C|D etc .. |Z
This is the BNF equivalent of a repeating operation. It says that a variable is
either a variable followed by a letter or a letter. For example, to use this
definition you start with:
<variable>
and use rule 1 to replace it by:
<variable><letter>
then use rule 2 to replace <letter> by A to give:
<variable>A
Next we use rule 1 to replace <variable> to get:
<variable><letter>A
and so on building up the variable name one character at a time.
Boring isn’t it? But it is an ideal way to tell a computer what the rules of a
language are.
Just to check that you are following – why does rule 1 include the alternative
|<letter> ?
The reason is that this is the version of the rule we use to stop a variable
name growing forever, because once it is selected the next step is to pick a
letter and then we have no non-terminals left in the result. Also notice that
the BNF definition of a variable name cannot restrict its length. Rules like
“fewer than 8 characters” have to be imposed as notes that are supplied
outside of the BNF grammar.

Extended BNF
To make repeats easier you will often find that an extended BNF notation is
used where curly brackets are used to indicate that the item can occur as
many times as you like – including zero. In this notation:
<variable> → <letter> {<letter>}
means “a variable is a letter followed by any number of letters”. Another
extension is the use of square brackets to indicate an optional item. For
example:
<integer constant> → [+|-] <digit> {<digit>}
means that you can have a plus or minus sign, or no sign at all, in front of a
sequence of at least one digit. The notation isn't universal by any means, but
something like it is often used to simplify BNF rules.

57
If you know regular expressions you might recognize some of the conventions
used to specify repeated or alternative patterns. This is no accident. Both
regular expressions and BNF are capable of parsing regular languages. The
difference is that BNF can do even more. See the previous chapter for a
specification of what sorts of grammars are needed to model different
complexities of language and machines.

BNF In Pictures - Syntax Diagrams


You will also encounter BNF in picture form. Syntax diagrams or "railroad
diagrams" are often used as an easy way to understand and even use BNF
rules. The idea is simple - each diagram shows how a non-terminal can be
constructed from other elements. Alternatives are shown as different
branches of the diagram and repeats are loops. You construct a legal string by
following a path through the diagram and you can determine if a string is
legal by finding a path though the diagram that it matches. For example, the
following syntax diagram is how the classic language Pascal defines a variable
name:

It corresponds to the rules:


<letter> → <letter><letterordigit>
<letterordigit> → <letter>|<digit>
If you want to see more syntax diagrams simply see if your favorite language
has a list of them - although it has to be admitted that their use has declined.
Here is a classic example in the 1973 document defining Pascal:

58
Why Bother?
What have we got so far?
● A grammar is a set of rules which defines the types of things that you
can legitimately write in a given language.
This is very useful because it allows designers of computer languages to
define the language unambiguously.
If you look at the definition of almost any modern language – C, C++, C#,
Java, Python, Ruby and so on - you will find that it is defined using some
form of BNF. For example, if you lookup the grammar of C++ you will
discover that an expression is defined as:
expression:
assignment-expression
expression , assignment-expression
Which in our BNF notation would be:
<expression> → <assignment-expression>|
<expression> , <assignment-expression>
As already mentioned, sometimes the BNF has to be supplemented by a note
restricting what is legal.
This use of BNF in defining a language is so obviously useful that most
students of computer science and practicing programmers immediately take it
to heart as a good method. Things really only seem to begin to go wrong when
parsing methods are introduced. Most people just get lost in the number and
range of complexity of the methods introduced and can’t see why they should
bother with any of it.

Generating Examples
You can think of the BNF rules that define a language as a way of generating
examples of the language or of testing to see if a statement is a valid example
of the language. Generating examples is very easy. It’s rather like playing
dominoes with production rules. You pick a rule and select one of the parts of
the rule separated by “OR” and then try and find a rule that starts with any
non-terminal on the right. You replace the non-terminal items until you have
a symbol string with no non-terminal items – this is a legal example of the
language. It has to be because you generated it using nothing but the rules
that define the language.
Working the other way, i.e. from an example of the language to a set of rules
that produce it, turns out to be more difficult. Finding the production rules is
called “parsing” the expression and which parsing algorithms work well can
depend on the nature of the grammar. That is, particularly simple grammars
can be parsed using particularly simple and efficient methods. At this point

59
the typical course in computer science spends a great deal of time explaining
the different parsing algorithms available and a great many students simply
cannot see the wood for the trees – syntax trees that is!
While the study of parsing methods is important, you only understand why
you are doing it when you discover what parsing can do for you. Obviously if
you can construct a legal parse of an expression then the expression is
grammatical and this is worth knowing. For example, any reasonable
grammar of arithmetic expressions would soon throw out /3+*5 as nonsense
because there is no set of rules that create it.
This is valuable but there is more. If you have parsed an expression the way
that it is built up shows you what it means. This is a function of grammar that
is usually ignored because grammar is supposed to be only to do with syntax,
which is supposed to have nothing to do with meaning or semantics. In
English, for example, you can have syntax without meaning;
“The green dream sleeps furiously.”
This is a perfectly good English sentence from the point of view of grammar,
but, ignoring poetry for the moment, it doesn’t have any meaning. It is pure
syntax without any semantics.
This disassociation of syntax and semantics is often confused with the idea
that syntax conveys no meaning at all and this is wrong. For example, even in
our nonsense sentence you can identify that something “the dream” has the
property “green” and it is doing something, i.e. “sleeping”, and that it is doing
it in a certain way, i.e. “furiously”. This is a lot of information that is brought
to you courtesy of the syntax and while the whole thing may not make sense
the syntax is still doing its best to help you get at the meaning.

Syntax Is Semantics
So in natural languages syntax carries with it some general indications of
meaning. The same is true of the grammar of a programming language.
Consider a simple arithmetic expression:
3+2*5
As long as you know the rules of arithmetic, you will realize that you have to
do the multiplication first. Arithmetic notation is a remarkably sophisticated
mini-language, which is why it takes time to learn in school and why
beginners make mistakes.
Implementing this arithmetic expression as a program is difficult because you
can't simply read it from left to right and implement each operation as you
meet it. That is, 3+2*5 isn't (3+2)*5 i.e. doing the operations in the order
presented 3+2 and then *5. As the multiplication has a higher priority it is
3+(2*5) which written in order of operations is 2*5+3.

60
A simple grammar for this type of expression, leaving out the obvious detail,
might be:
<expression> → <expression> + <expression> |
<expression> * <expression>
This parses the expression perfectly, but it doesn’t help with the meaning of
the expression because there are two possible ways that the grammar fits:
<expression> → <expression 1> + <expression 2>
3 + 2*5
or
<expression> → <expression 1> * <expression 2>
3+2 * 5
These are both perfectly valid parses of the expression as far as the grammar
is concerned, but only the first parsing groups the expressions together in a
way that is meaningful. We know that the 2 and 5 should group as a unit of
meaning and not the 3 and the 2, but this grammar gives rise to two possible
syntax trees

We need to use a grammar that reflects the meaning of the expression.


For example, the grammar:
<expression> → <multi-expression> + <multi–expression>
<multi–expression> → <value>* <value> | <value>
also allows 3+2*5 to be legal but it only fits in one way:
<expression> → <multi-expression> + <multi-expression>
3 + 2*5
<multi-expression> → <value> * <value>
2 * 5
This means that this particular grammar only gives the syntax tree that
corresponds to the correct grouping of the arithmetic operators and their
operands. It contains the information that multiplication groups together
more strongly than addition. In this case we have a grammar that reflects the
semantics or meaning of the language and this is vital if the grammar is going
to help with the translation.

61
There may be many grammars that generate a language and any one of these
is fine for generating an expression or proving an expression legal, but when
it comes to parsing we need a grammar that means something. Perhaps it isn't
true to say syntax is semantics, but you really have to pick the syntax that
reflects the semantics.

Traveling the Tree


Now that we have said the deeply unfashionable thing that syntax is not
isolated from semantics, we can now see why we bother to use a grammar
analyzer within a compiler. Put simply, a syntax tree or its equivalent can be
used to generate the machine code or intermediate code that the expression
corresponds to. The syntax tree can be considered as a program that tells you
how to evaluate the expression. For example, a common method of generating
code from a tree is to walk all its nodes using a “depth first” algorithm. That
is, visit the deepest nodes first and generate an instruction corresponding to
the value or operator stored at each node.
The details of how to do this vary but you can see the general idea in this
diagram:

So now you know. We use grammar to parse expressions, to make syntax


trees, to generate the code.

A Real Arithmetic Grammar


You might like to see what a real grammar for simple arithmetic expressions
looks like:
<Exp> → <Exp> + <Term> | <Exp> - <Term> | <Term>
<Term> → <Term> * <Factor> | <Term> / <Factor> | <Factor>
<Factor> → x | y | ... |
Of course, this simplified rule leaves out proper variable names, operators
other than +, -, * and /, but it is a reasonable start.

62
Summary
● Grammar is not just related to the different machines that characterize
the languages they describe. Grammars are generally useful in
computer science.

● The most common method of describing a grammar is some variation


on BNF or Backus-Naur Form. It is important to realize that there is no
standard for this, but the variations are usually easy to understand.

● Extended BNF allows it to describe language forms that are specified


as regular expressions.

● Often syntax diagrams are easier to understand than pure BNF.

● A grammar is often said to be to do with the syntax of a language and


nothing at all to do with semantics, the meaning. This isn’t
completely true.

● To be useful, a grammar has to reveal the meaning of a language and it


has to be compatible with its meaning.

● With a suitable grammar you can work backwards from the sequence
of symbols to the rules that created them. This is called parsing.

● There are many parsing algorithms and in general simple grammars


have simple parsing methods.

● If a grammar is compatible with the meaning of a language then


parsing can be used to construct a syntax tree which shows how the
components group together.

● A syntax tree can have many uses, but the main one in computer
science is to generate a sequence of machine instructions that
correspond to the original sequence of symbols. This is how a
compiler works.

63
Chapter 6

Numbers, Infnity and Computation

Infinity is a concept that mathematicians are supposed to understand and it is


a key, if often hidden, component of computer science arguments and formal
proofs. Not understanding the nature of infinity will lead you to conclude all
sorts of very wrong, and even silly, results. So let’s discover what infinity is
all about and how to reason with it. It isn’t so difficult and it is a lot of fun.
As infinity is often said to be “just” a very big number, first we need some
background on what exactly numbers are. In particular, we need to know
something about what sort of numbers there are. There is a complexity
hierarchy of numbers starting from the most simple, and probably the most
obvious, working their way up through rationals and irrationals to finally the
complex numbers.
I'm not going to detour into the history of numbers, fascinating though it is,
and I’m not going to go into the details of the mathematical theory. What is
important is that you have an intuitive feel for each of the types of number,
starting with the simple numbers that we all know.

Integers and Rationals


All you really need to know is that in the beginning were the whole or natural
numbers. The natural, or counting, numbers are just the ones we need to
count things and in most cases these don't include zero which was quite a
late invention. After all who, apart from programmers, counts zero things?
The integers are what you get when you add zero and negative numbers. You
need the negatives to solve problems like "What do I add to 5 to get 3?" and
zero appears as a result of "What is 5-5?".
You can motivate all of the different types of number by the need to solve
equations of particular types. So the integers are the solution to equations
like:
x+a=b
where a and b are natural numbers and x is sometimes a natural number and
sometimes the new sort of number - a negative integer. Notice that the
definition of the new type of number, an integer, only makes use of numbers
we already know about, the natural numbers.

65
Notice that we can immediately see that the natural numbers and the integers
are infinite in the usual sense of the term – there is always another integer.
You can also see that if you were asked to count the number of integers in a
set you could do this. The integers are naturally countable. If you are
surprised that the fact that something is countable is worthy of comment, you
need to know that there are things that you cannot count – see later.
Next in the number hierarchy come the fractions, the numbers which fill in
all the spaces on the number line. That is, if you draw a line between two
points then each point will correspond to an integer or a fraction on some
measuring system. The fractional numbers are usually known as the rationals
because they are a ratio of two integers. That is, the rationals are solutions of
equations like:
x*a=b
where again a and b are what we already know about, i.e. integers, and x is
sometimes an integer and sometimes the new type of number – a rational.
The definition of the new type of number once again only involves the old
type of number.

A small part of the number line to every rational number there is a point but is
the reverse true? Is there a number for every point?
Notice that if you pick any two points on the line then you can always find
another fractional point between them. In fact, if you pick any two points you
can find an infinity of fractions that lie between them. This is quite unlike the
integers where there most certainly isn't an integer between every pair of
integers - for example, what is the integer between 1 and 2? Also notice that
this seems to be a very different sort of infinity to the countable infinity of the
integers. It can be considered to be a "small" infinity in that there are an
infinite number of rational points in any small section of the line. Compare
this to the "big" infinity of the integers which extends forever. You can hold
an infinite number of rationals in your hand, whereas it would be impossible
to hold an infinity of integers in any container, however big
The infinity of the rationals seems to be very different from the infinity of the
integers, but is it really so different?

66
The Irrationals
Given the integers and the fractions is there a number to every point in the
number line and a point to every number? No but this fact is far from
obvious. If you disagree then you have been exposed to too much math - not
in itself a bad thing but it can sometimes make it harder to see the naive
viewpoint.
It was followers of Pythagoras who discovered a shocking truth - that there
were points on the number line that didn't correspond either to integers or
fractions.
The square root of two, √2, is one such point. You can find the point that
corresponds to √2 simply by drawing a unit square and drawing the diagonal
- now you have a line segment that is exactly √2 long and thus lines of that
length exist and are easy enough to construct. This means you can find a
point on the line that is √2 from the zero point quite easily. The point that is
√2 exists, but does a rational number that labels it?

So far so much basic math but, and this is the big but, it can be proved (fairly
easily) that there can be no fraction, i.e. a number of the form a/b that is √2.
You don't need to know the proof to trust its results but the proof is very
simple. However, skip this section if you want to.
Suppose that you can express √2 as a rational number in its lowest terms,
i.e. all factors canceled out, as a/b. Then (a/b)2 is 2 by definition and
hence a2=2b2.
This implies that a2 is even and hence a has to be even (as odd numbers
squared are still odd).
Suppose a=2c for some integer c. Then 4c2=2b2 or b2=2c2 and hence b2
is also even and hence b is even. Thus b=2d for some integer d.
Now we can write a/b=2c/2d, which means that a/b isn't in its lowest
form as we can now remove a factor of 2 to get c/d.
As you can reduce any fraction to its lowest form by canceling common
factors there can be no a/b that squares to give 2.

67
What this means is that if we only have the integers and the fractions or the
rationals (because they are a ratio of two integers) then there are points on the
line that do not correspond to a number - and this is unacceptable.
The solution was that we simply expanded what we regard as a number and
the irrational numbers were added to the mix. But what exactly are irrationals
and how can you write one down?
This is a question that occupied mathematicians for a long time and it took a
while to get it right. Today we think of an irrational number as a value with
an infinite sequence of digits after the decimal point. This sequence of digits
goes on forever and can never fall into a repeating pattern because if it does it
is fairly easy to show that the value is a rational. That is in decimal:
0.33333333333333
repeating for ever isn't an irrational number because it repeats and it is
exactly 1/3 i.e. it is rational.
Notice that there is already something strange going on because while we
have an intellectual solution to what an irrational is, we still can't write one
down. How could you as the digits never end and never fall into a repeating
pattern. This means that in general writing an irrational down isn't going to
be a practical proposition. And yet I can write down √2 and this seems to
specify the number precisely.
Also what sort of equations do irrationals solve? A partial answer is that they
are solutions to equations like:
xa=b
Again a and b are rational with b positive and x is sometimes the new class of
numbers i.e. an irrational. The problem is that only a relatively small
proportion of the irrationals come from the roots of polynomials as we shall
see.
Now we have integers like 1 or -2, fractions like 5/6 and irrationals like √2.
From the point of view of computing this is where it all gets very messy, but
first we are now ready to look at the subtlety of infinity.

The Number Hierarchy


There is a standard mathematical procedure for defining and expanding the
number system. Starting out with a set of numbers that are well defined, you
look at simple equations involving those numbers. In each case you discover
that some of the equations don't have a solution that is a number in the set of
numbers you started with. To make these solutions available you add a new
type of number to create a larger set of numbers. You repeat this until you
end up with a set of numbers for which all the equations you can think up

68
have a solution in the same set of numbers. That is, as far as equations go you
now have a complete set of numbers.
This sounds almost magical and the step that seems to hold the magic is
when you add something to the existing set of numbers to expand its scope. It
seems almost crazy that you can simply invent a new type of number. In fact
it isn't magical because it is the equation that defines all of the properties of
the new numbers. For example, you can't solve x2=2 using nothing but
rational numbers. So you invent a symbol to represent the new number √2
and you can use it in calculations as if it was an ordinary number √2*2+3/4
and you simply work with the symbol as if it was a perfectly valid number.
You can even occasionally remove the new symbol using the fact that
√2*√2=2 which is, of course given by the equation that √2 solves. If you can
get rid of all of the uses of the new symbol so much the better but if you can't
then at the end of the calculation you replace the new symbol with a numeric
approximation that gets you as close to the true answer as you need.
To summarize:
In each case the equation listed is the one that leads to the next class of
numbers, i.e. the one for which no solution exists in the current class of
numbers.
Notice that the complex numbers are complete in the sense that we don’t
need to invent any more types of number to get the solutions of all equations
that involve the complex numbers. The complex numbers are very important,
but from our point of view we only need to understand the hierarchy up to
the irrationals.

69
Aleph-Zero and All That
There are different orders of infinity. The usual infinity, which can be
considered the smallest infinity, is usually called aleph-zero, aleph-naught or
aleph-null and is written:

As this is aleph-zero and not just aleph, you have probably guessed that there
are others in the series and aleph-one, aleph-two and so on are all different
orders of infinity, and yes there are, probably an aleph-zero of them.
What is this all about? Surely there is just infinity and that's that? Certainly
not, and the existence of different orders of infinity or the "transfinite"
numbers is an idea that plays an important role in computer science and the
way we think about algorithms in general. Even if you are not going to
specialize in computer science and complexity theory, it is part of your
intellectual growth to understand one of the deepest and most profound
theories of the late 19th and early 20th centuries. It was the German
mathematician Georg Cantor who constructed the theory of the transfinite
numbers a task that was said to have driven him mad. In fact contemplation
of the transfinite is a matter of extreme logic and mathematical precision
rather than lunacy.

Georg Cantor
1845-1918

70
Unbounded Versus Infinite
Programmers meet infinity sooner and more often than the average person.
OK, mathematicians meet it all the time, but perhaps not in the reality of an
infinite loop.
Question: How long does an infinite loop last?
Answer: As long as you like.
This highlights a really important idea. We really don't have an infinite loop,
what we have is a finite but unbounded loop. If you have read the earlier
chapters this will be nothing new, but there is a sense in which it is very
much the programmer’s form of infinity. Something that is finite, but
unbounded, is on its way to being infinite, but it isn't actually there yet so we
don’t have to be too philosophical.
Consider again the difference between the following two sets:
U = the set of counting numbers less than n
N = the set of natural numbers, i.e. 0,1,2,3,
Set U is finite for any n, but it is unbounded in that it can be as big as you like
depending on n. On the other hand, N is the real thing - it is an infinite set.
It is comparable to the difference between someone offering you a very big
stick to hold - maybe one that reaches the orbit of the moon, the sun or
perhaps the nearest star- and a stick that is really infinitely long.
A stick that is finite but "as big as you like" is similar to the sort of sticks that
you know. For one thing it has an end. An infinite stick doesn't have an end –
it is something new and perhaps confined to the realm of theory.
The same arguments hold for U and N. The set U always has a largest member
but N has no largest member. Infinity is different from finite but unbounded.
When you write an "infinite loop" you are really just writing a finite but
unbounded loop - some day it will stop.
Often finite but unbounded is all we need to prove theorems or test ideas and
it avoids difficulties. For example, a finite but unbounded tape in a Turing
machine does have a last cell, but an infinite tape doesn't. A Turing machine
working with tape that doesn't have an end is more than just a mechanical
problem – how do you move it? Indeed can it move? You can avoid this
particular problem by moving the reading/writing head, which is finite, but
this doesn’t solve all the conceptual difficulties of a tape that is actually
infinite.
In the early days of computer science there was a tendency to prefer “finite
but unbounded”, but as the subject became increasingly math-based, true
infinity has replaced it in most proofs. This is a shame as the programmer’s
infinity is usually all that is needed.

71
Comparing Size
The basic idea of the size of something is that if two sets are the same size
then we can put them into one-to-one correspondence. Clearly if this is the
case then the two sets are of the same size as there can't be any "things"
hiding in one set that make it bigger than the other set.
This definition seems innocent enough, and it is, but you have to trust it and
the conclusions you reach when you use it to compare infinite sets. For
example, suppose you take the set Z of all integers then this is clearly an
infinite set. Now take the set O of all the odd integers - this set clearly has half
the number of things in it.
But wait - if you set up the 1:1 correspondence between O and Z such that
each n in O is associated with 2n in Z, we have to conclude that O and Z have
the same number of elements - they are both sets with an infinity of elements.
This is the source of the reasonably well-known story featuring Hotel Hilbert
named in honor of German mathematician, David Hilbert. The hotel has an
infinite number of rooms and on this night they are all full when a tour bus
arrives with n new guests. The passengers are accommodated by the simple
rule that the existing guest in room m moves to room m+n - thus clearing the
first n rooms. If n were 10 then the person in room 1 would move to room 11,
room 2 to room 12 and so on. Notice that in this case the pigeon hole
principle doesn’t apply. For a finite number of rooms n say putting m>n people
into the rooms would imply that at least one room held two or more people.
This is not the case if n is infinite as we can make more space.
From a programmers point of view the question really is “how long do the
new guests wait before moving into their rooms?” If the move is made
simultaneously then all guests come out into the corridor in one time step
and move to their new room at time step two. The new guests then move into
their new accommodation in one more time step. So the move is completed in
finite time. However, if the algorithm is to move guests one per time step you
can see that the move will never be completed in finite time because the guest
in room 1 would have to wait for the guest in room 1+n to move into room 2+n
and so on. It takes infinite time to move infinite guests one guest at a time.
Suppose a coach carrying an infinite number of new guests arrives? Can they
be accommodated? Yes, but in this case the move is from room n to room 2n.
This frees all odd number rooms and there is an infinity of these. Once again
if this is performed simultaneously it takes finite time but if the move is one
guest at a time it takes infinity to complete.
What about Hotel Hilbert with a finite but unbounded number of rooms?
In this case when a finite number of new guests appear the same move occurs
but now we have to add n rooms to the hotel. Ok this might take a few
months but the new guests are accommodated in finite time even if the

72
current guests move one at a time. Finite but unbounded is much more
reasonable. This argument also works if a finite but unbounded number of
new guests appear. Things only go wrong when the coach turn up with an
infinite number of guests. In this case it really does take infinity to build the
new rooms – a task that is practically speaking never completed.
In short, you can add or subtract any number from an infinite set and you still
have an infinite number, even if the number added or subtracted is infinite.
Put more formally, the union of a finite or infinite number of infinite sets is
still infinite.

My favorite expression of this fact is:


Aleph-null bottles of beer on the wall,
Aleph-null bottles of beer,
You take one down, and pass it around,
Aleph-null bottles of beer on the wall.

In Search of Aleph-One
Is there an infinity bigger than the infinity of the integers?
At first you might think that the rationals, i.e. numbers like a/b where a and b
are integers, form a bigger set - but they don't. It is fairly easy, once you have
seen how, to arrange a one-to-one assignment between integers and rational
fractions.
If you have two sets A and B then the Cartesian product A X B of the two sets
is the set of all pairs (a,b) with a taken from A and b taken from B.
If A and B are the set of all positive integers you can consider all the pairs
(a,b) as the co-ordinates of a point in a grid with integer co-ordinates.

73
Now simply start at the origin and traverse the grid in a diagonal pattern
assigning a count as you go:
0->(0,0), 1->(1,0), 2->(0,1), 3->(2,0), 4->(1,1)
and so on.
Clearly we have a one-to-one mapping of the integers to the co-ordinate pairs
and so the co-ordinate pairs have the same order of infinity as the natural
numbers. This also proves that the Cartesian product, i.e. the set of all pairs,
of two infinite sets is the same size.
You can modify the proof slightly to show that the rationals a/b with b not
equal to 0 is also just the standard infinity. You can see that in fact using this
enumeration we count some rationals more than once.
Basically, if you can discover an algorithm that counts the number of things
in a set, then that set is countable and an infinite countable set has order of
infinity aleph-zero.
Two questions for you before moving on.
1. Can you write a for loop that enumerates the set (a,b) or
equivalently give a formula i→(a,b)?
2. Why do we have to do the diagonal enumeration? Why can’t you
just write two nested loops which count a from 0 to infinity and
b from 0 to infinity?
See later in the chapter for answers.

What Is Bigger Than Aleph-Zero?


Consider the real numbers, that is the rational numbers plus the irrationals.
The reals are the set of infinite decimal fractions - you can use other
definitions and get to the same conclusions. That is, a real number is
something of the form:
integer. infinite sequence of digits
e.g.
12.34567891234567...
The reals include all of the types of numbers except of course the complex
numbers. If the fractional part is 0 we have the integers. If the fractional part
repeats we have a rational. If the factional part doesn’t repeat we have an
irrational.
You also only have to consider the set of reals in the interval [0,1] to find a
set that has more elements than the integers.
How to prove that this is true?

74
Cantor invented the diagonal argument for this and it has been used ever
since to prove things like Gödel's incompleteness theorem, the halting
problem and more. It has become a basic tool of logic and computer science.
To make things easy let's work in binary fractions. Suppose there is an
enumeration that gives you all the binary fractions between 0 and 1. We don't
actually care exactly what this enumeration is as long as we can use it to
build up a table of numbers:
1 0.010101110111101...
2 0.101110111010111...
3 0.101101001011001...
and so on. All we are doing is writing down the enumeration i -> a binary
real number for i an integer.
If this enumeration exists then we have proved that there are as many reals as
integers and vice versa and so the reals have an order of infinity that is aleph-
zero i.e. they are countable.
If this is all true the enumerations contains all of the reals as long as you keep
going. If we can find a real number that isn't in the list then we have proved
that it isn't a complete enumeration and there are perhaps more reals than
integers.
Now consider the following argument.
Take the first bit after the binary point of the first number and start a new
number with its logical Boolean NOT - that is, if the bit is 0 use 1 and if it is 1
use 0. Next take the second bit from the second number and use its NOT for
the second bit of the new number. In fact, you can see that the new number is
simply the logical NOT of the diagonal of the table. In this case:
s=0.110...
This new number s is not equal to the first number in the table because it has
been constructed to differ in the first bit. It is not equal to the second number
because it has been constructed to be different in the second bit, and so on. In
fact, you can see that s differs in the nth bit after the binary point from the
nth number in the table.
We have no choice to conclude that s isn't in the table and so there is a real
number that isn’t in the supposedly complete enumeration. In other words,
there can be no such complete enumeration because if you present me with
one I can use it to create a number that it doesn't include.
You cannot count the real numbers and there are more real numbers than
there are integers.

Can you see why the argument fails if the fractions are not infinite
sequences of bits?

75
Finite But Unbounded “Irrationals”
Suppose we repeat the arguments for a sequence of n digits that are finite but
unbounded in n. Is the size of this set of unbounded “irrationals” aleph-one?
The answer is no, the size of these irrationals is still aleph-zero. You can
discover that it isn’t aleph-one by simply trying to prove that is is using the
diagonal argument. For any n the diagonal number has more digits than n and
hence isn’t included in the table of irrationals of size n. Thus the diagonal
proof fails.
Consider the sets Sn of all “irrationals” on n digits, i.e. Sn is the set of
sequences of length n. It is obvious that each set has a finite and countable
number of elements. Now consider the set S, which is the union of all of the
sets Sn. This is clearly countable as the union of countable sets is always
countable and it is infinite – it is aleph-zero in size. You can convert this
loose argument into something more precise. You can also prove it by giving
an enumeration based on listing the contents of each set in order n. You have
to have the set S∞, i.e. the set of infinite sequences of digits to get to aleph-
one.

Enumerations
To see this more clearly you only have to think a little about iteration or
enumeration. This section considers this mathematics from a programmer’s
viewpoint.
Put simply, anything you can count out is enumerable.
Integers are enumerable - just write a for loop.
for i=0 to 10
display i
next i
which enumerates the integers 0 to 10.
Even an infinite number of integers is enumerable, it’s just the for loop would
take forever to complete:
for i=0 to infinity
display i
next i
OK, the loop would never end. but if you wait long enough you will see any
integer you care to pick displayed. In this sense, the integers, all of them, are
enumerable. The point is that for any particular integer you select the wait is
finite.
When it comes to enumerating infinity, the state of getting to whatever item
we are interested in if we wait long enough is the best we can hope for. An
infinite enumeration cannot be expected to end.

76
The next question is are the rationals enumerable? At first look the answer
seems to be no because of the following little problem. Suppose you start your
for loop at 0. The next integer is 1 but what is the next rational? The answer is
that there isn't one. You can get as close as you like to zero by making the
fraction smaller and smaller – 1/10, 1/100, 1/1000 and so on. There is no
next fraction (technically the fractions are "dense") and this seems to make
writing a for loop to list them impossible.
However, this isn't quite the point. We don't need to enumerate the fractions
in any particular order we just need to make sure that as long as the loop
keeps going every rational number will eventually be constructed. So what
about:
for n=0 to infinity
for i=n to 0
display (n-i)/i
next i
next n
This may seem like a strange transformation but it corresponds to the “zig
zag” enumeration of the rationals given earlier:

You can see that n is the column number. When n is 4, say, we need to
enumerate (4,0), (3,1), (2,2), (1,3) and (0,4) and you should be able to
confirm that this is what the nested for loops do. Also ignore the fact that we
are listing improper rationals of the form (a/0) - you cannot divide by zero.
You should be able to see that if you wait long enough every fraction will
eventually be displayed and in addition all of the possible combinations of
(a,b) will eventually be produced.
We can agree that even an infinite set is enumerable as long as you are
guaranteed to see every element if you wait long enough and so the rationals
are enumerable. Notice that we have transformed two infinite loops into a
single infinite loop by adopting a zig-zag order. This idea generalizes to n
infinite loops, but it is more complicated.

77
Enumerating the Irrationals
Fractions, and hence the rationals, are enumerable. The next question is - are
the irrationals enumerable? At first sight it seems not, by the same argument
that there is no "next" irrational number, but as we have seen this argument
fails with the rationals because they can be transformed to a zig-zag order and
hence they are enumerable.
However, first we have another problem to solve - how do we write down the
irrationals? It is clear that you cannot write them down as anything like a/b,
so how do you do it? After lots of attempts at trying to find a way of
representing irrational numbers the simplest way of doing the job is to say
that an irrational number is represented exactly by an infinite sequence of
digits, for example, 3.14159… , where the ... means "goes on for ever". You
can see that a single irrational number is composed of an enumerable
sequence of digits that can be produced by:
for i=0 to infinity
display the ith digit of √2
next i
You should probably begin to expect something different given that now each
number is itself an enumerable sequence of digits where integers and rational
are simply enumerable as a set.
What about the entire set of irrational numbers – is it enumerable? This is
quite a different problem because now you have a nested set of for loops that
both go on to infinity:
for i=0 to infinity
for j=0 to infinity
display jth digit of the ith irrational
next j
next i
The problem here is that you never get to the end of the inner loop so the
outer loop never steps on. In other words, no matter what the logic of the
program looks like, it is, in fact, functionally equivalent to:
i=0
for j=0 to infinity
display jth digit of the ith irrational
next j
The original nested loops only appear to be an enumeration because they look
like the finite case. In fact, the program only ever enumerates the first
irrational number you gave it. It’s quite a different sort of abstraction to the
enumeration of the rationals and, while it isn't a proof that you can't do it, it
does suggest what the problem might be. As the inner loop never finishes, the
second loop never moves on to another value of i and there are huge areas of
the program that are logically reachable but are never reached because of the
infinite inner loop.

78
It is as if you can have one infinite for loop and see results, all the results if
you wait forever - but a pair of nested infinite loops is non-computable
because there are results you will never see, no matter how long you wait.
The point is that you can't enumerate a sequence of infinite sequences. You
never finish enumerating the first one, so you never get to the second.
As already stated, this isn't a proof but you can construct a proof out of it to
show that there is no effective enumeration of the irrationals. However, this
isn't about proof, it's about seeing why an infinite sequence of infinite
sequences is different from just an infinite sequence and you can derive
Cantor’s diagonal argument from it. Now let’s return to the alephs.

Aleph-One and Beyond


The size of the real numbers is called aleph-one and it is the size of the
continuum – that model of space where between each point there is not only
another point but an infinity of points. Aleph-zero is the sort of infinity we
usually denote by ∞ and so aleph-one is bigger than our standard sort of
countable infinity. This all seems a little shocking - we now have two
infinities.
You can show that aleph-one behaves in the same way as aleph-zero in the
sense that you can take any lots of elements away, even aleph-one elements,
and there are still aleph-one elements left. After the previous discussion, you
can also see why the reals are not effectively enumerable and lead to a new
order of infinity – they need two infinite loops. Of course, the problem with
these loops is that the inner loop never ends so the outer one never gets to
step on to the next real number in the enumeration.
This raises the question of what we did to get from aleph-zero to aleph-one
and can we repeat it to get to aleph-two? The answer is yes. If two sets A and B
are aleph-zero in size then we already know that all of the usual set
operations A U B, i.e. set A union B, is everything in both sets – often said as
an infinity plus an infinity is the same infinity - and A X B all of the pairs
obtained by taking one from A and one from B (the Cartesian product) for
example also have aleph-zero elements. However, there is another operation
that we haven't considered - the power set.
If you have a set of elements A, then the power set of A, usually written 2A, is
the set of all subsets of A including the empty set and A. So if A={a,b} the
power set is {0,{a},{b}, A}. You can see why it is called the power set as a
set with two things in it gives rise to a power set with four things, i.e. 2 2=4.
This is generally true and if A has n elements its power set has 2n elements.
Notice that this is a much bigger increase than other set operations, i.e A U A
has 2n elements and A X A has n2 elements.

79
The power set really does seem to change gear on the increase in the number
of elements. So much so that if A has aleph-zero elements then it can be
proved that 2A, i.e. the power set, has aleph-one elements. And, yes, if A has
aleph-one elements its power set has aleph-two elements, and so on.
● In general if A has aleph-n elements, the power set 2A has aleph-(n+1)
elements. There is an infinity of orders of infinity.
Notice also that the reals, R, are related to the power set of the integers, just as
the rationals, Q, are related to the Cartesian product of the integers, i.e. R=2Z
and Q=Z X Z with suitable definitions and technical adjustments.
The set of alephs is called the transfinite numbers and it is said that this is
what sent Cantor crazy but you should be able to see that it is perfectly
logical and there is no madness within.
So finally, are there aleph-zero, i.e. a countable infinity, of transfinite
numbers or are there more? The answer is we don't know and it all hinges on
the answer to the question "is there a set with an order of infinity between
aleph-zero and aleph-one".
That there isn't is the so-called continuum hypothesis, and it forms Hilbert's
23rd problem and here we get into some very deep mathematics and an area
of active research.
Not Enough Programs!
Now we come to the final knockout blow for computation and it is based on
the infinities. How many programs are there? Any program is simply a
sequence of bits and any sequence of bits can be read as a binary number, a
binary integer to be exact. Thus, given any program, you have a binary
number corresponding to it. Equally, given any binary number you can regard
it as a program - it might not do anything useful, but that doesn't spoil the
principle:
● Every program is an integer and every integer is a program.
This correspondence is an example of a Gödel numbering and it is often
encountered in computer science.
This mean that the set of programs is the same as the set of integers and we
realize immediately that the set of programs is enumerable. All I have to do is
write a for loop that counts from 1 to 2 n and I have all the programs that can
be stored in n bits. Even if you let this process go to infinity the number of
programs is enumerable. It is infinite, but it is enumerable and so there are
aleph-zero programs.
This is a very strange idea. The Gödel numbering implies that if you start
generating such a numbering eventually you will reach a number that is
Windows or Linux or Word or whatever program you care to name. This has,
occasionally been suggested as a programming methodology, but only as a
joke.

80
You can repeat the argument with Turing machines rather than programs.
Each Turing machine can be represented as a tape to be fed into a universal
Turing machine. The tape is a sequence of symbols that can be binary coded
and read as a single number. We have another Gödel numbering and hence
the set of Turing machines is enumerable and if we allow the tape to be finite
but unbounded there are aleph-zero Turing machines. The important point is
that the number of irrational numbers isn't enumerable, it is aleph-one, which
means there are more irrational numbers than there are programs.
So what? Think about the task of computing each possible number. That is,
imagine each number has a program that computes it and displays it. You can
do it for the integers and you can do it for the rationals, but you can't do it for
the irrationals and there are more irrational numbers than there are programs,
so many, in fact most, irrational numbers don't have programs that compute
them. We touched on this idea in Chapter 3, but now we have the benefit of
knowing about the orders of infinity.
Numbers like √2, π and e clearly do have programs that compute them, so not
all irrational numbers are non-computable - but the vast majority are non-
computable. To express it another way:
● Most of the irrational numbers do not have programs that
compute them.
What does this mean? Well, if you think about it, a program is a relatively
short, regular construct and if it generates an irrational number then
somehow the information in that number must be about the same as the
information in the program that generates it. That is, computable numbers are
regular in some complicated sense, but a non-computable number is so
irregular that you can't compress its structure into a program. This leads on to
the study of algorithmic information theory which is another interesting area
of computer science full of strange ideas and even stranger conclusions, see
Chapter 7.
Notice that you can’t get around this conclusion by somehow extending the
way you count programs. You can’t, for example, say there are an aleph-zero
of programs that run on computer A and another aleph-zero that run on
computer B so we have more programs. This is wrong because aleph-zero
plus aleph-zero programs is still just aleph-zero programs. To avoid the
problem you have to invent a computer that works with real numbers as a
program so that there aleph-one of them – and you can probably see the
circular argument here.

81
Not Enough Formulas!
If you are now thinking that programs to compute numbers aren't that
important, let's take one final rephrasing of the idea. Consider a mathematical
formula. It is composed of an enumerable set of n symbols and as such the set
of all formulas is itself enumerable. Which means that there are more
irrational numbers than there are mathematical formulas. In other words, if
you want to be dramatic, most of the numbers in mathematics are out of the
reach of mathematics.
There are some subtle points that I have to mention to avoid the mistake of
inventing "solutions" to the problem. Surely you can write an equation like
x=a where a is an irrational number and so every irrational number has its
equation in this trivial sense. The problem here is that you cannot write this
equation down without writing an infinite sequence of digits, i.e. the numeric
specification for a.
Mathematical formulas can't include explicit irrational numbers unless they
are assigned a symbol like π or e and these irrationals, the ones we give a
symbol to, are exactly the ones that have formulas that define them. That is,
the computable irrational numbers are the ones we often give names to. They
are the ones that have specific properties that enable us to reason about them.
For example, I can use the symbol √2 to stand as the name of an irrational
number, but this is more than a name. For example, I can do exact arithmetic
with the symbol:
√2(1+√2)2=2(1+2√2+√2√2)= 2(1+2√2+2)=2(3+2√2)=6+4√2.
We repeatedly use the fact that √2√2=2 and it is this property, and the ability
to compute an approximation to √2, that makes the symbol useful. This is not
the case for irrational numbers in general and we have no hope of making it
so. That is, a typical irrational number requires an infinite sequence of
symbols to define it. For a few we can replace this by a finite symbol, like √2,
that labels its properties and this allows it to be used in computation and
provides a way of approximating the infinite sequence to any finite number of
places.
Also notice that if you allow irrational numbers within a program or a
formula then the set of programs or formulas is no longer enumerable. What
this means is that if you extend what you mean by computing to include
irrational numbers within programs, then the irrationals become computable.
So far this idea hasn't prove particularly useful because practical computation
is based on enumerable sets.

82
Transcendental and Algebraic Irrationals
This lack of programs to compute all the numbers also spills over into
classical math. Long before we started to worry about computability
mathematicians had run into the problem.
The first irrational numbers that were found were roots of simple polynomials
- polynomials with rational coefficients. If you think about it for a moment
you will see that a polynomial is a sort of program for working something out
and finding the roots of a polynomial is also an algorithmic procedure. A
polynomial is something like:
axn+bxn-1+cxn-2+ ..
For example:
2x4+1/2x3+8x2+9x+13
is a polynomial of degree 4.
A few minutes thinking should convince you that there are only an
enumerable set of polynomials. You can encode a polynomial as a sequence:
(a,b,c.. )
where the a, b, c are the coefficients of the powers of x. The length of the
sequence is finite but unbounded. As in the earlier argument each of these
sets for fixed, n is countable and hence the union of all of these sets is
countable and hence there are aleph-zero polynomials. If you don’t like this
argument write a program that enumerates them but you have to use a zig-zag
enumeration like the one used for the rationals.
As the set of polynomials is countable, and each polynomial has at most n
distinct roots, the set of all roots is also countable and hence the set of
irrationals that are roots of polynomials is enumerable. That is, there are
aleph-zero irrationals that are the roots of polynomials. Such irrational
numbers are called algebraic numbers, they are enumerable and as such they
are a "small" subset of all of the irrationals, which is aleph-one and hence
much bigger. That is, most irrationals are not algebraic numbers. We call this
bulk of the irrational numbers transcendental numbers and they are not the
roots of any finite polynomial with rational coefficients.
So what are transcendental numbers? Most of them don't have names, but the
few that do are well known to us - π and e are perhaps the best known. The
transcendental numbers that we can work with have programs that allow us
to compute the digit sequence to any length we care to. In this sense the non-
computable numbers are a subset of the transcendental numbers.
That is, in this sense the computable numbers are the integers, the algebraic
irrationals and the transcendentals that are "regular enough" to have ways of
computing approximations.

83
The situation is very strange when you try to think about it in terms of
computational complexity. The algebraic numbers are the roots of
polynomials and in this sense they are easy computable numbers. The
transcendentals are not the roots of polynomials and hence a harder class of
things to compute. Indeed most of them are non-computable, but some of
them are, and some, like π, are fairly easy to compute.

π - the Special Transcendental


π is the most fascinating of transcendental numbers because it has so many
ways of computing its digits. What formulas can you find that give a good
approximation to π? Notice that an infinite series for π gives an increasingly
accurate result as you compute more terms of the series. So if you want the
56th digit of π you have to compute all the digits up to the 56th as well as the
digit of interest - or do you? There is a remarkable formula - the Bailey–
Borwein–Plouffe formula - which can provide the nth binary digit of π
without having to compute any others. Interestingly, this is the original
definition of a computable number given by Turing – there exists a program
that prints the nth digit of the number after a finite time.
The digits of π cannot be random - we compute them rather than throw a dice
for them - but they are pseudo random. It is generally said that if you
enumerate π for long enough then you will eventually find every number you
care to specify and if you use a numeric code you will eventually find any
text you specify. So π contains the complete works of Shakespeare, or any
other text you care to mention; the complete theory of QFT; and the theory of
life the universe and everything. However, this hasn't been proved. Any
number that contains any sequence if you look for long enough is called
normal, and we have never proved that π is normal.
All of this is pure math, but it is also computer science because it is about
information, computational complexity and more – see the next chapter.

84
Summary
● There is a hierarchy of number types starting from the counting
numbers, expanding to the integers, the rationals, the irrationals and
finally the complex numbers. Each expansion is due to the need to
solve valid equations in the number type that doesn’t have a solution
in the number type.

● The two sets are the same size if it is possible to match elements up
one to one.

● The size of the set of integers – the usual meaning of infinity – is


aleph-zero. The rationals can be put into one to one correspondence
with the integers and so there are aleph-zero rationals.

● To find a set with more elements than aleph-zero you need to move to
the irrationals. The irrationals cannot be enumerated and there are
aleph-one of them.

● The union of any sets of size aleph-zero is a set of size aleph-zero. The
Cartesian product of sets of size aleph-zero is a set of size aleph-zero.
It is only when you take the power set 2A do we get a set of size aleph-
one.

● The number of programs or equivalent Turing machines is aleph-zero.


Therefore there are not enough programs to compute all of the
irrationals – the majority are therefore non-computable numbers.

● Similarly there are not enough formulas for all the irrationals and in
this sense they are beyond the reach of mathematics.

● The irrationals that are roots of polynomials are called algebraic.


However, as there are only aleph-zero polynomials, most of the
irrationals are not the roots of polynomials and these are called
transcendental.

● The majority of transcendentals are non-computable but some,


including numbers like π, are not only computable, but seem to be
very easy to compute.

85
Chapter 7

Kolmogorov Complexity and Randomness

At the end of the previous chapter we met the idea that if you want to
generate an infinite anything then you have to use a finite program and this
seems to imply that the whatever it is that is being computed has to have
some regularity. This idea is difficult to make precise, but that doesn’t mean it
isn’t an important one. It might be the most important idea in the whole of
computer science because it explains the relationship between the finite and
the infinite and it makes clear what randomness is all about.

Algorithmic Complexity
Suppose I give you a string like 111111... which goes on for one hundred ones
in the same way. The length of the string is 100 characters, but you can write
a short program that generates it very easily:
repeat 100
print "1"
end repeat
Now consider the string "231048322087232.." and so on for one hundred
digits. This is supposed to be a random string, it isn't because I typed it in by
hitting number keys as best I could, but even so you would be hard pressed to
create a program that could print it that was shorter than it is. In other words,
there is no way to specify this random-looking string other than to quote it.
This observation of the difference between these two strings is what leads to
the idea of Kolmogorov, or algorithmic, complexity. The first string can be
generated by a program with roughly 30 characters, and so you could say it
has 30 bytes of information, but the second string needs a program of at least
the hundred characters to quote the number as a literal and so it has 100
bytes of information.
You can already see that this is a nice idea, but it has problems. Clearly the
number of bytes needed for a program that prints one hundred ones isn't a
well-defined number - it depends on the programming language you use.
However, in any programming language we can define the Kolmogorov
complexity as the length of the smallest program that generates the string in
question.

87
Andrey Kolmogorov was a Russian mathematician credited with developing
this theory of information but it was based on a theorem of Ray Solomonoff
which was later rediscovered by Gregory Chaitin - hence the theory is often
called Solomonoff-Kolmogorov-Chatin complexity theory.

Andrey Nikolaevich Kolmogorov


1903 -1987
Obviously one way around this problem that the measure of this complexity
is to use the size of a Turing machine that generates the sequence, but even
this can result in slightly different answers depending on the exact definition
of the Turing machine. However, in practice the Turing machine description
is the one preferred.
So complexity is defined with respect to a given description language – often
a Turing machine. The fact that you cannot get an exact absolute measure of
Kolmogorov complexity is irritating but not a real problem as any two
measures can be shown to differ by a constant.
The Kolmogorov complexity of a string is just the smallest program that
generates it.
For infinite strings things are a little more interesting because, if you don't
have a program that will generate the string, you essentially don't have the
string in any practical sense. That is, without a program that generates the
digits of an infinite sequence you can't actually define the string. This is also
the connection between irrational numbers and non-computable numbers.
As explained in the previous chapter, an irrational number is an infinite
sequence of digits. For example:
2.31048322087232 ...
where the ... means carry on forever.
Some irrationals have programs that generate them and as such their
Kolmogorov complexity is a lot less than infinity.

88
However, as there are only a countable number of programs and there are an
uncountable number of irrationals – see the previous chapter - there has to be
a lot of irrational numbers that don't have programs that generate them and
hence that have an infinite Kolmogorov complexity. Put simply, there aren't
enough programs to compute all of the irrationals and hence most irrationals
have an infinite Kolmogorov complexity. To be precise, there is an aleph-
zero, or a countable infinity, of irrational numbers that have Kolmogorov
complexity less than infinity and an aleph-one, or an uncountable infinity of
them, that have a Kolmogorov complexity of infinity.
A key theorem in algorithmic complexity is:
● There are strings that have arbitrarily large Kolmogorov complexity.
If this were not so we could generate the aleph-one set of infinite strings
using an aleph-zero set of programs.
The irrationals that have a smaller than infinite Kolmogorov complexity are
very special, but there are an infinity of these too. In a sense these are the
"nice" irrationals - numbers like π and e - that have interesting programs that
compute them to any desired precision. How would you count the numbers
that had a less than infinite Kolmogorov complexity?
Simple just enumerate all of the programs you can create by listing their
machine code as a binary number. Not all of these programs would generate a
number, indeed most of them wouldn't do anything useful, but among this
aleph-zero of programs you would find the aleph-zero of "nice" irrational
numbers.
Notice that included among the nice irrationals are some transcendentals. A
transcendental is a number that isn't the root of any finite polynomial
equation. Any number that is the root of a finite polynomial is called
algebraic. Clearly, for a number that is the root of a finite polynomial, i.e. not
transcendental but algebraic, you can specify it by writing a program that
solves the polynomial. For example, √2 is an irrational, but it is algebraic and
so it has a program that generates it and hence it’s a nice irrational.

Kolmogorov Complexity Is Not Computable


The second difficulty inherent in the measure of Kolmogorov complexity is
that, given a random-looking string, you can't really be sure that there isn't a
simple program that generates it. This situation is slightly worse than it seems
because you can prove that the Kolmogorov complexity of a string is itself a
non-computable function. That is, there is no program (or function) that can
take a string and output its Kolmogorov complexity. The proof of this, in the
long tradition of such non-computable proofs, is proof by contradiction.

89
If you assume that you have a program that will work out the Kolmogorov
complexity of any old string, then you can use this to generate the string
using a smaller program and hence you get a contradiction. To see that this is
so, suppose we have a function Kcomplex(S) which will return the
Kolmogorov complexity of any string S. Now suppose we use this in an
algorithm:
for I = 1 to infinity
for all strings S of length I
if Kcomplex(S)> K then return S
You can see the intent, for each length of string test each string until its
complexity is greater than K, and there seems like there is nothing wrong with
this. Now suppose that the size of this function is N, then any string it
generates has a Kolmogorov complexity of N or less. If you set K to N or any
value larger than N you immediately see the problem. Suppose the algorithm
returns S with a complexity greater than K, but S has just been produced by an
algorithm that is N in size and hence string S has a complexity of N or less - a
contradiction.
This is similar to the reasonably well known Berry paradox:
“The smallest positive integer that cannot be defined in fewer than
twenty English words”
Consider the statement to be a program that specifies the number n which is
indeed the smallest positive integer that cannot be defined in fewer than
twenty English words. Then it cannot be the number as it has just been
defined in 14 words. Notice that once again we have a paradox courtesy of
self-reference. However, this isn't to say that all is lost.

Compressability
You can estimate the Kolmogorov complexity for any string fairly easily. If a
string is of length L and you run it through a compression algorithm you get a
representation which is L-C in length where C is the number of characters
removed in the compression. You can see that this compression factor is
related to the length of a program that generates the string, i.e. you can
generate the string from a description that is only L-C characters plus the
decompression program.
Any string which cannot be compressed by even one character is called
incompressible. There have to be incompressible strings because, by another
counting argument, there are 2n strings of length n but only 2n-1 strings of
length less than n. That is, there aren't enough shorter strings to represent all
of the strings of length n.
Again we can go further and prove that most strings aren't significantly
compressible, i.e. the majority of strings have a high Kolmogorov complexity.

90
The theory says that if you pick a string of length n at random then the
probably that it is compressible by c is given by 1-21-c+2-n.
Plotting the probability of compressing a string by c characters for a string
length of 50 you can see at once that most strings are fairly incompressible
once you reach a c of about 5.

Armed with all of these ideas you can see that a string can be defined as
algorithmically random if there is no program shorter than it that can
generate it - and most strings are algorithmically random.

Random and Pseudo Random


The model of computation that we have been examining up to this point is
that of the mechanistic universe. Our computations are just what the universe
does. The planets orbit the sun according to the law of gravity and we can
take this law and create a computer program that simulates this phenomena.
In principle, we can perform the simulation to any accuracy that we care to
select. The planets orbiting the sun and the computational model will be
arbitrarily close to one another – closer as the accuracy increases. It all seems
to be a matter of decimal places and how many we want our computations to
work with. This gives the universe a mechanistic feel. The planets are where
they are today because of where they were yesterday and where they were
1000 years earlier and so on back to the start of the universe. If you could give
the position of every atom at time zero then you could run the model or the
real universe and everything would be where it is today.
This is the model of the universe that dominated science until quite recently
and was accepted despite it containing some disturbing ideas. If the universe
really is a deterministic machine there can be no “free will”. You cannot
choose to do something because it’s all a matter of the computation
determining what happens next.
The idea of the absolute determinism reduces the universe to nothing but a
playing out of a set of events that were set to happen by the initial conditions
and the laws that govern their time development. In this universe there is no
room for random.

91
Randomness and Ignorance
The standard idea of probability and randomness is linked with the idea that
there is no way that we can predict what is going to happen even in principle.
We have already mentioned the archetypal example of supposed randomness
– the toss of a coin. We summarize the situation by saying that there is a 50%
chance of it landing on either of its sides as if its dynamics were controlled by
nothing by randomness. Of course this is not the case – the coin obeys precise
laws that in principle allow you to predict exactly which side it will land on.
You can take the initial conditions of the coin and how it is about to be
thrown into the air and use the equations of Newtonian mechanics to predict
exactly how it will land on the table.
In principle, this can be done with certainty and there is no need to invoke
randomness or probability to explain the behavior of the coin. Of course, in
practice we cannot know all of the initial conditions and so a probability
model is used to describe certain features of the exact dynamics of the
system. That is, the coin tends to come down equally on both sides unless
there is something in its makeup that forces it to come down more on one
side than the other for a range of initial conditions.
You need to think about this for a little while, but it soon becomes clear that a
probability model is an excuse for what we don’t know – but still very useful
nonetheless. What we do is make up for our lack of deterministic certainty by
summarizing the dynamics by the observed frequency that we get any
particular result. For example, in the case of the coin we could throw it 100
times and count the number of heads and the number of tails. For a fair coin
we would expect this to be around 50% of each outcome. This is usually
summarized by saying that the probability of getting heads is 0.5. Of course,
after 100 throws it is unlikely that we would observe exactly 50 heads and
this is allowed for in the theory. What we expect, however, is that as the
number of throws increases the proportion gets closer and closer to 50% and
the deviations away from 50% become smaller and smaller. It takes quite a lot
of math, look up The Law of Large Numbers, to make these intuitions exact
but this is the general idea.
You can, if you want to, stop the theory at this point and just live with the
idea that some systems are too complex to predict and the relative frequency
of each outcome is all you can really work with. However, you can also move
beyond observed frequencies by making arguments about symmetry. You can
argue that as one side of a coin is like the other, and the throwing of the coin
shows no preference for one side or another, then by symmetry you would
expect 50% of each outcome. Another way of saying this is that the physics is
indifferent to the coin’s sides and the side of a coin cannot affect the
dynamics. Compare this to a coin with a small something stuck to one side.
Now there isn’t the symmetry we used before and the dynamics does take
account of the side of the coin. Hence we cannot in this case conclude that
the relative frequency of each side is 50%.

92
Pseudo Random
When you work with probabilities it is all too easy to lose sight of the fact that
the phenomena you are describing aren’t really random – they are
deterministic but beyond your ability to model accurately. Computers are
often used to generate random numbers, and at first this seems to be a
contradiction. How can a program generate random numbers when its
dynamics are completely determined – more determined than the toss of a
coin, say. We usually reserve the term pseudo random for numbers generated
in this way but they are no more pseudo random than the majority of other
numbers that we regard as truly random.
Notice that as pseudo random numbers are generated by a program much
smaller than the generated sequence they are most certainly not
algorithmically random in the sense of Kolmogorov. It is also true that an
algorithmically random sequence might not be so useful as a sequence of
random numbers because they might have irregularities that allow the
prediction of the next value. For example an algorithmically random
sequence might have more sixes say than any other digit. It would still be
algorithmically random because this doesn’t imply that you can write a
shorter program to generate it. What is important about pseudo random
numbers is that they are not predictable in a statistical sense, rather than
there being no smaller program that generates the sequence. Statistical
prediction is based on detecting patterns that occur more often than they
should for a random sequence. In the example given above, the occurrence of
the digit six more often than any other digit would mean that guessing that
the next digit was six would be correct more often than chance.
When we use the term pseudo random we mean that in principle all digits
occur with equal frequencies, all pairs of digits occur with equal frequencies,
all triples of digits occur with equal frequencies and so on. A pseudo random
sequence that satisfies this set of conditions allows no possibility of
predicting what might come next as every combination is equally likely. In
practice, pseudo random numbers don’t meet this strict condition and we
tend to adopt a less stringent requirement depending on what the numbers
are going to be used for. The key idea is that the sequence should not give a
supposed adversary a good chance of guessing what comes next. For example,
pseudo random numbers used in a game can be of relatively low quality as
the player has little chance to analyzing the sequence and so finding a
weakness. Pseudo random numbers needed for cryptography have to be of a
much higher quality as you can imagine that a nation state might put all of
the computer power it has into finding a pattern.
Pseudo random numbers are number sequences that satisfy a condition of
non-predictability given specified resources. In this sense what is pseudo
random is a relative term.

93
True Random
If you have followed this reasoning you should be happy to accept the fact
that there is no random, only pseudo random. What would a truly random
event look like? It would have to differ from deterministic randomness by not
being deterministic at all. In other words, there would have to be no theory
that could predict the outcome, even in principle.
A fairly recent discovery, chaos, is often proposed as an example of
something that can stand in for randomness. The idea is that systems exist
that are so sensitive to their initial conditions that predicting their behavior
rapidly becomes impractical. If the throw of a coin were a chaotic system
then the behavior of the coin would be very different for starting states that
differed by very little. If you knew the initial state of the coin and this
allowed you to predict that it would come down heads then, for a chaotic
system, a tiny change in that initial state would make it come down tails.
Given that you cannot in practice know the initial state perfectly then the
chaotic dynamics gives you no choice but to resort to probabilities.
A chaotic system however is still not in principle unpredictable. In this case
adding more-and-more information still makes the predictions more accurate.
This is not true randomness even though it is very interesting at a practical
and theoretical level.
The question of whether or not there are truly random processes, i.e. ones
that are even in principle not predictable, is considered open for some people,
and answered by quantum mechanics for others. What is so special about
quantum mechanics? The answer is that it has probability built into it. The
equations of quantum mechanics don’t predict the exact state of a system,
only the probability that it will be in any one of a number of states. If a coin
was governed by quantum mechanics then all that the equations could say is
that there is a 50% chance of it being heads and 50% chance of it being tails.
Even given all of the information about the quantum coin and its initial state,
the dynamics of the system is inherently probabilistic.
If quantum mechanics is correct, to paraphrase Einstein, “God does play dice
with the universe”.
If you don’t find this disturbing, and Einstein certainly did, consider that
when the quantum coin lands heads or tails you cannot ask about the
deterministic sequence of events that led to the outcome. If there was such a
sequence, it could have been used to predict the outcome and probability
wouldn’t enter into it. To say that a coin lands heads without anything on the
way to the outcome actually determining this is very strange. For a real coin
we think that the initial throw, the air currents and how far it travels before
landing determine its final configuration, but this cannot be so for a quantum
coin. A quantum event does not have an explanation only a probability.

94
Some find this so disturbing that they simply postulate that, like our
supposedly random physical coin, we are simply missing some information
in the quantum case as well. However, to date no complete theory has been
shown to give the same predictions as standard quantum mechanics and so
far these predictions have been correct.
So, if you want true randomness you need to use a quantum system?
No, in most cases this isn’t necessary pseudo random is more than good
enough. It is interesting to note that even a supposedly quantum-based
random number generator is difficult to create as it is all too easy for the
system to be biased by non-quantum effects. For example, a random number
generator based on radiation can be biased by non-quantum noise in the
detector.
This idea that there is no causal sequence leading up to the outcome is very
similar to the idea of algorithmic randomness. There is no program smaller
than the sequence that generates it and, in a sense, this means that there is no
deterministic explanation for it. The program is the deterministic sequence,
an explanation if you like for the infinite sequence it generates.
To see this, consider how could we create an algorithmic random number
generator?
We would need to write a finite program that generated the sequence but, as
the sequence is algorithmically random, there is no such program. You could
write an infinite program, but this would be equivalent to simply quoting the
digits in the sequence. But how can you quote an infinite sequence of digits?
How can you determine the next digit? There seems to be no possibility of a
causal way of selecting the next digit because if there was you could use it to
construct a shorter program. In this sense algorithmically random sequences
are not deterministic. Perhaps this is the key to the difference between
pseudo random and physically random numbers. This leads us on to the
subject of the next chapter, “the axiom of choice”.
The annoying thing about Kolmogorov complexity is that it is easy to
understand and yet in any given case you really can't measure it in any
absolute sense. Even so, it seems to say a lot about the nature of the physical
world and our attempts to describe it. You will notice that at the core of this
understanding is the idea of a program that generates the sequence of
behavior. It is tempting to speculate that many of the problems we have with,
say, modern physics is due to there simply not being enough programs to go
around or, what amounts to the same thing, too many random sequences.

95
Summary
● The algorithmic, or Kolmogorov, complexity of a string of symbols is
the size of the smallest program that will generate it.

● Kolmogorov complexity clearly depends on the machine used to host


the program, but it is defined up to a constant which allows for the
variation in implementation.

● Most irrationals don’t have programs that generate them and hence
their Kolmogorov complexity is infinite.

● Kolmogorov complexity isn’t computable in the sense that there isn’t a


single function or Turing machine that will return the complexity of
an arbitrary string.

● A C compressible string can be reduced by C symbols by a


compression program.

● A string that cannot be reduced by even one symbol is said to be


incompressible. Such strings have to exist by a simple counting
principle. This means that the majority of strings have a high
Kolmogorov complexity.

● A string with a high Kolmogorov complexity is algorithmically


random.

● Most random numbers are pseudo random in that they are


theoretically predictable if not practically predictable. This includes
examples of systems that are usually considered to be truly random,
such as the toss of a coin. Clearly, with enough data, you can predict
which face the coin will come down.

● The only example of true randomness is provided by quantum


mechanics where the randomness is built into the theory – there is no
way of predicting the outcome with more information.

96
Chapter 8

Algorithm of Choice

After looking at some advanced math concerning infinity, you can start to
appreciate that algorithms, or programs, have a lot to do with math. The fact
that the number of programs is smaller than the number of numbers gives rise
to many mathematical phenomena. One particularly exotic mathematical
idea, known as the axiom of choice, isn’t often discussed in computer science
circles and yet it probably deserves to be better known. However, this said, if
you want to stay close to a standard account of computer science, you are free
to skip this chapter.
Before we get started, I need to say that this isn't a rigorous mathematical
exposition of the axiom of choice. What it is attempting to do is to give the
idea of the "problem" to a non-mathematician, i.e. an average programmer. I
also need to add that while there is nothing much about the axiom of choice
that a practical programmer needs to know to actually get on with
programming it, it does have connections with computer science and
computability. The axiom of choice may not be a particularly practical sort of
mathematical concern, but it is fascinating and it is controversial.

Zermelo and Set Theory


The axiom of choice was introduced by Ernst Zermelo in 1904 to prove what
seems a very reasonable theorem.

Ernst Zermelo
1871-1953

97
In set theory you often find that theorems that seem to state the obvious
actually turn out to be very difficult to prove. In this case the idea that needed
a proof was the "obvious" fact that every set can be well ordered. That is, there
is an order relation that can be applied to the elements so that every subset
has a least element under the order. This is Zermelo's well-ordering theorem.
To prove that this is the case Zermelo had to invent the axiom of choice. It
now forms one of the axioms of Zermelo-Fraenkel set theory which is,
roughly speaking, the theory that most would recognize as standard set
theory. There are a lot of mathematical results that depend on the axiom of
choice, but notice it is an axiom and not a deduction. That is, despite
attempts to prove it from simpler axioms, no proof has ever been produced.
Mathematicians generally distinguish between Zermelo-Fraenkel set theory
without the axiom of choice and a bigger theory where more things can be
proved with the axiom of choice.

The Axiom of Choice


So exactly what is the axiom of choice - it turns out to be surprisingly simple:
"The axiom of choice allow you to select one element from each set in a
collection of sets"
- yes, it really is that simple.
A countable collection of sets is just an enumeration of sets Si for a range of
values of i. The axiom of choice says that for each and every i you can pick an
element from the set Si. It is so obvious that it hardly seems worth stating as
an axiom, but it has a hidden depth.
Another way of formulating the axiom of choice is to say that for any
collection of sets there exists a choice function f which selects one element
from each set, i.e. f(Si) is an element of Si.
Notice that if you have a collection of sets that comes with a natural choice
function then you don't need the axiom of choice. The fact you have an
explicit choice function means that you have a mechanism for picking one
element for each set. The axiom of choice is a sort of reassurance that even if
you don't have an explicit choice function one exists - you just don't know
what it is. Another way to look at this is that the axiom of choice says that a
collection of sets for which there is no choice function doesn't exist.

To Infinity and...
So where is this hidden depth that makes this obvious axiom so
controversial? The answer is, as it nearly always is in these cases, infinity. If
you have a finite collection of sets then you can prove that there is a choice
function by induction and you don't need the axiom of choice as it is a
theorem of standard set theory.

98
Things start to get a little strange when we work with infinite collections of
sets. In this case, even if the collection is a countable infinity of sets, you
cannot prove that there is a choice function for any arbitrary collection and
hence you do need the axiom of choice. That is, to make set theory carry on
working you have to assume that there is a choice function in all cases.
In the case where you have a non-countable infinity of sets then things are
more obviously difficult. Some non-countable collections do have obvious
choice functions and others don't. This only way to see what the difficulties
are is to look at some simple examples.
First consider the collection of all closed finite intervals of the reals, i.e.
intervals like the set of points x satisfying a ≤ x ≤ b or [a,b] in the usual
notation. Notice that the interval a to b includes its end-points, i.e. a and b are
in the set due to the use of less-than-or-equals. This is an infinite and
uncountable collection of sets, but there is an obvious choice function. All
you have to is define F([a,b]) as the mid point of the interval. Given you
have found a choice function there is no need to invoke the axiom of choice.
Now consider a collection of sets that sound innocent enough - the collection
of all subsets of the reals. It is clear that you can't use the mid-point of each
subset as a choice function because not every subset would have a mid-point.
You need to think of this collection of subsets as including sets that are
arbitrary collections of points. Any rule that you invent for picking a point is
bound to encounter sets that don't have a point with that property (note; this
isn't a rigorous argument). You can even go a little further. If you invent a
choice function, simply consider the sets that don’t have that point – they
must exist as this is the set of all subsets!
Now you can start to see the difficulty in supplying a choice function. No
matter what algorithmic formulation you invent, some sub-sets just won’t
have a point that satisfies the specification. You cannot, for example, use the
smallest value in the sub-set to pick a point because open sets, sets like a<x<b
or (a,b) don't necessarily have a smallest value using the usual less than
relationship. The reason is that, unlike closed intervals, open intervals don’t
include their end points. What this means is that you can get closer and
closer to a but you never reach a within the set. There is no smallest value
because if you give me a proposed smallest value, s, there is always a smaller
value between a and s.
So now the real problem is made clear - this is a problem to do with
computability and this is why it is relevant to computer science. If there is no
choice function for the collection of all subsets of the reals, then the axiom of
choice fails, and so do all of the theorems that are proved using it. The axiom
of choice simply asserts that there is such a function - it doesn't tell you how
to find it. At the moment it looks as if the weight of mathematical opinion is
that we can't find a choice function for the totality of sub-sets of the reals,
even though the axiom of choice insists that one exists.

99
Choice and Computability
If this sounds familiar from what you know of computer science then it
should. The axiom of choice is roughly speaking asserting that the choice
function is computable or realizable or whatever you want to call it even if we
don't know what it is. In many cases where something is non-computable we
can blame it on the fact that there aren't enough programs to go around. For
example, there are only a countable number of programs, but there are an
uncountable number of real numbers - hence the majority of real numbers
aren't computable.
In this case though, the problem seems to be that the variability of the sub-
sets is such that there isn't sufficient regularity for a program to specify one
point in each subset. You might think that the solution is to allow computable
functions that return an arbitrary point in the set, but it could be that there
are sets so complex that there is no formal rule that can pick out such a point.
For example, consider a set that is composed of just non-computable numbers
- how can the choice function return one of them?
When you realize the connection between the axiom of choice and
computability it seems less obvious that you can always exercise an arbitrary
choice. Many mathematicians seem to satisfy themselves with a vague (or
sometimes precise) idea that in situations like this the choice function could
somehow be an arbitrary "pick one point of each set" - it doesn't matter what
it is, just pick one. You can see that in simple cases the choice function is
doing something we might be able to name - the mid-point, the smallest
value, etc, but this new arbitrary choice function is more like a random
selection.
British philosopher, logician and mathematician Bertrand Russell tried to
explain this idea by the example of socks and shoes. If you have an infinite
collection of pairs of shoes then the choice function could be either "pick the
right shoe" or "pick the left shoe" with no need for the axiom of choice. Now
consider a similar collection of pairs of socks. There is no obvious choice
function because there is no way to distinguish between the socks making up
the pair. In this case the axiom of choice has to be invoked to say that you can
indeed pick a sock from each pair.
The sock example sounds reasonable, but is an arbitrary choice really a
computable function? In the case of socks it seems reasonable, but in the case
of all sub-sets of the reals it isn't as obvious. Consider for a moment the
number line between 0 and 1, i.e. the closed set [0,1]. There are points that
have “names” like ½, √2, and so on. However, as we have already revealed a
number of times – the majority of the points don’t have a finite label. How
can you then pick a point? Zoom in till you see a few points and then pick
one of them “at random”? If you zoom in you always see an infinity of points.
In fact, you always see an aleph-one of points. How can you pick one of

100
something that forms a continuum? Only by giving a program or function that
specifies the point can you “pick a point”. Of course, the program or function
is a finite label for the point, so we have made no progress at all.
There are also many deep resonances here with other subjects. Do the pairs of
shoes and socks remind you of the behavior of fermions and bosons? If you
are a physicist, they do. The inability to provide a computable function
without using a random choice in the case of the socks says something about
algorithmic information, but what exactly isn't obvious.
You can now also see why the axiom of choice comes into the proof of
Zermelo's well-ordering theorem. If you can find a function that is irregular
enough to pick an arbitrary point in each set then presumably there is no
reason we can’t find a function that orders each set. In fact the two ideas, the
axiom of choice and the well-ordering theorem, are equivalent – they both
pick points from arbitrary sets.

Non-Constructive
In mathematics the axiom of choice is usually invoked to suppose that there
is a choice function without having explicitly to supply one - and this leads to
conclusions that, while they might not be undisputed paradoxes. can be very
troubling. The problem is that just insisting that a procedure can be done
without explaining how it can be done isn't constructive. It leads to
mathematics with objects and procedures which provably exist without
giving the exact algorithm to create such an object.
The best known is the Banach-Tarski paradox, which says that if you take a
3D solid unit sphere you can "carve it up" into a finite number of pieces and
then using rotations and translations reassemble it into two identical 3D solid
unit spheres. We seem to have created twice the volume of stuff without
actually creating anything at all. The pieces are constructed using the axiom
of choice but obviously without giving the choice function used to do the job.
If you think that doubling the volume of stuff is simply a crazy, illogical, idea
then you are forgetting some of the strange properties of infinity. Take an
infinite set of the integers and split it into a set of even and a set of odd
integers - both sets are also infinite. So you start with an infinite set and split
it in two and you have two infinite sets - perfectly logical. In the case of the
Banach-Tarski paradox the sets aren't of infinite extent, i.e they are bounded,
but they are continuous and hence have aleph-one points. In the same way,
as the countable sets can be split into two parts, each infinite so the ball can
be split into two parts each with aleph-one points inside. However, to do this
you have to be able to select an aleph-one number of points from an aleph-
one number of sets, i.e. a choice function must exist.

101
If you think that this is just silly and proves that the axiom of choice isn't
valid, then you need to know just how many results in less controversial
mathematics depend on it. For example, that every Hilbert space has an
orthonormal basis, the Hahn-Banach theorem, that additive groups on R and
C are isomorphic, the theorem that every countable union of countable sets is
countable, and so on. They are all fairly basic properties that we expect to be
true. These are usually held up as justification that the axiom of choice is
reasonable, but it has to be admitted that if the axiom of choice is false all of
these important results can remain true by limiting them to situations where
the axiom isn’t required.
At the end of the day the axiom of choice is more illuminating about the
properties of infinite sets than anything else. Mathematics makes use of it, but
there is a certain caution about the theorems that rely on it. As far as
computer science is concerned it seems to be another aspect of computability.
It seems much more reasonable that the choice function cannot exist in all
cases when viewed like this.

102
Summary
● The axiom of choice isn’t usually included as part of computer
science. It is considered more part of pure math than computation.
However, it is related to the idea of computablity and therefore
deserves consideration.

● The axiom of choice simply says that in all circumstances it is


possible to select one thing from each set in a collection. Equivalently
you can postulate the existence of a choice function which gives you
one element from each set. If the choice function is non-computable
the axiom of choice is false.

● There are lots of examples where the axiom of choice is obviously true
and a few well known examples where it is far from obvious. The
problems arise when the sets in question are uncountable and there
are an uncountable number of them.

● If you try to select a single point from a set of all subsets of the real
numbers you can see that we have a problem in proving that for
whatever rule we select a suitable point exists. This may be regarded
as another example of not enough functions/programs to compute
everything that is aleph-one.

● The solution of “pick a point at random” doesn’t really help because it


challenges what it means to “pick a point”. The only meaningful sense
in which you can pick a point is to give a description, or program, that
lets you find it. Failing this you have to label it in some way and there
are neither enough programs nor labels.

● If the axiom of choice is false many reasonable theorems are also false.
If it is true then so is the Banach-Tarski paradox, that you can take a
3D sphere and rearrange its points to create two such spheres of the
same size.

103
Chapter 9

Gödel’s Incompleteness Theorem

We make the assumption that computers, logic and scientific thought can go
anywhere we care to take them - there are no limits to what we can work out.
However, we have already seen that there are some serious problems – Turing
machines can’t compute everything, nearly all numbers are non-computable
and the axiom of choice is paradoxical. What is more, these difficulties
extend to the heart of mathematics. The most famous result, and one that
predates Turing’s result, is Gödel’s incompleteness theorem and it was the
first hint that computation limits what math can do.
If I was to tell you that I had a mathematical theorem that so far didn’t have a
proof you’d just assume that the mathematicians hadn’t really been working
hard enough on the problem. For example, the issue of Fermat’s theorem and
its proof dragged on for centuries and then suddenly it was all solved by some
new math and some older math cleverly applied.
Fermat's last theorem is a negative sort of theorem. It states that there are no
integers a, b and c that satisfy an+bn=cn for n>2 (there are lots of examples of
a,b and c is n=2 or 1). It almost cries out for you to prove it by finding a
counter example i.e an a, b, c that do satisfy the relationship for some n>2.
But the proof is much harder than you could ever possibly imagine.
Suppose for a moment that the halting problem was computable. Then we
could solve Fermat’s last theorem easily. Write a program that searches a, b, c
and n in a “zig-zag” pattern so that if we wait long enough we will see any
a, b, c and n we care to specify and test each case to see if an+bn=cn. If it finds
a,b,c and n that satisfies this relation we have a counter example. Of course,
in any practical case we would potentially search forever, but if the halting
problem is decidable all we have to do is submit the equivalent Turing
machine tape to the halting machine and if it pronounces that it halts we
have a counter example and Fermat’s last theorem is disproved. If it says that
machine never halts, then Fermat’s last theorem is proved. Notice that it is
the second case that is really important because it relieves us from the need
for a search that lasts forever. But wait a minute – the theorem has been
proved and so no need for an infinite search. Logical proof really is this
powerful!

105
Similarly the four-color problem similarly dragged on for years, but it was
solved in a completely new way and a computer generated a large part of the
proof. The four-color theorem states that you only need at most four colors to
color in a map so that no two adjacent regions have the same color (where
adjacent means sharing a boundary not just a corner). Again the four-color
theorem sets your mind working to see if you can find a map where four-
colors aren't enough to ensure that no two adjacent regions share the same
color. In the case of the four-color theorem the computer’s accurate logic was
used to go through a long list of tests that would have been impossible for a
human to complete in a reasonable time.
Humans plus computers seem to be capable of working things out in a way
that is precise and efficient and surely there cannot be anything that is
beyond reach? The fact is it can be proved that there are truths that are
indeed beyond the reach of logic. This is essentially what Kurt Gödel proved
in 1931. The idea isn’t a difficult one but it is often used to justify some very
strange conclusions most of which aren’t justified at all! It is important, and
fun, to find out a little more about what it really means.

Kurt Friedrich Gödel


1906-1978

The Mechanical Math Machine


First we need to look a little more closely at how we really expect the world
to work. Back at the turn of the 20th century mathematics had moved on from
being a “creative” subject to trying to be precise and logical. The idea was that
you stated your axioms – the things you regarded as true – and then you used
these rules to derive new truths. This was the big project – the axiomatization
of the whole of mathematics. If it could be achieved then we would be in
position where we could say what was true with certainty – as long as the
initial axioms were OK.

106
Bertrand Arthur William Russell
1872-1970

Alfred North Whitehead


1861-1947
Notice that we are talking about some very fundamental things that you might
think could be taken for granted. For example, Alfred North Whitehead and
Bertrand Russell produced a huge book (Principia Mathematica published in
thee volumes in 1910, 1912 and 1913) that took several hundred pages to get
to a proof that 1+1=2! The point is that this was very detailed work building
from very simple initial axioms which were so basic that no one could
dispute them. The fact that 1+1 was 2 wasn't just an observation of physics,
but a logical necessity given the initial axioms.
The axiomatization of mathematics also potentially reduced math to a
mechanistic shuffling of symbols that a computer could happily be left to do.
You didn’t even need to give the theorems and proofs any interpretation in
the real world. You could regard it as a game with symbols that you moved
around according to the rules. The new arrangements are theorems and the
way that you got the new arrangements from the existing ones are the proofs.
Many mathematicians rejected this soulless brand of mathematics, but they
still adopted the basic way of working and regarded a proof as a sequence of
legal steps that takes us from a known truth to a new truth. The freedom and
creativity was what came before the proof. You spend time thinking and
experimenting to find a way to put together a logical argument from axioms to
a new theorem.

107
The Failure Axioms
What could possibly be wrong with this idea? It seems so reasonable that
what we know should be extended in this way using nothing but the force of
pure logic. There really isn’t any alternative to the idea of logical proof that
sane people find acceptable. It also seems entirely reasonable to suppose that
applying this method will eventually get us to any statement that is true.
What is more, you might as well give a computer the task of shuffling the
symbols and seeing what is true. Yes, math is a matter for mechanical
computation.
Consider now the problems discussed at the start – Fermat’s last theorem and
the four-color theorem – their solutions are just a matter of computing time.
Now we come to the downfall of the method. You would assume that given a
properly formed arrangement of symbols that the computer, or the human,
could eventually work out either a proof of the theorem or an anti-proof, i.e. a
proof that the negation of the theorem is true thus the theorem is false. It
might take a long time, but in principle you could work out if the statement
was true or false. This is the assumption of consistency that we suppose will
hold for any reasonable or useful system of reasoning.
● A theorem has to be either true or false – there is no middle ground
and it isn’t reasonable to have proof that it is both.
This sounds so obvious that there is no need to really think about it too
deeply, but it is also equally obvious that a statement (let alone a proof or a
theorem) has to be either true or false.
Now consider:
This sentence is false
You can begin to see that things are not quite so simple. If the statement is
true it is false and if it is false it has to be true. Notice it is self-reference that
is causing the problem. This is the paradox of Epimenides (named after the
Cretan philosopher Epimenides of Knossos who was alive around 600 BC),
and it is the key to Kurt Gödel’s proof that axiomatic systems aren’t consistent
in the sense that there are theorems that cannot be proved true or false within
the system. There may be a larger system, i.e. one with more axioms in which
these theorems are provable, but that’s almost cheating.

Gödel’s First Incompleteness Theorem


The argument that Gödel used is very simple. Let’s suppose that there is a
machine M that can do the job we envisaged for a consistent system. That is,
the machine can prove or disprove any theorem that we feed to it. It is
programmed with the axioms of the system and it can use them to provide
proofs. Suppose we ask for the program for the machine written in the same
logic used for the theorems that the machine proves. After all, any computer

108
can be reduced to logic and all we are requesting is the logical expression that
represents the design of the computer – call this PM for "Program for M". Now
we take PM and construct a logical expression, which we’ll call statement X
which says:
“The machine constructed according to PM will never prove this
statement is true”
and ask the machine to prove or disprove X.
Consider the results. If the machine says “X is true” then X is false. If the
machine says “X is false” then Xis true.

You see what the problem is here and you can do this sort of trick with any
proof machine that is sufficiently powerful to accept its own description as a
theorem. Any such machine, and the system of logic on which it is based,
must, by its very nature be inconsistent in that there are theorems that can be
written using the system that it cannot be used to prove either true or false.
In other words there are three types of theorem in the system – those that are
provably true, those that are provably false and those that are undecidable
using the axioms that we have at our disposal.
Is it possible to for a machine to be powerful enough to attempt to prove a
statement that involves its own description? Perhaps it is in the nature of
machines that they cannot cope with their own description.
The really clever part of Gödel’s work was to find that any axiomatic system
powerful enough to describe the integers, i.e. powerful enough to describe
simple arithmetic, has this property. That is, arithmetic isn’t a consistent
axiomatic theory, which means that there are theorems about the integers that
have no proof or disproof within the theory of arithmetic.

109
So now think about the 300-year search for a proof for Fermat’s last theorem.
Perhaps it wasn’t that the mathematicians weren’t trying hard enough.
Perhaps there was no proof. As it turns out we now know that a proof exists,
but there was a long time when it was a real possibility that the theorem was
undecidable. This example also indicates what to be true but without a proof
means. If could be that the axioms of arithmetic are not sufficient to prove
some unproven theorem. That is, there is no sequence of steps that take the
axioms from their raw form to a proof. However, the theorem may well be
true in that there are no integers that satisfy the equation. You cannot even
prove it by searching of a counter example because as the search is infinite,
you can never conclude that just because you haven’t found a counter
example one does not exist. The fact that the theorem is true would just be
beyond our reach.
You clearly now understand the idea, but do you believe it? To make sure you
do consider the following problem: there seem to be lots of paired primes, i.e.
primes that differ by two (3,5), (5,7), (11,13) and so on. It is believed that
there are an infinite number of such pairs - the so called "twin prime
conjecture" but so far no proof. So what are the possibilities? You examine
number pairs moving over larger and large integers and occasionally you
meet paired primes. Presumably either you keep on meeting them or you
there comes a point when you don’t. In other words, the theorem is either
true or false. And as it is true or false presumably there should be a proof of
this.
If you have taken Gödel’s theorem to heart you should know that this doesn’t
have to be the case. The integers go on forever and you can’t actually decide
the truth of the theorem by looking at the integers. How far do you have to go
without seeing a pair of primes to be sure that you aren’t ever going to see
another pair? How many pairs do you have to see to know that you are going
to keep seeing them? There is no answer to either question. In the same way
why do you assume that there is a finite number of steps in a proof that will
determine the answer. Why should the infinite be reducible to the finite?
Only because we have grown accustomed to mathematics performing this
miracle.
In the case of the twin prime conjecture recent progress has proved that that
are infinitely many primes separated by N and N has to be less than 70
million. This doesn't mean that N is 2 however. Collaborative work by a lot of
mathematicians has reduced the upper limit (at the end of 2014) to less than
246 which seems like progress but attempts to reduce if further have not been
successful. Can the bound be pushed down as far as 2 or is there really no
proof?
Repeatedly mathematics proves things using axiomatic logic that otherwise
would require an infinite search, but we are not guaranteed that this is
possible in all cases. There may well be, and in fact there have to be, cases

110
where no finite proof exists. This is what Gödel’s theorem really is all about.
There are statements that are undecidable. If you add additional axioms to
the system, the statements that were undecidable might well become
decidable but there will still be valid statements that are undecidable in the
larger system. Indeed, every time you expand the axioms you increase the
number of theorems that are both decidable and undecidable.

End of the Dream


It’s as if mathematics at the turn of the 20th century was seeking the ultimate
theory of everything and Gödel proved that this just wasn’t possible. So far so
good, or bad depending on your point of view. You may even recognize some
of this theory as very similar to the theory of Turing machines and non-
computability, in which case it might not be too much of a shock to you.
However, at the time they were thought up both Gödel’s and Turing’s ideas
were revolutionary and they were both regarded with suspicion and dismay.
It was thought to be the end of the dream: mathematics was limited.
Mathematics wasn’t perfect and in fact every area of mathematics contained
its limitations. Today you will find it argued that Gödel’s theorem proves that
God exists. You will find it argued that Gödel’s theorem proves that human
thought goes beyond logic. The human mind is capable of seeing truth that
mathematics cannot prove. It is also argued that it limits artificial
intelligence, because there are things that any machine cannot know and
hence also proves that human intelligence is special because it can know
what the machine cannot.
If you think about it, Gödel’s theorem proves none of this. It doesn’t even
suggest that any of this is the case. Gödel’s theorem doesn’t deal with
probabilities and what we believe, only in the limitations of finite systems in
proving assertions about the infinite. Sometimes the infinite is regular enough
to allow something to be proved. Sometimes, in fact most of the time, it isn’t.
But important though this is, we live in a finite personal universe and we
don’t demand perfect proof. We go with the flow, guess and accept good
probabilities as near certainties.
And if you eliminate the infinite, Gödel's theorem doesn't hold.

111
Summary
● Mathematical proof can be considered to be a game with symbols,
axioms, that lead to proof. It was thought that this implied that math
can be mechanized and reduced to an algorithm.

● Many mathematical questions can be answered by an infinite search


procedure. Not finding a counter example doesn’t prove anything and
the power of math is in providing finite proofs for infinite searches.

● Gödel proved that in any system of logic that was powerful enough to
include arithmetic there were theorems that were true but for which
a proof using the axioms of the system did not exist.

● The incompleteness theorem was revolutionary and put a limit on


what could be done with axiomatic systems. It has also been used,
incorrectly, to prove many things such the existence of God and the
inability of AI to equal real intelligence.

● What the incompleteness theorem is saying is that there are some


truths that cannot be established by a finite procedure, i.e. a proof.

● If you expand the axioms to create a larger logical system then it is


possible that what was unprovable becomes provable. The only
problem is that new unprovable theorems are introduced.

112
Chapter 10

Lambda Calculus

Turing’s approach via a machine model has a great deal of appeal to a


programmer and to a computer scientist but, as we have just seen in the
previous chapter, mathematicians have also considered what is computable
using arguments based on functions, sets and logic. You may also recall that
the central tenet of computer science is the Church-Turing thesis and Alonzo
Church was a mathematician/logician who tackled computability from a
much more abstract point of view. He invented Lambda calculus as a way of
investigating what is and what is not computable and this has lately become
better known because of the introduction of lambda functions within many
computer languages. This has raised the profile of the Lambda calculus so
now is a good time to find out what it is all about.

What is Computable?
Lambda calculus is an attempt to be precise about what computation actually
is. It is a step on from pure logic, but it isn't as easy to understand as the more
familiar concept of a Turing machine.
A Turing machine defines the act of computing in terms that we understand
at a reasonable practical level - a finite state machine and a single tape that
can be read and written. The concept is about the most stripped-down piece
of hardware that you can use to compute something. If something isn't Turing
computable then it isn't computable by any method we currently know.
Lambda calculus is an alternative to the "hardware" approach of a Turing
machine and it too seeks to specify a simple system that is powerful enough
to compute anything that can be computed. One way of putting this is that
the Lambda calculus is equivalent to a Turing machine and vice versa. This is
the Church-Turing thesis - that every function that can be computed can be
computed by a Turing Machine or the Lambda calculus. There are a number
of alternative formulations of computation but they are all equivalent to a
Turing Machine or Lambda calculus.
Lambda calculus started out as an attempt to create a logical foundation for
the whole of mathematics, but this project failed due to the paradoxes that are
inherent in any such attempt. However, a cut-down version of the idea
proved to be a good model for computation.

113
The Basic Lambda
The most difficult thing to realize about Lambda calculus is that it is simply a
game with symbols. If you think about it all of computation is just this - a
game with symbols. Any meaning that you care to put on the symbols may be
helpful to your understanding but it isn't necessary for the theory.
In the case of Lambda calculus, first we have the three rules that define what
a lambda expression is:
1. Any "variable", e.g. x, is a Lambda Expression, LE
2. If t is a LE then (λx.t) is a LE where x is a variable
3. If t and s are LEs then (t s) is also a LE.
The use of the term "variable" is also a little misleading in that you don't use it
to store values such as numbers. It is simply a symbol that you can
manipulate. The parentheses are sometimes dropped if the meaning is clear
and many definitions leave them out. Rule 2 is called an abstraction and Rule
3 is called an application.
If you prefer we can define a simple grammar for a lambda expression, <LE>:
<LE> → variable | (λ variable. <LE>)|(<LE> <LE>)
Notice that what we have just defined is a simple grammar for a computer
language. That really is all it is, as we haven't attempted to supply any
meaning or semantics to the lambda expression.
Examples of lambda expressions along with an explanation of their structure:
y Rule 1
x Rule 1
(x y) Rule 3
z Rule 1
(λx.(x z)) Rule 2
((λx.(x z))(a b)) Rule 1, 2 and 3
To make sure you follow, let's parse the final example.
The first term is:
(λx. <LE>)
with
<LE>=(x z)
Of course, x and z are LEs by Rule 1 and the second term is just (a b),
which is again an LE by Rule 3. The two terms fit together as an LE, again by
Rule 3.

114
A rough syntax tree is:

The parentheses are a nuisance so we usually work with the following


conventions:
1. Outermost parentheses are dropped - a b not (a b)
2. Terms are left associative - so a b c means ((a b) c)
3. The expression following a dot is taken to be as long as possible -
λx.xz is taken to mean λx.(x z) not (λx. x) (z)
In practice is it Convention 2 that is most difficult to get used to at first,
because it implies that a b c is three separate lambda expressions not one
consisting of three symbols. If in doubt add parentheses and notice that
grouping of terms does matter as expressed in the basic grammar by the
presence of brackets. Without the brackets the grammar is ambiguous. Also
notice that the names of the variables don't carry any information. You can
change them all in a consistent manner and the lambda expression doesn't
change. That is, two lambda expressions that differ only in the names used
for the variables are considered to be the same.

Reduction
So far we have a simple grammar that generates lots of strings of characters,
but they don't have any meaning. At this point it is tempting to introduce
meanings for what we are doing - but we don't have to. We can still regard the
whole thing as playing with symbols. If this sounds like an empty thing to do,
you need to remember that this is exactly what a Turing machine, or any
computer, does. The symbols don't have any intrinsic meaning. When you

115
write a program to compute the digits of π, say, you may think that you are
doing something real, but the digits are just symbols and you add the extra
interpretation that they are numbers in a place value system.
There are a number of additional rules added to the bare syntax of the lambda
expression that can be used to reduce one expression to another. The most
important to understand is probably beta reduction. It is that if you have a
lambda expression of the form:
(λx.t) s
then you can reduce it to a new lambda expression by replacing every
occurrence of x in t by s and erasing the λx. You can think of this as a simple
string operation if you like.
For example:
(λx. x z)(a b)
is of the form given with t = x z and s = a b and so by beta reduction we
have:
((a b) z)
or dropping the parentheses:
a b z

Reduction as Function Evaluation


Starting from:
(λx. x z)(a b)
by our new rule we can replace the x in x z with s, i.e. a b. The resulting
new reduced expression is:
a b z
You can think of this as a function application where the function is:
(λx. x z)
and λx can be thought of as defining the function's parameter as being x and
the expression following the dot is the function's body.

Hence when you write:


(λx.x z) (a b)
the function is evaluated by having the parameter x set to (a b) in the
function body.

116
The only thing odd about this is that we don't usually write the parameter
values after the function in this way. This, however, fits in with Reverse
Polish Notation used in most functional languages.
A very common example is the identity function:
(λx.x)
which, if you think about its form, simply substitutes the expression you
apply it to for x which gives you the original expression again.
For example given:
(λx. x)(a b)
then reduction means substitute (a b) for x in the function body which
gives:
(a b)

More Than One Parameter


You can have functions with more than one parameter by using more than
one lambda. For example:
(λz. (λx. x z)) (a b) (c d)
which means set z to the first expression (a b) and then set x to the second
expression giving:
(c d) (a b)
It is usual to abbreviate such multiple parameter functions by just writing a
single lambda for all the parameters.
For example:
(λz. (λx. x z) = λz x. x z
which means z and x are the parameters for the function body x z.

117
More Than One Function
Notice that you can have multiple function applications in a single
expression. For example:
(λx.x)(λy.y)(a b)
which is just two applications of the identity function, is equivalent to:
((λx.x)(λy.y)) (a b)
as applications are left associative. Applying the inner lambda, i.e. replacing x
by (λy.y)reduces to:
(λy.y)(a b)
which is:
(a b)
after second reduction.
You can repeatedly reduce a lambda expression until there are no more
lambdas, or until there are no more arguments to supply values to the
lambdas. Notice that it makes it easier to think about reduction to liken it to
function evaluation, but you don't have to. You can simply see it as a game
with symbols.
If you think of a Turing machine then what is initially written on the tape is a
set of symbols that is converted to a final output as the machine runs. This is
computation. In the case of lambda expressions you start with an initial
expression and apply functions to it using beta reduction, which produces a
new string of symbols. This too is computation.

Bound, Free and Names


Variables that occur within lambda expressions next to a lambda are called
bound variable while the others are free variables. You can think of bound
variables as the parameters of the function. Now that we have beta reduction
defined you can see that there is a sense in which it doesn't matter what you
call a bound variable. For example:
λx. x
is the identity function because (λx. x) e gives e. Obviously:
λy. y
is also the identity function.
So we can change the name of a parameter without changing the function.
This is alpha reduction (or alpha conversion or alpha-renaming). Exactly how
you do the renaming and how you combine expressions that use the same
names for variables can get a little complicated. The safest rule to follow is to
change bound variable names so that they are different from any free

118
variables in another expression and to make sure that free variables have
distinct names in different expressions. For example don't write:
(λx.x)x
Instead use:
(λy.y)x
The rules for how to deal with name clashes can seem complicated at first,
but if you remember that each lambda expression is an entity in its own right
and any variables names do not carry across from one entity to another, you
should find it fairly easy. You can think of it as function local variables if it
helps.

Using Lambdas
You can now see what lambda expressions are all about, even if some of the
fine detail might be mysterious, but what next? The point is that lambdas
provide a form of universal computation using just the symbols and the rules
- no extra theoretical baggage needs to be brought into the picture.
This is where is gets far too theoretical for some practical programmers. For
example, we don't just introduce integer values that can be stored in
variables. Instead we invent symbols, well lambda expressions, that behave
just like the integers. If you have never seen this sort of thing done before it
can see very complicated, but keep in mind that all we need are some
symbols that behave like the integers.
One possible set of such symbols is:
0 = λsz. z
1 = λsz. s(z)
2 = λsz.s(s(z))
3 = λsz.s(s(s(z))
and so on.
You can see the general structure - we have a function of s and z and the
number of s variables in the body of the function grows by one each time. In
general the symbol that represents n has n*s variables.
OK, so what have these strange lambda expressions got to do with integers?
The answer is that for a symbol to represent an integer it has to obey the basic
properties of the integers - or rather we have to find some other symbols that
behave in a way that mirrors the properties of the integers, i.e. addition,
subtraction and so on. One of the basic properties of the integers, more
fundamental than arithmetic, is the successor function which gives you the
very next integer, or in more familiar terms:
succ(n) = n+1

119
It is fairly easy to create a successor function as a lambda expression:
Succ = λabc.b(a b c)
This may look like a mystery at first, but try it out. Apply Succ to the 0
symbol:
Succ 0 = λabc.b(a b c) (λsz. z)
The first resolution is to substitute (λsz. z) for a:
Succ 0 = λabc.b(a b c) (λsz. z)
= λbc.b((λsz. z) b c)
In this expression we can make two more resolutions as the inner function
(λsz. z) can be applied to b to give (λz. z)c which can be reduced to give c:
Succ 0 = λabc.b(a b c) (λsz. z)=
λbc.b((λsz. z) b c)=
λbc.b((λz. z) c))=
λbc.b(c)
That is:
Succ 0 = λabc.b(abc) (λsz. z)=
λbc.b(c) =
λsz.s(z) =
1
Notice that we can change the parameter names to s and z without changing
the lambda expression.
You can try the same calculation for other integer symbols and you will find
that it works. The reason it works is that the Succ function as defined adds
one more occurrence of the b variable to the integer. From here you can
continue to invent functions that add, multiply and so on. You can also
define Booleans and logical operations, recursion, conditionals and
everything you need to build a computational system. For most of us this is
too low-level and abstract to follow though. After all, if it takes this long to
create the integers how long would it take to create the rest?
So our account of the Lambda calculus finishes here, but you should be able
to see that this simple system is capable of being used to construct a Turing
machine, and a Turing machine can be constructed that is equivalent to a
lambda expression and reduction. The two are equivalent as far as
computation goes.

120
The Role of Lambda In Programming
Now we get to a difficult part. Why is this very theoretical idea turning up in
highly practical programming languages? It turns up in functional languages
because the ideas fit together - Lambda calculus is a functional programming
language. In other languages it turns up because it provides a simple way of
defining anonymous functions that can be passed to other functions.
In a language that has functions as first class objects, i.e. where functions are
objects and can be passed as parameters to other functions, then there is no
real need of anything like a lambda expression. Even languages as simple as
JavaScript don't really need to call anything a lambda expression, because all
functions are first class objects and an anonymous function is just one you
don't bother assigning to a variable.
In languages such as Java and C# functions only occur as members of objects.
There aren't such things as free floating functions that don't belong to some
object, hence there cannot be anything like an anonymous function. To avoid
adding functions as objects in their own right you have to introduce
something that looks like a lambda expression with a function head that
defines the parameters and a function body that defines what to do with the
parameters to obtain a result. Notice that any lambda expression defined as
part of a programming language goes well beyond what we have discussed
here - in particular the variables store complex data types and the standard
operations are defined. While it is true that all of these things could be
reduced to formal lambda expressions, this isn't of much interest to the
practicing programmer, nor should it be.
So the lambda expressions that you meet in object-oriented languages really
don't have that much to do with the Lambda calculus, and not that much to
do with the grammar of lambda expressions. They are just anonymous
functions that you can pass to other functions and a way of avoiding the need
to implement functions as objects.

121
Summary
● Turing invented a machine definition of what computation is all
about, whereas Church invented a logical system that captured the
essence of computation – Lambda calculus.

● Lambda calculus is Turing-equivalent or the Turing machine is


lambda-equivalent depending on your point of view – they both have
the same computational power.

● The Church-Turing thesis is that anything that can be computed can


be done so either by a Turing machine or by Lambda calculus.

● A basic lambda term can be defined by a simple grammar.

● Computation in Lambda calculus is mainly via beta reduction, which


is analogous to function evaluation.

● This can be extended to multiple parameters and to multiple


“functions”, making Lambda calculus more powerful than you might
expect.

● Lambda calculus can be used to model the integers and from here
arithmetic and once you have arithmetic the rest follows and it can be
used to do everything a Turing machine or a real computer can.

● For reasons that have little to do with computational theory, lambda


expressions have become known as a way to include simple functions
into programming languages which otherwise don’t support them.

122
Part II

Bits, Codes and Logic

Chapter 11 Information Theory 126


Chapter 12 Coding Theory – Splitting the Bit 135
Chapter 13 Error Correction 143
Chapter 14 Boolean Logic 151

123
Chapter11

Information Theory

There has been an undercurrent in a lot of the earlier discussion about


algorithms only being able to capture the regularity in infinite things. If
something infinite doesn’t have enough regularity then it cannot be generated
by a finite algorithm. There is something about information and complexity
that we are just not making quite explicit. These ideas are part of a subject
known as computational information, which has been discussed in Chapter 7,
but there is another theory of information that is based on how often
something happens. It is this theory of information that gives rise to the idea
of the bit as the basic unit of information and leads to questions such as:
How much information does a bit carry?
What is this "information" stuff anyway?
The answers are all contained in the subject called, unsurprisingly,
Information Theory, which was invented by one man a surprisingly short
time ago.
Let us try an easier question. How much data can you store on a 1GByte
drive? The answer is obvious - 1GByte of course! But what exactly does this
mean? How much music, photos, video or books does this equate to? What
does a Gigabyte mean in terms of the real world? To answer these questions
in any meaningful way we need a theory of information and more importantly
we need to define a way to measure information.
Until quite recently it wasn't even clear that something called information
existed. For a long time we just muddled along and put things down in books,
listened to music and didn't really think about how it all worked. It was only
when we started to send messages between distant points did it really become
apparent that information was a quantity that we might measure.
Today we know that the amount of information in something is measured in
bits and this gives you the amount of storage you need to store it and the
amount of communications bandwidth we need to transmit it. This much is
easy to say, but it raises a whole set of interesting questions.
You can take any set of symbols – the word “the” or the letter “A”, for
example - and specify the amount of information it contains in bits. How does
this work? How can you say how many bits of information the letter “A”
contains?

125
This is all a little mysterious, although we use the same ideas every day. For
example, if you have ever compressed a file then you are making use of
information theory. The most amazing thing about information theory is that
it is a relatively recent discovery (1948), and it is almost entirely the work of
one man, Claude Shannon, who more-or-less simply wrote it all down one
day.

Claude Shannon
1916 - 2001

Surprise, Sunrise!
Suppose I tell you that the sun rose this morning. Surprised? I hope not.
There is a sense in which a statement such as this conveys very little
information. You don’t need me to tell you it, you already know it. In this
sense, if you were already 100% certain that the sun rose this morning you
could even say that me telling you so carried no information at all!
On the other hand, if I told you that the sun wasn’t going to rise tomorrow
then you would presumably be shocked at this unlikely news – never mind
about your horror or any other emotion, that’s for psychologists to sort out.
We are only interested in the information content of the statement, not its
implications for the end of the world. From this point of view it seems
reasonable to relate the information in a message to the probability of
receiving it. For example, if I am to tell you whether or not the sun rose each
morning then on the mornings that I tell you it did you get little information
compared to any morning I tell you it didn't. The relative likelihood of a
message is related to the amount of information it contains.
This was the key insight that got Shannon started on information theory. He
reasoned like this:
If the probability of message A is p then we can assume that the
information contained in the message is some function of p – call it I(p)
say.

126
What properties does I(p) have to have? Suppose you are sent message A –
now you have I(p)’s worth of information. If you are then sent another
message, B with probability q, you have I(p)+I(q) of information.
But suppose A and B are sent at the same time. The probability of getting both
messages at the same time is p times q. For example, when flipping a coin, the
probability of getting a head is 0.5, whereas the probability of getting two
heads is 0.5 times 0.5, i.e. 0.25. As it really doesn’t matter if message A is
received first and then message B or if they arrive together – the information
they provide should be the same. So we are forced to the conclusion that:
I(pq)=I(p)+I(q)
If you know some simple math you will recognize the basic property of
logarithms. That is:
log(xy)=log(x)+log(y)
What all this means is that if you accept that the information carried by a
symbol or a message is a function of its probability then that function has to
be a log or something like a log. The reason is that when things happen
together their probabilities multiply, but the information they carry has to
add.
What Shannon did next was to decide that the information in a message A
which occurs with probability p should be:
I(A)=-log2(p)
where the logarithm is taken to the base 2 instead of the more usual base 10
or base e.

Logs
(Skip this section if you are happy with all things logarithmic).
Part of the problem here is the use of a negative logarithm, so a brief reminder
of what logs are all about is sure to help. The log of a number x to a base b is
just the number that you have to raise b to get x. That is, if:
by=x
then y is the log of x to the base b.
For logs to the base 2 we simply have to find the number that we raise 2 by to
get the specified value.

127
So log2(4)=2 because 22=4 and continuing in this way you can see that:
n log2(n)
1 0 (20=1)
2 1 (21=2)
4 2 (22=4)
8 3 (23=8)
16 4 (24=16)
The only problem you might have is with 20=1, which is so just because it
makes overall sense.
Clearly powers of two have nice simple logs to the base 2 and values in
between powers of two have logs which aren’t quite so neat. For example,
log2(9) is approximately 3.17 because 23.17 is almost 9.
We need to know one more thing about logs and this is how to take the log of
a fraction. If you raise a number to a negative power then that’s the same as
raising 1 over the number to the same power. For example:
2-1 = 1/21 = 1/2, i.e. 0.5.
The reason this works is that powers add and you can deduce what a negative
power must be by looking at expressions like:
2+1 x 2-1 = 21-1 = 20 = 1
and you can see from this that:
2-1 = 1/21 = 1/2
and more generally:
a-b = 1/ab
This also means that that logs of fractions are negative and, given
probabilities are always less than one, log2(p) where p is a probability, is
negative – hence the reason for including the minus sign in the definition.
Put simply the minus is there to make the log of probabilities positive.
n log2n -log2n
1/2 -1 1
1/4 -2 2
1/8 -3 3
1/16 -4 4

128
Bits
Now that we have the mathematics out of the way, we need to return to look
again at the idea of information as measured by I(A)=-log2(p).
You may have noticed that powers of two seem to figure in the calculation of
information and this seems to be very natural and appropriate when it comes
to computers. However, to see how deep and natural the connection is we can
play a simple game.
Suppose we flip a fair coin and you have to find out if it landed as a head or a
tail. It’s a very boring game because you can simply ask me and I will tell you,
but some information has flowed between us and we can attempt to use our
new measure of information to find out how much. Before I answer the
question, the two results “Heads” or “Tails” are equally possible and so their
probabilities are simply p=0.5 and q=0.5 respectively. So when I give you the
answer the amount of information I pass to you is:
I(answer)=-log2(.5)=-(-1)=1
This is quite interesting because in this binary split between two equally
likely alternatives I am passing you 1 unit of information.
We can extend this a little. Suppose two coins are tossed and you have to ask
questions to find out if the result was (Head, Head), (Head, Tail), (Tail, Head)
or (Tail, Tail), but you can only ask about one coin at a time.

You would ask one question and get 1 unit of information and then ask your
second question and get another unit of information. As information is
additive you now have 2 units of information. Whenever you encounter a
binary tree of this sort the number of units of information that you get by
knowing which outcome occurred is just the number of levels that the tree
has. In other words, for two coins there are two levels and getting to the
bottom of the tree gets you 2 units of information. For three coins you get 3

129
units of information and so on. Notice that this isn’t entirely obvious because
for two coins you have 4 possible outcomes and for three coins you have 8:
Coins Outcomes Information
1 2 1
2 4 2
3 8 3
4 16 4
5 32 5
It seems that if you find out which of 32 possible outcomes occurred you only
gain 5 units of information – but then you only needed to ask five binary
questions!
You could of course do the job less efficiently. You could ask 32 questions of
the type “Did you get HTHTH?” until you get the right answer but this is up to
you. If you want to gather 5 units of information in an inefficient way then
that’s your choice and not a reflection of the amount of information you
actually gain.
Notice that this is also the first hint that information theory may tell you
something about optimal ways of doing things. You can choose to be
inefficient and ask 32 questions or you can do things in the most efficient
way possible and just ask 5.

A Two-Bit Tree
Now consider the task of representing the outcome of the coin-tossing in a
program. Most programmers would say let the first coin be represented by a
single bit, 0 for tails 1 for head. Then the two-coin situation can be
represented by two bits 00, 01, 10, and 11. The three-coin situation can be
represented by three bits 000, 001, 010, and so on. You can see that this looks
a lot like the way information increases with number of coins. If you have n
coins you need n bits to represent a final outcome and you have n units of
information when you discover the final outcome.
The connection should by now be fairly obvious.
The unit of information is the bit and if a symbol occurs with probability p
then it carries –log2(p) bits of information and needs a minimum of –log2(p)
bits of information to represent or code it. You can of course use more bits to
code it if you want to, but –log2(p) is the very smallest number you can use
without throwing away information.
Notice that this conclusion is very deep. There is a connection between the
probability of a symbol occurring and the amount of information it contains
and this determines the smallest number of bits needed to represent it. We go
from probabilities to bits - which is not an obvious connection before you
make it.

130
The Alphabet Game
At this point you should be thinking that this is all very reasonable, but how
does it work in more complicated situations? It is time for a very small simple
example to show that it all does make good sense. Consider a standard
alphabet of 26 letters. Let’s for the moment assume that all of the letters are
equally likely. Then the probability of getting any letter is 1/26 and the
information contained in any letter is –log2(1/26) = 4.7 bits.
What this means is that you need 5 bits to encode this standard alphabet and
a single letter carries just less than 5 bits of information.
Notice that in theory we can find a code that only uses 4.7 bits per letter on
average and this is the sort of question that information theorists spend their
lives working out. For most of us a storage scheme that uses 5 bits is good
enough.
As a sanity check, you can also see that five bits really are enough to code 26
letters or other symbols. If you run though the list of possibilities – 00000,
00001, 00010, and so on you will find that there are 32 possible five-bit codes
and so, as predicted, we have more than enough bits for the job. The guessing
game analogy still works. If I think of a letter at random then how many
questions do you have to ask to discover which letter it is? The simple
minded answer is 26 at most - “is it an A?”, “is it a B?” and so on. But by taking
a slightly more clever approach the same job can be done in five questions at
most. Can you work out what they are?

Compression
Information theory has lots of interesting applications in computing and one
worth knowing about is “coding theory”.
A good few years ago at the start of the personal computer revolution a self-
appointed guru announced that he had invented a “data compression
machine” – in hardware. These were days well before zipping files was
common and the idea of data compression was strange, exotic and very, very
desirable – after all the largest disks at the time were only 180Kbytes!
Compression is fine but this particular guru claimed that he could get an 80%
compression ratio and all the data could be fed through a second time to get a
compounded compression rate of 80% of the 80%.
Such ideas are in the same category as perpetual motion machines and faster
than light travel but without information theory you can’t see why this
particular claim is nonsense. Compression works by finding the code that
represents the data most efficiently. If the data D contains I(D) bits of
information and it is currently represented by more bits than I(D) you can
compress it by finding the optimal code. That’s what programs that zip files
attempt to do – they scan the data and work out a code custom made for it.

131
The details of the code are stored along with the data in a “dictionary” but
even with this overhead the savings made are enough to provide large space
savings in most cases.
Now why can’t you repeat the procedure to get twice the savings?
It should be obvious now with information theory. You can’t do it again
because once you have coded the information using an optimal code –you
can’t improve on perfection! So the smallest number of bits needed to
represent any data is the measure of the information it contains.
It really is that easy.
But how do you find the optimal code?
Find out in our introduction to Coding Theory in Chapter 12.

Channels, Rates and Noise


Before information theory we had little idea how much information a
communication channel could carry. Ideas such as sending TV data over
medium wave radio transmitters, or down a standard phone line sounded
reasonable until you work out the amount of information involved and the
capacity of the channel. It was this problem that really motivated Shannon to
invent information theory – after all he was employed by a telephone
company. He wanted to characterize the information capacity of a
communication channel.
Such questions are also the concern of information theory and here we have
to consider noise as well as information. The key theorem in this part of
information theory, the channel capacity theorem, says that if you have an
analog communication channel C with bandwidth B, i.e. it can carry
frequencies within the bandwidth, measured in Hz and subject to noise, then
the capacity of the channel in bits per second is:
C = B log2(1+S/N)
where S/N is the signal to noise ratio, i.e. the ratio of the signal power to the
noise power. For example, if the bandwidth is 4 kHz and the signal to noise
ratio is 100, which is typical of an analog phone connection then:
C = 4000 log2(1+100)= 26632.8 ≈ 26 kbits/s
If this is really the specification of the phone line, then this is the maximum
rate at which you can send data. If you try to send data faster than this then
errors will occur to reduce the achieved rate to 26 kbits/s. You can see that as
you increase the noise the channel capacity drops. Intuitively what this
means is that as the noise gets louder you have to repeat things to ensure that
the signal gets through. Alternatively you could shout louder and increase the
signal to noise ratio. Finally you could increase the bandwidth. This is the
reason you need to use high frequency radio to send a lot of data.

132
If you are transmitting using a 1 gHz signal it is easy to get a 1 mHz
bandwidth. If you are transmitting at 10 mHz then 1 mHz is a sizable chunk
of the available spectrum.

More Information - Theory


From here information theory develops, becoming even more interesting and
leading on to all sorts of ideas - how can you achieve the channel capacity
with suitable codes, error correction, the Nyquist rate, reconstructing signals
from samples and so on.
Information theory also provides a model that enables you to analyze many
different types of system. In psychology you can work out how much
information is being processed by the brain. Information is also related to
entropy which is a natural quantity in many physical systems and this makes
the connection between physics and computer science.
You can find out more from any book on information theory and there is still
a lot of theoretical work to be done.

133
Summary
● Before information theory we had little idea how much information
any signal or set of symbols contained.

● The key observation that led Shannon to his formulation of


information theory was that the amount of information was related to
the degree of surprise at receiving the message.

● After some consideration, it becomes clear that the amount of


information in a message A is I(A)=-log2(p) where p is the probability
of receiving the message.

● The unit of information is the bit and another way to understand


information is as the number of binary questions you have to ask to
determine a particular outcome.

● Data compression relies on using the most efficient code to represent


the data.

● Although information theory is applied to digital systems, it was


originally invented to deal with analog systems such as telephone
lines and radio channels.

● The key theorem from classical information theory is the channel


capacity theorem, which relates the bandwidth and signal to noise
ratio to the information capacity of a channel.

134
Chapter12

Coding Theory – Splitting the Bit

In the previous chapter we learned that if a message or symbol occurs with


probability p then the information contained in that symbol or message is
-log2(p) bits. For example, consider the amount of information contained in a
single letter from an alphabet. Assuming there are twenty-six letters in an
alphabet and assuming they are all used equally often the probability of
receiving any one particular letter is 1/26 which gives -log2(1/26)=4.7 bits
per letter.
It is also obvious from other elementary considerations that five bits are more
than enough to represent 26 symbols. Quite simply, five bits allow you to
count up to more than 26 and so you can assign one letter of the alphabet to
each number. In fact, five bits is enough to represent 32 symbols because you
can count up to 31, i.e. 11111.
Using all five bits you can associate A with zero and Z with 25 and the values
from 26 to 31 are just wasted. It seems a shame to have wasted bits because
we can’t quite find a way to use 4.7 bits – or can we? Can we split the bit?

Average Information
Before we go on to consider “splitting the bit”, we need to look more carefully
at the calculation of the number of bits of information in one symbol from an
alphabet of twenty-six. Clearly the general case is that a symbol from an
alphabet of n symbols has information equal to –log2(n) bits, assuming that
every symbol has equal probability.
Of course what is wrong with this analysis is that the letters of the alphabet
are not all equally likely to occur. For example, if you don’t receive a letter “e”
you would be surprised – count the number in this sentence. However, your
surprise at seeing the letter “z” should be higher – again count the number in
some other sentence than this one!
Empirical studies can provide us with the actual rates that letters occur – just
count them in a book for example. If you add the “space” character to the set
of symbols, now a total of 27, you will discover that it is by far the most
probable character with a probability of 0.185 of occurring, followed by “e” at
0.100, “t” at 0.080 and so on down to “z” which has a probability of only
0.0005. It makes you wonder why we bother with “z” at all!

135
But, given that “z” is so unlikely, its information content is very high –
log2(0.0005)=10.96 bits. So, “z” contains nearly eleven bits of information
compared to just over three bits for “e”.
If you find this argument difficult to follow just imagine that you are placing
bets on the next symbol to appear. You expect the next symbol to be a space
character so you are very surprised when it turns out to be a “z”. Your betting
instincts tell you what the expected behavior of the world is and when
something happens that isn't expected you gain some information.
We can define an average information that we expect to get from a single
character taken from the alphabet.
If the ith symbol occurs with probability pi then the information it contains is
–log2(pi) bits and the average information contained in one symbol, averaged
over all symbols is:
–p1log2(p1) –p2log2(p2) … –pnlog2(pn)
You can see that this is reasonable because each symbol occurs with
probability pi and each time it occurs it provides –log2(pi) bits of
information. Notice that while “z” carries 11 bits of information, the fact that
it doesn’t occur very often means that its contribution to the average
information is only about 0.0055 bits.
Applying this formula to the letters of the alphabet and their probabilities
gives an average information of 4.08 bits per symbol which should be
compared to 4.76 bits per symbol for an alphabet of 27 equally likely
characters. Notice that the average information in 27 equally likely characters
is also 4.76 bits – to see why try working it out.
No coding method can represent messages using fewer than the number of
bits given by their average information. Inefficient coding methods can, and
do, use more bits than the theoretical minimum. This waste of bits is usually
called redundancy because the extra bits really are unnecessary, i.e.
redundant. Redundancy isn't always a waste as it can be used to detect and
even correct errors in the messages, as we will see the next chapter.
There is also a nice theorem which says that the average information per
symbol is largest if all of the symbols are equally likely. This means that if
any of the symbols of an alphabet are used more or less often than others, the
average information per symbol is lower. It is this observation that is the key
to data compression techniques.
In general the average information, also known as entropy is given by:
E = ∑i pi log2 pi

where the sum is over all the symbols that are in use.

136
Make It Equal
If you have a set of symbols then the most efficient way of using them is to
make sure that the average amount of information per symbol is maximized.
From our discussion so far, it should now be clear that the way to do this is to
ensure that each symbol is used equally often – but how? The answer is to
use a code that forces the symbols to occur equally often, even if they don’t
want to!
Consider the standard alphabet and divide it into two groups of letters so that
the probability of a letter belonging to either group is the same, i.e. 0.5. To do
this we have to include a mix of likely and not-so-likely letters to get the right
probability. With this division of the alphabet we can begin to construct our
code by using a 0 to indicate that the letter is in the first group and a 1 to
indicate that it is in the second group. Clearly the first bit of our code has an
equal chance of being a 0 or a 1 and we can continue on in this way by sub-
dividing each of the two groups into equally likely subsets and assigning 0 to
one of them and 1 to the other:

Now when we receive a set of data bits we use them one after another to find
which group the letter is in much like a well known “yes/no” question and
answer game. Each group is equally likely and hence each symbol used in the
code is also equally likely (only 8 letters shown in the diagram).
This code was invented by Shannon and Fano and named after them. If you
can construct such a code it is optimal in the sense that each symbol in the
code carries the maximum amount of information. The problem is that you
can’t always divide the set into two equally probable groups – not even
approximately!
There is a better code.

137
Huffman Coding
The optimal code for any set of symbols can be constructed by assigning
shorter codes to symbols that are more probable and longer codes to less
commonly occurring symbols. The way that this is done is very similar to the
binary division used for Shannon-Fano coding, but instead of trying to create
groups with equal probability we are trying to put unlikely symbols at the
bottom of the “tree”. The way that this works is that we sort the symbols into
order of increasing probability and select the two most unlikely symbols and
assign these to a 0/1 split in the code. The new group consisting of the pair of
symbols is now treated as a single symbol with a probability equal to the sum
of the probabilities and the process is repeated.
This is called Huffman coding, after its inventor, and it is the optimal code
that we have been looking for. For example, suppose we have the five
symbols A, B, C, D, E with probabilities 0.1, 0.15, 0.2, 0.25 0.3 respectively:

A B C D E
0.1 0.15 0.2 0.25 0.3

The first stage groups A and B together because these are the least often
occurring symbols. The probability of A or B occurring is 0.25 and now we
repeat the process treating A/B as a single symbol:

The first stage of coding


Now the symbols with the smallest probability are C and the A/B pair which
gives another split and a combined A/B/C symbol with a probability of 0.45.
Notice we could have chosen C and D as the least likely given a different, but
just as good, code.

138
The second stage
The two symbols that are least likely now are D and E with a combined
probability of 0.55. This also completes the coding because there are now
only two groups of symbols and we might as well combine these to produce
the finished tree.

The final step

139
This coding tree gives the most efficient representation of the five letters
possible. To find the code for a symbol you simply move down the tree
reading off the zeros and ones as you go until you arrive at the symbol.
To decode a set of bits that has just arrived you start at the top of the tree and
take each branch in turn according to whether the bit is a 0 or a 1 until you
run out of bits and arrive at the symbol. Notice that the length of the code
used for each symbol varies depending on how deep in the tree the symbol is.
The theoretical average information in a symbol in this example is 2.3 bits -
this is what you get if you work out the average information formula given
earlier. If you try to code B you will find that it corresponds to 111, i.e. three
bits, and it corresponds to moving down the far right hand branch of the tree.
If you code D you will find it corresponds to 00, i.e. the far left hand branch
on the tree. In fact each remaining letter is either coded as a two- or three-bit
code and guess what? If the symbols occur with their specified probabilities,
the average length of code used is 2.3 bits.
So we have indeed split the bit! The code we are using averaged 2.3 bits to
send a symbol.
Notice that there are some problems with variable length codes in that it is
more difficult to store them because you need to indicate how many bits are
in each group of bits. The most common way of overcoming this is to use
code words that have a unique sequence of initial bits. This wastes some code
words, but it still generally produces a good degree of data compression.

Efficient Data Compression


If you have some data stored say on disk then it is unlikely to be stored using
an efficient code. After all the efficient code would depend on the
probabilities that each symbol occurred with and this is not something taken
into account in simple standard codings. What this means is that almost any
file can be stored in less space if you switch to an optimal code.
So now you probably think that data compression programs build Huffman
codes for the data on a disk? They don’t because there are other
considerations than achieving the best possible data compression, such as
speed of coding and decoding. However, what they do is based on the
principle of the Huffman code. They scan through data and look for patterns
of bits that occur often. When they find one say “01010101” they record it in a
table and assign it a short code 11 say. Now whenever the code 11 occurs in
the coded data this means 01010101, i.e. 8 bits are now represented by 2. As
the data is scanned and repeating patterns are found the table, or
“dictionary”, is built up and sections of the data are replaced by shorter
codes.

140
This is how the data compression is achieved, but when the file is stored back
on disk its dictionary has to be stored along with it. In practice data is
generally so repetitive that the coded file plus its dictionary is much smaller
than the original. There are even schemes called "data deduping" that build a
system-wide dictionary and apply it to everything in a storage system. If every
document starts in the same way with a standard heading and ends with a
legal statement then this produces huge compression ratios.

141
Summary
● Information theory predicts the number of bits required to represent a
message, which sometimes is a fractional quantity.

● In particular, if we consider average information contained in a


message, a fractional number of bits will be used to represent it.

● No coding method can use fewer bits than the average information,
but inefficient coding methods can use many more bits.

● The average information in a message is maximum when all of the


symbols are equally likely.

● We can get close to the number of bits in the average information


using the Shannon-Fano coding algorithm which attempts to split the
symbols into groups that are close to being equally likely.

● The Shannon-Fano coding is good, but in most cases it isn't optimal.


To get an optimal code we need to use the Huffman coding algorithm
where variable length codes are used. The shortest codes represent the
most likely symbols.

● Finding codes that represent messages using a smaller number of bits


generally don't use theoretically optimal methods. The reason is that
speed of coding is important. The most common method of
compressing data is to find repeating patterns and represent them as
entries in a dictionary.

142
Chapter13

Error Correction

Error correcting codes are essential to computing and communications. At


first they seem a bit like magic - how can you possibly not only detect an
error but correct it as well? In fact, it turns out to be very easy to understand
the deeper principles.
The detection and correction of errors is a fundamental application of coding
theory. Without modern error correcting codes the audio CD would never
have worked. It would have had so many clicks, pops and missing bits due to
the inevitable errors in reading the disc, that you just wouldn't have been able
to listen to it. Photos sent back from space craft wouldn't be viable without
error correcting codes. Servers often make use of ECC memory modules
where ECC stands for Error Correcting Code and secure data storage wouldn't
be secure without the use of ECC. Finally the much-loved QR Code has four
possible levels of error correction with the highest level allowing up to 30% of
the code to be damaged:

This QR code can still be read due to error correcting codes


So how does error detection and correction work?

Parity Error
There is a classic programmer's joke that you have probably already heard:
Programmer buys parrot. Parrot sits on programmer’s shoulder and
says “pieces of nine, pieces of nine,..”. When asked why it isn’t saying
the traditional “pieces of eight” the programmer replies, “It’s a parroty
error!”
Parity error checking was the first error detection code and it is still used
today. It has the advantage of being simple to understand and simple to
implement. It can also be extended to more advanced error detection and
correction codes.

143
To understand how parity checking works consider a seven-bit item of data:
0010010
If this was stored in some insecure way, then if a single bit was changed from
a 0 to a 1 or vice versa you would never know. If, however, instead of storing
eight bits, we store nine, with the ninth – the parity bit – set to make the total
number of ones odd or even, you can see that a single-bit change is
detectable.
If you select odd parity then the eight bits are:
1 0010010
i.e. the parity bit is a 1, and if any of the bits changes then the parity changes
from odd to even and you know there has been a bit error.
Parity checking detects an error in a single bit but misses any errors that flip
two bits, because after any even number of bit changes the parity is still the
same.
This type of error checking is used when there is a fairly small probability of
one bit being changed and hence an even smaller probability of two bits being
changed.
Hamming Distance
When you first meet parity error detection it all seems very simple, but it
seems like a “one-off” rather than a general principle. How, for example, do
you extend it to detect a two-bit or three-bit error? In fact, parity checking is
the simplest case of a very general principle but you have to think about it all
in a slightly different way to see this.
When a bit is changed at random by noise you can think of the data word as
being moved a small distance away from its true location. A one-bit change
moves it a tiny amount, a two-bit change moves it further and so on. The
more bits that are changed the further away the data word is from its original
true location. A simple measure of this distance between two data words is to
count the number of bits that they differ by. For example, the two data words
011 and 110 are two units apart because they differ in two places – the first
and last bits. This is called the Hamming distance after R. W. Hamming who
did much of the early work into error detection and correction.

Richard Hamming
1915-1998

144
What has Hamming distance got to do with parity checking? Simple - if you
take a valid data word which has a parity bit associated with it and change a
single bit then you have a data word which is one Hamming unit of distance
away from the original. Every valid code word has an invalid code word one
unit away from it.
To imagine this it is easier to think of a three-bit code. In this case you can
draw a cube to represent the location of each possible code word.

A code cube
If we treat all even parity words as valid and odd parity words as invalid, you
can see at once that a code such as 000 is surrounded by invalid codes one
unit away, e.g. 001, 010 and 100. If you move two units away then you reach
valid codes again. What is more, every valid code is surrounded by a cluster
of invalid codes one unit away. In other words, a single-bit error always
moves a valid code to an invalid code and hence we detect the error.

Light colored corners are valid codes – dark are invalid

145
Hypercubes
This may seem like a very complicated way of thinking about a very simple
code, but it opens up the possibility of constructing more advanced codes.
For example, the key to the parity checking code is that every valid code is
surrounded by invalid codes at one unit’s distance and this is why one-bit
errors are detected. To detect two-bit errors all we need to do is increase the
radius of the "sphere" of invalid codes from one unit to two units.
In the three-bit code case this leaves us with only two valid codes, 000 and
111. Anything else is an invalid code that was generated by changing one or
two bits from the original code words. Notice that 000 and 111 are three units
away from each other. Now we have a code that can detect a one- or two-bit
error with certainty, but, of course, it can’t detect a three-bit error!

A two-bit error detecting code cube


Problems start to happen when you increase the number of bits in the code.
If the code has n bits then it labels the 2n corners of an n-dimensional
hypercube. All we have to do to make a code that detects m-bit errors is to
surround each valid code word with invalid codes out to a distance of m
units.
Although it might seem like strange terminology, you can refer to the
surrounding group of invalid codes as a “sphere of radius m” - after all they
are all the same distance away from the valid center code. It is also clear that
error detection carries an increasing overhead. A three-bit code that detects
one-bit errors has four valid codes and four invalid codes. A three-bit code
that detects two-bit errors has only two valid codes and six invalid ones.
Error detecting codes don’t carry information efficiently – you have to use
more bits than strictly necessary for the information content. This is called
“adding redundancy” and follows the general principle of information theory
that in the presence of noise adding redundancy increases reliability.

146
Error Correction
To make the small, but amazing, step from error detection to error correction
we need to construct an error correction code, something that we have, in
fact, already done! Consider the three-bit code that can detect up to two-bit
errors. In this code the only valid words are 000 and 111. Now consider what
it means if you retrieve 001. This is clearly an illegal code and so an error has
occurred, but if you assume that only a one-bit error has occurred then it is
also obvious that the true code word was 000.
Why? Because the only valid code word within 1 unit’s distance of 001 is
000. To get to 111 you have to change two bits. In other words, you pick the
valid code closest to the invalid code you have received.

The light colored incorrect code word is closest to 000


What this means is that a code that has invalid words surrounding it, forming
a sphere of radius two, can detect errors in up to two bits and can correct one-
bit errors. The general principle isn’t difficult to see. Create a code in which
each valid code word is surrounded by a sphere of radius m and you can
detect up to m-bit errors and correct m-1 bit errors. The correct algorithm is
simply to assign any incorrect code to the closest correct code.

Real Codes
Of course this isn’t how error detecting/correcting codes are implemented in
practice! It is the theory but making it work turns out to be much more
difficult. The additional problems are many. For example, to have a high
probability of dealing with errors you have to use a large number of bits. You
can make any unreliable storage or transmission medium as reliable as you
like by throwing enough bits at it. This means that you might have to use a
huge table of valid and invalid code words, which is just not practical.

147
In addition, whatever code you use it generally has to be quick to implement.
For example, if an error occurs on a CD or DVD then the electronics corrects
it in real time so that the data flow is uninterrupted. All this means that real
codes have to be simple and regular.
There is also the small problem of burst errors. Until now we have been
assuming that the probability of a bit being affected by noise is the same no
matter where it is in the data word. In practice noise tends to come in bursts
that affect a group of bits. What we really want our codes to do is protect
against burst errors of m bits in a row rather than m bits anywhere in the
word. Such burst error correcting codes are more efficient but how to create
them is a difficult problem.
For example, if you want to use a code that can detect a burst error of b bits
then you can make use of a Cyclic Redundancy Checksum (CRC) of b
additional bits. The CRC is computed in a simple way as a function of the
data bits, and checking the resulting data word is equally simple, but the
theory of how you arrive at a particular CRC computation is very advanced
and to understand it would take us into mathematical areas such as Galois
theory, linear spaces and so on.
The same sort of theory can be extended to create error-correcting codes
based on CRC computations, but if you want to learn more about error
detection and correction codes then you have to find out first about modern
mathematics.

148
Summary
● If you are prepared to use more bits than are necessary to send a
message, you can arrange for errors to be detected and even corrected.

● The first error detection scheme we encounter is usually parity. An


extra bit is included with the data to make the data either have an odd
number of ones or an even number of ones. If a single bit changes
then the odd or even property is changed and the error can be
detected.

● A parity check can only detect a single bit error. Changing two bits
returns the data to the original parity.

● The Hamming distance between two items of data is simply the


number of bits by which they differ.

● Another way to look at the parity check is to notice that an error in a


valid code word changes it into an invalid code word just one unit
away.

● This idea can be generalized to detect errors in more than one bit. If
each valid code word is surrounded by invalid code words that differ
by m bits, then we can detect errors that change up to m bits.

● Error correction extends this idea so that the each invalid code is only
within m bits of a single valid code. This single valid code is the only
one that could have resulted in the invalid code by a change in m bits
and hence we can correct the error by selecting it as the correct code.

● In practice error detection and correction codes have to be fast and


easy to use. The error has to be detectable and correctable by a simple
computation, usually implemented in hardware.

● To design codes of this type requires a lot of modern mathematical


ideas.

149
Chapter 14

Boolean Logic

You might be thinking that this is a little late to be introducing ideas about
logic. Computers are created using logic gates so why wait so long to discuss
them. The truth of the matter is that historically logic and computers did not
really fit together. Early computers were thought of as calculating machines
and it still is difficult to see where logic fits into rotating gear wheels counting
off integers. When Turing was involved in building an early computers, a real
Turing machine, he tended to think in terms of artificial neurons rather than
logic gates.
Even today we tend to be over simplistic about logic and its role in
computation and understanding the world and even George Boole, the man
who started it all, was a bit over the top with the titles of his books on the
subject - Mathematical Analysis of Thought and An Investigation of the Laws of
Thought.
Boole’s work certainly set modern logic off on the right road, but it certainly
wasn’t anything to do with something as general as the “laws of thought”.
Even today we have no clear idea what laws govern thought and if we did the
whole subject of artificial intelligence would be a closed one. What Boole did
to be recognized as the father of modern information technology was to come
up with an idea that was at the same time revolutionary and simple.
Who was George Boole?
A contemporary of Charles Babbage, whom he briefly met, George Boole is
these days credited as being the "forefather of the information age".

George Boole 1815 - 1864

151
An Englishman by birth, in 1849 he became the first professor of mathematics
in Ireland’s newly founded Queen’s College which is now University College,
Cork.
He died at the age of 49 in 1864 and his work might never have had an impact
on computer science without Claude Shannon, see Chapter 11, who 70 years
later recognized the relevance for engineering of Boole’s symbolic logic. As a
result, Boole’s thinking has become the practical foundation of digital circuit
design and the theoretical grounding of the digital age.

Boolean Logic
Boolean logic is very easy to explain and to understand. You start off with the
idea that some statement P is either true or false, it can’t be anything in
between (this called the law of the excluded middle). Then you can form
other statements, which are true or false, by combining these initial
statements together using the fundamental operators AND, OR and NOT.
Exactly what a "fundamental" operator is forms an interesting question in its
own right, something we will return to later when we ask how many logical
operators do we actually need?
The way that all this works more or less fits in with the way that we use these
terms in English. For example, if P is true then NOT(P) is false. So, if “today is
Monday” is true then “NOT(today is Monday)” is false. We often translate the
logical expression into English as “today is not Monday” and this makes it
easier to see that it is false if today is indeed Monday. The problem with this
sort of discussion is that it very quickly becomes convoluted and difficult to
follow and this is part of the power of Boolean logic. You can write down
arguments clearly in symbolic form.

Truth Tables
The rules for combining expressions are usually written down as tables listing
all of the possible outcomes. These are called truth tables and for the three
fundamental operators these are:
P Q P AND Q P Q P OR Q P NOT P
F F F F F F F T
F T F F T T T F
T F F T F T
T T T T T T

152
Notice that while the Boolean AND is the same as the English use of the term,
the Boolean OR is a little different. When you are asked would you like tea
OR coffee you are not expected to say yes to both! In the Boolean case,
however, “OR” most certainly includes both. When P is true and Q is true the
combined expression (P OR Q) is also true.
There is a Boolean operator that corresponds to the English use of the term
“or” and it is called the “Exclusive OR” written as XOR or EOR. Its truth table
is:
P Q P XOR Q
F F F
F T T
T F T
T T F

and this one really would stop you having both tea and coffee at the same
time (notice the last line is True XOR True = False).

Practical Truth Tables


All this seems very easy but what value does it have? It most certainly isn’t a
model for everyday reasoning except at the most trivial “tea or coffee” level.
We do use Boolean logic in our thinking, but only at the most trivially
obvious level. However, if you start to design machines that have to respond
to the outside world in even a reasonably complex way then you quickly
discover that Boolean logic is a great help.
For example, suppose you want to build a security system which only works
at night and responds to a door being opened. If you have a light sensor you
can treat this as giving off a signal that indicates the truth of the statement:
P = it is daytime
Clearly NOT(P) is true at night and we have our first practical use for Boolean
logic! Similarly the door opening sensor can be considered to correspond to
true if the door is open:
Q = door open
What we really want is something that works out the truth of the statement:
R = burglary in progress
in terms of P and Q.
A little thought soon leads to the solution:
R = NOT(P) AND Q

153
That is, the truth of “burglary in progress” is given by:

P Q NOT(P) NOT(P)AND Q
F F T F
F T T T
T F F F
T T F F

From this you should be able to see that the alarm only goes off when it is
night-time and a door opens.

From Truth Tables to Electronic Circuits


Not only does this little demonstration illustrate why Boolean logic is useful
in designing such systems, it also explains why electronics circuits to perform
Boolean logic are commonplace. You can buy integrated circuits that perform
AND, OR and NOT and many other combinations of these operations in a
single easy to use package. Where is the “truth” in this? Clearly Boolean logic
works with “true” and “false” and this is certainly how Boole himself thought
about it, but you can work with the same system of operators and results no
matter what you call the two states - up/down, on/off or zero/one.
In the electronics case it is more natural to think of high and low voltage as
representing the natural states that we are working with. An explanation that
involves Boolean logic often sounds more appropriate when expressed in
terms that belong to the underlying subject matter but it is still the same
Boolean logic.
What has all of this got to do with computers? The answer is two-fold. The
first connection is that computers contain electronic circuitry that behaves in
much the same way as the burglar alarm. In general, Boolean logic helps
when you need to design a circuit that has to give an output only when
certain combinations of inputs are present. Such a circuit uses “combinatorial
logic” and there are lots of them inside a computer. You can specify the
behavior of any piece of combinatorial logic using a truth table. The only
difference is that in most cases the tables are very large and converting them
into efficient hardware is quite a difficult job.

154
Logic in Hardware
What makes Boolean logic so central to computers and a wide range of
electronic devices is that logic is easy to implement. A logic "gate" is a piece of
electronics that takes in some inputs and produces an output according to
one of the logic tables. Usually the inputs are either high or low and this is
interpreted as 1 and 0 or “true” and “false”. For example, a NOT gate has a
single input and its output is high when its input is low, and vice versa. You
can create more complicated logic circuits by wiring up gates.
The standard symbols for the three basic logic gates - NOT, AND and OR plus
the XOR are:

In general, a small circle on a connection indicates a negation or a NOT. For


example, the symbols for NAND, NOR and XNOR, the negation of XOR are :

Note that there are also European standard symbols, which are different.

155
In many ways this is electronics made easy. You don't need to know anything
about transistors or other components to build a logic circuit. Just work out
what the logical expression is and wire gates up to implement it. For example,
our burglar alarm example has the logical expression:
R = NOT(P) AND Q
So to implement it in hardware we need a NOT gate and an AND gate:

Real logic circuits are much more complicated than this and can make use of
hundreds, or even thousands, of gates.

Binary Arithmetic
The second connection that Boolean logic has with computers is actually just
a special case of the first, but it is very important and it often confuses people.
Boolean logic can be used to implement binary arithmetic. Notice that there is
no suggestion that binary arithmetic and Boolean logic are the same thing -
they aren’t. Binary arithmetic is just an example of a place value system for
representing values. That is, we humans work in base 10 and computers find
it easier to work in base two, because they don’t have ten fingers. However,
when you do binary arithmetic you follow standard rules that determine how
you should combine bits together to produce result bits. These rules can be
expressed as combinatorial logic, in other words a truth table.
You can easily construct a table showing how to add two bits together to give
a result and a carry to the next place:

A B Result Carry
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1

If you don't know how to add two bits together then simply accept the table
as your instructions for how to do it.

156
You can see that this is just another truth table and the combinatorial logic
needed for it can be produced in the usual way. Actually this is only a half
adder - yes this is the real technical term - in that it doesn’t add a carry bit
that might have been generated by a previous pair of bits.

The table for a “full adder”, this too is the correct technical term, is:

A B C Result Carry
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1

The truth table can be converted into hardware:

157
So combinatorial circuits are the main connection between Boolean logic and
computer hardware, but there is more to computer hardware than
combinatorial logic.
Occasionally you will hear computers referred to just one huge piece of
Boolean logic, but this is an overstatement. Large chunks are just AND, OR
and NOT gates but there are also huge chunks that are not describable in
terms of pure Boolean logic.

Sequential Logic
Computers also include sequential logic, which includes an element of time.
For example, a flip-flop, again perfectly good jargon, is a circuit that changes
state in the same way as a pendulum “flips” from side to side. The odd thing
is that you can make a flip-flop from two NOT gates and most sequential logic
can be understood in terms of combinations of logic gates. However, this is an
incomplete explanation and requires something beyond Boolean logic. For
example, Boolean logic cannot cope with the statement:
A = NOT A
If A is true then NOT A is false and so A must be false, but this means that
NOT A is true, which means that A must be true, which means, and so on…
So from the point of view of Boolean logic this is nonsense.
Now imagine that the on one side of a blackboard you write “A is true” and
on the other side you write “NOT A is true”. Now when you read “A is true”
you are happy with this. When you turn the blackboard over you see “Not A
is true” and so conclude A is false. As you keep flipping the board over the
value of A flips from true to false and back again. This is, of course, just the
paradox of self-reference as introduced and used in earlier chapters. The big
difference is that now it isn't a paradox as you can include time in the system.
If you think that each time you turn the blackboard over we create a new state
then there is no paradox - when the board is one way we have true, the other
and we have false.
This turning over of the blackboard is the extra part of the theory needed to
deal with sequential logic - it shows how time changes things. So computer
hardware is not just one large statement in Boolean logic, which just happens
to be true or false. Its state changes with every tick of its system clock. Even if
you eliminate the system clock and only look at combinatorial logic, time still
enters the picture. When the inputs to a combinatorial circuit change there is
a period when the state of the system changes from its initial to its final state -
a transient state change. For example, in an n-bit full adder we have to allow
time for the carry generated at the low order bits to propagate to the higher
order bits. Even a seemingly static logic circuit is in fact dynamic.

158
As a small example consider the SR (Set/Reset) latch made from two NOR
gates.

If you look at the truth table for this arrangement you will see that it isn't like
earlier tables:

S R Q
0 0 No change
0 1 0
1 0 1
1 1 Unstable

If both S and R are at zero then the output doesn't change - it is latched. If you
set R to one then the output Q goes low and it is reset. If you set S to one then
the output goes high and it is set. If you put a one on S and R then we appear
to have a paradox because we can't set and reset the latch at the same time. In
fact what happens is that the latch either goes to one or zero depending on
how fast signals propagate though it and the exact timing of the inputs. The
SR latch is useful because it "remembers" its last state - was it set or reset. It is
a very simple memory component.
Even simple computers are more complicated than simple static Boolean logic
would suggest. In the real world you cannot ignore time and the time
development of the system which Boolean logic ignores.

De Morgan's Laws
Boolean logic is fundamental to the design of computer hardware even if it
isn’t the whole story. The same holds true for programming. A program also
needs the element of time built into it to make it work and this takes it
beyond the bounds of simple Boolean logic. However, there are times when
simple common sense reasoning lets even the best programmer down. For
example, a common task is making a decision using an IF statement:
IF (A>0 AND A<10) THEN do something
The bracket following the IF is almost pure Boolean logic and the something
only gets done if it works out to be true. So far, so simple. So simple, in fact,

159
that many programmers decide that they don’t need to know anything about
Boolean logic at all; a serious error. Consider how you would change this
example so that the “something” was done when the condition was false. The
simplest way is to write:
IF NOT(A>0 AND A<10) THEN do something
However, this offends some programmers who want to rewrite the condition
not using the NOT. Try it and see what you come up with.
A common mistake is to use:
IF (A<=0 AND A>=10) THEN do something
This isn’t correct and if you don’t believe it simply write out the truth table
for both. The correct NOT of the condition is:
IF (A<=0 OR A>10) THEN do something
The switch from AND to OR shocks many experienced programmers and it is a
source of many programming errors.
My advice is that if you can write a logical condition in a natural form, but
you really need the NOT of it then just write NOT in front of it! Otherwise learn
De Morgan’s laws which make the fundamental connection between
expressions that involve AND and OR:
NOT(A AND B) = NOT(A) OR NOT(B)
and
NOT(A OR B) = NOT(A) AND NOT(B)
You can see that the first law changes an expression that involves AND into
one that involves OR and the second does the opposite. You can see this more
clearly if the laws are written as:
A AND B = NOT(NOT(A) OR NOT(B))
and
A OR B = NOT(NOT(A) AND NOT(B))
You can see that you can use the first to replace A AND B by a more
complicated expression involving OR and NOT. The second can be used to
replace A OR B by an expression involving AND and NOT.

The Universal Gate


One final and slightly deep thought. How many operators do you need to
implement Boolean logic? The obvious answer is three - AND, OR and NOT, but
this isn't correct. The correct answer is that you only need one operation.
Clearly it can't be just any operation. For example, it is impossible to make an
OR from combinations of AND - try it. However, if you pick an operation that
includes an element of the NOT operator, then you can make anything you like

160
using it. For example, while AND operator isn't universal, the NAND, i.e. NOT
AND, is. The truth table for NAND is just NOT(P AND Q):

P Q P NAND Q
F F T
F T T
T F T
T T F

To make a NOT all you have to do is write:


P NAND P= NOT(P)

P P P NAND P
F F T
T T F

Once you have a NOT you can make an AND:


NOT(P NAND Q)= P AND Q
The only operation that remains is to make an OR out of NAND and NOT and the
solution to this is one of De Morgan’s laws:
NOT(P AND Q)=NOT(P) OR NOT(Q)
A little work soon gives:
NOT(P) NAND NOT(Q)=P OR Q
So with a NAND operator you can build a NOT, an AND and an OR. This means
that a single operator does everything that the usual three can do.
In practical terms this means that every circuit in a computer could be built
using just a NAND gate. In practice, though, electronic engineers like to use a
variety of gates because it is simpler.
NAND isn't the only universal operator. You might guess that NOR, i.e. NOT(P OR
Q) is also universal but what about XOR? There isn't a hint of a NOT gate here.
However, what about P XOR True?
P True P XOR P
F T T
T T F

Yes, P XOR True = NOT P and from this you can build an AND and an OR
operator. XOR Is also a universal operator for Boolean logic. It is something to
think about that the whole of logic can be built from just one operation.

161
Logic, Finite State Machines and Computers
If you take into account the way a Boolean circuit can change with time, then
you have to come to the conclusion that you can build any computational
machine using just one type of logic gate. Using nothing but a universal gate
you can build combinatorial logic and active devices such as flip-flops and
hence the components of memory.
You can see that you can build a finite state machine from nothing but
universal logic gates. You can even build a Turing machine, as long as you
simulate its tape using memory built from universal logic gates. Of course, in
practice, the machine’s memory would be finite and bounded and so you
have really just another finite state machine.
It is remarkable that something capable of so much complex behavior as a
computer can be built from the most basic of components - the universal logic
gate.

162
Summary
● George Boole invented a system of logic that encapsulated the way
true and false changed when combined using AND, OR and NOT.

● Boolean logic can be represented by listing the results of all inputs in


a table or as an equivalent formula.

● Boolean operators and more generally Boolean functions can be


implemented as electronic modules usually called "gates".

● You can think of the two values, true and false, as any two states. In
particular you can think of them as one and zero, i.e. binary.

● Boolean operators can be used to implement binary arithmetic and


other binary functions.

● Real logic gates need time to change state and this fact can be made
use of to implement devices that aren't simply static logic functions.

● A computer is built using logic gates that change state, usually under
the control of a clock pulse.

● Logic is used in programming but it isn't as easy as at it seems. In


particular De Morgan's Laws summarize the relationship between
expressions involving AND and OR.

● A universal gate or operator is one that can be used to implement any


Boolean function. There are a number of universal gates and they all
involve negation.

● It is possible to build a complete computer using nothing but multiple


universal gates.

163
Part III

Computational Complexity

Chapter 15 How Hard Can It Be? 167


Chapter 16 Recursion 179
Chapter 17 NP v P 193

165
Chapter 15

How Hard Can It Be?

So far we have been looking at the pure logic of computation – what can and
cannot be computed. This has mostly been about the way infinity stops us
from completing a computation. Usually it involves arguments that lead us to
realizing that we cannot make progress in a computation because there is a
step that involves an infinity of operations or there simply aren’t enough
programs to do the job. In the real world, we don’t actually meet infinity, so
while these results are deep they generally don’t have any effect on what we
can actually do. For example, arguing that some programming tool cannot
detect an infinite loop because of the halting problem is clearly nonsense,
even if you do occasionally encounter it as a serious argument. In the real
world what matters is how difficult a problem is with real world resources.
Computational complexity is the study of how hard a problem is in terms of
the resources it consumes. Notice that this is not the study of a single
problem, but a set of similarly formulated problems. For example, you might
be able to solve a particular quadratic equation, but what we are interested in
is the solution to all quadratic equations.
You might think that how hard a problem is would be of practical interest,
but not much theoretical interest. You might expect that problems would
show a smooth variation in how hard they are and how much computing
resources they require. It turns out that there are distinct classes of difficulty
that result from the very nature of algorithms. The easiest of these classes can
still present us with problems that are effectively out of our reach, but the
hardest present us with problems that fairly certainly will always be beyond
our computational powers. This in itself has practical consequences.

Orders
The easy part of complexity is already known to most programmers as how
long it takes to complete a task as the number of elements, or some other
measure of the size of a problem, increases. You most often come across this
idea in connection with evaluating sorting and searching methods, but it
applies equally well to all algorithms. If you have a sorting method that takes
1 second to sort 10 items, how long does it take to sort 11, 12, and so on,
items?

167
More precisely what does the graph of sort time against n, the number of
items look like? You can also ask the same question for how much memory
an algorithm takes and this is usually called its space complexity to contrast
with the time complexity that we are looking at.
If you plot the graphs of time against n for a range of sorting methods then
you notice something surprising. For small n some sorting methods will be
better than others and you can alter which method is best by "tinkering" with
it, but as n gets larger your tinkering gets swamped by the real characteristic
of the method. For small n the overheads of getting started on the problem
tend to dominate. It is only when the problem is large enough does it show its
real nature, as this chart indicates:

It is how fast the graph increases with n that matters, because if one curve is
rising more steeply than another it doesn't matter how they start out there
will always be a value of n that makes the first curve bigger. This rate of
increase in time with n is usually described by giving the rough rate of
increase in terms of powers of n.
For example, a graph that increases like a straight line is said to be of order n,
usually written O(n) and graph that increases like n squared is said to be
order n squared or O(n2) and so on. This is usually known as “Big O” notation
and an exact definition is complicated looking if you don’t like math, and
limits in particular. Essentially, if you say something is O(f(n))for big n it
looks like, i.e. is close to, f.
One of the confusing things about Big O notation is that its precise definition
doesn't always fit in with the way it is used in computer science.
Mathematically Big O means "increases no faster than".

168
For example, if an algorithm takes time n2 to process n items then you can
correctly write that it is O(n100) or O(n3) or anything that increases faster than
n2. Strictly, Big O notation just provides an upper bound without insisting
that the upper bound is "tight", i.e. as small as possible. This is generally not
the way Big O is used as it is taken to mean that the function in the Big O is a
tight upper bound and in this case we would only say that O(n2) was an
appropriate expression of the algorithm's running time. This is sloppy but it is
common practice.
Obviously if you have a choice of methods then you might think that you
should prefer the lowest order method you can find, but this is only true if
you are interested in large values of n. For small values of n an elaborate
method that proves faster for large values of n may not do as well. In
mathematical terms for small n the constants and lower order powers might
matter, but as n gets bigger the highest power term grows faster than the rest
and eventually dominates the result. So we ignore all but the highest power
or fastest growing term and throw away all the constants to report the order of
an algorithm. In the real world these additional terms are often important and
we have to put them back in. In particular when comparing algorithms at a
fine level the lower terms and even constants are useful.

Problems and Instances


In most cases we are interested in a general type of problem rather than a
specific instance of a problem. For example, sorting a list of numbers is a
problem and sorting a particular list of numbers is an instance of the problem.
It might be that a particular instance is easy compared to a typical instance of
the problem. For example, you might be lucky enough to be given a sorted list
of numbers. Some instances might be harder than a typical instance, but
while these variations in difficulty might be interesting they are not what we
generally study when we want to characterize the difficulty of a problem.
When you say how much time, or other resources, a problem takes, you can
characterize its average behavior, its worst case behavior or, less often, its
best case behavior. For example, the Quicksort algorithm takes on average
O(nlog n). The best case complexity is also O(nlog n). However, there are
instances where it can take O(n2), which represents worst case complexity. In
the case of another sorting algorithm, bubble sort, the average and worst case
complexity is O(n2) but the best case complexity is O(n).
Although in most cases we are interested in just the average or worst case
complexity, there is an interest in what happens as a problem instance
changes from being easy to difficult. If you can find a parameter that
characterizes the easiness of a problem then you can study how the
complexity changes with that parameter. For example, in the case of sorting
you could investigate the complexity versus a measure of how sorted the

169
starting list is. What is interesting is to determine if there is a "phase change"
in the difficulty of the problem. A phase change is terminology that comes
from physics and it refers to when a collection of atoms suddenly changes
from solid to liquid or to gas. Some problems exhibit the same sort of
phenomenon when they suddenly change from being easy to being more
difficult with the variation in some parameter. For the moment, however, we
will concentrate on average complexity.
Polynomial Versus Exponential Time
Most algorithms run in something like O(nx) where x is hopefully a small
number like 1,2,3… Such methods are said to run in polynomial time and
notice that O(n), which is often called linear time, is also polynomial time
with x=1.
There are, however, algorithms that run slower than any polynomial time you
care to pick. This is a bit strange. Surely a big enough value of x will always
do the trick? That, is if n2 isn’t slow enough, change to n100 or n1000 each one
so much slower than the previous. Such a strategy doesn’t work. This is
because there are algorithms that run in exponential time, i.e. O(an), and it
isn't difficult to show that no matter what value of x you choose there is a
value of n such that an is bigger than nx. It is the n that is important and not
the a that is used and generally we regard O(an) to be exponential time, no
matter what a is.
A couple of things to notice is that the difference between an and nx is that in
the first a fixed value is raised to an ever-increasing power, i.e. a1, a2, a3, and
so on, but in the second an ever-increasing value is raised to a fixed power
e.g. 14, 24, 34, and so on. Seen in this light it isn't surprising that an always
overtakes nx for a fixed x. What this means is that a polynomial-time
algorithm may take a long time to run, but it is nowhere near as bad as an
exponential-time algorithm.
The set of all algorithms that run in polynomial time is called, not
unreasonably, P and the question of whether an algorithm is in P or not is an
important and sometimes difficult to answer question, as discussed in
Chapter 17, The algorithms that increase more steeply than polynomial time
are generally referred to as exponential-time algorithms although this is a
misuse of the jargon because many run in a time that has nothing to do with
exponentiation. Often saying that an algorithm runs in exponential time is a
shorthand for saying it isn't in P but is exponential or worse. For example, the
traveling salesman's problem, i.e. find the shortest route between n cities, has
an obvious solution which runs in O(n!).
(Note: n!=n*(n-1)*(n-2)..*1 e.g. 5!=5*4*3*2*1)
This is a factorial time algorithm, but it is still often referred to as requiring
exponential time. Notice that this too involves multiplying increasing
numbers of terms, just like exponentiation. Also notice that O(nn) is slower
than O(n!).

170
Whatever you call them, algorithms that are in P are usually regarded as
being reasonable and those not in P are considered unreasonable.
The jargon is slightly murky, but we say that anything that takes exponential
time or worse is exponential. More accurately, exponential time is generally
defined to be O(apoly(n)) where a is a constant and poly(n) is any polynomial
in n. As already discussed O(n!) is factorial time and it increases faster than
exponential time and hence is generally worse. You will also hear O(n)
referred to as linear time, O(nlog n) as quasilinear time because it doesn't
increase much faster than linear, and O(apoly(log n)) as quasi-polynomial
time because it doesn't increase much faster than polynomial time. There are
others and you can invent more.
Finally, notice that the constants in these definitions largely don't matter and
it is usual to take a to be 2 for no particular reason than we tend to like base 2
calculations. It also doesn't matter what the base of the log is. Although base 2
is common, all log functions increase equally fast for big enough n. A few of
the most common orders of complexity, in order of increasing run time, are:
O(n) < O(nlogn) < O(poly(n)) < O(2poly(n))< O(n!)< O(nn)

A Long Wait
You may not think that the order of an algorithm matters too much, but it
makes an incredible difference. For example, if you have four algorithms,
O(n), O(nlog2 n), O(n2) and O(2n) respectively, the times for various n are:

n O(n) O(nlog2 n) O(n2) O(2n)


10 10 33 100 1024
1,267,650,600,228,229,
100 100 664 10000
401,496,703,205,376
500 500 4483 250000 value with 150 digits
1000 1000 9966 1000000 value with 300 digits
5000 5000 61438 25000000 value with 1500 digits

For a more practical interpretation of these figures consider them as times in


seconds. When you first develop an algorithm, testing on 10 items reveals a
spread of times of 10 seconds to about 17 minutes using the different orders.
Moving on to 5000 cases the difference is 1.4 hours for the O(n) algorithm to
around three quarters of a year for O(n2)! However, for the exponential
algorithm the time is approximately 4.5 billion, billion, … billion years where
“billion” is repeated 16 time, in other words, around 15 times the age of the
universe.
The point is that there is a very big difference, a qualitative difference,
between algorithms that scale like a power and exponentially. This is why
classifying algorithms is important - it really does tell us something deep
about the type of problem.

171
Where do the Big Os Come From?
It is interesting to ask what makes one algorithm O(n), another O(n2) and so
on? The answer lies in the loops. If you have a single loop that processes a list
of n elements then it isn't difficult to see that it runs in time O(n). That is:
For 1 to n
do something
Next
takes a time proportional to n.
If you nest one loop inside another you get O(n2):
For 1 to n
For 1 to n
do something
Next
Next
You can continue this argument for O(nx). In other words, the power gives
you the maximum depth of nested loops in the program and this is the case
no matter how well you try to hide the fact that the loops are nested!
You could say that an O(nx) program is equivalent to x nested loops. This isn't
a precise idea - you can always replace loops with the equivalent recursion,
for example, but it sometimes helps to think of things in this way.
Another way to think of this is that an O(n) algorithm examines the list of
possibilities or values just once. An O(n2) algorithm examines the entire list
once for each item in the list. An O(n3) algorithm examines the entire list
once for each pair of items in the list and so on.
As well as simple powers of n, algorithms are often described as O(log n) and
O(nlog n). These two, and variations on them, crop up so often because of
the way some algorithms work by repeatedly "halving" a task. This is also
often referred to as a "divide and conquer" algorithm. For example, if you
want to search for a value in a sorted list you can repeatedly divide the list
into two - a portion that certainly doesn't contain the value and a portion that
does. This is the well known binary search. If this is new to you then look it
up because it is one of the few really fundamental algorithms.
Where does log n come into this? Well ask yourself how many repeated
divisions of the list are necessary to arrive at a single value. If n is 2 then the
answer is obviously one division, if n is 4 then there are two, if n is 8 then
there are three and so on. In general if n = 2m then it takes m divisions to
reach a single element.
As the log to the base 2 of a number is just the power that you have to raise 2
by to get that number, i.e.
log2 n = log22m = m
you can see that log2n is the just the number of times that you can divide n
items before reaching a single item.

172
Hence log2n and n log2n tend to crop up in the order of clever algorithms
that work by dividing the data down. Again, you could say that an O(log2 n)
algorithm is equivalent to a repeated halving down to one value and
O(nlog2n) is equivalent to repeating the partitioning n times to get the
solution.
As this sort of division is typical of recursion, you also find O(log2 n) as the
mark of a recursive algorithm. When a recursive algorithm runs in O(nx) it is
simply being used as a way of covering up the essentially iterative nature of
the algorithm!
What about exponential-time algorithms? Consider for a moment O(an)
where a1 = a corresponds to the work needed to process a problem of size 1
and a2 corresponds to the work to process a problem of size 2 and so on. This
looks like, first, a single loop doing a fixed amount of work and then like a
pair of nested loops doing the same amount of work. Similarly O(an) looks
like n nested loops doing some work.
Notice that in this case a is the average time that one of the loops takes. That
is, you get an exponential-time algorithm when it involves a variable number
of nested loops. The bigger the n the more nested loops and this is why it
slows down so quickly with decreasing n. A variable number of loops in an
algorithm may be something you are not familiar with. In practice, the easiest
approach to implementing a variable number of loops is to use recursion. As
discussed in the next chapter, recursion is really all about variable numbers
of loops and exponential-time performance.
What sort of problem needs a variable number of loops? Consider the simple
problem of displaying all of the values in a list of length n:
For i=1 to n
display ith value in list
Next
Now consider the problem of displaying every pair of values, the first taken
from one list and the second from another:
For i=1 to n
For j= 1 to n
display ith value in list1 and jth in list2
Next
Next
Now consider the same problem for three lists, you need three nested loops. If
I ask you to display all of the combinations of items from n lists, with values
taken from each list, you can see that this is going to need n nested for loops.
How do you write this variable number of loops? The best answer is that you
don’t and use recursion instead, see the next chapter.
Exponential time usually occurs with problems of this sort where you have to
look at every combination of possible items and this needs a variable number
of nested loops.

173
Finding the Best Algorithm
We have reached a stage where the reasoning undergoes a subtle change that
is often neglected in textbooks. Up until now we have been talking about
algorithms belonging to P, but we would really like to answer a slightly more
general question. Is it possible to find an algorithm that achieves some result
in polynomial time or is such a thing impossible? This shifts the emphasis
from finding the order of any given algorithm to finding the order of the best
algorithm for the job, which is much more difficult.
You can also ask what the best algorithm is if you specify your computing
resources - a Turing machine, for example, or a finite state machine and so
on. There are algorithms that run in P for some classes of machine but not for
others. This is where things can get complicated and subtle. For the moment
we will ignore the type of computer used for the algorithm, apart from saying
that it is a standard machine with random access memory - a RAM machine.
It is interesting to note, however, that algorithms that run in sub-polynomial
time can change their complexity when run on different types of machine.

How Fast Can You Multiply?


So the really interesting questions are not about how fast any particular
algorithm is, but what is the fastest algorithm for any particular task. For
example, it isn't difficult to show that multiplying two numbers together
using the usual “shift and add” algorithm is O(n2) where n is the number of
bits used to represent the numbers. This means that multiplication is
certainly in P, but is O(n2) the best we can do? Is there an O(n) algorithm?
Certainly you can't produce an algorithm that is faster than O(n) because you
at least have to look at each bit, but can you actually reach O(n)? Before you
spend a lot of time on the problem, I'd better tell you that it is difficult and
the best practical algorithm that anyone has come up with is the odd-looking
O(nlogn loglogn), which is a lot better than O(n2) but still not as good as
O(n). A recent paper has proposed a O(nlog n) algorithm, but for it to be
worth using n has to be a huge value, well beyond anything reasonable, see
Galactic Algorithms in Chapter 17. Is there a faster practical algorithm? Who
knows? You would either need a proof that said that it was impossible to
perform multiplication faster than the stated algorithm or you would need to
demonstrate a faster algorithm.
You can see that in a very deep sense the order of the best algorithm tells you
a lot about the nature of the task being performed. Is it just an easy task that
happens to look difficult in this form? Or is it irreducibly difficult? This
relates back to Kolmogorov complexity and finding the smallest program that
generates a sequence. In this case we need the fastest program.

174
Prime Testing
The example of multiplication is easy to understand and it is easy to prove
that it is in P by describing an O(n2) algorithm. So, while we might want to
argue about how fast it can be done, there is no question that it is in P.
Now consider an equally innocent looking problem - proving that a number is
prime (that is has no factors). For smallish numbers this is a trivial job. For
example, 12 isn't a prime because it can be divided by 2 and 3 to name just
two candidates, but 13 is prime because it has no factors.
The simplest method of proving that any number, z say, is prime is to try to
divide it by 2, 3 and then all the odd numbers up to SQRT(z). If you analyze
this method you will discover that it is O(2n) where n is the number of bits
used to represent the number and this is exponential and so not in P i.e.
prime testing using this algorithm isn't in P.
But this doesn't prove that testing for primes is or isn't in P because there
might be a better way of doing the job. But given the exponential complexity
of factoring a number it seems very reasonable that a polynomial time
algorithm for testing primality isn’t possible.
However, recently the problem was solved and now we know that testing a
number for primality is in P – i.e. there is a polynomial time algorithm that
will tell you if a number is prime or not. This wasn’t an easy result and we
still don’t know if there is an even faster algorithm. The current fastest
algorithm is O((log n)6), which is sub-polynomial.
You should be able to see that the problem is that if you can find an algorithm
for a task that runs in polynomial time then you have proved that the problem
is in P, but if you can't it might be that you just haven't looked hard enough.
In this case we tackled proving primality by factoring, which was exponential
time, but it turned out that there was a very non-obvious algorithm that could
prove primality or not without factoring the number.
It is also worth thinking for a moment about what this means for the related
task of factoring a number. To prove that a number is non-prime you simply
have to find a factor and searching for a factor seems to require exponential
time. If the number is a prime then you have to search through all of the
possible factors to prove that it doesn't have one. At first sight proving a
number is a prime seems harder than proving that a number isn't a prime and
yet we have just claimed that proving a number prime is in P. This is very
strange as the property of being a prime is defined in terms of factors and yet
somehow you can prove it without finding any factors at all. All the proof of
primality gives you as far as factoring is concerned is a clear decision on how
many factors there are, i.e. zero for a prime and at least two for a non-prime.
It clearly tells you nothing about what the factors are and this is good from
the point of view of many cryptographic procedures that rely on the fact that
factoring is hard.

175
To summarize: finding factors is exponential in time but the question "does
this number have zero or more than zero factors” is answerable in polynomial
time.
The route that we took to find an algorithm for testing a number for primality
in polynomial time is also interesting. At first we invented probabilistic tests
for primality, i.e. tests that only prove that a number is prime with a given
probability. This is an unusual idea that deserves explaining further.
Currently the most useful of the probabilistic tests is the Rabin’s Strong
Pseudoprimality test.
If the number we want to test is n, assumed odd, first we have to find d and s
such that n-1 = 2sd where d is also odd (this is easy to do, try it). Next we
pick an x smaller than n and form the sequence:
n−1

x d , x 2 d , x 4 d , x 8 d ... x 2 d

all calculated modulo n.


Once you have the sequence of values – called a “Miller-Rabin sequence” –
you can be sure that if n is a prime the sequence starts with a 1 or has a -1
before the end of the sequence. As once the sequence is a 1 or a -1 the
remaining values are 1 respectively you can specify the condition as
For a prime the Miller Rabin sequence is either all ones or it has a minus one
followed by all ones.
This sounds conclusive, but some non-prime numbers also have sequences
that satisfy this condition. Note that if the sequence doesn't satisfy the
condition you have a certain proof that the number isn't a prime. It is the case
where it is prime that is difficult.
At this point it looks like a hopeless task to say any more about the number if
the sequence satisfies the condition. It might be prime or it might not. There
is a proof that if n isn't prime then three quarters of the x values that you
could pick don't have sequences that satisfy the condition.
To be more exact, if the number isn't prime the test has a 75% probability of
not starting with a 1 or containing a -1. So after one test that satisfied the
condition you can conclude that there is a 25% chance of the number isn't a
prime. After another test the doesn't satisfy the condition, the chance that the
number isn't prime is 6.25% or, putting it the other way, the probability that
it is prime is 93.75%. If you keep testing you can make the probability as
close to one as you want.
For example, if you get sequences that satisfy the condition with 100 random
values then the probability that you have a non-prime number is (1/4) 100
which is roughly 1/1061 which is exceeding unlikely and so you are safe in
concluding that the number is indeed prime.

176
Such is the strange reasoning behind randomized testing for primes, and
many other randomized algorithms. They never produce certainty, but what
is the practical difference between a probability of 1/10 61 and certain proof?
There is more chance of a machine malfunction giving you the wrong answer
computing a certain proof of primality than the Strong Pseudoprimality test
failing!
The existence of a probability-based algorithm that was in P was an
indication that there might just be a deterministic algorithm in P. This turned
out to be true. Interestingly the probability-based test is still in use because in
practice it is so much quicker than the deterministic test for moderate n. A
great deal of computer science is only useful as theory even when it appears
to be practical.

177
Summary
● Algorithms scale in different ways as the number of items they have
to deal with changes. For small values details of implementation
matter, but as the number gets bigger they reveal their true nature.

● The way algorithms scale is summed up in the "Big O" notation where
O(f(n)) means that for large n the algorithm scales like f(n).

● We are mostly concerned with average performance, but worst and


best case performance are also of interest.

● Many algorithms run in O(nx) so-called polynomial time and these


can be considered to be "reasonable" algorithms, even if for any
particular algorithm the time to compute it might be out of reach.

● Algorithms that run in O(an) or worse are called exponential time and
increase their demands so quickly that they quickly become
impractical.

● The difference between polynomial time and exponential time is so


large that it is qualitative.

● The way algorithms scale results from their programmatic structure.


Polynomial algorithms are equivalent to a fixed number of nested
loops. Exponential algorithms are equivalent to a variable number of
nested loops.

● Proving that an algorithm is as fast as it possibly could be is very


difficult and in general we have to regard any estimate as a lower
bound on behavior - a better algorithm may be waiting to be
discovered.

● An example of this discovery of faster algorithms is proving primality.


Linked with factoring it was first thought to be exponential and then a
polynomial time probabilistic test was extended to a more complex
deterministic test.

178
Chapter 16

Recursion

Recursion is interesting from the point of view of complexity theory and


computer science in general because it is the natural way of implementing
algorithms that take exponential time. This isn't because recursion is in any
sense bad - it's just a more advanced form of the flow of control that makes
itself useful when a problem needs it. One of the advantages of recursion is
that it can be used to implement "divide and conquer" algorithms that are
sub-polynomial in execution time.
Recursion is not an essential concept in the sense that anything you can
achieve using it you can achieve without it. If this were not the case then the
Church-Turing thesis would be false. On the other hand, recursion is so
powerful that it can be used as the basis for the very definition of what
computation is. You can tell that attitudes to recursion vary greatly from "why
do I need this" to "this is all I need", but the fact that is the subject of a well-
known joke makes it harder to ignore.
Definition of recursion: Recursion - see Recursion.
Recursion is, loosely speaking, self-reference.
There are some who just take to this idea and there are others, the majority,
who always feel a little unsure that they have really understood how it all
works. Recursion is important because it provides a way of implementing a
type of algorithm that would otherwise be difficult.
In many textbooks recursion is introduced as an advanced method and a few
examples are given. Sometimes you get the impression that the topic has been
included just for the reason that it is in other books. Usually the examples are
problems that are already stated in recursive form and in this case a recursive
implementation seems natural. What is generally overlooked is what
recursion is good for. What makes a problem suitable for a recursive
implementation?
The answer is surprisingly simple and it reveals the link between recursion
and algorithms that have exponential runtimes.

179
Ways of Repeating Things
Before we move on to recursion we need to look at the most basic ways of
repeating something, if only to make sure we are using the same vocabulary.
When you first learn to program one of the key ideas is writing a set of
instructions that repeats. For example, to print “Hello” repeatedly I’d use a
loop:
Do
Print “Hello”
Loop
You can think of the Loop as an instruction to go back to the Do and repeat all
of the instructions between the two all over again. In this case you just go
round and round the loop forever printing Hello each time.
This is an "infinite” loop, or more accurately according to earlier chapters, a
finite but unbounded loop. However, most loops in sets of instructions are
finite loops which repeat a number of times.
There are two broad types of finite loop, enumeration and conditional. The
difference is that an enumeration loop repeats a given number of times, i.e.
"print hello five times" is an enumeration loop, as is the familiar for loop
found in most programming languages.
A conditional loop keeps on repeating until a final end result is achieved, i.e.
"stir the coffee until the sugar has dissolved" is a conditional loop, as are the
familiar while or until loops found in most programming languages.

To state the obvious, a loop is called a loop, because it loops around the text
of a program. There is a very deep sense in which every programmer thinks of
a loop as a circular shape and the only differences between loops is how you
get out of a loop. In principle, you only need a conditional loop because this
can be used to implement an enumeration loop. In this sense a conditional
loop is universal, much like the universal logic gate.

180
Self-Reference
So far so good, but what has all of this got to do with something far less
familiar – recursion? Recursion is just another way of repeating things. It’s
simple at first but it has a habit of becoming complicated just when you think
you are getting used to it.
Imagine that instead of simply using a command to print a message we write
a function (or a procedure), i.e. a self-contained module that does the job and
perhaps more, called Greetings:
Function Greetings()
lines of instructions
Print “Hello”
Return
Whenever you use its name the set of instructions is obeyed. So:
Call Greetings()
results in all of the instructions in the function being obeyed. In this case all
that happens is that Hello is printed.
This is really all we need to move on to recursion. Consider the following
small change:
Function Greetings()
lines of instructions
Print “Hello”
Call Greetings()
Return
What does this do? If you start it off with a Call Greetings() in some other
part of the program you start the execution of the list of instructions it
contains. Everything is normal until you get to the Call Greetings()
instruction, which starts the whole thing going again.
We have an infinite loop built using the trick of having a function call itself!
This is the essence of recursion, but in this simple form it just looks like an
alternative way of creating a loop. However, creating a loop using self-
reference quickly becomes so much more and it’s all to do with the way self-
reference automatically stores the state of things each time you call the
function.

181
Functions as Objects
To see the difference between recursion and simple looping it helps to
consider that a function is like an object which is created each time you
make use of it. This is only partially true in most programming languages in
that each time a function is called it exists in a new "call context" which only
approximates to a "completely new copy" of the function. However, while this
idea of function as object isn't universally correct, thinking about things in
this way can make recursion much easier to understand. For example,
consider the difference between:
Do
Call Greet1
Loop
and:
Function Greetings()
lines of instructions
Print “Hello”
Call Greetings
Return
In the case of the loop a single "object" or context is brought into existence
and then is destroyed when the loop ends. That is, at any moment in the life
of the program there is a single Greet1 in existence.
Now compare this to the second version of the repeat. The first time the
Greetings function is called a single Greetings object comes into existence. It
does its stuff and then, before it comes to an end, it does another Call
Greetings command. Now there are two Greetings objects in existence.
You can probably see where this is going. Each time the function self-
references, a new instance of the object comes into existence and, if we wait
long enough, the result is an infinite collection of objects all waiting for the
next one to finish its task.

Recursion therefore builds up a collection of objects as it spirals its way to


infinity.

182
Conditional recursion
You might be beginning to see why recursion is so much more powerful than
iteration. Recursion can, by the power of self-reference, bring into existence
as many copies of the function as are needed to complete a task and this
includes any variables etc that are created by the function. Of course, the sort
of recursion we are looking at isn’t of much use because it is infinite
recursion – it just goes on.
Just like an infinite loop, infinite recursion can be tamed by adding a
condition that stops it happening. This gives us “finite” or “conditional”
recursion. You might think that there is nothing new here, but you would be
wrong and again this is another one of those “recursive” complications that is
designed to catch you out just when you think you understand. To show you
what I mean consider the following small function:
Function counter(ByValue i)
if i == 3 then Return
i = i + 1
Call counter(i)
Print i
Return
The ByValue is included to ensure that the parameter is passed by value. This
is the default for most programming languages, but we need to make sure that
this is the case for the recursion to work properly. It has to be passed by value
because this is the only way the new copy of the function gets its own copy of
the parameter which it can act on without modifying any earlier copies. You
can think of this as demanding that each function object that is created has to
be a completely new copy and not inherit anything from the previous
function.
Recursion becomes complicated when you start to use parameters passed by
reference or global variables. To keep things simple, always use local
variables and pass by value within recursive functions. This ensures that
each copy of the function that is created is a unique object that doesn't share
anything with the other instances of the function.
Call counter(0) gets things going. The function first tests to see if the
counter, i.e. i, is 3. If it is then the function ends. If it isn’t, i has 1 added to
it. Then counter is called again.
Clearly this is a finite recursion as eventually counter will be called with a
value of 3 and the recursion will come to an end. In this case we have four
recursions, and four instances of the counter function. The reason it is four
and not three instances of the function is that we need a final instance to test
for i==3 which exits without printing i.

183
Ask yourself what is printed? If your answer is 1,2,3 then you need to think a
little more carefully and look at the diagram below. Nothing gets printed until
i reaches 3 when the next copy of the counting function comes to an end and
the final copy continues on to print the value of i. Then it comes to an end
and the previous instance of the object gets to carry on and print its value of i
and so on back to the first copy. You should now be able to see that what is
printed is 3,2,1

It is helpful to think of recursion as having a forward phase when the


"instances" of the function are being created and a backward phase when the
functions are being destroyed. You could say that recursion spirals its way
out and then spirals back in the opposite direction, but this might be taking
things too far for some! Even so there is a sense in which the shape of the
flow of control for a recursion is a spiral that you work your way along and
then work your way back along when the recursion comes to an end.

184
Forward and Backward Recursion
In general a recursive object looks something like:
Function xxx(parameters)
list of instructions 1
if condition then Call xxx
list of instructions 2
Return
The first list of instructions is obeyed on the way up the spiral, i.e. as the
instances of the function are being built, and the second list is obeyed on the
way back down the spiral as the instances are being destroyed. You can also
see the sense in which the first list occurs in the order you would expect, but
the second is obeyed in the reverse order that the functions are called in.

185
This is one of the reasons that recursion is more than just a simple loop. A
loop just goes in one direction, but recursion goes up the chain of calls and
then back down again.
Sometimes a recursive function has only the first or the second list of
instructions and these are easier to analyze and implement. For example a
recursive function that doesn't have a second list doesn't need the copies
created on the way up the recursion to be kept because they aren't made any
use of on the way down! This is usually called tail recursion because the
recursive call to the function is the very last instruction in the function, i.e. in
the tail end of the function. Many languages can implement tail recursion
more efficiently than general recursion because they can throw away the state
of the function before the recursive call, so saving stack space.
So tail recursion is good because it can be efficiently implemented, but you
can see that limiting yourself to tail recursion is missing some of the power of
recursion because it only has a forward order of execution like a simple loop.

What Use is Recursion?


You now understand what recursion is in terms of flow of control, but you
might well be still wondering where it becomes necessary? The problem is
that in many simple examples recursion isn’t even helpful, let alone
necessary. A common example is working out a factorial or similar
mathematical formula given in recursive form:
Fn = n*Fn-1 with F1 = 1,
This gives:
F2 = 2*F1 = 2*1
F3 = 3*F2 = 3*2*1
and so on.
To implement this as a recursion we simply need to copy the mathematical
definition:
Function F(ByValue n)
if n == 1 then Return 1
Return n*F(n-1)
You can see how this works. When you call the function with F(3) it
evaluates 3*F(2), which in turn evaluates 2*F(1). The if statement causes
F(1) to return 1, which unwinds the recursion to give 2*1, and finally 3*2*1.
Notice that this isn’t tail recursion.

186
You can see this more clearly if the function is written as:
Function F(ByValue n)
if n == 1 Then Return 1
temp = F(n-1)
temp = n*temp
Return temp
You can now clearly see that the calculation happens after the recursive call.
That is, the calculation is performed on the way back down the recursion.
It is possible to write this using tail recursion but it isn’t as easy:
Function F(ByValue n, ByValue product)
if (n == 1) Then Return product
Return F(n - 1, product * n);
}
The complication is that, as we now have to do the computation on the way
up the recursion, we have to use an extra parameter to pass it to the next
recursive call.
However, the simplest way of computing the factorial avoids recursion
altogether:
product = 1
for i = 2 to n
product = product*i
next i
Computing the factorial as a direct loop is simple, understandable and it
works. You don't need recursion to implement a factorial function. Many
simple examples of recursion are like this – slightly contrived!

A Case for Recursion -The Binary Tree


Let’s look at an example of recursion in action where its use really simplifies
things because it is natural and necessary. Suppose you have a binary tree,
i.e. data that starts at a root node and where each node has a left and a right
child node. Your problem is to write a program that starts at the root node
and prints the name of every node in the tree. You could write this program
without recursion, but compared to the recursive version it would be very
complicated.
For a recursive approach let’s use a Left(node) function to return the left
child of the current node and Right(node) to return the right node. We can
now write a recursive tree list as:
Function list_tree(node)
if node = nothing then Return
print node
list_tree(Left(node))
list_tree(Right(node))
Return

187
This looks almost too easy – where is the work actually done?
If you start if off with:
list_tree(root)
where root is the start of the tree, the recursive calls work their way down
the left branch from the root node and then the left branch of the right child
of the root node. It’s so easy it seems crazy to contemplate any other way of
doing the job! It is this sort of example that makes programmers fall in love
with recursion - provided they really understand it.
Just to make sure you understand what is going on, work out what happens if
you move the print node to the end of the routine. What happens if you
change Left for Right and vice versa?
The reason why this tree listing works so well as a recursion is that the data is
recursive in its basic nature. For example, try this definition of a binary tree:
A binary tree is a node with a binary tree as its left child and a binary
tree as its right child.
You can see that this is an explicitly recursive description of a binary tree and
this is the reason why recursion fits so well with this particular data
structure.
There have even been programming methodologies that make use of the idea
that the algorithm is always derived from the data structure and this
"recursive data needs recursive algorithms" is just a special case.
More fundamental is the fact that without recursion this example needs a
variable number of for loops.

Nested Loops
There is a deeper reason why recursion is often easier to use, but more
difficult to understand. Many problems require loops within loops, or nested
loops, to work. As long as you know the number of loops that have to be
nested then no problem and everything works. But some problems need a
variable number of nested loops and these are the ones that are more easily
solved using recursion.
You can see this in the case of listing the binary tree. If the tree is of a
specified depth then you can list it by writing a loop for each level of the tree
nested within the first.
for each node for the first level
for each child node for the second level
for each child child node for the third level
and so on. Of course, if you don’t know how many levels the tree has you
can’t finish the program, but you can do it recursively with no such problem.
It is as if you are trying to say if there are n levels in the tree you nest n loops.

188
For this reason, you can say that recursion is equivalent to a variable number
of nested loops and it’s useful whenever this sort of structure is needed.
An even simpler example is printing tuples, which was the example given in
the previous chapter. You can easily write a program that prints i,j for values
from 1 to 10
for i= 1 To 10
for j= 1 To 10
print(i,j)
next j
next i
You would have no problem in extending this to printing i,j and k. In fact, no
matter how many values were required, you could write the program using
the required number of nested loops. However, if the number of values
required is only specified after the program starts running, then you don't
know beforehand how many nested loops to use and you can't do it. Well,
you can, but it’s a lot more difficult and you would probably need to use a
stack or some similar data structure to store the index variables.
Again the problem is a lot easier to express as a recursion:
function tuple(N)
if (N == 0) return
for i = 0 To 10
print(i)
tuple(N-1);
}
}
The parameter N is used to “count” the number of loops needed. When it
reaches zero we have enough and the function just returns. Otherwise it starts
a for loop and then calls tuple(N-1) which starts yet another for loop or
returns if the job is done.
In other words, if you find that you have a problem where you want to write
an unknown number of nested loops then you can solve the problem most
easily using recursion.
The fact that recursion is often equivalent to nested loops is the reason that
some programs take exponential time, see the previous chapter for a full
explanation. However, if recursion is implementing n nested loops then it is
of order O(an), where a is the average time for a single loop to complete.

189
The Paradox of Self-Reference
Recursion is an example of self-reference and this is at the heart of many
paradoxes. Self-reference is fine unless it leads to a contradiction.
We have already considered, in Chapter 9 the paradox created by:
This sentence is false.
which if you assume it is true asserts its own falsity, and if you assume it is
false asserts its own truth. As suggested in Chapter 14, this type of problem
goes away if you allow time into the picture. Then the truth state of the
sentence can flip-flop from true to false and back again. If you don't allow
time into the picture then you have to conclude that the sentence doesn't
have a truth value.
As we’ve already discovered, self-reference is also hidden in many of the
important theorems of computer science. The halting problem only occurs if
the Turing machine can be applied to itself and this in turn implies that the
Turing machine is unbounded. As soon as you put a bound on the size of the
Turing machine that the halting problem can be solved for, the machine
cannot be applied to itself. In this case the self-reference, and hence the
contradiction, needs infinite resources.
Another, and perhaps clearer, way of looking at the problem is that we are
constructing infinite recursions and then asking for the result of the
computation. For example, consider the function:
flip-flop(state)
flip-flop(NOT state)
Return
Now what is the value of flip-flop(true)? This is just like the "this sentence
is false" paradox in that the value of the computation flip-flops from true to
false and so on. In this case the time element is explicit and the fact that this
is an infinite recursion means you never get a final result. The question of the
final truth value is undecidable. It is only when the recursion is infinite do
we have a problem. There is a sense in which all such paradoxes are about
the result of infinite recursion.
Self-reference plays a central part in philosophy and some types of
mysticism. It can be raised to a universal principle or something deep in the
machine. Computer science mystics tend to refer to it as a "strange loop". Our
consciousness seems to be recursive – I’m observing myself observe myself –
and so on. Of course, this just the tip of the iceberg - I am part of the universe
observing itself. You can see that there is plenty of scope for invoking the
mystic. If you would like to discover more about the fascinating
philosophical aspects of recursion and “strange loops” then there is no better
book on the subject than Godel, Escher, Bach: An Eternal Golden Braid, by
Douglas R. Hofstadter.

190
Summary
● The simplest way of repeating anything is to use a loop and the shape
of the flow of control is indeed just a loop.

● There are conditional loops and enumeration loops but the only type
of loop you need is the conditional loop.

● Recursion, or self-reference, is another way of repeating things. It isn't


a necessary construct as it can be emulated using loops, but it is very
useful.

● Infinite recursion is the equivalent of an infinite loop.

● The most important idea in recursion is that a complete copy of a


function is brought into existence with each recursive call.

● You can think of the flow of control in recursion as a spiral moving up


each time the function is called and then spiraling down as each
function completes.

● The block of code before the recursive call is executed on the way up
the spiral and the block of code after the recursive call is executed on
the way down the spiral.

● If the block of code after the recursive call is eliminated we have tail
recursion, which is efficient to implement.

● Recursion can be useful when implementing an algorithm that has a


common recursive definition. It is more necessary, however, when the
data structure is itself recursive.

● Recursion is the easiest way to implement a variable number of nested


loops and this is also the reason why recursive algorithms are
exponential.

● Many paradoxes arise from contradictory self-reference, which can be


thought of as infinite recursions. The result of an infinite recursion is
undecidable and all of the paradoxes of computer science are infinite
recursions.

● Recursion is an example of self-reference which lends itself easily to


mysticism and is often called a strange loop. Human consciousness is
the result of us observing us, which in turn is an example of the
universe observing itself, a strange loop that goes beyond the scope of
this book

191
Chapter 17

NP Versus P Algorithms

The difference between polynomial time and exponential time algorithms is


relatively easy to understand, but once you start classifying algorithms you
can't help but discover other interesting classifications. Things are more
subtle than just how long it takes to do a computation, you can also ask
question about how hard verification of a supposed answer is and this leads
us on to the class of problems called NP – perhaps currently the most
interesting of all.

Functions and Decision Problems


There are two general types of algorithm - function evaluation and decision
problems. A function evaluation problem is just that – given a function f and
a value x find the result y. This may sound limited but if you allow the
function to be general enough it encompasses just about everything. For
example, a program that gives you the nth digit of π is just evaluating the
function Pi(n), i.e. the function that gives you the nth digit of π.
Alternatively you can perform a computation to answer a question – a
decision problem. For example, is 9 the sixth digit of π? You can answer this
question by computing Pi(6) and comparing the result to 9. As 9 is the sixth
digit of π, the answer is “true”.
As well as being able to convert a decision problem into a function
evaluation, you can generally convert a function evaluation into a decision
problem by asking if each possible result is a solution. For example, is Pi(6)
zero, is Pi(6) one and so on.
This all seems simple enough and not particularly interesting. It seems that
we are just describing what we already have in terms of algorithms in slightly
different ways. If it is hard to find the hundredth digit of π, then it is going to
be hard to answer the question “is the hundredth digit of π a 6”. However,
there are problems for which this is not the case and these are interesting.

193
Non-Deterministic Polynomial Problems
There are problems that are difficult to solve, but easy to check if you are
provided with a confirming example. For example, if I give you a set of
numbers and the decision problem, “Is there is a subset that sums to zero?”
then to find out you have to examine each possible subset until you find one
that sums to zero and this procedure is not in P. In fact, it is takes exponential
time as the number of sets you have to examine is O(2n).
Now suppose I give you a subset that I claim sums to zero, how long would it
take for you to verify that claim? All you have to do is sum the single set of
numbers and thus verification is in P. The subset that settles the matter is
usually called a witness. In this case the witness is for a “yes” answer to the
problem. You can also think of the witness and its verification as a proof that
the answer is “yes”.
So we have a decision problem that is not in P, but if you are given a witness
verification is in P. This is an example of a problem in NP, Non-deterministic
Polynomial time. The reason they are called Non-deterministic Polynomial is
that another definition of an NP problem it that it can be solved in
polynomial time by a non-deterministic Turing machine - one that has
multiple alternatives at each step. You can think of it as either taking all of
the alternatives or being lucky enough to take the correct branch at each step.
For example, a non-deterministic Turing machine for the sub-set sum
problem would examine each sub-set in parallel and return the one first one
that summed to zero. As this is equivalent to checking a witness, it is clear
that it can be done in polynomial time.
A deterministic Turing machine verifying a witness in polynomial time is the
same as a non-deterministic Turing machine solving the problem in
polynomial time.
The distinctive feature of NP problems is that they are difficult to find any
solution for, but easy to check a proposed solution. As any problem in P can
provide its own witness, i.e. a solution found in polynomial time is verified in
polynomial time, it is obvious that P is in NP, i.e. P  NP.
Also notice that for an NP problem for which we can check a solution in
polynomial time, we can find a solution in at most exponential time by
checking all possible solutions.
There is a subtle point here. If the witness is to be checked by a Turing
machine in polynomial time, it cannot grow in size faster than polynomial.
Putting this another way, as the time to process the witness depends on its
size, polynomial time implies polynomial growth in the size of the witness. If
we write the witness in binary using m bits this has to be a polynomial in the
problem size m=poly(n). Thus there can only be 2m=2poly(n) witnesses and
each can be checked in polynomial time.

194
So solving the problem by checking every witness would take at most
exponential time. That is, an NP problem is at most exponential to solve.
So we now have, writing Exp for the class of algorithms that take exponential
time to solve:
P  NP  Exp
and we don't know if they are proper subsets of each other, i.e. does P=NP
and NP=Exp?
The big question, and it is one of the $1 million Millennium prize questions,
is whether P = NP:
Is it possible that there is a polynomial algorithm that solves problems
that have a polynomial time verification.
Most computer scientists hold the opinion that this isn’t possible, i.e. P≠NP,
but without a proof there is no certainty.

Co-NP
The complement of a decision problem is the same problem re-worded to
swap the “yes” and “no” aspects of the decision, so “there is a subset that
sums to zero” becomes “there is no subset that sums to zero”. If a decision
problem is in NP then its complement is by definition in a class called co-NP.
Just being in co-NP says nothing about whether or not there is a witness for
the problem, just that there is a witness for the complement of the problem.
Another unsolved problem is whether NP=co-NP, that is, is every problem in
co-NP also in NP and vice-versa. It is generally accepted that the two are not
equal, that is there are problems in co-NP that are not in NP, but there is no
proof of this as yet.
Consider the co-NP problem, “there is no subset that sums to zero”. How
could you provide a witness of this? There is no point in offering up a subset
that sums to a non-zero value as this counterexample doesn’t prove that there
isn’t an alternative set that sums to zero. You might think at this point that
NP ≠ co-NP is proved, but this is not proof.
There are examples of problems that were thought not to be in NP but later
were proved to be in NP. For example, consider the problem of proving that a
number x is composite, i.e. not a prime. To do this you need to test for all
possible factors by dividing and in Chapter 15 this was stated to be O(2n)
where n is the number of bits. However, you can provide a witness that x is
composite by simply giving one of its factors. This can be verified in
polynomial time and hence composite is in NP.

195
Now consider the complementary problem, given a number x prove that it is
not composite i.e. it is a prime. This is in co-NP as it is the complement of a
problem in NP.
Originally it was thought that there was no witness for primeness, but then
one was found. This put prime testing in NP, which means composite as the
complement of a problem in NP is in co-NP as well. As explained in Chapter
14, eventually prime testing was shown to be in P and hence so is composite
testing.
Is there a witness for the complement subset sum problem? It seems unlikely,
but who knows - perhaps there could be a bound on the size of sum
obtainable from a set that makes a zero sum impossible. If it could be proved
that no such witness existed, we would have the result NP ≠ co-NP.

Function Problems
One of the disadvantages of high profile classes like NP is that many function
problems that sound as if they might reduce to a decision problem in NP
don’t. For example, the problem of finding the shortest path through a
network clearly isn't a suitable candidate for NP. Why not? Because you
cannot supply a witness that verifies that a path is the shortest in polynomial
time. If I provide you with a path that I claim is shortest you still have to
compare it to all the other possible paths to prove that it is, which is clearly
not in polynomial time.
Compare this to the seemingly similar problem of proving that a given path is
not the shortest - i.e. you don't actually have to find the shortest path, just
prove that the one given is or is not the shortest. This is NP because all that
has to be done is to provide an example path, i.e. one path, that is shorter.
There is no requirement to compare it to every possible path, just to one that
is actually shorter. Thus, given a path A it clearly isn't the shortest if you are
given the extra piece of information that path B is shorter, i.e. B is a witness
that A isn't the shortest path.
Perhaps the best known “difficult” problem is the TSP – Traveling Salesman
Problem. Given a list of cities simply find the shortest route that visits each
city and returns to the starting point. Clearly this is not in NP because no
witness verifiable in polynomial time can be found. If I give you a candidate
path you cannot verify it is the shortest without searching the entire solution
space for the shortest. The witness doesn't help.
As before, we can turn it into an NP problem by changing the question to “is
there a route shorter than L”. This is in NP, but it isn’t the full TSP problem.
We will return to these not-quite NP problems later, in the context of NP-hard
problems.

196
The Hamiltonian Problem
Another well known example is the Hamiltonian. A graph (another name for
a network) is said to be “Hamiltonian” if there is a path through it visiting
each node exactly once.
The graph in the diagram below is Hamiltonian because there exists a path
that visits each node exactly once - shown on the right.

The following should be fairly obvious to you now:


1. Is proving that a graph is Hamiltonian in P?
No, it takes exponential time to find a path that visits each node by
examining each path until you find one that is Hamiltonian.
2. Is proving that a graph is not Hamiltonian in P?
Obviously not, because you have to check every path to prove that a
suitable one doesn't exist and so this must take longer on average than
(1).
3. Is proving that a graph is Hamiltonian in NP?
Yes. The witness is a path that visits each node exactly once. Once
you have the description of such a path, checking that it visits each
node only once is trivial and obviously in P.
4. Is proving that a graph is not Hamiltonian in NP?
Probably not because you can't give a piece of information that allows
someone to check that a suitable path doesn't exist. You have no
choice but to check them all and that's clearly the same as (2).
5. Is proving that a graph is not Hamiltonian in co-NP?
As it is the complement of the original problem, which is in NP, it is
in co-NP.

197
Boolean Satisfiability
One of the NP problems that you need to know is Boolean satisfiability or
SAT. First, however, meet CircuitSAT, which is closely related and slightly
easier to work with. If you put together a Boolean circuit – a collection of
gates connected together, see Chapter 14 – then you can ask whether it is it
possible to find a set of inputs that makes the output true. This sounds like an
easy problem, but if there are n inputs you have to try them with all possible
combinations of true/false. This means that if there are n inputs there are 2n
possible combinations to try and the problem is exponential. However, it is
fairly obvious that it is in NP as if you are provided with a witness that
satisfies the logic, you can verify it simply by applying to each of the inputs.
For example, a circuit built to implement:
(a AND b AND c) OR (d AND e)
can be verified by a witness a=1, b=1, c=1, d=x and e=x where x means “don’t
care”.
What is important about CircuitSAT is that it is a very simple problem and, as
we will discover in the next section, it is NP-complete. The better known
k-SAT problem is similar to CircuitSAT but cast as formulas rather than
gates. In k-SAT each of the gates is restricted to having just k inputs. As a
Boolean formula, a k-SAT also has to be in a particular form - Conjunctive
Normal Form or CNF - where the formula is the AND of a set of clauses which
contain k variables ORed together. For example, a 3-SAT in CNF form is:
(a OR NOT b OR c) AND (d OR e OR f) AND (NOT a OR c OR NOT e)
and so on.
Notice that you can still have as many total inputs as you like, but each gate
can only accept k inputs, for example, 2SAT:
(a AND b) OR (c AND NOT d)
The laws of Boolean algebra permit you to write any general expression in k
variables as a k-SAT CNF formula. What this means is that k-SAT CNF is as
expressive as any general Boolean formula.
What is interesting is that 2-SAT can be solved in polynomial time, but 3-SAT
and greater are in NP. In fact, 2-SAT can be solved in linear time! This is
surprising, but most of the work in finding a satisfying input has been done in
expressing the logical function in 2-SAT form. The reason why 2-SAT being
in P doesn't imply k-SAT or CircuitSAT are in P, is that converting formulas
to 2-SAT takes exponential time.
There are many interesting questions you can ask about Boolean circuits and
this is an area of active research, but CircuitSAT and 3-SAT are all we need
before moving on to consider NP-complete.

198
NP-Complete and Reduction
The discovery of NP-complete problems was one of the big breakthroughs in
computational theory. The Cook–Levin theorem (1972) states that the Boolean
satisfiability problem is NP-complete. What does this mean?
A problem is NP-complete if it is in NP and if every problem in NP can be
transformed into it in polynomial time.
A polynomial transformation or reduction is simply an algorithm that runs in
polynomial time that can take the solution of one problem and use it to solve
another problem. For example, suppose I have problem P1 and I can solve it
in polynomial time. Then if I have a problem P2 and I can apply a polynomial
reduction from P1 to P2 it must be that P2 is solvable in at most polynomial
time. The reason is simply that putting the polynomial solution to P1 together
with the reduction of P1 to P2 gives you a polynomial-time solution to P2.
This does not mean that there might not be a better algorithm for P2, but you
have provided a polynomial algorithm, so P2 has to be polynomial or better.
In general, if P1 has complexity C1, which is polynomial or worse, and you
can polynomially reduce P1 to P2, then P2 is at worst C1.
In the case of NP-complete problems these can be polynomial-reduced to any
other NP problem. What this implies is that if you can find a polynomial
algorithm for any NP-complete problem then there is a polynomial algorithm
for all NP problems and hence NP=P. This is the reason NP-complete
problems are the focus of study for people interested in proving NP=P. Notice
that it cannot be so easily used to prove NP≠P.
Notice that not all problems in NP are NP-complete and thus there are some
that you can find a polynomial solution for without affecting the status of the
rest. In particular, problems in P are also in NP and obviously no problem in
P can be NP-complete.

Proof That SAT Is NP-Complete


It turned out to be difficult to find the first NP-complete problem, but it was
demonstrated in 1971 that SAT was NP-complete. When you first hear about
this, a natural thought is how can so many diverse problems be shown to be
related to SAT? Surely someone could invent a problem tomorrow that was in
NP and was so different to the rest that it wasn’t related to SAT? To
understand how an NP-complete problem can capture the essence of NP, we
need to look at the proof that SAT is NP-complete.
The nature of the proof is interesting because it relies on the existence of a
non-deterministic Turing machine that solves the NP problem we are
considering. The proof is very detailed and quite difficult to follow in depth,
but you can get a general idea how it works as long as you are prepared to
take some things on trust.

199
Suppose we have a Turing machine that verifies a particular NP problem –
this Turing machine has to exist by the definition of NP and by the Church-
Turing thesis. The input to the Turing machine is a witness, which we have
already seen cannot grow faster than a polynomial function of the problem
size. The Turing machine takes the witness as its input and, after a
polynomial time, outputs a true or false verdict on the witness.
It can be shown that the Turing machine can be converted into an equivalent
Boolean circuit. The circuit would possibly be big, complicated and messy,
but it can be done. The basic idea is that we use logic gates to build the
Turing machine and this is fairly obviously possible but to prove that it is
possible you have to give the details of how to do it and this makes the proof
long and complicated.
This means that we now have a Boolean circuit whose inputs are bits that
represent the witness and whose output is the verdict. Notice that a good
witness “satisfies” the Boolean circuit in the sense of the previous section.
Now suppose that you can solve the CircuitSAT problem in polynomial time.
The solution to that particular Boolean circuit that verified a witness could be
obtained in polynomial time. This means we can not just verify a witness in
polynomial time we can find a witness in polynomial time. Finding a witness
in polynomial time is the same as solving the original problem. That is we
have found a solution to the original problem in polynomial time and the NP
problem is in P.
Thus CircuitSAT, and hence k-SAT for k>2, is NP-complete and it can be
reduced to any NP problem in polynomial time.
Once we have proved SAT is NP-complete, we can prove other problems are
NP-complete by polynomial reduction from the problem to SAT. Using this
technique, lots of NP-complete problems have been discovered – 3-SAT,
Subset Sum, Hamiltonian and so on. Currently there are more than 3000
known NP-complete problems and they are often so easy to prove that
journals have stopped publishing new results.
There is a sense in which NP-complete problems are the “hardest” in NP
because, whatever complexity class they are in, the other NP problems have
to be in the same class or easier. However, recall that NP problems can be at
most exponential and hence there are harder problems that are not in NP.
So far most of the NP problems we know of are either NP-complete or they
are in P. The best-known unknown is factoring, which is in NP, but not
proved to be NP-complete nor in P. To show that factoring was NP-complete
you would have to give a polynomial reduction from it to SAT or you would
have to show how to convert the Turing machine that solves a general NP
problem into factoring.

200
There is a theorem that if P≠NP there have to be problems that are not NP-
complete and not in P. Another theorem states that if NP≠co-NP then P≠NP.
If you discover that a problem is in NP but not NP-complete then you might
try to find a polynomial algorithm for it as finding one has no effect on the
P=NP question. If you find that a problem is NP-complete then you probably
shouldn’t waste time on trying to find a polynomial algorithm as this would
imply P=NP and most people think that it is almost certain that P≠NP.
Finally if someone does prove P≠NP then we can conclude that there are no
easy solutions for the NP-complete problems.
NP-Hard
You might think that we have exhausted the topic of NP problems, but there
is another class of problems related to NP-complete problems. NP-hard
problems are like NP-complete problems in that all NP problems can be
polynomial-reduced to them, but they are not themselves necessarily in NP.
That is, an NP-complete problem is an NP-hard problem that is in NP. Hence
NP-complete problems are a subset of NP-hard problems. Hence, as long as
P≠NP, we have:

Just as problems in NP cannot be harder than those in NP-complete,


problems in NP cannot be harder than NP-hard problems. A subtle point is
that NP-hard problems don’t have to be decision problems.
Clearly the subset sum problem is NP-hard as it is NP-complete and all NP-
complete problems are trivial examples of NP-hard problems. However, there
are problems that are NP-hard and not in NP.
The best known and easiest to understand is the full Traveling Salesman
Problem (TSP). It has already been shown that this cannot be in NP as there is
no witness for the problem of finding the shortest path. If you are given a path
that is supposed to be the shortest, you still have to compute all of the paths
to check that it is the shortest. However, the decision version of the problem –
find a path that is shorter than a specified length – is in NP as a witness is
easily verified in polynomial time. Not only this, but it can be proved to be
NP-complete.

201
Now think about the effect of the solution of the full TSP to the decision TSP
problem. If we know that the solution to the TSP problem is S, the length of
the shortest path, then the answer to the decision problem is easy. If L>S then
“yes” and if L<S then “no”. The full TSP problem is polynomial-reducible to
not just an NP problem, but to an NP-complete problem, which can be
polynomial-reduced to any NP problem. Hence, the TSP problem can also be
polynomial-reduced to any NP problem. The conclusion is that the TSP
problem is NP-hard, but not NP-complete, as it isn't in NP.
The relationship between the NP-complete decision problem and the NP-hard
full TSP problem is typical. Generally, to convert an NP-complete problem
into an NP-hard problem you simply have to find a way to force the complete
search of the solution space.
For example, consider the subset sum problem. If we change it so that instead
of searching for a set that sums to zero we are searching for the number of
subsets that sum to zero then we have an NP-hard problem because now we
have to search the entire solution space. This is an example of a very common
class of problems obtained from decision problems by changing "is there"
question to a "how many” question. You can see at once that this small
change has removed the problem from NP. If I offer you a witness – n sets that
sum to 0 – you can verify them in polynomial time, but you cannot confirm
that there is or is not another set that sums to zero without trying all of the
possibilities. The witness doesn’t help very much.
The problem is that to show that it is NP-hard we have to show that its
solution can be used to solve a related NP-complete problem in polynomial
time. This is fairly easy. Solve the modified problem and if n is greater than 0
you know that there is at least one set that sums to zero – we have a solution
to the original subset sum problem and the transformation is clearly
polynomial. This modified problem, finding the number of sets, is
NP-complete but it isn’t in NP and so it is NP-hard.
What if P = NP?
The conclusion seems to be that some algorithms that don't belong to P are
really, really difficult while others, the NP set, contain an easy way out and
aren't quite as difficult.
One of the most important revolutions in cryptography depends on just this
distinction. In cryptography a function that is made easier to work out when
you have some extra information is called a "trap door" function and this is
closely related to the idea of a witness in NP. Using a trap door function you
can implement "public key" coding. For example, if I choose two very large
prime numbers p1 and p2 I can multiply them together to give m. Then I can
quite happily make m public in the knowledge that the problem of finding p1
and p2 from m needs an algorithm that isn't in P, that is it takes a long time to
factor m.

202
If I also use a coding method that makes use of m, the public key, but needs a
knowledge of the factors of m to decode, I have a cryptographic system that
anyone can use to send me messages, but only I can read. This all only works
if factoring is NP but not P and this hasn't been proved.
There are a whole family of encryption methods based on NP algorithms, not
just public key systems. For example, the controversial DES standard is based
on an NP algorithm and again if someone can find a polynomial time
algorithm for the same task they have access to a great deal of data that the
owners assume to be secure!
So it is generally assumed that a proof that P=NP would bring many systems
based on them not being equal crashing down. Of course this isn’t true.
Suppose I give you a perfect proof that P=NP then what could you conclude?
You can now assume that every algorithm in NP has a polynomial algorithm.
However, you are not promised a low order polynomial algorithm. It could be
that all of the problems in NP-complete have polynomial algorithms which
are still not practical for computation in a reasonable time. For example, even
though there is a polynomial algorithm for testing a prime, practical methods
still tend to use random tests to give a verdict to a high degree of probability.
This idea was formalized by Richard Lipton and Ken Regan as "galactic
algorithms". The idea is that an algorithm might be faster but the break even
point where it is faster than alternatives might be so huge that you would
only gain an advantage for data sets the size of the galaxy. For example the
fastest way to multiply two numbers is to take their Fourier transform in 1729
dimensions and this algorithm is O(nlog n) which is much better than the
alternatives. However, it only overcomes the cost of doing the transform
when the numbers have more than 2d bits where d=172912 is a number with
38 zeros. This is more than the number of atoms in the universe and hence
finding two numbers for it to multiply in O(nlog n) time really is a galactic
proposition.
The point here is that the algorithm does achieve O(nlog n) and so
multiplication of two numbers really is an O(nlog n) operation, but not in
practice. The same sort of situation could arise with respect to NP algorithms.
If a galactic polynomial-time algorithm was discovered then we would have
to conclude that P=NP, but not in any practical sense. Of course, such a
discovery would be important because it would change the way we have to
view the world and it would start a drive towards finding algorithms that
were less galactic.
Proof that P≠NP and especially P=NP would certainly have big theoretical
consequences, but for practical implications we would have to wait to see the
details.

203
Summary
● We generally recognize two types of algorithm - function evaluation
and decision problems. However, function evaluation can be
converted into a decision problem.
● NP decision problem are easy to check if you have a proposed
solution, a witness, but otherwise may be hard to solve. Any problem
for which a witness exists can be checked in polynomial time and is
therefore in NP.
● A problem in NP needs, at most, exponential time and thus
P  NP  Exp.
● Co-NP problems are derived from NP problems by changing the
decision around. For example, "is there a subset that sums to zero" is
in NP, whereas "there is no subset that sums to zero" is in co-NP. It
isn't known if all co-NP problems are also in NP.
● The Boolean satisfiability problem, CircuitSAT, is an archetypal NP
problem - find the inputs to a Boolean circuit that produces a true
output.
● There are many variants on the more general SAT problem. The k-
SAT problem restricts the form of the Boolean function to use nothing
but k inputs ANDed together and then ORed with other similar
clauses. What is surprising is that 2-SAT is in P, but k-SAT for k>2 is
in NP.
● One of the most important theorems in computer science is that
k-SAT for k>2 or CircuitSAT is NP-complete. That is, any problem in
NP can be reduced to SAT. What this means is that if a polynomial
algorithm exists for any NP-complete problem then there is a
polynomial algorithm for all NP problems and P=NP.
● NP-hard problems are NP-complete problems that are not necessarily
in NP. If you were to find a polynomial solution to any NP-hard
problem then P=NP. Problems in NP cannot be harder than NP-hard
problems.
● The consequences of proving P=NP are often assumed to be serious –
cryptographic codes would suddenly be insecure, for example.
However, the concept of a "galactic" algorithm puts this idea in
perspective. A galactic algorithm is one that only achieves its
asymptotic performance for galactically large values of problem. If
P=NP but only for a galactic polynomial algorithm, then nothing has
changed from a practical point of view.

204
Index

2-SAT..........................................................................................................198
3-SAT..........................................................................................................198

Alan Turing...................................................................................................19
aleph-null bottles of beer.............................................................................73
aleph-one......................................................................................................79
aleph-zero.........................................................................................70, 81, 89
algebraic numbers........................................................................................83
Algol 60.........................................................................................................56
algorithmic, complexity...............................................................................87
algorithmically random................................................................................91
Alonzo Church............................................................................................113
AND.............................................................................................................152
arithmetic expression...................................................................................60
arrow notation..............................................................................................56
Average Information...................................................................................135
axiom of choice.......................................................................................13, 97
axiomatic systems.......................................................................................108
axiomatization............................................................................................106

Backus...........................................................................................................56
Backus Normal Form....................................................................................56
Backus-Naur Form........................................................................................56
Bailey–Borwein–Plouffe...............................................................................84
Banach-Tarski paradox...............................................................................101
bandwidth...................................................................................................132
Berry paradox...............................................................................................90
Bertrand Russell.................................................................................100, 107
beta reduction.............................................................................................116
Big O............................................................................................................168
binary arithmetic........................................................................................156
binary fraction..............................................................................................75
binary tree...........................................................................................129, 187
bit................................................................................................................130
Bletchley Park...............................................................................................19
BNF.........................................................................................................56, 59
Boolean logic.......................................................................................152, 158
bosons.........................................................................................................101

205
bound variable............................................................................................118
bounded........................................................................................................49
burst errors..................................................................................................148
Busy Beaver..................................................................................................33

C++..............................................................................................................59
Cantor............................................................................................................75
Cartesian product...................................................................................73, 79
channel capacity theorem..........................................................................132
chaos.............................................................................................................94
Charles Babbage..........................................................................................151
Christopher Strachey,...................................................................................31
Church-Turing Thesis................................................24, 29, 48, 30, 113, 200
CircuitSAT..........................................................................................198, 200
Claude Shannon.................................................................................126, 152
closed finite intervals...................................................................................99
closed set.....................................................................................................100
CNF.............................................................................................................198
co-NP...........................................................................................................195
Cobol.............................................................................................................56
Coding Theory............................................................................................132
combinatorial logic.............................................................................154, 158
complex numbers.........................................................................................69
compression..........................................................................................90, 131
computability........................................................................................13, 113
computable...................................................................................................12
computable number.....................................................................................84
computation..................................................................................................12
computational complexity...................................................................13, 167
computational grammar...............................................................................42
computer science..........................................................................................11
conditional..................................................................................................180
Conditional recursion.................................................................................183
Conjunctive Normal Form..........................................................................198
context-free..............................................................................................46,50
context-sensitive...........................................................................................47
continuum.....................................................................................................79
continuum hypothesis..................................................................................80
Cook–Levin theorem...................................................................................199
countable................................................................................................66, 74
CRC.............................................................................................................148
cryptography...........................................................................14, 93, 175, 202

206
CSS................................................................................................................25
Cyclic Redundancy Checksum..................................................................148

data compression........................................................................................140
De Morgan’s laws........................................................................159, 160, 161
decidable.................................................................................................24, 49
decision problems......................................................................................193
depth first......................................................................................................62
determinism..................................................................................................91
deterministic.................................................................................................94
diagonal argument........................................................................................75
dictionary....................................................................................................140
Dijkstra..........................................................................................................11
divide and conquer.....................................................................................179
Donald Knuth...............................................................................................56
e...................................................................................................13, 81, 83, 89
ECC memory...............................................................................................143
Efficient Data Compression........................................................................140
entropy........................................................................................................136
enumeration....................................................................................75, 76, 180
EOR.............................................................................................................153
Epimenides.................................................................................................108
equations.......................................................................................................65
Error correcting codes................................................................................143
error correction...........................................................................................147
Exclusive OR...............................................................................................153
exponential.................................................................................................195
exponential time.................................................................170, 173, 189, 193

factorial.......................................................................................................186
factorial time...............................................................................................171
false.............................................................................................................154
Fano.............................................................................................................137
Fermat’s theorem........................................................................................105
fermions......................................................................................................101
finite but unbounded.............................................................................23, 33
finite collection.............................................................................................98
finite state algorithm....................................................................................43
finite state machine......................................................20, 33, 39, 49, 50, 113
flip-flop.......................................................................................................158
Fortran...........................................................................................................56
free variables...............................................................................................118

207
full adder.....................................................................................................157
function.......................................................................................................182
function evaluation....................................................................................193
function problems......................................................................................196

galactic algorithms..............................................................................174, 203


Galois theory...............................................................................................148
Georg Cantor.................................................................................................70
George Boole...............................................................................................151
Gödel.....................................................................................................75, 108
Gödel numbering..........................................................................................80
Gödel’s incompleteness theorem...............................................................105
Goldbach’s conjecture..................................................................................33
grammar....................................................................................42, 48, 55, 115
green dream..................................................................................................60

half adder....................................................................................................157
halt................................................................................................................22
halting problem........................................................................12, 30, 33, 190
Hamiltonian................................................................................................197
Hamming distance......................................................................................144
history...........................................................................................................40
Hotel Hilbert.................................................................................................72
Huffman coding..........................................................................................138
hypercube...................................................................................................146

incompressible..............................................................................................90
induction.......................................................................................................98
infinite recursion........................................................................................190
infinity..............................................................12, 23, 65, 66, 68, 70, 98, 108
information...........................................................................................36, 125
Information Theory............................................................................125, 126
integer...................................................................................................65, 119
irrational.........................................................................68, 78, 81, 82, 88, 89
iteration.........................................................................................................76

k-SAT..................................................................................................198, 200
Kleene star....................................................................................................44
Kolmogorov complexity.......................................................................87, 174
Kurt Gödel...................................................................................................106

208
Lambda calculus...................................................................................24, 113
Lambda Expression....................................................................................114
Law of Large Numbers..................................................................................92
law of the excluded middle........................................................................152
LIFO..............................................................................................................45
linear bounded machine..............................................................................47
linear time...................................................................................................171
logarithm.....................................................................................127, 171, 172
logic gate.............................................................................................151, 155
lookup table............................................................................................29, 41
loops....................................................................................................172, 180

matching parentheses...................................................................................50
Millennium prize........................................................................................195
Miller-Rabin sequence................................................................................176
mini-language...............................................................................................60
multiplication.............................................................................................174

NAND..........................................................................................................161
natural numbers...........................................................................................65
Naur...............................................................................................................56
negative integer.............................................................................................65
nested loops................................................................................................188
neurons.......................................................................................................151
Newtonian mechanics..................................................................................92
Noam Chomsky............................................................................................45
non-computable..............................................................................12, 89, 100
non-computable numbers............................................................................36
Non-deterministic Polynomial time..........................................................194
non-deterministic Turing machine............................................................194
non-terminal.................................................................................................56
NOR.............................................................................................................159
NOT.............................................................................................................152
NP..........................................................................................................14, 194
NP = P...........................................................................................................14
NP-complete.......................................................................................199, 201
NP-hard.......................................................................................................201
NP=P...................................................................................................195, 199
NP≠P.............................................................................................14, 195, 199
Nyquist rate.................................................................................................133

209
object...........................................................................................................182
odd parity....................................................................................................144
open set.........................................................................................................99
OR................................................................................................................152

P.....................................................................................................................14
palindrome..............................................................................................43, 46
paper tape.....................................................................................................20
parameter....................................................................................................117
parity...........................................................................................................144
parse..................................................................................................42, 44, 62
parsing.....................................................................................................55, 59
Pascal............................................................................................................58
phase change..............................................................................................170
phrase structured grammar..........................................................................48
pigeon hole principle.............................................................................34, 72
polynomial time.........................................................................170, 174, 193
polynomials..................................................................................................83
Post’s production system;.............................................................................24
power set.......................................................................................................79
primality.............................................................................................175, 196
prime...........................................................................................................175
Principia Mathematica...............................................................................107
priority..........................................................................................................60
probabilistic test.........................................................................................176
programmer’s infinity...................................................................................71
programming methodologies.....................................................................188
proof............................................................................................................108
proof by contradiction..................................................................................89
pseudo random.............................................................................................93
public key...................................................................................................202
pushdown machine................................................................................45, 47
Pythagoras.....................................................................................................67

QR Code......................................................................................................143
quantum computer.......................................................................................25
quantum mechanics.....................................................................................94
quasilinear..................................................................................................171
Quicksort.....................................................................................................169

Rabin’s Strong Pseudoprimality.................................................................176


railroad diagrams..........................................................................................58

210
random................................................................................13, 36, 87, 91, 100
rationals..................................................................................................66, 73
real numbers...........................................................................................36, 74
recursion.....................................................................................173, 179, 181
recursion is a spiral....................................................................................184
recursive functions.......................................................................................24
reduction...............................................................................................33, 199
redundancy.................................................................................................136
regular expressions...........................................................................25, 44, 58
regular sequence...........................................................................................44
roots of polynomials.....................................................................................68

SAT.....................................................................................................198, 199
satisfiability................................................................................................198
self-reference......................................................................108, 158, 179, 190
semantics................................................................................................60, 61
sequential logic...........................................................................................158
set theory.......................................................................................................98
Shannon......................................................................................................137
Shannon-Fano.............................................................................................138
signal to noise.............................................................................................132
space complexity........................................................................................168
sphere of radius m......................................................................................146
splitting the bit...........................................................................................135
square root of two.........................................................................................67
SR (Set/Reset) latch....................................................................................159
stack......................................................................................................45, 189
state.............................................................................................................158
Stephen Kleene.............................................................................................44
strange loop.................................................................................................190
Strong Pseudoprimality test.......................................................................177
sub-polynomial...................................................................................175, 179
subset sum problem...........................................................................201, 202
successor.....................................................................................................119
super-Turing machine..................................................................................24
symmetry......................................................................................................92
syntax diagram..............................................................................................58
syntax tree.......................................................................................61, 62, 115

211
tail recursion...............................................................................................186
terminal.........................................................................................................56
time complexity..........................................................................................168
transcendental numbers...............................................................................83
transfinite..........................................................................................13, 70, 80
Traveling Salesman Problem.............................................................196, 201
true..............................................................................................................154
truth tables..................................................................................................152
TSP......................................................................................................196, 201
tuples...........................................................................................................189
Turing.............................................................................................50, 84, 151
Turing algorithm...........................................................................................43
Turing equivalent.........................................................................................24
Turing machine............................................................................................49
Turing machine................................................20, 24, 29, 39, 47, 81, 88, 111
Turing thinking.............................................................................................51
Turing-complete...........................................................................................25
twin prime conjecture................................................................................110

unbounded..................................................................................43, 46, 71, 72


undecidable............................................................................................32, 35
unintended Turing-completeness................................................................25
union.............................................................................................................79
Universal Gate............................................................................................160
universal operator.......................................................................................161
universal Turing machine............................................................................29
Universe........................................................................................................26
upper bound...............................................................................................169
USB,..............................................................................................................41

virtual machine.............................................................................................29

well-ordering theorem..........................................................................98, 101


Whitehead...................................................................................................107
witness................................................................................................194, 200

XOR.............................................................................................................161

Zermelo.........................................................................................................97
Zermelo-Fraenkel set theory........................................................................98
zig-zag order..................................................................................................77

212
More Titles Of Interest
Programmer's Python: Everything is an Object:
Something Completely Different by Mike James
ISBN: 978-1871962581
This book sets out to explain the deeper logic in
the approach that Python 3 takes to classes and
objects. The subject is roughly speaking
everything to do with the way Python implements
objects. That is, in order of sophistication,
metaclass; class; object; attribute; and all of the
other facilities such as functions, methods and the
many “magic methods” that Python uses to make
it all work. This is a fairly advanced book in the
sense that you are expected to know basic Python.
However, it tries to explain the ideas using the
simplest examples possible. As long as you can
write a Python program, and you have an idea
what object-oriented programming is about, it should all be understandable
and, as important, usable.

Programmer's Guide To Kotlin by Mike James


ISBN: 978-1871962536
Kotlin is attracting attention as "a better Java"
especially since Google backed it as a language
for Android development. In this book Mike
James introduces Kotlin to programmers. You
don't have to be an expert programmer in Java or
any other language, but you need to know the
basics of programming and using objects. While
Kotlin is similar to Java, and you can pick up
much of the language as you go along, a deeper
understanding will enable you to create better
and more robust programs. As with all languages
there are some subtle areas where an
understanding of how things work makes all the
difference.

213
Just JavaScript: An Idiomatic Approach by Ian Elliot
ISBN: 978-1871962574
This book is an attempt to understand
JavaScript for what it really is - a very different
language that should not be compared to Java
or dismissed as simply a scripting language. It
looks at the ideas that originally motivated the
JavaScript approach and also at the additions
over time that have produced modern
JavaScript/ECMAScript. It isn’t a complete
introduction to JavaScript and isn’t for the
complete beginner to programming. It has
been written for those who are familiar with
the basic constructs used in any programming
language and have already encountered
JavaScript.
After reading it, you will have an understanding how and why JavaScript is
unique and the way in which you can exploit its strengths.

Fundamental C: Getting Closer To The Machine by Harry Fairhead


ISBN: 978-1871962604
C is a good language to learn. It was designed to
do a very different job from most modern
languages and the key to understanding it is not
to just understand the code, but how this relates
to the hardware.
Harry Fairhead takes an approach that is close
to the hardware, introducing addresses,
pointers, and how things are represented using
binary. An important idea is that everything is a
bit pattern and what it means can change. As a
C developer you need to think about the way
data is represented, and this book encourages
this. It emphasizes the idea of modifying how a
bit pattern is treated using type punning and
unions. This power brings with it the scourge of the C world – undefined
behavior - which is ignored in many books on C. Here, not only is it
acknowledged, it is explained together with ways to avoid it.

214

Potrebbero piacerti anche