Sei sulla pagina 1di 63

The Anatomy of the LAONA Search Engine

Change history: Draft 14 2016: Draft Oct 8 2016; Draft Oct 7 2016; Draft Oct 5 2016; Draft Sept 30 2016; Draft Sept 16 2016;
(Draft Aug 29 2016):

Caesar Ogole
Final version will look significantly different

1.0 Introduction
The layered arrangements (Figure 1) were first created in February 2016. It was adapted later in June
2016 and incorporated as an integral part of the message content delivered to potential users by the
automated notification system. To users of the data repository, the layered representation continues to
be a useful technique for rendering abstractions to information details, and in effect, its original goal- of
aiding human perceptual visualization of the voluminous LAONA knowledge base- was achieved.
However, there still remain challenges to users who try to navigate through the data, especially when
they need to quickly locate specific segments of the data.
Figure 1: Screenshot showing content of email message sent from the automated notification system. First created and tests
completed on June 3 2016, the automated system can be thought of as the first level of automation. The system has been
automatically sending out notification messages in batches on daily basis from the day the tests were complete until recently
when the notifications were halted for some (non-technical) reasons.

Two years after a cross-section of potential users had had access and the opportunity to use to the data
repository whose content continued to be built on rolling basis, it had become apparent that a large part
of the data files made available in Portable Document Format (PDF) was getting cumbersome to users-
mostly because of its length (1292 pages for Part 1 and 1428 pages for Part 2). Downloading and
navigating through the raw lengthy files enriched with internal structures such as hyperlinked pages is a
property that can be considered not user-friendly! This is in spite of the fact that the PDF documents do
come with embedded (inbuilt) search and scrolling facility, among other useful features. The PDF file
format was preferred for display purposes because the format preserves some of the important original
properties of the data. For example, some email senders preferred using certain font properties to
emphasize their points- and indeed the messages bearing such features may have some psychological
effects on readers/recipients- at least over time. It is important that the data preserves these original
properties of data whenever it is possible. Examples are in plenty throughout the texts. Figure 2 (a) and
(b) are simply examples. Prolonged exposure to such textual creations with twisted logic and all else
behind the scenes were enough to cause potentially dangerous effects- in mind and body e.g.
hallucinations, nausea, etc at least on temporary basis throughout the course of the time - speaking
from personal experience.

Figure 2(a) Figure 2(b)

Figure 2: Example of some textual creations and contents that are part of the repository. Portable Document Format (PDF) as
opposed to plain text data format was preferred for display because PDF preserves certain signal characteristics so that
reviewers may also infer other effects from the type of data we have in the repository.
2.0 Understanding where I am coming from: some background story

2.1 Where and when my ‘career in Computer Science’ began: inside the grass-thatched house
in Abia, Uganda

There is a story that I find particularly interesting about how the name “Java” – the programming
language - came about. The creators of the programming language, led by James Gosling initially
proposed and temporarily used other names, such as “Oak” but [they] later settled for the name “Java”.
The name “Java” was chosen to reflect the team’s liking for Java coffee. These are some historical
backgrounds that had me fascinated when I first read about the origin of the programming language
Java in 2002. Since then, Java programming language has evolved in many ways. For example, in early
2000’s some of the language capabilities (libraries and application programming interfaces, API) were
still under development. That’s why, when you read through my undergraduate capstone project, you
will find a section where I provided a description of “Project Swing” – the name part “Project” was
apparently used in the most recent references at the time to mean that “Swing”, an enhancement of the
language’s Abstract Windowing Toolkit (AWT) was still under development. Since then, the language has
evolved and is now rich, in many libraries/APIs. Also, Java and related technologies are no longer
maintained by Sun Microsystems (as indicated in my undergraduate capstone report) but by Oracle Inc.
[Oracle acquired Sun in 2010 or so].

However, before I got to learn of all these, and that was before I joined University, I had made use of my
long high-school (senior six) vacation to read through general literature on computers and computer
systems. The articles that I read, about 30 to 40 pages, covered introduction to computer organization
(hardware) and peripheral devices, computer networks and computer software. The articles were pages
printed out from www.msn.com and www.about.com, - and I believe at least one set of the articles was
from Encarta Encyclopedia. The articles were given to me by one of my brothers, who thought I needed
such literature to kill the occasional boredom during my vacation in the village. It was interesting that
when I was reading through the articles, I had not had adequate exposures to the physical computers
themselves. Not that I had not used a computer at all. My first time to really use a computer was
sometime in 1995/1996 when I was requested by one of my brothers- the same one who gave me the
articles- to look through some rows of data displayed on a laptop computer screen and cross-check that
the records matched those on printed sheets of paper. And that use of computer was limited to using
the arrow (up and down) keys of the keyboard, for I had been warned against tampering with other
parts of the computer. That was during my secondary school holidays (in Kampala city) when I was in
senior one or senior two. Back at school the next semester, in a college dormitory, I zealously narrated
to one of my roommates (also a classmate) at the college that I had used a computer, but to my
disappointment, he stubbornly dismissed my claim. “You must be talking about television!”, he retorted.
That was sometime in 1995/1996.
Anyway, my senior six vacation chores in the year 2000 included herding a couple of domestic animals
like goats and cows in the nearby farms, a thing that I should have found enjoyable. However, I did not
like the routine work very much not only because I was not accustomed to it, but mostly because one of
the cows (out of the six) was not friendly to me. [I had grown up not looking after animals since we
really didn’t have any when I was growing up. I was told we had many animals the time I was born but
they were later raided/stolen by cattle rustlers]. The particular unfriendly cow could unpredictably start
chasing and knocking people down, especially those that she was not acquainted with, and I was
definitely one of those who did not manage to tame her for the entire time I was charged with looking
after them. One particular incident remains vivid in my memory. On that day, while I was tending the
animals as they grazed in the grassland about two miles away on the farm, the unfriendly cow charged
at me and started chasing me. As I ran away from the animal, a wild snake- of quite a size recoiled just
underneath my foot while making a hissing sound. I was wearing loose sandals, and I had to abandon
the sandals in the grasslands so I could run faster. I ran like a sprinter without looking back over my
shoulders until I reached home and told my father how the cow and the snake charged at me. My father
went and took care of the animals, although the sandals were never found again. My peers generally
had no such troubles. My phobia for such [wild] animals is one of the reasons I jumped while almost
screaming when I was out and about hiking with some friends here in USA some time in 2011. I had
almost stepped on a snake, albeit a dead one!

Anyway, as I continued reading the articles about computers in my grass-thatched hut, I could only rely
on mental pictures I created from the descriptions to piece together how some of the computer
components such as the mouse, floppy disks, etc were connected to each other. I had not seen them
physically. I could only imagine how the computer system components were interconnected, although
there were a couple of basic diagrammatic illustrations provided in the articles. I had no access to the
Internet nor had I used the Internet before. There were no libraries (of books) to consult in the village. I
was only reading about computers to a relative level of detail for the first time: what Local Area
Networks (LANs) were and how the Internet is an example of Wide Area Network (WANs). In that set of
articles, I read about RISC and CISC processors as minimal microprocessor models. Intel was cited as one
of the manufacturers of microprocessor chips (such as the Pentium). In the set of articles, I read about
evolution of programming languages: from the low-level assembler, to BASIC, to FORTRAN, to C, among
others. I read about how a computer program (or set of instructions) is used to fetch and move data or
operands (a bit or groups of bits such as bytes and nibbles) from one memory location (register) to
another within a microprocessor, in the von Neumann architecture- for example. In that set of articles, a
description of the difference between Random Access Memory and Auxiliary Memory was provided. In
that set of articles, I read about Sir Tim Berners Lee inventing the hypertext- what really made the World
Wide Web possible. The article explained the difference between the Internet and the World Wide Web
(WWW), or simply “the Web”. The articles talked about operating systems, giving examples such as
Windows and Macintosh. The article described the difference between system software and application
software. The articles described what browsers were, giving examples such as Mosaic and Netscape and
Internet Explorer. Excellent, high quality and motivating content and writing style- so I read and read
over again throughout the four month vacation! The major issue that remained was that I could not see
what I was reading about! In one set of the articles, the language style was free-style. “Just type in the
text into the search engine input text field and hit the return key- and presto! The search engine will
display the results for you”. [During the vacation, I was also practicing Maths, Physics and Chemistry for I
knew it was easy to forget all that I had learned if I didn’t rehearse them. I also read a couple of novels. I
also coached local students at Senior Four level in mathematics. That was also the time Victor Ochen
was working at radio station in Lira Town, 23 miles away- and I was told he would often send me
greetings live during his early morning (5 am or so) radio program - referring to me as “Aguc me Abia”. I
did not get to hear directly about his many greetings except twice as I really never used to turn on radio
in the early mornings. I am not a morning person]

Fortunately, in the grass-thatched hut (that served as my residential quarter) during the vacation, there
was a blackboard and white chalk onto which I could transfer my mental pictures, by drawing the
diagrams on the blackboard, say, of how data packets were fetched from cache and manipulated by
arithmetic and logic units of the CPU.

Figure 3: Year: 2000/2001. High-school vacation: A somewhat look-alike of the hut that stood in
our compound, not far from the main house. I was occupying this hut as I read for the first time articles covering general
introduction to computers. When I think of the origin of names such as Java or Oak, I am reminded of the small hut in which I
was drawing pictures of the computer system in an attempt to complete the missing links between imagination and reality. The
hut, or “ot lum” or “lai” could as well be a name for a technology product just like the “oak” was chosen as a name for a
programming language!
Figure 4: Year 200: What a scene of looking after cattle – my three-month occupation- looked like

2.2 How reading general literature on computers during my Senior Six vacation proved to be
very useful during my undergraduate studies

So, when I joined Makerere university the next year (after the vacation), I was very curious to learn
more about computer processors, software, browsers, operating systems, firmware etc- that I had only
read about (in “theory”). [The type of reading that I had done during vacation is what some people refer
to as reading “only theory”. Never mind, “theory” does have other meanings]. So, anyway, I was not
disappointed with the theoretical backgrounds after I started formal courses at the university. I learnt
more about computer architecture, operating systems, programming and many other concepts- both in
theory and also most times, in practice. It turns out that “Computer Literacy” – a course unit that
introduced first year students on how to use computer packages such as Microsoft Word, PowerPoint,
Excel, Web browsing, File/Folder Management etc was a little more challenging (and not very interesting,
I have to admit) for me. I was terrible at typing- and for some reason, I did not improve on that aspect
that much. I am still awkward at typing. Yet, the other course units were much more fun!

So let me pick an example: one that exemplifies my path to understanding principles and heuristics for
search engine development.

1. In 2000, during my senior six long vacation, I read a general literature that included, among
others, how the internet works. Examples of search engine that were given at the time included
Alta Vista, Google, Infoseek, among others. I had not had a chance to really use the Internet.

2. In 2001/2002, I joined university and I was formally introduced to computers. I learned


introductory theories in computer science and practices (through practicum) (e.g. computer
programming, computer networks, and operating systems such as UNIX, among others).
3. In 2003, one of the course units I took as part of undergraduate training was “Distributed
System Development”. Mr. Jonathan Kizito was the instructor for this course. The simplified
problem statement in Distributed System Development project was: “How do I write a set of
computer programs that communicate over a computer network when the setting is such that
one set of programs is resident on one computer and another set of programs resident in other
another computer – with the two (or more) computer nodes separated potentially by large
geographical distances?” Mr. Jonathan Kizito, perhaps, was one of the most enthusiastic
instructors who made sure he carried out hands-on demonstration involving live tweaking of
computer program code in the lecture room. Since I was an inexperienced programmer, I found
Mr. Kizito’s approach more motivating and close to what I had longed for from my earlier
instructors who, for the most part, left it upon the students to do programming tasks without
proper assistance.

But there was one issue with Distributed System Development: for a beginner (like me at the
time), learning new concepts necessary to carry out implementation were a steep learning curve!
So even though Mr. Kizito had done a fine job demonstrating in class how one program module
would request service of another via some distributed system protocols, I still encountered
difficulty comprehending the concepts afterwards when I had to complete the course
assignment on my own. The concepts to be grasped in order to complete the assignment were
Remote Method Invocation (RMI) and/or Remote Procedure Calls (RPC)- Java-based and C-
based protocols for communication between network programs, respectively. Even though I had
become fairly proficient in Java programming at the time, the concepts for network
programming seemed a little more advanced (than what I had learnt from my favorite tutorial
site). The tutorial site did not cover distributed system development. [At the time, I would use
the terms network programming and client-server programming interchangeably- to mean
distributed system development). That’s one of the times I had to look for another reference
that contained details of Distributed System using Java – since I had opted to complete the
assignment using RMI. (I was not good at programming in C, so RPC was not the route to take).

Fortunately, there was a comprehensive reference book I could use: Core Java II by Cay S.
Horstmann and Gary Cornell. [There was also Core Java I but “Distributed Objects” (Figure 5)
was covered in the more advanced Part II.] There were a couple of copies of this book at the
Department of Computer Science Library, and they were almost not borrowed so I could find
them any time I needed it. But for that particular homework which had to be done within a
week or two, I concentrated on the Chapter “Distributed Objects” – and the concepts were not
easy to grasp. I had to burn some midnight oil in Mitchell Dining Hall so I could get some ideas
on important underlying ideas ranging from descriptions of standards such as the Object
Management Group (OMG)’s Common Object Request Broker Architecture (CORBA). [Mitchell
Hall is one of the student’s halls of residence at Makerere University where I was resident]. I do
not remember much details about CORBA, but at the time, my effort was fruitful in that I
understood partially the chapter’s content, just enough for me to adapt the code we were given
in order to create software objects that communicated over a logical network. [By logical, I
mean, in our practical setting, the two objects (programs) were resident on one physical
machine but they were on two different networks, meaning that for them to communicate and
request service of the other, the programs needed a protocol such as RMI]. In fact, I borrowed
and kept this textbook on my bookshelf in the dormitory (Mitchell Hall) for an extended period
of time (for about two semesters), not because I needed it all that time, but for yet another
reason: my roommate, who also happened to be my classmate from Secondary School (three
years earlier), was a veterinary medicine student- also in his third year. Veterinarians (students),
including my roommate, would sometimes pompously go about carrying voluminous text books,
and in his section of the book shelf in our shared room in Mitchell Hall, he had a big text book,
entitled “Blood”. To make my side of the shelf also look “cool”, I kept the rather voluminous
“Core Java II” textbook and other textbooks such as “Software Architecture” even though I
wasn’t really using them most of the time 

Figure 5: Year 2003: A chapter in the book Core Java II by Cay S. Horstmann and Gary Cornell that I (as a beginner in
2003) found particularly useful to understand distributed systems architecture. Service Oriented Architectures (SOA), I
would later in 2007 get to learn during a graduate-level course “Advanced Web Technologies”, builds on such concepts
as that described in CORBA– enabling use of marshalled heterogeneous components of software across distributed
networks, while taking into account important factors such as availability, load balancing, security (where applicable).
In fact, the Me2U MTN service of which Mr. Kizito was the lead developer in early 2000’s is a [potential] technology
artifact, whose development methodology could serve as a quintessence for SOA paradigm.

Little did I know that after graduating with my Bachelor’s degree, I was going to find Mr. Kizito
(already) in the Netherlands – for, unknown to me, soon after teaching me the course, he was
among the three people selected for M.S Computer Science in the Netherlands (a year before I
joined the same Netherlands-based Program). The other two people who went with Mr. Kizito
were Mr. John Businge (who, at the time he was selected, was a teaching assistant at Mbarara
University), Ms. Doreen Tuheirwe and Ms. Proscovia Olango. All these people, included Mr.
Kizito had graduated earlier in the BSc (flat) program between 2001 – 2004, with the exception
of Olango who came from Bachelor of Library and Information Science (BLIS) background. In
fact, Doreen formally graduated the same day I did (in early 2005) although she had already left
for the Netherlands in/around Sept 2004. Kizito and Businge, I would later learn from them, had
graduated in 2001 and 2002, respectively. Julius (my undergraduate classmate) graduate in Oct
2005 after he (Julius) and I were already in the Netherlands. I do not know when Olango had
graduated from Makerere University.
The 2003 Distributed Systems Development course unit was Kizito’s maiden and only set of
lectures at Makerere University before he went to the Netherlands. Before that, he would later
cause it to be known to me when I joined them in the in the Netherlands, he had been working
for a Kampala-based technology company called Digital Solutions. In fact, Kizito made me to
know that he was the lead engineer for the Me-To-You (Me2U) service, developed by Digital
Solutions. [The Me2U service, then available through Uganda’s Mobile Telecommunications
Network (MTN) allowed mobile phone users to share up to a limited amount of talk airtime
directly from one phone to another.] I could not doubt Mr. Kizito when he told me that he was
the main brain behind the Me2U service, although, I, for one, had only heard about the service; I
had not had an opportunity to use it. For the one year I was with him in the Netherlands, Mr.
Kizito, to the Ugandan group in the Netherlands, was a go-to guy for software-related issues not
only because of his technical expertise but mostly because he maintained a library/collection of
software packages such Linux distributions, ant-virus, etc. In essence, Mr. Kizito’s kindness and
technical expertise, to an extent, reminded me of what Mr. Ashaba and Mr. Ntale and Mr. Oluka
were to me and others earlier during my undergraduate studies at Makerere University,
Uganda). While in the Netherlands, I also had an opportunity to attend the same class for a
couple of course units such as Mobile Software and Technology for System Realization with the
group (the trio- well, including Julius Kidubuka), although in some cases, different individuals
completed the course unit’s projects at different times due to varying individual schedules.

Some of the details above are for completeness, but they are important for coherence of this
particular story. For example, while I quickly found Mr. Kizito and Mr. Businge quite easy to
interface with, I, on the other hand, had to interact quite reservedly with other people (such as
Tushabe) for, the fact that they had been my (former) teachers during my undergraduate
studies still pre-occupied me. Also, some people wonder why I have to talk about distributed
system development, and the next time I am talking about machine learning algorithms. Where
is the specialization? Do you want to be a jack of all trades, a master at none? The simple
answer is: That’s what computer science is, ideally. Recall the Technology Stack? One has to be a
master of all, ideally. By this statement, I am downplaying the importance of having in place an
organized team in a typical software technology project when resources allow. “Division of labor”
may be applied, potentially on rotational basis. For the group work to succeed, every team
member has to have a good idea of what the other is doing]. There are scenarios where an
algorithm by itself is not useful. Algorithms need to be applied to data and the results of
processing rendered to a user in a usable manner. Components of the algorithm and/or data
and/or output interfaces may be distributed across computer networks, and for one component
to make request for service of other components, some protocol(s) for communication is
needed. Knowledge of distributed systems concepts becomes indispensable. Design of a system
that takes into consideration of the fact that components may require access to services offered
by other components across a network is part of the notion that led to the paradigm of Service
Oriented Architecture. The distributed components (and hence services) need to be available
(sometimes securely) all the time they are requested for by other components. These
considerations lead to other ideas such as redundancy, among others. Indeed, some of these
are characteristics of server computers (or simply servers).

4. But I need to say this: by “other considerations” (such as “security” referred to in 3 above), I
need to state how the course “Distributed Systems” further facilitated my understanding of the
significance of some important concepts in Objected Oriented Computing paradigm. I got to
appreciate OO concepts such as “Encapsulation” better,- for, when software object properties
(state variables) and behavior (methods) are declared, say as “private”, then we know, such a
visibility modifier is, in fact, a security feature- in which case (for this example), the method or
variable declared using the visibility modifier “private” can only be accessed from within the
same object. Only methods and state variables declared “public” are accessible from external
objects, potentially distributed across networks! So distributed system development compels
one to have at least a fairly firm grasp of these [new] important concepts. Visibility of software
objects is simply one example, but it is a critically important layer of software security! [There
are other visibility modifiers such as “protected”]. But computer security presents problems of
elusive nature, and personally, I do not believe there is system that is (anywhere close to) 100%
secure. The example I gave, for example, assumes that there is no other workaround to
negotiate access to, say, a “private” state of an object (assuming all other factors are held
constant and an object’s variable access is the only threat to security). That is, I am assuming the
technology (or programming/scripting language does not have other latent method for
accessing a private method or state value from outside the object). I am not an expert in
computer security. I did undertake some short training in 2011 on the topic of computer
security, content of which included discussion of threat models and ways of averting some
attacks, but the threat/attack “models” often discussed are not exhaustive! I had an opportunity
to carry out some hands-on work on basic analysis of traffic across network using third-party
tools- and again, the assumption was that the analysis tool is superb!

Let me give an example to illustrate the issue of elusiveness in computer security. Let’s say two
“secure” software objects are located on two nodes on a network. “Secure” in the sense
discussed above: only the state values and/or methods declared public can be accessed outside
the class instance (object) by another remote object, i.e. across a network. The request to
access the properties of the other remote object is brokered through some standard mechanism
(protocol) known to the nodes in which the communicating software objects are resident; and
the response sent by the remote object back to the requestor is conveyed using a similar
protocol. Here is the point I am trying to convey in simple terms: the channel through which
the two objects communicate across a network (be it wired or wireless) presents an opportunity
for security breach in itself- and this has led to coining of terminologies such as “man-in-the-
middle-attack”, etc.
Communication channel

Computer 1 man-in-the-middle Computer 2

Figure 6: An illustration of opportunity for breach of security in a situation where two parties are communicating
over a network

Of course, a lot of effort has been put in research to find mechanisms to obfuscate the content
of messages sent across networks so that even if the “man-in-the-middle” gets it, the message is
twisted in some way- making it unintelligible to the man in the middle. Encryption algorithms
are used for this purpose. Then, the intended recipient would use a decryption algorithm to
convert/unscramble the cipher text back to the plaintext. Think of [implementations of]
encryption and decryption algorithms as examples of the (private) methods of the
communicating software objects (from the OO paradigm) in the distributed system- i.e. the
encryption and decryption algorithms can only be seen by other members of the same object,
not outside the object; they are shared secret between the communicating parties in the
network-but each object has its own local copy. I was introduced to security algorithms
“theoretically” during my undergraduate in 2003 – in the course unit “Cryptology and Coding
Theory” - and a quite voluminous reference text book we used discussed a range of encryption
and decryption algorithms such as ‘block ciphers’, etc. The course instructor, Mr. Barongo, once
did come in class with a page long number series- which he explained [that it] was an example of
an encrypted text – a result of converting plaintext into a cipher-text using an RSA algorithm!
[RSA, in this case, is an acronym attributed to the inventors of the RSA algorithm - Rivest-
Shamir-Adleman, not the Republic of South Africa  ]. [Like most of my instructors, I didn’t
know Mr. Barongo at a personal level but Barongo, a year after teaching me the course
“Cryptology and Coding Theory”, would go on to request me to assist him in grading a
coursework for Introductory Statistics – another course unit he was teaching at Institute of
Statistics and Applied Economics (ISAE). He had prepared for me a rubric for the purpose. One of
his students at ISAE (and for the course unit I was grading), Mr. Dickson Wanglobo, had been my
classmate and a close ‘math-buddy’ at Secondary school three years earlier- and he was residing
only a couple of doors away in my residence Hall at Makerere University. The assignment of
grading the course work was quite involving because there were many students, but I accepted
it because I also needed some money, and it was a refresher for me in Basic Statistics].

Anyway, the incident [in cyber security] of a well-equipped and determined man in the middle
(adversary), intercepting an encrypted data being transmitted across the channel and working
backwards (through cryptanalysis) to plaintext of the message content is not impossible, though
it is often very difficult with the new advanced security algorithms. Technologies alone can’t
address all issues related to computer/internet security. Organizational policies –enforcing
adherence to certain standard operating procedures (SOPs)- are important in reducing
technology security loopholes. It’s one of those controversial issues in Computer Science or
technology. In some cases, it is not worth the effort investing in protecting certain types of
information. In some cases, securing certain types of information is critically important! That’s
why organizational data is often classified as- “public”, “confidential”, etc… but the initial
classification process cannot be automated- it has to be done by human beings!

The concept of “availability” in Service Oriented Computing could mean a trade-off is created
between other resources (such as storage and communication channels) to ensure that when
one aspect gets broken, another is available for continuity of service! [Of course, physical
security is also important]. In a nutshell, the assignment for the course “distributed systems”
reinforced my practical understanding of some the concepts that I had only read about (in
theory).

5. So the path- or story, so to speak - to how I gained knowledge enough to create a search engine
continues: In 2007, I had a unique opportunity to take another course unit called Advanced Web
Technologies, which essentially was distributed system development that I had done during my
undergraduate except that it contained more advanced concepts. Still, the course included
discussion on RMI and RPC before delving deeper into Service Oriented Architectures and
application of graph theory in advanced web technologies such as search engines. In fact, the
final course project for the course Advanced Web Technologies was to develop something
similar to Google’s page-rank algorithm, but since it was very time-demanding, I only made a
simple proof-of-concept. From my understanding, distributed system development requires one
to keep abreast with the latest technologies in addition to understanding the fundamentals. Not
many people I have met are that good all-rounded-ly. But Prof Marco Aiello who taught the
course Advanced Web Technologies at the University of Groningen in 2007 was an excellent
professor of distributed systems! Julius, John, Doreen and Jonathan had already left the
Netherlands (for Uganda), and so it was only Mr. Moses Matovu, (who had joined/started in
2006- one year after me) and I, who had the chance to attend in-class lectures by Prof. Aiello.
This set of lectures was also probably Prof. Aiello’s maiden lecture (for this course, at least) at
the University of Groningen.

Much of the principles and heuristics that I have applied to develop the LAONA search engine is an
extension of what I learnt during the courses Distributed Systems (undergraduate-2003) and Advanced
Web Technologies (MS-2007). Yet other new concepts, for example, graph theory applications that were
important topics in the course Advanced Web Technologies were merely extensions of my
undergraduate capstone. The text processing algorithm for LAONA search engine were adapted from
the basic code that I wrote for the course Machine Learning (for Natural Language Processing - 2006),
and much credit is given to the course Instructor Prof. Jorg Tiedemann, who, at the time, was a Post-doc
researcher at the University of Groningen. Indispensable to this task was knowledge of some
fundamentals of Natural Language Processing (such as “regular expressions”) that I gained from the
graduate-level course “Natural Language Understanding (2007)” - from an equally great Professor
Gertjan van Noord. Of course, I did study pattern recognition and other advanced machine learning
algorithms (such as neural networks, SVMs) – courses taught by Prof. Michael Biehl (2005-2007).

3.0 The Need for Web Interface for LAONA Repository


3.1 Accessibility to data is an issue encountered at various levels

3.1.1 To facilitate accessibility to data through automated notifications- sent out on regular, periodic
basis

So first things first: aspects of accessibility of LAONA data was, from one front, addressed through
automated notification system (Figure 1), and that undertaking was to ensure that [potential] users
were constantly reminded of the availability of the data on regular (e.g. weekly) basis. Delivery of this
service was somewhat imposed on some the recipients at that stage as some recipients considered the
notifications unsolicited. There were a couple of reactions against the development. Still, that was a
successful demonstration of the (technical) feasibility of automation at that level. That is, the
voluminous data has been made accessible to the users by bringing it right into their users’ email inbox!

3.1.2 To further improvements on accessibility to relevant data by presenting data in a format in which
data segments are organized in sorted (time) order

Chronological arrangement of data segments may also be regarded as yet another step that was
[already] undertaken towards addressing the issue of data accessibility, for, data presented in
chronological order can be thought of as sorted. Sorting (or having presorted data) facilitates more
efficient, in this case, time-order search. Data in sorted form also lessens the difficulty of getting to
(understand) immediate prior and posterior contexts of an event- after an event is isolated. Layer 3 (Part
1 and Part 2) consists of email data arranged in time order, with incidents of overlaps documented using
inline subtitles using keywords such as “Meanwhile…”. The subtitles and other annotations originally not
part of the email data were added during the manual sorting process. Initially, the email data were all
arranged in a large document using a word processor (Microsoft Word, to be specific) and enrichments
such as subtitles and hyperlinks added, before the long Word documents were converted to PDF files.
The first parts of a chronologically arranged PDF file was a 650-page document was created and shared
with participant users on December 31 2013.

Throughout the time-from 2012- (since data in this repository started coming in- on rolling basis- ), most
people had not realized the great potential having readily accessible, highly-accurate historical data has
on facilitating learning. In fact, the initial motivation behind creating the first part of chronologically
organized data, including the first 650 page document that covered the period from Dec 2012 to Dec
2013, was not borne out of the idea or anticipation that the document contents could turn out to be of
importance for anything more than short-term purpose. The step that I took on December 31 2013 by
broadcasting a sketch of the chronologically arranged data was a highly desperate move on my part
attempt to offer a sort of crash course on the potential dangers of reasoning/acting based on
information from unreliable sources and/or incomplete information- which I assumed was responsible
for the never ending devastating war-like mess that had spread all over the place at the time- and also
partly because I found myself at the center of it all, for better or worse.

Some reactions were not in favor of the development, however. In fact, there was no explicit expression
of support of the new development that I had initiated. I suppose some people had different interests,
but there is/was also the problem of some participants lacking the capability to carry out independent
analysis; nonage was pervasive; some participants were not competent enough to read and
comprehend well on their own, not only because of issues related to semi-literacy on the part of such
people who were often emotionally charged, unsuspecting participants- but the situation was also
compounded by tense and very confusing episodes in an artificial atmosphere largely and deliberately
orchestrated by a subgroup of participants (demagogues) who seemed to have been harboring some
preplanned goals and interests prior to the time most people consider the genesis of the problems. The
result is that even people who were able to read and write fairly well (including most of the key
demagogues themselves) ended up in a state of confusion- with their mental and emotional states
characterized by envy, perplexity and paranoia- battle ground states, if you like.

With continued improvements in accessibility and formats in which the historical data was presented
and with inclusion of derivative enrichments (e.g. subtitles) that made understanding of the past events
much easier for everyone, however, most people started to embrace use of the data about the past
events in order to make more informed contributions. Still, this did not change things overnight- but it
generated a few “wows”, “this is unbelievable!” here and there- enough to change the course
somewhat- though only gradually. Most of the damages done could not be undone. An elaborate
example of derivative enrichments are the analyses presented in Layer 4 and Layer 5, derived from
preprocessed, well-organized Layer 3 and layers below it combined with relevant data obtained from
external sources such as news media. Here is one way in which sorting does facilitate efficient learning:
returning users (people who are not first-time users of the data system) could easily get to the desired
page segments at or near the pages they had visited before by jumping/skipping (e.g. by scrolling faster
through) some segments/pages of the parts of the documents- in which the pages were organized in
time order.

3.1.3 To further improvements on data accessibility by providing a capability to display relevant


segments of data in smaller, manageable portions through a web-based user interface

Even someone who is a not first-time user of the data system may not recall the time point(s) when
certain episodes occurred but he/she may recall some salient features (e.g. keywords) that were uttered
during the episodes that he/she is interested in reviewing. The PDF does come with an inbuilt search
facility that users can leverage by entering the keyword, and the user is taken to the page that contains
the keyword. However, the usefulness of the inbuilt PDF search function is weighed down by the sheer
volumes of the various parts of the data. (Think of the experience of scrolling though thousands of pages
every time you want to go to some one page. Your computer is likely to freeze!) And note that some
sections of the documents are enriched with multiple structures such as hyperlinks, and the inbuilt PDF
search facility does not dig deep into the contents of the document segments to which the hyperlinks
point. Navigation through the repository (through the PDF’s inbuilt search facility) ceases to be a
pleasant experience, especially when the parts are presented to a user in large chunks (e.g. Layer 3 Part
1 of the repository consists of 1292 pages and Layer 3 Part 2 of the repository consists of 1428 pages).

But there are also limitations that will soon become clear. And there is also the issue of portions of the
data being available in other formats than PDF. The issue of accessibility therefore remains only partially
resolved. An effort to make the data (subsystem) more usable with respect to these unresolved aspects
of accessibility issues led to conception of a new idea: create a web interface with
controls/components that easily allows for access to the various parts and categories of data. The
web interface would allow for retrieval and display of relevant data in smaller sized portions in a manner
that enhances user experience, in effect, significantly reducing the download time and the unpleasant
lengthy scrolling that would otherwise would affect user experience negatively if the documents were
presented to users in thousands of pages. To achieve this objective, the web interface would contain
graphical user interface (GUI) components that enable a user to input minimal information that the
web-based system would then use as a guideline for selecting and displaying segments of the data on
the data display panel of the GUI. Development effort for this system would be thought of as
comprising: (i) the “front-end layer development”- creation of graphical components that users can
interact with in order to retrieve and display the specific segments of data, and, (ii) the creation of
“back-end layer intelligence”- the mechanisms that interpret what the user has asked for- and
subsequently carrying out necessary on-the-fly data preprocessing, and, (iii) creating linkages between
these layers.

3.1.4 To further improvements on accessibility by facilitating comprehensibility through inline semantic


enrichments

Another great feature illustrated in Part 1 of Layer 3 was presentation of the time intervals between
successive emails. A script was written to compute the length of time between any two dates, breaking
the computation result down into number of days, number of hours, and number of minutes. The length
of time between successive messages could be important if a reviewer were interested in getting an
idea of the “velocity” of the message signals or how long it took for a particular response signal to be
provided, or when we are interested in computing other relevant statistics. From the statistics, for
example, one could infer the performance of certain respondents/participants under tense, fast
settings- with caveats erected in place to account for any relevant missing data. There are limits to
which semantic enrichments such as subtitles, hyperlinks and computed time intervals can be included
in the documents. Too much embedded meta-data can lead to cluttering, in effect causing confusion
and defeating the purpose of the enrichment. Let’s look at an example of missing data. At the times the
email data were being generated, the email communications were more often than not accompanied by
behind-the-scenes highly emotive telephone conversations. However, data from the telephone
communications were not captured in the repository. With an extra effort, a reviewer could, however,
try to infer at least the nature of some of the missing behind-the-scenes communications. Also, the
settings of the environment and its conditions (e.g. weather, peer pressure) which could have had some
influences on the nature of the signals generated by some participants could not be captured- but a
reviewer could try to infer some of those pieces of information from the dates and intensities at which a
given group of signals was generated e.g. December is winter period in USA so one could tell that cold
weather could have contributed to certain behavior of a person while email responses (exchanges)
provided in succession in time intervals as short as 10 minutes or 30 minutes or even a couple of hours
could serve as an indicator of how intense the situation was during the indicated period, though
measurements made in this way need to be undertaken with great care.

But while I was pre-occupied with the problem of producing reconstructed past episodes that, in
addition to preserving all the minute details, also included enrichments discussed above, I was, on the
other hand, also worried about the possibility that a reviewer of such a reconstruction may find himself
or herself drowning into the bewildering torment and madness that some of the key and unsuspecting
volunteer participants trapped in the shenanigans found themselves in for the extended period when
the streaming signals were generated. My worry was however negated by another thought that for
such damages to reproduce themselves (e.g. for any other reviewer to run crazy) simply because a
reconstruction is of a set of events that had caused serious changes in mental and emotional states of
others and all else that accompanied it, such a reconstruction must be real: that is, it must include
(nearly) all the events reproduced in reality under very similar if not the exact same conditions. Such a
realization would occur under the assumption that such reviewers possess psychological states (e.g.
level of empathy, ingenuity) similar to that of the participants that were affected by the same
experiences during the actual events. Only under such circumstances would such post-event realizations
occur. That other parts of the messages and event records were not captured because they could not be
captured dismisses any claim that making available to reviewers the data we have in the repository
could cause the same damages that were experienced during the time of the actual events.

3.2 The search engine as part of the web-based system

A new functionality that allows searching of the data through the newly built user interface was needed.
In the new web interface, the issue of the length/size/volume of the data would be addressed by
retrieving and displaying only one segment/page of data at a time, with neighboring segments made
easily through readily available components of the GUI such as “next page button” or “back page button”
or through the arrow (hot) keys. A search function allows users to easily locate sections (e.g. a page or
pages) of the repository (e.g. document) containing a search word or word phrase. The first version was
already implemented and tested on Aug 20 2016, and an announcement to that effect relayed
accordingly (See Figure 7)!
Figure 7: Screenshot showing content of a message I (manually) sent out to a group of potential end-users on August 20 to relay
to users the latest milestone in the search engine development. Here, functionality of the engine had grown to include capability
for multi-word search!

4.0 Under the hood: search engine development design decisions


Section 3 contains arguments for the need, and consequently, actions taken to improve accessibility to
the historical data. One step taken to improve timely access to relevant segments of the data was to
build a web interface. A key component of the web-based system would be a search engine, tailored for
this data repository. Indeed, the search engine component of the system was successfully built, tested
and made available to users on August 20 2016 (See Figure 7). In this section, we discuss the techniques
employed in creating the search engine.

4.1 Retrieving and loading search data into memory, invoking the search function/method and some
preprocessing of the search key

The data to be accessed is stored in remote data server (Amazon Web Server- Cloud Computing service
developed by Amazon Inc.) A user gains access to the data by launching a web interface, through a web
browser.

Some analogies between the operation of a microcomputer and how a search engine works will make it
easier to understand the problem.
4.1.1 Microcomputer vs. the Search engine

Microcomputer

i. When a computer is started (by powering it on), it uses firmware – embedded programs- to
perform some self-diagnostic tests to make sure all the necessary components are functioning
properly.
ii. After it is determined that all the necessary components are functioning properly, the operating
system (system programs), application programs and data needed by the programs are loaded
from the hard disk (and/or other auxiliary storage device) into the working memory (called
Random Access Memory, RAM). The data remains in the RAM as long as the computer is
running. The data and programs in the RAM are lost when the computer is turned off.

LAONA search engine

The LAONA search engine was (partly) designed to work in a similar manner

i. When the web based application is launched, the web-based system performs some
self-diagnostic tests to ensure that the environment is suitable for running of the
application. For example, tests are (or may be) carried out to determine if the client
browser versions and settings allows display of the applications graphical user interface
components.

ii. After it is determined that the environment supports running of the environment, the
necessary application programs and data are loaded into the working memory. In this
case, the working memory equivalent of RAM of a microcomputer is provided by the
browser. Like it is the case in RAM, the size of a browser memory is limited.

4.1.1 The size of RAM matters and so does that of any working memory

In the case of a microcomputer, the size of the working memory is limited, and yet the size matters, for
it has a bearing on how much data can be kept for quick access. That is, when programs and data
needed for current use are stored in the same RAM, it is much faster when a program needs to access
the data in the RAM, than when it has to access data from an external device (such the hard disk, or
even across a network). Likewise, the size of the browser memory is important. In the case of LAONA
search engine, the data and programs that need to be loaded into the working (browser) memory when
a user launches the application from his/her (client) computer are retrieved from remote persistent
storage in a dedicated computer (called server). A server is a specialized dedicated computer that is
available for access by client computers 24/7. Communication between the client computer and the
server is facilitated by network communication protocols.
4.1.2 What data and programs are loaded into memory?

Because size of the working memory is relatively small, a rule of thumb followed in the design of the
search engine was to make sure that the size of the data that need to be loaded into the working
memory is made as small as possible- by any means necessary, while preserving integrity of the original
data. It will become clearer in subsequent sections how, through a technique called indexing, the size of
the data loaded into memory is made much smaller than it would otherwise be if the original data was
loaded into memory. It is these much smaller-in-size pre-computed indexed data that are loaded into
the working memory of the client computer as part of the initialization when a user launches the web-
based application. Line 5 and Line 10 of Code snippet 1 shows the main initialization functions loadDoc()
and loadIndexedLayerPart1() that are invoked as the application is launched. Having in place
precomputed data where possible presents performance benefits in that less time is taken by the
program to relay response to user requests since the program may only have to perform a look-up from
files instead of trying to execute time-consuming procedure in real-time as a user waits for the results.
Since the term RAM is reserved for temporary working memory of a system typically like that a
microcomputer, we cannot use RAM to refer to the working memory for distributed systems like that of
the search engine.

Terminologies such as “transient memory” or “transitory memory” are sometimes used to refer to the
temporary working memory in the case of distributed systems. The memory blocks may be resident on
the same physical computer or they may be distributed across a network. Another analogy to think of
these transient memories (other than RAMs) are the lower level class objects (instantiations of classes in
object oriented context) that exist during runtime of a program. Matter fact, class objects are nothing
but chunks of memory holding programs (methods) and/or data (values of instance variables). Each
memory object has a unique identifier and is accessible to a programmer by its reference variables. An
object- instantiation of a class- is alive as long as its reference variable points to it. Once an object’s
reference variable is destroyed (say, by reusing it to point to another object of same or parent class, or
in some cases, after part of the program has finished execution), the object disappears –and the chunk
of memory is made available for reuse using an OO technique called garbage collection. Of course,
transient memories may contain several of such objects during the session when the application is
running, and these objects are lost when the application quits. You will encounter terms such as
“plurality of memories” to refer to such a phenomenon- although the phrase are commonly used in
distributed system context. The fact is that even in a RAM, there are often many (i.e a plurality of)
memory objects existing when an application is running.

The decision to have preloaded data in the working memory at the time the application is launched
assumes that all the data and program modules loaded in the memory on the client computer will be
needed for that session when the application is running. However, it is not the case that a user will
always proceed to use the search functionality every time he/she has the application running. For
example, he/she may only be interested in using other functionality through interface controls such as
“next button”, “previous button” or “search by page number”. These use-case scenarios require neither
the preloaded indexed data nor the elaborate text search program module- even though they are
already loaded in the memory. So it seems a section of memory is redundant (and “wasted”) under such
scenarios. Yes, that is true but the merits of not having the data and the programs in memory would
overweigh the demerits of redundancy in the event that a need arises for the user to use the elaborate
search function. In the unpredictable but likely event that a user chooses to use the “search by keywords”
functionality, and the indexed files are not preloaded, the application is rendered almost unusable since,
on average, it may take a considerable amount of time to retrieve the data across a network. Note that
under the current design which allows for preloading at initialization, the only data retrieved across a
network are data segments (single pages of document). Retrieval and display of search results (pages)
occurs by direct access after preloaded client-side in-memory programs, using preloaded client-side in-
memory data (indexed files), have determined the page number(s) that match(es) the conditions
specified in the query! Search criteria are discussed in Sections 4.2.

4.1.3 Creating a fully automated system using * * * * *

The process of generating the precomputed files, in this case, is sometimes referred to as offline
processing because of their time-consuming nature which compels designers to have the preprocessing
carried out well in advance and not at runtime. The intermediate output of the preprocessing is then
made available to the application at runtime. Of course, in a fully automated system, this offline
processing would be time-triggered, or the trigger could be fired by arrival of new data-, or the like,
meaning the index files need to be updated. The offline preprocessing would be carried out
automatically on the server side. Our [system] implementation is currently semi- automated- meaning
that (behind-the-scenes, i.e. server-side) updates to the indexed files need to be triggered manually
when new pages are added, inserted or deleted from the repository. One time use of *****, applied to
generate the current index files on a cloud computing server side, is already a proof of concept of the
idea proposed for full automation.
Code snippet 1: The decision of when and what data portions to load into working memory (from auxiliary persistent storage)
for processing are important factors that have to be taken into consideration.

4.2 The Search Criteria

4.2.1 Searching by page numbers: design for direct access to the requested segment

Two search criteria were identified: (i) Searching by specifying page numbers (ii) Searching by specifying
keyword phrase

Searching by page numbers, a use-case scenario where we envision a user specifying a page number
within a valid range and the search engine proceeding to return (i.e display) the requested page. To
understand clearly, let’s look at the structure of a typical document to be searched. Let’s pick Layer 3
Part 1 document which consists of 1292 pages (in PDF format).
PDF to plain text
conversion: one big file
in PDF format is
converted to one big file
in text format

Figure 9: This long text file (super string) is the output of the PDF to text
Figure 8 : An example (Layer 3 Part 1) conversion process. It is in turn (fed in as) an input to the text processor!
corpus document in PDF
One problem with the text version is that the line properties of some texts
are not preserved by the conversion process. Page number numeric texts
are also mixed up with other texts. (The term “super string” as used in this
context is coined to mean an object containing very long string as a value of
its instance variable. It is not related to the Super string theory in String
Physics.)

The first technical challenge we face is that our search algorithm is a text processor only, that is, the
program routine accepts and processes input presented in text format. A text is a sequence of
alphanumeric characters. What that means is that the original PDF file first has to be converted into PDF.
This preprocessing step was successfully done using some UNIX utilities.

However, the PDF to Text conversion of the 1292 page file (figure 8) into a text file (Figure 9) creates an
output file that contains distortions of the original text line properties of texts in the original file (Figure
8), and some sections of the text that indicated page numbers in the original PDF file end up getting
mixed up (concatenated) with other strings of characters during the conversion. Figure 9 shows what
the resulting file looked like. What these distortions mean, partly, is that even though the text file is in
the format acceptable by our algorithm, our text processor (algorithm module) would encounter lots of
trouble locating the beginning and end of pages in attempt to find and display a segment of the
document in accordance with a user request- in this first search by page number scenario. This is more
especially because the numeric character string that served as page number indicators in the original
PDF document are now mixed up with other strings of characters in the new text file. Moreover, in the
original PDF files, there also exist other numeric characters that are not page numbers themselves. What
this means is that the problem of isolating page numbers from a mixed-up section of the texts in the
conversion output text file is further complicated for we can longer go by the assumption that any
numeric pattern is a page number.

To address the issues discussed above, the idea conceived and implemented was to first split the original
1292 page PDF document into individual 1292 PDF files (where each file, and its content, corresponds to
a page from the original PDF document). Each of the 1292 PDF files was then converted to text files
(using the UNIX facility discussed above). Figure 10 illustrates what this conversion process looks like.
Figure 10: The large 1292 page document in PDF was segmented /split into 1292 individual pages. This preprocessing step yields several
performance benefits.

Now, we have 1292 text files besides the 1292 PDF files from which the text files were generated. We
have to note, however, that inherent in most of the 1292 text files, are sorts of distortions on some text
line properties as it was discussed earlier. These distortions will not have significant impacts however
when the individual 1292 text files are used for search purposes. (Important note: In fact, in the search
by page number scenario, we do not need the individual 1292 text files. The individual text files, as we
will see in subsequent sections, are important in the second scenario when a user opts to search the
corpus by specifying search string pattern. Two different interfaces are at a user’s disposal to take in
user input in each case).

All we need in the search by page number scenario are the individual PDF files. For each PDF file
(corresponding to a page in original document), a mechanism to access it directly is provided. In our case,
these 1292 PDF files are stored in a remote file server and a user can access each file using a unique URI.
Each file is named such that its name string contains a unique number (in this case, ranging from 1 to
1292) corresponding to its page number in the original whole 1292 page PDF document. This name
string is then made part of the URL, and in typical user scenario (search by page number), the unique
number is used to identify and retrieve the particular file/page that has its file’s URI containing a match
of the unique page number entered by the user. This direct access to a document segment/page greatly
improves search efficiency. The strategy eliminates the need to [have a processor that would otherwise
have to] search through the segments/pages starting from the beginning of the document (page 1) and
proceeding sequentially until the desired page is found. Direct access is akin to the strategy used by
random access memories (RAMs). The term “random access” is used to mean that any memory location
(segment) can be accessed directly: - any byte of memory can be accessed without touching the
preceding bytes. The difference is that in the case of LAONA search engine, the data and programs are
retrieved from a remote server into the client memory for display, while in the case of a microcomputer,
programs and data are often stored in a local disk- and loaded into the RAM.

If we had not had the 1292 page document split into individual PDF pages, we would have to go the
route of the time-consuming search for the requested page beginning from page 1 every time a user
makes a request by specifying a page number. And in this case, the program would probably have to use
the long text document (Figure 7) and find a way of isolating page numbers from the string of characters.
The one-time offline preprocessing task of splitting the whole document into individual PDF files and of
naming the files in a manner that facilitates direct access is thus cost-effective- and improves user-
experience during actual use of the system! [The same technique was applied to Layer 3 part 2- which
consists of 1428 pages].

4.2.2 Searching by keywords

The benefits of direct access to file segments discussed in the preceding section are not limited to search
by page numbers only. However, a second scenario of searching by keywords requires that the text
processor to have the capability of searching for matching text patterns in the actual corpus. In the
search by keywords, the page number(s) of the segment(s) that contain(s) the keyword/phrase is/are
returned. For each page number returned, the actual page(s) corresponding to the page numbers is/are
retrieved from the file server by direct access- one at a time. This section discusses the case where
search proceeds by going deep into the texts, unlike in the search by page number case where only
filenames of the segmented files were sufficient to complete the search. [Note: we use the term direct
access in this case since random access is applied to RAMs]. However, a strategy often applied to aid
efficient search in the search by word/phrase scenario is to convert the occurrences of word tokens into
corresponding index vales. Indexing, as it called, replaces a word with a numeric value. This strategy
saves storage space, and its benefit cannot be understated especially in a case such as ours where
transient memories are limited in size. Think of a word such as anti-disestablishment with 21 characters
occurring in a text say 7 times.
anti-disestablishment word2 anti-disestablishment word4 anti-disestablishment word6 anti-
disestablishment word8 anti-disestablishment word10 anti-disestablishment word12 anti-
disestablishment cholesteryloxycarbonyl

Indexing would yield the following output:

24 3 24 8 24 6 24 9 24 11 24 4 24 32

The 21 characters in the word anti-disestablishment have been replaced by 24- its corresponding index
value 24, and this process translates directly to a significant amount of storage space saved- on average.
But since it is the index values that are used during real-time client-side processing, the processing
speed is greatly improved because operations involving numbers (primitive data types) are generally less
time consuming than string (object) operations. Well, that’s not exactly the case since our
implementation treats a numeric pattern as a string pattern during comparisons. Important note:
Indexing is applied to the 1292 individual text files- creating one long file of indexes in which the page
are delimited by some unique numerical patterns. More about indexing is discussed in the next section.

4.2.2.1 Handling single-word search keys

Let’s get clear why the program is designed (co-designed) to handle single-word and multi-word search
keys separately. The design goal is to do with efficiency, an attempt to optimize usage of working
memory space and computation/search time. Table 1 below shows what the content of the unique
word index file looks like

Index word Page numbers for pages containing the word


1 hello 8 10 100 205 1292
2 you 22 69 201 800 907
3 wow 2 9 506
4 ditto 9
5 who 17 77 152 405 708 1001
6 egregious 201 306
7 are 1 2 8 22 101 999
8 (hello 88 704
9 “are 32 91 303
11 innuendo 8 10
12 “who! 872
13 anchor 1172
14 . 1 2 4 5 6 7 8 9 23 15 76 23 98 123
15 ? 34 41 3
16 ! 45 632
17 position(s) 542
Table 1 An example of a precomputed unique word index file consisting of the first 17 unique word tokens. In the case of Layer
3 Part 1 and Part 2, the unique words appear in the table rows are in the order in which the words in the second columns occur
in the corpus. The first column contains unique index values from 1 through 17 (the number of unique words). In single word
search, the index values are not important because the search proceeds on look-up basis using the column 2 values. The third
column of the table contains page number(s) for page(s) in which the corresponding word is found. The page numbers are
separated by a single space character.

Of course, in the case of the LAONA repository, the index runs from 1 up to about 35,000 for each part
(Part 1 and Part 2 of Layer 3). What that means is that there are just about 35,000 unique words in that
large sub-repository (e.g. Part 1) - counting word variants discussed in the later sections of this paper.

4.2.2.2 Handling punctuations in the word tokens

There is an important problem to be noted when generating the unique word list from the thousands of
pages of the original documents. To illustrate the non-triviality of this problem, let’s pick an example
from the table: you may have already noticed that in the second column above, the word hello appears
more than once, the second one, (hello, preceded by an opening bracket (. Similarly, there are two
variants of the word who, the second one containing the character-prefix “ (opening quotation mark)
and a character-suffix ! (exclamation mark). The inbuilt PDF facility (too) would fail to recognize that
these variants are of the same word token! How do these extra-variants of words come in? Why can’t
we leave them out and deal with only the main one – one that does not contain extra characters?
Providing a response to such questions are what software engineers, computer scientists or indeed all
scientists refer to as making design decisions. Design decisions may vary at various levels from one
person to another depending on the way they choose to make representations of the solution
method/approach, or model. Yes, it is possible that there may be two or more cost-effective solution
methodologies to a given problem instance!

One reason we have to pay attention to such extraneous characters is that the algorithm (text processor)
that builds the unique word list carries out comparisons at character level so hello is not the same as
(hello . This difference cannot be ignored since every unique word may have so many of such variants
throughout the texts (corpora). For example, the word hello may be encountered at various parts of the
documents in the following variants.

hello (hello hello) (hello) “hello hello” “hello” ‘hello hello’ ‘hello’ [hello hello] [hello] {hello hello}
{hello} >hello hello> hello+

and possibly others. (In our case, we have identified up to 53 of such variants for every unique word.
This does mean that they all occur in the corpora are isolated during tokenization and included in the
unique word lists. Only those that appear in the corpora are. Still, we need to identify all of them for
reasons that will become apparent when the search engine algorithm is discussed in depth in the later
sections ). Of course, some of the common and standard characters (such as end of sentence markers
e.g. ? . ! ) were handled at the level of tokenization, but still they may crop up in the resulting word
tokens for various reasons, for example, when a word in a corpus appears with repeated !!! or ?!, :-, or
the like. So, we may still have to include some of the standard end of sentence character markers among
the list of 53 characters.

4.2.2.3 Understanding the sources of non-standard punctuations in word tokens

At another level, we have to recognize that while such additional complexities could be avoided by
making sure the corpora are free of such non-standard punctuations, we note that such difficulties
cannot be avoided by eliminating word tokens laden with such punctuations since our corpora are
derived from discourses in which contributors are coming from various backgrounds with various writing
styles, and in any case, the writings/wordings were a blend of formal and informal scripts characterized
by language tones ranging from happy to furious, gentle to profane, etc. Perhaps, I am one of those who
contributed the most by popularizing use of such bracketed sentence structures, so I know pretty well
why such non-standard characters may crop up as part of the words. [One of the reasons is that, coming
from computer science background, I am accustomed to writing expressions in which (balanced)
parentheses, among others, are used to define precedence of operators. I was also introduced to post-
fix and infix operations in Computer Architecture course unit in my first year undergraduate Computer
Science. Writing and evaluating bracketed expressions in Algorithms and Data Structures soon became
my second nature although, when [I am] writing formal reports, I am cognizant of the difficulty they
impose on comprehensibility.

4.2.2.4 Offline processing versus real-time processing

So, in recognizing such occurrences of variants of punctuated words, we are in a position to make a
decision as to whether we need to process such structures at the time we are creating the unique word
list above (a sort of offline preprocessing for semi-automated system), or whether we should wait and
process during the runtime of the application when a new search key is presented. And we realize that
there are trade-offs! How we carry out the preprocessing offline has a bearing on how preprocessing
has got to be done during application runtime. This is the idea that was being conveyed in “In its
original formulation (Vapnik, 1979) the method is presented with a set of labeled data instances and the
SVM training algorithm aims to find a hyperplane that separates the dataset into a discrete predefined
number of classes in a fashion consistent with the training examples.” [1].

In this case, one such a design decision made is to treat each of word variants as a separate token and
include their occurrences in the unique word list file. Not only will this decision lessen difficulty in
computation during runtime but it will also improve performance (in terms of accuracy) of related
modules.
4.2.2.5 Output of one module of a model may be input to another module of the same model

For example, let’s consider the task of extending the functionality from not only searching for keywords
entered by a user to one with a capability to carry out auto-complete (i.e. making suggestions as/when
search keys are entered). In order to implement this extra functionality, the file of the entire corpora
with its content replaced by index values, needs to be updated to include the state path. An example
will illustrate how the word states are generated:
Code snippet 1: A subroutine code (simplified) to generate state path. This fragment of code was written on April 19 2016, an as
enhancement of the computer program initially written (by this same author) and used for coursework for machine learning for
natural language processing (2006). In this simplified code, there are only two end-of-sentence markers: period (.) and question
mark (?) . Considerations have been to include other sentence delimiters such as exclamation mark (!) in the production
version.
Hello who are you? Ditto innuendo. Wow! Position(s). Who anchor? Egregious “are who!? Ditto (hello.
Table 3: An example of corpus

1 5 2 15 4 11 14 3 16 17 5 13 15 6 9 12 15 4 8 14

Table 5: A fully indexed file corresponding to corpus file in Table 3. Here, the
indices are delimited by space characters. Note that the index values in column 1 of
Table 4 are used to replace the occurrences of corresponding words in the running
text (corpus file) (Table 3). It is these two files (Table 4 and Table 5) that are
preloaded into the client-side working memory from the (remote) server when a
user launches the web application

Table 4: Unique word index file corresponding to corpus in Table 3

The state path is indicated below- as a sequence of positions of words (with counting beginning from 1)-
on sentence basis :
Code snippet 2: For each of the words in the corpus, this code fragment (function) returns the frequency of the words in all the
possible states. The unmodified code was written in 2006 as part of a program code for text processing (machine learning)

Code snippet 3: The unmodified code was written in 2006 as part of a program code for text processing (machine learning)
1 2 … N-1 N
1 COUNT(S1, S1) COUNT(S1, S2) … COUNT(S1, SN-1) COUNT(S1, SN)

2 COUNT(S2, S1) COUNT(S2, S2) … COUNT(S2, SN-1) COUNT(S2, SN)

… … … … … …

N-1 COUNT(SN-1, 1) COUNT(SN-1, N2) … COUNT(SN-1, SN-1) COUNT(SN-1, SN)

N COUNT(SN, 1) COUNT(SN,N2) … COUNT(SN,SN- 1) COUNT(SN, SN)

Table 2: Pair-wise counts of states. For the example above, the value would be 5

This new file, in turn, will be fed in as an input to compute the probabilities of pair-wise co-occurrence
of words. The precomputed probability values are used to determine which words are displayed as auto-
complete suggestions as and when a user types the search key in the search word input text field (which
is part of the front-end web graphical user interface). Such values (as that required by the program
module) to compute desired (intermediate) results are called parameters. Note that in this example,
the module for predicting words through precomputed probabilities is now part of the enhanced model
chosen for developing the search engine. [This enhanced functionality has not yet been implemented as
part of the current version of the LAONA search engine at the time of preparing this report.] The
probabilities are thus called model parameters. Note that in this example, the values of the probabilities
are computed from counts of pair-wise occurrences of words. We could generate the values from a
table of counts of consecutive occurrences of three or more words (instead of only two), but those cases
would be very rare. What this means is that most of the values of the computed probability values
would be zero, hence auto-suggestions/predictions would not be displayed, except only seldom. This
would defeat the purpose of the enhancement. Instead, unnecessary computation overhead would be
incurred. Yet, it seems as if even pair-wise selection does not guarantee yielding of the optimal values
for the parameters. Yes, [the choice for] the pair-wise counts produce only but fair estimates for the
probabilities- fair for practical purposes- the reason this process is sometimes referred to as model
parameter estimation! The (many) segments e.g. pages, parts of layers, etc have different word co-
occurrence characteristics, so we can only come up with some estimates that would yield fairly good
prediction across the entire corpus, and pair-wise consideration have been found work better, on
average, relative to other considerations. This is an example where going by biased decisions (or simply
bias) is a good thing in science. Such a bias is an intelligent/informed one.
Parameters of the enhanced predictive model

Code snippet 3: The unmodified code was written in 2006 as part of a program code for text processing (machine learning)

Code snippet 4: I wrote this (unmodified) code snippet in 2006; part of a program code for text processing (machine learning)
Code snippet 5: I wrote this (unmodified) code snippet in 2006; part of a program code for text processing (machine learning)

[The parameters we are referring to here are the initial probabilities, transition probabilities and the
emission probabilities, among others]. The point being conveyed is that if we had not taken into
consideration the occurrences of the variants of the words, then we would have even poorer estimates
of the counts. Our program will have to aggregate the counts for each of the words under consideration
with respect to the variants identified above. [In programming, we distinguish between formal and
actual parameters].

In fact, the offline computation is critically important and yet it is so computationally (time) consuming
that we cannot wait to do it during runtime. For example, creating a fully indexed file for Part 1 Layer 3
of LAONA repository involves running a computer program routine which executes for up to four hours.
Part 2 Layer indexing takes about the same execution time. Table 5 shows what a fully indexed file looks
like. [How much time would it take to generate an index file corresponding to corpora as large as a
billion pages? Such are the issues faced by search engine giants such as Google, Yahoo, Bing, etc. But
they solved it already!]
Part of the indexing problem is determining beginning and ends of the segments of the document. The
segments, in our case, refer to pages in the document.

Code snippet 6:

Code snippet 7:

4.2.2.6 Handling multi-word search keys

In the case of the single-word search keys, the program only needs to perform a look-up on the
precomputed file unique word list file (Table 7), iterating through the second column, searching a match
for the single key word. All the pages that contain match(es) to the keyword/phrase, if any, is/are
contained in the concatenated strings that are values in the third column corresponding to the
matching row(s) of the same file.
However, it is a different scenario when the search key comprises two or more words. In a multi-word
search scenario, we are interested in locating the position(s) in the document (i.e. corpus) where the
two (or more) words in the search phrase occur in the order given. For example, if the search word is
civil interventions, then, we are interested in getting pages that contain the word civil followed by
interventions. We (that, is our programs) have to take into account the fact that there are variants of
each of the words that make up the multi-word search phrase. For example, variants of civil
interventions include “civil interventions, “civil” interventions, “civil “interventions, “civil
interventions”, civil” interventions”, “civil” “interventions?”, among others. We have to account for all
possible combinations. Similarly, a three-word search phrase such as who are you has its variants
including who are you?, “who are you”, “who” are! you”. Such punctuations are quite common in a
corpus such as ours that are characterized by texts from informal conversations, with some of the weird
string characters injected into the texts by stubborn contributors, others as a result of attempts to
communicate to a section of the audience in coded language, others due to common errors- aside from
those that are correctly placed.

So, in order to handle all possible combinations of word variants, our program was written to contain a
snippet that processes multi-word search request, treating it as a separate case than the single-word
scenario although some subroutines (e.g. to generate all identified word variants) are reused. Below is
some code snippet: for now, let’s limit the number of words in a multi-word search, skLength to 5, but
that limit will be lifted when enhancements are made. The program method accepts the search string as
input (via it’s formal parameter). The subroutine employs regular expressions to split the search phrase
not only into word tokens but also into variants of the word token, using delimiters such as space
characters (as discussed before).

Code snippet 8:
Figure 9 shows how who are you is handled.

An example illustrate how the engine processes who are you

Row 0 who

Row 1 are

Row 2 you “you you” ” you” …

Figure 9

 Variants of you includes “you”, “you, you”, ‘you, you?.... (up to 53)
 Variants of are includes “are”, “are, are”, ‘are, are?.... (up to 53)
 Variants of who includes “who”, “who, who”, ‘who, who?.... (up to 53)

In a general case where the number of words in the search phrase is N, the last row when the single
terms of the search phrase are aligned vertically would be numbered N-1.
Row N-1

1. Start by processing row N-1

Code snippet 9:

Then, after the traversals, at row=N-1, the 2D-array:

M[N-1][0]=you

M[N-1][1]=”you

M[N-1][2]=you”

M[N-1][3]=”you”

M[N-1][5]=[you

M[N-1][6]=(you

M[N-1][53]=you?
An example illustrate how the engine processes who are you

Row N-1 Iteration

Row 0 who

are

Row 2 you “you you” ” you” …

you “you you” “you”

Figure 10
Row N-2 iteration

Row 0 who

are are are are


53
you “you you” “you”
are

110v3
110v0 110v1 110v2

Row 2 you “you you” ” you” …


you “you you” “you”

Figure 10
Processing Row 0

who who who who


53
are are are are

you “you you” “you”


1… 53 1… 53
1… 53 … 1… 53

who

are are are53 are 1… 53 1… 53 1… 53 1… 53


1…
you “you you” “you”

are

1… 53
110v3
110v0 110v1 110v2

Row N-1 you “you you” ” you” … … 53


you “you you” “you”

Figure 11

Row 0 is particularly of interest to us, and so it must be handled in a special way. It contains all the
information about the search that we are carrying out. Again, we use regular expressions to split the
resulting (vectorized) string into the desired tokens as indicated in the code snippet (Figure 11).
The meaning of “base case”: row N-1

To get to row 0, however, we have to start from the base case (row N-1).

It is worthwhile to share from my personal experience how I got to learn of the term “base case” as used
in this context. While I was working to design this algorithm (in the sense of writing out a pseudo-code),
I drew out a sketch similar to what you have seen here, but at first, I could not immediately make out a
clear picture how the row iterations had to be done. It took me a couple of hours that spanned a couple
of days to figure out the ideas for the procedure depicted the diagram (Figure XXX). However, from the
beginning, it seemed to me that the problem would most likely be solvable using a recursive procedure,
for:

(i) When searching for a single word (in the case where the search key presented is single
word), the problem is simple even though the algorithm has to handle the 53 or so
variants of the single word. In this case, there would be only one row.
(ii) However, when the search key is a phrase with two or more words, it becomes non-
trivial: not only are the orders of the word tokens in the search phrase important, but the
when we have to factor in the 53 or so variants, then the problem cannot be solved in
one “row step”. In rows N-1 and below (i.e. row N-1, row N-2, … row 0, we are also faced
with the problem of considering ALL possible combinations of word variants from the
current row back to the beginning row (row N-1).

In recursion, a simple case that can be solved in one step is called the base case. I first got to learn
about recursive algorithms as part of “Introduction to Computer Programming” in 2002/2003-
thanks to yet another enthusiastic instructor at undergraduate level, who, unfortunately, taught me
for not more than a couple of months and left to pursue his interests elsewhere. Mr. Richard Obore
was one young man who taught me “Introduction to Computer Programming” course unit. Mr.
Jonathan Kizito came in a year later. Mr Obore was clearly smart. His emphasis on swapping and
recursion during the short time he lectured our class is something I keep remembering whenever I
encounter programming problems that demand solutions with such program constructs. In fact, to
me, it was Mr. Obore who popularized use of some of the programming jargons, particularly the
term “snippet” although I had encountered the rare term from the popular textbook by Kernighan
and Ritchie “The C Programming Language” a year earlier. [This book was quite popular among first
and second year students. Students would make use of any the commercial photocopying bureaus
located everywhere around the university campus to make photocopies of chapters of this book by
Kernigan and Ritchie. Interestingly, while I was in New York, I had, in 2012 – a builder housemate, a
man from New Jersey who intimately narrated to me how his father, and, either Kernigan or Ritchie
(one of the two, I forgot –who, of the two) were close colleagues at Bell Labs many years ago].

That Mr. Obore could use terms such as “snippet” quiet “normally” during lectures inspired me to
also adopt usage of the terms in my daily encounters as well. Mr. Obore was a great resource in
that he would make comparisons of different programming languages- e.g. Java vs. C (by way of live
snippets of codes –on whiteboard), in essence resolving some of the puzzles that clouded our
thoughts since we (students) were all at beginner’s level. During one of those lectures, Mr. Obore
went on to explain the difference in compilation steps of a C program and that of a Java program.
With a Java program, he went on to make a contrast:

Mr. Obore: The Java source code first has to be converted into a byte code by a Java compiler and
the bytecode is executed by Java Virtual Machine- unlike a C program which is compiled into a
machine specific code

Mr. Ogole: Machine code is in 0’s and 1’s format, and of course, the source code is plain text format.
How can I visualize Java bytecode?

Mr. Obore: Hmmm… what’s your name?

Mr. Ogole: Caesar Ogole

[Mr. Obore, in response to my question, explained that Java bytecode was an intermediate form of
code that’s meaningless to human understanding].

That response was good enough in that it helped dissipate the perplexity on my part (and perhaps other
students in the class found it useful). However, questions still lingered- as to what a bytecode looked like.
Some students in my class called Mr. Obore a “moving compiler”. I have a feeling the name “moving
compiler” was started by one classmate called Timothy, because he was really fond of Mr. Obore. That
was in early 2003. I did not get to see Mr. Obore again when he left until sometime in 2008 when I ran
into him at Kisementi, Kampala. I introduced myself to him, and reminded him of how he was good
lecturer and of how our class missed him.

Anyway, it was Mr. Obore again that introduced to us the term “base case” in programming- when he
talked about recursion as an alternative to iterative algorithms in certain problem cases. [The term base
case in recursion, by the way, is related to the concept employed in mathematical induction- which is
first introduced at Advanced Level Secondary School (Pure Mathematics)]. However, at that time, most
people had forgotten their A-Level Mathematics. My continuous practice during my Senior Six vacation
in Math, Chemistry and Physics put me at a vantage point: I had working knowledge of the mathematics-
the reason I could score 100% in some class tests such as “Computational Mathematics I & II” during my
first year of undergraduate studies to the amazement of some classmates].

Recursion in programming became much clearer to me a little after Obore’s lectures, especially when I
reached the chapter “Recursion” from my tutorial site which I was following systematically (and
concurrently as part of my private studies). Below are some lines I would like to echo about recursion:

Two Parts to Recursion:

 If the problem is easy, solve it immediately.


 If the problem can't be solved immediately, divide it into easier problems, then solve the easier
problems.
“A base case is a problem that can be solved immediately. Once you have reached the base case, you can
work your way back up to the original problem.”

Note, however, that our implementation of the search engine algorithm is not exactly recursive.

In a few cases during normal conversations, I have found myself using terminologies whose meanings
are derived from technical disciplines. “Kernel” was one of them. “Snippet” is another. Perhaps, I would
not have easily known and/or used these terms at all had it not been because of my background in
Computer Science.

Another example which will help me clarify certain things is the terminology “pre-emption” which I
remember I used recently in some mundane context – such as “pre-emptive talk”. I first encountered
“pre-emption” as a technical concept when I undertook a formal course in Operating System Design
(undergraduate 2002). Terms such as “round-robin”, “pre-emption”, “context-switching”, “multi-
tasking”, “time-sharing” are used to describe functions of Operating Systems or how the algorithms that
make up an Operating System works. To understand clearly what a pre-emptive scheme is, one has to
get into details of the definite algorithmic descriptions of the steps involved in “pre-emption”. In fact,
my fascination with MIT (the Massachusetts Institute of Technology) originated not from any other thing
but from a textbook, a standard reference textbook for Operating Systems (OS) that I read during my
undergraduate course for the unit Operating Systems Design (2002). A section of the book described
how MIT pioneered development of Operating Systems and went on citing some computer program
that had been left to run for a record length of time. (I suppose that section was describing a function of
some kinds of Operating Systems such as server computers). And according to the textbook, that
advancement took place sometime (way back) in 1960s or so! I do not remember the title of the text
book nor its author(s)- it had some dinosaur pictures on its hard cover design- but I was fascinated
because it was overwhelming to imagine how a program built from little pieces of algorithms being
described in the form of “round-robin”, “pre-emptive algorithms”, etc could execute for such a long
time! If I could program an operating system, then I could write any application program, I thought. I
shared what I had read with a couple of classmates and that’s how MIT soon became an iconic
technology reference, to some of us. It is from the same OS textbook that I came to understand that
(some of) those little pieces of programs when in execution are called processes. Each process has a
unique identifier called Process ID (PID). That’s why I thought it was funny when I saw PID also being
used elsewhere to mean pelvic inflammatory disease.

In fact, when I was writing the [this] search engine code, especially the aspect that selectively retrieves
and loads portions of (indexed) data into working memory, I knew the task I was undertaking was not
anything novel for I could recall that one of the functions of an operating system is memory
management- and that implied people who wrote system programs such as Operating Systems had, in
some domains, already solved some of these problems before. Likewise, when I was working on level 1
automation (for sending out notifications/reminders- See Figure 1), I knew I had studied to some level of
detail batch processing as one of the approaches used by an Operating System to manage
computational resources- although I had not an opportunity to carry out an implementation of a batch
processor myself. Exactly how batch processing is implemented depends on several factors- I knew.
So, anyway, we now know the origin of the term “base case”- from my background:

Code snippet 10:


Code snippet 11:

What it means for long search words

Building of large vectors

Let me make an informal observation – based on my personal experience: it is much more tedious and
uninteresting to write down an exact textual description for an algorithm than to write an actual code.
Providing such a description, e.g. for the search engine algorithm presents a challenge and reminds me
of the sort of puzzlement I encountered when I was taking a formal introduction to Computer
Architecture and Organization. Reading a popular textbook for Computer Architecture, I would wonder
how I could commit to memory procedural descriptions such as:

“When the status counter increments, the mem bit is toggled and the max flag is reset”.

Such description was pervasive – a marked characteristic of such books. But I just found an explanation
that confirms my thoughts:

“Two measures of complexity are the length of message required to describe a given phenomenon and
the length of the evolutionary history of that phenomenon. On a certain view, that makes a [Jackson
Pollock] painting complex by the first measure, simple by the second, whereas a smooth pebble on a
beach is simple by the first and complex by the second. The simplicity sought in science might be
achieved by reducing the length of the descriptive message – encapsulation in an equation, for example.
But could there be an inverse relationship between the degree of simplicity and the degree of
approximation that results? Of course, it would be nice if everything turned out simple, could be made
amenable to simple description. But some things might be better or more adequately explained in their
complexity – biological systems come into mind”.- an excerpt from a book by John Brockman.

I would also add (certain) computational models– to the list. In fact, the summary above is probably
close to what I had wanted to say – with regards to Occam’s razor that I got to learn of in 2005. “No
positing plurality without necessity”.

Time and space complexities as one means of describing complexity of algorithms

- Time complexity
...

- Space complexity


Returning approximate matches: permutations

(multi-layered )- layered in all directions

who who who who


53
are are are are

you “you you” “you”


1… 53 1… 53
1… 53 … 1… 53

who

are are are53 are 1… 53 1… 53 1… 53 1… 53


1…
you “you you” “you”

are

1… 53
110v3
110v0 110v1 110v2

Row N-1 you “you you” ” you” … … 53


you “you you” “you”

Figure 12
5. When we can’t do anything else but exhaustive search! Naïve thoughts on
possible applications
Let’s look at this excerpt, taken from [1]-last paragraph, page 4: SVMs have also been used for
feature selection. Pal(2006) investigated methods for feature selection based on SVMs. Citing the
unreasonably large computational requirements as a major disadvantage of exhaustive search
methods in practical applications, the researchers justified the use of a non-exhaustive search
procedure in selecting features with high discriminating power from large search spaces. SVM-
based methods combined with GA were compared with the random forest feature selection method
in land cover classification problems with hyperspectral data and small benefits were identified.
Zhang and Ma(2009) addressed the is-sue of feature selection in SVM approaches. They
implemented a modified recursive SVM approach to classify hyperspectral AVIRIS data. The
reduced dimensionality returned slightly better results, however their method has higher
computational demands compared with others. On the same subject Archibald and Fann(2007)
provided an interesting integration of feature selection within the SVM classification approach. They
achieved comparable accuracy while significantly reducing the computational load.

My interest, in this case is on exhaustive search methods in practical applications

 feature selection
 unreasonably large computational requirements as a major disadvantage of
 high discriminating power from large search spaces
 modified recursive SVM approach to classify hyperspectral AVIRIS data

Update functions and the rebirth of mathematics

Mathematical basis

A search operation in which the search key is a single word (e.g. civil) requires that the
algorithm create an intermediate output which is a list of all the possible variants
corresponding to that word:

The list, a one-dimensional array, would look like this:

{civil, “civil, civil”, “civil”, civil?, civil!...}

This conversion of the search key to its variants is done in real-time as the search is performed
as described in Section XXX (- handling single word keys)

In the case where at least two words are specified as a search phrase (e.g civil interventions),
we are faced with the problem of handling multi-word search scenarios (as described in Section
XXX). Each of the words in the search key has 53 variants.
Let us assume, for simplicity, that each of the words has only 3 possible word forms (variants):

civil: {civil, “civil, “civil”}

interventions: { interventions, “interventions, “interventions”}

Recall that in the multi-word search scenario, we use a preloaded fully indexed file- the content
of which are numeric index values that replace the all the word tokens in the original
corpus(See Table 5). Let’s assume the index values (from the unique index files) are such that:

word Index value


“civil” 2
“civil 3
civil 1
“interventions” 5
interventions” 6
interventions 4

Then, the string array C= {civil, “civil, “civil”}, after indexing, would be replaced by C= {2, 3, 1},
string array I= {interventions, “interventions, “interventions”} after indexing, would be replaced
by I= {5, 6, 4}.

Still, for our model, the problem scenario in search task involves creating an intermediate
matrix/array of all combinations (specifically ordered pairs) of variants of the two words:

{(2,5), (2,6), (2,4), (3,5),(3,6),(3,4),(1,5),(1,6),(1,4)}

The ordered pairs above, obtained from C and I, and denoted by CXI, is called the Cartesian
Product (CP) of C and I. Therefore, mathematically speaking, one of the tasks in handling a
multi-word search problem can be thought of- as the problem of finding the Cartesian Product
of sets, where the elements of the sets C and I (or for any fixed number of sets) correspond to
the variants of the individual words in the search phrase.

Mathematical notation:

C X I = {(c,i)|cϵC and iϵI } = {(2,5), (2,6), (2,4), (3,5),(3,6),(3,4),(1,5),(1,6),(1,4)}

So we can say that for our search engine to return true (i.e FOUND), it is necessary that the
search key e.g. (2, 5) - corresponding to “civil” “interventions” - be a member of the set formed
by the set which is the Cartesian Product (of C and I). The term necessary is used because not all
the patterns corresponding to the ordered pairs in the Cartesian Product (CP) may be present
in the text being searched- but when a search returns FOUND, then the found pattern must be
an ordered pair- a member of the Cartesian Product of the sets of the variants of the individual
words constituting the search phrase. Consider the following text as the corpus:

Page 2

Other great words characterizing civil interventions in times of crises included, “I think that the Secretary
General made her call based on what she believed was one of the tools she had in her arsenal to make
decisions…”

Page 1292

I am sure a lot of us were turned off, but the people who were turned off all the time should also not claim
that they are saints for they had ample opportunity to intervene to make things better by offering appropriate
pieces of advice, - and those "civil interventions" would constitute or stand out as our "quotable" lines
among the many "not funny and not exciting" ones. Basically, anyone who simply kept quiet or did not
contribute much -claiming to have been turned off - will not carry the title of a saint. Let's be clear about.
that! Those people will fall under the category of blank answer sheets turned in by an examinee.

Example corpus with sample texts picked from Layer 3 Part 1: A search involving of this corpus civil and
interventions – in the given order- would return true (i.e FOUND) if the search keys were civil interventions or
“civil interventions”. Those two patterns would correspond to element (2,6) and element (1,4) of the CP.

Subsets

All the ordered pair elements of the set of ordered pairs {(3,6),(1,4)} that return FOUND when a
search is carried out on this example corpus must also be elements of the CP. The set S=
{(3,6),(1,4)} is said to be a subset of the set CP={(2,5), (2,6), (2,4), (3,5),(3,6),(3,4),(1,5),(1,6),(1,4)}
and it is denoted by S ⊂ CP. If S has fewer elements than set CP, then S is said to a proper subset.

Definition: For any sets A and S, A is subset of S if, and only if, every member of A is a member
of S. We denote that A is a subset of S by S ⊂ A

I quoted that definition for one reason: to illustrate how some of the important concepts, when
formally considered, can be described by what some mathematicians call “elegant” or
“beautiful” mathematical representations such as sets and functions. Sets, Logic and Functions
form the mathematical basis of computing. This particular one was taken from lecture notes for
the course “Mathematical Basis for Computing- 2009”- by Prof. Howard Blair of Syracuse
University.
There are similar definitions of: superset, dyadic relations, functions, etc.

The point I am trying to convey is that these seemingly simple fundamentals of mathematics
that we started learning from the lower level primary school are very useful building blocks for
creating a way of seeing things clearly (as noted in the lecture matures referenced.). In fact,
formal mathematical definitions such as that above are among the first things I encountered
during my first year at undergraduate for the course “Computational Mathematics I”-2002. Mr.
Stuart Katungi – a good mathematician (and computer scientist), I should say, rekindled my
interest in maths when he started off my very first semester at the university with really good
teaching. Yet another dedicated teacher, his style bore a very striking relation to the notes
provided in “Mathematical Basis for Computing”.

But my goal of discussing mathematical basis of computing in this context is one, and an
important one: functions…

Definition 2: A dyadic relation from set A to set B is a subset of AXB.

In relation to the search engine model, we can say that the set S={(3,6), (A,4)} that contains
FOUND elements with S being a set of CXI is described as a dyadic relation.

Mathematics as a language

On Feb 26 2014, I posted a rather informal essay on Climate Colab. Here is one part of my argument for
the need of languages that offer vocabulary describing concepts as succinctly as possible: “Currently,
many dialects in African continent lack the expressive power to describe technical jargons such as
‘carbon dioxide’ and ‘global warming’, but with multi-disciplinary effort, new words can be coined and
adoption of usage accelerated by a platform such as the one we are proposing here. How can one know
a concept that has no word to describe it?”

When one gets to think of such problems related to languages as described above, aside from issues
related to climate change, then the need for the language of mathematics exemplified above in the form
of sets becomes apparent. In section 4.2.2.6 Handling multi-word search keys, I stated that we
need to account for “all possible combinations” of word variants. But the “all possible
combinations” we were referring to (in that section) actually were only those combinations that
must take word tokens arranged a certain order into account. The select group of ordered
elements in the case of a search engine – as we noted- can best be described (more succinctly)
using well-constructed terms such as “Cartesian product” and the like- coupled with methods
for symbolic representation. An exact description in natural language would often be quite
lengthy and prone to errors – with variations being encountered from person to another. When
a person develops mastery of a language such as mathematics which allows for succinct
representations of concepts in certain domains, then ambiguity- a problem inherent in all
natural languages (and dialects)- becomes less of a problem. Ambiguities present a major
setback in using natural languages, for example, in describing reality as accurately as possible.
At first, mathematical language and symbolism appear to be pompous, until you encounter a
situation when you really need it!

An excerpt from the textbook “The Politics of Hate” - a textbook by Prof. Emeritus John Wiess (Anti-
semitism, History and Holocaust in Modern Europe)- presents some historical insight on why science and
mathematics was more “portable” than language/literature in the past- and which seems to offer some
plausible explanation to even the current state of affairs.
- Fig. An excerpt from the book, “The Politics of Hate” - a textbook by Prof. Emeritus
John Wiess

Design to allow integration of other web services


Applications
- Numerous application have been cited in the review paper [1]
- But here are a couple more
-
1. Creating an up to date language dictionary: use this model to create all possible
combinations of words
- Cannot be done manually
- Use corpus to eliminate combinations of word phrases of chosen lengths that have
not appeared in use- assuming the corpus contains all word tokens as well as word
phrases of that language.
-

1… 53 1… 53
1… 53 … 1… 53

1… 53 1… 53 1… 53 1… 53
1… 53

1… 53
110v3
110v0 110v1 110v2

Row N-1 … 53
2. Image object recognition
- Cannot be done manually
-security of humans and of property, including of computer itself (though this is often separated
from computer security)

Compare and contrast: image object recognition vs. natural language processing

An image of an object may vary in size, shape, color and texture depending on several factors for
example, camera settings, orientation, atmospheric conditions, etc. This is analogous to the problem in
natural language processing where word tokens had many variants e.g (hello and hello where sources
of the extraneous characters were various… as described in “4.2.2.3 Understanding the sources of non-
standard punctuations in word tokens”.

f1, f2, f3 are mathematical functions or such symbolic manipulations

 Could be any mathematical transformations functions:


 images from UCI repository
4. Materials science and drug discovery

Materials science and Drug discovery


Background of my knowledge of chemistry...

Source: National Geographic (last accessed on Oct 15)

What is element 115’s name?


Representation of the periodic table

Goal: - to search for all possible combinations.

Scenario 0: How are new chemical elements discovered/created.. The National Geographic (NG)
article describes the process.

Scenario 1: Suppose I were an expert – and I knew the nature of each of the chemical
elements. …

Scenario 2: Suppose I were naïve but I have worked hard enough and gathered all these
elements just like I gathered all the unique words –

…and I have also proceeded to make an effort to arrange them in the order presented in the
periodic table…just like I arranged them in unique word list … Compare tools in natural
language processing and tools in chemistry

Also suppose I have worked hard enough to the point of studying each of these elements (e.g.
meaning of words and phrases including special occurrences such as idioms in the case of
natural language processing… )- know the nature of all possible combinations of these
elements under certain conditions (temperature, humidity, ratio of each constituent element..)..
In natural language processing… combinations include phrases that make meanings..

**TOOLS VARY, BUT REASONING SAME

But let’s look at the difficulty of the problems from a lay man’s perspective

1. Let’s say, the problem in materials science is: how do I find the strongest material…?

A naïve approach is to go about testing each of the combinations of the materials under varying
conditions until you find the strongest material! ) Here is the challenge: the number of distinct
building block materials may be large and finite but the number of conditions under which each
of the combinations have to be tested also be very large, if not infinite... and the number of
possible combinations infinitely large…! How long would it take to test all the combinations
possible before the hardest material is found… And how about the risks associated with
undertaking the processes?

[I have the privilege to rub shoulders with some the leading and ambitious scientists and
engineers working at the frontiers of these problems… but you see, I did not get to ask them
questions of such nature although I had opportunities to do so… The reason is that I was too
pre-occupied with such problems in my domain… May be you want to ask any scientists such
questions.
2. Let’s say, the problem in drug discovery is: how do I find a cure for a
certain ailment (disease) such as HIV/AIDS, cancer…?
Again, let’s assume that the problem of finding “cure” reduces to the problem finding a suitable
combination (or a finite number of combinations of the chemical elements from the periodic table as
discussed before). So the first challenge is to build an infrastructure that facilitate speedy search for
combination(s) that will “arrest the virus”, “chock it” and “block its multiplication”- in the case where
the cause of the disease is confirmed to be a virus-; giving an example. I have borrowed these terms
(“arrest the virus”, “chock it” and “block its multiplication”) from Ogwal’s opinions posted on a blog
sometime back on December 15, 2012. So let’s assume that the structure (bonding - which include
properties and behavior) or “biochemical structure” as Ogwal calls it, of this virus is well known. A
computer scientist with a background in object oriented programming background may want to think of
a virus as an object (an instantiation of a class). Recall from Section 2.2 How reading general literature
on computers during my Senior Six vacation proved to be very useful during my undergraduate studies –
where I discussed “elusiveness in computer security”. The HIV-object analogy makes sense because, like
HIV viruses, most objects also change states during runtime/lifetime. [String objects are immutable]

So, if I wanted to negate the change induced by change in state of an object, I would have to know all
the possible states that an object could be in during the object’s lifetime. And I would develop a
method/treatment that has the capability to deal with the object when it is in any of the distinct states –
on case by case basis. This problem bears similarity in nature to the problem of I referred to earlier as
“elusiveness in computer security”. It is also similar to the problem in materials science (my naïve take)
of searching for the hardest material where there are so many possible combinations under conditions
to search from. Each time a new combination is created, tests have to be carried out to compare its
strength with the previous one. It is only until we have carried out tests for all the possible combinations
that we can posit that a given combination yields the hardest material, for example. So it seems that
there are no solutions to such problems unless it is possible to search and perform test for a (proposed)
solution – to “arrest the virus”, “chock it” and “block its multiplication” – and this requires that there
should be a method to deal with each of the possible states that HIV virus can be. This problem, like the
computer security problem, like the materials science problem, etc requires an exhaustive search- and
administering a remedy to each state with the goal of stopping it at that state. Our LAONA search engine
of course is yet another live/working example that demonstrates the feasibly of the idea of exhaustive
search- particularly in the case where we want to search for all possible combinations (including
returning occurrences of all possible lengths of search phrases). The phrase “Returning FOUND for
occurrences of phrases of all possible lengths from the search phrase” is the equivalent of finding all
possible combinations or compounds formed from the chemical elements summarized in the periodic
table. The process of enhancing the search engine by going deeper into isolating and returning FOUND
for only meaningful combinations of word would be analogous to the process of isolating only chemical
compounds that are useful. Of course, in the case of the search engine, I need to have some prior
mechanism in place to tell that a given combination of words is meaningful. In the case of biochemistry,
I need to have a way testing that a given combination is useful. This is where finding a cure for certain
diseases becomes a nightmare:
(i) Exhaustive search is not practical because for each compound (treatment) found, it has to
be tested- and the search space is very large- infinity- if we have to naively think of all
possible combinations of the periodic elements and all possible physical conditions. Unlike
in the case of a search engine where calculations take place very fast, the search for “cure”
is painfully slow as testing involve using living things- and huge sacrifices! Parallelization, a
arrangement where different labs are testing different combinations would help speed up
the search…? That requires very systematic collaboration where the methods applied are
consistent (Refer to parallel processing in Layer 4)… From the search engine perspective,
think of processing each data point (generating updates corresponding to a given word
variant) in a given row step in parallel in order speed up the process in the multi-word
search scenario.
(ii) The assumption is that the list of elements in the periodic table is exhaustive. What does a
new discovery such as that of element 115 say…?

Note: This is a perspective from a lay person. It has got very little to do with my experience working in a
biopharmaceutical industry- though I gained invaluable experience and I have come to appreciate the
role of technology in biopharmaceutical industry.

LAONA Creation- HIV analogy (sketch)

But let’s not even go far. Think of the creating of LAONA. That’s perhaps perfect analogy of the HIV
puzzle- but fortunately- it has been solved for those who read through and understood the content! And
by no one else but by myself- at huge sacrifice! So I am in a better position describe the problem. One
could think of the documentation as “LAONA pathogenesis”… Pathogenies, I gather, refers to a manner
in which a disease spreads. Indeed, the subtitle for the accuracy LAONA documentation (Layer 3 Part 1
and Layer 3 Part 2-) “The Path to creating of LAONA…” combined with the very first subtitles in Part 1
“In the beginning...” and “The genesis of …” first applied on Dec 31 2013 matches that description of
pathogenesis. The LAONA pathogenies have been captured to a very high degree of accuracy. Looked at
this way, LAONA is a disease that anyone who joined the group would contract it, gradually by
association or by being confronted by a well packaged highly destructive lies. To survive, one has to
continually fight off the lies- but the result is that the lies, like the HIV virus, evolve- from one form to
another. The remedies have to be dynamic in nature too. “Killing white blood cells” seem to have literal
meaning of “destroying healthy logic”- and producing more immunity (white blood cells) seems to
parallel the process generating more and more positive knowledge to counter negativities….

But LAONA documentation is more than “LAONA pathogenesis” as noted. It contains remedy as well. In
that, hitherto unknown methods of addressing the problems have been uncovered. And the LRA/KONY
problem that has been solved already- falls under the same category.

Why would immunologists and virologist hold conflicting views on the HIV puzzle? Aren’t they coming
from the same biology and chemistry background? Note that I have not said anything to do with the
root cause of HIV/AIDS… And I do not think digging into conventional archives to study history of the
AIDS disease would tell us to a high degree of certainty the root cause of the disease.

Ogwal’s blog post may hold some water... though it would be good if he presented evidence to
substantiate his claims. Anyone can make claims, but those claims must be substantiated for it to be
believed.

To be continued

List of tech heroes

So the list of my tech heroes is endless! End users are one of them, for without them,
technologies are useless!

[1] Support vector machines in remote sensing: A review by Giorgos Mountrakis, , Jungho
Im, Caesar Ogole

Solution

CS of periodic table

AI of periodic table

- Cs6[Al2O6] has been prepared for the first time by annealing intimate mixtures of
Cs2O and CsAlO2

MITCSAIL

etc, etc, etc, etc…

PROBLEMS SOLVED. YOU WILL READ THE REST IN JOURNALS IN THE FUTURE

CSAI.

Potrebbero piacerti anche