A Spell Checker

MASARYK UNIVERSITY
FACULTY OF INFORMATICS
A Spell Checker
for Esperanto
BACHELOR THESIS
Marek Blahuš
Brno, May 2008

Declaration
Hereby I declare, that this paper is my original authorial work, which I have worked out on
my own. All sources, references and literature used or excerpted during the elaboration of
this work are properly cited and listed in complete reference to the due source.
Supervisor: RNDr. Petr Sojka, Ph.D.
ii
Acknowledgement
I would like to thank RNDr. Petr Sojka, Ph.D., the supervisor of my bachelor thesis, for his
comments, suggestions and time he spent helping me with this work.
I would also like to thank Dr. Petr Chrdle, CSc., owner of the KAVA-PECH publishing
house, who has provided me with a complimentary copy of the Plena ilustrita vortaro de
Esperanto 2005. I am grateful to Dr. Ludvic Lazar Zamenhof, the initiator of Esperanto.
iii
Abstract
This thesis provides a brief overview of spell checking software and describes the process
of constructing a spell checker for the Esperanto language and its implementation as a dic-
tionary (i.e. an affix file and a word list) for the Hunspell spell checker. The word list is an
adaptation of word roots coming from the renowned Esperanto dictionary PIV. Recognition
of morphologically complex words, which are common in Esperanto due to its agglutina-
tive nature, is made possible by the affix file which has been built based on ready-made
morpheme segmentation of word derivations appearing in the same source. Rules derived
in the latter process have been improved by semantic classification of all involved roots, for
which a system has been created based on corpus analysis and several specialized dictio-
naries, in combination with knowledge on the capability of each affix to accept roots from
different semantic classes, acquired from the PMEG reference grammar. The resulting spell
checker is a working proof of concept, to be further improved and integrated in the gram-
mar checker project of the E@I organization.
Abstrakto
Tiu ĉi disertaĵo donas koncizan trarigardon de literumkontrola programaro kaj priskribas la
procezon de konstruado de literumkontrolilo por la lingvo Esperanto kaj ties implementon
forme de vortaro (t.e. afiksa dosiero kaj vortlisto) por la literumkontrolilo Hunspell. La
vortlisto estas adaptaĵo de vortradikoj venantaj de la renoma Esperanto-vortaro PIV.
Rekono de morfologie kompleksaj vortoj, kiuj estas oftaj en Esperanto pro ties aglutina
ĥaraktero, estas ebla pro la afiksa dosiero kiu konstruiĝis surbaze de preta morfemstrukturi-
go de vortderivaĵoj aperantaj en la sama fonto. Reguloj ekestintaj en tiu procezo estas pli-
bonigitaj per semantika klasado de ĉiuj engaĝitaj radikoj, por kio kreiĝis sistemo bazita sur
tekstara analizo kaj kelkaj fakvortaroj, kombine kun scioj pri akceptemo de ĉiu afikso al
radikoj el diversaj semantikaj klasoj, akiritaj de la gramatika manlibro PMEG. La rezultinta
literumkontrolilo estas funkcianta koncept-pruvo, plibonigota kaj integrota en la gra-
matikkontrolilan projekton de la organizo E@I.
iv
Keywords
spell checker, Esperanto, Hunspell, corpus, morphology, semantic classification
Ŝlosilvortoj
literumkontrolilo, Esperanto, Hunspell, tekstaro, morfologio, semantika klasado
v
Table of Contents
1 Introduction.........................................................................................................................1
2 Overview of Existing Software...........................................................................................2
2.1 Kontrolu Literumadon...............................................................................................2
2.2 Esperantilo.................................................................................................................2
2.3 Ispell..........................................................................................................................3
2.4 GNU Aspell...............................................................................................................3
2.5 MySpell.....................................................................................................................4
2.6 Hunspell.....................................................................................................................4
2.6.1 Structure of Hunspell Data Files.......................................................................5
2.7 Overview Table.........................................................................................................6
3 A Hunspell Dictionary for Esperanto..................................................................................7
3.1 Strengths and Weaknesses of the Existing Dictionaries............................................7
3.2 Adopting a Suitable Approach to Esperanto Morphology........................................9
3.2.1 The Structure of Words in Esperanto................................................................9
3.2.2 System for a Semantic Classification of Stems...............................................11
3.3 DFD Diagram for Dictionary Construction.............................................................16
3.4 Compiling a New Word List....................................................................................17
3.4.1 Identifying Relevant Sources for the Word List.............................................17
3.4.2 Moore Machine for Semantic Classification...................................................19
3.4.3 Implementation of the Semantic Classification..............................................21
3.5 Compiling a New Set of Affix Rules.......................................................................23
3.5.1 Dictionary-Based Word Derivation System....................................................23
3.5.2 Implementation of Esperanto Morphology in Hunspell..................................25
3.6 Integration in OpenOffice.org and the E@I Grammar Checker..............................27
4 Evaluation of the Newly Constructed Dictionary.............................................................28
5 Conclusion.........................................................................................................................29
Bibliography.........................................................................................................................30
Appendix A: The 16 Rules of Esperanto Grammar..............................................................34
Appendix B: An Overview of Esperanto Affixes.................................................................38
vi
Chapter 1
Introduction
The development of computer technologies and the internet has been having a strong im-
pact on the world in which we live, often introducing major changes into certain fields of
human activities, within human communities, sometimes even giving brand new potential
to tools which we had already had before. Such is also the case of Esperanto, the interna-
tional auxiliary language created in 1887 by Dr. Ludvic Lazar Zamenhof, and its world-
wide community of speakers, which, according to some sources (Gordon, 2005, the article
about Esperanto), is estimated to count up to 2 million of people, spread in 115 countries of
the world from South America through Europe to eastern Asia.
The impact of these new technologies on the Esperanto community, a language-based dias-
pora, has been mainly positive: the borders between countries seem to be disappearing, ge-
ographical distances are losing their character of a burden, there are more opportunities for
maintaining international contacts. Esperanto speakers and students are able to meet each
other in an easier fashion, documents, music, literature and other resources in the language
are easier to find. It has also never been possible to address such a wide audience at once,
in Esperanto as in other languages. There are hundreds of thousands of webpages in Es-
peranto in the internet, and writing an article for the Esperanto Wikipedia or for one’s own
online diary does not take a serious effort. To send an e-mail message to an online discus-
sion group is much easier than to mail a letter to the editor of a magazine or to send out
bulk mail on one’s own, and this is being taken advantage of.
However, apart from all these positive aspects, negative consequences are emerging as
well. The language capabilities of an average Esperantist have probably not improved
much over the recent decades, but more and more published Esperanto texts are now writ-
ten by average Esperantists, with little or no attention to their language level, and thus the
quality of Esperanto texts accessible to the public observes a fall.
On the other hand, the technology does not only impose the problem; it also gives us the
tools to solve it. In 2007, a group of young people from E@I (Education@Internet, an in-
ternational organization promoting usage of the internet among Esperanto speakers) came
up with the idea of creating a language checking software package for Esperanto, intended
both for the student who yet needs to cultivate their language skills, as well as the skilled
Esperantist who types pages of text a day and whose mistakes usually come merely from a
lack of attention. A good spell and grammar checker would be useful in every kind of text
processor, e-mail client, web browser and everywhere in the internet where texts in Es-
peranto are being written (forums, chats, blogs). This bachelor thesis is a part of the project
and its goal is to design a proof of concept realization of the spell checking part of a soft-
ware which would be fulfilling such needs.
1
Chapter 2
Overview of Existing Software
Automatic spell checking by computer has its own history of a few decades. This chapter
provides an overview of several existing pieces of spell checker software which are either
exclusively dedicated to spell-checking Esperanto texts, or support spell-checking for mul-
tiple languages and have such Esperanto support included by default or as an optional
downloadable dictionary module, or which at least provide a fitting environment for the
creation of such a module, which may perhaps not have been developed so far only because
of their recency or due to lack of interest or unfamiliarity to the present community of Es-
peranto-speaking NLP software developers.
The aim of this chapter is to underline the differences between the various existing solu-
tions in order to learn about the good and bad aspects of the work which has so far been
done in the concerning field, as well as to identify a possible tool which would provide a
good starting point for the new tool which is to be created, should that mean merely a loose
inspiration by this particular solution’s advantages, an implementation of an Esperanto
spell checking dictionary for an existing piece of software, or even simply devising a more
efficient version in case there is already a fitting Esperanto dictionary available.
2.1 Kontrolu Literumadon

Kontrolu Literumadon (Esperanto for “Check Spelling”; Lendon, 1992) in its version 1.0,
created in 1992 by Klivo Lendon from Canada using Prolog and distributed as shareware,
is probably one of the very first spell checkers for Esperanto ever developed. It is a dedicat-
ed, stand-alone, non-suggesting spell checker and provides an MS-DOS pseudo-graphic in-
terface for spell-checking plain text and WordPerfect 5.1 files stated in the command line.
Upon execution, the program displays the content of the file and indicates in color the
words which it did not recognize. There is no text-editing or result-outputting functionality.
Since the program was created before the advent of Unicode, a large part of the author’s ef-
fort has been spent into implementing support for those characters of the Esperanto alpha-
bet which are not present in the basic ASCII character set.1 The correct appearance of these
characters has been secured by writing texts in graphic mode and for recognizing those
characters in the input files, apart from comprehension of the common x-convention, cir-
cumflex-convention and Zamenhof-convention, a user-editable character set file called
supersgn.inf has been provided.
2.2 Esperantilo
Esperantilo (Esperanto for “Tool for Esperanto”; Trzewik, 2003), started in 2003 as Es-
perantoEdit, is a stand-alone, multi-platform UTF-8 text editor with special linguistic func-
tions for Esperanto, including spell checker, grammar checker and machine translation. It is
distributed as free software under GNU GPL, maintained by Artur Trzewik from Germany
and programmed using XOTcl and XOTclIDE, a set of rather uncommon programming
1
See Appendix A for Esperanto alphabet and the conventions used for representing its non-ASCII-characters
in circumstances where an 8-bit character encoding is imposed.
2
tools. The spell-checking capabilities of Esperantilo employ both an internal dictionary and
a possibility to exploit existing Hunspell dictionaries as well as to externally launch Aspell.
The internal dictionary of Esperantilo consists of a word list of approx. 60,000 word forms
and a list of word roots of approx. 9,000 items. Thus the spell checking can work in two
levels: Marking in green the words which were not found in the word list but still may be
considered valid derivations within the word roots system, and marking in red the words
which were neither found in the word list nor could be recognized as valid derivations. For
each unknown word the program provides a set of suggestions.
2.3 Ispell
Ispell (Gorin, Willisson, Kuenning, 1971) is the first in a row of spell checkers, which goes
on with Aspell, MySpell and Hunspell. It first emerged in 1971, in connection with the ap-
pearance of Unix, and was aimed to serve to the text processing application of this operat-
ing system, developed since 1971 by Bell Labs. Ispell was originally written in PDP-10 As-
sembly language by R. E. Gorin, and later ported to the C programming language by Pace
Willisson of MIT. During its evolution, Ispell implemented several innovative performance
enhancements, including the generalized affix description system, which has since then
been imitated by other spell checkers such as MySpell, or the programmatic interface for
the emacs text editor, which was a pioneer attempt at separating the spell-checking func-
tionality in form of an external module which may be used by other applications. Some of
the Ispell’s weaknesses are its incapability of spell checking texts in other character sets
than the basic ASCII (thus rendering it usable for a very limited set of languages, particu-
larly those of Western Europe) and its low efficient correction-suggesting system that is
based simply on a Damerau-Levenshtein distance of 1. These were among the reasons
which supported the later emergence of GNU Aspell as Ispell’s successor. It is still, howev-
er, being maintained and its current version, the International Ispell by Geoff Kuenning, is
already equipped with a Unicode support and is available under the BSD license.
There is a 70,000-item Esperanto word list for Ispell (Pokrovskij, 1997), created by Sergio
Pokrovskij from Russia. This word list was probably the first of its kind in existence, and
was later adapted and used for many other goals, including its adaption for the spell check-
ers which followed Ispell. From modern applications which still offer a support for Ispell
and make use of this word list, UniRed (an abbreviation for “Unikoda Redaktilo”, meaning
“Unicode Editor” in Esperanto), created by Yuri Finkel of Russia, is worth mentioning.
2.4 GNU Aspell

The GNU Aspell (Atkinson, 1998), currently maintained by Kevin Atkinson of USA as part
of the GNU software system and distributed as free software under the GNU LGPL, is the
GNU’s standard spell checker software and was first published in 1998 with the aim to
eventually replace Ispell. The main improvements were done in adding better support for
spell-checking the English language (a developed suggestion system based on English pro-
nunciation rules) and also in memory management (such as that GNU Aspell supports us-
ing shared memory for dictionaries when several Aspell processes are open at once). How-
ever, also steps leading to further internationalization of the software have been done in its
later versions, including a built-in support for UTF-8 (without having to use a special dic-
3
tionary) and the effort to respect the current locale setting. GNU Aspell is written in C++,
can be compiled in all Unix-like operating systems as well as in Microsoft Windows and
can be used either as a library or as a stand-alone command line program.
GNU Aspell maintains backward compatibility with Ispell. It can be used with virtually
any program that expects Ispell, since it is capable of simulating its behavior when using a
pipe. Though the Aspell’s compiled dictionary format is completely different from that of
Ispell, virtually all old Ispell dictionaries have been converted so that they can be used with
Aspell. This was also the case of the Esperanto word list by Sergio Pokrovskij.
2.5 MySpell
MySpell (Hendricks, 2000) was the former spell checker library included in Writer, the text
processing software of the OpenOffice.org office suite. MySpell’s main developer, Kevin
Hendricks of Canada, created it in C++, with assistance from Kevin Atkinson (GNU Aspel-
l’s maintainer), in an aim to integrate various open source spell checkers and add a spell-
checking capability to OpenOffice.org (a project started in 2000). For every locale (a com-
bination of language and geographic territory), MySpell can store separate files for
spelling, hyphenation and a thesaurus. The spell-checking routine uses a word list file
(.dic) in connection with an affix file (.aff), in a similar manner as it has been intro-
duced by Ispell, to provide a support for languages with a rich affix system.
Important applications using MySpell include AbiWord, a text processor, and Mozilla
Thunderbird and Mozilla Firefox, the e-mail client and web browser of the Mozilla Foun-
dation. OpenOffice.org itself, however, has since its version 2.0.2 replaced MySpell with
Hunspell. The same is assumed to happen with Thunderbird and Firefox when they appear
in version 3. There are Esperanto word list and affix files created by Dmitri Gabinski of
Belarus, based on the older dictionary of Sergio Pokrovskij (Gabinski, Pokrovskij 2003).
2.6 Hunspell
Hunspell (Németh, 2005a, 2005b, 2005c) is an open source spell checker based on
MySpell, created and maintained by László Németh of Hungary, written in C++ and dis-
tributed as a stand-alone program or a library under GPL/LGPL/MPL tri-license. It has
been designed specially for languages with rich morphology and complex system of word
compounding, originally for Hungarian. Its dictionary format is backward-compatible with
that of MySpell, but Hunspell has the extra capability of working with UTF-8 encoded dic-
tionaries. Also the affix classes used in Hunspell may make use of UTF-8, resulting in a
65,535 affix classes maximum in a dictionary. Major improvements of Hunspell’s spell
checking algorithm include support for circumfixes, two-folded suffix stripping and recur-
sive compound rules.
Hunspell was started in 2005 and since March 8, 2006 it has replaced MySpell as the de-
fault spell checker in OpenOffice.org (starting from version 2.0.2). The same switch has
been launched as a bug report for Mozilla Firefox and Mozilla Thunderbird in 2005, and
since 2008, Hunspell has replaced MySpell in the beta versions of the upcoming Mozilla
Firefox 3.
4
There is no Esperanto spell checking dictionary dedicated to Hunspell and making use of
its improved support for agglutinative languages, among which Esperanto is often counted.
However, because of backward-compatibility, the MySpell adaptation of the Ispell Es-
peranto dictionary by Sergio Pokrovskij may be used with any software which employs
Hunspell as its spell checker. This is the present case of OpenOffice.org, for instance.
2.6.1 Structure of Hunspell Data Files

Spell checking dictionaries used by Hunspell consist of two data files. The first is a word
list file containing words of the language, the second is an “affix file” defining the meaning
of special flags used in the word list and word compounding patterns found in the affix file
itself. Depending on the character of the dictionary, these affix classes may be used to refer
to actual affixes or as much as each individual morpheme. They may also be used to mark
words in the word list with regard to word compounding. Sometimes, however, they may
actually have little or no relation to morphology and serve merely as an instrument of data
compression (through the creation of a “pseudo-affix” class for word forms accidentally
starting or ending in same letters). In an extreme cases, affix classes may even not be used
at all.
Hunspell’s word list files may be recognized by the .dic extension. The first line of the
file contains the approximate word count, after which comes the word list itself, with one
word on a line. Each word may be followed by a slash (“/”) and one or more flags which
represent the affix classes the word can accept or attributes related to word compounding.
Optionally, a field of morphological information may follow after a tabulator or a space.
An affix file (.aff extension) has a somewhat more complex structure. It is a collection of
instruction, each on one line, which describe the meaning of the affix classes present in the
word list and set different options which influence the behavior of the spell checker, such
as character encoding, character set used in suggestions, assumed keyboard layout, explicit
lists of often misspellings and easily interchangeable letters, etc. For each affix class, a set
of rules is defined, which describe the possible derivations the affix may produce. The fol-
lowing is an excerpt from the US English Hunspell dictionary en_US which defines the
creation of past tense forms of regular English verbs:
SFX D Y 4
SFX D 0 d e
SFX D y ied [^aeiou]y
SFX D 0 ed [^ey]
SFX D 0 ed [aeiou]y
The first line is the affix class header, which states the option name (PFX or SFX for a pre-
fix or suffix rule, respectively), the flag which denotes the affix class in the word list, the
capability of the affix of producing cross products with affixes of the opposite type on the
same root, and the line count of the following rules.
5
Each affix rule consists of a repetition of the option name and the affix class flag, followed
by the characters which get stripped from the beginning or end of the word when the affix
is applied (or a zero if nothing gets stripped), the affix itself, and the condition under which
its application is possible (a regular expression-like string or a dot if there is no special con-
dition). The first rule, for instance, may be applied to the verb “breathe” (past tense
“breathed”), the second one to the transition from “fly” to “flied”, the third one for
“work”→ “worked”, and the fourth one for “play” → “played”.
2.7 Overview Table

In this table, selected properties of the discussed spell checking softwares are presented:
Kontrolu Li- GNU As-

Esperantilo Ispell MySpell Hunspell
terumadon pell
spell spell spell spell spell
Type text editor
checker checker checker checkerchecker
multiple;
Target lan- multiple multiple multiple
Esperanto Esperanto stress on
guages languages languages languages
Hungarian
Artur Kevin Kevin László
Author Klivo Lendon R. E. Gorin
Trzewik Atkinson Hendricks Németh
Started 1992 2003 1971 1998 2000 2005
Source code Prolog XOTcl PDP-10, C C++ C++ C++
Windows, Unix, Unix, Unix,
Platform MS-DOS Unix
Linux Windows Windows Windows
GPL,
License shareware GPL BSD GPL BSD LGPL,
MPL
stand- stand-
Operation
stand-alone stand-alone stand-alone alone, li- library alone, li-
mode
brary brary
plain text,
Input plain text plain text plain text plain text plain text
WordPerfect
Output display only plain text plain text plain text plain text plain text
no, only surro- yes, not yes, not
UTF-8 yes no yes
gates originally originally
Suggestions no yes yes yes yes yes
Table 2.1: Overview table of existing spell checking softwares and their properties.
6
Chapter 3
A Hunspell Dictionary for Esperanto
Reading through the previous chapter, one can easily convince oneself that spell checking
is a living field of computer software which has been observing progress since the time of
its origin, today maybe an even more rapid one than ever before. The emerging of en-
hanced algorithms and new spell checkers, however, creates a demand for up-to-date spell
checking dictionaries at the same time, should the new technology be useful to users in
practice. Recently, in particular the rise of Hunspell as the default spell checker in
OpenOffice.org and Mozilla software has been worth noticing. Hunspell introduces some
features that its predecessors had not, particularly with regard to the processing of aggluti-
native languages with a rich morphology. This fact, and the fact that E@I has recently
started a project for an Esperanto grammar checker which should include also a spell
checking component, constitute an especially good opportunity for a new Hunspell dictio-
nary for Esperanto to be constructed. Such a dictionary may try to benefit the new features
of Hunspell, as well as make use of some recent resources such as larger corpora, and thus
represent an update of the existing MySpell dictionary for Esperanto which still carries
with itself the limitations of the original Ispell dictionary it is based on. That’s why Hun-
spell was chosen as the spell checking engine used to implement the dictionary that is the
subject of this paper.
3.1 Strengths and Weaknesses of the Existing Dictionaries

The present Esperanto spell checking dictionary for MySpell is poorly documented and
consists of a word list of 19,342 items, an affix file with 58 affix classes (34 prefixes and
24 suffixes) and a total of 2,426 affix rules, listed in alphabetical order without any com-
ments. There is a README file for the dictionary, but this limits itself to stating the name
of the authors (Pokrovskij as author of the original word list and Gabinski as the one who
adapted it for MySpell) and a reference to the GNU General Public License.
Fortunately, in order to understand the employed system of affix classes, the older work by
Pokrovskij may be used, since the MySpell dictionary obviously uses the same affix classes
and flags assigned to them, although I have not been able to get in contact with Dmitri
Gabinski in order to prove this supposition. A list of the Ispell dictionary’s affix classes and
their meaning can be found as comments in the respective affix file. And indeed, this list of
26 affix classes is a subset of the affix classes used by the MySpell dictionary. Additionaly,
the MySpell dictionary introduces several new ones, obviously in an attempt to simulate
word compounding (they include prefixes such as numbers or possessive pronouns, which
are capable of entering a compound in Esperanto – such as “trikapa” from “tri” and “kapa”
meaning “three-headed”), which in such a form is not present in the work of Pokrovskij.
Anyway, the restrictions of MySpell which supports only single byte flags for affix classes
seem inopportune for an agglutinative language like Esperanto, whose principles of word
building would require a much higher number of available affix classes if they were to be
described in whole detail. The addition of new affix classes in the MySpell dictionary is a
7
clear demonstration of this demand, however, it is only Hunspell with its capability of han-
dling UTF-8 (two byte) flags which seems to provide a satisfying space for such a project.
It may seem that another advantages of using Hunspell for spell checking Esperanto texts is
its novel capability of two-folded suffix stripping (the possibility of defining suffixes for
suffixes, i.e. one extra level of root modification), but this has been proven to be of a limit-
ed usefulness, as explained later in this text.
Definitely worth noticing is Hunspell’s capability of word compounding, which could sub-
stitute for the prefixes added by Gabinski into his MySpell dictionary. MySpell supports
basic compounding as well, but it is not possible to control it well enough in order to be
able to make use of it in favor of the Esperanto dictionary. Hunspell’s improvements in-
clude a set of three options (to which affix flags are assignable) for basic compounding
rules (COMPOUNDBEGIN, COMPOUNDMIDDLE and COMPOUNDEND), which, however,
seem to be quickly getting obsolete with the introduction of yet another instruction – COM-
POUNDRULE – which makes it possible to define complex word compounding rules in
Hunspell. The Esperanto dictionary which is to be constructed should try to make use of
these features as well.
As far as sources of linguistic data related to Esperanto used for building the dictionary are
concerned, there is also a short description in the Pokrovskij’s Ispell files, in the Esperanto
README file legumin.l3. In it, Pokrovskij first regrets the absence of recursive affix
stripping in Ispell (which has forced him to introduce some complex affixes which are not
present on their own in the Esperanto grammar and which imposes some limitations to
what he could achieve with his tool) and later he mentions the uselessness of the Ispell’s
basic compounding system, as I have just made above as well. Further in the file, he men-
tions the primary sources for his word list (the ready-made compounds from PIV, and sev-
eral texts he himself had checked up to that time) and then goes into details about his own
preferences with regard to certain parts of the Esperanto lexicon and how that has influ-
enced his accepting or not of certain words into the word list. This thorough inspection of
the word list and manual edits in order to improve it seem worth imitating in the new spell
checker as well.
In a personal communication from May 12, 2008, Pokrovskij has confirmed to me he had
still been developing his Ispell dictionary, although the lack of interest from the public had
been keeping him from updating its public distribution very frequently. He also adds that
he had never heard of Hunspell before, comes himself with the idea that it could be a better
solution for an agglutinative language such as Esperanto and even admits that if he was
starting his work on an Esperanto spell checker at this moment, he would definitely seri-
ously consider using Hunspell for the purpose.2 However, right thereafter, he adds that the
reported problems with its integration in the text editor Emacs make it of little use to him-
self.
2
„Eble hunspell pli bone konvenus al la kunmetema lingvo kiel Esperanto; se mi estus komenconta mian
laboron super literumilo por Esperanto, mi certe serioze konsiderus tion.“ – Sergio Pokrovskij in an e-mail to
Marek Blahuš on May 12, 2008.
8
3.2 Adopting a Suitable Approach to Esperanto Morphology
In order to implement a spell checker exploiting the traits of Esperanto morphology, it is
first necessary to adopt a suitable approach to this topic, which would make it possible to
produce a set of rules defining all the possible derivations which are considered valid in the
language.
The project of an international language, then called “Lingvo Internacia” and later named
“Esperanto”, was first published by its initiator, Dr. L. L. Zamenhof, in 1887. The “Unua
Libro” (“First Book”) included the Lord’s Prayer, some Bible verses, a letter, poetry, and
sixteen rules of grammar3 and 900 roots of vocabulary. In 1905, the sixteen rules were re-
published, along with a “universal dictionary” and a collection of exercises as the “Funda-
mento de Esperanto” (“Foundation of Esperanto”; Zamenhof, 1905). The Fundamento is
still considered a norm by the most of contemporary Esperanto speakers. Further evolution
of the language is now being observed and controlled by the “Akademio de Esperanto”
(“Academy of Esperanto”).
Zamenhof had spent about ten years working on the project of his language, but since he
had little scientific linguistics background, a thorough linguistic analysis of his language
was yet to be done. First attempts in this field have been made for instance by René de
Saussure (brother of Ferdinand de Saussure) who has also explored the word formation in
Esperanto in particular (Saussure, 1910). Probably the most significant and complete up-to-
date description of Esperanto grammar (including morphology) is given by Bertilo Wen-
nergren in PMEG (Wennergren, 2005). An English description (with particular stress on
word building), partially based on PMEG, has been given by Jiří Hana in his master thesis
(Hana, 1998).
Considering the word building principles described in PMEG as valid, it seems to be possi-
ble to adopt an approach to Esperanto morphology which enables us to construct a Hun-
spell affix file and a corresponding word list that would resemble the derivative principles
actually occurring in usage of the language.
3.2.1 The Structure of Words in Esperanto

The word building principles described in PMEG are said to be in agreement with earlier
decisions of the Akademio de Esperanto on the same topic (Aktoj de la Akademio
1963-1967“, pp. 69–70). A description of these principles in English may also be found by
Hana (Hana, 1998, pp. 33–45). The Academy has observed that roots in Esperanto maybe
divided into three categories, according to what part of speech is inherent to them. In com-
pliance with the word endings typical for each part of speech,4 these categories are some-
time called A-roots (adjectival), I-roots (verbal) and O-roots (nominal).
Apart from roots, there are three other elements in the Esperanto word building system:
3
These original sixteen rules may be found in Appendix A of this paper.
4
See Appendix A for details on Esperanto endings for different parts of speech.
9
• affixes (both prefixes and suffixes), which in fact are a subset of specific, very pro-
ductive roots, usually short in letters
• inflectional affixes and endings, capable of expressing the category (part of speech)
in common words, the number and the accusative in adjectives and nouns, the tense
and mood in verbs, and the active and passive participles
• primitive words, which are roots that do not require any category ending to form a
word, although the addition of the ending may be possible
One of the significant observations of the Academy was that any root can accept any cate-
gory ending, provided “the formed word has some meaning”. For example the root “rapid”
is inherently adjectival, so the base word form it produces is “rapida” (meaning “quick”),
but it can also accept the noun ending and produce “rapido” (meaning “speed”) or together
with the verbal ending constitute the word form “rapidi” (meaning “to hurry”).
If the endings on their own are not enough to express a meaning, this can be performed us-
ing affixes. PMEG lists a total of 10 official prefixes and 31 suffixes in Esperanto. These
are attached to the root and can either add on or change the meaning of it. Several affixes of
the same type may appear one after another.
New words, however, may also be formed as compounds, from two or more roots. For ex-
ample the above mentioned root “rapid” with the root “trajn” (meaning “train”) and the
noun ending can produce the compound “rapidtrajno” (meaning “express train”; literally
“quick train”). The effects such composition has on the semantics of the roots and on the
overall meaning of the compound are different in each case and guided by a system of
rules, which, however, is outside of the scope of this work and unnessary for our purpose.
Furthermore, the difference between a root and an affix is often also not clear. Yet, when
compounding, a connector letter, which is either “a”, “o”, “i” or even “e” (the adverbial
ending) are sometime preserved at the compound boundary, for example “skribotablo” =
“writing desk” from “skrib” = “write” and “tabl” = “table” (this issue of the connector,
sometimes regarded as conserving of endings, is in Hana 1998, where it is called “insert-
ed o”, intentionally rather neglected).
In order to maintain clarity and to propose a system in which every Esperanto word could
be constructed from a relatively small set of elements, I have decided to introduce the fol-
lowing word structure system and terminology, which represents a slightly modified ver-
sion of the structure mentioned above, and to keep using it through the rest of this text:
A word in Esperanto, for the purpose of the developed spell checker, is a compound, con-
sisting of one or several compound parts, with optional connecting letter between any pair
of neighbouring compound parts and optionally ended by an ending. Each compound part
consists of exactly one stem and any number of affixes placed around it. Written in form of
a regular expression, the structure looks like this:
(affix* stem affix* connector?)* affix* stem affix* ending?
10
All elements used in this word formation are called morphemes and each of them belongs
to exactly one of the following categories: stems, affixes, endings and connectors. Also
(apart from the fact that all connectors have a homonymous ending to them, but those can
be unambiguously distinguished by the position in the word they occupy), each morpheme
is unambiguously classifiable into one of these categories, simply according to the letters it
consists of.
If compared to the system used by PMEG and described above, all PMEG roots are stems
for me; all PMEG affixes are affixes for me as well; from inflectional affixes and endings,
the most are treated by me as endings (and there is always at most one ending, so complex-
es like “ajn” from “a”, “j” and “n” to denote an accusative plural adjective represent a sin-
gle ending for me), only the participles (“int”, “ant”, “ont”, “it”, “at”, “ot”) are being han-
dled in the same way as affixes (since they may appear not only in the end of words, but
also inside them); and the primitive words, called “function words” by me, are all consid-
ered stems.
Following are several illustration of my system of word structuring, with morpheme types
and inner compound boundaries shown:
malsanulejdomo:5
MAL – SAN – UL – EJ – | DOMO | O
affix stem affix affix | stem | ending
ŝajnmultekosta:6
ŜAJN | MULT | E | KOST | A
stem | stem | connector | stem | ending
3.2.2 System for a Semantic Classification of Stems

If we are to define rules for permitted combinations of affixes and stems, it is necessary to
observe the conditions under which affixes affixes accept certain stems (or vice versa)
while they don’t produce a meaningful combination with others. If we succeed in under-
standing this system, it is possible to create a spell checker which would not only recognize
all words which could be found in sources it was based on, but also valid compounds and
combinations of stems and affixes which may be untypical, but still perfectly correct and
usable in a context which the author of the dictionary simply did not think about. Also, im-
plementing such a system of stem classification could result in a significantly shorter word
list file, since all the possible combinations of stems and affixes would be described just by
the means of affix classes and not listed individually.
Semantic classification of stems and its usefulness for implementing rules of Esperanto
morphology has already been touched by Hana in his morphological analyzer (Hana, 1998,
pp. 54–55). He shows two examples of using two-level morphology rules to restrict the us-
5
meaning “a hospital house”, literally compounded as “the house [DOM+O] of the place [EJ] of the person(s)
[UL] of the condition opposite [MAL] to healthy [SAN]”
6
meaning “seemingly expensive”, literally compounded as “costly [KOST+A] in the way [+E] of amount
[MULT] of a seeming character [ŜAJN]”
11
age of some morphemes, in particular the prefixes “bo” (which can precede only a very
limited set of family-related stems) and “pra” (which has two different meanings, and thus,
in addition to the stems accepted by “pra”, it also accepts other stems). Later, in the conclu-
sion of his paper (Hana, 1998, p. 65), he declares classification of stems a working ap-
proach, which would be worth a wider application, but explains he has not pursued it exten-
sively since such a classification would be very time-consuming.
However, exploiting data coming from grammar references, dictionaries and corpus analy-
ses, it should be possible to pursue such a classification without the need for manual tag-
ging of each and every stem in the word list. It is among the goals of this work to try out
this approach and see if an automated semantic classification of Esperanto stems is feasible
and the data coming out of it useful for the construction of a spell checking dictionary.
In order to devise an algorithm for semantic classification, a suitable system of semantic

classes has first to be identified. This is to be done on basis of known traits of all affixes
which there are in Esperanto, since it is exactly the affixes and their relation to word stems
which we are later going to control using rules that make use of the classification. A good
enough description of the traits of Esperanto affixes has been given by Wennergren in
PMEG (Wennergren, 2005, chapters 38.2 and 38.3). He also discusses the topic of seman-
tic classification of stems, but provides only a provisional classification, declared as incom-
plete (Wennergren, 2005, chapter 37.1), whose part concerning humans and animals and
the problem of inherent genders in Esperanto stems is discussed in another chapter of
PMEG (Wennergren, 2005, chapter 4.2) and in a slightly updated yet in an apart document
(Wennergren, 2008).
In spite of the overall incompleteness of Wennergren’s work on semantic classification, it

seems to be possible to take his classification as a basis of the one which is to be designed
in scope of this chapter, so I am presenting an English translation of his semantic classes
described in the above cited sources here:
• Some stems refer to humans, persons, e.g.:

AMIK, TAJLOR, INFAN, PATR, SINJOR, VIR...
• Other stems refer to animals, e.g.:

ĈEVAL, AZEN, HUND, BOV, FIŜ, KOK, PORK...
• Another stems are plants, e.g.:

ARB, FLOR, ROZ, HERB, ABI, TRITIK...
• Some stems are tools, e.g.:7

KRAJON, BROS, FORK, MAŜIN, PINGL, TELEFON...
7
This category was later not identified as necessary in my explorations, since I have not found any affix re-
strictive to this particular semantic class, so I have not included in my proposal for semantic classification of
Esperanto stems.
12
• Many stems are names of actions, e.g.:
DIR, FAR, LABOR, MOV, VEN, FRAP, LUD...
• Other stems are names of traits or qualities, e.g.:

BEL, BON, GRAV, RUĜ, VARM, ĜUST, PRET...
Additionally, the stems referring to humans and animals may be sorted as masculine stems,
feminine stems and neutral gender stems. An extensive list of examples for these cate-
gories, and particularly complete lists for some tricky combinations (e.g. masculine stems
referring to animals, such as “taŭro” for “bull”) are present.
Wennergren’s description of the Esperanto affix system lists the 10 official prefixes and 31
suffixes in Esperanto8 and includes detailed information on their possible use, with exam-
ples. In some cases, the semantic class of stems combinable with the particular affix is de-
scribed very precisely, in other cases the description is unfortunately somewhat fuzzy (as is
often also the actual usage of the affix-stem combination in question).
I have attempted to produce a list of distinguishable semantic classes from the description
of affixes as well, combined the two sets together, and using information collected from the
cited sources, the given lists of examples and own knowledge of the language, I have come
up with a semantic classification system of Esperanto stems which is shown as a Venn dia-
gram in Figure 3.1. The class of objects, which is especially complex, is shown in details in
the bottom part of the figure.
8
For a list of these affixes and short explanation of their meaning, see Appendix B.
13
Figure 3.1: Proposed system for a semantic classification of Esperanto stems.
14
According to the needs of each affix, a list of flags denoting particular semantic classes or
their combinations has been derived from this classification. It is assumed that assigning
such flags to Esperanto stems in a semantic classification process should be sufficient for
later creation of rules controlling the possible combinations of stems and affixes. Each flag
has been assigned a single letter for easier reference, and a mnemonic for easier orientation.
A full list of these flags is shown in Table 3.1, with the mnemonic marked in bold.
Flag Description
A attribute stems, having the A-ending in their base word form
B animals [“bestoj” in Esperanto]
C common gender in beings (animals and persons)
F female gender in persons
I action stems, having the I-ending in their base word form
J place stems, producing adverbs of spatial meaning [in Esperanto “ejo” = “place”]
K plants [“kreskaĵoj” in Esperanto]
L antonym-producing stems, which accept the prefix “mal“
M male gender in beings (animals and persons)
N numbers (numerals and some other stems of amount)
O object stems, having the O-ending in their base word form
P persons
T transitive stems, producing transitive verbs
V function words, which may appear without an ending [ “funkcivortoj”]
Y family relationships
Table 3.1: Semantic flags derived from the proposed system of semantic classification.
15
Finally, a reference table with a haphazard sample of manually classified stems plus a well-
thought group of stems selected to cover the remaining semantic classes has been built,
which shall later serve as a check list for the automated classification process which is to be
implemented. The table shows for each stem and semantic flag whether the flag has been
assigned to the stem or not. These data are presented in Table 3.2.
Stem A B C F I K L M N O P R T X Y
ABON - - - - - - L - - - - - T - -
ANANAS - - - - - K - - - O - R - - -
ANAS - B C - - - - - - O - R - - -
BOV - B C - - - - - - O - R - - -
CENT - - - - - - - - N O - - - - -
CENTR - - - - - - L - - O - - - - -
DAM - - - F - - - - - O P R - - -
INFORM - - - - I - - - - - - - T - -
KOTIZ - - - - I - - - - - - - - - -
LAND - - - - - - - - - O - - - - -
OFIC - - - - - - - - - O - - - - -
PAG - - - - I - - - - - - - T - -
PATR - - - - - - - M - O P R - X Y
REĜISOR - - C - - - - - - O P R - - -
REPREZENT - - - - I - - - - - - - T - -
SAM A - - - - - L - - - - - - - -
TAŬR - B - - - - - M - O - R - - -
TEMP - - - - - - - - - O - - - - -
VER - - - - - - L - - O - - - - -
Table 3.2: A check list with a manually classified selection of stems.
3.3 DFD Diagram for Dictionary Construction

Having created a system for semantic classification of stems, it is possible to start design-
ing an algorithm which would collect Esperanto stems, make use of this system to semanti-
cally classify them, and thereafter generate rules describing all the permitted ways in which
they may be combined with each other and with affixes and endings in order to produce
valid word forms of the language. The flow of linguistic data through the dictionary con-
struction process is depicted in a data flow diagram in Figure 3.2.
16
Dictionary Semantic Classified Affix rules Hunspell
headwords classification stems generation dictionary
Function Corpus Specialized Ready-made morpheme

words data dictionaries segmentations
Figure 3.2: Data flow diagram for linguistic data in the dictionary construction process.
3.4 Compiling a New Word List

An essential part of a good spell checker is a solid word list, from which words may be tak-
en and used for checking the words obtained from the user on the input, either on their own
or composed in compounds according to additional compounding rules. Various word lists
(general word lists, proper names, terminology specific to some profession, unofficial
words) have been compiled also for Esperanto, notably the separate word lists distributed
with the Esperanto dictionary for Ispell (Pokrovskij, 1997), which give the user a limited
possibility to influence the properties of the spell checking process. In our proof-of-concept
dictionary for Hunspell, however, we are not going into such details so far, and will con-
centrate on compiling a single, though large, word list of general vocabulary.
3.4.1 Identifying Relevant Sources for the Word List

Plena ilustrita vortaro de Esperanto 2005 (“The Complete Illustrated Dictionary of Es-
peranto”, PIV) is the latest version of a renowned monolingual dictionary of Esperanto,
first compiled in 1970 by a large team of Esperanto linguists. There is an electronic list of
the words found in the dictionary (Grimley Evans, 2005), downloadable under a free li-
cense from the internet. The list consists of a total of 46,890 lexical units, of which about
16,780 are head words (stems); the other are derivatives of these. Because of its high num-
ber of words, good coverage of general vocabulary (plus a significant amount of rather spe-
cialized terms), easy availability and general recognition in the Esperanto community, PIV,
or actually its electronic version, is a good candidate for being the dictionary that will pro-
vide the base for our spell checking word list. It is also useful for obtaining information
about the inherent part of speech for each stem, because the form with the corresponding
ending is always the one which appears as the head word in the dictionary, while the de-
rived forms are listed after it, if at all.
Apart from the most common simple adjectival, verbal and nominal stems, there are also
stems which are function words, which means they may appear in a text on their own, with-
out an ending. Some function words also have an adjectival, verbal or nominal character
and may thus be used in word building, other function words lack this capability. Since it is
more difficult to extract function words from PIV, another source is going to be used for
obtaining a list of them. ESPSOF, a software package for analyzing and proofreading of
Esperanto texts, currently developed by Toon Witkam (Witkam, 2008), includes a list of
17
357 function words, even classified into many different categories. For our purpose, only
the distinguishing between inflected (which may accept and ending) and uninflected (which
may only stay on their own) function words is important. Another advantage of Witkam’s
list of function words is the inclusion in it of some word forms, which actually are not
functional words, but are regarded as such. This is, for instance, the case of all personal
pronouns, which hardly ever get combined with other stems and at the same time are very
likely to be erroneously recognized within compounds, because of their extremely short
length (two or three letters). The need to set aside personal pronouns and similar words (in-
cluding the few word forms they can produce by accepting an ending, such as possessive
form, plural and accusative) from the rest of inflected stems and regard them as uninflected
function words, along with several other possible improvements of Esperanto morphologi-
cal analysis, has been mentioned by Witkam in his paper from the 2006 GIL conference
(Witkam, 2007a).
Witkam’s ESPSOFT, whose dictionary is also based on PIV, includes yet another type of
useful extra data, which is the information on every verbal stem whether the verbs it pro-
duces are of a transitive or intransitive nature. Such kind of information is present in the pa-
per version of PIV as well, but has been lost during its conversion into the electronic ver-
sion. Witkam has managed to substitute it by the same kind of data coming from a reliable
Esperanto-Japanese dictionary (Hirotaka, Ono, 1997) which also I am going to make use
of. The advantage of such a source, apart from its reliability, is the fact it completely covers
the domain of words which enter the semantic classification process, i.e. the PIV words.
Corpus analysis seems to be a source of linguistic data especially worth exploring for the
purpose of the compilation of the word list. There are several professionally compiled cor-
pora for Esperanto, the largest of them being the 18.5-million-position Esperanto corpus
created by Eckhard Bick (Bick, 2007). Corpora present the language in a form in which it
is actually used, what often differs from the way it is described in dictionaries and grammar
references. And if the goal of our spell checker is rather to make users aware of their mis-
takes than to force them to switch to a particular language style, we should make sure the
linguistic data we construct the spell checker on are as realistic as possible. Some semantic
classes may be directly derived from corpus analysis, since searching for a set of stems
which accept a certain affix, the easiest solution is, of course, to look up that affix in a cor-
pus and extract the stem set from the result. This works perfectly, for example, for the
“mal” prefix, which produces a meaning opposite to that of the stem it is modifying. This
method, however, can unfortunately not be used in all cases, since very often an affix can
be accepted by two or more semantic classes, which themselves are difficult to distinguish
one from another. Also, not every affix is productive enough to have a high number of oc-
currences in a corpus, and not every affix is recognizable reliably enough in a corpus which
is not morphologically tagged.
In order to be able to properly recognize all classes of the proposed semantic classification
system, it is necessary to employ some specialized dictionaries as well. A particularly de-
manding field is the recognition of stems denoting animals, plants and human beings. It is
for example only these classes of stems, which may accept the Esperanto suffix “id” used
18
for forming an offspring, descendant. This suffix is, however, not frequent enough for a
satisfying number of combinable stems to be identified by corpus analysis. That is why it is
necessary to make use of specialized word lists during the semantic classification process.
A significant contribution in this field has been brought by the recently deceased Wouter F.
Pilger, who had been maintaining a set of “provisional personal lists” from the fields of
botany, zoology and ornithology (Pilger, 1982, 1992, 1996a, 1996b, 1996c, 1997). Some-
times the words given in them are too much scientific and do not cover the everyday vo-
cabulary (such as listing a dog by the zoological word “kaniso” rather than the common
word “hundo”), but these gaps may be filled in by addition of the vocabulary from several
lernu! word lists (lernu!, 2002), which, on the other hand, attempt to present the Esperanto
students with the most common vocabulary for each subject. Together, these word lists, if
properly adapted, provide a solid base for recognizing the semantic classes of plants and
animals.
Another specialized word lists which are being used as an aid for the semantic classifica-
tion are a list of professions, functions and ranks (Worsten, 2003), and several small closed
vocabularies, such as the vocabulary of family-relationship stems, extracted from the
PMEG (Wennergren, 2005).
Every time, however, when a stem appears in a specialized word list, which is not present
in the original PIV-derived word list, it is thrown away, since such a stem would with cer-
tainty not be fully classifiable. This measure also guarantees that only stems from a con-
trolled vocabulary (those listed in PIV) may get recognized by the spell checker, and no
other words, which may be welcomed by some users and definitely it at least helps to main-
tain certain quality of the produced dictionary, even if it may have the side effect of not
recognizing some rather specialized vocabulary.
3.4.2 Moore Machine for Semantic Classification

The process of semantic classification of all rules is a complex one, because of all the dif-
ferent possible semantic classes that have to be taken into account and the number of
sources that must be combined in order to achieve a good classification. Figure 3.3 in this
chapter depicts the whole classification process each stem must go through, including the
names of the tools that are used in each step of classification, the flags which are being as-
signed to the stems in these steps, and the way in which each step of the classification is
connected to the rest of the process. The diagram shown is a Moore machine, a finite state
automaton in which the outputs (semantic flags) are determined by the current state alone.
If in a state some classification step is performed, the state is labeled by an abbreviation de-
noting the source engaged, and there are transitions from this state to two or more other
states, each of them for some possible result of the classification step in question. A state
from which there are no transitions to other states is a final state. The flags outputted while
walking through the automaton are accumulated and when a final state is achieved, the re-
sulted output is the set of semantic flags for the particular stem Thus, the set of all possible
combinations of output semantic flags (a total of 85) may be obtained simply by enumera-
tion of all possible passes through this automaton.
19
λ λ J A
non-function word place
q0 q3 q4 q6
PMEG+corpus
fu
ESPSOF te d corpus
nc
non
lec bute
tio
- pl
attri
λ
inf λ
n
ac e
wo
rd
V λ λ I
C uninflected action
number
q1 q2 q5 q8
q12
r
PIV ESPSOF
be
ESPSOF
m
e
nu
co
ct itiv
n-
m
obj ns
m
a N
no
tr
ive
on
YP O T q7
nsit
q11 q10 q9
-tra
family
n on
λ PMEG PMEG
L
λ
e
q29 ily λ
al
- fam q27
m
on ucing λ
n m-prod
M λ N λ antony
number λ q30
q13 q14 q15 q26 non-anto
nym prod
ucing λ
no
PMEG+corpus n-n
u corpus q28
mb
er
λ
C K λ
CP
q19 q17 q16
plant no q25
n-p
non-person
Pilger+lernu lan
tn
co
l
on m on
m
a -an co m
m
nim im
a
on
al
M B
F λ λ MP
q20 q18 non-female male
male q22 q21 q23 q24
female
PMEG PMEG PMEG+worsten
Figure 3.3: Moore machine describing the process of automated semantic classification.
20
3.4.3 Implementation of the Semantic Classification
The above described system of semantic classification has been implemented by means of a
set of Linux shell scripts which make use particularly of the programs sed, grep, and the
utilities from the textutils package (such as cut, join, sort, uniq, wc).
Each of the steps (i.e. states labeled with an abbreviation of some data source) of the au-
tomaton shown in Figure 3.3 performs a step in the classification process, using the particu-
lar data source to classify the processed stem and decides which transition should be fol-
lowed as next (and thus also whether a flag should be outputted and if, then which one).
In practice, however, the automaton is not run on each stem from the word list individually,
but the whole word list goes through a batch process which iterates through all classifica-
tion steps associated to states of the automaton and performs the respective classifying pro-
gram for all stems in the word list for which it is applicable at once. The order in which the
steps are iterated through may be for instance the one that corresponds to the numbering of
the states in the figure, but it may be proved that this order can be chosen arbitrarily as long
as there does not exist any path from the initial step to the step which is to be performed
that would contain a step that had not been performed so far (this condition guarantees that
the program realizing the following step will have at its disposal all the information about
the classified stem from the previous steps it might require, and since all paths forming cir-
cles which can be found in the graph may be distinguished from each other by the presence
of a certain flag or set of flags in the output, it is also in every moment possible to deter-
mine according to the output flags by which series of steps a particular stem has been clas-
sified so far).
Each program performing a single classification step implements a universal interface, so

that they may be easily lined up in a batch and swapped arbitrarily, as long as the condition
given above is fulfilled. The word list which is being classified is stored in a text file, one
item on a line which consists from the text of the stem itself, a slash (“/”) and a list of flags
that have been assigned to that stem so far. When run, a program performing a step of the
classification receives a copy of this partially flagged word list as an input file, inspects it
and generates an output file containing those stems which it has changed, followed after the
slash by the newly added flag(s). Every time after such a step has been performed, a merg-
ing program joins the two files, updating the main word list file with the new flags from the
output file.
For example, the partially flagged word list may at some moment look like this:
dom/JO
patr/O
21
Then, if the program assigned to the step in state “q10”, whose task is to make use of
PMEG to determine if a stem describes a family relationship, is run, it produces the follow-
ing result:
patr/YP
This is due to the fact that “patr” (the stem for “father”) describes a family relationship
(and thus is being given the flags “Y” for “family” and yet “P” for “person”), while the oth-
er stem, “dom” (meaning “house”), does not describe a family relationship and thus re-
mains unnoticed.
Finally, the merging program updates the main word list with the information from this
output file, after which the main word list looks like this:
dom/JO
patr/OYP
Somewhat complicated is the problem of homonyms, since these are not distinguished by
the algorithm. If it happens that a homonym occurs, such as when determining the inherent
part of speech by means of PIV in state “q5” for the stem “bar”, which is a homonym with
the basic form of either the verbal “bari” (”to obstruct”) or the nominal “baro” (“bar”, the
unit of pressure), then it is assigned two otherwise exclusive flags in the same step (becom-
ing bar/IO). This may produce a combination of flags which could never occur in a sin-
gle unambiguous stem, but it is usually of little harm, since in the following steps it is still
possible to recognize the series of steps the stem has been classified by, although they may
be some “superfluous” flags left, from the point of view of a step-performing program.
The only real problem with homonyms seems to emerge when a stem is assigned a flag due
to the traits of one of its meanings, but this flag is relevant also to the other meaning, for
which the presence of this flag may or may not have been determined so far. This is the
case of an attribute/object homonym for instance, whose meanings gets split in the “q5”
state and the attribute and the object meanings may later be directed to the “q7” and the
“q15” states, respectively, which both decide about the number character of the stem, as-
signing it or not the N flag. Now it may happen that one of the meanings has a number
character and the other one has not, but as soon as both of them “meet again” in the follow-
ing state which is “q26”, this difference can not been identified anymore. Probably, either
the restructuring of the automaton or the introduction of an unambiguous set of flags (i.e.
one flag may be generated only in one single state) would help to solve the problem. In
practice, however, this is of little harm, since there are especially few homonyms in the Es-
peranto vocabulary (there are even Esperanto speakers who think there should be no
homonyms in the language at all), and even if a homonym appears and a dubious event like
the one described above occurs, the presence or not of the particular flag usually does not
have any impact on the decision processes which occur in the step that are to follow (such
as it is not important if a stem has a number character while it is being classified for the ca-
pability of producing antonyms in “q26”).
22
3.5 Compiling a New Set of Affix Rules
After a word list has been compiled and semantically classified, it is possible to create a set
of rules which makes use this classification by imposing that certain affixes be combined
only with stems from certain semantic classes.
3.5.1 Dictionary-Based Word Derivation System

One possible approach would be to try to produce the rules manually, following theories of
Esperanto word building that describe which affixes can be combined with which kind of
stems (as we have seen in PMEG) and also make attempts at describing the Esperanto com-
pounding system in general, using different terminologies and approaches. This process,
however, would be particularly difficult, if feasible at all, because there is still no generally
accepted overall theory on the compounding in Esperanto and the agglutinative character of
the language makes compounds very frequent and varied, although the speakers usually do
not have problems understanding each other, since circumstances like context and their
own national languages seem to make comprehension relatively easy.
Instead of working out rules by hand, particularly if we want to concentrate on the lan-
guage as it is actually being used, the usage of a dictionary or a corpus seems a suitable
idea once again. There is presently no large enough Esperanto corpus which would contain
information on the morphological structure of its words, but we may observe that Esperan-
to dictionaries such as PIV are actually full of compounds. Toon Witkam, who has written
on the topic of deriving Esperanto morphology from dictionaries (Witkam, 2007), notes
that in PIV compounds consist two thirds of all lexical units (although he considers also
words with a single stem but at least one suffix added to it a compound, what differs from
my usage in this work). In his ESPSOF software package (Witkam, 2008), he makes use of
morphologically analyzed words from PIV, for a lot of which he had to add the structure
information himself, since the dictionary does not provide full morpheme structure for all
the words it lists. Once a morpheme structure for a large amount o words is known, howev-
er, it may be used in morphological analysis, both as direct models for known words, and
as a guide for analyzing words which are not present in the dictionary but probably com-
pounds of words which may be found there.
Witkam’s goal for ESPSOF, however, is to “construct a general word compound analyzer,
which would work without any semantic knowledge on the text or the world” (Witkam,
2007, translation from Esperanto is mine). But as we have developed a system for semantic
classification of stems, we now have the possibility to try to approach the problem of word
derivation and compounding using this kind of information as well.
Taking Witkam’s list of ready-made morpheme segmentations, we may put it through an

affix and stem recognition process, which tries to match the morphemes with either affixes
or stems from the word list we have compiled. Those words, in which we have successfully
recognized all compounds (which should be the case with most of them, since those words
all come from PIV, which itself is the source of our word list), may serve as morphology
models, since, together with the semantic classes that may be assigned to the stems found
23
in them, they give us hints about what the possible combinations of stems and affixes in Es-
peranto are.
An analysis of Witkam’s morpheme structures provides interesting information on the fre-

quency of different affix combinations in Esperanto. In his database of approximately
33,000 ready morpheme segmentations of word forms from the PIV, there are 632 different
combinations of affixes and stem positions (independent on what the particular stems are).
The frequencies of the combinations, except for the first two, approximately observe the
Zipf’s law. Twenty most common morpheme structures in Esperanto are listed in Ta-
ble 3.3. Note that the word endings were irrelevant to the analysis, and that an asterisk (*)
denotes a position for a stem (in contrast with an affix).
Frequency Structure
12,970 *
6,005 */*
1,386 */o/*
800 */aĵ
795 */ad
642 */ist
631 */ig
619 */ec
505 */iĝ
417 */il
400 */ul
376 */ej
359 */et
318 */ism
317 */*/ig
311 mal/
254 */*/iĝ
220 */uj
215 */ar
206 */*/il
Table 3.3: Most common morpheme structures in Esperanto.
Yet if we group the word forms having the same morpheme structure together and classify
the stems in them using our system of semantic classification, we may get a list of stem se-
mantic classifications for each morpheme structure, from which information about the pre-
vailing semantic class for each stem position could be retrieved and used to produce a mor-
phological rule describing the semantic conditions for formation of words which follow the
particular morpheme structure.
For example, inspecting the semantic analysis of stems in 70 word forms of the morpheme
structure “dis/*”, it is possible to notice that the stem position in this structure is mostly (in
53 cases) occupied by a verbal transitive stem (semantic flags IT). In all other cases (ex-
24
cept for a one rather erroneous one), the stem in the stem position is a common noun (se-
mantic flag O). Eventually, the analysis of this particular morpheme structure may result in
producing two new word building rules, namely that a combination of “dis” and a stem
having either the O flag or the IT flags is a valid compound part in Esperanto and should
be recognized as such by the constructed spell checker.
3.5.2 Implementation of Esperanto Morphology in Hunspell

After performing the analysis of Esperanto morpheme structures and the related combina-
tions of affixes and semantic classes, the last part of the task of creating a spell checker for
Esperanto was to implement the word list and the resulting rules in form of a Hunspell dic-
tionary.
Several approaches have been tried to represent the created semantic classification and
word building rules exploiting the morphology capabilities of Hunspell’s dictionary files.
The most straight-forward ones, unfortunately, have been found unfeasible, because of
some limitations of the Hunspell software. The last approach tried, however, has been suc-
cessful and provided an actual method of implementing the derived Esperanto morphology
rules in form of a Hunspell spell checking dictionary.
The main problem with a Hunspell implementation of a complex word building system,
which is typical for Esperanto, has been its capabilities of compounding and affixation,
which were found too limited – paradoxically, for a spell checker which boasts the best
support for morphology ever. The main causes of the failure of several first attempts at im-
plementing the Esperanto morphology in Hunspell were:
• The conflict between Hunspell’s old and new systems for defining permitted com-
pound structure: The set of COMPOUNDFIRST, COMPOUNDMIDDLE, COM-
POUNEND and COMPOUNDFLAG instructions is useful for defining local compound
rules (such as that a word ending may appear only as the last element of a word
form and never alone), while the newer COMPOUNDRULE instruction which takes
as an argument a regular expresion may be used for defining the global structure of
a compound (such as when trying to define word structure in a manner similar to
what I have done in chapter 3.2.1 of this work). The problem is, that these two sets
of instructions may not be used together, so one can actually opt for just one of
them, and neither of them may seem potential enough at this time (although there
are promises of the COMPOUNDRULE getting more potential in the future, and even-
tually replacing the old instructions).
• A hidden limitation of Hunspell’s noted two-folded suffix stripping. It actually

works (it is possible to assign an affix class to another affix), but it is limited only to
the last compound part of the word form (or the first one, in case of right-to-left lan-
guages). And, still, it is just a two-folded stripping, so it can not be used to freely
implement the Esperanto morphology where combinations of three affixes (and es-
pecially if we count word endings among them) are relatively often.
25
These problems with making use of the Hunspell’s prominent support for complex mor-
phology have forced me to look for a less smooth solution, which, however, would at least
be capable of expressing the systems of rules and stem classification I have developed. Fi-
nally, I settled on using solely word compounds (by means of the COMPOUNDRULE in-
struction) for implementing the Esperanto morphology, with not even a touch of the actual
Hunspell’s affix system.
Thus, a feasible approach for implementing the created morphology rules is to produce a
set of regular expressions (limited to the asterisk and question mark operators, however)
which describe each of the identified rules, using a single character to refer to either a suf-
fix or a stem semantic flag or semantic flag group. All the affixes, word endings and all
stems used must be put in the word list file and marked with necessary Hunspell flags (“af-
fix classes”) so that these flags may be used in these regular expressions. In order to keep
things simple, I have kept the same system of flags I have developed for semantic classifi-
cation, introducing new ones where necessary, for example if a created morphology rule
expects a semantic flag combination in a position while Hunspell can accept only one flag
per its compound part in the COMPOUNDRULE regular expression.
An example implementation of the rules for the “dis/*” morpheme structure discussed in
the previous chapter for a sample dictionary of just several words could look like as fol-
lows:
a/z
i/z
o/z
dis/d
don/ITx
est/I
sem/O
COMPOUNDRULE dOz
COMPOUNDRULE dxz
Here, the word list file (the top part) defines morphemes and their flags, and the excerpt
from the affix file (the bottom part) shows how word building rules are written for the pre-
fix “dis”. The flags for the morphemes “dis”, “don”, “est” and “sem” come directly from
semantic classification of stems, but a new auxiliary flag x has been introduced, which is
merely a replacement for the flag combination IT which, since not a single flag, can not be
present in a compound rule on its own. The first compound rule describes the case when a
nominal stem follows “dis” in a word, ended by a word ending (the flag z here marks a
sample subset of possible Esperanto word endings in the word list). The second compound
rule describes the case when “dis” is followed by a verbal transitive stem, followed by a
word ending. Using these rules, the spell checker recognizes the following word forms:
“dissema”, “dissemi”, “dissemo”, “disdona”, “disdoni”, “disdono”.
26
3.6 Integration in OpenOffice.org and the E@I Grammar Checker
The above text shows that it is actually possible to develop a spell checking dictionary for
Esperanto using Hunspell and that in spite of some Hunspell’s limitations, it is capable of
expressing semantically based rules of Esperanto morphology, thus bringing in some new
technology into the field of Esperanto spell checking which would be worth trying out in
the practice.
Since this Esperanto spell checker is being developed in the scope of a broader grammar
checker project called “Lingvohelpilo” (“Language Helper”) by the E@I organization, the
first idea about its possible field of use is naturally this system, whose development has
started in the beginning of 2008 and is supposed to last for a full year. The spell checker
should be an integral part of the project and in fact the first component which would be
processing the text after the user has submitted it and later forwarding it, along with a list
of unrecognized word and suggestions of corrections for each of them, to a separate com-
ponent dedicated to grammar checking. In order to connect the two components, the two
processes could be connected with a pipeline, and some modifications to the Hunspell code
are also foreseeable, so that the grammar checker receives the spell checker’s input in a
form which is most convenient for it. Theoretically, it could also be possible to fight some
Hunspell’s limitations on word building described above by creating an own modified ver-
sion of the Hunspell source code, but this is yet to be discussed and its feasibility explored.
A more straight-forward and immediately possible use of the new Esperanto dictionary is
to use it with OpenOffice.org Writer, the open source text processor which originally gave
birth to MySpell. Since OpenOffice.org has been supporting Hunspell for quite a long time
already, it is very easy to use the newly created dictionary with any recent version of this
office package. Actually, it is enough to replace the two old Esperanto dictionary files in
the OpenOffice.org directory with the new ones, restart the program and the new dictionary
is working in the application.
For a more user-friendly process of installation, however, it would be worth to get the new
dictionary officially accepted by the OpenOffice.org developer community, so that it could
be downloadable from the project’s website, or even eventually become the packages main
spell checking dictionary for Esperanto.
At the same time, if enough attention would be paid and some additional steps done, it
should also be possible to actually get the dictionary distributed automatically with each
download of the Esperanto localization of OpenOffice.org. The present situation, unfortu-
nately, is, that the spell checking dictionary has to be downloaded separately, after in-
stalling the office package, apparently because of problems with licensing (OpenOffice.org
requires all components of its package to be triple-licensed, while the present Esperanto
dictionary is available only under GNU GPL). Such a recognition of the dictionary, howev-
er, seems to be a somewhat time-consuming process and not so much related to computer
programming anymore, so, howsoever it is worth doing, it may probably be considered al-
ready out of the scope of this work.
27
Chapter 4
Evaluation of the Newly Constructed Dictionary
The dictionary whose construction has been described in this work is intended to be a
proof-of-concept, so it still swarms with a lot of flaws which should be taken notice of in
the upcoming development stages. These known problems and omissions include:
• Presently, Hunspell provides suggestions only for single items from the word list,
not for compounds. This is especially annoying since due to the way in which the
dictionary files were created, most of Esperanto words are now represented as com-
pounds. Yet another working approach of implementing the constructed morpho-
logical rules should be found, or the Hunspell code should be modified, so that the
program not only recognizes valid and invalid words, but also provides suggestions
in case of a misspelling.
• The issue of punctuation and non-alphabetical characters in general should be coped

with. A hyphen, for example, is sometimes used within an Esperanto compound
word in order to give hint on the word’s structure and achieve more clarity (e.g. to
distinguish between “sent-ema” = “sensitive” and “sen-tema” = “without a topic”).
Hyphens can be defined as word characters in Hunspell’s data files and used in the
same manner as letters, but some external programs, such as OpenOffice.org, have
their own hyphenation algorithms, which means the hyphen does not even get deliv-
ered to the spell checker.
• A common challenge in spell checkers is the domain of proper names. This prob-
lem, although worth attention, has not been touched by the proof-of-concept spell
checker implementation described in this work. In order to develop a fully-featured
Esperanto spell checking dictionary, however, a thorough analysis of the Esperanto
system of proper nouns will have to be performed and its results implemented. A
special care should be given to the suffixes “ĉj” and “nj”, which are used to form in-
timate forms of given names in Esperanto. They are unique in Esperanto, since they
are the only affixes which virtually require their preceding to be shortened, produc-
ing forms such as “Joĉjo” from “Johano” (the male suffix) and “Manjo” from
“Maria” (the female suffix). Perhaps it would be useful to employ the actual Hun-
spell’s suffix system in order to implement this behavior.
• At last but not at least, the semantic classification system realized in the scope of
this work may be further developed and improved. The output of the semantic clas-
sification process for the check list given in chapter 3.2 has shown quite satisfying
results (the only error being the classification of “reĝisoro” = “a director” as a neu-
tral noun, since the stem was not present in the used word list of professions), how-
ever, it still may be the case that many other words now get classified incorrectly,
probably often ending up particularly in this neutral noun category.
28
Chapter 5
Conclusion
This thesis has discussed the topic spell checking, with particular accent on spell checking
of texts in the Esperanto language. An overview of major spell checking engines, their ca-
pabilities and history has been presented, including several tools dedicated to Esperanto.
The structure of Hunspell data files has been discussed in more detail. Existing Esperanto
dictionaries for MySpell and Ispell have been described and communication has been es-
tablished with the author of the latter one. Following an assessment of strengths and weak-
nesses of these dictionaries and discussion of reachable improvements, a plan for the con-
struction of a new Hunspell dictionary for Esperanto has been presented. Foundations of
Esperanto morphology have been discussed in order to adopt an approach to it that could be
made use of during the construction of the dictionary. Due to specific traits of Esperanto
word building, semantic classification of stems from its vocabulary has been identified as
an especially useful precondition for this activity, a classification scheme has been pro-
posed based on research in a grammar reference, and a complex system implementing this
classification has been planned and realized, making use of various sources of linguistic
data including dictionaries and corpora. An analysis has been performed on a corpus of
morphological segmentations of Esperanto words and word building rules employing se-
mantic classes have been created based on results of the analysis. A dictionary-based word
derivation system implementing these rules has been realized in Hunspell, making it possi-
ble to utilize the acquired information on Esperanto morphology in form of a spell check-
ing dictionary, which has been successfully tested in OpenOffice.org as a proof of concept
and shall now be improved and integrated in the grammar checker project of the E@I orga-
nization, which shall be released in 2009.
The created Hunspell dictionary for Esperanto may be accessed online at the following
website: http://nlp.fi.muni.cz/~xblah/bc/
29
Bibliography
There are several essential publications in present-day Esperanto movement which are so
common that they are well known just by their abbreviations. In this work, I follow this
practice and refer to the “Plena manlibro de Esperanta gramatiko” (Wennergren, 2005)
simply as “PMEG”, and to the “Plena ilustrita vortaro de Esperanto 2005” simply as “PIV”.
Aktoj de la Akademio 1963-1967. 2a eldono. Rotterdam : Akademio de Esperanto, 2007.

Oficiala Bulteno de la Akademio de Esperanto; no. 9. Text in Esperanto. Available from
WWW: <http://www.akademio-de-esperanto.org/aktoj/aktoj1/>.
ATKINSON, Kevin. GNU Aspell [online]. 0.50. SourceForge.net, 1998, 2002-08-21 [ac-
cessed 2008-05-21]. Text in English. Available from WWW: <http://aspell.net/>.
BICK, Eckhard. Tagging and Parsing an Artificial Language: An Annotated Web-Corpus

of Esperanto. In Proceedings of Corpus Linguistics 2007. Birmingham : University of
Birmingham, 2007. Text in English. Available from WWW: <http://beta.visl.sdu.dk/~eck-
hard/pdf/CorpusLinguistics2007_esp.pdf>.
GABINSKI, Dmitri, POKROVSKIJ, Sergio. Esperanta literumilo por OpenOffice.org [on-

line]. 2003, 2003-12-23 [cit. 2008-05-22]. Text in Esperanto. Available from WWW:
<http://www.esperanto.pisem.net/literumilo.html>.
GLEDHILL, Christopher. Grammar of Esperanto, The: A corpus-based description. Editor

U.J. Lüders. München / Newcastle : LINCOM Europa, 1998. 100 pp. Text in English.
ISBN 3895862177.
GORDON, Raymond G., Jr. (ed.). Ethnologue : Languages of the World. Fifteenth edition.
Dallas : SIL International, 2005. 1272 pp. Text in English. Available from WWW: <http://
www.ethnolugue.com/>. ISBN 10155671159X.
GORIN, R. E., WILLISSON, Pace, KUENNING, Geoff. International Ispell [online].

3.3.02. 1971, 2005-06-11 [cit. 2008-05-21]. Text in English. Available from WWW:
<http://ficus-www.cs.ucla.edu/geoff/ispell.html>.
GRIMLEY EVANS, Edmund. Kapvortoj de PIV [online]. Versio 1.4. 2005-10-21 [ac-
cessed 2008-05-20]. Text in Esperanto. Available from the Internet Archive: <http://we-
b.archive.org/web/20061012192331/rano.org/pivkap>.
HANA, Jiří. Two-level morphology of Esperanto [master thesis]. Prague, 1998. 85 pp.
Charles University Prague, Faculty of Mathematics and Physics. Master thesis supervisor
RNDr. Jan Hajič, Ph.D. Text in English. Available from WWW: <http://www.ling.ohio-
state.edu/~hana/esr/thesis.pdf>.
HENDRICKS, Kevin. MySpell [online]. 3.0. 2000 [accessed 2008-05-21]. Text in English.
Available from WWW: <http://lingucomponent.openoffice.org/MySpell-3.zip>.
30
HIROTAKA, Masaaki, ONO, Takao. Plena Elektronika Vorto-Listo Esperanto-Japana
[online]. 1-a eldono. Jokohamo : 1997 , 1997-01-06 [accessed 2008-05-20]. Text in Japan
and Esperanto. Available from WWW: <http://www.s-w.co.jp/~taon/dentan/index.html>.
LENDON, Klivo. Kontrolu Literumadon [online]. 1.0. Oakville, Ontario, Canada : 1992 ,
1992-09-28 [cit. 2008-05-21]. Text in English and Esperanto. Available from FTP:
<ftp://garbo.uwasa.fi/pc/linguistics/kl100.zip>.
lernu!. Learning / Words / World learning / By topic [online]. E@I, 2002 [accessed
2008-05-20]. Text in English. Available from WWW:
<http://en.lernu.net/lernado/vortoj/vortlernado/index.php?id=37>.
NÉMETH, László. Hunspell : open source spell checking, stemming, morphological analy-
sis and generation under GPL, LGPL or MPL licenses [online]. 1.2.2. SourceForge.net,
2005a, 2008-04-12 [accessed 2008-05-20]. Text in English. Available from WWW: <http://
hunspell.sourceforge.net/>.
NÉMETH, László. Hunspell − format of Hunspell dictionaries and affix files [online].
2005b, 2008-04-12 [accessed 2008-05-20]. Text in English. Available from WWW:
<http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754>.
NÉMETH, László. Hunspell − spell checker, stemmer and morphological analyzer

[online]. 2005c, 2008-04-12 [accessed 2008-05-20]. Text in English. Available from
WWW: <http://sourceforge.net/docman/display_doc.php?
docid=90720&group_id=143754>.
PILGER, Wouter F. Provizora privata listo de komunlingvaj nomoj de plantoj de nordok-

cidenta Eŭropo. Lelystad : Vulpo-libroj, 1982. 72 pp. Text in Esperanto. Available from
WWW: <http://www.xs4all.nl/~pilger/wfp/pplp-u8.htm>. ISBN 9070074311.
PILGER, Wouter F. Provizora privata listo de nomoj de bestoj : Mamuloj. Lelystad :

Vulpo-libroj, 1992. 72 pp. Text in Esperanto. Available from WWW: <http://www.xs4al-
l.nl/~pilger/wfp/mamulu8.htm>. ISBN 9070074354.
PILGER, Wouter F. Birdonomoj en Esperanto por Hejma Vortaro [online]. c1996a,

2002-05-10 [accessed 2008-05-20]. Text in Esperanto. Available from WWW:
<http://www.xs4all.nl/~pilger/wfp/birdhvu8.htm>.
PILGER, Wouter F. Komunlingvaj nomoj de Eŭropaj birdoj [online]. c1996b, 2002-02-20

[accessed 2008-05-20]. Text in Esperanto. Available from WWW:
<http://www.xs4all.nl/~pilger/wfp/birdeou8.htm>.
PILGER, Wouter F. Provizora privata listo de nomoj de bestoj : Insektoj [online]. c1996c,
<http://www.xs4all.nl/~pilger/wfp/insektu8.htm>.
31
PILGER, Wouter F. Provizora privata listo de nomoj de legomoj en Nord-Okcidenta
Eŭropo [online]. c1997, 1997-10-27 [accessed 2008-05-20]. Text in Esperanto. Available
from WWW: <http://www.xs4all.nl/~pilger/wfp/legomou8.htm>.
Plena ilustrita vortaro de Esperanto 2005. Editor Gaston Waringhien. Paris : SAT, 2005.
1265 s. Text in Esperanto. ISBN 2950243282.
POKROVSKIJ, Sergio. Vortaro por ISpell [online]. c1997, 2002-03-12 [cit. 2008-05-22].
Text in Esperanto, English. Available from WWW: <http://www.esperanto.mv.ru/Down-
load/ispell/ispelleo.tgz>.
SAUSSURE, René de. La construction logique des mots en Espéranto : réponse a des cri-
tiques. Genève : Par Antido, 1910. 83 pp. Text in French.
TRZEWIK, Artur. Esperantilo – text editor with particular Esperanto functions, spell and
grammar checking and machine translation [online]. 0.982. [2003] [cit. 2008-05-21]. Text
in English. Dostupný z WWW: <http://www.xdobry.de/esperantoedit/index_en.html>.
WENNERGREN, Bertilo. Plena manlibro de Esperanta gramatiko. El Cerrito : ELNA,

2005. 696 pp. Text in Esperanto. Available from WWW: <http://bertilow.com/pmeg/>.
ISBN 0939785072.
WENNERGREN, Bertilo. Seksa signifo de vortoj kaj radikoj en Esperanto [online].

<http://bertilow.com/seksaj_vortoj/index.html>.
Wikipedia contributors. Esperanto vocabulary. In Wikipedia, The Free Encyclopedia [on-

line]. 2008-05-17 [accessed 2008-05-21]. Text in English. Available from WWW:
<http://en.wikipedia.org/w/index.php?title=Esperanto_vocabulary&oldid=212992148>.
WITKAM, Toon. Automatische Morphemanalyse in Esperanto macht Komposita besser

lesbar auf dem Bildschirm. In BLANKE, Detlev. Esperanto heute – Wie aus einem Projekt
eine Sprache wurde : Beiträge der 16. Jahrestagung der Gesellschaft für Interlinguistik
e.V., 1.-3. Dezember 2006 in Berlin. Berlin : GIL, 2007a. Text in German. ISSN
1432-3567.
WITKAM, Toon. La ekscito de vortstatistiko : Kiel krudforta kunmet-analizo kompletigas

tekstkontrolon. Utrecht : [s.n.], [2007b]. 32 pp. Text in Esperanto.
WITKAM, Toon. ESPSOF : Esperanto-Softvaro por Vindozo. versio 0.8. Utrecht : [s.n.],
2008. 7 pp. plus Microsoft Office files. Text in Esperanto, Dutch.
Worsten. Multlingva vortaro pri profesioj, funkcioj kaj rangoj [online]. Wrocław : c2003 ,
2008-03 [accessed 2008-05-20]. Text in Esperanto, Czech, German, English, Spanish,
French, Italian, Polish. Available from WWW: <http://worsten.org/vortaroj/profesioj.htm>.
32
ZAMENHOF, L. Fundamento de Esperanto. Warszawa : [s.n.], 1905. Text in Esperanto,
French, English, German, Russian, Polish. Available from WWW: <http://www.akademio-
de-esperanto.org/fundamento/>.
33
Appendix A
The 16 Rules of Esperanto Grammar
The grammar of Esperanto has been analyzed for a lot of times and there are different theo-
ries and opinions as to how the language shall be approached from the linguistic point of
view. The most renowned contemporary Esperanto grammar reference is PMEG (Wenner-
gren, 2005), but there has recently been also novel approaches to the topic, such as the de-
scription by Gledhill (Gledhill, 1998) which is based on a modern method of corpus analy-
sis. The very first description of Esperanto grammar, in 16 rules, howsoever incomplete
and inexpert it is, has been given by the author of the language itself, Dr. L. L. Zamenhof,
when the language was published in 1887. It later became a part of the so called “Funda-
mento de Esperanto” (Zamenhof, 1905) which later showed itself to be an efficient tool for
preventing the language from falling into dialects. The following is a copy of the English
version of the original 16 rules of Esperanto grammar, provided here so that an uninformed
reader may get a basic idea on the structure of Esperanto:
A) The alphabet
Aa (a as in “last”), Bb (b as in “be”), Cc (ts as in “wits”), Ĉĉ (ch as in “church”), Dd (d as
in “do”), Ee (a as in “make”), Ff (f as in “fly”), Gg (g as in “gun”), Ĝĝ (j as in “join”), Hh
(h as in “half”), Ĥĥ (strongly aspirated h, “ch” in “loch”), Ii (i as in “marine”), Jj (y as in
“yoke”), Ĵĵ (z as in “azure”), Kk (k as in “key”), Ll (l as in “line”), Mm (m as in “make”),
Nn (n as in “now”), Oo (o as in “not”), Pp (p as in “pair”), Rr (r as in “rare”), Ss (s as in
“see”), Ŝŝ (sh as in “show”), Tt (t as in “tea”), Uu (u as in “bull”), Ŭŭ (u as in “mount” –
used in diphtongs), Vv (as in “very”), Zz (z as in “zeal”).
Remark: If it be found impraticable to print works with the diacritical signs (^, ˘), the letter
h may be substituted for the sign (^), and the sign (˘), may be altogether omitted.9
B) Parts of Speech
1. The Article
There is no indefinite, and only one definite, article, la, for all genders, numbers, and cases.
2. Substantives
Substantives are formed by adding o to the root. For the plural, the letter j must be added to
the singular. There are two cases: the nominative and the objective (accusative). The root
with the added o is the nominative, the objective adds an n after the o. Other cases are
9
This surrogate system for representing some characters of the Esperanto alphabet is sometimes called the
“Zamenhof-convention”, but it is not extremely popular. Nowadays, the most popular system to write Es-
peranto letters in computer if Unicode support is missing is to use the “x-convention”, which places an “x” af-
ter the Latin alphabet version of the particular letter. This has been promoted as more practical, since the dif-
ference between “ŭ” and “u” does not disappear as in case of the Zamenhof-convention, and also words writ-
ten using this surrogate alphabet still appear in proper order when sorted alphabetically. Yet another system is
the “circumflex-convention”, which puts the circumflex symbol (“^”) after the letter.
34
formed by prepositions; thus, the possessive (genitive) by de, “of”; the dative by al, “to”,
the instrumental (ablative) by kun, “with”, or other preposition as the sense demands. E. g.
root patr, “father”; la patr'o, “the father”; la patr'o'n, “the father” (objective), de la patr'o,
“of the father”; al la patr'o, “to the father”; kun la patr'o, “with the father”; la patr'o'j, “the
fathers”; la patr'o'j'n, “the fathers” (obj.), por la patr'o'j, “for the fathers”.
3. Adjectives
Adjectives are formed by adding a to the root. The numbers and cases are the same as in
substantives. The comparative degree is formed by prefixing pli (more); the superlative by
plej (most). The word “than” is rendered by ol, e. g. pli blanka ol neĝo, “whiter than snow”.
4. Numerals
The cardinal numerals do not change their forms for the different cases. They are: unu (1),
du (2), tri (3), kvar (4), kvin (5), ses (6), sep (7), ok (8), naŭ (9), dek (10), cent (100), mil
(1000). The tens and hundreds are formed by simple junction of the numerals, e. g. 533 =
kvin'cent tri'dek tri. Ordinals are formed by adding the adjectival a to the cardinals, e. g.
unu'a, “first”; du'a, “second”, etc. Multiplicatives (as “threefold”, “fourfold”, etc.) add obl,
e. g. tri'obl'a, “threefold”. Fractionals add on, as du'on'o, “a half”; kvar'on'o, “a quarter”.
Collective numerals add op, as kvar'op'e, “four together”. Distributive prefix po, e. g., po
kvin, “five apiece”. Adverbials take e, e. g., unu'e, “firstly”, etc.
5. Pronouns
The personal pronouns are: mi, “I”; vi, “thou”, “you”; li, “he”; ŝi, “she”; ĝi, “it”; si, “self”;
ni, “we”; ili, “they”; oni, “one”, “people”, (French “on”). Possessive pronouns are formed
by suffixing to the required personal, the adjectival termination. The declension of the pro-
nouns is identical with that of substantives. E. g. mi, “I”; mi'n, “me” (obj.); mi'a, “my”,
“mine”.
6. Verbs
The verb does not change its form for numbers or persons, e. g. mi far'as, “I do”; la patr'o
far'as, “the father does”; ili far'as, “they do”. The present tense ends in as, e. g. mi far'as,
“I do”. The past tense ends in is, e. g. li far'is, “he did”. The future tense ends in os, e. g. ili
far'os, “they will do”. The subjunctive mood ends in us, e. g. ŝi far'us, “she may do”. The
imperative mood ends in u, e. g. ni far'u, “let us do”. The infinitive mood ends in i, e. g.
fari, “to do”. There are two forms of the participle in the international language, the
changeable or adjectival, and the unchangeable or adverbial. The present participle active
ends in ant, e. g. far'ant'a, “he who is doing”; far'ant'e, “doing”. The past participle active
ends in int, e. g. far'int'a, “he who has done”; far'int'e, “having done”. The future participle
active ends in ont, e. g. far'ont'a, “he who will do”; far'ont'e, “about to do”. The present
participle passive ends in at, e. g. far'at'e, “being done”. The past participle passive ends in
it, e. g. far'it'a, “that which has been done”; far'it'e, “having been done”. The future partici-
ple passive ends in ot, e. g. far'ot'a, “that which will be done”; far'ot'e, “about to be done”.
All forms of the passive are rendered by the respective forms of the verb est (to be) and the
35
participle passive of the required verb; the preposition used is de, “by”. E. g. ŝi est'as
am'at'a de ĉiu'j, “she is loved by every one”.
7. Adverbs
Adverbs are formed by adding e to the root. The degrees of comparison are the same as in
adjectives, e. g., mi'a frat'o kant'as pli bon'e ol mi, “my brother sings better than I”.
8. Prepositions
All prepositions govern the nominative case.
C) General Rules
9. Pronunciation
Every word is to be read exactly as written, there are no silent letters.
10. Accent
The accent falls on the last syllable but one, (penultimate).
11. Compounds
Compound words are formed by the simple junction of roots, (the principal word standing
last), which are written as a single word, but, in elementary works, separated by a small
line ('). Grammatical terminations are considered as independent words. E. g. vapor'ŝip'o,
“steamboat” is composed of the roots vapor, “steam”, and ŝip, “a boat”, with the substanti-
val termination o.
12. Negative
If there be one negative in a clause, a second is not admissible.
13. Direction
In phrases answering the question “where?” (meaning direction), the words take the termi-
nation of the objective case; e. g. kie'n vi ir'as? “where are you going?”; dom'o'n, “home”;
London'o'n, “to London”, etc.
14. The Indefinite Preposition

Every preposition in the international language has a definite fixed meaning. If it be neces-
sary to employ some preposition, and it is not quite evident from the sense which it should
be, the word je is used, which has no definite meaning; for example, ĝoj'i je tio, “to rejoice
over it”; rid'i je tio, “to laugh at it”; enu'o je la patr'uj'o, “a longing for one’s fatherland”.
In every language different prepositions, sanctioned by usage, are employed in these dubi-
ous cases, in the international language, one word, je, suffices for all. Instead of je, the ob-
jective without a preposition may be used, when no confusion is to be feared.
36
15. New Words
The so-called “foreign” words, i. e. words which the greater number of languages have de-
rived from the same source, undergo no change in the international language, beyond con-
forming to its system of orthography. – Such is the rule with regard to primary words,
derivatives are better formed (from the primary word) according to the rules of the interna-
tional grammar, e. g. teatr'o, “theatre”, but teatr'a, “theatrical”, (not teatricul'a), etc.
16. Elision
The a of the article, and final o of substantives, may be sometimes dropped euphoniae gra-
tia, e. g. de l’ mond'o for de la mond'o; Ŝiller’ for Ŝiller'o; in such cases an apostrophe
should be substituted for the discarded vowel.
37
Appendix B
An Overview of Esperanto Affixes
This list of some of the affixes which form an essential part of the Esperanto morphology is
a slight modification of the one which can be found in the English Wikipedia (Wikipedia
Contributors, 2008) and is provided here as a tool of reference, should any discussion of a
particular affix in the text of this work be unclear.
-aĉ pejorative (expresses a skribaĉi (to scrawl, from 'write'); veteraĉo (foul
poor opinion of the object weather); domaĉo (a hovel); rigardaĉi (to gape at, from
or action) 'look at')
-ad imperfective aspect (fre- kuradi (to keep on running); parolado (a speech); adi (to
quent, repeated, or contin- carry on)
ual action); as a noun, an
action or process
-aĵ a concrete manifestation manĝaĵo (food, from 'eat'); novaĵo (news, novelty)
-an a member, follower, par- kristano (a Christian); marksano (a Marxist); usonano (a
ticipant, inhabitant US American) [cf. amerikano (a continental American)]
-ar a collective group arbaro (a forest, from 'tree'); vortaro (a dictionary, from
'word' [a set expression]); homaro (humanity, from 'hu-
man' [a set expression; 'crowd, mob' is homamaso])
-ĉj masculine affectionate Joĉjo (Jack); paĉjo (daddy); fraĉjo (bro)
form; the root is truncated
-ebl possible kredebla (believable); videbla (visible)
-ec an abstract quality amikeco (friendship); boneco (goodness); italeca (Ital-
ianesque)
-eg augmentative; sometimes domego (a mansion); virego (a giant); librego (a tome);
pejorative connotations varmega (boiling hot); ridegi (to guffaw)
when used with people
-ej a place characterized by lernejo (a school, from 'to learn'), vendejo (a store, from
the root (not used for to- 'to sell'), juĝejo (a court, from 'to judge'), kuirejo (a
ponyms) kitchen, from 'to cook'), hundejo (a kennel, from 'dog'),
senakvejo (a desert, from 'without water')
-em having a propensity, ten- ludema (playful), parolema (talkative), kredema (credu-
dency lous)
-end mandatory pagenda (payable), legendaĵo (required reading)
-er the smallest part ĉenero (a link, from 'chain'); fajrero (a spark, from 'fire');
neĝero (a snowflake, from 'snow'), kudrero (a stitch,
from 'sew'), ero (a crumb etc)
-estr a leader, boss lernejestro (a school principal); urbestro (a mayor, from
'city'); centestro (a centurion, from 'hundred')
-et diminutive; sometimes af- dometo (a hut); libreto (a booklet); varmeta (lukewarm);
fectionate connotations rideti (to smile)
when used with people
38
-id an offspring, descendant katido (a kitten); reĝido (a prince, from 'king'); arbido (a
sapling, from 'tree'); izraelido (an Israelite)
-ig to make, to cause (transi- mortigi (to kill, from 'die'); purigi (to clean); konstruigi
tivizer/causative) (to have built)
-iĝ to become (intransitivizer/ amuziĝi (to enjoy oneself); naskiĝi (to be born); ruĝiĝi
inchoative/middle voice) (to blush, from 'red')
-il an instrument ludilo (a toy, from 'play'); tranĉilo (a knife, from 'cut');
helpilo (a remedy, from 'help')
-in female bovino (a cow); patrino (a mother); studentino (a co-ed)
-ind worthy of memorinda (memorable); kredinda (credible); fidinda
(dependable, trustworthy)
-ing a holder, sheath glavingo (a scabbard, from 'sword'); kandelingo (a can-
dle-holder); dentingo (a tooth socket)
-ism a doctrine, system (as in komunismo (Communism); kristanismo (Christianity)
English)
-ist person professionally or instruisto (teacher); dentisto (dentist); abelisto (a bee-
avocationally occupied keeper), komunisto (a communist)
with an idea or activity (a
narrower use than in En-
glish)
-nj feminine affectionate Jonjo (Joanie); panjo (mommy); anjo (granny)
form; the root is truncated
-obl multiple duobla (double); trioble (triply)
-on fraction duona (half [of]); centono (one hundredth)
-op collective numeral duope (by twos); gutope (drop by drop)
-uj a (loose) container, coun- monujo (a purse, from 'money'); Anglujo (England [An-
try (archaic when refer- glio in current usage]); Kurdujo (Kurdistan, the Kurdish
ring to a political entity), a lands); pomujo (appletree [now pomarbo])
tree of a certain fruit (ar-
chaic)
-ul a person characterized by junulo (a youth); sanktulo (a saint, from 'holy'); abo-
the root coulo (a beginning reader, from aboco "ABC's"); aĉulo
(a wretch, from the suffix aĉ); tiamulo (a contemporary,
from 'then')
-um undefined ad hoc suffix kolumo (a collar, from 'neck'); krucumi (to crucify, from
(used sparingly) 'cross'); malvarmumo (a cold, from 'cold'); plenumi (to
fulfill, from 'full'); brakumi (to hug, from 'arm'); dek-
strume (clockwise, from 'right')
bo- relation by marriage, in- bopatro (a father-in-law); boedzino (a sister-wife)
law
dis- separation, scattering disĵeti (to throw about); dissendi (to distribute); disatomi
(to split by atomic fission)
ek- perfective aspect (begin- ekbrili (to flash); ekami (to fall in love); ekkrii (to cry
ning, sudden, or momen- out); ekde (inclusive 'from'); ek! (hop to!)
39
tary action)
eks- former, ex- eksedzo (an ex-husband); eksbovo (a steer [jokingly,
from 'bull']); Eks la estro! (Down with our leader!)
fi- shameful, nasty fihomo (a wicked person); fimensa (foul-minded); fivorto
(a profane word); Fi al vi! (Shame on you!)
ge- both sexes together gepatroj (parents); gesinjoroj (ladies and gentlemen); la
geZamenhofoj (the Zamenhofs); gelernejo (a coeduca-
tional school); geiĝi (to pair up, to mate)
mal- antonym malgranda (small); malriĉa (poor); malino (a male [jok-
ingly]); maldekstrume (counter-clockwise)
mis- incorrectly, awry misloki (to misplace); misakuzi (to wrongly accuse); mis-
famiga (disparaging, from fama 'well-known' and the
causative suffix -ig)
pra- great-(grand-), primordial, praavo (a great-grandfather); prapatro (a forefather);
proto- prabesto (a prehistoric beast); prahindeŭropa (Proto-In-
doeuropean)
re- over again, back again resendi (to send back); rekonstrui (to rebuild); reaboni
(to renew a subscription), rebrilo (reflection, glare, from
'shine'), reira bileto (a return ticket, from iri 'to go')
40

A Spell Checker

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

A Spell Checker

Caricato da

Copyright:

Formati disponibili

MASARYK UNIVERSITY

Brno, May 2008

Supervisor: RNDr. Petr Sojka, Ph.D.

2.1 Kontrolu Literumadon

2.4 GNU Aspell

2.6.1 Structure of Hunspell Data Files

2.7 Overview Table

Kontrolu Li- GNU As-

3.1 Strengths and Weaknesses of the Existing Dictionaries

3.2.1 The Structure of Words in Esperanto

(affix* stem affix* connector?)* affix* stem affix* ending?

3.2.2 System for a Semantic Classification of Stems

In order to devise an algorithm for semantic classification, a suitable system of semantic

In spite of the overall incompleteness of Wennergren’s work on semantic classification, it

• Some stems refer to humans, persons, e.g.:

• Other stems refer to animals, e.g.:

• Another stems are plants, e.g.:

• Some stems are tools, e.g.:7

• Other stems are names of traits or qualities, e.g.:

3.3 DFD Diagram for Dictionary Construction

Function Corpus Specialized Ready-made morpheme

3.4 Compiling a New Word List

3.4.1 Identifying Relevant Sources for the Word List

3.4.2 Moore Machine for Semantic Classification

Each program performing a single classification step implements a universal interface, so

3.5.1 Dictionary-Based Word Derivation System

Taking Witkam’s list of ready-made morpheme segmentations, we may put it through an

An analysis of Witkam’s morpheme structures provides interesting information on the fre-

3.5.2 Implementation of Esperanto Morphology in Hunspell

• A hidden limitation of Hunspell’s noted two-folded suffix stripping. It actually

• The issue of punctuation and non-alphabetical characters in general should be coped

Aktoj de la Akademio 1963-1967. 2a eldono. Rotterdam : Akademio de Esperanto, 2007.

BICK, Eckhard. Tagging and Parsing an Artificial Language: An Annotated Web-Corpus

GABINSKI, Dmitri, POKROVSKIJ, Sergio. Esperanta literumilo por OpenOffice.org [on-

GLEDHILL, Christopher. Grammar of Esperanto, The: A corpus-based description. Editor

GORIN, R. E., WILLISSON, Pace, KUENNING, Geoff. International Ispell [online].

NÉMETH, László. Hunspell − spell checker, stemmer and morphological analyzer

PILGER, Wouter F. Provizora privata listo de komunlingvaj nomoj de plantoj de nordok-

PILGER, Wouter F. Provizora privata listo de nomoj de bestoj : Mamuloj. Lelystad :

PILGER, Wouter F. Birdonomoj en Esperanto por Hejma Vortaro [online]. c1996a,

PILGER, Wouter F. Komunlingvaj nomoj de Eŭropaj birdoj [online]. c1996b, 2002-02-20

WENNERGREN, Bertilo. Plena manlibro de Esperanta gramatiko. El Cerrito : ELNA,

WENNERGREN, Bertilo. Seksa signifo de vortoj kaj radikoj en Esperanto [online].

Wikipedia contributors. Esperanto vocabulary. In Wikipedia, The Free Encyclopedia [on-

WITKAM, Toon. Automatische Morphemanalyse in Esperanto macht Komposita besser

WITKAM, Toon. La ekscito de vortstatistiko : Kiel krudforta kunmet-analizo kompletigas

14. The Indefinite Preposition

Potrebbero piacerti anche