Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
thus we do not need to keep all entries in the lex- In the second and third cases, the reduction
icon but only those entries which are basic (moti- of the lexicon can be observed. The pattern
vating) word forms for the neighbouring entries of komunismus P derives the word forms by ex-
the found n-tuples. changing the string at the end of the basic (moti-
The entries that are thereby eliminated from the vating) word form:
lexicon can be constructed according to the new smus → zace, sticky, sta, smus
pattern from the basic (motivating) word form.
and the corresponding change of the attributes of
The original word forms can be determined algo-
the constructed tag.
rithmically and their original lexical meaning can
For an implementation of these WD patterns, a
be inferred as well.
parallel with Finite State Automata (FST) is use-
We will reduce the lexicon using the description
ful. The property of chaining (Roche and Schabes,
of the WD process which yields the predictable
1997) is very suitable here – it allows us to build
changes in the semantics of the derived entries.
WD patterns as hierarchical modules. This prop-
The WD process can be illustrated by the Fig. 1.
erty makes it possible to limit the duplicity of the
It can be seen that the sub-entries humanizace (hu-
stored information and increase their lucidity.
manisation), humanisticky (humanistically ADV),
humanistický (humanistic ADJ), humanistčin (hu- 7.1 WD Relation Mining
manist’s FEM POSS ADJ), humanistka (human- We explained how to extend the morphological
ist FEM), humanistův (humanist’s MAS POSS database employing the regular changes of word
ADJ), humanista (humanist MAS), humanismus forms that can be observed in the course of the
(humanism), can be assigned: WD processes (Osolsobě et al., 2002). We have
1. either to the respective infl. paradigms: shown that if the WD processes are described by
humanizace:růže
humanisticky:otrocky the rules it is possible to reduce our stem dictio-
humanistický:otrocký nary and eventually to obtain a dictionary of roots.
humanistčin:matčin To make the process of searching for the dis-
humanistka:matka
humanistův:otcův crete description of WD processes simpler we
humanista:husita have implemented an algorithm that looks up the
humanismus:komunismus
relations between the strings corresponding to the
2. or to:
humanizace:růže individual entries in the morphological database.
humanistický:otrocký P The input for the algorithm is a description of
humanista:husita P
humanismus:komunismus the variations of the individual word forms to-
3. or to a deriv. pattern (meta-pattern): gether with conditions placed on the attributes of
humanismus:komunismus P the respective grammatical tags.
růže otrocky
humanizace y .. humanisticky
..................
.
........
otrocký P
... humanistick⊕ý ........ý..........
...
.. ...
... otrocký
zace.......
.......
.. sticky ............ humanistický
.
... ..
.. ......
... ...... matčin
...
...
čin .........
humanistčin
............ ..
komunismus P matka P
humani⊕smus humanist⊕ka
....... ........
...
... ........ sta
.......... ka ..........
........ ka
........ .
...
........ ........ ..........
... ........ . .
...
...
...
...smus husita P ův
.............................
otcův matka
...
...
... humanist⊕a a
............
...............
humanistův humanistka
... .....
...
.........
husita
.
humanista
komunismus
humanismus
To describe the variations of the word forms we We know that Ai,j are constants and $j vari-
will use: ables which can take values from Σ∗ . For an arbi-
• variables $1, $2, . . . (values ∈ Σ∗ ), trary string Si , a regular grammar can be written
• constants / “affixes” : A1,1 , A1,2 , . . . ∈ Σ∗ (see Eq. 2).
• concatenation operator ⊕
• strings Si ∈ {Ai,1 , Ai,2 , . . . $1, $2, . . .}∗ S → Ai,1 $1N1 $1| . . . |$m → E
• conditions – constraints on the values of N1 → Ai,2 $2N2 E → a|aE|b|bE| . . .
given attributes, eventually a determination ...
whether the given word form has to be Nm → Ai,m+1 where E ∈ Σ∗
present in the database: C1 , C2 , . . . (2)
It can be seen that for each string Si a non-
7.1.1 Input deterministic transducer can be constructed that
The task assumes: takes a word form on the input and on the output,
it produces a set of all acceptable evaluations of
• n . . . number of the word forms searched for,
the variables $1 . . . $m, i. e. a set (possibly empty)
• n-tuple: (S1 , C1 ) . . . (Sn , Cn ).
of the m-tuples of members of set Σ∗ .
The Si strings should be written in such a way 7.1.2 The Algorithm
so as not to contain the pairs constant – constant,
variable – variable standing adjacent. First we have to select the pairs (Si , Ci ) for
which the requirement in the condition states that
• two neighbouring constants can be merged
their corresponding word forms have to occur in
into one
the database. Those word forms, strings and pairs
• two variables can be separated by constant
will be called located. The word forms that we
(empty string).
can determine from the located ones after substi-
• if the variable at beginning is required, or at
tution values for the variables in the strings will be
the end of the string, then we set Ai,1 , or Ai,m
labelled as inferred.
≡
We can speak here about free and bound occur-
Each string Si thus can be given without loosing rences of the variables. Free variables will be de-
any generality in the following way: termined during the computation of the same au-
tomaton in which they take place. Bound vari-
Si ≡ Ai,1 ⊕ $1 ⊕ Ai,2 ⊕ $2⊕ ables are dependent on the computation of other
(1) automata. The values are instantiated for bound
Ai,3 ⊕ . . . ⊕ $m ⊕ Ai,m+1
variables before the computing the automaton in 8 The First Results from our Data
which the variables occur. Thus we can work with
them in a given automaton as with the constants – Table 5 displays the individual steps taken dur-
this simplifies the automaton. ing forming the respective WD nest. The step A
(see Table 5) consisted in the derivation of mas-
When a given word form is accepted by the
culine possessives using suffix -ův. It is obvious
transducer (for string Si ) we obtain the respective
that this derivation is regular, the number of lem-
evaluation variables included in Si as an output. If
mata has not changed – all of them have been as-
the same variables occur also in other strings they
signed to the paradigm for the possessives otcův
can be substituted (instantiated) by the values.
(father’s). In the step B the gender of noun is
Thus step by step we will construct the respec- changed from masculine to feminine using suf-
tive FS automata for located strings Si using the fix -ka. Moreover, in this step the paradigms
instantiation of the variables. If the automaton neumětel a Kocáb nM have been removed. Also
does not contain any free variables it is obvious the number of lemmata assigned to the paradigm
that the respective pair is inferred (it can be lo- učitel (teacher) has been reduced to half, i. e. from
cated at the same time) – these will be labelled as 908 to 454. This means that according to our data
inferred+located). (our morphological database) half of the agentives
The order in which the individual automata will cannot form the feminine counterpart and this re-
be applied can be optimised. A certain part of the sult can be expected to be confirmed by examin-
state space being searched can be eliminated in ad- ing a larger corpus. The step C is again regu-
vance based on conditions Ci , i. e. it is enough lar – it consists in the derivation of feminine pos-
to search/eliminate entries associated with the pat- sessives using suffix -in with a number of lem-
terns which guarantee/eliminate some attributes of mata not being changed. In the step D the ad-
the tag. jectives are formed by means of suffix -ský and
We suppose that by means of located strings all this process is less regular. From the possible 454
the variables used in the inferred strings can be in- lemmata belonging to the paradigm učitel the ad-
stantiated in such a way that we will be able to de- jectives are derived only from 113+21+16=150.
termine correctly inferred word forms relying only Moreover, these adjectives split into three adjec-
on the knowledge of the located word forms, i. e. tive paradigms pražský (Prague), společenský (so-
in the cases where the inferred strings do not con- cial) and kremžský (Crems) depending on whether
tain free variables. In the opposite case the algo- they form a comparative and adverb or not. The
rithm has to stop prematurely. following step is again regular – it involves the
The optimisation will determine the order in derivation of adverbs from adjectives by shorten-
which the individual automata containing free ing the last vowel from ý to y. It can be seen
variables will be applied. that from the adjectives belonging to the paradigm
kremžský such adverbs cannot be formed at all.
We will start with the first automaton following
The step E is irregular as well and it involves the
the order determined by the optimisation. Step by
derivation of the nouns from the respective adjec-
step we will go through all the entries and then
tives by replacing the suffix -ský for -stvı́.
for all possible evaluations we will instantiate the
variables and continue with searching the entries
9 Conclusions
acceptable for the next automaton (according to
the given ordering), i. e. we look for the next ele- The purpose of the paper is to show how se-
ment of the respective n-tuple. lected word derivation relations in Czech can be
If we succeed in the instantiation of all vari- described using the morphological analyser ajka
ables and determine all inferred word forms and if and the program I par which works with the
all inferred+located word forms are found in the Czech morphological database. The Czech data
database then the currently determined n-tuple can necessary for this description are: stem dictionary
be sent to the output. used by ajka containing 385,066 Czech stems
908 učitel,otcův 454 učitel,otcův,matka 454 učitel,otcův,matka,matčin
3 obyvatel,otcův 2 obyvatel,otcův,matka 2 obyvatel,otcův,matka,matčin
A B
2 přı́tel,otcův 2 přı́tel,otcův,matka 2 přı́tel,otcův,matka,matčin
→ →
1 neumětel,otcův
1 Kocáb nM,otcův
C↓
113 učitel,otcův,matka,matčin,pražský,pražsky 113 učitel,otcův,matka,matčin,pražský
21 učitel,otcův,matka,matčin,společenský,společensky D 21 učitel,otcův,matka,matčin,společenský
← 16 učitel,otcův,matka,matčin,kremžský
2 obyvatel,otcův,matka,matčin,pražský,pražsky 2 obyvatel,otcův,matka,matčin,pražský
E↓
46 učitel,otcův,matka,matčin,pražský,pražsky,stavenı́
19 učitel,otcův,matka,matčin,společenský,společensky,stavenı́