Varliklar 05

DOKUZ EYLL UNIVERSITY
GRADUATE SCHOOL OF NATURAL AND APPLIED

SCIENCES

DEVELOPMENT OF A METHOD TO
DETERMINE ROOT AND SUFFIXES FOR
TURKISH WORDS TO GENERATE LARGE
SCALE TURKISH CORPUS

by
zlem VARLIKLAR

1uly, 2005
ZMR

DEVELOPMENT OF A METHOD TO
DETERMINE ROOT AND SUFFIXES FOR
TURKISH WORDS TO GENERATE LARGE
SCALE TURKISH CORPUS

A Thesis Submitted to the
Graduate School of Natural and Applied Sciences of Dokuz Eyll University
In Partial Fulfillment of the Requirements for the Degree of Master of Science
in Computer Engineering, Computer Engineering Program

by
zlem VARLIKLAR

1uly, 2005
ZMR

ii
M.Sc THESIS EXAMINATION RESULT FORM

We have read the thesis entitled ~DEVELOPMENT OF A METHOD TO
DETERMINE ROOT AND SUFFIXES FOR TURKISH WORDS TO
GENERATE LARGE SCALE TURKISH CORPUS completed by ZLEM
VARLIKLAR under supervision oI ASSOCIATIVE PROFESSOR DR. YALIN
EB and we certiIy that in our opinion it is Iully adequate, in scope and in quality,
as a thesis Ior the degree oI Master oI Science.

Assoc. ProI. Dr. Yalin EBI
Supervisor

ProI. Dr. R. Alp KUT Asst. ProI. Dr. ZaIer DICLE
(Jury Member) (Jury Member)

ProI.Dr. Cahit HELVACI
Director
Graduate School of Natural and Applied Sciences

iii
ACKNOWLEDGEMENTS

I would like to thank my advisor Assoc. ProI. Dr. Yalin EBI Ior oIIering me to
study in Natural Language Processing and Ior his advises, support and help to
complete my thesis.

I would also like to thank Asst. ProI. Dr. Gkhan DALKILI, who shared their
ideas about 'Natural Language Processing during the writing and developing phase
oI thesis; and my colleague UIuk DEMIR, who encouraged me during the writing oI
the thesis.

I have special thanks to my parents and my Iiance Cenk AKTAS Ior their support,
patience and making me encouraged.

zlem VARLIKLAR

iv
DEVELOPMENT OF A METHOD TO DETERMINE ROOT AND SUFFIXES
FOR TURKISH WORDS TO GENERATE LARGE SCALE TURKISH
CORPUS

ABSTRACT

For determining a language`s morphological specialties, it is needed to generate a
corpus that represents the language. II there is a large scale Turkish corpus that
involves all specialties oI the language, some statistical properties oI the Turkish
language depending on the words can also be investigated.
In this study, how must a large scale, comprehensive, understandable, easily used
Turkish corpus be generated and determining an appropriate method to generate it,
and also determining an eIIicient method to determine stem, root and suIIixes oI the
words that are used to Iorm this corpus are explained.
For generating large scale Turkish corpus, the texts, have almost 130 million
words, were achieved Irom some newspapers, novels and stories, and subtitles oI
some Iilms written in Turkish Irom the Internet. The stems, roots, abbreviations and
suIIixes` list Ior Turkish were obtained. The abbreviation list and rules generated Ior
the sentence boundary detection were stored in an XML Iile; these Iiles had provided
successive results in sentence boundary detection. AIter this process, sentences were
splitted into words and types oI words were Iound to help Iinding the correct root oI
the word. The stems oI words were determined by using stems and inIlectional
suIIixes` lists. The roots and derivational suIIixes oI these Iounded stems were
determined by using root and derivational suIIixes` lists. All results include
paragraphs, sentences, words, root and suIIixes were stored into an XML structure
specialized Ior NLP applications to make the applications easier. The only drawback
oI XML structure is that it needs too much memory on disk drive. All XML Iiles
were stored into the memory oI the computer at the beginning oI the generating large
scale corpus process not to be aIIected Irom this drawback. This process had made
the steps oI generating large scale Turkish corpus being very Iast and eIIective.
Keywords: Natural language processing, corpus, large scale Turkish corpus,
morphological analysis, determining stem and root.

v
BYK LEKL TRKE DERLEM OLU$TURMAK N TRKE
KELMELERN KK VE EKLERN BELRLEMEK N YNTEM
GEL$TRME

Z

Dillerin biimbilimsel zelliklerinin belirlenmesi iin, dilin zelliklerini temsil
edebilecek bir derlem gereklidir. Bu derlem zerinde analiz tekniklerini kolaylikla
uygulamaya izin verecek kadar byk olmalidir. Trke`nin tm zelliklerini
ierebilecek, byk lekli bir derlem alismasi daha nce yapilmamistir. Byle bir
derlemin varligiyla, kelimelere bagli olan, Trkenin istatistiksel zellikleri de
incelenebilecektir.
Bu alismanin amaci byk lekli, kapsamli, anlasilir, kolay kullanilabilir bir
Trke derlem gelistirmek iin en uygun yntemi belirlemek; ve bu derlemi
olusturmak iin kullanilan kelimelerin gvde, kk ve eklerini belirlemek iin verimli
bir yntem gelistirmektir.
Byk lekli Trke derlem olusturmak iin, bazi gazeteler, roman ve hikayeler
ve Trke Iilmlerin altyazilarindan olusan, yaklasik 130 milyon kelimelik yazili
Trke internet yoluyla elde edilmistir. Trke gvdeler, kkler, ekler ve
kisaltmalarin listeleri elde edilmistir. Kisaltma listesi ve cmle sonu kural listesi
XML yapisinda olusturulmustur. Cmle sonunu belirlemek iin gelistirilen
uygulamada kisaltma ve kural listeleri kullanilmis ve basarili sonular alinmistir.
Bulunan cmleler kelimelere ayrilmis, kelimelerin gvdeleri, kk ve ekim eki
listeleri kullanilarak bulunduktan sonra; kkleri, kk listesi ve yapim ekleri listesi
kullanilarak belirlenmistir. Elde edilen tm sonular; paragraI, cmle, kelime, kk ve
ek seklinde, Dogal Dil Isleme (DDI) uygulamalari iin zellestirilmis bir XML
yapisi iine kaydedilmistir. XML yapisinin bilinen tek dezavantaji dosya
boyutlarinin byk olmasidir. Bunun iin tm XML dosyalari islemlere baslamadan
nce haIizaya yklenmektedir. Bu islem, derlem olusturma basamaklarinin ok hizli
ve etkili bir sekilde yapilabilmesini saglamistir.
Anahtar szckler: Dogal dil isleme, derlem, byk lekli Trke derlem,
biimbilimsel analiz, gvde ve kk belirleme.

vi
CONTENTS

Page
THESIS EXAMINATION RESULT FORM..........................................................................II
ABSTRACT........................................................................................................................... IV
Z........................................................................................................................................... V
LIST OF TABLES...............................................................................................................VIII
LIST OF FIGURES ............................................................................................................... IX

CHAPTER ONE - INTRODUCTION................................................................................. 1

CHAPTER TWO - CORPUS AND LARGE SCALE CORPUS....................................... 4
2.1 Corpus ............................................................................................................................... 5
2.1.1 English Corpora........................................................................................................ 5
2.1.1.1 Brown Corpus ................................................................................................ 5
2.1.1.2 British National Corpus (BNC) ..................................................................... 5
2.1.1.3 The Bank oI English ...................................................................................... 8
2.1.1.4 English Gigaword .......................................................................................... 8
2.1.1.5 American National Corpus ............................................................................ 9
2.1.2 Turkish Corpora ....................................................................................................... 9
2.1.2.1 Koltuksuz Corpus .......................................................................................... 9
2.1.2.2 YT Corpus................................................................................................. 10
2.1.2.3 Dalkilic Corpus ............................................................................................ 10
2.1.2.4 METU Turkish Corpus ................................................................................ 10
2.1.2.5 TurCo Turkish Corpus ................................................................................. 10
2.1. 3 Corpora oI Other Languages ................................................................................. 12
2.1.3.1 The Czech National Corpus (CNC) ............................................................. 12
2.1.3.2 Croatian National Corpus ............................................................................ 13
2.1.3.3 PAROLE...................................................................................................... 13
2.1.3.4 French Corpus.............................................................................................. 13
2.1.3.5 COSMAS (Corpus Search Management Analysis System) ........................ 13
2.2 Large Scale Corpus ......................................................................................................... 14

vii
CHAPTER THREE - PREVIOUS WORKS..................................................................... 15
3.1 Morphological Parsing in Other Languages.................................................................... 16
3.2 Stem and Root Finding Algorithms Ior Turkish.............................................................. 16
3.2.1 AF Algorithm........................................................................................................... 17
3.2.2 LM Algorithm.......................................................................................................... 18
3.2.3 IdentiIied Maximum Match (IMM) Algorithm ....................................................... 18
3.2.4 Solak and OIlazer`s Approach................................................................................. 19
3.2.5 Root Reaching Method without Dictionary............................................................. 20
3.2.6 Extended Finite State Approach .............................................................................. 22
3.2.7 FindStem Algorithm................................................................................................ 24
3.3 Sentence Boundary Detection ......................................................................................... 26

CHAPTER FOUR - PROPOSED SYSTEM..................................................................... 27
4.1 Sentence Boundary Detection ......................................................................................... 29
4.2 Examination oI Type oI Words ....................................................................................... 34
4.3 Description oI the Methos Ior Finding Roots.................................................................. 34
4.3.1 Finding Stems and InIlextional SuIIixes .............................................................. 34
4.3.2 Finding Roots and Derivational SuIIixes ............................................................. 37
4.4 Generate Large Scale Turkish Corpus............................................................................. 42
4.4.1 Data in the Corpus ................................................................................................ 42
4.4.2 DeIinition oI the Corpus Structure ....................................................................... 43

CHAPTER FIVE - CONCLUSION................................................................................... 47

REFERENCES..................................................................................................................... 49
ABBREVIATIONS.............................................................................................................. 53
APPENDICES...................................................................................................................... 54
A. The List oI the Novels & Stories in Corpus .................................................................... 54
B. Turkish Alphabet ............................................................................................................. 62
B.1 Lowercase Letters................................................................................................. 62
B.2 Uppercase Letters ................................................................................................. 62
C. Turkish Language Specialities ........................................................................................ 63
C.1 Vowel Harmony ................................................................................................... 64
C.2 Consonant Harmony............................................................................................ 66
C.3 Root DeIormations ............................................................................................... 67

viii
LIST OF TABLES

Page

Table 2.1 NOW, Iiles` size and distribution in TurCo .................11
Table 2.2 NOW, NODW and DWUR in TurCo....................12
Table 3.1 Example oI Ilags............................19
Table 3.2 The aIIix-verbs in Turkish.........................21
Table 4.1 The meanings oI the characters in the sentence boundary rule list.........31
Table 4.2 The number oI stems in the lists......................37
Table 4.3 The number oI roots in the noun and verb lists....................41
Table 4.4 The number oI Iiles, NOW, Iiles` sizes, and distribution () oI data..........42

ix
LIST OF FIGURES

Page

Figure 2.1 An unannotated example oI a raw BNC text .................7
Figure 3.1 Finite state machine oI Table 3.2......................21
Figure 3.2 The main Iinite state machine .......................22
Figure 3.3 Links and inIlectional groups ..........................23
Figure 3.4 Dependency links in an example Turkish sentence .................23
Figure 4.1 Block diagram oI algorithm Ior generating corpus ...............28
Figure 4.2 The rule list Ior sentence boundary detection .................30
Figure 4.3 Example oI abbrevation list in XML Iile .....................32
Figure 4.4 Example oI sentences in XML Iile .........................33
Figure 4.5 List oI noun stems in Turkish ........................36
Figure 4.6 List oI adjective stems in Turkish .......................36
Figure 4.7 (a) Sample oI noun roots. (b) Sample oI verb roots .................38
Figure 4.8 Sample oI stems and roots in 'Irom noun to noun suIIixes list .............39
Figure 4.9 Sample oI stems and roots in 'Irom verb to noun suIIixes list............39
Figure 4.10 Block diagram oI algorithm Ior Iinding root step in generating corpus .........40
Figure 4.11 Schema deIinition Ior the mapping XML Iile ..................43
Figure 4.12 Sample conIiguration oI XML Iile .......................44
Figure 4.13 Sample schema deIinition Ior input Iiles ...................45
Figure 4.14 Sample valid XML Iile Ior processing ...................46

CHAPTER ONE
INTRODUCTION

'Natural Language is the language naturally used by humans. 'Natural Language
Processing (NLP) is a research area that is used Ior many diIIerent purposes and it
becomes more popular continuously. In this area, computers are used to process
natural language; it is used in academic searches and Ior commercial purposes.

NLP can be deIined as the construction oI a computing system that processes and
understands natural language. The word 'understand in this deIinition can be
clariIied such as the Iollowing; the observable behavior oI the system must make us
assume that it is doing internally the same, or very similar, things that we do when
we understand language (Gngrd, 1993).

The structure determination process covers two main topics: Morphological
Analysis and Statistical Analysis:

Morphological analysis means that investigation oI the words` morphological
status, such as investigation oI word types (verb, noun, adjective, etc.),
analyzing parts oI the words (root, suIIix or preIix).

Statistical analysis can be done in two ways; on letters and words. Consonant
and vowel letter placements, letter n-gram Irequencies, relationship between
letters such as letter positions according to each other and these kinds oI
analyses can be applied on the letters, called Letter Analysis. Investigation oI
number oI letters in a word, the order oI the letters in a word, word n-gram
Irequencies, word orders in a sentence and these kinds oI the analyses can be
applied on words, called Word Analysis.

Morphological classiIication is made according to natural languages` word
structures. Turkish is an agglutinative language` like Finnish, Hungarian, Quechua

2
and Swahili, where it is classiIied where new words are Iormed by adding suIIixes to
the end oI roots (See Appendix C). In Turkish, there are grammatical rules Ior
suIIixes that which oI them may Iollow which other and in what order they will be.
By this concatenation the meaning oI words are changed or extended. This suIIix
concatenation can result in relatively long words, which are Irequently equivalent to
a whole sentence in English (e.g. Osmanlilastiramadiklarimizdansiniz).

NLP is used in:
Speech synthesis: although this may not at Iirst sight appear very 'intelligent',
the synthesis oI natural-sounding speech is technically complex and almost
certainly requires some 'understanding' oI what is being spoken to ensure, Ior
example, correct intonation.
Speech recognition: basically the reduction oI continuous sound waves to
discrete words.
Natural language understanding: here treated as moving Irom isolated words
(either written or determined via speech recognition) to 'meaning'. This may
involve complete model systems or 'Iront-ends', driving other programs by
NL commands.
Natural language generation: generating appropriate NL responses to
unpredictable inputs.
Machine translation (MT): translating one NL into another. (Coxhead, 2002)
Database applications: It helps the user by the Iamiliarity and Ilexibility oI
the language while accessing database. It is used in expert systems Ior
explanation generation by helping knowledge oI the syntax and semantics oI
the Iragment oI natural language.
Spelling correction: Spelling correctors are word-based, but nowadays there
have been a lot oI studies about syntax-based spelling correctors. There has
been a word-based spelling corrector Ior Turkish developed by Solak.

For these NLP applications on the language, a corpus is generated and used.
Detailed inIormation about corpora was given in the Iollowing chapters.

3
Nowadays, large scale corpus is needed Ior every language to be able to apply
some analyses on the language and get reliable results about the specialities oI it, as
told beIore. Also, Ior Turkish it is very important to generate a large scale Turkish
corpus. To generate such corpus, it is very important to determine stem, root and
suIIixes oI the words in a correcy way.

The main goal oI this study is to generate large scale Turkish corpus, and to
develop an appropriate method that Iinds the root and suIIixes oI the Turkish words
in an eIIicient way, while generating such corpus. All steps in generating corpus
were examined, and some works were made to implement these steps. Also, the
general concepts oI Corpora and previous works about generating corpora,
determining stem, root and suIIixes oI words were given.

This thesis is divided into 5 chapters. Chapter 1 introduces the thesis and explains
brieIly why it was written. Chapter 2 includes the deIinition oI Corpus and explains
some Corpora prepared in English, Turkish and other languages. In chapter 3, some
previous works on morphological analysis oI Turkish language are explained brieIly.
Turkish stemming and root determination algorithms and the works on Sentence
Boundary Detection in the literature are introduced with their main specialities.
Chapter 4 gives a detailed explanation oI the proposed system to generate Large
Scale Turkish Corpus with its all steps. Finally, last chapter presents conclusion.

4
CHAPTER TWO
CORPUS AND LARGE SCALE CORPUS

'Corpus is a collection oI linguistic data, either written texts or a transcription oI
recorded speech, which can be used as a starting-point oI linguistic description or as
a means oI veriIying hypotheses about a language. ' (Crystal, 1991).

'A collection oI naturally occurring language text, chosen to characterize a state
or variety oI a language. (Sinclair, 1991).

Corpus can be deIined as a special database that is created Irom texts, used in
Natural Language Processing area and allows all specialized processes such as
Iinding and separating the words quickly.

An ideal corpus is large and representative oI the language. But, there is a trade-
oII between quality (representative) and quantity (large). A representative corpus has
samples oI all the language. Large corpus has very large data and it can also be used
in NLP. And also, corpora can be divided into two types: 'Balanced, and
'Unbalanced. Large corpus is 'Unbalanced. Corpus can be balanced by taking
samples oI all diIIerent topics in a language like technical words, medicine, spoken
language, etc. that makes corpus a 'representative oI the language. But, it is very
diIIicult to take equal, small pieces oI samples Irom diIIerent areas into a corpus.
Instead oI this, an unbalanced corpus may be generated and used better because it
will consist oI lots oI words Irom any areas in a language. When working on letter
analysis, small sized corpora are enough (Dalkili, 2001), but Ior word analysis large
corpora are needed. And Ior some extraordinary words, unbalanced corpus can be
more powerIul than a balanced corpus.

There are some general analyses that can be applied to a corpus like n-gram
analysis, Number oI DiIIerent Words (NODW), DiIIerent Word Usage Ratio
(DWUR) that can also give the general characteristics oI the corpus.

N-gram analysis is one oI the common statistical methods carried on a corpus. By

5
using n-grams, language model probabilities can be estimated and used in speech
recognition systems (Nadas, 1984). N-gram analysis can be used in correcting words
by detecting non-words. It can also be mixed with pattern matching and strings that
don`t appear in a given word list can be detected. It is also useIul Ior OCR (Optical
Character Recognition) (Kukich, 1992). It can also be used Ior data compression and
encryption. And also, missing words can be estimated Ior a given text by calculating
word n-grams.

2.1 Corpus
There are lots oI corpora created Ior diIIerent languages. Some oI them are
representative, and some are large (Church & Mercer, 1993). By using the corpus,
diIIerent analyses can be done, such as diIIerent word usage statistics, n-gram
analysis Ior letters (Shannon, 1951) and words (JuraIsky & Martin, 2000) etc.
Character recognition operations, cryptanalytical procedures, spelling corrections
(Church & Gale, 1991), etc. processes can be done by using corpus in NLP
applications.

Some examples oI the corpora in diIIerent languages are given in the Iollowing
sections.

2.1.1 English Corpora
2.1.1.1 Brown Corpus
This corpus was Iirst assembled in 1963-1964 at Brown University. In 1964, it
had 1 million words with 61,805 diIIerent words and in a later edition in 1992; the
new Brown corpus had 583 million words with 293,181 diIIerent words (JuraIsky &
Martin, 2000).

2.1.1.2 British National Corpus (BNC)
The British National Corpus is a very large (over 100 million words) corpus oI
modern English, both spoken and written. However, non-British English and Ioreign
language words do occur in the corpus (BNC: What is the BNC, (n.d.)). 90 oI

6
BNC is a written part including extracts Irom newspapers, journals, academic books,
school and university essays, and 10 spoken part includes a large amount oI
unscripted inIormal conversation. This is a project oI OxIord University Press also
including some other members. It was completed in 1994 and it was released in
February 1995 (British National Corpus (BNC), (n.d.)). An unannotated example oI a
raw BNC text is shown in the Iollowing Iigure.

7

Figure 2.1 An unannotated example oI a raw BNC text.

<bncDoc id=BDFX8 n=093802>
<header type=text creator='natcorp' status=new update=1994-07-13>
<fileDesc>
<titStmt>
<title>
General Practitioners Surgery -- an electronic transcription
</title>
<respStmt>
<resp> Data capture and transcription </resp>
<name> Longman ELT </name>
</respStmt>
</titStmt>
<ednStmt n=1> Automatically-generated header </ednStmt>
<extent kb=7 words=128> </extent>
<pubStmt>
<respStmt>
<resp> Archive site </resp>
<name> Oxford University Computing Services </name>
</respStmt>
<address>
13 Banbury Road, Oxford OX2 6NN U.K.
...
Internet mail: natcorp@ox.ac.uk
</address>
<idno type=bnc n=093802> 093802 </idno>
<avail region=world status=unknown>
Exact conditions of use not currently known to
the archiving agency.
...
Distribution of any part of the corpus must
include a copy of the corpus header.
</avail>
<date value=1994-07-13> 1994-07-13 </date>
</pubStmt>
<srcDesc>
<recStmt>
<rec type=DAT>
</rec>
</recStmt>
</srcDesc>
</fileDesc>
<profDesc>
<creation date='?'> Origination/creation date not known </creation>
<partics>
<person age=X educ=0 flang=EN-GBR id=PS22T n=W0001 sex=m soc=AB>
...
</person>
<person id=FX8PS000 n=W0000> ... </person>
<person id=FX8PS001 n=W0002> ... </person>
</partics>
...
...
</bncDoc>

8
2.1.1.3 The Bank of English
The Bank oI English is a collection oI samples oI modern English language held
on computer Ior analysis oI words, meanings, grammar and usage. In linguistics and
lexicography such a collection is termed a corpus (The Bank oI English - Terms &
Conditions, (n.d.)).

The Bank oI English was launched in 1991 by COBUILD (a division oI
HarperCollins Publishers) and The University oI Birmingham. Since 1980
COBUILD, which is based within the School oI English at Birmingham University,
has been collecting a corpus oI texts on computer Ior dictionary compilation and
language study. In 1991 Harper Collins decided on a major initiative to increase the
scale oI the corpus to 200 million words, to Iorm the basic data resource Ior a new
generation oI authoritative language reIerence publications.

It had 450 million words with over halI million diIIerent words in January 2002
and it continues to grow with the constant addition oI new material. It has speech and
writing. The written part contains books, newspapers, magazines, letters, etc. and the
spoken part includes speech Irom BBC World Service radio broadcasts, and the
American National Public Radio, meetings, conversations, etc. The data are either
collected Irom electronic environment or Irom scanning some books. The collection
oI text was started in 1980 (The Bank oI English, (n.d.)).

2.1.1.4 English Gigaword
It is an English corpus having 1,756,504,000 words and 4,111,240 documents. It
is a product oI Linguistic Data Consortium. It includes data Irom Agence France
Press English Service, Associated Press Worldstream English Service, The New
York times Newwire Service and Xinhua News Agency English Serice. It is sold Ior
2500$ (English Gigaword, (n.d.)).

9
2.1.1.5 American National Corpus
The ANC is aimed to contain a core corpus oI at least 100 million words,
including both written and spoken (transcripts) data comparable across genres to the
BNC. The genres in the ANC are expanded to include "new" types oI language data
that have become available in recent years, such as web blogs and web pages, chats,
email, and rap music lyrics. In addition to the core 100 million words, the ANC will
include an additional component oI potentially several hundreds oI millions oI
words, chosen to provide both the broadest and largest selection oI data possible. In
Iall, 2003, the ANC produced its First Release oI over 11 million words oI American
English. (The ANCProject, (n.d.))

2.1.2 Turkish Corpora
Some Turkish corpora are listed below:

Koltuksuz Corpus
YT Corpus
Dalkilic Corpus
METU Turkish Corpus
TurCo Turkish Corpus

There are also other corpora Ior Turkish like ~2.2M Words (Gngr, 1995).

2.1.2.1 Koltuksu: Corpus
It is the one oI the Turkish corpora that is used Ior letter statistics and to Iind out
some oI the characteristics oI Turkish language. It has 6,095,457 characters and
Iormed oI 24 novels and stories oI 22 diIIerent authors. These novels and stories
were put into the digital environment by a data entry group Irom books (Koltuksuz,
1995).

10
2.1.2.2 YT Corpus
This corpus was created Ior morphology based data compression study. It has
4,263,847 characters Irom 14 diIIerent documents: 3 Novels, 1 PhD Thesis, 1
Transcription, 9 Articles (Diri, 2000).

2.1.2.3 Dalkilic Corpus
Dalkilic Corpus: It was created Ior letter statistics and to deIine the characteristics
oI Turkish language like Koltuksuz corpus. It has 1,473,738 characters Irom Hurriyet
newspaper web archive (01/01/1998 06/01/1998 mainpage and 01/01/1998
06/30/1998 authors) (Dalkili, 2001).

Dalkilic Corpus: It is the combination oI some the previous Turkish corpora
(Koltuksuz, YT and Dalkilic corpora) with a size oI 11,749,977 characters
(Dalkili & Dalkili, 2001).

2.1.2.4 METU Turkish Corpus
It is a collection oI 2 million words oI post-1990 written Turkish samples (METU
Corpus, (n.d.)).

2.1.2.5 TurCo Turkish Corpus
Known Iirst corpus created Ior word statistics. It has a capacity oI 362.449MB,
and 50,111,828 words. TurCo is nearly the halI size oI the British National Corpus,
but not as big as English Gigaword.

TurCo consists oI text data taken Irom 11 diIIerent websites, and novels and
stories in Turkish that belong to more than 100 authors. Most parts (98.11) oI
TurCo were collected Irom websites. 1.89 oI the corpus is novels and stories.

In order to make TurCo larger, to include more words, it is not balanced and each
document in the corpus has diIIerent size as shown in Table 2.1.

11
Table 2.1 NOW, Iiles` size and distribution in TurCo
Site # Web Sites NOW
Corpora Files`
Sizes
1
(MB)
Distribution ()
1 www.tbmm.gov.tr 23,396,817 170.747 46.69
2 www.stargazete.com.tr 9,746,093 69.103 19.45
3 www.hurriyet.com.tr 9,415,716 69.140 18.79
4 Turkish novels and stories 4,668,306 33.571 1.89
5 www.die.gov.tr 948,116 6.387 9.32
6 www.arabul.com 753,571 4.994 1.50
7 www.pcmagazine.com.tr 527,757 3.722 1.05
8 www.bilimteknoloji.com.tr 203,620 1.450 0.41
9 www.abgs.gov.tr
2
160,562 1.249 0.32
10 www.lazland.com 135,519 0.954 0.27
11 www.yeniasir.com.tr 96,857 0.707 0.19
12 www.pankitap.com 58,894 0.425 0.12
TOTAL 50,111,828 362.449 100.00

In TurCo, Number oI Words (NOW), number oI diIIerent words (NODW) and DiIIerent
Word Usage Ratio (DWUR) are calculated and shown in Table 2.2. NODW in all sites
are 1,235,056, but some words are repeated in diIIerent sites. These words are picked
up Irom TurCo and calculated again. The result oI this, NODW in TurCo is 686,804.
According to this result, DWUR in TurCo is 1.37.

1
Includes only Turkish alphabet and space character
2
Turkey`s National Program Ior the European Union

12
Table 2.2 NOW, NODW and DWUR in TurCo
Site # NOW NOW Ratio () NODW NODW Ratio () DWUR ()
1 23.396.817 46,69 342.544 27,74 1,46
2 9.746.093 19,45 255.024 20,65 2,62
3 9.415.716 18,79 99.432 8,05 1,06
4 4.668.306 9,32 309.030 25,02 6,62
5 948.116 1,89 20.760 1,68 2,19
6 753.571 1,50 42.208 3,42 5,60
7 527.757 1,05 46.743 3,78 8,86
8 203.620 0,41 29.228 2,37 14,35
9 160.562 0,32 13.103 1,06 8,16
10 135.519 0,27 37.057 3,00 27,34
11 96.857 0,19 25.294 2,05 26,11
12 58.894 0,12 14.633 1,18 24,85
Total 50.111.828 100,00 1.235.056 100,00 2,74
TurCo 50.111.828 686.804 1,37

2.1. 3 Corpora of Other Languages
2.1.3.1 The C:ech National Corpus (CNC)
The Czech National Corpus (CNC) is a non-commercial, academic project
Iocused on building up a large computer-based corpus, containing mainly written
Czech.

The idea oI CNC was Iirst mentioned in 1991 in the statement oI intent which was
signed by 8 signatories, representatives oI the Iollowing institutions: Faculty oI
Philosophy Charles University, Faculty oI Mathematics and Physics, Charles
University, Masaryk University, PalackEC university and the Institute oI Czech
Language, Academy oI Sciences (Klimova, 1996, CZECH NATIONAL CORPUS
(CNC)).

It has synchronous and diachronic parts. Some parts oI the synchronous are:
Database and dictionaries (Electronic databases and dictionaries), SYN2000

13
(Balanced representative oI contemporary written Czech and contains about 100
million words), ORAL (Spoken Czech). Some parts oI diachronic are: The bank oI
diachronic Czech (2,000,000 words oI transcribed texts; 100,000 words oI
transliterated texts; 200,000 words oI dialect texts) (The Czech National Corpus
(CNC), (n.d.)).

2.1.3.2 Croatian National Corpus

It has 30 million words and 9,156,446 tokens as oI February 24
th
, 2003. It includes
older and contemporary text. It is available through Croatian Academic Research
Network (Croatian National Corpus, (n.d.)).

2.1.3.3 PAROLE

It is a multilinguistic corpus. The languages involved in PAROLE corpora are:
Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek,
Irish, Italian, Norwegian, Portuguese and Swedish. It has 20,000 entries per
language. All the texts are younger than 1970.

2.1.3.4 French Corpus

It has 20,093,099 words Irom books (3,267,409 words Irom CD-ROM),
newspapers (13,856,763 words Irom Le Monde newspaper), periodicals (942,963
words Irom HERMES and CNRS-InIos), etc. (2,025,964 words) (French Corpus,
(n.d.)).

2.1.3.5 COSMAS (Corpus Search Management Analvsis Svstem)

It is a German corpus having 1,903,000,000 running words. 'Running means
new words are added each day. Only 1181 million words are available to public
because oI copyright restrictions. It is a product oI 'Institut Ir Deutsche Sprache,
Mannheim (COSMAS, German Corpus, (n.d.)).

14
2.2 Large Scale Corpus
Having large and representative corpus is very important Ior a language. For word
analysis large corpora are needed; but it does not mean that it will be an unbalanced
corpus. It is needed to make a combination oI large and representative (balanced and
unbalanced) corpus to use it in research areas such as speech recognition, spell
checking etc. Iunctionally.

There isn`t any corpus Ior Turkish big enough to make eIIicient analysis on it, so
a large scale Turkish corpus must be created. II there is such a corpus, some
statistical properties oI the Turkish language depending on the words can be
investigated easily and can be used in such areas.

15

CHAPTER THREE
PREVIOUS WORKS

In natural language processing, diIIerent methods have been developed and
implemented to make morphological analysis more eIIicient. The main part oI the
analysis is Iinding the 'correct root or stem oI the words, 'correct means that the
intended meaning oI the user wants to Iind. There are a lot oI ambiguities in Turkish
(e.g., yaz/noun vs yaz/verb type). Because oI such diIIiculties, all researchers try to
Iind nearest root to the real instead Iinding correct root.

Morphological parsing algorithms may be divided into two classes as aIIix
stripping and root-driven analysis methods (Solak and OIlazer, 1993). Both methods
have been used in the history oI the morphological parsing.

Both oI these classes have advantages and disadvantages. In the root driven
approach, the stem oI the word should be Iirstly Iound in a lexicon beIore starting the
morphological analysis. Most popular morphological analyzers such as PCKimmo
(Antworth, 1990) and Ample (Weber, 1988) use the root driven approach and
conIirm the method`s success with their customized versions Ior diIIerent languages.
Root driven methods are also widely used in the studies done Ior Turkish. However,
Ior other agglutinative languages, some aIIix stripping methods have been developed
and successIul results were achieved. The major disadvantage oI this approach is the
cost oI the searching process required to Iind the stem. The examining oI each
subpart is obviously a very time consuming process especially Ior the languages
where the words can appear in very long Iorms. On the other hand, in the aIIix
stripping approach, the searching process is relatively Iast as the search is only done
Ior aIIixes.

16
3.1 Morphological Parsing in Other Languages
For ancient Greek, Packard`s parser proceeds by stripping aIIixes oII the word,
and then attempting to look up the remainder in a lexicon. Only iI there is an entry in
the lexicon matching the remainder and compatible with the stripped-oII aIIixes is
the parse deemed a success.

Brodda and Karlsson use a similar method to the analysis oI Finnish, an
agglutinative language, but without any lexicon oI roots. SuIIixes are stripped oII
Irom the end oI the word until no more can be removed, and what is leIt is assumed
to be root.

Sagvall devised a morphological analyzer Ior Russian which Iirst looks in a
lexicon Ior a root matching an initial substring oI the word. It then uses grammatical
inIormation stored in the lexical entry to determine what possible suIIixes may
Iollow.

Three diIIerent approaches to morphological parsing oI agglutinative languages
were developed independently, in the early 1980`s: Ior Quechua (R. Kasper, 1982),
Ior Finnish (Koskenniemi, 1983) and Ior Turkish (Hankamer, 1984). These three
approaches are identica. They all proceed Irom leIt to right, such as Sagvall`s parser.
Roots are sought in the lexicon that match initial substrings oI the word, grammatical
category oI the root determines what class oI suIIixes may Iollow. When a suIIix in
the permitted class is Iound to match a Iurther substring oI the word, grammatical
inIormation in the lexical entry Ior that suIIix determines once again what class oI
suIIixes may Iollow. II the end oI the word can be reached by iteration oI this
process, and iI the last suIIix analyzed is one which may end a word, the parse is
successIul.

3.2 Stem and Root Finding Algorithms for Turkish
Some oI the methods that determine root or stem oI words in Turkish are
investigated below:

17
1. AF Algorithm
2. Longest-Match (L-M) Algorithm
3. IdentiIied Maximum Match (IMM) Algorithm
4. FindStem Algorithm
5. Solak and OIlazer`s Approach
6. Root Reaching Method without Dictionary
7. Extended Finite State Approach
3.2.1 AF Algorithm
AF algorithm is developed by Solak and Can in 1994. The algorithm works by a
lexicon that keeps actively used stems Ior Turkish in which each record is explained
with 64 tags. The word searched is iteratively looked up in the lexicon Irom right to
leIt by pruning a letter at each step. II the word matches with any oI the root words,
then the morphological analysis Ior that word is done. II any oI the surIace Iorms is
in correspondence with the word at hand, then it is assumed that the root word is an
eligible stem Ior that word. Solak and Can did not distinguish a root word Irom a
stem. This may be because the root words may be viewed as special cases oI stems in
the sense that the root is a stem that neither contains any morpheme nor is a
compound word. The process is repeated until the word drops down to a single letter.
Here is the algorithm:

1. Remove suIIixes that are added with punctuation marks Irom the word.
2. Search the word in dictionary.
3. II a matched root Iound, add the word into root words list.
4. II the word remained as a single letter, the root words list is empty then go to
step 6, iI root words list has at least one element then go to step 7.
5. Remove the last letter Irom the word and go to step 2.
6. Add the searched word into unIounded record and exit.
7. Get the root word Irom the root words list.
8. Apply morphological analysis to the root word.
9. II the result oI morphological analysis is positive then add the root word to
the stems list.

18
10. II there is any element(s) in root words list then go to step 7.
11. Choose the all stems in the stems list as a word stem.

This algorithm Iinds all possible stems oI the word; the Iound stems rise too many
other stems, e.g., the root g: (eye) is a source oI derivation to roughly 150 stems
which have totally diIIerent meanings indeed. So, this algorithm is Iar away to Iind
'correct stem.

3.2.2 LM Algorithm
Longest-Match (L-M) is developed by Kut et al. in 1995. It is based on the word
search logic over a lexicon that covers Turkish word stems and their possible
variances. Here is the algorithm:

2. Search the word in the dictionary.
3. II a matched root is Iound, go to step 5.
4. II the word remained as a single letter, go to step 6. Otherwise, remove the
last letter Irom the word and go to step 2.
5. Choose the Iound root as a stem and go to step 7.
6. Add the searched word into unIounded records.
7. Exit.

This algorithm Iinds the Iirst match stem by beginning the last letter oI the word
and removing the letters one by one. And the lexicon used may not involve all
possible stems. So, this algorithm is Iar away to Iind 'correct root or stem, too.

3.2.3 Identified Maximum Match (IMM) Algorithm
This algorithm is developed by Kksal in 1975. It is a leIt-to-right parsing
algorithm. It tries to Iind the maximum length substring which is present in a root
lexicon. II a match is Iound, the remaining part oI the word is considered as the

19
suIIixes, this part searched in a suIIix morpheme Iorms dictionary and morphemes
are identiIied one by one until there is no element.

The solution oI these processes may not be correct, in such cases all oI the steps
are repeated by reducing one character Irom the Iound substring.

3.2.4 Solak and Ofla:ers Approach
Solak and OIlazer used a dictionary has 23,000 words has been based on the
Turkish Writing Guide as the source (Solak and OIlazer, 1993). The words are
placed in a sorted order in an ordered sequential array to be able to make Iast search.
Each entry oI the dictionary contains a root word in Turkish and a series oI Ilags
showing certain properties oI that word. II the bit corresponding to a certain Ilag is
set Ior an entry, it means that the word has the property represented by that Ilag. 64
diIIerent Ilag is reserved Ior each entry, but only 41 Ilags have been used. Some oI
the Ilags are shown in the Iollowing table.
Table 3.1 Example oI Ilags
Flag Property oI the word Ior which this Ilag is set Examples
CLNONE Belongs to none oI the two main root classes RAGMEN, VE
CLISIM Is a nominal root BEYAZ, OKUL
CLFIIL Is a verbal root SEV, GEZ
ISOA Is a proper noun AYSE, TRK
ISOC Is a proper noun which has a homonym that is not a proper noun MISIR, SEVGI
ISSAYI Is a numeral BIR, KIRK
ISKI Is a nominal root which can directly take the relative suIIix KI BERI, BR
ISSD Is a nominal root ending with a consonant which is soItened
when a suIIix beginning with a vowel is attached.
AMA,PARMA
K, PSIKOLOG
ISSDD Is a nominal root ending with a consonant which has homonym
whose Iinal consonant is soItened when a suIIix beginning with a
vowel is attached.
ADET, KALP

The root oI the word is searched in the dictionary using a maximal match
algorithm. In this algorithm, Iirst the whole word is searched in the dictionary, iI it is

20
Iound then the word has no suIIixes and it does not need to be parsed. II not, then a
letter Irom the right is removed and the substring is searched. This step is repeated
until the root is Iound. II no root is Iound although the Iirst letter oI the word is
reached, the word`s structure is accepted as incorrect.

In order to obtain reliable results Irom this parser, all oI the rules and their
exceptions must be implemented. But it is not possible to obtain all rules and
exceptions in Turkish language.

3.2.5 Root Reaching Method without Dictionarv
This method is developed by Cebiroglu and Adali. It is claimed and proved that
the analysis oI a Turkish word to its root and suIIixes can be Iormulated. The
suIIixes that can be attached to a word root are divided into groups and Iinite state
machines are Iormed by Iormulating the order oI suIIixes Ior each oI these groups. A
main machine is Iormed by combining these machines speciIic to the groups. In the
morphological analysis done using the main machine, the word root is obtained by
extracting the suIIixes Irom the end towards the start. Here are the abbreviations that
are used in suIIixes:

U: i,i,u,
A: a,e
D: d,t
C: c,
I : i,i
(): the letters not obligatory

Example: '-cU can be -ci, -ci, -cu, -c

The morphological rules can be determined with Iinite state machines. To reach
the root oI the word, these rules may be interpreted Irom the right to leIt and Irom the
last to the beginning. For all sets, diIIerent modules are developed, dependent to each
other.

The Iollowing table shows the aIIix-verbs in Turkish. This is determined as a set
oI the aIIix-verbs.

21
Table 3.2 The aIIix-verbs in Turkish
1 (y)Um 6 m 11 cAsInA
2 sUn 7 n 12 (y)DU
3 (y)Uz 8 k 13 (y)sA
4 sUnUz 9 nUz 14 (y)mUs
5 lAr 10 DUr 15 (y)ken

And the Iollowing Iigure is a Iinite state machine is the implementation oI this table:

A
B
1
,
2
,
3
,
4
C
5
F
12,13,14,15
D
6
,
7
,
8
,
9
E
1
0
1
4
G
1,2,3,4,5
10,12,13,14
14
14
12,13
H
1
1
1,2,3,4,5
14

Figure 3.1 Finite state machine oI Table 3.2.

For example the word 'aliskan-mis-siniz is examined by this Iinite state
machine. sUnUz aIIix moves Irom A to B state, -(y)mUs aIIix moves Irom B to F
state. II the last aIIix n is tried to move anywhere Irom F state, it is not possible to
move, so the process is stopped. Because oI the F state`s being Iinite state; the
possible root is accepted as 'aliskan. But the root is 'alis-.

For all sets like the aIIixes that are used Ior nouns and verbs new Iinite state
machines are implemented. They are all combined in one Iinite state machine at the
end and the roots are Iound. The Iollowing Iigure shows the main Iinite state
machine.

22

Figure 3.2 The main Iinite state machine
But, in this approach a Iinite state machine Ior the derivational suIIixes list could
not be done, because in Turkish, these suIIixes` arrangement can not be ruled.

3.2.6 Extended Finite State Approach
This algorithm is developed by OIlazer. In this approach, a Turkish word is
represented as a sequence oI inflectional groups (IGs), separated by `DBs denoting
derivation boundaries, in the Iollowing general Iorm:

root InIl1`DBInIl2`DB.. `DBInIl
n

where InIl
i
denote relevant inIlectional Ieatures including the part-oI-speech Ior the
root, or any oI the derived Iorms. For instance, the derived determiner
'saglamlastirdigimizdaki ((the thing existing) at the time we caused (something) to
become strong) would be represented as:
saglamAdj `DBVerbBecome `DBVerbCausPos
`DBAdj PastPartP1sg`DB
NounZeroA3sgPnonLoc`DBDet

This word has 6 IGs:

1. saglamAdj
2. VerbBecome

23
3. VerbCausPos
4. AdjPastPartPlsg
5. NounZeroA3sg PnonLoc
6. Det

A sentence would then be represented as a sequence oI the IGs making up the
words. When a word is considered as a sequence oI IGs, syntactic relation links only
emanate Irom the last IG oI a (dependent) word, and land on one oI the IG's oI the
(head) word on the right (with minor exceptions), as exempliIied in the Iollowing
Iigure:

Figure 3.3 Links and inIlectional groups

With minor exceptions, the dependency links between the IGs, when drawn above
the IG sequence, do not cross. The Iollowing Iigure shows a dependency tree Ior a
sentence laid on top oI the words segmented along IG boundaries.

Last line shows the Iinal POS Ior each word.
Figure 3.4 Dependency links in an example Turkish sentence

The approach relies on augmenting the input with "channels" that (logically)
reside above the IG sequence and "laying" links representing dependency relations in
these channels. The parser operates in a number oI iterations: At each iteration oI the
parser, a new empty channel is "stacked" on top oI the input, and any possible links

24
are established using these channels, until no new links can be added. The channel
symbol 0 indicates that the channel segment is not used while 1 indicates that the
channel is used by a link that starts at some IG on the leIt and ends at some IG on the
right, that is, the link is just crossing over the IG. II a link starts Irom an IG (ends on
an IG), then a start (stop) symbol denoting the syntactic relation is used on the right
(leIt) side oI the IG. The syntactic relations (along with symbols used) that are
encoded in the parser are the Iollowing:

4 S (Subject), 0 (Object), M (ModiIier, adv/adj), P (Possessor), C (ClassiIier),
D (Determiner), T (Dative Adjunct), L ( Locative Adjunct), A: (Ablative Adjunct),
I (Instrumental Adjunct).

3.2.7 FindStem Algorithm
FindStem is developed by Sever and Bitirim (Sever & Bitirim, 2003). This
algorithm contains a pre-processing step that converts all letters oI the word into their
cases and singles out the letters aIter the punctuation mark in the word. It has three
components;Find the Root,Morphological Analysis andChoose the Stem.

InFind the Root component, all possible roots oI an examined word are Iound
by starting with the Iirst character oI the examined word and searching the lexicon
Ior this item. Then the next character is appended to the item Ior which lexicon
search begins. This operation continues until the item becomes equal to the examined
word or until the system understands that there are no more relevant roots Ior the
examined word in the lexicon. Then, these roots and production rules will be used to
derive the examining word. In lexicon, the type inIormation Ior every root word and
possible root changes (when a root word combines with suIIix) is coded Ior use oI
morphological analysis. During the root and the suIIix combination in Turkish, two
alterations on a root word structure would be in order: change oI the last vowel (e.g.
ara-ariyor) or consonant letter (e.g. kitap-kitabi) oI the root word and drop oI middle
vowel letter (e.g. ogul-oglum). To help such kind oI situations, lexicons are used in
stemming algorithms Ior Turkish.

25
A morphological analyzer is used inMorphological Analysis component. In
Turkish language there are a number oI rules to determine the Iorm and order oI
suIIixation; the derivational suIIixes are used Ior changing word meanings. To add
the derivational suIIixes to end oI a word is determined by word type (this
inIormation is coded into the lexicon Ior every word). II this procedure is applied, all
possible stems can be Iound. Consider the word edebilecek as an examined word.
The longest possible roots retrieved Irom lexicon are edebi and edep. According
to the algorithms LM and IdentiIied Maximum Match (IMM) that assigns a stem by
matching the examined word with longest root words, 'edebi will be selected as
output. But it is not possible to produce the examined word, edebilecek, by using
this root ; this result can be achieved through the morphological analysis procedure.

In the last component, Choose the Stem, the word stem is chosen by a selection
between derivations in the derivations list.

Here is the algorithm:

2. Find all possible roots oI the word in a lexicon and add them into root
words list.
3. II root words list is empty, add the word into unIounded records and exit.
4. Get the root word Irom root words list.
5. Apply morphological analysis to the root word.
6. AIter morphological analysis, add the Iormed derivations into derivations
list.
7. II there is any element(s) in root words list then go to step 4.
8. Choose the word stem by a selection between derivations in the derivations
list.

This algorithm Iinds all possible stems oI the word by eliminating the stems that
are not in the derivation list. So, this algorithm is better, but also Iar away to Iind
'correct stem.

26
3.3 Sentence Boundary Detection
For many natural language processing tasks, identiIying sentence boundaries is
one oI the most important prerequisites. Many available natural language processing
tools do not perIorm a reliable detection oI sentence boundaries.

Using a list oI end-oI-sentence punctuation marks (e.g. '., '!) is usable to Iind
end oI sentence in a suIIicient way. A period can be used in an abbreviation, as a
decimal point, in e-mail addresses etc. Some examples are shown below:

She comes here by 5 p.m. on Saturday evening.
At 5 p.m. I have to go to the hospital.

Because oI using end oI sentence characters in diIIerent situations, ambiguities are
appeared. Such ambiguity is the main problem oI sentence boundary detection, and
until now, there are not any works Ior Turkish or any languages that solves these
kind oI ambiguities shown in the language.

27

CHAPTER FOUR
PROPOSED SYSTEM

A corpus can be thought oI as a collection oI texts gathered according to
particular principles Ior some particular purpose.

There are some steps to generate a corpus. These steps oIIer Turkish words to be
determined appropriately, and make the corpus more Iunctionality. The steps oI the
solution as Iollows:

1. Sentence boundary detection
2. Examination oI types oI words
3. Finding stem and inIlectional suIIixes
4. Finding root and derivational suIIixes
5. Generate large scale Turkish corpus

At Iirst, the sentences in the text Iile are determined according to the some rules
introduced in the Iollowing sections, and then these sentences are splitted into
words. AIter splitting, the words are examined, the types oI them are Iound and then
the stems and inIlectional suIIixes oI these words are Iound. Finding types and stems
are iterative Iunction, because in Turkish it is not possible to Iind type oI word
unless knowing its stem, or vice versa.

AIter Iinding stem oI the word, the root and derivational suIIixes are splited, and
then all data is stored into corpus.

The summarized illustration oI generating large scale corpus is depicted in the
Iollowing Iigure.

27

28

Figure 4.1 Block diagram oI algorithm Ior generating corpus
Get other word
Find all sentences
Examine Type oI Word
Get a sentence
Split sentence into words
Split word into stem & inIlextional suIIixes
Split stem into root & derivational suIIixes
Write in corpus
End oI
Sentence
No Yes

29
4.1 Sentence Boundary Detection
The Iirst step in generating corpus is 'Iinding sentences. Turkish sentences
generally end with known punctuations such as ., ., !, ?.

The process oI Iinding end oI sentences is very complex. In Turkish there are
some ambiguities in Iinding end oI sentence process like any other languages. For
example;

Uluslar, bu ekonomik buhran sonucunda 2. Dnya Savasi`ni yasamistir.
Bu sezon kaybedilen ma sayisi 2. Dnya Kupasi`na katilma sansi azaliyor.

In the Iirst sentence, the '. character is used Ior enumerate, but in the second
sentence it indicates end oI sentence. And aIter '., both oI them have the same word
that begins with uppercase. So, there is an ambuigity Ior the process oI Iinding end
oI sentence.

In this work, to Iind end oI sentence, the rule list is created Iirstly, and stored in
XML Iormat.

30

Figure 4.2 The rule list Ior sentence boundary detection

XML Iormat is created in triple group (e.g. 'L.L). The dot character in the
middle oI the group is shows the end oI sentence characters. The leIt character
shows the beginning character`s situation oI the word beIore the punctuation, and the
right character shows the beginning character`s situation oI the word aIter the
punctuation. In the Iollowing table, the characters` meanings are shown.

<rules>
<rule EOS="False">L.L</rule>
<rule EOS="True">L.U</rule>
<rule EOS="True">L.#</rule>
<rule EOS="True">?.'</rule>
<rule EOS="True">?."</rule>
<rule EOS="True">?.(</rule>
<rule EOS="True">?.)</rule>
<rule EOS="True">?.-</rule>
<rule EOS="True">?./</rule>
<rule EOS="False">U.L</rule>
<rule EOS="False">?.,</rule>
<rule EOS="False">#.L</rule>
<rule EOS="False">#.'</rule>
<rule EOS="False">#."</rule>
<rule EOS="False">#.(</rule>
<rule EOS="False">#.)</rule>
<rule EOS="False">#.-</rule>
<rule EOS="False">#.,</rule>
<rule EOS="False">#.#</rule>
<rule EOS="False">#.U</rule>
</rules>

31
Table 4.1 The meanings oI the characters in the sentence boundary rule list
Character Meaning
. EOS punctuations (. . ! ? )
L Lowercase
U Uppercase
# Number
? Any character
- -
, ,
( (
) )
/ /

' '

By using these rules, making the end oI sentence Iinding be easier is aimed. But,
while the rules were created, some diIIiculties were appeared because oI the Turkish
language specialities, and until now these diIIiculties cannot be solved.

As an example, some ambiguities are shown below:

Cumhuriyetimizin 75. yili coskuyla kutlandi.
Tahta ikan IV. Murat emirler yagdirdi.
Olimpiyatlar iin uzun zamandir alisan Ahmet kosuda 2. Uzun atlamada
ise ancak 4. olabildi.
A. Mehmet YILDIZ size ugradi.
AlIabenin ilk harIi A. Mehmet`e bunu gretmeniz gerekiyor.

For abbreviations that make ambiguity in the sentences, an XML Iile was created,
and abbreviation list was combined into this Iile as shown in the Iollowing Iigure.

32

Figure 4.3 Example oI abbrevation list in XML Iile

Abbrevation and rule lists were written into two Iiles in a standard seperated Irom
the main program, to allow users to make changes in these Iiles easily and
independent Irom the program.

By using this abbrevation and rule lists, the texts were splitted into sentences and
output was written in an XML Iormat again as shown in the Iollowing Iigure.

<abbrevations>
<abbr>A</abbr>
<abbr>AA</abbr>
<abbr>AAFSE</abbr>
<abbr>AAM</abbr>
<abbr>AB</abbr>
<abbr>ABD</abbr>
<abbr>ABS</abbr>
<abbr>ADSL</abbr>
<abbr>AET</abbr>
...
<abbr>HABITAT</abbr>
<abbr>HAVAS</abbr>
<abbr>HDD</abbr>
<abbr>hek</abbr>
...
<abbr>zf</abbr>
<abbr>zm</abbr>
<abbr>ZMO</abbr>
<abbr>zool</abbr>
</abbrevations>

33

Figure 4.4 Example oI sentences in XML Iile

<File OriginalName="ID_396984_M.txt">
<Paragraph Index="0">
<Sentence Index="0">
<Word Index="0"> Sigara </Word>
<Word Index="1"> kullanmnn </Word>
<Word Index="2"> azalmas </Word>
<Word Index="3"> konusunda </Word>
.
<Word Index="22"> mercilere </Word>
<Word Index="23"> bayvurularmz </Word>
<Word Index="24"> oldu </Word>
</Sentence>
</Paragraph>
<Paragraph Index="1">
<Word Index="0"> Prof. Dr.</Word>
<Word Index="1"> Tuncer</Word>
...
<Word Index="24"> gerektigini</Word>
<Word Index="25"> bildirdi</Word>
</Sentence>
<Word Index="0"> Tuncer</Word>
<Word Index="1"> aksi </Word>
<Word Index="2"> taktirde </Word>
...
<Word Index="17"> ifade </Word>
<Word Index="18"> etti </Word>
</Sentence>
...
</Paragraph>
...
</File>

34
As told in the Iollowing sections, word types, root and suIIixes oI the words were
added into this structure.
4.2 Examination of Type of Words
Finding types and stems are iterative Iunction, because in Turkish it is not
possible to Iind type oI word unless knowing its stem, or vice versa.

AIter splitting the sentence into its words, it was appreciated by three steps:

1. Determining the stems oI the words
2. Determining the available types oI the words

At Iirst, all words in the sentence and stems oI them are determined by using
morphological analysis.

In the second step, probable types oI words were determined (noun, adjective .)
by making use oI electronic dictionary. So, the working space would be smaller, and
the word types that were not available could be eliminated. And also, some stems
were eliminated that are not suitable types according to its place in the sentence.
4.3 Description of the Methods for Finding Roots
Finding root oI the words are very important part oI generating a corpus. For
Iinding the roots, there are two steps:

1. Finding Stems and InIectional SuIIixes
2. Finding Roots and Derivational SuIIixes
4.3.1 Finding Stems and Inflectional Suffixes
By Iinding stems, the words will be cleared Irom the inIlectional suIIixes and this
process will make these words look like as in the dictionary.

35
Some methods Ior determining stems were thought and attempted to implement.
The Iirst method was to produce all probable words in Turkish. This algorithm
would be worked such that:

Concate the inIlectional suIIixes at the end oI the roots to create diIIerent
stems.
Write all words in a Iile.
Find the examined word Irom the Iile created in the previous step to
determine its root and inIlectional suIIixes.

In theory, this algorithm makes Iinding stems and roots processes Iast. But, some
technical problems in computer technology made implementation not possible, Ior
example in operating systems, creating and appending a Iile has over 4 GB capacity
were not allowed.

AIter that, this method was improved by making concatenation process Ior each
letter in the alphabet, and writing them in separated Iiles according to the Iirst letter
in the words. AIter this, there would be 29 Iiles Ior Turkish words. But, this method
would need too much sources and make perIormance worse, and also some technical
problems in computer technology made implementation not possible as in previous
method, so this was not used.

In the last Iound and developed method to determine stems oI words, the stems
will be stored in a list according to word types and inIlectional suIIixes will be
stored in a diIIerent list in probable combinations. Then, searching process will be
made in two lists at the same time, and the stem and suIIixes will be determined.
Some sample lists are shown in the Iollowing Iigures.

36

Figure 4.5 List oI noun stems in Turkish

Figure 4.6 List oI adjective stems in Turkish

The number oI stems in the lists that are specialized in the word types is shown in
the Iollowing table:

.
belgili
belgin
belgisiz
beli bkk
belig
belirgin
belirli
belirli belirsiz
belirsiz
belirtik
belirtili
belirtisiz
belkili
belli
belli
belli basli
belli belirsiz
.
ab
aba
aba
abaci
abacilik
abadi
aba gresi
abajur
abajurcu
abajurculuk
abaks
abandirma
abani
abanma
abanoz
abanozgiller
abanozlasma
abarti
abartici
abarticilik
.

37
Table 4.2 The number oI stems in the lists
Word Type NOS
Noun 44835
Verb 6483
Adjective 11128
Conjunction 30
Preposition 116
Adverb 2551
Pronoun 81
Interjection 297

This method will work much Iaster as comparing previous works.

4.3.2 Finding Roots and Derivational Suffixes
The previous works about Iinding roots were investigated, and it was seen that
generally two diIIerent methods were used (Kksal, 1975; Solak & Can, 1994; Solak
& OIlazer, 1993). And, both oI the methods had advantages and disadvantages. The
methods had been used were:

1. Examining Irom the beginning to the end oI the word, and Iind root Irom a
dictionary.
2. Examining Irom the end to the beginning oI the word, Iind root Irom a
dictionary and eliminate suIIixes.

In the Iirst method, the letters in the word examined are taken one by one Irom
the beginning, and the substring oI this word is checked Irom the dictionary iI there
is a word such as the substring. The Iirst Iound result in the dictionary is said to be
root oI the word. But, as it is seen, the Iirst word can not be the root every time, such
as 'bilek word in Turkish. In this method, the root oI this word can be Iound as
'bil-', but it is not real root.

38
In the second method, the letters in the word examined are taken one by one Irom
the end to the beginning, and the string achieved is checked in the suIIixes dictionary
to determine the suIIixes. At the same time, the remaining part oI the word is
checked in the dictionary to determine the root. By doing these processes, it is aimed
to determine the real root and suIIixes in an eIIicient way. But, this method also has
disadvantages. The root oI the word can not be Iound in a correct way. Also, the
suIIixes are diIIerent in the sources, so this method is not eIIicient.

In this work, aIter the probable types and stems are Iound, the roots are Iound
Irom the root list by using suIIixes list (Korkmaz, 2003). All oI the roots are
separated into two Iiles; 'noun and 'verb. Words in all types except 'verb were
stored into 'noun list. The sample part oI the list oI roots is shown in the Iollowing
Iigure.

Figure 4.7 (a) Sample oI noun roots. (b) Sample oI verb roots.

All oI the suIIixes are separated into Iour Iiles according to their specialities,
these are suIIixes that are used Ior derivating 'Irom noun to noun, 'Irom noun to
verb, 'Irom verb to noun and 'Irom verb to verb. The sample part oI the list oI
stem, root and suIIixes is shown in the Iollowing Iigures.

abad
abajur
abaks
abandone
aban
abanoz
abaso
abat
Abaza
abazan
Abbas
abd
.
aban-
abart-
abra-
aci-
a-
ada-
agin-
agla-
ag-
agna-
agri-
ak-
.
(a) (b)

39

Figure 4.8 Sample oI stems and roots in 'Irom noun to noun suIIixes list.
Figure 4.9 Sample oI stems and roots in 'Irom verb to noun suIIixes list.

40
The illustration oI Iinding root step in generating large scale corpus is depicted
in the Iollowing Iigure.

Figure 4.10 Block diagram oI algorithm Ior Iinding root step in generating corpus

There are 16203 noun roots and 738 verb roots in the lists. The numbers oI roots
according to the letters in the noun and verb lists are shown in the Iollowing tables:

Examine Type oI Word
Split word into stems & inIlextional suIIixes
by using list oI the stems generated according
to the type determined, e.g. verb
Split stems into roots & derivational suIIixes
by using list oI the suIIixes and roots according
to the type determined, e.g. verb
Write all Iound roots and suIIixes into corpus
Generate stems by using the roots and suIIixes lists
Store stems in lists according to their types

41
Table 4.3 The number oI roots in the noun and verb lists
Letter Number of Noun Roots Number of Verb Roots
A 1101 43
B 839 60
C 284 4
344 45
D 631 66
E 535 29
F 596 12
G 447 32
H 770 16
I 71 12
I 706 22
J 73 0
K 1712 106
L 377 0
M 1994 2
N 370 0
O 267 0
60 19
P 836 16
R 438 0
S 1106 99
S 328 6
T 1346 37
U 85 22
80 12
V 283 3
Y 256 74
Z 268 1
Total 16203 738

The Iound roots were stored into the XML structure to generate large scale
corpus. This XML structure needs very large memory. It is the only drawback oI this
method. But, this problem was solved by making all processes into the memory, by

42
using pointers, etc.

And also, by applying some statistical and morphological analysis techniques,
such as n-gram analysis, the number oI roots determined can be decreased, and the
real root oI the word can be Iound.
4.4 Generate Large Scale Turkish Corpus
One oI the biggest problems in NLP works is appeared while storing the words
into large databases, retrieving the words and making analysis on this database. For
storing and retrieving processes in databases, some diIIerent algorithms are used.
But, Ior specialized works such as natural language processing these database
algorithms` perIormances are not enough to work on it.

Databases were not used in the previous works; instead, specialized structures
were used, and some statistical analyses were applied on the corpus (ebi &
Dalkili, 2004).

For generating corpus, XML structure was used to solve this kind oI problems.
But, as told beIore, its only drawback is using large memory.
4.4.1 Data in the Corpus
The used data to generate corpus is shown in the Iollowing table.
Table 4.4 The number oI Iiles, NODW, Iiles` sizes, and distribution () oI data
Web Sites
Number of
Files
NODW
Files` Sizes
(MB)
www.netgazete.com.tr
483428 1006692 400
www.aksam.com.tr
13934 345440 45.7
www.tercuman.com.tr
11704 467746 42.7
www.yeniasir.com.tr
13609 240672 64.5
Subtitles
6105 606704 152.0
Turkish novels and stories
240 10279162 77.9
Total
529020 12946416 782,8

The list oI the Turkish Novels and stories is shown in Appendix A.

43
4.4.2 Definition of the Corpus Structure
A database system stores any kind oI data and allows users to process this data
using predeIined query languages (like Structured Query Language) in a declarative
way. One oI the main problems oI a database system is perIormance. PerIormance oI
a database is a good criterion. Each database system has many diIIerent algorithms
to store and access data. B Tree and Hashing algorithms are the most commonly
used ones. Database systems are general purpose systems this is the why they can
not be used Ior Natural Language Processing. So there should be a speciIic way to
accomplish this Ior a speciIic aim. The developed system is not a database system; it
is speciIic Ior Natural Language Processing operations.

This system processes a data Iile in XML Iormat, and a schema deIinition in XSD
to validate given XML Iile. For key column mappings between implemented code
and data Iile(s), system needs another XML Iile. Schema deIinition Ior this mapping
XML Iile as shown in the Iollowing Iigure.

Figure 4.11 Schema deIinition Ior the mapping XML Iile
<?xml version="1.0"?>
<xs:schema id="mappings"
targetNamespace="http://tempuri.org/CodeXMLMapping.xsd"
xmlns:mstns="http://tempuri.org/CodeXMLMapping.xsd"
xmlns="http://tempuri.org/CodeXMLMapping.xsd"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
attributeFormDefault="qualified"
elementFormDefault="qualified">
<xs:element name="mappings" >
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element name="map" nillable="false">
<xs:complexType>
<xs:simpleContent >
<xs:extension base="xs:string">
<xs:attribute name="Key" form="unqualified"
use="required" type="xs:string" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>
</xs:schema>

44

Figure 4.12 Sample conIiguration oI XML Iile

For this conIiguration a valid xml schema deIinition should be as in the Iollowing
Iigure.
<?xml version="1.0" encoding="utf-8"?>
<mappings>
<map Key="FileElement">file</map>
<map Key="TitleAttribute">Title</map>
<map Key="AuthorAttribute">Author</map>
<map Key="URLAttribute">URL</map>
<map Key="PublisherAttribute">Publisher</map>
<map Key="FileTypeAttribute">FileType</map>
<map Key="ParagraphElement">paragraph</map>
<map Key="ParagraphIndexAttribute">ParagraphIndex</map>
<map Key="SentenceElement">sentence</map>
<map Key="SentenceIndexAttribute">SentenceIndex</map>
<map Key="WordElement">word</map>
<map Key="WordIndexAttribute">WordIndex</map>
<map Key="WordValueElement">value</map>
<map Key="WordStemElement">stem</map>
<map Key="WordTypeElement">type</map>
</mappings>

45

Figure 4.13 Sample schema deIinition Ior input Iiles
<?xml version="1.0" encoding="utf-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns="urn:Common-Schema" elementFormDefault="qualified"
targetNamespace="urn:Common-Schema">
<xs:element name="Common">
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element name="file">
<xs:complexType>
<xs:sequence>
<xs:element name="paragraph" minOccurs="0"
maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="sentence" minOccurs="0"
<xs:complexType>
<xs:sequence>
<xs:element name="word" minOccurs="0"
<xs:complexType>
<xs:sequence>
<xs:element name="value"
type="xs:string" minOccurs="0" />
<xs:element name="stem"
type="xs:string" minOccurs="0" />
<xs:element name="type"
type="xs:nonNegativeInteger" minOccurs="0" />
</xs:sequence>
<xs:attribute name="WordIndex"
form="unqualified" type="xs:nonNegativeInteger" />
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="SentenceIndex"
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="ParagraphIndex"
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="Title" form="unqualified"
type="xs:string" />
<xs:attribute name="Author" form="unqualified"
type="xs:string" />
<xs:attribute name="URL" form="unqualified"
type="xs:string" />
<xs:attribute name="Publisher" form="unqualified"
type="xs:string" />
<xs:attribute name="FileType" form="unqualified"
type="xs:nonNegativeInteger" />
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>

46
Valid XML Iile that can be used Ior processing is shown in the Iollowing Iigure.
The stem and the type oI the word are added in this Iile.

Figure 4.14 Sample valid XML File Ior processing
<?xml version="1.0" encoding="utf-8" ?>
<file Title="Politik Guvenlik" Author="mer elik"
URL="www.sabah.com.tr" Publisher="Sabah Gazetesi"
FileType="2">
< paragraph ParagraphIndex="0">
<sentence SentenceIndex="0">
<word WordIndex="0">
<value>Uluslararasi</value>
<stem>Uluslararasi</stem>
<type>1</type>
</word>
<value>toplantilarda</value>
<stem>toplanti</stem>
<type>65</type>
</word>
<value>guvenligin</value>
<stem>guvenlik</stem>
<type>1</type>
</word>
....
</sentence>
</paragraph>
< paragraph ParagraphIndex="1">
<sentence SentenceIndex="0">
....
<value>gundemlerinin</value>
<stem>gundem</stem>
<type>65</type>
</word>
<value>olmamasi</value>
<stem>ol</stem>
<type>385</type>
</word>
<value>gerekir</value>
<stem>gerek</stem>
<type>8</type>
</word>
</sentence>
</paragraph>
....
</file>

47
CHAPTER FIVE
CONCLUSION

NLs are ambugious in speech, grammar, meaning, etc. To resolve local
ambiguity, humans employ not only a detailed knowledge oI the language itselI, also
its sounds, rules about sound combinations, its grammar and lexicon together with
word meanings and meanings derived Irom word combinations and orderings, a
large and detailed knowledge oI the world, the ability to inIer what a speaker meant,
even iI he/she did not actually say it, etc. They are the Iactors that make NLs so
diIIicult to process by computer. However, the languages are needed to process by
computers, because oI the developing technology. Some techniques have been tried
to solve these problems, but none oI them had 100 percent success.

Machine learning or some statistical analysis may be used Ior solving ambiguities
in sentence boundary detection, that can not be ruled eighter end oI sentence or not.
The words have such ambiguities in Turkish, too. For example; the root oI 'glge is
'glge, but there is a word 'gl in Turkish, and algorithms have not be able to Iind
its real root, they Iound both oI the words as root. Also, during the root and the
suIIix combination in Turkish, two alterations on a root word structure would be in
order: change oI the last vowel (e.g. ara-ariyor) or consonant letter (e.g. kitap-kitabi)
oI the root word and drop oI middle vowel letter (e.g. ogul-oglum). Such situations
cause the Iinding root algorithms not to work properly. In this thesis, these problems
had been tried to solve by using determining the word type beIore trying to Iind its
root. AIter the word type Iound, according to its place in the sentence, the root list oI
this word type was used to determine root. This technique has solved some oI the
problems. Even iI this solution sometimes produced more than one root, it worked
better, when compared with other algorithms. Other methods investigated, and it is
seen that none oI them has 100 percent success about Iinding the correct stem or root
oI the word. It is very diIIicult process to Iind 'real root because oI the language`s
being agglutinative and having ambiguities. But, this problem may be solved by
using machine learning and some statistical analyses, again.
47

48

The number oI possible words generated by adding suIIixes is practically inIinite.
As such, a Iinite-size lexicon Ior Turkish would miss a signiIicant percentage oI
Turkish words. This makes lexicon-based text recognition approaches unsuitable Ior
Turkish or other agglutinative languages. In this thesis, the stems were generated by
using root and suIIixes lists, by taking the roots that only begin the same letter with
the examined word. It made smaller search space, and made the implementation oI
generating the stems by adding suIIixes possible and usable. But, instead oI this
method, Iinite state machines are considered to be suitable Ior Turkish because oI
being rule-based language. However, all rules can not be known, because the
language grows Iast and the rules are always modiIied and extended.

The most important thing to work on a large corpus is that it requires much CPU
and memory power to analyze. It was seen that, programs using databases like
MySQL are too slow Ior this kind oI operations because oI the general nature oI
databases. In this thesis, not to be aIIected Irom these drawbacks oI the databases, an
XML structure is used to generate corpus. It is suitable to retrieve any words Iast by
using the memory oI the computer, but its known drawback is that it needs too much
memory on disk drive. By using speciIic and suitable algorithms, lots oI time can be
gained, and more detailed analysis can be done in databases.

In the Iuture, in order to make analyses eIIicient, corpus size can be increased by
adding new sites including novels, technical papers, written reports, thesis etc. At the
same time, with the classiIication oI these documents, diIIerent corpora oI diIIerent
Iields can be generated, e.g. medical corpus, engineering corpus, etc.

49
REFERENCES
Antworth E. (1990). PC-KIMMO. A two-level processor for morphological analvsis.
TX: Summer Institute oI Linguistics, Dallas.
BNC. What is BNC. (n.d.). Retrieved March 3, 2005, Irom
http://www.natcorp.ox.ac.uk
British National Corpus (BNC). (n.d.). Retrieved March 3, 2005, Irom
http://www.hcu.ox.ac.uk/ BNC/what/index.html
Brodda, B., & Karlsson, F. (1980). An experiment with morphological analvsis of
Finnish. Papers Irom the Institute oI Linguistics, University oI Stockholm,
Publication 40, Stockholm.
Cebiroglu, G. & Adali, E. (2002). Root reaching method without dictionarv. Istanbul
Technical University Computer Engineering Department, Istanbul, Turkey.
Church, K., & Gale, W. (1991). Probability scoring Ior spelling correction. Statistics
and Computing, 93-103.
Church, K., & Mercer, R. (1993). Introduction to the Special Issue on Computational
Linguistics Using Large Corpora. Computational Linguistics, 19(1), 1-24.
COSMAS, German Corpus. (n.d.). Retrieved January 5, 2005, Irom
http://corpora.ids-mannheim.de /~ cosmas/.
Coxhead, P. (2002). An Introduction to Natural Language Processing (NLP),
Retrieved June 26, 2005, Irom www.cs.bham.ac.uk.
Croatian National Corpus. (n.d.). Retrieved January 15, 2005, Irom
http://www.hnk.IIzg.hr/corpus. htm.
Crystal,D. (1991). A Dictionarv of Linguistics and Phonetics, Blackwell, 3rd
Edition.

50
ebi, Y. & Dalkili, G. (2004). Turkish Word N-gram Analv:ing Algorithms for a
Large Scale Turkish Corpus - TurCo, ITCC 2004, IEEE International ConIerence
on InIormation Technology, Vol:2, pp. 236-240.
Dalkili, G. (2001). Some Statistical Properties of Contemporarv Printed Turkish
and A Text Compression Application. MSc Thesis. International Computing
Institute, Ege University.
Dalkili, M.E., & Dalkili, G. (2001). Some Measurable Language Characteristics oI
Printed Turkish. Proc. of the XJI. International Svmposium on Computer and
Information Sciences, 217-224.
Diri, B. (2000). A Text Compression System Based on the Morphology oI Turkish
Language. Proc. of the XJ International. Svmposium on Computer and
Information Sciences, 12-23.
English Gigaword. (n.d.). Retrieved January 5, 2005, Irom
http://www.ldc.upenn.edu/Catalog/ Catalog Entry. jsp?catalogIdLDC200 3T05
French Corpus. (n.d.). Retrieved January 5, 2005, Irom
http://www.elda.Ir/cata/text/W0020.html, 02/01/2003.
Gngrd Z. (1993). A lexical-functional grammar for Turkish. MSc Thesis.
Computer Engineering Department, Bilkent University, Ankara.
Hankamer, J. (1984). Turkish generative morphology and morphological parsing,
Second International Conference on Turkish Linguistics, Istanbul.
Hankamer, J. (1989). Morphological parsing and lexicon. Lexical Representation
and Process, MIT Press.
JuraIsky, D., & Martin, J.H. (2000). Speech and Language Processing, Prentice
Hall, 193-199.
Kasper, R. & Weber, D. (1982). Users reference manual for the Cs Quechua
adaptation program. Occasional Publications in Academic Computing, (8,9),

51
Summer Institute oI Linguistic, Inc.
Klimova, J. (1996). CZECH NATIONAL CORPUS (CNC). Retrieved June 28, 2005,
Irom http://www.ling.ohio-state.edu/~dm/events/EastWest96/cnc.html
Korkmaz, Z. (2003). Trkiye Trkesi Grameri, TDK, Ankara.
Koskenniemi, K. (1983). Two-level morphologv. University oI Helsinki, Department
oI General Linguistics, Publication No. 11, Helsinki, Finland.
Kksal, A. (1975). Automatic Morphological Analvsis of Turkish. Hacettepe
University, Ankara, Turkey.
Kukich K. (1992). Technique Ior automatically correcting words in text. Periodical
Issue Article of ACM Press, 377-439
Kut, A., Alpkoak, A., & zkarahan, E. (1995). Bilgi bulma sistemleri iin otomatik
trke dizinleme yntemi. Biliim Bildirileri, Dokuz Eyll University, Izmir,
Turkey.
METU Corpus. (n.d.). Retrieved October 20, 2004, Irom
http://www.ii.metu.edu.tr/~corpus/corpus.html
Nadas, A. (1984). Estimation oI probabilities in the language model oI the IBM
speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 32(4), 859-861.
OIlazer, K. (1999). Dependencv Parsing with an Extended Finite State Approach.
Department oI Computer Engineering, Bilkent University Ankara, Turkey.
Packard, D. (1973). Computer-assisted morphological analysis oI Ancient Greek.
Computational and Mathematical Linguistics. Proceedings of the International
Conference on Computational Linguistics, Pisa Leo S. Olschki, Firenze, 343-355.
Sagvall,A. (1973). A svstem for automatic inflectional analvsis implemented for
Russian, Data Linguistica 8, Almquist and Wiksell, Stockholm.

52
Sinclair,J. (1991). Corpus Concordance, Collocation. OUP.
Sever H., & Bitirim Y. (2003). FindStem. Analvsis and evaluation of A Turkish
stemming algorithm. Department oI Computer Engineering Baskent University
Ankara, 06530 Turkey, Department oI Computer Engineering Eastern
Mediterranean University Famagusta, T.R.N.C.
Shannon, C.E. (1951). Prediction and Entropy oI Printed English. The Bell Svstem
Technical Journal, 30(1),50-64.
Solak, A. (1991). Design and Implementation of a spelling checker for Turkish,
Department oI Comp. Eng. And InIormation Sciences, Bilkent Unv., Ankara,
Trkiye.
Solak, A., & OIlazer, K. (1993). Design and Implementation of a Spelling Checker
for Turkish, Department oI Computer Engineering and InIormation Science,
Bilkent Unv. Ankara, Trkiye.
Solak A., & Can, F. (1994). EIIects of stemming on Turkish text retrieval, Technical
report BUCEIS-94-20, Bilkent University, Ankara, Turkey.
The ANCProfect. (n.d.). Retrieved January 5, 2005, Irom
http://americannationalcorpus.org
The Bank of English. (n.d.). Retrieved March 10, 2005, Irom
http://www.cobuild.collins.co.uk/ boeinIo.html
The Bank of English - Terms & Conditions. (n.d.). Retrieved March 10, 2005, Irom
http://www.titania.bham.ac.uk/docs/svenguide.html
The C:ech National Corpus (CNC). (n.d.). Retrieved January 15, 2005, Irom
http://ucnk.II.cuni.cz/ english/index.html
Weber, D. J., Black H. A., & McConnel, S. R. (1988). AMPLE. a tool for exploring
morphologv, TX: Summer Institute oI Linguistics, Dallas.

53
ABBREVIATIONS

Following acronyms have been used in this thesis:

DWUR DiIIerent Word Using Ratio
IMM IdentiIied Maximum Match
MT Machine Translation
NL Natural Language
NLP Natural Language Processing
NOS Number OI Stems
NODW Number OI DiIIerent Words
NOW Number OI Words

54
APPENDICES
A. The List oI the Novels & Stories in Corpus
NAME of NOVEL / STORY AUTHOR SITE NOW
2001 A.C.Clarke
http://www.pdaturk.com/
56868
2010 A.C.Clarke http://www.pdaturk.com/ 55226
2061 A.C.Clarke http://www.pdaturk.com/ 43824
1.Dogu Halklari Kurultayi Nurer Ugurlu http://www.pdaturk.com/ 62279
365 neri Ellison&Barnett http://www.pdaturk.com/ 39243
Access Anonim http://www.pdaturk.com/ 15609
Adamlik Dini Vural Yayincilik http://www.kitap.perisi.com/ 11705
Ag Yklemesi Anonim http://www.pdaturk.com/ 11623
Agir Roman Metin Kaan http://www.pdaturk.com/ 27635
Alacakaranlikta Tonio Kroger http://www.kitaplik.com 24914
Aldatmak Ahmet Altan http://www.pdaturk.com/ 44314
Alice... Lewis Carroll http://www.kelepirkitap.com 19258
Ankara Aniti Augustus http://www.pdaturk.com/ 18967
Apartman Emile Zola http://www.pdaturk.com/ 49039
Atatrk ve Inn John Grew http://www.pdaturk.com/ 31821
Atatrk,bir milletin yeniden
dogusu
Kinross http://www.1001kitap.com/ 58878
Atatrk'n Anadoluya
Gnderilis... Baki Iz
25110
Ates Deniz Margaret Weis http://www.kitap.perisi.com/
Atinalilarin Devleti Aristoteles http://www.maximumbilgi.com 18697
Atuan Mezarlari Ursula Kroeber LeGuin http://www.pdaturk.com/ 34437
Avalon'un Sisleri-By Ustasi M.Z.Bradley http://www.pdaturk.com/ 86054
Avrupa ile Asya Arasi... Nurer Ugurlu http://www.pdaturk.com/ 27510
Bakir Atli Puskin http://www.pdaturk.com/ 8737
Bartleby Herman Melville http://www.pdaturk.com/ 11816
Baskasinin Karisi Dostoyevski http://www.kitaplik.com 15031
Bassiz At Paul Berna http://www.haberbilgi.com 27666
Beden Dili Baki Evkarali http://www.pdaturk.com/ 24544
Beden Egitimi Giyasettin Demirhan http://www.meb.gov.tr/ 18000
Bedwyr'in Kilici R.A.Salvatore http://www.pdaturk.com/ 63584

55
APPENDICE A (Cont`d.)
Belki de Gerekten Istiyorsun Murat Glsoy http://www.altkitap.com 21000
Bella'nin lm Georges Simenon http://www.pdaturk.com/ 34190
Beni Duyuyor musun? Leyla Navaro http://www.pdaturk.com/ 32534
Benito Cereno Herman Belville http://www.pdaturk.com/ 26302
Beyaz Geceler Dostoyevski http://www.pdaturk.com/ 26623
Beyaz Gemi Cengiz Aytmatov http://www.pdaturk.com/ 37027
Bilim Is Basinda Lenihan http://www.1001kitap.com/ 9347
Binbogalar EIsanesi Yasar Kemal http://www.pdaturk.com/ 67344
Bir iIt Yrek Marln Morgan http://www.kitap.perisi.com/ 39981
Bir Laboratuvar Romansi Adnan Kurt http://www.altkitap.com 23451
Bozkirda Maksim Gorki http://www.pdaturk.com/ 17965
Bozkirda Bir Kral Lear Turgenyev http://www.pdaturk.com/ 19073
Btn ykleri-2 Sabahattin Ali http://www.pdaturk.com/ 71500
Bycler Krali http://www.pdaturk.com/ 168434
CANDIDE YA DA
IYIMSERLIK Voltaire
24599
Cazname Tunel Glsoy http://www.altkitap.com 46506
Cazname-2 Tunel Glsoy http://www.altkitap.com 49050
Cemile Cengiz Aytmatov http://www.pdaturk.com/ 22620
Cemile Orhan Kemal http://www.pdaturk.com/ 30046
CHP Tzk chp http://izmir.chp.org.tr 14238
alikusu R.Nuri Gntekin http://www.pdaturk.com/ 94267
arlinin ikolata Fabrikasi Roald Dahl http://www.1001kitap.com/ 21893
ocuk ve Ergen Gelisimi Bekir Onur http://www.pdaturk.com/ 109516
l Gezegeni Dune Frank Herbert http://www.kitap.perisi.com/ 157194
zm Plani Annan http://www.pdaturk.com/ 39054
Daragacinda 3 Iidan Nihat Behram http://www.pdaturk.com/ 32008
Deccal F.Nietzsche http://www.pdaturk.com/ 24603
Degirmen, Kagni, Ses Sabahattin Ali http://www.pdaturk.com/ 62943
Degirmenimden Mektuplar Alphonso Daudet http://www.pdaturk.com/ 35311
Degisim Franz KaIka http://www.pdaturk.com/ 15706
Deliligin Daglarinda H.P.LovecraIt http://www.pdaturk.com/ 31996
Denemeler Montaigne http://www.pdaturk.com/ 50746
Denizden Gelen Lezzet Anonim http://ekitap.8m.com/ 15888
Devlet Adami Platon http://www.pdaturk.com/ 19071

56
Devrim Tarihi ve Toplumbilim
Aisindan Atatrk Emre Kongar
104852
Dil gretim... Anonim http://www.pdaturk.com/ 18508
Dinde Siyasal Islam Tekeli Nur Serter http://www.pdaturk.com/ 40070
Disi Kurdun Ryalari Cengiz Aytmatov http://www.pdaturk.com/ 86819
Dogudaki Hayalet Pierre Loti http://www.pdaturk.com/ 22369
Doktor Faustus Cristopher Marlowe http://www.pdaturk.com/ 10926
Dost Kazanma Anonim http://www.pdaturk.com/ 29319
Dvs Kulb - 1 Chuck Palahniuk http://www.pdaturk.com/ 18126
Dvs Kulb - 2 Chuck Palahniuk http://www.pdaturk.com/ 18960
Duvar J.Paul Sartre http://www.pdaturk.com/ 46800
Dsnyorum yleyse Vurun Ilhan Seluk http://www.pdaturk.com/ 32203
Ecco Homo F. Nietzsche http://www.ayrinti.net/nietzsche 23329
EIendi ile Usagi Lev Tolstoy http://www.pdaturk.com/ 19727
Egitim Politikamiz Mahmut Adem http://www.pdaturk.com/ 27403
Ejderha Mizragi Ejderha Mizragi http://www.ankira.com/ 38755
Ekonomi Karisik http://www.IilozoI.tripod.com 8000
ElI Yildizi M. Weis&T.Hickman http://www.kitapperisi.com 88974
Emek... chp http://www.chp.org.tr 25389
Empedokles Holderlin http://www.pdaturk.com/ 19774
Enternasyonel Sule Bucak http://www.chp.org.tr 29250
Erzurum... Puskin http://www.kelepirkitap.com 30642
Evrim Kurami ve Bagnazlik Cemal Yildirim http://www.1001kitap.com/ 43245
Excel 2000 Kitapik Hakki cal http://www.pdaturk.com/ 21372
Faust Goethe http://www.pdaturk.com/ 12306
FelseIe Tarihi Karisik http://www.IilozoI.tripod.com 20409
FelseIenin Baslangi Ilkeleri Georges Politzer http://www.pdaturk.com/ 47165
FelseIi Kavramlar mer Sevingl http://www.kitaplik.com 19844
Fen gretimi Fitnat Kaptan http://www.meb.gov.tr/ 17102
Gelin Birlik Olalim Harun Yahya http://www.harunyahya.net 41865
Gelisim Psikolojisi Bekir Onur http://www.pdaturk.com/ 66439
Genlik Projesi chp http://www.chp.org.tr 5755
Gilgamis Destani MuzaIIer Ramazanoglu http://www.pdaturk.com/ 16355
Gz Ucuyla Dean Koontz http://www.pdaturk.com/ 181675

57
Glsn ve Unutusun Kitabi Milan Kundera http://www.pdaturk.com/ 57485
Gnes lkesi Tommaso Campanella http://www.pdaturk.com/ 23682
Gnmz Basininda Kadin(lar) Leyla Simsek http://www.altkitap.com 37280
Gzel Konusma Anonim http://www.pdaturk.com/ 25853
Harry Potter 4 J.K.Rowling http://www.pdaturk.com/ 150000
Harry Potter-FelseIe Tasi J.K.Rowling http://www.pdaturk.com/ 56271
Harry Potter-Sirlar Odasi J.K.Rowling http://www.pdaturk.com/ 67309
Harry Potter-Zmrdanka
Yoldasligi J.K.Rowling
201465
Hastahane mit Sahin http://www.kaliteoIisi.com 30909
HAYATIM HARBIDEN
ROMAN Mehmet Kartal
66844
Hayatin Kkleri Mahlon Hoagland http://www.pdaturk.com/ 25621
Hayvan Mezarligi Stephen King http://www.kitap.perisi.com/ 87072
HedeI Trkiye Oktay Sinanoglu http://www.kitap.perisi.com/ 60768
Huzur A.Hamdi Tanpinar http://www.pdaturk.com/ 99636
Hrriyet'in Ilani Tarik Tunaya http://www.pdaturk.com/ 17377
Iphigeni Tauris'te Goethe http://www.pdaturk.com/ 13075
Irk ve Irkilik Dsncesi Alaeddin Senel http://www.pdaturk.com/ 41482
Icra ve IIlas Kanunu Serhat Yener http://www.pdaturk.com/ 62539
Idealizm, Matrix FelseIesi ve
Maddenin Geregi
Harun Yahya http://www.harunyahya.net 16781
Ikizlerin Sinavi M.Weis&T.Hickman http://www.pdaturk.com/ 73097
Ilham Veren ykler Anonim http://www.pdaturk.com/ 20048
Ilkgretimde Matematik
gretimi Yasar Baykul http://www.meb.gov.tr/ 35500
Ilkgretimde lme Yasar Baykul http://www.meb.gov.tr/ 12858
Imparatorluk Isaac Asimov http://www.pdaturk.com/ 55362
Ince Memed Yasar Kemal http://www.kitaplik.com 86795
Insan Insana Dogan Cceloglu http://www.pdaturk.com/ 64187
Internet Hakki cal http://www.pdaturk.com/ 21715
Isa Gelecek Harun Yahya http://www.harunyahya.net 18719
Isa`nin Gelis Alametleri Harun Yahya http://www.harunyahya.net 49054
Isyan A.Altan http://www.kitap.perisi.com/ 106059
Jimi Hendrix CURTIS KNIGHT http://www.pdaturk.com/ 45513

58
Jules Amcam Guy De Maupassant http://www.pdaturk.com/ 28379
Kayigim Rosinha http://www.pdaturk.com/ 40253
Kemalizm Sonrasinda Trk
Kadini Nurer Ugurlu
14399
Kili yarasi gibi A.Altan http://www.kitap.perisi.com/ 72969
Kirmizi Isikta Yrmek Erdal Atabek http://www.pdaturk.com/ 36772
Kirmizi Kpek Louis de Bernieres http://www.pdaturk.com/ 20929
Kizil Glge Kizil Glge http://www.ankira.com/ 12773
Kiralik Konak Y.Kadri Karaosmanoglu http://www.pdaturk.com/ 54124
Kitiaranin Oglu M.Weis&T.Hickman http://www.pdaturk.com/ 25043
Knulp Herman Hesse http://www.pdaturk.com/ 21454
Konusan KaItan KALMAN MIKSZATH http://www.pdaturk.com/ 24400
Konusmalar KonIyus http://www.kelepirkitap.com 24141
Korkun Bir Gece Anton ehov http://www.pdaturk.com/ 26615
Kral, Bilge ve Soytari http://www.pdaturk.com/ 44360
Kristal Parasi R.A.Salvatore http://www.pdaturk.com/ 84332
Kur'an http://www.kelepirkitap.com 113513
Kurtulus Savasi Sirasinda... B.Georghes Gaulis http://www.pdaturk.com/ 29996
Kutsal Kitap Harun Yahya http://www.harunyahya.net 80675
Kuzularin Sessizligi http://www.kitap.perisi.com/ 45989
Kk Dnyam LatiI Erdogan http://www.pdaturk.com/ 43979
Kltrn ABC'si Bozkurt Gven http://www.pdaturk.com/ 23552
Linux Nasil HOWTOs http://www.linux.org.tr/ 25717
Luthien'in Kumari R.A.Salvatore http://www.pdaturk.com/ 75465
LtIen Beni Anla Ipek Ongun http://www.pdaturk.com/ 69162
Macbeth Shakespeare http://www.pdaturk.com/ 15304
Mektuplar Platon http://www.pdaturk.com/ 15705
Memleketin Birinde Aziz Nesin http://www.haberbilgi.com 27425
Mezeler Anonim http://www.pdaturk.com/ 26350
Miras R.A.Salvatore http://www.pdaturk.com/ 76563
Miskinler Tekkesi R.Nuri Gntekin http://www.pdaturk.com/ 48939
Mukaddes Ankara'dan
Mektuplar Kadriye Hseyin
20886
Mzik gretimi Ali Uan http://www.meb.gov.tr/ 28071
Nadja Andre Breton http://www.pdaturk.com/ 21272

59
Nkleer Enerji... AriI Knar http://ekitap.8m.com/ 11246
Okul ncesi Egitim Sengl Gen http://www.meb.gov.tr/ 20732
Oligarsi Vladimir Putin http://www.pdaturk.com/ 17000
grenmenin Olusumu Tlay stndag http://www.meb.gov.tr/ 25000
l Erkek Kuslar Inci Aral http://www.pdaturk.com/ 79782
P.Nikitin Ekonomi Politik http://www.pdaturk.com/ 90416
PAL SOKAGI OCUKLARI Ferenc Molnar http://www.pdaturk.com/ 38865
Peter Schemihl A.VON CHAMISSO http://www.pdaturk.com/ 16688
Pis Morugun Notlari Charles Bukowski http://www.pdaturk.com/ 38684
Pollyanna-1 ELEANOR H. PORTER http://www.pdaturk.com/ 93755
Psikolojik Danisma ve
Rehberlik Hasan Tan
http://www.meb.gov.tr/
87000
Rama ile Bulusma Atrhur Clarke http://www.kitap.perisi.com/ 55053
Ramses:Isigin Oglu Christian |acq http://www.pdaturk.com/ 75271
Ramses-3 Kades Savasi Christian |acq http://www.pdaturk.com/ 71623
RavenloIt Chrstie Golden http://www.kitap.perisi.com/ 85354
RavenloIt 2 Chrstie Golden http://www.ankira.com/ 53704
Rehberlik ve Danisma Anonim http://www.pdaturk.com/ 81718
Resim Egitimi Hulusi Sezer http://www.meb.gov.tr/ 15205
Rudin Turgenyev http://www.pdaturk.com/ 35239
Sagduyu Jean Meslier http://www.agnostic.com.tr.tc 47823
Sari Odanin Esrari Gaston Leroux http://www.pdaturk.com/ 47320
Satran zerine Cabaplanca http://www.pdaturk.com/ 25031
Savasta Ne Yaptin Baba? Can Dndar http://www.pdaturk.com/ 20055
Sendika Sendika Duyuru http://www.sendika.org/ 148266
Sessiz Bir lm SIMONE DE BEAUVOIR http://www.pdaturk.com/ 20384
Sevil Berberi BEAUMARCHAIS http://www.pdaturk.com/ 15308
Sezar ve Kleopatra Bernard Shaw http://www.pdaturk.com/ 27602
Shannara`nin Kilici Terry Brooks http://www.kitap.perisi.com/ 153746
Simyaci Paolo Coelho http://www.pdaturk.com/ 28252
Siyasal Sistemler Taner Kislali http://www.1001kitap.com/ 56135
Son Ibni Sirac'in Servenleri CHATEAUBRIAND http://www.pdaturk.com/ 11192
Sorgulayan Denemeler Bertrand Russell http://www.pdaturk.com/ 52199
Sosyoloji Karisik http://www.IilozoI.tripod.com 13713
Sylev Atatrk http://turkbilim26.sitemynet.com 187435

60
Suun Pii Mehmet Kartal http://www.pdaturk.com/ 23548
Suda Yan Ateste Bogul Charles Bukowski http://www.pdaturk.com/ 18190
Sudaki iz http://www.kitap.perisi.com/ 46279
Seker Portakali J.Mauro de Vasconcelos http://www.pdaturk.com/ 29946
Simdiki ocuklar Harika Aziz Nesin http://www.pdaturk.com/ 31627
T.C. Anayasasi Anonim http://www.pdaturk.com/ 18871
Talat Pasa'nin Hatiralari Talat Pasa http://www.pdaturk.com/ 31941
Tanzimat-i Hayriye Devri E.Ziya Karal http://www.pdaturk.com/ 26326
Tarih Karisik http://www.IilozoI.tripod.com 32000
Tirpan Fakir Baykurt http://www.pdaturk.com/ 89225
Tom Sawyer Mark Twain http://www.pdaturk.com/ 19190
Top Oynayan Kedi Magazasi H.De Balzac http://www.pdaturk.com/ 16689
Toplum Kalkinmasi RiIat Miser http://www.pdaturk.com/ 11739
Toprak Cengiz Aytmatov http://www.pdaturk.com/ 29930
Totem ve Tabu-1 Freud http://www.pdaturk.com/ 18389
Totem ve Tabu-2 Freud http://www.pdaturk.com/ 20239
Tk.Toplumu ve Dnyanin
Gelecegi Alan Durning
31003
Trk Ceza Kanunu Anonim http://www.pdaturk.com/ 103683
Tyrann Isaac Asimov http://www.pdaturk.com/ 49156
Ugursuz Miras HoIImann http://www.pdaturk.com/ 23951
Unutulmus Diyarlar Unutulmus Diyarlar http://www.ankira.com/ 8125
Kisa Oyun Luigi Pirandello http://www.pdaturk.com/ 10297
yk Gogol http://www.pdaturk.com/ 18977
VakiI ve Dnya Isaac Asimov http://www.pdaturk.com/ 104069
Werther Goethe http://www.pdaturk.com/ 30578
Yaban Y.Kadri Karaosmanoglu
http://www.maximumbilgi.co
m 53180
Yaban rdegi Henrik Ibsen http://www.pdaturk.com/ 26827
Yahudiler Lessing http://www.pdaturk.com/ 10731
Yalniz Gezerin Dslemleri J.J.Rousseau http://www.pdaturk.com/ 24850
Yaprak Dkm R.Nuri Gntekin http://www.pdaturk.com/ 28574
Yasak Iliski Barbara Taylor http://www.pdaturk.com/ 29163
Yazlik Dns Corlo Goldini http://www.pdaturk.com/ 15917

61
Yemek TariIleri Anonim http://www.pdaturk.com/ 34116
Yeraltindan Notlar Dostoyevski http://www.pdaturk.com/ 30025
Yerdeniz Bycs Ursula Kroeber LeGuin http://www.pdaturk.com/ 46537
Yildizlarin Zamani Alan Lightman http://www.pdaturk.com/ 27150
Yneticinin Kilavuzu Coleman&Barrie http://www.pdaturk.com/ 27209
Yn-Kitap Sendika Duyuru http://www.sendika.org/ 14942
Yukari Mahalle John Steinbech http://www.maximumbilgi.co
m
38848
Yksek Denetim Kurumlari Ihsan Gren http://www.tcmb.gov.tr/ 37228
Yzklerin EIendisi 1 - Yzk
Kardesligi
J.R.Tolkien http://www.kitap.perisi.com/ 139161
Yzklerin EIendisi 2 - Iki
Kule
Yzklerin EIendisi 3 - Kralin
Dns
Yzyillik Yalnizlik G.Marquez http://www.pdaturk.com/ 79582
ZeytinDagi Falih RiIki Atay http://www.pdaturk.com/ 29394

62

B. Turkish Alphabet
B.1 Lowercase Letters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
a b c d e I g g h i i j k l m n o

19 20 21 22 23 24 25 26 27 28 29
p r s s t u v y z

Consonants:b, c, , d, I, g, g, h, j, k, l, m, n, p, r, s, s, t, v, y, z}
Vowels:a, e, i, i, o, , u, }

B.2 Uppercase Letters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
A B C D E F G G H I I J K L M N O

19 20 21 22 23 24 25 26 27 28 29
P R S S T U V Y Z

Consonants:B, C, , D, F, G, G, H, J, K, L, M, N, P, R, S, S, T, V, Y, Z}
Vowels:A, E, I, I, O, , U, }

63
C. Turkish Language Specialities

Turkish is an agglutinative language like Finnish, Hungarian. It belongs to the
southwestern group oI Turkic Iamily. Turkic languages are in the Uralic-Altaic
language Iamily. In agglutinative languages, words Iormed by combined root words
and morphemes. Word structures can grow by addition oI morphemes. Morphemes
added to a stem can convert the word Irom nominal to a verbal structure or
viceversa.

Turkish has a very productive morphology. There is a root and several suIIixes
are combined to this root. It is possible to produce a very high number oI words
Irom the same root with suIIixes. The lexicon size may grow to unmanageable size.

A popular example oI a Turkish word Iormation is:
OSMANLILASTIRAMAYABILECEKLERIMIZDENMISSINIZCESINE

This can be broken down into morphemes:
OSMANLILASTIRAMA(Y)ABILECEKLERIMIZDENMISSINIZ
CESINE

In this example, one word in Turkish corresponds to a Iull sentence in English.
This example can be translated into English as 'as iI you were oI those whom we
might consider not converting into an Ottoman.

There are 29 letters in Turkish language. The eight oI them are vowels and
twenty-one oI them are consonants. (See Appendice B)

The number oI vowels is more than many languages. Vowels oI Turkish can be
classiIied in three groups according to their properties:

64
Front and back,
Round and unrounded,
High or low

The vowels can be partitioned as below in detail:

Back vowels: a, i, o, u}
Front vowels: e, i, , }
Front unrounded vowels: e, i}
Front rounded vowels: , }
Back unrounded vowels: a, i}
Back rounded vowels: o, u}
High vowels: i, i, u, }
Low unrounded vowels: a, e}

Turkish word Iormation uses a number oI phonetic harmony rules. When a suIIix
is appended to a stem vowels and consonants change in certain ways.

C.1 Jowel Harmonv
Vowel harmony is the best-known morphophonemic process in Turkish. It is
most interesting and distinctive Ieature. Vowel harmony is a leIt-to-right process. It
operates sequentially Irom syllable to syllable. Vowel harmony processes Iorce
certain vowels in suIIixes agree with the last vowel in the stems or roots they are
being aIIixed to. When vowels are aIIixed to a stem, they change according to the
vowel harmony rules. The Iirst vowel in the suIIix changes according to the last
vowel oI the stem. Vowel harmony consists oI two assimilations:

65
1. Palatal assimilation

This is called 'major vowel harmony . This vowel harmony is common to
almost Turkic languages. This assimilation is about Iront/back Ieature oI the
language. Back vowels are the set oI a, i, o, u} and the Iront vowels are the set oI
e, i, , }.

II the vowels oI the Iollowing morphemes are back then the vowel oI the Iirst
morpheme in a word is back, e.g. aski lar

'lar is a plural suIIix. 'ler, other Iorm oI plural suIIix, is not used, because the
vowels oI the stem are back vowels.

II the vowels oI the Iollowing morphemes are Iront then the vowel oI the Iirst
morpheme in a word is Iront, e.g. ev ler

Long vowels are ', , . These vowels are in words oI French origin in general.
Examples:
satler (saatler)
gller (goller)
usller (usuller)

2. Labial assimilation

This is called 'minor vowel harmony. This assimilation is about
rounded/unrounded Ieature oI the language.
Examples: l n
usul n (usl n)
topal in
deIter im
saat im (sat im)

66
C.2 Consonant Harmonv

Consonant harmony is another basic aspect oI Turkish phonology. Consonants oI
Turkish phonology can be classiIied into two main groups. These are voiceless and
voiced. Voiceless consonants are ', I, h, k, p, s, s, t}. Voiced
consonants are 'b, c, d, g, g, j, l, m, n, r, v, y, z}.
Consonant harmony rules doesn`t Iormulate easily because oI irregular character oI
borrowed and native words. There are some consonant harmony rules in Turkish:

II the end oI the word is one the voiceless consonants ('p, , t, k) then
it changes to a corresponding voiced consonants ('b, c, d, g).

o 'p changes to 'b ( kitab im ).
o 'd changes to 't ( ta(d)t tik ), but not every 'd changes, such as
'nad, 'soyad, etc.
o 'k changes to 'g ( aya(k)g in ).
o ' changes to 'c ( aga()c in ), but not every ' changes, such as
'g, 'a i, etc.

II a suIIix starts with 'd, and iI the last consonant oI the stem is one oI
', I, h, k, p, s, s, t}, 'd is replaced with 't , e.g.
yulaItan (yulaI dan)

II the last consonant oI the stem is one oI ', I, 'h, 'k, 'p, 's, 's}
and iI the suIIix begins with the 'c then 'c is resolved as a ' , e.g.
yasa (yas ca)

II 'k is at the end oI the stem and 'k preceded by an 'n then 'k becomes
'g , e.g. elen(k)g e

There are some exceptions Ior this rule, e.g. 'bank.

67

II the Iinal character oI the stem is 'g and a vowel is beginning oI the suIIix
then 'g becomes 'g in Ioreign origin words, e.g. analo(g)g a

There are some exceptions Ior this rule, also, e.g.'lig, 'pedagog, etc.

II the Iinal character oI the stem is 'g and a consonant is beginning oI the
suIIix then 'g does not become 'g , e.g. bumerang tan

II the Iinal character oI the stem is a vowel, and a vowel is beginning oI the
suIIix then 'y inserted to stem, e.g. akarsu y unuz

When certain suIIixes are aIIixed last consonant is duplicated in Arabic or
Persian origin words, e.g. zam m i

II Arabic origin words ending with a vowel then drops in exception to the
general rule, e.g. camii camisi

There are many numbers oI words that have this property, e.g. 'mevki, 'cami,
'terIi, 'zayi, 'ikna, 'merci, etc.

C.3 Root Deformations
Turkish roots are not Ilexible in normally. There are some cases about various
deIormations. There are some exception cases:

Root is observed in personal pronouns
Examples: ben bana
sen sana

Wide vowel at the end oI the stem is narrowed when the suIIix 'yor comes
aIter the verbs ending with the 'a,e , e.g. kapiyor (kapa i yor)

68

When a suIIix is beginning with a vowel comes aIter some nouns, which has
a vowel i, i} in its last syllable, this vowel drops. This occurs generally
designating parts oI the human body, e.g. agzimiz (agiz i miz)

When the possessive suIIix 'il, il is aIIixed to some verbs, and the last
vowel oI the verb is vowel 'i, i then this vowel drops, e.g. ayril (ayir il)

II a plural suIIix is aIIixed to a compound words then this suIIix coming
beIore the possessive suIIix at the end oI the stem.
Example: gzyasi lar -~ gzyaslari (not gzyasilar)

Varliklar 05

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Varliklar 05

Caricato da

Copyright:

Formati disponibili

DOKUZ EYLL UNIVERSITY

GRADUATE SCHOOL OF NATURAL AND APPLIED

Potrebbero piacerti anche