0 valutazioniIl 0% ha trovato utile questo documento (0 voti)
39 visualizzazioni78 pagine
DOKUZ Eylul University Graduate School of Natural and Applied Sciences DEVELOPMENT of a METHOD to DETERMINE ROOT and SUFFIXES for Turkish WORDS. IZMIR ii M.Sc THESIS EXAMINATION RESULT FORM We certiIy that in our opinion it is Iully adequate, in scope and in quality, as a thesis Ior
DOKUZ Eylul University Graduate School of Natural and Applied Sciences DEVELOPMENT of a METHOD to DETERMINE ROOT and SUFFIXES for Turkish WORDS. IZMIR ii M.Sc THESIS EXAMINATION RESULT FORM We certiIy that in our opinion it is Iully adequate, in scope and in quality, as a thesis Ior
DOKUZ Eylul University Graduate School of Natural and Applied Sciences DEVELOPMENT of a METHOD to DETERMINE ROOT and SUFFIXES for Turkish WORDS. IZMIR ii M.Sc THESIS EXAMINATION RESULT FORM We certiIy that in our opinion it is Iully adequate, in scope and in quality, as a thesis Ior
DEVELOPMENT OF A METHOD TO DETERMINE ROOT AND SUFFIXES FOR TURKISH WORDS TO GENERATE LARGE SCALE TURKISH CORPUS
by zlem VARLIKLAR
1uly, 2005 ZMR
DEVELOPMENT OF A METHOD TO DETERMINE ROOT AND SUFFIXES FOR TURKISH WORDS TO GENERATE LARGE SCALE TURKISH CORPUS
A Thesis Submitted to the Graduate School of Natural and Applied Sciences of Dokuz Eyll University In Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering, Computer Engineering Program
by zlem VARLIKLAR
1uly, 2005 ZMR
ii M.Sc THESIS EXAMINATION RESULT FORM
We have read the thesis entitled ~DEVELOPMENT OF A METHOD TO DETERMINE ROOT AND SUFFIXES FOR TURKISH WORDS TO GENERATE LARGE SCALE TURKISH CORPUS completed by ZLEM VARLIKLAR under supervision oI ASSOCIATIVE PROFESSOR DR. YALIN EB and we certiIy that in our opinion it is Iully adequate, in scope and in quality, as a thesis Ior the degree oI Master oI Science.
Assoc. ProI. Dr. Yalin EBI Supervisor
ProI. Dr. R. Alp KUT Asst. ProI. Dr. ZaIer DICLE (Jury Member) (Jury Member)
ProI.Dr. Cahit HELVACI Director Graduate School of Natural and Applied Sciences
iii ACKNOWLEDGEMENTS
I would like to thank my advisor Assoc. ProI. Dr. Yalin EBI Ior oIIering me to study in Natural Language Processing and Ior his advises, support and help to complete my thesis.
I would also like to thank Asst. ProI. Dr. Gkhan DALKILI, who shared their ideas about 'Natural Language Processing during the writing and developing phase oI thesis; and my colleague UIuk DEMIR, who encouraged me during the writing oI the thesis.
I have special thanks to my parents and my Iiance Cenk AKTAS Ior their support, patience and making me encouraged.
zlem VARLIKLAR
iv DEVELOPMENT OF A METHOD TO DETERMINE ROOT AND SUFFIXES FOR TURKISH WORDS TO GENERATE LARGE SCALE TURKISH CORPUS
ABSTRACT
For determining a language`s morphological specialties, it is needed to generate a corpus that represents the language. II there is a large scale Turkish corpus that involves all specialties oI the language, some statistical properties oI the Turkish language depending on the words can also be investigated. In this study, how must a large scale, comprehensive, understandable, easily used Turkish corpus be generated and determining an appropriate method to generate it, and also determining an eIIicient method to determine stem, root and suIIixes oI the words that are used to Iorm this corpus are explained. For generating large scale Turkish corpus, the texts, have almost 130 million words, were achieved Irom some newspapers, novels and stories, and subtitles oI some Iilms written in Turkish Irom the Internet. The stems, roots, abbreviations and suIIixes` list Ior Turkish were obtained. The abbreviation list and rules generated Ior the sentence boundary detection were stored in an XML Iile; these Iiles had provided successive results in sentence boundary detection. AIter this process, sentences were splitted into words and types oI words were Iound to help Iinding the correct root oI the word. The stems oI words were determined by using stems and inIlectional suIIixes` lists. The roots and derivational suIIixes oI these Iounded stems were determined by using root and derivational suIIixes` lists. All results include paragraphs, sentences, words, root and suIIixes were stored into an XML structure specialized Ior NLP applications to make the applications easier. The only drawback oI XML structure is that it needs too much memory on disk drive. All XML Iiles were stored into the memory oI the computer at the beginning oI the generating large scale corpus process not to be aIIected Irom this drawback. This process had made the steps oI generating large scale Turkish corpus being very Iast and eIIective. Keywords: Natural language processing, corpus, large scale Turkish corpus, morphological analysis, determining stem and root.
v BYK LEKL TRKE DERLEM OLU$TURMAK N TRKE KELMELERN KK VE EKLERN BELRLEMEK N YNTEM GEL$TRME
Z
Dillerin biimbilimsel zelliklerinin belirlenmesi iin, dilin zelliklerini temsil edebilecek bir derlem gereklidir. Bu derlem zerinde analiz tekniklerini kolaylikla uygulamaya izin verecek kadar byk olmalidir. Trke`nin tm zelliklerini ierebilecek, byk lekli bir derlem alismasi daha nce yapilmamistir. Byle bir derlemin varligiyla, kelimelere bagli olan, Trkenin istatistiksel zellikleri de incelenebilecektir. Bu alismanin amaci byk lekli, kapsamli, anlasilir, kolay kullanilabilir bir Trke derlem gelistirmek iin en uygun yntemi belirlemek; ve bu derlemi olusturmak iin kullanilan kelimelerin gvde, kk ve eklerini belirlemek iin verimli bir yntem gelistirmektir. Byk lekli Trke derlem olusturmak iin, bazi gazeteler, roman ve hikayeler ve Trke Iilmlerin altyazilarindan olusan, yaklasik 130 milyon kelimelik yazili Trke internet yoluyla elde edilmistir. Trke gvdeler, kkler, ekler ve kisaltmalarin listeleri elde edilmistir. Kisaltma listesi ve cmle sonu kural listesi XML yapisinda olusturulmustur. Cmle sonunu belirlemek iin gelistirilen uygulamada kisaltma ve kural listeleri kullanilmis ve basarili sonular alinmistir. Bulunan cmleler kelimelere ayrilmis, kelimelerin gvdeleri, kk ve ekim eki listeleri kullanilarak bulunduktan sonra; kkleri, kk listesi ve yapim ekleri listesi kullanilarak belirlenmistir. Elde edilen tm sonular; paragraI, cmle, kelime, kk ve ek seklinde, Dogal Dil Isleme (DDI) uygulamalari iin zellestirilmis bir XML yapisi iine kaydedilmistir. XML yapisinin bilinen tek dezavantaji dosya boyutlarinin byk olmasidir. Bunun iin tm XML dosyalari islemlere baslamadan nce haIizaya yklenmektedir. Bu islem, derlem olusturma basamaklarinin ok hizli ve etkili bir sekilde yapilabilmesini saglamistir. Anahtar szckler: Dogal dil isleme, derlem, byk lekli Trke derlem, biimbilimsel analiz, gvde ve kk belirleme.
vi CONTENTS
Page THESIS EXAMINATION RESULT FORM..........................................................................II ABSTRACT........................................................................................................................... IV Z........................................................................................................................................... V LIST OF TABLES...............................................................................................................VIII LIST OF FIGURES ............................................................................................................... IX
CHAPTER ONE - INTRODUCTION................................................................................. 1
CHAPTER TWO - CORPUS AND LARGE SCALE CORPUS....................................... 4 2.1 Corpus ............................................................................................................................... 5 2.1.1 English Corpora........................................................................................................ 5 2.1.1.1 Brown Corpus ................................................................................................ 5 2.1.1.2 British National Corpus (BNC) ..................................................................... 5 2.1.1.3 The Bank oI English ...................................................................................... 8 2.1.1.4 English Gigaword .......................................................................................... 8 2.1.1.5 American National Corpus ............................................................................ 9 2.1.2 Turkish Corpora ....................................................................................................... 9 2.1.2.1 Koltuksuz Corpus .......................................................................................... 9 2.1.2.2 YT Corpus................................................................................................. 10 2.1.2.3 Dalkilic Corpus ............................................................................................ 10 2.1.2.4 METU Turkish Corpus ................................................................................ 10 2.1.2.5 TurCo Turkish Corpus ................................................................................. 10 2.1. 3 Corpora oI Other Languages ................................................................................. 12 2.1.3.1 The Czech National Corpus (CNC) ............................................................. 12 2.1.3.2 Croatian National Corpus ............................................................................ 13 2.1.3.3 PAROLE...................................................................................................... 13 2.1.3.4 French Corpus.............................................................................................. 13 2.1.3.5 COSMAS (Corpus Search Management Analysis System) ........................ 13 2.2 Large Scale Corpus ......................................................................................................... 14
vii CHAPTER THREE - PREVIOUS WORKS..................................................................... 15 3.1 Morphological Parsing in Other Languages.................................................................... 16 3.2 Stem and Root Finding Algorithms Ior Turkish.............................................................. 16 3.2.1 AF Algorithm........................................................................................................... 17 3.2.2 LM Algorithm.......................................................................................................... 18 3.2.3 IdentiIied Maximum Match (IMM) Algorithm ....................................................... 18 3.2.4 Solak and OIlazer`s Approach................................................................................. 19 3.2.5 Root Reaching Method without Dictionary............................................................. 20 3.2.6 Extended Finite State Approach .............................................................................. 22 3.2.7 FindStem Algorithm................................................................................................ 24 3.3 Sentence Boundary Detection ......................................................................................... 26
CHAPTER FOUR - PROPOSED SYSTEM..................................................................... 27 4.1 Sentence Boundary Detection ......................................................................................... 29 4.2 Examination oI Type oI Words ....................................................................................... 34 4.3 Description oI the Methos Ior Finding Roots.................................................................. 34 4.3.1 Finding Stems and InIlextional SuIIixes .............................................................. 34 4.3.2 Finding Roots and Derivational SuIIixes ............................................................. 37 4.4 Generate Large Scale Turkish Corpus............................................................................. 42 4.4.1 Data in the Corpus ................................................................................................ 42 4.4.2 DeIinition oI the Corpus Structure ....................................................................... 43
CHAPTER FIVE - CONCLUSION................................................................................... 47
REFERENCES..................................................................................................................... 49 ABBREVIATIONS.............................................................................................................. 53 APPENDICES...................................................................................................................... 54 A. The List oI the Novels & Stories in Corpus .................................................................... 54 B. Turkish Alphabet ............................................................................................................. 62 B.1 Lowercase Letters................................................................................................. 62 B.2 Uppercase Letters ................................................................................................. 62 C. Turkish Language Specialities ........................................................................................ 63 C.1 Vowel Harmony ................................................................................................... 64 C.2 Consonant Harmony............................................................................................ 66 C.3 Root DeIormations ............................................................................................... 67
viii LIST OF TABLES
Page
Table 2.1 NOW, Iiles` size and distribution in TurCo .................11 Table 2.2 NOW, NODW and DWUR in TurCo....................12 Table 3.1 Example oI Ilags............................19 Table 3.2 The aIIix-verbs in Turkish.........................21 Table 4.1 The meanings oI the characters in the sentence boundary rule list.........31 Table 4.2 The number oI stems in the lists......................37 Table 4.3 The number oI roots in the noun and verb lists....................41 Table 4.4 The number oI Iiles, NOW, Iiles` sizes, and distribution () oI data..........42
ix LIST OF FIGURES
Page
Figure 2.1 An unannotated example oI a raw BNC text .................7 Figure 3.1 Finite state machine oI Table 3.2......................21 Figure 3.2 The main Iinite state machine .......................22 Figure 3.3 Links and inIlectional groups ..........................23 Figure 3.4 Dependency links in an example Turkish sentence .................23 Figure 4.1 Block diagram oI algorithm Ior generating corpus ...............28 Figure 4.2 The rule list Ior sentence boundary detection .................30 Figure 4.3 Example oI abbrevation list in XML Iile .....................32 Figure 4.4 Example oI sentences in XML Iile .........................33 Figure 4.5 List oI noun stems in Turkish ........................36 Figure 4.6 List oI adjective stems in Turkish .......................36 Figure 4.7 (a) Sample oI noun roots. (b) Sample oI verb roots .................38 Figure 4.8 Sample oI stems and roots in 'Irom noun to noun suIIixes list .............39 Figure 4.9 Sample oI stems and roots in 'Irom verb to noun suIIixes list............39 Figure 4.10 Block diagram oI algorithm Ior Iinding root step in generating corpus .........40 Figure 4.11 Schema deIinition Ior the mapping XML Iile ..................43 Figure 4.12 Sample conIiguration oI XML Iile .......................44 Figure 4.13 Sample schema deIinition Ior input Iiles ...................45 Figure 4.14 Sample valid XML Iile Ior processing ...................46
CHAPTER ONE INTRODUCTION
'Natural Language is the language naturally used by humans. 'Natural Language Processing (NLP) is a research area that is used Ior many diIIerent purposes and it becomes more popular continuously. In this area, computers are used to process natural language; it is used in academic searches and Ior commercial purposes.
NLP can be deIined as the construction oI a computing system that processes and understands natural language. The word 'understand in this deIinition can be clariIied such as the Iollowing; the observable behavior oI the system must make us assume that it is doing internally the same, or very similar, things that we do when we understand language (Gngrd, 1993).
The structure determination process covers two main topics: Morphological Analysis and Statistical Analysis:
Morphological analysis means that investigation oI the words` morphological status, such as investigation oI word types (verb, noun, adjective, etc.), analyzing parts oI the words (root, suIIix or preIix).
Statistical analysis can be done in two ways; on letters and words. Consonant and vowel letter placements, letter n-gram Irequencies, relationship between letters such as letter positions according to each other and these kinds oI analyses can be applied on the letters, called Letter Analysis. Investigation oI number oI letters in a word, the order oI the letters in a word, word n-gram Irequencies, word orders in a sentence and these kinds oI the analyses can be applied on words, called Word Analysis.
Morphological classiIication is made according to natural languages` word structures. Turkish is an agglutinative language` like Finnish, Hungarian, Quechua
2 and Swahili, where it is classiIied where new words are Iormed by adding suIIixes to the end oI roots (See Appendix C). In Turkish, there are grammatical rules Ior suIIixes that which oI them may Iollow which other and in what order they will be. By this concatenation the meaning oI words are changed or extended. This suIIix concatenation can result in relatively long words, which are Irequently equivalent to a whole sentence in English (e.g. Osmanlilastiramadiklarimizdansiniz).
NLP is used in: Speech synthesis: although this may not at Iirst sight appear very 'intelligent', the synthesis oI natural-sounding speech is technically complex and almost certainly requires some 'understanding' oI what is being spoken to ensure, Ior example, correct intonation. Speech recognition: basically the reduction oI continuous sound waves to discrete words. Natural language understanding: here treated as moving Irom isolated words (either written or determined via speech recognition) to 'meaning'. This may involve complete model systems or 'Iront-ends', driving other programs by NL commands. Natural language generation: generating appropriate NL responses to unpredictable inputs. Machine translation (MT): translating one NL into another. (Coxhead, 2002) Database applications: It helps the user by the Iamiliarity and Ilexibility oI the language while accessing database. It is used in expert systems Ior explanation generation by helping knowledge oI the syntax and semantics oI the Iragment oI natural language. Spelling correction: Spelling correctors are word-based, but nowadays there have been a lot oI studies about syntax-based spelling correctors. There has been a word-based spelling corrector Ior Turkish developed by Solak.
For these NLP applications on the language, a corpus is generated and used. Detailed inIormation about corpora was given in the Iollowing chapters.
3 Nowadays, large scale corpus is needed Ior every language to be able to apply some analyses on the language and get reliable results about the specialities oI it, as told beIore. Also, Ior Turkish it is very important to generate a large scale Turkish corpus. To generate such corpus, it is very important to determine stem, root and suIIixes oI the words in a correcy way.
The main goal oI this study is to generate large scale Turkish corpus, and to develop an appropriate method that Iinds the root and suIIixes oI the Turkish words in an eIIicient way, while generating such corpus. All steps in generating corpus were examined, and some works were made to implement these steps. Also, the general concepts oI Corpora and previous works about generating corpora, determining stem, root and suIIixes oI words were given.
This thesis is divided into 5 chapters. Chapter 1 introduces the thesis and explains brieIly why it was written. Chapter 2 includes the deIinition oI Corpus and explains some Corpora prepared in English, Turkish and other languages. In chapter 3, some previous works on morphological analysis oI Turkish language are explained brieIly. Turkish stemming and root determination algorithms and the works on Sentence Boundary Detection in the literature are introduced with their main specialities. Chapter 4 gives a detailed explanation oI the proposed system to generate Large Scale Turkish Corpus with its all steps. Finally, last chapter presents conclusion.
4 CHAPTER TWO CORPUS AND LARGE SCALE CORPUS
'Corpus is a collection oI linguistic data, either written texts or a transcription oI recorded speech, which can be used as a starting-point oI linguistic description or as a means oI veriIying hypotheses about a language. ' (Crystal, 1991).
'A collection oI naturally occurring language text, chosen to characterize a state or variety oI a language. (Sinclair, 1991).
Corpus can be deIined as a special database that is created Irom texts, used in Natural Language Processing area and allows all specialized processes such as Iinding and separating the words quickly.
An ideal corpus is large and representative oI the language. But, there is a trade- oII between quality (representative) and quantity (large). A representative corpus has samples oI all the language. Large corpus has very large data and it can also be used in NLP. And also, corpora can be divided into two types: 'Balanced, and 'Unbalanced. Large corpus is 'Unbalanced. Corpus can be balanced by taking samples oI all diIIerent topics in a language like technical words, medicine, spoken language, etc. that makes corpus a 'representative oI the language. But, it is very diIIicult to take equal, small pieces oI samples Irom diIIerent areas into a corpus. Instead oI this, an unbalanced corpus may be generated and used better because it will consist oI lots oI words Irom any areas in a language. When working on letter analysis, small sized corpora are enough (Dalkili, 2001), but Ior word analysis large corpora are needed. And Ior some extraordinary words, unbalanced corpus can be more powerIul than a balanced corpus.
There are some general analyses that can be applied to a corpus like n-gram analysis, Number oI DiIIerent Words (NODW), DiIIerent Word Usage Ratio (DWUR) that can also give the general characteristics oI the corpus.
N-gram analysis is one oI the common statistical methods carried on a corpus. By
5 using n-grams, language model probabilities can be estimated and used in speech recognition systems (Nadas, 1984). N-gram analysis can be used in correcting words by detecting non-words. It can also be mixed with pattern matching and strings that don`t appear in a given word list can be detected. It is also useIul Ior OCR (Optical Character Recognition) (Kukich, 1992). It can also be used Ior data compression and encryption. And also, missing words can be estimated Ior a given text by calculating word n-grams.
2.1 Corpus There are lots oI corpora created Ior diIIerent languages. Some oI them are representative, and some are large (Church & Mercer, 1993). By using the corpus, diIIerent analyses can be done, such as diIIerent word usage statistics, n-gram analysis Ior letters (Shannon, 1951) and words (JuraIsky & Martin, 2000) etc. Character recognition operations, cryptanalytical procedures, spelling corrections (Church & Gale, 1991), etc. processes can be done by using corpus in NLP applications.
Some examples oI the corpora in diIIerent languages are given in the Iollowing sections.
2.1.1 English Corpora 2.1.1.1 Brown Corpus This corpus was Iirst assembled in 1963-1964 at Brown University. In 1964, it had 1 million words with 61,805 diIIerent words and in a later edition in 1992; the new Brown corpus had 583 million words with 293,181 diIIerent words (JuraIsky & Martin, 2000).
2.1.1.2 British National Corpus (BNC) The British National Corpus is a very large (over 100 million words) corpus oI modern English, both spoken and written. However, non-British English and Ioreign language words do occur in the corpus (BNC: What is the BNC, (n.d.)). 90 oI
6 BNC is a written part including extracts Irom newspapers, journals, academic books, school and university essays, and 10 spoken part includes a large amount oI unscripted inIormal conversation. This is a project oI OxIord University Press also including some other members. It was completed in 1994 and it was released in February 1995 (British National Corpus (BNC), (n.d.)). An unannotated example oI a raw BNC text is shown in the Iollowing Iigure.
7
Figure 2.1 An unannotated example oI a raw BNC text.
<bncDoc id=BDFX8 n=093802> <header type=text creator='natcorp' status=new update=1994-07-13> <fileDesc> <titStmt> <title> General Practitioners Surgery -- an electronic transcription </title> <respStmt> <resp> Data capture and transcription </resp> <name> Longman ELT </name> </respStmt> </titStmt> <ednStmt n=1> Automatically-generated header </ednStmt> <extent kb=7 words=128> </extent> <pubStmt> <respStmt> <resp> Archive site </resp> <name> Oxford University Computing Services </name> </respStmt> <address> 13 Banbury Road, Oxford OX2 6NN U.K. ... Internet mail: natcorp@ox.ac.uk </address> <idno type=bnc n=093802> 093802 </idno> <avail region=world status=unknown> Exact conditions of use not currently known to the archiving agency. ... Distribution of any part of the corpus must include a copy of the corpus header. </avail> <date value=1994-07-13> 1994-07-13 </date> </pubStmt> <srcDesc> <recStmt> <rec type=DAT> </rec> </recStmt> </srcDesc> </fileDesc> <profDesc> <creation date='?'> Origination/creation date not known </creation> <partics> <person age=X educ=0 flang=EN-GBR id=PS22T n=W0001 sex=m soc=AB> ... </person> <person id=FX8PS000 n=W0000> ... </person> <person id=FX8PS001 n=W0002> ... </person> </partics> ... ... </bncDoc>
8 2.1.1.3 The Bank of English The Bank oI English is a collection oI samples oI modern English language held on computer Ior analysis oI words, meanings, grammar and usage. In linguistics and lexicography such a collection is termed a corpus (The Bank oI English - Terms & Conditions, (n.d.)).
The Bank oI English was launched in 1991 by COBUILD (a division oI HarperCollins Publishers) and The University oI Birmingham. Since 1980 COBUILD, which is based within the School oI English at Birmingham University, has been collecting a corpus oI texts on computer Ior dictionary compilation and language study. In 1991 Harper Collins decided on a major initiative to increase the scale oI the corpus to 200 million words, to Iorm the basic data resource Ior a new generation oI authoritative language reIerence publications.
It had 450 million words with over halI million diIIerent words in January 2002 and it continues to grow with the constant addition oI new material. It has speech and writing. The written part contains books, newspapers, magazines, letters, etc. and the spoken part includes speech Irom BBC World Service radio broadcasts, and the American National Public Radio, meetings, conversations, etc. The data are either collected Irom electronic environment or Irom scanning some books. The collection oI text was started in 1980 (The Bank oI English, (n.d.)).
2.1.1.4 English Gigaword It is an English corpus having 1,756,504,000 words and 4,111,240 documents. It is a product oI Linguistic Data Consortium. It includes data Irom Agence France Press English Service, Associated Press Worldstream English Service, The New York times Newwire Service and Xinhua News Agency English Serice. It is sold Ior 2500$ (English Gigaword, (n.d.)).
9 2.1.1.5 American National Corpus The ANC is aimed to contain a core corpus oI at least 100 million words, including both written and spoken (transcripts) data comparable across genres to the BNC. The genres in the ANC are expanded to include "new" types oI language data that have become available in recent years, such as web blogs and web pages, chats, email, and rap music lyrics. In addition to the core 100 million words, the ANC will include an additional component oI potentially several hundreds oI millions oI words, chosen to provide both the broadest and largest selection oI data possible. In Iall, 2003, the ANC produced its First Release oI over 11 million words oI American English. (The ANCProject, (n.d.))
2.1.2 Turkish Corpora Some Turkish corpora are listed below:
Koltuksuz Corpus YT Corpus Dalkilic Corpus METU Turkish Corpus TurCo Turkish Corpus
There are also other corpora Ior Turkish like ~2.2M Words (Gngr, 1995).
2.1.2.1 Koltuksu: Corpus It is the one oI the Turkish corpora that is used Ior letter statistics and to Iind out some oI the characteristics oI Turkish language. It has 6,095,457 characters and Iormed oI 24 novels and stories oI 22 diIIerent authors. These novels and stories were put into the digital environment by a data entry group Irom books (Koltuksuz, 1995).
10 2.1.2.2 YT Corpus This corpus was created Ior morphology based data compression study. It has 4,263,847 characters Irom 14 diIIerent documents: 3 Novels, 1 PhD Thesis, 1 Transcription, 9 Articles (Diri, 2000).
2.1.2.3 Dalkilic Corpus Dalkilic Corpus: It was created Ior letter statistics and to deIine the characteristics oI Turkish language like Koltuksuz corpus. It has 1,473,738 characters Irom Hurriyet newspaper web archive (01/01/1998 06/01/1998 mainpage and 01/01/1998 06/30/1998 authors) (Dalkili, 2001).
Dalkilic Corpus: It is the combination oI some the previous Turkish corpora (Koltuksuz, YT and Dalkilic corpora) with a size oI 11,749,977 characters (Dalkili & Dalkili, 2001).
2.1.2.4 METU Turkish Corpus It is a collection oI 2 million words oI post-1990 written Turkish samples (METU Corpus, (n.d.)).
2.1.2.5 TurCo Turkish Corpus Known Iirst corpus created Ior word statistics. It has a capacity oI 362.449MB, and 50,111,828 words. TurCo is nearly the halI size oI the British National Corpus, but not as big as English Gigaword.
TurCo consists oI text data taken Irom 11 diIIerent websites, and novels and stories in Turkish that belong to more than 100 authors. Most parts (98.11) oI TurCo were collected Irom websites. 1.89 oI the corpus is novels and stories.
In order to make TurCo larger, to include more words, it is not balanced and each document in the corpus has diIIerent size as shown in Table 2.1.
11 Table 2.1 NOW, Iiles` size and distribution in TurCo Site # Web Sites NOW Corpora Files` Sizes 1 (MB) Distribution () 1 www.tbmm.gov.tr 23,396,817 170.747 46.69 2 www.stargazete.com.tr 9,746,093 69.103 19.45 3 www.hurriyet.com.tr 9,415,716 69.140 18.79 4 Turkish novels and stories 4,668,306 33.571 1.89 5 www.die.gov.tr 948,116 6.387 9.32 6 www.arabul.com 753,571 4.994 1.50 7 www.pcmagazine.com.tr 527,757 3.722 1.05 8 www.bilimteknoloji.com.tr 203,620 1.450 0.41 9 www.abgs.gov.tr 2 160,562 1.249 0.32 10 www.lazland.com 135,519 0.954 0.27 11 www.yeniasir.com.tr 96,857 0.707 0.19 12 www.pankitap.com 58,894 0.425 0.12 TOTAL 50,111,828 362.449 100.00
In TurCo, Number oI Words (NOW), number oI diIIerent words (NODW) and DiIIerent Word Usage Ratio (DWUR) are calculated and shown in Table 2.2. NODW in all sites are 1,235,056, but some words are repeated in diIIerent sites. These words are picked up Irom TurCo and calculated again. The result oI this, NODW in TurCo is 686,804. According to this result, DWUR in TurCo is 1.37.
1 Includes only Turkish alphabet and space character 2 Turkey`s National Program Ior the European Union
2.1. 3 Corpora of Other Languages 2.1.3.1 The C:ech National Corpus (CNC) The Czech National Corpus (CNC) is a non-commercial, academic project Iocused on building up a large computer-based corpus, containing mainly written Czech.
The idea oI CNC was Iirst mentioned in 1991 in the statement oI intent which was signed by 8 signatories, representatives oI the Iollowing institutions: Faculty oI Philosophy Charles University, Faculty oI Mathematics and Physics, Charles University, Masaryk University, PalackEC university and the Institute oI Czech Language, Academy oI Sciences (Klimova, 1996, CZECH NATIONAL CORPUS (CNC)).
It has synchronous and diachronic parts. Some parts oI the synchronous are: Database and dictionaries (Electronic databases and dictionaries), SYN2000
13 (Balanced representative oI contemporary written Czech and contains about 100 million words), ORAL (Spoken Czech). Some parts oI diachronic are: The bank oI diachronic Czech (2,000,000 words oI transcribed texts; 100,000 words oI transliterated texts; 200,000 words oI dialect texts) (The Czech National Corpus (CNC), (n.d.)).
2.1.3.2 Croatian National Corpus
It has 30 million words and 9,156,446 tokens as oI February 24 th , 2003. It includes older and contemporary text. It is available through Croatian Academic Research Network (Croatian National Corpus, (n.d.)).
2.1.3.3 PAROLE
It is a multilinguistic corpus. The languages involved in PAROLE corpora are: Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish. It has 20,000 entries per language. All the texts are younger than 1970.
2.1.3.4 French Corpus
It has 20,093,099 words Irom books (3,267,409 words Irom CD-ROM), newspapers (13,856,763 words Irom Le Monde newspaper), periodicals (942,963 words Irom HERMES and CNRS-InIos), etc. (2,025,964 words) (French Corpus, (n.d.)).
It is a German corpus having 1,903,000,000 running words. 'Running means new words are added each day. Only 1181 million words are available to public because oI copyright restrictions. It is a product oI 'Institut Ir Deutsche Sprache, Mannheim (COSMAS, German Corpus, (n.d.)).
14 2.2 Large Scale Corpus Having large and representative corpus is very important Ior a language. For word analysis large corpora are needed; but it does not mean that it will be an unbalanced corpus. It is needed to make a combination oI large and representative (balanced and unbalanced) corpus to use it in research areas such as speech recognition, spell checking etc. Iunctionally.
There isn`t any corpus Ior Turkish big enough to make eIIicient analysis on it, so a large scale Turkish corpus must be created. II there is such a corpus, some statistical properties oI the Turkish language depending on the words can be investigated easily and can be used in such areas.
15
CHAPTER THREE PREVIOUS WORKS
In natural language processing, diIIerent methods have been developed and implemented to make morphological analysis more eIIicient. The main part oI the analysis is Iinding the 'correct root or stem oI the words, 'correct means that the intended meaning oI the user wants to Iind. There are a lot oI ambiguities in Turkish (e.g., yaz/noun vs yaz/verb type). Because oI such diIIiculties, all researchers try to Iind nearest root to the real instead Iinding correct root.
Morphological parsing algorithms may be divided into two classes as aIIix stripping and root-driven analysis methods (Solak and OIlazer, 1993). Both methods have been used in the history oI the morphological parsing.
Both oI these classes have advantages and disadvantages. In the root driven approach, the stem oI the word should be Iirstly Iound in a lexicon beIore starting the morphological analysis. Most popular morphological analyzers such as PCKimmo (Antworth, 1990) and Ample (Weber, 1988) use the root driven approach and conIirm the method`s success with their customized versions Ior diIIerent languages. Root driven methods are also widely used in the studies done Ior Turkish. However, Ior other agglutinative languages, some aIIix stripping methods have been developed and successIul results were achieved. The major disadvantage oI this approach is the cost oI the searching process required to Iind the stem. The examining oI each subpart is obviously a very time consuming process especially Ior the languages where the words can appear in very long Iorms. On the other hand, in the aIIix stripping approach, the searching process is relatively Iast as the search is only done Ior aIIixes.
16 3.1 Morphological Parsing in Other Languages For ancient Greek, Packard`s parser proceeds by stripping aIIixes oII the word, and then attempting to look up the remainder in a lexicon. Only iI there is an entry in the lexicon matching the remainder and compatible with the stripped-oII aIIixes is the parse deemed a success.
Brodda and Karlsson use a similar method to the analysis oI Finnish, an agglutinative language, but without any lexicon oI roots. SuIIixes are stripped oII Irom the end oI the word until no more can be removed, and what is leIt is assumed to be root.
Sagvall devised a morphological analyzer Ior Russian which Iirst looks in a lexicon Ior a root matching an initial substring oI the word. It then uses grammatical inIormation stored in the lexical entry to determine what possible suIIixes may Iollow.
Three diIIerent approaches to morphological parsing oI agglutinative languages were developed independently, in the early 1980`s: Ior Quechua (R. Kasper, 1982), Ior Finnish (Koskenniemi, 1983) and Ior Turkish (Hankamer, 1984). These three approaches are identica. They all proceed Irom leIt to right, such as Sagvall`s parser. Roots are sought in the lexicon that match initial substrings oI the word, grammatical category oI the root determines what class oI suIIixes may Iollow. When a suIIix in the permitted class is Iound to match a Iurther substring oI the word, grammatical inIormation in the lexical entry Ior that suIIix determines once again what class oI suIIixes may Iollow. II the end oI the word can be reached by iteration oI this process, and iI the last suIIix analyzed is one which may end a word, the parse is successIul.
3.2 Stem and Root Finding Algorithms for Turkish Some oI the methods that determine root or stem oI words in Turkish are investigated below:
17 1. AF Algorithm 2. Longest-Match (L-M) Algorithm 3. IdentiIied Maximum Match (IMM) Algorithm 4. FindStem Algorithm 5. Solak and OIlazer`s Approach 6. Root Reaching Method without Dictionary 7. Extended Finite State Approach 3.2.1 AF Algorithm AF algorithm is developed by Solak and Can in 1994. The algorithm works by a lexicon that keeps actively used stems Ior Turkish in which each record is explained with 64 tags. The word searched is iteratively looked up in the lexicon Irom right to leIt by pruning a letter at each step. II the word matches with any oI the root words, then the morphological analysis Ior that word is done. II any oI the surIace Iorms is in correspondence with the word at hand, then it is assumed that the root word is an eligible stem Ior that word. Solak and Can did not distinguish a root word Irom a stem. This may be because the root words may be viewed as special cases oI stems in the sense that the root is a stem that neither contains any morpheme nor is a compound word. The process is repeated until the word drops down to a single letter. Here is the algorithm:
1. Remove suIIixes that are added with punctuation marks Irom the word. 2. Search the word in dictionary. 3. II a matched root Iound, add the word into root words list. 4. II the word remained as a single letter, the root words list is empty then go to step 6, iI root words list has at least one element then go to step 7. 5. Remove the last letter Irom the word and go to step 2. 6. Add the searched word into unIounded record and exit. 7. Get the root word Irom the root words list. 8. Apply morphological analysis to the root word. 9. II the result oI morphological analysis is positive then add the root word to the stems list.
18 10. II there is any element(s) in root words list then go to step 7. 11. Choose the all stems in the stems list as a word stem.
This algorithm Iinds all possible stems oI the word; the Iound stems rise too many other stems, e.g., the root g: (eye) is a source oI derivation to roughly 150 stems which have totally diIIerent meanings indeed. So, this algorithm is Iar away to Iind 'correct stem.
3.2.2 LM Algorithm Longest-Match (L-M) is developed by Kut et al. in 1995. It is based on the word search logic over a lexicon that covers Turkish word stems and their possible variances. Here is the algorithm:
1. Remove suIIixes that are added with punctuation marks Irom the word. 2. Search the word in the dictionary. 3. II a matched root is Iound, go to step 5. 4. II the word remained as a single letter, go to step 6. Otherwise, remove the last letter Irom the word and go to step 2. 5. Choose the Iound root as a stem and go to step 7. 6. Add the searched word into unIounded records. 7. Exit.
This algorithm Iinds the Iirst match stem by beginning the last letter oI the word and removing the letters one by one. And the lexicon used may not involve all possible stems. So, this algorithm is Iar away to Iind 'correct root or stem, too.
3.2.3 Identified Maximum Match (IMM) Algorithm This algorithm is developed by Kksal in 1975. It is a leIt-to-right parsing algorithm. It tries to Iind the maximum length substring which is present in a root lexicon. II a match is Iound, the remaining part oI the word is considered as the
19 suIIixes, this part searched in a suIIix morpheme Iorms dictionary and morphemes are identiIied one by one until there is no element.
The solution oI these processes may not be correct, in such cases all oI the steps are repeated by reducing one character Irom the Iound substring.
3.2.4 Solak and Ofla:ers Approach Solak and OIlazer used a dictionary has 23,000 words has been based on the Turkish Writing Guide as the source (Solak and OIlazer, 1993). The words are placed in a sorted order in an ordered sequential array to be able to make Iast search. Each entry oI the dictionary contains a root word in Turkish and a series oI Ilags showing certain properties oI that word. II the bit corresponding to a certain Ilag is set Ior an entry, it means that the word has the property represented by that Ilag. 64 diIIerent Ilag is reserved Ior each entry, but only 41 Ilags have been used. Some oI the Ilags are shown in the Iollowing table. Table 3.1 Example oI Ilags Flag Property oI the word Ior which this Ilag is set Examples CLNONE Belongs to none oI the two main root classes RAGMEN, VE CLISIM Is a nominal root BEYAZ, OKUL CLFIIL Is a verbal root SEV, GEZ ISOA Is a proper noun AYSE, TRK ISOC Is a proper noun which has a homonym that is not a proper noun MISIR, SEVGI ISSAYI Is a numeral BIR, KIRK ISKI Is a nominal root which can directly take the relative suIIix KI BERI, BR ISSD Is a nominal root ending with a consonant which is soItened when a suIIix beginning with a vowel is attached. AMA,PARMA K, PSIKOLOG ISSDD Is a nominal root ending with a consonant which has homonym whose Iinal consonant is soItened when a suIIix beginning with a vowel is attached. ADET, KALP
The root oI the word is searched in the dictionary using a maximal match algorithm. In this algorithm, Iirst the whole word is searched in the dictionary, iI it is
20 Iound then the word has no suIIixes and it does not need to be parsed. II not, then a letter Irom the right is removed and the substring is searched. This step is repeated until the root is Iound. II no root is Iound although the Iirst letter oI the word is reached, the word`s structure is accepted as incorrect.
In order to obtain reliable results Irom this parser, all oI the rules and their exceptions must be implemented. But it is not possible to obtain all rules and exceptions in Turkish language.
3.2.5 Root Reaching Method without Dictionarv This method is developed by Cebiroglu and Adali. It is claimed and proved that the analysis oI a Turkish word to its root and suIIixes can be Iormulated. The suIIixes that can be attached to a word root are divided into groups and Iinite state machines are Iormed by Iormulating the order oI suIIixes Ior each oI these groups. A main machine is Iormed by combining these machines speciIic to the groups. In the morphological analysis done using the main machine, the word root is obtained by extracting the suIIixes Irom the end towards the start. Here are the abbreviations that are used in suIIixes:
U: i,i,u, A: a,e D: d,t C: c, I : i,i (): the letters not obligatory
Example: '-cU can be -ci, -ci, -cu, -c
The morphological rules can be determined with Iinite state machines. To reach the root oI the word, these rules may be interpreted Irom the right to leIt and Irom the last to the beginning. For all sets, diIIerent modules are developed, dependent to each other.
The Iollowing table shows the aIIix-verbs in Turkish. This is determined as a set oI the aIIix-verbs.
21 Table 3.2 The aIIix-verbs in Turkish 1 (y)Um 6 m 11 cAsInA 2 sUn 7 n 12 (y)DU 3 (y)Uz 8 k 13 (y)sA 4 sUnUz 9 nUz 14 (y)mUs 5 lAr 10 DUr 15 (y)ken
And the Iollowing Iigure is a Iinite state machine is the implementation oI this table:
A B 1 , 2 , 3 , 4 C 5 F 12,13,14,15 D 6 , 7 , 8 , 9 E 1 0 1 4 G 1,2,3,4,5 10,12,13,14 14 14 12,13 H 1 1 1,2,3,4,5 14
Figure 3.1 Finite state machine oI Table 3.2.
For example the word 'aliskan-mis-siniz is examined by this Iinite state machine. sUnUz aIIix moves Irom A to B state, -(y)mUs aIIix moves Irom B to F state. II the last aIIix n is tried to move anywhere Irom F state, it is not possible to move, so the process is stopped. Because oI the F state`s being Iinite state; the possible root is accepted as 'aliskan. But the root is 'alis-.
For all sets like the aIIixes that are used Ior nouns and verbs new Iinite state machines are implemented. They are all combined in one Iinite state machine at the end and the roots are Iound. The Iollowing Iigure shows the main Iinite state machine.
22
Figure 3.2 The main Iinite state machine But, in this approach a Iinite state machine Ior the derivational suIIixes list could not be done, because in Turkish, these suIIixes` arrangement can not be ruled.
3.2.6 Extended Finite State Approach This algorithm is developed by OIlazer. In this approach, a Turkish word is represented as a sequence oI inflectional groups (IGs), separated by `DBs denoting derivation boundaries, in the Iollowing general Iorm:
root InIl1`DBInIl2`DB.. `DBInIl n
where InIl i denote relevant inIlectional Ieatures including the part-oI-speech Ior the root, or any oI the derived Iorms. For instance, the derived determiner 'saglamlastirdigimizdaki ((the thing existing) at the time we caused (something) to become strong) would be represented as: saglamAdj `DBVerbBecome `DBVerbCausPos `DBAdj PastPartP1sg`DB NounZeroA3sgPnonLoc`DBDet
This word has 6 IGs:
1. saglamAdj 2. VerbBecome
23 3. VerbCausPos 4. AdjPastPartPlsg 5. NounZeroA3sg PnonLoc 6. Det
A sentence would then be represented as a sequence oI the IGs making up the words. When a word is considered as a sequence oI IGs, syntactic relation links only emanate Irom the last IG oI a (dependent) word, and land on one oI the IG's oI the (head) word on the right (with minor exceptions), as exempliIied in the Iollowing Iigure:
Figure 3.3 Links and inIlectional groups
With minor exceptions, the dependency links between the IGs, when drawn above the IG sequence, do not cross. The Iollowing Iigure shows a dependency tree Ior a sentence laid on top oI the words segmented along IG boundaries.
Last line shows the Iinal POS Ior each word. Figure 3.4 Dependency links in an example Turkish sentence
The approach relies on augmenting the input with "channels" that (logically) reside above the IG sequence and "laying" links representing dependency relations in these channels. The parser operates in a number oI iterations: At each iteration oI the parser, a new empty channel is "stacked" on top oI the input, and any possible links
24 are established using these channels, until no new links can be added. The channel symbol 0 indicates that the channel segment is not used while 1 indicates that the channel is used by a link that starts at some IG on the leIt and ends at some IG on the right, that is, the link is just crossing over the IG. II a link starts Irom an IG (ends on an IG), then a start (stop) symbol denoting the syntactic relation is used on the right (leIt) side oI the IG. The syntactic relations (along with symbols used) that are encoded in the parser are the Iollowing:
4 S (Subject), 0 (Object), M (ModiIier, adv/adj), P (Possessor), C (ClassiIier), D (Determiner), T (Dative Adjunct), L ( Locative Adjunct), A: (Ablative Adjunct), I (Instrumental Adjunct).
3.2.7 FindStem Algorithm FindStem is developed by Sever and Bitirim (Sever & Bitirim, 2003). This algorithm contains a pre-processing step that converts all letters oI the word into their cases and singles out the letters aIter the punctuation mark in the word. It has three components;Find the Root,Morphological Analysis andChoose the Stem.
InFind the Root component, all possible roots oI an examined word are Iound by starting with the Iirst character oI the examined word and searching the lexicon Ior this item. Then the next character is appended to the item Ior which lexicon search begins. This operation continues until the item becomes equal to the examined word or until the system understands that there are no more relevant roots Ior the examined word in the lexicon. Then, these roots and production rules will be used to derive the examining word. In lexicon, the type inIormation Ior every root word and possible root changes (when a root word combines with suIIix) is coded Ior use oI morphological analysis. During the root and the suIIix combination in Turkish, two alterations on a root word structure would be in order: change oI the last vowel (e.g. ara-ariyor) or consonant letter (e.g. kitap-kitabi) oI the root word and drop oI middle vowel letter (e.g. ogul-oglum). To help such kind oI situations, lexicons are used in stemming algorithms Ior Turkish.
25 A morphological analyzer is used inMorphological Analysis component. In Turkish language there are a number oI rules to determine the Iorm and order oI suIIixation; the derivational suIIixes are used Ior changing word meanings. To add the derivational suIIixes to end oI a word is determined by word type (this inIormation is coded into the lexicon Ior every word). II this procedure is applied, all possible stems can be Iound. Consider the word edebilecek as an examined word. The longest possible roots retrieved Irom lexicon are edebi and edep. According to the algorithms LM and IdentiIied Maximum Match (IMM) that assigns a stem by matching the examined word with longest root words, 'edebi will be selected as output. But it is not possible to produce the examined word, edebilecek, by using this root ; this result can be achieved through the morphological analysis procedure.
In the last component, Choose the Stem, the word stem is chosen by a selection between derivations in the derivations list.
Here is the algorithm:
1. Remove suIIixes that are added with punctuation marks Irom the word. 2. Find all possible roots oI the word in a lexicon and add them into root words list. 3. II root words list is empty, add the word into unIounded records and exit. 4. Get the root word Irom root words list. 5. Apply morphological analysis to the root word. 6. AIter morphological analysis, add the Iormed derivations into derivations list. 7. II there is any element(s) in root words list then go to step 4. 8. Choose the word stem by a selection between derivations in the derivations list.
This algorithm Iinds all possible stems oI the word by eliminating the stems that are not in the derivation list. So, this algorithm is better, but also Iar away to Iind 'correct stem.
26 3.3 Sentence Boundary Detection For many natural language processing tasks, identiIying sentence boundaries is one oI the most important prerequisites. Many available natural language processing tools do not perIorm a reliable detection oI sentence boundaries.
Using a list oI end-oI-sentence punctuation marks (e.g. '., '!) is usable to Iind end oI sentence in a suIIicient way. A period can be used in an abbreviation, as a decimal point, in e-mail addresses etc. Some examples are shown below:
She comes here by 5 p.m. on Saturday evening. At 5 p.m. I have to go to the hospital.
Because oI using end oI sentence characters in diIIerent situations, ambiguities are appeared. Such ambiguity is the main problem oI sentence boundary detection, and until now, there are not any works Ior Turkish or any languages that solves these kind oI ambiguities shown in the language.
27
CHAPTER FOUR PROPOSED SYSTEM
A corpus can be thought oI as a collection oI texts gathered according to particular principles Ior some particular purpose.
There are some steps to generate a corpus. These steps oIIer Turkish words to be determined appropriately, and make the corpus more Iunctionality. The steps oI the solution as Iollows:
1. Sentence boundary detection 2. Examination oI types oI words 3. Finding stem and inIlectional suIIixes 4. Finding root and derivational suIIixes 5. Generate large scale Turkish corpus
At Iirst, the sentences in the text Iile are determined according to the some rules introduced in the Iollowing sections, and then these sentences are splitted into words. AIter splitting, the words are examined, the types oI them are Iound and then the stems and inIlectional suIIixes oI these words are Iound. Finding types and stems are iterative Iunction, because in Turkish it is not possible to Iind type oI word unless knowing its stem, or vice versa.
AIter Iinding stem oI the word, the root and derivational suIIixes are splited, and then all data is stored into corpus.
The summarized illustration oI generating large scale corpus is depicted in the Iollowing Iigure.
27
28
Figure 4.1 Block diagram oI algorithm Ior generating corpus Get other word Find all sentences Examine Type oI Word Get a sentence Split sentence into words Split word into stem & inIlextional suIIixes Split stem into root & derivational suIIixes Write in corpus End oI Sentence No Yes
29 4.1 Sentence Boundary Detection The Iirst step in generating corpus is 'Iinding sentences. Turkish sentences generally end with known punctuations such as ., ., !, ?.
The process oI Iinding end oI sentences is very complex. In Turkish there are some ambiguities in Iinding end oI sentence process like any other languages. For example;
Uluslar, bu ekonomik buhran sonucunda 2. Dnya Savasi`ni yasamistir. Bu sezon kaybedilen ma sayisi 2. Dnya Kupasi`na katilma sansi azaliyor.
In the Iirst sentence, the '. character is used Ior enumerate, but in the second sentence it indicates end oI sentence. And aIter '., both oI them have the same word that begins with uppercase. So, there is an ambuigity Ior the process oI Iinding end oI sentence.
In this work, to Iind end oI sentence, the rule list is created Iirstly, and stored in XML Iormat.
30
Figure 4.2 The rule list Ior sentence boundary detection
XML Iormat is created in triple group (e.g. 'L.L). The dot character in the middle oI the group is shows the end oI sentence characters. The leIt character shows the beginning character`s situation oI the word beIore the punctuation, and the right character shows the beginning character`s situation oI the word aIter the punctuation. In the Iollowing table, the characters` meanings are shown.
31 Table 4.1 The meanings oI the characters in the sentence boundary rule list Character Meaning . EOS punctuations (. . ! ? ) L Lowercase U Uppercase # Number ? Any character - - , , ( ( ) ) / /
' '
By using these rules, making the end oI sentence Iinding be easier is aimed. But, while the rules were created, some diIIiculties were appeared because oI the Turkish language specialities, and until now these diIIiculties cannot be solved.
As an example, some ambiguities are shown below:
Cumhuriyetimizin 75. yili coskuyla kutlandi. Tahta ikan IV. Murat emirler yagdirdi. Olimpiyatlar iin uzun zamandir alisan Ahmet kosuda 2. Uzun atlamada ise ancak 4. olabildi. A. Mehmet YILDIZ size ugradi. AlIabenin ilk harIi A. Mehmet`e bunu gretmeniz gerekiyor.
For abbreviations that make ambiguity in the sentences, an XML Iile was created, and abbreviation list was combined into this Iile as shown in the Iollowing Iigure.
32
Figure 4.3 Example oI abbrevation list in XML Iile
Abbrevation and rule lists were written into two Iiles in a standard seperated Irom the main program, to allow users to make changes in these Iiles easily and independent Irom the program.
By using this abbrevation and rule lists, the texts were splitted into sentences and output was written in an XML Iormat again as shown in the Iollowing Iigure.
34 As told in the Iollowing sections, word types, root and suIIixes oI the words were added into this structure. 4.2 Examination of Type of Words Finding types and stems are iterative Iunction, because in Turkish it is not possible to Iind type oI word unless knowing its stem, or vice versa.
AIter splitting the sentence into its words, it was appreciated by three steps:
1. Determining the stems oI the words 2. Determining the available types oI the words
At Iirst, all words in the sentence and stems oI them are determined by using morphological analysis.
In the second step, probable types oI words were determined (noun, adjective .) by making use oI electronic dictionary. So, the working space would be smaller, and the word types that were not available could be eliminated. And also, some stems were eliminated that are not suitable types according to its place in the sentence. 4.3 Description of the Methods for Finding Roots Finding root oI the words are very important part oI generating a corpus. For Iinding the roots, there are two steps:
1. Finding Stems and InIectional SuIIixes 2. Finding Roots and Derivational SuIIixes 4.3.1 Finding Stems and Inflectional Suffixes By Iinding stems, the words will be cleared Irom the inIlectional suIIixes and this process will make these words look like as in the dictionary.
35 Some methods Ior determining stems were thought and attempted to implement. The Iirst method was to produce all probable words in Turkish. This algorithm would be worked such that:
Concate the inIlectional suIIixes at the end oI the roots to create diIIerent stems. Write all words in a Iile. Find the examined word Irom the Iile created in the previous step to determine its root and inIlectional suIIixes.
In theory, this algorithm makes Iinding stems and roots processes Iast. But, some technical problems in computer technology made implementation not possible, Ior example in operating systems, creating and appending a Iile has over 4 GB capacity were not allowed.
AIter that, this method was improved by making concatenation process Ior each letter in the alphabet, and writing them in separated Iiles according to the Iirst letter in the words. AIter this, there would be 29 Iiles Ior Turkish words. But, this method would need too much sources and make perIormance worse, and also some technical problems in computer technology made implementation not possible as in previous method, so this was not used.
In the last Iound and developed method to determine stems oI words, the stems will be stored in a list according to word types and inIlectional suIIixes will be stored in a diIIerent list in probable combinations. Then, searching process will be made in two lists at the same time, and the stem and suIIixes will be determined. Some sample lists are shown in the Iollowing Iigures.
36
Figure 4.5 List oI noun stems in Turkish
Figure 4.6 List oI adjective stems in Turkish
The number oI stems in the lists that are specialized in the word types is shown in the Iollowing table:
. belgili belgin belgisiz beli bkk belig belirgin belirli belirli belirsiz belirsiz belirtik belirtili belirtisiz belkili belli belli belli basli belli belirsiz . ab aba aba abaci abacilik abadi aba gresi abajur abajurcu abajurculuk abaks abandirma abani abanma abanoz abanozgiller abanozlasma abarti abartici abarticilik .
37 Table 4.2 The number oI stems in the lists Word Type NOS Noun 44835 Verb 6483 Adjective 11128 Conjunction 30 Preposition 116 Adverb 2551 Pronoun 81 Interjection 297
This method will work much Iaster as comparing previous works.
4.3.2 Finding Roots and Derivational Suffixes The previous works about Iinding roots were investigated, and it was seen that generally two diIIerent methods were used (Kksal, 1975; Solak & Can, 1994; Solak & OIlazer, 1993). And, both oI the methods had advantages and disadvantages. The methods had been used were:
1. Examining Irom the beginning to the end oI the word, and Iind root Irom a dictionary. 2. Examining Irom the end to the beginning oI the word, Iind root Irom a dictionary and eliminate suIIixes.
In the Iirst method, the letters in the word examined are taken one by one Irom the beginning, and the substring oI this word is checked Irom the dictionary iI there is a word such as the substring. The Iirst Iound result in the dictionary is said to be root oI the word. But, as it is seen, the Iirst word can not be the root every time, such as 'bilek word in Turkish. In this method, the root oI this word can be Iound as 'bil-', but it is not real root.
38 In the second method, the letters in the word examined are taken one by one Irom the end to the beginning, and the string achieved is checked in the suIIixes dictionary to determine the suIIixes. At the same time, the remaining part oI the word is checked in the dictionary to determine the root. By doing these processes, it is aimed to determine the real root and suIIixes in an eIIicient way. But, this method also has disadvantages. The root oI the word can not be Iound in a correct way. Also, the suIIixes are diIIerent in the sources, so this method is not eIIicient.
In this work, aIter the probable types and stems are Iound, the roots are Iound Irom the root list by using suIIixes list (Korkmaz, 2003). All oI the roots are separated into two Iiles; 'noun and 'verb. Words in all types except 'verb were stored into 'noun list. The sample part oI the list oI roots is shown in the Iollowing Iigure.
Figure 4.7 (a) Sample oI noun roots. (b) Sample oI verb roots.
All oI the suIIixes are separated into Iour Iiles according to their specialities, these are suIIixes that are used Ior derivating 'Irom noun to noun, 'Irom noun to verb, 'Irom verb to noun and 'Irom verb to verb. The sample part oI the list oI stem, root and suIIixes is shown in the Iollowing Iigures.
abad abajur abaks abandone aban abanoz abaso abat Abaza abazan Abbas abd . aban- abart- abra- aci- a- ada- agin- agla- ag- agna- agri- ak- . (a) (b)
39
Figure 4.8 Sample oI stems and roots in 'Irom noun to noun suIIixes list. Figure 4.9 Sample oI stems and roots in 'Irom verb to noun suIIixes list.
40 The illustration oI Iinding root step in generating large scale corpus is depicted in the Iollowing Iigure.
Figure 4.10 Block diagram oI algorithm Ior Iinding root step in generating corpus
There are 16203 noun roots and 738 verb roots in the lists. The numbers oI roots according to the letters in the noun and verb lists are shown in the Iollowing tables:
Examine Type oI Word Split word into stems & inIlextional suIIixes by using list oI the stems generated according to the type determined, e.g. verb Split stems into roots & derivational suIIixes by using list oI the suIIixes and roots according to the type determined, e.g. verb Write all Iound roots and suIIixes into corpus Generate stems by using the roots and suIIixes lists Store stems in lists according to their types
41 Table 4.3 The number oI roots in the noun and verb lists Letter Number of Noun Roots Number of Verb Roots A 1101 43 B 839 60 C 284 4 344 45 D 631 66 E 535 29 F 596 12 G 447 32 H 770 16 I 71 12 I 706 22 J 73 0 K 1712 106 L 377 0 M 1994 2 N 370 0 O 267 0 60 19 P 836 16 R 438 0 S 1106 99 S 328 6 T 1346 37 U 85 22 80 12 V 283 3 Y 256 74 Z 268 1 Total 16203 738
The Iound roots were stored into the XML structure to generate large scale corpus. This XML structure needs very large memory. It is the only drawback oI this method. But, this problem was solved by making all processes into the memory, by
42 using pointers, etc.
And also, by applying some statistical and morphological analysis techniques, such as n-gram analysis, the number oI roots determined can be decreased, and the real root oI the word can be Iound. 4.4 Generate Large Scale Turkish Corpus One oI the biggest problems in NLP works is appeared while storing the words into large databases, retrieving the words and making analysis on this database. For storing and retrieving processes in databases, some diIIerent algorithms are used. But, Ior specialized works such as natural language processing these database algorithms` perIormances are not enough to work on it.
Databases were not used in the previous works; instead, specialized structures were used, and some statistical analyses were applied on the corpus (ebi & Dalkili, 2004).
For generating corpus, XML structure was used to solve this kind oI problems. But, as told beIore, its only drawback is using large memory. 4.4.1 Data in the Corpus The used data to generate corpus is shown in the Iollowing table. Table 4.4 The number oI Iiles, NODW, Iiles` sizes, and distribution () oI data Web Sites Number of Files NODW Files` Sizes (MB) www.netgazete.com.tr 483428 1006692 400 www.aksam.com.tr 13934 345440 45.7 www.tercuman.com.tr 11704 467746 42.7 www.yeniasir.com.tr 13609 240672 64.5 Subtitles 6105 606704 152.0 Turkish novels and stories 240 10279162 77.9 Total 529020 12946416 782,8
The list oI the Turkish Novels and stories is shown in Appendix A.
43 4.4.2 Definition of the Corpus Structure A database system stores any kind oI data and allows users to process this data using predeIined query languages (like Structured Query Language) in a declarative way. One oI the main problems oI a database system is perIormance. PerIormance oI a database is a good criterion. Each database system has many diIIerent algorithms to store and access data. B Tree and Hashing algorithms are the most commonly used ones. Database systems are general purpose systems this is the why they can not be used Ior Natural Language Processing. So there should be a speciIic way to accomplish this Ior a speciIic aim. The developed system is not a database system; it is speciIic Ior Natural Language Processing operations.
This system processes a data Iile in XML Iormat, and a schema deIinition in XSD to validate given XML Iile. For key column mappings between implemented code and data Iile(s), system needs another XML Iile. Schema deIinition Ior this mapping XML Iile as shown in the Iollowing Iigure.
NLs are ambugious in speech, grammar, meaning, etc. To resolve local ambiguity, humans employ not only a detailed knowledge oI the language itselI, also its sounds, rules about sound combinations, its grammar and lexicon together with word meanings and meanings derived Irom word combinations and orderings, a large and detailed knowledge oI the world, the ability to inIer what a speaker meant, even iI he/she did not actually say it, etc. They are the Iactors that make NLs so diIIicult to process by computer. However, the languages are needed to process by computers, because oI the developing technology. Some techniques have been tried to solve these problems, but none oI them had 100 percent success.
Machine learning or some statistical analysis may be used Ior solving ambiguities in sentence boundary detection, that can not be ruled eighter end oI sentence or not. The words have such ambiguities in Turkish, too. For example; the root oI 'glge is 'glge, but there is a word 'gl in Turkish, and algorithms have not be able to Iind its real root, they Iound both oI the words as root. Also, during the root and the suIIix combination in Turkish, two alterations on a root word structure would be in order: change oI the last vowel (e.g. ara-ariyor) or consonant letter (e.g. kitap-kitabi) oI the root word and drop oI middle vowel letter (e.g. ogul-oglum). Such situations cause the Iinding root algorithms not to work properly. In this thesis, these problems had been tried to solve by using determining the word type beIore trying to Iind its root. AIter the word type Iound, according to its place in the sentence, the root list oI this word type was used to determine root. This technique has solved some oI the problems. Even iI this solution sometimes produced more than one root, it worked better, when compared with other algorithms. Other methods investigated, and it is seen that none oI them has 100 percent success about Iinding the correct stem or root oI the word. It is very diIIicult process to Iind 'real root because oI the language`s being agglutinative and having ambiguities. But, this problem may be solved by using machine learning and some statistical analyses, again. 47
48
The number oI possible words generated by adding suIIixes is practically inIinite. As such, a Iinite-size lexicon Ior Turkish would miss a signiIicant percentage oI Turkish words. This makes lexicon-based text recognition approaches unsuitable Ior Turkish or other agglutinative languages. In this thesis, the stems were generated by using root and suIIixes lists, by taking the roots that only begin the same letter with the examined word. It made smaller search space, and made the implementation oI generating the stems by adding suIIixes possible and usable. But, instead oI this method, Iinite state machines are considered to be suitable Ior Turkish because oI being rule-based language. However, all rules can not be known, because the language grows Iast and the rules are always modiIied and extended.
The most important thing to work on a large corpus is that it requires much CPU and memory power to analyze. It was seen that, programs using databases like MySQL are too slow Ior this kind oI operations because oI the general nature oI databases. In this thesis, not to be aIIected Irom these drawbacks oI the databases, an XML structure is used to generate corpus. It is suitable to retrieve any words Iast by using the memory oI the computer, but its known drawback is that it needs too much memory on disk drive. By using speciIic and suitable algorithms, lots oI time can be gained, and more detailed analysis can be done in databases.
In the Iuture, in order to make analyses eIIicient, corpus size can be increased by adding new sites including novels, technical papers, written reports, thesis etc. At the same time, with the classiIication oI these documents, diIIerent corpora oI diIIerent Iields can be generated, e.g. medical corpus, engineering corpus, etc.
49 REFERENCES Antworth E. (1990). PC-KIMMO. A two-level processor for morphological analvsis. TX: Summer Institute oI Linguistics, Dallas. BNC. What is BNC. (n.d.). Retrieved March 3, 2005, Irom http://www.natcorp.ox.ac.uk British National Corpus (BNC). (n.d.). Retrieved March 3, 2005, Irom http://www.hcu.ox.ac.uk/ BNC/what/index.html Brodda, B., & Karlsson, F. (1980). An experiment with morphological analvsis of Finnish. Papers Irom the Institute oI Linguistics, University oI Stockholm, Publication 40, Stockholm. Cebiroglu, G. & Adali, E. (2002). Root reaching method without dictionarv. Istanbul Technical University Computer Engineering Department, Istanbul, Turkey. Church, K., & Gale, W. (1991). Probability scoring Ior spelling correction. Statistics and Computing, 93-103. Church, K., & Mercer, R. (1993). Introduction to the Special Issue on Computational Linguistics Using Large Corpora. Computational Linguistics, 19(1), 1-24. COSMAS, German Corpus. (n.d.). Retrieved January 5, 2005, Irom http://corpora.ids-mannheim.de /~ cosmas/. Coxhead, P. (2002). An Introduction to Natural Language Processing (NLP), Retrieved June 26, 2005, Irom www.cs.bham.ac.uk. Croatian National Corpus. (n.d.). Retrieved January 15, 2005, Irom http://www.hnk.IIzg.hr/corpus. htm. Crystal,D. (1991). A Dictionarv of Linguistics and Phonetics, Blackwell, 3rd Edition.
50 ebi, Y. & Dalkili, G. (2004). Turkish Word N-gram Analv:ing Algorithms for a Large Scale Turkish Corpus - TurCo, ITCC 2004, IEEE International ConIerence on InIormation Technology, Vol:2, pp. 236-240. Dalkili, G. (2001). Some Statistical Properties of Contemporarv Printed Turkish and A Text Compression Application. MSc Thesis. International Computing Institute, Ege University. Dalkili, M.E., & Dalkili, G. (2001). Some Measurable Language Characteristics oI Printed Turkish. Proc. of the XJI. International Svmposium on Computer and Information Sciences, 217-224. Diri, B. (2000). A Text Compression System Based on the Morphology oI Turkish Language. Proc. of the XJ International. Svmposium on Computer and Information Sciences, 12-23. English Gigaword. (n.d.). Retrieved January 5, 2005, Irom http://www.ldc.upenn.edu/Catalog/ Catalog Entry. jsp?catalogIdLDC200 3T05 French Corpus. (n.d.). Retrieved January 5, 2005, Irom http://www.elda.Ir/cata/text/W0020.html, 02/01/2003. Gngrd Z. (1993). A lexical-functional grammar for Turkish. MSc Thesis. Computer Engineering Department, Bilkent University, Ankara. Hankamer, J. (1984). Turkish generative morphology and morphological parsing, Second International Conference on Turkish Linguistics, Istanbul. Hankamer, J. (1989). Morphological parsing and lexicon. Lexical Representation and Process, MIT Press. JuraIsky, D., & Martin, J.H. (2000). Speech and Language Processing, Prentice Hall, 193-199. Kasper, R. & Weber, D. (1982). Users reference manual for the Cs Quechua adaptation program. Occasional Publications in Academic Computing, (8,9),
51 Summer Institute oI Linguistic, Inc. Klimova, J. (1996). CZECH NATIONAL CORPUS (CNC). Retrieved June 28, 2005, Irom http://www.ling.ohio-state.edu/~dm/events/EastWest96/cnc.html Korkmaz, Z. (2003). Trkiye Trkesi Grameri, TDK, Ankara. Koskenniemi, K. (1983). Two-level morphologv. University oI Helsinki, Department oI General Linguistics, Publication No. 11, Helsinki, Finland. Kksal, A. (1975). Automatic Morphological Analvsis of Turkish. Hacettepe University, Ankara, Turkey. Kukich K. (1992). Technique Ior automatically correcting words in text. Periodical Issue Article of ACM Press, 377-439 Kut, A., Alpkoak, A., & zkarahan, E. (1995). Bilgi bulma sistemleri iin otomatik trke dizinleme yntemi. Biliim Bildirileri, Dokuz Eyll University, Izmir, Turkey. METU Corpus. (n.d.). Retrieved October 20, 2004, Irom http://www.ii.metu.edu.tr/~corpus/corpus.html Nadas, A. (1984). Estimation oI probabilities in the language model oI the IBM speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(4), 859-861. OIlazer, K. (1999). Dependencv Parsing with an Extended Finite State Approach. Department oI Computer Engineering, Bilkent University Ankara, Turkey. Packard, D. (1973). Computer-assisted morphological analysis oI Ancient Greek. Computational and Mathematical Linguistics. Proceedings of the International Conference on Computational Linguistics, Pisa Leo S. Olschki, Firenze, 343-355. Sagvall,A. (1973). A svstem for automatic inflectional analvsis implemented for Russian, Data Linguistica 8, Almquist and Wiksell, Stockholm.
52 Sinclair,J. (1991). Corpus Concordance, Collocation. OUP. Sever H., & Bitirim Y. (2003). FindStem. Analvsis and evaluation of A Turkish stemming algorithm. Department oI Computer Engineering Baskent University Ankara, 06530 Turkey, Department oI Computer Engineering Eastern Mediterranean University Famagusta, T.R.N.C. Shannon, C.E. (1951). Prediction and Entropy oI Printed English. The Bell Svstem Technical Journal, 30(1),50-64. Solak, A. (1991). Design and Implementation of a spelling checker for Turkish, Department oI Comp. Eng. And InIormation Sciences, Bilkent Unv., Ankara, Trkiye. Solak, A., & OIlazer, K. (1993). Design and Implementation of a Spelling Checker for Turkish, Department oI Computer Engineering and InIormation Science, Bilkent Unv. Ankara, Trkiye. Solak A., & Can, F. (1994). EIIects of stemming on Turkish text retrieval, Technical report BUCEIS-94-20, Bilkent University, Ankara, Turkey. The ANCProfect. (n.d.). Retrieved January 5, 2005, Irom http://americannationalcorpus.org The Bank of English. (n.d.). Retrieved March 10, 2005, Irom http://www.cobuild.collins.co.uk/ boeinIo.html The Bank of English - Terms & Conditions. (n.d.). Retrieved March 10, 2005, Irom http://www.titania.bham.ac.uk/docs/svenguide.html The C:ech National Corpus (CNC). (n.d.). Retrieved January 15, 2005, Irom http://ucnk.II.cuni.cz/ english/index.html Weber, D. J., Black H. A., & McConnel, S. R. (1988). AMPLE. a tool for exploring morphologv, TX: Summer Institute oI Linguistics, Dallas.
53 ABBREVIATIONS
Following acronyms have been used in this thesis:
DWUR DiIIerent Word Using Ratio IMM IdentiIied Maximum Match MT Machine Translation NL Natural Language NLP Natural Language Processing NOS Number OI Stems NODW Number OI DiIIerent Words NOW Number OI Words
54 APPENDICES A. The List oI the Novels & Stories in Corpus NAME of NOVEL / STORY AUTHOR SITE NOW 2001 A.C.Clarke http://www.pdaturk.com/ 56868 2010 A.C.Clarke http://www.pdaturk.com/ 55226 2061 A.C.Clarke http://www.pdaturk.com/ 43824 1.Dogu Halklari Kurultayi Nurer Ugurlu http://www.pdaturk.com/ 62279 365 neri Ellison&Barnett http://www.pdaturk.com/ 39243 Access Anonim http://www.pdaturk.com/ 15609 Adamlik Dini Vural Yayincilik http://www.kitap.perisi.com/ 11705 Ag Yklemesi Anonim http://www.pdaturk.com/ 11623 Agir Roman Metin Kaan http://www.pdaturk.com/ 27635 Alacakaranlikta Tonio Kroger http://www.kitaplik.com 24914 Aldatmak Ahmet Altan http://www.pdaturk.com/ 44314 Alice... Lewis Carroll http://www.kelepirkitap.com 19258 Ankara Aniti Augustus http://www.pdaturk.com/ 18967 Apartman Emile Zola http://www.pdaturk.com/ 49039 Atatrk ve Inn John Grew http://www.pdaturk.com/ 31821 Atatrk,bir milletin yeniden dogusu Kinross http://www.1001kitap.com/ 58878 Atatrk'n Anadoluya Gnderilis... Baki Iz http://www.pdaturk.com/ 25110 Ates Deniz Margaret Weis http://www.kitap.perisi.com/ Atinalilarin Devleti Aristoteles http://www.maximumbilgi.com 18697 Atuan Mezarlari Ursula Kroeber LeGuin http://www.pdaturk.com/ 34437 Avalon'un Sisleri-By Ustasi M.Z.Bradley http://www.pdaturk.com/ 86054 Avrupa ile Asya Arasi... Nurer Ugurlu http://www.pdaturk.com/ 27510 Bakir Atli Puskin http://www.pdaturk.com/ 8737 Bartleby Herman Melville http://www.pdaturk.com/ 11816 Baskasinin Karisi Dostoyevski http://www.kitaplik.com 15031 Bassiz At Paul Berna http://www.haberbilgi.com 27666 Beden Dili Baki Evkarali http://www.pdaturk.com/ 24544 Beden Egitimi Giyasettin Demirhan http://www.meb.gov.tr/ 18000 Bedwyr'in Kilici R.A.Salvatore http://www.pdaturk.com/ 63584
55 APPENDICE A (Cont`d.) Belki de Gerekten Istiyorsun Murat Glsoy http://www.altkitap.com 21000 Bella'nin lm Georges Simenon http://www.pdaturk.com/ 34190 Beni Duyuyor musun? Leyla Navaro http://www.pdaturk.com/ 32534 Benito Cereno Herman Belville http://www.pdaturk.com/ 26302 Beyaz Geceler Dostoyevski http://www.pdaturk.com/ 26623 Beyaz Gemi Cengiz Aytmatov http://www.pdaturk.com/ 37027 Bilim Is Basinda Lenihan http://www.1001kitap.com/ 9347 Binbogalar EIsanesi Yasar Kemal http://www.pdaturk.com/ 67344 Bir iIt Yrek Marln Morgan http://www.kitap.perisi.com/ 39981 Bir Laboratuvar Romansi Adnan Kurt http://www.altkitap.com 23451 Bozkirda Maksim Gorki http://www.pdaturk.com/ 17965 Bozkirda Bir Kral Lear Turgenyev http://www.pdaturk.com/ 19073 Btn ykleri-2 Sabahattin Ali http://www.pdaturk.com/ 71500 Bycler Krali http://www.pdaturk.com/ 168434 CANDIDE YA DA IYIMSERLIK Voltaire http://www.pdaturk.com/ 24599 Cazname Tunel Glsoy http://www.altkitap.com 46506 Cazname-2 Tunel Glsoy http://www.altkitap.com 49050 Cemile Cengiz Aytmatov http://www.pdaturk.com/ 22620 Cemile Orhan Kemal http://www.pdaturk.com/ 30046 CHP Tzk chp http://izmir.chp.org.tr 14238 alikusu R.Nuri Gntekin http://www.pdaturk.com/ 94267 arlinin ikolata Fabrikasi Roald Dahl http://www.1001kitap.com/ 21893 ocuk ve Ergen Gelisimi Bekir Onur http://www.pdaturk.com/ 109516 l Gezegeni Dune Frank Herbert http://www.kitap.perisi.com/ 157194 zm Plani Annan http://www.pdaturk.com/ 39054 Daragacinda 3 Iidan Nihat Behram http://www.pdaturk.com/ 32008 Deccal F.Nietzsche http://www.pdaturk.com/ 24603 Degirmen, Kagni, Ses Sabahattin Ali http://www.pdaturk.com/ 62943 Degirmenimden Mektuplar Alphonso Daudet http://www.pdaturk.com/ 35311 Degisim Franz KaIka http://www.pdaturk.com/ 15706 Deliligin Daglarinda H.P.LovecraIt http://www.pdaturk.com/ 31996 Denemeler Montaigne http://www.pdaturk.com/ 50746 Denizden Gelen Lezzet Anonim http://ekitap.8m.com/ 15888 Devlet Adami Platon http://www.pdaturk.com/ 19071
56 APPENDICE A (Cont`d.) Devrim Tarihi ve Toplumbilim Aisindan Atatrk Emre Kongar http://www.pdaturk.com/ 104852 Dil gretim... Anonim http://www.pdaturk.com/ 18508 Dinde Siyasal Islam Tekeli Nur Serter http://www.pdaturk.com/ 40070 Disi Kurdun Ryalari Cengiz Aytmatov http://www.pdaturk.com/ 86819 Dogudaki Hayalet Pierre Loti http://www.pdaturk.com/ 22369 Doktor Faustus Cristopher Marlowe http://www.pdaturk.com/ 10926 Dost Kazanma Anonim http://www.pdaturk.com/ 29319 Dvs Kulb - 1 Chuck Palahniuk http://www.pdaturk.com/ 18126 Dvs Kulb - 2 Chuck Palahniuk http://www.pdaturk.com/ 18960 Duvar J.Paul Sartre http://www.pdaturk.com/ 46800 Dsnyorum yleyse Vurun Ilhan Seluk http://www.pdaturk.com/ 32203 Ecco Homo F. Nietzsche http://www.ayrinti.net/nietzsche 23329 EIendi ile Usagi Lev Tolstoy http://www.pdaturk.com/ 19727 Egitim Politikamiz Mahmut Adem http://www.pdaturk.com/ 27403 Ejderha Mizragi Ejderha Mizragi http://www.ankira.com/ 38755 Ekonomi Karisik http://www.IilozoI.tripod.com 8000 ElI Yildizi M. Weis&T.Hickman http://www.kitapperisi.com 88974 Emek... chp http://www.chp.org.tr 25389 Empedokles Holderlin http://www.pdaturk.com/ 19774 Enternasyonel Sule Bucak http://www.chp.org.tr 29250 Erzurum... Puskin http://www.kelepirkitap.com 30642 Evrim Kurami ve Bagnazlik Cemal Yildirim http://www.1001kitap.com/ 43245 Excel 2000 Kitapik Hakki cal http://www.pdaturk.com/ 21372 Faust Goethe http://www.pdaturk.com/ 12306 FelseIe Tarihi Karisik http://www.IilozoI.tripod.com 20409 FelseIenin Baslangi Ilkeleri Georges Politzer http://www.pdaturk.com/ 47165 FelseIi Kavramlar mer Sevingl http://www.kitaplik.com 19844 Fen gretimi Fitnat Kaptan http://www.meb.gov.tr/ 17102 Gelin Birlik Olalim Harun Yahya http://www.harunyahya.net 41865 Gelisim Psikolojisi Bekir Onur http://www.pdaturk.com/ 66439 Genlik Projesi chp http://www.chp.org.tr 5755 Gilgamis Destani MuzaIIer Ramazanoglu http://www.pdaturk.com/ 16355 Gz Ucuyla Dean Koontz http://www.pdaturk.com/ 181675
57 APPENDICE A (Cont`d.) Glsn ve Unutusun Kitabi Milan Kundera http://www.pdaturk.com/ 57485 Gnes lkesi Tommaso Campanella http://www.pdaturk.com/ 23682 Gnmz Basininda Kadin(lar) Leyla Simsek http://www.altkitap.com 37280 Gzel Konusma Anonim http://www.pdaturk.com/ 25853 Harry Potter 4 J.K.Rowling http://www.pdaturk.com/ 150000 Harry Potter-FelseIe Tasi J.K.Rowling http://www.pdaturk.com/ 56271 Harry Potter-Sirlar Odasi J.K.Rowling http://www.pdaturk.com/ 67309 Harry Potter-Zmrdanka Yoldasligi J.K.Rowling http://www.pdaturk.com/ 201465 Hastahane mit Sahin http://www.kaliteoIisi.com 30909 HAYATIM HARBIDEN ROMAN Mehmet Kartal http://www.pdaturk.com/ 66844 Hayatin Kkleri Mahlon Hoagland http://www.pdaturk.com/ 25621 Hayvan Mezarligi Stephen King http://www.kitap.perisi.com/ 87072 HedeI Trkiye Oktay Sinanoglu http://www.kitap.perisi.com/ 60768 Huzur A.Hamdi Tanpinar http://www.pdaturk.com/ 99636 Hrriyet'in Ilani Tarik Tunaya http://www.pdaturk.com/ 17377 Iphigeni Tauris'te Goethe http://www.pdaturk.com/ 13075 Irk ve Irkilik Dsncesi Alaeddin Senel http://www.pdaturk.com/ 41482 Icra ve IIlas Kanunu Serhat Yener http://www.pdaturk.com/ 62539 Idealizm, Matrix FelseIesi ve Maddenin Geregi Harun Yahya http://www.harunyahya.net 16781 Ikizlerin Sinavi M.Weis&T.Hickman http://www.pdaturk.com/ 73097 Ilham Veren ykler Anonim http://www.pdaturk.com/ 20048 Ilkgretimde Matematik gretimi Yasar Baykul http://www.meb.gov.tr/ 35500 Ilkgretimde lme Yasar Baykul http://www.meb.gov.tr/ 12858 Imparatorluk Isaac Asimov http://www.pdaturk.com/ 55362 Ince Memed Yasar Kemal http://www.kitaplik.com 86795 Insan Insana Dogan Cceloglu http://www.pdaturk.com/ 64187 Internet Hakki cal http://www.pdaturk.com/ 21715 Isa Gelecek Harun Yahya http://www.harunyahya.net 18719 Isa`nin Gelis Alametleri Harun Yahya http://www.harunyahya.net 49054 Isyan A.Altan http://www.kitap.perisi.com/ 106059 Jimi Hendrix CURTIS KNIGHT http://www.pdaturk.com/ 45513
58 APPENDICE A (Cont`d.) Jules Amcam Guy De Maupassant http://www.pdaturk.com/ 28379 Kayigim Rosinha http://www.pdaturk.com/ 40253 Kemalizm Sonrasinda Trk Kadini Nurer Ugurlu http://www.pdaturk.com/ 14399 Kili yarasi gibi A.Altan http://www.kitap.perisi.com/ 72969 Kirmizi Isikta Yrmek Erdal Atabek http://www.pdaturk.com/ 36772 Kirmizi Kpek Louis de Bernieres http://www.pdaturk.com/ 20929 Kizil Glge Kizil Glge http://www.ankira.com/ 12773 Kiralik Konak Y.Kadri Karaosmanoglu http://www.pdaturk.com/ 54124 Kitiaranin Oglu M.Weis&T.Hickman http://www.pdaturk.com/ 25043 Knulp Herman Hesse http://www.pdaturk.com/ 21454 Konusan KaItan KALMAN MIKSZATH http://www.pdaturk.com/ 24400 Konusmalar KonIyus http://www.kelepirkitap.com 24141 Korkun Bir Gece Anton ehov http://www.pdaturk.com/ 26615 Kral, Bilge ve Soytari http://www.pdaturk.com/ 44360 Kristal Parasi R.A.Salvatore http://www.pdaturk.com/ 84332 Kur'an http://www.kelepirkitap.com 113513 Kurtulus Savasi Sirasinda... B.Georghes Gaulis http://www.pdaturk.com/ 29996 Kutsal Kitap Harun Yahya http://www.harunyahya.net 80675 Kuzularin Sessizligi http://www.kitap.perisi.com/ 45989 Kk Dnyam LatiI Erdogan http://www.pdaturk.com/ 43979 Kltrn ABC'si Bozkurt Gven http://www.pdaturk.com/ 23552 Linux Nasil HOWTOs http://www.linux.org.tr/ 25717 Luthien'in Kumari R.A.Salvatore http://www.pdaturk.com/ 75465 LtIen Beni Anla Ipek Ongun http://www.pdaturk.com/ 69162 Macbeth Shakespeare http://www.pdaturk.com/ 15304 Mektuplar Platon http://www.pdaturk.com/ 15705 Memleketin Birinde Aziz Nesin http://www.haberbilgi.com 27425 Mezeler Anonim http://www.pdaturk.com/ 26350 Miras R.A.Salvatore http://www.pdaturk.com/ 76563 Miskinler Tekkesi R.Nuri Gntekin http://www.pdaturk.com/ 48939 Mukaddes Ankara'dan Mektuplar Kadriye Hseyin http://www.pdaturk.com/ 20886 Mzik gretimi Ali Uan http://www.meb.gov.tr/ 28071 Nadja Andre Breton http://www.pdaturk.com/ 21272
59 APPENDICE A (Cont`d.) Nkleer Enerji... AriI Knar http://ekitap.8m.com/ 11246 Okul ncesi Egitim Sengl Gen http://www.meb.gov.tr/ 20732 Oligarsi Vladimir Putin http://www.pdaturk.com/ 17000 grenmenin Olusumu Tlay stndag http://www.meb.gov.tr/ 25000 l Erkek Kuslar Inci Aral http://www.pdaturk.com/ 79782 P.Nikitin Ekonomi Politik http://www.pdaturk.com/ 90416 PAL SOKAGI OCUKLARI Ferenc Molnar http://www.pdaturk.com/ 38865 Peter Schemihl A.VON CHAMISSO http://www.pdaturk.com/ 16688 Pis Morugun Notlari Charles Bukowski http://www.pdaturk.com/ 38684 Pollyanna-1 ELEANOR H. PORTER http://www.pdaturk.com/ 93755 Psikolojik Danisma ve Rehberlik Hasan Tan http://www.meb.gov.tr/ 87000 Rama ile Bulusma Atrhur Clarke http://www.kitap.perisi.com/ 55053 Ramses:Isigin Oglu Christian |acq http://www.pdaturk.com/ 75271 Ramses-3 Kades Savasi Christian |acq http://www.pdaturk.com/ 71623 RavenloIt Chrstie Golden http://www.kitap.perisi.com/ 85354 RavenloIt 2 Chrstie Golden http://www.ankira.com/ 53704 Rehberlik ve Danisma Anonim http://www.pdaturk.com/ 81718 Resim Egitimi Hulusi Sezer http://www.meb.gov.tr/ 15205 Rudin Turgenyev http://www.pdaturk.com/ 35239 Sagduyu Jean Meslier http://www.agnostic.com.tr.tc 47823 Sari Odanin Esrari Gaston Leroux http://www.pdaturk.com/ 47320 Satran zerine Cabaplanca http://www.pdaturk.com/ 25031 Savasta Ne Yaptin Baba? Can Dndar http://www.pdaturk.com/ 20055 Sendika Sendika Duyuru http://www.sendika.org/ 148266 Sessiz Bir lm SIMONE DE BEAUVOIR http://www.pdaturk.com/ 20384 Sevil Berberi BEAUMARCHAIS http://www.pdaturk.com/ 15308 Sezar ve Kleopatra Bernard Shaw http://www.pdaturk.com/ 27602 Shannara`nin Kilici Terry Brooks http://www.kitap.perisi.com/ 153746 Simyaci Paolo Coelho http://www.pdaturk.com/ 28252 Siyasal Sistemler Taner Kislali http://www.1001kitap.com/ 56135 Son Ibni Sirac'in Servenleri CHATEAUBRIAND http://www.pdaturk.com/ 11192 Sorgulayan Denemeler Bertrand Russell http://www.pdaturk.com/ 52199 Sosyoloji Karisik http://www.IilozoI.tripod.com 13713 Sylev Atatrk http://turkbilim26.sitemynet.com 187435
60 APPENDICE A (Cont`d.) Suun Pii Mehmet Kartal http://www.pdaturk.com/ 23548 Suda Yan Ateste Bogul Charles Bukowski http://www.pdaturk.com/ 18190 Sudaki iz http://www.kitap.perisi.com/ 46279 Seker Portakali J.Mauro de Vasconcelos http://www.pdaturk.com/ 29946 Simdiki ocuklar Harika Aziz Nesin http://www.pdaturk.com/ 31627 T.C. Anayasasi Anonim http://www.pdaturk.com/ 18871 Talat Pasa'nin Hatiralari Talat Pasa http://www.pdaturk.com/ 31941 Tanzimat-i Hayriye Devri E.Ziya Karal http://www.pdaturk.com/ 26326 Tarih Karisik http://www.IilozoI.tripod.com 32000 Tirpan Fakir Baykurt http://www.pdaturk.com/ 89225 Tom Sawyer Mark Twain http://www.pdaturk.com/ 19190 Top Oynayan Kedi Magazasi H.De Balzac http://www.pdaturk.com/ 16689 Toplum Kalkinmasi RiIat Miser http://www.pdaturk.com/ 11739 Toprak Cengiz Aytmatov http://www.pdaturk.com/ 29930 Totem ve Tabu-1 Freud http://www.pdaturk.com/ 18389 Totem ve Tabu-2 Freud http://www.pdaturk.com/ 20239 Tk.Toplumu ve Dnyanin Gelecegi Alan Durning http://www.pdaturk.com/ 31003 Trk Ceza Kanunu Anonim http://www.pdaturk.com/ 103683 Tyrann Isaac Asimov http://www.pdaturk.com/ 49156 Ugursuz Miras HoIImann http://www.pdaturk.com/ 23951 Unutulmus Diyarlar Unutulmus Diyarlar http://www.ankira.com/ 8125 Kisa Oyun Luigi Pirandello http://www.pdaturk.com/ 10297 yk Gogol http://www.pdaturk.com/ 18977 VakiI ve Dnya Isaac Asimov http://www.pdaturk.com/ 104069 Werther Goethe http://www.pdaturk.com/ 30578 Yaban Y.Kadri Karaosmanoglu http://www.maximumbilgi.co m 53180 Yaban rdegi Henrik Ibsen http://www.pdaturk.com/ 26827 Yahudiler Lessing http://www.pdaturk.com/ 10731 Yalniz Gezerin Dslemleri J.J.Rousseau http://www.pdaturk.com/ 24850 Yaprak Dkm R.Nuri Gntekin http://www.pdaturk.com/ 28574 Yasak Iliski Barbara Taylor http://www.pdaturk.com/ 29163 Yazlik Dns Corlo Goldini http://www.pdaturk.com/ 15917
61 APPENDICE A (Cont`d.) Yemek TariIleri Anonim http://www.pdaturk.com/ 34116 Yeraltindan Notlar Dostoyevski http://www.pdaturk.com/ 30025 Yerdeniz Bycs Ursula Kroeber LeGuin http://www.pdaturk.com/ 46537 Yildizlarin Zamani Alan Lightman http://www.pdaturk.com/ 27150 Yneticinin Kilavuzu Coleman&Barrie http://www.pdaturk.com/ 27209 Yn-Kitap Sendika Duyuru http://www.sendika.org/ 14942 Yukari Mahalle John Steinbech http://www.maximumbilgi.co m 38848 Yksek Denetim Kurumlari Ihsan Gren http://www.tcmb.gov.tr/ 37228 Yzklerin EIendisi 1 - Yzk Kardesligi J.R.Tolkien http://www.kitap.perisi.com/ 139161 Yzklerin EIendisi 2 - Iki Kule J.R.Tolkien http://www.kitap.perisi.com/ 127339 Yzklerin EIendisi 3 - Kralin Dns J.R.Tolkien http://www.kitap.perisi.com/ 59621 Yzyillik Yalnizlik G.Marquez http://www.pdaturk.com/ 79582 ZeytinDagi Falih RiIki Atay http://www.pdaturk.com/ 29394
62
B. Turkish Alphabet B.1 Lowercase Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c d e I g g h i i j k l m n o
19 20 21 22 23 24 25 26 27 28 29 p r s s t u v y z
Consonants:b, c, , d, I, g, g, h, j, k, l, m, n, p, r, s, s, t, v, y, z} Vowels:a, e, i, i, o, , u, }
B.2 Uppercase Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 A B C D E F G G H I I J K L M N O
19 20 21 22 23 24 25 26 27 28 29 P R S S T U V Y Z
Consonants:B, C, , D, F, G, G, H, J, K, L, M, N, P, R, S, S, T, V, Y, Z} Vowels:A, E, I, I, O, , U, }
63 C. Turkish Language Specialities
Turkish is an agglutinative language like Finnish, Hungarian. It belongs to the southwestern group oI Turkic Iamily. Turkic languages are in the Uralic-Altaic language Iamily. In agglutinative languages, words Iormed by combined root words and morphemes. Word structures can grow by addition oI morphemes. Morphemes added to a stem can convert the word Irom nominal to a verbal structure or viceversa.
Turkish has a very productive morphology. There is a root and several suIIixes are combined to this root. It is possible to produce a very high number oI words Irom the same root with suIIixes. The lexicon size may grow to unmanageable size.
A popular example oI a Turkish word Iormation is: OSMANLILASTIRAMAYABILECEKLERIMIZDENMISSINIZCESINE
This can be broken down into morphemes: OSMANLILASTIRAMA(Y)ABILECEKLERIMIZDENMISSINIZ CESINE
In this example, one word in Turkish corresponds to a Iull sentence in English. This example can be translated into English as 'as iI you were oI those whom we might consider not converting into an Ottoman.
There are 29 letters in Turkish language. The eight oI them are vowels and twenty-one oI them are consonants. (See Appendice B)
The number oI vowels is more than many languages. Vowels oI Turkish can be classiIied in three groups according to their properties:
64 Front and back, Round and unrounded, High or low
The vowels can be partitioned as below in detail:
Back vowels: a, i, o, u} Front vowels: e, i, , } Front unrounded vowels: e, i} Front rounded vowels: , } Back unrounded vowels: a, i} Back rounded vowels: o, u} High vowels: i, i, u, } Low unrounded vowels: a, e}
Turkish word Iormation uses a number oI phonetic harmony rules. When a suIIix is appended to a stem vowels and consonants change in certain ways.
C.1 Jowel Harmonv Vowel harmony is the best-known morphophonemic process in Turkish. It is most interesting and distinctive Ieature. Vowel harmony is a leIt-to-right process. It operates sequentially Irom syllable to syllable. Vowel harmony processes Iorce certain vowels in suIIixes agree with the last vowel in the stems or roots they are being aIIixed to. When vowels are aIIixed to a stem, they change according to the vowel harmony rules. The Iirst vowel in the suIIix changes according to the last vowel oI the stem. Vowel harmony consists oI two assimilations:
65 1. Palatal assimilation
This is called 'major vowel harmony . This vowel harmony is common to almost Turkic languages. This assimilation is about Iront/back Ieature oI the language. Back vowels are the set oI a, i, o, u} and the Iront vowels are the set oI e, i, , }.
II the vowels oI the Iollowing morphemes are back then the vowel oI the Iirst morpheme in a word is back, e.g. aski lar
'lar is a plural suIIix. 'ler, other Iorm oI plural suIIix, is not used, because the vowels oI the stem are back vowels.
II the vowels oI the Iollowing morphemes are Iront then the vowel oI the Iirst morpheme in a word is Iront, e.g. ev ler
Long vowels are ', , . These vowels are in words oI French origin in general. Examples: satler (saatler) gller (goller) usller (usuller)
2. Labial assimilation
This is called 'minor vowel harmony. This assimilation is about rounded/unrounded Ieature oI the language. Examples: l n usul n (usl n) topal in deIter im saat im (sat im)
66 C.2 Consonant Harmonv
Consonant harmony is another basic aspect oI Turkish phonology. Consonants oI Turkish phonology can be classiIied into two main groups. These are voiceless and voiced. Voiceless consonants are ', I, h, k, p, s, s, t}. Voiced consonants are 'b, c, d, g, g, j, l, m, n, r, v, y, z}. Consonant harmony rules doesn`t Iormulate easily because oI irregular character oI borrowed and native words. There are some consonant harmony rules in Turkish:
II the end oI the word is one the voiceless consonants ('p, , t, k) then it changes to a corresponding voiced consonants ('b, c, d, g).
o 'p changes to 'b ( kitab im ). o 'd changes to 't ( ta(d)t tik ), but not every 'd changes, such as 'nad, 'soyad, etc. o 'k changes to 'g ( aya(k)g in ). o ' changes to 'c ( aga()c in ), but not every ' changes, such as 'g, 'a i, etc.
II a suIIix starts with 'd, and iI the last consonant oI the stem is one oI ', I, h, k, p, s, s, t}, 'd is replaced with 't , e.g. yulaItan (yulaI dan)
II the last consonant oI the stem is one oI ', I, 'h, 'k, 'p, 's, 's} and iI the suIIix begins with the 'c then 'c is resolved as a ' , e.g. yasa (yas ca)
II 'k is at the end oI the stem and 'k preceded by an 'n then 'k becomes 'g , e.g. elen(k)g e
There are some exceptions Ior this rule, e.g. 'bank.
67
II the Iinal character oI the stem is 'g and a vowel is beginning oI the suIIix then 'g becomes 'g in Ioreign origin words, e.g. analo(g)g a
There are some exceptions Ior this rule, also, e.g.'lig, 'pedagog, etc.
II the Iinal character oI the stem is 'g and a consonant is beginning oI the suIIix then 'g does not become 'g , e.g. bumerang tan
II the Iinal character oI the stem is a vowel, and a vowel is beginning oI the suIIix then 'y inserted to stem, e.g. akarsu y unuz
When certain suIIixes are aIIixed last consonant is duplicated in Arabic or Persian origin words, e.g. zam m i
II Arabic origin words ending with a vowel then drops in exception to the general rule, e.g. camii camisi
There are many numbers oI words that have this property, e.g. 'mevki, 'cami, 'terIi, 'zayi, 'ikna, 'merci, etc.
C.3 Root Deformations Turkish roots are not Ilexible in normally. There are some cases about various deIormations. There are some exception cases:
Root is observed in personal pronouns Examples: ben bana sen sana
Wide vowel at the end oI the stem is narrowed when the suIIix 'yor comes aIter the verbs ending with the 'a,e , e.g. kapiyor (kapa i yor)
68
When a suIIix is beginning with a vowel comes aIter some nouns, which has a vowel i, i} in its last syllable, this vowel drops. This occurs generally designating parts oI the human body, e.g. agzimiz (agiz i miz)
When the possessive suIIix 'il, il is aIIixed to some verbs, and the last vowel oI the verb is vowel 'i, i then this vowel drops, e.g. ayril (ayir il)
II a plural suIIix is aIIixed to a compound words then this suIIix coming beIore the possessive suIIix at the end oI the stem. Example: gzyasi lar -~ gzyaslari (not gzyasilar)