Sei sulla pagina 1di 36

School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic Issues in Mongolian


Language Technology

Purev Jamai
School of Information Technology
National University of Mongolia

purev@num.edu.mn

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Outline
” Introduction
” Linguistic characteristics of Mongolian language
” Mongolian scripts
” Computing standards and language technology
” Future plan

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Introduction > Mongolia


” Population: 2.5 million
” Capital city: Ulaanbaatar /1 million/
” The main religion: Buddhism /95%/
” Area: 1.565.000 sq. km
” Climate: -30 C, +30 C

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Introduction > National University of Mongolia

” Established year: 1942


” Total students: ~10 thousand
” 12 faculty
” School of IT
”Established year: 2002
”Total students: ~500

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Introduction > ICT Background in Mongolia


” Current situation:
” All 21 town centers (countryside) connected
each other with VSAT network connection with
10Mbps.

” Almost all of 9 IPSs is joint venture company


with USA, Japan, Korea, Russia.

” Local ISPs structure:


”Fiber connection China – Mongolia – Russia;
”Satellite (USA, Hong Kong, IntelSat)

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Introduction > ICT Background in Mongolia


” Internet users - estimated 2% of population.

” 80% percent use dial-up connection.


Web Hosting running approx 1500.

” 30 companies working in software development.


11 out of these companies work on software
outsourcing (usually to Japan).

” ICT concentrated in Ulaanbaatar.


” Poor Telephone, PC and Internet penetration.
” Lack of online content in local language

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Introduction to Mongolian
” The Ural-Altaic family of languages
” Most closely related to the Tungus and Turkish
” Main Mongolian dialects:
” Halh / The Modern Mongolian language /
” Buriad
” Oirat
” Tsahar
” Harchin
” Ordos
” Others

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Introduction to Mongolian
” Approximately 5 million people speak:

” In Mongolia (2.5mln)
” In China (1.5mln)
” In Afghanistan (?)
” In Russia (0.5mln)

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Introduction to Mongolian
” Chief characteristic of Mongolian language is:

” Subject-Object-Structure (SOV)
” The absence of grammatical gender.
” Vowel harmony

” Agglutination
” The modifiers always precede the modified (head).

” Mongolian nominalized verbs are much more active


syntactically.
” There is hardly any difference between nouns and
adjectives.

” There is a ery limited plural system

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Phonology
” 47 phonemes
” 14 vowels: 7 short, 7 long
” 33 consonants

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Phonology
” SYLLABLES (C=consonant, V=vowel)

” Genuine Mongolian words use the following


syllable structure:
”V – syllables

” VC – syllables

” CV – syllables

” CVC – syllables

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Phonology
” Through the Cyrillic script, the following
artificial syllables came into usage:

”VCC – syllables

”CVCC – syllables

”VCCC – syllables

”CVCCC – syllables

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Segmentation
” Mongolian words are relatively easily detected
from the text since a space is supposed to be
placed between them.

” A sentence begins with capital letter.

” Full stop (.) marks the end of a sentence.

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Word structure
” Mongolian is agglutinative in word structure.
Grammatical feature can be indicated by adding
suffixes to roots or stems
” Word structures can be long
” Mongolian words have plenty of morpho-phonemic
rules and their exceptions which make morphemic
analysis difficult.
” An order of the structural constituents of
Mongolian word:

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language


> Word structure
” Example of a train for describing word building
process (Ref: 10):

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language


> Morphology
” Mongolian morphology consists of:
” Noun
” Prowords
” Verb
” Adwords
” Postposition
” Conjunction
” Interjection

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Morphology > Noun
” Characteristic of Noun:
” No gender question of nouns
” Nouns are declined with 8 case ending suffixes

” The nominative case has no suffixes


” The plural form is never used
if it is already clear
from the context

” Plural can expressed by:


” numerals
” repeated words
” verbs
” quantitative words
” abstract ideas

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language


> Morphology > Verb
” No person or number suffixes
” Voice and aspect have many more possibilities than
English
” Mood has more subgroups than English
” There are no irregular verbs in the Mongolian
language
” Some vowels are dropped when a suffix is added; in
other cases an inserted vowel is needed.

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language


> Morphology > Verb
” Verb
structure:

” There is
almost 200
suffixes for
verb

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language


> Syntax
” Subject-Object-Verb (SOV)

” The order of phrases may be changed freely

” Predicate is obligatory
” Postpositions rather than prepositions
” Does not distinguish gender.
” The nouns can be used as predicate.
” The functions of some English prepositions like as
of, from, to are preformed in Mongolian by case-
suffixes.

” The functions of some English prepositions (e.g.


for, before, against) are preformed by
postpositions, which follow the word they govern.
The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Linguistic characteristics of Mongolian language >


Syntax > Syntactic Pattern
S Æ (NP) VP
NP Æ (D) (ADJ*) N
VP Æ (NP) (ADV*) V (AUX)
PP Æ NP P
CP Æ S (C)

AUX Auxiliary verb


P Postposition
CP Complement sentence
C Conjunction

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Mongolian scripts > Introduction


” the Mongols created/used at least 10 scripts:
Phagspa Script (14th century),
Tod Script (16th century),
Soyombo Script (17th century),
Vagindra Script (19th century).
” Some obsolete
scripts:

Phagspa script

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Mongolian scripts > Introduction

” Today, the Mongolian language uses two official


scripts:
” The (new) Cyrillic Mongolian Script (so-called
shortly Cyrillic Script)

” The (old) Mongolian Script

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Mongolian scripts > (old) Mongolian Script


” Adopted from the Uighur alphabet, which
originates of Sogdian letters of Aramaic origin, in
the 12th Century used in Mongolia until 1941

” Replaced by the Latin alphabet (in 1931), and again


by Cyrillic

” In 1990, the (old) Mongolian script restored to


official use by the government.

” Before this, the (old) Mongolian alphabet


continued to be used in Inner Mongolia, a part of
China
The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Mongolian scripts > (old) Mongolian Script


” The traditional (old) Mongolian characters are
allocated a block of 56 characters from 1800 to
18AF
” There are 7 basic vowels, 27 consonants

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Mongolian scripts > (old) Mongolian Script


” Notable features of this alphabet are followings:

” This is a phonemic alphabet with separate


letters for consonants and vowels

” Written vertically from top to bottom and


from left to right.

” The letters have a number of different shapes,


the choice of which depends on the position of
a letter in a word and which letter follows it

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Mongolian scripts > the Cyrillic Script


” The (new) Cyrillic Mongolian Script

” The Cyrillic alphabet slightly modified the Russian


alphabet by adding 2 letters: Өө /ö/ and Үү /ü/
” 35 letters (35 small, 35 capital):
” 13 vowels, 20 consonants, and 2 signs

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Mongolian scripts > Main Differences Between


Mongolian and Cyrillic Scripts

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Computing standards and Language technology >


Character set and encoding
” For the Cyrillic character set:

” Microsoft code page 866 and Windows code


page 1251
” ISO 8859-5 and K O18

” Keyboard driver is available for ASCII and Unicode

” Character set is available on Microsoft and Linux


platforms

” Many fonts for Mongolian Cyrillic are available

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Computing standards and Language technology >


Keyboard
” Microsoft Windows XP provides Cyrillic based in-
built support for Mongolian keyboard

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Computing standards and Language technology >


Keyboard
” Microsoft Windows XP provides support for Mongolian
(Cyrillic) locale (mn/MN)

” Accessing Google with Mongolian locale

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Computing standards and Language technology >


Interface Terminology Translation
” Mongolian Linux OS, named Soyombo, GNOME 2.2, KDE
3.1 and Mozilla FireFox internet browser have been
completely and successfully developed by OpenMN
research group.

” The team also translated 100% of GNOME 2.2 GUI to


Mongolian. But GNOME is partially supports Mongolian
language

” GNOME 9.0 is almost finished where about 95% and


latest version of OpenOffice is under processing to
translate by Newcom Systems Inc.

” IT terms translation is under processing.


The project covers roughly 10 thousand terms

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Computing standards and Language technology >


Status of Advanced Applications
” Mongolian Unicode keyboard driver is supported by GNOME
2.14 version

” Charset Converter V1.0 is an online tool to convert the Cyrillic


Mongolian text in Win1251 encoding to Cyrillic Mongolian
Unicode text and vs.

” Converter from Win1251 to Unicode (UTF-8) on Unix/Linux


platform. This is a Mongolian cp1251 encoded text to/from
UTF-8 encoded text conversion program

” Concordancer for memory-based dictionary of Mongolian


toponymy
” First version of Online English-Mongolian dictionary was built
in 2001 (expanded the vocabulary to 5000/6000 English and
Mongolian words
” Mongolian spell checker on Microsoft Word 15000 root words,
around 120 different suffixes

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Future plan > Priority Tasks


” We attempted to conduct some research on
studying:

”A Mongolian word structure by Finite State


Machine (PC-KIMMO based),

” Constructing the prototype structure of


Mongolian language corpus

” Analyzing Mongolian sentence structure by Tree


Adjoining Grammars

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Future plan > Priority Tasks


” Problems:
” Mongolian language is not always as well defined as required
for computational modeling. So, it is necessary to build
fundamental research resources.

” Our attempt for computer-aided Mongolian language


processing could not get the expected outcome by the now,
because such activity takes times and costs money

” Also our researchers are not quite well experienced for


such kind study.

” We have lack of skilled human resources

” The goal of our research proposal for next years is to build


a corpus for Mongolian language that could be used freely
and easily exploitable for the computer-based processing
of Mongolian

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)
School of Information Technology, NUM, Mongolia Thai Computational Linguistics Laboratory

Thank you
for your attention!

Acknowledgement to Dr. Virach Sornlertlamvanich


for his support and co-operation

The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (Aug 2006)

Potrebbero piacerti anche