Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
depend on language models that are Therefore apart from the Indian language words we
trained on text. This survey paper describes should also be able to handle proper names and
about how to normalize Hindi Text using English words transliterated in Indian languages
standard rules. We studied the different since they form substantial percentage of words.
papers that used for different languages like For example we found widely used spelling
Bangla, Python, French etc. and even Hindi variations for the Hindi word `angrezi' as shown
also. Different papers had used different below.
Text Segmentation: Some written languages like truncating the vowel form of the last consonant
Chinese, Japanese and Thai do not have single word while speaking, even as it continues to be written in
boundaries either, so any significant text parsing full form). There are also some modern conventions
usually requires the identification of word
for writing English words in Devanāgarī.
boundaries, which is often a non-trivial task.
Word Sense Disambiguation: Many words have
more than one meaning; we have to select the
meaning which makes the most sense in context.
Hindi Text Normalization: Many Words in Hindi
2. NEED OF TEXT
language written in different ways. We work on the NORMALIZATION
Normalization of those Hindi words.
Syntactic Ambiguity: The grammar for natural
languages is ambiguous, i.e. there are often multiple 2.1 Normalization of Non-standard Words Real text
possible parse trees for a given sentence. Choosing contains a variety of "non-standard" token types,
the most appropriate one usually requires semantic such as digit sequences; words, acronyms and letter
and contextual information. Specific problem
sequences in all capitals; mixed case words (WinNT,
components of syntactic ambiguity include sentence
boundary disambiguation. SunOS); abbreviations; Roman numerals; URL's and
e-mail addresses. Many of these kinds of elements
are pronounced according to principles that are
1.1 Main cause behind Hindi Text Normalization:- quite different from the pronunciation of ordinary
words. Furthermore, many items have more than
The issue why Normalization is required on Hindi
one plausible pronunciation, and the correct one
Text is due to Devanāgarī (दे वनागरी) Script. must be disambiguated from context: IV could be
Devanāgarī (दे वनागरी) is an abugida script used to "four", "fourth", "the fourth", or "I.V. Normalizing or
write several Indo-Aryan languages, including rewriting such text using ordinary words is an
Sanskrit, Hindi, Gujarati, Marathi, Sindhi, Bihari, Bhili, important issue for several applications more
Marwari, Konkani, Bhojpuri, Pahari (Garhwali and sophisticated text normalization will be an important
Kumaoni), Santhali, Nepali, Newari, Tharu and tool for utilizing the vast amounts of on-line text
sometimes Kashmiri and Romani. The Devanāgarī resources. Normalized text is likely to be of specific
Italic text writing system can be called an abugida, as benefit in information extraction applications.
each consonant has an inherent vowel (a), which can
be changed with the different vowel signs.
Devanāgarī is written from left to right. A top line 2.2 Text normalization challenges
linking characters is thought to represent the line of
the page with characters historically being written
under the line. In Sanskrit, words were originally
The process of normalizing text is rarely
written together without Media: spaces, so that the
straightforward. Texts are full of heteronyms,
top line was unbroken, although there were some
numbers, and abbreviations that all require
exceptions to this rule. The break of the top line
expansion into a phonetic representation. There are
primarily marks breath groups. In modern languages,
many spellings in English and many other languages
word breaks are used. When reading Sanskrit
which are pronounced differently based on context.
written in Devanāgarī, the pronunciation is
For example, “IV could be "four", "fourth", "the
completely unambiguous. Similarly, any word in
fourth", or "I.V. ".
Sanskrit is considered to be written only in one
manner (discounting modern typesetting variations Most text-to-speech (TTS) systems do not generate
in depicting conjunct forms). However, for modern semantic representations of their input texts, as
languages, certain conventions have been made (e.g. processes for doing so are not reliable, well
the normalized text. According to semiotic classes a Yuxiang Jia, Dezhi Huang, Wu Liu, Yuan Dong,
lexical analyser was designed to tokenize each NSW Shiwen Yu, Halia Wang (2008) develops taxonomy
by regular expression using tool JFlex. We assigned a of NSWs on the basis of a large scale Chinese corpus,
tag for each token according to semiotic classes. The and proposes a two-stage NSWs disambiguation
outputs of the tokenization are then used in the next strategy, Finite State Automata (FSA) for initial
step i.e. token expander. According to the assigned classification and Maximum Entropy (ME) classifiers
tag token verbalization and disambiguation was for subclass disambiguation. Typical methods for
performed by the token expander. In Semiotic class text normalization are based on handcrafted rules.
identification, we identified a set of semiotic classes But such hand-crafted rules are difficult to write,
which belongs to the Bangla language. To do this, we maintain and adapt to new domains. On the other
selected a news corpus, forum and blog, then we hand, in view of homograph disambiguation, many
proceeded in two steps to identify the semiotic machine learning methods are employed and have
classes: (i) Python [4] script was used to identify the shown their advantages. Decision tree and decision
semiotic class from news corpus and we manually list are used in English and Hindi text normalization
checked it in the forum and blog (ii) we defined a set [1]. The text normalization approach proposed in
of rules according to context of homographs or this paper does not need word segmentation
ambiguous tokens. The result is a set of semiotic process. Finite state automata detect NSWs from the
classes in Bangla text as shown in table1. real text and make an initial classification and then
maximum entropy classifiers are used for further
classification. The process flow is outlined in Fig.1.
steps. The input of this module is NSW itself and its may be treated in a similar way to Machine
class tag. The output is its corresponding Chinese Translation: The tools and algorithms developed for
words. The conversion is a one-one correspondence Machine Translation may be used for text
and finite state transducers are applicable here. This normalization, with the “spoken language” being
paper makes an extensive investigation of Chinese treated as the target language [10].
text normalization. NSWs taxonomy is developed
based on a large scale corpus. After a systematic K.Panchapagesan, Partha Pratim Talukdar,
analysis of the taxonomy, a two stage NSWs N.sridhar Krishna, Kalika Bali, A.G. Ramakrishnan
classification strategy is proposed, finite state (2004) proposed a novel approach to text
automata for initial classification and maximum normalization, where in tokenization and initial
entropy classifiers for further classification. token classification are combined into one stage,
Experiment results show that this approach achieves followed by a second level of token sense
a good performance and generalizes well to new disambiguation [2]. The architecture of the proposed
domains. In addition, this approach is character- approach is shown in Figure 1. Tokenization and
based, no need of word segmentation preprocess. Initial Token Classification are performed using a
However, some error occurs in experiment such as lexical analyzer that is derived from various token
number sequence error [7]. definitions in the form of regular expressions. For
the second level of token sense disambiguation,
Conghui Zhu, Jie Tang, Hang Li, Hwee Tou Ng, Tie- traditionally, text normalization is viewed as an
Jun Zhao (2007) describes about the issue of text engineering issue and is conducted in a more or less
normalization, an important yet often overlooked ad-hoc manner.
problem in natural language processing. By text
normalization, we mean converting ‘informally
inputted’ text into the canonical form, by eliminating
‘noises’ in the text and detecting paragraph and
sentence boundaries in the text. Previously, text
normalization issues were often undertaken in an
ad-hoc fashion or studied separately. This paper first
gives a formalization of the entire problem. It then
proposes a unified tagging approach to perform the
task using Conditional Random Fields (CRF). The Figure 3.3 Stages involved in Text Normalization.
category of each token need to be identified. This is sequences are usually divided into subtokens using
accomplished by the lexical analyser itself when the presence of commas, hyphens, or slashes within
there is no ambiguity arising from the format(s) of them. The algorithm in the Perl splitter is that the
the token. In case of ambiguity, the token with input texts are first bracketed into groups, and then
possible token types is output to facilitate further the texts are split based on split points, while
disambiguation. Identification of token category keeping the groups intact. Next is the classifier which
involves high degree of ambiguity. For example, operates in two stages. The first stage handles all
‘1960’ could be of the type ‘Year’, or of the type tags except for ASWD, EXPN, and LSEQ, which were
‘Cardinal Number’, and ‘1.25’ could be of the type combined into a single ALPHA tag. If a token is
‘Float’, or of the type ‘Time’. Disambiguation is tagged as ALPHA, the alphabetic classifier classifies
generally handled by hand-crafted context- the token as one of ASWD, EXPN, or LSEQ. The
dependent rules. We have used decision tree based classifier generates features for a token using the
data-driven techniques to address the same. When a two preceding tokens, the token itself, and the two
token is input to the tree for disambiguation, a following tokens. The general classifier has 128
decision is made by traversing the tree starting from features. Other features include the length of the
the root node, taking various paths satisfying the token and context tokens, whether the target and
conditions at intermediate nodes, till the leaf node is context tokens have proper name capitalization, the
reached. Decision lists are a special class of decision number of splits the splitter would make on the
trees; they can use for representing a wide range of target token, and some very basic context
classifiers. A decision list can be viewed as hierarchy disambiguation scoring. The result is not satisfactory
of rules, when a classification is needed; the rule in due to the performance of the classifier, since the
the hierarchy is addressed. If that rules fails to low recall rate evidently lowers the accuracy rate [4].
classify as well, the third rule is addressed, and so
on. They are basically if-then else statements [3]. For Gilles Adda, Martine Adda-Decker, Jean-Luc
example: Gauvain, Lori Lamel (1997) describes a quantitative
investigation into the impact of text normalization
if condition1 (x) is true then output = output1 (x) on lexica and language models for speech
recognition in French. The text normalization
else if condition2 (x) is true then output = output2 process defines what is considered to be a word by
(x) the recognition system. Depending on this definition
we can measure different lexical coverage’s and
. . .
language model perplexities, both of which are
. . . closely related to the speech recognition accuracies
obtained on read newspaper texts. Different text
. . . normalizations of up to 185M words of newspaper
texts are presented along with corresponding lexical
Else output = default_output (x)
coverage and perplexity measures. Some
Steve Atwell, Hahn Koo, Liam Moran, and Tae-Jin normalizations were found to be necessary to
Yoon (2004) describes the implementation of a achieve good lexical coverage, while others were
system in the programming language Python that more or less equivalent in this regard. The choice of
normalizes texts into a form that resembles how a normalization to create language models for use in
human might read it out loud. Classified ads are the the recognition experiments with read newspaper
target domain of the normalization system. For a texts was based on these findings. Best system
given input text, a sequence of letters can either be configuration obtained an 11.2% word error rate in
treated as a single token or a group of subtokens the AUPELF ‘French-speaking’ speech recognizer
that can be split further. For example, character
evaluation test held in February 1997. We use two Unicode contains 10 consonant characters with
large French dictionaries: BDLEX and DELAF [6]. nukta (a dot under consonant) and one nukta
character itself. We delete all occurrences of nukta
character and replace all consonants with nuktas
with their corresponding consonant character. This
4. RULES USED FOR would equate words like the ones shown below.
NORMALIZING TEXT
[4]. Steve Atwell, Hahn Koo, Liam Moran, and Tae-Jin Yoon,
“Text Normalization in Python”
www.linguistics.uiuc.edu/grads/moran/papers/TextNorm.pdf.
REFERENCES