Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
OCR Errors Init ially, d ivide the bulk of text data into a series of
Due to the noise, the following errors may occur separate wordsfurther we use an inbuilt analy zer i.e a
morphological analyzer which uses the separate dictionaries to
Reject error access the root word and the suffixed word followed by it. We
The machine reading process may not be able to need to establish a relationship between different varieties of
recognize a character. root words and its suffixesin order to do this process, a
mapping function is necessary. Valid ity of a word is checked
Substitution error using morphological analyzer. We have to identify the type of
OCR may recognize a character incorrectly. error i.e word is incorrect viz. correct root and incorrect suffix,
Character fusions incorrect root and correct suffix and correct root and matching
suffix. These errors are taken care individually and the
Two or more character images merge to appear as a single incorrect words are made transparent by suitable solutions.
connected component. These words which are mis -interrupted are corrected by the
Character fragmentation help of user by giving suitable suggestions. But, the drawback
of this system is that, it fails when it is imp lemented for OCR
output text. It cannot efficiently handle special cases in OCR
A character image is frag mented into more than one sub
like character fusion, character frag mentation etc.
image.
1.3 Spell Checker
A spell checker is an applicat ion program required by Dictionary lookup method
mach ines to process natural languages effectively. Spell Dictionary lookup method is a method of comparing the
checkers can be used as independent tools or they can be a words in the input file with the correct words in the dictionary.
part of larger applications like search engine, translator etc. A This method is used as an advantage over OCR to inspect the
simp le spell checker can perform the fo llowing tasks: letters which are amb iguousbut in a large scale, it leads to
Scanning and extracting the words contained in
the text. size overhead and calculation of probability will become
Matching of the correctly written words with complex and even the cost of searching.
those typed including special N-gram approach
symbols,hyphens..etc is the important step. An N-gram approach is an arrangement of text in a
sequential order for different items like phonemes, graphemes,
To handle morphology to process a language letters, words and even pair of wo rds.
dependent algorithm is required. English Unigram, bigram, trig ram are the varieties in it. In
language also requires a spell checker for the general, we have N-gram which is a types of predictive model
similar words including plurals,verbal designed by the help of Markov to guess the subsequent item
in the form of n-1 order and it follows is a probabilistic
formsetc. So processing even these steps for
language model approach.
other languages will be a complicated issue.
Design
Related Work
The literature survey reveals that most of the research
works on Kannada Spell Checker focus on normal text wh ile
some efforts have been made in other languages like Pun jabi,
Hindi etc., for OCR text . But, no work related to OCR spell
checkers are reported for Kannada directly. So me works on
OCR Spell Checker in other languages and Kannada spell
checker are reported here.
This review discusses common spell checking approaches
and the problems that may occur during spell checking
process. There are two common approaches for imp lementing
spell checker: Dict ionary lookup method and N-gram
approach.
Comparison
Co mpare each word with a standard dictionary and check
for the validity of the word by using min imu m ed it distance
algorith m.
Mi ni mum Edi t Distance
It is a levenstein distance where two strings or words are
compared and will result to either similar or dissimilarity, the
techniques used to perform this method are substitution,
insertion and deletion in order to convert one word to another
word and calculate the minimu m distance to convert one string
to another by using NLP, where automat ic process ing of data
is done for spelling correction with the help of standard
dictionary and choose the suitable one by selecting the lowest
distance to the word formed.
Ex: When two strings INTENTION and EXECUTION are
considered, the minimu m edit distance between them is 5 i.e.,
Minimu m of 5 operations are required to change INTENTION
as EXECUTION.
Romanizati on The words in the dictionary which has edit distance less
than or equal to 3 are suggested for a given misspelled wo rd.
Ro manization is a process of converting a written text fro m
a specific system to Roman Script. Ro manization includes
Results
following methods:
Transliteration for representing written text
Transcription for representing the spoken word
Combination of both transliteration and
transcription
Ex: 1. in Romanized format is written
as avanu.
2. in Romanized format is written as snEha.
In this tool we read line by line fro m an input file and each
line is Ro man ized to English.
Ex: is Ro man ized to
English as rAmanu kAdig.e hodanu.
Tokenization
Token ization is a process of forming a set of tokens which
has meaningful elements such as words, phrases, symbols in
the form of text .
Ex: Input: This is a spell checker for Kannada OCR.
Output: This, is, a, spell, checker, for, Kannada, OCR.
Conclusion
In this project we have imp lemented spell checker for
Kannada OCR. Fro m this project we learnt various tools to
implement spell checker. After this project, we understood
various problems occur during text processing; also we got to
know how to tackle these problems. Although there were lots
of problems during Kannada text processing we understood a
major way to imp lement a Kannada Spell Checker for
Kannada OCR. As this project is a first attempt for
implementing spell checker for Kannada OCR, we hop e our
project serves as platform fo r beginners to understand various
aspects of spell checker for Kannada OCR.
Future Work
There can be several future work proposed, some of them
involving are improving the performance while others can be
built on top of the work done here. Here are some o f the works
we believe can be performed :
The methods can be improved to achieve better
efficiency.
A larger dictionary with set of huge words can be
used.
The methods can be used to separate root word and
affix word to improve the performance.
Work can be elaborated for semantic errors as well.
Further, the work can be extended by applying the
mu lti-threaded approach in the spell checker tool.
References