Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Joydeep Sengupta
B Tech , CSE, SoA University
Bhubaneswar, Odisha, India
Abstract Optical character Recognition (OCR) is a document image analysis method where scanned digital image
that contains either machine printed or handwritten script are input into a system to translate it into an editable
machine readable d digital text format. Hence OCR has been a topic of interest for researchers all around the globe in
the past decade and research paper involving OCR is increasing day by day. It is seen that efficient algorithms have
increased the speed and accuracy of character recognition. A substantial amount of work has been done on foreign
languages such as English , Chinese etc. but very few paper are there for Indian languages baring a few for Hindi
and Bengali. Hence our research work was directed towards development of a novel algorithm for Offline Typewritten
Odia Character recognition using Template Matching.
Keywords Odia Script; Character Recognition; Matching; Templates; Odia Unicode
I. INTRODUCTION
Optical Character Recognition is the process of translating images of handwritten or printed text into a format understood
by machines for the purpose of editing, indexing/searching, and a reduction in storage size [1]-[3].The first step in OCR
is going back to the roots of the languages and studying the individual characters which make up the language.Each
character is unique in many ways and if we can extract unique features of the individual character we can train the
computer about that particular character. Each character has different sets of features which can be used while comparing
with a test character. Hence by this way we can make the computer recognize a character. Since our study is focused on
the Odia character recognition. Odia is one of the oldest and is the official language of the Odisha state in the Indian
constituency. The Odia language consists of 50 different characters out of which 12 are vowels and rest are consonants,
Character recognition in this language is particularly
difficult because there are many similar looking characters and the combined characters are very difficult to segregate.
So to make recognition easier we have developed an algorithm. The recognition of characters and numeral of a language
is a challenging problem since their variations due to different font sizes and different types of variations introduced
during writing. The character recognition (CR) can be broadly classified into two groups:
On-line character recognition systems
Off-line character recognition systems
Online Character Recognition systems:
Online character recognition is the real time recognition of characters. Since Online character recognition uses online
systems which have better timing information for recognizing characters. Online Character Recognition also avoids the
initial step of locating the characters and which directly captures writing with the order of the strokes ([4], [5], [6]).
Offline Character Recognition:
Offline character recognition involves the automatic conversion of text from an image into letter code.In this type of
character recognition, typewritten characters usually scanned from images & then converted into grey/binary scale image
& then fed to recognition algorithm. Offline Character recognition is more challenging task than online. Since in this type
of recognition we have no control over the medium & devices ([4], [5], [6]).
Traditionally, OCR techniques are classified into:
Template based approach
Feature based approach
In template based approach, an unknown pattern is superimposed on the ideal template pattern and the degree of
correlation between the two is used for the decision about the classification. Early OCR systems employed only template
approach. But they become ineffective in the presence of noises, changes of handwriting, etc. Template matching is a
trainable process as template characters can be changed. Thus modern systems combine it with feature based approaches
to obtain better results.In Feature-based approach derives properties (features)form the test patterns and employs them in
a more sophisticated classification model, which are described in section IV.
2015, IJARCSSE All Rights Reserved
Page | 823
Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829
Objective
The objective of the project is to develop a technique that can efficiently recognize typewritten characters of Odia
language. Our main emphasis is on the template matching part where the input character is directly matched with a set of
prototype characters representing each possible class. Further we concentrate our work on the Unicode mapping which
defines the uniform way of encoding multilingual text. The ultimate goal of character recognition is to simulate the
human reading capabilities.
Organization
Rest of the paper is organized as follows. Section II describes the motivation of the project. In Section III, Odia
languages and data collection is explained. The major steps in character recognition are discussed in Section IV.
Implementation details and proposed framework presented in Section V. The experimental results discussed in Section
VI and finally conclusion and future work of the paper is given in section VII.
II. MOTIVATION
A large amount of work has been done in the field of OCR but very little research has been done for the Odia language.
Since Odisha has a rich heritage of manuscripts and novels, which are need to be preserved in Odia language and Odia
scripts. That are in the process of being lost due to the lack of Odia OCR systems. The basic need for text recognition is
automatic recognition of alphabetic characters and numerals through computers. For some foreign languages the OCR
systems are developed but for some Indian Languages attempts are made for like: Devanagari, Tamil etc. Thus we are
making an attempt to develop the OCR system for type-written Odia language.
III. ODIA LANGUAGE AND DATA COLLECTION
India is a multi-lingual and multi-script country and Odia is one of the popular languages in India which is mainly used
in the state of Odisha. The Odia script, by which Oriya language is written, is developed from the Kalinga script, one of
the many descendents of the Brahmi script of ancient India. As like other Indian scripts also in Odia language, the
concept of upper/lower case is absent. Among all these 12 independent vowels, 11 vowels have dependent forms
(i.e.excluding first vowel). The alphabet of the modern Odia script consists of 12 vowels and 41 consonants. These
characters are called basic characters and the Odia Numerals of Odia script and their corresponding English numerals are
shown in Fig.1.1, Fig.1.2 and Fig.1.3.Writing style of Odia script is from left to right. In Odia script a vowel following a
consonant takes a modified shape. Depending on the vowel, its modified shape is placed at the left, right, both left and
right, or bottom of the consonant. These modified shapes are called modifiers or matra as shown in Fig.1.4. A consonant
or a vowel following a consonant sometimes takes a compound orthographic shape, which we call as compound
character.Compound characters can be combinations of two consonants as well as a consonant and a vowel. There are
more than 200 compound characters in Odia script [7] and in this paper we consider the recognition of off-line
typewritten Odia basic characters by using template matching with unicode.Some similarity shaped characters may make
difficulties and makes the recognition system more complex to get higher recognition rate which are shown in Fig.1.5.
By character recognition, the character symbols of a language are transformed into symbolic representations such as
ASCII, or Unicode. The basic problem is to assign the digitized character into its symbolic class. The Unicode of Odia
characters are shown in Fig.8.
Page | 824
Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829
Page | 825
Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829
Template-matching technique is the process of finding the location of a sub image called a template inside an image
which is different from the others in that no features are actually extracted. Instead the matrix containing the image of the
input character is directly matched with a set of prototype characters representing each possible class. The distance
between the pattern and each prototype is computed, and the class of the prototype giving the best match is assigned to
the pattern.The technique is simple and easy to implement in hardware and has been used in many commercial OCR
machines. However, this technique is sensitive to noise and style variations and has no way of handling rotated
characters.Also a small number of possible postures can be recognized. If the application requires a large posture set,then
template matching will not work better.
V. METHODOLOGY AND PROPOSED ALGORITHM
The steps of proposed algorithm for Odia Optical Character Recognition (OOCR) are implemented in MATLAB
(R2010.a/64-bit) version as per the above block diagram shown in Fig.2.
Database Creation:
Initially, we have created a database of all character images of Odia scripts from - of pixels 5050.
Data Acquisition:
Through the scanning process a digital image of the original document is captured. Scanned images are then stored in
some picture file such as BMP, JPG, GIF etc. as shown in Fig.3.
Page | 826
Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829
Page | 827
Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829
VI. RESULTS
No standardized test sets exist for character recognition, and as the performance of an OCR system is highly
dependent on the quality of the input, this makes it difficult to evaluate and compare different systems. Still, recognition
rates are often given, and usually presented as the percentage of characters correctly classified. According to result Fig.
8.3 and 8.4 only 2nd vowel is not matching properly.
%Accuracy = No.of characters found correctly *100 Total no.of patterns
%Accuracy= 46 *100 =97.87% 47
So %Accuracy= 97.87%
To illustrate the accuracy of Odia characters, typewritten text images of different fonts of different sizes have
been tested under OCR algorithm by using MATLAB (R2010.a/64-bit) and then performance was measured using this
sample as shown in the Fig.8.1 and Fig. 8.2.
Page | 828
Ballav et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 823-829
Page | 829