Sei sulla pagina 1di 4

IMPLEMENTATION OF A STATISTICAL BASED ARABIC

CHARACTER RECOGNITION SYSTEM

Anthony Cheung, Mohammed Bennamoun, and Neil W. Bergmann

Space Centre for Satellite Navigation,


QUT. GPO Box 2434, Brisbane, Qld 4001, Australia.
Email: h.cheung @ student.qut.edu.au
ABSTRACT 2. The Arabic OCR System
Character recognition systems have contributed The characteristics of Arabic characters and the cursive
tremendously to the automation of data entry processes. nature of this script do not allow the direct application of
Arabic characters are widely used in Arabic countries. If the algorithms used to classify the characters of other
optical character recognition (OCR) systems are available languages. Figure 1 shows the character set of Arabic.
for Arabic characters, they will be very useful and will Some of the characteristics of Arabic characters are
have a high commercial value. Therefore, we have mentioned in [3], [4] and are summarized below. Figure 2
implemented a statistical based Arabic OCR system and illustrates the variation of the Arabic characters' shape
presented the techniques used in this paper. It consists of depending on their positions in the word.
image acquisition, preprocessing, segmentation, feature
extraction and statistical classification. The developed
system is window-based software that can interact with a
scanner and can recognize a single font of Arabic script
with 85%accuracy at about 16 characterdsecond.

1. Introduction
A character recognition system is designed for the machine
replication of human reading. It involves the techniques of
recognition on both printed and handwritten characters
from a text image. The first concept of idea of OCR
system was developed in the 1929 [l]. In the middle of the
1940's, the first OCR system appeared with the invention
of the digital computers [2]. A commercial OCR machine
was available in the 1950's. Since then, a large amount of
research papers and reports have appeared, and many new
recognition techniques have been developed. OCR systems
improve human-machine interaction in many applications,
including office automation, cheque verification, and a
large variety of banking, business and data entry
applications [l]. [2].

OCR systems for most of the popular script (e.g. English,


Chinese, and Japanese) have already been well developed
and are available commercially. However, due to the
technical difficulties introduced by the cursive nature of
Arabic script, Arabic OCR systems have not been as well
developed as the others. On the other hand, there are over Figure 1 Arabic character in all forms. (EF end form, MI?
one billion Arabic script users in the world, a fast and middle form, BF beginning form, and IF isolated form)
accurate Arabic OCR has a great commercial potential.
This paper presents such a system. The techniques used are
described in Section 2. The results and discussion are i. Arabic is a cursive type language, which is written
presented in Section 3. The conclusion is given in fi-om right to left, and so shall the recognition.
Section 4. ii. Each character has two to four different forms,
depending on its position in the word or subword. It

1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommunications 53 1
increases the classes to be recognized from 28 to 2.1 Image Acquisition and Preprocessing
... 100. A document is quantized by a scanner in space and
111. Some characters can only appear at the beginning or
amplitude (i.e. image sampling and gray-level
at the end of a word or sub-word. An Arabic word
quantization) to acquire a digitized representation of it [5].
could have one or more sub-words. This is due to
This is controlled by the user interface of the system.
the fact that some characters are not connectable
from the left side with the succeeding character.
The preprocessing is the process of enhancing the acquired
iv. Most characters (17 out of 28) have a dot, two dots,
image to increase the ease of feature extraction and to
or zigzags associated with the character and these
compensate for the eventual poor quality of the scanned
can be above, below, or inside the character.
documents Binarization, noise elimination and thinning
V. There are only three zigzags that represent vowels.
[6]-[lo] have also been implemented.
Other vowels are represented by diacritics in the
form of over-scores or under-scores. The use of
2.2 Segmentation
diacritics is limited to the cases where the word is
foreign or where the pronunciation requires a stress. Segmentation is a crucial step of OCR systems as it
vi. Some characters may overlap vertically (without extracts meaningful regions for analysis. A poor
touching each other) within a word. segmentation process produces mis-recognition or
vii. There are no upper or lower cases in Arabic. rejection. It is especially important for Arabic OCR
systems due to the cursive nature of Arabic script and the
Because of the above characteristics, the Arabic character fact that some Arabic words overlap vertically. Page layout
segmentation and recognition are far more difficult than analysis and character separation are used to segment
the recognition of the Latin or Chinese characters. characters from the preprocessed image.

In the page layout analysis process, a horizontal projection


profile of the document image is plotted. The segmentation
between lines of text is determined by scanning through
the profile from the first row. If the difference of the
number of black pixels between two rows is larger than a
predefined threshold, a new line of text is indicated. The
next large variation in the number of black pixels between
another two rows indicates the bottom of the line. Words
are segmented from a line-segmented image in a similar
method, except that a vertical projection profile is plotted
instead.
Figure 2 An Arabic word, Depending on the position of the characters in a word,
Arabic script has four types of shapes for the same
The implemented system involves five image processing character as mentioned in Section 2. Therefore, we have
stages, which are the image acquisition, the preprocessing, applied Amin’s algorithm [11] to segment characters from
the segmentation, the feature extraction and classification a word. This technique has been used in many cases, e.g.
(see Figure 3j. [11]-[13]. In his character separation algorithm, the
vertical projection of a word is plotted. The connectivity
point will show the least sum of the average value (M,j,
Acquisitioii hpruceusing

-
where

\ -
N, is the number of columns of the word image and Ci, is
the number of black pixel of the ith column. Hence, each
part showing a value less than M, is segmented into a
t
Recognized
different Character. However, if the histogram produced
from the vertical projection does not agree the following
Text rules, the character remains un-segmented.

Figure 3 The Arabic OCR system.

1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommunications
532
where di is the distance between ith peak and (i+l)th peak, mpq is the @+q)thorder geometric moment of a digital
and dLis the total width of the character. By examining the
Arabic characters, the distance between peaks does not density distribution function p(x,y) which can be expressed
excess 1/3 of the width of Arabic Character. Moreover, at as follow:
"
the end of a word or sub-word, the following rule is
applied:
e-0
Li+l > 1.5 x Li (3) where p, q = 0, 1,2, ...; p(x, .y) is the gray-level value of a
where L, is the ith peak in the histogram. pixel at (x, .y).

23 Feature Extraction
The end result of the image acquisition, preprocessing, and 2.4 Statistical Classification
segmentation is an array of numbers that represents the The classification process is carried out at the final stage to
character in some way. In the general case,however, the recognize the characters. The classification process assigns
matching of these numbers to a template may be too time an input character into one or more pre-specified classes
consuming and not flexible enough. Therefore, feature which are based on the extracted features and their
extraction is essential in character recognition systems. analysis. In our system, the minimum distance is obtained
The feature extraction process uses a set of measurements by calculating the sum of square of error between the input
that represent unique features to describe the character. seven moment descriptors with the moments of the 100
These measurements may then be represented in the characters stored in a database.
feature space for classification. Seven second-order
moments, qpq,described in [14], are used to calculate the 3. Experimental Results and Discussion
features of each segmented Arabic character. They are:
We have fully implemented the Arabic OCR system
$1 =77m +7702 (4) described above and provided a friendly graphical
window-based user-interface. In that system, document
@2 = (7720 - 7702 )" + 477fI (5) images are acquired through a scanner by calling a
command fiom the system's menu. After the recognition,
the users are able to edit their documents through the
system as well. Many images have been tested. It has been
shown that the system has an accuracy of 85% and is able
to run in real time at around 16 chartsec. Some example
results are shown in Figure 4 and 5.

qpqis the normalized @+q)thorder central moment and is


equal to:

(b)
kqis the central moment and can be expressed as follow: Figure 4 (a) the original document (b) the
m m
recognized result with the system menu.

where F = tn,o t m,,and 7 =moll m , The major problem that we have encountered is the
character separation through the use of the vertical

1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommunications 533
projection technique described in Section 2.2. Some of the References
characters could not be segmented as those characters are
S . Mori, C. Y. Suen and K Yamamoto, “Historical
horizontally overlapped with each other. Moreover, under
Review of OCR Research and Development,” Roc.
our experimental investigation, Amin’s algorithm was not
always satisfied during the segmentation of characters. It
IEEE,Vol. 80, NO. 7, pp. 1029-1058, 1992.
V. K Govindan and A. P. Shivaprasad, “Character
might be because this algorithm is only suitable for some
Recognition - A Review,” Pattern Recognition, Vol.
font types.
23, NO. 7, pp. 671-683, 1990.
1. S. Abuhaiba, S . A. Mahmoud and R. J. Green,
4. Conclusion and Future Work
“Cluster Number Estimation and Skeleton Refining
A user-friendly statistical based Arabic optical character Algorithms for Arabic Characters,” The Arabian
recognition system was implemented. Many tests were Journal for Science and Engineering, Vol. 16, No. 4B,
carried out and showed that the system is able to recognize pp. 519-530, Oct 1991.
Arabic texts with an accuracy of 85%. It is also a real time K. M. Jambi, “Arabic Character Recognition: Many
system which recognizes documents at around 16 char/sec. Approaches and One Decade,” The Arabian Journal
In our Arabic OCR system, the text image is first for Science and Engineering, Vol. 16, No. 4B, pp.501-
preprocessed to enhance the quality of the image. Then it 509, Oct 1991.
is divided into paragraphs, lines are then extracted, and W. Niblack, “An Introduction to Digital Image
each word is decomposed into its constituent characters Processing,” Rentice Hall International.
using the algorithm described in [ll]. Feature extraction A. Cheung, M. Bennamoun and N. W. Bergmann, “A
and statistical classification are then carried out to identify New Thinning Algorithm for Arabic Characters,”
the Arabic characters. The accuracy of the system is ISAS, Caracas, Venezuela, Oct 1997.
enormously affected by the segmentation process. C. J. Hilditch, “Comparison of Thinning Algorithm on
A Parallel Processor,” Image Vision Computer Vol. 1,
We have developed an Arabic word segmentation method NO. 3 pp.115-132, 1983.
which can successfully segment vertically overlapped L. Lam, and C. Y. Sum, “An Evaluation of Parallel
Arabic words [15]. This process should increase the Thinning Algorithms for Character Recognition,”
segmentation accuracy of Arabic characters. Currently, we BEE Tran. on PAMI, Vol. 17, No. 1, Jan 1995.
are investigating an improved character separation P. S . P. Wang and Y. Y Zhang, “A Fast and Flexible
technique, and believe that it will increase the recognition Thinning Algorithms,” IEEE Trans. on Computers,
accuracy of our system. Vol. 38, NO. 5, pp. 741-745, 1989.
[lo] S . Suzuki and K-Abe, “Binary Picture Thinning by an
Iterative Parallel Two Sub-cycle Operations,” Pattern
Recognition, Vol. 10, No. 3, pp. 297-307, 1987.
[11]A. Amin, ‘Xecognition of Arabic Handprinted
Mathematical Formulae,” The Arabian Joumal for
Science and Engineering, Vol. 16, No. 4B, pp. 532-
542, Oct 1991.
[12]A. Amin and G. Masini, “Machine Recognition of
Multifonts Printed Arabic Texts,” Proc. Sth Inter.
Conf. On Pattem Recognition (Paris, France), pp. 392-
395,1986.
[13]A. Amin and J. F. Mari, ‘Machine Recognition and
Correction of Printed Arabic Text,” IEEE Trans. on
System, Man, and Cybernetics, Vol. 19, No. 5,
pp. 1300-1306,1989.
[14]F. El-Khaly and M. A. Sid-Ahmed, ‘Tvlachine
Recognition of Optically Captured Machine Printed
Arabic Text,” Pattern Recognition, Vol. 23, No. 11,
pp. 1207-1214, 1990.
[15]A. Cheung, M. Bennamoun, and N. W. Bergmann, “A
New Word Segmentation Algorithm for Arabic
0) Script,” DICTA’97: The 4th Conference on Digital
Figure 5 (a) the original document (b) the Imaging Computing Techniques and Applications,
recognized result with the system menu. AucMand, New Zealand, 1997 (to appear).

1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommunications
534

Potrebbero piacerti anche