7

Proeadings of 2004 ewmasonalSympasiam gn ieigont Mutat, Video and Speech Processing Oetobar 20222004 Hong Kong ORAN: A BASIS FOR AN ARABIC OCR SYSTEM Abdelmalek Zidiouri King Fahd University of Petroleum and Minerals, P.O. Box 1360, Dhahran 31261, Saudi Arabia Email: malel ABSTRACT In this paper we present a system called ORAN (OfMline Recognition of Arabic characters and. Numerals). This ‘stem is based on a method called modified MCR (Minimum Covering Run). expression for document lage. Using the corespondence between binary images and bipartite graphs, the MCR expression canbe Found by ‘consicting & minimum covering or maximum matching jn the corresponding graph. We use the structural information obained from this expression, to deseribe the strokes of characters acconding to some extracted features. These afe obiained aller a zoning scheme, where the baseline is detected and the Fine of text divided ito four zones. Reference prototypes for the system are built according to a steetual description of character in some ‘mole! docamens. By this method we overcome the problem of segmentation that is inherent to Arabic characters even when they are machine printed or typed Simple matching is performed for the eandidae characters to reference prototypes. A recognition rate of more than 97% is achieved, 1. INTRODUCTION Character recognition has witnessed 4 Jot of sucvess for Roman Alphabet based languages, however this atea stil reeds thorough investigation Tor Arabic Script based languages. Researchers deal with this problem in two ‘main dstinetve domains, One that has only the image oF picture of the text, refered to as OFLine recognition. ‘he other is where the writer or pen’s movements ean be repistered and knwo tothe system and this i fered to fa On-Line recognition. These in tum ean be dived into ‘Wo main areas that of handwriting and that of printed oF typed text ‘Text holds a major part of information in documents; however automatically recognizing the characters presents challenging tasks, especially for languages like Arabic, whichis writen eursvely (connected) even when machine Printed, Recognition algorithms developed for other Tanguages are not readily applicable to Arabic. ‘There is alvays the problem of segmentation of words into 703 Afr usr Individual characters to he recognized. However sezimenation and recognition are very mueh related and ‘one can lead to the ther. Research in the area of character recognition, although is one ofthe oldest since the advent ‘of computers, is stil an open field foe researchers due to the challenging nature of perception and recognition. Human understanding af shapes and patterns i associated with a huge amount of information that we gather during cur interaction with these pattems o shapes. A human ean recognize a mug or its picture quite easily fom any angle IU will see I. We can read diferent fonts and sisles even ‘shen characters are broken ar overlapping. Iisa diicult {ask to automatically mimic the perception of the human eye. Neverless, optical character recognition (OCR) has advanced a great extent for languages other than Arabic [2-5]. However Arabic and languages using the ‘Arabic serpt like Urdu or Farsi, have recived relatively Fess autenion in this Feld [6]. This is due in puro the specificities of the Arabic language. A state ofthe art of ‘Arabic character recognition researc throughout the last two decades ean be found in (7h ‘ORAN is a contribution in this fed. 1 is @ system based on a method called modified MCR (Minimum Covering Run) [I]. Parts of this work have already been published. Here wwe emphasize how ORAN can he the basis of @ complete OCR for Arabi and ean be extended to other languages. Fig. 1 shows the flow chart of the complete system, Detailed information about MCR and algorithms used can be found in some of our previous publications [8-10]. This method of representing binary images by a minimum number of horizontal and vertical runs is used for stroke extraction as features for recognition. A description of our approach and outline of this segmentation fee technique used for recognition of printed cursive Arabic text is given. Arabic characters beeing connected even when printed or typed need t0 be segmented for recognition. We overcome this peoblem of segmentation, by expressing the characters in terms of small pars called strokes. The word “stroke” is being used here to mean such “parts” as the four eurved segments composing a character zero, or an “O" or a Similar shaped patiem would be represented by 2 vertical and 2 horizontal “stokes”. A character “Cora similar curved patier at the end of many Arabic characters will be cepeesenied by 1 vertical and 2 horizontal strokes andPraceadings of 2004 international Symposium on itligent Makimeda, Video and Speech Processing Cotober 20-22, 2004 Hong Kong) ocuMeNT iMac Fig.1, ORAN system: Parsing a document and performing the text block character recognition 50 0n, These strokes are labeled by a labeling procedure ‘The recognition is then achieved by matching candidate character shapes to relerence prototypes built fram a "aining set of documents. 2. MCR REPRESENTATION OF ARABIC SCRIPT ‘The MCR technique is useful to several tasks in document ‘mage unalysis and understanding. We use it successfully for parsing document patterns into different categories pictorial, graphical and textual [11}, and for extraction of tables from documents [12}. The sirctural information obiained is used in ORAN system, a5 a preprocessing for character recognition When the modilied MCR expression is obtained for a document image, the stokes are divided into overlapping nd non-overlapping pars. The non-overlapping pars of ¢ pattem are labeled and ordered with respect 10 helt tsolute postion in the document image, This i ina tp down, lefl to right priority, t0 follow the Arabic way of ‘writing. A labeling process of the parts obtained provides. the dynamic information fiom a static image, needed for character recognition Arabic (Farsi or Urdu) is writen from tight © le ‘eursively most of is characters are connected even when typed or machine printed. Arabie characters take dilTeent shapes (ovo shapes fr six of them and four shapes forthe remuining 22 characters) depending on their position in a ‘word; s¢e Table.1. They afe written following a baseline. ‘Mos ofthe information is contained abave the baseline. I optical character recognition systems are available for Arabic language, they will be of very high commercial value. Arabic is widely used now in the information technology sector around more than 20 counties. At present the Automatic Page Reader from Sabb “Al-Qar ‘ALA and the Arabic version of Omnipage ate the only available commertal systems for Arabic language, Hels fe [fe lelelelel ele |= Ms Le el gel [of ederim feel [| |] +] eee] eet] A BT] bf sfefe/efele) efels]el ele] = le [el [ele fefelelele (> lela] e|— Tabled Arabic Alphabet in their diferent. shapes: {solated (D), a the bepinning (B, in the mide (M) and at the end (F) of @ wor ur approach to Arabic cursive writing, based on MCR. Expression and its modified version, is a new one. We consider Arabi like any other stroke-like languages. To cout knowledge this is the first time that this hind of approach is being atlempted for Arabi, This has been smde possible by the siuctural features thatthe Modltied MCR expression offers to represent text and tabular ‘components in binary document images. Fig.2 shows a simple binary pater, its MCR ran representation andthe corresponding bipartite graph “These parts are described according to § topological Features, These are obtained aller a zoning scheme, where the line of texts detected and divided into four zones: an ‘upper zane, « middle zone, «baseline 2one and a lower 704vava ave vav7 v0 oy ig2. (a) A simple binary patter, (b) Corresponding bipartite graph (€) Horizontal and. Vertical stokes by Mogliied MCR expeession This zoning scheme has been desided because most of the information from shape of the body of Arabic (Characters ties inthe upper portion of the characters. Also most stress marks (dots of characters and a zigzag called “Ihamza”) occur elfetvely above the writing line that we call baseline 3. FEATURES EXTRACTION In the framework of structural approaches to OCR frost methods are based on representation, feature ‘extraction, desription and classification. For cursive iwiting it is necessary to overcome the complicated problem of letter Separation. In our novel approach with [MCR representation we are able to segment words or sub ‘words into characters a Ue same time asthe recognition is achieved. Words segmentation (separation of text into ‘words and sub words) i mostly achieved by the Proceedings of 2004 internavinal Symposium an Inteligant Mutimeda, Video and Speech Frocessing _Octaber 2022,2004 Hong Kang connected component. information from the labeling process. The other information is from the prototypes themselves, as Arabic characters have a special shape at the end ofa word or a sub word, Unfortunately there is no ‘way 10 tell a word from a sub word which could be composed of as few as one character unless high lev Fecognition is sought asthe use ofa dictionary oF 2 si checker. ‘The feature vectors built from the character pseudo-segmeniation into pars provide the dynamic Information from the static image, needed for offline characters recognition. The recognition i performed from Fett a eight © bein @ natural way to allow for an eventual link toe speech synthesizer machine Fight topological features are extracted from the structural information obtained. These are: fin, wa; ecomtscal features (engl, with) {Ap} = th}; ype of stoke (horizontal or vertical) ‘dr rection felt and ight fom the center $08} * Hs bes mz relative position wit espe to baseline {In addition we associate with cach stroke a region label frgn} and a label for the number of connected components in that stoke {con}. Atthough stokes are divided to only «vo types namely horizontal or vertical, wwe still keep trace of the original curvature and slant the diretion or angle information. Then the jon and recognition stage is achieved by simple matching of candidate character on a seanned document toa prototype inthe reference database ‘We match a candidate character € to a prototype P hhaving the same number of strokes &. All the prototypes face visited in this process, and if for a prototype P ~ Spy Sy COnmection_ul) there isa candidate character (C (5) 53 fs connection_rale) such tha: VS,er B,6C where ptt, SU dn Sa) mS {leq i 2 relationship to, oF value of one ofthe 8 features hate fl, al hte bf te) it was, Koen ok whee fy 18 a rlatonsti 1, or value of one fenurs in the protoyps, and fn the candidate characte, and Pel 2oamt cont ay such haf, 3 f,).0(Camceion_Rule_Aate) then the canidte character shape C fe msched 10 the rrottspe P. 708

7

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

7

Caricato da

Copyright:

Formati disponibili

Potrebbero piacerti anche