Sei sulla pagina 1di 6

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

DOCUMENT PRESENTATION ENGINE FOR INDIAN OCR


A DOCUMENT LOYOUT ANALYSIS APPLICATION
Umesh Kumar, Jagdish Raheja

ABSTRACT

Today office automation is going on in the all of the fields. Everybody is using the computers for fast data processing
and for maintaining the large amount data. But, there is need of the previous processed data which is printed
documents. There are two ways to use old data. First is to type the data to process in the computer. And the second
is to scan the document and use OCR (Optical Character Recognition) to convert the document image in to the
editable text format. There are more than 1000 languages and 14 scripts used by 112 million people in the India [1,2].
So there is need of OCR system for Indian scripts which is in development process. In OCR we have to scan the
document, then noise cleaning, skew detection and correction, text non–text classification, text line detection and
segmentation, word segmentation, character segmentation and identification and output file generation. The main
contribution in this paper is that how to maintain the layout of document. At the present scenario OCR system
produce the text file as output without maintaining the layout of document. OCR is an error-prone process. The error
remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the
original image representation. In this paper the use of XML is presented to generate the open office document file. In
this paper the open document standard is followed which is approved by OASIS (Organization for the advancement
of Structured Information Standards) on Feb. 1, 2007 [3]. The main feature of the propose solution is that it is scripts
independent. So it can be applied for all Indian scripts.

Keywords: OCRed Document, Indian Scripts, Document layout.

1.0 I NTRODUCTION

In the field of computer science text and images are the main source of the information. A human can understand the
information if it is presented in well manner. For example, if the data is presented with the images in well organized
manner then the other one can understand in a better way. If the presentation is not well the information given will not
be un derstood by the other. Th e OCR process is error prone . It is time c onsuming and expensive to ma nually
proofread O CR resul ts. T he errors re maining i n O CRed t exts c an cause s erious prob lems i n readi ng and
understanding i f the y do not r efer to th e ori ginal image repr esentation. Document repr esentation a fter OCR is ver y
important task. It c an cause serious problems in reading and as well as understanding the document if they do not
maintain the la yout as in the original image representation. Present system scan the document image and p lace the
text and image one after ot her w ithout maintaining the l ayout [2]. In t his paper t here i s sm all di scussion a bout the
processing of the OCR and the detail discussion of the proposed system. The following figure gives an overview of
the OCR process.

Figure 1: flow of OCR process

As the above figure show the first step of OCR is two tone conversions which convert the image into binary image,
and then skew is de tected an d corre cted. N oise c leaning i s perf ormed o n th e skew c orrected image. T ext no n-text
classification t echnique classifies th e do cument image i n te xt an d i mage. T he i mage part is e xtracted fro m the
document image for fur ther processing and remaining text par t is passed to text lines detection. After detecting the
text lines, each line is processed and individual character is generated and final output file is generated. But at this

©Informatics '09, UM 2009 RDT4 - 88


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

stage there is problem that, in output file the layout of document image, is not maintained. The main contribution of
this paper is that how to maintain the layout of the output document.

2.0 PROCESSING OF OCR

The OCR is combination of mu ltiple processes as sho wn in the a bove f igure 1. The first process o f th e OCR is to
acquire the document image using the scanner or a camera. The image has the many color combination in the image
but the OCR process the binary image. So the image is converted to the binary image. A global threshold value is
generated and the image is converted to the corresponding binary image. Then the noise is removed from the input
image. T he m ost c ommonly used ap proach call ed morphological component re moval te chnique is us ed to r emove
the noise. Figure 2 shows the image taken as input and after removing the noise.

Figure 2: Noisy image, Noise cleaned Image, Skewed Image and Skew Corrected Image

After noise removal the resultant image is taken as input to the skew detection and correction technique. The cause
of sk ew i s due t o the improper a lignment of th e d ocument paper d uring sc anning. When th e d ocument is b eing
scanned and it not properly placed on scanner then the scanned image can have skewed image. A s kewed image
may result in failure to detect the text in the document image. So it is necessary to detect the skew and correct it.
Above figure 2 represent the skew in the image and the resultant image after the correction. After the preprocessing
of image involving the noise removal and skew correction, the image is segmented in two categories i.e. text and non
text c lassification. T his categorization is done b y the text, no n-text c lassification. As a ll types of document image
contain the text, Ima ge and tabl e; this module takes the image and tab le as non-text area and remaining part as a
text area [13, 14]. At this stage the non-text area is extracted from the image and stored. The remaining part of image
is u sed to de tect the text. Th e figure 4 rep resents th e Input image and t he tex t no n t ext area cla ssification an d
extracted text area. Red color boundary is used to represent non-text area of image.

Figure 3: An input image, Text non text classification, and extracted Text area

The abov e fig ure 4 s hows t he co mplete text p art i n the do cument i mage which is us ed to d etect t he te xt. A fter
classification each text block is identified and text lines are detected as shown in the following image for a single text
block.

©Informatics '09, UM 2009 RDT4 - 89


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Figure 5: Detected text lines.

After detecting the t ext l ines word s egmentation, ch aracter seg mentation, template matching and ch aracter
replacement is then performed. As th ere are standard techniques for these processes and are dis cussed elsewhere
[5, 12] so details discussion is not given here. But for the sake of completion of the OCR process, just introduction is
given here. Wo rd segmentation i s pe rformed using t he bas ic feature of t he script called the white s pace between
each word. Each word is segmented and passed for further processing called the character segmentation. Again the
character s egmentation is performed using the s ame b asic fe ature o f the sc ripts c alled the white space or g ap
between each character. After this process the output comes as individual character image. And matching character
is searched and replaced.

3.0 OVERVIEW OF PROBLEM

The pres ent O CR s ystem av ailable f or Indian s cripts is a ble to c onvert th e doc ument image in ed itable text an d
produce a text file as discussed above. As discussed before it performs the pre-processing and classify the text and
non text are a. T hen d etect the text in s egmented text area and generate i ts equivalent te xt. T he edi table te xt a nd
image is plac ed in t he text file. It works fine and has no pr oblem if the d ocument image is single column. But what
happens if the document image is multicolumn as shown in the following image.

Figure 6: Multicolumn Document Image (left) and corresponding disorder blocks representation (right)

As shown in the figure 6 (left) there are five text blocks and the two image block. If the 1st block then 3, 2, 4, and 5 is
processed one by one and placed in output file. But if it not follows this sequence or i mage is not placed at pr operly

©Informatics '09, UM 2009 RDT4 - 90


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

location. Then it is not easy to understand that document. So, it is important to maintain the layout of the document
as it was in the inp ut d ocument im age. The abov e fi gure 6 (right) repr esents the ou tput f ile of the OCR s ystem
available now. Bu t i t r epresents the serious problem o f document understanding. T he d ocument image i n figure 6
(left) is understandable because he/she know which column to be read after reading the one column but in the case
of figure 6 (ri ght), it l oss the document presentation so no t able t o understand. I n thi s p aper t here i s a simple
approach which takes the text output from OCR and blocks coordinate information and p roduce the output file in t he
odt format in which the layout is maintained as it was in the input document image.

4.0 PR OPOSED SOLUTION

Problem of layout retention of the OCR for document image is solved using the Open Document Format specification
v1.1 approved by the OASIS on F eb. 2007. It specifies the XML su pport to generate the document and handle the
document el ements. The document c ontains the Document root s, Document metadata, Bod y elements an d
Document types, Application settings, Scripts, Font face declarations, Styles, Page styles and Layout [3]. Using the
java, a s imple parser is designed which define the document elements in XML and compress as the document file.
The Open Document Type file is generated as the output which can be viewed in Linux as well as in Window using
the Open Office [3] and Microsoft Office (using the plug-ins).

Algorithm Flow Chart


1. Read the block information file. If no
blocks are there then Exit.
2. Create Support files.
3. Select the font for the specific language.
4. Create Content.xml file.
5. Write the root information in content file.
6. Read the block coordinate information.
7. If (image!=0) then
Include the element meta-data for image
and text.
Else
If (text! = 0) then
Include the element meta-data for text.
8. For I= 0 to N // where N is no. of images.
8.1 read Coordinate information
8.2 insert the image and image tag
and coordinate information.
9. For I =0 to M // where M is no. of Text
block
9.1 read coordinate information
9.2 Put the tag for text frame with
coordinate information.
9.3 Insert the text in above written
tag.
10. Write the closing tag in the content file.
11. Generate the output file using jar utility.

The above algorithm and flow chart represent that how the document will be g enerated and e lements are placed in
document. Every document contains the document root tag which is the meta-data and static information. Each Xml
file has some specific root tag which gives the information like version, date, types of document and footer tag. The
user h as the ch oice to sel ect the la nguage of OCR be cause the t echnique proposed here a re s cripts i ndependent
and will work for all the languages. The next process depends upon the block coordinate information file. Decide from
the block coordinate information file that there is image block or text block. If document is not blank then write the
declaration data in content file after the root information. This declaration data define that there is image block or text
block. Now insert the document elements in the document using their coordinate information. Repeat the process to
draw the image using the coordinate value for each image, width and height of image, and the path of image to insert
for the number of images. As image blocks are inserted in the document then the text frames are inserted for the text
information. An d write th e text i nformation in the text frame t ag. Again re peat this p rocess f or each available text
block. At last include the document leaf tag called “footer tag” which is static and same for all documents.

5.0 TESTING AND RESULTS

In this section there is a reference input image and define the result step by step.

©Informatics '09, UM 2009 RDT4 - 91


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Figure 7: A document image which will be OCRed.

The pre processing is performed on t he i mage and c lassified as t he te xt n on text blocks a nd block c oordinate
information i s stored in file. The tex t bl ocks a re fu rther pro cessed a nd equivalent tex t i s p roduced. As th e above
document i mage is giv en i t has the four i mages block an d t hree text b locks. And their c orresponding block
coordinates information which is converted to the document size as given below.

image 4
5734.05 10919.968 17416.018 17105.884
6755.892 151.892 17393.92 4012.946
249.936 154.94 6470.904 4121.912
228.092 4129.024 10721.086 10816.082
text 3
10870.946 4215.892 17375.886 1234.43
202.946 10997.946 5592.826 345.765
305.054 17347.946 17445.99 4567.87

There ar e four co ordinate v alues for the image blo cks and thre e values for the te xt bl ocks. Using these coordinate
values draw the text frame, insert the text from text files, and define the image path with their coordinate values and
height, width. After defining the all el ements of document in t he content file the jar/Zip utility is used to compress all
the fil es in the file name.odt. The out put f ile op ened in the O pen O ffice and h as the correct l ayout as shown in th e
following image.

Figure 8: output of Document Presentation Engine Process

©Informatics '09, UM 2009 RDT4 - 92


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

6.0 CONCLUSION AND FUTURE WORKS

The approach presented in this paper is implemented and tested with different Indian scripts as well as roman scripts.
This is working ver y well an d dep endent o f t he acc uracy of t he text n on-text clas sification process. Where i t is
working well but there is so much to do. The OCR gives the out as a simple text it is not recognizing the text is Bold,
Italic, Underline, or heading. So many features are remaining to implement in the OCR.

REFERENCES

[1] ht tp://tdil.mit.gov.in/resource_centre.html.
[2] ht tp://indiansaga.com/languages/index.html.
[3] ht tp://docs.oasis-open.org/office/v1.1/OS/ OpenDocument-v1.1.pdf
[4] R Thomas M. Breuel. “High Performance Document Layout Analysis”. Volume: 3, On page(s): 61- 64, 2002.
[5] Gaurav Gupta, S hobhit N iranjan, Ank it S hrivastava, Dr RMK Sin ha, “ Document L ayout An alysis a nd its
application i n OCR” , 10th IEEE International E nterprise D istributed Obj ect Computing Conference
Workshops, 2006.
[6] Jignesh Dholakia, Atul Negi, S. Rama Mohan, “Zone Identification In Gujarati Text”,Proceedings of the 2005
8th International Conference on Document Analysis and Recognition, 2005 IEEE.
[7] Suryaprakash Kom palli, Sankalp Na yak, Srirang araj Setlur, “Chall enges in O CR of Dev anag ari Doc ument”,
Proceedings of the 20 05 Eig ht In ternational Co nference on Document An alysis and Recognition (ICDAR’05)
1520-5263/05, 2005 IEEE.
[8] Pavlidis and Zhou J., “Page segmentation and classification”, Graphical models and Image Processing vol.54
pp 484-496.pan, p. 301, 1982.
[9] Kolcz, A ., Als pector, J. , Augusteijn, M ., Carlson, R. , Popescu, G .V.[G. Vi orel], “A Li ne-Oriented Ap proach to
Word Spotting in Handwritten Documents”, PAA(3), No. 2, 2000, pp. 153-168.
[10] Likforman-Sulem, L .[Laurence], Zahour, A .[Abderrazak], Taconet, B. [Bruno], “Te xt li ne s egmentation o f
historical documents: a survey”, IJDAR(9), No. 2-4, April 2007, pp. 123-138.
[11] Udo Miletzki, “Character Recognition in Practice Today and Tomorrow”, 0-8, 186-7898-4197,IEEE 1997.
[12] Tao Hong and Sargur N. Srihari, “Representating OCRed Documents In HTML”, Published in IEEE, 1997.
[13] Z. Sh i, S . Se tlur, and V. G ovindaraju, '' Text Ex traction from G ray Scale Historical Document I mages Using
Adaptive Lo cal Connectivity Map'', Eighth International Conference o n Docu ment Analysis an d Recognition,
Seoul, Korea, 2005, pp. 794-798.
[14] Umesh Ku mar “Text Li ne detection i n Mul ticolumn d ocument i mage a nd Doc ument Presentation E ngine”
M.Tech Thesis, GGSIPU, New Delhi, 2008.

©Informatics '09, UM 2009 RDT4 - 93

Potrebbero piacerti anche