Sei sulla pagina 1di 4

Optical Character Recognition on a Grid Infrastructure

Dana Petcu
1
, Silviu Panica
1
, Doina Banciu
2
, Viorel Negru
1
, Andrei Eckstein
1
1
Computer Science Department, West University of Timisoara, Romania
2
National Institute for Research and Development in Informatics, Romania
Abstract
The current capacity to translate paper documents
quickly and accurately into machine readable form using
optical character recognition technology augments the op-
portunities in document searching and storing, as well as
the automated document processing. A fast response in
translating large collections of image-based electronic do-
cuments into structured electronic documents is still a prob-
lem. The availability of a large number of processing units
in Grid environments and of free optical character recog-
nition tools can be exploited to produce a fast translation.
Following this idea, several experiments concerning optical
character recognition were performed on a Grid infrastruc-
ture and their results are reported in this paper. These re-
sults are encouraging further developments of systems for
document image analysis using Grid technologies.
1. Introduction
The document image analysis (DIA) is concerned with
the conversion of documents in paper form into electronic
formats. Optical character recognition (OCR) is a funda-
mental part of DIA. While DIA takes care of the general
problem of recognizing and giving semantics to the gra-
phical components of an input document, OCR takes care
of deriving the meaning of the characters from their bit-
mapped images. Usually, the paper documents are scanned,
and the resulting document images are processed via docu-
ment layout analysis and OCR. This transforms them into
a structured electronic format similar to that generated by
electronic authoring tools. The conversion of paper docu-
ments into electronic formats is an on-going task at world
scale. On one hand, there is a large number of documents
and books still in paper format that need to be converted. On
the other hand, many documents, like legacy documents, are
still published in paper format for a variety of reasons (see
the overview [2]). The range of applications of OCR has
been expanding recently. For example, mobile electronic
devices such as cell phones and digital cameras are capable
of acquiring images at a sufciently high resolution to faci-
litate OCR. Some mobile phones with OCR capabilities are
already available on the market today. Given the low pro-
cessing power of these devices, it is desirable to have high
speed OCR systems that can be used on these machines or
to access remote OCR systems.
In this context, the potential and promise of bringing to-
gether image collections in open, distributed, exible do-
cument recognition frameworks is immense [4]. A spe-
cial case is that of large-scale book image collections [11].
Recognizing this fact, the Romanian cooperation project
SINRED (National System for the Management of Digi-
tal Resources in Science and Technology based on Grid
Structures, 2005-2008) intends to: design and implement
a virtual digital library based on the facilities of the current
public libraries; design the methods and methodologies for
creating a uniform system at the national level in the eld
of documentation based on digital documents; analyze the
advantages and opportunities offered by the Grid technolo-
gies in the eld of digital content; adopt international ap-
proaches for library networking that are compatible with the
national standards; design and implement a unique portal
for accessing the digital information; dene the procedures
for building digital databases according to national and in-
ternational regulations. This paper describes the rst results
of SINRED concerning the use of Grid technologies for do-
cument image analysis. Section 2 presents a short overview
of the previous efforts to improve the response time of the
OCR systems. The experiments performed in a Grid envi-
ronment are reported in Section 3.
2. State-of-the-art concerning OCR systems
The OCR problem is well suited to parallelization or
distribution: the most computationally intensive part of
the problem is recognizing individual characters and this
requires access to particular parts of the scanned bitmap
data. Moreover, the processing associated with characters
or groups of characters can be done independently.
There are several ways in which, in the last thirty years,
the task of recognizing a document was partitioned for pa-
rallel or distributed processing: by algorithm phase, by
scanned page, by page region, by line or by character. The
Third International Conference on Automated Production of Cross Media Content for Multi-channel Distribution
0-7695-3030-3/07 $25.00 2007 IEEE
DOI 10.1109/AXMEDIS.2007.23
21
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.
MLOCR system [1], for example, has been designed in
mid-90s for real time processing of printed characters on
letters. The system was congured as a master-slave mode
and different phases (like separation, rotation etc.) are per-
formed by different processors. Similarly, on a transputer
machine from the same period, the implementation pre-
sented in [5] splits the postcodes up into individual charac-
ters and processes them independently within the farm that
implements the preprocessing and character classication
stages. OCHRE-P [7] was a prototype OCR system mod-
elled on a task farming paradigm, in which each task acted
on a line of text. Parallelism at the text line was discussed
more recently in [20]. A modern client-server solution was
proposed in [16] starting from the remark that a recogni-
tion system can be logically subdivided into three compo-
nents responsible for: capturing data, classifying data, and
a knowledge-base holding models of words or characters
known to the system. In [16] the data capture occurs locally
on a client machine and recognition is delegated to a set of
remote servers. The exchange of online data requires a pro-
tocol and central to this is a format in which to send data to
a remote server (an XML based format is used currently).
With the increase of computing power, the need to im-
prove the accuracy of the OCR has lead to the approach to
combine classiers at the cost of increased processing. Sim-
ple classier fusion methods such as minimum, maximum,
average, median, and majority voting have been tested. The
paper [3] proposes a cascade architecture for combining
classiers. Both the separate classiers and the cascade
classiers are good candidates for a limited parallelism or
distribution. The availability of a large number of idle work-
stations in computer labs was recently exploited as opportu-
nity to produce searchable PDF les for material that exists
only as image les [17]: using 100 workstations available
approximately 12 hours a day, 7 million searchable PDF do-
cuments were generated from 42 million TIF page images,
with a peak productivity of about 1 million pages per day.
The workstation application determined what TIF les were
available for processing by querying and updating a server-
based SQL database, processed page images one at a time,
returning text and searchable PDF les to the server.
The OCRGrid [12, 13] is a platform for distributed and
cooperative OCR systems that allows end users to search
for and use OCR servers over networks. Moreover, a toolkit
was built to deploy secure Web-based OCRs. The document
image is sent from the clients computer to the toolkit via a
Web server. Every OCR server has a specication le in
XML, which is written by the server administrator. The le
describes the specications of the server, including the loca-
tion (URL) of the server, the name of the OCR engine, the
supported languages, the document types, etc. The portal
server has a robot program for collecting the specication
les automatically and periodically from the OCR servers.
The robot analyzes each specication le in XML and up-
dates the database entries. A simple search program picks
up the OCR servers that match the clients needs from the
database and shows the search results. The client gathers the
results and performs the recognition based on the majority
logic. A recent Web-based OCR system is freely available
for research purposes at [8]. The user is supposed to be a
human. The current trend in information technology is to
build service oriented architecture in which software com-
ponents can replace the human user. In this context a Web
service for OCR is proposed by [15].
Current initiatives such as Google Book Search [11] and
the Open Content Alliance [19] are advance efforts to digi-
tize millions of books. The OCRopus engine recently pro-
posed by Google [10], intended for high-throughput, high-
volume document conversion effort, is based on two re-
search projects: a high-performance handwriting recognizer
developed in the mid-90s and novel high-performance lay-
out analysis methods (the code will be available at the end
of 2007, beginning of 2008).
Since the OCR technology has matured to the point
where the available recognition software delivers negligi-
ble error rates per page for high quality typewritten text, the
efforts are now concentrated on more difcult tasks to rec-
ognize important content, including both the semantic and
structural aspects, to create exible and modular document
recognition systems to recognize a vast diversity of fonts,
symbols, tables, languages, or to identify links to other do-
cuments in order to group diverse content. In [2] it is sug-
gested that paper documents should be scanned, but they
should be kept on-line in image-based form and OCR re-
sults should be viewed as an annotation of the document im-
age, not as the denitive representation. Extensive work is
being carried out on creating standards (like XML or SVG)
for semantic and high-level representations of documents.
3. Experiments on a Grid infrastructure
The aim of our experiments is to prove that the time res-
ponse of a state-of-the-art OCR system that is applied on
a large collection of image-based electronic documents can
be considerably improved if Grid technologies are involved.
In our rst experiments, that are reported here, we consider
that the Grid architecture is a computational Grid. In par-
ticular, the tests were performed on the European SEE-Grid
infrastructure [21] with around 750 CPUs.
While commercial OCR are providing very low error
rates in character recognition process, their use on Grid in-
frastructure is not an option. Free OCR systems should be
used. In the case of using gLite-based infrastructure like
SEE-Grid, the selected OCR should work on Linux. There
are several free OCR systems that are available on Linux
platforms, like gOCR, Ocrad, ocre, ClaraOCR, OCRchie.
We selected gOCR [9] for several reasons: (a) recently
22
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.
Figure 1. Sample from the test pages
released version; (b) operating in command line; (c) exe-
cutable that is portable without special library; (d) good re-
views in the literature about its quality; (e) the availability
of libgocr, a library with the functionality needed to develop
an OCR engine (for further experiments, like collaborative
engines). The test pages containing only text were scanned
at 600dpi. Four sets of scanned pages of different quali-
ties were considered. Each sets contains 11 pages from a
book: (1) a book from 1993 edited in Germany; (2) a book
from 1996 edited in Romania by a photocopy technique;
(3) a book from 1999 edited in Germany in LNCS series;
(4) a book from 2003 edited in UK concerning the stories
of Harry Potter. The images were stored as gray GIF im-
ages. Samples of these pages, as well as the text recog-
nized by gOCR, are shown in Figure 1. Giftopnm, part of
Netpbm [18], was used convert a GIF le into a PNM im-
age, the classical input of gOCR. As expected, the error rate
of gOCR depends on the inputs quality. This error rate is
high for the second test set and is acceptable for the other
test sets (see Table 1). While ten years ago applying an OCR
system to a page was a time-consuming task (in [6] for ex-
ample, for a stream of input pages of 1135 characters, the
average time taken to compute a single page was 18160 sec,
and a transputer has reduced it to 450 sec), currently the text
can be obtained in a few seconds. For the test pages, gOCR
responded in an interval between 9 and 50 seconds. We
noticed that the response times and the error rate in of the
recognized text are far from optimal the system from [8],
for example, allows faster and more accurate OCR, despite
being a Web-based system. But gOCR is still the best can-
didate for Grid-enabling the OCR of the books. Applying
parallelism or distribution at line, region, or character level
is not under discussion since the times obtained by applying
current OCR systems to the full page are too shorts.
A script was built to launch the tasks of applying gOCR
to the test images from a set in parallel on SEE-Grid in-
frastructure. The packages that were send remotely using
gLite facilities included the gOCR executable, the Netpbm
library, and the GIF images. As expected, there is a con-
siderable speedup of the response time of processing each
test set see Table 2 despite the fact that the local work-
station on which the test sets were processed sequentially
Table 1. Rates of recognizing characters
Test set 1 2 3 4
Number of Mean value 4407 3007 2822 2050
characters Minim 3624 2925 2226 1707
per page Maxim 4820 3088 3196 2271
Percent of Mean value 97.74 66.48 98.10 96.07
recognized Mininm 97.53 64.72 97.76 93.62
chars per page Maxim 97.95 68.23 98.61 97.71
Number of Mean value 746 467 434 352
words Minim 650 457 352 307
per page Maxim 814 486 487 381
Percent of Mean value 93.68 18.51 92.91 83.46
recognized Minim 92.38 15.07 92.19 74.02
words per page Maxim 95.38 22.98 93.75 89.43
Table 2. Response times (s) and speedup
Test set 1 2 3 4
Time Total time/set 339.5 400.9 128.79 99.52
on Mean time/page 30.87 36.45 11.71 9.05
user Pages of the book 14 272 181 766
machine Expected time/book 432 9913 2119 6929
Time User time/set 69.66 60.05 22.65 11.75
in Mean time/page 50.84 47.21 16.14 9.77
Grid Min. time/page 28.31 16.42 9.96 6.79
Expected time/book 71 121 46 72
Speedup Per set, 11 CPUs 4.81 6.62 5.59 8.30
Expected/book 6.11 81.92 46.07 96.36
max.128 CPUs
is more powerful than the individual workstations from the
SEE-Grid infrastructure were the tasks were run. As the
books were not completely transformed in image-based do-
cuments, the times needed for processing them entirely
were only estimated. The estimation takes into account a
possible temporary limitation of the number of processing
units in the SEE-Grid environment (128 CPUs from the 750
CPUs that are available in SEE-Grid).
Furthermore, a user interface has been build, the gOCR
component of GOC, the Grid Operations Center. GOC is a
Grid portal operating on authors academic sites which en-
sures the interface between the Grid and gOCR component,
on one side, and user (human or code) on the other side.
GOC should be seen as a Grid user interface that delivers
access to Grid infrastructure using components specialized
in solving one or more type of problems. For example,
in our case gOCR component is a component that makes
gOCR program available on the Grid. GOC has three main
parts (Figure 2): (1) Authentication component which en-
sures user authentication to the portal but also, using digi-
tal certicates, authorization to the Grid infrastructure; (2)
Grid engine component which makes possible connection
to the Grid infrastructure using specic Grid API compo-
nents (in our case is gLite API); (3) gOCR component
which delivers Grid-enabled interface to gOCR analyzer.
The authentication component is in charge with user au-
thentication and authorization, using, either password au-
thentication (if using local Grid infrastructure) or certi-
cate based authentication (if using global Grid infrastruc-
23
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.
Figure 2. GOC-gOCR: functionality; interface
ture, like SEE-Grid). It also provides user authorization to
certain components, each user is allowed to use only spe-
cic components that are assigned to him. The Grid en-
gine component makes connection between user interface
portal and Grid infrastructure using Grid infrastructure spe-
cic API. In our case we developed this portal for gLite
infrastructure used in SEE-Grid project. To submit a job
to gLite infrastructure the user must do the following: (a)
obtain a digitally signed gLite VOMS certicate for ac-
cess to a gLite infrastructure UI (User Interface); (b) cre-
ate a JDL (Job Description Language) for the application to
launch it on the Grid; (c) submit JDL le to the Grid using
gLite user utilities available on gLite UI; (d) check status
of its job; (e) retrieve job output using gLite user utilities
available on gLite UI. Through the gOCR component the
user can: register new job sessions; upload image les that
need to be analyzed; user gOCR code or his own partic-
ular code (he can upload the executable code, list of pa-
rameters and other external les). If all the above steps
were completed successfully then gOCR component will
proceed to create a package (to adapt gOCR code or user
own code for Grid) and then, at the user command, will send
this package to the Grid engine component. After the Grid
job ran successfully, the user can send a retrieve command
to obtain the output from Grid, and, nally, that user can
download the output retrieved along with the log le (use-
ful in the case of errors). GOC user interface can be found
at http://ui01.info.uvt.ro:8080/nGOC/, and the test sets and
scripts at http://web.info.uvt.ro/petcu/Grid-gOCR.zip.
4. Conclusions and future directions
The rst experiments done in the frame of SINRED re-
vealed the fact that a computational Grid infrastructure can
be efciently used to speedup the translation of large sets
of image-based documents into structured documents that
are currently easy to discover, search and process. The next
step will consist in experimenting with combined classiers
on computational Grid infrastructure. Taking into consid-
eration the current evolution of Grid architectures towards
service oriented architectures, the design of Web services
wrapping existing OCR systems or other document anal-
ysis tools or combined classies, as well of Web services
written using Java OCR [14], is under discussion.
Acknowledgment. This work is supported by RO-CEEX-
I 729/2005 SINRED and RO-CEEX-II 5919/2006 GRAI.
References
[1] D. Andrews, R. Brown, C. Caldwell, et al., A Parallel Architec-
ture for Performing Real Time Multi-Line Optical Character Recog-
nition, in Procs. 25th SSST, 1993, pp. 533536.
[2] T.M. Breuel, The Future of Document Imaging in the Era of Elec-
tronic Documents, in Procs. DAS VI, 2005.
[3] K. Chellapilla, M. Shilman, P. Simard, Combining Multiple Classi-
ers for Faster Optical Character Recognition, in Procs. DAS VII,
Springer, LNCS 3872, 2006, pp. 358367.
[4] G.S. Choudhury, T. DiLauro, R. Ferguson, M. Droettboom, I. Fu-
jinaga, Document Recognition for a Million Books, D-Lib Mag-
azine, Vol. 12, No. 3, 2006, www.dlib.org/dlib/march06/
03contents.html
[5] A. Cuhadar, A.C. Downton, Scalable Parallel Processing De-
sign for Real Time Handwritten OCR, in Procs. 12th IAPR In-
ter.Conf.Pattern Recognition, Vol. 3, 1994, pp. 339341.
[6] M. Danelutto, S. Pelagatti, R. Ravazzolo, A. Riaudo, Parallel OCR
in P
3
L: a case study, in High-Performance Computing and Net-
working, LNCS 1067, Springer, 1996, pp. 10171019.
[7] M. Forbes, OCHRE-P Optical Character Recognition in Parallel,
Technical report EPCC-SS95-06, 1995.
[8] Free OCR, www.123dox.com
[9] gOCR: Optical Character Recognition, 2006, sourceforge.
net/projects/jocr/
[10] Google, Ocropus, 2007, code.google.com/p/ocropus/
[11] Google, Books Library Project, 2007, books.google.com
googlebooks/library.html.
[12] H. Goto, OCRGrid : A Platform for Distributed and Cooperative
OCR Systems, in Procs. 18th ICPR, Vol. 2, 2006, pp. 982985.
[13] H. Goto, A Platform for Web-Based OCR Systems with Server
Search Function, in Procs. DAS VII, LNCS 3872, 2006, pp. 1316.
[14] JavaOCR, 2004, www.javaocr.com/
[15] LeadTools, Optical Character Recognition Web Service,
2007, www.leadtools.com/SDK/WEB-SERVICES/OCR-
SERVICE/default.htm
[16] A.P.Lenaghan, R.R.Malyan, XPEN: An XML Based Format for
Distributed Online Handwriting Recognition, in Procs. 7th DAR,
2003, pp. 12701274.
[17] R. Mason, H. Schmidt, R. Trott, Down on the OCR Farm: How We
Produced Searchable PDFs for 7 Million Documents in a Student
Computer Lab, in Procs. 5th JCDL, 2005, pp. 391391.
[18] Netpbm - graphics tools and converters, 2006, sourceforge.
net/projects/netpbm
[19] Open Content Alliance, OCA, 2006, www.opencontent
alliance.org
[20] G.J.Rama, A.Ramakrishnan, D.Gupta, Parallel Processing in OCR
A Multithreaded approach, Procs.Tamil Internet, 2002, 107110.
[21] South-Eastern European Grid, 2006, www.see-grid.eu
24
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.

Potrebbero piacerti anche