Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract: Recently, the world has seen a giant leap in the field of mobile computing. Many
applications that were conventionally based on desktop computers, workstations and specialized
hardware are now being incorporated for mobile devices as well. Optical Character Recognition
being one of them, this paper explores various challenges faced to develop such a technology on
a mobile platform in contrast with conventional platforms such as an optical scanner. The paper
focuses on pre-processing steps required to correct the acquired images through cameras and
mobile devices in order to use them with traditional feature extraction and recognition software.
The paper also explores some solutions and workarounds as adopted by OCR engines including
Tesseract, OCR Droid and ABBYY. The paper lists numerous valuable references in the field and
may serve as a repository for the same.
Keywords: Optical Character Reader, Mobile, Camera, Pre-processing, Document Analysis,
OCR.
Atishay Jain
Akriti Dubey
Rachit Gupta
Nitin Jain
PoojaTripathi
Associate Professor
I SSN 2319-9725
May, 2013 www.ijirs.com Vol 2 Issue 5
International Journal of Innovative Research and Studies Page 87
1. Introduction:
Optical Character Recognition may be defined as the electronic analysis of an image in order
to identify the areas containing textual information and extracting/recognizing the text from
the given image.
Conventionally, scanners have been used to acquire images of the documents. The document
images are scanned using flat-bed, sheet-fed or mounted imaging devices. Analysis of images
acquired by scanners and the problems associated with them are constrained by their inherent
properties. Scanners have been used to acquire images of paper printed or handwritten
documents such as legal documents, journal papers, cheques, forms, for archiving books, etc.
Substantial portion of these documents is text, while figures and pictures are allowed. If the
scanned images are of a good resolution, processing and feature extraction follows with ease.
A resolution of 300dpi or more is generally adequate for OCR. Optical Character Recognition
in scanned documents and their related problems have been widely addressed in literature.
Recently, with the increasing availability of digital cameras and mobile devices at affordable
prices, the community is showing an increased interest in Camera based OCR. These devices
can also be used in areas where the use of scanners is not possible, e.g. recognising textual
information in a scene, reading of cargo container codes with uneven surfaces[1]. These
devices along with their totally different way of image acquisition pose a new set of
challenges. The lighting conditions may not be proper, the imaged surface may be textured or
a non-planar surface, the document plane and image plane may not be parallel leading to
perspective distortion. We have frequently used business cards in our paper to demonstrate
the challenges, because of their familiarity. However it must be noted that, document may
refer to any scene from where text can be extracted, including License plates of cars[2], codes
written on shipping containers[1], Labels on cylindrical bottles, etc. The challenges are
discussed in detail in section 3.
2. Steps Involved in OCR:
The process of OCR generally involves the following 3 steps:
May, 2013 www.ijirs.com Vol 2 Issue 5
International Journal of Innovative Research and Studies Page 88
Figure 1: Flowchart showing steps involved in OCR
2.1. Pre-Processing:
The raw image is generally not suitable for recognition task and there is a need for it to be
enhanced, e.g., the text should be contrasting with the background(Black on white), should be
projected to a uniform size, such as 12pt. 300dpi, characters be sharpened wherever possible.
The image is subjected some preliminary processing steps to make it usable for further steps
in processing and character recognition. Pre-processing aims to produce data that are easy for
the OCR systems to operate upon accurately. It mainly involves one or more of the following
steps:
2.1.1. Noise Reduction:
Compared to optical scanners, digital cameras are inherently more prone to noise due to their
mode of operation. CCD/CMOS sensors suffer from dark and read-out noise. Also, in low-
light conditions, or when sensors are exposed to less light due to a smaller aperture more
noise is observed. High shutter speed and ISO also contribute to increased noise. Blurring
may be introduced due to motion.
Contrast stretching may be used as introduced by Kuo and Ranganath[3] for enhancing color
and greyscale images. More advanced and sophisticated methods are explained by Taylor and
Dance [4] and Chen et al. [5] in their respective publications.
If low resolution cameras are used character strokes may appear very thin on the image and
be blended with the background. A Binarization technique performed on the above image
may not detect the stroke as part of the foreground. The above may be removed by Opening
morphological operations. So, noise reduction and image enhancement is of prime
importance before binarization.
2.1.2. Skew and perspective correction:
May, 2013 www.ijirs.com Vol 2 Issue 5
International Journal of Innovative Research and Studies Page 89
Perspective distortion in imaging using cameras is inevitable and its correction is of prime
importance for OCR applications. Skew detection and correction is the first step process in
the document analysis and understanding processing steps. Correcting these distortions is
very important, because they have a direct effect on the reliability and efficiency of the
segmentation and feature extraction stages. The noises and the deviation in the document
resolution or types are still the main two challenges facing the skew detection and correction
methods.
2.1.3. Binarization:
It is a process of converting the acquired image which is typically either a color image or a
grayscale image, to a bi-level image. Every pixel is categorized either as a foreground or a
background pixel[6][7][8][9][10][12]. Thresholding is the most popular Binarization method,
and can be applied in the following ways.
Global The entire document image is processed, and an estimate is calculated based on the
pixel counts such that pixels having higher and lower values than the estimate are classified
into the different classes, viz. foreground and background. All pixels are categorized using
global characteristics of the image. Otsus global thresholding[8] algorithm is a well known
and frequently used such algorithm.
Local adaptive thresholding[9][10] Different threshold values for all pixels of the image are
estimated based on the local characteristics.
Mehmet Sezgin et al [6] categorize the thresholding methods into six groups according to the
information they are exploit.
2.2. Feature Extraction:
In feature extraction stage, each character is represented as a feature vector, which becomes
its identity. During Training stage, feature vectors of characters are prepared as templates
which will later on serve in matching of features of acquired images of symbols with the
respective templates. Feature extraction methods analyse the input document image and
select a set of features that uniquely identifies and classifies the character. The objective of
feature extraction is to extract a set of features, which will maximize the recognition rate with
the least amount of elements. Due to the inherent nature of handwriting being highly variable
and imprecise, obtaining these features such that extraction will generate a similar feature set
May, 2013 www.ijirs.com Vol 2 Issue 5
International Journal of Innovative Research and Studies Page 90
for different instances of the same character is a difficult task. Sometimes multiple templates
of features of the same character are prepared, in order to account for the variability in
handwriting. Shunji Mori [27] explores in detail the various feature extraction methods along
with their mathematical treatment.
2.3. Recognition (Matching and Discrimination):
This phase includes the algorithm for recognising characters; the image of every character
must be converted to appropriate character code. Matching of the acquired character features
and features of the template character is performed and are discriminated into a particular
character code. The recognition algorithm may also produce a variety of outputs given an
image of an input character and provide probabilistic analysis of the same. For instance,
recognition of the image of "l" character can produce "l", "1" "/", "\" codes and list the
probability of occurrence of each.
ShunjiMori[27] lists and explains in detail many methods of character recognition at the basic
level, and the book can be used as a valuable reference.
3. Challenges to OCR of Camera Acquired Images:
OCR techniques assume high resolution, high quality images with a simple structure(High
contrasting text and background) for good quality recognition. Scanners perform very well
for the above mentioned tasks. Unfortunately, due to the typical nature of cameras and
environment, it is not always possible to assure the above. Many discrepancies may arise,
which are explained as follows:
3.1. Uneven Lighting Conditions:
A camera has very less control over lighting of an image than a scanner where lighting is
uniform and hence certain discrepancies can creep into the images like shadows and
reflection. Further problems arise when on-camera flash is used for the purpose. As Fisher
[11] found, if on-camera ash is used, the centre of the view is the brightest, and then lighting
decays outward. Use of flash may also result in the sensor being over-exposed or cause
flooding of light.
May, 2013 www.ijirs.com Vol 2 Issue 5
International Journal of Innovative Research and Studies Page 91
Due to the immense variation in light that may be present in camera acquired image, global
thresholding methods for binarization are not ideal(Refer Fig 2). A better alternative is to use
adaptive thresholding[7][9][10][12]. For example, fig. 3 uses adaptive thresholding.
a) b)
c)
Figure 2: a) Business card showing uneven lighting conditions, b) A result of Otsus Global
Thresholding algorithm [8] clearly showing missing text in light areas, particularly at
bottom-centre portion, c) A result of manually thresholding the image in an attempt to
recover missing text. Source: Own work.
Trier and Taxt compared 11 locally adaptive thresholding techniques and conclude that
Niblacks method is the most eective for general images [12]. OCR Droid engine[13] uses
the Sauvola algorithm for thresholding[10]. Both Niblacks and Sauvolas algorithm compute
thresholds individually for each pixel using information from the local neighborhood of that
pixel. Both of them perform well and give excellent results even for severely degraded
documents due to uneven lighting conditions. J. He and et al. [14] compare both the above
algorithms for thresholding historical documents with the ABBYY OCR engine[30] and
concluded that their proposed improvements to Niblacks and Sauvolas algorithms
performed slightly better than the originals and the ABBYY Commercial OCR engine.
Thresholding using both the algorithms on above business card is shown in fig 3.
May, 2013 www.ijirs.com Vol 2 Issue 5
International Journal of Innovative Research and Studies Page 92
a) b)
c)
Figure 3: a) A result of Niblacks Local adaptive Thresholding algorithm[9] on Fig 2.a) with
neighbourhood size 140x140 pixels. The missing text has appeared, but a lot of artefacts have
emerged in the background. b) artefacts removed using Dilation process with circular
structuring element of radius 3 pixels. C) A result of Sauvolas Local adaptive Thresholding
algorithm[10] on Fig 2.a) with neighbourhood size 140x140 pixels. Source: Own work.
3.2.Skewness (Rotation):
When OCR input is taken from a hand-held camera or other imaging devices whose
perspective is not xed like a scanner, text lines may get skewed from their original
orientation. Feeding such a rotated image to our OCR engine produces extremely poor
results.Many methods to deskew the document image have frequently appeared in literature,
viz. Hough transform, Projection Profile, Fourier transformation methods, RAST algorithm,
etc.
In Hough transform, the points in the Cartesian coordinate system are described as a
summation of sinusoidal distribution:
May, 2013 www.ijirs.com Vol 2 Issue 5
International Journal of Innovative Research and Studies Page 93
The skew angle is calculated on the basis that at the skew angle the density of Transform
spaces is maximum.Y. Lee [28] describes a method to detect and correct the skew angle in a
printed business form in his patent.
Horizontal projection profile is generally a histogram of the number of dark pixels in
horizontal scan lines of a text.
OCR DROID uses the RAST algorithm. It is a very robust skew-detection and correction
algorithm which is based on branch and bound text line finding algorithm for skew detection
and auto-rotation[15][16].
J.H. Park et al [17] propose a method of skew correction for business cards on PDA, wherein
local adaptive thresholding is applied to the images first. They introduce a process called
stripe generation, wherein adjacent characters and strings are merged and clusters are formed.
Only those clusters having size and eccentricity larger than a certain value are used as stripes.
Skew angle is calculated by first calculating directional angle of stripes using central
moments and averaging them.
3.3. Tilting (Perspective Distortion):
As opposed to scanners where the plane of document is always parallel to the plane of sensor,
document image acquisition with a mobile camera may not always be parallel to the
document plane. As a result tilted images are captured, wherein text closer to the camera
appears bigger and that far away from the camera smaller.A perspective distortion is
observed which causes lower recognition rate and accuracy if the recognizer is not
perspective intolerant[18].
OCR Droid takes advantage of orientation sensors embedded in the mobile devices. It detects
the tilt of the phone and disallows users to capture images when distortion occurs. It also
allows users to calibrate plane of the document with that of the camera. This is however, a
preventive technique rather than a corrective one.
Jagannathan and Jawahar demonstrate in their paper[19], the intelligent use of commonly
available clues for perspective rectification. In one of the algorithms they propose, take into
account the rectangular borders of the document. If the image is distorted, the rectangle
appears as quadrilateral. Now, if the positions of its vertices are calculated, the perspective
can be corrected by using an equation like:
May, 2013 www.ijirs.com Vol 2 Issue 5
International Journal of Innovative Research and Studies Page 94
33 32 31
23 22 12
33 32 31
13 12 11
h y h x h
h y h x h
y
h y h x h
h y h x h
x
Where: (x,y) are coordinates of pixels in image
(x,y) are coordinates of pixels of approximation of the original document
And [