Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
DEPARTMENT
OF
COMPUTER SCIENCE AND ENGINNERING
SEMINAR REPORT ON
“DOCUMENT TEXT SEGMENTATION”
SUBMITTED BY
NAME USN
ANKITA KULKARNI 2BL16CS013
B.L.D.E. A’S V.P. Dr. P.G. HALAKATTI COLLEGE OF ENGINEERING
AND TECHNOLOGY BIJAPUR – 586 103
CO-ORDINATORS
Prof. D. Y. Ijeri
Prof. S. S. Patil
ACKNOWLEDGEMENT
While presenting this Seminar on” Document Text Segmentation ”, I feel that it is our
duty to acknowledge the help rendered to us by various persons.
Firstly I thank God for showering his blessings on me. I am grateful to my V.P. Dr.
P.G.Halakatti College of Engineering and Technology for providing me a congenial
atmosphere to carry out the seminar successfully.
We wish to express our sincere gratitude to Dr. Pushpa. B. Patil, head of the department for
providing us an opportunity to do Seminar..
We are thankful to our guide Prof. Sujata Desai for their constant support ranging from
explaining the concepts in verification, patiently answering even the most trivial and basic
questions, helping us in solving out the issues in the actual work. She has been a remarkable
source of inspiration and constant motivation for us. their guidance and nurturing helped us in
strengthening our technical knowledge and skills greatly.
Finally, we convey our appreciation to the CSE department for providing us guidance,
support and timely assistance throughout the project.
VISVESVARAYA TECHNOLOGICAL UNIVERSITY JNANA
SANGAMA, BELAGAVI
DECLARATION
I Ankita Kulkarni, a student of Eight Semester B.E, Computer Science and Engineering,
B.L.D.E.A’s V.P. Dr P.G.H College of Engineering and Technology declare that the Seminar
Report titled “Document Text Segmentation” has been carried out by me and submitted in
partial fulfillment of the course requirements for the award of degree in Bachelor of
Engineering in Computer Science and Engineering of Visvesvaraya Technology University,
Belagavi, during academic year 2019-2020.
Place: Vijayapur
Date:
ABSTRACT
Segmentation of text from image documents has many important applications such as
document retrieving, object identification, detection of vehicle license plate, etc. It is very
popular research field in recent years. Text information in natural scene images serves as
important clues for many image-based applications such as scene understanding, content-
based image retrieval, assistive navigation, and automatic geocoding. However, locating text
from a complex background with multiple colours is a challenging task. Two different
techniques using morphological operations are proposed to find text strings from any natural
scene images.
The system employs Symlet wavelet and 2-mean classification for segmentation of
text from image document. The system uses morphology operation like as dilation and
erosion in post processing. Proposed method for text segmentation from image document has
been implemented in MATLAB. A method for the segmentation of the text regions from
compound images is proposed. The Haar Discrete Wavelet Transform (DWT) is employed.
The resulted detail component sub bands contain both text edges and non-text edges. The
Sobel edge detector is applied on each sub image to extract the strong edges. The resultant
edges are used to form an edge map using a weighted OR operator. The edge map is
binarized using a threshold. A morphological dilation operation is performed on the
processed edge map. The projection profiles are analysed to localize the text regions. Finally
a threshold is applied which results in the segmentation of the real text regions from the
image.
CONTENT
1. INTRODUCTION
2. LITERATURE SURVEY
3. MOTIVATION
7. EXPERIMENTAL RESULTS
8. APPLICATIONS
CONCLUSION
REFERENCES
CHAPTER 1
INTRODUCTION
Text line extraction or segmentation is an important problem that does not have an
universal accepted solution in the context of automatic handwritten document recognition
systems. Text characteristics can vary in font, size, shape, style, orientation, alignment,
texture, colour, contrast and background information. These variations turn the process of
word detection complex and difficult.
Text information extraction (TIE) is a technology used to extract text contents from
still images and image sequences, and then turn them into machine-editable text. Three main
steps of TIE are text localization, text segmentation and text recognition. For increasing the
recognition rate, the localized text regions must usually be segmented from the image and
converted into binary images before sent to an OCR system for recognition. The accuracy of
text segmentation can highly affect the text recognition rate. Scene text is a kind of text that
easily suffers from uneven lighting and complex background, so it is challenging to achieve
text segmentation under complex conditions. Many techniques on text segmentation have
been surveyed in and a lot of other methods on text segmentation have been proposed in
recent years. The recent techniques for text segmentation can be classified into stroke-based
methods, colour-based methods and graph-based methods.
Stroke-based methods search the stroke-like structures in text images using a stroke
filter. Considering the strokes as the intrinsic text characteristics, a definition of text is given
first and then a stroke filter is designed to get the response of text strokes. After the text
colour polarity is determined, a local region growing procedure is performed to refine the
binarized stroke response map. This approach mainly makes use of the transitional colour
between the strokes and the adjacent background in embedded video text images.
Unfortunately, this approach may not work for scene text as there is usually no transitional
colour between scene text characters and the adjacent background.
Segmentation algorithms based on hue, hue and lightness, and lightness were
performed respectively for the three types of classifying text regions respectively. But for text
image with low quality, this method is not effective enough. Li utilized graph theory to
handle multi-polarity text segmentation. In their model, the corresponding intensity map of a
colour image was represented with an undirected, weighted graph by treating each pixel
group at a gray level as a node.
CHAPTER 2
LITERATURE SURVEY
6. “ Segmentation and Extraction of Text from Curved Text Lines using Image
Processing Approach” by Monika A. Shejwal, Sangita D. Bharkad (2017)
The camera captured images containing text are having curved text lines because of
distortions by page curl and the view angle of camera. So it is necessary while
scanning the document, the text should be straight and words are inline properly. But
text lines segmentation of curled text is a difficult method for dew raping techniques.
This paper presents the method based on image processing algorithms for
segmentation and extraction of characters from curled text lines from document
images.
The digitization of document is an important method for increasing the quality and
compatibility of document. Now a day instead of scanning the document, camera captured
document images are very much used by people. Because cameras are on hand usually at low
cost and implanted in all mobile gadgets. This gives quick and non- contact document
imaging. The quality of the document images captured by camera is very poor because
camera perspective distortions, non-uniform shading, image blurring, character smearing (due
to low resolution) and lighting variations. So document image analysis plays a vital role in
extraction of information from document images. Extraction of straight text lines from
document images is easy as compared to the curved text lines. The text of document is in
curved manner because of camera perspective and other distortions.
In this paper, we tried to tackle these obstacles occur in binarization of camera captured
images. We used image processing algorithms for segmentation of the text lines of document
images.
Business forms usually do not consists of images or graphics but contain text within
specified areas or boxes. Text extraction from specific business forms is based on prior
knowledge about the location and the length of the text while properties of the boxes, such as
height and width are sometimes being used to determine the location of the text .
Other documents like newspapers, magazine sand technical journals have specific layout
structure and the text extraction is based on the prior knowledge of the location of the text
like headers, footers and page numbers and since these documents sometimes contain non-
text regions, RLSA and PPC are being used to segment the regions and heuristic rules are
applied to classify the text and non-text regions produced since the layout structure of most of
the documents are known. Multi-stage thresholding is applied in and histograms and
heuristics for block segmentation and classification is applied in while interactive input from
user for the desired block is applied performs horizontal dilation and the length of the line is
fixed so that the lines can be extracted completely and heuristic rules are applied for text and
non-text region classification. Region growing and analysis with heuristic rule is also applied
where the generic modelling of paper layout is known. Connected component generation and
heuristics are also applied in.
CHAPTER 4
The text-regions in a document image can be detected either by region based and texture-
based methods. They are relatively independent of changes in text size and orientation, but
having difficulties with complex images with non-uniform backgrounds, for example, if a
text string touches a graphical object in the original image, they may form one connected
component in the resultant binary image.
In the region based approach, we consider each pixel in the image and assign it to a
particular region or object. This approach is basically divided into two subcategories: edge
based and connected component based.
Basically the idea behind the edge-based algorithms is that the edges of text symbols are
typically stronger than those of noise, textured-background and other graphical items. In
these top-down techniques, a binary edge image is first generated using an edge detector, and
then adjacent edges are connected by applying morphological operations or other algorithms
such as run-length smoothing.
Connected components of the resultant image are the candidate text areas, as each one
represents either several merged lines or a graphical item. Then, each component is
decomposed into smaller regions by analysing its vertical and horizontal projection profiles,
and finally each of the small regions satisfying certain heuristic constraints is labelled as text.
Edge-based methods are fast and can detect text in complex backgrounds but are restrictive to
detect only horizontally or vertically aligned text strings.
The literature on text segmentation is broad in scope but there appears to be very little
literature on using machine learning techniques on this subject. Text segmentation algorithm
should have adaptation and learning capability, but a learner usually needs much time and
training data to achieve satisfactory results, which restricts its practicality. To overcome these
problems, M. M. Haji, S. D. Katebi give a simple procedure for generating training data from
manually segmented images, then applying a Naive Bayes Classifier (NBC), which is fast
both in training and application phase.
CHAPTER 5
ARCHITECTURE OF THE SYSTEM
Symlet Image
Block Processing
K-means clustering
Mask Image
A. Image Acquisition
The Image Acquisition is the process of collection of images for text and picture
segmentation in image document. We have used scanned image for text and picture
segmentation in image document.
B. Image pre-processing
The aim of pre-processing is to improve image data so that it removes undesired distortions
and/or it enhances image features that are relevant for further processing.
The analysis of a picture using image processing that can identify shades, colours and
relationships that cannot be perceived by the human eye. Image processing is used to solve
identification problems, such as in forensic medicine or in creating weather maps from
satellite pictures. It deals with images in bitmapped graphics format that have been scanned
in or captured with digital cameras. The colour images are then converted to grey level
images by using the following formula
C. Symlet Wavelet
The wavelet transform provides a multi-resolution representation of an image that has
become quite popular in recent years owing to their huge number of applications in various
fields, such as, for example, telecommunications, geophysics, astrophysics and in computer
vision field to enable to detection, analysis and recognition of image features and properties
over varying ranges of scale.
Symlet wavelets are a family of wavelets. They are a modified version of Daubechies
wavelets with increased symmetry. Symlet Wavelet[n] is defined for any positive integer n.
The scaling function and wavelet function have compact support length of 2n. The scaling
function has n vanishing moments.
Symlet wavelet can be used with functions as Discrete Wavelet Transform. The
discrete wavelet transform is a mathematical tool for signal analysis and image processing.
By the wavelet transform, an image can be decomposed into multiresolution frame in which
every portion has distinct frequency and spatial properties.
D. Block processing
Images can either be too large to load into memory, or else they can be loaded into memory
but then be too large to process. Therefore block processing is more useful to automatically
divide the input image into blocks of the user- specified size, process each block individually
and then reassembles each block results into the output image.
If we want to divide an image into blocks and process each block individually, the
function blkproc is used that allow to process distinct blocks.
E. K-means Clustering
The K-mean algorithm is an iterative technique that is used to partition an image into K
clusters. The basic algorithm is:
1. Pick K cluster centres, either randomly or based on some heuristic.
2. Assign each pixel in the image to the cluster that minimizes the distance between the
pixel and the cluster centre
3. Re-compute the cluster centre by averaging all of the pixels in the cluster
4. Repeat steps 2 and 3 until convergence is attained (e.g. no pixels change clusters)
The k-means algorithm is an evolutionary algorithm that gains its name from
its method of operation. The algorithm clusters observations into k groups, where k is
provided as an input parameter. We used 2-means classification in our
implementation. One group of white pixel and second group of black pixel are used in
2-means classification.
F. Post Processing
Post processing attempts to increase the quality of a mask image. Post processing is
performed with the help of Morphology. The morphological operations are dilation and
erosion. Dilation adds pixels to the boundaries of objects in an image, while erosion removes
pixels on object boundaries. The number of pixels added or removed from the objects in an
image depends on the size and shape of the structuring element used to process the image.
A. Pre-processing
If the input image is a colour image, its RGB components are combined to give the
intensity image Y as follows:
Y = 0.299R + 0.587G + 0.114B
Image Y is then processed with discrete wavelet transform.
MATLAB:
When it comes to image processing what can be more flexible than MATLAB. It stands for
“Matrix Laboratory”, a fourth-generation high-level programming language developed by
MathWorks (U.S). It was created in the year 1984 with an objective to provide interactive
environment for computation, visualisation and programming. It is written in C, C++ and Java.
MATLAB is used in every facet of computational mathematics. Mathematical calculations
where it is most commonly used include- matrix and array manipulations, linear algebra,
algebraic equations, statistics, calculus, integration, transformation, etc. It has wide range of
applications including signal processing, image and video processing, control systems,
computer vision, AI, etc. MATLAB is the most popular software used in the field of Digital
Image Processing. However, it is not open source, a user has to pay for licensed MATLAB
interpreter.
In OCR processing, the scanned-in image or bitmap is analysed for light and dark areas in
order to identify each alphabetic letter or numeric digit. When a character is recognized, it is
converted into an ASCII code. Special circuit boards and computer chips designed expressly
for OCR are used to speed up the recognition process.
OCR is often used as a “hidden” technology, powering many well-known systems and
services in our daily life. Less known, but as important, use cases for OCR technology
include data entry automation, indexing documents for search engines, automatic number
plate recognition, as well as assisting blind and visually impaired persons.
OCR technology has proven immensely useful in digitising historic newspapers and texts that
have now been converted into fully searchable formats and had made accessing those earlier
texts easier and faster.
CHAPTER 7
EXPERIMENTAL RESULTS
We have developed segmentation of text and picture from image document with the
help of Symlet wavelet and 2-mean classification using MATLAB R2009a. Here we have
considered some images and illustrate the segmentation of text and picture.
Fig7.9:Segmented text
Fig7.10:Original Image
8.4. Thresholding
Thresholding is the simplest method of image segmentation. From a grayscale image,
thresholding can be used to create binary images.
The development of efficient method for detecting all types of graphics and text in
any orientation from real life documents is a challenging process. Our Proposed method for
segmentation of text in image document extract text from images efficiently by applying
Symlet wavelet.
A method of text extraction from images is proposed using the Haar Discrete Wavelet
Transform, the Sobel edge detector, the weighted OR operator, thresholding and the
morphological dilation operator. These mathematical tools are integrated to detect the text
regions from the complicated images. The proposed method is robust against language and
font size of the texts. The proposed method is also used to decompose the blocks including
multi-line texts into single line text. According to the experimental results, the proposed
method is proved to be efficient for extracting the text regions from the images.
2. M. Pietikäinen and O. Okun, “Text Extraction from Grey Scale Page Images by Simple
Edge Detectors”, Proc. of the 12th Scandinavian Conf. On Image Analysis, Bergen, Norway,
pp. 628-635, 11-14 June 2001.
3. Jie Xi, Xian-Sheng Hua, Xiang-Rong Chen, et al., “A Video Text Detection and
Recognition System”, Proc. of ICME 2001, Waseda University, Japan, pp. 1080-1083,
August 2001.
4. Jie Xi, Xian-Sheng Hua, Xiang-Rong Chen, et al., “A Video Text Detection and
Recognition System”, Proc. of ICME 2001, Waseda University, Japan, pp. 1080-1083,
August 2001.
6. Shulan Deng and Shahram Latifi, "Fast Text Segmentation Using Wavelet for Document
Processing", Proc. of the 4th WAC, ISSCI, IFMIP, Maui, Hawaii, USA, pp. 739-744, 11-15
June 2000.
10. Punam Thakare, “A Study of Image Segmentation and Edge Detection Techniques”,
International Journal on Computer Science and Engineering (IJCSE) ISSN : 0975-3397 Vol.
3 No. 2 Feb 2011
11. Danial Md Nor1, Rosli Omar2 , M. Zarar M.Jenu, Jean Marc Ogier, “Image
Segmentation and text Extraction: Application to the Extraction of Textual Information in
Scene Images”, ISASM 2011
12. Neha Gupta, V.K. Banga, “Image Segmentation for Text Extraction”, ICEECE’2012
April 28-19, 2012
13. Mao, W., Chung F., Lanm K., and Siu. W., “Hybrid Chinese / English Text Detection in
Images and Video Frames”, Proceedings of International Conference on Pattern Recognition,
2002, Vol. 3, pp. 1015 – 1018.
14. Qixiang Ye, Wen Gao, Weiqiang Wang and Wei Zeng, 2003. “A Robust Text Detection
Algorithm in Images and Video Frames”, (2003 IEEE), ICICS-PCM 2003, 15-18 December
2003, Singapore.