Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
A system for spotting words in scanned docu- Spotting handwritten words in documents
ment images in three scripts, Devanagari, Ara- written in the Latin alphabet has received
bic and Latin is described. Three main com- considerable attention [1],[2] and [3]. [1] use
ponents of the system are a word segmenter, a pseudo 2-D hidden Markov models to represent
shape based matcher for words and a search in- keywords, [2] use hierarchial shape models and
terface. The user gives a query which can be ei- [3] extract profile-based holistic shape features
ther a word image or text. The candidate words from a line or word image and use dynamic
that are searched in the documents are retrieved time warping (DTW). A comparison of seven
and ranked, where the ranking criterion is a sim- methods showed that the highest precision was
ilarity score between the query and the candidate obtained by a method based on profiles and
words based on global word shape features. This DTW [4]. A word shape based method was
renders the word spotting technique to be inde- shown to perform better than the DTW method,
pendent of the script used. The performance of in terms of efficiency and effectiveness [5] – over
system is seen to be better for printed text as com- 11% in precision and recall and 893 times faster.
pared to handwritten. For handwritten English,
a precision of 60% was obtained at a recall of This paper discusses the performance of the word
50%. An alternate approach comprising of pro- shape method presented in [5] when applied to
totype selection and word matching, that yields a other scripts(Devanagari,Arabic) and languages
better performance for handwritten documents is arising from these scripts such as Sanskrit/Hindi
also discussed. For printed Sanskrit documents, and Arabic/Urdu. The paper also compares the
a precision as high as 90% was obtained at a Re- retrieval performances in these scripts and those
call of 50%. between printed and handwritten documents.
The word spotting technique presented has been
found to be effective for all the three languages
1 Introduction under consideration(Sanskrit,Arabic,English),
and for both printed and scanned handwritten
Word spotting is a content-based information
documents. The search capability of our system
retrieval task to find relevant words within a
comes with an interactive graphical user inter-
repository of scanned document images. The
face that allows searches (i) within the current
retrieval of matching words finds applications in
document (ii) from another opened document
several areas like forensics, historical documents,
or (iii) from a collection of preprocessed indexed
personal records, etc. In these applications it
documents.
is of interest to retrieve words from a large
database of documents based on the visual and
1
S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts,"
Vivek: Indian Journal of Artificial Intelligence, vol.16, no.3, pp. 2-9, 2006.
The word spotting method presented here Arabic text. The line segmentation is performed
involves the indexing of documents that involves using a clustering method. For word segmenta-
(i)Line and word segmentation (ii) Feature tion, the problem is formulated as a classifica-
extraction: Global word shape features known tion problem as to whether or not the gap be-
as GSC features [6] are extracted for each of tween two adjacent connected components in a
the words in the documents. The similarity line is word gap or not. An artificial neural net-
between word images is measured using a cor- work with features characterizing the connected
relation similarity measure. The system that is components was used to make a decision on this
mentioned in this paper includes functionalities classification problem.
obtained from the CEDARABIC system[7], and
the CEDAR-FOX system, which was developed
for forensic examination [8], [9].
2.1 Preprocessing
In the image preprocessing stage, the scanned
image of each of the documents is subjected to
several processing functions that leads to the seg-
mentation of the document into lines and the
words in each of these lines. This information re- (c) Arabic
garding the number of lines and words and their
positions in their respective documents is stored Figure 1: Segmentation examples where segmented
along with the document image. Figure 1 shows words are coloured differently: (a) handwritten En-
the line and word segmentation for handwrit- glish (b) printed Sanskrit and (c) handwritten Arabic
ten English, printed Devanagari and handwritten
2
S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts,"
Vivek: Indian Journal of Artificial Intelligence, vol.16, no.3, pp. 2-9, 2006.
3
S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts,"
Vivek: Indian Journal of Artificial Intelligence, vol.16, no.3, pp. 2-9, 2006.
4
S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts,"
Vivek: Indian Journal of Artificial Intelligence, vol.16, no.3, pp. 2-9, 2006.
3.1 Dataset
3.1.1 Latin(English) 3.1.3 Arabic
The data sets used for Word Spotting in English A document collection was prepared with 10 dif-
are subsets of a document image collection ferent writers, each contributing 10 different full
consisting of 3000 samples written by 1000 page documents in handwritten Arabic. Each
5
S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts,"
Vivek: Indian Journal of Artificial Intelligence, vol.16, no.3, pp. 2-9, 2006.
6
S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts,"
Vivek: Indian Journal of Artificial Intelligence, vol.16, no.3, pp. 2-9, 2006.
90
Precision
Each of the words tested occurred atleast 5 times 50
10
the selected documents. We have also calculated
0
7
S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty, "Spotting Words in Latin, Devanagari and Arabic Scripts,"
Vivek: Indian Journal of Artificial Intelligence, vol.16, no.3, pp. 2-9, 2006.
60
and Applications, 2(3), pp. 153–168, 2000.
55
[4] R. Manmatha and T. M. Rath, “Indexing
of handwritten historical documents-recent
50
progress,” in Symposium on Document Im-
45
age Understanding Technology (SDIUT),
3 4 5
Number of writers
6 7 8
pp. 77–85, 2003.
Figure 10: Precision at 50% recall is plotted against [5] B. Zhang, S. N. Srihari, and C. Huang,
different numbers of writers used for training. As “Word image retrieval using binary fea-
more number of writers are used for training, the pre- tures,” in Document Recognition and Re-
cision at a given recall increases. trieval XI, SPIE, San Jose, CA, 2004.
.
[6] B. Zhang and S. N. Srihari, “Binary vector
dissimilarity measures for handwriting iden-
4 Conclusion tification,” in Document Recognition and
Retrieval X, SPIE, Bellingham, WA, 5010,
Content-based search in documents using Latin,
pp. 28–38, 2003.
Devanagari and Arabic scripts has been de-
scribed here. Word images were used for the [7] S. N. Srihari, H. Srinivasan, P. Babu, and
retrieval of document data in each of these lan- C. Bhole, “Handwritten arabic word spot-
guages. The performance of the system in these ting using the cedarabic document analysis
various types of searches has been presented as system,” in Proc. Symposium on Document
Precision-Recall curves and the ranks of the re- Image Understanding Technology (SDIUT-
turned results. A very high precision can be 05), pp. 123–132, 2005.
obtained for spotting words from printed doc-
uments. A two step approach involving a query [8] S. N. Srihari, S.-H. Cha, H. Arora, and
expansion step yields promising results for spot- S. Lee, “Individuality of handwriting,”
ting words in handwritten documents. in Journal of Forensic Sciences, 47(4),
pp. 856–872, 2002.