Sei sulla pagina 1di 27

OPTICAL CHARACTER RECOGNITION (OCR)

Contents

Introduction Stages in OCR MATLAB Implementation Steps in MATLAB Implementation Android Implementation Advantages Applications Conclusion References
2

INTRODUCTION
Motivation:Text detection and recognition in general have quite a lot of relevant application for automatic indexing or information retrieval such document indexing, content-based image retrieval, and license car plate recognition which further opens up the possibility for more improved and advanced systems.

OCR:OCR is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text.

Aims and Objectives


OCR
Recognition Recognize each of the character in the detected text region using a suitable algorithm

Segmentation Separate the text region into its individual characters.

The goal of Optical Character Recognition (OCR) is to classify optical patterns (often contained in a digital image) corresponding to alphanumeric or other characters.

STAGES IN OCR
TRAINING

Pre - processing

Feature Extraction

Model Estimation OCR Pre - processing TESTING Feature Extraction

Classification

PRE-PROCESSING
The raw data is subjected to a number of preliminary processing steps to make it usable in the descriptive stages of character analysis. Pre-processing aims to produce data that are easy for the OCR systems to operate accurately. The main objectives of pre-processing are : Binarization Noise reduction Stroke width normalization Skew correction Slant removal

BINARIZATIO N

Binarization (thresholding) refers to the

conversion of a gray-scale image into a binary image. Two categories of thresholding are: Global - picks one threshold value for the entire document image which is often based on an estimation of the background level from the intensity histogram of the image. Adaptive (local) - uses different values for each pixel according to the local area information

Noise Reduction Normalization

Noise reduction improves the quality of the document. Normalization provides a tremendous reduction in data size, thinning extracts the shape information of the characters. Two main approaches:

Filtering (masks) Morphological Operations (erosion, dilation, etc)

6/10/13

FEATURE EXTRACTION
In feature extraction stage each character is represented as a feature vector, which becomes its identity. The major goal of feature extraction is to extract a set of features, which maximizes the recognition rate with the least amount of elements. Due to the nature of handwriting with its high degree of variability and imprecision obtaining these features, is a difficult task.

MODEL ESTIMATION
Given

labelled sets of features for many characters, where the labels correspond to the particular classes that the characters belong to, we wish to estimate a statistical model for each character class.

CLASSIFICATION
According

to Tou and Gonzalez, The principal function of a pattern recognition system is to yield decisions concerning the class membership of the patterns with which it is confronted. In the context of an OCR system, the recognizer is confronted with a sequence feature patterns from which it must determine the character classes.

MATLAB IMPLEMENTATION Flowchart:Preprocess

Segmentation

Recognition

Snapshot of MATLAB Application

Make Template
To create templete.mat to be use for classification:


36 images of characters Size = 60 X 55

Matrix siz e 55 X 60 X 36 Saved as template .mat

14

Preprocess
Raw Image Noise Filter Binarize

Resizing

Baunding

Complimenting

Preprocessed Image

Segmentation Connected Components

The segmentation character involves the following steps:

Scan the image from left to right to find on pixel. If on pixel been found, all on pixel connected to the detected on pixel will be extracted segmented as a pixel. The process will be repeated until it reach end right of the image.

Corr2
Where is the mean of the input matrix i and is the mean of the input matrix j. 0 < r < 1 1 mean i and j is exactly same while 0 mean the i and j not same at all.

Recognition - Template Correlations


temp = templates(:,:,j); in = chars(:,:,i); allCorrs(j) = corr2(temp, in); Source image Image Template

allcorrs(j)

0.82011

0.57395

0.43850

Android Implementation

The same OCR application we build for Android devices named MyOCR using open source library Tesseract by Google.

Tesseract Background:Developed on HP-UX at HP between 1985 and 1994 to run in a desktop scanner. Came neck and neck with Caere and XIS in the 1995 UNLV test. Never used in an HP product. Open sourced in 2005. Now on: http://code.google.com/p/tesseract-ocr Highly portable.

Tesseract OCR Architecture

ADVANTAGE
Increase efficiency OCR Recover valuable space Eliminates Retyping Need Greater accessibility

APPLICATION
Document reading machines used for Banking Applications Automatic address reading for mail sorting

Data entry

Process automation

Aid for blind Automatic number-plate readers

Other Applications

Text Entry
Page readers for text entry, mainly used in Office Automation

Typical errors in OCR


Variations in shape
Due to serifs and style variations.

Deformations
Caused by broken characters, smudged characters and speckle.

Variations in spacing
Due to subscripts, superscripts, skew and variable spacing

Mixture of text and graphics

Future needs
Need constrained OCR will be decreasing Omni font OCR Systems

Recognition of manually produced documents

Recognition of entire words instead of individual

REFRENCES

http://www.uri.edu/~hansenj/projects/ele585/OCR / J.T. Tou and R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley Publishing Company, Inc., Reading, Massachusetts, 1974

M. Szmurlo, Masters Thesis, Oslo, May 1995, (users.info.unicaen.fr/~szmurlo/papers/masters/ master.thesis.ps.gz)

THANK YOU
Special Thanks To: Google.com Mathwoks.com

Potrebbero piacerti anche