Sei sulla pagina 1di 17

Imaging and OCR

K.T.Anuradha
National Centre for Science Information
Indian Institute of Science
Bangalore – 560 012
(E-Mail: anu@ncsi.iisc.ernet.in)

15-20 April 2002 Imaging and OCR PI-3 1

Goals of This Presentation


„ To give an overview of Imaging and
Optical Character Recognition process

15-20 April 2002 Imaging and OCR PI-3 2

1
What Will You Learn?
„ You will get an overview of Imaging and
OCR process
„ What you need to do in the lab:
„ Scan some specific documents and using a few
OCR software installed, convert the scanned
images to text

15-20 April 2002 Imaging and OCR PI-3 3

Historical Perspective

„ M. Sheppard's invention, GISMO - A Robot Reader-


Writer in 1951
„ J. Rainbow developed a prototype machine in 1954
„ able to read uppercase typewritten output at the
“fantastic” speed of one character per minute
„ IBM, Recognition Equipment, Inc., Farrington,
Control Data, and Optical Scanning Corp, marketed
OCR systems by 1967
„ NASA used imaging system to enhance and
manipulate satellite images
15-20 April 2002 Imaging and OCR PI-3 4

2
Historical Perspective
„ Several standards were developed
„ Character Set for Optical Character Recognition (OCR-A).
ANSI X3.17-81
„ Character Set for Optical Character Recognition (OCR-B).
ANSI X3.49-75
„ Paper Used in Optical Character Recognition Systems.
ANSI X3.62-87. Several standards were developed
„ Optical Character Recognition (OCR) Inks. ANSI X3.86-80.
„ Optical Character Recognition (OCR) Character Position.
ANSI X3.93-81

15-20 April 2002 Imaging and OCR PI-3 5

Applications

„ Industries and Institutions in which control of


large amounts of paper work is critical
„ Banking, Credit cards, Insurance industries
„ The medical community
„ To capture, store and transmit radiology images
„ Libraries and archives
„ For conservation and preservation of vulnerable
documents and for the provision of access to
source documents
15-20 April 2002 Imaging and OCR PI-3 6

3
Glossary
„ Glyph – the image of a character rendered in pixels.
„ Raster – the scanned image created by a kinescope (a
CRT, Cathode Ray Tube, such as that used in computer
displays)
„ Text image – the content of a text record, often the
contents of a page of text.
„ Pixel – (Picture ELements) or pels (Picture ELements), an
image sample area that is almost always square. Arranged
in a grid, pixels form a raster image. A scanned page of a
paper or microform document creates a digital image that
is a raster of pixels.
15-20 April 2002 Imaging and OCR PI-3 7

More about Pixels


„ All pixels are identical in size and arrangement.
„ All pixels are processed the same way.
„ All pixels are scanned, displayed, and printed the
same way.
„ Each pixel has a location and a colour.
„ Both given as numbers.
„ Location: latitude and longitude
„ Color: Amount of Red Green and Blue
„ Max on all 3 is white, minimum on all 3 is black

15-20 April 2002 Imaging and OCR PI-3 8

4
Bit-Mapped Images
„ A bit-mapped image is a raster of
pixels.
„ Printed as a raster.

„ Can be created by raster scanning.

„ Can be created by a RIP (Raster

Image Processor) in a printer.

15-20 April 2002 Imaging and OCR PI-3 9

How many shades


„ Five main types of image shades
„ One-bit black and white or bi-tonal: no shades
between black and white
„ 4 bit gray scale: 16 shades of gray
„ 8 bit gray scale: 256 shades of gray
„ 8 bit colour: each bit can be one of 256 colours
„ 24 bit colour: 16.8 million colours
„ 32 & 42 bit colours: not used much; opted by
photographers
15-20 April 2002 Imaging and OCR PI-3 10

5
Resolution
„ Number of dots per inch (dpi) determines the
resolution
„ Higher the dpi, larger is the size
„ 1 bit black and white image at 100 dpi
requires 10 Kb of storage and 24 bit colour
image at 400 dpi requires 475 Kb of storage

15-20 April 2002 Imaging and OCR PI-3 11

Image trasmission and Access


„ On the Net via standard protocol such as TCP/IP
„ Transferring a single archival image over 56 Kbps
line require about 18 minutes, thumb nail within
seconds. LAN should support 10 Mbps to 100 Mbps
„ Colour Monitor of 19 inch size that support 1024 by
768 line resolution is ideal.
„ Desktop laser printers for monochrome with 300 to
600 dpi to the more expensive gray scale and colour
laser printers
15-20 April 2002 Imaging and OCR PI-3 12

6
Types of images
„ Thumbnail
„ Allows to judge in viewing the image; requires about 10-
35 Kb of storage space for each image
„ Service
„ Designed to convey information; typically are
compressed, requires up to 300 Kb for each image
„ Archival
„ Uncompressed image free of the artifacts resulting from
compression; highest quality images requires several Mb
each
15-20 April 2002 Imaging and OCR PI-3 13

Indexing of Images

„ Images are indexed to identify and retrieve


images
„ Eg. Purchace order number, Policy number,
account number, profile number, ISSN number
„ MARC format for bibliographic records has
some limitations in indexing images
„ Two alternatives to MARC are Dublin Core
and EAD (Encoded Archival Description)
15-20 April 2002 Imaging and OCR PI-3 14

7
Image formats
„ Raster „ Vector
„ bit mapped graphics and is „ mathematically defined with
composed of coloured dots.
coded instructions that
„ Common formats include .tiff define the angles and
(tagged image file format:
relationships between every
basis for all image files), .jpg
(joint photo- graphic experts line in the image.
group for gray line images), „ Common vector formats
.gif (for colour images), mpg include .wmf and .cgm
(motion picture experts „ images are edited in drawing
group), .bmp, .pdf programs like Adobe
„ images are edited in paint and Illustrator and CorelDraw.
photo programs like Adobe
PhotoShop and Metacreations
Painter
15-20 April 2002 Imaging and OCR PI-3 15

Image formats: uses


and advantages
„ Raster „ Vector
„ In continuous tone images „ Logos with a few solid

eg photographs; on the web colours and need to be


where there are no vector shown at a variety of sizes;
formats currently supported Creating specialized text
„ Only format that will show effects; 3D and CAD
smooth gradients and subtle programs
detail necessary in „ Resolution independent;

photographic images; Allow Smooth curves; Small file


for color correction much sizes
easier then vector images

15-20 April 2002 Imaging and OCR PI-3 16

8
Image capture interfaces

„ IDE
„ Widely used, low cost, poorest seek time
„ SCSI
„ Faster seek time, costs more, 40Mb-160Mb/sec
„ USB (Universal Serials Bus)
„ Ease of setup, 15Mb/sec
„ IEEE 1394
„ Initially developed by Apple, 3.2Gb/sec, not all pcs
support

15-20 April 2002 Imaging and OCR PI-3 17

Image Drivers

An image driver is required for an image capture


device to communicate with software applications.
Two standards are available
„ ISIS

„ Proprietary product developed by Pixel Translation


„ TWAIN
„ Developed and designed by TWAIN Working Group in
1999 adopted TWAIN 1.7 as the current standard

15-20 April 2002 Imaging and OCR PI-3 18

9
Selecting Imaging System
„ Imaging systems selection depends on the type of
application
„ Workflow or transaction processing system: Focus on
processing of documents and automating the process;
Capturing and storing images without alteration. Eg.
Purchase orders, invoices, credit card charges and
insurance policies
„ Storage and retrieval systems: Store and retrieve large
number of documents in a variety of types and formats.
Capturing and inhancing them to facilitate readability Eg.
Medical, Library community
15-20 April 2002 Imaging and OCR PI-3 19

Types of Imaging System

„ Drum Scanners: High-end scanners


„ Use photo multipliers

„ Expensive and sensitive devices

„ Flatbed Scanners
„ Ideal for odd-sized images

„ Sheetfed Scanners
„ Can scan only loose sheets

„ Compact in size and easy to install

„ Handheld scanners
„ Provide portability and functionality at the low cost

15-20 April 2002 Imaging and OCR PI-3 20

10
What, Why and When of OCR
„ Allows to scan printed, typewritten or hand
written text (numerals, letters or symbols)
and/or convert scanned image to a
computer process able format, either in the
form of a plain text or a word document or
an excel spread sheet, which can be edited,
used or reused in other documents
„ It uses raster images
15-20 April 2002 Imaging and OCR PI-3 21

What, Why and When of OCR

„ OCR is used when recreating a document in


electronic form takes more time
„ The converted text files take less space than
the original image file and can be indexed
„ Bridges the gap between the paperless and
the papered

15-20 April 2002 Imaging and OCR PI-3 22

11
How of OCR

„ It has three components:


„ Image scanner, OCR hardware/software, Output
interface

15-20 April 2002 Imaging and OCR PI-3 23

How of OCR

15-20 April 2002 Imaging and OCR PI-3 24

12
How of OCR

„ Scanner has 4 components:


„ A detector, An illumination source, A scan lens
and a document transport
„ OCR hardware/software performs three
operational steps:
„ Document analysis, Character recognition,
Contextual processing

15-20 April 2002 Imaging and OCR PI-3 25

How of OCR

„ Output Interface
„ Allows character recognition results to be
electronically transferred into the domain that
uses the results

15-20 April 2002 Imaging and OCR PI-3 26

13
Types of OCRs
„ Two types of OCRs
„ Task specific readers
„ General purpose readers
„ Task specific readers
„ Reads only specific documents: bank cheques, mail
address
„ used primarily for high-volume applications which
require high system throughput: Assigning ZIP Codes to
letter mail, Reading data entered in forms, e.g., tax
forms, Automatic accounting procedures used in
processing utility bills
15-20 April 2002 Imaging and OCR PI-3 27

Types of OCRs

„ General purpose page readers


„ High end OCR (usually for offices)
„ Speed and Accuracy are important
„ Format preservation
„ Good proof reading solutions
„ Low end OCR (usually for house use)
„ Speed is not required
„ Proof reading is done manually

15-20 April 2002 Imaging and OCR PI-3 28

14
Factors affecting OCR quality

„ Scanner quality
„ Scan resolution
„ Type of printed documents, whether laser printer
outputs or photocopied
„ Paper quality
„ Fonts used in the text
„ Linguistic complexities
„ Dictionary used

15-20 April 2002 Imaging and OCR PI-3 29

Evaluating OCRs

„ Neat interface
„ Easy-to-use wizards
„ Accurate recognition
„ Scan resolution setting (600 dpi is advisable)
„ Time taken from scanning to deliver the final
product
„ Enhanced usability of the product
„ Ability to modify the scan setting
15-20 April 2002 Imaging and OCR PI-3 30

15
Summarizing

„ We learnt basics of imaging system and images


„ Different steps involved in OCR technique and
scanning
„ Conversion of raster image to text using OCR
techniques
„ Types of imaging system and OCR software
„ Evaluation of imaging system and OCR software

15-20 April 2002 Imaging and OCR PI-3 31

References

„ Web Sites:
„ www.archivebuilders.com
„ Sunsite.berkeley.edu
„ www.cedar.buffalo.edu/Publications/TechReps/OCR/ocr.htm
„ navigatela.lacity.org/samples/start/
„ Journals
„ Chip July 2000
„ Pcquest Product review column

15-20 April 2002 Imaging and OCR PI-3 32

16
Questions?
Comments?
Discussions?
(Pl. fill the feedback form)
Thank You!

15-20 April 2002 Imaging and OCR PI-3 33

17

Potrebbero piacerti anche