Sei sulla pagina 1di 8

Document Management System:A Notion Towards

Paperless Office
Mahendra K. Ugale Shweta J. Patil
Dept. of Computer Science & Engineering Dept. of Computer Science & Engineering
MGM's Jawaharlal Nehru Engineering College, MGM's Jawaharlal Nehru Engineering College,
Aurangabad, (M.S), India. Aurangabad, (M.S), India.
Email-mahendraugale@jnec.ac.in Email-patilshweta10@gmail.com

Dr. Vijaya B. Musande


Dept. of Computer Science & Engineering
MGM's Jawaharlal Nehru Engineering College,
Aurangabad, (M.S), India.
Email- vijaya.musande@gmail.com

Abstract— Paperless Document Management System Document Management System


is used to eliminate the losses that businesses suffer Document management systems are generally filing cabinets
because of physical paper files and filing systems. that provide architecture for organizing all digital documents.
This Paper addresses some of the technologies that These systems work mainly with scanners,
Which convert paper documents into digital formats. Through
are helping professionals shift toward a paperless
search engines, document management systems allow for
business world. quick retrieval to any document or file.
Document management system is commenced as a notion to
A DMS based on organizing digital documents to lower handling of paper. EDMS is capable to handle
search and store documents and to reduce paper. document handling processes and to manage the flow of
Most of the workplace consists a variety of information. EDMS has different parts of capacity in terms of
documents having a mixture of handwritten and storage, archiving, administration, flow control of DMS
printed text. The detection of such documents is a reports to encourage work processes. In any case, intricacy
crucial task for OCR developers. This paper and cost of operation of EDMS still force issues for being
describes different steps for processing different moderate for SME's [2].
documents using scanning, tagging, and indexing for
effective data retrieval with OCR and Indexing A DMS includes five components:
techniques.
I. Input
Keywords: Paperless, Document Management System, A decent scanner will make putting paper documents into your
PC easily. Likewise, documents can be input utilizing photo,
OCR, Tagging, Indexing, Information Retrieval. which is considered as softcopy.

I. INTRODUCTION II. Storage


The storage system provides a capacity to store documents. A
The physical Paper process used handling of a physical
decent storage system will oblige changing documents,
document, from a shelf or filing cabinet or cupboard. Filing
expanding data volumes.
systems require a large amount of physical space and having
inefficiencies in searching for papers. Cost required
III. Indexing
maintaining such type of physical documents is also large.
Organizations that use paper-based processes also having
The index framework will make a composed documenting
security risks. A paperless office should be considered as an
framework and makes future searching or retrieval
important project within organizations and establish an
successfully. A decent indexing framework will make existing
initiative with good environmental practices as a contribution
procedures and systems more successful.
to sustainable development [1].

978-1-5090-4264-7/17/$31.00 ©2017 IEEE


217
IV. Retrieval • Indexing of PDF Documents-Collect all PDF documents
The retrieval framework utilizes fundamental data about the to be indexed into one or more folders.
documents along with an index and related content to recover • Searching of PDF Documents-Searches can be done by
pictures stored in a system.A productive retrieval system will specifying single string, or multiple search strings or by
make finding the required archives speedier and using patterns.
simpler.output should be Eco-friendly.
In this paper, an emphasis is given on last three phases i.e.
Documents are viewable to only authenticated personnel, tagging of PDF documents, indexing of PDF documents and
whether in the workplace, in various areas, or over the searching of PDF documents. The aim is to improve
Internet. information retrieval from PDF documents by reducing time
using effective tagging and indexing.
V. Retention
In record accessing system document retention schedule must
be characterized. This permits the documents that came to
end of their maintenance calendar to be distinguished and
decimated. By overwriting distinctive records number of times
the reports are successfully "destroyed."

Architecture of Document Management System


Fig-2 Tagging of Documents

Fig-3 Indexing of Documents.

II. RELATED WORK

A. Overview of different OCR Software


• Nuance Omni Page Ultimate
Initially used OCR software is Nuance OmniPage Ultimate.
When its “OCR & train” process starts, OmniPage recognizes
an entire page and then brings up a window with a grid of all
doubtful characters. The user then goes through and corrects
any character that OmniPage has incorrectly identified, after
which the program OCRs the page and returns results. This
approach may work well for texts in English (or one of the
other nine languages that OmniPage supports), but it fails for
language like Devanagari [9].

Fig-1 Document Management System Testing Observation-


o Intermediate file (.opd) is working inefficiently, It
takes time to load input file and to save output file, It
Electronic documents such as PDF's are becoming doesn’t support large file,Working behavior of
increasingly popular as we move further towards the notion of
software is dynamic.
the document management system [13].For a paperless office,
it is necessary to reform administrative processes by the The recovery system is not appropriate, Retrieval of
organization [6]. data is not satisfactory, and Re-editing of the
document is tedious.
Different Phases
• Scanning of Documents-Normally scanning at 300 dpi is • Abbyy Finerader
recommended and Maximum dpi limit can be up to 600. ABBYY FineReader is an OCR framework that examined
• Tagging of PDF Documents (Scanned Documents)-Read document picture records including advanced photographs i.e.
page images, Analyze page images, Recognize the digital photos into an editable format. [10].
contents of the image.

218
Testing Observation- o Better graphics management,
o Simple GUI . o document flattening .[11]
o Faster and Accurate Recognition.
o Supports Most Of The World's Languages. Testing Observation-
o Difficult User Interface
• SimpleOCR o It missed out some instances while searching.
This is freeware OCR application that gives flexible o Retrieval of data is not satisfactory.
accuracy for those who just want to convert a few no of o Accuracy for Information retrieval is less.
pages using OCR. Developers can use SimpleOCR with • Adobe Reader
their custom applications. A document that consists of scanned images of document is a
consisting of scanned images of text which is inaccessible
• Simple Index because the content of the document is a graphic
Simple Index is designed to enable to quickly organize representation not searchable text.
multiple documents in batches, extract data from text
Scanned images of text must be converted into reachable and
and barcodes with OCR, and then use that data to
readable text using OCR. [12]
organize the files automatically.
Testing Observation-
Functionalities of Tested OCR Softwares o Simple User Interface.
o It gives all correct instances while searching.
Table I- Functionalities of Tested OCR Softwares
o Retrieval of data is satisfactory.
o Accuracy for content retrieval is more.
Nuance
Criterion Simple ABBYY Omni- Simple
OCR Fine- page Index Table II- Functionalities of Tested Indexing Softwares
Reader Ultimate
Scanner
Driver
TWAIN TWAIN TWAIN TWAIN Criterion Nuance PDF Adobe Reader
Supported
Converter
Table/
Spreadsheet ✘ ✓ ✓ ✘ Professional
Recognition User friendly No Yes
Searchable
PDF ✘ ✓ ✓ ✘
Output Retrieval Not satisfactory Good
PDF
Password ✘ ✓ ✓ ✘ Indexing Time Less More
Support
Vertical Text
Recognition
✘ ✓ ✓ ✘ Accuracy Not satisfactory Good
Barcode
Recognition
✘ ✓ ✓ ✘
Image Pre-
✘ ✓ ✓ ✘ Based on above comparison of indexing software it is
processing
observed that Adobe Reader software is better for indexing
Hot
Folder
✘ ✓ ✘ ✘ PDF documents.
Language
Supported
3 189 137 179 III. SYSTEM IMPLEMENTATION
Installation Desktop/
Desktop Desktop Desktop
Network
For Tagging of PDF documents, the proposed framework
Based on above comparison of tested OCR software it is depends on the implementation of OCR with help of Asprise
observed that Abbyy Fine reader OCR software is SDK. For Indexing of PDF archives, the proposed framework
depends on implementation with the help of Hoot SDK.
competitively better.
The input for the OCR system is scanned document image. To
B. Different Indexing Software
perform character recognition , the application needs to go
Through three steps-
• Nuance PDF Converter Professional
o It provides an effective document conversion to PDF. • Optical Scanner
It also offers Character images can be acquired from a scanner or any other
o one-click scanning to PDF, electronic source. It will be stored in any image format and that
Advanced PDF search capabilities, image may be a color, gray scale, but the actual processing will
o Accurate PDF to Microsoft Excel conversion, take place on binary images [3].
o Enhanced multimedia support,

219
• Preprocessing from document photo and yield in formats like plain text,
The resulting image from the scanning process may contain XML and accessible PDF [7].
noise. Depending on the resolution of the scanner, the
characters may be broken. These defects cause poor Asprise C# .NET OCR (optical character recognition) and
recognition rates that can be eliminated by using a barcode recognition SDK offers an elite API library for you to
preprocessing develop your C# .NET applications (Windows applications,
Silverlight, ASP.NET web service applications, ActiveX
• Segmentation controls, and so on.) with the usefulness of extracting content
Given an input image, identify individual glyphs (basic units and barcode information from examined reports.
representing one or more characters, usually contiguous).
Highlights
• Feature Extraction: • High Accuracy
From each glyph image, extract features to be used as input Low resolution documents can be easily recognize by Asprise
of ANN. This is the most critical part of this approach. [3] OCR.

• Classification: • Format retention:


There are no of classification or recognition approaches that Input document’s Text layouts are preserved effectively.
are as follows-
o Template Matching, • High Speed:
o Neural networks, High Speed to perform recognition effectively in short time
o Decision tree based
o Bayesian-based, • Ease of utilization:
o Support Vector Machines SVM[3]. Complex are expelled from Asprise OCR SDK

This paper utilized the toolkits of Hoot that used for • Barcode Recognition:
information retrieval. Alongside letters and numbers Asprise OCR can perceive
practically every sort of barcode.
• Preprocessing Module: You can choose to recognize characters or barcodes or
Before utilizing Hoot we have to preprocess the available both.
text documents. The principle part of preprocessing is to
change full-width characters into half-width characters. So
as to better show the utilization of Hoot, this will divide the Algorithm of tagging of documents-
large documents into small documents and then assign a This flowchart explains different steps for tagging of
unique ID number to each document. documents.

• Indexing Module:

In the wake of Preprocessing you can utilize Hoot to process


significant data.

1. Index Creation for handling documents


2. Build a query object.
3. Pursuit or search in an index.

• Searching Module:

After indexing process, framework will build up a search class


which gives two methodologies, Index search approach is
utilized to search indexing which is constructed by Hoot.
Nonetheless, string search approach used to seek information.
To begin with, give the pursuit path then parse the string and
create query object to search data.

A. Tagging of PDF Documents using Asprise.


Asprise OCR SDK is a commercial optical character
recognition and barcode identification recognition SDK
library that gives an API to perceive the text and barcodes

220
Information Retrieval in HOOT-

The Important part of indexing module is to the speedup of


recovery; the scanning module is principally utilized for
cooperating with users[5].

An index is made out of various sections, a section is a mix of


documents, a document is a combination of fields.
a Once documents are built and analyzed, next step is to
index them so that this document can be retrieved based on
certain keys instead of whole contents of the document.
Indexing process is similar to indexes which are at the end of
a book where common words are shown with their page
numbers so that these words can be tracked quickly instead of
searching the complete book.

Working of HOOT:
• Set the index storage, it will store all the index
information.
• Select Specific location or folder where you need to
index for content.
• Load Hoot: It will stack Hoot so you can inquiry a
current record using existing index.
• Begin Indexing : It will stack Hoot and begin indexing
the folder you indicated.
• Stop Indexing : It will come dynamic after you have
begun indexing so you can stop the procedure.
• Count Words : It will demonstrate the quantity of words
in Hoots lexicon .
• Save Index : It will save index in memory to disk.
• Free memory : It will call the interior free memory
method on Hoot
• You can look for content; it will demonstrate the count
and time taken..
• To open the document simply double clicks on the
record way in the listbox.
Fig 4 --- Algorithm of Tagging of Documents Indexing Technique-
B. Indexing of PDF Documents using Hoot.
There are number of popular information retrieval (IR)
Hoot is a framework that considered as
indexing techniques are available; the technique which
o Part of the analysis engine,
o It can likewise give a straightforward however used by Hoot is called Inverted Index.
intense interface with the goal that individuals can • Inverted index
helpfully and rapidly build up the crawler.
o Hoot is the littlest utilization of information retrieval An inverted index is a special index which stores the words to
using highly compact storage inverted WAH bitmap bitmap conversion data.
index and data recovery is the way toward hunting
down words in a piece of content. In an ordinary index you would store what words are in a
o A Text indexing engine specific document, an inverted index is the inverse given a
o A Query engine. word 'x' what records have this word is stored.
Features-
o Quick operating speed. A bitmap index is an index based record numbers to the
o Incredibly little code estimate. archive. Bitmap indexes were initially additionally utilized as
a part of Information Retrieval (IR), however, are today
o Uses WAH compacted BitArrays to store data.
mostly replaced by inverted index.
o Multi-threaded execution.
o Tiny estimate just 38kb DLL
WAH [15] is one of a few presented compression techniques.
o Highly optimized storage.
Despite the fact that there are plans with more minimized
indexes, WAH supports effective query preparing.

221
Table IV- Accuracy of Correctly recognized Pages for Automated and Manual
This joined with the way that FastBit is transparently tagging
accessible persuades the utilization of WAH-compacted
bitmap lists. No of Tested Accuracy of
Criteria Pages Correctly Recognized
An inverted index comprises of a dictionary of the distinct values (Printed Pages (%)
of the attribute, with pointers to inverted lists that reference tuples
and Nuance Abby
with the given value through tuple identifiers (TIDs). To decrease Omni- Fine-
both space use and the I/O prerequisites in query processing, the Handwritten)
page reader
inverted lists are frequently compacted [14].
Ultimat
e
IV. EXPERIMENTAL RESULTS Automated 80% 85%
There is training data of organization containing Tagging
Approximately 15 Lakhs pages which need to be tagged by 100
using Abbyy Fine Reader. The technique is applied in a Manual 85% 95%
sequence of reading page image, analyze page image and Tagging
preprocess page images to recognize page contents
correctly.Initially, we considered 1181 scanned pdf documents
i.e Handwritten, Typed and Printed for tagging by considering The dataset we considered tagged pages for indexing
some fixed keyword using Abbyy Fine Reader and Omnipage considering some fixed keyword. Here we consider 20
Ultimate and obtain the following result. keywords in a single page for searching after indexing in
Nuance PDF Converter Professional and Adobe Reader. The
Table III- Result Analysis of Tagging Softwares result is presented for four cases having 20 keywords per
page and obtain the following result.
Table V- Result Analysis of Indexing Softwares
Nuance Abbyy Fine Retrieval
Tools Used Omnipage Reader Nuance PDF Accuracy of
Retrieval
Adobe Accuracy
Ultimate Software Converter
Reader
Nuance PDF
of Adobe
Used Professional Converter
Reader
Professional
(%)
Keyword Recognition (%)
Keyword
Rate
150 190 Searching
16 Keywords
19
80% 95%
(Out of 200 tagged Rate Keywords
keyword) For Page 1
Keyword
Accuracy of Searching Searching 18
16 Keywords 80% 90%
( Manually and Rate Keywords
75% 95% For Page 2
Automatically) Keyword
(%) Searching
17 Keywords
19
85% 95%
Rate Keywords
Size 349 MB 349 MB For Page 3
Before Tagging (1181 Pages) (1181 Pages) Keyword
Size Searching 16
14 Keywords 70% 80%
55 MB 60 MB Rate Keywords
After Tagging For Page 4
Time Required To Load More Than 2 Less Than 1 Average
Retrieval
Pages Hours Minute Accuracy
78.75% 90%
Time to Perform OCR (%)
4 Hours 3 Hours
(Automatically)
For testing purpose of tagging using Asprise OCR different
cases are considered-

Case 1- Keyword recognition rate for plain text page with


300 dpi resolution and plain text page with 150 dpi
resolution.
Case 2- Keyword recognition rate from Tabular data
document.
Case 3- Barcode Recognition rate from document.

Following Table V shows experimental results of Tagging


using Asprise OCR by considering various criteria.

222
Following Table VI shows experimental results of Indexing V. CONCLUSION
using Hoot for effective retrieval. Maximum features are available in Abbyy Finereader than
OmniPage Ultimate. Working performance of Abbyy is better
Table VI-Experimental Results of Tagging
than Omnipage because of an easy interface. Keyword
Recognition rate of Handwritten and printed document is
better in Abbyy Finereader. The accuracy of correctly
Cases Criteria Nature of Asprise OCR searched documents by Abbyy FineReader is 95%.Keyword
Page SDK
Recognition rate is better of Printed or Typed Document if
Automated Tagging is used. Keyword Recognition rate is
Keyword
Recognition 356/358 better for Handwritten Document if Manual Tagging is used.
Rate (Testing is done using an evaluation copy of OmniPage and
Accuracy of Abbyy Finereader.)
Recognized
99%
Keyword In this paper implementation of tagging of a document is
(%) based on Asprise OCR SDK. Keyword Recognition Rate
Size Plain Text
Before Page with 300 1.12 MB from plain text page having 358 Keywords with good
Tagging dpi resolution Resolution is greater than keyword recognition rate from
Size 2kb plain text page having 358 Keywords with low Resolution.
After Tagging (rtf file) The accuracy of Recognized Keyword from Plain Text Page
Case-1 Time
Required To
with 300 dpi resolution is 99% in Asprise OCR SDK. Plain
10 Sec Text Page with 150 dpi resolution i.e. low resolution is
Recognize 358
Keywords 59.60%. Content formats or Text layouts on input
Keyword documents are saved in asprise OCR SDK. Adjacent to
Recognition 153/358
Rate Plain Text Characters (letters and numbers), Asprise OCR can perceive
Accuracy of Page with 150 each sort of barcode.
Recognized dpi resolution
59.60%
Keyword This paper also proposes an effective information retrieval
(%)
indexing method using the evaluation copy of Adobe Reader.
Recognition
of 2 rows and All keywords After indexing the documents, retrieval is much faster than
Tabular data simple tagged documents. Though PDF documents tagged
Case-2 3 columns are recognized
document. with identifying keywords to enhance searchabilty, indexing
with with layout.
Keywords gives a better solution to retrieve information efficiently. It
Barcode Correctly reduces a time for information retrieval. Thus indexing needs
Barcode Data
Case-3 Recognition Recognized to be carried out after tagging of documents to increase
Document
rate contents retrieval accuracy.

The work extensively has tested the same using Nuance PDF
Converter Professional also. Retrieval accuracy is also more in
Table VII Experimental Results of Indexing Adobe Reader. The accuracy of correctly retrieved documents
by Adobe Reader is 90%.
Hoot is a full-text indexing engine toolkit written in VB.net,
Retrieval multi-user support access, rapidly visit indexing time. This
Accuracy of paper provides detail analysis of Hoot, indexing and
Software Used Hoot
Hoot
(%)
searching. The accuracy of correctly retrieved documents
Keyword Searching Rate using Hoot is 91.25%. The experimental results show that
19 Keywords 95% Hoot is efficient for information retrieval.
For Page 1
Keyword Searching Rate
19 Keywords 95% ACKNOWLEDGMENT
For Page 2
Keyword Searching Rate
17 Keywords 85%
For Page 3
Keyword Searching Rate We would like to acknowledge the support and input from our
18 Keywords 90%
For Page 4 Project Guides Dr. Vijaya B Musande for their constant
Average Retrieval Accuracy
91.25% encouragement and guidance during completion of work.
(%)

223
REFERENCES [7] ra-Dinora ORANTES-JIMÉNEZ, Alejandro ZAVALA GALINDO,
Graciela VÁZQUEZ-ÁLVAREZ, Paperless Office: a new proposal for
[1] Sandra-Dinora ORANTES-JIMÉNEZ, Alejandro ZAVALA GALINDO, AND INFORMATICS VOLUME 13 - NUMBER 3 - YEAR 2015
Graciela VÁZQUEZ-ÁLVAREZ, Paperless Office: a new proposal for
organizations, ISSN: 1690-4524 SYSTEMICS, CYBERNETICS AND
INFORMATICS VOLUME 13 - NUMBER 3 - YEAR 2015 [8] Hang Thu Pho, Torben Tambo, “Integrated Management Systems and
Workflow-Based Electronic Document Management: An Empirical
Study” , Journal of Industrial Engineering and Management, pp. 194-
[2] Hang Thu Pho, Torben Tambo, “Integrated Management Systems and 217, January 2014.
Workflow-Based Electronic Document Management: An Empirical
Study” , Journal of Industrial Engineering and Management, pp. 194- [9] M Swamy Das,Ram Mohan Rao Kovvur, “Evaluation of Neural Based
217, January 2014. Feature Extraction Methods for Printed Telugu OCR System”,
Advances in Computer Science and Information Technology (ACSIT)
[3] M Swamy Das,Ram Mohan Rao Kovvur, “Evaluation of Neural Based ,Volume 2, Number 11,pp. 85-90,April- June, 2015.
Feature Extraction Methods for Printed Telugu OCR System”,
Advances in Computer Science and Information Technology (ACSIT)
,Volume 2, Number 11,pp. 85-90,April- June, 2015. [10] Y. C. Li and H. F. Ding, “Research and Application of Full-Text Search
Engine Based on Lucene,” Computer Technology and Development,
Vol. 20, No. 2, 2010, pp.4-56.
[4] Y. C. Li and H. F. Ding, “Research and Application of Full-Text Search
Engine Based on Lucene,” Computer Technology and Development,
Vol. 20, No. 2, 2010, pp.4-56.
[5] FreeOCR Url [Online].Available: http://www.paperfile.net/ [11] FreeOCR Url [Online].Available: http://www.paperfile.net/
[6] Asprise OCR Url [Online]
Available: https://en.wikipedia.org/wiki/Asprise_OCR Asprise OCR Url [Online] Available:
https://en.wikipedia.org/wiki/Asprise_OCR
[9] The Abbyy Finereader Website. [Online]. [12]
Available:https://www.abbyy.com/finereader/
[10] The Nuance Omnipage Ultimate Website. [Online]. Available:
[13]
https://www.nuance.com.
[11] The The Omnipage Reader website. [Online]. Available [9] The A
http://www.nuance.com/
[12] The Adobe Reader website. [Online]. Available:
http://www.adobe.com/

[13] Jennifer Pearson,”Supporting Effective User Navigation in Digital


Documents”, CHI-2010,Atlanta,GA,USA,April 10- 15,2010.

[14] Truls A. Bjorklund and Nils Grimsmo, Johannes Gehrke ,


Oystein Torbjornsen, Inverted Indexes vs. Bitmap Indexes in
Decision Support Systems, ACM, CIKM’09, November 2–6,
2009, Hong Kong, China.

[15] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices


with effcient compression, ACM Trans. Database Syst., 2006.

224

Potrebbero piacerti anche