AWS Textxtract2019 0312 MCL Slide Deck

Extract Text and Data from Any
Document with No Prior ML

Experience
Kashif Imran
Sr. Solutions Architect, AWS AI
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Documents Processing
• Amazon Textract
• Overview
• Amazon Textract APIs
• Demo
• Reference Architectures
• How to Get Started
Documents are important
Primary tool of record keeping, communicating, collaborating, and transacting
Finance Medical
Insurance Legal
Real estate Business management
Accounting Education
Tax management And many more…
16.3M US mortgage applications ($2.1T) in 2016
About 240M W2 tax forms will be processed for
FY2018 in the US
Need for processing documents
Search Compliance Business

and discovery and control process automation
How documents are processed today
Manual Optical Character Recognition Rules and

processing (OCR) template-based extraction
Challenges for processing documents
Manual processing
Expensive Error-prone Time-consuming
Manual processing
Variable output
Inconsistent results
Needs multiple human

reviews for consensus
Output
1. Exempt is true
2. 28 is true
3. CPP/QPP is true
4. RPC/RRQ is true
Optical Character Recognition (OCR)
No multi-column detection
No rotated text
detection (not shown)
No stylized font
detection (not shown)
Output
Extract data quickly & No code templates to
accurately maintain
Optical Character Recognition (OCR)
OCR reads left to right,

ignoring table structure
Output
Start Date End Date Employer Name Position Held Reason for leaving
1/15/2009 6/30/2013 Any Company Head Baker Family Relocated
Rules and template-based extraction
Limited by Significant development and Templates

accuracy of OCR management overhead are brittle
Rules and template-based extraction
The well-known W2 US tax form has 100s of variants each year
It looks easy, but …
…not a single corresponding pixel value in common
Introducing Amazon Textract
Extract text and data from virtually any document
Amazon Textract features
Text extraction Table extraction Form extraction
Amazon Textract – Text Extraction
Amazon Textract - Text Extraction
Blocks: PAGE, LINE, WORD
is washed by waves, and cooled
Amazon Textract—Text Extraction API
DetectDocumentText
Name Description Name Description

Document Blob or Amazon S3 Blocks List of blocks identified
object from the document
ID Unique ID of the unit
Relationships CHILD
Block type PAGE, LINE, WORD
Pages Contains number of

pages in the document
DetectDocumentText
Request Response
Amazon Textract—
Text Extraction simplified
Multi-column detection
Output
Extract data quickly & No code or templates to
accurately maintain
Demo
Amazon Textract – Table Extraction
Amazon Textract—Table Extraction
Blocks: PAGE, TABLE, CELL

For each ’block’ you get:
• Text
• Confidence score
• Block relationships (e.g. cells within a table)
Amazon Textract—Table Extraction API
AnalyzeDocument with “table” as FeatureTypes parameter

FeatureTypes TABLES ID Unique ID of the unit
Relationships CHILD
Block type PAGE, TABLE, CELL

AnalyzeDocument - Tables
Request Response
Amazon Textract—
Table Extraction simplified
Table recognized
Words grouped by cell
Output {
Start Date: 1/15/2009
End Date: 6/30/2013
Employer Name: Any Company
Position Held: Head Baker
Reason for leaving: Family relocated
}
Demo
Amazon Textract – Form Extraction
Amazon Textract—Form Extraction
Blocks: PAGE, KEY_VALUE_SET

For each ’block’ of your document:
• Form field name (key) and field value (value) association
• Confidence score
• Page number
• Block relationships
Amazon Textract—Forms Extraction API
AnalyzeDocument with “forms” as FeatureTypes parameter

FeatureTypes FORMS ID Unique ID of the unit
Relationships KEY, VALUE, CHILD
Block type PAGE, KEY_VALUE_SET

AnalyzeDocument - Forms
Request Response
Amazon Textract—
Form Extraction simplified
Logical
groupings captured
Relationships captured
Output Full Name:

First: John Glyphs captured
Middle: X
Last: Doe
Date of Birth:
MM: 01
DD: 01
YYYY: 1971
Gender:
Male: True
Female: False
Demo
Amazon Textract
Sync and async
Supports single-page
documents such
as images (e.g.,
mobile capture)
For multi-page documents,

up to 3,000 pages
Asynchronous APIs
• StartDocumentTextDetection
• StartDocumentAnalysis
Request Response
Asynchronous APIs
• GetDocumentTextDetection
• GetDocumentAnalysis
Request
Response
->
API Reference – AWS CLI
• DetectDocumentText
aws textract detect-document-text --document '{"S3Object":{"Bucket":"textract-demo-
images", "Name":"simple_text_document.jpg"}}‘
• AnalyzeDocument - Forms
aws textract analyze-document --document '{"S3Object":{"Bucket":"textract-demo-images",
"Name":“employmentapp.png"}}‘--feature-types “FORMS”
• AnalyzeDocument - Tables
aws textract analyze-document --document '{"S3Object":{"Bucket":"textract-demo-images",
"Name":“DenseTextwithTable.png"}}‘ --feature-types “TABLES"
API Reference – AWS CLI
• StartDocuemtTextDetection
aws textract start-document-text-detection --document-location
'{"S3Object":{"Bucket":"textract-demo-images", "Name":"AfterVisitSummaryExample.pdf"}}'
--notification-channel '{"SNSTopicArn":"arn:aws:sns:us-east-1:<aws-account-
id>:SNSDemoTest","RoleArn":" arn:aws:iam::<aws-account-id>:role/SNSFullAccess"}‘
• GetDocumentTextDetection
aws textract get-document-text-detection --max-results 5 --job-id <job-id>
Amazon Textract
Under the hood
Text Extraction: OCR reimagined
Orientation
Structure variability
Document variability
Beyond OCR: Segmentation and rectification
Photometric
Beyond OCR: Segmentation and rectification
Geometric
Beyond OCR: Table and cell detection
Understand document structure and context to find tables
Understand cells even without explicit boundaries
Variable-sized rows and columns
Beyond OCR: Field name (key) and value Extraction
Detect phrases or groups of words
Output
Full Name:
First: John
Middle: X
Last: Doe
Date of Birth:
MM: 01
DD: 01
YYYY: 1971
Gender:
Male: True
Female: False
Beyond OCR: Inferring key/value association
Detect structures of the same form without templates
Key and value association
Infer empty values
Output Full Name:

First: John
Middle: null
Last: Doe
Date of Birth:
MM: 01
DD: 01
YYYY: 1971
Gender:
Male: True
Female: False
Reference Architectures
Reference architecture—Index and search documents
Input Amazon S3 AWS Lambda Amazon Textract Output Amazon

Uploaded document Uploaded A Lambda function is Automatically Perform contextual Elastisearch
images such as tax documents are triggered to initiate extract text, search on millions of Service
forms, credit stored in data lake document analysis including key-value documents or Extracted data and
applications, or using the pairs and tables integrate data into confidence scores
medical notes Hieroglyph API your document are indexed to
management system enable document
search
Reference architecture—Extract for NLP
Input Amazon S3 Amazon Textract NLP Amazon Output

Uploaded document Uploaded Automatically Use natural language Elastisearch Discover medical
images of medical documents are extract words and processing to extract Service insights to improve
notes, explanation of stored in S3 lines of text, and insights from Easily search patient care
benefits, and tables medical documents through extracted
patient forms data and text
insights
Quickly turn extracted text/data into actionable insights
Reference architecture—Form capture
Input Amazon Textract Customer Application Database

Customer uses mobile The Amazon Textract API Customers experience User submitted data is
app to capture a photo of is integrated into the end- real-time capture of their loaded into a
a employment user application to information by taking a database
application form automatically extract text photo instead of manual
from the form and auto- data entry
populate the form fields
Amazon Textract
Launch customers
Amazon Textract
Benefits
Amazon Textract
Pricing
Per 100 pages or images Up to 1M 1M+

processed pages/month pages/month
Text Detection $0.15 $0.06
Table Extraction $1.50 $1.00

(Text Detection included)
Key-Value Detection $5.00 $4.00
(Text Detection included)
All $6.50 $5.00
Amazon Textract
Free Tier
Features Free for first three months
Text Detection Up to 1,000 pages

or images
Table Detection
Key-Value Detection Up to 100 pages

or images
Text, Table, and
Key-Value Detection
Textract
Regions
US East (Ohio) EU (Ireland)

US West (Oregon)
US East (N. Virginia)
Amazon Textract
Preview
LEARN MORE
or
https://pages.awscloud.com/textract-
SIGN UP preview.html
Resources
• Amazon Textract Documentation

• Detecting and Analyzing Text in Single-Page Documents
• Detecting and Analyzing Text in Multi-Page Documents
• Extract Key-Value Pairs from a Form Document
• Export Tables into a CSV
Thank you!

AWS Textxtract2019 0312 MCL Slide Deck

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

AWS Textxtract2019 0312 MCL Slide Deck

Caricato da

Copyright:

Formati disponibili

Extract Text and Data from Any

Document with No Prior ML

Real estate Business management

Tax management And many more…

Search Compliance Business

Manual Optical Character Recognition Rules and

Expensive Error-prone Time-consuming

Needs multiple human

OCR reads left to right,

Limited by Significant development and Templates

…not a single corresponding pixel value in common

Text extraction Table extraction Form extraction

Blocks: PAGE, LINE, WORD

is washed by waves, and cooled

Name Description Name Description

Block type PAGE, LINE, WORD

Pages Contains number of

Blocks: PAGE, TABLE, CELL

Name Description Name Description

Block type PAGE, TABLE, CELL

Pages Contains number of

Words grouped by cell

Blocks: PAGE, KEY_VALUE_SET

Name Description Name Description

Relationships KEY, VALUE, CHILD

Block type PAGE, KEY_VALUE_SET

Pages Contains number of

Output Full Name:

For multi-page documents,

Understand document structure and context to find tables

Understand cells even without explicit boundaries

Variable-sized rows and columns

Detect phrases or groups of words

Detect structures of the same form without templates

Key and value association

Infer empty values

Output Full Name:

Input Amazon S3 AWS Lambda Amazon Textract Output Amazon

Input Amazon S3 Amazon Textract NLP Amazon Output

Quickly turn extracted text/data into actionable insights

Input Amazon Textract Customer Application Database

Per 100 pages or images Up to 1M 1M+

Table Extraction $1.50 $1.00

Text Detection Up to 1,000 pages

Key-Value Detection Up to 100 pages

US East (Ohio) EU (Ireland)

• Amazon Textract Documentation

Potrebbero piacerti anche