Sei sulla pagina 1di 64

Extract Text and Data from Any

Document with No Prior ML


Experience
Kashif Imran
Sr. Solutions Architect, AWS AI

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda

• Documents Processing
• Amazon Textract
• Overview
• Amazon Textract APIs
• Demo

• Reference Architectures
• How to Get Started
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Documents are important
Primary tool of record keeping, communicating, collaborating, and transacting

Finance Medical

Insurance Legal

Real estate Business management

Accounting Education

Tax management And many more…

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
16.3M US mortgage applications ($2.1T) in 2016

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
About 240M W2 tax forms will be processed for
FY2018 in the US

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Need for processing documents

Search Compliance Business


and discovery and control process automation

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How documents are processed today

Manual Optical Character Recognition Rules and


processing (OCR) template-based extraction

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges for processing documents
Manual processing

Expensive Error-prone Time-consuming

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges for processing documents
Manual processing

Variable output

Inconsistent results

Needs multiple human


reviews for consensus
Output
1. Exempt is true
2. 28 is true
3. CPP/QPP is true
4. RPC/RRQ is true

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges for processing documents
Optical Character Recognition (OCR)

No multi-column detection

No rotated text
detection (not shown)

No stylized font
detection (not shown)

Output
Extract data quickly & No code templates to
accurately maintain

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges for processing documents
Optical Character Recognition (OCR)

OCR reads left to right,


ignoring table structure

Output
Start Date End Date Employer Name Position Held Reason for leaving
1/15/2009 6/30/2013 Any Company Head Baker Family Relocated

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges for processing documents
Rules and template-based extraction

Limited by Significant development and Templates


accuracy of OCR management overhead are brittle

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges for processing documents
Rules and template-based extraction
The well-known W2 US tax form has 100s of variants each year

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
It looks easy, but …

…not a single corresponding pixel value in common

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introducing Amazon Textract
Extract text and data from virtually any document

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract features

Text extraction Table extraction Form extraction

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract – Text Extraction

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract - Text Extraction

Blocks: PAGE, LINE, WORD

is washed by waves, and cooled

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract—Text Extraction API
DetectDocumentText

Name Description Name Description


Document Blob or Amazon S3 Blocks List of blocks identified
object from the document
ID Unique ID of the unit

Relationships CHILD

Block type PAGE, LINE, WORD

Pages Contains number of


pages in the document

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DetectDocumentText

Request Response

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract—
Text Extraction simplified

Multi-column detection

Output
Extract data quickly & No code or templates to
accurately maintain

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract – Table Extraction

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract—Table Extraction

Blocks: PAGE, TABLE, CELL


For each ’block’ you get:
• Text
• Confidence score
• Block relationships (e.g. cells within a table)

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract—Table Extraction API
AnalyzeDocument with “table” as FeatureTypes parameter

Name Description Name Description


Document Blob or Amazon S3 Blocks List of blocks identified
object from the document
FeatureTypes TABLES ID Unique ID of the unit

Relationships CHILD

Block type PAGE, TABLE, CELL

Pages Contains number of


pages in the document

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AnalyzeDocument - Tables

Request Response

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract—
Table Extraction simplified

Table recognized

Words grouped by cell

Output {
Start Date: 1/15/2009
End Date: 6/30/2013
Employer Name: Any Company
Position Held: Head Baker
Reason for leaving: Family relocated
}

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract – Form Extraction

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract—Form Extraction

Blocks: PAGE, KEY_VALUE_SET


For each ’block’ of your document:
• Form field name (key) and field value (value) association
• Confidence score
• Page number
• Block relationships

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract—Forms Extraction API
AnalyzeDocument with “forms” as FeatureTypes parameter

Name Description Name Description


Document Blob or Amazon S3 Blocks List of blocks identified
object from the document
FeatureTypes FORMS ID Unique ID of the unit

Relationships KEY, VALUE, CHILD

Block type PAGE, KEY_VALUE_SET

Pages Contains number of


pages in the document

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AnalyzeDocument - Forms

Request Response

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract—
Form Extraction simplified

Logical
groupings captured

Relationships captured

Output Full Name:


First: John Glyphs captured
Middle: X
Last: Doe

Date of Birth:
MM: 01
DD: 01
YYYY: 1971

Gender:
Male: True
Female: False

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract
Sync and async

Supports single-page
documents such
as images (e.g.,
mobile capture)

For multi-page documents,


up to 3,000 pages

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Asynchronous APIs
• StartDocumentTextDetection
• StartDocumentAnalysis

Request Response

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Asynchronous APIs
• GetDocumentTextDetection
• GetDocumentAnalysis

Request

Response
->
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API Reference – AWS CLI

• DetectDocumentText
aws textract detect-document-text --document '{"S3Object":{"Bucket":"textract-demo-
images", "Name":"simple_text_document.jpg"}}‘

• AnalyzeDocument - Forms
aws textract analyze-document --document '{"S3Object":{"Bucket":"textract-demo-images",
"Name":“employmentapp.png"}}‘--feature-types “FORMS”

• AnalyzeDocument - Tables
aws textract analyze-document --document '{"S3Object":{"Bucket":"textract-demo-images",
"Name":“DenseTextwithTable.png"}}‘ --feature-types “TABLES"

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API Reference – AWS CLI

• StartDocuemtTextDetection
aws textract start-document-text-detection --document-location
'{"S3Object":{"Bucket":"textract-demo-images", "Name":"AfterVisitSummaryExample.pdf"}}'
--notification-channel '{"SNSTopicArn":"arn:aws:sns:us-east-1:<aws-account-
id>:SNSDemoTest","RoleArn":" arn:aws:iam::<aws-account-id>:role/SNSFullAccess"}‘

• GetDocumentTextDetection
aws textract get-document-text-detection --max-results 5 --job-id <job-id>

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract
Under the hood

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Text Extraction: OCR reimagined

Orientation

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Text Extraction: OCR reimagined

Structure variability

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Text Extraction: OCR reimagined

Document variability

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beyond OCR: Segmentation and rectification

Photometric

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beyond OCR: Segmentation and rectification

Geometric

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beyond OCR: Table and cell detection

Understand document structure and context to find tables

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beyond OCR: Table and cell detection

Understand cells even without explicit boundaries

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beyond OCR: Table and cell detection

Variable-sized rows and columns

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beyond OCR: Field name (key) and value Extraction

Detect phrases or groups of words

Output
Full Name:
First: John
Middle: X
Last: Doe
Date of Birth:
MM: 01
DD: 01
YYYY: 1971
Gender:
Male: True
Female: False

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beyond OCR: Inferring key/value association

Detect structures of the same form without templates

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beyond OCR: Inferring key/value association

Key and value association

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beyond OCR: Inferring key/value association

Infer empty values

Output Full Name:


First: John
Middle: null
Last: Doe
Date of Birth:
MM: 01
DD: 01
YYYY: 1971
Gender:
Male: True
Female: False

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reference Architectures

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reference architecture—Index and search documents

Input Amazon S3 AWS Lambda Amazon Textract Output Amazon


Uploaded document Uploaded A Lambda function is Automatically Perform contextual Elastisearch
images such as tax documents are triggered to initiate extract text, search on millions of Service
forms, credit stored in data lake document analysis including key-value documents or Extracted data and
applications, or using the pairs and tables integrate data into confidence scores
medical notes Hieroglyph API your document are indexed to
management system enable document
search

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reference architecture—Extract for NLP

Input Amazon S3 Amazon Textract NLP Amazon Output


Uploaded document Uploaded Automatically Use natural language Elastisearch Discover medical
images of medical documents are extract words and processing to extract Service insights to improve
notes, explanation of stored in S3 lines of text, and insights from Easily search patient care
benefits, and tables medical documents through extracted
patient forms data and text
insights

Quickly turn extracted text/data into actionable insights

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reference architecture—Form capture

Input Amazon Textract Customer Application Database


Customer uses mobile The Amazon Textract API Customers experience User submitted data is
app to capture a photo of is integrated into the end- real-time capture of their loaded into a
a employment user application to information by taking a database
application form automatically extract text photo instead of manual
from the form and auto- data entry
populate the form fields

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract
Launch customers

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract
Benefits

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract
Pricing

Per 100 pages or images Up to 1M 1M+


processed pages/month pages/month
Text Detection $0.15 $0.06

Table Extraction $1.50 $1.00


(Text Detection included)
Key-Value Detection $5.00 $4.00
(Text Detection included)
All $6.50 $5.00

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract
Free Tier
Features Free for first three months

Text Detection Up to 1,000 pages


or images

Table Detection

Key-Value Detection Up to 100 pages


or images
Text, Table, and
Key-Value Detection

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Textract
Regions

US East (Ohio) EU (Ireland)


US West (Oregon)
US East (N. Virginia)

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Textract
Preview

LEARN MORE

or
https://pages.awscloud.com/textract-
SIGN UP preview.html

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Resources

• Amazon Textract Documentation


• Detecting and Analyzing Text in Single-Page Documents
• Detecting and Analyzing Text in Multi-Page Documents
• Extract Key-Value Pairs from a Form Document
• Export Tables into a CSV

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Potrebbero piacerti anche