OCR service on AWS | Textract

Amit Singh Rathore
2 min readJun 17, 2019

Recently I was involved in one of the projects where OCR was used. And discussion was around making OCR as-a-service. During that time I thought of giving Amazon Textract. Amazon Textract is a service that automatically extracts text and data from scanned documents. So I picked up one of the hand written slides on Aurora-Serverless and gave it a try. To be honest results were impressive.

Image I used to test the service:

Output from Textract:

Its not perfect, but who is!! This was pretty easy using console. Simply upload document and run the analysis. But I was more interested in doing it in a server less way. So I used lambda to call the Textract api and S3 to store the document. I wrote below python code for lambda.

But I was not able to run the same in default boto3 which comes with Lambda. Since it does not have the support to Textract. To overcome this I created a lambda layer with latest boto3 and then applied it to the lambda. Here is the output of lambda.

Hope it was helpful.

--

--

Amit Singh Rathore
Amit Singh Rathore

Written by Amit Singh Rathore

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML

No responses yet