Image for post
Image for post

OCR service on AWS | Textract

Recently I was involved in one of the projects where OCR was used. And discussion was around making OCR as-a-service. During that time I thought of giving Amazon Textract. Amazon Textract is a service that automatically extracts text and data from scanned documents. So I picked up one of the hand written slides on Aurora-Serverless and gave it a try. To be honest results were impressive.

Image I used to test the service:

Image for post
Image for post

Output from Textract:

Image for post
Image for post

Its not perfect, but who is!! This was pretty easy using console. Simply upload document and run the analysis. But I was more interested in doing it in a server less way. So I used lambda to call the Textract api and S3 to store the document. I wrote below python code for lambda.

But I was not able to run the same in default boto3 which comes with Lambda. Since it does not have the support to Textract. To overcome this I created a lambda layer with latest boto3 and then applied it to the lambda. Here is the output of lambda.

Image for post
Image for post

Hope it was helpful.

Cloud | ML | Big Data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store