Long Running Jobs in Step Function With System Manager and Lambda

Image for post
Image for post

Recently I needed to come up with a server-less approach to run push based jobs on EC2 which will finish in 30+ minutes. I decided to implement the same by using System Manager as it allows me to run commands or script without even login to the box. Since it was going to be used across by different team and orchestration was also needed I finalized on using Step Function which will leverage Lambda to make calls to SSM. Below is the architecture:

Image for post
Image for post

Lets go through the design with each component one by one. First is the Step Function. Here is the overview of the state machine.

Image for post
Image for post

Definition of the state machine:

Image for post
Image for post

Code for Caller lambda (which will start SSM command)

Image for post
Image for post

Variation of same lambda to run the document from S3.

Image for post
Image for post

When the command is received by SSM it returns a command id and status in the response. But since it will keep on running for more than 30 minutes will need a second invocation of lambda to get the status. To get status of the command from SSM using a Lambda I have written following lambda:

I am following below status reference for next polling:

Image for post
Image for post

Execution and result time.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Hope this was informative and helpful.

Cloud | ML | Big Data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store