Long Running Jobs in Step Function With System Manager and Lambda
Recently I needed to come up with a server-less approach to run push based jobs on EC2 which will finish in 30+ minutes. I decided to implement the same by using System Manager as it allows me to run commands or script without even login to the box. Since it was going to be used across by different team and orchestration was also needed I finalized on using Step Function which will leverage Lambda to make calls to SSM. Below is the architecture:
Lets go through the design with each component one by one. First is the Step Function. Here is the overview of the state machine.
Definition of the state machine:
Code for Caller lambda (which will start SSM command)
Variation of same lambda to run the document from S3.
When the command is received by SSM it returns a command id and status in the response. But since it will keep on running for more than 30 minutes will need a second invocation of lambda to get the status. To get status of the command from SSM using a Lambda I have written following lambda:
I am following below status reference for next polling:
Execution and result time.
Hope this was informative and helpful.