For one of the projects I was working on, I had to face lot of issues with AWS Limits. They are, majorly, of 3 kinds.
Soft Limits: This can be increased by raising a support ticket with AWS. E.g. Limit for Load balancer in a region is 20. You simply raise a limit increase request in support center by filling out this form.
After it has been processed you will receive a satisfying mail like below.
Hard Limits: This can’t be increased. E.g. There is a hard limit of 60 parameters in CloudFormation. This can’t be increased.
Gray Area: Unlike the previous two limits, these are not documented by AWS. These are mostly related to api (not api gateway these are service APIs like describeStack, describeInstance) throttles. The maximum API request rates vary across regions. Each API call type (Describe, Create, delete, register, etc.) has a different throttling limit and these rate limits are adjusted dynamically every second in every region. If you try to contact AWS on the same to know the limits/baseline you may receive a response with the limits (if you are lucky enough). In most cases you will get a response like this.
We throttle API request for each account on a per-Region basis to help the performance of the service. However, the exact API limits for the APIs are internal to AWS and thus, I will not be able to let you know of the exact limit for you specifically.
In few cases you might be given option of Paid service like SSM parameters.
So how do we fix those.
Well, implementing a proper back-off and retry logic for the client making the call. It is really helpful. And it will work in most cases. I have observed a 4–5 times of performance improvement if proper strategy is implemented. I observed a slight difference in the performance with different SDK (Java vs Python). This is due to different default back off strategy. In java we can set our own back-off and retry with the following settings:
withRetryPolicy → (backoffStrategy, RetryCondition, maxErrorRetry)
In Boto3 we have concept of Config:
Config( retries = dict( max_attempts = NUMBER ) )
To test this I created a setup of One state machine having 6 lambdas all calling describeInstanceAttribute api on 4 instances. Initially I was able to run ~120 concurrent calls successfully. That means around 2800 api calls. The job completed in around 320 secs. After increasing the count of parallel executions it started to give Request Limit Exceeded errors. Then I modified the SDK MaxRetryAttempts and BackOffStrategy. I also added a retry logic at step function level with BackOffRate set to 2. With the new setting I was able to execute 400 concurrent calls. Which comes to almost 10k api calls. The job finished in the around 580 secs. I could think of introducing a one second gap between the six lambdas invocation and it will further increase the no of executions that can happen in parallel.
Also you need to be cautious with few services like S3 and Dynamo where setting are little bit different.