SRE Checklist for AWS Lambda
Collection of best practices to verify for AWS lambda
Ensure that the Lambda function is provisioned with adequate resource configuration
Each lambda function comes with three main resource configuration
- Memory
- Ephermal storage
- Timeout
Setting either too high or too low values for these can cause issues.
- Having a larger memory can leave scope for a memory leak and increased cost. Also, it can lead a developer to miss any inefficient code paths.
- Having a larger function timeout can leave scope for inefficient code and longer runtimes, which in turn will increase the cost.
- Setting memory and timeout values too low can cause function failure due to insufficient resources.
Solution
- We should ensure that the lambda function being deployed has been optimized based on the Lambda Power Tuning tool results.
- AWS Lambda Power Tuning Reference | Github repo
- We should Adjust function timeouts based on actual function needs by monitoring the average runtime. With AWS CloudWatch Logs we can identify the average runtime and set the timeout slightly above this average.
Ensure optimal concurrency controls
Without concurrency control, Lambda functions can consume excessive resources and risk throttling. Throttling in Lambda happens when requests exceed the set concurrency limit. Uncontrolled concurrency can lead to scaling issues and increased costs during traffic spikes.
Ignoring concurrency settings can impact shared resource usage across other functions in the same AWS account.
Solution
- Set concurrency limits to avoid resource contention and ensure predictable performance.
- Request an increase from AWS Support if needed, which is often granted without extra cost.
- Use reserved concurrency to restrict the number of function instances.
- Monitor concurrency and execution usage metrics in AWS CloudWatch.
- Track Throttles and ConcurrentExecutions metrics in Cloudwatch.
- Use CloudWatch to monitor and alert on throttling occurrences and message buildup in DLQs.
- Regularly adjust concurrency settings based on function performance and load requirements to balance cost and resource availability.
CloudWatch Logs Insights query for throttling:
fields @timestamp, @message
| filter @message like /THROTTLING/
| sort @timestamp desc
Accounting for Cold Start
- Lambdas when run initially create a new environment container, which could delay response times, especially if there are infrequent invocations.
- Also if a function is launched in VPC it can experience cold starts due to ENI provisioning.
- Also, using large deployment packages can increase cold start duration.
Solution
- Check Cloudwatch logs for
initDuration
and duration
fields @timestamp, @initDuration, @duration
| filter @message like /REPORT/
| stats avg(@initDuration) as initLatency, avg(@duration) as runtime by bin(1m)
- Enable provisioned concurrency in the Lambda settings.
- Provisioned concurrency, helps reduce cold starts for critical functions. (Abusing provisioned concurrency should be avoided)
- Pre-warm lambda just before invocations.
- Use SnapStart for Java functions.
- If the lambda function does not access any VPC resources or AWS services, they don’t need to run inside VPC.
- For VPC-bound functions, configure private subnets and optimize security groups by limiting rules.
- Minimize package size by keeping dependencies to essentials only.
- Use Lambda layers for shared code to reduce deployment package size. If you move dependencies to layers, that can reduce the main package size, decreasing cold start duration. This should improve function load times.
- Ensure the lambda function only uses company published layers.
- For high-latency-sensitive applications, split functions into smaller services to avoid triggering cold starts in non-essential functions.
- Decompose complex functions into microservices by creating smaller, dedicated Lambda functions for each task, focusing provisioned concurrency on latency-sensitive services only to minimize cold start impact.
Ensure configuration values are not hard-coded
Hard-coding configuration values within code reduce flexibility and increase maintenance. Also, hard-coded values can introduce security risks, especially for sensitive information like API keys.
Solution
- Use environment variables to pass configuration values, making it easier to change settings without modifying code.
- Consider managing encrypted secrets with more advanced AWS Key Management Service (KMS) if necessary.
- Use AWS Systems Manager Parameter Store for configurations
- Use AWS Secrets Manager for secure, centralized secrets/credentials management.
Insufficient Error Handling and Logging
Missing error handling can cause functions to fail silently, making issues harder to diagnose. Failing to log important function activity can complicate debugging. Logs can be scattered if not centralized, leading to less oversight.
Solution
- Send function logs to AWS CloudWatch Log groups.
- Set log retention period for the log group.
- Avoid unnecessary log levels. Only send required logs.
- Regularly review logs for trends or recurring errors, and set up alarms based on Error code patterns.
- If needed enable Xray Tracing, Code guru profiling & Lambda insigts
- Always use try-catch blocks to handle errors and log machine-readable concise error messages (codes)
- Use structured logging to capture essential function events and errors.
- Use JSON and include essential data such as request IDs, timestamps, status codes, and error messages to track each event accurately.
Missing Tags
Ensure all lambda functions have defined tags. This ensure streamlines chargeback and access controls. Tags can be used in IAM policy conditions and tags can be used in the cost center for chargeback.
Ensure the least privilege principle for lambda
Lambda functions have two policies attached.
Execution Role
Resource Policy
- The execution role defines what the Lambda function can do.
- Resource policy decides what action can be taken on the Lambda function.
- Overly permissive IAM policies increase vulnerability to data breaches.
Solution
- Implement least privilege by defining specific permissions instead of wildcards like “s3:*”.
- Use IAM conditions and resource policies to limit access based on specific criteria.
- Ensure iam:PassRole on the lambda role is limited.
- Ensure only authorized personnel can invoke the function
- Ensure only authorized personnel can Update the lambda function.
Ensure Function Lifecycle Management
Lambda offers versioning. We should ensure that the lambda functions have versions and aliases. Versions & aliases allow us to update with minimal downtime. We can even do traffic distribution between versions.
Not using function versioning can lead to difficulties in managing updates and rollbacks.
Solution
- Always enable versioning to differentiate between the development, production, and testing stages.
- Use aliases to simplify version management and enable blue/green or canary deployments.
- Regularly clean up old versions for efficient function management.
Optimize VPC Settings
- Use VPC endpoints for services services like S3 and DynamoDB to reduce latency.
- Avoid lambda functions that use NAT gateways in Lambda VPCs, can use Global proxy.
Ensure Dead Letter Queues (DLQs)
DLQ allows us to handle the failed events. In case of Asynchronous invocation or EventSource-mapping, we can configure a Destination to send the event that we failed to process. Omitting DLQs/Destinations for failed asynchronous invocations can lead to data loss.
DLQs are specialized queues used to capture and handle failed events that Lambda functions cannot process successfully, such as with throttling, execution errors, or concurrency limits.
Solution
- Use DLQs to capture failed events, providing a fallback for unprocessed messages.
- Set up DLQs to ensure reliable processing and reduce unnecessary retries.
- Configure alerts on DLQ usage to monitor for any increase in failures.
Use AWS Config to validate resource configuration compliance
AWS offers many managed rules to ensure that lambda function adhere to the defined configuration and will mark the resource as non-compliant when the configuration deviates.
- LAMBDA_CONCURRENCY_CHECK
- LAMBDA_DLQ_CHECK
- LAMBDA_FUNCTION_PUBLIC_ACCESS_PROHIBITED
- LAMBDA_FUNCTION_SETTINGS_CHECK
- LAMBDA_INSIDE_VPC
- LAMBDA_VPC_MULTI_AZ_CHECK
Miscellaneous
- Choose the optimal function runtime. See if Graviton / ARM architecture works. This will reduce the cost of Lambda function
- Offload orchestration work to specialized services like Step Function
- Don’t invoke functions directly use SQS/SNS in front of it.
- Verify integration with API with time out < 29 sec.
- Setup Anomaly alerts at lambda invocations.
- Validate Triggers, only desired triggers should be there
- Destination, ensure lambda has destination
- Layer, ensure lambda usage company published layered.