K8s for Data Engineer — Container exit codes
exit code, their meaning & how to handle them
Exit codes are used by container engines to report the reasons for container termination, providing valuable insights into the root causes of pod failures. understanding exit codes is important for better troubleshooting and maintaining the health of your apps.
In this guide, we will explore the significance of exit codes and how to interpret them in the context of Kubernetes.
The Container Lifecycle
To better understand the causes of container failure, let’s discuss the lifecycle of a container first. Taking Docker as an example — at any given time, a Docker container can be in one of several states:
- Created — the Docker container is created but not started yet (this is the status after running docker create, but before actually running the container)
- Up — the Docker container is currently running. This means the operating system process managed by the container is running. This happens when you use the commands docker start or docker run can happen using docker start or docker run.
- Paused — the container process was running, but Docker purposely paused the container. Typically this happens when you run the Docker pause command
- Exited — the Docker container has been terminated, usually because the container’s process was killed
When a container reaches the Exited status, Docker will report an exit code in the logs, to inform you what happened to the container that caused it to shut down.
Container exit codes
Container exit codes are used by container engines to indicate the reasons for container termination. When a container terminates, it reports why it was terminated through an exit code. Understanding these exit codes can help in diagnosing the root cause of pod failures.
Exit codes serve as a way to inform the user, operating system, and other applications about why the process was terminated. Each code is a number ranging from 0 to 255.
Codes below 125 have application-specific meanings, while codes above 125 are reserved for system signals.
Understanding these exit codes is essential for troubleshooting and resolving issues in Kubernetes clusters, nodes, containers, or pods. By identifying the exit code, one can take appropriate steps to diagnose and fix the underlying problems.
Interpreting Common Container Exit Codes:
Exit Code 0 (Purposefully Stopped)
Exit code 0 denotes a deliberate termination of the container, often initiated by developers or automated processes. Technically, it signifies a clean exit without any errors. When a container receives this exit code, it implies that the foreground process has completed its task successfully or that an intentional stop signal was issued.
Exit Code 1 (Application Error or Invalid Reference)
Exit code 1 typically arises from application errors or misconfigurations within the container environment. This could include runtime exceptions, segmentation faults, or other critical failures encountered by the application process. Additionally, an invalid reference in the container’s specifications, such as an incorrect image name or missing dependencies, can trigger this exit code.
- An application error — this could be a simple programming error in code run by the container, such as “divide by zero”, or advanced errors related to the runtime environment, such as Java, Python, etc
- An invalid reference — this means the image specification refers to a file that does not exist in the container image
What to do if a container is terminated with Exit Code 1?
- Check the container log to see if one of the files listed in the image specification could not be found. If this is the issue, correct the image specification to point to the correct path and filename.
- If you cannot find an incorrect file reference, check the container logs for an application error, and debug the library that caused the error.
Exit Code 125 (Command Execution Issue)
Exit Code 125 indicates a failure in executing the command specified during container initialization. This failure might occur due to various reasons, including incorrect command syntax, insufficient permissions, or resource limitations such as memory or CPU constraints. Detailed examination of container logs and runtime environments is essential to pinpoint the root cause of this issue.
What to do if a container is terminated with Exit Code 125?
- Check if the command used to run the container uses the proper syntax
- Check if the user running the container, or the context in which the command is executed in the image specifications, has sufficient permissions to create containers on the host
- If your container engine provides other options for running a container, try them. For example, in Docker, try
docker start
instead ofdocker run
- Test if you can run other containers on the host using the same username or context. If not, reinstall the container engine, or resolve the underlying compatibility issue between the container engine and the host setup
Exit Code 126 (Command Invocation Issue)
A container receiving Exit Code 126 indicates that the command specified in its execution environment could not be invoked successfully. This failure typically stems from missing dependencies or incompatible runtime environments required for command execution. Troubleshooting this issue involves examining the container’s environment variables, ensuring proper installation of dependencies, and verifying compatibility with the runtime environment. The permission problem or command is not an executable
What to do if a container is terminated with Exit Code 126?
- Check the container logs to see which command could not be invoked
- Try running the container specification without the command to ensure you isolate the problem
- Troubleshoot the command to ensure you are using the correct syntax and all dependencies are available
- Correct the container specification and retry running the container
Exit Code 127 (Command Not Found)
Exit Code 127 signals that a command referenced in the container’s specification is not found within the container’s filesystem. This could occur due to various reasons, such as a missing executable file, an incorrect command path, or a typo in the command name. Identifying and rectifying these discrepancies requires a thorough inspection of the container’s filesystem and environment configuration. Possible problem with $PATH or a typo.
Exit Code 128 (Invalid Argument to Exit)
Exit Code 128 indicates a successful termination of the container process, typically after fulfilling its intended task. Unlike other exit codes that signify errors or failures, Code 128 denotes a graceful exit without encountering any exceptional conditions. It confirms that the container’s main process completed its execution without encountering errors or exceptions.
What to do if a container is terminated with Exit Code 128?
- Check the container logs to identify which library caused the container to exit.
- Identify where the offending library uses the
exit
command, and correct it to provide a valid exit code.
Exit Codes 134 — Abnormal Termination (SIGABRT)
Exit code 134 almost always (in spark) means out of memory.
What to do if a container is terminated with Exit Code 134?
- Check the logs: Check Spark’s logs for more details about the program failure.
- Increase memory: If your program requires more memory to run properly, you can solve the problem by increasing the memory limit. You can use the
--driver-memory
and--executor-memory
parameters to set the program’s memory limit. - Optimizer: You can try to reduce memory usage by optimizing the program. For example, use more efficient algorithms, reduce memory allocation, etc.
- Check your code: Check your code for potential memory leaks or other issues.
Exit Code 137 — Immediate Termination (SIGKILL)
Exit Code 137 means that the container has received a SIGKILL signal from the host operating system. This signal instructs a process to terminate immediately, with no grace period. This can be triggered automatically by the host, usually due to running out of resources (memory / cpu).
What to do if a container is terminated with Exit Code 137?
- Check logs on the host to see what happened before the container termination, and whether it previously received a SIGTERM signal (graceful termination) before receiving SIGKILL
- If there was a prior SIGTERM signal, check if your container process handled SIGTERM and was able to gracefully terminate
- If there was no SIGTERM and the container reported an
OOMKilled
error, troubleshoot memory issues on the host.
Spark specific guide:
- Increase the driver/executor memory
- add more spark partition
- increase the number of shuffle partition
- reduce the number of cores for the executor
Exit code 139 — Segmentation fault (SIGSEGV)
Exit Code 139 means that the container received a SIGSEGV signal from the operating system. This indicates a segmentation error — a memory violation, caused by a container trying to access a memory location to which it does not have access. This can also be due to file I/O issues.
There are three common causes of SIGSEGV errors:
- Coding error — container process did not initialize properly, or it tried to access memory through a pointer to previously freed memory
- Incompatibility between binaries and libraries —the container process runs a binary file that is not compatible with a shared-library, and thus may try to access inappropriate memory addresses
- Hardware incompatibility or misconfiguration — if you see multiple segmentation errors across multiple libraries, there may be a problem with memory subsystems on the host or a system configuration issue
What to do if a container is terminated with Exit Code 139?
- Check if the container process handles SIGSEGV. On both Linux and Windows, you can handle a container’s response to segmentation violations. For example, the container can collect and report a stack trace
- If you need to further troubleshoot SIGSEGV, you may need to set the operating system to allow programs to run even after a segmentation fault occurs, to allow for investigation and debugging. Then, try to intentionally cause a segmentation violation and debug the library causing the issue
- If you cannot replicate the issue, check the memory subsystems on the host and troubleshoot the memory configuration
Spark Specific guide
- This error might occur if a Spark job is executed on a problematic datanode.
- Check the permission of the yarn log directory.
Exit code 143 (Graceful Termination)
Exit Code 143 means that the container received a SIGTERM signal from the operating system, which asks the container to gracefully terminate, and the container succeeded in gracefully terminating (otherwise you will see Exit Code 137). This exit code can be:
- Triggered by the container engine stopping the container, for example when using the
docker stop
ordocker-compose
down commands - Triggered by Kubernetes setting a pod to Terminating status, and giving containers a 30-second period to gracefully shut down
What to do if a container is terminated with Exit Code 143?
- Check host logs to see the context in which the operating system issued the SIGTERM signal.
- If you are using Kubernetes, check the kubelet logs to see if and when the pod was shut down.
In general, Exit Code 143 does not require troubleshooting. It means the container was properly shut down after being instructed to do so by the host.