docker rundocker-compose.ymlAn inference container, even when running and with its API ports exposed, can harbor issues where the application inside it is not functioning correctly. The application within the container could have crashed, become unresponsive, or failed to load the model properly. For these scenarios, health checks become essential.
Health checks provide a way for Docker (and container orchestrators like Kubernetes) to periodically verify that your application is not just running, but is actually healthy and capable of serving requests. If a container fails its health check, Docker can report its status as unhealthy, allowing monitoring systems or orchestrators to take corrective action, such as restarting the container.
Docker provides the HEALTHCHECK instruction in the Dockerfile to define how a container's health should be assessed. Its basic syntax is:
HEALTHCHECK [OPTIONS] CMD command
Or, to disable any health check inherited from a base image:
HEALTHCHECK NONE
The CMD command is executed inside the container. Docker interprets the exit status of this command:
0: Success - the container is healthy.1: Unhealthy - the container is not working correctly.2: Reserved - do not use this exit code.For a typical inference API built with Flask or FastAPI, a common strategy is to expose a simple endpoint, like /health or /ping, specifically for health checks. The HEALTHCHECK command can then query this endpoint.
You can use tools available within the container, like curl or wget, to check if the API endpoint responds successfully (HTTP status 2xx or 3xx).
# Example using curl (ensure curl is installed in the image)
# Assumes the API runs on port 8000
HEALTHCHECK --interval=15s --timeout=3s --start-period=5s --retries=3 \
CMD curl --fail http://localhost:8000/health || exit 1
In this example:
curl --fail http://localhost:8000/health attempts to fetch the /health endpoint.--fail tells curl to return an error exit code (non-zero) on server errors (HTTP 4xx or 5xx), which signals failure.|| exit 1 ensures that if curl itself fails (e.g., cannot connect), the command still exits with 1 (unhealthy).Relying solely on a successful HTTP response might not be enough. The web server could be running, but perhaps the ML model failed to load, or a required resource is unavailable. A better approach is to implement logic within your /health endpoint to perform internal checks.
Here's a minimal example using FastAPI:
# main.py (FastAPI example)
from fastapi import FastAPI, HTTPException
import os
app = FastAPI()
# Placeholder: Simulate loading a model
model_loaded = os.path.exists("./model.pkl")
@app.get("/health")
def health_check():
# Add more checks here: database connection, model status, etc.
if not model_loaded:
raise HTTPException(status_code=503, detail="Model not loaded")
return {"status": "ok"}
@app.post("/predict")
def predict(data: dict):
# Prediction logic...
if not model_loaded:
raise HTTPException(status_code=500, detail="Prediction service unavailable")
# ... actual prediction ...
return {"prediction": "some_result"}
# Add other endpoints as needed
Now, the HEALTHCHECK command querying /health will only succeed if the endpoint returns a 200 OK status, which in this case, requires model_loaded to be true.
The HEALTHCHECK instruction allows several options to control its behavior:
--interval=DURATION (default: 30s): Specifies the time to wait between running health checks.--timeout=DURATION (default: 30s): Sets the maximum time allowed for the health check command to complete before it's considered failed.--start-period=DURATION (default: 0s): Provides a grace period for the container to initialize before the first health check failure counts towards the maximum number of retries. This is useful for applications that take some time to start up.--retries=N (default: 3): Defines the number of consecutive health check failures required to mark the container as unhealthy.Choosing appropriate values depends on your application. A simple API might use shorter intervals and timeouts, while a service loading a large model might need a longer start-period.
Let's integrate a health check into a Dockerfile for a FastAPI inference service:
# Use an appropriate Python base image
FROM python:3.9-slim
WORKDIR /app
# Install dependencies (ensure curl is included if using it for health check)
RUN apt-get update && apt-get install -y curl --no-install-recommends && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code and model
COPY ./app /app
COPY ./model.pkl /app/model.pkl # Example model file
# Expose the port the API runs on
EXPOSE 8000
# Define the health check
# Check every 10 seconds after an initial 5-second grace period
HEALTHCHECK --interval=10s --timeout=3s --start-period=5s --retries=3 \
CMD curl --fail http://localhost:8000/health || exit 1
# Command to run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
After building an image with a HEALTHCHECK instruction and running a container from it, you can monitor its health status.
The docker ps command will show the status, including the health state (e.g., (healthy), (unhealthy), (starting)) after a short delay.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a1b2c3d4e5f6 my-inference-api:latest "uvicorn main:app --h…" 15 seconds ago Up 14 seconds (healthy) 0.0.0.0:8000->8000/tcp upbeat_raman
For more detailed information, including the output of the last health check, use docker inspect:
$ docker inspect --format='{{json .State.Health}}' a1b2c3d4e5f6
{
"Status": "healthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2023-10-27T10:30:00.123Z",
"End": "2023-10-27T10:30:00.456Z",
"ExitCode": 0,
"Output": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 15 100 15 0 0 458 0 --:--:-- --:--:-- --:--:-- 468\n{\"status\":\"ok\"}"
}
# ... more log entries
]
}
By implementing health checks, you create more resilient containerized inference services. They provide a clear signal about the application's operational status, enabling automated systems to manage container lifecycles effectively and ensuring your ML models are consistently available to serve predictions.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with