docker run
docker-compose.yml
Once you've built your inference container and exposed the necessary ports for your API, how do you ensure the service inside the container is actually working correctly? A container might be running, but the application within it could have crashed, become unresponsive, or failed to load the model properly. This is where health checks become essential.
Health checks provide a way for Docker (and container orchestrators like Kubernetes) to periodically verify that your application is not just running, but is actually healthy and capable of serving requests. If a container fails its health check, Docker can report its status as unhealthy
, allowing monitoring systems or orchestrators to take corrective action, such as restarting the container.
Docker provides the HEALTHCHECK
instruction in the Dockerfile to define how a container's health should be assessed. Its basic syntax is:
HEALTHCHECK [OPTIONS] CMD command
Or, to disable any health check inherited from a base image:
HEALTHCHECK NONE
The CMD command
is executed inside the container. Docker interprets the exit status of this command:
0
: Success - the container is healthy.1
: Unhealthy - the container is not working correctly.2
: Reserved - do not use this exit code.For a typical inference API built with Flask or FastAPI, a common strategy is to expose a simple endpoint, like /health
or /ping
, specifically for health checks. The HEALTHCHECK
command can then query this endpoint.
You can use tools available within the container, like curl
or wget
, to check if the API endpoint responds successfully (HTTP status 2xx or 3xx).
# Example using curl (ensure curl is installed in the image)
# Assumes the API runs on port 8000
HEALTHCHECK --interval=15s --timeout=3s --start-period=5s --retries=3 \
CMD curl --fail http://localhost:8000/health || exit 1
In this example:
curl --fail http://localhost:8000/health
attempts to fetch the /health
endpoint.--fail
tells curl
to return an error exit code (non-zero) on server errors (HTTP 4xx or 5xx), which signals failure.|| exit 1
ensures that if curl
itself fails (e.g., cannot connect), the command still exits with 1
(unhealthy).Relying solely on a successful HTTP response might not be enough. The web server could be running, but perhaps the ML model failed to load, or a required resource is unavailable. A better approach is to implement logic within your /health
endpoint to perform internal checks.
Here's a minimal example using FastAPI:
# main.py (FastAPI example)
from fastapi import FastAPI, HTTPException
import os
app = FastAPI()
# Placeholder: Simulate loading a model
model_loaded = os.path.exists("./model.pkl")
@app.get("/health")
def health_check():
# Add more checks here: database connection, model status, etc.
if not model_loaded:
raise HTTPException(status_code=503, detail="Model not loaded")
return {"status": "ok"}
@app.post("/predict")
def predict(data: dict):
# Prediction logic...
if not model_loaded:
raise HTTPException(status_code=500, detail="Prediction service unavailable")
# ... actual prediction ...
return {"prediction": "some_result"}
# Add other endpoints as needed
Now, the HEALTHCHECK
command querying /health
will only succeed if the endpoint returns a 200 OK status, which in this case, requires model_loaded
to be true.
The HEALTHCHECK
instruction allows several options to control its behavior:
--interval=DURATION
(default: 30s): Specifies the time to wait between running health checks.--timeout=DURATION
(default: 30s): Sets the maximum time allowed for the health check command to complete before it's considered failed.--start-period=DURATION
(default: 0s): Provides a grace period for the container to initialize before the first health check failure counts towards the maximum number of retries. This is useful for applications that take some time to start up.--retries=N
(default: 3): Defines the number of consecutive health check failures required to mark the container as unhealthy
.Choosing appropriate values depends on your application. A simple API might use shorter intervals and timeouts, while a service loading a large model might need a longer start-period
.
Let's integrate a health check into a Dockerfile for a FastAPI inference service:
# Use an appropriate Python base image
FROM python:3.9-slim
WORKDIR /app
# Install dependencies (ensure curl is included if using it for health check)
RUN apt-get update && apt-get install -y curl --no-install-recommends && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code and model
COPY ./app /app
COPY ./model.pkl /app/model.pkl # Example model file
# Expose the port the API runs on
EXPOSE 8000
# Define the health check
# Check every 10 seconds after an initial 5-second grace period
HEALTHCHECK --interval=10s --timeout=3s --start-period=5s --retries=3 \
CMD curl --fail http://localhost:8000/health || exit 1
# Command to run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
After building an image with a HEALTHCHECK
instruction and running a container from it, you can monitor its health status.
The docker ps
command will show the status, including the health state (e.g., (healthy)
, (unhealthy)
, (starting)
) after a short delay.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a1b2c3d4e5f6 my-inference-api:latest "uvicorn main:app --h…" 15 seconds ago Up 14 seconds (healthy) 0.0.0.0:8000->8000/tcp upbeat_raman
For more detailed information, including the output of the last health check, use docker inspect
:
$ docker inspect --format='{{json .State.Health}}' a1b2c3d4e5f6
{
"Status": "healthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2023-10-27T10:30:00.123Z",
"End": "2023-10-27T10:30:00.456Z",
"ExitCode": 0,
"Output": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 15 100 15 0 0 458 0 --:--:-- --:--:-- --:--:-- 468\n{\"status\":\"ok\"}"
}
# ... more log entries
]
}
By implementing health checks, you create more resilient containerized inference services. They provide a clear signal about the application's operational status, enabling automated systems to manage container lifecycles effectively and ensuring your ML models are consistently available to serve predictions.
© 2025 ApX Machine Learning