Setting up a basic monitoring pipeline using MLflow for tracking metrics and Grafana for visualizing them in near real-time. This exercise demonstrates how these tools can work together, with MLflow providing detailed logging for analysis and model lineage, while Grafana offers operational dashboarding capabilities often powered by a time-series database like Prometheus.For this practice, we assume you have a working Python environment with mlflow and prometheus_client installed (pip install mlflow prometheus_client Flask), and Docker installed for running Prometheus and Grafana.ScenarioWe'll simulate a simple prediction service. For each prediction request, we want to:Log the predicted value using MLflow for historical tracking and analysis.Expose the current average prediction value as a metric that Prometheus can scrape.Visualize this average prediction value over time in Grafana.Step 1: Instrumenting a Simulated ServiceFirst, let's create a simple Python script (prediction_service.py) that simulates making predictions and logs metrics using both MLflow and the Prometheus client library. We'll use Flask to create a minimal web endpoint that Prometheus can scrape.import mlflow import time import random from flask import Flask, Response from prometheus_client import Gauge, CollectorRegistry, generate_latest # --- Configuration --- MLFLOW_TRACKING_URI = "http://127.0.0.1:5000" # Default local MLflow tracking server URI EXPERIMENT_NAME = "Production Monitoring Simulation" SERVICE_PORT = 8080 # Port for the Flask app exposing Prometheus metrics # --- MLflow Setup --- mlflow.set_tracking_uri(MLFLOW_TRACKING_URI) mlflow.set_experiment(EXPERIMENT_NAME) # Ensure the experiment exists try: client = mlflow.tracking.MlflowClient() experiment = client.get_experiment_by_name(EXPERIMENT_NAME) if experiment is None: client.create_experiment(EXPERIMENT_NAME) experiment_id = client.get_experiment_by_name(EXPERIMENT_NAME).experiment_id except Exception as e: print(f"Could not connect to MLflow tracking server at {MLFLOW_TRACKING_URI} or create experiment. Please ensure it's running. Error: {e}") experiment_id = None # Handle case where MLflow server isn't running # --- Prometheus Setup --- # Use a registry to manage metrics prometheus_registry = CollectorRegistry() # Define a Gauge metric to track the average prediction value avg_prediction_gauge = Gauge( 'average_prediction_value', 'Average value of predictions made by the service', registry=prometheus_registry ) # --- Flask App for Prometheus Metrics --- app = Flask(__name__) @app.route('/metrics') def metrics(): """ Exposes Prometheus metrics. """ return Response(generate_latest(prometheus_registry), mimetype='text/plain') # --- Simulation Logic --- def simulate_predictions(): """ Simulates making predictions and logging metrics. """ print(f"Starting prediction simulation. Logging to MLflow experiment: '{EXPERIMENT_NAME}'") prediction_values = [] max_history = 50 # Keep a rolling window for average calculation with mlflow.start_run(experiment_id=experiment_id, run_name="Simulated Production Run") as run: print(f"MLflow Run ID: {run.info.run_id}") mlflow.log_param("simulation_type", "average_prediction_monitoring") step = 0 while True: # Simulate a new prediction (e.g., score, probability, regression output) # Let's simulate a value drifting over time base_value = 50 drift = step * 0.1 # Gradual drift upwards noise = random.gauss(0, 5) # Add some noise prediction = base_value + drift + noise prediction = max(0, min(100, prediction)) # Clamp between 0 and 100 # Log individual prediction to MLflow mlflow.log_metric("prediction_value", prediction, step=step) # Update rolling list for average calculation prediction_values.append(prediction) if len(prediction_values) > max_history: prediction_values.pop(0) # Calculate and update Prometheus Gauge if prediction_values: current_avg = sum(prediction_values) / len(prediction_values) avg_prediction_gauge.set(current_avg) # Optionally, log the average to MLflow as well mlflow.log_metric("average_prediction_value_gauge", current_avg, step=step) print(f"Step {step}: Prediction={prediction:.2f}, Current Avg={current_avg:.2f}") step += 1 time.sleep(5) # Wait 5 seconds between simulated predictions if __name__ == '__main__': # Start Flask app in a separate thread (or process) for /metrics endpoint # For simplicity in this example, we might run it separately or integrate. # Here we focus on the simulation loop. Run Flask separately. # Example: Run 'flask --app prediction_service run --port 8080' in another terminal # Or integrate threading: # from threading import Thread # metrics_thread = Thread(target=lambda: app.run(host='0.0.0.0', port=SERVICE_PORT, debug=False), daemon=True) # metrics_thread.start() # Run the simulation simulate_predictions() Before running:Start the MLflow tracking server in a separate terminal:mlflow uiThis usually starts the server at http://127.0.0.1:5000.You need to run the Flask app part to expose the /metrics endpoint. The simplest way for this example is to run the Flask development server pointing to the app object in our script. Save the script as prediction_service.py. In one terminal, run the simulation:python prediction_service.pyIn another terminal, run the Flask server to expose metrics:flask --app prediction_service run --port 8080(Ensure Flask is installed: pip install Flask). You should now be able to access http://localhost:8080/metrics and see the average_prediction_value metric.Step 2: Setting up PrometheusNow, we need Prometheus to scrape the /metrics endpoint. Create a simple prometheus.yml configuration file:# prometheus.yml global: scrape_interval: 10s # Scrape targets every 10 seconds scrape_configs: - job_name: 'prediction_service' static_configs: - targets: ['host.docker.internal:8080'] # For Docker on Mac/Windows # If running Docker on Linux, use your host machine's IP instead of host.docker.internal # Example: - targets: ['172.17.0.1:8080'] (Find your Docker bridge IP if needed) # Or if running Prometheus directly on host: - targets: ['localhost:8080']Note on host.docker.internal: This special DNS name resolves to the host machine's IP address from within a Docker container on Docker Desktop (Mac/Windows). If you are using Docker on Linux, you might need to use the host's IP address on the Docker bridge network (often 172.17.0.1) or configure networking differently. If running Prometheus directly on the host (not in Docker), use localhost:8080.Run Prometheus using Docker, mounting your configuration file:docker run -d --name prometheus \ -p 9090:9090 \ -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheusYou should now be able to access the Prometheus UI at http://localhost:9090. Check the "Targets" page to confirm Prometheus is successfully scraping your prediction_service endpoint. You can also query the metric average_prediction_value in the Prometheus query browser.Step 3: Setting up GrafanaRun Grafana using Docker:docker run -d --name grafana \ -p 3000:3000 \ grafana/grafana-ossAccess Grafana at http://localhost:3000 (default login is admin/admin, you'll be prompted to change the password).Add Data Source:Go to Configuration (gear icon) -> Data Sources.Click "Add data source".Select "Prometheus".Set the HTTP URL to your Prometheus instance. If Grafana is running in Docker alongside Prometheus (started as above), you can often use http://prometheus:9090. If Grafana is on the host and Prometheus is in Docker, use http://localhost:9090.Click "Save & Test". You should see a success message.Step 4: Creating a Grafana DashboardCreate Dashboard:Go to the Dashboards section (four squares icon) -> New -> New Dashboard.Click "Add visualization".Configure Panel:Select your "Prometheus" data source.In the "Metrics browser" or query field, enter the metric name: average_prediction_value.Grafana should automatically fetch the data and display a time-series graph.Customize the panel title (e.g., "Average Prediction Value Over Time").Adjust visualization settings (e.g., line width, color, axis labels) as desired.Click "Apply" or "Save" to add the panel to your dashboard. Save the dashboard itself with a meaningful name.You should now see a graph visualizing the average prediction value, updating roughly every 10 seconds (based on the Prometheus scrape interval), reflecting the drift we introduced in the simulation.{"data": [{"y": [50.1, 50.8, 51.5, 51.9, 52.6, 53.1, 53.9, 54.2, 54.8, 55.3, 55.9, 56.7, 57.1, 57.8, 58.5], "x": ["2023-10-27 10:00:00", "2023-10-27 10:00:10", "2023-10-27 10:00:20", "2023-10-27 10:00:30", "2023-10-27 10:00:40", "2023-10-27 10:00:50", "2023-10-27 10:01:00", "2023-10-27 10:01:10", "2023-10-27 10:01:20", "2023-10-27 10:01:30", "2023-10-27 10:01:40", "2023-10-27 10:01:50", "2023-10-27 10:02:00", "2023-10-27 10:02:10", "2023-10-27 10:02:20"], "type": "scatter", "mode": "lines", "name": "Avg Prediction", "line": {"color": "#228be6"}}], "layout": {"title": "Simulated Average Prediction Value Over Time", "xaxis": {"title": "Time"}, "yaxis": {"title": "Average Prediction Value", "range": [45, 65]}, "margin": {"l": 50, "r": 30, "t": 50, "b": 50}, "height": 350}}Visualization of the average_prediction_value metric as seen in a Grafana panel, showing an upward trend over time due to simulated drift.DiscussionThis exercise provides a foundational setup. We used:MLflow: To log individual prediction metrics (prediction_value) and potentially configuration parameters or aggregated metrics (average_prediction_value_gauge). This creates a detailed historical record tied to specific runs, invaluable for debugging, auditing, and retraining analysis. You can explore these runs in the MLflow UI.Prometheus: To scrape operational metrics exposed by the service (average_prediction_value). Prometheus is designed for efficient time-series data storage and querying, making it suitable for high-frequency monitoring.Grafana: To query Prometheus and visualize the operational metrics in near real-time dashboards. Grafana also provides alerting capabilities based on these metrics.In a more complex production scenario:The prediction service would likely be containerized and deployed (e.g., on Kubernetes).Metrics collection might involve dedicated agents or libraries integrating more deeply with the serving framework.Prometheus scraping would be configured via service discovery mechanisms.Grafana dashboards would be more sophisticated, potentially combining metrics from multiple sources (infrastructure, application performance, model metrics) and incorporating alerting rules (e.g., alert if the average prediction value shifts too quickly or crosses certain thresholds).The link between MLflow's detailed logs and Grafana's operational view might involve exporting aggregated data from MLflow (or its backend database) into the time-series database periodically, or using unique identifiers logged in both systems to correlate issues.This hands-on example illustrates how distinct tools can be combined to build a layered monitoring strategy, addressing both immediate operational health and longer-term model performance analysis.