This exercise puts you in the role of an MLOps engineer tasked with scrutinizing a cloud bill to find optimization opportunities. We will learn to cross-reference cost data with operational metrics to derive meaningful, actionable findings.
Our scenario is based on a simplified but representative Cloud Cost and Usage Report (CUR). These reports are the ground truth for all spending, containing granular, hourly data for every resource. A raw CUR can contain millions of rows, making direct analysis impractical. Your first task is always to aggregate and filter this data to find the signal in the noise.
Assume you have received a monthly CUR for an ML team. After some initial processing, a few lines might look like this:
| line_item_usage_start_date | product_name | line_item_usage_type | line_item_resource_id | user:project | user:job_id | line_item_unblended_cost |
|---|---|---|---|---|---|---|
| 2023-10-15T10:00:00Z | Amazon EC2 | USW2-GPU-Instance:g5.48xlarge | i-0abcdef1234567890 | project-atlas | train-exp-23a | 8.16 |
| 2023-10-15T11:00:00Z | Amazon EC2 | USW2-GPU-Instance:g5.48xlarge | i-0abcdef1234567890 | project-atlas | train-exp-23a | 8.16 |
| 2023-10-15T10:00:00Z | Amazon S3 | USW2-Storage-Bytes-Hour | atlas-dataset-bucket | project-atlas | 0.0000004 | |
| 2023-10-16T04:00:00Z | Amazon EC2 | USW2-BoxUsage:t3.medium | i-fedcba0987654321 | project-ganymede | inference-api | 0.0416 |
The presence of custom tags like user:project and user:job_id is a result of good governance and is essential for effective cost attribution. Without them, it is nearly impossible to assign costs to specific activities.
Before examining individual jobs, you need a high-level view. The first step is to aggregate total costs by service. This helps you understand the primary cost drivers for the entire platform. Using a tool like Pandas in Python, a SQL query in AWS Athena, or your cloud provider's cost explorer, you can generate a summary.
A breakdown of monthly costs. Compute services, especially Amazon EC2, represent the largest portion of the expenditure.
The chart makes it obvious that compute is the dominant cost, with EC2 being the largest contributor. While S3 storage and data transfer are not negligible, any significant optimization effort must begin with compute.
Now, focus on the $18,500 EC2 bill. Using the user:project and line_item_usage_type columns, we can create a more detailed breakdown. This helps attribute costs to specific teams and the types of instances they use, which often indicates the workload type (e.g., GPU instances for training, CPU instances for data processing or inference).
A sunburst chart illustrating cost attribution. Project Atlas is the highest spender, primarily on g5 GPU instances for model training.
This view shows that "Project Atlas" is responsible for the majority of the cost, with a heavy investment in g5 instances. This is our next target for analysis.
A high cost for a training job is not inherently bad if it produces a valuable model. The problem arises from waste, which can be quantified by looking at two factors: job failures and resource underutilization.
To do this, you must augment the CUR data with metadata from your ML platform. Let's assume you have logs that provide the final status of each training job (Succeeded, Failed) and the average GPU utilization percentage over its runtime.
| Job ID | Instance Type | Hours | Raw Cost (USD) | Job Status | Avg. GPU Utilization |
|---|---|---|---|---|---|
| train-exp-23a | g5.48xlarge | 100 | 816 | Succeeded | 92% |
| train-exp-24b | g5.48xlarge | 150 | 1224 | Failed | 88% |
| train-exp-25 | g5.12xlarge | 200 | 544 | Succeeded | 45% |
| train-exp-26 | p4d.24xlarge | 50 | 1638 | Succeeded | 95% |
This combined view is far more informative than the billing report alone.
train-exp-24b: This job consumed $1,224 of GPU time and produced nothing of value. This is a direct financial loss. The root cause might be a code bug, a misconfigured environment, or insufficient fault tolerance.train-exp-25: This job succeeded, but its GPU utilization was a mere 45%. The GPUs were idle for more than half the time, likely waiting for data from a slow storage source or waiting on the CPU to complete pre-processing. While the job succeeded, over half of its $544 cost was waste due to an I/O or CPU bottleneck.Now we can apply the EffectiveCost formula from the chapter introduction:
For train-exp-25, the effective cost is not 544.Itiscloserto544 / (1.0 \times 0.45) = $1,208$. This figure represents what the job would have cost if you were paying only for the portion of the resources that were actively used. Your goal is to get the EffectiveCost as close to the RawCost as possible.
Based on this analysis, you can now move from observation to action. You can provide the Project Atlas team with specific, data-driven recommendations.
Investigate Training Failures:
train-exp-24b failed after incurring a cost of $1,224.Address GPU Underutilization:
train-exp-25 showed only 45% GPU utilization, indicating a severe bottleneck.Optimize Storage Costs:
Improve Governance:
user:project and user:job_id tags. Some resources lacked these.This structured process transforms a raw billing report from a simple accounting document into a strategic tool for engineering improvement. It is the fundamental loop of FinOps: measure your spending, attribute it to specific activities, analyze the efficiency of that spending, and implement changes to improve it.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with