Practice: Analyzing a Cloud Cost and Usage Report

This exercise puts you in the role of an MLOps engineer tasked with scrutinizing a cloud bill to find optimization opportunities. We will learn to cross-reference cost data with operational metrics to derive meaningful, actionable findings.

Our scenario is based on a simplified but representative Cloud Cost and Usage Report (CUR). These reports are the ground truth for all spending, containing granular, hourly data for every resource. A raw CUR can contain millions of rows, making direct analysis impractical. Your first task is always to aggregate and filter this data to find the signal in the noise.

Assume you have received a monthly CUR for an ML team. After some initial processing, a few lines might look like this:

line_item_usage_start_date	product_name	line_item_usage_type	line_item_resource_id	user:project	user:job_id	line_item_unblended_cost
2023-10-15T10:00:00Z	Amazon EC2	USW2-GPU-Instance:g5.48xlarge	i-0abcdef1234567890	project-atlas	train-exp-23a	8.16
2023-10-15T11:00:00Z	Amazon EC2	USW2-GPU-Instance:g5.48xlarge	i-0abcdef1234567890	project-atlas	train-exp-23a	8.16
2023-10-15T10:00:00Z	Amazon S3	USW2-Storage-Bytes-Hour	atlas-dataset-bucket	project-atlas		0.0000004
2023-10-16T04:00:00Z	Amazon EC2	USW2-BoxUsage:t3.medium	i-fedcba0987654321	project-ganymede	inference-api	0.0416

The presence of custom tags like user:project and user:job_id is a result of good governance and is essential for effective cost attribution. Without them, it is nearly impossible to assign costs to specific activities.

Step 1: High-Level Aggregation by Service

Before examining individual jobs, you need a high-level view. The first step is to aggregate total costs by service. This helps you understand the primary cost drivers for the entire platform. Using a tool like Pandas in Python, a SQL query in AWS Athena, or your cloud provider's cost explorer, you can generate a summary.

A breakdown of monthly costs. Compute services, especially Amazon EC2, represent the largest portion of the expenditure.

The chart makes it obvious that compute is the dominant cost, with EC2 being the largest contributor. While S3 storage and data transfer are not negligible, any significant optimization effort must begin with compute.

Step 2: Correlating Compute Costs with ML Activities

Now, focus on the $18,500 EC2 bill. Using the user:project and line_item_usage_type columns, we can create a more detailed breakdown. This helps attribute costs to specific teams and the types of instances they use, which often indicates the workload type (e.g., GPU instances for training, CPU instances for data processing or inference).

A sunburst chart illustrating cost attribution. Project Atlas is the highest spender, primarily on g5 GPU instances for model training.

This view shows that "Project Atlas" is responsible for the majority of the cost, with a heavy investment in g5 instances. This is our next target for analysis.

Step 3: Analyzing Training Job Efficiency

A high cost for a training job is not inherently bad if it produces a valuable model. The problem arises from waste, which can be quantified by looking at two factors: job failures and resource underutilization.

To do this, you must augment the CUR data with metadata from your ML platform. Let's assume you have logs that provide the final status of each training job (Succeeded, Failed) and the average GPU utilization percentage over its runtime.

Job ID	Instance Type	Hours	Raw Cost (USD)	Job Status	Avg. GPU Utilization
train-exp-23a	g5.48xlarge	100	816	Succeeded	92%
train-exp-24b	g5.48xlarge	150	1224	Failed	88%
train-exp-25	g5.12xlarge	200	544	Succeeded	45%
train-exp-26	p4d.24xlarge	50	1638	Succeeded	95%

This combined view is far more informative than the billing report alone.

Job train-exp-24b: This job consumed $1,224 of GPU time and produced nothing of value. This is a direct financial loss. The root cause might be a code bug, a misconfigured environment, or insufficient fault tolerance.
Job train-exp-25: This job succeeded, but its GPU utilization was a mere 45%. The GPUs were idle for more than half the time, likely waiting for data from a slow storage source or waiting on the CPU to complete pre-processing. While the job succeeded, over half of its $544 cost was waste due to an I/O or CPU bottleneck.

Now we can apply the EffectiveCost formula from the chapter introduction:

EffectiveCost = \frac{TotalSpend}{JobSuccessRate \times ResourceUtilization}

For train-exp-25, the effective cost is not $544. It is closer to$ 544 / (1.0 \times 0.45) = $1,208$. This figure represents what the job would have cost if you were paying only for the portion of the resources that were actively used. Your goal is to get the EffectiveCost as close to the RawCost as possible.

Step 4: Formulating Actionable Recommendations

Based on this analysis, you can now move from observation to action. You can provide the Project Atlas team with specific, data-driven recommendations.

Investigate Training Failures:
- Finding: Job train-exp-24b failed after incurring a cost of $1,224.
- Action: Prioritize a post-mortem on this failure. Implement pre-flight checks and enhance automated checkpointing so long-running jobs can resume from a recent state instead of restarting from scratch.
Address GPU Underutilization:
- Finding: Job train-exp-25 showed only 45% GPU utilization, indicating a severe bottleneck.
- Action: Profile the data loading pipeline for this job. Evaluate if a faster storage solution (like a parallel filesystem) or a more efficient data pre-processing library (like NVIDIA DALI) is needed. If the bottleneck cannot be resolved, you should use a smaller, less expensive instance.
Optimize Storage Costs:
- Finding: A separate analysis of the S3 bucket shows that 80% of the storage cost comes from holding old datasets and model artifacts in the S3 Standard tier.
- Action: Implement an S3 Lifecycle Policy to automatically transition objects older than 90 days to the S3 Glacier Instant Retrieval tier and objects older than a year to Glacier Deep Archive, reducing storage costs by up to 90%.
Improve Governance:
- Finding: Analysis was only possible because of user:project and user:job_id tags. Some resources lacked these.
- Action: Implement an automated policy (e.g., using AWS Config Rules) that terminates or flags any new EC2 instance or S3 bucket launched without the required project tags.

This structured process transforms a raw billing report from a simple accounting document into a strategic tool for engineering improvement. It is the fundamental loop of FinOps: measure your spending, attribute it to specific activities, analyze the efficiency of that spending, and implement changes to improve it.

Was this section helpful?

References

FinOps Framework, FinOps Foundation, 2023 - Provides the standard framework for cloud financial management, including principles and practices for cost optimization in cloud environments.
What are AWS Cost and Usage Reports?, Amazon Web Services, 2024 (Amazon Web Services) - Official documentation for the AWS Cost and Usage Report, explaining its structure and data points for granular cost analysis.
Cloud FinOps: Collaborative, Real-Time, Cloud Financial Management, J.R. Storment, Mike Fuller, 2023 (O'Reilly Media) - A book detailing cloud financial operations, offering practical methods for cost attribution, optimization, and governance, which aligns with the exercise's steps.
Machine Learning Engineering in Action, Ben Wilson, 2022 (Manning Publications) - Covers operational aspects of machine learning, including resource management and cost considerations for AI infrastructure.