Optimizing an MPP system requires transparency into how data settles on disk. While the query optimizer handles execution paths, the physical arrangement of data determines the volume of I/O operations required. An examination of the storage metadata of a distributed table quantifies the effectiveness of partition pruning. Diagnostic commands measure clustering health and identify scenarios where logical data ordering does not align with physical storage, leading to excessive scanning.Inspecting Partition MetricsWhen data is ingested into systems like Snowflake or BigQuery, it is divided into immutable blocks or micro-partitions. The system maintains a metadata header for each block containing the minimum and maximum values for every column. The efficiency of a query depends on the distinctness of these min-max ranges.To evaluate this, we look at the Clustering Depth. This metric represents the average number of partitions that overlap for a specific column range. If a table has a depth of 1, the data is perfectly sorted; a query filtering on that column will scan the theoretical minimum number of blocks. A high depth implies significant overlap, forcing the engine to scan multiple partitions to satisfy a single predicate.Mathematically, if we define a set of partitions $P$ where each partition $p_i$ covers a range $[min_i, max_i]$ for a target column, the overlap count for a specific value $v$ is:$$ Overlap(v) = \sum_{i=1}^{|P|} \mathbb{1}(min_i \leq v \leq max_i) $$A high summation value across the dataset indicates poor storage efficiency.We can visualize the difference between a healthy storage profile and a fragmented one using the following diagram. Note how the overlapping ranges in the fragmented state force the query engine to read three files to find the value "50", whereas the optimized state requires only one.Comparison of partition range overlaps in fragmented versus optimized storage layouts.digraph G { rankdir=TB; node [shape=box, style="filled", fontname="Helvetica", fontsize=10, color="#dee2e6"]; edge [fontname="Helvetica", fontsize=9, color="#868e96"]; subgraph cluster_0 { label = "Fragmented Storage (High Depth)"; style = filled; color = "#e9ecef"; node [fillcolor="#bac8ff"]; F1 [label="Partition 1\nRange: 0-60"]; F2 [label="Partition 2\nRange: 40-90"]; F3 [label="Partition 3\nRange: 20-100"]; Query [label="Query: WHERE id = 50", shape=ellipse, fillcolor="#ffc9c9"]; Query -> F1 [label="Scans", style=solid]; Query -> F2 [label="Scans", style=solid]; Query -> F3 [label="Scans", style=solid]; } subgraph cluster_1 { label = "Optimized Storage (Low Depth)"; style = filled; color = "#e9ecef"; node [fillcolor="#69db7c"]; O1 [label="Partition 1\nRange: 0-30"]; O2 [label="Partition 2\nRange: 31-60"]; O3 [label="Partition 3\nRange: 61-100"]; QueryOpt [label="Query: WHERE id = 50", shape=ellipse, fillcolor="#ffc9c9"]; QueryOpt -> O1 [label="Pruned", style=dashed]; QueryOpt -> O2 [label="Scans", style=solid, penwidth=2]; QueryOpt -> O3 [label="Pruned", style=dashed]; } }Diagnostic Commands and Output AnalysisIn a practical environment, you verify these profiles using system functions. For this exercise, we assume a table named orders_history containing several terabytes of timestamped transactional data.To inspect the storage profile in Snowflake, you execute the clustering information function. This command does not scan the data itself but reads the metadata layer, making it instant and cost-efficient.SELECT SYSTEM$CLUSTERING_INFORMATION('orders_history', '(event_date)');The output returns a JSON object detailing the health of the physical storage. Understanding the specific fields in this return object is necessary for performance tuning.total_partition_count: The total number of micro-partitions comprising the table.average_overlap_depth: The average number of partitions that share the same data range. A value significantly greater than 1 suggests the table requires re-sorting (clustering).depth_histogram: A distribution showing how many partitions fall into specific overlap brackets (e.g., 0-16, 16-32).If you observe a histogram where the majority of partitions reside in high-depth buckets (e.g., depth > 16), your queries are likely performing "table scans" even when filters are applied.Analyzing Scan MetricsThe ultimate test of a storage profile is the ratio of data scanned to data returned. We can analyze this relationship by executing a control query and inspecting the query profile stats.Consider a query filtering for a single day:SELECT * FROM orders_history WHERE event_date = '2023-11-15';In an ideal MPP architecture, the system should identify the specific micro-partitions containing 2023-11-15 and ignore the rest. If the table contains 1 year of data and is properly clustered by date, the scan ratio should be approximately:$$ Ratio \approx \frac{1}{365} $$If the query profile indicates that 100% of the partitions were scanned, the data was likely ingested in a non-sequential order (e.g., backfilling historical data mixed with real-time streams), rendering the metadata min-max pointers useless for pruning.The following chart demonstrates the impact of clustering on scan volume as the query selectivity changes. Notice that for highly selective queries (fetching small data slices), a disorganized storage profile results in exponentially higher I/O costs.Data scanned (GB) versus query selectivity percentage for clustered and unclustered tables.{"layout": {"title": "Scan Efficiency: Clustered vs Unclustered", "xaxis": {"title": "Query Selectivity (% of rows requested)"}, "yaxis": {"title": "Data Scanned (GB)"}, "template": "simple_white", "colorscale": [{"0.0": "#1c7ed6", "1.0": "#fa5252"}]}, "data": [{"x": [1, 5, 10, 20, 50, 100], "y": [10, 50, 100, 200, 500, 1000], "type": "scatter", "mode": "lines+markers", "name": "Perfectly Clustered", "line": {"color": "#40c057"}}, {"x": [1, 5, 10, 20, 50, 100], "y": [800, 850, 900, 950, 980, 1000], "type": "scatter", "mode": "lines+markers", "name": "Unclustered (High Overlap)", "line": {"color": "#fa5252"}}]}Interpreting the ResultsIn the "Unclustered" scenario shown above, the I/O curve is nearly flat. This indicates that requesting 1% of the data costs nearly as much as requesting 100% of the data. This behavior signals a failure in partition pruning.To address this, you must force a rewrite of the underlying storage. In most MPP systems, this is achieved through commands that rewrite the table segments in a sorted order.-- Manual re-clustering strategy INSERT OVERWRITE INTO orders_history SELECT * FROM orders_history ORDER BY event_date;By enforcing an ORDER BY during the write operation, you align the physical storage with the logical query patterns. After executing this operation, running the SYSTEM$CLUSTERING_INFORMATION command again should yield a significantly lower average_overlap_depth.It is important to note that maintaining perfect storage profiles incurs compute costs. Every time data is re-sorted, the system uses resources. Therefore, you should only define clustering keys on columns that are frequently used in high-volume WHERE clauses or JOIN conditions. Continuous background optimization (Auto-Clustering) is available on some platforms but must be monitored to prevent maintenance costs from exceeding the savings gained in query performance.