To understand how a compiler optimizes a neural network, you must look at the intermediate code it generates. Just as a software engineer reads assembly or bytecode to debug low-level performance issues, an ML engineer must inspect the Intermediate Representation (IR) to understand how a high-level model is translated into executable instructions. This process moves the focus from the mathematical definition of the model to its structural implementation.Reading the Textual RepresentationMost modern ML compilers, including TVM and the MLIR infrastructure, provide a way to dump the IR as text. While the internal data structure is a graph, the textual representation usually resembles a typed assembly language or a restricted subset of Python. This format is designed to be human-readable while capturing all the explicit details that the compiler needs, such as data types, shapes, and memory scopes.Consider a simple operation in a framework like PyTorch: a linear layer followed by a ReLU activation. In Python, this is a concise function call. In the IR, it becomes a sequence of explicit operations. The compiler transforms the implicit logic of Python into a format often based on Static Single Assignment (SSA), where every variable is assigned exactly once.Below is an example of what this textual IR might look like for a matrix multiplication followed by an element-wise addition and activation. Note the explicit type annotations (such as float32) and tensor shapes.def @main(%input: Tensor[(1, 128), float32], %weight: Tensor[(64, 128), float32], %bias: Tensor[(64), float32]) -> Tensor[(1, 64), float32] { %0 = nn.dense(%input, %weight, units=64); %1 = nn.bias_add(%0, %bias); %2 = nn.relu(%1); return %2; }In this representation, you can observe several details that are hidden in standard framework code:Function Signature: The entry point @main strictly defines input names and types.Explicit Handles: Every intermediate result is given a temporary handle (e.g., %0, %1). This makes the data dependency chain explicit.Operator Attributes: The nn.dense operation includes attributes like units=64 directly in the call, removing ambiguity about the configuration.Visualizing Dataflow and DependenciesWhile text is useful for inspecting attributes, it can be difficult to trace complex dependencies in a large model. Visualizing the IR as a graph helps verify the overall topology of the network. In this visualization, nodes represent computations, and directed edges represent the flow of tensors.When inspecting the graph structure, you look for connectivity issues. For example, you might verify that a branch in the network merges back correctly or that an operator fusion pass has successfully combined two nodes into one.digraph G { rankdir=TB; node [shape=box, style="filled", fontname="Helvetica", fontsize=12, margin=0.2]; edge [fontname="Helvetica", fontsize=10, color="#868e96"]; input [label="Input\nTensor[(1, 128)]", fillcolor="#e9ecef", color="#adb5bd"]; weight [label="Weight\nTensor[(64, 128)]", fillcolor="#e9ecef", color="#adb5bd"]; bias [label="Bias\nTensor[(64)]", fillcolor="#e9ecef", color="#adb5bd"]; dense [label="nn.dense", fillcolor="#a5d8ff", color="#228be6", shape=component]; bias_add [label="nn.bias_add", fillcolor="#a5d8ff", color="#228be6", shape=component]; relu [label="nn.relu", fillcolor="#a5d8ff", color="#228be6", shape=component]; output [label="Output\nTensor[(1, 64)]", fillcolor="#b2f2bb", color="#40c057"]; input -> dense; weight -> dense; dense -> bias_add; bias -> bias_add; bias_add -> relu; relu -> output; }Graph representation of a dense layer followed by bias addition and ReLU activation, showing data dependencies between operators.In the diagram above, the flow of data dictates the execution order. The nn.dense node depends on both Input and Weight. The compiler uses this dependency information to determine which operations can run in parallel and which must be serialized. If two nodes do not share a path of dependencies, the compiler is free to schedule them on different streams or threads.Analyzing Tensor MetadataA significant part of inspecting the IR involves verifying tensor metadata. Unlike Python variables, which can change type or shape dynamically, IR variables are strictly typed.Shape InformationThe compiler tracks the shape of every tensor at every stage. When you inspect the IR, you will see shape tuples like (1, 128).Static Shapes: If the dimensions are fixed integers, the compiler can pre-allocate memory buffers and unroll loops efficiently.Dynamic Shapes: If you see dimensions represented as variables (e.g., (batch_size, 128) or (?, 128)), the compiler must generate code that calculates dimensions at runtime. This often results in slower code because optimizations like vectorization become more difficult.Data Types (Dtypes)Frameworks often default to 32-bit floats (float32). However, hardware accelerators may prefer 16-bit floats (float16) or 8-bit integers (int8). Inspecting the IR allows you to confirm that type casting (quantization) has occurred where expected. If you intend to run a model in float16 but the IR shows float32 operations, you have identified a performance bottleneck.Common IR ComponentsWhen reading an IR dump from tools like TVM, XLA, or TorchInductor, you will encounter specific structural elements. Understanding these components helps you navigate the output file.Modules: The top-level container. A module holds global definitions, constants (like pre-trained weights), and function definitions.Blocks: A sequence of instructions that execute linearly. Control flow operations like if statements or loops create boundaries between blocks. In deep learning graphs, the structure is often a single large block of operations unless the model contains control flow (like Recurrent Neural Networks).Allocations: In lower-level IRs, you might see explicit memory allocation instructions (e.g., alloc). This indicates that the compiler has moved from a pure graph view to a memory-managed view.The following chart illustrates the hierarchy of an IR module, distinguishing between the high-level definitions and the operational instructions.{"layout": {"width": 600, "height": 400, "title": "Hierarchy of an ML Compiler IR Module", "font": {"family": "Helvetica"}, "margin": {"l": 40, "r": 40, "t": 60, "b": 40}}, "data": [{"type": "treemap", "labels": ["IR Module", "Global Variables", "Functions", "Parameters", "Weight Constants", "Main Function", "Helper Functions", "Operations", "Metadata"], "parents": ["", "IR Module", "IR Module", "Global Variables", "Global Variables", "Functions", "Functions", "Main Function", "Main Function"], "values": [10, 4, 6, 2, 2, 4, 2, 3, 1], "marker": {"colors": ["#dee2e6", "#ced4da", "#ced4da", "#e9ecef", "#e9ecef", "#e9ecef", "#e9ecef", "#fcc2d7", "#eebefa"]}}]}Treemap visualization showing the structural hierarchy of a typical IR module containing globals, functions, and operations.Debugging Optimization FailuresThe primary reason to inspect IR is to diagnose why an optimization failed. Compilers rely on pattern matching to apply transformations. If the IR structure does not match the expected pattern, the optimization is skipped.For instance, a compiler might support "Conv2d + ReLU" fusion. This means it looks for a convolution node immediately followed by a ReLU node. If the IR inspection reveals an intermediate cast operation or a reshape node between the convolution and the ReLU, the pattern match will fail, and the fusion will not happen. By reading the IR, you can identify this intervening node and potentially modify your model definition to remove it, enabling the optimization.Another common scenario involves broadcasting. If you perform an element-wise addition between tensors of different ranks, the compiler inserts broadcast operations. These can sometimes be expensive or inhibit other optimizations. Inspecting the IR reveals exactly where implicit broadcasting occurs, allowing you to fix shapes in the source model to be explicit and efficient.From Inspection to ActionOnce you can read the structure, you can verify if the compiler interprets your model intentions correctly.Did the constant folding pass work? Check if constant sub-graphs are replaced by a single constant node.Is the layout correct? Check if convolution inputs are NCHW or NHWC based on your target hardware requirements.Are unused branches removed? Verify that dead code elimination has removed nodes that do not contribute to the final output.Mastering IR inspection bridges the gap between model design and hardware execution, providing the visibility needed to tune performance effectively.