High-level differences between TensorFlow's static graphs and PyTorch's dynamic approach are discussed, along with torch.Tensor and autograd. Common TensorFlow operations are translated into PyTorch, and how PyTorch handles automatic differentiation is demonstrated. This covers tensor creation, manipulation, and computing gradients with autograd.Let's start by ensuring you have PyTorch imported. Typically, you'll also want NumPy, as it's a common partner in many data science workflows.import torch import numpy as npTensor Manipulations: A Comparative WorkoutIf you're coming from TensorFlow, you're already familiar with the idea of a tensor. PyTorch's torch.Tensor will feel quite similar to tf.Tensor in many respects, but with some syntactic and behavioral differences, especially concerning mutability and the define-by-run nature of operations.Creating TensorsCreating tensors in PyTorch is straightforward. You can create them from Python lists or NumPy arrays, or initialize them with specific values.From existing data (Python lists or NumPy arrays):# Python list data_list = [[1, 2], [3, 4]] pt_tensor_from_list = torch.tensor(data_list, dtype=torch.float32) print("From list:\n", pt_tensor_from_list) # NumPy array data_numpy = np.array([[5., 6.], [7., 8.]]) pt_tensor_from_numpy = torch.from_numpy(data_numpy) # Shares memory with numpy array print("From NumPy array (shares memory):\n", pt_tensor_from_numpy) # To create a copy that doesn't share memory: pt_tensor_copied_from_numpy = torch.tensor(data_numpy) print("From NumPy array (copied):\n", pt_tensor_copied_from_numpy)This is analogous to tf.constant(data_list) or tf.convert_to_tensor(data_numpy) in TensorFlow. A main difference is that torch.from_numpy() creates a tensor that shares memory with the NumPy array (if the array is on the CPU), while torch.tensor() always copies the data.Tensors with specific values (zeros, ones, random):# Zeros tensor zeros_tensor = torch.zeros(2, 3) # Shape (2, 3) print("Zeros tensor:\n", zeros_tensor) # Ones tensor ones_tensor = torch.ones(2, 3, dtype=torch.int16) # Specify dtype print("Ones tensor:\n", ones_tensor) # Random tensor (uniform distribution between 0 and 1) rand_tensor = torch.rand(2, 3) print("Random tensor (uniform):\n", rand_tensor) # Random tensor (normal distribution) randn_tensor = torch.randn(2, 3) print("Random tensor (normal):\n", randn_tensor)These are direct counterparts to tf.zeros(), tf.ones(), tf.random.uniform(), and tf.random.normal(). You can also create tensors like another tensor's properties (shape, dtype, device) using torch.zeros_like(existing_tensor) or torch.rand_like(existing_tensor).Tensor AttributesAccessing tensor attributes like shape, data type, and device is very similar to TensorFlow.my_tensor = torch.rand(3, 4, device='cpu') # Explicitly on CPU print(f"Shape: {my_tensor.shape}") # or my_tensor.size() print(f"Data type: {my_tensor.dtype}") print(f"Device: {my_tensor.device}")tensor.size() is an alias for tensor.shape in PyTorch.Basic OperationsArithmetic operations and matrix multiplications will feel familiar.Element-wise operations:a = torch.tensor([[1., 2.], [3., 4.]]) b = torch.ones(2, 2) * 2 # Addition print("Addition (a + b):\n", a + b) print("Addition (torch.add(a, b)):\n", torch.add(a, b)) # Multiplication (element-wise) print("Multiplication (a * b):\n", a * b) print("Multiplication (torch.mul(a, b)):\n", torch.mul(a, b)) # In-place operations (modify the tensor directly) c = torch.tensor([[1.,1.],[1.,1.]]) c.add_(b) # Note the underscore for in-place print("In-place addition (c.add_(b)):\n", c)PyTorch operations often have an in-place version denoted by a trailing underscore (e.g., add_(), mul_()). This modifies the tensor directly and can be more memory-efficient but requires care as it can overwrite data needed elsewhere. TensorFlow tensors are generally immutable, so operations create new tensors.Matrix Multiplication:mat1 = torch.randn(2, 3) mat2 = torch.randn(3, 4) # Matrix multiplication product = torch.matmul(mat1, mat2) print("Matrix product (torch.matmul):\n", product) # Alternatively, using the @ operator (Python 3.5+) product_at = mat1 @ mat2 print("Matrix product (@ operator):\n", product_at)This is identical in syntax to tf.matmul() and the @ operator in TensorFlow when working with eager tensors.Indexing, Slicing, Joining, MutatingPyTorch supports standard NumPy-style indexing and slicing, which is very powerful.x = torch.arange(1, 10).reshape(3, 3) print("Original tensor x:\n", x) # First row print("First row: ", x[0, :]) # Second column print("Second column: ", x[:, 1]) # Sub-tensor print("Sub-tensor (x[1:, 1:]):\n", x[1:, 1:]) # Conditional indexing (Masking) mask = x > 5 print("Elements greater than 5:\n", x[mask])Joining tensors is done with torch.cat() (concatenate) and torch.stack().t1 = torch.zeros(2,3) t2 = torch.ones(2,3) # Concatenate along dimension 0 (rows) cat_dim0 = torch.cat((t1, t2), dim=0) print("Concatenated along dim 0 (rows):\n", cat_dim0) print("Shape:", cat_dim0.shape) # torch.Size([4, 3]) # Concatenate along dimension 1 (columns) cat_dim1 = torch.cat((t1, t2), dim=1) print("Concatenated along dim 1 (columns):\n", cat_dim1) print("Shape:", cat_dim1.shape) # torch.Size([2, 6]) # Stack (creates a new dimension) stacked_tensors = torch.stack((t1, t2), dim=0) print("Stacked along new dim 0:\n", stacked_tensors) print("Shape:", stacked_tensors.shape) # torch.Size([2, 2, 3])This is similar to tf.concat() and tf.stack().Reshaping TensorsChanging the shape of a tensor is a common operation. PyTorch offers view() and reshape().tensor.view(): Returns a new tensor with the same data but a different shape. The new shape must be compatible with the original number of elements. Importantly, view() can only operate on contiguous tensors and the returned tensor shares the underlying data. If the tensor is not contiguous, you might need to call .contiguous() first.tensor.reshape(): This is more flexible. It can return a view if possible, or it will create a copy if a view cannot be made (e.g., if the tensor is not contiguous and the new shape requires a copy).original = torch.arange(12.) # Creates a 1D tensor: [0., 1., ..., 11.] print("Original:", original) # Using view view_tensor = original.view(3, 4) print("View (3,4):\n", view_tensor) # Modifying view_tensor will affect original, and vice-versa, because they share data view_tensor[0,0] = 99. print("Original after modifying view:", original) # Using reshape (may or may not be a view) reshaped_tensor = original.reshape(2, 6) print("Reshaped (2,6):\n", reshaped_tensor) # Transpose transposed_tensor = view_tensor.t() # Only for 2D tensors print("Transposed (view_tensor.t()):\n", transposed_tensor) # For general N-D transpose, use permute permuted_tensor = view_tensor.permute(1, 0) # Swaps dimensions 0 and 1 print("Permuted (view_tensor.permute(1,0)):\n", permuted_tensor)TensorFlow's tf.reshape() is similar to torch.reshape(). tf.transpose() is akin to torch.permute().NumPy BridgePyTorch tensors on the CPU and NumPy arrays can share their underlying memory locations, so changes in one will reflect in the other.# PyTorch tensor to NumPy array pt_tensor = torch.ones(5) numpy_array = pt_tensor.numpy() print("NumPy array from PyTorch tensor:\n", numpy_array) pt_tensor.add_(1) # In-place addition print("PyTorch tensor after add_:", pt_tensor) print("NumPy array after PyTorch tensor modified:", numpy_array) # NumPy array also changes! # NumPy array to PyTorch tensor np_array = np.array([1, 2, 3, 4, 5]) torch_tensor_from_np = torch.from_numpy(np_array) print("PyTorch tensor from NumPy array:\n", torch_tensor_from_np) np.add(np_array, 1, out=np_array) # In-place addition in NumPy print("NumPy array after modification:", np_array) print("PyTorch tensor after NumPy array modified:", torch_tensor_from_np) # PyTorch tensor also changes!This tight integration is very convenient. If the tensor is on the GPU, .numpy() will first copy it to the CPU. TensorFlow's .numpy() method on tf.Tensor objects provides similar functionality for eager tensors, creating a NumPy array copy.Device Management (CPU/GPU)Moving tensors between devices (like CPU and GPU) is a fundamental operation.# Check if GPU is available if torch.cuda.is_available(): device = torch.device("cuda") # A CUDA device object cpu_device = torch.device("cpu") print(f"Running on {device}") # Create a tensor on CPU then move to GPU tensor_cpu = torch.randn(2, 2) print("Tensor on CPU:", tensor_cpu) tensor_gpu = tensor_cpu.to(device) # or tensor_cpu.cuda() print("Tensor on GPU:", tensor_gpu) # Create a tensor directly on GPU tensor_direct_gpu = torch.randn(2, 2, device=device) print("Tensor directly on GPU:", tensor_direct_gpu) # Move back to CPU tensor_back_to_cpu = tensor_gpu.to(cpu_device) # or tensor_gpu.cpu() print("Tensor back on CPU:", tensor_back_to_cpu) # Note: Operations between tensors on different devices will raise an error. # For example, tensor_cpu + tensor_gpu would fail. # They must be on the same device. try: result = tensor_cpu + tensor_gpu except RuntimeError as e: print(f"\nError trying to operate on tensors on different devices: {e}") else: device = torch.device("cpu") print("CUDA not available, running on CPU.") tensor_cpu = torch.randn(2, 2) # Operations will default to CPUThis is analogous to using with tf.device('/GPU:0'): in TensorFlow for placing operations, or tensor.gpu()/tensor.cpu() methods on TensorFlow tensors (though the to() method is the more modern PyTorch way). PyTorch requires tensors to be on the same device for an operation.Mini-Exercise 1: Tensor GymnasticsCreate a 3x4 tensor A filled with random numbers from a uniform distribution between 0 and 10.Create a 4x2 tensor B filled with the integer value 2.Calculate the matrix product C = A @ B.What is the shape of C? Print it.Extract the second column of C.If a GPU is available, move C to the GPU. Print its device. Then move it back to CPU and print its device again.Solution (try it yourself first!)# 1. Create tensor A A = torch.rand(3, 4) * 10 print("Tensor A:\n", A) # 2. Create tensor B B = torch.full((4, 2), 2, dtype=torch.float32) # Ensure B is float for matmul with A print("Tensor B:\n", B) # 3. Calculate C = A @ B C = A @ B print("Tensor C (A @ B):\n", C) # 4. Shape of C print("Shape of C:", C.shape) # 5. Extract second column of C second_column_C = C[:, 1] print("Second column of C:\n", second_column_C) # 6. Move C to GPU and back (if available) if torch.cuda.is_available(): gpu_device = torch.device("cuda") cpu_device = torch.device("cpu") print(f"Initial device of C: {C.device}") C_gpu = C.to(gpu_device) print(f"Device of C after moving to GPU: {C_gpu.device}") C_cpu_again = C_gpu.to(cpu_device) print(f"Device of C after moving back to CPU: {C_cpu_again.device}") else: print("CUDA not available. Skipping GPU transfer part.")Automatic Differentiation with autogradPyTorch's autograd package is the engine for automatic differentiation. If you're familiar with TensorFlow's tf.GradientTape, you'll find autograd serves a similar purpose but fits naturally with PyTorch's define-by-run nature.When a tensor's requires_grad attribute is set to True, autograd starts tracking all operations on it. When you finish your computation, you can call .backward() on a scalar output (typically your loss function), and autograd automatically computes the gradients of this scalar with respect to all tensors that had requires_grad=True and contributed to it.The requires_grad AttributeBy default, tensors you create do not track gradients:x = torch.tensor([1.0, 2.0, 3.0]) print(f"x.requires_grad: {x.requires_grad}") # False by defaultTo enable gradient tracking, set requires_grad=True at creation or later:# At creation w = torch.tensor([0.5, 0.1, -0.2], requires_grad=True) print(f"w.requires_grad: {w.requires_grad}") # Or modify in-place (for leaf tensors that don't have a grad_fn) x.requires_grad_(True) print(f"x.requires_grad after modification: {x.requires_grad}")Parameters of torch.nn.Module (which we'll see in the next chapter) automatically have requires_grad=True.Computing Gradients: backward() and .gradLet's see a simple example. Suppose we have a function $y = \sum_i (w_i \cdot x_i + b)^2$. We want to compute $\frac{\partial y}{\partial w_i}$ and $\frac{\partial y}{\partial b}$.# Inputs (leaf nodes, not requiring gradients for this example) x = torch.tensor([1.0, 2.0, 3.0]) # Parameters (we want gradients for these) w = torch.tensor([0.5, 0.1, -0.2], requires_grad=True) b = torch.tensor(0.1, requires_grad=True) # Forward pass: operations are tracked z = w * x + b # Element-wise multiplication and addition y = z.pow(2).sum() # Square and sum (y is a scalar) print(f"y: {y.item()}") # Backward pass: compute gradients y.backward() # Gradients are accumulated in the .grad attribute of the tensors print(f"Gradients for w (dy/dw): {w.grad}") print(f"Gradient for b (dy/db): {b.grad}") # x does not have .grad because requires_grad was False print(f"x.grad: {x.grad}") # Will be NoneIn TensorFlow, this would be analogous to:# TensorFlow equivalent # x_tf = tf.constant([1.0, 2.0, 3.0]) # w_tf = tf.Variable([0.5, 0.1, -0.2]) # b_tf = tf.Variable(0.1) # # with tf.GradientTape() as tape: # z_tf = w_tf * x_tf + b_tf # y_tf = tf.reduce_sum(tf.pow(z_tf, 2)) # # dy_dw_tf, dy_db_tf = tape.gradient(y_tf, [w_tf, b_tf]) # print(f"TF dy/dw: {dy_dw_tf}") # print(f"TF dy/db: {dy_db_tf}")The core idea is the same: define a computation, then ask the framework to compute gradients. PyTorch's backward() is called on the output tensor, and gradients populate the .grad attribute of the input tensors that required them.Gradient AccumulationOne important behavior to note: gradients are accumulated in PyTorch. If you call backward() multiple times, the new gradients are added to the existing values in the .grad attribute.q = torch.tensor(2.0, requires_grad=True) out1 = q * q out1.backward() # Computes d(out1)/dq = 2*q = 4.0 print(f"q.grad after first backward: {q.grad}") # tensor(4.) out2 = q * q * q out2.backward() # Computes d(out2)/dq = 3*q^2 = 12.0 # q.grad will now be 4.0 (from previous) + 12.0 (from current) = 16.0 print(f"q.grad after second backward (accumulated): {q.grad}")This is why, in a typical training loop, you must explicitly zero out the gradients before each call to backward() using optimizer.zero_grad() or manually with tensor.grad.zero_().# Manually zeroing gradients if q.grad is not None: q.grad.zero_() print(f"q.grad after zeroing: {q.grad}")Stopping Gradient TrackingSometimes you want to perform operations without autograd tracking them, for example, during model evaluation (inference) or when updating weights manually.torch.no_grad() context manager:print(f"w.requires_grad before no_grad: {w.requires_grad}") # True with torch.no_grad(): print("Inside torch.no_grad():") y_eval = (w * x + b).sum() # Operations here won't be tracked print(f" y_eval.requires_grad: {y_eval.requires_grad}") # False # w.requires_grad is still True, but ops on it within this block don't build graph print(f" w.requires_grad inside no_grad: {w.requires_grad}") print(f"w.requires_grad after no_grad: {w.requires_grad}") # TrueThis is useful for speeding up computations and reducing memory usage when gradients are not needed.tensor.detach(): This creates a new tensor that shares the same data but is detached from the computation history. It won't require gradients.detached_w = w.detach() print(f"detached_w.requires_grad: {detached_w.requires_grad}") # False # Modifying detached_w would affect w if w is a leaf tensor, # but operations on detached_w won't affect w.grad.Mini-Exercise 2: Autograd ExplorationConsider the function $z = (a \cdot b) + \sin(c)$.Initialize a, b, and c as scalar torch.Tensors with values a=2.0, b=3.0, c=0.0 (radians). Ensure they all require gradients.Compute z.Call backward() on z.Print the gradients $\frac{\partial z}{\partial a}$, $\frac{\partial z}{\partial b}$, and $\frac{\partial z}{\partial c}$.Manually calculate what these gradients should be and verify your results. (Hint: $\frac{d}{dx} \sin(x) = \cos(x)$)Solution (try it yourself first!)# 1. Initialize tensors a = torch.tensor(2.0, requires_grad=True) b = torch.tensor(3.0, requires_grad=True) c = torch.tensor(0.0, requires_grad=True) # 0.0 radians # 2. Compute z z = (a * b) + torch.sin(c) print(f"z = {z.item()}") # 3. Call backward() z.backward() # 4. Print gradients print(f"dz/da: {a.grad}") print(f"dz/db: {b.grad}") print(f"dz/dc: {c.grad}") # 5. Manual verification: # dz/da = b = 3.0 # dz/db = a = 2.0 # dz/dc = cos(c) = cos(0.0) = 1.0 print("\nManual verification:") print(f"Expected dz/da: 3.0, Got: {a.grad.item()}") print(f"Expected dz/db: 2.0, Got: {b.grad.item()}") print(f"Expected dz/dc: cos(0) = 1.0, Got: {c.grad.item()}")A Glimpse of OptimizationWhile we'll cover optimizers in torch.optim extensively in Chapter 4, autograd is the foundation. Here's a very basic manual optimization step:x_val = torch.tensor([2.0], requires_grad=True) y_target = torch.tensor([10.0]) learning_rate = 0.1 print(f"Initial x: {x_val.item()}") for i in range(5): # Perform 5 optimization steps # Define a simple model and loss y_pred = x_val * 3 + 1 # Our "model" loss = (y_pred - y_target)**2 # Zero out previous gradients (if any) if x_val.grad is not None: x_val.grad.zero_() # Compute gradients of loss w.r.t. x_val loss.backward() # Update x_val using gradient descent (manual step) # We use torch.no_grad() because this update shouldn't be part of gradient tracking with torch.no_grad(): x_val -= learning_rate * x_val.grad print(f"Step {i+1}: x = {x_val.item():.4f}, loss = {loss.item():.4f}, grad = {x_val.grad.item():.4f}")In this loop, we calculate a loss, compute gradients using loss.backward(), and then manually update x_val in the direction that minimizes the loss. The with torch.no_grad(): block ensures that the weight update operation itself is not tracked by autograd. This simple loop illustrates the core mechanics that torch.optim will automate for us.This hands-on tour should give you a good feel for PyTorch's tensor operations and its autograd system. You've seen how to create and manipulate tensors, move them between devices, and, importantly, how to compute gradients. As we move into building models with torch.nn, these fundamental skills will be essential.