Alright, you've seen the high-level differences between TensorFlow's static graphs and PyTorch's dynamic approach, and you've been introduced to torch.Tensor
and autograd
. This practice section is designed to help your understanding by translating common TensorFlow operations into PyTorch and then exploring how PyTorch handles automatic differentiation. We'll cover tensor creation, manipulation, and then get into computing gradients with autograd
.
Let's start by ensuring you have PyTorch imported. Typically, you'll also want NumPy, as it's a common partner in many data science workflows.
import torch
import numpy as np
If you're coming from TensorFlow, you're already familiar with the idea of a tensor. PyTorch's torch.Tensor
will feel quite similar to tf.Tensor
in many respects, but with some syntactic and behavioral differences, especially concerning mutability and the define-by-run nature of operations.
Creating tensors in PyTorch is straightforward. You can create them from Python lists or NumPy arrays, or initialize them with specific values.
From existing data (Python lists or NumPy arrays):
# Python list
data_list = [[1, 2], [3, 4]]
pt_tensor_from_list = torch.tensor(data_list, dtype=torch.float32)
print("From list:\n", pt_tensor_from_list)
# NumPy array
data_numpy = np.array([[5., 6.], [7., 8.]])
pt_tensor_from_numpy = torch.from_numpy(data_numpy) # Shares memory with numpy array
print("From NumPy array (shares memory):\n", pt_tensor_from_numpy)
# To create a copy that doesn't share memory:
pt_tensor_copied_from_numpy = torch.tensor(data_numpy)
print("From NumPy array (copied):\n", pt_tensor_copied_from_numpy)
This is analogous to tf.constant(data_list)
or tf.convert_to_tensor(data_numpy)
in TensorFlow. A key difference is that torch.from_numpy()
creates a tensor that shares memory with the NumPy array (if the array is on the CPU), while torch.tensor()
always copies the data.
Tensors with specific values (zeros, ones, random):
# Zeros tensor
zeros_tensor = torch.zeros(2, 3) # Shape (2, 3)
print("Zeros tensor:\n", zeros_tensor)
# Ones tensor
ones_tensor = torch.ones(2, 3, dtype=torch.int16) # Specify dtype
print("Ones tensor:\n", ones_tensor)
# Random tensor (uniform distribution between 0 and 1)
rand_tensor = torch.rand(2, 3)
print("Random tensor (uniform):\n", rand_tensor)
# Random tensor (normal distribution)
randn_tensor = torch.randn(2, 3)
print("Random tensor (normal):\n", randn_tensor)
These are direct counterparts to tf.zeros()
, tf.ones()
, tf.random.uniform()
, and tf.random.normal()
. You can also create tensors like another tensor's properties (shape, dtype, device) using torch.zeros_like(existing_tensor)
or torch.rand_like(existing_tensor)
.
Accessing tensor attributes like shape, data type, and device is very similar to TensorFlow.
my_tensor = torch.rand(3, 4, device='cpu') # Explicitly on CPU
print(f"Shape: {my_tensor.shape}") # or my_tensor.size()
print(f"Data type: {my_tensor.dtype}")
print(f"Device: {my_tensor.device}")
tensor.size()
is an alias for tensor.shape
in PyTorch.
Arithmetic operations and matrix multiplications will feel familiar.
Element-wise operations:
a = torch.tensor([[1., 2.], [3., 4.]])
b = torch.ones(2, 2) * 2
# Addition
print("Addition (a + b):\n", a + b)
print("Addition (torch.add(a, b)):\n", torch.add(a, b))
# Multiplication (element-wise)
print("Multiplication (a * b):\n", a * b)
print("Multiplication (torch.mul(a, b)):\n", torch.mul(a, b))
# In-place operations (modify the tensor directly)
c = torch.tensor([[1.,1.],[1.,1.]])
c.add_(b) # Note the underscore for in-place
print("In-place addition (c.add_(b)):\n", c)
PyTorch operations often have an in-place version denoted by a trailing underscore (e.g., add_()
, mul_()
). This modifies the tensor directly and can be more memory-efficient but requires care as it can overwrite data needed elsewhere. TensorFlow tensors are generally immutable, so operations create new tensors.
Matrix Multiplication:
mat1 = torch.randn(2, 3)
mat2 = torch.randn(3, 4)
# Matrix multiplication
product = torch.matmul(mat1, mat2)
print("Matrix product (torch.matmul):\n", product)
# Alternatively, using the @ operator (Python 3.5+)
product_at = mat1 @ mat2
print("Matrix product (@ operator):\n", product_at)
This is identical in syntax to tf.matmul()
and the @
operator in TensorFlow when working with eager tensors.
PyTorch supports standard NumPy-style indexing and slicing, which is very powerful.
x = torch.arange(1, 10).reshape(3, 3)
print("Original tensor x:\n", x)
# First row
print("First row: ", x[0, :])
# Second column
print("Second column: ", x[:, 1])
# Sub-tensor
print("Sub-tensor (x[1:, 1:]):\n", x[1:, 1:])
# Conditional indexing (Masking)
mask = x > 5
print("Elements greater than 5:\n", x[mask])
Joining tensors is done with torch.cat()
(concatenate) and torch.stack()
.
t1 = torch.zeros(2,3)
t2 = torch.ones(2,3)
# Concatenate along dimension 0 (rows)
cat_dim0 = torch.cat((t1, t2), dim=0)
print("Concatenated along dim 0 (rows):\n", cat_dim0)
print("Shape:", cat_dim0.shape) # torch.Size([4, 3])
# Concatenate along dimension 1 (columns)
cat_dim1 = torch.cat((t1, t2), dim=1)
print("Concatenated along dim 1 (columns):\n", cat_dim1)
print("Shape:", cat_dim1.shape) # torch.Size([2, 6])
# Stack (creates a new dimension)
stacked_tensors = torch.stack((t1, t2), dim=0)
print("Stacked along new dim 0:\n", stacked_tensors)
print("Shape:", stacked_tensors.shape) # torch.Size([2, 2, 3])
This is similar to tf.concat()
and tf.stack()
.
Changing the shape of a tensor is a common operation. PyTorch offers view()
and reshape()
.
tensor.view()
: Returns a new tensor with the same data but a different shape. The new shape must be compatible with the original number of elements. Importantly, view()
can only operate on contiguous tensors and the returned tensor shares the underlying data. If the tensor is not contiguous, you might need to call .contiguous()
first.tensor.reshape()
: This is more flexible. It can return a view if possible, or it will create a copy if a view cannot be made (e.g., if the tensor is not contiguous and the new shape requires a copy).original = torch.arange(12.) # Creates a 1D tensor: [0., 1., ..., 11.]
print("Original:", original)
# Using view
view_tensor = original.view(3, 4)
print("View (3,4):\n", view_tensor)
# Modifying view_tensor will affect original, and vice-versa, because they share data
view_tensor[0,0] = 99.
print("Original after modifying view:", original)
# Using reshape (may or may not be a view)
reshaped_tensor = original.reshape(2, 6)
print("Reshaped (2,6):\n", reshaped_tensor)
# Transpose
transposed_tensor = view_tensor.t() # Only for 2D tensors
print("Transposed (view_tensor.t()):\n", transposed_tensor)
# For general N-D transpose, use permute
permuted_tensor = view_tensor.permute(1, 0) # Swaps dimensions 0 and 1
print("Permuted (view_tensor.permute(1,0)):\n", permuted_tensor)
TensorFlow's tf.reshape()
is similar to torch.reshape()
. tf.transpose()
is akin to torch.permute()
.
PyTorch tensors on the CPU and NumPy arrays can share their underlying memory locations, so changes in one will reflect in the other.
# PyTorch tensor to NumPy array
pt_tensor = torch.ones(5)
numpy_array = pt_tensor.numpy()
print("NumPy array from PyTorch tensor:\n", numpy_array)
pt_tensor.add_(1) # In-place addition
print("PyTorch tensor after add_:", pt_tensor)
print("NumPy array after PyTorch tensor modified:", numpy_array) # NumPy array also changes!
# NumPy array to PyTorch tensor
np_array = np.array([1, 2, 3, 4, 5])
torch_tensor_from_np = torch.from_numpy(np_array)
print("PyTorch tensor from NumPy array:\n", torch_tensor_from_np)
np.add(np_array, 1, out=np_array) # In-place addition in NumPy
print("NumPy array after modification:", np_array)
print("PyTorch tensor after NumPy array modified:", torch_tensor_from_np) # PyTorch tensor also changes!
This tight integration is very convenient. If the tensor is on the GPU, .numpy()
will first copy it to the CPU. TensorFlow's .numpy()
method on tf.Tensor
objects provides similar functionality for eager tensors, creating a NumPy array copy.
Moving tensors between devices (like CPU and GPU) is a fundamental operation.
# Check if GPU is available
if torch.cuda.is_available():
device = torch.device("cuda") # A CUDA device object
cpu_device = torch.device("cpu")
print(f"Running on {device}")
# Create a tensor on CPU then move to GPU
tensor_cpu = torch.randn(2, 2)
print("Tensor on CPU:", tensor_cpu)
tensor_gpu = tensor_cpu.to(device) # or tensor_cpu.cuda()
print("Tensor on GPU:", tensor_gpu)
# Create a tensor directly on GPU
tensor_direct_gpu = torch.randn(2, 2, device=device)
print("Tensor directly on GPU:", tensor_direct_gpu)
# Move back to CPU
tensor_back_to_cpu = tensor_gpu.to(cpu_device) # or tensor_gpu.cpu()
print("Tensor back on CPU:", tensor_back_to_cpu)
# Note: Operations between tensors on different devices will raise an error.
# For example, tensor_cpu + tensor_gpu would fail.
# They must be on the same device.
try:
result = tensor_cpu + tensor_gpu
except RuntimeError as e:
print(f"\nError trying to operate on tensors on different devices: {e}")
else:
device = torch.device("cpu")
print("CUDA not available, running on CPU.")
tensor_cpu = torch.randn(2, 2) # Operations will default to CPU
This is analogous to using with tf.device('/GPU:0'):
in TensorFlow for placing operations, or tensor.gpu()
/tensor.cpu()
methods on TensorFlow tensors (though the to()
method is the more modern PyTorch way). PyTorch requires tensors to be on the same device for an operation.
A
filled with random numbers from a uniform distribution between 0 and 10.B
filled with the integer value 2.C = A @ B
.C
? Print it.C
.C
to the GPU. Print its device. Then move it back to CPU and print its device again.Solution (try it yourself first!)
# 1. Create tensor A
A = torch.rand(3, 4) * 10
print("Tensor A:\n", A)
# 2. Create tensor B
B = torch.full((4, 2), 2, dtype=torch.float32) # Ensure B is float for matmul with A
print("Tensor B:\n", B)
# 3. Calculate C = A @ B
C = A @ B
print("Tensor C (A @ B):\n", C)
# 4. Shape of C
print("Shape of C:", C.shape)
# 5. Extract second column of C
second_column_C = C[:, 1]
print("Second column of C:\n", second_column_C)
# 6. Move C to GPU and back (if available)
if torch.cuda.is_available():
gpu_device = torch.device("cuda")
cpu_device = torch.device("cpu")
print(f"Initial device of C: {C.device}")
C_gpu = C.to(gpu_device)
print(f"Device of C after moving to GPU: {C_gpu.device}")
C_cpu_again = C_gpu.to(cpu_device)
print(f"Device of C after moving back to CPU: {C_cpu_again.device}")
else:
print("CUDA not available. Skipping GPU transfer part.")
autograd
PyTorch's autograd
package is the engine for automatic differentiation. If you're familiar with TensorFlow's tf.GradientTape
, you'll find autograd
serves a similar purpose but integrates more seamlessly with PyTorch's define-by-run nature.
When a tensor's requires_grad
attribute is set to True
, autograd
starts tracking all operations on it. When you finish your computation, you can call .backward()
on a scalar output (typically your loss function), and autograd
automatically computes the gradients of this scalar with respect to all tensors that had requires_grad=True
and contributed to it.
requires_grad
AttributeBy default, tensors you create do not track gradients:
x = torch.tensor([1.0, 2.0, 3.0])
print(f"x.requires_grad: {x.requires_grad}") # False by default
To enable gradient tracking, set requires_grad=True
at creation or later:
# At creation
w = torch.tensor([0.5, 0.1, -0.2], requires_grad=True)
print(f"w.requires_grad: {w.requires_grad}")
# Or modify in-place (for leaf tensors that don't have a grad_fn)
x.requires_grad_(True)
print(f"x.requires_grad after modification: {x.requires_grad}")
Parameters of torch.nn.Module
(which we'll see in the next chapter) automatically have requires_grad=True
.
backward()
and .grad
Let's see a simple example. Suppose we have a function y=∑i(wi⋅xi+b)2. We want to compute ∂wi∂y and ∂b∂y.
# Inputs (leaf nodes, not requiring gradients for this example)
x = torch.tensor([1.0, 2.0, 3.0])
# Parameters (we want gradients for these)
w = torch.tensor([0.5, 0.1, -0.2], requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)
# Forward pass: operations are tracked
z = w * x + b # Element-wise multiplication and addition
y = z.pow(2).sum() # Square and sum (y is a scalar)
print(f"y: {y.item()}")
# Backward pass: compute gradients
y.backward()
# Gradients are accumulated in the .grad attribute of the tensors
print(f"Gradients for w (dy/dw): {w.grad}")
print(f"Gradient for b (dy/db): {b.grad}")
# x does not have .grad because requires_grad was False
print(f"x.grad: {x.grad}") # Will be None
In TensorFlow, this would be analogous to:
# TensorFlow equivalent
# x_tf = tf.constant([1.0, 2.0, 3.0])
# w_tf = tf.Variable([0.5, 0.1, -0.2])
# b_tf = tf.Variable(0.1)
#
# with tf.GradientTape() as tape:
# z_tf = w_tf * x_tf + b_tf
# y_tf = tf.reduce_sum(tf.pow(z_tf, 2))
#
# dy_dw_tf, dy_db_tf = tape.gradient(y_tf, [w_tf, b_tf])
# print(f"TF dy/dw: {dy_dw_tf}")
# print(f"TF dy/db: {dy_db_tf}")
The core idea is the same: define a computation, then ask the framework to compute gradients. PyTorch's backward()
is called on the output tensor, and gradients populate the .grad
attribute of the input tensors that required them.
One important behavior to note: gradients are accumulated in PyTorch. If you call backward()
multiple times, the new gradients are added to the existing values in the .grad
attribute.
q = torch.tensor(2.0, requires_grad=True)
out1 = q * q
out1.backward() # Computes d(out1)/dq = 2*q = 4.0
print(f"q.grad after first backward: {q.grad}") # tensor(4.)
out2 = q * q * q
out2.backward() # Computes d(out2)/dq = 3*q^2 = 12.0
# q.grad will now be 4.0 (from previous) + 12.0 (from current) = 16.0
print(f"q.grad after second backward (accumulated): {q.grad}")
This is why, in a typical training loop, you must explicitly zero out the gradients before each call to backward()
using optimizer.zero_grad()
or manually with tensor.grad.zero_()
.
# Manually zeroing gradients
if q.grad is not None:
q.grad.zero_()
print(f"q.grad after zeroing: {q.grad}")
Sometimes you want to perform operations without autograd
tracking them, for example, during model evaluation (inference) or when updating weights manually.
torch.no_grad()
context manager:
print(f"w.requires_grad before no_grad: {w.requires_grad}") # True
with torch.no_grad():
print("Inside torch.no_grad():")
y_eval = (w * x + b).sum() # Operations here won't be tracked
print(f" y_eval.requires_grad: {y_eval.requires_grad}") # False
# w.requires_grad is still True, but ops on it within this block don't build graph
print(f" w.requires_grad inside no_grad: {w.requires_grad}")
print(f"w.requires_grad after no_grad: {w.requires_grad}") # True
This is useful for speeding up computations and reducing memory usage when gradients are not needed.
tensor.detach()
:
This creates a new tensor that shares the same data but is detached from the computation history. It won't require gradients.
detached_w = w.detach()
print(f"detached_w.requires_grad: {detached_w.requires_grad}") # False
# Modifying detached_w would affect w if w is a leaf tensor,
# but operations on detached_w won't affect w.grad.
Consider the function z=(a⋅b)+sin(c).
a
, b
, and c
as scalar torch.Tensor
s with values a=2.0
, b=3.0
, c=0.0
(radians). Ensure they all require gradients.z
.backward()
on z
.Solution (try it yourself first!)
# 1. Initialize tensors
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = torch.tensor(0.0, requires_grad=True) # 0.0 radians
# 2. Compute z
z = (a * b) + torch.sin(c)
print(f"z = {z.item()}")
# 3. Call backward()
z.backward()
# 4. Print gradients
print(f"dz/da: {a.grad}")
print(f"dz/db: {b.grad}")
print(f"dz/dc: {c.grad}")
# 5. Manual verification:
# dz/da = b = 3.0
# dz/db = a = 2.0
# dz/dc = cos(c) = cos(0.0) = 1.0
print("\nManual verification:")
print(f"Expected dz/da: 3.0, Got: {a.grad.item()}")
print(f"Expected dz/db: 2.0, Got: {b.grad.item()}")
print(f"Expected dz/dc: cos(0) = 1.0, Got: {c.grad.item()}")
While we'll cover optimizers in torch.optim
extensively in Chapter 4, autograd
is the foundation. Here's a very basic manual optimization step:
x_val = torch.tensor([2.0], requires_grad=True)
y_target = torch.tensor([10.0])
learning_rate = 0.1
print(f"Initial x: {x_val.item()}")
for i in range(5): # Perform 5 optimization steps
# Define a simple model and loss
y_pred = x_val * 3 + 1 # Our "model"
loss = (y_pred - y_target)**2
# Zero out previous gradients (if any)
if x_val.grad is not None:
x_val.grad.zero_()
# Compute gradients of loss w.r.t. x_val
loss.backward()
# Update x_val using gradient descent (manual step)
# We use torch.no_grad() because this update shouldn't be part of gradient tracking
with torch.no_grad():
x_val -= learning_rate * x_val.grad
print(f"Step {i+1}: x = {x_val.item():.4f}, loss = {loss.item():.4f}, grad = {x_val.grad.item():.4f}")
In this loop, we calculate a loss, compute gradients using loss.backward()
, and then manually update x_val
in the direction that minimizes the loss. The with torch.no_grad():
block ensures that the weight update operation itself is not tracked by autograd
. This simple loop illustrates the core mechanics that torch.optim
will automate for us.
This hands-on tour should give you a good feel for PyTorch's tensor operations and its autograd
system. You've seen how to create and manipulate tensors, move them between devices, and, importantly, how to compute gradients. As we move into building models with torch.nn
, these fundamental skills will be essential.
© 2025 ApX Machine Learning