Alright, let's put the concepts from this chapter into practice. We'll build a complete image data pipeline using tf.data
, starting from image file paths on disk and ending with batches of preprocessed image tensors ready to be fed into a Keras model. This hands-on exercise will solidify your understanding of creating datasets, applying transformations, and optimizing the pipeline for performance.
We assume you have a collection of images organized into directories, perhaps by class. For this example, let's imagine a structure like:
data/
train/
class_a/
image1.jpg
image2.jpg
...
class_b/
image3.jpg
image4.jpg
...
validation/
class_a/
image101.jpg
...
class_b/
image102.jpg
...
Our goal is to create a tf.data.Dataset
that reads these JPG files, decodes them, resizes them to a uniform size, applies some basic augmentation, shuffles and batches them, and uses prefetching.
First, ensure you have TensorFlow imported. We'll also use the os
module, although tf.data.Dataset.list_files
often suffices.
import tensorflow as tf
import os
import matplotlib.pyplot as plt # For visualization later
# Define some constants
IMG_HEIGHT = 128
IMG_WIDTH = 128
BATCH_SIZE = 32
BUFFER_SIZE = tf.data.AUTOTUNE # Use AUTOTUNE for buffer sizes
The first step is to get a list of all the image files we want to process. The tf.data.Dataset.list_files
function is perfect for this. It accepts a file pattern (including wildcards *
and **
) and returns a dataset of matching file paths. Setting shuffle=True
is generally a good idea for the training set.
# Adjust the path pattern to match your dataset structure
train_image_pattern = "data/train/*/*.jpg"
# Create the initial dataset of file paths
list_ds = tf.data.Dataset.list_files(train_image_pattern, shuffle=True)
# Let's see a few file paths
for f in list_ds.take(3):
print(f.numpy())
Note:
shuffle=True
inlist_files
shuffles the file paths before they are processed. We will add another shuffling step later after processing the images.
Now, we need functions to load, decode, resize, and optionally augment the images.
a) Load and Decode:
This function takes a file path tensor, reads the file content, decodes it as a JPEG image, and resizes it. We also normalize the pixel values to the range [0, 1]
.
def decode_img(img_path):
# Read the raw file content
img = tf.io.read_file(img_path)
# Decode the JPEG file to a uint8 tensor
img = tf.image.decode_jpeg(img, channels=3) # channels=3 for RGB
# Convert image to floats in range [0, 1]
img = tf.image.convert_image_dtype(img, tf.float32)
# Resize the image to the desired size
return tf.image.resize(img, [IMG_HEIGHT, IMG_WIDTH])
b) (Optional) Augmentation: Let's create a simple augmentation function that randomly flips the image horizontally. More complex augmentations (brightness, contrast, rotation) can be added here.
def augment_img(image):
image = tf.image.random_flip_left_right(image)
# Add more augmentations here if desired
# image = tf.image.random_brightness(image, max_delta=0.2)
return image
c) Combined Processing Function: It's often convenient to combine decoding and augmentation (if used) into a single processing function that will be mapped over the dataset. If your dataset structure includes labels (e.g., encoded in the path), you would also extract the label here. For simplicity, we'll focus only on the image pipeline.
def process_path(file_path):
img = decode_img(file_path)
# Apply augmentation during training
# For validation/test sets, you might skip this step
img = augment_img(img)
return img # In a real scenario, you'd return (img, label)
We use the map()
transformation to apply our process_path
function to each file path in the dataset. Using num_parallel_calls=tf.data.AUTOTUNE
allows tf.data
to automatically tune the level of parallelism for potentially faster processing.
# Apply the processing function to each item in the dataset
# Use AUTOTUNE to let TensorFlow optimize parallel processing
processed_ds = list_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)
At this point, processed_ds
is a dataset where each element is a processed image tensor (resized, normalized, and augmented).
To prepare the dataset for model training, we need to shuffle the processed images, batch them, and prefetch.
shuffle()
: Randomizes the order of elements. A buffer size larger than the dataset size ensures perfect shuffling, but can use a lot of memory. A common practice is to use a reasonably large buffer (e.g., 1000).batch()
: Groups elements into batches. drop_remainder=True
can sometimes be useful if the model requires fixed batch sizes.prefetch()
: Overlaps the data preprocessing and model execution. It prepares subsequent batches while the current batch is being processed by the model, significantly reducing I/O wait times. tf.data.AUTOTUNE
lets TensorFlow choose an optimal prefetch buffer size.The order matters: shuffling before batching ensures that elements are randomized across batches each epoch. Prefetching should generally be the last step.
def configure_for_performance(ds):
ds = ds.shuffle(buffer_size=BUFFER_SIZE) # Use AUTOTUNE or a fixed size like 1000
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)
return ds
final_ds = configure_for_performance(processed_ds)
Let's retrieve one batch from our final dataset and check its shape and value range.
# Take one batch from the dataset
image_batch = next(iter(final_ds))
# Check the shape - should be (BATCH_SIZE, IMG_HEIGHT, IMG_WIDTH, 3)
print(f"Batch shape: {image_batch.shape}")
# Check the value range - should be roughly [0, 1]
print(f"Min value: {tf.reduce_min(image_batch).numpy()}")
print(f"Max value: {tf.reduce_max(image_batch).numpy()}")
# Display the first image in the batch
plt.figure(figsize=(6, 6))
plt.imshow(image_batch[0])
plt.title("Sample Image from Batch")
plt.axis("off")
plt.show()
If the shape is (32, 128, 128, 3)
(assuming our constants) and the values are between 0 and 1, our pipeline is working correctly! The image displayed should look like one of your input images, possibly flipped due to augmentation.
model.fit
This final_ds
object is now ready to be passed directly to Keras:
# Assume 'model' is a compiled Keras model
# model.fit(final_ds, epochs=10)
# Similarly for evaluation:
# validation_ds = ... build a similar pipeline for validation data ...
# model.evaluate(validation_ds)
You have successfully built an efficient image data pipeline using tf.data
! This pipeline handles file discovery, parallel image loading and preprocessing, augmentation, shuffling, batching, and prefetching, all decoupled from the model training loop itself. This approach is scalable and essential for working with large image datasets in TensorFlow. You can adapt the process_path
function to include label extraction based on your directory structure or file naming conventions if needed for supervised learning tasks.
© 2025 ApX Machine Learning