Build a complete image data pipeline using tf.data, starting from image file paths on disk and ending with batches of preprocessed image tensors ready to be fed into a Keras model. This hands-on exercise will help solidify understanding of creating datasets, applying transformations, and optimizing the pipeline for performance.We assume you have a collection of images organized into directories, perhaps by class. For this example, let's imagine a structure like:data/ train/ class_a/ image1.jpg image2.jpg ... class_b/ image3.jpg image4.jpg ... validation/ class_a/ image101.jpg ... class_b/ image102.jpg ...Our goal is to create a tf.data.Dataset that reads these JPG files, decodes them, resizes them to a uniform size, applies some basic augmentation, shuffles and batches them, and uses prefetching.PrerequisitesFirst, ensure you have TensorFlow imported. We'll also use the os module, although tf.data.Dataset.list_files often suffices.import tensorflow as tf import os import matplotlib.pyplot as plt # For visualization later # Define some constants IMG_HEIGHT = 128 IMG_WIDTH = 128 BATCH_SIZE = 32 BUFFER_SIZE = tf.data.AUTOTUNE # Use AUTOTUNE for buffer sizesStep 1: Create a Dataset of File PathsThe first step is to get a list of all the image files we want to process. The tf.data.Dataset.list_files function is perfect for this. It accepts a file pattern (including wildcards * and **) and returns a dataset of matching file paths. Setting shuffle=True is generally a good idea for the training set.# Adjust the path pattern to match your dataset structure train_image_pattern = "data/train/*/*.jpg" # Create the initial dataset of file paths list_ds = tf.data.Dataset.list_files(train_image_pattern, shuffle=True) # Let's see a few file paths for f in list_ds.take(3): print(f.numpy())Note: shuffle=True in list_files shuffles the file paths before they are processed. We will add another shuffling step later after processing the images.Step 2: Define Processing FunctionsNow, we need functions to load, decode, resize, and optionally augment the images.a) Load and Decode: This function takes a file path tensor, reads the file content, decodes it as a JPEG image, and resizes it. We also normalize the pixel values to the range [0, 1].def decode_img(img_path): # Read the raw file content img = tf.io.read_file(img_path) # Decode the JPEG file to a uint8 tensor img = tf.image.decode_jpeg(img, channels=3) # channels=3 for RGB # Convert image to floats in range [0, 1] img = tf.image.convert_image_dtype(img, tf.float32) # Resize the image to the desired size return tf.image.resize(img, [IMG_HEIGHT, IMG_WIDTH])b) (Optional) Augmentation: Let's create a simple augmentation function that randomly flips the image horizontally. More complex augmentations (brightness, contrast, rotation) can be added here.def augment_img(image): image = tf.image.random_flip_left_right(image) # Add more augmentations here if desired # image = tf.image.random_brightness(image, max_delta=0.2) return imagec) Combined Processing Function: It's often convenient to combine decoding and augmentation (if used) into a single processing function that will be mapped over the dataset. If your dataset structure includes labels (e.g., encoded in the path), you would also extract the label here. For simplicity, we'll focus only on the image pipeline.def process_path(file_path): img = decode_img(file_path) # Apply augmentation during training # For validation/test sets, you might skip this step img = augment_img(img) return img # In a real scenario, you'd return (img, label)Step 3: Map the Processing FunctionWe use the map() transformation to apply our process_path function to each file path in the dataset. Using num_parallel_calls=tf.data.AUTOTUNE allows tf.data to automatically tune the level of parallelism for potentially faster processing.# Apply the processing function to each item in the dataset # Use AUTOTUNE to let TensorFlow optimize parallel processing processed_ds = list_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)At this point, processed_ds is a dataset where each element is a processed image tensor (resized, normalized, and augmented).Step 4: Configure for Performance and TrainingTo prepare the dataset for model training, we need to shuffle the processed images, batch them, and prefetch.shuffle(): Randomizes the order of elements. A buffer size larger than the dataset size ensures perfect shuffling, but can use a lot of memory. A common practice is to use a reasonably large buffer (e.g., 1000).batch(): Groups elements into batches. drop_remainder=True can sometimes be useful if the model requires fixed batch sizes.prefetch(): Overlaps the data preprocessing and model execution. It prepares subsequent batches while the current batch is being processed by the model, significantly reducing I/O wait times. tf.data.AUTOTUNE lets TensorFlow choose an optimal prefetch buffer size.The order matters: shuffling before batching ensures that elements are randomized across batches each epoch. Prefetching should generally be the last step.def configure_for_performance(ds): ds = ds.shuffle(buffer_size=BUFFER_SIZE) # Use AUTOTUNE or a fixed size like 1000 ds = ds.batch(BATCH_SIZE) ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE) return ds final_ds = configure_for_performance(processed_ds)Step 5: Inspect the OutputLet's retrieve one batch from our final dataset and check its shape and value range.# Take one batch from the dataset image_batch = next(iter(final_ds)) # Check the shape - should be (BATCH_SIZE, IMG_HEIGHT, IMG_WIDTH, 3) print(f"Batch shape: {image_batch.shape}") # Check the value range - should be roughly [0, 1] print(f"Min value: {tf.reduce_min(image_batch).numpy()}") print(f"Max value: {tf.reduce_max(image_batch).numpy()}") # Display the first image in the batch plt.figure(figsize=(6, 6)) plt.imshow(image_batch[0]) plt.title("Sample Image from Batch") plt.axis("off") plt.show()If the shape is (32, 128, 128, 3) (assuming our constants) and the values are between 0 and 1, our pipeline is working correctly! The image displayed should look like one of your input images, possibly flipped due to augmentation.Step 6: Using the Pipeline with model.fitThis final_ds object is now ready to be passed directly to Keras:# Assume 'model' is a compiled Keras model # model.fit(final_ds, epochs=10) # Similarly for evaluation: # validation_ds = ... build a similar pipeline for validation data ... # model.evaluate(validation_ds)You have successfully built an efficient image data pipeline using tf.data! This pipeline handles file discovery, parallel image loading and preprocessing, augmentation, shuffling, batching, and prefetching, all decoupled from the model training loop itself. This approach is scalable and essential for working with large image datasets in TensorFlow. You can adapt the process_path function to include label extraction based on your directory structure or file naming conventions if needed for supervised learning tasks.