All Courses

Applying Transformations: map()

Once you have a tf.data.Dataset object, whether created from tensors, NumPy arrays, or files like TFRecords, the next step is often to preprocess the data. You might need to normalize numerical features, resize images, decode byte strings, or apply other custom logic to prepare the data for your model. The tf.data API provides the map() transformation for exactly this purpose.

The map(map_func, num_parallel_calls=None) method applies a given function, map_func, to each element of the dataset, returning a new dataset with the transformed elements. This is analogous to the map function in standard Python or operations in libraries like Pandas, but designed to work efficiently within the TensorFlow ecosystem.

Basic Usage

Let's start with a simple example. Suppose we have a dataset of numerical values and we want to normalize them by squaring each element.

import tensorflow as tf
import numpy as np

# Create a dataset from a NumPy array
numeric_dataset = tf.data.Dataset.from_tensor_slices(np.arange(1, 6, dtype=np.float32))

# Define a simple mapping function (can be a lambda or a regular function)
def square_element(x):
  return x * x

# Apply the map transformation
squared_dataset = numeric_dataset.map(square_element)

# Iterate through the transformed dataset to see the results
print("Original Dataset:")
for element in numeric_dataset:
  print(element.numpy())

print("\nSquared Dataset:")
for element in squared_dataset:
  print(element.numpy())

Output:

Original Dataset:
1.0
2.0
3.0
4.0
5.0

Squared Dataset:
1.0
4.0
9.0
16.0
25.0

The function passed to map() receives a single argument representing one element from the dataset (which might be a single tensor, or a tuple/dictionary of tensors, depending on how the dataset was constructed). It should return the transformed element(s). TensorFlow automatically traces this function to build a graph for efficient execution.

Applying TensorFlow Operations

For performance, it's best to use TensorFlow operations (tf.*) inside your mapping function whenever possible. This allows TensorFlow to integrate the preprocessing steps directly into the computational graph, potentially running them on accelerators like GPUs or TPUs and avoiding costly data transfers between Python and the TensorFlow runtime.

Consider decoding and resizing images stored as byte strings (a common scenario when reading from TFRecord files):

# Assume we have a dataset where each element is a scalar string tensor
# containing JPEG encoded image data. (We'll simulate this here)
# In a real scenario, this might come from tf.data.TFRecordDataset
jpeg_strings = [
    tf.random.uniform(shape=(), minval=0, maxval=255, dtype=tf.int32), # Simulate some bytes
    tf.random.uniform(shape=(), minval=0, maxval=255, dtype=tf.int32)
]
# A real image string would be much longer and structured.
# We'll use dummy strings for demonstration.
dummy_image_dataset = tf.data.Dataset.from_tensor_slices(["dummy_jpeg1", "dummy_jpeg2"])

def parse_and_resize_image(jpeg_string):
  # Pretend tf.io.decode_jpeg works on these dummies.
  # In reality, you'd have actual JPEG bytes.
  # image = tf.io.decode_jpeg(jpeg_string, channels=3)

  # Simulate decoding to a random tensor shape
  image = tf.random.uniform(shape=(200, 200, 3), maxval=256, dtype=tf.int32)
  image = tf.cast(image, tf.float32) # Cast to float for processing

  # Resize the image
  image = tf.image.resize(image, [128, 128])

  # Normalize pixel values (example)
  image = image / 255.0
  return image

# Apply the preprocessing function using map
processed_image_dataset = dummy_image_dataset.map(parse_and_resize_image)

# Check the shape of the first element
for img in processed_image_dataset.take(1):
    print("Processed image shape:", img.shape)
    # print("Sample pixel value:", img.numpy()[0,0,0]) # Value would be between 0 and 1

Output (shape will be consistent, value will vary):

Processed image shape: (128, 128, 3)

Here, tf.image.decode_jpeg, tf.image.resize, and standard arithmetic operations are all TensorFlow ops, ensuring efficient graph-based execution.

Parallelizing the Transformation

Preprocessing can sometimes be computationally intensive, especially for complex operations or large data elements like high-resolution images. If your preprocessing steps for different elements are independent, you can parallelize the map transformation to speed things up.

The num_parallel_calls argument controls how many elements are processed concurrently. Setting it to tf.data.AUTOTUNE allows TensorFlow to dynamically adjust the level of parallelism based on available CPU resources during runtime, which is often the recommended approach.

# Applying the image processing function with parallel calls
processed_image_dataset_parallel = dummy_image_dataset.map(
    parse_and_resize_image,
    num_parallel_calls=tf.data.AUTOTUNE
)

# Iteration would yield the same results, but potentially faster
# for img in processed_image_dataset_parallel.take(1):
#     print("Processed image shape (parallel):", img.shape)

Using tf.data.AUTOTUNE lets the tf.data runtime find a good level of parallelism, balancing throughput with resource usage without requiring manual tuning.

This diagram illustrates how map with num_parallel_calls processes multiple dataset elements concurrently using the provided function.

Handling Python Logic with `tf.py_function`

While TensorFlow operations are preferred, sometimes you need to use arbitrary Python code within your mapping function, perhaps to call a library that doesn't have TensorFlow equivalents (e.g., certain audio processing or complex string manipulation libraries). In such cases, you can wrap your Python function using tf.py_function.

tf.py_function allows you to embed non-TensorFlow Python code into your TensorFlow graph. However, be aware of the implications:

Performance: It introduces overhead due to switching between the TensorFlow runtime and the Python interpreter. It breaks graph serialization (you can't easily save a model graph that relies heavily on py_function). Parallelism might be limited by Python's Global Interpreter Lock (GIL).
Serialization: Models using tf.py_function might not be easily exportable to environments without a Python interpreter (like TensorFlow Lite or TensorFlow Serving in some configurations).

Use tf.py_function sparingly and only when necessary.

import cv2 # Example: Using OpenCV, a non-TF library

def python_based_processing(tensor_path):
  # Convert tensor path (bytes) to string
  path_str = tensor_path.numpy().decode('utf-8')
  # Use OpenCV (or any Python library)
  # img = cv2.imread(path_str)
  # processed_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

  # Simulate some processing returning a numpy array
  processed_data = np.array([len(path_str)], dtype=np.int32) # Example: length of path
  return processed_data

# Assume dataset contains string tensors representing file paths
paths_dataset = tf.data.Dataset.from_tensor_slices(["/path/to/file1.txt", "/path/to/file2.txt"])

def tf_wrapper_func(tensor_path):
  # Define the output type and shape for TensorFlow
  result = tf.py_function(
      func=python_based_processing,
      inp=[tensor_path],       # Input tensor(s)
      Tout=tf.int32            # Output TensorFlow dtype(s)
  )
  # Set the shape explicitly if known, as py_function loses shape info
  result.set_shape([1]) # Example output shape
  return result

# Map using the wrapped function
processed_py_dataset = paths_dataset.map(tf_wrapper_func)

for item in processed_py_dataset:
    print(item.numpy())

Output:

[18]
[18]

The map transformation is a fundamental tool for preparing your data within a tf.data pipeline. By applying TensorFlow operations efficiently, potentially in parallel, you can create clean, performant preprocessing stages that integrate smoothly with model training. Remember to use TensorFlow ops where possible and resort to tf.py_function only when external Python libraries are essential.

Was this section helpful?

Applying Transformations: map()

Basic Usage

Applying TensorFlow Operations

Parallelizing the Transformation

Handling Python Logic with tf.py_function

Handling Python Logic with `tf.py_function`