Now that we understand the benefits of using the tf.data
API, let's look at the fundamental first step: creating a tf.data.Dataset
object. The API provides convenient methods to ingest data from common structures you likely already use, such as in-memory tensors, NumPy arrays, or even standard Python generators. This flexibility allows you to start building pipelines quickly, regardless of your data's initial format.
The most straightforward way to create a dataset is when your data already exists in memory as TensorFlow tensors or NumPy arrays. This is common for smaller datasets that fit comfortably into your machine's RAM or when working with data you've just generated or loaded using other libraries.
The primary function for this is tf.data.Dataset.from_tensor_slices()
. This function takes tensors as input and creates a dataset where each element corresponds to a slice across the first dimension of the input tensors.
Let's see a simple example using a NumPy array:
import tensorflow as tf
import numpy as np
# Example NumPy array
numpy_data = np.arange(10)
print(f"Original NumPy array: {numpy_data}")
# Create a dataset from the NumPy array
dataset_from_numpy = tf.data.Dataset.from_tensor_slices(numpy_data)
print("\nDataset elements:")
# Iterate through the dataset to see the slices
for element in dataset_from_numpy:
# Each element is a tf.Tensor
print(element.numpy())
Output:
Original NumPy array: [0 1 2 3 4 5 6 7 8 9]
Dataset elements:
0
1
2
3
4
5
6
7
8
9
As you can see, from_tensor_slices
treated the 1D NumPy array [0, 1, ..., 9]
as a collection of 10 individual elements, creating a dataset that yields each number one by one.
The same applies to TensorFlow tensors:
# Example TensorFlow tensor
tensor_data = tf.range(5, 10)
print(f"Original TensorFlow tensor: {tensor_data.numpy()}")
# Create a dataset from the tensor
dataset_from_tensor = tf.data.Dataset.from_tensor_slices(tensor_data)
print("\nDataset elements:")
for element in dataset_from_tensor:
print(element.numpy())
Output:
Original TensorFlow tensor: [5 6 7 8 9]
Dataset elements:
5
6
7
8
9
A significant use case for from_tensor_slices
is handling paired data, like features and corresponding labels. You can pass a tuple (or dictionary) of tensors or NumPy arrays, and tf.data
will slice them together, ensuring alignment.
# Example features (e.g., measurements) and labels (e.g., categories)
features = np.array([[1, 2], [3, 4], [5, 6]])
labels = np.array([0, 1, 0])
print(f"Features:\n{features}")
print(f"Labels: {labels}")
# Create a dataset from a tuple of NumPy arrays
dataset_features_labels = tf.data.Dataset.from_tensor_slices((features, labels))
print("\nDataset elements (feature, label pairs):")
for feature_element, label_element in dataset_features_labels:
print(f"Feature: {feature_element.numpy()}, Label: {label_element.numpy()}")
Output:
Features:
[[1 2]
[3 4]
[5 6]]
Labels: [0 1 0]
Dataset elements (feature, label pairs):
Feature: [1 2], Label: 0
Feature: [3 4], Label: 1
Feature: [5 6], Label: 0
Notice how each element yielded by the dataset is now a tuple containing one slice from the features
array and the corresponding slice from the labels
array. This structure is exactly what you often need for supervised learning tasks.
It's important to distinguish from_tensor_slices
from tf.data.Dataset.from_tensors()
. The latter creates a dataset with a single element, which is the input tensor itself, rather than slicing it.
# Using from_tensors
single_element_dataset = tf.data.Dataset.from_tensors(features)
print("\nDataset created with from_tensors:")
for element in single_element_dataset:
print("Element shape:", element.shape)
print(element.numpy())
Output:
Dataset created with from_tensors:
Element shape: (3, 2)
[[1 2]
[3 4]
[5 6]]
Use from_tensors
when you want to treat the entire tensor structure as one item in your dataset, perhaps for later batching or processing. Use from_tensor_slices
when you want to iterate over the individual rows (or slices along the first dimension) of your data.
Sometimes your data isn't readily available as a tensor or NumPy array. It might be generated on-the-fly, read from a source not directly supported by TensorFlow (like a custom file format or database), or require complex Python logic for creation or preprocessing that's difficult to express purely in TensorFlow operations. In such cases, you can use a Python generator function.
The tf.data.Dataset.from_generator()
method bridges the gap between Python code execution and the TensorFlow graph. It allows you to wrap a Python generator function and turn its yielded items into a tf.data.Dataset
.
Here's the basic structure:
yield
) that produces your data items one by one.output_signature
argument, typically defined with tf.TensorSpec
.tf.data.Dataset.from_generator()
, passing your generator function and the output_signature
.Let's look at an example where a generator produces sequences of incrementing numbers:
import tensorflow as tf
import itertools # For generating sequences
# 1. Define the Python generator
def count_generator(stop):
"""Generates sequences [0], [0, 1], [0, 1, 2], ... up to stop"""
for i in range(1, stop + 1):
# Yield a sequence as a list or NumPy array
sequence = np.arange(i)
yield sequence # Generator yields NumPy arrays
# 2. Define the output signature
# Since the sequences have variable length, we use None for the shape dimension
output_signature = tf.TensorSpec(shape=(None,), dtype=tf.int64)
# 3. Create the dataset
# We use a lambda to pass the 'stop' argument to our generator
stop_value = 5
dataset_from_generator = tf.data.Dataset.from_generator(
lambda: count_generator(stop_value), # Use lambda for arguments
output_signature=output_signature
)
print("\nDataset elements from generator:")
for element in dataset_from_generator:
print(element.numpy())
Output:
Dataset elements from generator:
[0]
[0 1]
[0 1 2]
[0 1 2 3]
[0 1 2 3 4]
Key points about from_generator
:
output_signature
: This is mandatory and critical. tf.TensorSpec(shape=..., dtype=...)
precisely describes each yielded element. Use None
for dimensions with variable sizes. If your generator yields multiple items (like features and labels), provide a tuple or dictionary of tf.TensorSpec
objects.from_generator
involves running Python code (which can be slower than native TensorFlow operations and subject to Python's Global Interpreter Lock), it might not be as performant as from_tensor_slices
or reading from TFRecord files, especially for simple transformations. However, for complex data generation or reading from unsupported sources, it's an invaluable tool. TensorFlow runs the generator code within a tf.py_function
internally.Using from_tensor_slices
for in-memory data and from_generator
for custom Python-based data sources provides a solid foundation for creating tf.data.Dataset
objects, paving the way for the powerful transformations and optimizations we'll explore next.
© 2025 ApX Machine Learning