All Courses

Creating Datasets from Tensors, NumPy, and Generators

Now that we understand the benefits of using the tf.data API, let's look at the fundamental first step: creating a tf.data.Dataset object. The API provides convenient methods to ingest data from common structures you likely already use, such as in-memory tensors, NumPy arrays, or even standard Python generators. This flexibility allows you to start building pipelines quickly, regardless of your data's initial format.

Creating Datasets from Tensors and NumPy Arrays

The most straightforward way to create a dataset is when your data already exists in memory as TensorFlow tensors or NumPy arrays. This is common for smaller datasets that fit comfortably into your machine's RAM or when working with data you've just generated or loaded using other libraries.

The primary function for this is tf.data.Dataset.from_tensor_slices(). This function takes tensors as input and creates a dataset where each element corresponds to a slice across the first dimension of the input tensors.

Let's see a simple example using a NumPy array:

import tensorflow as tf
import numpy as np

# Example NumPy array
numpy_data = np.arange(10)
print(f"Original NumPy array: {numpy_data}")

# Create a dataset from the NumPy array
dataset_from_numpy = tf.data.Dataset.from_tensor_slices(numpy_data)

print("\nDataset elements:")
# Iterate through the dataset to see the slices
for element in dataset_from_numpy:
  # Each element is a tf.Tensor
  print(element.numpy())

Output:

Original NumPy array: [0 1 2 3 4 5 6 7 8 9]

Dataset elements:
0
1
2
3
4
5
6
7
8
9

As you can see, from_tensor_slices treated the 1D NumPy array [0, 1, ..., 9] as a collection of 10 individual elements, creating a dataset that yields each number one by one.

The same applies to TensorFlow tensors:

# Example TensorFlow tensor
tensor_data = tf.range(5, 10)
print(f"Original TensorFlow tensor: {tensor_data.numpy()}")

# Create a dataset from the tensor
dataset_from_tensor = tf.data.Dataset.from_tensor_slices(tensor_data)

print("\nDataset elements:")
for element in dataset_from_tensor:
    print(element.numpy())

Output:

Original TensorFlow tensor: [5 6 7 8 9]

Dataset elements:
5
6
7
8
9

A significant use case for from_tensor_slices is handling paired data, like features and corresponding labels. You can pass a tuple (or dictionary) of tensors or NumPy arrays, and tf.data will slice them together, ensuring alignment.

# Example features (e.g., measurements) and labels (e.g., categories)
features = np.array([[1, 2], [3, 4], [5, 6]])
labels = np.array([0, 1, 0])

print(f"Features:\n{features}")
print(f"Labels: {labels}")

# Create a dataset from a tuple of NumPy arrays
dataset_features_labels = tf.data.Dataset.from_tensor_slices((features, labels))

print("\nDataset elements (feature, label pairs):")
for feature_element, label_element in dataset_features_labels:
  print(f"Feature: {feature_element.numpy()}, Label: {label_element.numpy()}")

Output:

Features:
[[1 2]
 [3 4]
 [5 6]]
Labels: [0 1 0]

Dataset elements (feature, label pairs):
Feature: [1 2], Label: 0
Feature: [3 4], Label: 1
Feature: [5 6], Label: 0

Notice how each element yielded by the dataset is now a tuple containing one slice from the features array and the corresponding slice from the labels array. This structure is exactly what you often need for supervised learning tasks.

It's important to distinguish from_tensor_slices from tf.data.Dataset.from_tensors(). The latter creates a dataset with a single element, which is the input tensor itself, rather than slicing it.

# Using from_tensors
single_element_dataset = tf.data.Dataset.from_tensors(features)

print("\nDataset created with from_tensors:")
for element in single_element_dataset:
    print("Element shape:", element.shape)
    print(element.numpy())

Output:

Dataset created with from_tensors:
Element shape: (3, 2)
[[1 2]
 [3 4]
 [5 6]]

Use from_tensors when you want to treat the entire tensor structure as one item in your dataset, perhaps for later batching or processing. Use from_tensor_slices when you want to iterate over the individual rows (or slices along the first dimension) of your data.

Creating Datasets from Python Generators

Sometimes your data isn't readily available as a tensor or NumPy array. It might be generated on-the-fly, read from a source not directly supported by TensorFlow (like a custom file format or database), or require complex Python logic for creation or preprocessing that's difficult to express purely in TensorFlow operations. In such cases, you can use a Python generator function.

The tf.data.Dataset.from_generator() method bridges the gap between Python code execution and the TensorFlow graph. It allows you to wrap a Python generator function and turn its yielded items into a tf.data.Dataset.

Here's the basic structure:

Define a Python Generator: Create a standard Python generator function (using yield) that produces your data items one by one.
Specify Output Signature: Tell TensorFlow about the structure and data types of the items your generator yields. This is essential because TensorFlow needs to build a graph before running the Python code, and it must know the expected output signature (types and shapes) to do so. This is done using the output_signature argument, typically defined with tf.TensorSpec.
Create the Dataset: Call tf.data.Dataset.from_generator(), passing your generator function and the output_signature.

Let's look at an example where a generator produces sequences of incrementing numbers:

import tensorflow as tf
import itertools # For generating sequences

# 1. Define the Python generator
def count_generator(stop):
  """Generates sequences [0], [0, 1], [0, 1, 2], ... up to stop"""
  for i in range(1, stop + 1):
    # Yield a sequence as a list or NumPy array
    sequence = np.arange(i)
    yield sequence # Generator yields NumPy arrays

# 2. Define the output signature
# Since the sequences have variable length, we use None for the shape dimension
output_signature = tf.TensorSpec(shape=(None,), dtype=tf.int64)

# 3. Create the dataset
# We use a lambda to pass the 'stop' argument to our generator
stop_value = 5
dataset_from_generator = tf.data.Dataset.from_generator(
    lambda: count_generator(stop_value), # Use lambda for arguments
    output_signature=output_signature
)

print("\nDataset elements from generator:")
for element in dataset_from_generator:
  print(element.numpy())

Output:

Dataset elements from generator:
[0]
[0 1]
[0 1 2]
[0 1 2 3]
[0 1 2 3 4]

Important points about from_generator:

Flexibility: It allows incorporating arbitrary Python logic into your data pipeline.
output_signature: This is mandatory and critical. tf.TensorSpec(shape=..., dtype=...) precisely describes each yielded element. Use None for dimensions with variable sizes. If your generator yields multiple items (like features and labels), provide a tuple or dictionary of tf.TensorSpec objects.
Performance: Because from_generator involves running Python code (which can be slower than native TensorFlow operations and subject to Python's Global Interpreter Lock), it might not be as performant as from_tensor_slices or reading from TFRecord files, especially for simple transformations. However, for complex data generation or reading from unsupported sources, it's an invaluable tool. TensorFlow runs the generator code within a tf.py_function internally.

Using from_tensor_slices for in-memory data and from_generator for custom Python-based data sources provides a foundation for creating tf.data.Dataset objects, preparing for the transformations and optimizations we'll explore next.

Was this section helpful?