Now that we've discussed methods for creating synthetic tabular data, let's put theory into practice. In this hands-on section, we'll generate a simple synthetic dataset with several columns, mimicking a basic customer transaction table. We will focus on the technique of generating values for each column independently, drawing from statistical distributions or predefined rules, as covered earlier in this chapter. This exercise uses Python along with the popular pandas
and numpy
libraries, common tools for data manipulation.
First, ensure you have Python installed. You'll also need the pandas
and numpy
libraries. If you don't have them installed, you can typically install them using pip:
pip install pandas numpy
Once installed, let's import them into our Python script or notebook:
import pandas as pd
import numpy as np
print("Libraries imported successfully!")
Let's imagine we need a dataset representing customer transactions. We'll create a table with the following columns:
CustomerID
: A unique identifier for each customer (integer).Age
: The age of the customer (integer).ProductCategory
: The category of the product purchased (text/categorical).PurchaseAmount
: The amount spent in the transaction (float/numerical).We'll aim to generate 100 rows of synthetic data for this structure.
We will generate the data for each column independently based on simple rules or distributions.
For CustomerID
, we can simply create a sequence of unique integers from 1 up to the number of rows we want (100 in this case).
num_rows = 100
customer_ids = np.arange(1, num_rows + 1)
# Display the first 5 generated IDs
print(customer_ids[:5])
Let's assume customer ages generally follow a somewhat normal distribution. We can use numpy
to sample ages from a normal distribution. Let's center the ages around 35 with a standard deviation of 10. Since age must be positive and typically an integer, we'll take the absolute value and convert to integers.
# Sample ages from a normal distribution (mean=35, std_dev=10)
np.random.seed(42) # for reproducible results
ages_float = np.random.normal(loc=35, scale=10, size=num_rows)
# Ensure ages are positive and convert to integers
ages = np.abs(ages_float).astype(int)
# Ensure a minimum age, e.g., 18
ages = np.maximum(ages, 18)
# Display the first 5 generated ages
print(ages[:5])
Self-Correction: Initially, I just used np.abs().astype(int)
. Added np.maximum(ages, 18)
to make the synthetic data more plausible, assuming transactions are from adults. Setting np.random.seed(42)
makes the example reproducible.
For ProductCategory
, we'll define a list of possible categories and randomly choose from this list for each row. We can also assign probabilities to make some categories more frequent than others.
categories = ['Electronics', 'Clothing', 'Groceries', 'Home Goods']
# Assign probabilities: Electronics (30%), Clothing (25%), Groceries (25%), Home Goods (20%)
category_probabilities = [0.30, 0.25, 0.25, 0.20]
# Generate categories based on probabilities
product_categories = np.random.choice(categories, size=num_rows, p=category_probabilities)
# Display the first 5 generated categories
print(product_categories[:5])
Let's generate PurchaseAmount
. We can use a uniform distribution for simplicity, assuming purchases range from 5to500.
# Generate purchase amounts from a uniform distribution between 5 and 500
purchase_amounts = np.random.uniform(low=5.0, high=500.0, size=num_rows)
# Round to 2 decimal places for currency representation
purchase_amounts = np.round(purchase_amounts, 2)
# Display the first 5 generated amounts
print(purchase_amounts[:5])
Now that we have generated data for each column, we can combine them into a pandas
DataFrame. A DataFrame is essentially a table, perfect for our structured data.
# Create a dictionary to hold our data
synthetic_data = {
'CustomerID': customer_ids,
'Age': ages,
'ProductCategory': product_categories,
'PurchaseAmount': purchase_amounts
}
# Create the DataFrame
synthetic_df = pd.DataFrame(synthetic_data)
# Display the first few rows of the synthetic table
print("First 5 rows of the synthetic table:")
print(synthetic_df.head())
Let's quickly inspect the table we've created using some basic pandas
functions.
# Get information about the columns and data types
print("\nTable Information:")
synthetic_df.info()
# Get basic statistics for numerical columns
print("\nBasic Statistics:")
print(synthetic_df.describe())
# Get value counts for the categorical column
print("\nProduct Category Counts:")
print(synthetic_df['ProductCategory'].value_counts())
You should see output confirming the data types (integer for ID and Age, object/string for Category, float for Amount) and summary statistics like mean, min, max for the numerical columns, along with the counts for each product category. Notice how the category counts roughly align with the probabilities we specified.
Visualizations can help us quickly understand the distributions in our synthetic data. Let's create a histogram for the Age
column.
Histogram showing the frequency of different age groups in the synthetic dataset. The distribution peaks around the mid-30s, as expected from our generation method.
Let's also visualize the distribution of PurchaseAmount
.
Histogram showing the frequency of different purchase amount ranges. The distribution appears relatively uniform, as we sampled from a uniform distribution.
Congratulations! You've successfully generated a simple synthetic tabular dataset using Python. We defined a structure, generated data for each column independently using numpy
for numerical data (sampling from distributions) and categorical data (sampling from a list with probabilities), and assembled it into a pandas
DataFrame.
This method is straightforward but has limitations. Most significantly, generating columns independently means we haven't explicitly modeled any relationships between columns (e.g., older customers might tend to buy certain categories more, or purchase amounts might differ across categories). In real data, such correlations often exist. The section on "Preserving Basic Column Correlations" introduced this challenge, and more advanced techniques (beyond the scope of this introductory course) aim to capture these dependencies.
This practical exercise provides a foundation. You can experiment by changing the distributions, probabilities, or adding more columns. This simple synthetic table could now potentially be used for basic testing of data loading pipelines or simple model training scenarios, keeping its limitations in mind. The next chapters will delve into generating other types of data, like images, and discuss how to evaluate the quality of the synthetic data we create.
© 2025 ApX Machine Learning