Okay, let's roll up our sleeves and put the methods we've discussed into practice. This section will guide you through generating simple synthetic data using Python, a common language for machine learning tasks. We'll use the NumPy
library, which is fundamental for numerical operations in Python. If you haven't used it before, don't worry. We'll keep things straightforward.
Before we start generating data, you need a Python environment with the NumPy
library installed. If you have Python set up (perhaps through an Anaconda distribution or directly), you can typically install NumPy using pip, Python's package installer. Open your terminal or command prompt and type:
pip install numpy
For organizing and displaying our data nicely, especially when combining different types, the Pandas
library is very helpful. Let's install that too:
pip install pandas
Now, let's import these libraries in our Python script or interactive session (like a Jupyter notebook):
import numpy as np
import pandas as pd
# Set a seed for reproducibility (optional, but good practice)
# This makes the random numbers predictable for demonstration purposes.
np.random.seed(42)
Setting a seed
ensures that every time you run the code, you get the exact same sequence of "random" numbers. This is useful for debugging and sharing examples.
We learned that sampling from statistical distributions is a common way to generate numerical data. Let's try generating two types: uniformly distributed and normally distributed data.
Imagine we want to simulate product prices that are equally likely to be any value between 20and100. We can use NumPy's random.uniform
function.
# Generate 10 synthetic product prices between 20 and 100
num_samples = 10
min_price = 20.0
max_price = 100.0
uniform_prices = np.random.uniform(low=min_price, high=max_price, size=num_samples)
# Let's round them to 2 decimal places for realism
uniform_prices = np.round(uniform_prices, 2)
print("Generated Uniform Prices:")
print(uniform_prices)
This will output an array of 10 prices, like [ 34.9 , 89.01, 80.49, 65.34, 26.38, 43.61, 80.13, 73.8 , 84.47, 73.6 ]
. Each number had an equal chance of appearing anywhere within the 20to100 range.
Now, let's simulate something that clusters around an average value, like customer ages. Suppose the average age (μ) is 35 years, with a standard deviation (σ) of 8 years. This means most ages will be close to 35, but some will be younger or older. We use random.normal
.
# Generate 10 synthetic customer ages (mean=35, std_dev=8)
num_samples = 10
mean_age = 35.0
std_dev_age = 8.0
normal_ages = np.random.normal(loc=mean_age, scale=std_dev_age, size=num_samples)
# Ages are usually whole numbers, so let's round and ensure they are non-negative
normal_ages = np.round(normal_ages).astype(int)
normal_ages = np.maximum(normal_ages, 0) # Ensure no negative ages
print("\nGenerated Normal Ages:")
print(normal_ages)
This might output ages like [39 32 41 49 32 31 44 37 34 32]
. Notice how most values are relatively close to 35.
Let's visualize the distribution of a larger sample of these ages using a histogram.
# Generate 1000 ages for a better view of the distribution
lots_of_ages = np.random.normal(loc=mean_age, scale=std_dev_age, size=1000)
lots_of_ages = np.round(lots_of_ages).astype(int)
lots_of_ages = np.maximum(lots_of_ages, 0)
# We'll prepare data for a Plotly chart
hist_ages, bin_edges = np.histogram(lots_of_ages, bins=15)
bin_centers = 0.5 * (bin_edges[1:] + bin_edges[:-1])
Histogram showing the frequency of synthetic ages generated using a normal distribution (mean=35, std dev=8). Notice the bell shape, characteristic of a normal distribution, centered around 35.
Now let's generate some categorical data.
Suppose we want to assign a product category: 'Electronics', 'Clothing', or 'Home Goods'. Maybe 'Electronics' is most common (50% chance), 'Clothing' less common (30%), and 'Home Goods' the least (20%). We can use random.choice
.
# Define categories and their probabilities
categories = ['Electronics', 'Clothing', 'Home Goods']
probabilities = [0.5, 0.3, 0.2] # Must sum to 1.0
# Generate 10 synthetic categories based on probabilities
num_samples = 10
random_categories = np.random.choice(categories, size=num_samples, p=probabilities)
print("\nGenerated Random Categories:")
print(random_categories)
This might output: ['Clothing' 'Home Goods' 'Electronics' 'Electronics' 'Electronics' 'Clothing' 'Electronics' 'Electronics' 'Electronics' 'Clothing']
. You'd expect 'Electronics' to appear roughly 5 times, 'Clothing' 3 times, and 'Home Goods' 2 times, though with only 10 samples, there will be random variation.
Sometimes, one piece of data depends on another. Let's create a 'HighValue' flag (True/False) based on the uniform_prices
we generated earlier. Let's say any product priced over $75 is considered high value.
# Apply a rule based on the generated prices
high_value_flags = uniform_prices > 75.0
print("\nGenerated High Value Flags (based on price > $75):")
print(high_value_flags)
This will output an array of True
or False
values corresponding to each price in uniform_prices
, e.g., [False True True False False False True False True False]
.
Often, we want to combine different generated features into a structured dataset, like a table. The Pandas library is excellent for this. Let's combine our prices, ages, categories, and high-value flags into a DataFrame.
# Ensure all arrays have the same length (10 samples in our case)
assert len(uniform_prices) == num_samples
assert len(normal_ages) == num_samples
assert len(random_categories) == num_samples
assert len(high_value_flags) == num_samples
# Create a dictionary to hold our synthetic data
synthetic_data = {
'Price': uniform_prices,
'CustomerAge': normal_ages,
'Category': random_categories,
'HighValue': high_value_flags
}
# Create a Pandas DataFrame
synthetic_df = pd.DataFrame(synthetic_data)
print("\nCombined Synthetic Dataset (First 5 rows):")
print(synthetic_df.head())
This will display a neat table:
Price CustomerAge Category HighValue
0 34.90 39 Clothing False
1 89.01 32 Home Goods True
2 80.49 41 Electronics True
3 65.34 49 Electronics False
4 26.38 32 Electronics False
You've now created your first basic synthetic dataset! You generated numerical data using statistical distributions and categorical data using both random choice with probabilities and simple rules. You also combined these into a structured format.
These techniques form the foundation for generating more complex synthetic data. As you progress, you'll encounter methods that can better capture relationships between variables and generate more realistic data, but the core ideas of defining generation processes, sampling, and applying rules remain central.
© 2025 ApX Machine Learning