Okay, we've discussed the general ideas behind generating data using statistical distributions and rule-based systems. Now let's put those ideas into practice by creating some simple numerical synthetic data. Numerical data, representing quantities or measurements (like age, temperature, or price), is a fundamental building block in many datasets.
One of the most straightforward ways to generate synthetic numerical data is by drawing samples from well-understood statistical distributions. Think of a distribution as a blueprint describing the likelihood of different numerical values occurring.
Imagine you need to generate customer ages for a simulation, but you only know the ages range from 18 to 80, and you have no reason to believe any age within that range is more likely than another. The uniform distribution is perfect for this. It assigns equal probability to all values within a specified range [a,b].
To generate numbers from a uniform distribution, you need to define the minimum value (a) and the maximum value (b). Then, you can randomly pick numbers within this interval, where each number has the same chance of being selected.
For example, to generate 5 synthetic ages between 18 and 80:
# Pseudocode example
import random
min_age = 18
max_age = 80
number_of_samples = 5
synthetic_ages = []
for _ in range(number_of_samples):
# Generate a random floating-point number between min_age and max_age
age = random.uniform(min_age, max_age)
# Often, we might want integers (like age)
synthetic_ages.append(int(age))
print(synthetic_ages)
# Possible output: [45, 22, 71, 58, 30]
If we generated many points this way, they would be spread out fairly evenly across the 18-80 range.
What if you need to generate data that clusters around a central value? For instance, simulating the heights of adult males, which tend to hover around an average height. The normal distribution (often called the Gaussian distribution or bell curve) is ideal here.
To use a normal distribution, you need two parameters:
Let's generate 5 synthetic heights (in cm) assuming a mean (μ) of 175 cm and a standard deviation (σ) of 7 cm.
# Pseudocode example
import random
mean_height = 175
std_dev_height = 7
number_of_samples = 5
synthetic_heights = []
for _ in range(number_of_samples):
# Generate a random number from a normal distribution
height = random.gauss(mean_height, std_dev_height)
# Round to a reasonable precision, e.g., one decimal place
synthetic_heights.append(round(height, 1))
print(synthetic_heights)
# Possible output: [178.2, 166.5, 175.1, 184.0, 172.9]
Most generated heights will be close to 175 cm, with fewer values appearing further away. Generating many values would produce the characteristic bell shape.
A histogram showing simulated heights generated using a normal distribution with a mean of 175 cm and a standard deviation of 7 cm. Most values cluster around 175 cm.
Other distributions exist (like exponential for waiting times, or Poisson for counts), but uniform and normal are excellent starting points for generating simple numerical data.
Sometimes, numerical data follows specific patterns or depends on other values. Rule-based systems allow you to define these patterns explicitly.
The simplest rule generates sequential numbers. This is common for creating unique identifiers (IDs).
# Pseudocode example
start_id = 1001
number_of_records = 5
synthetic_ids = []
for i in range(number_of_records):
synthetic_ids.append(start_id + i)
print(synthetic_ids)
# Output: [1001, 1002, 1003, 1004, 1005]
You might need to generate a value based on another synthetic (or real) value. For example, calculating a total_price
based on a base_price
and a tax_amount
.
# Pseudocode example
# Assume base_prices were generated earlier, e.g., using a distribution
base_prices = [50.0, 120.0, 75.5]
synthetic_totals = []
tax_rate = 0.08 # 8% tax
for price in base_prices:
tax_amount = price * tax_rate
total_price = price + tax_amount
synthetic_totals.append(round(total_price, 2))
print(synthetic_totals)
# Output: [54.0, 129.6, 81.54]
Rules can also involve conditions. Perhaps sensor readings depend on the time of day.
# Pseudocode example - simplified
# Assume 'hour_of_day' is generated or known (0-23)
synthetic_temperatures = []
hours = [6, 14, 22] # Morning, Afternoon, Night
for hour in hours:
if 6 <= hour < 12: # Morning
# Cooler temperature range
temp = random.uniform(15, 20)
elif 12 <= hour < 18: # Afternoon
# Warmer temperature range
temp = random.uniform(22, 28)
else: # Evening/Night
# Cooling down
temp = random.uniform(18, 22)
synthetic_temperatures.append(round(temp, 1))
print(synthetic_temperatures)
# Possible output: [17.8, 25.1, 19.5]
These rule-based methods allow you to embed specific logic or relationships directly into your synthetic data generation process.
By combining sampling from distributions and applying rules, you can start to create numerical synthetic data that reflects simple patterns and characteristics you might expect to find in real datasets. The next section will explore similar techniques for generating categorical data.
© 2025 ApX Machine Learning