One straightforward way to create artificial data is by drawing numbers from well-understood mathematical patterns called statistical distributions. Think of a distribution as a rule that describes how likely different outcomes are. If we can describe a feature in our real data using one of these rules, or if we want our synthetic data to follow a specific pattern, we can use the distribution to generate new data points.
Let's look at two fundamental distributions you'll often encounter.
Imagine rolling a standard six-sided die. Assuming it's a fair die, each number from 1 to 6 has an equal chance (1/6) of appearing on any given roll. This is the essence of a uniform distribution: every possible value within a defined range is equally likely.
When we generate data from a continuous uniform distribution, we typically specify a minimum value (a) and a maximum value (b). The process then randomly selects numbers between a and b, where any number in that range has the same probability of being chosen.
For example, if we need to generate synthetic data representing the percentage completion of a task, and we have no reason to believe any percentage is more likely than another, we could use a uniform distribution between 0% and 100%. Or, if we wanted to simulate customer satisfaction scores on a scale of 1 to 5, assuming any score is equally probable (a simplified assumption!), we could sample uniformly from the integers 1, 2, 3, 4, 5.
Here's a visual example. We generated 1000 random numbers using a uniform distribution between 0 and 10. Notice how the bars in the histogram are roughly the same height, indicating that numbers in each small interval occurred about equally often.
A histogram showing frequencies of values sampled from a uniform distribution between 0 and 10. Each range of values has a similar frequency count.
Many natural phenomena tend to cluster around an average value. Think about the heights of adult humans: most people are close to the average height, while extremely tall or short individuals are rare. This pattern is often described by the normal distribution, also known as the Gaussian distribution or the "bell curve".
The normal distribution is defined by two parameters:
When we generate data from a normal distribution, we specify the desired mean (μ) and standard deviation (σ). The generation process then produces numbers that are most likely to be near μ. Values become progressively less likely the further they are from the mean, following the characteristic bell shape.
For instance, if we know that the average score on a standardized test is 100 with a standard deviation of 15, and we believe the scores follow a normal distribution, we can generate synthetic test scores using these parameters. Most generated scores will be close to 100, with fewer scores around 85 or 115, and even fewer further out.
Here's a visualization of 1000 numbers generated from a normal distribution with a mean (μ) of 5 and a standard deviation (σ) of 1.5. You can clearly see the bell shape, where values near 5 are most frequent.
A histogram showing frequencies of values sampled from a normal distribution with μ=5 and σ=1.5. Frequencies are highest near the mean and decrease further away, forming a bell shape.
Sampling from statistical distributions is a fundamental technique in synthetic data generation because:
While powerful for simple cases, generating data solely based on individual distributions has limitations. Real-world datasets often have complex relationships between different features (columns). For example, age and income might be related; generating them independently using separate distributions might miss this connection. We'll explore ways to handle such dependencies when we discuss generating tabular data in the next chapter.
For now, the key takeaway is that statistical distributions provide a foundational way to generate synthetic data points with specific, controllable properties based on mathematical rules. Common tools like Python's NumPy library offer functions like numpy.random.uniform()
and numpy.random.normal()
to easily sample from these distributions, which you'll get to try in the hands-on practical section.
© 2025 ApX Machine Learning