Okay, let's dive into one of the most straightforward ways to create synthetic tabular data: sampling rows directly from an existing dataset. Imagine you have a table of real data, like customer information or sensor readings. Row sampling techniques use this existing table as a blueprint to build a new, synthetic one.
This approach is particularly useful when you need a dataset that closely mirrors the overall structure and statistical properties of your original data, perhaps for testing software, creating anonymized versions for sharing, or simply generating a larger dataset with similar characteristics.
The simplest method is sampling with replacement. Think of your original dataset as a large bag filled with numbered balls, where each ball represents a row.
Because you put the ball back each time (sampling "with replacement"), the same row from the original dataset can appear multiple times in your synthetic dataset. If your original dataset has N rows and you want to create a synthetic dataset with M rows, you just perform this random draw-and-replace action M times.
Here's a small illustration:
Original Data (3 rows):
ID | FeatureA | FeatureB |
---|---|---|
1 | 10 | Red |
2 | 15 | Blue |
3 | 12 | Red |
Possible Synthetic Data (sampled with replacement, size 4):
ID | FeatureA | FeatureB |
---|---|---|
2 | 15 | Blue |
1 | 10 | Red |
2 | 15 | Blue |
3 | 12 | Red |
Notice row 2 was selected twice.
Advantages:
Disadvantages:
Another option is sampling without replacement. This time, when you pick a ball (a row) from the bag, you don't put it back. This means each original row can appear at most once in your synthetic dataset.
This technique is typically used to create a smaller, random subset of your original data. For example, you might use it to create a training or testing split for a machine learning model. It's generally not used to create a synthetic dataset that's larger than the original, because you'd run out of unique rows to sample.
What if your dataset has specific subgroups that are important, but some are much smaller than others? For instance, imagine a customer dataset where 95% are standard customers and 5% are high-value customers. If you use simple random sampling (with replacement), a small synthetic sample might accidentally miss the high-value customers or include very few, not reflecting their true proportion or importance.
Stratified sampling addresses this. The process involves:
This ensures that the synthetic dataset preserves the relative proportions of the defined subgroups from the original data.
Process flow for stratified sampling, ensuring representation of different customer types.
While the methods above copy entire rows, a slightly more sophisticated approach involves sampling a row and then modifying some of its values slightly. For example, you could:
This starts to create data points that didn't exist in the original dataset, potentially offering better privacy and novelty. However, making these modifications requires careful consideration to ensure the resulting data remains realistic and useful. This technique acts as a bridge towards more complex generative models that learn the underlying patterns in the data rather than just copying rows.
It's important to remember the main limitation: basic row sampling techniques (without modification) primarily replicate the data you already have. They copy existing rows, either exactly or by ensuring group proportions are maintained. They don't generate fundamentally new patterns or relationships that weren't present in the original dataset. If your goal is to generate data that explores possibilities beyond what's explicitly in your source data, you'll need the more advanced methods discussed later.
However, for tasks like creating balanced test sets, quickly scaling up a dataset while preserving its basic structure, or generating simple anonymized versions, row sampling provides an accessible and effective starting point.
© 2025 ApX Machine Learning