Alongside generating numerical data, we often need to create synthetic categorical data. Categorical data represents characteristics that fall into distinct groups or labels, rather than numerical values. Think of product types ('Electronics', 'Clothing', 'Groceries'), user status ('Active', 'Inactive', 'New'), or simple classifications ('Yes', 'No'). Generating this type of data is fundamental for many machine learning tasks, such as classification problems.
Before you can generate categorical data, you must first define the set of possible categories. This set represents all the unique values the categorical feature can take. For example, if you're simulating customer survey responses to a question about satisfaction, your categories might be: 'Very Satisfied', 'Satisfied', 'Neutral', 'Dissatisfied', 'Very Dissatisfied'.
The most straightforward way to generate a categorical value is to simply pick one category at random from your defined set, where each category has an equal chance of being selected. This is analogous to rolling a fair die where each face (category) has a 1/N probability of appearing, with N being the total number of categories.
This approach is useful when you have no specific information about the expected frequency of each category or when you want a perfectly balanced representation for initial testing.
Let's say our categories are ['Red', 'Green', 'Blue']
. Uniform random sampling means each time we generate a data point, 'Red' has a 1/3 chance, 'Green' has a 1/3 chance, and 'Blue' has a 1/3 chance.
Example (Python-like pseudocode):
import random
categories = ['Red', 'Green', 'Blue']
# Generate one random category
random_category = random.choice(categories)
print(random_category) # Output could be 'Red', 'Green', or 'Blue'
# Generate 10 random categories
synthetic_data = [random.choice(categories) for _ in range(10)]
print(synthetic_data)
# Example Output: ['Green', 'Blue', 'Green', 'Red', 'Blue', 'Blue', 'Red', 'Green', 'Green', 'Red']
Often, categories in real data don't appear with equal frequency. For instance, in a dataset of online transactions, the category 'Completed' might be far more common than 'Failed' or 'Refunded'. To mimic this, you can assign specific probabilities to each category. The key requirement is that the probabilities for all categories must sum up to 1.0 (or 100%).
You then sample from the categories according to these assigned weights or probabilities. Categories with higher probabilities will appear more frequently in the generated data.
Suppose we want to generate data for user types, but we expect most users to be 'Active'. We could define probabilities like this:
Sum of probabilities = 0.7 + 0.2 + 0.1 = 1.0.
Example (Python-like pseudocode):
import random
categories = ['Active', 'Inactive', 'New']
probabilities = [0.7, 0.2, 0.1]
# Generate one random category based on probabilities
weighted_category = random.choices(categories, weights=probabilities, k=1)[0]
print(weighted_category) # Output is most likely 'Active'
# Generate 100 random categories
synthetic_data = random.choices(categories, weights=probabilities, k=100)
# Count occurrences to check distribution (approximate)
from collections import Counter
print(Counter(synthetic_data))
# Example Output: Counter({'Active': 68, 'Inactive': 22, 'New': 10})
We can visualize the target probabilities versus the frequencies obtained from a generated sample.
Comparison of the desired probability distribution and the actual frequency distribution from a sample of 100 generated points using weighted sampling. The sample frequencies closely approximate the target probabilities.
Sometimes, the category of a data point depends on the values of other features. This relates back to the rule-based systems mentioned earlier. You can define explicit rules that assign a category based on conditions met by other generated data.
This method allows you to introduce simple relationships between different columns in your synthetic dataset, making it potentially more realistic than generating each column independently.
Examples:
Age < 18
THEN Age Group = 'Child'
18 <= Age < 65
THEN Age Group = 'Adult'
Age >= 65
THEN Age Group = 'Senior'
Purchase History = 'Electronics'
THEN Recommended Product Type = 'Accessories'
Purchase History = 'Books'
THEN Recommended Product Type = 'Stationery'
Rule-based generation requires defining the logic connecting different pieces of data. While simple rules are easy to implement, complex dependencies might require more sophisticated approaches.
Example (Python-like pseudocode combining numerical and rule-based categorical):
import random
def generate_age_group(age):
if age < 18:
return 'Child'
elif 18 <= age < 65:
return 'Adult'
else:
return 'Senior'
synthetic_data = []
for _ in range(5):
# Generate a random age (numerical)
age = random.randint(10, 80)
# Generate age group based on the age (rule-based categorical)
age_group = generate_age_group(age)
synthetic_data.append({'Age': age, 'Age Group': age_group})
print(synthetic_data)
# Example Output:
# [{'Age': 45, 'Age Group': 'Adult'},
# {'Age': 12, 'Age Group': 'Child'},
# {'Age': 71, 'Age Group': 'Senior'},
# {'Age': 25, 'Age Group': 'Adult'},
# {'Age': 66, 'Age Group': 'Senior'}]
These basic methods, uniform sampling, weighted sampling, and rule-based assignment, provide a foundation for generating simple categorical data. While they might not capture all the complexities of real-world distributions and dependencies, they are essential starting points for creating basic synthetic datasets for testing, development, or augmenting limited real data. As you progress, you'll encounter more advanced techniques that build upon these fundamentals.
© 2025 ApX Machine Learning