A simple synthetic dataset with several columns is generated, mimicking a basic customer transaction table. This practical exercise focuses on creating values for each column independently, drawing from statistical distributions or predefined rules. Python, along with the popular pandas and numpy libraries, are used as common tools for data manipulation.Setting Up Your EnvironmentFirst, ensure you have Python installed. You'll also need the pandas and numpy libraries. If you don't have them installed, you can typically install them using pip:pip install pandas numpyOnce installed, let's import them into our Python script or notebook:import pandas as pd import numpy as np print("Libraries imported successfully!")Defining the Structure of Our Synthetic TableLet's imagine we need a dataset representing customer transactions. We'll create a table with the following columns:CustomerID: A unique identifier for each customer (integer).Age: The age of the customer (integer).ProductCategory: The category of the product purchased (text/categorical).PurchaseAmount: The amount spent in the transaction (float/numerical).We'll aim to generate 100 rows of synthetic data for this structure.Generating Data for Each ColumnWe will generate the data for each column independently based on simple rules or distributions.1. Generating Customer IDsFor CustomerID, we can simply create a sequence of unique integers from 1 up to the number of rows we want (100 in this case).num_rows = 100 customer_ids = np.arange(1, num_rows + 1) # Display the first 5 generated IDs print(customer_ids[:5])2. Generating Customer AgesLet's assume customer ages generally follow a somewhat normal distribution. We can use numpy to sample ages from a normal distribution. Let's center the ages around 35 with a standard deviation of 10. Since age must be positive and typically an integer, we'll take the absolute value and convert to integers.# Sample ages from a normal distribution (mean=35, std_dev=10) np.random.seed(42) # for reproducible results ages_float = np.random.normal(loc=35, scale=10, size=num_rows) # Ensure ages are positive and convert to integers ages = np.abs(ages_float).astype(int) # Ensure a minimum age, e.g., 18 ages = np.maximum(ages, 18) # Display the first 5 generated ages print(ages[:5])Self-Correction: Initially, I just used np.abs().astype(int). Added np.maximum(ages, 18) to make the synthetic data more plausible, assuming transactions are from adults. Setting np.random.seed(42) makes the example reproducible.3. Generating Product CategoriesFor ProductCategory, we'll define a list of possible categories and randomly choose from this list for each row. We can also assign probabilities to make some categories more frequent than others.categories = ['Electronics', 'Clothing', 'Groceries', 'Home Goods'] # Assign probabilities: Electronics (30%), Clothing (25%), Groceries (25%), Home Goods (20%) category_probabilities = [0.30, 0.25, 0.25, 0.20] # Generate categories based on probabilities product_categories = np.random.choice(categories, size=num_rows, p=category_probabilities) # Display the first 5 generated categories print(product_categories[:5])4. Generating Purchase AmountsLet's generate PurchaseAmount. We can use a uniform distribution for simplicity, assuming purchases range from $5 to $500.# Generate purchase amounts from a uniform distribution between 5 and 500 purchase_amounts = np.random.uniform(low=5.0, high=500.0, size=num_rows) # Round to 2 decimal places for currency representation purchase_amounts = np.round(purchase_amounts, 2) # Display the first 5 generated amounts print(purchase_amounts[:5])Assembling the Synthetic TableNow that we have generated data for each column, we can combine them into a pandas DataFrame. A DataFrame is essentially a table, perfect for our structured data.# Create a dictionary to hold our data synthetic_data = { 'CustomerID': customer_ids, 'Age': ages, 'ProductCategory': product_categories, 'PurchaseAmount': purchase_amounts } # Create the DataFrame synthetic_df = pd.DataFrame(synthetic_data) # Display the first few rows of the synthetic table print("First 5 rows of the synthetic table:") print(synthetic_df.head())Inspecting the Generated TableLet's quickly inspect the table we've created using some basic pandas functions.# Get information about the columns and data types print("\nTable Information:") synthetic_df.info() # Get basic statistics for numerical columns print("\nBasic Statistics:") print(synthetic_df.describe()) # Get value counts for the categorical column print("\nProduct Category Counts:") print(synthetic_df['ProductCategory'].value_counts())You should see output confirming the data types (integer for ID and Age, object/string for Category, float for Amount) and summary statistics like mean, min, max for the numerical columns, along with the counts for each product category. Notice how the category counts roughly align with the probabilities we specified.Visualizing the Synthetic DataVisualizations can help us quickly understand the distributions in our synthetic data. Let's create a histogram for the Age column.{"layout": {"title": "Distribution of Synthetic Customer Ages", "xaxis": {"title": "Age"}, "yaxis": {"title": "Count"}, "bargap": 0.1, "template": "plotly_white"}, "data": [{"type": "histogram", "x": [27, 34, 40, 49, 32, 46, 34, 24, 32, 29, 44, 38, 35, 37, 38, 28, 40, 40, 38, 36, 32, 33, 35, 28, 42, 35, 45, 42, 31, 25, 33, 29, 24, 30, 40, 33, 30, 35, 40, 47, 29, 31, 26, 45, 26, 31, 33, 42, 27, 41, 26, 35, 33, 38, 28, 34, 28, 41, 31, 39, 36, 30, 43, 40, 44, 30, 38, 32, 41, 29, 47, 39, 31, 38, 28, 43, 46, 35, 37, 24, 48, 31, 38, 30, 18, 34, 40, 40, 41, 34, 25, 30, 35, 40, 40, 29, 34, 39, 36, 31], "marker": {"color": "#228be6"}}]}Histogram showing the frequency of different age groups in the synthetic dataset. The distribution peaks around the mid-30s, as expected from our generation method.Let's also visualize the distribution of PurchaseAmount.{"layout": {"title": "Distribution of Synthetic Purchase Amounts", "xaxis": {"title": "Purchase Amount ($)"}, "yaxis": {"title": "Count"}, "bargap": 0.1, "template": "plotly_white"}, "data": [{"type": "histogram", "x": [257.73, 367.44, 258.59, 441.9, 128.65, 409.1, 145.51, 402.58, 77.1, 239.79, 447.87, 168.01, 464.32, 127.74, 399.81, 311.89, 193.43, 491.13, 216.55, 299.16, 33.58, 157.44, 181.77, 462.54, 335.8, 115.86, 128.27, 296.54, 416.88, 452.51, 164.01, 216.08, 435.81, 244.25, 283.14, 454.68, 436.86, 477.51, 40.77, 469.2, 424.61, 424.88, 333.58, 140.33, 234.7, 416.05, 306.08, 494.59, 18.26, 462.24, 389.7, 19.9, 378.68, 133.66, 184.74, 250.9, 303.91, 488.72, 471.4, 475.99, 482.8, 388.7, 20.91, 42.1, 475.32, 287.94, 138.02, 496.12, 110.75, 217.26, 67.65, 370.43, 29.94, 444.4, 131.1, 145.78, 384.2, 159.89, 101.68, 393.21, 271.8, 11.19, 103.89, 404.2, 41.3, 497.57, 268.18, 116.5, 249.46, 239.03, 241.81, 319.69, 13.57, 491.52, 37.18, 234.2, 390.88, 178.92, 397.66], "marker": {"color": "#12b886"}, "nbinsx": 10}]}Histogram showing the frequency of different purchase amount ranges. The distribution appears relatively uniform, as we sampled from a uniform distribution.Summary and Next StepsCongratulations! You've successfully generated a simple synthetic tabular dataset using Python. We defined a structure, generated data for each column independently using numpy for numerical data (sampling from distributions) and categorical data (sampling from a list with probabilities), and assembled it into a pandas DataFrame.This method is straightforward but has limitations. Most significantly, generating columns independently means we haven't explicitly modeled any relationships between columns (e.g., older customers might tend to buy certain categories more, or purchase amounts might differ across categories). In real data, such correlations often exist. The section on "Preserving Basic Column Correlations" introduced this challenge, and more advanced techniques aim to capture these dependencies.This practical exercise provides a foundation. You can experiment by changing the distributions, probabilities, or adding more columns. This simple synthetic table could now potentially be used for basic testing of data loading pipelines or simple model training scenarios, keeping its limitations in mind. The next chapters will get into generating other types of data, like images, and discuss how to evaluate the quality of the synthetic data we create.