Alright, let's put the techniques you've learned in this chapter into practice. Theory is valuable, but applying these skills to realistic scenarios solidifies your understanding. In this hands-on section, we'll work through a typical data wrangling workflow, simulating the process of gathering data from different sources, cleaning it up, performing transformations, and combining it into a single, analysis-ready dataset.
Imagine we're working with data related to city park usage. We have information about park visitors from a survey (simulated database export), data on specific events held in parks (simulated API response), and some basic details about the parks themselves (perhaps scraped from a city website or provided as a simple file). Our goal is to integrate these disparate sources into a unified dataset suitable for later analysis, such as understanding which events attract more visitors or identifying demographic patterns in park usage.
First, ensure you have the necessary Python libraries installed and import them. We'll primarily use pandas for data manipulation, requests (or simulate file loading) for API interaction, and potentially libraries like BeautifulSoup for scraping (or simulate file loading) and sqlalchemy for database interaction (or simulate with pandas reading CSV/SQLite).
import pandas as pd
import numpy as np
import json # If loading from JSON file
# import requests # If interacting with a live API
# from bs4 import BeautifulSoup # If doing live web scraping
# from sqlalchemy import create_engine # If connecting to a SQL database
# For this example, we'll assume data is loaded from local files
# simulating the outputs of acquisition steps.
We'll start by loading our simulated datasets. Let's assume we have:
visitors.csv
: Contains survey responses with visitor demographics and visit frequency.events.json
: A JSON file detailing events held in parks, including park ID, event type, and estimated attendance.parks.csv
: Basic information about each park, like name, ID, and size (acres).# Load visitor survey data (simulating database query result)
try:
visitors_df = pd.read_csv('visitors.csv')
print("Visitor data loaded successfully.")
# Display first few rows and info
print(visitors_df.head())
print(visitors_df.info())
except FileNotFoundError:
print("Error: visitors.csv not found. Please ensure the file is in the correct directory.")
# Create dummy data if file not found, so rest of code can run
visitors_df = pd.DataFrame({
'visitor_id': range(1, 101),
'age': np.random.randint(18, 75, 100),
'gender': np.random.choice(['M', 'F', 'NB', None], 100, p=[0.45, 0.45, 0.05, 0.05]),
'visit_frequency': np.random.choice(['Daily', 'Weekly', 'Monthly', 'Rarely', np.nan], 100, p=[0.1, 0.3, 0.4, 0.15, 0.05]),
'primary_park_id': np.random.randint(1, 6, 100)
})
visitors_df.loc[::10, 'age'] = np.nan # Introduce some missing ages
# Load event data (simulating API response)
try:
with open('events.json', 'r') as f:
events_data = json.load(f)
events_df = pd.json_normalize(events_data, record_path='events')
print("\nEvent data loaded successfully.")
# Display first few rows and info
print(events_df.head())
print(events_df.info())
except FileNotFoundError:
print("Error: events.json not found. Creating dummy data.")
events_df = pd.DataFrame({
'event_id': [f'evt{i:03}' for i in range(1, 21)],
'park_id': np.random.randint(1, 6, 20),
'event_type': np.random.choice(['Concert', 'Market', 'Festival', 'Sports'], 20),
'attendance_est': np.random.randint(50, 500, 20),
'event_date': pd.to_datetime(pd.date_range(start='2023-06-01', periods=20, freq='3D'))
})
events_df.loc[[3, 15], 'attendance_est'] = -1 # Introduce problematic data
# Load park details (simulating scraped or simple file data)
try:
parks_df = pd.read_csv('parks.csv')
print("\nPark data loaded successfully.")
# Display first few rows and info
print(parks_df.head())
print(parks_df.info())
except FileNotFoundError:
print("Error: parks.csv not found. Creating dummy data.")
parks_df = pd.DataFrame({
'Park ID': range(1, 6),
'Park Name': [f'Park {chr(64+i)}' for i in range(1, 6)],
'Size (acres)': [25.5, 102.0, 15.8, 55.1, 220.9],
'Has Playground': [True, True, False, True, True]
})
Immediately, we might notice inconsistencies or issues from the .info()
and .head()
outputs:
primary_park_id
vs. park_id
vs. Park ID
).event_date
might be an object).NaN
, None
) are present in visitors_df
.events_df
).parks_df
have spaces and parentheses.Let's address the issues identified above.
Consistent column naming makes merging and analysis much easier. We'll convert names to lowercase and replace spaces/special characters with underscores.
# Clean visitor column names (already mostly clean)
visitors_df.columns = visitors_df.columns.str.lower().str.replace(' ', '_')
# Clean event column names
events_df.columns = events_df.columns.str.lower().str.replace(' ', '_')
# Clean park column names
parks_df.columns = parks_df.columns.str.lower().str.replace(' ', '_').str.replace('[()]', '', regex=True)
# Rename columns for consistency before merging
parks_df = parks_df.rename(columns={'park_id': 'id', 'park_name': 'name'}) # Example if original was 'Park ID'
visitors_df = visitors_df.rename(columns={'primary_park_id': 'park_id'}) # Align key columns
# Assuming events_df 'park_id' is already consistent with visitors_df
print("\nCleaned column names:")
print("Visitors:", visitors_df.columns)
print("Events:", events_df.columns)
print("Parks:", parks_df.columns)
Ensure columns have the appropriate data types. For instance, event_date
should be a datetime object.
# Correct data types if needed (example assumes event_date was loaded as object)
if 'event_date' in events_df.columns and events_df['event_date'].dtype == 'object':
events_df['event_date'] = pd.to_datetime(events_df['event_date'], errors='coerce')
print("\nCorrected event_date dtype:", events_df['event_date'].dtype)
# Check other types using visitors_df.info(), events_df.info(), parks_df.info()
# Correct other types as necessary, e.g., ensuring IDs are integers if appropriate
Address obviously incorrect values, like the negative attendance estimate. We might replace it with NaN
or use a domain-specific rule.
# Handle illogical values like negative attendance
print(f"\nEvents with invalid attendance before correction: {len(events_df[events_df['attendance_est'] < 0])}")
events_df.loc[events_df['attendance_est'] < 0, 'attendance_est'] = np.nan
print(f"Events with invalid attendance after correction: {len(events_df[events_df['attendance_est'] < 0])}")
Now, let's apply strategies for missing data (NaN
values).
print("\nMissing values before handling:")
print("Visitors:\n", visitors_df.isnull().sum())
print("Events:\n", events_df.isnull().sum())
print("Parks:\n", parks_df.isnull().sum())
# Strategy for visitors_df:
# - age: Impute with the median age
median_age = visitors_df['age'].median()
visitors_df['age'].fillna(median_age, inplace=True)
print(f"\nImputed missing 'age' with median: {median_age}")
# - gender: Impute with 'Unknown' or the mode, depending on analysis needs
mode_gender = visitors_df['gender'].mode()[0] # Use mode if appropriate
visitors_df['gender'].fillna('Unknown', inplace=True) # Or impute with 'Unknown'
print(f"Filled missing 'gender' with 'Unknown'")
# - visit_frequency: Impute with the mode
mode_freq = visitors_df['visit_frequency'].mode()[0]
visitors_df['visit_frequency'].fillna(mode_freq, inplace=True)
print(f"Imputed missing 'visit_frequency' with mode: {mode_freq}")
# Strategy for events_df:
# - attendance_est: Impute with the median attendance for that event type
events_df['attendance_est'] = events_df.groupby('event_type')['attendance_est'].transform(lambda x: x.fillna(x.median()))
# Handle cases where an entire group might be NaN
events_df['attendance_est'].fillna(events_df['attendance_est'].median(), inplace=True)
print(f"Imputed missing 'attendance_est' using median within event_type or overall median.")
print("\nMissing values after handling:")
print("Visitors:\n", visitors_df.isnull().sum())
print("Events:\n", events_df.isnull().sum())
Let's apply scaling to the visitor age
and park size_acres
. We'll use Min-Max scaling as discussed in the chapter: xscaled=(x−min(x))/(max(x)−min(x)).
# Min-Max Scale visitor age
min_age = visitors_df['age'].min()
max_age = visitors_df['age'].max()
visitors_df['age_scaled'] = (visitors_df['age'] - min_age) / (max_age - min_age)
print("\nApplied Min-Max scaling to 'age'.")
print(visitors_df[['age', 'age_scaled']].head())
# Min-Max Scale park size
min_size = parks_df['size_acres'].min()
max_size = parks_df['size_acres'].max()
parks_df['size_acres_scaled'] = (parks_df['size_acres'] - min_size) / (max_size - min_size)
print("\nApplied Min-Max scaling to 'size_acres'.")
print(parks_df[['name', 'size_acres', 'size_acres_scaled']].head())
# Example: Plot distribution of scaled age
import plotly.express as px
fig_age_hist = px.histogram(visitors_df, x='age_scaled', nbins=10, title='Distribution of Scaled Visitor Age')
fig_age_hist.update_layout(bargap=0.1, xaxis_title='Scaled Age', yaxis_title='Count')
# Display the chart (in a notebook environment or save to HTML)
# fig_age_hist.show()
# For web output, print the JSON representation:
# print(fig_age_hist.to_json(pretty=False)) # This can be long; generate a concise version for display
# Generate Plotly JSON for web display
chart_json_output = '''
```plotly
{"layout": {"title": {"text": "Distribution of Scaled Visitor Age"}, "bargap": 0.1, "xaxis": {"title": {"text": "Scaled Age"}}, "yaxis": {"title": {"text": "Count"}}}, "data": [{"type": "histogram", "x": [0.2982456140350877, 0.8421052631578947, 0.19298245614035087, 0.7192982456140351, 0.47368421052631576, 0.8596491228070176, 0.0, 0.9122807017543859, 0.49122807017543857, 0.49122807017543857, 0.2807017543859649, 0.5614035087719298, 0.6491228070175439, 0.15789473684210525, 0.05263157894736842, 0.6666666666666666, 0.9473684210526315, 0.3157894736842105, 0.3333333333333333, 0.7017543859649122, 0.49122807017543857, 0.6491228070175439, 0.40350877192982454, 0.3508771929824561, 0.7894736842105263, 0.10526315789473684, 0.7543859649122807, 0.17543859649122806, 0.2982456140350877, 0.08771929824561403, 0.49122807017543857, 0.7017543859649122, 0.017543859649122806, 0.8771929824561403, 0.7368421052631579, 0.3333333333333333, 0.42105263157894735, 0.9298245614035088, 0.12280701754385964, 0.631578947368421, 0.49122807017543857, 0.8245614035087719, 0.21052631578947367, 0.43859649122807015, 0.45614035087719296, 0.6140350877192983, 0.2807017543859649, 0.8070175438596491, 0.19298245614035087, 0.5087719298245614, 0.49122807017543857, 0.07017543859649122, 0.22807017543859648, 0.2631578947368421, 0.49122807017543857, 0.8947368421052632, 0.5789473684210527, 0.3684210526315789, 0.03508771929824561, 0.9824561403508771, 0.49122807017543857, 0.14035087719298245, 0.6842105263157894, 0.24561403508771928, 0.49122807017543857, 1.0, 0.9649122807017544, 0.5964912280701754, 0.49122807017543857, 0.2631578947368421, 0.49122807017543857, 0.5263157894736842, 0.7719298245614035, 0.38596491228070173, 0.8947368421052632, 0.49122807017543857, 0.543859649122807, 0.49122807017543857, 0.6491228070175439, 0.49122807017543857, 0.03508771929824561, 0.8771929824561403, 0.21052631578947367, 0.6140350877192983, 0.49122807017543857, 0.9122807017543859, 0.5614035087719298, 0.7368421052631579, 0.5087719298245614, 0.14035087719298245, 0.24561403508771928, 0.3684210526315789, 0.10526315789473684, 0.49122807017543857, 0.3157894736842105, 0.5964912280701754, 0.8596491228070176, 0.40350877192982454, 0.5263157894736842], "nbinsx": 10}]}
''' print(chart_json_output)
Histogram showing the distribution of visitor ages after imputation and Min-Max scaling. The ages now fall within the 0 to 1 range.
Finally, let's combine these cleaned datasets into a single DataFrame. We can merge visitors_df
with parks_df
on the park ID, and potentially merge event data as well, though that might require more complex aggregation depending on the analysis goal. For now, let's merge visitor and park information.
# Ensure the key columns have the same name and compatible types
# We already renamed 'park_id' in visitors_df and 'id' in parks_df earlier.
# Let's make sure parks_df key is also named 'park_id' for the merge.
parks_df = parks_df.rename(columns={'id': 'park_id'})
# Merge visitors with park details
# Use left merge to keep all visitors, even if their park_id isn't in parks_df (handle potential mismatches)
visitor_park_merged_df = pd.merge(visitors_df, parks_df, on='park_id', how='left')
print("\nMerged visitor and park data:")
print(visitor_park_merged_df.head())
print(visitor_park_merged_df.info())
# Check if any park details are missing after the merge (indicates park_id mismatch or missing parks)
print(f"\nRows with missing park info after merge: {visitor_park_merged_df['name'].isnull().sum()}")
# You could perform further merges, for example, aggregating event attendance per park and merging that in.
# Example: Aggregate events data
park_event_summary = events_df.groupby('park_id').agg(
total_events=('event_id', 'count'),
avg_attendance=('attendance_est', 'mean')
).reset_index()
# Merge event summary with the main DataFrame
final_df = pd.merge(visitor_park_merged_df, park_event_summary, on='park_id', how='left')
# Fill NaN for parks with no events in our summary
final_df['total_events'].fillna(0, inplace=True)
final_df['avg_attendance'].fillna(0, inplace=True) # Or another appropriate value
print("\nFinal merged data with event summary:")
print(final_df.head())
print(final_df.info())
You have now successfully practiced the core components of data acquisition and preparation covered in this chapter:
This final_df
DataFrame is now significantly cleaner, more consistent, and structured appropriately for the next steps in the data science workflow: feature engineering and modeling, which we will cover in the subsequent chapters. Remember that data preparation is often iterative; you might revisit these steps as you understand the data better during analysis and modeling.
© 2025 ApX Machine Learning