Working through a typical data wrangling workflow involves simulating the process of gathering data from different sources, cleaning it up, performing transformations, and combining it into a single, analysis-ready dataset. This approach provides an opportunity to apply data wrangling skills in realistic scenarios.Imagine we're working with data related to city park usage. We have information about park visitors from a survey (simulated database export), data on specific events held in parks (simulated API response), and some basic details about the parks themselves (perhaps scraped from a city website or provided as a simple file). Our goal is to integrate these disparate sources into a unified dataset suitable for later analysis, such as understanding which events attract more visitors or identifying demographic patterns in park usage.Setting Up Your EnvironmentFirst, ensure you have the necessary Python libraries installed and import them. We'll primarily use pandas for data manipulation, requests (or simulate file loading) for API interaction, and potentially libraries like BeautifulSoup for scraping (or simulate file loading) and sqlalchemy for database interaction (or simulate with pandas reading CSV/SQLite).import pandas as pd import numpy as np import json # If loading from JSON file # import requests # If interacting with a live API # from bs4 import BeautifulSoup # If doing live web scraping # from sqlalchemy import create_engine # If connecting to a SQL database # For this example, we'll assume data is loaded from local files # simulating the outputs of acquisition steps.Step 1: Load and Inspect Initial DatasetsWe'll start by loading our simulated datasets. Let's assume we have:visitors.csv: Contains survey responses with visitor demographics and visit frequency.events.json: A JSON file detailing events held in parks, including park ID, event type, and estimated attendance.parks.csv: Basic information about each park, like name, ID, and size (acres).# Load visitor survey data (simulating database query result) try: visitors_df = pd.read_csv('visitors.csv') print("Visitor data loaded successfully.") # Display first few rows and info print(visitors_df.head()) print(visitors_df.info()) except FileNotFoundError: print("Error: visitors.csv not found. Please ensure the file is in the correct directory.") # Create dummy data if file not found, so rest of code can run visitors_df = pd.DataFrame({ 'visitor_id': range(1, 101), 'age': np.random.randint(18, 75, 100), 'gender': np.random.choice(['M', 'F', 'NB', None], 100, p=[0.45, 0.45, 0.05, 0.05]), 'visit_frequency': np.random.choice(['Daily', 'Weekly', 'Monthly', 'Rarely', np.nan], 100, p=[0.1, 0.3, 0.4, 0.15, 0.05]), 'primary_park_id': np.random.randint(1, 6, 100) }) visitors_df.loc[::10, 'age'] = np.nan # Introduce some missing ages # Load event data (simulating API response) try: with open('events.json', 'r') as f: events_data = json.load(f) events_df = pd.json_normalize(events_data, record_path='events') print("\nEvent data loaded successfully.") # Display first few rows and info print(events_df.head()) print(events_df.info()) except FileNotFoundError: print("Error: events.json not found. Creating dummy data.") events_df = pd.DataFrame({ 'event_id': [f'evt{i:03}' for i in range(1, 21)], 'park_id': np.random.randint(1, 6, 20), 'event_type': np.random.choice(['Concert', 'Market', 'Festival', 'Sports'], 20), 'attendance_est': np.random.randint(50, 500, 20), 'event_date': pd.to_datetime(pd.date_range(start='2023-06-01', periods=20, freq='3D')) }) events_df.loc[[3, 15], 'attendance_est'] = -1 # Introduce problematic data # Load park details (simulating scraped or simple file data) try: parks_df = pd.read_csv('parks.csv') print("\nPark data loaded successfully.") # Display first few rows and info print(parks_df.head()) print(parks_df.info()) except FileNotFoundError: print("Error: parks.csv not found. Creating dummy data.") parks_df = pd.DataFrame({ 'Park ID': range(1, 6), 'Park Name': [f'Park {chr(64+i)}' for i in range(1, 6)], 'Size (acres)': [25.5, 102.0, 15.8, 55.1, 220.9], 'Has Playground': [True, True, False, True, True] }) Immediately, we might notice inconsistencies or issues from the .info() and .head() outputs:Column names might differ in case or naming convention (e.g., primary_park_id vs. park_id vs. Park ID).Data types might need correction (e.g., event_date might be an object).Missing values (NaN, None) are present in visitors_df.Potentially problematic data exists (e.g., negative attendance in events_df).Column names in parks_df have spaces and parentheses.Step 2: Data Cleaning and StandardizationLet's address the issues identified above.Standardize Column NamesConsistent column naming makes merging and analysis much easier. We'll convert names to lowercase and replace spaces/special characters with underscores.# Clean visitor column names (already mostly clean) visitors_df.columns = visitors_df.columns.str.lower().str.replace(' ', '_') # Clean event column names events_df.columns = events_df.columns.str.lower().str.replace(' ', '_') # Clean park column names parks_df.columns = parks_df.columns.str.lower().str.replace(' ', '_').str.replace('[()]', '', regex=True) # Rename columns for consistency before merging parks_df = parks_df.rename(columns={'park_id': 'id', 'park_name': 'name'}) # Example if original was 'Park ID' visitors_df = visitors_df.rename(columns={'primary_park_id': 'park_id'}) # Align columns # Assuming events_df 'park_id' is already consistent with visitors_df print("\nCleaned column names:") print("Visitors:", visitors_df.columns) print("Events:", events_df.columns) print("Parks:", parks_df.columns)Correct Data TypesEnsure columns have the appropriate data types. For instance, event_date should be a datetime object.# Correct data types if needed (example assumes event_date was loaded as object) if 'event_date' in events_df.columns and events_df['event_date'].dtype == 'object': events_df['event_date'] = pd.to_datetime(events_df['event_date'], errors='coerce') print("\nCorrected event_date dtype:", events_df['event_date'].dtype) # Check other types using visitors_df.info(), events_df.info(), parks_df.info() # Correct other types as necessary, e.g., ensuring IDs are integers if appropriateHandle Erroneous DataAddress obviously incorrect values, like the negative attendance estimate. We might replace it with NaN or use a domain-specific rule.# Handle illogical values like negative attendance print(f"\nEvents with invalid attendance before correction: {len(events_df[events_df['attendance_est'] < 0])}") events_df.loc[events_df['attendance_est'] < 0, 'attendance_est'] = np.nan print(f"Events with invalid attendance after correction: {len(events_df[events_df['attendance_est'] < 0])}")Step 3: Handle Missing ValuesNow, let's apply strategies for missing data (NaN values).print("\nMissing values before handling:") print("Visitors:\n", visitors_df.isnull().sum()) print("Events:\n", events_df.isnull().sum()) print("Parks:\n", parks_df.isnull().sum()) # Strategy for visitors_df: # - age: Impute with the median age median_age = visitors_df['age'].median() visitors_df['age'].fillna(median_age, inplace=True) print(f"\nImputed missing 'age' with median: {median_age}") # - gender: Impute with 'Unknown' or the mode, depending on analysis needs mode_gender = visitors_df['gender'].mode()[0] # Use mode if appropriate visitors_df['gender'].fillna('Unknown', inplace=True) # Or impute with 'Unknown' print(f"Filled missing 'gender' with 'Unknown'") # - visit_frequency: Impute with the mode mode_freq = visitors_df['visit_frequency'].mode()[0] visitors_df['visit_frequency'].fillna(mode_freq, inplace=True) print(f"Imputed missing 'visit_frequency' with mode: {mode_freq}") # Strategy for events_df: # - attendance_est: Impute with the median attendance for that event type events_df['attendance_est'] = events_df.groupby('event_type')['attendance_est'].transform(lambda x: x.fillna(x.median())) # Handle cases where an entire group might be NaN events_df['attendance_est'].fillna(events_df['attendance_est'].median(), inplace=True) print(f"Imputed missing 'attendance_est' using median within event_type or overall median.") print("\nMissing values after handling:") print("Visitors:\n", visitors_df.isnull().sum()) print("Events:\n", events_df.isnull().sum())Step 4: Data TransformationLet's apply scaling to the visitor age and park size_acres. We'll use Min-Max scaling as discussed in the chapter: $x_{scaled} = (x - \min(x)) / (\max(x) - \min(x))$.# Min-Max Scale visitor age min_age = visitors_df['age'].min() max_age = visitors_df['age'].max() visitors_df['age_scaled'] = (visitors_df['age'] - min_age) / (max_age - min_age) print("\nApplied Min-Max scaling to 'age'.") print(visitors_df[['age', 'age_scaled']].head()) # Min-Max Scale park size min_size = parks_df['size_acres'].min() max_size = parks_df['size_acres'].max() parks_df['size_acres_scaled'] = (parks_df['size_acres'] - min_size) / (max_size - min_size) print("\nApplied Min-Max scaling to 'size_acres'.") print(parks_df[['name', 'size_acres', 'size_acres_scaled']].head()) # Example: Plot distribution of scaled age import plotly.express as px fig_age_hist = px.histogram(visitors_df, x='age_scaled', nbins=10, title='Distribution of Scaled Visitor Age') fig_age_hist.update_layout(bargap=0.1, xaxis_title='Scaled Age', yaxis_title='Count') # Display the chart (in a notebook environment or save to HTML) # fig_age_hist.show() # For web output, print the JSON representation: # print(fig_age_hist.to_json(pretty=False)) # This can be long; generate a concise version for display # Generate Plotly JSON for web display chart_json_output = ''' ```plotly {"layout": {"title": {"text": "Distribution of Scaled Visitor Age"}, "bargap": 0.1, "xaxis": {"title": {"text": "Scaled Age"}}, "yaxis": {"title": {"text": "Count"}}}, "data": [{"type": "histogram", "x": [0.2982456140350877, 0.8421052631578947, 0.19298245614035087, 0.7192982456140351, 0.47368421052631576, 0.8596491228070176, 0.0, 0.9122807017543859, 0.49122807017543857, 0.49122807017543857, 0.2807017543859649, 0.5614035087719298, 0.6491228070175439, 0.15789473684210525, 0.05263157894736842, 0.6666666666666666, 0.9473684210526315, 0.3157894736842105, 0.3333333333333333, 0.7017543859649122, 0.49122807017543857, 0.6491228070175439, 0.40350877192982454, 0.3508771929824561, 0.7894736842105263, 0.10526315789473684, 0.7543859649122807, 0.17543859649122806, 0.2982456140350877, 0.08771929824561403, 0.49122807017543857, 0.7017543859649122, 0.017543859649122806, 0.8771929824561403, 0.7368421052631579, 0.3333333333333333, 0.42105263157894735, 0.9298245614035088, 0.12280701754385964, 0.631578947368421, 0.49122807017543857, 0.8245614035087719, 0.21052631578947367, 0.43859649122807015, 0.45614035087719296, 0.6140350877192983, 0.2807017543859649, 0.8070175438596491, 0.19298245614035087, 0.5087719298245614, 0.49122807017543857, 0.07017543859649122, 0.22807017543859648, 0.2631578947368421, 0.49122807017543857, 0.8947368421052632, 0.5789473684210527, 0.3684210526315789, 0.03508771929824561, 0.9824561403508771, 0.49122807017543857, 0.14035087719298245, 0.6842105263157894, 0.24561403508771928, 0.49122807017543857, 1.0, 0.9649122807017544, 0.5964912280701754, 0.49122807017543857, 0.2631578947368421, 0.49122807017543857, 0.5263157894736842, 0.7719298245614035, 0.38596491228070173, 0.8947368421052632, 0.49122807017543857, 0.543859649122807, 0.49122807017543857, 0.6491228070175439, 0.49122807017543857, 0.03508771929824561, 0.8771929824561403, 0.21052631578947367, 0.6140350877192983, 0.49122807017543857, 0.9122807017543859, 0.5614035087719298, 0.7368421052631579, 0.5087719298245614, 0.14035087719298245, 0.24561403508771928, 0.3684210526315789, 0.10526315789473684, 0.49122807017543857, 0.3157894736842105, 0.5964912280701754, 0.8596491228070176, 0.40350877192982454, 0.5263157894736842], "nbinsx": 10}]}''' print(chart_json_output){"layout": {"title": {"text": "Distribution of Scaled Visitor Age"}, "bargap": 0.1, "xaxis": {"title": {"text": "Scaled Age"}}, "yaxis": {"title": {"text": "Count"}}}, "data": [{"type": "histogram", "x": [0.2982456140350877, 0.8421052631578947, 0.19298245614035087, 0.7192982456140351, 0.47368421052631576, 0.8596491228070176, 0.0, 0.9122807017543859, 0.49122807017543857, 0.49122807017543857, 0.2807017543859649, 0.5614035087719298, 0.6491228070175439, 0.15789473684210525, 0.05263157894736842, 0.6666666666666666, 0.9473684210526315, 0.3157894736842105, 0.3333333333333333, 0.7017543859649122, 0.49122807017543857, 0.6491228070175439, 0.40350877192982454, 0.3508771929824561, 0.7894736842105263, 0.10526315789473684, 0.7543859649122807, 0.17543859649122806, 0.2982456140350877, 0.08771929824561403, 0.49122807017543857, 0.7017543859649122, 0.017543859649122806, 0.8771929824561403, 0.7368421052631579, 0.3333333333333333, 0.42105263157894735, 0.9298245614035088, 0.12280701754385964, 0.631578947368421, 0.49122807017543857, 0.8245614035087719, 0.21052631578947367, 0.43859649122807015, 0.45614035087719296, 0.6140350877192983, 0.2807017543859649, 0.8070175438596491, 0.19298245614035087, 0.5087719298245614, 0.49122807017543857, 0.07017543859649122, 0.22807017543859648, 0.2631578947368421, 0.49122807017543857, 0.8947368421052632, 0.5789473684210527, 0.3684210526315789, 0.03508771929824561, 0.9824561403508771, 0.49122807017543857, 0.14035087719298245, 0.6842105263157894, 0.24561403508771928, 0.49122807017543857, 1.0, 0.9649122807017544, 0.5964912280701754, 0.49122807017543857, 0.2631578947368421, 0.49122807017543857, 0.5263157894736842, 0.7719298245614035, 0.38596491228070173, 0.8947368421052632, 0.49122807017543857, 0.543859649122807, 0.49122807017543857, 0.6491228070175439, 0.49122807017543857, 0.03508771929824561, 0.8771929824561403, 0.21052631578947367, 0.6140350877192983, 0.49122807017543857, 0.9122807017543859, 0.5614035087719298, 0.7368421052631579, 0.5087719298245614, 0.14035087719298245, 0.24561403508771928, 0.3684210526315789, 0.10526315789473684, 0.49122807017543857, 0.3157894736842105, 0.5964912280701754, 0.8596491228070176, 0.40350877192982454, 0.5263157894736842], "nbinsx": 10}]}Histogram showing the distribution of visitor ages after imputation and Min-Max scaling. The ages now fall within the 0 to 1 range.Step 5: Merging DatasetsFinally, let's combine these cleaned datasets into a single DataFrame. We can merge visitors_df with parks_df on the park ID, and potentially merge event data as well, though that might require more complex aggregation depending on the analysis goal. For now, let's merge visitor and park information.# Ensure the columns have the same name and compatible types # We already renamed 'park_id' in visitors_df and 'id' in parks_df earlier. # Let's make sure parks_df is also named 'park_id' for the merge. parks_df = parks_df.rename(columns={'id': 'park_id'}) # Merge visitors with park details # Use left merge to keep all visitors, even if their park_id isn't in parks_df (handle potential mismatches) visitor_park_merged_df = pd.merge(visitors_df, parks_df, on='park_id', how='left') print("\nMerged visitor and park data:") print(visitor_park_merged_df.head()) print(visitor_park_merged_df.info()) # Check if any park details are missing after the merge (indicates park_id mismatch or missing parks) print(f"\nRows with missing park info after merge: {visitor_park_merged_df['name'].isnull().sum()}") # You could perform further merges, for example, aggregating event attendance per park and merging that in. # Example: Aggregate events data park_event_summary = events_df.groupby('park_id').agg( total_events=('event_id', 'count'), avg_attendance=('attendance_est', 'mean') ).reset_index() # Merge event summary with the main DataFrame final_df = pd.merge(visitor_park_merged_df, park_event_summary, on='park_id', how='left') # Fill NaN for parks with no events in our summary final_df['total_events'].fillna(0, inplace=True) final_df['avg_attendance'].fillna(0, inplace=True) # Or another appropriate value print("\nFinal merged data with event summary:") print(final_df.head()) print(final_df.info())Conclusion of PracticeYou have now successfully practiced the core components of data acquisition and preparation covered in this chapter:Loaded data simulating different sources (database, API, file/scrape).Inspected data for quality issues.Standardized column names for consistency.Corrected data types and handled erroneous entries.Applied strategies for handling missing values (imputation).Performed data transformations (Min-Max scaling).Merged data from multiple sources into a unified DataFrame.This final_df DataFrame is now significantly cleaner, more consistent, and structured appropriately for the next steps in the data science workflow: feature engineering and modeling, which we will cover in the subsequent chapters. Remember that data preparation is often iterative; you might revisit these steps as you understand the data better during analysis and modeling.