All Courses

Filtering and Preparing Data for Plotting

Often, the raw data you load into a Pandas DataFrame isn't immediately ready for visualization. You might have more columns than needed for a specific plot, or you might only want to visualize a subset of the rows that meet certain criteria. Preparing your data involves selecting the relevant pieces and filtering out the noise, ensuring your plots communicate the intended message clearly.

Selecting Specific Columns

DataFrames can contain many columns, but a single plot usually only visualizes relationships between two or three variables. You can easily select the columns you need using bracket notation.

To select a single column (which returns a Pandas Series), use single brackets:

# Assume 'df' is your DataFrame
product_names = df['ProductName']
sales_data = df['Sales']

To select multiple columns (which returns a new DataFrame), use a list of column names inside double brackets:

# Select 'ProductName' and 'Sales' columns
product_sales_df = df[['ProductName', 'Sales']]

# Display the first few rows of the new DataFrame
print(product_sales_df.head())

This product_sales_df now contains only the data needed for, perhaps, a bar chart showing sales per product, making it easier to pass to plotting functions.

Filtering Rows Based on Conditions

Just as you might only need specific columns, you often only want to visualize rows that meet certain conditions. This is frequently called boolean indexing or conditional selection. You create a condition (which evaluates to True or False for each row) and use it inside the brackets to filter the DataFrame.

Example 1: Simple Condition

Let's say you have a DataFrame df with sales data and you only want to plot products with sales greater than $1000$ .

# Create a boolean Series: True for rows where Sales > 1000, False otherwise
high_sales_condition = df['Sales'] > 1000

# Use the boolean Series to filter the DataFrame
high_sales_df = df[high_sales_condition]

# Or, more concisely:
high_sales_df = df[df['Sales'] > 1000]

# Now, high_sales_df contains only rows meeting the condition
print(high_sales_df.head())

You can then use high_sales_df for plotting.

Example 2: Multiple Conditions

You can combine conditions using logical operators: & for AND, | for OR. It's important to wrap each individual condition in parentheses () due to Python's operator precedence rules.

Suppose you want to visualize data for products in the 'Electronics' category and with sales above $500$ .

# Condition 1: Category is 'Electronics'
condition1 = df['Category'] == 'Electronics'
# Condition 2: Sales are greater than 500
condition2 = df['Sales'] > 500

# Combine conditions using AND (&)
# Note the parentheses around each condition
electronics_high_sales_df = df[(condition1) & (condition2)]

# Or, directly:
electronics_high_sales_df = df[(df['Category'] == 'Electronics') & (df['Sales'] > 500)]

print(electronics_high_sales_df)

Similarly, to select 'Electronics' or products with sales above $1500$ :

# Combine conditions using OR (|)
relevant_products_df = df[(df['Category'] == 'Electronics') | (df['Sales'] > 1500)]

print(relevant_products_df)

Filtering allows you to focus your visualizations on specific segments of your data, making your plots more targeted and insightful.

Handling Missing Data (Basic Approach)

Real-world datasets often contain missing values, represented in Pandas as NaN (Not a Number). Plotting functions might behave unexpectedly or produce errors when encountering NaNs.

A simple strategy for dealing with missing data before plotting is to remove rows that have missing values in the columns you intend to plot. The .dropna() method is useful here.

# Assume we want to plot 'Sales' vs 'Profit'
# Check for missing values in these columns
print(df[['Sales', 'Profit']].isnull().sum())

# Drop rows where *either* 'Sales' or 'Profit' is NaN
cleaned_df = df.dropna(subset=['Sales', 'Profit'])

# Verify missing values are handled (output should be 0 for these columns)
print(cleaned_df[['Sales', 'Profit']].isnull().sum())

Now, cleaned_df can be used for plotting without issues caused by missing values in the 'Sales' or 'Profit' columns. Be aware that removing rows means discarding data, which might not always be the best approach, but it's a reasonable starting point for basic visualization preparation.

Putting It Together: Filtering for a Plot

Let's combine these ideas. Imagine you have a DataFrame customer_df with columns 'Age', 'City', 'SpendingScore'. You want to create a scatter plot of 'Age' vs 'SpendingScore' but only for customers from 'New York' who are younger than 40.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample DataFrame (replace with loading your actual data)
data = {'Age': [25, 45, 35, 50, 22, 38, 60],
        'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'New York', 'Chicago', 'New York'],
        'SpendingScore': [75, 50, 85, 40, 90, 45, 60]}
customer_df = pd.DataFrame(data)

# 1. Filter the data
filtered_customers = customer_df[(customer_df['City'] == 'New York') & (customer_df['Age'] < 40)]

# Display the filtered data
print("Filtered Data for Plotting:")
print(filtered_customers)

# 2. Create the plot using the filtered data
plt.figure(figsize=(8, 5)) # Set figure size
sns.scatterplot(data=filtered_customers, x='Age', y='SpendingScore')
plt.title('Spending Score vs. Age (New York Customers < 40)')
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.grid(True) # Add grid for readability
plt.show()

By first filtering the DataFrame to include only the rows of interest (filtered_customers), the subsequent scatter plot specifically visualizes the relationship within that subgroup. Attempting to plot the entire customer_df and then mentally filtering would be much less effective.

Preparing your data using Pandas selection and filtering techniques is a common preliminary step in the data visualization workflow. It ensures that your plots are based on the correct subset of your data, leading to clearer and more accurate insights.

Was this section helpful?