Often, the raw data you load into a Pandas DataFrame isn't immediately ready for visualization. You might have more columns than needed for a specific plot, or you might only want to visualize a subset of the rows that meet certain criteria. Preparing your data involves selecting the relevant pieces and filtering out the noise, ensuring your plots communicate the intended message clearly.
DataFrames can contain many columns, but a single plot usually only visualizes relationships between two or three variables. You can easily select the columns you need using bracket notation.
To select a single column (which returns a Pandas Series), use single brackets:
# Assume 'df' is your DataFrame
product_names = df['ProductName']
sales_data = df['Sales']
To select multiple columns (which returns a new DataFrame), use a list of column names inside double brackets:
# Select 'ProductName' and 'Sales' columns
product_sales_df = df[['ProductName', 'Sales']]
# Display the first few rows of the new DataFrame
print(product_sales_df.head())
This product_sales_df
now contains only the data needed for, perhaps, a bar chart showing sales per product, making it easier to pass to plotting functions.
Just as you might only need specific columns, you often only want to visualize rows that meet certain conditions. This is frequently called boolean indexing or conditional selection. You create a condition (which evaluates to True
or False
for each row) and use it inside the brackets to filter the DataFrame.
Example 1: Simple Condition
Let's say you have a DataFrame df
with sales data and you only want to plot products with sales greater than 1000.
# Create a boolean Series: True for rows where Sales > 1000, False otherwise
high_sales_condition = df['Sales'] > 1000
# Use the boolean Series to filter the DataFrame
high_sales_df = df[high_sales_condition]
# Or, more concisely:
high_sales_df = df[df['Sales'] > 1000]
# Now, high_sales_df contains only rows meeting the condition
print(high_sales_df.head())
You can then use high_sales_df
for plotting.
Example 2: Multiple Conditions
You can combine conditions using logical operators: &
for AND, |
for OR. It's important to wrap each individual condition in parentheses ()
due to Python's operator precedence rules.
Suppose you want to visualize data for products in the 'Electronics' category and with sales above 500.
# Condition 1: Category is 'Electronics'
condition1 = df['Category'] == 'Electronics'
# Condition 2: Sales are greater than 500
condition2 = df['Sales'] > 500
# Combine conditions using AND (&)
# Note the parentheses around each condition
electronics_high_sales_df = df[(condition1) & (condition2)]
# Or, directly:
electronics_high_sales_df = df[(df['Category'] == 'Electronics') & (df['Sales'] > 500)]
print(electronics_high_sales_df)
Similarly, to select 'Electronics' or products with sales above 1500:
# Combine conditions using OR (|)
relevant_products_df = df[(df['Category'] == 'Electronics') | (df['Sales'] > 1500)]
print(relevant_products_df)
Filtering allows you to focus your visualizations on specific segments of your data, making your plots more targeted and insightful.
Real-world datasets often contain missing values, represented in Pandas as NaN
(Not a Number). Plotting functions might behave unexpectedly or produce errors when encountering NaN
s.
A simple strategy for dealing with missing data before plotting is to remove rows that have missing values in the columns you intend to plot. The .dropna()
method is useful here.
# Assume we want to plot 'Sales' vs 'Profit'
# Check for missing values in these columns
print(df[['Sales', 'Profit']].isnull().sum())
# Drop rows where *either* 'Sales' or 'Profit' is NaN
cleaned_df = df.dropna(subset=['Sales', 'Profit'])
# Verify missing values are handled (output should be 0 for these columns)
print(cleaned_df[['Sales', 'Profit']].isnull().sum())
Now, cleaned_df
can be used for plotting without issues caused by missing values in the 'Sales' or 'Profit' columns. Be aware that removing rows means discarding data, which might not always be the best approach, but it's a reasonable starting point for basic visualization preparation.
Let's combine these ideas. Imagine you have a DataFrame customer_df
with columns 'Age', 'City', 'SpendingScore'. You want to create a scatter plot of 'Age' vs 'SpendingScore' but only for customers from 'New York' who are younger than 40.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Sample DataFrame (replace with loading your actual data)
data = {'Age': [25, 45, 35, 50, 22, 38, 60],
'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'New York', 'Chicago', 'New York'],
'SpendingScore': [75, 50, 85, 40, 90, 45, 60]}
customer_df = pd.DataFrame(data)
# 1. Filter the data
filtered_customers = customer_df[(customer_df['City'] == 'New York') & (customer_df['Age'] < 40)]
# Display the filtered data
print("Filtered Data for Plotting:")
print(filtered_customers)
# 2. Create the plot using the filtered data
plt.figure(figsize=(8, 5)) # Set figure size
sns.scatterplot(data=filtered_customers, x='Age', y='SpendingScore')
plt.title('Spending Score vs. Age (New York Customers < 40)')
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.grid(True) # Add grid for readability
plt.show()
By first filtering the DataFrame to include only the rows of interest (filtered_customers
), the subsequent scatter plot specifically visualizes the relationship within that subgroup. Attempting to plot the entire customer_df
and then mentally filtering would be much less effective.
Preparing your data using Pandas selection and filtering techniques is a common preliminary step in the data visualization workflow. It ensures that your plots are based on the correct subset of your data, leading to clearer and more accurate insights.
© 2025 ApX Machine Learning