While selecting data by labels (.loc
) or integer positions (.iloc
) is useful, often you need to filter your data based on the values within the columns themselves. For instance, you might want to find all rows where the 'Age' column is greater than 30, or where the 'City' column is 'New York'. This process of selecting data based on conditions is known as boolean indexing.
At its core, boolean indexing involves creating a boolean Series (a Series containing only True
or False
values) based on a condition, and then using that Series to select rows from the DataFrame.
Let's start with a simple DataFrame. Imagine we have data about individuals, including their age:
import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Age': [24, 27, 22, 32, 29, 18],
'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago'],
'Salary': [70000, 80000, 65000, 90000, 85000, 60000]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City Salary
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
2 Charlie 22 New York 65000
3 David 32 Chicago 90000
4 Eve 29 Los Angeles 85000
5 Frank 18 Chicago 60000
Now, suppose we want to find out which individuals are older than 25. We can apply a comparison directly to the 'Age' column:
condition = df['Age'] > 25
print(condition)
Output:
0 False
1 True
2 False
3 True
4 True
5 False
Name: Age, dtype: bool
As you can see, Pandas performs the comparison (> 25
) element-wise for the entire 'Age' Series, resulting in a new Series of boolean values. True
indicates the condition is met for that row, and False
indicates it is not.
This boolean Series acts like a filter. You can pass it directly inside the square brackets []
of the DataFrame (or .loc
) to select only the rows where the condition evaluated to True
:
# Select rows where Age > 25
older_than_25 = df[df['Age'] > 25]
print(older_than_25)
Output:
Name Age City Salary
1 Bob 27 Los Angeles 80000
3 David 32 Chicago 90000
4 Eve 29 Los Angeles 85000
Only the rows corresponding to True
in the boolean Series (indices 1, 3, and 4) are returned.
This works similarly for other types of conditions, like string comparisons:
# Select rows where City is 'New York'
in_new_york = df[df['City'] == 'New York']
print(in_new_york)
Output:
Name Age City Salary
0 Alice 24 New York 70000
2 Charlie 22 New York 65000
Real-world data analysis often requires filtering based on multiple criteria simultaneously. You can combine boolean conditions using logical operators:
&
: Element-wise logical AND|
: Element-wise logical OR~
: Element-wise logical NOTImportant: When combining conditions, you must enclose each individual condition in parentheses ()
due to Python's operator precedence rules. Using Python's standard and
, or
, not
operators will result in an error or unexpected behavior because they work on the truthiness of entire objects, not element-wise.
&
)Let's find individuals who live in 'Los Angeles' and are older than 25:
# Condition 1: City is 'Los Angeles'
# Condition 2: Age > 25
# Combine with &
combined_condition_and = (df['City'] == 'Los Angeles') & (df['Age'] > 25)
print(df[combined_condition_and])
Output:
Name Age City Salary
1 Bob 27 Los Angeles 80000
4 Eve 29 Los Angeles 85000
Only Bob and Eve satisfy both conditions.
|
)Now, let's find individuals who live in 'Chicago' or have a salary greater than 80000:
# Condition 1: City is 'Chicago'
# Condition 2: Salary > 80000
# Combine with |
combined_condition_or = (df['City'] == 'Chicago') | (df['Salary'] > 80000)
print(df[combined_condition_or])
Output:
Name Age City Salary
3 David 32 Chicago 90000
4 Eve 29 Los Angeles 85000
5 Frank 18 Chicago 60000
David and Frank live in Chicago. Eve doesn't live in Chicago, but her salary is greater than 80000. All three are included.
~
)To select rows where a condition is not met, use the tilde ~
operator. Let's find everyone who does not live in 'New York':
# Condition: City is 'New York'
# Negate with ~
not_in_new_york = df[~(df['City'] == 'New York')]
print(not_in_new_york)
Output:
Name Age City Salary
1 Bob 27 Los Angeles 80000
3 David 32 Chicago 90000
4 Eve 29 Los Angeles 85000
5 Frank 18 Chicago 60000
.loc
Boolean indexing works seamlessly with the .loc
accessor as well. This can be particularly useful and readable, especially when you want to select specific columns based on row conditions:
# Select Name and Salary for people older than 25
older_names_salaries = df.loc[df['Age'] > 25, ['Name', 'Salary']]
print(older_names_salaries)
Output:
Name Salary
1 Bob 80000
3 David 90000
4 Eve 85000
Here, the first argument to .loc
is the boolean condition selecting the rows, and the second argument specifies the columns to retrieve for those rows.
True
/False
).[]
or .loc
to filter the DataFrame.&
(AND), |
(OR), and ~
(NOT).()
when using &
or |
. Example: (condition1) & (condition2)
..loc
Integration: Boolean arrays work very effectively with .loc
for selecting rows and specific columns simultaneously based on conditions.Conditional selection is a fundamental technique for subsetting your data. It allows you to isolate the specific observations that meet your criteria, forming the basis for more focused analysis and manipulation tasks.
© 2025 ApX Machine Learning