Now that you understand the fundamental Pandas data structures, Series
and DataFrame
, the next logical step is learning how to access specific parts of this data. Selecting subsets of data, rows, columns, or individual cells, is a foundational operation for nearly all data analysis and manipulation tasks. Pandas provides several powerful and optimized methods for indexing and selection, offering flexibility based on whether you want to select by labels or by integer positions.
[]
The most basic way to select data is using the square bracket operator []
. Its behavior depends on what you pass inside the brackets and whether you're working with a Series
or a DataFrame
.
Selecting Columns in a DataFrame: Passing a single column name (string) or a list of column names selects one or more columns.
import pandas as pd
import numpy as np
# Assume df is a pre-defined DataFrame like:
data = {
'Temperature': [25.3, 26.1, 24.8, 27.0, 23.9, 25.5, 26.8],
'Humidity': [65, 68, 62, 70, 60, 66, 69],
'Pressure': [1012, 1010, 1015, 1009, 1013, 1011, 1008]
}
index_labels = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03',
'2023-01-04', '2023-01-05', '2023-01-06', '2023-01-07'])
df = pd.DataFrame(data, index=index_labels)
# Select a single column (returns a Series)
temperatures = df['Temperature']
print(type(temperatures))
# Output: <class 'pandas.core.series.Series'>
print(temperatures)
# Output:
# 2023-01-01 25.3
# 2023-01-02 26.1
# 2023-01-03 24.8
# 2023-01-04 27.0
# 2023-01-05 23.9
# 2023-01-06 25.5
# 2023-01-07 26.8
# Name: Temperature, dtype: float64
# Select multiple columns (returns a DataFrame)
weather_subset = df[['Temperature', 'Pressure']]
print(type(weather_subset))
# Output: <class 'pandas.core.frame.DataFrame'>
print(weather_subset.head(3))
# Output:
# Temperature Pressure
# 2023-01-01 25.3 1012
# 2023-01-02 26.1 1010
# 2023-01-03 24.8 1015
Selecting Rows (Slicing) in a DataFrame: If you pass a slice (like 0:3
), []
selects rows by position. This can sometimes be ambiguous because []
is primarily column-oriented for DataFrames.
# Select the first 3 rows
first_three_days = df[0:3]
print(first_three_days)
# Output:
# Temperature Humidity Pressure
# 2023-01-01 25.3 65 1012
# 2023-01-02 26.1 68 1010
# 2023-01-03 24.8 62 1015
Using []
for row selection via slicing works, but for clarity and avoiding potential confusion, especially when dealing with integer indices, Pandas offers more explicit methods: .loc
and .iloc
.
.loc
The .loc
indexer is used for selection primarily based on labels (index names and column names). It provides a very explicit way to select data.
Syntax: df.loc[row_label(s), column_label(s)]
Selecting Rows:
df.loc['2023-01-02']
(returns a Series)df.loc[['2023-01-02', '2023-01-05']]
(returns a DataFrame)df.loc['2023-01-02':'2023-01-04']
(returns a DataFrame). Notice that label-based slicing includes the end label.Selecting Rows and Columns:
df.loc['2023-01-03', 'Humidity']
df.loc[['2023-01-03', '2023-01-06'], ['Temperature', 'Pressure']]
df.loc['2023-01-02':'2023-01-04', 'Humidity':'Pressure']
df.loc[:, ['Temperature', 'Humidity']]
(using :
selects all rows)df.loc[['2023-01-05', '2023-01-07'], :]
(using :
selects all columns)# Example using .loc
# Single row by label
print("Row for 2023-01-02:")
print(df.loc['2023-01-02'])
# Output:
# Row for 2023-01-02:
# Temperature 26.1
# Humidity 68.0
# Pressure 1010.0
# Name: 2023-01-02 00:00:00, dtype: float64
# Slice of rows by label (inclusive)
print("\nRows from 2023-01-02 to 2023-01-04:")
print(df.loc['2023-01-02':'2023-01-04'])
# Output:
# Rows from 2023-01-02 to 2023-01-04:
# Temperature Humidity Pressure
# 2023-01-02 26.1 68 1010
# 2023-01-03 24.8 62 1015
# 2023-01-04 27.0 70 1009
# Specific rows and columns
print("\nTemp and Pressure for specific dates:")
print(df.loc[['2023-01-03', '2023-01-06'], ['Temperature', 'Pressure']])
# Output:
# Temp and Pressure for specific dates:
# Temperature Pressure
# 2023-01-03 24.8 1015
# 2023-01-06 25.5 1011
# All rows, specific columns
print("\nHumidity and Pressure columns:")
print(df.loc[:, ['Humidity', 'Pressure']].head(3))
# Output:
# Humidity and Pressure columns:
# Humidity Pressure
# 2023-01-01 65 1012
# 2023-01-02 68 1010
# 2023-01-03 62 1015
.iloc
The .iloc
indexer is used for selection primarily based on integer positions (from 0 to length−1, like standard Python lists). It ignores index labels and column names.
Syntax: df.iloc[row_position(s), column_position(s)]
Selecting Rows:
df.iloc[1]
(returns a Series corresponding to the second row)df.iloc[[1, 4]]
(returns a DataFrame with the second and fifth rows)df.iloc[1:4]
(returns a DataFrame with rows at positions 1, 2, and 3). Notice that integer-based slicing excludes the end position, consistent with Python's slicing behavior.Selecting Rows and Columns:
df.iloc[2, 1]
(cell at the 3rd row, 2nd column)df.iloc[[2, 5], [0, 2]]
(3rd and 6th rows, 1st and 3rd columns)df.iloc[1:4, 0:2]
(rows 1-3, columns 0-1)df.iloc[:, [0, 1]]
(1st and 2nd columns)df.iloc[[4, 6], :]
(5th and 7th rows)# Example using .iloc
# Single row by position (second row)
print("Row at index position 1:")
print(df.iloc[1])
# Output:
# Row at index position 1:
# Temperature 26.1
# Humidity 68.0
# Pressure 1010.0
# Name: 2023-01-02 00:00:00, dtype: float64
# Slice of rows by position (exclusive)
print("\nRows at index positions 1 through 3 (exclusive):")
print(df.iloc[1:4])
# Output:
# Rows at index positions 1 through 3 (exclusive):
# Temperature Humidity Pressure
# 2023-01-02 26.1 68 1010
# 2023-01-03 24.8 62 1015
# 2023-01-04 27.0 70 1009
# Specific rows and columns by position
print("\nCells at specific positions:")
print(df.iloc[[2, 5], [0, 2]]) # Rows 2, 5 and Columns 0, 2
# Output:
# Cells at specific positions:
# Temperature Pressure
# 2023-01-03 24.8 1015
# 2023-01-06 25.5 1011
# Slice rows and columns by position
print("\nSlice of rows and columns:")
print(df.iloc[1:4, 0:2]) # Rows 1-3, Columns 0-1
# Output:
# Slice of rows and columns:
# Temperature Humidity
# 2023-01-02 26.1 68
# 2023-01-03 24.8 62
# 2023-01-04 27.0 70
Perhaps the most powerful selection method involves using boolean (True/False) arrays or Series to filter data based on conditions.
[]
, .loc[]
, or .iloc[]
. Only rows where the boolean Series is True
will be selected.# Example using Boolean Indexing
# 1. Create a condition: Temperatures above 25.0 degrees
high_temp_condition = df['Temperature'] > 25.0
print("Boolean condition Series:")
print(high_temp_condition)
# Output:
# Boolean condition Series:
# 2023-01-01 True
# 2023-01-02 True
# 2023-01-03 False
# 2023-01-04 True
# 2023-01-05 False
# 2023-01-06 True
# 2023-01-07 True
# Name: Temperature, dtype: bool
# 2. Apply the condition using []
high_temp_days = df[high_temp_condition]
print("\nDays with Temperature > 25.0 (using []):")
print(high_temp_days)
# Output:
# Days with Temperature > 25.0 (using []):
# Temperature Humidity Pressure
# 2023-01-01 25.3 65 1012
# 2023-01-02 26.1 68 1010
# 2023-01-04 27.0 70 1009
# 2023-01-06 25.5 66 1011
# 2023-01-07 26.8 69 1008
# Apply the condition using .loc (preferred for clarity)
high_temp_days_loc = df.loc[high_temp_condition]
print("\nDays with Temperature > 25.0 (using .loc):")
print(high_temp_days_loc)
# Output: (Identical to above)
# Selecting specific columns based on a condition
high_temp_humidity = df.loc[df['Temperature'] > 26.0, 'Humidity']
print("\nHumidity on days with Temperature > 26.0:")
print(high_temp_humidity)
# Output:
# Humidity on days with Temperature > 26.0:
# 2023-01-02 68
# 2023-01-04 70
# 2023-01-07 69
# Name: Humidity, dtype: int64
You can combine multiple conditions using logical operators: &
for AND, |
for OR, and ~
for NOT. Remember to enclose individual conditions in parentheses due to Python's operator precedence rules.
# Combining conditions: Temperature > 25 AND Humidity < 68
complex_condition = (df['Temperature'] > 25.0) & (df['Humidity'] < 68)
print("\nComplex condition (Temp > 25.0 AND Humidity < 68):")
print(complex_condition)
# Output:
# Complex condition (Temp > 25.0 AND Humidity < 68):
# 2023-01-01 True
# 2023-01-02 False
# 2023-01-03 False
# 2023-01-04 False
# 2023-01-05 False
# 2023-01-06 True
# 2023-01-07 False
# dtype: bool
filtered_data = df.loc[complex_condition]
print("\nFiltered data based on complex condition:")
print(filtered_data)
# Output:
# Filtered data based on complex condition:
# Temperature Humidity Pressure
# 2023-01-01 25.3 65 1012
# 2023-01-06 25.5 66 1011
These selection methods are not just for reading data; they can also be used to modify data in place. By placing the selection on the left side of an assignment operator, you can update specific parts of your DataFrame or Series.
# Setting values
# Create a copy to avoid modifying the original df
df_copy = df.copy()
# Set Humidity to 75 for a specific date using .loc
df_copy.loc['2023-01-03', 'Humidity'] = 75
print("\nDataFrame after setting Humidity on 2023-01-03:")
print(df_copy.loc['2023-01-01':'2023-01-04'])
# Output:
# DataFrame after setting Humidity on 2023-01-03:
# Temperature Humidity Pressure
# 2023-01-01 25.3 65 1012
# 2023-01-02 26.1 68 1010
# 2023-01-03 24.8 75 1015 # <-- Changed value
# 2023-01-04 27.0 70 1009
# Set Temperature to 0 for all rows where Pressure > 1012 using boolean indexing
df_copy.loc[df_copy['Pressure'] > 1012, 'Temperature'] = 0
print("\nDataFrame after setting Temp based on Pressure:")
print(df_copy)
# Output:
# DataFrame after setting Temp based on Pressure:
# Temperature Humidity Pressure
# 2023-01-01 25.3 65 1012
# 2023-01-02 26.1 68 1010
# 2023-01-03 0.0 75 1015 # <-- Changed value
# 2023-01-04 27.0 70 1009
# 2023-01-05 0.0 60 1013 # <-- Changed value
# 2023-01-06 25.5 66 1011
# 2023-01-07 26.8 69 1008
# Set multiple columns for the first row using .iloc
df_copy.iloc[0, [0, 1]] = [20.0, 50] # Set Temp and Humidity for the first row
print("\nDataFrame after setting multiple values using .iloc:")
print(df_copy.head(3))
# Output:
# DataFrame after setting multiple values using .iloc:
# Temperature Humidity Pressure
# 2023-01-01 20.0 50 1012 # <-- Changed values
# 2023-01-02 26.1 68 1010
# 2023-01-03 0.0 75 1015
.loc
and .iloc
.loc
when you need to select data based on index labels or column names. It's generally more explicit and less prone to errors if your index labels are meaningful (like dates or IDs)..iloc
when you need to select data based on integer positions, regardless of the index labels or column names. This is useful when you need the Nth row or column, or when dealing with default integer indices (0, 1, 2...).[]
for row slicing if possible; prefer .loc
or .iloc
for clarity. Using []
primarily for column selection is common and acceptable.df.loc[0]
would select the row with label 0
, while df.iloc[0]
would select the first row by position. This potential ambiguity is a strong reason to prefer .loc
and .iloc
over simple []
indexing for anything beyond basic column selection.Mastering these data indexing and selection techniques is fundamental for effective data manipulation with Pandas. You'll use them constantly to filter, subset, inspect, and modify your data as you prepare it for analysis and machine learning models.
© 2025 ApX Machine Learning