删除列和行

移除不必要或有问题的数据是数据准备中一个常见步骤。您可能需要从DataFrame中删除整行（观测值）或整列（特征）。例如：

一列可能包含另一列中已有的冗余信息。
一列可能包含过多的缺失值，无法可靠地填充。
特定行可能代表您希望从分析中排除的异常值或错误。
行或列可能与您尝试回答的具体问题不相关。

Pandas提供了多功能的drop()方法来处理行和列的删除。

使用索引标签删除行

要删除一行或多行，您需要将要删除的行的索引标签提供给drop()方法。通常，您还会使用index参数 (parameter)指定要操作的是行。

我们从一个示例DataFrame开始：

import pandas as pd
import numpy as np

data = {'StudentID': ['S101', 'S102', 'S103', 'S104', 'S105', 'S106'],
        'Math': [85, 92, 78, 88, 95, np.nan],
        'Science': [90, 88, 94, 85, 79, np.nan],
        'History': [76, 81, 85, 70, 88, 60],
        'Notes': ['Good', 'Excellent', np.nan, 'Needs Improvement', 'Good', 'Incomplete']}
df = pd.DataFrame(data)
df.set_index('StudentID', inplace=True) # 将 StudentID 设置为索引

print("原始 DataFrame：")
print(df)

这会生成：

Original DataFrame:
           Math  Science  History              Notes
StudentID
S101       85.0     90.0       76               Good
S102       92.0     88.0       81          Excellent
S103       78.0     94.0       85                NaN
S104       88.0     85.0       70  Needs Improvement
S105       95.0     79.0       88               Good
S106        NaN      NaN       60         Incomplete

假设我们想删除学生S106，因为他们的数据不完整。我们可以使用drop()方法来完成此操作：

# 按索引标签删除单行
df_dropped_row = df.drop(index='S106')

print("\n删除行 'S106' 后的 DataFrame：")
print(df_dropped_row)

Output:

DataFrame after dropping row 'S106':
           Math  Science  History              Notes
StudentID
S101       85.0     90.0       76               Good
S102       92.0     88.0       81          Excellent
S103       78.0     94.0       85                NaN
S104       88.0     85.0       70  Needs Improvement
S105       95.0     79.0       88               Good

请注意，df.drop()默认返回一个新的DataFrame，其中指定行已删除。原始DataFrame df保持不变。

要删除多行，请传入一个索引标签列表：

# 按索引标签删除多行
df_dropped_rows = df.drop(index=['S103', 'S106'])

print("\n删除行 'S103' 和 'S106' 后的 DataFrame：")
print(df_dropped_rows)

Output:

DataFrame after dropping rows 'S103' and 'S106':
           Math  Science  History              Notes
StudentID
S101       85.0     90.0       76               Good
S102       92.0     88.0       81          Excellent
S104       88.0     85.0       70  Needs Improvement
S105       95.0     79.0       88               Good

删除列

删除列非常相似，但您提供的是列名而不是索引标签。指定要操作列的优选方式是使用columns参数 (parameter)。

假设我们的分析不需要Notes列。我们可以将其删除：

# 按名称删除单列
df_dropped_col = df.drop(columns='Notes')

print("\n删除 'Notes' 列后的 DataFrame：")
print(df_dropped_col)

Output:

DataFrame after dropping the 'Notes' column:
           Math  Science  History
StudentID
S101       85.0     90.0       76
S102       92.0     88.0       81
S103       78.0     94.0       85
S104       88.0     85.0       70
S105       95.0     79.0       88
S106        NaN      NaN       60

同样，此操作返回一个新的DataFrame。原始的df仍然包含Notes列。

要删除多列，请传入一个列名列表：

# 按名称删除多列
df_dropped_cols = df.drop(columns=['Math', 'Notes'])

print("\n删除 'Math' 和 'Notes' 列后的 DataFrame：")
print(df_dropped_cols)

Output:

DataFrame after dropping 'Math' and 'Notes' columns:
           Science  History
StudentID
S101          90.0       76
S102          88.0       81
S103          94.0       85
S104          85.0       70
S105          79.0       88
S106           NaN       60

过去，您可能会看到使用axis=1来表示删除列的代码（例如，df.drop('Notes', axis=1)）。虽然这种方式可行，但使用columns参数（例如，df.drop(columns='Notes')）通常被认为更具可读性和明确性。同样，axis=0对应于删除行（默认），但使用index参数更清晰。

原地修改 DataFrame

如果您确定要直接修改原始DataFrame而不创建新DataFrame，可以使用inplace=True参数 (parameter)。

# 创建一个副本以进行原地修改
df_copy = df.copy()
print("\n原始 df_copy (前 3 行)：")
print(df_copy.head(3))

# 原地删除 'History' 列
return_value = df_copy.drop(columns='History', inplace=True)

print("\n原地删除 'History' 后 df_copy (前 3 行)：")
print(df_copy.head(3))
print(f"\n当 inplace=True 时返回值为：{return_value}")

Output:

Original df_copy (first 3 rows):
           Math  Science  History      Notes
StudentID
S101       85.0     90.0       76       Good
S102       92.0     88.0       81  Excellent
S103       78.0     94.0       85        NaN

df_copy after dropping 'History' inplace (first 3 rows):
           Math  Science      Notes
StudentID
S101       85.0     90.0       Good
S102       92.0     88.0  Excellent
S103       78.0     94.0        NaN

Return value when inplace=True: None

请注意，当使用inplace=True时，drop方法会直接修改DataFrame并返回None。使用inplace=True时请务必小心，因为它会永久改变您的数据。通常情况下，尤其是在学习或操作数据时，使用默认行为（返回修改后的副本）会更安全，这让您能够保留原始数据。

使用drop()删除行和列是优化数据集的基本步骤，使您能够专注于与您的分析最相关的数据。

参考文献

pandas.DataFrame.drop, The pandas development team, 2023 - pandas.DataFrame.drop 方法的官方文档，详细说明其所有参数和使用示例。
Python for Data Analysis, Wes McKinney, 2022 (O'Reilly Media) - 一本关于使用 Python 和 Pandas 进行数据分析的权威书籍，涵盖数据清洗和操作技术，包括删除数据。
Data Cleaning with Pandas and Python, Jonathan Ng, 2023 (Real Python) - 一个实用的在线教程，展示 Pandas 中各种数据清洗操作，包括根据不同标准删除行和列的方法。