After learning how to add or remove columns and handle missing data, another frequent task in data preparation is modifying the information within existing columns. You might need to perform calculations based on current values, standardize text formats, change data types, or apply custom transformations. Pandas provides several flexible ways to achieve this.
The simplest way to modify a column is often through direct arithmetic or logical operations. Because Pandas is built on NumPy, these operations are usually vectorized, meaning they are applied element-wise and efficiently without explicit loops.
For example, imagine you have a DataFrame with temperatures in Celsius and you want to convert them to Fahrenheit.
import pandas as pd
import numpy as np
# Sample DataFrame
data = {'City': ['London', 'Paris', 'Tokyo'],
'Temp_C': [12, 15, 8]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Modify the 'Temp_C' column to Fahrenheit in a new column
df['Temp_F'] = df['Temp_C'] * 9/5 + 32
print("\nDataFrame with Fahrenheit column added:")
print(df)
# Or, overwrite the original column (use with caution)
# df['Temp_C'] = df['Temp_C'] * 9/5 + 32
# df = df.rename(columns={'Temp_C': 'Temp_F'}) # Rename if overwriting
This vectorized approach works for addition (+
), subtraction (-
), multiplication (*
), division (/
), exponentiation (**
), and other standard mathematical operations.
apply()
Sometimes, the transformation you need is more complex than simple arithmetic. You might have a custom function you want to apply to each element in a column (a Pandas Series). The apply()
method is useful here.
Let's say we want to categorize the temperatures into 'Cold', 'Mild', or 'Warm'.
# Define a function to categorize temperature
def categorize_temp(temp):
if temp < 10:
return 'Cold'
elif 10 <= temp < 20:
return 'Mild'
else:
return 'Warm'
# Apply the function to the 'Temp_C' column
df['Temp_Category'] = df['Temp_C'].apply(categorize_temp)
print("\nDataFrame with Temperature Category:")
print(df)
You can also use apply()
with lambda functions for shorter, one-off operations:
# Example: Calculate square of temperature using a lambda function
df['Temp_C_Squared'] = df['Temp_C'].apply(lambda x: x**2)
print("\nDataFrame with Temperature Squared:")
print(df)
apply()
is flexible but can sometimes be slower than vectorized operations or more specialized functions if they exist for your task.
replace()
If you need to substitute specific values within a column, the replace()
method is very convenient. You can replace a single value or provide a dictionary to replace multiple values simultaneously.
# Sample DataFrame with categorical data
data_cat = {'ID': [101, 102, 103, 104],
'Grade': ['A', 'B', 'C', 'B'],
'Status': ['Pass', 'Pass', 'Fail', 'Pass']}
df_cat = pd.DataFrame(data_cat)
print("Original Categorical DataFrame:")
print(df_cat)
# Replace 'Fail' with 'Did Not Pass' in the 'Status' column
df_cat['Status'] = df_cat['Status'].replace('Fail', 'Did Not Pass')
print("\nDataFrame after replacing 'Fail':")
print(df_cat)
# Replace multiple grades using a dictionary
grade_map = {'A': 'Excellent', 'B': 'Good', 'C': 'Fair'}
df_cat['Grade_Desc'] = df_cat['Grade'].replace(grade_map)
print("\nDataFrame after replacing multiple grades:")
print(df_cat)
map()
for Value SubstitutionSimilar to replace()
, the map()
method on a Series can be used for substituting each value based on a dictionary (or another Series, or a function). It's particularly useful when you want to map all existing values to new ones based on a predefined correspondence. Values not found in the mapping dictionary will become NaN.
# Sample Series
s = pd.Series(['cat', 'dog', 'rabbit', 'cat'])
print("Original Series:")
print(s)
# Map animal names to sounds
animal_sounds = {'cat': 'meow', 'dog': 'bark'}
sounds = s.map(animal_sounds)
print("\nSeries after mapping (unknown values become NaN):")
print(sounds)
While replace()
targets specific values to change, map()
is often used for a more complete transformation based on a lookup.
.str
Pandas provides a powerful set of string processing methods accessible via the .str
accessor on Series containing strings (object dtype). This allows you to easily perform common string operations vectorized across the entire column.
# Sample DataFrame with text data
data_text = {'Product': ['Apple iPhone 14', 'SAMSUNG GALAXY S23', 'google pixel 7'],
'Code': ['APL-14', 'SAM-S23', 'GGL-PX7']}
df_text = pd.DataFrame(data_text)
print("Original Text DataFrame:")
print(df_text)
# Convert 'Product' names to lowercase
df_text['Product_Lower'] = df_text['Product'].str.lower()
# Split 'Code' into 'Brand' and 'Model'
df_text[['Brand_Code', 'Model_Code']] = df_text['Code'].str.split('-', expand=True)
# Check if 'Product' contains 'galaxy' (case-insensitive)
df_text['Has_Galaxy'] = df_text['Product'].str.contains('GALAXY', case=False)
print("\nDataFrame after string manipulations:")
print(df_text)
Other common .str
methods include startswith()
, endswith()
, replace()
, len()
, strip()
, get()
, and many more. These are incredibly useful for cleaning and standardizing text data.
astype()
Sometimes, data is loaded with an incorrect data type. For example, numbers might be read as strings (object type), or you might want to convert floats to integers after handling missing values. The astype()
method allows you to change the data type of a column.
# Sample DataFrame with mixed types
data_types = {'ID': ['101', '102', '103'],
'Value': [20.5, 15.0, 33.8],
'Category': ['X', 'Y', 'X']}
df_types = pd.DataFrame(data_types)
print("Original DataFrame with dtypes:")
print(df_types)
print(df_types.dtypes)
# Convert 'ID' from object (string) to integer
df_types['ID'] = df_types['ID'].astype(int)
# Convert 'Value' from float to integer (truncates decimal part)
df_types['Value_Int'] = df_types['Value'].astype(int)
# Convert 'Category' to a more memory-efficient 'category' dtype
df_types['Category'] = df_types['Category'].astype('category')
print("\nDataFrame after changing dtypes:")
print(df_types)
print(df_types.dtypes)
Be careful when changing types. If a conversion is not possible (e.g., trying to convert a string like 'hello' to an integer), Pandas will raise an error. Ensure your data is suitable for the target type before using astype()
. For conversions involving potential errors or missing values, you might need to clean the data first or use functions like pd.to_numeric(errors='coerce')
which turn unconvertible values into NaN.
Modifying existing columns is a fundamental part of data wrangling. By combining direct operations, apply()
, replace()
, string methods, and type conversions, you gain significant control over shaping your data into the format needed for analysis.
© 2025 ApX Machine Learning