After loading data and addressing obviously missing values, the next significant step involves refining the dataset's structure and content. Real-world data frequently contains inconsistencies, requires format adjustments, or needs new features derived from existing ones. This section covers essential Pandas techniques for cleaning and transforming your data, making it suitable for analysis and machine learning model inputs.
We'll focus on applying functions, modifying column names, handling duplicate entries, replacing values, binning continuous data, manipulating string data, and converting data types.
Pandas provides flexible ways to apply functions across your data. The .apply()
method is particularly useful for applying a function along an axis of a DataFrame (rows or columns) or element-wise on a Series.
Let's assume we have a DataFrame df
with temperature data in Celsius and want to convert it to Fahrenheit.
import pandas as pd
import numpy as np
data = {'Location': ['City A', 'City B', 'City C'],
'Temp_Celsius': [25.0, 18.5, 30.2]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Define a conversion function
def celsius_to_fahrenheit(c):
return (c * 9/5) + 32
# Apply the function to the 'Temp_Celsius' column (a Series)
df['Temp_Fahrenheit'] = df['Temp_Celsius'].apply(celsius_to_fahrenheit)
print("\nDataFrame with Fahrenheit:")
print(df)
# Using a lambda function for the same task
df['Temp_Fahrenheit_lambda'] = df['Temp_Celsius'].apply(lambda c: (c * 9/5) + 32)
print("\nDataFrame with Fahrenheit (using lambda):")
print(df)
You can also use .apply()
across rows or columns of a DataFrame. For instance, calculating the range of temperatures if we had min/max values per row:
# Example DataFrame with min/max temps
data_range = {'Location': ['City A', 'City B', 'City C'],
'Min_Temp_C': [15.0, 12.5, 22.8],
'Max_Temp_C': [30.0, 21.0, 35.5]}
df_range = pd.DataFrame(data_range)
print("\nDataFrame with Temp Ranges:")
print(df_range)
# Apply a function across rows (axis=1)
df_range['Temp_Range'] = df_range.apply(lambda row: row['Max_Temp_C'] - row['Min_Temp_C'], axis=1)
print("\nDataFrame with Calculated Range:")
print(df_range)
While .apply()
is flexible, vectorized operations using NumPy functions or built-in Pandas methods are generally faster for simple arithmetic or logical operations on entire columns. Use .apply()
when more complex, custom logic is required row-wise or column-wise.
For element-wise operations specifically on a Series, .map()
can be convenient, especially for substituting values based on a dictionary or applying a simple function. For element-wise operations across an entire DataFrame, .applymap()
exists, but it's often less performant than .apply()
or vectorized approaches.
Clear and consistent column names are important for code readability and data interpretation. The .rename()
method is the standard way to change column names or index labels.
# Continuing with the temperature DataFrame
print("\nOriginal Columns:", df.columns)
# Rename specific columns using a dictionary
df_renamed = df.rename(columns={'Temp_Celsius': 'Celsius', 'Temp_Fahrenheit': 'Fahrenheit'})
print("\nRenamed Columns:", df_renamed.columns)
print(df_renamed)
# Renaming can also be done inplace
# df.rename(columns={'Temp_Celsius': 'Celsius'}, inplace=True)
You can similarly rename index labels using the index
argument within .rename()
. If you need to make one of the columns the DataFrame's index, use .set_index('ColumnName')
.
Duplicate rows can skew analysis and model training. Pandas makes identifying and removing them straightforward.
data_duplicates = {'col1': ['A', 'B', 'C', 'A', 'B', 'D'],
'col2': [1, 2, 3, 1, 2, 4]}
df_dup = pd.DataFrame(data_duplicates)
print("\nDataFrame with Duplicates:")
print(df_dup)
# Check for duplicate rows (returns a boolean Series)
print("\nAre rows duplicates?")
print(df_dup.duplicated())
# By default, keeps the first occurrence
print("\nKeeping first duplicate:")
print(df_dup.drop_duplicates(keep='first'))
# Keep the last occurrence
print("\nKeeping last duplicate:")
print(df_dup.drop_duplicates(keep='last'))
# Drop all rows that are duplicates
print("\nDropping all duplicates:")
print(df_dup.drop_duplicates(keep=False))
# Check duplicates based on a subset of columns
df_dup['col3'] = [10, 20, 30, 15, 25, 40] # Add another column
print("\nDataFrame with added column:")
print(df_dup)
print("\nDuplicates based on 'col1' and 'col2':")
print(df_dup.drop_duplicates(subset=['col1', 'col2'], keep='first'))
The keep
parameter controls which duplicate (if any) to retain. subset
allows you to define duplicates based only on specific columns.
Often, you need to replace specific values within your data. This could be for correcting data entry errors, standardizing categories, or mapping numerical codes to labels. The .replace()
method is versatile for this.
s = pd.Series([1, 2, 3, 1, 4, 5, -99]) # -99 might be a placeholder for missing
print("\nOriginal Series:")
print(s)
# Replace a single value
print("\nReplace -99 with NaN:")
print(s.replace(-99, np.nan))
# Replace multiple values using lists
print("\nReplace 1 and 2 with 0:")
print(s.replace([1, 2], 0))
# Replace using a dictionary for mapping
print("\nReplace 1 with 100, 2 with 200:")
print(s.replace({1: 100, 2: 200}))
# Can also use regex=True for pattern matching in strings
df_text = pd.DataFrame({'codes': ['A-101', 'B-202', 'C-303', 'A-101-extra']})
print("\nDataFrame with Codes:")
print(df_text)
print("\nReplace prefix using regex:")
# Replace 'A-', 'B-', 'C-' at the beginning (^) of the string with ''
print(df_text['codes'].replace(to_replace=r'^[ABC]-', value='', regex=True))
.replace()
is powerful for targeted value substitution across Series or DataFrames.
Sometimes it's useful to convert continuous numerical data into discrete bins or categories. This process, known as discretization or binning, can help simplify models or capture non-linear effects. Pandas offers pd.cut
and pd.qcut
.
pd.cut
segments data into bins based on value ranges. You can specify the number of bins (equal width) or provide explicit bin edges.
ages = pd.Series([22, 35, 58, 19, 41, 65, 30, 28, 50, 72])
print("\nOriginal Ages:")
print(ages)
# Bin ages into 4 equal-width bins
age_bins = pd.cut(ages, bins=4)
print("\nAges binned into 4 intervals:")
print(age_bins)
print("\nValue counts per bin:")
print(age_bins.value_counts())
# Define custom bin edges and labels
bin_edges = [0, 18, 35, 60, 100]
bin_labels = ['Youth', 'Young Adult', 'Middle Aged', 'Senior']
age_bins_labeled = pd.cut(ages, bins=bin_edges, labels=bin_labels, right=False) # right=False means bins are [a, b)
print("\nAges with custom bins and labels:")
print(age_bins_labeled)
print("\nValue counts for custom bins:")
print(age_bins_labeled.value_counts())
pd.qcut
segments data based on quantiles, meaning each bin will have approximately the same number of data points.
# Bin ages into 4 quantile-based bins (quartiles)
age_qbins = pd.qcut(ages, q=4)
print("\nAges binned into quartiles:")
print(age_qbins)
print("\nValue counts per quartile bin:")
print(age_qbins.value_counts())
# Quantile bins with labels
qbin_labels = ['Q1', 'Q2', 'Q3', 'Q4']
age_qbins_labeled = pd.qcut(ages, q=4, labels=qbin_labels)
print("\nAges with quartile labels:")
print(age_qbins_labeled)
print("\nValue counts per labeled quartile:")
print(age_qbins_labeled.value_counts())
Distribution of individuals across custom-defined age groups.
Binning is a common feature engineering technique used before feeding data into certain types of machine learning models.
Text data often requires significant cleaning. Pandas Series have a special .str
accessor that provides vectorized string manipulation functions, mirroring Python's built-in string methods but operating efficiently on the entire Series.
# Sample Series with messy string data
text_data = pd.Series([' Apple ', 'banana', ' Orange', ' GRAPES! ', 'Pineapple '])
print("\nOriginal Text Series:")
print(text_data)
# Convert to lowercase
print("\nLowercase:")
print(text_data.str.lower())
# Remove leading/trailing whitespace
print("\nStripped Whitespace:")
print(text_data.str.strip())
# Chain operations
print("\nStripped and Uppercase:")
print(text_data.str.strip().str.upper())
# Check if string contains a pattern
print("\nContains 'apple' (case-insensitive):")
print(text_data.str.lower().str.contains('apple'))
# Replace parts of strings
print("\nReplace '!' with '':")
print(text_data.str.replace('!', '', regex=False)) # Use regex=False for literal replacement
# Split strings
print("\nSplit by space (yields lists):")
print(text_data.str.strip().str.split(' '))
# Extract patterns using regex (requires capturing group)
product_codes = pd.Series(['SKU-123-A', 'SKU-456-B', 'ITEM-789-C', 'SKU-000-D'])
print("\nProduct Codes:")
print(product_codes)
print("\nExtract numeric part:")
# Extract one or more digits (\d+) after 'SKU-'
print(product_codes.str.extract(r'SKU-(\d+)-'))
The .str
accessor is essential for cleaning categorical features stored as text, preparing text for natural language processing, or extracting structured information from unstructured strings.
Ensuring data is stored with the correct data type (integer, float, boolean, category, datetime, etc.) is important for memory efficiency and operational correctness. You can check types using the .dtypes
attribute and convert types using the .astype()
method.
df_types = pd.DataFrame({'ID': ['1', '2', '3', '4'],
'Value': ['10.5', '20.1', '30.9', '40.0'],
'Category': ['A', 'B', 'A', 'C'],
'Count': [5, 10, 15, 20]})
print("\nOriginal DataFrame with mixed types:")
print(df_types)
print("\nOriginal Data Types:")
print(df_types.dtypes)
# Convert ID from object (string) to integer
df_types['ID'] = df_types['ID'].astype(int)
# Convert Value from object (string) to float
df_types['Value'] = df_types['Value'].astype(float)
# Convert Category to a dedicated 'category' type (memory efficient for few unique strings)
df_types['Category'] = df_types['Category'].astype('category')
# Convert Count (already int) to float
df_types['Count'] = df_types['Count'].astype(float)
print("\nDataFrame after Type Conversion:")
print(df_types)
print("\nNew Data Types:")
print(df_types.dtypes)
# Handling potential errors during conversion
s_error = pd.Series(['1', '2', 'three', '4'])
print("\nSeries with non-numeric value:")
print(s_error)
# Using errors='coerce' will turn unconvertible values into NaT (datetime) or NaN (numeric)
numeric_s = pd.to_numeric(s_error, errors='coerce')
print("\nConverted to numeric with coercion:")
print(numeric_s)
Using the appropriate data types, especially category
for low-cardinality strings or int
subtypes (int8
, int16
, etc.) where applicable, can significantly reduce the memory footprint of your DataFrames. The pd.to_numeric
, pd.to_datetime
, and pd.to_timedelta
functions offer more specialized conversion options, particularly errors='coerce'
for gracefully handling values that cannot be converted.
These cleaning and transformation techniques form a core part of the data preparation process. Mastering them allows you to reshape raw data into a clean, structured format ready for the next stages of analysis, such as grouping, aggregation, and merging, which we will explore subsequently.
© 2025 ApX Machine Learning