Often, the raw data you load doesn't contain all the information you need for your analysis. You might need to calculate new values based on existing data or add categorical labels. Pandas makes it straightforward to add new columns to a DataFrame.
The most direct way to add a new column is by assigning data to a column name that doesn't yet exist. Think of it like adding a new entry to a dictionary, but where the "value" is typically a Series, array, or a single value that gets broadcast across all rows.
Let's start with a simple DataFrame:
import pandas as pd
import numpy as np
data = {'col_A': [10, 20, 30, 40],
'col_B': [5, 15, 25, 35]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Original DataFrame:
col_A col_B
0 10 5
1 20 15
2 30 25
3 40 35
If you want to add a new column where every row has the same value (a scalar value), you can simply assign it:
# Add a column 'source' with a constant string value
df['source'] = 'dataset_1'
# Add a column 'version' with a constant numeric value
df['version'] = 1.0
print("\nDataFrame after adding constant columns:")
print(df)
DataFrame after adding constant columns:
col_A col_B source version
0 10 5 dataset_1 1.0
1 20 15 dataset_1 1.0
2 30 25 dataset_1 1.0
3 40 35 dataset_1 1.0
Pandas automatically broadcasts the single value 'dataset_1' and the number 1.0 to fill all the rows in the new columns 'source' and 'version', respectively.
A very common operation is to create a new column based on calculations involving one or more existing columns. Since Pandas operations are generally vectorized (like NumPy), you can perform these calculations efficiently.
Let's add a column 'col_C' that is the sum of 'col_A' and 'col_B':
# Create 'col_C' by adding 'col_A' and 'col_B'
df['col_C'] = df['col_A'] + df['col_B']
print("\nDataFrame after adding calculated column 'col_C':")
print(df)
DataFrame after adding calculated column 'col_C':
col_A col_B source version col_C
0 10 5 dataset_1 1.0 15
1 20 15 dataset_1 1.0 35
2 30 25 dataset_1 1.0 55
3 40 35 dataset_1 1.0 75
The addition happens element-wise for each row. You can use any standard arithmetic operators (+, -, *, /, %) or more complex functions from libraries like NumPy.
For instance, let's add another column 'col_D' which is 'col_A' divided by 10:
# Create 'col_D' by dividing 'col_A' by 10
df['col_D'] = df['col_A'] / 10
print("\nDataFrame after adding calculated column 'col_D':")
print(df)
DataFrame after adding calculated column 'col_D':
col_A col_B source version col_C col_D
0 10 5 dataset_1 1.0 15 1.0
1 20 15 dataset_1 1.0 35 2.0
2 30 25 dataset_1 1.0 55 3.0
3 40 35 dataset_1 1.0 75 4.0
You can also add a new column by assigning a Python list or a NumPy array. The main requirement is that the length of the list or array must match the number of rows in the DataFrame (its index length).
# Add a column 'col_E' using a Python list
new_values_list = [100, 200, 300, 400]
df['col_E'] = new_values_list
# Add a column 'col_F' using a NumPy array
new_values_np = np.array([0.1, 0.2, 0.3, 0.4])
df['col_F'] = new_values_np
print("\nDataFrame after adding columns from list and NumPy array:")
print(df)
DataFrame after adding columns from list and NumPy array:
col_A col_B source version col_C col_D col_E col_F
0 10 5 dataset_1 1.0 15 1.0 100 0.1
1 20 15 dataset_1 1.0 35 2.0 200 0.2
2 30 25 dataset_1 1.0 55 3.0 300 0.3
3 40 35 dataset_1 1.0 75 4.0 400 0.4
If the length doesn't match, Pandas will raise a ValueError
.
You can also assign an existing Pandas Series to create a new column. When doing this, Pandas aligns the data based on the index of the Series and the DataFrame. If the indices match, the values are placed accordingly. If the indices don't perfectly align, rows in the DataFrame that don't have a matching index in the Series will get a missing value (NaN) in the new column.
Let's create a Series with a slightly different index:
s = pd.Series([500, 600, 700], index=[1, 2, 4]) # Note index: 1, 2, 4
print("\nSeries 's' to be added:")
print(s)
# Add 'col_G' using Series 's'
df['col_G'] = s
print("\nDataFrame after adding column 'col_G' from Series 's':")
print(df)
Series 's' to be added:
1 500
2 600
4 700
dtype: int64
DataFrame after adding column 'col_G' from Series 's':
col_A col_B source version col_C col_D col_E col_F col_G
0 10 5 dataset_1 1.0 15 1.0 100 0.1 NaN
1 20 15 dataset_1 1.0 35 2.0 200 0.2 500.0
2 30 25 dataset_1 1.0 55 3.0 300 0.3 600.0
3 40 35 dataset_1 1.0 75 4.0 400 0.4 NaN
Notice how col_G
has values only at indices 1 and 2, matching the Series s
. Index 0 and 3 in the DataFrame didn't exist in s
, so they received NaN
. The value at index 4 in s
was ignored because the DataFrame doesn't have an index 4. This index alignment behavior is fundamental to Pandas and prevents errors from misaligned data.
Being able to add new columns, especially derived ones, is an important step in data preparation. It allows you to engineer new features, normalize data, or simply structure information more effectively for analysis or modeling.
© 2025 ApX Machine Learning