While concatenation using pd.concat
is useful for stacking datasets with similar structures, often you'll need to combine DataFrames based on the values in one or more common columns or indices. This is analogous to how you might join tables in a relational database like SQL. Pandas provides the powerful pd.merge()
function for these operations.
pd.merge()
allows you to combine rows from two DataFrames based on shared values in specified columns, often called "keys". Think of it like looking up information in one table based on an identifier found in another table.
Let's imagine we have two simple DataFrames: one containing employee information and another containing department information.
import pandas as pd
# Employee data
employees = pd.DataFrame({
'employee_id': [101, 102, 103, 104],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'dept_id': [10, 20, 10, 30]
})
# Department data
departments = pd.DataFrame({
'dept_id': [10, 20, 40],
'dept_name': ['Engineering', 'Marketing', 'Sales']
})
print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)
Output:
Employees DataFrame:
employee_id name dept_id
0 101 Alice 10
1 102 Bob 20
2 103 Charlie 10
3 104 David 30
Departments DataFrame:
dept_id dept_name
0 10 Engineering
1 20 Marketing
2 40 Sales
Both DataFrames share a dept_id
column. We can use this column as the "key" to merge the two DataFrames, adding the dept_name
to the employee information.
The simplest way to merge is to specify the DataFrames and the column(s) to merge on using the on
parameter:
# Merge employees and departments based on 'dept_id'
merged_df = pd.merge(employees, departments, on='dept_id')
print("\nMerged DataFrame (Default - Inner Join):")
print(merged_df)
Output:
Merged DataFrame (Default - Inner Join):
employee_id name dept_id dept_name
0 101 Alice 10 Engineering
1 103 Charlie 10 Engineering
2 102 Bob 20 Marketing
Notice a few things:
dept_name
is now associated with the corresponding employees based on dept_id
.dept_id
30 doesn't exist in the departments
DataFrame.employees
DataFrame has dept_id
40.By default, pd.merge()
performs an inner join. This means it only keeps rows where the key (dept_id
in this case) exists in both DataFrames. We'll explore other join types shortly.
What if the key columns have different names in the two DataFrames? For example, suppose the department key in the employees
DataFrame was named department_code
instead of dept_id
.
# Modified employee data with different key name
employees_alt = pd.DataFrame({
'employee_id': [101, 102, 103, 104],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'department_code': [10, 20, 10, 30] # Renamed column
})
print("Employees DataFrame (Alternative):")
print(employees_alt)
print("\nDepartments DataFrame:")
print(departments)
Output:
Employees DataFrame (Alternative):
employee_id name department_code
0 101 Alice 10
1 102 Bob 20
2 103 Charlie 10
3 104 David 30
Departments DataFrame:
dept_id dept_name
0 10 Engineering
1 20 Marketing
2 40 Sales
In this situation, you can't use the on
parameter directly. Instead, you use left_on
and right_on
to specify the key column name for the left (employees_alt
) and right (departments
) DataFrames, respectively.
# Merge using left_on and right_on
merged_alt_keys = pd.merge(employees_alt, departments,
left_on='department_code',
right_on='dept_id')
print("\nMerged DataFrame (Different Key Names):")
print(merged_alt_keys)
Output:
Merged DataFrame (Different Key Names):
employee_id name department_code dept_id dept_name
0 101 Alice 10 10 Engineering
1 103 Charlie 10 10 Engineering
2 102 Bob 20 20 Marketing
The result is the same as before, but notice that both key columns (department_code
and dept_id
) are included in the merged DataFrame. You might want to drop one of them after the merge if it's redundant.
You can also merge based on multiple columns. Simply provide a list of column names to the on
parameter (or use left_on
and right_on
with lists of column names). The merge operation will then only combine rows where all specified key columns match between the DataFrames.
# Example DataFrames with multiple potential keys
df_left = pd.DataFrame({
'key1': ['A', 'B', 'B', 'C'],
'key2': [1, 2, 1, 2],
'left_val': [10, 20, 30, 40]
})
df_right = pd.DataFrame({
'key1': ['B', 'C', 'C', 'D'],
'key2': [1, 2, 3, 1],
'right_val': [100, 200, 300, 400]
})
print("Left DataFrame:")
print(df_left)
print("\nRight DataFrame:")
print(df_right)
# Merge on both key1 and key2
merged_multi = pd.merge(df_left, df_right, on=['key1', 'key2'])
print("\nMerged on Multiple Keys (Inner Join):")
print(merged_multi)
Output:
Left DataFrame:
key1 key2 left_val
0 A 1 10
1 B 2 20
2 B 1 30
3 C 2 40
Right DataFrame:
key1 key2 right_val
0 B 1 100
1 C 2 200
2 C 3 300
3 D 1 400
Merged on Multiple Keys (Inner Join):
key1 key2 left_val right_val
0 B 1 30 100
1 C 2 40 200
Only the rows where both key1
and key2
matched in df_left
and df_right
were kept.
The pd.merge()
function is a fundamental tool for integrating data from different sources based on shared information. Understanding how to specify keys is the first step. Next, we'll look at the different types of merges you can perform.
© 2025 ApX Machine Learning