Combining DataFrames based on the values in common columns or indices is a frequent requirement, analogous to joining tables in a relational database like SQL. While concatenation, often performed using pd.concat, is useful for stacking datasets with similar structures, these value-based combinations require a different approach. Pandas provides the pd.merge() function for these operations.
pd.merge() allows you to combine rows from two DataFrames based on shared values in specified columns, often called "keys". Think of it like looking up information in one table based on an identifier found in another table.
Let's imagine we have two simple DataFrames: one containing employee information and another containing department information.
import pandas as pd
# Employee data
employees = pd.DataFrame({
'employee_id': [101, 102, 103, 104],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'dept_id': [10, 20, 10, 30]
})
# Department data
departments = pd.DataFrame({
'dept_id': [10, 20, 40],
'dept_name': ['Engineering', 'Marketing', 'Sales']
})
print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)
Output:
Employees DataFrame:
employee_id name dept_id
0 101 Alice 10
1 102 Bob 20
2 103 Charlie 10
3 104 David 30
Departments DataFrame:
dept_id dept_name
0 10 Engineering
1 20 Marketing
2 40 Sales
Both DataFrames share a dept_id column. We can use this column as the "key" to merge the two DataFrames, adding the dept_name to the employee information.
The simplest way to merge is to specify the DataFrames and the column(s) to merge on using the on parameter:
# Merge employees and departments based on 'dept_id'
merged_df = pd.merge(employees, departments, on='dept_id')
print("\nMerged DataFrame (Default - Inner Join):")
print(merged_df)
Output:
Merged DataFrame (Default - Inner Join):
employee_id name dept_id dept_name
0 101 Alice 10 Engineering
1 103 Charlie 10 Engineering
2 102 Bob 20 Marketing
Notice a few things:
dept_name is now associated with the corresponding employees based on dept_id.dept_id 30 doesn't exist in the departments DataFrame.employees DataFrame has dept_id 40.By default, pd.merge() performs an inner join. This means it only keeps rows where the (dept_id in this case) exists in both DataFrames. We'll explore other join types shortly.
What if the columns have different names in the two DataFrames? For example, suppose the department key in the employees DataFrame was named department_code instead of dept_id.
# Modified employee data with different name
employees_alt = pd.DataFrame({
'employee_id': [101, 102, 103, 104],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'department_code': [10, 20, 10, 30] # Renamed column
})
print("Employees DataFrame (Alternative):")
print(employees_alt)
print("\nDepartments DataFrame:")
print(departments)
Output:
Employees DataFrame (Alternative):
employee_id name department_code
0 101 Alice 10
1 102 Bob 20
2 103 Charlie 10
3 104 David 30
Departments DataFrame:
dept_id dept_name
0 10 Engineering
1 20 Marketing
2 40 Sales
In this situation, you can't use the on parameter directly. Instead, you use left_on and right_on to specify the column name for the left (employees_alt) and right (departments) DataFrames, respectively.
# Merge using left_on and right_on
merged_alt_keys = pd.merge(employees_alt, departments,
left_on='department_code',
right_on='dept_id')
print("\nMerged DataFrame:")
print(merged_alt_keys)
Output:
Merged DataFrame (Different Names):
employee_id name department_code dept_id dept_name
0 101 Alice 10 10 Engineering
1 103 Charlie 10 10 Engineering
2 102 Bob 20 20 Marketing
The result is the same as before, but notice that both columns (department_code and dept_id) are included in the merged DataFrame. You might want to drop one of them after the merge if it's redundant.
You can also merge based on multiple columns. Simply provide a list of column names to the on parameter (or use left_on and right_on with lists of column names). The merge operation will then only combine rows where all specified columns match between the DataFrames.
# Example DataFrames with multiple potential keys
df_left = pd.DataFrame({
'key1': ['A', 'B', 'B', 'C'],
'key2': [1, 2, 1, 2],
'left_val': [10, 20, 30, 40]
})
df_right = pd.DataFrame({
'key1': ['B', 'C', 'C', 'D'],
'key2': [1, 2, 3, 1],
'right_val': [100, 200, 300, 400]
})
print("Left DataFrame:")
print(df_left)
print("\nRight DataFrame:")
print(df_right)
# Merge on both key1 and key2
merged_multi = pd.merge(df_left, df_right, on=['key1', 'key2'])
print("\nMerged on Multiple Keys (Inner Join):")
print(merged_multi)
Output:
Left DataFrame:
key1 key2 left_val
0 A 1 10
1 B 2 20
2 B 1 30
3 C 2 40
Right DataFrame:
key1 key2 right_val
0 B 1 100
1 C 2 200
2 C 3 300
3 D 1 400
Merged on Multiple Keys (Inner Join):
key1 key2 left_val right_val
0 B 1 30 100
1 C 2 40 200
Only the rows where both key1 and key2 matched in df_left and df_right were kept.
The pd.merge() function is a fundamental tool for integrating data from different sources based on shared information. Understanding how to specify keys is the first step. Next, we'll look at the different types of merges you can perform.
Was this section helpful?
pd.merge, explaining its parameters and behavior.pd.merge and its connection to SQL-style joins, with illustrative examples.© 2026 ApX Machine LearningEngineered with