While concatenation using pd.concat is useful for stacking datasets with similar structures, often you'll need to combine DataFrames based on the values in one or more common columns or indices. This is analogous to how you might join tables in a relational database like SQL. Pandas provides the powerful pd.merge() function for these operations.

pd.merge() allows you to combine rows from two DataFrames based on shared values in specified columns, often called "keys". Think of it like looking up information in one table based on an identifier found in another table.

The Core Idea: Merging on Keys

Let's imagine we have two simple DataFrames: one containing employee information and another containing department information.

import pandas as pd

# Employee data
employees = pd.DataFrame({
    'employee_id': [101, 102, 103, 104],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'dept_id': [10, 20, 10, 30]
})

# Department data
departments = pd.DataFrame({
    'dept_id': [10, 20, 40],
    'dept_name': ['Engineering', 'Marketing', 'Sales']
})

print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)

Output:

Employees DataFrame:
   employee_id     name  dept_id
0          101    Alice       10
1          102      Bob       20
2          103  Charlie       10
3          104    David       30

Departments DataFrame:
   dept_id    dept_name
0       10  Engineering
1       20    Marketing
2       40        Sales

Both DataFrames share a dept_id column. We can use this column as the "key" to merge the two DataFrames, adding the dept_name to the employee information.

The simplest way to merge is to specify the DataFrames and the column(s) to merge on using the on parameter:

# Merge employees and departments based on 'dept_id'
merged_df = pd.merge(employees, departments, on='dept_id')

print("\nMerged DataFrame (Default - Inner Join):")
print(merged_df)

Output:

Merged DataFrame (Default - Inner Join):
   employee_id     name  dept_id    dept_name
0          101    Alice       10  Engineering
1          103  Charlie       10  Engineering
2          102      Bob       20    Marketing

Notice a few things:

The dept_name is now associated with the corresponding employees based on dept_id.
David (employee_id 104, dept_id 30) is missing from the result. Why? Because dept_id 30 doesn't exist in the departments DataFrame.
The Sales department (dept_id 40) is also missing. Why? Because no employee in the employees DataFrame has dept_id 40.

By default, pd.merge() performs an inner join. This means it only keeps rows where the key (dept_id in this case) exists in both DataFrames. We'll explore other join types shortly.

Specifying Different Key Column Names

What if the key columns have different names in the two DataFrames? For example, suppose the department key in the employees DataFrame was named department_code instead of dept_id.

# Modified employee data with different key name
employees_alt = pd.DataFrame({
    'employee_id': [101, 102, 103, 104],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'department_code': [10, 20, 10, 30] # Renamed column
})

print("Employees DataFrame (Alternative):")
print(employees_alt)
print("\nDepartments DataFrame:")
print(departments)

Output:

Employees DataFrame (Alternative):
   employee_id     name  department_code
0          101    Alice               10
1          102      Bob               20
2          103  Charlie               10
3          104    David               30

Departments DataFrame:
   dept_id    dept_name
0       10  Engineering
1       20    Marketing
2       40        Sales

In this situation, you can't use the on parameter directly. Instead, you use left_on and right_on to specify the key column name for the left (employees_alt) and right (departments) DataFrames, respectively.

# Merge using left_on and right_on
merged_alt_keys = pd.merge(employees_alt, departments,
                           left_on='department_code',
                           right_on='dept_id')

print("\nMerged DataFrame (Different Key Names):")
print(merged_alt_keys)

Output:

Merged DataFrame (Different Key Names):
   employee_id     name  department_code  dept_id    dept_name
0          101    Alice               10       10  Engineering
1          103  Charlie               10       10  Engineering
2          102      Bob               20       20    Marketing

The result is the same as before, but notice that both key columns (department_code and dept_id) are included in the merged DataFrame. You might want to drop one of them after the merge if it's redundant.

Merging on Multiple Keys

You can also merge based on multiple columns. Simply provide a list of column names to the on parameter (or use left_on and right_on with lists of column names). The merge operation will then only combine rows where all specified key columns match between the DataFrames.

# Example DataFrames with multiple potential keys
df_left = pd.DataFrame({
    'key1': ['A', 'B', 'B', 'C'],
    'key2': [1, 2, 1, 2],
    'left_val': [10, 20, 30, 40]
})

df_right = pd.DataFrame({
    'key1': ['B', 'C', 'C', 'D'],
    'key2': [1, 2, 3, 1],
    'right_val': [100, 200, 300, 400]
})

print("Left DataFrame:")
print(df_left)
print("\nRight DataFrame:")
print(df_right)

# Merge on both key1 and key2
merged_multi = pd.merge(df_left, df_right, on=['key1', 'key2'])

print("\nMerged on Multiple Keys (Inner Join):")
print(merged_multi)

Output:

Left DataFrame:
  key1  key2  left_val
0    A     1        10
1    B     2        20
2    B     1        30
3    C     2        40

Right DataFrame:
  key1  key2  right_val
0    B     1        100
1    C     2        200
2    C     3        300
3    D     1        400

Merged on Multiple Keys (Inner Join):
  key1  key2  left_val  right_val
0    B     1        30        100
1    C     2        40        200

Only the rows where both key1 and key2 matched in df_left and df_right were kept.

The pd.merge() function is a fundamental tool for integrating data from different sources based on shared information. Understanding how to specify keys is the first step. Next, we'll look at the different types of merges you can perform.