Understanding Merge Types (Joins)

When you combine DataFrame objects using pd.merge(), the way rows are matched and included in the final result depends heavily on the type of merge or join you perform. This behavior is controlled by the how parameter within the pd.merge() function. Think of these merge types as different rules for deciding which rows to keep based on whether the join keys match in the input DataFrame objects.

Let's set up two simple DataFrame objects to illustrate these different join types. Imagine we have one DataFrame with employee information and another with their project assignments.

import pandas as pd

# DataFrame with employee names
employees = pd.DataFrame({
    'EmpID': [101, 102, 103, 104],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

# DataFrame with project assignments
projects = pd.DataFrame({
    'EmpID': [103, 104, 105, 106],
    'Project': ['Zeus', 'Apollo', 'Athena', 'Poseidon']
})

print("Employees DataFrame:")
print(employees)
print("\nProjects DataFrame:")
print(projects)

Running this gives us:

Employees DataFrame:
   EmpID     Name
0    101    Alice
1    102      Bob
2    103  Charlie
3    104    David

Projects DataFrame:
   EmpID   Project
0    103      Zeus
1    104    Apollo
2    105    Athena
3    106  Poseidon

Notice that EmpID 103 and 104 exist in both DataFrame objects. EmpID 101 and 102 are only in employees, while EmpID 105 and 106 are only in projects. The EmpID column is our common column, often called the join key, which pd.merge() will use by default if not specified otherwise.

Inner Join (`how='inner'`)

An inner join is the default merge type in Pandas. It combines the two DataFrame objects and keeps only the rows where the join key (EmpID in our case) exists in both the left and the right DataFrame objects. Rows with keys present in only one of the DataFrame objects are discarded. It essentially finds the intersection of the keys.

# Perform an inner join
inner_join_df = pd.merge(employees, projects, on='EmpID', how='inner')

print("Inner Join Result:")
print(inner_join_df)

The output shows only the employees present in both tables:

Inner Join Result:
   EmpID     Name Project
0    103  Charlie    Zeus
1    104    David  Apollo

An inner join retains only the common EmpID values (103, 104) found in both the Employees and Projects datasets.

Outer Join (`how='outer'`)

An outer join, sometimes called a full outer join, keeps all rows from both DataFrame objects. If a row from one DataFrame doesn't have a matching key in the other, the columns from the other DataFrame are filled with NaN (Not a Number) values. This join type finds the union of the keys.

# Perform an outer join
outer_join_df = pd.merge(employees, projects, on='EmpID', how='outer')

print("Outer Join Result:")
print(outer_join_df)

The result includes all employees and all projects:

Outer Join Result:
   EmpID     Name   Project
0  101.0    Alice       NaN
1  102.0      Bob       NaN
2  103.0  Charlie      Zeus
3  104.0    David    Apollo
4  105.0      NaN    Athena
5  106.0      NaN  Poseidon

Notice how Alice (101) and Bob (102) have NaN in the Project column because they weren't in the projects DataFrame. Similarly, projects assigned to EmpID 105 and 106 have NaN in the Name column.

An outer join retains all EmpID values from both datasets, using NaN where a corresponding match is not found in the other dataset.

Left Join (`how='left'`)

A left join keeps all rows from the left DataFrame (the first one passed to pd.merge(), which is employees in our examples) and includes matching rows from the right DataFrame (projects). If a key from the left DataFrame does not exist in the right DataFrame, the columns from the right DataFrame are filled with NaN. Keys that exist only in the right DataFrame are discarded.

# Perform a left join
left_join_df = pd.merge(employees, projects, on='EmpID', how='left')

print("Left Join Result:")
print(left_join_df)

This keeps all employees and adds project information where available:

Left Join Result:
   EmpID     Name Project
0    101    Alice     NaN
1    102      Bob     NaN
2    103  Charlie    Zeus
3    104    David  Apollo

All EmpIDs from employees (101, 102, 103, 104) are present. Since 101 and 102 don't have matching entries in projects, their Project values are NaN. EmpIDs 105 and 106 from projects are excluded because they are not in the left DataFrame (employees).

A left join retains all EmpID values from the left dataset (Employees) and includes matching data from the right dataset (Projects). Non-matches from the right are filled with NaN.

Right Join (`how='right'`)

A right join is the mirror image of a left join. It keeps all rows from the right DataFrame (projects) and includes matching rows from the left DataFrame (employees). If a key from the right DataFrame does not exist in the left DataFrame, the columns from the left DataFrame are filled with NaN. Keys that exist only in the left DataFrame are discarded.

# Perform a right join
right_join_df = pd.merge(employees, projects, on='EmpID', how='right')

print("Right Join Result:")
print(right_join_df)

This keeps all projects and adds employee information where available:

Right Join Result:
   EmpID     Name   Project
0  103.0  Charlie      Zeus
1  104.0    David    Apollo
2  105.0      NaN    Athena
3  106.0      NaN  Poseidon

All EmpIDs from projects (103, 104, 105, 106) are present. Since 105 and 106 don't have matching entries in employees, their Name values are NaN. EmpIDs 101 and 102 from employees are excluded because they are not in the right DataFrame (projects).

A right join retains all EmpID values from the right dataset (Projects) and includes matching data from the left dataset (Employees). Non-matches from the left are filled with NaN.

Choosing the correct merge type is an important step in data preparation. It depends entirely on what information you need in your final combined dataset. Do you only care about entries present in both sources (inner)? Do you need a complete picture, including entries from only one source (outer)? Or are you primarily interested in augmenting one specific dataset with information from another (left or right)? Understanding these join types allows you to precisely control how your data is combined.

Was this section helpful?

References

pandas.merge, pandas development team, 2024 - Official documentation for the pd.merge() function, providing a complete reference for its parameters and merge types.
Python for Data Analysis, Wes McKinney, 2022 (O'Reilly Media) - A foundational book for Pandas data analysis, featuring comprehensive discussions and examples of combining DataFrames with merge operations.
Python Data Science Handbook, Jake VanderPlas, 2016 (O'Reilly Media) - An accessible guide to data science in Python, explaining Pandas data manipulation, including examples of various DataFrame merge types.

Understanding Merge Types (Joins)

Inner Join (how='inner')

Outer Join (how='outer')

Left Join (how='left')

Right Join (how='right')

Inner Join (`how='inner'`)

Outer Join (`how='outer'`)

Left Join (`how='left'`)

Right Join (`how='right'`)