When you combine DataFrame objects using pd.merge(), the way rows are matched and included in the final result depends heavily on the type of merge or join you perform. This behavior is controlled by the how parameter within the pd.merge() function. Think of these merge types as different rules for deciding which rows to keep based on whether the join keys match in the input DataFrame objects.
Let's set up two simple DataFrame objects to illustrate these different join types. Imagine we have one DataFrame with employee information and another with their project assignments.
import pandas as pd
# DataFrame with employee names
employees = pd.DataFrame({
'EmpID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
# DataFrame with project assignments
projects = pd.DataFrame({
'EmpID': [103, 104, 105, 106],
'Project': ['Zeus', 'Apollo', 'Athena', 'Poseidon']
})
print("Employees DataFrame:")
print(employees)
print("\nProjects DataFrame:")
print(projects)
Running this gives us:
Employees DataFrame:
EmpID Name
0 101 Alice
1 102 Bob
2 103 Charlie
3 104 David
Projects DataFrame:
EmpID Project
0 103 Zeus
1 104 Apollo
2 105 Athena
3 106 Poseidon
Notice that EmpID 103 and 104 exist in both DataFrame objects. EmpID 101 and 102 are only in employees, while EmpID 105 and 106 are only in projects. The EmpID column is our common column, often called the join key, which pd.merge() will use by default if not specified otherwise.
how='inner')An inner join is the default merge type in Pandas. It combines the two DataFrame objects and keeps only the rows where the join key (EmpID in our case) exists in both the left and the right DataFrame objects. Rows with keys present in only one of the DataFrame objects are discarded. It essentially finds the intersection of the keys.
# Perform an inner join
inner_join_df = pd.merge(employees, projects, on='EmpID', how='inner')
print("Inner Join Result:")
print(inner_join_df)
The output shows only the employees present in both tables:
Inner Join Result:
EmpID Name Project
0 103 Charlie Zeus
1 104 David Apollo
An inner join retains only the common
EmpIDvalues (103, 104) found in both the Employees and Projects datasets.
how='outer')An outer join, sometimes called a full outer join, keeps all rows from both DataFrame objects. If a row from one DataFrame doesn't have a matching key in the other, the columns from the other DataFrame are filled with NaN (Not a Number) values. This join type finds the union of the keys.
# Perform an outer join
outer_join_df = pd.merge(employees, projects, on='EmpID', how='outer')
print("Outer Join Result:")
print(outer_join_df)
The result includes all employees and all projects:
Outer Join Result:
EmpID Name Project
0 101.0 Alice NaN
1 102.0 Bob NaN
2 103.0 Charlie Zeus
3 104.0 David Apollo
4 105.0 NaN Athena
5 106.0 NaN Poseidon
Notice how Alice (101) and Bob (102) have NaN in the Project column because they weren't in the projects DataFrame. Similarly, projects assigned to EmpID 105 and 106 have NaN in the Name column.
An outer join retains all
EmpIDvalues from both datasets, usingNaNwhere a corresponding match is not found in the other dataset.
how='left')A left join keeps all rows from the left DataFrame (the first one passed to pd.merge(), which is employees in our examples) and includes matching rows from the right DataFrame (projects). If a key from the left DataFrame does not exist in the right DataFrame, the columns from the right DataFrame are filled with NaN. Keys that exist only in the right DataFrame are discarded.
# Perform a left join
left_join_df = pd.merge(employees, projects, on='EmpID', how='left')
print("Left Join Result:")
print(left_join_df)
This keeps all employees and adds project information where available:
Left Join Result:
EmpID Name Project
0 101 Alice NaN
1 102 Bob NaN
2 103 Charlie Zeus
3 104 David Apollo
All EmpIDs from employees (101, 102, 103, 104) are present. Since 101 and 102 don't have matching entries in projects, their Project values are NaN. EmpIDs 105 and 106 from projects are excluded because they are not in the left DataFrame (employees).
A left join retains all
EmpIDvalues from the left dataset (Employees) and includes matching data from the right dataset (Projects). Non-matches from the right are filled withNaN.
how='right')A right join is the mirror image of a left join. It keeps all rows from the right DataFrame (projects) and includes matching rows from the left DataFrame (employees). If a key from the right DataFrame does not exist in the left DataFrame, the columns from the left DataFrame are filled with NaN. Keys that exist only in the left DataFrame are discarded.
# Perform a right join
right_join_df = pd.merge(employees, projects, on='EmpID', how='right')
print("Right Join Result:")
print(right_join_df)
This keeps all projects and adds employee information where available:
Right Join Result:
EmpID Name Project
0 103.0 Charlie Zeus
1 104.0 David Apollo
2 105.0 NaN Athena
3 106.0 NaN Poseidon
All EmpIDs from projects (103, 104, 105, 106) are present. Since 105 and 106 don't have matching entries in employees, their Name values are NaN. EmpIDs 101 and 102 from employees are excluded because they are not in the right DataFrame (projects).
A right join retains all
EmpIDvalues from the right dataset (Projects) and includes matching data from the left dataset (Employees). Non-matches from the left are filled withNaN.
Choosing the correct merge type is an important step in data preparation. It depends entirely on what information you need in your final combined dataset. Do you only care about entries present in both sources (inner)? Do you need a complete picture, including entries from only one source (outer)? Or are you primarily interested in augmenting one specific dataset with information from another (left or right)? Understanding these join types allows you to precisely control how your data is combined.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with