Pandas offers pd.merge for combining DataFrames, providing powerful, SQL-like flexibility based on columns or indices. Complementing this, Pandas also provides the .join() method, specifically optimized for combining DataFrames based on their index labels, or for joining an index to a column. This method offers an efficient way to perform specific kinds of merges, particularly left joins on indices, which are very common.
By default, DataFrame.join() attempts to join with another DataFrame (or multiple DataFrames) using their indices. It performs a left join unless specified otherwise, meaning it keeps all rows from the calling DataFrame (the "left" one) and includes matching data from the "right" DataFrame based on the index.
Let's set up two simple DataFrames to illustrate:
import pandas as pd
# Left DataFrame
left_df = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
# Right DataFrame
right_df = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
print("Left DataFrame:")
print(left_df)
print("\nRight DataFrame:")
print(right_df)
# Join left_df with right_df on their indices (default: left join)
joined_df = left_df.join(right_df)
print("\nJoined DataFrame (left join on index):")
print(joined_df)
Output:
Left DataFrame:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
Right DataFrame:
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
Joined DataFrame (left join on index):
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
Notice how the resulting joined_df retains all index labels from left_df (K0, K1, K2). For K0 and K2, which exist in both DataFrames' indices, the corresponding values from right_df (C0, D0, C2, D2) are included. For K1, which is only in left_df, the columns from right_df (C and D) are filled with NaN (Not a Number), representing missing data. The index K3 from right_df is not included because it wasn't present in left_df and we performed a left join.
The .join() method isn't limited to joining index-on-index. You can join the index of the calling DataFrame (left) to one or more columns in the passed DataFrame (right) using the on parameter.
# Right DataFrame with a column instead of index
right_df_on_col = pd.DataFrame({'key': ['K0', 'K2', 'K3'],
'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']})
print("Left DataFrame:")
print(left_df)
print("\nRight DataFrame:")
print(right_df_on_col)
# Join left_df's index with right_df_on_col's column
joined_on_col = left_df.join(right_df_on_col.set_index('key'), on=None)
# Alternatively, and more directly with .join's 'on' parameter (works on the *other* DataFrame):
# This is less common usage for .join(), pd.merge is often clearer here.
# However, if you call join on right_df_on_col:
# joined_on_col = right_df_on_col.join(left_df, on='key') # Joins right's 'key' col with left's index
# The most common pattern using .join for this is setting the index first:
joined_on_col_idiom = left_df.join(right_df_on_col.set_index('key'))
print("\nJoined DataFrame (left index on right column):")
print(joined_on_col_idiom)
Output:
Left DataFrame:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
Right DataFrame (with column):
key C D
0 K0 C0 D0
1 K2 C2 D2
2 K3 C3 D3
Joined DataFrame (left index on right column):
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
In this example, we first used set_index('key') on right_df_on_col to make the key column its index, allowing a standard index-on-index join with left_df. This is a very frequent pattern when using .join(). The result is identical to the first example because the logic becomes the same: join based on matching index values (K0, K1, K2 from left_df against the new index of right_df_on_col).
While left_df.join(other, on='col_in_other') is possible, it's often less readable than using pd.merge(left_df, other, left_index=True, right_on='col_in_other') or the set_index approach shown above. The primary strength of .join() lies in its simplicity for index-based operations.
howJust like pd.merge, the .join() method accepts a how argument to control the type of join. The options are:
'left': (Default) Keep all keys from the left DataFrame.'right': Keep all keys from the right DataFrame.'outer': Keep all keys from both DataFrames.'inner': Keep only keys found in both DataFrames.Let's see how these affect the outcome using our original left_df and right_df:
# Inner Join
inner_join = left_df.join(right_df, how='inner')
print("\nInner Join (keys in both):")
print(inner_join)
# Outer Join
outer_join = left_df.join(right_df, how='outer')
print("\nOuter Join (keys in either):")
print(outer_join)
# Right Join
right_join = left_df.join(right_df, how='right')
print("\nRight Join (keys in right):")
print(right_join)
Output:
Inner Join (keys in both):
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
Outer Join (keys in either):
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
Right Join (keys in right):
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
Observe the differences:
inner: Only K0 and K2 appear, as they are the only index labels present in both left_df and right_df.outer: All unique index labels from both DataFrames (K0, K1, K2, K3) are present. NaN fills in where data is missing from one of the original DataFrames.right: All index labels from right_df (K0, K2, K3) are kept. Data from left_df is included where the index matches; otherwise, NaN is used (as seen for K3).Here's a diagram illustrating a left join:
Representation of a
left_df.join(right_df)operation. All index keys from the left table are preserved. Matching keys from the right table bring their data; non-matching keys result in NaN values for the right table's columns.
If the DataFrames you are joining have columns with the same name (other than the index or columns being joined on), .join() will raise an error. To resolve this, you can use the lsuffix and rsuffix parameters to append distinguishing suffixes to the overlapping column names from the left and right DataFrames, respectively.
# Another right DataFrame, but with a conflicting column name 'B'
right_overlap = pd.DataFrame({'B': ['B_r0', 'B_r2', 'B_r3'], # Overlapping column 'B'
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
print("Left DataFrame:")
print(left_df)
print("\nRight DataFrame with Overlap:")
print(right_overlap)
# Attempting join without suffixes (will cause error)
# joined_overlap_error = left_df.join(right_overlap) # Raises ValueError
# Using suffixes to resolve overlap
joined_overlap_fixed = left_df.join(right_overlap, lsuffix='_left', rsuffix='_right')
print("\nJoined DataFrame with Suffixes:")
print(joined_overlap_fixed)
Output:
Left DataFrame:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
Right DataFrame with Overlap:
B D
K0 B_r0 D0
K2 B_r2 D2
K3 B_r3 D3
Joined DataFrame with Suffixes:
A B_left B_right D
K0 A0 B0 B_r0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 B_r2 D2
As you can see, the original B column from left_df is now B_left, and the B column from right_overlap is now B_right in the final result.
You can join more than two DataFrames at once by passing a list of DataFrames to .join(). The join operation happens sequentially from left to right based on the index.
# A third DataFrame
other_df = pd.DataFrame({'E': ['E1', 'E2', 'E4']}, index=['K1', 'K2', 'K4'])
print("Left DF:")
print(left_df)
print("\nRight DF:")
print(right_df)
print("\nOther DF:")
print(other_df)
# Join left_df with right_df and other_df (default: left joins)
multi_join = left_df.join([right_df, other_df])
print("\nMulti-Join Result (left joins):")
print(multi_join)
# Example with outer joins
multi_join_outer = left_df.join([right_df, other_df], how='outer')
print("\nMulti-Join Result (outer joins):")
print(multi_join_outer)
Output:
Left DF:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
Right DF:
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
Other DF:
E
K1 E1
K2 E2
K4 E4
Multi-Join Result (left joins):
A B C D E
K0 A0 B0 C0 D0 NaN
K1 A1 B1 NaN NaN E1
K2 A2 B2 C2 D2 E2
Multi-Join Result (outer joins):
A B C D E
K0 A0 B0 C0 D0 NaN
K1 A1 B1 NaN NaN E1
K2 A2 B2 C2 D2 E2
K3 NaN NaN C3 D3 NaN
K4 NaN NaN NaN NaN E4
The default left join keeps only the indices from the initial DataFrame (left_df). The outer join combines all indices from all DataFrames involved.
In summary, DataFrame.join() provides a concise syntax primarily for index-based joins. While pd.merge offers more general-purpose merging capabilities, .join() is often quicker and more readable when your combination logic relies heavily on the DataFrame indices, especially for the common case of performing a left join. Remember that .join() internally uses pd.merge, so the underlying logic is consistent.
Was this section helpful?
DataFrame.join method, explaining its parameters, behavior, and examples.pandas.merge function, which underpins DataFrame.join and offers broader combination capabilities.© 2026 ApX Machine LearningEngineered with