Data analysis frequently involves combining information from multiple sources. Whether you have customer details in one file and their order history in another, or experimental results spread across several tables, you'll need ways to integrate these disparate datasets. Pandas provides powerful and flexible functions for combining Series and DataFrames: concat
, merge
, and join
. Understanding how and when to use each is fundamental for effective data preparation.
This section focuses on these core combining operations. We'll look at simple stacking of data, database style joins based on common columns, and index based joining.
pd.concat
The pd.concat
function performs concatenation operations along an axis. It's useful for simply stacking multiple Series or DataFrames either vertically (row wise) or horizontally (column wise).
Let's create a couple of simple DataFrames:
import pandas as pd
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3])
df2 = pd.DataFrame({
'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7])
df3 = pd.DataFrame({
'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']
}, index=[8, 9, 10, 11])
The simplest use of pd.concat
is to combine a list of DataFrames. By default, it concatenates row wise (axis=0
):
result_rows = pd.concat([df1, df2, df3])
print(result_rows)
# Output:
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 8 A8 B8 C8 D8
# 9 A9 B9 C9 D9
# 10 A10 B10 C10 D10
# 11 A11 B11 C11 D11
Notice that the indexes from the original DataFrames are preserved. If the original DataFrames had overlapping index values, the resulting DataFrame would also have duplicate index values. Often, you don't need the original index. You can create a new default integer index using the ignore_index=True
argument:
result_ignore_index = pd.concat([df1, df2, df3], ignore_index=True)
print(result_ignore_index)
# Output:
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 8 A8 B8 C8 D8
# 9 A9 B9 C9 D9
# 10 A10 B10 C10 D10
# 11 A11 B11 C11 D11
You can also concatenate column wise by setting axis=1
. In this case, Pandas aligns data based on the index. If DataFrames have different indexes, the result will contain the union of the indexes, introducing NaN
values where data is missing.
Let's create a DataFrame to concatenate horizontally:
df4 = pd.DataFrame({
'E': ['E0', 'E1', 'E2', 'E3'],
'F': ['F0', 'F1', 'F2', 'F3']
}, index=[0, 1, 2, 3])
result_cols = pd.concat([df1, df4], axis=1)
print(result_cols)
# Output:
# A B C D E F
# 0 A0 B0 C0 D0 E0 F0
# 1 A1 B1 C1 D1 E1 F1
# 2 A2 B2 C2 D2 E2 F2
# 3 A3 B3 C3 D3 E3 F3
If the indexes don't align perfectly:
df5 = pd.DataFrame({
'G': ['G2', 'G3', 'G4', 'G5']
}, index=[2, 3, 4, 5]) # Different index than df1
result_cols_misaligned = pd.concat([df1, df5], axis=1)
print(result_cols_misaligned)
# Output:
# A B C D G
# 0 A0 B0 C0 D0 NaN
# 1 A1 B1 C1 D1 NaN
# 2 A2 B2 C2 D2 G2
# 3 A3 B3 C3 D3 G3
# 4 NaN NaN NaN NaN G4
# 5 NaN NaN NaN NaN G5
Here, pd.concat
takes the union of the indexes (0 through 5) and fills missing values with NaN
.
pd.merge
For more complex, database style joining operations based on the values in common columns (keys), Pandas provides the pd.merge
function. This is analogous to SQL JOIN
operations. It aligns rows from two DataFrames based on one or more shared keys.
Let's set up two DataFrames for merging:
left = pd.DataFrame({
'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
})
right = pd.DataFrame({
'key': ['K0', 'K1', 'K2', 'K4'], # Note K4 instead of K3
'C': ['C0', 'C1', 'C2', 'C4'],
'D': ['D0', 'D1', 'D2', 'D4']
})
print("Left DataFrame:\n", left)
print("\nRight DataFrame:\n", right)
# Output:
# Left DataFrame:
# key A B
# 0 K0 A0 B0
# 1 K1 A1 B1
# 2 K2 A2 B2
# 3 K3 A3 B3
# Right DataFrame:
# key C D
# 0 K0 C0 D0
# 1 K1 C1 D1
# 2 K2 C2 D2
# 3 K4 C4 D4
The primary argument for controlling the merge is how
, which specifies the type of join:
inner
(default): Use intersection of keys from both frames. Only keys found in both DataFrames are included.outer
: Use union of keys from both frames. Includes all keys found in either DataFrame, filling missing values with NaN
.left
: Use keys from the left frame only. Includes all keys from the left DataFrame, matching against the right.right
: Use keys from the right frame only. Includes all keys from the right DataFrame, matching against the left.By default, pd.merge
looks for columns with the same name in both DataFrames and uses them as the join keys.
# Inner join (default)
inner_merged = pd.merge(left, right, on='key')
print("\nInner Merged:\n", inner_merged)
# Output:
# Inner Merged:
# key A B C D
# 0 K0 A0 B0 C0 D0
# 1 K1 A1 B1 C1 D1
# 2 K2 A2 B2 C2 D2
Only keys 'K0', 'K1', and 'K2', which are present in both left
and right
, appear in the result. 'K3' from left
and 'K4' from right
are excluded.
# Outer join
outer_merged = pd.merge(left, right, on='key', how='outer')
print("\nOuter Merged:\n", outer_merged)
# Output:
# Outer Merged:
# key A B C D
# 0 K0 A0 B0 C0 D0
# 1 K1 A1 B1 C1 D1
# 2 K2 A2 B2 C2 D2
# 3 K3 A3 B3 NaN NaN
# 4 K4 NaN NaN C4 D4
All keys from both DataFrames are included. Missing values are filled with NaN
.
# Left join
left_merged = pd.merge(left, right, on='key', how='left')
print("\nLeft Merged:\n", left_merged)
# Output:
# Left Merged:
# key A B C D
# 0 K0 A0 B0 C0 D0
# 1 K1 A1 B1 C1 D1
# 2 K2 A2 B2 C2 D2
# 3 K3 A3 B3 NaN NaN
All keys from the left
DataFrame ('K0', 'K1', 'K2', 'K3') are present. Since 'K3' doesn't exist in right
, the columns from right
('C', 'D') are filled with NaN
for that row.
# Right join
right_merged = pd.merge(left, right, on='key', how='right')
print("\nRight Merged:\n", right_merged)
# Output:
# Right Merged:
# key A B C D
# 0 K0 A0 B0 C0 D0
# 1 K1 A1 B1 C1 D1
# 2 K2 A2 B2 C2 D2
# 3 K4 NaN NaN C4 D4
All keys from the right
DataFrame ('K0', 'K1', 'K2', 'K4') are present. Since 'K4' doesn't exist in left
, the columns from left
('A', 'B') are filled with NaN
for that row.
Visualization of keys included in different
pd.merge
join types (how
argument). Common keys (K0, K1, K2) are always included except when missing from the preserving side of a left/right join. Unique keys (K3 in Left, K4 in Right) are included based on the join type.
Merging on Different Key Column Names:
If the key columns have different names in the two DataFrames, you can specify them using left_on
and right_on
:
left_diff_key = pd.DataFrame({
'lkey': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
})
right_diff_key = pd.DataFrame({
'rkey': ['K0', 'K1', 'K2', 'K4'],
'C': ['C0', 'C1', 'C2', 'C4'],
'D': ['D0', 'D1', 'D2', 'D4']
})
merged_diff_keys = pd.merge(left_diff_key, right_diff_key, left_on='lkey', right_on='rkey', how='inner')
print("\nMerged on Different Key Names:\n", merged_diff_keys)
# Output:
# Merged on Different Key Names:
# lkey A B rkey C D
# 0 K0 A0 B0 K0 C0 D0
# 1 K1 A1 B1 K1 C1 D1
# 2 K2 A2 B2 K2 C2 D2
Note that both key columns ('lkey' and 'rkey') are present in the result.
Merging on Index:
You can also merge based on the DataFrame index instead of columns by setting left_index=True
or right_index=True
.
left_idx = left.set_index('key')
right_idx = right.set_index('key')
print("\nLeft with Key as Index:\n", left_idx)
print("\nRight with Key as Index:\n", right_idx)
merged_on_index = pd.merge(left_idx, right_idx, left_index=True, right_index=True, how='inner')
print("\nMerged on Index:\n", merged_on_index)
# Output:
# Left with Key as Index:
# A B
# key
# K0 A0 B0
# K1 A1 B1
# K2 A2 B2
# K3 A3 B3
# Right with Key as Index:
# C D
# key
# K0 C0 D0
# K1 C1 D1
# K2 C2 D2
# K4 C4 D4
# Merged on Index:
# A B C D
# key
# K0 A0 B0 C0 D0
# K1 A1 B1 C1 D1
# K2 A2 B2 C2 D2
You can even mix index and column merges (left_index=True
, right_on='col'
or left_on='col'
, right_index=True
).
Handling Overlapping Column Names:
If both DataFrames have columns with the same name (that are not the join keys), merge
automatically adds suffixes (_x
, _y
by default) to distinguish them. You can customize these suffixes using the suffixes
argument.
left_overlap = pd.DataFrame({
'key': ['K0', 'K1'],
'Value': [1, 2]
})
right_overlap = pd.DataFrame({
'key': ['K0', 'K1'],
'Value': [3, 4]
})
merged_overlap = pd.merge(left_overlap, right_overlap, on='key')
print("\nMerged with Overlapping Column Names (default suffixes):\n", merged_overlap)
# Output:
# Merged with Overlapping Column Names (default suffixes):
# key Value_x Value_y
# 0 K0 1 3
# 1 K1 2 4
merged_custom_suffix = pd.merge(left_overlap, right_overlap, on='key', suffixes=('_left', '_right'))
print("\nMerged with Custom Suffixes:\n", merged_custom_suffix)
# Output:
# Merged with Custom Suffixes:
# key Value_left Value_right
# 0 K0 1 3
# 1 K1 2 4
DataFrame.join
Pandas DataFrames also have an instance method join
, which provides a convenient way to merge primarily based on indexes. It can also join on columns of the calling DataFrame to the index of the passed DataFrame. Under the hood, it often uses pd.merge
.
Using the index based DataFrames from before (left_idx
, right_idx
):
# Equivalent to pd.merge(left_idx, right_idx, left_index=True, right_index=True, how='inner')
joined_inner = left_idx.join(right_idx, how='inner')
print("\nInner Joined with DataFrame.join:\n", joined_inner)
# Output:
# Inner Joined with DataFrame.join:
# A B C D
# key
# K0 A0 B0 C0 D0
# K1 A1 B1 C1 D1
# K2 A2 B2 C2 D2
# Left join is the default for join()
joined_left_default = left_idx.join(right_idx) # how='left' is default
print("\nLeft Joined with DataFrame.join (default):\n", joined_left_default)
# Output:
# Left Joined with DataFrame.join (default):
# A B C D
# key
# K0 A0 B0 C0 D0
# K1 A1 B1 C1 D1
# K2 A2 B2 C2 D2
# K3 A3 B3 NaN NaN
# Outer join
joined_outer = left_idx.join(right_idx, how='outer')
print("\nOuter Joined with DataFrame.join:\n", joined_outer)
# Output:
# Outer Joined with DataFrame.join:
# A B C D
# key
# K0 A0 B0 C0 D0
# K1 A1 B1 C1 D1
# K2 A2 B2 C2 D2
# K3 A3 B3 NaN NaN
# K4 NaN NaN C4 D4
You can also join a DataFrame's column(s) with another DataFrame's index using the on
argument in join
.
# Use original 'left' DataFrame (with 'key' as a column)
# Join 'left' on its 'key' column with 'right_idx' (which has 'key' as index)
joined_on_col = left.join(right_idx, on='key', how='inner')
print("\nJoined DataFrame column to other DataFrame index:\n", joined_on_col)
# Output:
# Joined DataFrame column to other DataFrame index:
# key A B C D
# 0 K0 A0 B0 C0 D0
# 1 K1 A1 B1 C1 D1
# 2 K2 A2 B2 C2 D2
join
also accepts the lsuffix
and rsuffix
arguments (similar to suffixes
in merge
) to handle overlapping column names when not joining on index.
concat
, merge
, and join
pd.concat
when you want to stack DataFrames vertically or horizontally without complex alignment based on column values. It's great for combining datasets with the same structure or adding columns side by side based on matching indexes.pd.merge
for the most flexibility in combining DataFrames based on common column values or indexes, similar to SQL joins. It offers explicit control over join types (inner
, outer
, left
, right
) and key specifications (on
, left_on
, right_on
, left_index
, right_index
). This is often the workhorse for combining relational data.DataFrame.join
as a convenient shorthand, primarily when joining based on indexes. It defaults to a left join and can be slightly cleaner syntactically than merge
for index to index or column to index joins.Mastering these combining operations is essential for preparing data that might initially reside in separate files or tables. By effectively concatenating, merging, and joining DataFrames, you can construct the unified datasets needed for further analysis and machine learning model training.
© 2025 ApX Machine Learning