Alright, let's put the concepts of combining DataFrames into practice. This section provides hands-on examples using pd.concat
, pd.merge
, and the .join()
method. We'll work with small, clear datasets to illustrate how each technique works.
First, make sure you have Pandas imported. We'll use the conventional alias pd
.
import pandas as pd
import numpy as np # Often used alongside Pandas
To demonstrate concatenation and merging, let's create a few simple DataFrames.
# DataFrames for concatenation examples
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2']},
index=[0, 1, 2])
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5'],
'C': ['C3', 'C4', 'C5']},
index=[3, 4, 5])
df3 = pd.DataFrame({'D': ['D0', 'D1', 'D2'],
'E': ['E0', 'E1', 'E2'],
'F': ['F0', 'F1', 'F2']},
index=[0, 1, 2])
# DataFrames for merging/joining examples
left_df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'L_val': ['L0', 'L1', 'L2', 'L3']})
right_df = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
'R_val': ['R0', 'R1', 'R4', 'R5']})
# DataFrames for index-based joining
left_join_df = pd.DataFrame({'L_val': ['L0', 'L1', 'L2']},
index=pd.Index(['K0', 'K1', 'K2'], name='key'))
right_join_df = pd.DataFrame({'R_val': ['R0', 'R1', 'R4']},
index=pd.Index(['K0', 'K1', 'K4'], name='key'))
Let's look at our initial DataFrames:
print("--- df1 ---")
print(df1)
print("\n--- df2 ---")
print(df2)
print("\n--- df3 ---")
print(df3)
print("\n--- left_df ---")
print(left_df)
print("\n--- right_df ---")
print(right_df)
print("\n--- left_join_df ---")
print(left_join_df)
print("\n--- right_join_df ---")
print(right_join_df)
pd.concat
Concatenation is like stacking or gluing DataFrames together.
This is the default behavior of pd.concat
. It stacks DataFrames vertically, aligning columns by name.
# Concatenate df1 and df2 vertically (axis=0 is default)
vertical_concat = pd.concat([df1, df2])
print("--- Vertical Concatenation (Default) ---")
print(vertical_concat)
Notice how df2
is appended below df1
. The indices from the original DataFrames are preserved. If index uniqueness is important or you prefer a clean 0-based index, use ignore_index=True
.
# Concatenate df1 and df2, ignoring original index
vertical_concat_new_index = pd.concat([df1, df2], ignore_index=True)
print("\n--- Vertical Concatenation (ignore_index=True) ---")
print(vertical_concat_new_index)
To combine DataFrames side-by-side based on their index, use axis=1
.
# Concatenate df1 and df3 horizontally (axis=1)
horizontal_concat = pd.concat([df1, df3], axis=1)
print("--- Horizontal Concatenation (axis=1) ---")
print(horizontal_concat)
Here, df1
and df3
are combined based on matching index labels (0, 1, 2). If indices didn't align perfectly, Pandas would introduce NaN
values for missing matches (an outer join behavior along the index).
pd.merge
Merging combines DataFrames based on the values in common columns, similar to SQL joins.
An inner merge keeps only the rows where the key exists in both DataFrames.
# Perform an inner merge on the 'key' column
inner_merge = pd.merge(left_df, right_df, on='key', how='inner') # 'how=inner' is default
print("--- Inner Merge ---")
print(inner_merge)
Only keys 'K0' and 'K1', which are present in both left_df
and right_df
, appear in the result.
An outer merge keeps all rows from both DataFrames. Where a key doesn't exist in one of the DataFrames, NaN
values are introduced for the columns originating from that DataFrame.
# Perform an outer merge on the 'key' column
outer_merge = pd.merge(left_df, right_df, on='key', how='outer')
print("\n--- Outer Merge ---")
print(outer_merge)
Keys 'K0', 'K1', 'K2', 'K3', 'K4', and 'K5' are all present. Notice the NaN
values where keys are missing in one of the original DataFrames (e.g., 'K2' and 'K3' have NaN
for R_val
, 'K4' and 'K5' have NaN
for L_val
).
A left merge keeps all rows from the left DataFrame and includes matching rows from the right DataFrame. If a key from the left DataFrame doesn't exist in the right, NaN
is used for the right DataFrame's columns.
# Perform a left merge on the 'key' column
left_merge = pd.merge(left_df, right_df, on='key', how='left')
print("\n--- Left Merge ---")
print(left_merge)
All keys from left_df
('K0', 'K1', 'K2', 'K3') are present. 'K2' and 'K3' have NaN
for R_val
because they are not in right_df
. Keys only present in right_df
('K4', 'K5') are excluded.
A right merge is the opposite of a left merge. It keeps all rows from the right DataFrame and includes matching rows from the left.
# Perform a right merge on the 'key' column
right_merge = pd.merge(left_df, right_df, on='key', how='right')
print("\n--- Right Merge ---")
print(right_merge)
All keys from right_df
('K0', 'K1', 'K4', 'K5') are present. 'K4' and 'K5' have NaN
for L_val
. Keys only present in left_df
('K2', 'K3') are excluded.
To help visualize these joins, consider the keys as sets:
Visual representation of different join types based on keys 'K0'-'K5'. Filled blue circles indicate keys included in the result for each join type, assuming
left_df
keys are K0-K3 andright_df
keys are K0, K1, K4, K5.
If the key columns have different names in the DataFrames, use left_on
and right_on
.
# Temporarily rename 'key' in right_df for demonstration
right_df_renamed = right_df.rename(columns={'key': 'common_key'})
print("\n--- right_df_renamed ---")
print(right_df_renamed)
# Merge using left_on and right_on
merge_diff_names = pd.merge(left_df, right_df_renamed,
left_on='key', right_on='common_key',
how='inner')
print("\n--- Merge with Different Key Names (left_on, right_on) ---")
print(merge_diff_names)
The result includes both original key columns (key
from left_df
and common_key
from right_df_renamed
). You might want to drop one afterwards.
.join
The .join()
method is a convenient way to perform merges based on the DataFrame indices. It defaults to a left join.
# Perform a default (left) join on the index
left_index_join = left_join_df.join(right_join_df, how='left') # 'how=left' is default
print("--- Left Index Join (Default .join()) ---")
print(left_index_join)
This keeps all index values from left_join_df
('K0', 'K1', 'K2') and adds the corresponding R_val
from right_join_df
. 'K2' gets NaN
for R_val
as it's not in right_join_df
's index.
You can specify other join types using the how
parameter, just like with pd.merge
.
# Perform an outer join on the index using .join()
outer_index_join = left_join_df.join(right_join_df, how='outer')
print("\n--- Outer Index Join (.join(how='outer')) ---")
print(outer_index_join)
This includes all index values from both DataFrames ('K0', 'K1', 'K2', 'K4'), filling missing values with NaN
.
You can also use .join()
to join a DataFrame's index with a column in another DataFrame by setting the joining column as the index temporarily in the other DataFrame, or by using the on
parameter within .join()
.
# Join left_df (using 'key' column) with right_join_df (using its index)
# We can use the 'on' parameter in join to specify the column in left_df
column_index_join = left_df.join(right_join_df, on='key', how='inner')
print("\n--- Joining Column ('key' in left_df) to Index (right_join_df) ---")
print(column_index_join)
This practical session covered the primary ways to combine DataFrames in Pandas. Concatenation stacks data, while merging and joining combine data based on shared keys (in columns or indices) using different logical rules (inner, outer, left, right). Experiment with these techniques on your own datasets to solidify your understanding.
© 2025 ApX Machine Learning