One of the most straightforward ways to combine DataFrames is by concatenation. Think of it like stacking pieces of paper (your DataFrames) either one on top of the other (appending rows) or side-by-side (adding columns). Pandas provides the pd.concat()
function for this purpose.
The pd.concat()
function takes a list or sequence of Series or DataFrame objects as its primary argument and intelligently sticks them together.
By default, pd.concat()
stacks DataFrames vertically, meaning it appends rows from the second DataFrame to the end of the first, the third to the end of the second, and so on. This happens along axis=0
.
Let's create two simple DataFrames to illustrate:
import pandas as pd
# Create the first DataFrame
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']},
index=[0, 1])
# Create the second DataFrame
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']},
index=[2, 3])
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
# Concatenate along rows (default behavior)
result_rows = pd.concat([df1, df2])
print("\nConcatenated DataFrame (Rows):")
print(result_rows)
Output:
DataFrame 1:
A B
0 A0 B0
1 A1 B1
DataFrame 2:
A B
2 A2 B2
3 A3 B3
Concatenated DataFrame (Rows):
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
As you can see, df2
's rows were appended directly below df1
's rows. The columns were aligned automatically because both DataFrames had the same column names ('A' and 'B').
Notice the index in the result_rows
DataFrame: [0, 1, 2, 3]
. In this case, the original indices were unique and combined nicely. However, what if the original DataFrames had overlapping index labels?
Let's modify df2
to have overlapping indices with df1
:
# Create df1 again
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']},
index=[0, 1])
# Create df2 with overlapping index
df2_overlap = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']},
index=[1, 2]) # Index 1 overlaps with df1
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2 (Overlapping Index):")
print(df2_overlap)
# Concatenate with overlapping indices
result_overlap = pd.concat([df1, df2_overlap])
print("\nConcatenated (Overlapping Index):")
print(result_overlap)
print("\nIndex of the result:")
print(result_overlap.index)
Output:
DataFrame 1:
A B
0 A0 B0
1 A1 B1
DataFrame 2 (Overlapping Index):
A B
1 A2 B2
2 A3 B3
Concatenated (Overlapping Index):
A B
0 A0 B0
1 A1 B1
1 A2 B2 # Duplicate index label 1
2 A3 B3
Index of the result:
Index([0, 1, 1, 2], dtype='int64')
By default, pd.concat()
preserves the original indices, even if this results in duplicates (like index 1
in the example above). While Pandas allows duplicate indices, they can sometimes make selecting data by label (.loc
) ambiguous or lead to unexpected behavior in later operations.
If you don't need to preserve the original indices and prefer a clean, unique index for the resulting DataFrame, you can use the ignore_index=True
argument:
# Concatenate with overlapping indices, ignoring original index
result_ignore_index = pd.concat([df1, df2_overlap], ignore_index=True)
print("\nConcatenated (Ignoring Index):")
print(result_ignore_index)
print("\nIndex of the result (Ignored):")
print(result_ignore_index.index)
Output:
Concatenated (Ignoring Index):
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
Index of the result (Ignored):
RangeIndex(start=0, stop=4, step=1)
Setting ignore_index=True
discards the original indices and assigns a new default integer index (a RangeIndex
) to the concatenated DataFrame.
You can also concatenate DataFrames side-by-side by specifying axis=1
. This aligns DataFrames based on their index labels and places their columns next to each other.
Let's create two DataFrames with the same index but different columns:
# Create df1 again
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']},
index=[0, 1])
# Create df3 with different columns but same index
df3 = pd.DataFrame({'C': ['C0', 'C1'],
'D': ['D0', 'D1']},
index=[0, 1])
print("DataFrame 1:")
print(df1)
print("\nDataFrame 3:")
print(df3)
# Concatenate along columns (axis=1)
result_cols = pd.concat([df1, df3], axis=1)
print("\nConcatenated DataFrame (Columns):")
print(result_cols)
Output:
DataFrame 1:
A B
0 A0 B0
1 A1 B1
DataFrame 3:
C D
0 C0 D0
1 C1 D1
Concatenated DataFrame (Columns):
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
Here, pd.concat
used the index (0
and 1
) to align the rows and placed the columns of df3
next to the columns of df1
.
Below is a diagram illustrating the difference between concatenating along axis=0
(rows) and axis=1
(columns).
Diagram illustrating vertical (axis=0) and horizontal (axis=1) concatenation.
What happens when concatenating DataFrames that don't perfectly align on the axis not being concatenated? For example, concatenating side-by-side (axis=1
) when the DataFrames have different row indices, or stacking vertically (axis=0
) when they have different columns.
This is controlled by the join
parameter, which works similarly to database joins:
join='outer'
(Default): Takes the union of the indices (or columns). It keeps all labels from both DataFrames. If a label exists in one DataFrame but not the other, the missing values are filled with NaN
(Not a Number).join='inner'
: Takes the intersection of the indices (or columns). It keeps only the labels that are present in both DataFrames.Let's see this with column concatenation (axis=1
) where the DataFrames have different indices:
# df1 has index [0, 1]
print("DataFrame 1:")
print(df1)
# Create df4 with different index [1, 2]
df4 = pd.DataFrame({'C': ['C1', 'C2'],
'D': ['D1', 'D2']},
index=[1, 2])
print("\nDataFrame 4:")
print(df4)
# Concatenate columns with outer join (default)
result_outer = pd.concat([df1, df4], axis=1, join='outer') # or just pd.concat([df1, df4], axis=1)
print("\nConcatenated Columns (Outer Join):")
print(result_outer)
# Concatenate columns with inner join
result_inner = pd.concat([df1, df4], axis=1, join='inner')
print("\nConcatenated Columns (Inner Join):")
print(result_inner)
Output:
DataFrame 1:
A B
0 A0 B0
1 A1 B1
DataFrame 4:
C D
1 C1 D1
2 C2 D2
Concatenated Columns (Outer Join):
A B C D
0 A0 B0 NaN NaN
1 A1 B1 C1 D1
2 NaN NaN C2 D2
Concatenated Columns (Inner Join):
A B C D
1 A1 B1 C1 D1
0
, 1
, and 2
(the union of [0, 1]
and [1, 2]
). Since index 0
only exists in df1
, the 'C' and 'D' columns have NaN
for that row. Similarly, since index 2
only exists in df4
, the 'A' and 'B' columns have NaN
for that row. Index 1
exists in both, so all columns have values.1
, as it's the only index label present in both df1
and df4
.The same join
logic applies when concatenating rows (axis=0
) using DataFrames with different column names. An outer join would include all columns from both DataFrames, filling missing values with NaN
, while an inner join would only keep columns common to both.
Concatenation using pd.concat
is a fundamental tool for assembling larger datasets from smaller pieces. Understanding how to control the axis, handle indices, and manage alignment with joins is important for combining your data correctly.
© 2025 ApX Machine Learning