Practice Combining DataFrames

Alright, let's put the concepts of combining DataFrames into practice. This section provides hands-on examples using pd.concat, pd.merge, and the .join() method. We'll work with small, clear datasets to illustrate how each technique works.

First, make sure you have Pandas imported. We'll use the conventional alias pd.

import pandas as pd
import numpy as np # Often used alongside Pandas

Preparing Sample DataFrames

To demonstrate concatenation and merging, let's create a few simple DataFrames.

# DataFrames for concatenation examples
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2'],
                    'C': ['C0', 'C1', 'C2']},
                   index=[0, 1, 2])

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5'],
                    'C': ['C3', 'C4', 'C5']},
                   index=[3, 4, 5])

df3 = pd.DataFrame({'D': ['D0', 'D1', 'D2'],
                    'E': ['E0', 'E1', 'E2'],
                    'F': ['F0', 'F1', 'F2']},
                   index=[0, 1, 2])

# DataFrames for merging/joining examples
left_df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                        'L_val': ['L0', 'L1', 'L2', 'L3']})

right_df = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                         'R_val': ['R0', 'R1', 'R4', 'R5']})

# DataFrames for index-based joining
left_join_df = pd.DataFrame({'L_val': ['L0', 'L1', 'L2']},
                             index=pd.Index(['K0', 'K1', 'K2'], name='key'))

right_join_df = pd.DataFrame({'R_val': ['R0', 'R1', 'R4']},
                              index=pd.Index(['K0', 'K1', 'K4'], name='key'))

Let's look at our initial DataFrames:

print("--- df1 ---")
print(df1)
print("\n--- df2 ---")
print(df2)
print("\n--- df3 ---")
print(df3)
print("\n--- left_df ---")
print(left_df)
print("\n--- right_df ---")
print(right_df)
print("\n--- left_join_df ---")
print(left_join_df)
print("\n--- right_join_df ---")
print(right_join_df)

Practice: Concatenation with `pd.concat`

Concatenation is like stacking or gluing DataFrames together.

Vertical Concatenation (Stacking Rows)

This is the default behavior of pd.concat. It stacks DataFrames vertically, aligning columns by name.

# Concatenate df1 and df2 vertically (axis=0 is default)
vertical_concat = pd.concat([df1, df2])

print("--- Vertical Concatenation (Default) ---")
print(vertical_concat)

Notice how df2 is appended below df1. The indices from the original DataFrames are preserved. If index uniqueness is important or you prefer a clean 0-based index, use ignore_index=True.

# Concatenate df1 and df2, ignoring original index
vertical_concat_new_index = pd.concat([df1, df2], ignore_index=True)

print("\n--- Vertical Concatenation (ignore_index=True) ---")
print(vertical_concat_new_index)

Horizontal Concatenation (Adding Columns)

To combine DataFrames side-by-side based on their index, use axis=1.

# Concatenate df1 and df3 horizontally (axis=1)
horizontal_concat = pd.concat([df1, df3], axis=1)

print("--- Horizontal Concatenation (axis=1) ---")
print(horizontal_concat)

Here, df1 and df3 are combined based on matching index labels (0, 1, 2). If indices didn't align perfectly, Pandas would introduce NaN values for missing matches (an outer join behavior along the index).

Practice: Merging with `pd.merge`

Merging combines DataFrames based on the values in common columns, similar to SQL joins.

Inner Merge (Default)

An inner merge keeps only the rows where the key exists in both DataFrames.

# Perform an inner merge on the column
inner_merge = pd.merge(left_df, right_df, on='key', how='inner') # 'how=inner' is default

print("--- Inner Merge ---")
print(inner_merge)

Only keys 'K0' and 'K1', which are present in both left_df and right_df, appear in the result.

Outer Merge

An outer merge keeps all rows from both DataFrames. Where a key doesn't exist in one of the DataFrames, NaN values are introduced for the columns originating from that DataFrame.

# Perform an outer merge on the column
outer_merge = pd.merge(left_df, right_df, on='key', how='outer')

print("\n--- Outer Merge ---")
print(outer_merge)

Keys 'K0', 'K1', 'K2', 'K3', 'K4', and 'K5' are all present. Notice the NaN values where keys are missing in one of the original DataFrames (e.g., 'K2' and 'K3' have NaN for R_val, 'K4' and 'K5' have NaN for L_val).

Left Merge

A left merge keeps all rows from the left DataFrame and includes matching rows from the right DataFrame. If a key from the left DataFrame doesn't exist in the right, NaN is used for the right DataFrame's columns.

# Perform a left merge on the column
left_merge = pd.merge(left_df, right_df, on='key', how='left')

print("\n--- Left Merge ---")
print(left_merge)

All keys from left_df ('K0', 'K1', 'K2', 'K3') are present. 'K2' and 'K3' have NaN for R_val because they are not in right_df. Keys only present in right_df ('K4', 'K5') are excluded.

Right Merge

A right merge is the opposite of a left merge. It keeps all rows from the right DataFrame and includes matching rows from the left.

# Perform a right merge on the column
right_merge = pd.merge(left_df, right_df, on='key', how='right')

print("\n--- Right Merge ---")
print(right_merge)

All keys from right_df ('K0', 'K1', 'K4', 'K5') are present. 'K4' and 'K5' have NaN for L_val. Keys only present in left_df ('K2', 'K3') are excluded.

To help visualize these joins, consider the keys as sets:

Visual representation of different join types based on keys 'K0'-'K5'. Filled blue circles indicate keys included in the result for each join type, assuming left_df keys are K0-K3 and right_df keys are K0, K1, K4, K5.

Merging on Different Column Names

If the columns have different names in the DataFrames, use left_on and right_on.

# Temporarily rename 'key' in right_df for demonstration
right_df_renamed = right_df.rename(columns={'key': 'common_key'})

print("\n--- right_df_renamed ---")
print(right_df_renamed)

# Merge using left_on and right_on
merge_diff_names = pd.merge(left_df, right_df_renamed,
                            left_on='key', right_on='common_key',
                            how='inner')

print("\n--- Merge with Different Names (left_on, right_on) ---")
print(merge_diff_names)

The result includes both original columns (key from left_df and common_key from right_df_renamed). You might want to drop one afterwards.

Practice: Index-Based Joining with `.join`

The .join() method is a convenient way to perform merges based on the DataFrame indices. It defaults to a left join.

# Perform a default (left) join on the index
left_index_join = left_join_df.join(right_join_df, how='left') # 'how=left' is default

print("--- Left Index Join (Default .join()) ---")
print(left_index_join)

This keeps all index values from left_join_df ('K0', 'K1', 'K2') and adds the corresponding R_val from right_join_df. 'K2' gets NaN for R_val as it's not in right_join_df's index.

You can specify other join types using the how parameter, just like with pd.merge.

# Perform an outer join on the index using .join()
outer_index_join = left_join_df.join(right_join_df, how='outer')

print("\n--- Outer Index Join (.join(how='outer')) ---")
print(outer_index_join)

This includes all index values from both DataFrames ('K0', 'K1', 'K2', 'K4'), filling missing values with NaN.

You can also use .join() to join a DataFrame's index with a column in another DataFrame by setting the joining column as the index temporarily in the other DataFrame, or by using the on parameter within .join().

# Join left_df (using column) with right_join_df (using its index)
# We can use the 'on' parameter in join to specify the column in left_df
column_index_join = left_df.join(right_join_df, on='key', how='inner')

print("\n--- Joining Column to Index (right_join_df) ---")
print(column_index_join)

This practical session covered the primary ways to combine DataFrames in Pandas. Concatenation stacks data, while merging and joining combine data based on shared keys (in columns or indices) using different logical rules (inner, outer, left, right). Experiment with these techniques on your own datasets to solidify your understanding.

Was this section helpful?

Hands-on practical: Combining Datasets