After cleaning and reshaping your data, arranging it in a meaningful order is often a necessary step. Sorting allows you to view data from lowest to highest value, alphabetically, or based on custom criteria, making patterns easier to spot and specific entries simpler to locate. Pandas provides flexible methods to sort data based on either the index labels or the actual values within columns.

Sorting by Index

Sometimes, you need to arrange your data based on its row or column labels (the index). This is particularly useful if the index represents a time series, an ordered category, or simply needs to be in alphabetical or numerical order. The sort_index() method handles this.

Let's start with a simple DataFrame:

import pandas as pd
import numpy as np

data = {'col_b': [4, 7, 1, 8, 5],
        'col_a': ['apple', 'banana', 'orange', 'apple', 'banana'],
        'col_c': [10.0, np.nan, 20.0, 10.0, 15.0]}
df = pd.DataFrame(data, index=['R3', 'R1', 'R5', 'R2', 'R4'])

print("Original DataFrame:")
print(df)

Original DataFrame:
    col_b   col_a  col_c
R3      4   apple   10.0
R1      7  banana    NaN
R5      1  orange   20.0
R2      8   apple   10.0
R4      5  banana   15.0

Notice the row index (R3, R1, R5, etc.) is not in alphabetical order. To sort the DataFrame rows by their index labels:

df_sorted_by_index = df.sort_index()

print("\nDataFrame sorted by row index (ascending):")
print(df_sorted_by_index)

DataFrame sorted by row index (ascending):
    col_b   col_a  col_c
R1      7  banana    NaN
R2      8   apple   10.0
R3      4   apple   10.0
R4      5  banana   15.0
R5      1  orange   20.0

By default, sort_index() sorts in ascending order. To sort in descending order, use the ascending=False argument:

df_sorted_by_index_desc = df.sort_index(ascending=False)

print("\nDataFrame sorted by row index (descending):")
print(df_sorted_by_index_desc)

DataFrame sorted by row index (descending):
    col_b   col_a  col_c
R5      1  orange   20.0
R4      5  banana   15.0
R3      4   apple   10.0
R2      8   apple   10.0
R1      7  banana    NaN

You can also sort by the column index (the column names) by specifying axis=1:

df_sorted_by_columns = df.sort_index(axis=1)

print("\nDataFrame sorted by column index (ascending):")
print(df_sorted_by_columns)

DataFrame sorted by column index (ascending):
     col_a  col_b  col_c
R3   apple      4   10.0
R1  banana      7    NaN
R5  orange      1   20.0
R2   apple      8   10.0
R4  banana      5   15.0

Like many Pandas operations, sort_index() returns a new sorted DataFrame by default. If you want to modify the original DataFrame directly, use the inplace=True argument. Be cautious when using inplace=True, as it overwrites your original data structure.

df_copy = df.copy() # Work on a copy to preserve original df
df_copy.sort_index(inplace=True)

print("\nOriginal DataFrame after inplace sort by index:")
print(df_copy)

Original DataFrame after inplace sort by index:
    col_b   col_a  col_c
R1      7  banana    NaN
R2      8   apple   10.0
R3      4   apple   10.0
R4      5  banana   15.0
R5      1  orange   20.0

Sorting by Values

More frequently, you'll want to sort your DataFrame based on the values in one or more columns. The sort_values() method is used for this purpose. The most important argument for sort_values() is by, which specifies the column name (or list of column names) to sort by.

Let's sort our original DataFrame df based on the values in col_b:

df_sorted_by_col_b = df.sort_values(by='col_b')

print("\nDataFrame sorted by 'col_b' (ascending):")
print(df_sorted_by_col_b)

DataFrame sorted by 'col_b' (ascending):
    col_b   col_a  col_c
R5      1  orange   20.0
R3      4   apple   10.0
R4      5  banana   15.0
R1      7  banana    NaN
R2      8   apple   10.0

Again, the default sorting order is ascending. Use ascending=False for descending order:

df_sorted_by_col_b_desc = df.sort_values(by='col_b', ascending=False)

print("\nDataFrame sorted by 'col_b' (descending):")
print(df_sorted_by_col_b_desc)

DataFrame sorted by 'col_b' (descending):
    col_b   col_a  col_c
R2      8   apple   10.0
R1      7  banana    NaN
R4      5  banana   15.0
R3      4   apple   10.0
R5      1  orange   20.0

You can sort by multiple columns by passing a list of column names to the by argument. Pandas will sort by the first column in the list, then use the second column to break ties, and so on.

Let's sort by col_a (alphabetically) and then by col_b (numerically) for rows with the same col_a value:

df_sorted_by_multi = df.sort_values(by=['col_a', 'col_b'])

print("\nDataFrame sorted by 'col_a' then 'col_b' (ascending):")
print(df_sorted_by_multi)

DataFrame sorted by 'col_a' then 'col_b' (ascending):
    col_b   col_a  col_c
R3      4   apple   10.0
R2      8   apple   10.0
R4      5  banana   15.0
R1      7  banana    NaN
R5      1  orange   20.0

Notice how rows with 'apple' are together, sorted by col_b (4 then 8), and rows with 'banana' are together, sorted by col_b (5 then 7).

You can also specify different sorting orders for each column when sorting by multiple columns. Pass a list of booleans to the ascending argument, corresponding to the list passed to by.

Let's sort by col_a ascending and col_b descending:

df_sorted_by_multi_mixed = df.sort_values(by=['col_a', 'col_b'], ascending=[True, False])

print("\nDataFrame sorted by 'col_a' (asc) then 'col_b' (desc):")
print(df_sorted_by_multi_mixed)

DataFrame sorted by 'col_a' (asc) then 'col_b' (desc):
    col_b   col_a  col_c
R2      8   apple   10.0
R3      4   apple   10.0
R1      7  banana    NaN
R4      5  banana   15.0
R5      1  orange   20.0

Now, for 'apple', the row with col_b=8 comes before the row with col_b=4. For 'banana', the row with col_b=7 comes before col_b=5.

Handling Missing Values During Sorting

What happens to missing values (NaN) when sorting? By default, sort_values() places NaN values at the end of the sorted output, regardless of whether the sort is ascending or descending. You can control this behavior using the na_position argument, which accepts either 'first' or 'last'.

Let's sort col_c (which contains a NaN) and explicitly put the NaN first:

df_sorted_nan_first = df.sort_values(by='col_c', na_position='first')

print("\nDataFrame sorted by 'col_c', NaN first:")
print(df_sorted_nan_first)

DataFrame sorted by 'col_c', NaN first:
    col_b   col_a  col_c
R1      7  banana    NaN
R3      4   apple   10.0
R2      8   apple   10.0
R4      5  banana   15.0
R5      1  orange   20.0

As with sort_index(), sort_values() also accepts the inplace=True argument to modify the DataFrame directly.

Sorting is a fundamental operation for organizing and understanding your data. Whether arranging rows by index labels or ordering them based on column contents, the sort_index() and sort_values() methods provide the necessary tools for bringing structure to your DataFrames.