While the Pandas Series
represents a single column of data, the DataFrame
is the primary structure you'll use for most tabular data analysis. Think of it as a spreadsheet or an SQL table within your Python environment. It's a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Let's look at the common ways to construct a DataFrame
. You'll typically create them from existing data structures like Python dictionaries or NumPy arrays, or by reading data from files (which is covered in the next chapter).
One of the most frequent methods is using a Python dictionary where the keys represent the desired column names and the values are lists (or NumPy arrays) containing the data for each column. It's important that all lists or arrays used as values have the same length, as each list corresponds to a column, and the elements at the same position across lists form a row.
import pandas as pd
import numpy as np
# Data as a dictionary of lists
data = {
'StudentID': ['S001', 'S002', 'S003', 'S004'],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 92, 78, 88]
}
# Create DataFrame
df_from_dict_list = pd.DataFrame(data)
print(df_from_dict_list)
This code produces the following DataFrame
:
StudentID Name Score
0 S001 Alice 85
1 S002 Bob 92
2 S003 Charlie 78
3 S004 David 88
Notice how Pandas automatically assigned a default integer index (0, 1, 2, 3) to the rows. The dictionary keys (StudentID
, Name
, Score
) became the column labels.
You can achieve the same result using NumPy arrays as the dictionary values:
# Data as a dictionary of NumPy arrays
data_np = {
'StudentID': np.array(['S001', 'S002', 'S003', 'S004']),
'Name': np.array(['Alice', 'Bob', 'Charlie', 'David']),
'Score': np.array([85, 92, 78, 88])
}
# Create DataFrame
df_from_dict_np = pd.DataFrame(data_np)
print(df_from_dict_np)
Another common pattern involves a list where each element is a dictionary. In this case, each dictionary represents a row in the resulting DataFrame
. The keys within the dictionaries become the column labels. Pandas is smart enough to handle situations where some dictionaries might be missing keys; it will fill those spots with NaN
(Not a Number), Pandas' default marker for missing data.
# Data as a list of dictionaries
list_of_dicts = [
{'StudentID': 'S001', 'Name': 'Alice', 'Score': 85},
{'StudentID': 'S002', 'Name': 'Bob', 'Age': 21}, # Note: 'Score' is missing, 'Age' is extra
{'StudentID': 'S003', 'Name': 'Charlie', 'Score': 78, 'Age': 22},
{'StudentID': 'S004', 'Name': 'David', 'Score': 88}
]
# Create DataFrame
df_from_list_dict = pd.DataFrame(list_of_dicts)
print(df_from_list_dict)
The output demonstrates how Pandas handles the inconsistent keys:
StudentID Name Score Age
0 S001 Alice 85.0 NaN
1 S002 Bob NaN 21.0
2 S003 Charlie 78.0 22.0
3 S004 David 88.0 NaN
Pandas inferred all possible column names (StudentID
, Name
, Score
, Age
) from the keys present in the dictionaries. Where a key was missing for a specific row (dictionary), NaN
was inserted. Also, notice that the Score
and Age
columns were automatically assigned a floating-point data type (float64
) because NaN
is technically a float value.
If your data already exists as a 2D NumPy array, you can directly convert it into a DataFrame
. By default, Pandas assigns integer labels to both columns and rows (the index). However, you can explicitly provide labels using the columns
and index
arguments.
# A 2D NumPy array
np_array = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
# Create DataFrame with default labels
df_from_np_default = pd.DataFrame(np_array)
print("DataFrame with default labels:")
print(df_from_np_default)
# Create DataFrame with custom labels
df_from_np_custom = pd.DataFrame(
np_array,
columns=['Col_A', 'Col_B', 'Col_C'],
index=['Row_X', 'Row_Y', 'Row_Z']
)
print("\nDataFrame with custom labels:")
print(df_from_np_custom)
Output:
DataFrame with default labels:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
DataFrame with custom labels:
Col_A Col_B Col_C
Row_X 1 2 3
Row_Y 4 5 6
Row_Z 7 8 9
Visual representation showing how dictionary keys and lists map to the columns and data within a Pandas DataFrame. The row index is often automatically generated if not specified.
You can also construct a DataFrame
from a dictionary where the values are Pandas Series
objects. This works similarly to using a dictionary of lists, aligning the data based on the index labels of the Series.
# Create two Series
s1 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s2 = pd.Series([15, 25, 35, 45], index=['a', 'b', 'c', 'd']) # Note different index
# Create DataFrame from a dictionary of Series
df_from_series = pd.DataFrame({'Col1': s1, 'Col2': s2})
print(df_from_series)
Output:
Col1 Col2
a 10.0 15
b 20.0 25
c 30.0 35
d NaN 45
Pandas automatically aligns the data based on the index labels ('a', 'b', 'c', 'd'). Since s1
didn't have an index 'd', NaN
was introduced in Col1
for that row.
These methods provide flexible ways to create DataFrames programmatically. Once you have a DataFrame, the next step is usually to inspect its properties and contents, which we will cover next. Remember that loading data directly from files like CSV or Excel is also a very common way to get data into a DataFrame, explored in Chapter 6.
© 2025 ApX Machine Learning