Okay, you've successfully created your first Pandas Series and DataFrames. That's a great start! But just creating them isn't enough. Before you start manipulating or analyzing data, you need to understand what you're working with. How large is the DataFrame? What kind of data does each column hold? Are there missing values? Pandas provides several convenient attributes and methods to quickly inspect your DataFrame and get answers to these questions. Let's look at the most common ones.
Often, you don't need to see the entire dataset at once, especially if it has thousands or millions of rows. The head()
and tail()
methods are perfect for getting a quick glimpse of your data's structure and content.
head()
: Returns the first n rows of the DataFrame. By default, it returns the first 5 rows.tail()
: Returns the last n rows of the DataFrame. By default, it returns the last 5 rows.# Assuming 'df' is your DataFrame
# Display the first 5 rows
print(df.head())
# Display the first 3 rows
print(df.head(n=3)) # or df.head(3)
# Display the last 5 rows
print(df.tail())
# Display the last 2 rows
print(df.tail(n=2)) # or df.tail(2)
Using head()
and tail()
is useful for verifying that data loaded correctly and for getting an initial feel for the column names and the types of values they contain.
To find out exactly how many rows and columns your DataFrame has, use the shape
attribute. It doesn't require parentheses ()
because it's an attribute (a characteristic of the object) rather than a method (an action).
# Get the dimensions (rows, columns)
dimensions = df.shape
print(f"DataFrame dimensions: {dimensions}")
print(f"Number of rows: {dimensions[0]}")
print(f"Number of columns: {dimensions[1]}")
The shape
attribute returns a tuple where the first element is the number of rows and the second is the number of columns. Knowing the dimensions is fundamental for understanding the scale of your dataset.
The info()
method provides a concise summary of a DataFrame. This is one of the most useful inspection methods. It tells you:
dtype
) of each column.# Display a concise summary of the DataFrame
df.info()
Running df.info()
might produce output like this:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 148 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Look closely at the "Non-Null Count" column. Comparing this count to the total number of entries (150 in this example) is a quick way to spot columns with missing data (like petal_length
which has 148 non-null values, indicating 2 missing values). It also clearly shows the data type Pandas inferred for each column (float64
for numbers with decimals, object
often for strings).
For columns containing numerical data, the describe()
method calculates several common summary statistics. This gives you a quick quantitative overview of the distribution of your data.
# Generate descriptive statistics for numerical columns
summary_stats = df.describe()
print(summary_stats)
The output typically includes:
count
: The number of non-null values.mean
: The average value.std
: The standard deviation, measuring the spread of the data.min
: The minimum value.25%
: The first quartile (percentile).50%
: The median or second quartile.75%
: The third quartile.max
: The maximum value. sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 148.000000 150.000000
mean 5.843333 3.057333 3.758108 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
By default, describe()
only includes numerical columns. You can modify its behavior to include other types:
df.describe(include='object')
: Shows summary statistics for columns with object
dtype (often strings). This includes count, number of unique values (unique
), the most frequent value (top
), and its frequency (freq
).df.describe(include='all')
: Attempts to include statistics for all columns, mixing numerical and object summaries where appropriate.To get just the names of the columns, use the columns
attribute.
# Get the column labels
column_names = df.columns
print(column_names)
This returns an Index
object containing the column names. This is useful if you need to check the exact spelling of a column name or iterate through the columns programmatically.
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
Similarly, the index
attribute gives you the labels for the rows.
# Get the row labels (index)
row_labels = df.index
print(row_labels)
This returns another Index
object. By default, DataFrames get a RangeIndex
(integers starting from 0), but the index can be other things, like dates or specific labels, which you'll encounter later.
RangeIndex(start=0, stop=150, step=1)
These inspection tools (head
, tail
, shape
, info
, describe
, columns
, index
) are your first line of defense when encountering a new dataset. They help you quickly orient yourself, understand the basic structure and content, and identify potential issues like missing data or incorrect data types before you proceed with more complex data cleaning, transformation, or analysis tasks. Make it a habit to use them whenever you create or load a DataFrame.
© 2025 ApX Machine Learning