Imagine preparing ingredients for a recipe, only to find some items missing from your pantry. Similarly, when working with data, you'll often encounter datasets where some information is simply not there. These gaps are known as missing values.
A missing value represents information that should exist for a specific observation (a row) and variable (a column) but is absent. Think of it as an empty cell in your spreadsheet or database table where you expected to find a piece of data.
In the world of data analysis, especially when using tools like Python with libraries such as pandas and NumPy, missing data isn't usually represented by a literal blank space. Instead, you'll commonly encounter specific placeholders:
NaN
(Not a Number): This is the most frequent representation you'll see in numerical data within pandas DataFrames and NumPy arrays. NaN
is a special floating-point value used to signify undefined or unrepresentable results, which serves well for indicating missing numeric data.
None
: Python's built-in object representing the absence of a value. You might find None
used in columns that contain mixed data types or primarily strings, although pandas often converts None
to NaN
in numerical columns for consistency.
NULL
: This is the standard way databases (like SQL databases) represent missing information. When you import data from a database, these NULL
values are often converted to NaN
or None
by your data loading tools.
Placeholders: Sometimes, data collection systems use specific codes (like -1
, 999
, "missing"
, or ""
(an empty string)) to denote missing information. While these look like regular data, they represent missingness. Identifying and handling these requires extra care because tools won't automatically recognize them as missing unless you explicitly tell them to.
Here's a small example illustrating how missing values might look in a table:
Name | Age | Score | City |
---|---|---|---|
Alice | 25 | 88 | New York |
Bob | NaN | 76 | London |
Charlie | 30 | NaN | NaN |
David | 22 | 95 | San Francisco |
In this table, Bob's age is missing (NaN
), Charlie's score is missing (NaN
), and Charlie's city is also missing (NaN
).
Counts of present and missing values for the 'Age', 'Score', and 'City' columns from the example table above.
As mentioned in the chapter introduction, these missing values aren't just cosmetic issues. They pose significant challenges:
NaN
often results in NaN
, rendering the calculation useless.Understanding what missing values are and how they manifest in your data is the essential first step before you can effectively address them, which is precisely what we'll cover in the upcoming sections.
© 2025 ApX Machine Learning