As we discussed in the chapter introduction, computers treat data differently based on its assigned type. Performing mathematical calculations requires data to be in a numeric format. If your dataset contains numbers stored as text (strings), like '100' or '98.6', you won't be able to directly use them for addition, subtraction, averaging, or more complex analysis. Trying to calculate 5+′10′ will likely result in an error or an unexpected outcome, not the 15 you intended. This section focuses on converting such text-based numbers into proper numeric types: integers and floats.
Before converting, let's quickly clarify the two primary numeric types you'll encounter:
Imagine a column representing product prices, but the values are stored as strings like '5.99′,′12.00′,′€8.50′.Youcannotdirectlycalculatetheaveragepriceortotalsalesvaluefromthesetextentries.Similarly,sortingmighthappenalphabetically(′12.00' might come before '5.99′) instead of numerically. Converting these strings into a numeric format (like float) is essential to perform these operations correctly.
Most data analysis tools provide functions to attempt this conversion. The general idea is to instruct the tool to read the string, interpret it as a number, and store it in the appropriate numeric type (integer or float).
Let's consider using Python with the popular pandas
library as an example. Pandas provides a function called pd.to_numeric()
which is specifically designed for this task.
Basic Conversion
Suppose you have a pandas Series (a column from a DataFrame) named prices_text
containing strings:
0 '5.99'
1 '12.00'
2 '8.50'
Name: prices_text, dtype: object
The dtype: object
usually indicates strings or mixed types in pandas. To convert this to numeric, you would use:
# Assuming 'prices_text' is your pandas Series
numeric_prices = pd.to_numeric(prices_text)
print(numeric_prices)
The output would look like this:
0 5.99
1 12.00
2 8.50
Name: prices_text, dtype: float64
Notice the dtype: float64
. Pandas recognized the decimal points and chose the float type. If the strings represented whole numbers, like '100', '25', '0', pd.to_numeric()
would likely choose an integer type (int64
).
Handling Potential Problems
What happens if the column contains values that cannot be interpreted as numbers? Common examples include:
$
, €
)1,000
)N/A
, Missing
, 5 units
)If you try to convert a column containing such values directly, the pd.to_numeric()
function will usually stop and raise an error, because it doesn't know how to handle the non-numeric entry.
For example, trying to convert ['5.99', '$12.00', '8.50']
would likely fail on the second element.
Using errors='coerce'
A very common strategy is to tell the conversion function to replace any problematic values with a special marker for missing data, often represented as NaN
(Not a Number). In pandas, you achieve this using the errors='coerce'
argument:
# Example Series with problematic values
mixed_values = pd.Series(['100', '55.5', 'N/A', '2,000', '-5'])
# Attempt conversion, coercing errors to NaN
numeric_values = pd.to_numeric(mixed_values, errors='coerce')
print(numeric_values)
The output would be:
0 100.0
1 55.5
2 NaN # 'N/A' becomes NaN
3 NaN # '2,000' (with comma) becomes NaN
4 -5.0
Name: values, dtype: float64
Here's what happened:
errors='coerce'
turned it into NaN
.NaN
.Note that even though some original values were whole numbers ('100', '-5'), the entire column's data type becomes float64
. This is because NaN
itself is technically a float value, so its presence forces the column to be float to accommodate it.
Using errors='coerce'
is often a good first step because it performs the conversion for valid numbers and flags the problematic entries as missing data (NaN
). You can then decide how to handle these NaN
values (e.g., investigate the original data, clean the strings further, or use imputation techniques covered in Chapter 2).
When converting, should you aim for an integer or a float?
NaN
values might be introduced during conversion.pd.to_numeric
(like using .astype(pd.Int64Dtype())
which supports integers alongside missing values) if you want integers despite having NaNs. For introductory purposes, letting pandas default to float when NaNs are present is often simplest.Correctly converting data to numeric types is a foundational step. It unlocks the ability to perform calculations, comparisons, and quantitative analysis, moving you closer to extracting meaningful insights from your data. Remember to always inspect the data types (.dtype
) before and after conversion to ensure the process worked as expected.
© 2025 ApX Machine Learning