Let's put the concepts from this chapter into practice. We've learned about different data types and why ensuring columns have the correct type is significant for analysis. Now, we'll walk through a common scenario where you receive data with mixed or incorrect types and apply the techniques discussed to fix them.
We'll use a small, hypothetical dataset representing product orders. Imagine this data came from combining different sources or manual entry, leading to inconsistencies. We will use the pandas
library, a standard tool for data manipulation in Python. If you haven't used it before, pandas
provides structures called DataFrames, which are like tables, and Series, which are like columns, along with functions to work with them.
First, let's create our sample DataFrame. In a real project, you would load this from a file (like a CSV), but here we'll define it directly for clarity.
import pandas as pd
import numpy as np # We need numpy for NaN
data = {
'OrderID': [1, 2, 3, 4, 5],
'Product': ['Apple', 'Banana', 'Milk', 'Bread', 'Apple'],
'Category': ['Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit'],
'Quantity': ['10', '5', '2', '1', '5'], # Stored as strings
'Unit_Price': ['$0.50', '$0.30', '$3.5O', '$2.50', '$0.50'], # Strings with symbols and an error
'Order_Date': ['01/15/2023', '01/16/2023', '01/16/2023', '2023-01-17', '01/18/2023'] # Mixed formats stored as strings
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Before making changes, let's check the current data types pandas
assigned to each column. The info()
method is excellent for this, providing types and non-null counts.
print("\nInitial Data Types:")
df.info()
You'll likely see output similar to this:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 OrderID 5 non-null int64
1 Product 5 non-null object
2 Category 5 non-null object
3 Quantity 5 non-null object
4 Unit_Price 5 non-null object
5 Order_Date 5 non-null object
dtypes: int64(1), object(5)
memory usage: 368.0+ bytes
Notice that OrderID
is correctly identified as an integer (int64
), but Product
, Category
, Quantity
, Unit_Price
, and Order_Date
are all listed as object
. While Product
and Category
being objects (which usually means strings in pandas
) might be acceptable initially, Quantity
, Unit_Price
, and Order_Date
definitely need correction for proper analysis (e.g., calculating total cost or analyzing sales over time).
Let's tackle Quantity
and Unit_Price
.
1. Quantity: This column contains numbers stored as strings. We can convert it directly using pd.to_numeric
.
df['Quantity'] = pd.to_numeric(df['Quantity'])
print("\nData types after converting Quantity:")
print(df.dtypes)
The output should now show Quantity
as int64
or float64
.
2. Unit_Price: This column is trickier. It contains dollar signs ($
) and potentially other non-numeric characters (like the letter 'O' instead of zero in '$3.5O').
First, we need to clean the strings: remove the $
sign. We can use the .str.replace()
method available for pandas
Series with string data.
df['Unit_Price'] = df['Unit_Price'].str.replace('$', '', regex=False)
print("\nUnit_Price after removing '$':")
print(df['Unit_Price'])
Now, try converting to numeric. What happens with the '$3.5O' entry?
# Attempt direct conversion (will likely cause an error)
# df['Unit_Price'] = pd.to_numeric(df['Unit_Price']) # Uncommenting this would raise an error
# Use errors='coerce' to handle invalid values
df['Unit_Price'] = pd.to_numeric(df['Unit_Price'], errors='coerce')
print("\nUnit_Price after conversion with errors='coerce':")
print(df['Unit_Price'])
print("\nData types after converting Unit_Price:")
print(df.dtypes)
Using errors='coerce'
tells pandas
to replace any value that cannot be converted to a number with NaN
(Not a Number), which represents missing data. You'll see NaN
where '$3.5O' used to be. This is useful because it prevents the entire conversion from failing. You would then need to decide how to handle this new missing value, perhaps by investigating the source or using an imputation technique discussed in Chapter 2. For now, we'll leave it as NaN
. The Unit_Price
column should now be of type float64
.
The Order_Date
column contains dates stored as strings, and even has mixed formats. pandas
has a powerful pd.to_datetime
function that can often automatically figure out the format.
df['Order_Date'] = pd.to_datetime(df['Order_Date'])
print("\nData types after converting Order_Date:")
print(df.dtypes)
If the formats were more complex or ambiguous, you might need to specify the format string using the format
argument in pd.to_datetime
. For example, if all dates were like '01/15/2023', you could use pd.to_datetime(df['Order_Date'], format='%m/%d/%Y')
. However, pandas
is quite clever at inferring common formats, as seen here. The Order_Date
column should now be datetime64[ns]
.
The Category
column contains text representing distinct groups ('Fruit', 'Dairy', 'Bakery'). While leaving it as object
(string) is fine, converting it to a category
data type can be more memory-efficient and sometimes speed up operations, especially if there are many duplicate values (like 'Fruit' appearing multiple times).
df['Category'] = df['Category'].astype('category')
print("\nFinal Data Types after all conversions:")
df.info()
The info()
output now reflects the changes: Quantity
is numeric, Unit_Price
is numeric (float) with one missing value introduced by coercion, Order_Date
is datetime, and Category
is categorical.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 OrderID 5 non-null int64
1 Product 5 non-null object # Still object (string), which is fine
2 Category 5 non-null category
3 Quantity 5 non-null int64
4 Unit_Price 4 non-null float64 # Note: 4 non-null due to coercion
5 Order_Date 5 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 593.0+ bytes # Memory usage might decrease slightly due to category
Here's a visual comparison of the data types before and after our corrections:
Count of columns for each data type present in the DataFrame before and after applying type corrections.
In this hands-on section, we:
df.info()
.Quantity
) to numeric using pd.to_numeric()
.Unit_Price
) by removing unwanted characters ($
) using .str.replace()
.pd.to_numeric(errors='coerce')
to handle invalid entries gracefully by turning them into NaN
.Order_Date
) into a proper datetime format using pd.to_datetime()
.Category
) into a memory-efficient category
type using .astype('category')
.Correcting data types is a fundamental step in data preprocessing. It ensures that your data is stored appropriately, preventing errors and enabling accurate calculations, comparisons, and analysis in subsequent steps. By applying these techniques, you make your data ready for visualization, statistical analysis, or machine learning model training.
© 2025 ApX Machine Learning