As we discussed in the chapter introduction, ensuring your data columns have the correct data type is fundamental for reliable analysis. Performing mathematical operations on numbers stored as text, or trying to sort dates chronologically when they're just strings, often leads to errors or incorrect results. Imagine trying to calculate 5+10; the outcome is very different if the '5' is treated as the text character '5' instead of the numerical value 5. Before we can fix these issues, we first need to accurately determine the current data types assigned to each column in our dataset.
Fortunately, most data analysis tools provide simple ways to inspect these types. If you're working with Python and the pandas library, which is very common for data manipulation, you can easily check the data types after loading your data.
Consider a small dataset containing information about products. We can load this into a pandas DataFrame and then inspect the data types using the .dtypes
attribute.
import pandas as pd
import io
# Sample data representing content loaded from a file
data = """ProductID,ProductName,Price,StockCount,LastUpdated,IsActive
101,Widget A,"50.99",500,2023-10-26,True
102,Gadget B,"120.50",150,2023-10-25,False
103,Thingamajig,"15.00"," N/A",2023-10-26,True
104,Doohickey,"75.25",200,2023-10-27,True
"""
# Use io.StringIO to simulate reading from a CSV file
df = pd.read_csv(io.StringIO(data))
# Display the data types of each column
print(df.dtypes)
Executing this code will typically print output resembling this:
ProductID int64
ProductName object
Price object
StockCount object
LastUpdated object
IsActive bool
dtype: object
Let's interpret this output to understand the current state of our data:
int64
: This signifies 64-bit integers, meaning whole numbers. In our example, ProductID
has been correctly identified as a column of integers.object
: This is a common data type in pandas, typically indicating that the column contains strings (text). It can also mean the column has a mix of different data types. Notice that ProductName
, Price
, StockCount
, and LastUpdated
are all classified as object
.
ProductName
is expected to be text, so object
is appropriate here.Price
contains numerical values, but because they were enclosed in quotes (e.g., "50.99"
) in the sample data, pandas has interpreted them as text strings. This is incorrect for calculations.StockCount
contains numbers but also includes a non-numeric string " N/A"
. This inconsistency forces pandas to use the general object
type for the whole column, preventing numerical operations.LastUpdated
contains dates, but they are currently stored as simple text strings.bool
: This represents Boolean values, which can be either True
or False
. The IsActive
column was correctly inferred as boolean.You might also encounter other data types, such as:
float64
: For 64-bit floating-point numbers (numbers with decimals, like 3.14 or 98.6).datetime64[ns]
: Pandas' specific type for date and time values (the [ns]
indicates nanosecond precision). Columns need explicit conversion to reach this type.category
: A specialized type for columns with a limited, fixed set of unique values (like 'Low', 'Medium', 'High').The object
data type often serves as a signal that a column requires attention. While suitable for genuine text fields, its appearance in columns expected to be numeric, date/time, or boolean usually indicates an underlying issue. This could be due to:
'123'
, '$5.00'
).'10/26/2023'
, '26-Oct-2023'
).'True'
, 'false'
, 'Y'
, 'N'
).'5 apples'
, '10 kg'
).'N/A'
, 'missing'
, '?'
).Here’s a summary table illustrating common scenarios where inferred data types might be incorrect:
Example Column Content | Likely Inferred Type | Desired/Correct Type | Common Reason for Mismatch |
---|---|---|---|
'101' , '102' , '103' |
object |
int64 |
Numbers enclosed in quotes or read as text. |
'99.9' , '15.5' , 'NA' |
object |
float64 |
Mix of number-like strings and non-numeric text. |
'$10.50' , '€20.00' |
object |
float64 |
Currency symbols prevent direct numeric reading. |
'2023-11-01' , '01/11/2023' |
object |
datetime64[ns] |
Dates stored as text strings. |
'Yes' , 'No' , 'TRUE' |
object |
bool |
Boolean concepts stored as various text strings. |
1 , 2 , 3 (Postal Codes) |
int64 |
object / category |
Numeric codes that shouldn't be used in math. |
Using functions or attributes like .dtypes
is the crucial first step in diagnosing data type problems. By examining the output, you can pinpoint which columns are not stored in their appropriate format. Once identified, you're ready to apply the conversion techniques covered in the next sections to ensure your data is structured correctly for meaningful analysis.
© 2025 ApX Machine Learning