As mentioned in the chapter introduction, understanding the type of data stored in each column is fundamental. Computers and software tools need this information to perform operations correctly. Imagine trying to calculate the average price if the prices are stored as text like '\19.99$' instead of as numbers, or trying to sort events chronologically if dates are just treated as plain text strings like 'Jan 1st'. Using the wrong data type can lead to errors, incorrect calculations, and misleading results.
Let's look at the most common data types you'll encounter when working with datasets:
These represent numerical values and are essential for mathematical calculations.
10
, -5
, 0
, 1024
3.14
, -0.5
, 98.6
, 2.71828
You can perform arithmetic operations like addition, subtraction, multiplication, and division on numeric types. Trying to add 5+10 works as expected if both are numeric, but if '5' is stored as text, the operation might fail or produce an unexpected result like concatenating text ('510').
Strings represent textual data. They are sequences of characters enclosed in quotes (single ' ' or double " "). Anything can be represented as a string, including names, addresses, descriptions, codes, or even numbers that you don't intend to use in calculations (like ZIP codes or ID numbers).
'Hello World'
, "Data Science"
, '123 Main St'
, "ID-9876"
, 'True'
(note: the word 'True', not the boolean value), '2023-10-26'
(a date represented as text)While strings can contain digits, they are treated as text characters, not numerical values. Mathematical operations typically don't apply directly to strings in a numerical sense.
Booleans represent truth values, indicating one of two states: true or false. They are fundamental in logic, comparisons, and control flow.
True
, False
These often result from comparisons (e.g., is price > 100
?) or represent binary states (e.g., is_subscribed
, email_verified
).
These specialized types represent dates, times, or both. Storing dates and times in a proper datetime format allows for chronological sorting, calculating durations, extracting components (like year, month, day, hour), and performing time-based analysis. If dates are stored as strings ('October 26, 2023', '26/10/2023'), these operations become much harder or impossible without conversion.
2023-10-26
(date), 14:30:00
(time), 2023-10-26 14:30:00
(datetime)While sometimes represented initially as strings, categorical data represents variables that belong to a fixed, limited number of categories or groups. Examples include user ratings ('Low', 'Medium', 'High'), product types ('Electronics', 'Clothing', 'Groceries'), or survey responses ('Agree', 'Neutral', 'Disagree'). Recognizing these can sometimes optimize storage and analysis, though for basic cleaning, handling them often involves ensuring the string representation is consistent.
A classification of common data types found in datasets.
Understanding these fundamental types is the first step. In the following sections, we'll see how to check the current data types in your dataset and, more importantly, how to convert columns to their correct type to ensure your data is ready for reliable analysis.
© 2025 ApX Machine Learning