Before we can analyze data or build machine learning models, we first need to understand the type of data we're dealing with. Think of data types as different languages spoken by your information. Just like you need the right translator for a language, you need the right statistical methods and visualization tools for each data type. Applying a method designed for numerical measurements to categorical labels often doesn't make sense and can lead to incorrect conclusions.
In statistics and machine learning, data generally falls into two main categories: Categorical and Numerical. Let's break these down.
Categorical data represents characteristics or labels used to group items. These are qualitative descriptions. We can further divide categorical data into two types:
Nominal data consists of categories that do not have an inherent order or ranking. You can count them, but you can't logically order them from lowest to highest.
With nominal data, operations like calculating an average are meaningless. What's the average of "Red" and "Blue"? It doesn't compute. However, we can count the frequency of each category (e.g., how many red items vs. blue items).
Ordinal data represents categories that do have a meaningful order or ranking, but the intervals between the categories might not be equal or quantifiable.
While there's an order, arithmetic operations like addition or averaging are still generally inappropriate for ordinal data because the differences between ranks aren't precisely defined. We can, however, determine the median (the middle value when ordered) or find percentiles.
Numerical data represents quantities that can be measured or counted. These are quantitative values. Numerical data also comes in two flavors:
Discrete data consists of values that are countable and often take on integer values. There are distinct gaps between possible values. You can count the items.
You can perform arithmetic operations (addition, subtraction, averaging) on discrete data.
Continuous data can take on any value within a given range. These values are typically measured, not counted, and can theoretically be infinitely precise, limited only by the measuring instrument.
Arithmetic operations are perfectly valid for continuous data. This is the type of data most commonly associated with statistical measurements like mean, variance, etc.
Here is a simple way to visualize the hierarchy of data types:
A diagram illustrating the main categories and sub-types of data commonly encountered.
Understanding these distinctions is fundamental for several reasons:
When you load data using libraries like Pandas (which we'll cover briefly later in this chapter), it often attempts to infer the data types automatically. However, it's always good practice to verify these inferred types.
Pandas typically uses these common dtype
objects:
object
: Often used for text or mixed-type columns. Usually represents categorical data (both nominal and ordinal).int64
: Represents integer values. Often corresponds to discrete numerical data.float64
: Represents floating-point (decimal) numbers. Often corresponds to continuous numerical data.bool
: Represents Boolean (True/False) values. This is a type of categorical data.category
: A specific Pandas type optimized for categorical data (can represent nominal or ordinal).datetime64
: Represents date and time values.Being able to identify the type of data you have is the first step towards meaningful analysis and effective machine learning model building. In the following chapters, we'll see how the methods we use change based on whether we're working with categorical or numerical data.
© 2025 ApX Machine Learning