Data is the raw material of data science. Think of it as the collection of facts, figures, observations, symbols, or descriptions that we gather to analyze. Before we can extract insights or build models, we first need to understand what we're working with.
At its core, data represents pieces of information about the world. But in the context of data science, we often think of data as anything that can be recorded, stored, and processed digitally. It's not just numbers in a spreadsheet, although that's a common form. Data can be surprisingly diverse:
Consider a simple example: tracking sales in a small online store. The data might include:
Each piece of information, like "Alice Smith", "Laptop Bag", "49.99", "2023-10-26 14:30:05", "New York", is a data point or observation. When we collect many such related observations together, we typically organize them into a dataset. Often, this dataset takes the form of a table, where rows represent individual records (like a single purchase) and columns represent different attributes or characteristics (like customer name, item, price).
CustomerID | Name | Item | Price | PurchaseDate
-----------|-------------|--------------|-------|--------------------
101 | Alice Smith | Laptop Bag | 49.99 | 2023-10-26 14:30:05
102 | Bob Johnson | USB Cable | 12.50 | 2023-10-26 15:01:22
101 | Alice Smith | Mouse | 25.00 | 2023-10-27 09:15:10
... | ... | ... | ... | ...
A simple tabular dataset representing customer purchases.
It's important to distinguish raw data from information. Raw data, like the number 101
, might not mean much on its own. It only becomes information when we give it context. Knowing that 101
refers to the CustomerID
for "Alice Smith" makes it meaningful. Data science often involves transforming raw data into useful information and, ultimately, into actionable insights.
Understanding what constitutes data is the essential first step. Recognizing its varied forms prepares you to think about how to structure, clean, and analyze it, which are topics we will cover next.
© 2025 ApX Machine Learning