Okay, let's put theory into practice. You've learned about finding data sources and the importance of preparing data. Now, we'll walk through the fundamental steps of actually loading a dataset and taking a first look at it. Think of this as opening the box after receiving a package – you want to make sure everything is there and get a general idea of the contents before you start working with it.
For this exercise, imagine we have a simple dataset stored in a common format, like a Comma Separated Values (CSV) file. CSV files are just plain text files where data is organized in rows, and the values within each row are separated by commas. Let's say our file is named simple_sales.csv
and contains basic information about product sales.
Here's what the raw data inside simple_sales.csv
might look like:
Product,Category,Price,QuantitySold
Apple,Fruit,0.50,150
Banana,Fruit,0.30,250
Carrot,Vegetable,0.20,180
Broccoli,Vegetable,1.50,90
Orange,Fruit,0.60,120
This is a typical structure:
Product
, Category
, Price
, QuantitySold
).The first action is to "load" or "import" this data into whatever environment you might use for analysis. This could be a spreadsheet program (like Microsoft Excel or Google Sheets) or a data analysis tool or library (like pandas in Python, though we won't use specific code here).
The conceptual process involves:
simple_sales.csv
file.After this step, the data is no longer just text in a file; it's structured within your analysis environment, ready for inspection.
Once the data is loaded, the immediate next step is to perform some basic checks. This helps confirm that the data loaded correctly and gives you a first feel for its content.
Most tools provide a way to look at the beginning, or "head", of the dataset. This usually shows the first 5 or 10 rows.
Looking at the head of our simple_sales
data would show something like:
Product | Category | Price | QuantitySold |
---|---|---|---|
Apple | Fruit | 0.50 | 150 |
Banana | Fruit | 0.30 | 250 |
Carrot | Vegetable | 0.20 | 180 |
Broccoli | Vegetable | 1.50 | 90 |
Orange | Fruit | 0.60 | 120 |
Why do this?
It's useful to know the size of your dataset: how many rows and how many columns it has. For our tiny example:
Why do this?
Look closely at the column headers and the data within them.
Product
: Contains text strings (names of products). This seems like qualitative data.Category
: Contains text strings (types of products). Also qualitative.Price
: Contains numbers with decimals (currency values). This is quantitative (specifically, continuous) data.QuantitySold
: Contains whole numbers (counts). This is quantitative (specifically, discrete) data.Why do this?
By performing these simple loading and inspection steps, we've:
This practical step is the gateway to data preparation. Having loaded and initially inspected the data, you're now better equipped to move on to the next stages discussed in this chapter, such as handling missing values (though our simple example has none) or identifying outliers, which are necessary before performing any meaningful analysis.
© 2025 ApX Machine Learning