Scikit-learn comes equipped with several standard datasets that are invaluable for getting started, testing algorithms, and comparing results. Instead of needing to find and format your own data immediately, you can use these readily available datasets to practice the library's functionalities. These datasets are accessible through the sklearn.datasets
module.
load_*
FunctionsFor smaller, classic datasets often embedded within the library itself, Scikit-learn provides load_*
functions. These functions return a special object, often referred to as a "Bunch", which acts like a dictionary containing the data and its metadata. Let's load the famous Iris dataset, a common benchmark for classification tasks.
from sklearn.datasets import load_iris
import pandas as pd
# Load the Iris dataset
iris_data = load_iris()
# The returned object acts like a dictionary
print(f"Keys in the iris dataset object: {list(iris_data.keys())}")
The output will typically show keys like data
, target
, frame
, target_names
, DESCR
, feature_names
, filename
, and data_module
. Let's examine the most important attributes:
data
: This contains the features (the measurements) as a NumPy array. Each row represents a sample (a flower), and each column represents a feature (sepal length, sepal width, petal length, petal width).target
: This is a NumPy array containing the labels or target values for each sample. In the Iris dataset, these correspond to the species of iris (0, 1, or 2).feature_names
: A list of strings indicating the name of each feature column in data
.target_names
: A list of strings indicating the name of each target class in target
.DESCR
: A string containing a detailed description of the dataset. It's always a good idea to print and read this.frame
: (Optional, depending on arguments) A Pandas DataFrame containing both data
and target
, often with proper column names.Let's inspect some of these attributes:
# Print the feature names
print(f"Feature names: {iris_data.feature_names}")
# Print the target names
print(f"Target names: {iris_data.target_names}")
# Print the first 5 rows of the data (features)
print(f"\nFirst 5 rows of data:\n{iris_data.data[:5]}")
# Print the first 5 target labels
print(f"\nFirst 5 targets: {iris_data.target[:5]}")
# Print a portion of the dataset description
print(f"\nDataset Description (partial):\n{iris_data.DESCR[:500]}...")
For convenience, especially if you prefer working with Pandas (as covered in the prerequisites), you can often load the data directly into a DataFrame. The load_*
functions usually accept an as_frame=True
argument.
# Load the Iris dataset as a Pandas DataFrame
iris_df = load_iris(as_frame=True)['frame']
# Display the first 5 rows of the DataFrame
print("\nIris dataset as a Pandas DataFrame (first 5 rows):")
print(iris_df.head())
This DataFrame conveniently combines the features and the target column with meaningful names.
Other commonly used load_*
functions include:
load_digits()
: Handwritten digits dataset for classification.load_diabetes()
: Diabetes dataset for regression.load_breast_cancer()
: Breast cancer dataset for binary classification.Besides the embedded load_*
datasets, sklearn.datasets
also provides:
fetch_*
functions: These download larger, real-world datasets from the internet (e.g., fetch_california_housing
, fetch_olivetti_faces
). They work similarly to load_*
functions but require an internet connection for the initial download.make_*
functions: These generate synthetic datasets according to specified parameters (e.g., make_classification
, make_regression
, make_blobs
). These are extremely useful for controlled experiments and understanding algorithm behavior under specific data characteristics.Exploring these built-in datasets is an excellent way to become familiar with different types of machine learning problems and the data formats Scikit-learn expects before moving on to your own custom data. In the practical exercise that follows, you'll verify your installation by loading one of these datasets.
© 2025 ApX Machine Learning