Machine learning algorithms learn patterns from data. To work effectively with Scikit-learn, it's significant to understand how the library expects this data to be formatted. Building upon the consistent API design discussed previously, Scikit-learn primarily operates on numerical data stored in specific structures.
The input data, often called features, samples, or instances, is typically represented as a two-dimensional array-like structure. By convention, this features matrix is referred to as X
.
X
is (n_samples, n_features)
.
n_samples
: The number of individual observations or data points in your dataset (rows).n_features
: The number of characteristics or attributes measured for each sample (columns).numpy.float64
). While some algorithms can handle integers, converting your data to floats is often a safe default. Handling categorical (non-numeric) features requires specific preprocessing steps, which we will cover in Chapter 4.Let's look at a simple example. Imagine a dataset with 3 samples and 2 features (e.g., height and weight).
Using NumPy:
import numpy as np
# 3 samples, 2 features
X_np = np.array([
[170.0, 75.0], # Sample 1: Height=170, Weight=75
[165.5, 68.2], # Sample 2: Height=165.5, Weight=68.2
[181.2, 88.9] # Sample 3: Height=181.2, Weight=88.9
])
print(f"Shape of NumPy array: {X_np.shape}")
# Output: Shape of NumPy array: (3, 2)
print(f"Data type: {X_np.dtype}")
# Output: Data type: float64
Using Pandas:
import pandas as pd
# 3 samples, 2 features
X_pd = pd.DataFrame({
'Height (cm)': [170.0, 165.5, 181.2],
'Weight (kg)': [75.0, 68.2, 88.9]
})
print("Pandas DataFrame:")
print(X_pd)
print(f"\nShape of DataFrame: {X_pd.shape}")
# Output:
# Pandas DataFrame:
# Height (cm) Weight (kg)
# 0 170.0 75.0
# 1 165.5 68.2
# 2 181.2 88.9
#
# Shape of DataFrame: (3, 2)
# Scikit-learn often works with the underlying NumPy representation
print(f"\nUnderlying NumPy array shape: {X_pd.values.shape}")
# Output: Underlying NumPy array shape: (3, 2)
print(f"Underlying NumPy array dtype: {X_pd.values.dtype}")
# Output: Underlying NumPy array dtype: float64
In supervised learning (Chapters 2 and 3), besides the features X
, we also need the target variable, which represents the value we want to predict. By convention, the target variable is referred to as y
.
(n_samples,)
. It's important that the number of samples in y
exactly matches the number of samples in X
.y
should be a numeric array (usually floats).y
can be numeric (integers representing classes, e.g., 0, 1, 2) or contain strings/objects representing class labels (e.g., 'spam', 'not spam'). Scikit-learn often internally converts non-numeric labels to integers.Continuing our example, let's add a target variable, perhaps indicating whether each person plays basketball (1 for yes, 0 for no) - a classification task.
Using NumPy:
# Target for the 3 samples
y_np = np.array([1, 0, 1]) # Sample 1: Yes, Sample 2: No, Sample 3: Yes
print(f"Shape of target array: {y_np.shape}")
# Output: Shape of target array: (3,)
print(f"Data type: {y_np.dtype}")
# Output: Data type: int64
Using Pandas:
# Target for the 3 samples
y_pd = pd.Series([1, 0, 1], name='Plays Basketball')
print("Pandas Series:")
print(y_pd)
print(f"\nShape of Series: {y_pd.shape}")
# Output:
# Pandas Series:
# 0 1
# 1 0
# 2 1
# Name: Plays Basketball, dtype: int64
#
# Shape of Series: (3,)
Component | Convention | Typical Structure | Shape | Data Type (Common) |
---|---|---|---|---|
Features | X |
2D Array/DataFrame | (n_samples, n_features) |
float64 |
Target Variable | y |
1D Array/Series | (n_samples,) |
float64 (Regression) int or object (Classification) |
The key takeaway is that Scikit-learn expects your data primarily as numerical arrays with specific shapes: X
as a 2D array where rows are samples and columns are features, and y
(for supervised learning) as a 1D array holding the target values for each sample. While Pandas DataFrames add convenience, understanding the underlying (n_samples, n_features)
structure is fundamental. Adhering to these conventions ensures compatibility with the wide range of tools available in the library. In the next section, we'll see how to load some pre-packaged datasets that already follow these conventions.
© 2025 ApX Machine Learning