Machine learning algorithms learn patterns from data. To work effectively with Scikit-learn, understanding how the library expects data to be formatted is important. With its consistent API design, Scikit-learn primarily operates on numerical data stored in specific structures.
The input data, often called features, samples, or instances, is typically represented as a two-dimensional array-like structure. By convention, this features matrix is referred to as X.
X is (n_samples, n_features).
n_samples: The number of individual observations or data points in your dataset (rows).n_features: The number of characteristics or attributes measured for each sample (columns).numpy.float64). While some algorithms can handle integers, converting your data to floats is often a safe default. Handling categorical (non-numeric) features requires specific preprocessing steps, which we will cover in Chapter 4.Let's look at a simple example. Imagine a dataset with 3 samples and 2 features (e.g., height and weight).
Using NumPy:
import numpy as np
# 3 samples, 2 features
X_np = np.array([
[170.0, 75.0], # Sample 1: Height=170, Weight=75
[165.5, 68.2], # Sample 2: Height=165.5, Weight=68.2
[181.2, 88.9] # Sample 3: Height=181.2, Weight=88.9
])
print(f"Shape of NumPy array: {X_np.shape}")
# Output: Shape of NumPy array: (3, 2)
print(f"Data type: {X_np.dtype}")
# Output: Data type: float64
Using Pandas:
import pandas as pd
# 3 samples, 2 features
X_pd = pd.DataFrame({
'Height (cm)': [170.0, 165.5, 181.2],
'Weight (kg)': [75.0, 68.2, 88.9]
})
print("Pandas DataFrame:")
print(X_pd)
print(f"\nShape of DataFrame: {X_pd.shape}")
# Output:
# Pandas DataFrame:
# Height (cm) Weight (kg)
# 0 170.0 75.0
# 1 165.5 68.2
# 2 181.2 88.9
#
# Shape of DataFrame: (3, 2)
# Scikit-learn often works with the underlying NumPy representation
print(f"\nUnderlying NumPy array shape: {X_pd.values.shape}")
# Output: Underlying NumPy array shape: (3, 2)
print(f"Underlying NumPy array dtype: {X_pd.values.dtype}")
# Output: Underlying NumPy array dtype: float64
In supervised learning (Chapters 2 and 3), besides the features X, we also need the target variable, which represents the value we want to predict. By convention, the target variable is referred to as y.
(n_samples,). It's important that the number of samples in y exactly matches the number of samples in X.y should be a numeric array (usually floats).y can be numeric (integers representing classes, e.g., 0, 1, 2) or contain strings/objects representing class labels (e.g., 'spam', 'not spam'). Scikit-learn often internally converts non-numeric labels to integers.Continuing our example, let's add a target variable, perhaps indicating whether each person plays basketball (1 for yes, 0 for no) - a classification task.
Using NumPy:
# Target for the 3 samples
y_np = np.array([1, 0, 1]) # Sample 1: Yes, Sample 2: No, Sample 3: Yes
print(f"Shape of target array: {y_np.shape}")
# Output: Shape of target array: (3,)
print(f"Data type: {y_np.dtype}")
# Output: Data type: int64
Using Pandas:
# Target for the 3 samples
y_pd = pd.Series([1, 0, 1], name='Plays Basketball')
print("Pandas Series:")
print(y_pd)
print(f"\nShape of Series: {y_pd.shape}")
# Output:
# Pandas Series:
# 0 1
# 1 0
# 2 1
# Name: Plays Basketball, dtype: int64
#
# Shape of Series: (3,)
| Component | Convention | Typical Structure | Shape | Data Type (Common) |
|---|---|---|---|---|
| Features | X |
2D Array/DataFrame | (n_samples, n_features) |
float64 |
| Target Variable | y |
1D Array/Series | (n_samples,) |
float64 (Regression) int or object (Classification) |
The main takeaway is that Scikit-learn expects your data primarily as numerical arrays with specific shapes: X as a 2D array where rows are samples and columns are features, and y (for supervised learning) as a 1D array holding the target values for each sample. While Pandas DataFrames add convenience, understanding the underlying (n_samples, n_features) structure is fundamental. Adhering to these conventions ensures compatibility with the wide range of tools available in the library. In the next section, we'll see how to load some pre-packaged datasets that already follow these conventions.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with