As your machine learning projects grow, chaining together data loading, preprocessing, model training, and evaluation steps can become cumbersome. Manually managing each stage is prone to errors and makes it difficult to reproduce results consistently. MLJ.jl provides a way to define and manage these sequences using pipelines, effectively turning a complex workflow into a single, manageable object.
At the heart of building pipelines in MLJ.jl is the @pipeline
macro. This powerful tool allows you to define a sequence of operations, or even more complex graphs of operations, that culminate in a final model. Let's explore how to construct these pipelines, starting with simpler linear sequences and moving to more intricate setups for handling diverse data types.
The most straightforward pipeline involves a linear sequence of steps. For instance, you might want to standardize your features and then feed them into a classification model. If all your features are numerical and require the same standardization, the pipeline is simple.
Consider a dataset X
(features) and y
(target) where X
contains only continuous numerical features. We can build a pipeline that first standardizes X
and then trains a K-Nearest Neighbors classifier.
using MLJ
using DataFrames
using Random # for reproducibility
# Load necessary model types
Standardizer = @load Standardizer pkg=MLJModels
KNNClassifier = @load KNNClassifier pkg=MLJModels
# Generate some sample numeric data
Random.seed!(42)
X_numeric = DataFrame(A = rand(10), B = rand(10), C = rand(10))
y_target = coerce(rand(["Class1", "Class2"], 10), Multiclass)
# Construct a linear pipeline
linear_pipe = @pipeline(Standardizer, KNNClassifier(K=3))
# This pipeline is now a composite model. You can train it like any other MLJ model:
mach_linear = machine(linear_pipe, X_numeric, y_target)
fit!(mach_linear, verbosity=0)
# And make predictions:
y_pred_linear = predict(mach_linear, X_numeric)
# info(y_pred_linear[1]) # To see the type of prediction
In this linear_pipe
, data flows from X_numeric
into the Standardizer
. The output of the Standardizer
(scaled data) then becomes the input for the KNNClassifier
. The KNNClassifier
is trained using this scaled data and the original y_target
. This is a clean, encapsulated workflow.
Real-world datasets often contain a mix of feature types, such as numerical and categorical columns. Each type might require different preprocessing. For example, numerical features often benefit from scaling (like standardization), while categorical features typically need to be one-hot encoded.
MLJ's @pipeline
macro handles this common scenario elegantly. When you list multiple transformers (unsupervised models) before a final supervised model, MLJ intelligently applies each transformer to the appropriate parts of the input data (based on their scientific types, or scitypes
) and then concatenates their outputs before feeding them to the supervised model.
Let's illustrate this with the classic Iris dataset, modified to include a categorical feature.
import RDatasets
# Prepare Iris data
iris = RDatasets.dataset("datasets", "iris")
X_iris = select(iris, Not(:Species)) # Features
y_iris = iris.Species # Target
# Make one feature categorical for demonstration
Random.seed!(123) # for reproducibility
X_iris.PetalType = coerce(rand(["Short", "Medium", "Long"], nrow(X_iris)), Multiclass)
# Original features like SepalLength, PetalLength are Continuous.
# PetalType is now Multiclass.
# Ensure target is also Multiclass
y_iris_cat = coerce(y_iris, Multiclass)
# Construct the pipeline for heterogeneous data
# Standardizer will act on Continuous features.
# OneHotEncoder will act on Multiclass (Finite) features.
# Their outputs are automatically concatenated before going to KNNClassifier.
hetero_pipe = @pipeline(Standardizer, OneHotEncoder, KNNClassifier(K=5))
# Train this pipeline
mach_hetero = machine(hetero_pipe, X_iris, y_iris_cat)
fit!(mach_hetero, verbosity=0)
# Make predictions
y_pred_hetero = predict(mach_hetero, X_iris)
# first(y_pred_hetero, 5) # Show first 5 predictions
In hetero_pipe
:
Standardizer
is fitted using only the Continuous
features from X_iris
(e.g., SepalLength
, SepalWidth
, PetalLength
, PetalWidth
). It then transforms these features.OneHotEncoder
is fitted using only the Multiclass
(or more generally, Finite
) features from the original X_iris
(i.e., PetalType
). It transforms this feature into numerical representations.Standardizer
(the scaled numerical features) and the OneHotEncoder
(the one-hot encoded features) are automatically concatenated column-wise (hcat
).KNNClassifier
along with y_iris_cat
.This automatic handling of feature subsets and concatenation is a significant convenience. The flow can be visualized as follows:
The diagram shows how input features
X
are processed in parallel byStandardizer
(for continuous features) andOneHotEncoder
(for multiclass features). Their outputs are concatenated (hcat
) to formX_transformed
, which is then used with the targety
to train theKNNClassifier
and make predictionsŷ
.
This ability to define preprocessing for different feature types within a single pipeline structure is extremely useful for maintaining clean and reproducible machine learning code.
Once a pipeline like linear_pipe
or hetero_pipe
is defined, it behaves like any other MLJ model. You create a machine
by binding the pipeline to your data:
mach = machine(my_pipeline, X, y)
Then, you fit!
the machine:
fit!(mach, verbosity=0)
This fit!
call will train all components of the pipeline in the correct order. For hetero_pipe
, it first fits the Standardizer
on the numeric parts of X
, then fits the OneHotEncoder
on the categorical parts of X
. It then transforms X
using these fitted transformers, concatenates the results, and finally fits the KNNClassifier
on this processed data and y
.
Predictions are made as usual:
y_predictions = predict(mach, X_new)
When predict
is called, X_new
goes through the same fitted transformations (standardization of its numeric parts, one-hot encoding of its categorical parts, followed by concatenation) before being passed to the fitted KNNClassifier
to generate predictions.
You can also inspect the fitted_params
of a pipeline machine to see the learned parameters for each component:
fp = fitted_params(mach_hetero)
fp.standardizer
would contain means and standard deviations, fp.one_hot_encoder
details of the encoding, and fp.k_n_n_classifier
the state of the trained KNN model.
Constructing pipelines this way ensures that your entire workflow, from raw features to model predictions, is encapsulated. This simplifies model deployment, tuning (as you can tune hyperparameters of pipeline components), and ensures that the same preprocessing steps are consistently applied during training, evaluation, and prediction. While we've focused on the common @pipeline(Transformer1, ..., TransformerN, Model)
structure, MLJ's learning networks offer even more flexibility for custom pipeline architectures, which you can explore as your needs become more specialized.
Was this section helpful?
© 2025 ApX Machine Learning