As your machine learning projects grow, chaining together data loading, preprocessing, model training, and evaluation steps can become cumbersome. Manually managing each stage is prone to errors and makes it difficult to reproduce results consistently. MLJ.jl provides a way to define and manage these sequences using pipelines, effectively turning a complex workflow into a single, manageable object.
At the core of building pipelines in MLJ.jl is the @pipeline macro. This powerful tool allows you to define a sequence of operations, or even more complex graphs of operations, that culminate in a final model. Let's explore how to construct these pipelines, starting with simpler linear sequences and moving to more intricate setups for handling diverse data types.
The most straightforward pipeline involves a linear sequence of steps. For instance, you might want to standardize your features and then feed them into a classification model. If all your features are numerical and require the same standardization, the pipeline is simple.
Consider a dataset X (features) and y (target) where X contains only continuous numerical features. We can build a pipeline that first standardizes X and then trains a K-Nearest Neighbors classifier.
using MLJ
using DataFrames
using Random # for reproducibility
# Load necessary model types
Standardizer = @load Standardizer pkg=MLJModels
KNNClassifier = @load KNNClassifier pkg=MLJModels
# Generate some sample numeric data
Random.seed!(42)
X_numeric = DataFrame(A = rand(10), B = rand(10), C = rand(10))
y_target = coerce(rand(["Class1", "Class2"], 10), Multiclass)
# Construct a linear pipeline
linear_pipe = @pipeline(Standardizer, KNNClassifier(K=3))
# This pipeline is now a composite model. You can train it like any other MLJ model:
mach_linear = machine(linear_pipe, X_numeric, y_target)
fit!(mach_linear, verbosity=0)
# And make predictions:
y_pred_linear = predict(mach_linear, X_numeric)
# info(y_pred_linear[1]) # To see the type of prediction
In this linear_pipe, data flows from X_numeric into the Standardizer. The output of the Standardizer (scaled data) then becomes the input for the KNNClassifier. The KNNClassifier is trained using this scaled data and the original y_target. This is a clean, encapsulated workflow.
"Datasets often contain a mix of feature types, such as numerical and categorical columns. Each type might require different preprocessing. For example, numerical features often benefit from scaling (like standardization), while categorical features typically need to be one-hot encoded."
MLJ's @pipeline macro handles this common scenario elegantly. When you list multiple transformers (unsupervised models) before a final supervised model, MLJ intelligently applies each transformer to the appropriate parts of the input data (based on their scientific types, or scitypes) and then concatenates their outputs before feeding them to the supervised model.
Let's illustrate this with the classic Iris dataset, modified to include a categorical feature.
import RDatasets
# Prepare Iris data
iris = RDatasets.dataset("datasets", "iris")
X_iris = select(iris, Not(:Species)) # Features
y_iris = iris.Species # Target
# Make one feature categorical for demonstration
Random.seed!(123) # for reproducibility
X_iris.PetalType = coerce(rand(["Short", "Medium", "Long"], nrow(X_iris)), Multiclass)
# Original features like SepalLength, PetalLength are Continuous.
# PetalType is now Multiclass.
# Ensure target is also Multiclass
y_iris_cat = coerce(y_iris, Multiclass)
# Construct the pipeline for heterogeneous data
# Standardizer will act on Continuous features.
# OneHotEncoder will act on Multiclass (Finite) features.
# Their outputs are automatically concatenated before going to KNNClassifier.
hetero_pipe = @pipeline(Standardizer, OneHotEncoder, KNNClassifier(K=5))
# Train this pipeline
mach_hetero = machine(hetero_pipe, X_iris, y_iris_cat)
fit!(mach_hetero, verbosity=0)
# Make predictions
y_pred_hetero = predict(mach_hetero, X_iris)
# first(y_pred_hetero, 5) # Show first 5 predictions
In hetero_pipe:
Standardizer is fitted using only the Continuous features from X_iris (e.g., SepalLength, SepalWidth, PetalLength, PetalWidth). It then transforms these features.OneHotEncoder is fitted using only the Multiclass (or more generally, Finite) features from the original X_iris (i.e., PetalType). It transforms this feature into numerical representations.Standardizer (the scaled numerical features) and the OneHotEncoder (the one-hot encoded features) are automatically concatenated column-wise (hcat).KNNClassifier along with y_iris_cat.This automatic handling of feature subsets and concatenation is a significant convenience. The flow can be visualized as follows:
The diagram shows how input features
Xare processed in parallel byStandardizer(for continuous features) andOneHotEncoder(for multiclass features). Their outputs are concatenated (hcat) to formX_transformed, which is then used with the targetyto train theKNNClassifierand make predictionsŷ.
This ability to define preprocessing for different feature types within a single pipeline structure is extremely useful for maintaining clean and reproducible machine learning code.
Once a pipeline like linear_pipe or hetero_pipe is defined, it behaves like any other MLJ model. You create a machine by binding the pipeline to your data:
mach = machine(my_pipeline, X, y)
Then, you fit! the machine:
fit!(mach, verbosity=0)
This fit! call will train all components of the pipeline in the correct order. For hetero_pipe, it first fits the Standardizer on the numeric parts of X, then fits the OneHotEncoder on the categorical parts of X. It then transforms X using these fitted transformers, concatenates the results, and finally fits the KNNClassifier on this processed data and y.
Predictions are made as usual:
y_predictions = predict(mach, X_new)
When predict is called, X_new goes through the same fitted transformations (standardization of its numeric parts, one-hot encoding of its categorical parts, followed by concatenation) before being passed to the fitted KNNClassifier to generate predictions.
You can also inspect the fitted_params of a pipeline machine to see the learned parameters for each component:
fp = fitted_params(mach_hetero)
fp.standardizer would contain means and standard deviations, fp.one_hot_encoder details of the encoding, and fp.k_n_n_classifier the state of the trained KNN model.
Constructing pipelines this way ensures that your entire workflow, from raw features to model predictions, is encapsulated. This simplifies model deployment, tuning (as you can tune hyperparameters of pipeline components), and ensures that the same preprocessing steps are consistently applied during training, evaluation, and prediction. While we've focused on the common @pipeline(Transformer1, ..., TransformerN, Model) structure, MLJ's learning networks offer even more flexibility for custom pipeline architectures, which you can explore as your needs become more specialized.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with