This practical exercise will guide you through creating, training, evaluating, and managing a complete machine learning pipeline using MLJ.jl. We'll build upon the concepts discussed earlier in this chapter, demonstrating how pipelines can streamline your workflow, from initial data processing to model deployment. By encapsulating these steps, pipelines enhance the organization and reproducibility of your machine learning projects.
We'll work with a familiar dataset and progressively build our pipeline, illustrating each step with Julia code. You'll see how MLJ.jl makes it straightforward to connect different components, such as data scalers and classifiers, into a single, cohesive unit.
First, ensure you have the necessary Julia packages installed. For this exercise, we primarily need MLJ
for the core framework and models, and DataFrames
for data handling if your data isn't already in a compatible format. We will use the Iris dataset, which is readily available through MLJ.
Let's start by loading the packages and the dataset:
using MLJ
using DataFrames
using Random # For reproducibility
# Set a random seed for reproducibility
Random.seed!(123)
# Load the Iris dataset
X, y = @load_iris; # X is a DataFrame, y is a categorical vector
# Display the first few rows of features and the target
first(X, 3)
3×4 DataFrame
Row │ sepal_length sepal_width petal_length petal_width
│ Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────
1 │ 5.1 3.5 1.4 0.2
2 │ 4.9 3.0 1.4 0.2
3 │ 5.7 3.8 1.7 0.4
first(y, 3)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"setosa"
"setosa"
"setosa"
The Iris dataset consists of 4 numerical features and a 3-class categorical target. We will now split this data into training and testing sets.
# Split data into 70% training and 30% testing
train_rows, test_rows = partition(eachindex(y), 0.7, shuffle=true, rng=Random.GLOBAL_RNG);
X_train = X[train_rows, :];
y_train = y[train_rows];
X_test = X[test_rows, :];
y_test = y[test_rows];
A pipeline consists of a sequence of operations. For our Iris classification task, we'll use two main components:
Standardizer
to scale the numerical features.DecisionTreeClassifier
to perform the classification.Let's load these model types from MLJ's model registry.
Standardizer = @load Standardizer pkg=MLJModels
DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree verbosity=0
With our components defined, we can now assemble them into a pipeline. MLJ.jl's @pipeline
macro provides a flexible way to define these structures. We'll create a simple linear pipeline where data flows from the Standardizer
to the DecisionTreeClassifier
.
# Define the pipeline structure
IrisPipeline = @pipeline(
scaler = Standardizer(),
classifier = DecisionTreeClassifier(max_depth=3, rng=Random.GLOBAL_RNG), # Set rng for tree
prediction_type = :deterministic # We want direct class predictions
);
# Instantiate the pipeline model
pipe_model = IrisPipeline()
Here, scaler
and classifier
are the names we've given to our steps. The prediction_type = :deterministic
ensures that when we call predict
on the fitted pipeline, we get class labels directly (e.g., "setosa") rather than probabilities. If you needed probabilities, you'd use prediction_type = :probabilistic
.
The defined IrisPipeline
structure can be visualized to understand its flow:
Data flow in our simple Iris classification pipeline. Input features are first standardized, then fed into a decision tree classifier to produce predictions.
Now, we create a machine from our pipeline model and the training data, then fit it.
# Create a machine from the pipeline model and data
mach = machine(pipe_model, X_train, y_train);
# Fit the machine (trains the entire pipeline)
fit!(mach, verbosity=0);
When fit!
is called on a pipeline machine:
Standardizer
(named scaler
) is trained on X_train
.X_train
is transformed using the fitted Standardizer
.DecisionTreeClassifier
(named classifier
) is trained on the transformed X_train
and y_train
.The learned parameters for each step are stored within the machine. You can inspect the fitted parameters of individual components if needed using fitted_params(mach)
.
To assess our pipeline's performance, we use cross-validation. MLJ's evaluate!
function handles this.
# Define evaluation metrics
acc = Accuracy()
f1_micro = MicroF1Score() # Good for multiclass, averages over samples
# Perform 5-fold cross-validation
evaluation_results = evaluate!(mach,
resampling=CV(nfolds=5, shuffle=true, rng=Random.GLOBAL_RNG),
measures=[acc, f1_micro],
verbosity=1);
println(evaluation_results)
This will output the mean and standard deviation of accuracy and F1-score across the cross-validation folds. For example, you might see something like:
┌─────────────────┬───────────────┬────────────────────────────────────────────────────────┐
│ measure │ operation │ 1.96*SE ± mean │
├─────────────────┼───────────────┼────────────────────────────────────────────────────────┤
│ Accuracy │ predict_mode │ 0.063 ± 0.94 │
│ MicroF1Score │ predict_mode │ 0.063 ± 0.94 │
└─────────────────┴───────────────┴────────────────────────────────────────────────────────┘
Report.measurements:
┌──────────────┬────────────────────────────┬───────────────────────────────────────────────┐
│ │ 1.96*SE │ mean │
├──────────────┼────────────────────────────┼───────────────────────────────────────────────┤
│ Accuracy │ 0.06323891395980072 │ 0.9428571428571428 │
│ MicroF1Score │ 0.06323891395980072 │ 0.9428571428571428 │
└──────────────┴────────────────────────────┴───────────────────────────────────────────────┘
Report.per_fold:
┌──────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────┐
│ │ fold 1 │ fold 2 │ fold 3 │ fold 4 ⋯
├──────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────────┤
│ Accuracy │ 0.9047619047619048 │ 0.9523809523809523 │ 1.0 │ 0.9047619 ⋯
│ MicroF1Score │ 0.9047619047619048 │ 0.9523809523809523 │ 1.0 │ 0.9047619 ⋯
└──────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┘
1 column omitted
Example output from
evaluate!
. Themean
column underReport.measurements
shows the average performance across folds.
After training and evaluation, you can use the fitted pipeline to make predictions on unseen data, like our X_test
.
# Make predictions on the test set
y_pred = predict(mach, X_test);
# Calculate accuracy on the test set
test_accuracy = accuracy(y_pred, y_test)
println("Test set accuracy: $(round(test_accuracy, digits=3))")
# You can also view some predictions
first(y_pred, 5)
This would output the test accuracy, for instance: Test set accuracy: 0.956
, and the first few predictions.
A significant advantage of pipelines is the ability to save the entire trained workflow and load it later for inference or further analysis. MLJ uses Julia's standard serialization (.jlso
format) for this.
# Save the fitted machine (which contains the trained pipeline)
MLJ.save("iris_pipeline_machine.jlso", mach);
# To demonstrate loading, let's create a new machine by loading from the file
mach_loaded = machine("iris_pipeline_machine.jlso");
# Make predictions with the loaded machine
y_pred_loaded = predict(mach_loaded, X_test);
# Verify that the predictions are the same
println("Predictions from original and loaded machine are identical: $(y_pred == y_pred_loaded)")
This confirms that the loaded machine behaves identically to the original one, preserving all learned parameters and the pipeline structure. This is extremely useful for deploying models into production or sharing reproducible results.
Throughout this practical, we've touched upon aspects of reproducibility:
Pkg.jl
) ensures that your Project.toml
and Manifest.toml
files capture the exact versions of packages used. This is fundamental for others (or your future self) to recreate the environment. This was covered in detail in Chapter 1.Random.seed!(integer)
or passing rng
arguments to functions like partition
, CV
, and models that have stochastic components (e.g., DecisionTreeClassifier
, many ensemble methods) is important for getting consistent results across runs. We used Random.GLOBAL_RNG
after seeding it, or passed a specific integer where rng
arguments were available.machine
object captures the state of your trained pipeline, allowing for exact replication of its predictive behavior.By consistently applying these practices, you can significantly improve the reliability and trustworthiness of your machine learning experiments.
This hands-on exercise has demonstrated the core workflow of creating a pipeline with Standardizer
and DecisionTreeClassifier
components, training it, evaluating its performance, making predictions, and managing the trained pipeline by saving and loading it. MLJ.jl's pipeline system, especially with the @pipeline
macro, provides a powerful and intuitive way to manage even more complex machine learning workflows, which you might encounter as you tackle more advanced problems.
Was this section helpful?
© 2025 ApX Machine Learning