Once your data is clean, the next step is often to transform it. Raw, clean data isn't always in the best format for machine learning algorithms. Different algorithms have different expectations about the input data. For instance, some algorithms are sensitive to the scale of input features, while others require all inputs to be numerical. This section covers three common and important data transformation techniques in Julia: scaling numerical features, encoding categorical features, and binning continuous data.
Many machine learning algorithms perform better or converge faster when numerical input features are on a similar scale. For example, algorithms that compute distances between data points, like k-Nearest Neighbors (k-NN) and Support Vector Machines (SVMs), or algorithms that use gradient descent, like linear regression and neural networks, can be sensitive to feature scaling. If features have different ranges (e.g., one feature from 0 to 1, another from 0 to 1,000,000), the feature with the larger range can dominate the calculations.
Standardization rescales features so they have a mean (μ) of 0 and a standard deviation (σ) of 1. The transformation is given by:
Xscaled=σX−μ
This method is widely used and is particularly useful when the data follows a Gaussian (normal) distribution. It's generally less sensitive to outliers than min-max scaling.
In Julia, you can use the Standardizer
model from MLJ.jl
to perform this transformation. MLJ.jl
(Machine Learning Julia) is a comprehensive framework for machine learning in Julia, and its models often follow a fit!
/transform
pattern.
Let's see an example. First, ensure you have MLJ
and DataFrames
loaded:
using MLJ, DataFrames
import StableRNGs.StableRNG # for reproducibility
# Sample data
rng = StableRNG(123)
X_df = DataFrame(
age = rand(rng, 25:65, 10),
income = rand(rng, 30000:150000, 10)
)
# Convert to a table MLJ can use
X = MLJ.table(X_df)
Now, let's apply standardization to the age
and income
features:
# Initialize the Standardizer model
scaler_model = Standardizer()
# Wrap the model in a machine and fit it to the data
scaler_machine = machine(scaler_model, X)
fit!(scaler_machine)
# Transform the data
X_scaled = MLJ.transform(scaler_machine, X)
# X_scaled is now a table with standardized features
# You can convert it back to a DataFrame to inspect
X_scaled_df = DataFrame(X_scaled)
println(X_scaled_df)
Output:
10×2 DataFrame
Row │ age income
│ Float64 Float64
─────┼──────────────────────
1 │ -0.0706283 1.2155
2 │ 1.48319 -0.893144
3 │ -1.62449 -1.28456
4 │ -1.03332 0.370213
5 │ 1.12564 1.33237
6 │ 0.686526 -1.07185
7 │ -0.854546 0.781878
8 │ 0.865303 0.16541
9 │ 0.108148 -0.80376
10 │ -0.525828 0.18795
The Standardizer
in MLJ automatically identifies continuous features for standardization.
To visualize the effect, consider two features with different scales before and after standardization:
The plot shows how two features, initially on different scales, are brought to a comparable scale after standardization, centered around zero.
Min-Max scaling rescales features to a fixed range, usually [0, 1]. The transformation is:
Xscaled=Xmax−XminX−Xmin
This is useful when you need data within a bounded interval or when the algorithm doesn't assume any specific distribution of the data. However, it can be sensitive to outliers because Xmin and Xmax are used in the calculation.
While MLJ's core doesn't have a direct MinMaxScaler
like some other frameworks, you can implement it or use models from extended MLJ ecosystems (like MLJScikitLearnInterface.jl
). For instructional purposes, let's see how you might apply this to a single feature using basic Julia functions:
# Example for a single feature (e.g., income from X_df)
feature_to_scale = X_df.income
function min_max_scale(col)
min_val = minimum(col)
max_val = maximum(col)
if min_val == max_val # Avoid division by zero if all values are the same
return zeros(length(col))
end
return (col .- min_val) ./ (max_val - min_val)
end
income_min_max_scaled = min_max_scale(feature_to_scale)
# println(income_min_max_scaled)
This scales the income
feature to the [0,1] range. For a full table and integration into MLJ workflows, you would typically wrap such logic into a custom transformer or use a package that provides this functionality. MLJ.ContinuousEncoder
can also be used to transform features to the [0,1]
interval, though its primary mechanism (empirical CDF) is different from standard min-max.
A common practice in machine learning is to fit scalers (or any preprocessor) only on the training data. Then, use the fitted scaler to transform the training data, validation data, and any new test data. This prevents information from the validation or test sets from "leaking" into the training process, which could lead to overly optimistic performance estimates. MLJ's machine
abstraction helps manage this correctly when used with data partitions.
Machine learning algorithms typically require numerical input. Categorical features, such as 'color' (with values like 'Red', 'Green', 'Blue') or 'city' (e.g., 'New York', 'London', 'Tokyo'), need to be converted into a numerical representation.
Julia's CategoricalArrays.jl
package is often used to handle categorical data. MLJ.jl
integrates well with this, and you'll often use the coerce
function to ensure your categorical columns have the correct scientific type (e.g., Multiclass
or OrderedFactor
) before applying encoders.
# Sample categorical data
X_cat_df = DataFrame(
id = 1:5,
color = ["Red", "Green", "Blue", "Green", "Red"],
grade = ["A", "C", "B", "A", "D"] # Assume this is ordinal
)
# Coerce 'color' to Multiclass (nominal) and 'grade' to OrderedFactor (ordinal)
X_cat_coerced = coerce(X_cat_df, :color => Multiclass, :grade => OrderedFactor)
One-Hot Encoding is a common technique for nominal categorical features (where categories have no inherent order). It creates a new binary (0 or 1) feature for each unique category.
For example, if a 'color' feature has categories 'Red', 'Green', 'Blue':
[1, 0, 0]
[0, 1, 0]
[0, 0, 1]
This avoids implying any ordinal relationship between categories. The main drawback is that it can significantly increase the number of features (dimensionality) if a categorical variable has many unique values.
MLJ provides OneHotEncoder
for this:
# Assume X_cat_coerced from previous example
encoder_model = OneHotEncoder()
encoder_machine = machine(encoder_model, MLJ.table(X_cat_coerced))
fit!(encoder_machine)
X_encoded = MLJ.transform(encoder_machine, MLJ.table(X_cat_coerced))
X_encoded_df = DataFrame(X_encoded)
# println(X_encoded_df) # Inspect the one-hot encoded output
The OneHotEncoder
by default transforms features with Multiclass
scitype. The id
column (Continuous) and grade
column (OrderedFactor) would typically be passed through or handled differently (e.g., grade
by ContinuousEncoder
if specified). You can specify which features to encode using OneHotEncoder(features=[:feature1, :feature2])
.
For ordinal categorical features (where categories have a natural order, like 'Low', 'Medium', 'High' or education levels), assigning a numerical rank (e.g., 0, 1, 2) can be appropriate. This is often called label encoding or integer encoding.
In MLJ, if a feature is correctly coerced to OrderedFactor
, the ContinuousEncoder
model can be used to convert these ordered categories into numerical values, typically integers starting from 1, or scaled to [0,1]
.
# Using X_cat_coerced where 'grade' is OrderedFactor
# The default levels are based on the order of appearance or specified levels
# For 'grade': A, C, B, D. If ordered A < B < C < D, coercion should specify levels.
# X_cat_coerced = coerce(X_cat_df, :grade => OrderedFactor(levels=["D", "C", "B", "A"]))
# ContinuousEncoder will transform OrderedFactor features
# For features specified as OrderedFactor, ContinuousEncoder maps them to integers
ord_encoder_model = ContinuousEncoder()
ord_encoder_machine = machine(ord_encoder_model, MLJ.table(X_cat_coerced))
fit!(ord_encoder_machine)
X_ord_encoded = MLJ.transform(ord_encoder_machine, MLJ.table(X_cat_coerced))
X_ord_encoded_df = DataFrame(X_ord_encoded)
# println(X_ord_encoded_df.grade)
The ContinuousEncoder
transforms OrderedFactor
features into Continuous
ones. It will pass through Multiclass
features by default, unless you use OneHotEncoder
for those. A Standardizer
would also one-hot encode Multiclass
features by default. It's often about composing these transformers in a pipeline.
Binning, or discretization, involves converting continuous numerical features into discrete categorical ones (bins). This can be useful for several reasons:
Equal-width binning divides the range of the feature into a specified number of bins, each having the same width. The width is calculated as (Xmax−Xmin)/Nbins.
MLJ offers UnivariateDiscretizer
for this purpose. It essentially creates N_{bins}
categories.
# Sample continuous data (e.g., the 'age' feature from X_df)
age_feature = X_df.age
data_for_binning = DataFrame(age = age_feature)
# UnivariateDiscretizer model for equal-width binning
# Let's say we want 3 bins
binner_model = UnivariateDiscretizer(n_classes=3) # n_classes specifies number of bins
binner_machine = machine(binner_model, data_for_binning)
fit!(binner_machine)
age_binned = MLJ.transform(binner_machine, data_for_binning)
age_binned_df = DataFrame(age_binned)
# println(age_binned_df.age) # Shows CategoricalValue for each original age
The output age_binned_df.age
will contain categorical values representing the bins.
Here's a visual representation of how a continuous feature's distribution changes after equal-width binning:
The first histogram shows the original distribution of ages. The second shows the same data grouped into three equal-width bins.
Equal-frequency binning divides the continuous feature into bins such that each bin contains approximately the same number of observations. The bin edges are determined by quantiles (e.g., quartiles for 4 bins, deciles for 10 bins). This method can be more effective for skewed data distributions.
MLJ.UnivariateDiscretizer
primarily handles equal-width binning. For quantile-based binning, you might use functions from StatsBase.jl
(like quantile
to find bin edges) and then apply these cuts, or look into packages like FeatureTransforms.jl
which offer more sophisticated discretization options.
For example, using StatsBase.quantile
to define edges:
using StatsBase
# Using age_feature from before
quantiles = quantile(age_feature, [0, 0.25, 0.5, 0.75, 1.0]) # For 4 bins (quartiles)
# You would then use these `quantiles` as custom bin edges.
# This often requires manual mapping or a custom transformer in MLJ.
The number of bins is an important parameter. Too few bins can lead to loss of information, while too many bins might not provide the desired smoothing or generalization. The choice often depends on domain knowledge, experimentation, or heuristics like Sturges' formula or Freedman-Diaconis rule, though these are just starting points.
Data transformations like scaling, encoding, and binning are fundamental steps in preparing data for machine learning. In MLJ.jl
, these transformations are typically encapsulated as models and can be integrated into larger machine learning pipelines. This allows you to define a sequence of preprocessing steps along with your final predictive model, ensuring that data is consistently transformed during training, evaluation, and prediction. You'll learn more about constructing such pipelines in Chapter 5.
Remember that the specific transformations needed will depend on your data and the machine learning algorithms you plan to use. Always apply transformations learned from the training set consistently to any new data to maintain integrity and avoid data leakage.
Was this section helpful?
© 2025 ApX Machine Learning