With a solid grasp of feature engineering principles, attention now turns to applying these techniques in Julia. Crafting new, insightful features from your existing dataset is often a game-changer for machine learning model performance. Julia, especially when paired with the DataFrames.jl
package, offers a powerful and expressive environment for this critical step in the data preparation pipeline.
The art of feature engineering involves transforming raw data into a format that better represents the underlying problem to your learning algorithms. Here are common techniques and how to implement them in Julia:
Raw numerical features can often be transformed to capture more complex relationships or to fit model assumptions better.
Polynomial Features: If you suspect a non-linear relationship between a feature and the target variable, polynomial features can be beneficial. For a feature x, you might create x2, x3, and so on. In Julia, this is straightforward using element-wise operations on DataFrame columns.
using DataFrames
# Assume df is an existing DataFrame with a column :Age
# df = DataFrame(Age = [25, 30, 35, 40, 45])
# To create an Age_Squared feature:
# df.Age_Squared = df.Age .^ 2
This adds a new column Age_Squared
to your DataFrame, where each value is the square of the corresponding Age
.
Interaction Features: These are created by combining two or more features, typically through multiplication or addition. They can capture synergistic effects where the combined impact of features is different from their individual effects. For features x1 and x2, an interaction feature could be x1⋅x2.
# Assume df has columns :Price and :Quantity
# df = DataFrame(Price = [10, 20, 15], Quantity = [2, 1, 3])
# To create a Total_Cost feature (Price * Quantity):
# df.Total_Cost = df.Price .* df.Quantity
Beyond simple one-hot or label encoding (covered in data transformation), you can derive more sophisticated features from categorical data.
Frequency Encoding: This technique replaces each category with its frequency or count in the dataset. It can be useful if the prevalence of a category is informative.
using DataFrames
# Assume df has a :City column
# df = DataFrame(City = ["London", "Paris", "London", "Tokyo", "Paris", "London"])
# Calculate frequencies
city_frequencies = combine(groupby(df, :City), nrow => :City_Frequency)
# Join back to the original DataFrame
# df = leftjoin(df, city_frequencies, on = :City)
After this operation, df
will have a new City_Frequency
column (e.g., for "London", the value would be 3).
Target Encoding (Mean Encoding): This powerful technique replaces a category with the average value of the target variable for that category. For example, if you're predicting house prices, you might replace a city category with the average house price in that city.
Caution: Target encoding carries a high risk of data leakage if not implemented carefully. The encoding should be derived only from the training set and then applied to the validation/test sets. For results, it's often performed within a cross-validation loop. Due to its complexity when done correctly, a detailed implementation is not covered in this section, but it's an important technique to be aware of for more advanced modeling.
If your dataset includes date or time columns, extracting components can create valuable features. The Dates
module in Julia is indispensable here.
using DataFrames, Dates
# Assume df has a :TransactionDate column (e.g., Vector{Date})
# df = DataFrame(TransactionDate = [Date(2023,1,15), Date(2023,1,20), Date(2023,2,10)])
# Extract month
df.Transaction_Month = month.(df.TransactionDate)
# Extract day of the week (1=Monday, 7=Sunday)
df.Transaction_DayOfWeek = dayofweek.(df.TransactionDate)
# Extract year
df.Transaction_Year = year.(df.TransactionDate)
# You can also create boolean features, e.g., Is_Weekend
# df.Is_Weekend = dayofweek.(df.TransactionDate) .>= 6
These new features can capture seasonality, trends, or day-specific patterns.
While full-scale natural language processing (NLP) is a broad topic, you can extract simple yet effective features from text data:
using DataFrames
# Assume df has a :CommentText column
# df = DataFrame(CommentText = ["Great product!", "Not satisfied.", "Excellent service and quality."])
# Text length
df.Comment_Length = length.(df.CommentText)
# Word count (simple version)
df.Comment_WordCount = length.(split.(df.CommentText)) # split by space
DataFrames.jl
is central to feature engineering in Julia. Its functions allow for flexible and efficient column manipulation.
transform
and transform!
: These functions are highly useful for adding new columns based on existing ones. transform!
modifies the DataFrame in place, while transform
returns a new DataFrame. You can apply functions to columns, often using ByRow
for row-wise operations or broadcasting.
using DataFrames
# df = DataFrame(A = [1, 2, 3], B = [4, 5, 6])
# Using transform! to add a new column C = A + B
# transform!(df, [:A, :B] => ByRow(+) => :C)
# Using transform to create a new DataFrame with an additional column D = A * 2
# df_new = transform(df, :A => ByRow(x -> x * 2) => :D)
Custom Functions: For more complex feature logic, define your own Julia functions and then apply them.
using DataFrames
# df = DataFrame(Value = [10, 25, 5, 40])
function categorize_value(v)
if v < 10
return "Low"
elseif v < 30
return "Medium"
else
return "High"
end
end
# Apply custom function to create a new :Value_Category column
# df.Value_Category = categorize_value.(df.Value)
# Or using transform:
# transform!(df, :Value => ByRow(categorize_value) => :Value_Category_Transform)
The following diagram shows how different types of raw features can be used to engineer new, potentially more useful features.
Original data features are processed through various engineering techniques in Julia to generate new features that can enhance model understanding and performance.
Let's walk through a small example combining some of these techniques. Suppose we have a dataset of product sales.
using DataFrames, Dates
# Initial DataFrame
sales_df = DataFrame(
ProductID = ["A01", "B02", "A01", "C03"],
SalePrice = [19.99, 25.00, 18.50, 99.99],
UnitsSold = [10, 5, 12, 2],
SaleDate = [Date(2023, 3, 10), Date(2023, 3, 12), Date(2023, 4, 1), Date(2023, 4, 5)]
)
println("Original DataFrame:")
println(sales_df)
# 1. Create an interaction feature: TotalRevenue
sales_df.TotalRevenue = sales_df.SalePrice .* sales_df.UnitsSold
# 2. Extract SaleMonth from SaleDate
sales_df.SaleMonth = month.(sales_df.SaleDate)
# 3. Create a feature: Price_Per_Unit (if meaningful, otherwise skip)
# For this example, let's assume SalePrice is already per unit.
# If SalePrice was for a bundle, dividing by UnitsSold might create such a feature.
# 4. Frequency encode ProductID (as a proxy for product popularity in this batch)
product_counts = combine(groupby(sales_df, :ProductID), nrow => :Product_SaleCount)
sales_df = leftjoin(sales_df, product_counts, on = :ProductID)
println("\nDataFrame with Engineered Features:")
println(sales_df)
Output of the example:
Original DataFrame:
4×4 DataFrame
Row │ ProductID SalePrice UnitsSold SaleDate
│ String Float64 Int64 Date
─────┼─────────────────────────────────────────────
1 │ A01 19.99 10 2023-03-10
2 │ B02 25.0 5 2023-03-12
3 │ A01 18.5 12 2023-04-01
4 │ C03 99.99 2 2023-04-05
DataFrame with Engineered Features:
4×7 DataFrame
Row │ ProductID SalePrice UnitsSold SaleDate TotalRevenue SaleMonth Product_SaleCount
│ String Float64 Int64 Date Float64 Int64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────
1 │ A01 19.99 10 2023-03-10 199.9 3 2
2 │ A01 18.5 12 2023-04-01 222.0 4 2
3 │ B02 25.0 5 2023-03-12 125.0 3 1
4 │ C03 99.99 2 2023-04-05 199.98 4 1
This example demonstrates how new columns like TotalRevenue
, SaleMonth
, and Product_SaleCount
are added, each potentially providing new signals for a machine learning model.
.
syntax like df.A .* df.B
) or functions from DataFrames.jl
that operate on entire columns. This is generally more performant than iterating row by row in user code.MLJ.jl
. Newly created numerical features might themselves require scaling or normalization.By applying these techniques in Julia, you can significantly improve the quality of your input data, creating a path for more accurate machine learning models. The practical section later in this chapter will provide an opportunity to apply these skills to a dataset.
Was this section helpful?
© 2025 ApX Machine Learning