This section serves as a practical application of the methods discussed earlier in the chapter. We will take a raw dataset, apply cleaning techniques, and then engineer new features to make it more suitable for machine learning models. This hands-on exercise will use DataFrames.jl
for data manipulation, along with other helpful Julia packages.
For this exercise, we'll work with a small, representative dataset of product sales. Imagine this data comes from an e-commerce platform. It includes product information, sales figures, and customer reviews. Our goal is to clean this data and derive new, potentially predictive features.
First, let's ensure we have the necessary Julia packages. You'll primarily need DataFrames.jl
, CSV.jl
(for loading data, though we'll define it inline here), Statistics.jl
for calculations like mean, CategoricalArrays.jl
for binning, and Plots.jl
for visualization (though we will represent plots using JSON for web display).
using DataFrames, CSV, Statistics, CategoricalArrays # Plots.jl for actual plotting
Let's define our data. In a typical scenario, you would load this from a CSV file using CSV.read("your_file.csv", DataFrame)
. For this example, we'll create the DataFrame directly:
data_string = """
ProductID,Category,Price,UnitsSold,DiscountApplied,ReviewScore
1,Electronics,799.99,15,0.1,4.5
2,Books,24.50,120,,3.8
3,Electronics,1200.00,5,0.15,4.9
4,Home Goods,49.95,60,0.05,4.2
5,Books,19.99,200,0.0,
6,Electronics,99.99,30,0.07,2.5
7,Home Goods,89.00,25,0.1,9.9
8,Books,32.00,80,0.02,4.1
"""
df = CSV.read(IOBuffer(data_string), DataFrame, missingstrings=["", "NA"])
println("Initial DataFrame (first 5 rows):")
println(first(df, 5))
println("\nDescription of the DataFrame:")
println(describe(df, :eltype, :mean, :median, :nmissing))
The output from describe(df)
immediately highlights a few things:
DiscountApplied
has one missing value. Its type is Union{Missing, Float64}
.ReviewScore
also has one missing value and is of type Union{Missing, Float64}
.Price
and UnitsSold
seem complete.Clean data is foundational for reliable machine learning. We'll tackle missing values and potential outliers.
We have missing data in DiscountApplied
and ReviewScore
. Different strategies can be applied.
Imputing DiscountApplied
:
For DiscountApplied
, a missing value might indicate no discount was applied, or it could be a data entry error. Assuming a small number of missing values, imputation with the mean or median is a common approach for numerical data. Let's use the mean. The skipmissing
function is helpful here to calculate statistics on available data.
mean_discount = mean(skipmissing(df.DiscountApplied))
println("Mean discount (ignoring missings): ", round(mean_discount, digits=2))
# Impute missing values using coalesce
df.DiscountApplied = coalesce.(df.DiscountApplied, mean_discount)
println("\nDataFrame after imputing DiscountApplied (showing relevant columns):")
println(select(df, [:ProductID, :DiscountApplied, :ReviewScore]))
println("\nMissing values check after imputation:")
println(describe(df, :nmissing)) # Check nmissing for DiscountApplied
The coalesce
function is useful here; it returns its first argument that is not missing
. By broadcasting it with the original column and the calculated mean, we replace missing
entries with the mean_discount
.
Handling Missing ReviewScore
:
For ReviewScore
, a missing value means we don't know the customer's opinion. Depending on the downstream task and the amount of missing data, we could impute it (e.g., with the median review score) or remove the rows. If review quality is very important and imputation might introduce bias, removal can be a safer choice, provided it doesn't discard too much data. For this example, let's remove rows where ReviewScore
is missing. We will operate on a new DataFrame df_cleaned
to preserve df
.
df_cleaned = dropmissing(df, :ReviewScore)
println("\nDataFrame after dropping rows with missing ReviewScore:")
println(df_cleaned)
println("\nMissing values check for df_cleaned:")
println(describe(df_cleaned, :nmissing))
Now, df_cleaned
has no missing values in ReviewScore
. We lost one row of data (ProductID 5).
Outliers are data points that differ significantly from other observations. They can skew statistical analyses and degrade model performance. The ReviewScore
column had an entry 9.9
in our original data, which seems unusual if scores are typically on a 1-5 scale.
Let's visualize ReviewScore
from df_cleaned
(which still contains the 9.9 value before clamping) using a box plot to help spot this.
A box plot of review scores. The point at 9.9 is visibly distant from the others, suggesting it might be an outlier or an error.
Assuming valid review scores should be between 1 and 5, we can cap the ReviewScore
values. The clamp
function is perfect for this.
df_cleaned.ReviewScore = map(s -> clamp(s, 1.0, 5.0), df_cleaned.ReviewScore)
println("\nDataFrame after capping ReviewScore (1.0 to 5.0):")
println(select(df_cleaned, [:ProductID, :ReviewScore]))
println("\nDescription of ReviewScore after capping:")
println(describe(df_cleaned, cols=:ReviewScore))
The score 9.9
is now adjusted to 5.0
, the maximum of our valid range.
Feature engineering is the process of using domain knowledge to create new features from existing data that make machine learning algorithms work better.
Revenue
FeatureOur dataset has Price
, UnitsSold
, and DiscountApplied
. We can combine these to calculate the actual revenue generated by each product sale.
The formula for revenue could be: Revenue=Price×UnitsSold×(1−DiscountApplied).
df_cleaned.Revenue = df_cleaned.Price .* df_cleaned.UnitsSold .* (1.0 .- df_cleaned.DiscountApplied)
println("\nDataFrame with new 'Revenue' feature:")
println(select(df_cleaned, [:ProductID, :Price, :UnitsSold, :DiscountApplied, :Revenue]))
Notice the use of .
for broadcasted element-wise operations, which is idiomatic Julia for working with arrays and DataFrame columns. This new Revenue
column could be a very informative target variable for regression tasks or a useful feature for other models.
Price
into CategoriesSometimes, continuous numerical features are more useful when converted into categorical bins. For example, we can categorize products as Low
, Medium
, or High
priced. This is done using the cut
function from CategoricalArrays.jl
.
price_bins = [0, 50, 500, Inf] # Edges: (0, 50], (50, 500], (500, Inf]
price_labels = ["Low", "Medium", "High"]
df_cleaned.PriceCategory = cut(df_cleaned.Price, price_bins, labels=price_labels, extend=true)
println("\nDataFrame with 'PriceCategory' feature:")
println(select(df_cleaned, [:ProductID, :Price, :PriceCategory, :Revenue]))
The extend=true
argument in cut
ensures that values at the boundaries are included correctly. This PriceCategory
feature might reveal patterns in sales or reviews related to price segments.
Category
We can create a specific feature that flags whether a product belongs to a certain category. For instance, an IsElectronic
feature.
df_cleaned.IsElectronic = (df_cleaned.Category .== "Electronics")
println("\nDataFrame with 'IsElectronic' feature:")
println(select(df_cleaned, [:ProductID, :Category, :IsElectronic]))
This boolean feature can be directly used by many machine learning algorithms.
Let's look at our fully processed DataFrame and visualize some of the new features.
println("\nFinal processed DataFrame (df_cleaned):")
println(df_cleaned)
Now, let's visualize the distribution of our new Revenue
feature.
Histogram showing the distribution of the
Revenue
feature. Most products have revenues in the lower range, with one product having a significantly higher revenue.
And the counts for our PriceCategory
feature:
price_category_counts = by(df_cleaned, :PriceCategory, :Count => length) # Renaming for clarity
# For the plot, we need actual values for x and y
# PriceCategory values in df_cleaned: High, Low, High, Low, Medium, Medium, Low
# Counts: Low: 3, Medium: 2, High: 2
Bar chart illustrating the number of products in each defined price category: Low, Medium, and High.
In this hands-on section, we've walked through a typical data preparation workflow:
DiscountApplied
values with the column mean.ReviewScore
values.ReviewScore
column.Revenue
from existing sales data.Price
into PriceCategory
.IsElectronic
flag from the Category
column.These steps have transformed our raw dataset into a more structured and potentially more informative format, ready for the subsequent stages of a machine learning project, such as model training and evaluation. Effective data preparation is often an iterative process, and the techniques shown here provide a solid foundation for handling various data quality issues and enriching your datasets.
Was this section helpful?
© 2025 ApX Machine Learning