The process of preparing data for machine learning models is as important as selecting the model itself. Imperfections in your dataset, such as missing information or anomalous data points, can significantly degrade model performance or lead to misleading conclusions. This section focuses on two common data quality issues: missing values and outliers, and how to address them using Julia's powerful data manipulation capabilities, primarily with DataFrames.jl
.
Missing data is a frequent occurrence in datasets. Values might be absent due to errors in data collection, non-applicability of a feature to certain records, or various other reasons. In Julia, missing values are typically represented by the special missing
object. DataFrames.jl
is designed to handle missing
values gracefully.
Before you can handle missing values, you need to find them. DataFrames.jl
offers several ways to inspect your data for missing
entries.
Consider a sample DataFrame:
using DataFrames, Statistics
df = DataFrame(
ID = 1:5,
Age = [25, 30, missing, 22, 28],
Salary = [50000, missing, 65000, 48000, missing],
Department = ["HR", "Engineering", "HR", missing, "Marketing"]
)
julia> df
5×4 DataFrame
Row │ ID Age Salary Department
│ Int64 Any Any Any
─────┼──────────────────────────────────────────
1 │ 1 25 50000 HR
2 │ 2 30 missing Engineering
3 │ 3 missing 65000 HR
4 │ 4 22 48000 missing
5 │ 5 28 missing Marketing
You can check for missing values in a specific column or the entire DataFrame using the ismissing
function, which works element-wise:
julia> ismissing.(df.Age)
5-element BitVector:
0
0
1
0
0
julia> ismissing.(df)
5×4 BitMatrix:
0 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
0 1 1 0
To get a summary of missing values per column, the describe
function is very useful:
julia> describe(df, :nmissing)
4×2 DataFrame
Row │ variable nmissing
│ Symbol Int64
─────┼──────────────────────
1 │ ID 0
2 │ Age 1
3 │ Salary 2
4 │ Department 1
This output clearly shows that the Age
column has one missing value, Salary
has two, and Department
has one.
Once identified, you have several strategies to handle missing values. The choice depends on the nature of the data, the extent of missingness, and the requirements of your machine learning model.
One of the simplest approaches is to remove rows or columns that contain missing values.
Listwise Deletion (Removing Rows): If a row has one or more missing values, the entire row is discarded. The dropmissing
function in DataFrames.jl
facilitates this.
df_no_missing_rows = dropmissing(df)
julia> df_no_missing_rows
1×4 DataFrame
Row │ ID Age Salary Department
│ Int64 Int64 Int64 String
─────┼──────────────────────────────────
1 │ 1 25 50000 HR
You can also specify columns to consider for dropping rows:
df_no_missing_age_salary = dropmissing(df, [:Age, :Salary])
Caution: While straightforward, listwise deletion can lead to a substantial loss of data if missing values are widespread, potentially introducing bias if the missingness is not random.
Column Deletion: If a column has a very high percentage of missing values, it might provide little information and could be dropped entirely.
# Suppose 'Salary' column is deemed to have too many missings
df_no_salary_column = select(df, Not(:Salary))
This should be done judiciously, as you might discard potentially useful information.
Imputation involves filling in missing values with estimated or calculated substitutes. This preserves your sample size but can introduce other biases if not done carefully.
Let's impute missing values in our df
:
For the Age
column (numerical), let's use the mean:
mean_age = mean(skipmissing(df.Age)) # skipmissing creates an iterator over non-missing values
df_imputed_age = deepcopy(df) # Work on a copy
df_imputed_age.Age = coalesce.(df_imputed_age.Age, round(Int, mean_age)) # coalesce replaces missing with a value
julia> mean_age
26.25
julia> df_imputed_age
5×4 DataFrame
Row │ ID Age Salary Department
│ Int64 Int64 Any Any
─────┼────────────────────────────────────
1 │ 1 25 50000 HR
2 │ 2 30 missing Engineering
3 │ 3 26 65000 HR <- Imputed Age
4 │ 4 22 48000 missing
5 │ 5 28 missing Marketing
The coalesce.(vector, value)
function is a convenient way to replace all missing
entries in vector
with value
.
For the Salary
column (numerical), let's use the median:
median_salary = median(skipmissing(df.Salary))
df_imputed_salary = deepcopy(df_imputed_age) # Continue from previous imputation
df_imputed_salary.Salary = coalesce.(df_imputed_salary.Salary, median_salary)
julia> median_salary
57500.0
julia> df_imputed_salary
5×4 DataFrame
Row │ ID Age Salary Department
│ Int64 Int64 Float64 Any
─────┼───────────────────────────────────────
1 │ 1 25 50000.0 HR
2 │ 2 30 57500.0 Engineering <- Imputed Salary
3 │ 3 26 65000.0 HR
4 │ 4 22 48000.0 missing
5 │ 5 28 57500.0 Marketing <- Imputed Salary
For the Department
column (categorical), let's use the mode:
# Calculate mode for Department
department_counts = Dict{String, Int}()
for dept in skipmissing(df.Department)
department_counts[dept] = get(department_counts, dept, 0) + 1
end
mode_department = findmax(department_counts)[2] # findmax returns (value, key) for dicts
df_imputed_all = deepcopy(df_imputed_salary)
df_imputed_all.Department = coalesce.(df_imputed_all.Department, mode_department)
julia> mode_department
"HR"
julia> df_imputed_all
5×4 DataFrame
Row │ ID Age Salary Department
│ Int64 Int64 Float64 String
─────┼────────────────────────────────────
1 │ 1 25 50000.0 HR
2 │ 2 30 57500.0 Engineering
3 │ 3 26 65000.0 HR
4 │ 4 22 48000.0 HR <- Imputed Department
5 │ 5 28 57500.0 Marketing
Using transform!
or combine
from DataFrames.jl
can also offer more concise ways to perform these operations, especially when grouped. The MLJ.jl
package, which we'll encounter later, also provides dedicated FillImputer
tools for more streamlined imputation within machine learning pipelines.
Considerations for Imputation: Imputing values changes your dataset. While it allows you to retain data, it can reduce variance and potentially distort relationships between variables. It's often a good practice to create an additional binary indicator column that flags whether a value was originally missing, as the fact of missingness itself can sometimes be informative for a model.
Outliers are data points that differ significantly from other observations. They can arise from measurement errors, experimental issues, or genuinely extreme natural variations. Outliers can unduly influence statistical analyses and machine learning models, particularly those sensitive to variance like linear regression or models using squared error loss functions.
Detecting outliers often involves a combination of visualization and statistical methods.
Visual tools are excellent for getting an initial sense of potential outliers.
Box Plots: Box plots are particularly effective for visualizing the distribution of numerical data and highlighting outliers.
Let's consider a sample dataset for a feature Measurement
:
measurements = [22, 24, 25, 26, 28, 29, 30, 32, 33, 35, 55, 23, 27, 31, 60]
The values 55 and 60 appear to be higher than the bulk of the data.
Box plot indicating potential outliers in the
Measurement
data. Points outside the "whiskers" are often considered outliers.
Histograms and Scatter Plots: Histograms can show isolated bars at the extremes, and scatter plots can reveal points that lie far from the general cluster of data.
Z-score: The Z-score measures how many standard deviations a data point is from the mean. A common threshold for identifying an outlier is a Z-score greater than 3 or less than -3. The formula is: Z=(x−μ)/σ where x is the data point, μ is the mean, and σ is the standard deviation.
data_b = [10.0, 11.0, 12.0, 13.0, 14.0, 100.0]
mean_b = mean(data_b)
std_b = std(data_b)
z_scores_b = (data_b .- mean_b) ./ std_b
julia> z_scores_b
6-element Vector{Float64}:
-0.5401490234706847
-0.5085117491910266
-0.4768744749113684
-0.44523720063171023
-0.4135999263520521
2.384372374556842
In this small, skewed dataset, the Z-score for 100.0 is ~2.38. With more data, and if 100.0 were truly anomalous relative to a more normal distribution, its Z-score would be higher. A threshold of |Z| > 2 or 2.5 is also sometimes used.
Interquartile Range (IQR) Method: This method defines outliers based on the spread of the middle 50% of the data.
q1 = quantile(data_b, 0.25)
q3 = quantile(data_b, 0.75)
iqr_val = q3 - q1
lower_bound = q1 - 1.5 * iqr_val
upper_bound = q3 + 1.5 * iqr_val
outliers_b_iqr = [x for x in data_b if x < lower_bound || x > upper_bound]
julia> q1
10.75
julia> q3
35.0 # Note: quantile behavior depends on interpolation; with small N, it can be tricky.
# For [10,11,12,13,14,100], Q1 (25th) is (10+11)/2 = 10.5 if using certain methods,
# Q3 (75th) is (14+100)/2 = 57 if using same simple method.
# Statistics.quantile uses a more standard algorithm.
# Let's re-run with values that make quartiles intuitive:
# data_b_sorted = sort(data_b) -> [10.0, 11.0, 12.0, 13.0, 14.0, 100.0]
# Q1 for this using default interpolation in Statistics.jl:
# idx = (length(data_b_sorted)-1)*0.25 + 1 = 5*0.25 + 1 = 1.25 + 1 = 2.25
# So it interpolates between 2nd (11.0) and 3rd (12.0) element.
# (1-0.25)*11.0 + 0.25*12.0 = 0.75*11 + 0.25*12 = 8.25 + 3 = 11.25
julia> q1_recalc = quantile(sort(data_b), 0.25) # Should be 11.25
11.25
julia> q3_recalc = quantile(sort(data_b), 0.75) # (idx = 5*0.75+1 = 4.75) interpolates 4th (13) and 5th (14)
# (1-0.75)*13 + 0.75*14 = 0.25*13 + 0.75*14 = 3.25 + 10.5 = 13.75
13.75
julia> iqr_val_recalc = q3_recalc - q1_recalc
2.5
julia> lower_bound_recalc = q1_recalc - 1.5 * iqr_val_recalc
7.5
julia> upper_bound_recalc = q3_recalc + 1.5 * iqr_val_recalc
17.5
julia> outliers_b_iqr_recalc = [x for x in data_b if x < lower_bound_recalc || x > upper_bound_recalc]
1-element Vector{Float64}:
100.0
The value 100.0
is correctly identified as an outlier by the IQR method.
The approach to handling outliers depends heavily on their cause and the goals of your analysis.
Deletion: If you are confident that an outlier is due to a data entry error or measurement fault, removing it might be appropriate. However, this should be done with caution, as removing genuine extreme values can lead to a loss of important information.
# Assuming df.Salary has the outlier 100.0 from data_b
df_cleaned = filter(row -> row.Salary <= upper_bound_recalc && row.Salary >= lower_bound_recalc, df_with_salary_outlier)
Transformation: Applying mathematical transformations like log, square root, or reciprocal can compress the range of the data, reducing the skewness caused by outliers. For example, if data is right-skewed, a log transformation (log.(data)
) can make the distribution more symmetric. This topic is covered more broadly in data transformation, but it's relevant here.
Capping/Winsorizing: This involves replacing outliers with the nearest "acceptable" value. For example, values above the upper bound (Q3+1.5×IQR) could be capped at this upper bound, and values below the lower bound (Q1−1.5×IQR) could be capped at the lower bound.
df_capped = deepcopy(df) # Assume df.B is like data_b
# df.B = [10.0, 11.0, 12.0, 100.0, 13.0]
# upper_bound_recalc was 17.5
df_capped.B = map(x -> x > upper_bound_recalc ? upper_bound_recalc : x, df.B)
df_capped.B = map(x -> x < lower_bound_recalc ? lower_bound_recalc : x, df_capped.B)
julia> # Assuming original df.B was [10.0, 11.0, 12.0, 100.0, 13.0]
julia> # After capping with upper_bound_recalc = 17.5 and lower_bound_recalc = 7.5
julia> # df_capped.B would become [10.0, 11.0, 12.0, 17.5, 13.0]
Binning (Discretization): Grouping numerical data into discrete bins can sometimes mitigate the effect of outliers, as the outlier will fall into an extreme bin but its exact high value won't disproportionately affect the model.
Treat as Missing: Sometimes, outliers might be treated as missing values, and then an appropriate imputation technique can be applied.
Keep Them: If outliers represent genuine, albeit rare, phenomena that are important for your problem (e.g., fraudulent transactions in fraud detection), they should be kept and possibly analyzed separately. Using models with outlier resistance might also be a strategy.
Data cleaning is an iterative process. It requires careful examination of the data, understanding of the domain, and thoughtful consideration of the impact of each cleaning step on subsequent analyses and model building. There's no one-size-fits-all solution, and the best approach often involves a combination of techniques tailored to your specific dataset and objectives.
Was this section helpful?
© 2025 ApX Machine Learning