Having covered the foundational aspects of Julia for machine learning, including its environment, type system, and core data structures, it's time to put this knowledge into practice. This section guides you through the essential steps of setting up a project-specific environment and performing fundamental data operations using Julia's arrays and the DataFrames.jl
package. These skills are the building blocks for more complex data processing and model building tasks ahead.
Before we begin, ensure you have Julia installed and accessible. You can typically start a Julia session by typing julia
in your terminal or command prompt. This will open the Julia Read-Eval-Print Loop (REPL), an interactive environment where you can execute Julia code.
$ julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version x.y.z (YYYY-MM-DD)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia>
For any machine learning project, managing dependencies is important. Julia's built-in package manager, Pkg.jl
, helps you create isolated environments for your projects. Let's create one for this exercise.
Enter Pkg mode: In the Julia REPL, type ]
to switch to the Pkg REPL. The prompt will change to pkg>
.
Activate an environment: To create and activate an environment in your current working directory (e.g., a folder for this course's exercises), use the activate
command.
pkg> activate .
This command creates Project.toml
and Manifest.toml
files in the current directory if they don't exist. Project.toml
lists your direct dependencies, while Manifest.toml
records the exact versions of all dependencies, ensuring reproducibility.
Add necessary packages: For this hands-on, we'll need DataFrames.jl
for working with tabular data and CSV.jl
for reading Comma Separated Values files (though we'll use it minimally here).
pkg> add DataFrames CSV
This command downloads the packages and their dependencies, updating your project files.
Check the status: You can see the packages installed in your current environment with st
(short for status
).
pkg> st
This will list CSV
and DataFrames
among any other packages.
Exit Pkg mode: Press the Backspace
key to return to the standard julia>
REPL prompt.
Now, with our environment ready, let's load DataFrames
for use in our session. (Note: CSV
will be used later when we touch upon file loading).
julia> using DataFrames
Arrays are fundamental for numerical computation in Julia. Let's explore some basic operations. Remember that Julia uses 1-based indexing, meaning the first element of an array is at index 1.
You can create vectors (1D arrays) and matrices (2D arrays) easily.
# Creating a vector (1D array)
vector_a = [10, 20, 30, 40, 50]
println("Vector A: ", vector_a)
# Creating a vector with a specific element type
vector_b = Float64[1.5, 2.5, 3.5]
println("Vector B (Float64): ", vector_b)
# Creating a matrix (2D array)
# Spaces separate elements in a row, semicolon starts a new row
matrix_1 = [1 2 3; 4 5 6]
println("Matrix 1:\n", matrix_1)
# Creating a matrix initialized with zeros
# zeros(Type, rows, cols)
matrix_zeros = zeros(Int, 2, 3) # A 2x3 matrix of integer zeros
println("Matrix of zeros:\n", matrix_zeros)
# Creating a matrix of ones
matrix_ones = ones(2, 2) # Type defaults to Float64 if not specified
println("Matrix of ones:\n", matrix_ones)
Accessing individual elements or sub-sections (slices) of arrays is straightforward.
# Accessing elements (1-based indexing)
println("First element of vector_a: ", vector_a[1]) # Output: 10
println("Element at row 2, col 3 of matrix_1: ", matrix_1[2, 3]) # Output: 6
# Slicing
# Get elements from index 2 to 4 from vector_a
slice_vector_a = vector_a[2:4]
println("Slice of vector_a (2nd to 4th): ", slice_vector_a) # Output: [20, 30, 40]
# Get the first row of matrix_1
row_1 = matrix_1[1, :] # The colon ':' means all elements in that dimension
println("First row of matrix_1: ", row_1) # Output: [1, 2, 3]
# Get the second column of matrix_1
col_2 = matrix_1[:, 2]
println("Second column of matrix_1: ", col_2) # Output: [2, 5]
Julia excels at numerical operations. For element-wise operations on arrays, Julia uses a "dot" syntax for broadcasting.
# Element-wise addition with a scalar (broadcasting)
vector_plus_5 = vector_a .+ 5 # Add 5 to each element of vector_a
println("Vector A + 5: ", vector_plus_5) # Output: [15, 25, 35, 45, 55]
matrix_plus_10 = matrix_1 .+ 10 # Add 10 to each element of matrix_1
println("Matrix 1 + 10:\n", matrix_plus_10)
# Element-wise multiplication between two arrays of the same size
vector_mult = vector_a .* [2, 2, 2, 2, 2]
println("Vector A element-wise multiplied by 2s: ", vector_mult)
matrix_b = [10 20 30; 40 50 60]
element_wise_prod = matrix_1 .* matrix_b # Must be same dimensions
println("Element-wise product of matrix_1 and matrix_b:\n", element_wise_prod)
The dot .
before an operator (like .+
, .*
) tells Julia to apply that operation to each element of the array or between corresponding elements of two arrays. This is a powerful feature for writing concise and efficient numerical code.
DataFrames.jl
is Julia's primary tool for working with tabular data, much like Pandas in Python or data.frames in R.
You can create a DataFrame in several ways, for instance, from collections of arrays or dictionaries.
# Ensure DataFrames is loaded (if you haven't already in your session)
# using DataFrames
# Create a DataFrame from named columns (vectors)
df_contacts = DataFrame(
ID = [1, 2, 3, 4],
Name = ["Alice", "Bob", "Charlie", "Diana"],
Age = [30, 24, 39, 28],
City = ["New York", "Paris", "London", "Berlin"]
)
println("DataFrame 'df_contacts':")
display(df_contacts) # display() often gives better formatting in some environments
Once you have a DataFrame, you'll want to inspect its structure and content.
# Get the dimensions (rows, columns)
println("Size of df_contacts: ", size(df_contacts)) # e.g., (4, 4)
# Get column names
println("Column names: ", names(df_contacts))
# Show the first few rows
println("First 2 rows of df_contacts:")
display(first(df_contacts, 2))
# Get summary statistics for numerical columns
println("Summary statistics for df_contacts:")
display(describe(df_contacts))
The describe
function provides count, mean, min, max, and other useful statistics for each column.
You can select columns or rows based on various criteria.
# Selecting a single column (returns a Vector)
ages_vector = df_contacts.Age
println("Ages (as Vector): ", ages_vector)
# Another way to select a single column (returns a DataFrame with one column)
cities_df = df_contacts[:, :City] # Note: this can vary with context/version
# For a single column as a vector: df_contacts[!, :City]
# For a single column as a DataFrame: df_contacts[:, [:City]]
cities_as_df_column = df_contacts[:, [:City]]
println("Cities (as DataFrame):")
display(cities_as_df_column)
# Selecting multiple columns (returns a new DataFrame)
name_and_city_df = df_contacts[:, [:Name, :City]]
println("Names and Cities:")
display(name_and_city_df)
# Selecting rows by index
first_two_rows = df_contacts[1:2, :]
println("First two rows by index:")
display(first_two_rows)
Filtering rows based on conditions is a common operation.
# People older than 30
older_than_30 = df_contacts[df_contacts.Age .> 30, :] # Note the broadcasting dot for comparison
println("Contacts older than 30:")
display(older_than_30)
# Contacts from New York using the filter function
# filter(column => condition_function, dataframe)
from_new_york = filter(:City => city -> city == "New York", df_contacts)
println("Contacts from New York (using filter):")
display(from_new_york)
While we created df_contacts
programmatically, you'll frequently work with data from external files, such as CSVs. The CSV.jl
package, which we added to our environment, is used for this. Here's a quick look at how you would load a CSV file:
# Content of a file named 'sample_data.csv':
# product_id,category,price
# P101,Electronics,299.99
# P102,Books,24.50
# P103,Electronics,49.00
# To load this file (assuming it's in your working directory):
# using CSV # Make sure CSV.jl is loaded
# df_products = CSV.read("sample_data.csv", DataFrame)
# display(df_products)
We provide this as an example of common practice. Detailed data loading and saving techniques will be covered in Chapter 2. For now, mastering operations on programmatically created DataFrames is sufficient.
In this hands-on section, you've taken important first steps:
Pkg.jl
and add necessary packages like DataFrames.jl
and CSV.jl
.These fundamental operations with arrays and DataFrames are skills you will use repeatedly in any Julia-based machine learning project. They form the groundwork for the more advanced data preparation, model building, and evaluation techniques you'll encounter in subsequent chapters. With these basics in hand, you are well-equipped to proceed to more intricate data handling tasks.
Was this section helpful?
© 2025 ApX Machine Learning