Having discussed the significance of version control with Git in managing data engineering projects, let's put theory into practice. This section provides hands-on exercises with the basic Git commands you'll use frequently. Consistent use of version control is a hallmark of professional software development, and it's equally important in data engineering for tracking changes to code, configurations, and sometimes even documentation.
To follow along, you'll need Git installed on your system. If you haven't installed it yet, you can find instructions on the official Git website (https://git-scm.com/downloads). Once installed, open your preferred command-line interface (Terminal on macOS/Linux, Git Bash or Command Prompt/PowerShell on Windows).
The first step in tracking a project with Git is to initialize a repository. A Git repository is essentially a hidden directory (.git
) within your project folder where Git stores all the history and metadata for your project.
mkdir my-data-project
cd my-data-project
git init
You should see output similar to:
Initialized empty Git repository in /path/to/your/my-data-project/.git/
Now, my-data-project
is a Git repository. Any files you add here can be tracked.
The core Git workflow involves making changes to your files, adding those changes to a "staging area," and then committing them permanently (with a message) to the repository's history.
Create a file: Let's create a simple text file to simulate a project asset, like a script or notes.
# On Linux/macOS
echo "Initial configuration for data pipeline" > config.txt
# On Windows (Command Prompt)
echo Initial configuration for data pipeline > config.txt
# On Windows (PowerShell)
"Initial configuration for data pipeline" | Out-File -Encoding UTF8 config.txt
Check the status: Use git status
to see what Git knows about your project files.
git status
The output will look something like this:
On branch main
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
config.txt
nothing added to commit but untracked files present (use "git add" to track)
Git sees config.txt
but tells you it's "untracked." This means Git isn't monitoring it for changes yet.
Stage the file: Use git add
to move the file to the staging area. This tells Git you want to include the current version of this file in the next commit.
git add config.txt
Tip: You can stage all modified and new files in the current directory and subdirectories using
git add .
(note the period). Be careful with this, ensuring you don't accidentally stage files you don't intend to track.
Check status again:
git status
The output now changes:
On branch main
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: config.txt
Git now shows config.txt
under "Changes to be committed." It's staged.
Commit the changes: Use git commit
to save the staged changes to the repository history. The -m
flag allows you to provide a descriptive commit message directly.
git commit -m "Add initial pipeline configuration file"
You'll see output confirming the commit:
[main (root-commit) abc1234] Add initial pipeline configuration file
1 file changed, 1 insertion(+)
create mode 100644 config.txt
The abc1234
part is the beginning of the unique identifier (hash) for this commit.
Check status one last time:
git status
The output should indicate a clean working directory:
On branch main
nothing to commit, working tree clean
To see the history of commits you've made, use the git log
command.
git log
This will display a detailed list of commits, starting with the most recent:
commit abc1234567890defabcdef1234567890abcdef (HEAD -> main)
Author: Your Name <your.email@example.com>
Date: Tue Sep 17 10:30:00 2023 -0700
Add initial pipeline configuration file
Tip: For a more compact view, try
git log --oneline
:git log --oneline
Output:
abc1234 (HEAD -> main) Add initial pipeline configuration file
Often, projects generate files you don't want to track with Git, such as log files, temporary files, large data files, or sensitive information like API keys. You can tell Git to ignore specific files or patterns by creating a .gitignore
file in your repository's root directory.
Create a .gitignore
file:
# On Linux/macOS
touch .gitignore
# On Windows (Command Prompt) - Creates an empty file
type nul > .gitignore
# On Windows (PowerShell)
New-Item .gitignore -ItemType File
Edit the .gitignore
file: Add patterns for files/directories to ignore. Open .gitignore
in a text editor and add the following lines:
# Ignore log files
*.log
# Ignore files in a temporary directory
temp/
# Ignore sensitive credentials
credentials.json
Each line specifies a pattern. *
is a wildcard. /
at the end indicates a directory.
Stage and commit .gitignore
: Since .gitignore
is part of your project's configuration, you should add and commit it like any other project file.
git add .gitignore
git commit -m "Add .gitignore file"
Now, if you create a file named app.log
or a directory named temp
, git status
will not list them as untracked files.
Instead of starting a new project, you often begin by working on an existing one stored on a remote hosting service like GitHub, GitLab, or Bitbucket. You get a local copy of the repository using git clone
.
Find a repository URL: Go to a service like GitHub and find a public repository you're interested in. Look for a "Clone" or "Code" button, which will provide a URL (usually ending in .git
).
Clone the repository: Navigate to the directory outside of your my-data-project
folder where you want to place this new project. Then run the clone command:
# Example using a hypothetical URL
git clone https://github.com/some-user/some-repo.git
Git will download the entire project history and create a new directory named some-repo
(or whatever the repository name is). You can then cd some-repo
and start working with it using the same status
, add
, commit
, and log
commands. Cloning automatically sets up a connection (called a "remote," usually named origin
) to the original repository URL.
These commands (init
, status
, add
, commit
, log
, clone
, and using .gitignore
) form the foundation of using Git. Practice them by making more changes to your config.txt
file, adding new files, staging, and committing them. Observe the output of git status
and git log
at each step. Mastering these basics is essential for effective collaboration and managing the evolution of your data engineering projects.
© 2025 ApX Machine Learning