Practice basic Git commands essential for managing data engineering projects. These hands-on exercises cover the Git commands used frequently. Consistent use of version control is a hallmark of professional software development, and it's equally important in data engineering for tracking changes to code, configurations, and sometimes even documentation.To follow along, you'll need Git installed on your system. If you haven't installed it yet, you can find instructions on the official Git website (https://git-scm.com/downloads). Once installed, open your preferred command-line interface (Terminal on macOS/Linux, Git Bash or Command Prompt/PowerShell on Windows).Initializing a RepositoryThe first step in tracking a project with Git is to initialize a repository. A Git repository is essentially a hidden directory (.git) within your project folder where Git stores all the history and metadata for your project.Create a project directory: Let's make a new directory for this practice session.mkdir my-data-projectNavigate into the directory:cd my-data-projectInitialize Git: Tell Git to start tracking this directory.git initYou should see output similar to:Initialized empty Git repository in /path/to/your/my-data-project/.git/Now, my-data-project is a Git repository. Any files you add here can be tracked.Checking Status, Staging, and Committing ChangesThe core Git workflow involves making changes to your files, adding those changes to a "staging area," and then committing them permanently (with a message) to the repository's history.Create a file: Let's create a simple text file to simulate a project asset, like a script or notes.# On Linux/macOS echo "Initial configuration for data pipeline" > config.txt # On Windows (Command Prompt) echo Initial configuration for data pipeline > config.txt # On Windows (PowerShell) "Initial configuration for data pipeline" | Out-File -Encoding UTF8 config.txtCheck the status: Use git status to see what Git knows about your project files.git statusThe output will look something like this:On branch main No commits yet Untracked files: (use "git add <file>..." to include in what will be committed) config.txt nothing added to commit but untracked files present (use "git add" to track)Git sees config.txt but tells you it's "untracked." This means Git isn't monitoring it for changes yet.Stage the file: Use git add to move the file to the staging area. This tells Git you want to include the current version of this file in the next commit.git add config.txtTip: You can stage all modified and new files in the current directory and subdirectories using git add . (note the period). Be careful with this, ensuring you don't accidentally stage files you don't intend to track.Check status again:git statusThe output now changes:On branch main No commits yet Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: config.txtGit now shows config.txt under "Changes to be committed." It's staged.Commit the changes: Use git commit to save the staged changes to the repository history. The -m flag allows you to provide a descriptive commit message directly.git commit -m "Add initial pipeline configuration file"You'll see output confirming the commit:[main (root-commit) abc1234] Add initial pipeline configuration file 1 file changed, 1 insertion(+) create mode 100644 config.txtThe abc1234 part is the beginning of the unique identifier (hash) for this commit.Check status one last time:git statusThe output should indicate a clean working directory:On branch main nothing to commit, working tree cleanViewing Commit HistoryTo see the history of commits you've made, use the git log command.git logThis will display a detailed list of commits, starting with the most recent:commit abc1234567890defabcdef1234567890abcdef (HEAD -> main) Author: Your Name <your.email@example.com> Date: Tue Sep 17 10:30:00 2023 -0700 Add initial pipeline configuration fileTip: For a more compact view, try git log --oneline:git log --onelineOutput:abc1234 (HEAD -> main) Add initial pipeline configuration fileIgnoring FilesOften, projects generate files you don't want to track with Git, such as log files, temporary files, large data files, or sensitive information like API keys. You can tell Git to ignore specific files or patterns by creating a .gitignore file in your repository's root directory.Create a .gitignore file:# On Linux/macOS touch .gitignore # On Windows (Command Prompt) - Creates an empty file type nul > .gitignore # On Windows (PowerShell) New-Item .gitignore -ItemType FileEdit the .gitignore file: Add patterns for files/directories to ignore. Open .gitignore in a text editor and add the following lines:# Ignore log files *.log # Ignore files in a temporary directory temp/ # Ignore sensitive credentials credentials.jsonEach line specifies a pattern. * is a wildcard. / at the end indicates a directory.Stage and commit .gitignore: Since .gitignore is part of your project's configuration, you should add and commit it like any other project file.git add .gitignore git commit -m "Add .gitignore file"Now, if you create a file named app.log or a directory named temp, git status will not list them as untracked files.Cloning an Existing RepositoryInstead of starting a new project, you often begin by working on an existing one stored on a remote hosting service like GitHub, GitLab, or Bitbucket. You get a local copy of the repository using git clone.Find a repository URL: Go to a service like GitHub and find a public repository you're interested in. Look for a "Clone" or "Code" button, which will provide a URL (usually ending in .git).Clone the repository: Navigate to the directory outside of your my-data-project folder where you want to place this new project. Then run the clone command:# Example using a URL git clone https://github.com/some-user/some-repo.gitGit will download the entire project history and create a new directory named some-repo (or whatever the repository name is). You can then cd some-repo and start working with it using the same status, add, commit, and log commands. Cloning automatically sets up a connection (called a "remote," usually named origin) to the original repository URL.These commands (init, status, add, commit, log, clone, and using .gitignore) form the foundation of using Git. Practice them by making more changes to your config.txt file, adding new files, staging, and committing them. Observe the output of git status and git log at each step. Mastering these basics is essential for effective collaboration and managing the evolution of your data engineering projects.