Structured Query Language, universally known as SQL (often pronounced "sequel" or "S-Q-L"), is the standard language for interacting with relational databases. As a data engineer, you'll find that databases are central storage locations for the data you manage, and SQL is your primary tool for communicating with them. Think of it as the universal remote control for accessing and manipulating data stored in tables.
You've already learned about different storage systems, including relational databases (like PostgreSQL, MySQL, SQL Server) in Chapter 4. These databases organize data into tables, which look similar to spreadsheets, with rows representing records and columns representing attributes or features of those records. SQL provides a structured way to ask these databases questions (querying) and tell them how to change the data they hold (manipulating).
Data rarely stays static. As a data engineer, you will constantly need to:
SQL provides specific commands designed precisely for these tasks. While Chapter 4 touched upon defining database structures (Data Definition Language or DDL, like CREATE TABLE
), this section focuses on the commands used to work with the data inside those tables (Data Manipulation Language or DML).
The core operations for manipulating data in SQL are often referred to by the acronym CRUD (Create, Read, Update, Delete), though in SQL terms, they map primarily to INSERT
, SELECT
, UPDATE
, and DELETE
. Let's look at each one.
SELECT
StatementThe SELECT
statement is used to retrieve data from one or more tables. It's likely the SQL command you'll use most often.
Selecting Specific Columns: You can specify which columns you want to see.
SELECT user_id, email, registration_date
FROM users;
This query retrieves only the user_id
, email
, and registration_date
columns from a table named users
.
Selecting All Columns: The asterisk (*
) is a wildcard that selects all columns.
SELECT *
FROM users;
This retrieves every column for all rows in the users
table.
Filtering Data with WHERE
: Often, you only want rows that meet certain criteria. The WHERE
clause lets you filter the results.
SELECT user_id, email
FROM users
WHERE city = 'New York';
This query retrieves the user_id
and email
for only those users whose city
column has the value 'New York'.
You can use various operators in the WHERE
clause, such as =
, >
, <
, >=
, <=
, !=
(not equal), LIKE
(pattern matching), and combine conditions using AND
and OR
.
INSERT INTO
StatementWhen new data needs to be added to a table, you use the INSERT INTO
statement.
Syntax: You specify the table, the columns you're providing data for, and the corresponding values.
INSERT INTO users (user_id, email, city, registration_date)
VALUES (101, 'new.user@example.com', 'London', '2023-10-26');
This adds a new row to the users
table with the specified values for user_id
, email
, city
, and registration_date
. The order of values must match the order of the specified columns.
UPDATE
StatementTo change existing data within a table, you use the UPDATE
statement.
Syntax: You specify the table, the column(s) to change (SET
), and crucially, which row(s) to change (WHERE
).
UPDATE users
SET email = 'updated.email@example.com'
WHERE user_id = 101;
This command finds the row where user_id
is 101 and changes the value in the email
column for that specific row.
Important: Always use a WHERE
clause with UPDATE
. If you omit it, you will accidentally update the specified column(s) for every single row in the table, which is rarely intended and can cause significant data corruption.
DELETE FROM
StatementTo remove entire rows from a table, you use the DELETE FROM
statement.
Syntax: You specify the table and which rows to remove using the WHERE
clause.
DELETE FROM users
WHERE user_id = 101;
This command removes the row (or rows) where the user_id
is 101.
Important: Like UPDATE
, always be cautious and use a WHERE
clause with DELETE
. Omitting the WHERE
clause will delete all rows from the table. Double-check your WHERE
condition before executing a DELETE
statement.
These basic SQL manipulation commands are workhorses for data engineers:
SELECT
queries with WHERE
clauses to check for null values, incorrect formats, or outliers.UPDATE
.INSERT
statements to populate small dimension or lookup tables.DELETE
can be used to remove duplicate records or rows identified as erroneous after validation.SELECT
), potentially transforming it within the database (UPDATE
), loading it elsewhere, or removing processed staging data (DELETE
).While this introduction covers the fundamentals, SQL offers much more power, including combining data from multiple tables (JOIN
), summarizing data (using aggregate functions like COUNT
, SUM
, AVG
with GROUP BY
), and complex filtering. However, mastering SELECT
, INSERT
, UPDATE
, and DELETE
provides a solid foundation for interacting with the vast amounts of data stored in relational databases, making SQL an indispensable tool in your data engineering toolkit.
© 2025 ApX Machine Learning