Now that we've discussed the different shapes data can take (structured, semi-structured, unstructured) and where it often comes from, we need to consider where this data actually lives. Simply collecting data isn't enough; it needs to be stored in an organized way so that it can be accessed, managed, and analyzed effectively. This is where databases come in.
Think of a database as a digital filing cabinet, but much more powerful and organized. It's a system designed specifically for storing, retrieving, and managing collections of data. Data engineers spend a significant amount of time interacting with databases, ensuring data flows into them correctly and can be pulled out efficiently for various uses.
There are many types of databases, but for now, we'll focus on two broad categories that you'll encounter frequently: Relational Databases and NoSQL Databases.
Relational databases have been the workhorse of data storage for decades. They organize data into tables, which look a lot like spreadsheets. Each table consists of rows and columns.
Customers
, Products
, or Orders
).customer_id
, first_name
, email
for the Customers
table).The "relational" part comes from the ability to link data between tables using common columns, known as keys. For example, an Orders
table might have a customer_id
column that links each order back to the specific customer who placed it in the Customers
table.
A simple visualization showing two related tables:
Customers
andOrders
, linked bycustomer_id
. PK stands for Primary Key (unique identifier for a row in its table), and FK stands for Foreign Key (a key used to link to another table).
Relational databases enforce a predefined structure, called a schema. Before you can store data, you must define the tables, the columns in each table, the data type for each column (e.g., integer, text, date), and the relationships between tables. This structure ensures data consistency and integrity.
To interact with these databases, you typically use SQL (Structured Query Language). SQL is the standard language for querying, inserting, updating, and deleting data in relational databases. For example, to retrieve all information about a customer with ID 1, you might write:
SELECT *
FROM Customers
WHERE customer_id = 1;
Common Examples: PostgreSQL, MySQL, SQL Server, Oracle Database.
When are they used? Relational databases excel when data consistency is very important, the data structure is well-defined and doesn't change constantly, and complex queries involving multiple tables are needed. Think financial transactions, inventory systems, or user account management.
As applications evolved, especially with the rise of the internet, web applications, and big data, new challenges emerged. Sometimes data didn't fit neatly into tables, or the sheer volume and speed of data overwhelmed traditional relational systems. This led to the development of NoSQL databases.
NoSQL stands for "Not Only SQL". It's an umbrella term for databases that provide different ways of storing and retrieving data compared to the relational model. They often offer more flexibility in terms of data structure and are designed to scale out easily across many servers.
There are several types of NoSQL databases, each suited for different kinds of problems:
A significant difference from relational databases is that many NoSQL databases have a flexible schema or are schema-less. This means you don't necessarily need to define the exact structure of your data upfront, allowing for easier handling of varied or evolving data.
When are they used? NoSQL databases are often chosen for applications needing high scalability (handling massive amounts of data or users), flexibility in data format, or very high-speed read/write operations. Use cases include real-time analytics, content management systems, mobile applications, and storing data from IoT devices.
Here’s a summary of the main differences:
Key differences between typical Relational (SQL) and NoSQL databases. Note that these are generalizations, and specific databases might have variations.
The choice between a relational and a NoSQL database isn't always clear-cut, and sometimes both are used within the same larger system (a pattern called polyglot persistence). The decision depends heavily on the specific requirements of the application:
Understanding these fundamental database types is essential. As a data engineer, you'll work with systems that pull data from, and load data into, various databases. Knowing their characteristics helps in designing efficient data pipelines and choosing appropriate storage solutions, which we'll discuss more in later chapters.
© 2025 ApX Machine Learning