Data exists in various forms, such as structured, semi-structured, and unstructured, originating from diverse sources. Simply collecting this data is not enough; it requires organized storage for effective access, management, and analysis. Databases fulfill this crucial role.
Think of a database as a digital filing cabinet, but much more powerful and organized. It's a system designed specifically for storing, retrieving, and managing collections of data. Data engineers spend a significant amount of time interacting with databases, ensuring data flows into them correctly and can be pulled out efficiently for various uses.
There are many types of databases, but for now, we'll focus on two broad categories that you'll encounter frequently: Relational Databases and NoSQL Databases.
Relational databases have been the workhorse of data storage for decades. They organize data into tables, which look a lot like spreadsheets. Each table consists of rows and columns.
Customers, Products, or Orders).customer_id, first_name, email for the Customers table).The "relational" part comes from the ability to link data between tables using common columns, known as keys. For example, an Orders table might have a customer_id column that links each order back to the specific customer who placed it in the Customers table.
A simple visualization showing two related tables:
CustomersandOrders, linked bycustomer_id. PK stands for Primary Key (unique identifier for a row in its table), and FK stands for Foreign Key (a key used to link to another table).
Relational databases enforce a predefined structure, called a schema. Before you can store data, you must define the tables, the columns in each table, the data type for each column (e.g., integer, text, date), and the relationships between tables. This structure ensures data consistency and integrity.
To interact with these databases, you typically use SQL (Structured Query Language). SQL is the standard language for querying, inserting, updating, and deleting data in relational databases. For example, to retrieve all information about a customer with ID 1, you might write:
SELECT *
FROM Customers
WHERE customer_id = 1;
Common Examples: PostgreSQL, MySQL, SQL Server, Oracle Database.
When are they used? Relational databases excel when data consistency is very important, the data structure is well-defined and doesn't change constantly, and complex queries involving multiple tables are needed. Think financial transactions, inventory systems, or user account management.
As applications evolved, especially with the rise of the internet, web applications, and big data, new challenges emerged. Sometimes data didn't fit neatly into tables, or the sheer volume and speed of data overwhelmed traditional relational systems. This led to the development of NoSQL databases.
NoSQL stands for "Not Only SQL". It's an umbrella term for databases that provide different ways of storing and retrieving data compared to the relational model. They often offer more flexibility in terms of data structure and are designed to scale out easily across many servers.
There are several types of NoSQL databases, each suited for different kinds of problems:
A significant difference from relational databases is that many NoSQL databases have a flexible schema or are schema-less. This means you don't necessarily need to define the exact structure of your data upfront, allowing for easier handling of varied or evolving data.
When are they used? NoSQL databases are often chosen for applications needing high scalability (handling massive amounts of data or users), flexibility in data format, or very high-speed read/write operations. Use cases include real-time analytics, content management systems, mobile applications, and storing data from IoT devices.
Here’s a summary of the main differences:
Key differences between typical Relational (SQL) and NoSQL databases. Note that these are generalizations, and specific databases might have variations.
The choice between a relational and a NoSQL database isn't always clear-cut, and sometimes both are used within the same larger system (a pattern called polyglot persistence). The decision depends heavily on the specific requirements of the application:
Understanding these fundamental database types is essential. As a data engineer, you'll work with systems that pull data from, and load data into, various databases. Knowing their characteristics helps in designing efficient data pipelines and choosing appropriate storage solutions, which we'll discuss more in later chapters.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with