Software engineering teams have long relied on Application Performance Monitoring (APM) to track uptime and latency. Data engineering, however, presents a different challenge: a system can be perfectly operational, servers are up, DAGs are green, and tasks complete successfully, while the data itself is garbage. A successful task execution does not guarantee accurate data.Data observability bridges this gap. It applies the principles of DevOps observability to data pipelines, allowing engineers to deduce the internal quality of data based on its external signals. Unlike unit testing, which asserts known conditions before deployment, observability detects unknown issues in production. It relies on three fundamental signal types: metrics, logs, and traces (often manifested as lineage in data contexts).The Three Signals of Data HealthTo construct a monitoring system, we must capture specific signals from the data infrastructure. These signals allow us to answer three different questions: "Is there a problem?", "Where is the problem?", and "Why did it happen?".MetricsMetrics are numeric representations of data measured over intervals. In a data context, these are time-series values that track the shape and flow of information. They are efficient to store and easy to query for alerting purposes. Common data metrics include:Operational metrics: Job duration, compute cost, and slot usage.Quality metrics: Row counts (volume), null percentages, and distinct value counts.SLA metrics: Latency and time-since-last-update (freshness).LogsLogs are immutable, timestamped records of discrete events. While metrics tell you that a trend is changing, logs provide the granular details required to debug the event. When a transformation fails, the logs contain the exception stack trace or the specific SQL error code returned by the warehouse.Traces and LineageIn microservices, traces track a request as it hops between services. In data engineering, this concept translates to Data Lineage. Lineage maps the relationships between upstream sources, transformation jobs, and downstream tables. It provides the context required to understand the blast radius of an incident.The following diagram illustrates how these signals converge to form an observability layer over the physical infrastructure.digraph G { rankdir=TB; bgcolor="transparent"; node [shape=box, style="filled,rounded", fontname="Arial", fontsize=10, color="#dee2e6"]; edge [fontname="Arial", fontsize=9, color="#868e96"]; subgraph cluster_infra { label="Data Infrastructure"; style=dashed; color="#adb5bd"; fontcolor="#adb5bd"; source [label="Source DB", fillcolor="#e7f5ff", color="#74c0fc"]; etl [label="ETL Job (Python/SQL)", fillcolor="#e7f5ff", color="#74c0fc"]; warehouse [label="Data Warehouse", fillcolor="#e7f5ff", color="#74c0fc"]; source -> etl [label="Extract"]; etl -> warehouse [label="Load"]; } subgraph cluster_signals { label="Observability Signals"; style=dashed; color="#adb5bd"; fontcolor="#adb5bd"; logs [label="Logs\n(Errors, Events)", fillcolor="#fff0f6", color="#faa2c1"]; metrics [label="Metrics\n(Volume, Freshness)", fillcolor="#fff0f6", color="#faa2c1"]; lineage [label="Lineage\n(Dependencies)", fillcolor="#fff0f6", color="#faa2c1"]; } etl -> logs [style=dotted]; etl -> metrics [style=dotted]; warehouse -> metrics [style=dotted]; etl -> lineage [style=dotted]; platform [label="Observability Platform\n(Alerting & RCA)", fillcolor="#f3f0ff", color="#b197fc", shape=component]; logs -> platform; metrics -> platform; lineage -> platform; }Signals flow from the physical infrastructure into the observability platform for analysis.Detecting Anomalies with MetricsThe primary mechanism for proactive alerting is anomaly detection on collected metrics. We generally focus on three specific dimensions: Freshness, Volume, and Schema.FreshnessFreshness measures the time elapsed since a dataset was last updated. It effectively tracks the latency of your pipeline. If a job that usually runs at 08:00 UTC finishes at 09:30 UTC, the data is "stale" for 90 minutes.Freshness checks are important because they detect silent failures. A scheduled DAG might simply fail to trigger due to a scheduler malfunction. In this scenario, no error logs are generated because no code ran. Only a freshness monitor observing the destination table would catch this silence.VolumeVolume refers to the size of the data being processed, usually measured in row count or bytes. Sudden drops in volume often indicate upstream data loss or API failures, while sudden spikes might indicate duplicate data ingestion.Because data volume naturally fluctuates (e.g., lower traffic on weekends), static thresholds often generate false positives. Instead, we use statistical baselines. By calculating the Z-score, we can determine how far the current volume $V_t$ deviates from the moving average $\mu$ relative to the standard deviation $\sigma$:$$Z = \frac{V_t - \mu}{\sigma}$$An alert is triggered when $|Z| > k$, where $k$ is a sensitivity threshold (typically 2 or 3).The chart below demonstrates a volume anomaly where the row count drops significantly below the expected historical band.{"layout": {"title": {"text": "Volume Anomaly Detection", "font": {"size": 14}}, "xaxis": {"title": "Date", "showgrid": false}, "yaxis": {"title": "Row Count", "showgrid": true, "gridcolor": "#f1f3f5"}, "plot_bgcolor": "white", "margin": {"t": 40, "b": 40, "l": 50, "r": 20}, "showlegend": true, "legend": {"orientation": "h", "y": -0.2}}, "data": [{"type": "scatter", "name": "Expected Range (2σ)", "x": ["2023-10-01", "2023-10-02", "2023-10-03", "2023-10-04", "2023-10-05", "2023-10-06", "2023-10-07"], "y": [1100, 1150, 1080, 1120, 1100, 1150, 1100], "line": {"width": 0}, "fill": "tonexty", "fillcolor": "rgba(233, 236, 239, 0.5)", "showlegend": false}, {"type": "scatter", "name": "Baseline", "x": ["2023-10-01", "2023-10-02", "2023-10-03", "2023-10-04", "2023-10-05", "2023-10-06", "2023-10-07"], "y": [1000, 1050, 980, 1020, 1000, 1050, 1000], "line": {"color": "#adb5bd", "dash": "dot"}}, {"type": "scatter", "name": "Actual Volume", "x": ["2023-10-01", "2023-10-02", "2023-10-03", "2023-10-04", "2023-10-05", "2023-10-06", "2023-10-07"], "y": [1010, 1045, 990, 1025, 450, 1040, 1010], "line": {"color": "#339af0", "width": 2}}, {"type": "scatter", "name": "Anomaly", "x": ["2023-10-05"], "y": [450], "mode": "markers", "marker": {"color": "#fa5252", "size": 10}}]}A significant deviation from the historical baseline triggers an anomaly alert on October 5th.SchemaSchema drift occurs when the structure of the data changes. This includes columns being added, removed, or renamed, as well as data type changes (e.g., an integer field becoming a string).While schema evolution is natural in agile development, unmanaged schema changes are a leading cause of broken pipelines. Observability systems monitor the information schema of warehouses or the structure of JSON blobs in data lakes. When a change is detected, the system compares the new schema against a known registry or the previous state to determine if the change is a breaking change.The Role of Metadata and LineageMetrics tell you that something is wrong, but lineage tells you where it broke and who is affected.Lineage is constructed by parsing query logs or integrating with orchestration tools like Airflow. It builds a directed acyclic graph (DAG) of your data assets. When a freshness alert triggers on a dashboard, lineage allows you to traverse the graph backwards (upstream) to find the root cause. Conversely, if a source table is corrupted, lineage allows you to traverse forwards (downstream) to notify the relevant stakeholders.Effective observability requires combining these pillars. A metric alert triggers an incident. Lineage isolates the broken component. Logs reveal the error message. This workflow reduces the Mean Time to Resolution (MTTR) and transforms data governance from a manual oversight task into an automated engineering discipline.