Data quality is frequently misunderstood as a subjective measure of "goodness." Quality must be quantifiable in a production engineering context. We decompose quality into distinct technical dimensions to address vague complaints about "bad data" toward actionable engineering tasks. This decomposition allows us to apply specific testing strategies and assertion logic to each aspect of the dataset.The four primary dimensions we focus on in data engineering are Accuracy, Completeness, Consistency, and Validity. Each dimension addresses a specific failure mode in the data lifecycle and requires different implementation patterns in Python or SQL.AccuracyAccuracy measures the degree to which data correctly describes the real object or event being described. In engineering terms, accuracy is the ratio of correct values to total values.$$ Accuracy = \frac{\sum_{i=1}^{n} I(x_i = truth)}{n} $$Where $I$ is an indicator function that returns 1 if the value matches the truth and 0 otherwise.Testing for accuracy is difficult because "truth" is often external to the system. However, we can approximate accuracy through referential validation and range constraints.For example, if a temperature sensor reports a value of 5000°C for a standard server room, the data is technically a valid integer (validity), but it is almost certainly inaccurate. We catch this by asserting reasonable bounds based on domain knowledge.Example: Range-based accuracy assertion in SQLSELECT COUNT(*) as inaccurate_records FROM server_logs WHERE cpu_temperature_celsius < 10 OR cpu_temperature_celsius > 100;While this does not guarantee the exact temperature is correct, it filters out values that are physically impossible, thereby increasing the probability of the dataset being accurate.CompletenessCompleteness is often reduced to "checking for nulls," but it encompasses more. It measures whether all required data is present. This manifests in two ways:Attribute Completeness: Are specific fields populated? (e.g., Does every user have an email address?)Record Completeness: Is the total volume of data consistent with expectations? (e.g., Did we receive all hourly logs?)Missing data introduces bias into downstream machine learning models and breaks aggregation logic in reports.digraph G { rankdir=LR; node [shape=box, style=filled, fontname="Helvetica", fontsize=10]; edge [fontname="Helvetica", fontsize=9, color="#adb5bd"]; Source [label="Source Stream", fillcolor="#339af0", fontcolor="white", color="#339af0"]; Ingest [label="Ingestion Layer", fillcolor="#e9ecef", color="#adb5bd"]; Check [label="Completeness Check", shape=diamond, fillcolor="#ffc9c9", color="#ff6b6b"]; Warehouse [label="Data Warehouse", fillcolor="#51cf66", fontcolor="white", color="#2b8a3e"]; DLQ [label="Dead Letter Queue", fillcolor="#ff6b6b", fontcolor="white", color="#c92a2a"]; Source -> Ingest; Ingest -> Check; Check -> Warehouse [label="Values Present"]; Check -> DLQ [label="Null/Missing"]; }Data routing based on completeness checks prevents sparse records from polluting the warehouse.To test for record completeness without manual intervention, we often compare row counts against historical averages. If a pipeline usually processes 10,000 rows per hour, a drop to 500 rows indicates a completeness failure, even if those 500 rows are perfectly formatted.ConsistencyConsistency ensures that data values do not contradict each other, either within the same record or across different datasets. This is the enforcement of logic and relationships.Internal consistency checks validate that fields within a single row make sense together. For instance, start_date must always be less than or equal to end_date.$$ t_{start} \leq t_{end} $$Cross-system consistency (referential integrity) is harder to manage in distributed systems. It asks whether the user_id in the clickstream logs corresponds to a valid user_id in the users table. In relational databases, foreign keys enforce this. In data lakes, we must write assertions to check for orphaned records.Example: Internal consistency check in Pythondef check_timestamp_consistency(df): # Returns rows where the event sequence is impossible inconsistent = df[df['event_end'] < df['event_start']] return len(inconsistent) == 0ValidityValidity is strictly about syntax and format. Unlike accuracy (which asks "is this true?"), validity asks "does this follow the rules?". A phone number can be valid (correct format) but inaccurate (wrong number).Validity checks rely heavily on schema enforcement and Regular Expressions (Regex). Common validity constraints include:Data Types: Ensuring a string is not passed where an integer is expected.Format Patterns: Verifying emails match ^[\w\.-]+@[\w\.-]+\.\w+$.Allowed Values: Ensuring a status column only contains "ACTIVE", "INACTIVE", or "PENDING".Automated validity testing is the first line of defense in an ingestion pipeline. If data does not meet the schema definition, it should be rejected immediately before it causes serialization errors in downstream formats like Parquet or Avro.{"layout": {"title": {"text": "Frequency of Data Quality Failures by Dimension", "font": {"size": 14}}, "xaxis": {"title": "Dimension"}, "yaxis": {"title": "Incident Count"}, "margin": {"t": 40, "b": 40, "l": 40, "r": 40}, "height": 300, "bargap": 0.4}, "data": [{"type": "bar", "x": ["Validity", "Completeness", "Consistency", "Accuracy"], "y": [145, 89, 42, 15], "marker": {"color": ["#339af0", "#51cf66", "#fcc419", "#ff6b6b"]}}]}Validity failures are typically the most frequent but easiest to catch programmatically, while accuracy failures are rarer but harder to detect.The Interaction of DimensionsThese dimensions are not mutually exclusive. A single data point can fail multiple dimensions simultaneously. However, distinguishing between them helps in root cause analysis.If the format is wrong, it is a Validity issue (Check the producer's serialization).If the format is right but the field is empty, it is a Completeness issue (Check the upstream API response).If the data exists but contradicts other data, it is a Consistency issue (Check the synchronization logic).If the data looks perfect but is factually wrong, it is an Accuracy issue (Check the sensor or input source).By categorizing failures into these dimensions, engineers can write targeted unit tests that provide clear error messages, significantly reducing the time required to debug production pipelines.