Effective data governance relies on metadata. A fundamental aspect of production engineering dictates that effective governance requires describing what is being governed. Practical mechanisms for applying governance protocols to data assets shift focus from manual spreadsheet tracking. Strategies for programmatic data classification and tagging within the pipeline are essential for modern data governance.
Data classification is the process of categorizing data based on its sensitivity and business value. Tagging is the technical act of attaching those labels to the data objects, tables, columns, or streams, in a way that downstream systems can interpret.
Before writing code to tag data, you must establish a machine-readable taxonomy. A common mistake in data platforms is allowing free-text tags, which leads to inconsistency (e.g., using "PII", "Personal", and "Sensitive" to mean the same thing). Instead, we define a strict enumeration of sensitivity levels that maps directly to infrastructure configuration.
A standard technical taxonomy often follows a four-tier model:
These levels are not abstract; they dictate the storage and encryption parameters. For instance, an L4 tag might trigger a policy that forces column-level encryption at rest, while an L1 tag might allow the data to be exported to a public bucket.
Classification levels map directly to technical enforcement mechanisms such as encryption and access control lists.
In a scalable architecture, we employ three primary strategies to apply these tags: Explicit Definition, Automated Discovery, and Lineage Propagation.
The most deterministic method is declaring tags alongside schema definitions. When using tools like Terraform, dbt, or SQLAlchemy, you define the sensitivity of a column in the configuration file. This follows the GitOps practice: the classification is version-controlled and reviewed before deployment.
Consider a configuration file for a data table. Instead of just defining the data type, we append a policy_tags attribute.
# table_schema.yaml
columns:
- name: user_id
type: string
description: "Primary key for user"
tags:
- sensitivity:L2
- domain:identity
- name: email_address
type: string
tags:
- sensitivity:L3
- pii:true
When the deployment pipeline runs, it parses this YAML and applies the tags to the cloud data warehouse (e.g., Snowflake Object Tagging or Google BigQuery Policy Tags). This ensures that new tables are born classified.
Explicit definition relies on human discipline. To catch sensitive data that engineers might miss, we implement automated scanners. These scanners sample data during the ingestion phase or run on a schedule against stored data.
We utilize regular expressions (Regex) and logic checks to identify patterns associated with sensitive information. If a scanner detects a pattern matching a credit card number or an email address in a column labeled "description," it automatically applies a provisional L3 or L4 tag.
The logic for a basic scanner involves iterating through schemas and validating sample rows.
import re
from typing import List, Dict
# Regex patterns for common sensitive data types
PATTERNS = {
"email": r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$",
"ssn": r"^\d{3}-\d{2}-\d{4}$",
"ipv4": r"^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$"
}
def scan_sample_data(samples: List[str]) -> List[str]:
"""
Analyzes a list of data samples and returns detected tags.
"""
detected_tags = []
match_counts = {k: 0 for k in PATTERNS.keys()}
threshold = 0.8 # 80% of non-null data must match to apply tag
valid_samples = [s for s in samples if s is not None]
if not valid_samples:
return []
for sample in valid_samples:
for tag_type, pattern in PATTERNS.items():
if re.match(pattern, str(sample)):
match_counts[tag_type] += 1
# Calculate confidence and apply tags
for tag_type, count in match_counts.items():
if (count / len(valid_samples)) >= threshold:
detected_tags.append(f"detected_{tag_type}")
return detected_tags
This function calculates the ratio of matches to total rows. We use a threshold ( or 80%) rather than a single match to avoid false positives caused by dirty data.
Data changes form as it moves through pipelines. A raw table containing L3 data might be joined with an L2 table to create a new derived view. A governance system utilizes lineage to propagate tags downstream.
The logic for propagation generally follows a "high-water mark" principle: the sensitivity of a derived asset is equal to the highest sensitivity level of its upstream dependencies.
Mathematically, if a derived column is a function of input columns , the sensitivity level is:
If you join a public dataset () with a confidential dataset (), the resulting dataset must be treated as confidential () unless an explicit de-identification function (like hashing or masking) is applied during the transformation.
Tags must live where the query engine can see them. Storing tags in an external spreadsheet creates a disconnect between policy and enforcement. Modern data warehouses allow storing key-value pairs directly on the table objects.
When you execute a query, the engine inspects these native tags to determine if the requesting user has the necessary clearance. For example, a dynamic masking policy might check the tag on a column at query time:
-- Logic applied by the database engine at runtime
CASE
WHEN current_role() IN ('ANALYST_FULL') THEN email_address
WHEN tag_value(email_address, 'sensitivity') = 'L3' THEN '***MASKED***'
ELSE email_address
END
This creates a self-enforcing system. The scanner detects the data, applies the tag, and the database engine enforces the policy associated with that tag.
The automated workflow where a scanner identifies sensitivity, updates the catalog, and informs the policy engine to restrict or mask data at query time.
Automated tagging is never perfect. A "false positive" occurs when a scanner tags a product ID as a credit card number, potentially blocking access unnecessarily. A "false negative" occurs when sensitive data slips through untagged.
To manage this, we implement a "quarantine and review" workflow. When a scanner detects a new high-sensitivity tag (L3 or L4) on a previously low-sensitivity column, it should not immediately block production traffic if that traffic is critical. Instead, it should alert the data owner and potentially flag the dataset as "Needs Review."
However, for new datasets entering the system, the default posture should be "deny by default." Until a dataset is scanned and classified, it should be treated as Restricted (L4). This ensures that a failure in the scanning mechanism results in a closed door rather than an open data leak.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with