Data Classification and Tagging Strategies

Effective data governance relies on metadata. A fundamental aspect of production engineering dictates that effective governance requires describing what is being governed. Practical mechanisms for applying governance protocols to data assets shift focus from manual spreadsheet tracking. Strategies for programmatic data classification and tagging within the pipeline are essential for modern data governance.

Data classification is the process of categorizing data based on its sensitivity and business value. Tagging is the technical act of attaching those labels to the data objects, tables, columns, or streams, in a way that downstream systems can interpret.

The Engineering Taxonomy of Data

Before writing code to tag data, you must establish a machine-readable taxonomy. A common mistake in data platforms is allowing free-text tags, which leads to inconsistency (e.g., using "PII", "Personal", and "Sensitive" to mean the same thing). Instead, we define a strict enumeration of sensitivity levels that maps directly to infrastructure configuration.

A standard technical taxonomy often follows a four-tier model:

Public (L1): Data explicitly cleared for public release. No access controls required.
Internal (L2): Default level for business data. Accessible to employees but not the public.
Confidential (L3): Data requiring specific authorization. Includes PII (Personally Identifiable Information) like email addresses or user IDs.
Restricted (L4): Highly sensitive data. Includes credentials, financial account numbers, or health data. Access is logged and audited.

These levels are not abstract; they dictate the storage and encryption parameters. For instance, an L4 tag might trigger a policy that forces column-level encryption at rest, while an L1 tag might allow the data to be exported to a public bucket.

Classification levels map directly to technical enforcement mechanisms such as encryption and access control lists.

Strategies for Programmatic Tagging

In a scalable architecture, we employ three primary strategies to apply these tags: Explicit Definition, Automated Discovery, and Lineage Propagation.

Explicit Definition via Infrastructure as Code

The most deterministic method is declaring tags alongside schema definitions. When using tools like Terraform, dbt, or SQLAlchemy, you define the sensitivity of a column in the configuration file. This follows the GitOps practice: the classification is version-controlled and reviewed before deployment.

For example, a configuration file for a data table. Instead of just defining the data type, we append a policy_tags attribute.

# table_schema.yaml
columns:
  - name: user_id
    type: string
    description: "Primary key for user"
    tags: 
      - sensitivity:L2
      - domain:identity
  - name: email_address
    type: string
    tags: 
      - sensitivity:L3
      - pii:true

When the deployment pipeline runs, it parses this YAML and applies the tags to the cloud data warehouse (e.g., Snowflake Object Tagging or Google BigQuery Policy Tags). This ensures that new tables are born classified.

Automated Discovery and Pattern Matching

Explicit definition relies on human discipline. To catch sensitive data that engineers might miss, we implement automated scanners. These scanners sample data during the ingestion phase or run on a schedule against stored data.

We utilize regular expressions (Regex) and logic checks to identify patterns associated with sensitive information. If a scanner detects a pattern matching a credit card number or an email address in a column labeled "description," it automatically applies a provisional L3 or L4 tag.

The logic for a basic scanner involves iterating through schemas and validating sample rows.

import re
from typing import List, Dict

# Regex patterns for common sensitive data types
PATTERNS = {
    "email": r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$",
    "ssn": r"^\d{3}-\d{2}-\d{4}$",
    "ipv4": r"^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$"
}

def scan_sample_data(samples: List[str]) -> List[str]:
    """
    Analyzes a list of data samples and returns detected tags.
    """
    detected_tags = []
    match_counts = {k: 0 for k in PATTERNS.keys()}
    threshold = 0.8  # 80% of non-null data must match to apply tag

    valid_samples = [s for s in samples if s is not None]
    if not valid_samples:
        return []

    for sample in valid_samples:
        for tag_type, pattern in PATTERNS.items():
            if re.match(pattern, str(sample)):
                match_counts[tag_type] += 1

    # Calculate confidence and apply tags
    for tag_type, count in match_counts.items():
        if (count / len(valid_samples)) >= threshold:
            detected_tags.append(f"detected_{tag_type}")

    return detected_tags

This function calculates the ratio of matches to total rows. We use a threshold ( $0.8$ or 80%) rather than a single match to avoid false positives caused by dirty data.

Tag Propagation

Data changes form as it moves through pipelines. A raw table containing L3 data might be joined with an L2 table to create a new derived view. A governance system utilizes lineage to propagate tags downstream.

The logic for propagation generally follows a "high-water mark" principle: the sensitivity of a derived asset is equal to the highest sensitivity level of its upstream dependencies.

Mathematically, if a derived column $C_{output}$ is a function of input columns $C_1, C_2, \dots, C_n$ , the sensitivity level $L$ is:

$L(C_{output}) = \max(L(C_1), L(C_2), \dots, L(C_n))$

If you join a public dataset ( $L1$ ) with a confidential dataset ( $L3$ ), the resulting dataset must be treated as confidential ( $L3$ ) unless an explicit de-identification function (like hashing or masking) is applied during the transformation.

Metadata Storage and Catalog Integration

Tags must live where the query engine can see them. Storing tags in an external spreadsheet creates a disconnect between policy and enforcement. Modern data warehouses allow storing key-value pairs directly on the table objects.

When you execute a query, the engine inspects these native tags to determine if the requesting user has the necessary clearance. For example, a dynamic masking policy might check the tag on a column at query time:

-- Logic applied by the database engine at runtime
CASE 
    WHEN current_role() IN ('ANALYST_FULL') THEN email_address
    WHEN tag_value(email_address, 'sensitivity') = 'L3' THEN '***MASKED***'
    ELSE email_address
END

This creates a self-enforcing system. The scanner detects the data, applies the tag, and the database engine enforces the policy associated with that tag.

The automated workflow where a scanner identifies sensitivity, updates the catalog, and informs the policy engine to restrict or mask data at query time.

Handling False Positives and Negatives

Automated tagging is never perfect. A "false positive" occurs when a scanner tags a product ID as a credit card number, potentially blocking access unnecessarily. A "false negative" occurs when sensitive data slips through untagged.

To manage this, we implement a "quarantine and review" workflow. When a scanner detects a new high-sensitivity tag (L3 or L4) on a previously low-sensitivity column, it should not immediately block production traffic if that traffic is critical. Instead, it should alert the data owner and potentially flag the dataset as "Needs Review."

However, for new datasets entering the system, the default posture should be "deny by default." Until a dataset is scanned and classified, it should be treated as Restricted (L4). This ensures that a failure in the scanning mechanism results in a closed door rather than an open data leak.

Was this section helpful?

References

The DAMA Guide to the Data Management Body of Knowledge (DAMA-DMBOK 2), DAMA International, 2017 (Technics Publications) - Provides a comprehensive framework for data management, including data governance, metadata management, and data classification principles, essential for establishing a robust data governance program.
Understanding Policy Tags and Access Policies, Snowflake Documentation, 2023 (Snowflake Inc.) - Official documentation explaining how to implement data classification, tagging, and associated access control policies within the Snowflake data platform. Directly relevant to programmatic enforcement of sensitivity levels.
Model Properties & Tags, dbt Labs Documentation, 2023 (dbt Labs) - Official documentation detailing how to define metadata, including custom tags, for data models using dbt. This supports the 'Infrastructure as Code' approach for explicit data classification in transformation pipelines.