Practice: Implementing Access Policies

Managing access controls through manual database commands or user interface clicks inevitably leads to configuration drift. Reliable data engineering environments define access policies as code objects and apply them programmatically. This ensures that permissions are version-controlled, testable, and consistent across environments.

In this practice section, we will build a Python-based policy engine. This engine will take a user's role and a dataset's metadata as input, then dynamically generate a secure SQL query. This approach, often called "Dynamic Views" or "Query Rewriting," is a fundamental pattern for implementing scalable governance.

Defining Policy Objects

The first step is moving policy definitions out of the database and into a configuration structure. We use a dictionary structure here, but in production, this might be a YAML file managed in git. We define roles and map them to permissible data tags.

This model relies on the intersection of user attributes (roles) and data attributes (tags). If a user's role grants access to a specific tag, they can query the associated columns.

from typing import Dict, List, Optional
from dataclasses import dataclass

# Define the allowed access levels for specific tags
# A user with 'analyst' role can only see 'public' and 'internal' data
ACCESS_POLICIES = {
    "analyst": {"allowed_tags": {"public", "internal"}},
    "engineer": {"allowed_tags": {"public", "internal", "confidential"}},
    "admin": {"allowed_tags": {"public", "internal", "confidential", "pii"}}
}

@dataclass
class ColumnMetadata:
    name: str
    tag: str
    dtype: str

# Define our table schema with governance tags
users_table_schema = [
    ColumnMetadata(name="user_id", tag="public", dtype="int"),
    ColumnMetadata(name="email", tag="pii", dtype="string"),
    ColumnMetadata(name="country", tag="public", dtype="string"),
    ColumnMetadata(name="salary_band", tag="confidential", dtype="string"),
    ColumnMetadata(name="created_at", tag="internal", dtype="timestamp"),
]

The Policy Enforcement Engine

Once the definitions are established, we need a mechanism to enforce them. We will implement a function that acts as a transpiler: it reads the schema and the requestor's role, then constructs a SQL statement.

This logic follows a simple set operation. Let $R$ be the set of tags allowed for a role, and let $t_c$ be the tag associated with column $c$ . The column is selected as-is if:

$t_c \in R$

If the condition is not met, we have two architectural choices: exclude the column entirely or mask the data (return NULL or a hash). Masking is often preferred in production analytics to prevent code breakage when a column suddenly "disappears" from a view schema.

Flow of the policy engine combining role definitions and schema metadata to produce compliant SQL.

The following code implements this logic. It iterates through the schema and determines whether to expose the raw column or a masked literal.

def generate_secure_query(table_name: str, schema: List[ColumnMetadata], role: str) -> str:
    """
    Generates a SELECT statement with dynamic masking based on RBAC.
    """
    if role not in ACCESS_POLICIES:
        raise ValueError(f"Role '{role}' is not defined in policies.")

    allowed_tags = ACCESS_POLICIES[role]["allowed_tags"]
    select_clauses = []

    for col in schema:
        if col.tag in allowed_tags:
            # User has access, select the raw column
            select_clauses.append(col.name)
        else:
            # User lacks access, mask the data
            # Casting NULL ensures the view schema remains consistent
            select_clauses.append(f"CAST(NULL AS {col.dtype}) AS {col.name}")

    query_cols = ", ".join(select_clauses)
    return f"SELECT {query_cols} FROM {table_name};"

# Example Usage:
print("--- Analyst View ---")
print(generate_secure_query("users", users_table_schema, "analyst"))

print("\n--- Admin View ---")
print(generate_secure_query("users", users_table_schema, "admin"))

Analyzing the Output

When you run the script above, you will observe how the governance logic alters the resulting SQL.

For the Analyst, who only has public and internal access, the output masks the PII and Confidential columns:

SELECT user_id, 
       CAST(NULL AS string) AS email, 
       country, 
       CAST(NULL AS string) AS salary_band, 
       created_at 
FROM users;

For the Admin, who has all permissions, the query retrieves all data transparently:

SELECT user_id, email, country, salary_band, created_at FROM users;

Implementing Policy-as-Code in Infrastructure

While generating strings is useful for understanding the logic, in a real environment, you apply this using infrastructure-as-code tools.

View Generation: Use the function above within a migration script (e.g., Alembic for Python or a dbt macro) to create restricted views named users_analyst or users_admin.
Row-Level Security (RLS): Most modern warehouses (Snowflake, PostgreSQL, BigQuery) support RLS. Instead of CAST(NULL...), your Python policy engine can generate the specific CREATE ROW POLICY statements required by your vendor.
CI/CD Integration: By defining policies in Python, you can write unit tests for your governance. You can assert that a specific role never sees a column tagged pii before the code is ever deployed to production.

This programmatic approach decouples the definition of security (the dictionary) from the implementation of security (the SQL generation). This separation allows you to update policies centrally without manually rewriting hundreds of SQL queries.

Was this section helpful?

References

PostgreSQL: Documentation: 16: 5.8. Row Security Policies, PostgreSQL Global Development Group, 2025 - Official documentation explaining how to implement Row-Level Security (RLS) in PostgreSQL, a practical mechanism for enforcing access policies directly within the database based on query rewriting.
Data Governance: The Definitive Guide, Evren Eryurek, Uri Gilad, Valliappa Lakshmanan, Anita Kibunguchy-Grant, Jessi Ashdown, 2021 (O'Reilly Media) - A comprehensive guide to data governance principles and practices, including strategies for data access, security, and policy implementation in modern data environments.