Managing and Versioning Prompts

As you build more sophisticated applications, you'll quickly discover that prompts are not static strings. They are living assets that need to be tested, refined, and managed just like any other part of your codebase. Embedding multi-line prompt strings directly into your application logic can lead to what we might call "prompt spaghetti," making your code hard to read, maintain, and improve.

In a team setting, this problem is magnified. A prompt engineer might need to update a prompt's wording, a product manager might want to test a new instruction set, and a developer needs to integrate these changes without introducing bugs. A systematic approach to managing and versioning prompts is essential for building scalable and reliable LLM applications.

The Importance of Prompt Management

Managing prompts as first-class assets provides several significant advantages:

Maintainability: Centralizing prompts makes them easy to find, update, and review. Changes can be made in one place without digging through application code.
Collaboration: It creates a clear separation of concerns. Prompt engineers can focus on optimizing prompts, while developers focus on the application logic that uses them.
Experimentation: Versioning allows you to systematically track changes and compare the performance of different prompt variations through A/B testing.
Reproducibility: By locking your application to a specific prompt version, you can ensure consistent behavior and reliable outputs. If a new version causes problems, you can easily roll back to a previously known good version.

A Structured Approach to Prompts

The first step toward better management is to treat each prompt not just as a string, but as a versioned object with associated metadata. This encapsulates the prompt's template, its purpose, and any performance metrics tied to it.

You can create and register different versions of a prompt. For instance, let's define two versions of a prompt designed to review code. Version 1.0 is simple and direct, while Version 2.0 is more structured to guide the LLM's output.

from kerb.prompt import create_version, register_prompt

# Version 1.0: A simple, direct approach
v1 = create_version(
    name="code_reviewer",
    version="1.0",
    template="""Review this code and provide feedback:

{{code}}

Focus on bugs and improvements.""",
    description="Direct and concise code review prompt.",
    metadata={"tokens_avg": 50, "response_quality": 0.75}
)

# Version 2.0: More structured with specific guidelines
v2 = create_version(
    name="code_reviewer",
    version="2.0",
    template="""You are an expert code reviewer. Analyze the following code:

{{code}}

Provide feedback on:
1. Code correctness and bugs
2. Performance optimizations
3. Best practices and style
4. Security best practices

Format your response with clear sections.""",
    description="Structured code review with explicit guidelines.",
    metadata={"tokens_avg": 120, "response_quality": 0.85}
)

# Register both versions for later use
register_prompt(v1)
register_prompt(v2)

Here, each prompt is identified by a name ("code_reviewer") and a version string. The metadata dictionary is a powerful feature for tracking performance. You can store metrics like average token usage, latency, user feedback scores, or evaluation results directly with the prompt version that produced them.

Retrieving and Using Versioned Prompts

Once prompts are registered, you can retrieve a specific version from anywhere in your application using its name and version number. This decouples the prompt's content from the application logic.

The following code retrieves version 2.0 of our code_reviewer prompt and renders it with a code snippet, preparing it for an LLM call.

from kerb.prompt import get_prompt

# Retrieve version 2.0 of the "code_reviewer" prompt
prompt_v2 = get_prompt("code_reviewer", "2.0")

code_sample = "def add(x, y):\n    return x + y * 2"

# Render the prompt template with the code
rendered_prompt = prompt_v2.render({"code": code_sample})

print(rendered_prompt)

This approach makes your code cleaner and more flexible. To switch to a different prompt version, you only need to change the version string in the get_prompt call, without touching the surrounding application logic.

Inspecting and Comparing Versions

As your prompt library grows, you'll need tools to inspect what's available. You can list all registered versions for a given prompt name.

from kerb.prompt import list_versions

versions = list_versions("code_reviewer")
print(f"Available versions of 'code_reviewer': {versions}")
# Expected output: Available versions of 'code_reviewer': ['1.0', '2.0']

This is useful for building internal dashboards or validation tools that let your team see all available prompt variants.

To analyze the differences between versions, you can use the compare_versions function. It provides a summary of each version's template, description, and metadata, which is valuable for debugging or deciding which version to use.

from kerb.prompt import compare_versions

comparison = compare_versions("code_reviewer")
# The output is a dictionary detailing each version's properties.
# For example, to see the description of version 1.0:
print(comparison['versions']['1.0']['description'])

Strategies for Version Selection in Production

Hardcoding version numbers is not always practical, especially when you want to dynamically test or deploy new prompts. A more direct approach is to select a version based on a defined strategy.

The lifecycle of a prompt, from drafting and registration to testing, analysis, and final deployment in production.

The select_version function supports several strategies for programmatic selection:

random: Selects a version at random. This is the foundation for A/B testing, allowing you to distribute user requests across different prompt versions to compare their performance in a live environment.
latest: Selects the most recently registered version. This is useful in development environments where you always want to be using the newest prompt.
best_performing: Selects the version with the highest score on a specified metadata metric. This allows you to automatically deploy the prompt that has proven to be the most effective based on your own evaluations.

Here is how you might use these strategies in practice.

from kerb.prompt import select_version

# For A/B testing, randomly select a version for each request
selected_for_ab_test = select_version("code_reviewer", strategy="random")
print(f"A/B Test: Using version {selected_for_ab_test.version}")

# In a development environment, always use the latest version
latest_version = select_version("code_reviewer", strategy="latest")
print(f"Development: Using latest version {latest_version.version}")

# For production, select the version with the highest 'response_quality'
best_version = select_version(
    "code_reviewer", 
    strategy="best_performing",
    metric="response_quality"
)
print(f"Production: Using best version {best_version.version} (Quality: {best_version.metadata['response_quality']})")

By combining metadata with a selection strategy, you can build a system that automatically promotes the best-performing prompts, creating a powerful feedback loop for continuous improvement.

Best Practices for Prompt Management

As you integrate prompt management into your workflow, here are these best practices:

Centralize Your Prompts: Store prompts in a dedicated location, such as a version-controlled directory of configuration files (e.g., YAML or JSON), rather than scattering them throughout your codebase.
Use Semantic Versioning: Adopt a versioning scheme like major.minor.patch (e.g., 2.1.0) for your prompts. Increment the major version for breaking changes, the minor version for new features or significant improvements, and the patch version for minor tweaks and bug fixes.
Link Performance to Versions: Consistently log which prompt version was used for each LLM call. This allows you to correlate prompt changes with shifts in application performance, cost, and quality.
Treat Prompts as Code: Your prompts should be part of your version control system (like Git). Changes should be reviewed, tested, and deployed with the same rigor as application code.

Was this section helpful?

References

Best practices for prompt engineering, OpenAI, 2023 - This official guide from a leading LLM provider offers practical advice on designing effective prompts, underscoring the necessity for structured approaches and management for quality and consistency.
LLMOps: The Operationalization of Large Language Models, Mike Tung, Sameer Singh, Juntian Zhan, Zhaofeng Li, Tianxiang Li, 2023 (Databricks) - This resource defines LLMOps, highlighting the significance of managing the entire lifecycle of LLM applications, including components like prompts, through versioning, testing, and deployment. Accessed on October 26, 2023.
Generative AI with Transformers: Build Intelligent Apps, Tools, and Chatbots Using LLMs, Leonie Monigatti, Thomas Wolf, Lewis Tunstall, Leandro von Werra, Nazneen Rajani, Alessandro Sciarra, 2023 (O'Reilly Media) - This book provides practical information on building applications with LLMs, including methods for prompt engineering and managing components for robust systems.