As you build more sophisticated applications, you'll quickly discover that prompts are not static strings. They are living assets that need to be tested, refined, and managed just like any other part of your codebase. Embedding multi-line prompt strings directly into your application logic can lead to what we might call "prompt spaghetti," making your code hard to read, maintain, and improve.
In a team setting, this problem is magnified. A prompt engineer might need to update a prompt's wording, a product manager might want to test a new instruction set, and a developer needs to integrate these changes without introducing bugs. A systematic approach to managing and versioning prompts is essential for building scalable and reliable LLM applications.
Managing prompts as first-class assets provides several significant advantages:
The first step toward better management is to treat each prompt not just as a string, but as a versioned object with associated metadata. This encapsulates the prompt's template, its purpose, and any performance metrics tied to it.
You can create and register different versions of a prompt. For instance, let's define two versions of a prompt designed to review code. Version 1.0 is simple and direct, while Version 2.0 is more structured to guide the LLM's output.
from kerb.prompt import create_version, register_prompt
# Version 1.0: A simple, direct approach
v1 = create_version(
name="code_reviewer",
version="1.0",
template="""Review this code and provide feedback:
{{code}}
Focus on bugs and improvements.""",
description="Direct and concise code review prompt.",
metadata={"tokens_avg": 50, "response_quality": 0.75}
)
# Version 2.0: More structured with specific guidelines
v2 = create_version(
name="code_reviewer",
version="2.0",
template="""You are an expert code reviewer. Analyze the following code:
{{code}}
Provide feedback on:
1. Code correctness and bugs
2. Performance optimizations
3. Best practices and style
4. Security considerations
Format your response with clear sections.""",
description="Structured code review with explicit guidelines.",
metadata={"tokens_avg": 120, "response_quality": 0.85}
)
# Register both versions for later use
register_prompt(v1)
register_prompt(v2)
Here, each prompt is identified by a name ("code_reviewer") and a version string. The metadata dictionary is a powerful feature for tracking performance. You can store metrics like average token usage, latency, user feedback scores, or evaluation results directly with the prompt version that produced them.
Once prompts are registered, you can retrieve a specific version from anywhere in your application using its name and version number. This decouples the prompt's content from the application logic.
The following code retrieves version 2.0 of our code_reviewer prompt and renders it with a code snippet, preparing it for an LLM call.
from kerb.prompt import get_prompt
# Retrieve version 2.0 of the "code_reviewer" prompt
prompt_v2 = get_prompt("code_reviewer", "2.0")
code_sample = "def add(x, y):\n return x + y * 2"
# Render the prompt template with the code
rendered_prompt = prompt_v2.render({"code": code_sample})
print(rendered_prompt)
This approach makes your code cleaner and more flexible. To switch to a different prompt version, you only need to change the version string in the get_prompt call, without touching the surrounding application logic.
As your prompt library grows, you'll need tools to inspect what's available. You can list all registered versions for a given prompt name.
from kerb.prompt import list_versions
versions = list_versions("code_reviewer")
print(f"Available versions of 'code_reviewer': {versions}")
# Expected output: Available versions of 'code_reviewer': ['1.0', '2.0']
This is useful for building internal dashboards or validation tools that let your team see all available prompt variants.
To analyze the differences between versions, you can use the compare_versions function. It provides a summary of each version's template, description, and metadata, which is valuable for debugging or deciding which version to use.
from kerb.prompt import compare_versions
comparison = compare_versions("code_reviewer")
# The output is a dictionary detailing each version's properties.
# For example, to see the description of version 1.0:
print(comparison['versions']['1.0']['description'])
Hardcoding version numbers is not always practical, especially when you want to dynamically test or deploy new prompts. A more direct approach is to select a version based on a defined strategy.
The lifecycle of a prompt, from drafting and registration to testing, analysis, and final deployment in production.
The select_version function supports several strategies for programmatic selection:
random: Selects a version at random. This is the foundation for A/B testing, allowing you to distribute user requests across different prompt versions to compare their performance in a live environment.latest: Selects the most recently registered version. This is useful in development environments where you always want to be using the newest prompt.best_performing: Selects the version with the highest score on a specified metadata metric. This allows you to automatically deploy the prompt that has proven to be the most effective based on your own evaluations.Here is how you might use these strategies in practice.
from kerb.prompt import select_version
# For A/B testing, randomly select a version for each request
selected_for_ab_test = select_version("code_reviewer", strategy="random")
print(f"A/B Test: Using version {selected_for_ab_test.version}")
# In a development environment, always use the latest version
latest_version = select_version("code_reviewer", strategy="latest")
print(f"Development: Using latest version {latest_version.version}")
# For production, select the version with the highest 'response_quality'
best_version = select_version(
"code_reviewer",
strategy="best_performing",
metric="response_quality"
)
print(f"Production: Using best version {best_version.version} (Quality: {best_version.metadata['response_quality']})")
By combining metadata with a selection strategy, you can build a system that automatically promotes the best-performing prompts, creating a powerful feedback loop for continuous improvement.
As you integrate prompt management into your workflow, consider these best practices:
major.minor.patch (e.g., 2.1.0) for your prompts. Increment the major version for breaking changes, the minor version for new features or significant improvements, and the patch version for minor tweaks and bug fixes.Was this section helpful?
© 2026 ApX Machine LearningEngineered with